Lexer Modes

Up until now we have been running our lexer using lexer.simple. As the name implies, this is the simplest way to use Nibble’s lexer and it is context-free. Where possible we should try to stick to these simple lexers, but sometimes we need to be able to lex things that are context-sensitive. That’s where lexer modes come in!

Indentation Sensitivity

Let’s imagine we’re writing a lexer for a Python-ish programming language and we want to produce Indent and Dedent tokens to represent indentation. We might define our tokens like this:

pub type TokenT {
  Var(String)
  Str(String)
  Num(Int)

  // Keywords
  Def
  For
  In
  Print

  // Indentation
  Indent(Int)
  Dedent(Int)
}

We could represent a chunk of code like this:

def wibble arr
  for x in arr
    print x

  print "done!"

def wobble
  wibble [1, 2, 3]

Indentation would change the meaning of this program, so we need to know when we are inside a block of indented code or not. Our Indent and Dedent tokens carry with them the level of indentation they represent such that when we come to parsing we can make sure everything is valid, but how do we produce the tokens in the first place?

We’ll need to do two things: (1) write a custom matcher using lexer.custom and (2) store the current indentation level as the lexer’s mode.

pub opaque type Lexer(a, mode)
pub opaque type Matcher(a, mode)

Modes allow us to chose different matchers for different contexts, or inject state into our matchers. For our indentation-sensitive lexer, that means we’ll end up with Lexer and Matcher types like this:

type Lexer = nibble.Lexer(TokenT, Int)
type Matcher = nibble.Matcher(TokenT, Int)

To write our indentation matcher, we’ll count the number of spaces that immediately follow a newline and compare that to the current indentation level. If that number is less than the current indentation level, we’ll produce a Dedent token, otherwise we’ll produce an Indent token. In either case we’ll also update the lexer’s mode with the new indentation level for subsequent lines.

fn indentation() -> Matcher(TokenT, Int) {
  let assert Ok(is_indent) = regex.from_string("^\\n[ \\t]*")
  use current_indent, lexeme, lookahead <- lexer.custom

  case regex.check(is_indent, lexeme), lookahead {
    False, _ -> NoMatch
    True, " " | True, "\t" -> Skip
    True, "\n" -> Drop(current_indent)
    True, _ -> {
      let spaces = string.length(lexeme) - 1

      case int.compare(spaces, current_indent) {
        Lt -> Keep(Dedent(spaces), spaces)
        Eq if spaces == 0 -> Drop(0)
        Eq -> Keep(Indent(spaces), spaces)
        Gt -> Keep(Indent(spaces), spaces)
      }
    }
  }
}

There’s actually a little more going on here than I just described, so let’s break the pattern matching down case by case.

False, _ -> NoMatch

This branch should be self-explanatory. If the lexeme doesn’t match the regex we have for indentation then we don’t have a match and should tell the lexer to move on and attempt a different matcher.

True, " " | True, "\t" -> Skip

If the lexeme matches the indentation regex but we can see from the lookahead that it is followed by more indentation, then we don’t want to produce a token right now. We also don’t want the lexer to try any other matchers that might consume these lexeme (like lexer.whitespace) so we tell the lexer to skip checking any other matchers and move on to the next lexeme.

True, "\n" -> Drop(current_indent)

When the lexeme matches the indentation regex but it’s followed by a new line that means we have an empty line of just whitespace. You could choose to emit a token here if you wanted to, but in this case we’ll ignore empty lines entirely and instruct the lexer to drop the lexeme without producing anything. Importantly we preserve the current indentation level as the lexer’s mode going forward.

True, _ -> {
  let spaces = string.length(lexeme) - 1

  case int.compare(spaces, current_indent) {
    Lt -> Keep(Dedent(spaces), spaces)
    Eq if spaces == 0 -> Drop(0)
    Eq -> Keep(Indent(spaces), spaces)
    Gt -> Keep(Indent(spaces), spaces)
  }
}

This is the actual indentation logic; we hit this branch when the lookahead is any other grapheme. To get the new indentation level we just need to count the length of the string, remembering to subtract one for the leading newline. By comparing the new indentation to the current, we can work out what token to keep.
- If the new indentation is less than the current, we keep a Dedent token and update the lexer’s mode to the new indentation level.
- If the new indentation is is equal to the current but that level is zero, then we drop the lexeme without producing any indentation token at all.
- Otherwise, if the new indentation is equal or greater than the current, we keep an Indent token and update the lexer’s mode to the new indentation level.

Lexer Modes

Indentation Sensitivity

String Interpolation