Lexer Modes

Up until now we have been running our lexer using lexer.simple. As the name implies, this is the simplest way to use Nibble’s lexer and it is context-free. Where possible we should try to stick to these simple lexers, but sometimes we need to be able to lex things that are context-sensitive. That’s where lexer modes come in!

Indentation Sensitivity

Let’s imagine we’re writing a lexer for a Python-ish programming language and we want to produce Indent and Dedent tokens to represent indentation. We might define our tokens like this:

pub type TokenT {
  Var(String)
  Str(String)
  Num(Int)

  // Keywords
  Def
  For
  In
  Print

  // Indentation
  Indent(Int)
  Dedent(Int)
}

We could represent a chunk of code like this:

def wibble arr
  for x in arr
    print x

  print "done!"

def wobble
  wibble [1, 2, 3]

Indentation would change the meaning of this program, so we need to know when we are inside a block of indented code or not. Our Indent and Dedent tokens carry with them the level of indentation they represent such that when we come to parsing we can make sure everything is valid, but how do we produce the tokens in the first place?

We’ll need to do two things: (1) write a custom matcher using lexer.custom and (2) store the current indentation level as the lexer’s mode.

pub opaque type Lexer(a, mode)
pub opaque type Matcher(a, mode)

Modes allow us to chose different matchers for different contexts, or inject state into our matchers. For our indentation-sensitive lexer, that means we’ll end up with Lexer and Matcher types like this:

type Lexer = nibble.Lexer(TokenT, Int)
type Matcher = nibble.Matcher(TokenT, Int)

To write our indentation matcher, we’ll count the number of spaces that immediately follow a newline and compare that to the current indentation level. If that number is less than the current indentation level, we’ll produce a Dedent token, otherwise we’ll produce an Indent token. In either case we’ll also update the lexer’s mode with the new indentation level for subsequent lines.

fn indentation() -> Matcher(TokenT, Int) {
  let assert Ok(is_indent) = regex.from_string("^\\n[ \\t]*")
  use current_indent, lexeme, lookahead <- lexer.custom

  case regex.check(is_indent, lexeme), lookahead {
    False, _ -> NoMatch
    True, " " | True, "\t" -> Skip
    True, "\n" -> Drop(current_indent)
    True, _ -> {
      let spaces = string.length(lexeme) - 1

      case int.compare(spaces, current_indent) {
        Lt -> Keep(Dedent(spaces), spaces)
        Eq if spaces == 0 -> Drop(0)
        Eq -> Keep(Indent(spaces), spaces)
        Gt -> Keep(Indent(spaces), spaces)
      }
    }
  }
}

There’s actually a little more going on here than I just described, so let’s break the pattern matching down case by case.

False, _ -> NoMatch
True, " " | True, "\t" -> Skip
True, "\n" -> Drop(current_indent)
True, _ -> {
  let spaces = string.length(lexeme) - 1

  case int.compare(spaces, current_indent) {
    Lt -> Keep(Dedent(spaces), spaces)
    Eq if spaces == 0 -> Drop(0)
    Eq -> Keep(Indent(spaces), spaces)
    Gt -> Keep(Indent(spaces), spaces)
  }
}

String Interpolation

Search Document