nibble/lexer

Nibble takes a different approach to many other parser combinator libraries by also providing a lexer combinator module that you use to turn an input string into a list of tokens.

Parser combinators are a powerful and flexible way to build parsers, but they offer come at a performance cost compared to hand-written parsers or parser generators. On the other hand, writing a lexer by hand can be a bit tedious and difficult. Nibble aims to provide a happy middle-ground by making it easy to produce OK-performing lexers and then use parser combinators that can be much faster working on the smaller token stream.

To see how Nibble’s lexer works, let’s consider the example from the introduction guide:

fn lexer() {
  lexer.simple([
    lexer.token("hello", Hello),
    lexer.variable("[a-zA-Z]", "\w", Name),
    lexer.whitespace(Nil)
    |> lexer.ignore
  ])
}

We have three matchers here. One for the exact token “hello”, one for at least one letter followed by any number of word characters, and one for any amount of whitespace.

To see how these matchers work we’ll look at the input string "Hello Joe". Nibble looks at the input string one grapheme at a time, and runs each of the matchers in order. At the same time, it accumulates a list of the tokens it has produced so far.

Tokens : []
Input  : Hello Joe
         ^

If no matcher matches the input, Nibble will store the current grapheme and move on to the next one:

Tokens : []
Input  : Hello Joe
         -^

This continues until a matcher does match the input:

Tokens : []
Input  : Hello Joe
         ----^

The accumulated string (known as a lexeme) is consumed, and whatever token value the matcher produces is added to the list of tokens:

Tokens : [Hello]
Input  :  Joe
         ^

Here we have some whitespace that would be matched by lexer.whitespace, by we passed that matcher to the lexer.ignore comabinator. The matcher will still consume the input, but this time it will not produce a new token value:

Tokens : [Hello]
Input  : Joe
         ^

As we expect, the lexer continues on accumulating input. When it reaches the end of the input string with a value it checks all the matches one last time to see if they will produce a final token. In this case the lexer.variable matcher will match the accumulated “Joe” and we’re left with the following:

Tokens : [Hello, Name("Joe")]
Input  :

Types

Error

</>

pub type Error {
  NoMatchFound(row: Int, col: Int, lexeme: String)
}

Constructors

NoMatchFound(row: Int, col: Int, lexeme: String)

Lexer

opaque </>

pub opaque type Lexer(a, mode)

Match

</>

When writing a custom matcher, a Match is what you return to tell the lexer what to do next.

pub type Match(a, mode) {
  Keep(a, mode)
  Skip
  Drop(mode)
  NoMatch
}

Constructors

```
Keep(a, mode)
```
Consume the accumulated input and produce a token with the given value. A Keep match can also transition the lexer into a new mode.
```
Skip
```
Skip running any additional matchers this iteration, add the next grapheme to the accumulated input, and run the next iteration.
```
Drop(mode)
```
Drop the accumulated input and move on to the next iteration. A Drop match can also transition the lexer into a new mode. This match is useful for discarding input like whitespace or comments.
```
NoMatch
```
The matcher did not match the input, so the lexer should try the next matcher in the list (or fail if there are no more matchers).

Matcher

opaque </>

A Matcher is how we define the rules that match parts of the input string and turn them into tokens. At it’s core, a Match is a function that takes three arguments:

The current mode of the lexer
Any input we’ve accumulated so far
A lookahead of one grapheme

With just these three arguments we can define arbitrary rules for consuming (or not) input and producing tokens!

pub opaque type Matcher(a, mode)

Span

</>

A source span is a range into the source string that represents the start and end of a lexeme in a human-readable way. That means instead of a straight index into the string you get a row and column for the start and end instead!

pub type Span {
  Span(
    row_start: Int,
    col_start: Int,
    row_end: Int,
    col_end: Int,
  )
}

Constructors

Span(row_start: Int, col_start: Int, row_end: Int, col_end: Int)

Token

</>

You use Nibble’s lexer to turn a string into a list of tokens that your parser will eventually consume. The Token type contains the lexeme that was consumed (aka the raw input string), the source Span of the consumed lexeme to locate it in the source, and whatever token value your lexer produces.

pub type Token(a) {
  Token(span: Span, lexeme: String, value: a)
}

Constructors

Token(span: Span, lexeme: String, value: a)

Functions

advanced

</>

pub fn advanced(
  matchers: fn(a) -> List(Matcher(b, a)),
) -> Lexer(b, a)

An advanced lexer is one that can change what matchers it uses based on the current mode. This is useful for sophisticated lexers that might need to handle things like interpolated strings or indentation-sensitive syntax.

comment

</>

pub fn comment(
  start: String,
  to_value: fn(String) -> a,
) -> Matcher(a, b)

custom

</>

pub fn custom(
  f: fn(a, String, String) -> Match(b, a),
) -> Matcher(b, a)

Create a custom Matcher that is flexible enough to do anything you want! The first parameter is a function that takes the current lexer mode, the current lexeme, and a one-grapheme lookahead.

The function returns a Match that tells the lexer what to do next.

drop

</>

pub fn drop(f: fn(String, String) -> Bool) -> Matcher(a, b)

Create a custom Matcher that will consume the input and move to the next iteration without producing a token if it is True or return a NoMatch if it fails. The first parameter is a function that takes the current lexeme and the second parameter is a one-grapheme lookahead.

Matchers created with this convenience function cannot change the lexer’s mode or skip ahead to the next iteration without consuming the input.

float

</>

pub fn float(to_value: fn(Float) -> a) -> Matcher(a, b)

float_with_separator

</>

pub fn float_with_separator(
  separator: String,
  to_value: fn(Float) -> a,
) -> Matcher(a, b)

identifier

</>

pub fn identifier(
  start: String,
  inner: String,
  reserved: Set(String),
  to_value: fn(String) -> a,
) -> Matcher(a, b)

ignore

</>

pub fn ignore(matcher: Matcher(a, b)) -> Matcher(c, b)

Take a matcher than might Keep anything and silently Drop anything it produces instead. This is useful for things like whitespace or comments where you want to consume some input but you don’t want to emit a token.

int

</>

pub fn int(to_value: fn(Int) -> a) -> Matcher(a, b)

int_with_separator

</>

pub fn int_with_separator(
  separator: String,
  to_value: fn(Int) -> a,
) -> Matcher(a, b)

into

</>

pub fn into(
  matcher: Matcher(a, b),
  f: fn(b) -> b,
) -> Matcher(a, b)

Take an existing matcher and transition to a new mode. This only runs if the matcher is successful and either Keeps or Drops a value.

keep

</>

pub fn keep(
  f: fn(String, String) -> Result(a, Nil),
) -> Matcher(a, b)

Create a custom Matcher that will consume the input and produce a token with the given value if it is Ok or return a NoMatch if it fails. The first parameter is a function that takes the current lexeme and the second parameter is a one-grapheme lookahead.

Matchers created with this convenience function cannot change the lexer’s mode or skip ahead to the next iteration without consuming the input.

keyword

</>

pub fn keyword(
  str: String,
  breaker: String,
  value: a,
) -> Matcher(a, b)

Match exactly the given string only when the lookahead is matched by the given breaker regex. Keywords are exact strings like let but you wouldn’t want to lex letter as [Let, Var("tter")] so the breaker is used so you can say what characters should trigger a match.

map

</>

pub fn map(
  matcher: Matcher(a, b),
  f: fn(a) -> c,
) -> Matcher(c, b)

Take an existing matcher and transform it by applying a function to the value it produces.

number

</>

pub fn number(
  from_int: fn(Int) -> a,
  from_float: fn(Float) -> a,
) -> Matcher(a, b)

number_with_separator

</>

pub fn number_with_separator(
  separator: String,
  from_int: fn(Int) -> a,
  from_float: fn(Float) -> a,
) -> Matcher(a, b)

run

</>

pub fn run(
  source: String,
  lexer: Lexer(a, Nil),
) -> Result(List(Token(a)), Error)

run_advanced

</>

pub fn run_advanced(
  source: String,
  mode: a,
  lexer: Lexer(b, a),
) -> Result(List(Token(b)), Error)

simple

</>

pub fn simple(matchers: List(Matcher(a, Nil))) -> Lexer(a, Nil)

spaces

</>

pub fn spaces(token: a) -> Matcher(a, b)

spaces_

</>

pub fn spaces_(to_value: fn(String) -> a) -> Matcher(a, b)

string

</>

pub fn string(
  char: String,
  to_value: fn(String) -> a,
) -> Matcher(a, b)

symbol

</>

pub fn symbol(
  str: String,
  breaker: String,
  value: a,
) -> Matcher(a, b)

Match exactly the given string only when the lookahead is matched by the given breaker regex. This is an alias of keyword but it can be helpful to separate the two concepts.

then

</>

pub fn then(
  matcher: Matcher(a, b),
  f: fn(a) -> Match(c, b),
) -> Matcher(c, b)

Take an existing matcher and transform it by applying a function to the value it producs. The function you provide can return a different Match so you can, for example, take a matcher that Keeps a value and turn it into a matcher that Drops the value instead. This is out ignore works!

token

</>

pub fn token(str: String, value: a) -> Matcher(a, b)

Match exactly the given string with no lookahead and produce the given value.

try_identifier

</>

pub fn try_identifier(
  start: String,
  inner: String,
  reserved: Set(String),
  to_value: fn(String) -> a,
) -> Result(Matcher(a, b), CompileError)

variable

</>

pub fn variable(
  reserved: Set(String),
  to_value: fn(String) -> a,
) -> Matcher(a, b)

whitespace

</>

pub fn whitespace(token: a) -> Matcher(a, b)