nibble/lexer
Nibble takes a different approach to many other parser combinator libraries by also providing a lexer combinator module that you use to turn an input string into a list of tokens.
Parser combinators are a powerful and flexible way to build parsers, but they offer come at a performance cost compared to hand-written parsers or parser generators. On the other hand, writing a lexer by hand can be a bit tedious and difficult. Nibble aims to provide a happy middle-ground by making it easy to produce OK-performing lexers and then use parser combinators that can be much faster working on the smaller token stream.
To see how Nibble’s lexer works, let’s consider the example from the introduction guide:
fn lexer() {
lexer.simple([
lexer.token("hello", Hello),
lexer.variable("[a-zA-Z]", "\w", Name),
lexer.whitespace(Nil)
|> lexer.ignore
])
}
We have three matchers here. One for the exact token “hello”, one for at least one letter followed by any number of word characters, and one for any amount of whitespace.
To see how these matchers work we’ll look at the input string "Hello Joe"
.
Nibble looks at the input string one grapheme at a time, and runs each of
the matchers in order. At the same time, it accumulates a list of the tokens
it has produced so far.
Tokens : []
Input : Hello Joe
^
If no matcher matches the input, Nibble will store the current grapheme and move on to the next one:
Tokens : []
Input : Hello Joe
-^
This continues until a matcher does match the input:
Tokens : []
Input : Hello Joe
----^
The accumulated string (known as a lexeme) is consumed, and whatever token value the matcher produces is added to the list of tokens:
Tokens : [Hello]
Input : Joe
^
Here we have some whitespace that would be matched by lexer.whitespace
, by
we passed that matcher to the lexer.ignore
comabinator. The matcher
will still consume the input, but this time it will not produce a new token
value:
Tokens : [Hello]
Input : Joe
^
As we expect, the lexer continues on accumulating input. When it reaches the
end of the input string with a value it checks all the matches one last time
to see if they will produce a final token. In this case the lexer.variable
matcher will match the accumulated “Joe” and we’re left with the following:
Tokens : [Hello, Name("Joe")]
Input :
Types
pub type Error {
NoMatchFound(row: Int, col: Int, lexeme: String)
}
Constructors
-
NoMatchFound(row: Int, col: Int, lexeme: String)
When writing a custom matcher, a Match
is what you return to tell the lexer
what to do next.
pub type Match(a, mode) {
Keep(a, mode)
Skip
Drop(mode)
NoMatch
}
Constructors
-
Keep(a, mode)
Consume the accumulated input and produce a token with the given value. A
Keep
match can also transition the lexer into a new mode. -
Skip
Skip running any additional matchers this iteration, add the next grapheme to the accumulated input, and run the next iteration.
-
Drop(mode)
Drop the accumulated input and move on to the next iteration. A
Drop
match can also transition the lexer into a new mode. This match is useful for discarding input like whitespace or comments. -
NoMatch
The matcher did not match the input, so the lexer should try the next matcher in the list (or fail if there are no more matchers).
A Matcher
is how we define the rules that match parts of the input string
and turn them into tokens. At it’s core, a Match
is a function that takes
three arguments:
-
The current mode of the lexer
-
Any input we’ve accumulated so far
-
A lookahead of one grapheme
With just these three arguments we can define arbitrary rules for consuming (or not) input and producing tokens!
pub opaque type Matcher(a, mode)
A source span is a range into the source string that represents the start and end of a lexeme in a human-readable way. That means instead of a straight index into the string you get a row and column for the start and end instead!
pub type Span {
Span(
row_start: Int,
col_start: Int,
row_end: Int,
col_end: Int,
)
}
Constructors
-
Span(row_start: Int, col_start: Int, row_end: Int, col_end: Int)
You use Nibble’s lexer to turn a string into a list of tokens that your parser
will eventually consume. The Token
type contains the lexeme that was consumed
(aka the raw input string), the source Span
of the consumed lexeme
to locate it in the source, and whatever token value your lexer produces.
pub type Token(a) {
Token(span: Span, lexeme: String, value: a)
}
Constructors
-
Token(span: Span, lexeme: String, value: a)
Functions
pub fn advanced(
matchers: fn(a) -> List(Matcher(b, a)),
) -> Lexer(b, a)
An advanced
lexer is one that can change what matchers it uses based on the
current mode. This is useful for sophisticated lexers that might need to
handle things like interpolated strings or indentation-sensitive syntax.
pub fn custom(
f: fn(a, String, String) -> Match(b, a),
) -> Matcher(b, a)
pub fn drop(f: fn(String, String) -> Bool) -> Matcher(a, b)
Create a custom Matcher
that will consume the input and move
to the next iteration without producing a token if it is True
or return a
NoMatch
if it fails. The first parameter is a function that takes the
current lexeme and the second parameter is a one-grapheme lookahead.
Matchers created with this convienence function cannot change the lexer’s mode or skip ahead to the next iteration without consuming the input.
pub fn float_with_separator(
separator: String,
to_value: fn(Float) -> a,
) -> Matcher(a, b)
pub fn identifier(
start: String,
inner: String,
reserved: Set(String),
to_value: fn(String) -> a,
) -> Matcher(a, b)
pub fn ignore(matcher: Matcher(a, b)) -> Matcher(c, b)
Take a matcher than might Keep
anything and silently Drop
anything it
produces instead. This is useful for things like whitespace or comments
where you want to consume some input but you don’t want to emit a token.
pub fn int_with_separator(
separator: String,
to_value: fn(Int) -> a,
) -> Matcher(a, b)
pub fn into(
matcher: Matcher(a, b),
f: fn(b) -> b,
) -> Matcher(a, b)
Take an existing matcher and transition to a new mode. This only runs if
the matcher is successful and either Keep
s or Drop
s a value.
pub fn keep(
f: fn(String, String) -> Result(a, Nil),
) -> Matcher(a, b)
Create a custom Matcher
that will consume the input and produce
a token with the given value if it is Ok
or return a NoMatch
if it fails.
The first parameter is a function that takes the current lexeme and the
second parameter is a one-grapheme lookahead.
Matchers created with this convienence function cannot change the lexer’s mode or skip ahead to the next iteration without consuming the input.
pub fn keyword(
str: String,
breaker: String,
value: a,
) -> Matcher(a, b)
Match exactly the given string only when the lookahead is matched by the given
breaker regex. Keywords are exact strings like let
but you wouldn’t want
to lex letter
as [Let, Var("tter")]
so the breaker is used so you can say
what characters should trigger a match.
pub fn map(
matcher: Matcher(a, b),
f: fn(a) -> c,
) -> Matcher(c, b)
Take an existing matcher and transform it by applying a function to the value it produces.
pub fn number_with_separator(
separator: String,
from_int: fn(Int) -> a,
from_float: fn(Float) -> a,
) -> Matcher(a, b)
pub fn run_advanced(
source: String,
mode: a,
lexer: Lexer(b, a),
) -> Result(List(Token(b)), Error)
pub fn symbol(
str: String,
breaker: String,
value: a,
) -> Matcher(a, b)
Match exactly the given string only when the lookahead is matched by the given
breaker regex. This is an alias of keyword
but it can be
helpful to separate the two concepts.
pub fn then(
matcher: Matcher(a, b),
f: fn(a) -> Match(c, b),
) -> Matcher(c, b)
pub fn token(str: String, value: a) -> Matcher(a, b)
Match exactly the given string with no lookahead and produce the given value.
pub fn try_identifier(
start: String,
inner: String,
reserved: Set(String),
to_value: fn(String) -> a,
) -> Result(Matcher(a, b), CompileError)
pub fn whitespace(token: a) -> Matcher(a, b)