chomp/lexer
Chomp takes a different approach to many other parser combinator libraries by also providing a lexer combinator module that you use to turn an input string into a list of tokens.
Parser combinators are a powerful and flexible way to build parsers, but they offer come at a performance cost compared to hand-written parsers or parser generators. On the other hand, writing a lexer by hand can be a bit tedious and difficult.
Chomp aims to provide a happy middle-ground by making it easy to produce OK-performing lexers and then use parser combinators that can be much faster working on the smaller token stream.
Types
pub type Error {
NoMatchFound(row: Int, col: Int, lexeme: String)
}
Constructors
-
NoMatchFound(row: Int, col: Int, lexeme: String)
When writing a custom matcher, a Match
is what you return to tell the lexer
what to do next.
pub type Match(a, mode) {
Keep(a, mode)
Skip(mode)
Drop(mode)
NoMatch
}
Constructors
-
Keep(a, mode)
Consume the accumulated input and produce a token with the given value. A
Keep
match can also transition the lexer into a new mode. -
Skip(mode)
Skip running any additional matchers this iteration, add the next grapheme to the accumulated input, and run the next iteration. A
Skip
match can also transition the lexer into a new mode. -
Drop(mode)
Drop the accumulated input and move on to the next iteration. A
Drop
match can also transition the lexer into a new mode. This match is useful for discarding input like whitespace or comments. -
NoMatch
The matcher did not match the input, so the lexer should try the next matcher in the list (or fail if there are no more matchers).
A Matcher
is how we define the rules that match parts of the input string
and turn them into tokens. At it’s core, a Match
is a function that takes
three arguments:
-
The current mode of the lexer
-
Any input we’ve accumulated so far
-
A lookahead of one grapheme
With just these three arguments we can define arbitrary rules for consuming (or not) input and producing tokens!
pub opaque type Matcher(a, mode)
You use chomp’s lexer to turn a string into a list of tokens that your parser
will eventually consume. The Token
type contains the lexeme that was consumed
(aka the raw input string), the source Span
of the
consumed lexeme to locate it in the source, and whatever token value your lexer
produces.
pub type Token(a) {
Token(span: Span, lexeme: String, value: a)
}
Constructors
-
Token(span: Span, lexeme: String, value: a)
Functions
pub fn advanced(
matchers: fn(a) -> List(Matcher(b, a)),
) -> Lexer(b, a)
An advanced
lexer is one that can change what matchers it uses based on the
current mode. This is useful for sophisticated lexers that might need to
handle things like interpolated strings or indentation-sensitive syntax.
pub fn custom(
f: fn(a, String, String) -> Match(b, a),
) -> Matcher(b, a)
pub fn drop(f: fn(String, String) -> Bool) -> Matcher(a, b)
Create a custom Matcher
that will consume the input and move
to the next iteration without producing a token if it is True
or return a
NoMatch
if it fails. The first parameter is a function that takes the
current lexeme and the second parameter is a one-grapheme lookahead.
Matchers created with this convienence function cannot change the lexer’s mode or skip ahead to the next iteration without consuming the input.
pub fn float_with_separator(
separator: String,
to_value: fn(Float) -> a,
) -> Matcher(a, b)
pub fn identifier(
start: String,
inner: String,
reserved: Set(String),
to_value: fn(String) -> a,
) -> Matcher(a, b)
pub fn ignore(matcher: Matcher(a, b)) -> Matcher(c, b)
Take a matcher than might Keep
anything and silently Drop
anything it
produces instead. This is useful for things like whitespace or comments
where you want to consume some input but you don’t want to emit a token.
pub fn int_with_separator(
separator: String,
to_value: fn(Int) -> a,
) -> Matcher(a, b)
pub fn into(
matcher: Matcher(a, b),
f: fn(b) -> b,
) -> Matcher(a, b)
Take an existing matcher and transition to a new mode. This only runs if
the matcher is successful and either Keep
s or Drop
s a value.
pub fn keep(
f: fn(String, String) -> Result(a, Nil),
) -> Matcher(a, b)
Create a custom Matcher
that will consume the input and produce
a token with the given value if it is Ok
or return a NoMatch
if it fails.
The first parameter is a function that takes the current lexeme and the
second parameter is a one-grapheme lookahead.
Matchers created with this convienence function cannot change the lexer’s mode or skip ahead to the next iteration without consuming the input.
pub fn keyword(
str: String,
breaker: String,
value: a,
) -> Matcher(a, b)
Match exactly the given string only when the lookahead is matched by the given
breaker regex. Keywords are exact strings like let
but you wouldn’t want
to lex letter
as [Let, Var("tter")]
so the breaker is used so you can say
what characters should trigger a match.
pub fn map(
matcher: Matcher(a, b),
f: fn(a) -> c,
) -> Matcher(c, b)
Take an existing matcher and transform it by applying a function to the value it produces.
pub fn number_with_separator(
separator: String,
from_int: fn(Int) -> a,
from_float: fn(Float) -> a,
) -> Matcher(a, b)
pub fn run_advanced(
source: String,
mode: a,
lexer: Lexer(b, a),
) -> Result(List(Token(b)), Error)
pub fn symbol(
str: String,
breaker: String,
value: a,
) -> Matcher(a, b)
Match exactly the given string only when the lookahead is matched by the given
breaker regex. This is an alias of keyword
but it can be
helpful to separate the two concepts.
pub fn then(
matcher: Matcher(a, b),
f: fn(a) -> Match(c, b),
) -> Matcher(c, b)
pub fn token(str: String, value: a) -> Matcher(a, b)
Match exactly the given string with no lookahead and produce the given value.
pub fn try_identifier(
start: String,
inner: String,
reserved: Set(String),
to_value: fn(String) -> a,
) -> Result(Matcher(a, b), CompileError)
pub fn whitespace(token: a) -> Matcher(a, b)