chomp/lexer

Chomp takes a different approach to many other parser combinator libraries by also providing a lexer combinator module that you use to turn an input string into a list of tokens.

Parser combinators are a powerful and flexible way to build parsers, but they offer come at a performance cost compared to hand-written parsers or parser generators. On the other hand, writing a lexer by hand can be a bit tedious and difficult.

Chomp aims to provide a happy middle-ground by making it easy to produce OK-performing lexers and then use parser combinators that can be much faster working on the smaller token stream.

Types

pub type Error {
  NoMatchFound(row: Int, col: Int, lexeme: String)
}

Constructors

  • NoMatchFound(row: Int, col: Int, lexeme: String)
pub opaque type Lexer(a, mode)

When writing a custom matcher, a Match is what you return to tell the lexer what to do next.

pub type Match(a, mode) {
  Keep(a, mode)
  Skip(mode)
  Drop(mode)
  NoMatch
}

Constructors

  • Keep(a, mode)

    Consume the accumulated input and produce a token with the given value. A Keep match can also transition the lexer into a new mode.

  • Skip(mode)

    Skip running any additional matchers this iteration, add the next grapheme to the accumulated input, and run the next iteration. A Skip match can also transition the lexer into a new mode.

  • Drop(mode)

    Drop the accumulated input and move on to the next iteration. A Drop match can also transition the lexer into a new mode. This match is useful for discarding input like whitespace or comments.

  • NoMatch

    The matcher did not match the input, so the lexer should try the next matcher in the list (or fail if there are no more matchers).

A Matcher is how we define the rules that match parts of the input string and turn them into tokens. At it’s core, a Match is a function that takes three arguments:

  • The current mode of the lexer

  • Any input we’ve accumulated so far

  • A lookahead of one grapheme

With just these three arguments we can define arbitrary rules for consuming (or not) input and producing tokens!

pub opaque type Matcher(a, mode)

You use chomp’s lexer to turn a string into a list of tokens that your parser will eventually consume. The Token type contains the lexeme that was consumed (aka the raw input string), the source Span of the consumed lexeme to locate it in the source, and whatever token value your lexer produces.

pub type Token(a) {
  Token(span: Span, lexeme: String, value: a)
}

Constructors

  • Token(span: Span, lexeme: String, value: a)

Functions

pub fn advanced(
  matchers: fn(a) -> List(Matcher(b, a)),
) -> Lexer(b, a)

An advanced lexer is one that can change what matchers it uses based on the current mode. This is useful for sophisticated lexers that might need to handle things like interpolated strings or indentation-sensitive syntax.

pub fn comment(
  start: String,
  to_value: fn(String) -> a,
) -> Matcher(a, b)
pub fn custom(
  f: fn(a, String, String) -> Match(b, a),
) -> Matcher(b, a)

Create a custom Matcher that is flexible enough to do anything you want! The first parameter is a function that takes the current lexer mode, the current lexeme, and a one-grapheme lookahead.

The function returns a Match that tells the lexer what to do next.

pub fn drop(f: fn(String, String) -> Bool) -> Matcher(a, b)

Create a custom Matcher that will consume the input and move to the next iteration without producing a token if it is True or return a NoMatch if it fails. The first parameter is a function that takes the current lexeme and the second parameter is a one-grapheme lookahead.

Matchers created with this convienence function cannot change the lexer’s mode or skip ahead to the next iteration without consuming the input.

pub fn float(to_value: fn(Float) -> a) -> Matcher(a, b)
pub fn float_with_separator(
  separator: String,
  to_value: fn(Float) -> a,
) -> Matcher(a, b)
pub fn identifier(
  start: String,
  inner: String,
  reserved: Set(String),
  to_value: fn(String) -> a,
) -> Matcher(a, b)
pub fn ignore(matcher: Matcher(a, b)) -> Matcher(c, b)

Take a matcher than might Keep anything and silently Drop anything it produces instead. This is useful for things like whitespace or comments where you want to consume some input but you don’t want to emit a token.

pub fn int(to_value: fn(Int) -> a) -> Matcher(a, b)
pub fn int_with_separator(
  separator: String,
  to_value: fn(Int) -> a,
) -> Matcher(a, b)
pub fn into(
  matcher: Matcher(a, b),
  f: fn(b) -> b,
) -> Matcher(a, b)

Take an existing matcher and transition to a new mode. This only runs if the matcher is successful and either Keeps or Drops a value.

pub fn keep(
  f: fn(String, String) -> Result(a, Nil),
) -> Matcher(a, b)

Create a custom Matcher that will consume the input and produce a token with the given value if it is Ok or return a NoMatch if it fails. The first parameter is a function that takes the current lexeme and the second parameter is a one-grapheme lookahead.

Matchers created with this convienence function cannot change the lexer’s mode or skip ahead to the next iteration without consuming the input.

pub fn keyword(
  str: String,
  breaker: String,
  value: a,
) -> Matcher(a, b)

Match exactly the given string only when the lookahead is matched by the given breaker regex. Keywords are exact strings like let but you wouldn’t want to lex letter as [Let, Var("tter")] so the breaker is used so you can say what characters should trigger a match.

pub fn map(
  matcher: Matcher(a, b),
  f: fn(a) -> c,
) -> Matcher(c, b)

Take an existing matcher and transform it by applying a function to the value it produces.

pub fn number(
  from_int: fn(Int) -> a,
  from_float: fn(Float) -> a,
) -> Matcher(a, b)
pub fn number_with_separator(
  separator: String,
  from_int: fn(Int) -> a,
  from_float: fn(Float) -> a,
) -> Matcher(a, b)
pub fn run(
  source: String,
  lexer: Lexer(a, Nil),
) -> Result(List(Token(a)), Error)
pub fn run_advanced(
  source: String,
  mode: a,
  lexer: Lexer(b, a),
) -> Result(List(Token(b)), Error)
pub fn simple(matchers: List(Matcher(a, Nil))) -> Lexer(a, Nil)
pub fn spaces(token: a) -> Matcher(a, b)
pub fn spaces_(to_value: fn(String) -> a) -> Matcher(a, b)
pub fn string(
  char: String,
  to_value: fn(String) -> a,
) -> Matcher(a, b)
pub fn symbol(
  str: String,
  breaker: String,
  value: a,
) -> Matcher(a, b)

Match exactly the given string only when the lookahead is matched by the given breaker regex. This is an alias of keyword but it can be helpful to separate the two concepts.

pub fn then(
  matcher: Matcher(a, b),
  f: fn(a) -> Match(c, b),
) -> Matcher(c, b)

Take an existing matcher and transform it by applying a function to the value it producs. The function you provide can return a different Match so you can, for example, take a matcher that Keeps a value and turn it into a matcher that Drops the value instead. This is out ignore works!

pub fn token(str: String, value: a) -> Matcher(a, b)

Match exactly the given string with no lookahead and produce the given value.

pub fn try_identifier(
  start: String,
  inner: String,
  reserved: Set(String),
  to_value: fn(String) -> a,
) -> Result(Matcher(a, b), CompileError)
pub fn variable(
  reserved: Set(String),
  to_value: fn(String) -> a,
) -> Matcher(a, b)
pub fn whitespace(token: a) -> Matcher(a, b)
Search Document