Creating a Basic Parser

This tutorial will explore how Ergo works through creating a parser to match numbers, be they integers like -5 or 42, or decimals like 25.6. The parser will match them and convert them into the appropriate Elixir numeric value. We will build it up in stages, starting with bare digits.

Parsing a digit

To begin with we need to be able to parse digits. One option for that is to use the basic char parser to match digit characters. Using IEx here is what you would do:

alias Ergo
alias Ergo.Context
import Ergo.Terminals

digit = char(?0..?9)
Ergo.parse(digit, "42")
%Context{status: :ok, ast: 52}

Where the integer 52 is the character code of the digit '4' (Try typing ?4 into the IEx console to see for yourself).

Now we can parse a single digit, how about multiple digits?

Parsing many digits

To parse multiple digits we use the many parser in conjunction with the digit parser as follows:

import Ergo.Combinators

digits = many(digit())
Ergo.parse(digits, "42")
%Context{status: :ok, ast: [52, 50]}

In fact you might see '42' in your version of the AST because IEx will try to render the list [52, 50] as a charlist. This is a hangover from Erlang. If you would prefer to see the list add the following to your ~/.iex.exs file:

IEx.configure(inspect: [charlists: :as_lists])

The many combinator parser repeatedly invokes the digit parser to match as many digit characters as possible, generating an AST that is a list of those digits. To transform them into a numeric value we need to apply another function to the AST.

In our case the AST list contains character values of the digits matched, e.g. [52, 50] for the digits ['4', '2'] respectively. Since the digit character '0' has character value of 48 we can turn the characters into digit values by subtracting 48.

The heavy lifting of transforming digits is done by c_transform below which is pipeline to transform ['4', '2'] -> [{4, 10}, {2, 1}] -> [40, 2] -> 42:

c_transform = fn ast ->
  bases = Stream.unfold(1, fn n -> {n, n * 10} end)
  digits = Enum.map(ast, fn digit -> digit - 48 end)
  Enum.zip(Enum.reverse(digits), bases)
  |> Enum.map(&Tuple.product/1)
  |> Enum.sum
  end

digits = many(digit) |> transform(c_transform)

Ergo.parse(digits, "42")
%Context{status: :ok, ast: 42}

In this case we are applying the transform parser to the many parser. Transform only operates on the AST of the parser it is given by applying a function to it. In this case we could also have used:

digits = many(digit(), ast: c_transform)

As many of the combinator parsers support an optional ctx: or ast: argument as a shortcut.

At this point we can parse positive integers of any length:

Ergo.parse(digits, "918212812783918723")
%Context{status: :ok, ast: 918212812783918723}

Parsing negative numbers

What about negative values? We need to look for a leading '-' character however, unlike the digits, the minus is optional. Parsing the minus is simple enough:

minus = char(?-)

Ergo.parse(minus, "-")
%Context{status: :ok, ast: 45}

45 is the char value of the char '-'. We can now use the optional combinator to allow a minus to be matched, or not:

minus = optional(char(?-))

Ergo.parse(minus, "-42")
%Context{status: :ok, ast: 45}

Ergo.parse(minus, "42")
%Context{status: :ok, ast: nil}

In the second case the status is :ok meaning the optional parser succeeded, however the ast is nil meaning nothing was matched. Let's make this a bit more useful:

minus = optional(char(?-)) |> transform(fn ast ->
  case ast do
    nil -> 1
    45 -> -1
  end
end)

Ergo.parse(minus, "-42")
%Context{status: :ok, ast: -1}

Ergo.parse(minus, "42")
%Context{status: :ok, ast: 1}

Now when minus matches a '-' it will transform it to the value -1. When it doesn't match anything it will transform it to the value 1. Now let's combine it with the other parser.

integer = sequence([
  minus,
  digits
])

The sequence parser tries to match a list of parser in turn and, if they all match, generates an AST composed of a list of each of their results. Let's see how it works:

Ergo.parse(integer, "1234")
%Context{status: :ok, ast: [1, 1234]}

Ergo.parse(integer, "-5678")
%Context{status: :ok, ast: [-1, 5678]}

So we can see that it's easy to get the right result by simply taking the product of the two values in the AST:

integer = sequence([
  minus,
  digits,
  ],
  ast: &Enum.product/1
)

Ergo.parse(integer, "1234")
%Context{status: :ok, ast: 1234}

Ergo.parse(integer, "-5678")
%Context{status: :ok, ast: -5678}

So far so good. We can now parse positive and negative integers.

Parsing decimals

If we want to parse decimal numbers as well we need to handle the (optional) mantissa, the digits to the right of the decimal point.

We can see that the mantissa is structurally the same as the integer part, a set of digits, but will need to be processed a little differently.

The m_transform function below should look familiar. It works the same way as the c_transform only instead of multiplying by increasing powers of 10, we're dividing by increasing powers of 10.

m_transform = fn ast ->
  ast
  |> Enum.map(fn digit -> digit - 48 end)
  |> Enum.zip(Stream.unfold(0.1, fn n -> {n, n / 10} end))
  |> Enum.map(&Tuple.product/1)
  |> Enum.sum
end

mantissa = many(digit, ast: m_transform)

Ergo.parse(mantissa, "5")
%Context{status: :ok, ast: 0.5}

Ergo.parse(mantissa, "42")
%Context{status: :ok, ast: 0.42000000000000004}

There may be a precision issue with this code but you can see the principle it is operating by.

Now to join the two components together, assuming there is a decimal point (suggesting we'll need optional again). Also we'll again make use of the ast: feature of the sequence combinator to process AST's to give us the right value.

number = sequence([
  integer,
  optional(
    sequence([
      ignore(char(?.)),
      mantissa
    ], ast: &List.first/1)
  )
], ast: &Enum.sum/1)

Ergo.parse(number, "42")
%Context{status: :ok, ast: 42}

Ergo.parse(number, "0.45")
%Context{status: :ok, ast: 0.45}

Ergo.parse(number, "-42")
%Context{status: :ok, ast: -42}

All looking good, just one more example:

Ergo.parse(number, "-4.2")
%Context{status: :ok, ast: -3.8}

Oops! There is a problem with our implementation in that we add together the integer and decimal parts. This works for positive numbers but in the latter case -4 + 0.2 = -3.8 not -4.2. When the integer part is negative we need to subtract the decimal part. We can no longer just use Enum.sum to process the result of the top-level sequence. Instead:

combine = fn
  [integer, decimal | []] ->
    if integer >= 0 do
      integer + decimal
    else
      integer - decimal
    end
  ast ->
    Enum.sum(ast)
end

number = sequence([
  integer,
  optional(
    sequence([
      ignore(char(?.)),
      mantissa
    ], ast: &List.first/1)
  )
], ast: combine)

Ergo.parse(number, "-4.2")
%Context{status: :ok, ast: -4.2}

Conclusion

Through a series of steps we have built a parser that can handle any kind of integer or decimal number we throw at it. We've seen the using of terminal parsers like char as well as combinator parsers like optional, ignore, many, and sequence and meta parsers like transform (and that often transform can be specified as a ast: argument to a combinator parser).

Hopefully this guide will be helpful in thinking about how to build your own parsers.