Creating a Basic Parser
This tutorial will explore how Ergo works through creating a parser to match numbers, be they integers like -5 or 42, or decimals like 25.6. The parser will match them and convert them into the appropriate Elixir numeric value. We will build it up in stages, starting with bare digits.
Parsing a digit
To begin with we need to be able to parse digits. One option for that is to use the basic char
parser to match digit characters. Using IEx here is what you would do:
alias Ergo
alias Ergo.Context
import Ergo.Terminals
digit = char(?0..?9)
Ergo.parse(digit, "42")
%Context{status: :ok, ast: 52}
Where the integer 52 is the character code of the digit '4' (Try typing ?4
into the IEx console to see for yourself).
Now we can parse a single digit, how about multiple digits?
Parsing many digits
To parse multiple digits we use the many
parser in conjunction with the digit
parser as follows:
import Ergo.Combinators
digits = many(digit())
Ergo.parse(digits, "42")
%Context{status: :ok, ast: [52, 50]}
In fact you might see '42'
in your version of the AST because IEx will try to render the list [52, 50]
as a charlist. This is a hangover from Erlang. If you would prefer to see the list add the following to your ~/.iex.exs
file:
IEx.configure(inspect: [charlists: :as_lists])
The many
combinator parser repeatedly invokes the digit
parser to match as many digit characters as possible, generating an AST that is a list of those digits. To transform them into a numeric value we need to apply another function to the AST.
In our case the AST list contains character values of the digits matched, e.g. [52, 50] for the digits ['4', '2'] respectively. Since the digit character '0' has character value of 48 we can turn the characters into digit values by subtracting 48.
The heavy lifting of transforming digits is done by c_transform
below which is pipeline to transform ['4', '2'] -> [{4, 10}, {2, 1}] -> [40, 2] -> 42:
c_transform = fn ast ->
bases = Stream.unfold(1, fn n -> {n, n * 10} end)
digits = Enum.map(ast, fn digit -> digit - 48 end)
Enum.zip(Enum.reverse(digits), bases)
|> Enum.map(&Tuple.product/1)
|> Enum.sum
end
digits = many(digit) |> transform(c_transform)
Ergo.parse(digits, "42")
%Context{status: :ok, ast: 42}
In this case we are applying the transform
parser to the many
parser. Transform only operates on the AST of the parser it is given by applying a function to it. In this case we could also have used:
digits = many(digit(), ast: c_transform)
As many of the combinator parsers support an optional ctx:
or ast:
argument as a shortcut.
At this point we can parse positive integers of any length:
Ergo.parse(digits, "918212812783918723")
%Context{status: :ok, ast: 918212812783918723}
Parsing negative numbers
What about negative values? We need to look for a leading '-' character however, unlike the digits, the minus is optional. Parsing the minus is simple enough:
minus = char(?-)
Ergo.parse(minus, "-")
%Context{status: :ok, ast: 45}
45 is the char value of the char '-'. We can now use the optional
combinator to allow a minus to be matched, or not:
minus = optional(char(?-))
Ergo.parse(minus, "-42")
%Context{status: :ok, ast: 45}
Ergo.parse(minus, "42")
%Context{status: :ok, ast: nil}
In the second case the status is :ok
meaning the optional parser succeeded, however the ast is nil
meaning nothing was matched. Let's make this a bit more useful:
minus = optional(char(?-)) |> transform(fn ast ->
case ast do
nil -> 1
45 -> -1
end
end)
Ergo.parse(minus, "-42")
%Context{status: :ok, ast: -1}
Ergo.parse(minus, "42")
%Context{status: :ok, ast: 1}
Now when minus
matches a '-' it will transform it to the value -1. When it doesn't match anything it will transform it to the value 1. Now let's combine it with the other parser.
integer = sequence([
minus,
digits
])
The sequence
parser tries to match a list of parser in turn and, if they all match, generates an AST composed of a list of each of their results. Let's see how it works:
Ergo.parse(integer, "1234")
%Context{status: :ok, ast: [1, 1234]}
Ergo.parse(integer, "-5678")
%Context{status: :ok, ast: [-1, 5678]}
So we can see that it's easy to get the right result by simply taking the product of the two values in the AST:
integer = sequence([
minus,
digits,
],
ast: &Enum.product/1
)
Ergo.parse(integer, "1234")
%Context{status: :ok, ast: 1234}
Ergo.parse(integer, "-5678")
%Context{status: :ok, ast: -5678}
So far so good. We can now parse positive and negative integers.
Parsing decimals
If we want to parse decimal numbers as well we need to handle the (optional) mantissa, the digits to the right of the decimal point.
We can see that the mantissa is structurally the same as the integer part, a set of digits, but will need to be processed a little differently.
The m_transform
function below should look familiar. It works the same way as the c_transform
only instead of multiplying by increasing powers of 10, we're dividing by increasing powers of 10.
m_transform = fn ast ->
ast
|> Enum.map(fn digit -> digit - 48 end)
|> Enum.zip(Stream.unfold(0.1, fn n -> {n, n / 10} end))
|> Enum.map(&Tuple.product/1)
|> Enum.sum
end
mantissa = many(digit, ast: m_transform)
Ergo.parse(mantissa, "5")
%Context{status: :ok, ast: 0.5}
Ergo.parse(mantissa, "42")
%Context{status: :ok, ast: 0.42000000000000004}
There may be a precision issue with this code but you can see the principle it is operating by.
Now to join the two components together, assuming there is a decimal point (suggesting we'll need optional
again). Also we'll again make use of the ast:
feature of the sequence
combinator to process AST's to give us the right value.
number = sequence([
integer,
optional(
sequence([
ignore(char(?.)),
mantissa
], ast: &List.first/1)
)
], ast: &Enum.sum/1)
Ergo.parse(number, "42")
%Context{status: :ok, ast: 42}
Ergo.parse(number, "0.45")
%Context{status: :ok, ast: 0.45}
Ergo.parse(number, "-42")
%Context{status: :ok, ast: -42}
All looking good, just one more example:
Ergo.parse(number, "-4.2")
%Context{status: :ok, ast: -3.8}
Oops! There is a problem with our implementation in that we add together the integer and decimal parts. This works for positive numbers but in the latter case -4 + 0.2 = -3.8 not -4.2. When the integer part is negative we need to subtract the decimal part. We can no longer just use Enum.sum
to process the result of the top-level sequence. Instead:
combine = fn
[integer, decimal | []] ->
if integer >= 0 do
integer + decimal
else
integer - decimal
end
ast ->
Enum.sum(ast)
end
number = sequence([
integer,
optional(
sequence([
ignore(char(?.)),
mantissa
], ast: &List.first/1)
)
], ast: combine)
Ergo.parse(number, "-4.2")
%Context{status: :ok, ast: -4.2}
Conclusion
Through a series of steps we have built a parser that can handle any kind of integer or decimal number we throw at it. We've seen the using of terminal parsers like char
as well as combinator parsers like optional
, ignore
, many
, and sequence
and meta parsers like transform
(and that often transform can be specified as a ast:
argument to a combinator parser).
Hopefully this guide will be helpful in thinking about how to build your own parsers.