Getting started with SAX
This guide is an introduction of how you could parse a XML document in SAX mode with Saxy.
SAX (Simple API for XML)
SAX is an event driven algorithm to parse XML documents, which means that during parsing, a SAX parser will emit any meaningful data of the document such as start tag to a pre-configured handler, then the handler decides how to process with the emitted data.
SAX is especially useful when it comes to large file parsing, because unlike DOM parsing, it does not require fitting the whole parsed document into memory (for XPath operations for example).
Parsing in SAX mode is efficient, but it would take some time to get used to. This guide is here to help you get over it.
Implement the handler
Given a XML document as below needs to be parsed, and the desired outcome will be a list of foods with name, price and description.
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
</food>
<food>
<name>Strawberry Belgian Waffles</name>
<price>$7.95</price>
<description>Light Belgian waffles covered with strawberries and whipped cream</description>
</food>
</breakfast_menu>
To parse a XML document the SAX way with Saxy, first you need to implement a handler.
Let's start with handling the start and end events of the document. No action to take here, we simply return whatever passed in.
defmodule FoodHandler do
@behaviour Saxy.Handler
def handle_event(:start_document, _prolog, state) do
{:ok, state}
end
def handle_event(:end_document, _, state) do
{:ok, state}
end
end
Next we will be handling the <food>
element. The action will be very simple as
well.
- When
<food>
element starts, we put a new struct into the food list. - When
<food>
element ends, we do nothing but return the list.
To make it clear, let's call the state foods
instead of state
.
defmodule FoodHandler do
@behaviour Saxy.Handler
def handle_event(:start_element, {name, _attributes}, foods) do
if name == "food" do
{:ok, [%Food{} | foods]}
else
{:ok, foods}
end
end
def handle_event(:end_document, _data, foods) do
{:ok, foods}
end
end
Now we shall start handling <name>
and its content. But we encounter a problem:
:characters
event, which we are supposed to get "Belgian Waffles" for the
first food name does not include which tag it belongs to.
So we need to somehow cache the current tag that is being parsed, let's revise our handler a little bit.
def handle_event(:start_element, {tag_name, _attributes}, {current_tag, foods}) do
if tag_name == "food" do
foods = [%Food{} | foods]
{:ok, {tag_name, foods}}
else
{:ok, {tag_name, foods}}
end
end
With this now we can import the content of "name", and probably other food properties too.
def handle_event(:characters, content, {current_tag, foods}) do
[current_food | foods] = foods
food =
case current_tag do
"name" ->
Map.put(current_food, :name, content)
"price" ->
Map.put(current_food, :price, content)
"description" ->
Map.put(current_food, :description, content)
_other ->
current_food
end
{:ok, {"food", [food | foods]}}
end
As now we have implemented the event handler, it is time to parse the document.
document = File.read!("/path/to/the/file")
Saxy.parse_string(document, {nil, []}, FoodHandler)
{:ok,
[
%Food{name: "Belgian Waffles", price: "$5.95", description: "Two of our famous Belgian Waffles with plenty of real maple syrup"},
%Food{name: "Strawberry Belgian Waffles", price: "$7.95", description: "Light Belgian waffles covered with strawberries and whipped cream"},
]}