LazyHTML (LazyHTML v0.1.3)
View SourceEfficient parsing and querying of HTML documents.
LazyHTML is designed around lazy HTML documents. Documents are parsed and kept natively in memory for as long as possible. Query selectors are executed in native code for performance and adheres to browser standards. Under the hood, LazyHTML uses Lexbor, a fast, dependency-free and comprehensive HTML engine, written entirely in C.
LazyHTML works with a flat list of nodes and all operations are batched by default, as shown below:
lazy_html =
LazyHTML.from_fragment("""
<div>
<a href="https://elixir-lang.org">Elixir</a>
<a href="https://www.erlang.org">Erlang</a>
</div>\
""")
#=> #LazyHTML<
#=> 1 node
#=>
#=> #1
#=> <div>
#=> <a href="https://elixir-lang.org">Elixir</a>
#=> <a href="https://www.erlang.org">Erlang</a>
#=> </div>
#=> >
hyperlinks = LazyHTML.query(lazy_html, "a")
#=> #LazyHTML<
#=> 2 nodes (from selector)
#=>
#=> #1
#=> <a href="https://elixir-lang.org">Elixir</a>
#=>
#=> #2
#=> <a href="https://www.erlang.org">Erlang</a>
#=> >
LazyHTML.attribute(hyperlinks, "href")
#=> ["https://elixir-lang.org", "https://www.erlang.org"]
LazyHTML also provides several high-level conveniences:
- an
Inspect
implementation to pretty-print nodes - an
Access
implementation to run CSS selectors - an
Enumerable
implementation to traverse them
For example:
lazy_html = LazyHTML.from_fragment(~S|<p><strong>Hello</strong>, <em>world</em>!</p>|)
#=> #LazyHTML<
#=> 1 node
#=>
#=> #1
#=> <p><strong>Hello</strong>, <em>world</em>!</p>
#=> >
lazy_html["strong, em"]
#=> #LazyHTML<
#=> 2 nodes (from selector)
#=>
#=> #1
#=> <strong>Hello</strong>
#=>
#=> #2
#=> <em>world</em>
#=> >
LazyHTML.text(lazy_html)
#=> "Hello, world!"
Enum.map(lazy_html["strong, em"], &LazyHTML.text/1)
#=> ["Hello", "world"]
If needed, the lazy nodes can be converted into an Elixir tree data structure, and vice-versa.
lazy_html = LazyHTML.from_fragment("<p><strong>Hello</strong>, <em>world</em>!</p>")
#=> #LazyHTML<
#=> 1 node
#=>
#=> #1
#=> <p><strong>Hello</strong>, <em>world</em>!</p>
#=> >
tree = LazyHTML.to_tree(lazy_html)
#=> [{"p", [], [{"strong", [], ["Hello"]}, ", ", {"em", [], ["world"]}, "!"]}]
LazyHTML.from_tree(tree)
#=> #LazyHTML<
#=> 1 node
#=> #1
#=> <p><strong>Hello</strong>, <em>world</em>!</p>
#=> >
Summary
Functions
Returns all values of the given attribute on the lazy_html
root
nodes.
Returns attribute lists for every root element in lazy_html
.
Returns the child_nodes nodes of the root nodes in lazy_html
.
Filters lazy_html
root nodes, keeping only elements that match
the given CSS selector.
Parses an HTML document.
Parses a segment of an HTML document.
Builds a lazy HTML document from an Elixir tree data structure.
Finds elements in lazy_html
matching the given CSS selector.
Finds elements in lazy_html
matching the given id.
Returns tag name for every root element in lazy_html
.
Returns the text content of all nodes in lazy_html
.
Serializes lazy_html
as an HTML string.
Builds an Elixir tree data structure representing the lazy_html
document.
Types
@type t() :: %LazyHTML{resource: reference()}
Functions
Returns all values of the given attribute on the lazy_html
root
nodes.
Examples
iex> lazy_html =
...> LazyHTML.from_fragment("""
...> <div>
...> <span data-id="1">Hello</span>
...> <span data-id="2">world</span>
...> <span>!</span>
...> </div>
...> """)
iex> spans = LazyHTML.query(lazy_html, "span")
iex> LazyHTML.attribute(spans, "data-id")
["1", "2"]
iex> LazyHTML.attribute(spans, "data-other")
[]
Note that attributes without value, implicitly have an empty value:
iex> lazy_html = LazyHTML.from_fragment(~S|<div><button disabled>Click me</button></div>|)
iex> button = LazyHTML.query(lazy_html, "button")
iex> LazyHTML.attribute(button, "disabled")
[""]
Returns attribute lists for every root element in lazy_html
.
Note that if there are text or comment root nodes, they are ignored, and they have no corresponding list in the result.
Examples
iex> lazy_html =
...> LazyHTML.from_fragment("""
...> <div>
...> <span class="text" data-id="1">Hello</span>
...> <span>world</span>
...> </div>
...> """)
iex> spans = LazyHTML.query(lazy_html, "span")
iex> LazyHTML.attributes(spans)
[
[{"class", "text"}, {"data-id", "1"}],
[]
]
iex> lazy_html =
...> LazyHTML.from_fragment("""
...> <!-- Comment-->
...> <span class="text">Hello</span>
...> world
...> """)
iex> LazyHTML.attributes(lazy_html)
[
[{"class", "text"}]
]
Returns the child_nodes nodes of the root nodes in lazy_html
.
Examples
iex> lazy_html = LazyHTML.from_fragment(~S|<div><span>Hello</span> <span>world</span></div>|)
iex> LazyHTML.child_nodes(lazy_html)
#LazyHTML<
3 nodes (from selector)
#1
<span>Hello</span>
#2
[whitespace]
#3
<span>world</span>
>
iex> LazyHTML.child_nodes(LazyHTML.child_nodes(lazy_html))
#LazyHTML<
2 nodes (from selector)
#1
Hello
#2
world
>
Filters lazy_html
root nodes, keeping only elements that match
the given CSS selector.
Examples
iex> lazy_html = LazyHTML.from_fragment("""
...> <span>Hello</span>
...> <div>
...> <span>nested</span>
...> </div>
...> <span>world</span>
...> """)
iex> LazyHTML.filter(lazy_html, "span")
#LazyHTML<
2 nodes (from selector)
#1
<span>Hello</span>
#2
<span>world</span>
>
Parses an HTML document.
This function expects a complete document, therefore if either of
<html>
, <head>
or <body>
tags is missing, it will be added,
which matches the usual browser behaviour. To parse a part of an
HTML document, use from_fragment/1
instead.
Examples
iex> LazyHTML.from_document(~S|<html><head></head><body>Hello world!</body></html>|)
#LazyHTML<
1 node
#1
<html><head></head><body>Hello world!</body></html>
>
iex> LazyHTML.from_document(~S|<div>Hello world!</div>|)
#LazyHTML<
1 node
#1
<html><head></head><body><div>Hello world!</div></body></html>
>
Parses a segment of an HTML document.
As opposed to from_document/1
, this function does not expect a full
document and does not add any extra tags.
Examples
iex> LazyHTML.from_fragment(~S|<a class="button">Click me</a>|)
#LazyHTML<
1 node
#1
<a class="button">Click me</a>
>
iex> LazyHTML.from_fragment(~S|<span>Hello</span> <span>world</span>|)
#LazyHTML<
3 nodes
#1
<span>Hello</span>
#2
[whitespace]
#3
<span>world</span>
>
@spec from_tree(LazyHTML.Tree.t()) :: t()
Builds a lazy HTML document from an Elixir tree data structure.
Examples
iex> tree = [
...> {"html", [], [{"head", [], [{"title", [], ["Page"]}]}, {"body", [], ["Hello world"]}]}
...> ]
iex> LazyHTML.from_tree(tree)
#LazyHTML<
1 node
#1
<html><head><title>Page</title></head><body>Hello world</body></html>
>
iex> tree = [
...> {"div", [], []},
...> {:comment, " Link "},
...> {"a", [{"href", "https://elixir-lang.org"}], ["Elixir"]}
...> ]
iex> LazyHTML.from_tree(tree)
#LazyHTML<
3 nodes
#1
<div></div>
#2
<!-- Link -->
#3
<a href="https://elixir-lang.org">Elixir</a>
>
Finds elements in lazy_html
matching the given CSS selector.
Since lazy_html
may have multiple root nodes, the root nodes are
included in the search and they will appear in the result if they
match the given selector.
Examples
iex> lazy_html =
...> LazyHTML.from_fragment("""
...> <div class="layout">
...> <span>Hello</span>
...> <span>world</span>
...> </div>
...> """)
iex> LazyHTML.query(lazy_html, "span")
#LazyHTML<
2 nodes (from selector)
#1
<span>Hello</span>
#2
<span>world</span>
>
iex> LazyHTML.query(lazy_html, ".layout")
#LazyHTML<
1 node (from selector)
#1
<div class="layout">
<span>Hello</span>
<span>world</span>
</div>
>
Note that for each root node, the selector respects its actual
location in the document. Consequently, if you run one query/2
the returned nodes are not necessarily siblings, which may impact
a subsequent query:
iex> lazy_html =
...> LazyHTML.from_fragment("""
...> <div>
...> <span>Hello</span>
...> </div>
...> <div>
...> <span>World</span>
...> </div>
...> """)
iex> spans = LazyHTML.query(lazy_html, "span")
#LazyHTML<
2 nodes (from selector)
#1
<span>Hello</span>
#2
<span>World</span>
>
iex> LazyHTML.query(spans, ":first-child")
#LazyHTML<
2 nodes (from selector)
#1
<span>Hello</span>
#2
<span>World</span>
>
In the example above, each of the spans is first child of its respective parent, so the second query matches both.
Finds elements in lazy_html
matching the given id.
This function is similar to query/2
, but it accepts unescaped id
string.
Note that while technically there should be only a single element with the given id, if there are multiple elements, all of them are included in the result.
Examples
iex> lazy_html =
...> LazyHTML.from_fragment("""
...> <div>
...> <span id="hello">Hello</span>
...> <span>world</span>
...> </div>
...> """)
iex> LazyHTML.query_by_id(lazy_html, "hello")
#LazyHTML<
1 node (from selector)
#1
<span id="hello">Hello</span>
>
Returns tag name for every root element in lazy_html
.
Note that if there are text or comment root nodes, they are ignored, and they have no corresponding list in the result.
Examples
iex> lazy_html = LazyHTML.from_fragment(~S|<div><span>Hello</span> <span>world</span></div>|)
iex> LazyHTML.tag(lazy_html)
["div"]
iex> lazy_html = LazyHTML.from_fragment(~S|<span>Hello</span> <span>world</span>|)
iex> LazyHTML.tag(lazy_html)
["span", "span"]
Returns the text content of all nodes in lazy_html
.
Examples
iex> lazy_html = LazyHTML.from_fragment(~S|<div><span>Hello</span> <span>world</span></div>|)
iex> LazyHTML.text(lazy_html)
"Hello world"
If you want to get the text for each root node separately, you can
use Enum.map/2
:
iex> lazy_html = LazyHTML.from_fragment(~S|<div><span>Hello</span> <span>world</span></div>|)
iex> spans = LazyHTML.query(lazy_html, "span")
#LazyHTML<
2 nodes (from selector)
#1
<span>Hello</span>
#2
<span>world</span>
>
iex> Enum.map(spans, &LazyHTML.text/1)
["Hello", "world"]
Serializes lazy_html
as an HTML string.
Options
:skip_whitespace_nodes
- whentrue
, ignores text nodes that consist entirely of whitespace, usually whitespace between tags. Defaults tofalse
.
Examples
iex> lazy_html = LazyHTML.from_document(~S|<html><head></head><body>Hello world!</body></html>|)
iex> LazyHTML.to_html(lazy_html)
"<html><head></head><body>Hello world!</body></html>"
iex> lazy_html = LazyHTML.from_fragment(~S|<span>Hello</span> <span>world</span>|)
iex> LazyHTML.to_html(lazy_html)
"<span>Hello</span> <span>world</span>"
iex> lazy_html =
...> LazyHTML.from_fragment("""
...> <p>
...> <span> Hello </span>
...> <span> world </span>
...> </p>
...> """)
iex> LazyHTML.to_html(lazy_html, skip_whitespace_nodes: true)
"<p><span> Hello </span><span> world </span></p>"
@spec to_tree( t(), keyword() ) :: LazyHTML.Tree.t()
Builds an Elixir tree data structure representing the lazy_html
document.
Options
:sort_attributes
- whentrue
, attributes lists are sorted alphabetically by name. Defaults tofalse
.:skip_whitespace_nodes
- whentrue
, ignores text nodes that consist entirely of whitespace, usually whitespace between tags. Defaults tofalse
.
Examples
iex> lazy_html = LazyHTML.from_document(~S|<html><head><title>Page</title></head><body>Hello world</body></html>|)
iex> LazyHTML.to_tree(lazy_html)
[{"html", [], [{"head", [], [{"title", [], ["Page"]}]}, {"body", [], ["Hello world"]}]}]
iex> lazy_html = LazyHTML.from_fragment(~S|<div><!-- Link --><a href="https://elixir-lang.org">Elixir</a></div>|)
iex> LazyHTML.to_tree(lazy_html)
[
{"div", [], [{:comment, " Link "}, {"a", [{"href", "https://elixir-lang.org"}], ["Elixir"]}]}
]
You can get a normalized tree by passing sort_attributes: true
:
iex> lazy_html = LazyHTML.from_fragment(~S|<div id="root" class="layout"></div>|)
iex> LazyHTML.to_tree(lazy_html, sort_attributes: true)
[{"div", [{"class", "layout"}, {"id", "root"}], []}]