LazyHTML (LazyHTML v0.1.3)

View Source

Efficient parsing and querying of HTML documents.

LazyHTML is designed around lazy HTML documents. Documents are parsed and kept natively in memory for as long as possible. Query selectors are executed in native code for performance and adheres to browser standards. Under the hood, LazyHTML uses Lexbor, a fast, dependency-free and comprehensive HTML engine, written entirely in C.

LazyHTML works with a flat list of nodes and all operations are batched by default, as shown below:

lazy_html =
  LazyHTML.from_fragment("""
  <div>
    <a href="https://elixir-lang.org">Elixir</a>
    <a href="https://www.erlang.org">Erlang</a>
  </div>\
  """)
#=> #LazyHTML<
#=>   1 node
#=>
#=>   #1
#=>   <div>
#=>     <a href="https://elixir-lang.org">Elixir</a>
#=>     <a href="https://www.erlang.org">Erlang</a>
#=>   </div>
#=> >

hyperlinks = LazyHTML.query(lazy_html, "a")
#=> #LazyHTML<
#=>   2 nodes (from selector)
#=>
#=>   #1
#=>   <a href="https://elixir-lang.org">Elixir</a>
#=>
#=>   #2
#=>   <a href="https://www.erlang.org">Erlang</a>
#=> >

LazyHTML.attribute(hyperlinks, "href")
#=> ["https://elixir-lang.org", "https://www.erlang.org"]

LazyHTML also provides several high-level conveniences:

  • an Inspect implementation to pretty-print nodes
  • an Access implementation to run CSS selectors
  • an Enumerable implementation to traverse them

For example:

lazy_html = LazyHTML.from_fragment(~S|<p><strong>Hello</strong>, <em>world</em>!</p>|)
#=> #LazyHTML<
#=>   1 node
#=>
#=>   #1
#=>   <p><strong>Hello</strong>, <em>world</em>!</p>
#=> >

lazy_html["strong, em"]
#=> #LazyHTML<
#=>   2 nodes (from selector)
#=>
#=>   #1
#=>   <strong>Hello</strong>
#=>
#=>   #2
#=>   <em>world</em>
#=> >

LazyHTML.text(lazy_html)
#=> "Hello, world!"

Enum.map(lazy_html["strong, em"], &LazyHTML.text/1)
#=> ["Hello", "world"]

If needed, the lazy nodes can be converted into an Elixir tree data structure, and vice-versa.

lazy_html = LazyHTML.from_fragment("<p><strong>Hello</strong>, <em>world</em>!</p>")
#=> #LazyHTML<
#=>   1 node
#=>
#=>   #1
#=>   <p><strong>Hello</strong>, <em>world</em>!</p>
#=> >

tree = LazyHTML.to_tree(lazy_html)
#=> [{"p", [], [{"strong", [], ["Hello"]}, ", ", {"em", [], ["world"]}, "!"]}]

LazyHTML.from_tree(tree)
#=> #LazyHTML<
#=>   1 node

#=>   #1
#=>   <p><strong>Hello</strong>, <em>world</em>!</p>
#=> >

Summary

Functions

Returns all values of the given attribute on the lazy_html root nodes.

Returns attribute lists for every root element in lazy_html.

Returns the child_nodes nodes of the root nodes in lazy_html.

Filters lazy_html root nodes, keeping only elements that match the given CSS selector.

Parses an HTML document.

Parses a segment of an HTML document.

Builds a lazy HTML document from an Elixir tree data structure.

Finds elements in lazy_html matching the given CSS selector.

Finds elements in lazy_html matching the given id.

Returns tag name for every root element in lazy_html.

Returns the text content of all nodes in lazy_html.

Serializes lazy_html as an HTML string.

Builds an Elixir tree data structure representing the lazy_html document.

Types

t()

@type t() :: %LazyHTML{resource: reference()}

Functions

attribute(lazy_html, name)

@spec attribute(t(), String.t()) :: [String.t()]

Returns all values of the given attribute on the lazy_html root nodes.

Examples

iex> lazy_html =
...>   LazyHTML.from_fragment("""
...>   <div>
...>     <span data-id="1">Hello</span>
...>     <span data-id="2">world</span>
...>     <span>!</span>
...>   </div>
...>   """)
iex> spans = LazyHTML.query(lazy_html, "span")
iex> LazyHTML.attribute(spans, "data-id")
["1", "2"]
iex> LazyHTML.attribute(spans, "data-other")
[]

Note that attributes without value, implicitly have an empty value:

iex> lazy_html = LazyHTML.from_fragment(~S|<div><button disabled>Click me</button></div>|)
iex> button = LazyHTML.query(lazy_html, "button")
iex> LazyHTML.attribute(button, "disabled")
[""]

attributes(lazy_html)

@spec attributes(t()) :: [{String.t(), String.t()}]

Returns attribute lists for every root element in lazy_html.

Note that if there are text or comment root nodes, they are ignored, and they have no corresponding list in the result.

Examples

iex> lazy_html =
...>   LazyHTML.from_fragment("""
...>   <div>
...>     <span class="text" data-id="1">Hello</span>
...>     <span>world</span>
...>   </div>
...>   """)
iex> spans = LazyHTML.query(lazy_html, "span")
iex> LazyHTML.attributes(spans)
[
  [{"class", "text"}, {"data-id", "1"}],
  []
]

iex> lazy_html =
...>   LazyHTML.from_fragment("""
...>   <!-- Comment-->
...>   <span class="text">Hello</span>
...>   world
...>   """)
iex> LazyHTML.attributes(lazy_html)
[
  [{"class", "text"}]
]

child_nodes(lazy_html)

@spec child_nodes(t()) :: t()

Returns the child_nodes nodes of the root nodes in lazy_html.

Examples

iex> lazy_html = LazyHTML.from_fragment(~S|<div><span>Hello</span> <span>world</span></div>|)
iex> LazyHTML.child_nodes(lazy_html)
#LazyHTML<
  3 nodes (from selector)
  #1
  <span>Hello</span>
  #2
  [whitespace]
  #3
  <span>world</span>
>
iex> LazyHTML.child_nodes(LazyHTML.child_nodes(lazy_html))
#LazyHTML<
  2 nodes (from selector)
  #1
  Hello
  #2
  world
>

filter(lazy_html, selector)

@spec filter(t(), String.t()) :: t()

Filters lazy_html root nodes, keeping only elements that match the given CSS selector.

Examples

iex> lazy_html = LazyHTML.from_fragment("""
...> <span>Hello</span>
...> <div>
...>   <span>nested</span>
...> </div>
...> <span>world</span>
...> """)
iex> LazyHTML.filter(lazy_html, "span")
#LazyHTML<
  2 nodes (from selector)
  #1
  <span>Hello</span>
  #2
  <span>world</span>
>

from_document(html)

@spec from_document(String.t()) :: t()

Parses an HTML document.

This function expects a complete document, therefore if either of <html>, <head> or <body> tags is missing, it will be added, which matches the usual browser behaviour. To parse a part of an HTML document, use from_fragment/1 instead.

Examples

iex> LazyHTML.from_document(~S|<html><head></head><body>Hello world!</body></html>|)
#LazyHTML<
  1 node
  #1
  <html><head></head><body>Hello world!</body></html>
>

iex> LazyHTML.from_document(~S|<div>Hello world!</div>|)
#LazyHTML<
  1 node
  #1
  <html><head></head><body><div>Hello world!</div></body></html>
>

from_fragment(html)

@spec from_fragment(String.t()) :: t()

Parses a segment of an HTML document.

As opposed to from_document/1, this function does not expect a full document and does not add any extra tags.

Examples

iex> LazyHTML.from_fragment(~S|<a class="button">Click me</a>|)
#LazyHTML<
  1 node
  #1
  <a class="button">Click me</a>
>

iex> LazyHTML.from_fragment(~S|<span>Hello</span> <span>world</span>|)
#LazyHTML<
  3 nodes
  #1
  <span>Hello</span>
  #2
  [whitespace]
  #3
  <span>world</span>
>

from_tree(tree)

@spec from_tree(LazyHTML.Tree.t()) :: t()

Builds a lazy HTML document from an Elixir tree data structure.

Examples

iex> tree = [
...>   {"html", [], [{"head", [], [{"title", [], ["Page"]}]}, {"body", [], ["Hello world"]}]}
...> ]
iex> LazyHTML.from_tree(tree)
#LazyHTML<
  1 node
  #1
  <html><head><title>Page</title></head><body>Hello world</body></html>
>

iex> tree = [
...>   {"div", [], []},
...>   {:comment, " Link "},
...>   {"a", [{"href", "https://elixir-lang.org"}], ["Elixir"]}
...> ]
iex> LazyHTML.from_tree(tree)
#LazyHTML<
  3 nodes
  #1
  <div></div>
  #2
  <!-- Link -->
  #3
  <a href="https://elixir-lang.org">Elixir</a>
>

query(lazy_html, selector)

@spec query(t(), String.t()) :: t()

Finds elements in lazy_html matching the given CSS selector.

Since lazy_html may have multiple root nodes, the root nodes are included in the search and they will appear in the result if they match the given selector.

Examples

iex> lazy_html =
...>   LazyHTML.from_fragment("""
...>   <div class="layout">
...>     <span>Hello</span>
...>     <span>world</span>
...>   </div>
...>   """)
iex> LazyHTML.query(lazy_html, "span")
#LazyHTML<
  2 nodes (from selector)
  #1
  <span>Hello</span>
  #2
  <span>world</span>
>
iex> LazyHTML.query(lazy_html, ".layout")
#LazyHTML<
  1 node (from selector)
  #1
  <div class="layout">
    <span>Hello</span>
    <span>world</span>
  </div>
>

Note that for each root node, the selector respects its actual location in the document. Consequently, if you run one query/2 the returned nodes are not necessarily siblings, which may impact a subsequent query:

iex> lazy_html =
...>   LazyHTML.from_fragment("""
...>   <div>
...>     <span>Hello</span>
...>   </div>
...>   <div>
...>     <span>World</span>
...>   </div>
...>   """)
iex> spans = LazyHTML.query(lazy_html, "span")
#LazyHTML<
  2 nodes (from selector)
  #1
  <span>Hello</span>
  #2
  <span>World</span>
>
iex> LazyHTML.query(spans, ":first-child")
#LazyHTML<
  2 nodes (from selector)
  #1
  <span>Hello</span>
  #2
  <span>World</span>
>

In the example above, each of the spans is first child of its respective parent, so the second query matches both.

query_by_id(lazy_html, id)

@spec query_by_id(t(), String.t()) :: t()

Finds elements in lazy_html matching the given id.

This function is similar to query/2, but it accepts unescaped id string.

Note that while technically there should be only a single element with the given id, if there are multiple elements, all of them are included in the result.

Examples

iex> lazy_html =
...>   LazyHTML.from_fragment("""
...>   <div>
...>     <span id="hello">Hello</span>
...>     <span>world</span>
...>   </div>
...>   """)
iex> LazyHTML.query_by_id(lazy_html, "hello")
#LazyHTML<
  1 node (from selector)
  #1
  <span id="hello">Hello</span>
>

tag(lazy_html)

@spec tag(t()) :: [String.t()]

Returns tag name for every root element in lazy_html.

Note that if there are text or comment root nodes, they are ignored, and they have no corresponding list in the result.

Examples

iex> lazy_html = LazyHTML.from_fragment(~S|<div><span>Hello</span> <span>world</span></div>|)
iex> LazyHTML.tag(lazy_html)
["div"]

iex> lazy_html = LazyHTML.from_fragment(~S|<span>Hello</span> <span>world</span>|)
iex> LazyHTML.tag(lazy_html)
["span", "span"]

text(lazy_html)

@spec text(t()) :: String.t()

Returns the text content of all nodes in lazy_html.

Examples

iex> lazy_html = LazyHTML.from_fragment(~S|<div><span>Hello</span> <span>world</span></div>|)
iex> LazyHTML.text(lazy_html)
"Hello world"

If you want to get the text for each root node separately, you can use Enum.map/2:

iex> lazy_html = LazyHTML.from_fragment(~S|<div><span>Hello</span> <span>world</span></div>|)
iex> spans = LazyHTML.query(lazy_html, "span")
#LazyHTML<
  2 nodes (from selector)
  #1
  <span>Hello</span>
  #2
  <span>world</span>
>
iex> Enum.map(spans, &LazyHTML.text/1)
["Hello", "world"]

to_html(lazy_html, opts \\ [])

@spec to_html(
  t(),
  keyword()
) :: String.t()

Serializes lazy_html as an HTML string.

Options

  • :skip_whitespace_nodes - when true, ignores text nodes that consist entirely of whitespace, usually whitespace between tags. Defaults to false.

Examples

iex> lazy_html = LazyHTML.from_document(~S|<html><head></head><body>Hello world!</body></html>|)
iex> LazyHTML.to_html(lazy_html)
"<html><head></head><body>Hello world!</body></html>"

iex> lazy_html = LazyHTML.from_fragment(~S|<span>Hello</span> <span>world</span>|)
iex> LazyHTML.to_html(lazy_html)
"<span>Hello</span> <span>world</span>"

iex> lazy_html =
...>   LazyHTML.from_fragment("""
...>   <p>
...>     <span> Hello </span>
...>     <span> world </span>
...>   </p>
...>   """)
iex> LazyHTML.to_html(lazy_html, skip_whitespace_nodes: true)
"<p><span> Hello </span><span> world </span></p>"

to_tree(lazy_html, opts \\ [])

@spec to_tree(
  t(),
  keyword()
) :: LazyHTML.Tree.t()

Builds an Elixir tree data structure representing the lazy_html document.

Options

  • :sort_attributes - when true, attributes lists are sorted alphabetically by name. Defaults to false.

  • :skip_whitespace_nodes - when true, ignores text nodes that consist entirely of whitespace, usually whitespace between tags. Defaults to false.

Examples

iex> lazy_html = LazyHTML.from_document(~S|<html><head><title>Page</title></head><body>Hello world</body></html>|)
iex> LazyHTML.to_tree(lazy_html)
[{"html", [], [{"head", [], [{"title", [], ["Page"]}]}, {"body", [], ["Hello world"]}]}]

iex> lazy_html = LazyHTML.from_fragment(~S|<div><!-- Link --><a href="https://elixir-lang.org">Elixir</a></div>|)
iex> LazyHTML.to_tree(lazy_html)
[
  {"div", [], [{:comment, " Link "}, {"a", [{"href", "https://elixir-lang.org"}], ["Elixir"]}]}
]

You can get a normalized tree by passing sort_attributes: true:

iex> lazy_html = LazyHTML.from_fragment(~S|<div id="root" class="layout"></div>|)
iex> LazyHTML.to_tree(lazy_html, sort_attributes: true)
[{"div", [{"class", "layout"}, {"id", "root"}], []}]