Myhtmlex v0.2.1 Myhtmlex

A module to decode html into a tree structure.

Based on Alexander Borisov’s myhtml, this binding gains the properties of being html-spec compliant and very fast.

Example

iex> Myhtmlex.decode("<h1>Hello world</h1>")
{"html", [], [{"head", [], []}, {"body", [], [{"h1", [], ["Hello world"]}]}]}

Benchmark results (Nif calling mode) on various file sizes on a 2,5Ghz Core i7:

Settings:
  duration:      1.0 s

## FileSizesBench
[15:28:42] 1/3: github_trending_js.html 341k
[15:28:46] 2/3: w3c_html5.html 131k
[15:28:48] 3/3: wikipedia_hyperlink.html 97k

Finished in 7.52 seconds

## FileSizesBench
benchmark name                iterations   average time
wikipedia_hyperlink.html 97k        1000   1385.86 µs/op
w3c_html5.html 131k                 1000   2179.30 µs/op
github_trending_js.html 341k         500   5686.21 µs/op

Configuration

The module you are calling into is always Myhtmlex and depending on your application configuration, it chooses between the underlying implementations Myhtmlex.Safe (default) and Myhtmlex.Nif.

Erlang interoperability is a tricky mine-field. You can call into C directly using native implemented functions (Nif). But this comes with the risk, that if anything goes wrong within the C implementation, your whole VM will crash. No more supervisor cushions for here on, just violent crashes.

That is why the default mode of operation keeps your VM safe and happy. If you need ultimate parsing speed, or you can simply tolerate VM-level crashes, read on.

Call into C-Node (default)

This is the default mode of operation. If your application cannot tolerate VM-level crashes, this option allows you to gain the best of both worlds. The added overhead is client/server communications, and a worker OS-process that runs next to your VM under VM supervision.

You do not have to do anything to start the worker process, everything is taken care of within the library. If you are not running in distributed mode, your VM will automatically be assigned a sname.

The worker OS-process stays alive as long as it is under VM-supervision. If your VM goes down, the OS-process will die by itself. If the worker OS-process dies for some reason, your VM stays unaffected and will attempt to restart it seamlessly.

Call into Nif

If your application is aiming for ultimate parsing speed, and in the worst case can tolerate VM-level crashes, you can call directly into the Nif.

  1. Require myhtmlex without runtime

    in your mix.exs

    def deps do
      [
        {:myhtmlex, ">= 0.0.0", runtime: false}
      ]
    end
  2. Configure the mode to Myhtmlex.Nif

    e.g. in config/config.exs

    config :myhtmlex, mode: Myhtmlex.Nif
  3. Bonus: You can open up in-memory references to parsed trees, without parsing + mapping erlang terms in one go

Link to this section Summary

Functions

Returns a tree representation from the given html string

Returns a tree representation from the given html string

Returns a tree representation from the given reference. See decode/1 for example output. (Nif only!)

Returns a tree representation from the given reference. See decode/2 for options and example output. (Nif only!)

Returns a reference to an internally parsed myhtml_tree_t. (Nif only!)

Link to this section Types

Link to this type attr_list()
attr_list() :: [] | [attr]
Link to this type comment_node()
comment_node() :: {:comment, String.t}
Link to this type comment_node3()
comment_node3() :: {:comment, [], String.t}
Link to this type format_flag()
format_flag() :: :html_atoms | :nil_self_closing | :comment_tuple3
Link to this type tag()
tag() :: String.t | atom

Link to this section Functions

Link to this function decode(bin)
decode(String.t) :: tree

Returns a tree representation from the given html string.

Examples

iex> Myhtmlex.decode("<h1>Hello world</h1>")
{"html", [], [{"head", [], []}, {"body", [], [{"h1", [], ["Hello world"]}]}]}

iex> Myhtmlex.decode("<span class='hello'>Hi there</span>")
{"html", [],
 [{"head", [], []},
  {"body", [], [{"span", [{"class", "hello"}], ["Hi there"]}]}]}

iex> Myhtmlex.decode("<body><!-- a comment --!></body>")
{"html", [], [{"head", [], []}, {"body", [], [comment: " a comment "]}]}

iex> Myhtmlex.decode("<br>")
{"html", [], [{"head", [], []}, {"body", [], [{"br", [], []}]}]}
Link to this function decode(bin, list)
decode(String.t, [{:format, [format_flag]}]) :: tree

Returns a tree representation from the given html string.

This variant allows you to pass in one or more of the following format flags:

  • :html_atoms uses atoms for known html tags (faster), binaries for everything else.
  • :nil_self_closing uses nil to designate self-closing tags and void elements. For example <br> is then being represented like {"br", [], nil}. See http://w3c.github.io/html-reference/syntax.html#void-elements for a full list of void elements.
  • :comment_tuple3 uses 3-tuple elements for comments, instead of the default 2-tuple element.

Examples

iex> Myhtmlex.decode("<h1>Hello world</h1>", format: [:html_atoms])
{:html, [], [{:head, [], []}, {:body, [], [{:h1, [], ["Hello world"]}]}]}

iex> Myhtmlex.decode("<br>", format: [:nil_self_closing])
{"html", [], [{"head", [], []}, {"body", [], [{"br", [], nil}]}]}

iex> Myhtmlex.decode("<body><!-- a comment --!></body>", format: [:comment_tuple3])
{"html", [], [{"head", [], []}, {"body", [], [{:comment, [], " a comment "}]}]}

iex> html = "<body><!-- a comment --!><unknown /></body>"
iex> Myhtmlex.decode(html, format: [:html_atoms, :nil_self_closing, :comment_tuple3])
{:html, [],
 [{:head, [], []},
  {:body, [], [{:comment, [], " a comment "}, {"unknown", [], nil}]}]}
Link to this function decode_tree(ref)
decode_tree(reference) :: tree

Returns a tree representation from the given reference. See decode/1 for example output. (Nif only!)

Link to this function decode_tree(ref, list)
decode_tree(reference, [{:format, [format_flag]}]) :: tree

Returns a tree representation from the given reference. See decode/2 for options and example output. (Nif only!)

Link to this function open(bin)
open(String.t) :: reference

Returns a reference to an internally parsed myhtml_tree_t. (Nif only!)