View Source Floki (Floki v0.34.0)

Floki is a simple HTML parser that enables search for nodes using CSS selectors.

example

Example

Assuming that you have the following HTML:

<!doctype html>
<html>
<body>
  <section id="content">
    <p class="headline">Floki</p>
    <a href="http://github.com/philss/floki">Github page</a>
    <span data-model="user">philss</span>
  </section>
</body>
</html>

To parse this, you can use the function Floki.parse_document/1:

{:ok, html} = Floki.parse_document(doc)
# =>
# [{"html", [],
#   [
#     {"body", [],
#      [
#        {"section", [{"id", "content"}],
#         [
#           {"p", [{"class", "headline"}], ["Floki"]},
#           {"a", [{"href", "http://github.com/philss/floki"}], ["Github page"]},
#           {"span", [{"data-model", "user"}], ["philss"]}
#         ]}
#      ]}
#   ]}]

With this document you can perform queries such as:

  • Floki.find(html, "#content")
  • Floki.find(html, ".headline")
  • Floki.find(html, "a")
  • Floki.find(html, "[data-model=user]")
  • Floki.find(html, "#content a")
  • Floki.find(html, ".headline, a")

Each HTML node is represented by a tuple like:

{tag_name, attributes, children_nodes}

Example of node:

{"p", [{"class", "headline"}], ["Floki"]}

So even if the only child node is the element text, it is represented inside a list.

Link to this section Summary

Functions

Changes the attribute values of the elements matched by selector with the function mutation and returns the whole element tree.

Returns a list with attribute values from elements.

Returns a list with attribute values for a given selector.

Returns the direct child nodes of a HTML node.

Returns the nodes from a HTML tree that don't match the filter selector.

Find elements inside a HTML tree or string.

Searches for elements inside the HTML tree and update those that matches the selector.

parse(html) deprecated

Parses a HTML Document from a String.

Parses a HTML Document from a string.

Parses a HTML Document from a string.

Parses a HTML fragment from a string.

Parses a HTML fragment from a string.

Converts HTML tree to raw HTML. Note that the resultant HTML may be different from the original one. Spaces after tags and doctypes are ignored.

Traverses and updates a HTML tree structure.

Traverses and updates a HTML tree structure with an accumulator.

Link to this section Types

@type css_selector() :: String.t() | Floki.Selector.t() | [Floki.Selector.t()]
@type html_attribute() :: {String.t(), String.t()}
@type html_comment() :: {:comment, String.t()}
@type html_declaration() :: {:pi, String.t(), [html_attribute()]}
@type html_doctype() :: {:doctype, String.t(), String.t(), String.t()}
@type html_node() ::
  html_tag()
  | html_comment()
  | html_doctype()
  | html_declaration()
  | html_text()
@type html_tag() :: {String.t(), [html_attribute()], [html_node()]}
@type html_text() :: String.t()
@type html_tree() :: [html_node()]

Link to this section Functions

Link to this function

attr(html_elem_tuple, selector, attribute_name, mutation)

View Source
@spec attr(
  binary() | html_tree() | html_node(),
  css_selector(),
  binary(),
  (binary() -> binary())
) ::
  html_tree()

Changes the attribute values of the elements matched by selector with the function mutation and returns the whole element tree.

examples

Examples

iex> Floki.attr([{"div", [{"id", "a"}], []}], "#a", "id", fn(id) -> String.replace(id, "a", "b") end)
[{"div", [{"id", "b"}], []}]

iex> Floki.attr([{"div", [{"class", "name"}], []}], "div", "id", fn _ -> "b" end)
[{"div", [{"id", "b"}, {"class", "name"}], []}]
Link to this function

attribute(html, attribute_name)

View Source
@spec attribute(binary() | html_tree() | html_node(), binary()) :: list()

Returns a list with attribute values from elements.

examples

Examples

iex> Floki.attribute([{"a", [{"href", "https://google.com"}], ["Google"]}], "href")
["https://google.com"]

iex> Floki.attribute([{"a", [{"href", "https://google.com"}, {"data-name", "google"}], ["Google"]}], "data-name")
["google"]
Link to this function

attribute(html, selector, attribute_name)

View Source
@spec attribute(binary() | html_tree() | html_node(), binary(), binary()) :: list()

Returns a list with attribute values for a given selector.

examples

Examples

iex> Floki.attribute([{"a", [{"href", "https://google.com"}], ["Google"]}], "a", "href")
["https://google.com"]

iex> Floki.attribute([{"a", [{"class", "foo"}, {"href", "https://google.com"}], ["Google"]}], "a", "class")
["foo"]

iex> Floki.attribute([{"a", [{"href", "https://google.com"}, {"data-name", "google"}], ["Google"]}], "a[data-name]", "data-name")
["google"]
Link to this function

children(html_node, opts \\ [include_text: true])

View Source
@spec children(html_node(), Keyword.t()) :: html_tree() | nil

Returns the direct child nodes of a HTML node.

By default, it will also include all texts. You can disable this behaviour by using the option include_text to false.

If the given node is not an HTML tag, then it returns nil.

examples

Examples

iex> Floki.children({"div", [], ["text", {"span", [], []}]})
["text", {"span", [], []}]

iex> Floki.children({"div", [], ["text", {"span", [], []}]}, include_text: false)
[{"span", [], []}]

iex> Floki.children({:comment, "comment"})
nil
Link to this function

filter_out(html, selector)

View Source
@spec filter_out(html_node() | html_tree() | binary(), Floki.FilterOut.selector()) ::
  html_node() | html_tree()

Returns the nodes from a HTML tree that don't match the filter selector.

examples

Examples

iex> Floki.filter_out({"div", [], [{"script", [], ["hello"]}, " world"]}, "script")
{"div", [], [" world"]}

iex> Floki.filter_out([{"body", [], [{"script", [], []}, {"div", [], []}]}], "script")
[{"body", [], [{"div", [], []}]}]

iex> Floki.filter_out({"div", [], [{:comment, "comment"}, " text"]}, :comment)
{"div", [], [" text"]}

iex> Floki.filter_out({"div", [], ["text"]}, :text)
{"div", [], []}
@spec find(binary() | html_tree() | html_node(), css_selector()) :: html_tree()

Find elements inside a HTML tree or string.

examples

Examples

iex> {:ok, html} = Floki.parse_fragment("<p><span class=hint>hello</span></p>")
iex> Floki.find(html, ".hint")
[{"span", [{"class", "hint"}], ["hello"]}]

iex> {:ok, html} = Floki.parse_fragment("<div id=important><div>Content</div></div>")
iex> Floki.find(html, "#important")
[{"div", [{"id", "important"}], [{"div", [], ["Content"]}]}]

iex> {:ok, html} = Floki.parse_fragment("<p><a href='https://google.com'>Google</a></p>")
iex> Floki.find(html, "a")
[{"a", [{"href", "https://google.com"}], ["Google"]}]

iex> Floki.find([{ "div", [], [{"a", [{"href", "https://google.com"}], ["Google"]}]}], "div a")
[{"a", [{"href", "https://google.com"}], ["Google"]}]
Link to this function

find_and_update(html_tree, selector, fun)

View Source
@spec find_and_update(
  html_tree(),
  css_selector(),
  ({String.t(), [html_attribute()]} ->
     {String.t(), [html_attribute()]} | :delete)
) :: html_tree()

Searches for elements inside the HTML tree and update those that matches the selector.

It will return the updated HTML tree.

This function works in a way similar to traverse_and_update, but instead of updating the children nodes, it will only updates the tag and attributes of the matching nodes.

If fun returns :delete, the HTML node will be removed from the tree.

examples

Examples

iex> Floki.find_and_update([{"a", [{"href", "http://elixir-lang.com"}], ["Elixir"]}], "a", fn
iex>   {"a", [{"href", href}]} ->
iex>     {"a", [{"href", String.replace(href, "http://", "https://")}]}
iex>   other ->
iex>     other
iex> end)
[{"a", [{"href", "https://elixir-lang.com"}], ["Elixir"]}]
Link to this function

map(html_tree_list, fun)

View Source
This function is deprecated. Use `find_and_update/3` or `Enum.map/2` instead. .
This function is deprecated. Use `parse_document/1` or `parse_fragment/1` instead..
@spec parse(binary()) :: html_tag() | html_tree() | String.t()

Parses a HTML Document from a String.

The expect string is a valid HTML, but the parser will try to parse even with errors.

Link to this function

parse_document(document, opts \\ [])

View Source
@spec parse_document(binary(), Keyword.t()) ::
  {:ok, html_tree()} | {:error, String.t()}

Parses a HTML Document from a string.

It will use the available parser from application env or the one from the :html_parser option. Check https://github.com/philss/floki#alternative-html-parsers for more details.

examples

Examples

iex> Floki.parse_document("<html><head></head><body>hello</body></html>")
{:ok, [{"html", [], [{"head", [], []}, {"body", [], ["hello"]}]}]}

iex> Floki.parse_document("<html><head></head><body>hello</body></html>", html_parser: Floki.HTMLParser.Mochiweb)
{:ok, [{"html", [], [{"head", [], []}, {"body", [], ["hello"]}]}]}
Link to this function

parse_document!(document, opts \\ [])

View Source
@spec parse_document!(binary(), Keyword.t()) :: html_tree()

Parses a HTML Document from a string.

Similar to Floki.parse_document/1, but raises Floki.ParseError if there was an error parsing the document.

example

Example

iex> Floki.parse_document!("<html><head></head><body>hello</body></html>")
[{"html", [], [{"head", [], []}, {"body", [], ["hello"]}]}]
Link to this function

parse_fragment(fragment, opts \\ [])

View Source
@spec parse_fragment(binary(), Keyword.t()) ::
  {:ok, html_tree()} | {:error, String.t()}

Parses a HTML fragment from a string.

It will use the available parser from application env or the one from the :html_parser option.

Check https://github.com/philss/floki#alternative-html-parsers for more details.

Link to this function

parse_fragment!(fragment, opts \\ [])

View Source
@spec parse_fragment!(binary(), Keyword.t()) :: html_tree()

Parses a HTML fragment from a string.

Similar to Floki.parse_fragment/1, but raises Floki.ParseError if there was an error parsing the fragment.

Link to this function

raw_html(html_tree, options \\ [])

View Source
@spec raw_html(
  html_tree() | binary(),
  keyword()
) :: binary()

Converts HTML tree to raw HTML. Note that the resultant HTML may be different from the original one. Spaces after tags and doctypes are ignored.

options

Options

  • :encode: accepts true or false. Will encode html special characters to html entities. You can also control the encoding behaviour at the application level via config :floki, :encode_raw_html, true | false

  • :pretty: accepts true or false. Will format the output, ignoring breaklines and spaces from the input and putting new ones in order to pretty format the html.

examples

Examples

iex> Floki.raw_html({"div", [{"class", "wrapper"}], ["my content"]})
~s(<div class="wrapper">my content</div>)

iex> Floki.raw_html({"div", [{"class", "wrapper"}], ["10 > 5"]}, encode: true)
~s(<div class="wrapper">10 &gt; 5</div>)

iex> Floki.raw_html({"div", [{"class", "wrapper"}], ["10 > 5"]}, encode: false)
~s(<div class="wrapper">10 > 5</div>)

iex> Floki.raw_html({"div", [], ["\n   ", {"span", [], "Fully indented"}, "    \n"]}, pretty: true)
"""
<div>
  <span>
    Fully indented
  </span>
</div>
"""
Link to this function

text(html, opts \\ [deep: true, js: false, style: true, sep: ""])

View Source

Returns the text nodes from a HTML tree.

By default, it will perform a deep search through the HTML tree. You can disable deep search with the option deep assigned to false. You can include content of script tags with the option js assigned to true. You can specify a separator between nodes content.

examples

Examples

iex> Floki.text({"div", [], [{"span", [], ["hello"]}, " world"]})
"hello world"

iex> Floki.text({"div", [], [{"span", [], ["hello"]}, " world"]}, deep: false)
" world"

iex> Floki.text({"div", [], [{"script", [], ["hello"]}, " world"]})
" world"

iex> Floki.text({"div", [], [{"script", [], ["hello"]}, " world"]}, js: true)
"hello world"

iex> Floki.text({"ul", [], [{"li", [], ["hello"]}, {"li", [], ["world"]}]}, sep: "-")
"hello-world"

iex> Floki.text([{"div", [], ["hello world"]}])
"hello world"

iex> Floki.text([{"p", [], ["1"]},{"p", [], ["2"]}])
"12"

iex> Floki.text({"div", [], [{"style", [], ["hello"]}, " world"]}, style: false)
" world"

iex> Floki.text({"div", [], [{"style", [], ["hello"]}, " world"]}, style: true)
"hello world"
Link to this function

traverse_and_update(html_tree, fun)

View Source
@spec traverse_and_update(
  html_node() | html_tree(),
  (html_node() -> html_node() | nil)
) :: html_node() | html_tree()

Traverses and updates a HTML tree structure.

This function returns a new tree structure that is the result of applying the given fun on all nodes except text nodes. The tree is traversed in a post-walk fashion, where the children are traversed before the parent.

When the function fun encounters HTML tag, it receives a tuple with {name, attributes, children}, and should either return a similar tuple or nil to delete the current node.

The function fun can also encounter HTML doctype, comment or declaration and will receive, and should return, different tuple for these types. See the documentation for html_comment/0, html_doctype/0 and html_declaration/0 for details.

Note: this won't update text nodes, but you can transform them when working with children nodes.

examples

Examples

iex> html = [{"div", [], ["hello"]}]
iex> Floki.traverse_and_update(html, fn
...>   {"div", attrs, children} -> {"p", attrs, children}
...>   other -> other
...> end)
[{"p", [], ["hello"]}]

iex> html = [{"div", [], [{:comment, "I am comment"}, {"span", [], ["hello"]}]}]
iex> Floki.traverse_and_update(html, fn
...>   {"span", _attrs, _children} -> nil
...>   {:comment, text} -> {"span", [], text}
...>   other -> other
...> end)
[{"div", [], [{"span", [], "I am comment"}]}]
Link to this function

traverse_and_update(html_tree, acc, fun)

View Source
@spec traverse_and_update(
  html_node() | html_tree(),
  traverse_acc,
  (html_node(), traverse_acc -> {html_node() | nil, traverse_acc})
) :: {html_node() | html_tree(), traverse_acc}
when traverse_acc: any()

Traverses and updates a HTML tree structure with an accumulator.

This function returns a new tree structure and the final value of accumulator which are the result of applying the given fun on all nodes except text nodes. The tree is traversed in a post-walk fashion, where the children are traversed before the parent.

When the function fun encounters HTML tag, it receives a tuple with {name, attributes, children} and an accumulator. It and should return a 2-tuple like {new_node, new_acc}, where new_node is either a similar tuple or nil to delete the current node, and new_acc is an updated value for the accumulator.

The function fun can also encounter HTML doctype, comment or declaration and will receive, and should return, different tuple for these types. See the documentation for html_comment/0, html_doctype/0 and html_declaration/0 for details.

Note: this won't update text nodes, but you can transform them when working with children nodes.

examples

Examples

iex> html = [{"div", [], [{:comment, "I am a comment"}, "hello"]}, {"div", [], ["world"]}]
iex> Floki.traverse_and_update(html, 0, fn
...>   {"div", attrs, children}, acc ->
...>     {{"p", [{"data-count", to_string(acc)} | attrs], children}, acc + 1}
...>   other, acc -> {other, acc}
...> end)
{[
   {"p", [{"data-count", "0"}], [{:comment, "I am a comment"}, "hello"]},
   {"p", [{"data-count", "1"}], ["world"]}
 ], 2}

iex> html = {"div", [], [{"span", [], ["hello"]}]}
iex> Floki.traverse_and_update(html, [deleted: 0], fn
...>   {"span", _attrs, _children}, acc ->
...>     {nil, Keyword.put(acc, :deleted, acc[:deleted] + 1)}
...>   tag, acc ->
...>     {tag, acc}
...> end)
{{"div", [], []}, [deleted: 1]}