Floki v0.17.2 Floki View Source

Floki is a simple HTML parser that enables search for nodes using CSS selectors.

Example

Assuming that you have the following HTML:

<!doctype html>
<html>
<body>
  <section id="content">
    <p class="headline">Floki</p>
    <a href="http://github.com/philss/floki">Github page</a>
    <span data-model="user">philss</span>
  </section>
</body>
</html>

Examples of queries that you can perform:

Floki.find(html, “#content”)
Floki.find(html, “.headline”)
Floki.find(html, “a”)
Floki.find(html, “[data-model=user]”)
Floki.find(html, “#content a”)
Floki.find(html, “.headline, a”)

Each HTML node is represented by a tuple like:

{tag_name, attributes, children_nodes}

Example of node:

{"p", [{"class", "headline"}], ["Floki"]}

So even if the only child node is the element text, it is represented inside a list.

You can write a simple HTML crawler (with support of HTTPoison) with a few lines of code:

html
|> Floki.find(".pages a")
|> Floki.attribute("href")
|> Enum.map(fn(url) -> HTTPoison.get!(url) end)

It is simple as that!

Link to this section Summary

Functions

attribute(html_tree, attribute_name)

Returns a list with attribute values from elements

attribute(html, selector, attribute_name)

Returns a list with attribute values for a given selector

filter_out(html_tree, selector)

Returns the nodes from a HTML tree that don’t match the filter selector

find(html, selector)

Find elements inside a HTML tree or string

parse(html)

Parses a HTML string

raw_html(html_tree)

Converts HTML tree to raw HTML. Note that the resultant HTML may be different from the original one. Spaces after tags and doctypes are ignored

text(html, opts \\ [deep: true, js: false, sep: ""])

Returns the text nodes from a HTML tree. By default, it will perform a deep search through the HTML tree. You can disable deep search with the option deep assigned to false. You can include content of script tags with the option js assigned to true. You can specify a separator between nodes content

transform(html_tree_list, transformation)

Link to this section Types

html_tree()

html_tree() :: tuple | list

Link to this section Functions

attribute(html_tree, attribute_name)

attribute(binary | html_tree, binary) :: list

Returns a list with attribute values from elements.

Examples

iex> Floki.attribute("<a href=https://google.com>Google</a>", "href")
["https://google.com"]

iex> Floki.attribute([{"a", [{"href", "https://google.com"}], ["Google"]}], "href")
["https://google.com"]

attribute(html, selector, attribute_name)

attribute(binary | html_tree, binary, binary) :: list

Returns a list with attribute values for a given selector.

Examples

iex> Floki.attribute("<a href='https://google.com'>Google</a>", "a", "href")
["https://google.com"]

iex> Floki.attribute([{"a", [{"href", "https://google.com"}], ["Google"]}], "a", "href")
["https://google.com"]

filter_out(html_tree, selector)

filter_out(binary | html_tree, binary) :: list

Returns the nodes from a HTML tree that don’t match the filter selector.

Examples

iex> Floki.filter_out("<div><script>hello</script> world</div>", "script")
{"div", [], [" world"]}

iex> Floki.filter_out([{"body", [], [{"script", [], []},{"div", [], []}]}], "script")
[{"body", [], [{"div", [], []}]}]

iex> Floki.filter_out("<div><!-- comment --> text</div>", :comment)
{"div", [], [" text"]}

find(html, selector)

find(binary | html_tree, binary) :: html_tree

Find elements inside a HTML tree or string.

Examples

iex> Floki.find("<p><span class=hint>hello</span></p>", ".hint")
[{"span", [{"class", "hint"}], ["hello"]}]

iex> Floki.find("<body><div id=important><div>Content</div></div></body>", "#important")
[{"div", [{"id", "important"}], [{"div", [], ["Content"]}]}]

iex> Floki.find("<p><a href='https://google.com'>Google</a></p>", "a")
[{"a", [{"href", "https://google.com"}], ["Google"]}]

iex> Floki.find([{ "div", [], [{"a", [{"href", "https://google.com"}], ["Google"]}]}], "div a")
[{"a", [{"href", "https://google.com"}], ["Google"]}]

parse(html)

parse(binary) :: html_tree

Parses a HTML string.

Examples

iex> Floki.parse("<div class=js-action>hello world</div>")
{"div", [{"class", "js-action"}], ["hello world"]}

iex> Floki.parse("<div>first</div><div>second</div>")
[{"div", [], ["first"]}, {"div", [], ["second"]}]

raw_html(html_tree)

raw_html(html_tree) :: binary

Converts HTML tree to raw HTML. Note that the resultant HTML may be different from the original one. Spaces after tags and doctypes are ignored.

Examples

iex> Floki.parse(~s(<div class="wrapper">my content</div>)) |> Floki.raw_html
~s(<div class="wrapper">my content</div>)

text(html, opts \\ [deep: true, js: false, sep: ""])

Examples

iex> Floki.text("<div><span>hello</span> world</div>")
"hello world"

iex> Floki.text("<div><span>hello</span> world</div>", deep: false)
" world"

iex> Floki.text("<div><script>hello</script> world</div>")
" world"

iex> Floki.text("<div><script>hello</script> world</div>", js: true)
"hello world"

iex> Floki.text("<ul><li>hello</li><li>world</li></ul>", sep: " ")
"hello world"

iex> Floki.text([{"div", [], ["hello world"]}])
"hello world"

iex> Floki.text([{"p", [], ["1"]},{"p", [], ["2"]}])
"12"

transform(html_tree_list, transformation)