View Source HtmlQuery (HtmlQuery v4.0.0)

A concise HTML query API. HTML parsing is handled by Floki.

We created a related library called XmlQuery which has the same API but is used for querying XML. You can read more about them in Querying HTML and XML in Elixir with HtmlQuery and XmlQuery.

Data types

All functions can accept HTML in the form of a string, a Floki HTML tree, a Floki HTML node, or anything that implements the String.Chars protocol. See HtmlQuery.html/0.

Some functions take a CSS selector, which can be a string, a keyword list, or a list. See HtmlQuery.Css.selector/0.

Query functions

all/2return all elements matching the selector
find/2return the first element that matches the selector
find!/2return the only element that matches the selector, or raise

Extraction functions

attr/2returns the attribute value as a string
form_fields/1returns the names and values of form fields as a map
meta_tags/1returns the names and values of metadata fields
table/2returns the cells of a table as a list of lists or maps
text/2returns the text contents as a single string

Parsing functions

parse/1parses an HTML fragment into a [Floki HTML tree]
parse_doc/1parses an HTML doc into a [Floki HTML tree]

Utility functions

inspect_html/2prints prettified HTML with a label
normalize/1parses and re-stringifies HTML
pretty/1prettifies HTML
reject/2removes nodes that match the selector

Alias

If you use HtmlQuery a lot, you may want to alias it to the recommended shortcut "Hq":

alias HtmlQuery, as: Hq

Examples

Get the value of a selected option:

iex> html = ~s|<select> <option value="a" selected>apples</option> <option value="b">bananas</option> </select>|
iex> HtmlQuery.find(html, "select option[selected]") |> HtmlQuery.attr("value")
"a"

Get the text of a selected option, raising if there are more than one:

iex> html = ~s|<select> <option value="a" selected>apples</option> <option value="b">bananas</option> </select>|
iex> HtmlQuery.find!(html, "select option[selected]") |> HtmlQuery.text()
"apples"

Get the text of all the options:

iex> html = ~s|<select> <option value="a" selected>apples</option> <option value="b">bananas</option> </select>|
iex> HtmlQuery.all(html, "select option") |> Enum.map(&HtmlQuery.text/1)
["apples", "bananas"]

Use a keyword list as the selector (see HtmlQuery.Css for details on selectors):

iex> html = ~s|<div> <a href="/logout" test-role="logout-link">logout</a> </div>|
iex> HtmlQuery.find!(html, test_role: "logout-link") |> HtmlQuery.attr("href")
"/logout"

Summary

Types

A string or atom representing an attribute name. If an atom, underscores are converted to dashes.

A string, a struct that implements the String.Chars protocol, a Floki HTML tree, or a Floki HTML node.

Functions

Finds all elements in html that match selector, returning a Floki HTML tree.

Returns the value of attr from the outermost element of html. If attr is an atom, any underscores are converted to dashes.

Finds the first element in html that matches selector, returning a Floki HTML node.

Like find/2 but raises unless exactly one element is found.

Returns a map containing the form fields of form selector in html. Because it returns a map, any information about the order of form fields is lost.

Prints prettified html with a label, and then returns the original html.

Extracts all the meta tags from html, returning a list of maps.

Parses and then re-stringifies html, increasing the liklihood that two equivalent HTML strings can be considered equal.

Parses an HTML fragment using Floki.parse_fragment!/1, returning a Floki HTML tree.

Parses an HTML document using Floki.parse_document!/1, returning a Floki HTML tree.

Returns html as a prettified string (delgates to Floki.raw_html/2 with the pretty: true option).

Returns html after removing all nodes that don't match selector (delegates to Floki.filter_out/2).

Returns the contents of the table as a list of lists.

Returns the text value of html, separating substrings with a space by default. (Floki will split text into substrings.) You can pass a separator as the second argument; sometimes it's useful to pass an empty string.

Types

attr()

@type attr() :: binary() | atom()

A string or atom representing an attribute name. If an atom, underscores are converted to dashes.

html()

@type html() :: binary() | String.Chars.t() | Floki.html_tree() | Floki.html_node()

A string, a struct that implements the String.Chars protocol, a Floki HTML tree, or a Floki HTML node.

Functions

all(html, selector)

Finds all elements in html that match selector, returning a Floki HTML tree.

iex> html = ~s|<select> <option value="a" selected>apples</option> <option value="b">bananas</option> </select>|
iex> HtmlQuery.all(html, "option")
[
  {"option", [{"value", "a"}, {"selected", "selected"}], ["apples"]},
  {"option", [{"value", "b"}], ["bananas"]}
]

attr(html, attr)

@spec attr(html(), attr()) :: binary() | nil

Returns the value of attr from the outermost element of html. If attr is an atom, any underscores are converted to dashes.

iex> html = ~s|<div> <a href="/logout" test-role="logout-link">logout</a> </div>|
iex> HtmlQuery.find!(html, test_role: "logout-link") |> HtmlQuery.attr("href")
"/logout"

find(html, selector)

@spec find(html(), HtmlQuery.Css.selector()) :: Floki.html_node() | nil

Finds the first element in html that matches selector, returning a Floki HTML node.

iex> html = ~s|<select> <option value="a" selected>apples</option> <option value="b">bananas</option> </select>|
iex> HtmlQuery.find(html, "select option[selected]")
{"option", [{"value", "a"}, {"selected", "selected"}], ["apples"]}

find!(html, selector)

Like find/2 but raises unless exactly one element is found.

form_fields(html)

@spec form_fields(html()) :: %{required(atom()) => binary() | map()}

Returns a map containing the form fields of form selector in html. Because it returns a map, any information about the order of form fields is lost.

iex> html = ~s|<form> <input type="text" name="color" value="green"> <textarea name="desc">A tree</textarea> </form>|
iex> html |> HtmlQuery.find("form") |> HtmlQuery.form_fields()
%{color: "green", desc: "A tree"}

Field names are converted to snake case atoms:

iex> html = ~s|<form> <input type="text" name="favorite-color" value="green"> </form>|
iex> html |> HtmlQuery.find("form") |> HtmlQuery.form_fields()
%{favorite_color: "green"}

If form field names are in foo[bar] format, then foo becomes a key to a nested map containing bar:

iex> html = ~s|<form> <input type="text" name="profile[name]" value="fido"> <input type="text" name="profile[age]" value="10"> </form>|
iex> html |> HtmlQuery.find("form") |> HtmlQuery.form_fields()
%{profile: %{name: "fido", age: "10"}}

If a text field has no value attribute, it will not be returned at all:

iex> html = ~s|<form> <input type="text" name="no-value"> </form>|
iex> html |> HtmlQuery.find("form") |> HtmlQuery.form_fields()
%{}

iex> html = ~s|<form> <input type="text" name="empty-value" value=""> </form>|
iex> html |> HtmlQuery.find("form") |> HtmlQuery.form_fields()
%{empty_value: ""}

iex> html = ~s|<form> <input type="text" name="non-empty-value" value="something"> </form>|
iex> html |> HtmlQuery.find("form") |> HtmlQuery.form_fields()
%{non_empty_value: "something"}

The checked value of a radio button set is returned, or nil is returned if no value is checked:

iex> html = ~s|<form> <input type="radio" name="x" value="1"> <input type="radio" name="x" value="2" checked> </form>|
iex> html |> HtmlQuery.find("form") |> HtmlQuery.form_fields()
%{x: "2"}

iex> html = ~s|<form> <input type="radio" name="x" value="1"> <input type="radio" name="x" value="2"> </form>|
iex> html |> HtmlQuery.find("form") |> HtmlQuery.form_fields()
%{x: nil}

When evaluating checkboxes, the name attribute of the input defines whether or not a term or a list will be returned. A name that ends in [] allows a browser to send multiple values, in which case our form fields will return an array of values. A name that does not end in [] will evaluate to a single value, the last checked value in a list:

iex> html = ~s|<form> <input type="checkbox" name="x" value="1" checked> <input type="checkbox" name="x" value="2" checked> </form>|
iex> html |> HtmlQuery.find("form") |> HtmlQuery.form_fields()
%{x: "2"}

iex> html = ~s|<form> <input type="checkbox" name="x" value="1"> <input type="checkbox" name="x" value="2"> </form>|
iex> html |> HtmlQuery.find("form") |> HtmlQuery.form_fields()
%{x: nil}

iex> html = ~s|<form>
...>   <input type="hidden" name="x" value="false">
...>   <input type="checkbox" name="x" value="true">
...>   <input type="hidden" name="y" value="false">
...>   <input type="checkbox" name="y" checked value="true">
...> </form>|
iex> html |> HtmlQuery.find("form") |> HtmlQuery.form_fields()
%{x: false, y: true}

iex> html = ~s|<form> <input type="checkbox" name="x[]" value="1" checked> <input type="checkbox" name="x[]" value="2" checked> </form>|
iex> html |> HtmlQuery.find("form") |> HtmlQuery.form_fields()
%{x: ["1", "2"]}

iex> html = ~s|<form> <input type="checkbox" name="x[]" value="1"> <input type="checkbox" name="x[]" value="2"> </form>|
iex> html |> HtmlQuery.find("form") |> HtmlQuery.form_fields()
%{x: []}

inspect_html(html, label \\ "INSPECTED HTML")

@spec inspect_html(html(), binary()) :: html()

Prints prettified html with a label, and then returns the original html.

meta_tags(html)

@spec meta_tags(html()) :: [%{required(binary()) => binary()}]

Extracts all the meta tags from html, returning a list of maps.

iex> html = ~s|<head> <meta charset="utf-8"/> <meta http-equiv="X-UA-Compatible" content="IE=edge"/> </head>|
iex> HtmlQuery.meta_tags(html)
[%{"charset" => "utf-8"}, %{"content" => "IE=edge", "http-equiv" => "X-UA-Compatible"}]

normalize(html)

@spec normalize(html()) :: binary()

Parses and then re-stringifies html, increasing the liklihood that two equivalent HTML strings can be considered equal.

iex> a = ~s|<p id="color">green</p>|
iex> b = ~s|<p  id = "color" >green</p>|
iex> a == b
false
iex> HtmlQuery.normalize(a) == HtmlQuery.normalize(b)
true

parse(html)

@spec parse(html()) :: Floki.html_tree()

Parses an HTML fragment using Floki.parse_fragment!/1, returning a Floki HTML tree.

parse_doc(html)

@spec parse_doc(html()) :: Floki.html_tree()

Parses an HTML document using Floki.parse_document!/1, returning a Floki HTML tree.

pretty(html)

@spec pretty(html()) :: binary()

Returns html as a prettified string (delgates to Floki.raw_html/2 with the pretty: true option).

reject(html, selector)

@spec reject(html(), HtmlQuery.Css.selector()) :: html()

Returns html after removing all nodes that don't match selector (delegates to Floki.filter_out/2).

iex> html = ~s|<div> <span id="name">Alice</span> <span id="password">topaz</span> </div>|
iex> HtmlQuery.reject(html, id: "password") |> HtmlQuery.normalize()
~s|<div><span id="name">Alice</span></div>|

table(html, opts \\ [])

@spec table(
  html(),
  keyword()
) :: [[]] | [map()]

Returns the contents of the table as a list of lists.

Options:

  • as - if :lists (the default), returns the table as a list of lists; if :maps, returns the table as a list of maps.
  • only - a list of the indices of the columns to return; a list of column headers (as strings) to return, assuming that the first row of the table is the columns names; or :all to return all columns (which is the same as not specifying this option at all).
  • except - returns all the columns except the ones whose indices or names are given. only and except can be combined to further reduce the set of columns.
  • headers - if true (the default), returns the list of headers along with the rows. Ignored if as option is :maps.

Deprecated options:

  • columns - use only instead.
iex> html = "<table> <tr><th>A</th><th>B</th><th>C</th></tr> <tr><td>1</td><td>2</td><td>3</td></tr> </table>"
iex> HtmlQuery.table(html)
[
  ["A", "B", "C"],
  ["1", "2", "3"]
]
iex> HtmlQuery.table(html, as: :maps)
[
  %{"A" => "1", "B" => "2", "C" => "3"}
]
iex> HtmlQuery.table(html, only: [0, 2])
[
  ["A", "C"],
  ["1", "3"]
]
iex> HtmlQuery.table(html, only: [2, 0])
[
  ["C", "A"],
  ["3", "1"]
]
iex> HtmlQuery.table(html, only: ["C", "A"])
[
  ["C", "A"],
  ["3", "1"]
]
iex> HtmlQuery.table(html, except: ["C", "A"])
[
  ["B"],
  ["2"]
]
iex> HtmlQuery.table(html, only: ["C", "A"], headers: false)
[
  ["3", "1"]
]

text(html, separator \\ " ")

@spec text(html(), String.t()) :: binary()

Returns the text value of html, separating substrings with a space by default. (Floki will split text into substrings.) You can pass a separator as the second argument; sometimes it's useful to pass an empty string.

iex> html = ~s|<select> <option value="a" selected>apples</option> <option value="b">bananas</option> </select>|
iex> HtmlQuery.find!(html, "select option[selected]") |> HtmlQuery.text()
"apples"