scrape v3.1.0 Scrape.IR.HTML

Information Retrieval functions for extracting data out of HTML documents.

Makes extensive use of Scrape.Tools.DOM under the hood, so a customized jQuery-like approach can be taken.

Link to this section Summary

Functions

content(dom)

Try to extract the relevant text content from a given document.

description(dom)

Extract the best possible description from a HTML document or nil.

feed_urls(dom, url \\ "")

Attempts to fetch all possible feed_urls from the given HTML document.

icon_url(dom, url \\ "")

Attempts to find something resembling a favicon url or nil.

image_url(dom, url \\ "")

Attempts to find the best image_url for the website or nil.

paragraphs(dom)

Attempt to find the most meaningful content snippets in the HTML document.

sentences(dom)

Convenient fallback function if content/1 didn't work. Uses paragraphs/1 under the hood.

simple(dom)

Try to extract the semantically relevant part from a given document.

title(dom)

Extract the best possible title from a HTML document (string or DOM) or nil.

Link to this section Functions

content(dom)

content(Scrape.Tools.DOM.dom()) :: nil | String.t()

Try to extract the relevant text content from a given document.

Uses the Readability algorithm, which might fail sometimes. Ideally, it returns a single string containing full sentences. Remember that this method uses a few heuristics that somehow work together nicely in many cases, but nothing more.

description(dom)

description(Scrape.Tools.DOM.dom() | String.t()) :: nil | String.t()

Extract the best possible description from a HTML document or nil.

Examples

iex> HTML.description("")
nil

iex> HTML.description("<meta name='description' content='abc' />")
"abc"

feed_urls(dom, url \\ "")

feed_urls(Scrape.Tools.DOM.dom(), String.t()) :: [String.t()]

Attempts to fetch all possible feed_urls from the given HTML document.

Examples

iex> HTML.feed_urls("")
[]

iex> HTML.feed_urls("<link rel='alternate' href='/feed.rss' />")
["/feed.rss"]

icon_url(dom, url \\ "")

icon_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()

Attempts to find something resembling a favicon url or nil.

If a root url is given, will transform relative images to absolute urls.

Examples

iex> HTML.icon_url("")
nil

iex> HTML.icon_url("<link rel='shortcut icon' href='img.jpg' />")
"img.jpg"

image_url(dom, url \\ "")

image_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()

Attempts to find the best image_url for the website or nil.

If a root url is given, will transform relative images to absolute urls.

Examples

iex> HTML.image_url("")
nil

iex> HTML.image_url("<meta property='og:image' content='img.jpg' />")
"img.jpg"

paragraphs(dom)

paragraphs(Scrape.Tools.DOM.dom()) :: [String.t()]

Attempt to find the most meaningful content snippets in the HTML document.

Can be used as a fallback algorithm if content/1 did return nil but some text corpus is needed to work with.

A text paragraph is relevant if it has a minimum amount of characters and contains any indicators of a sentence-like structure. Very naive approach, but works surprisingly well so far.

sentences(dom)

sentences(Scrape.Tools.DOM.dom()) :: nil | String.t()

Convenient fallback function if content/1 didn't work. Uses paragraphs/1 under the hood.

simple(dom)

Try to extract the semantically relevant part from a given document.

title(dom)

title(Scrape.Tools.DOM.dom()) :: nil | String.t()

Extract the best possible title from a HTML document (string or DOM) or nil.

Examples

iex> HTML.title("")
nil

iex> HTML.title("<title>abc</title>")
"abc"

scrape v3.1.0 Scrape.IR.HTML

Link to this section Summary

Functions

Link to this section Functions

content(dom)

content(Scrape.Tools.DOM.dom()) :: nil | String.t()

description(dom)

description(Scrape.Tools.DOM.dom() | String.t()) :: nil | String.t()

Examples

feed_urls(dom, url \\ "")

feed_urls(Scrape.Tools.DOM.dom(), String.t()) :: [String.t()]

Examples

icon_url(dom, url \\ "")

icon_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()

Examples

image_url(dom, url \\ "")

image_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()

Examples

paragraphs(dom)

paragraphs(Scrape.Tools.DOM.dom()) :: [String.t()]

sentences(dom)

sentences(Scrape.Tools.DOM.dom()) :: nil | String.t()

simple(dom)

title(dom)

title(Scrape.Tools.DOM.dom()) :: nil | String.t()

Examples

v3.1.0 v3.0.3 v3.0.2 v3.0.1 v3.0.0 v2.0.0

scrape v3.1.0 Scrape.IR.HTML

Link to this section Summary

Functions

Link to this section Functions

content(dom) content(Scrape.Tools.DOM.dom()) :: nil | String.t()

description(dom) description(Scrape.Tools.DOM.dom() | String.t()) :: nil | String.t()

Examples

feed_urls(dom, url \\ "") feed_urls(Scrape.Tools.DOM.dom(), String.t()) :: [String.t()]

Examples

icon_url(dom, url \\ "") icon_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()

Examples

image_url(dom, url \\ "") image_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()

Examples

paragraphs(dom) paragraphs(Scrape.Tools.DOM.dom()) :: [String.t()]

sentences(dom) sentences(Scrape.Tools.DOM.dom()) :: nil | String.t()

simple(dom)

title(dom) title(Scrape.Tools.DOM.dom()) :: nil | String.t()

Examples

content(dom)

content(Scrape.Tools.DOM.dom()) :: nil | String.t()

description(dom)

description(Scrape.Tools.DOM.dom() | String.t()) :: nil | String.t()

feed_urls(dom, url \\ "")

feed_urls(Scrape.Tools.DOM.dom(), String.t()) :: [String.t()]

icon_url(dom, url \\ "")

icon_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()

image_url(dom, url \\ "")

image_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()

paragraphs(dom)

paragraphs(Scrape.Tools.DOM.dom()) :: [String.t()]

sentences(dom)

sentences(Scrape.Tools.DOM.dom()) :: nil | String.t()

title(dom)

title(Scrape.Tools.DOM.dom()) :: nil | String.t()