scrape v3.1.0 Scrape.IR.HTML

Information Retrieval functions for extracting data out of HTML documents.

Makes extensive use of Scrape.Tools.DOM under the hood, so a customized jQuery-like approach can be taken.

Link to this section Summary

Functions

Try to extract the relevant text content from a given document.

Extract the best possible description from a HTML document or nil.

Attempts to fetch all possible feed_urls from the given HTML document.

Attempts to find something resembling a favicon url or nil.

Attempts to find the best image_url for the website or nil.

Attempt to find the most meaningful content snippets in the HTML document.

Convenient fallback function if content/1 didn't work. Uses paragraphs/1 under the hood.

Try to extract the semantically relevant part from a given document.

Extract the best possible title from a HTML document (string or DOM) or nil.

Link to this section Functions

Link to this function

content(dom)
content(Scrape.Tools.DOM.dom()) :: nil | String.t()

Try to extract the relevant text content from a given document.

Uses the Readability algorithm, which might fail sometimes. Ideally, it returns a single string containing full sentences. Remember that this method uses a few heuristics that somehow work together nicely in many cases, but nothing more.

Link to this function

description(dom)
description(Scrape.Tools.DOM.dom() | String.t()) :: nil | String.t()

Extract the best possible description from a HTML document or nil.

Examples

iex> HTML.description("")
nil

iex> HTML.description("<meta name='description' content='abc' />")
"abc"
Link to this function

feed_urls(dom, url \\ "")
feed_urls(Scrape.Tools.DOM.dom(), String.t()) :: [String.t()]

Attempts to fetch all possible feed_urls from the given HTML document.

Examples

iex> HTML.feed_urls("")
[]

iex> HTML.feed_urls("<link rel='alternate' href='/feed.rss' />")
["/feed.rss"]
Link to this function

icon_url(dom, url \\ "")
icon_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()

Attempts to find something resembling a favicon url or nil.

If a root url is given, will transform relative images to absolute urls.

Examples

iex> HTML.icon_url("")
nil

iex> HTML.icon_url("<link rel='shortcut icon' href='img.jpg' />")
"img.jpg"
Link to this function

image_url(dom, url \\ "")
image_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()

Attempts to find the best image_url for the website or nil.

If a root url is given, will transform relative images to absolute urls.

Examples

iex> HTML.image_url("")
nil

iex> HTML.image_url("<meta property='og:image' content='img.jpg' />")
"img.jpg"
Link to this function

paragraphs(dom)
paragraphs(Scrape.Tools.DOM.dom()) :: [String.t()]

Attempt to find the most meaningful content snippets in the HTML document.

Can be used as a fallback algorithm if content/1 did return nil but some text corpus is needed to work with.

A text paragraph is relevant if it has a minimum amount of characters and contains any indicators of a sentence-like structure. Very naive approach, but works surprisingly well so far.

Link to this function

sentences(dom)
sentences(Scrape.Tools.DOM.dom()) :: nil | String.t()

Convenient fallback function if content/1 didn't work. Uses paragraphs/1 under the hood.

Try to extract the semantically relevant part from a given document.

Uses the Readability algorithm, which might fail sometimes. Ideally, it returns a single string containing full sentences. Remember that this method uses a few heuristics that somehow work together nicely in many cases, but nothing more.

Extract the best possible title from a HTML document (string or DOM) or nil.

Examples

iex> HTML.title("")
nil

iex> HTML.title("<title>abc</title>")
"abc"