scrape v3.1.0 Scrape.IR.HTML
Information Retrieval functions for extracting data out of HTML documents.
Makes extensive use of Scrape.Tools.DOM
under the hood, so a customized
jQuery-like approach can be taken.
Link to this section Summary
Functions
Try to extract the relevant text content from a given document.
Extract the best possible description from a HTML document or nil.
Attempts to fetch all possible feed_urls from the given HTML document.
Attempts to find something resembling a favicon url or nil.
Attempts to find the best image_url for the website or nil.
Attempt to find the most meaningful content snippets in the HTML document.
Convenient fallback function if content/1
didn't work. Uses paragraphs/1
under the hood.
Try to extract the semantically relevant part from a given document.
Extract the best possible title from a HTML document (string or DOM) or nil.
Link to this section Functions
content(dom)
content(Scrape.Tools.DOM.dom()) :: nil | String.t()
content(Scrape.Tools.DOM.dom()) :: nil | String.t()
Try to extract the relevant text content from a given document.
Uses the Readability algorithm, which might fail sometimes. Ideally, it returns a single string containing full sentences. Remember that this method uses a few heuristics that somehow work together nicely in many cases, but nothing more.
description(dom)
description(Scrape.Tools.DOM.dom() | String.t()) :: nil | String.t()
description(Scrape.Tools.DOM.dom() | String.t()) :: nil | String.t()
Extract the best possible description from a HTML document or nil.
Examples
iex> HTML.description("")
nil
iex> HTML.description("<meta name='description' content='abc' />")
"abc"
feed_urls(dom, url \\ "")
feed_urls(Scrape.Tools.DOM.dom(), String.t()) :: [String.t()]
feed_urls(Scrape.Tools.DOM.dom(), String.t()) :: [String.t()]
Attempts to fetch all possible feed_urls from the given HTML document.
Examples
iex> HTML.feed_urls("")
[]
iex> HTML.feed_urls("<link rel='alternate' href='/feed.rss' />")
["/feed.rss"]
icon_url(dom, url \\ "")
icon_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()
icon_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()
Attempts to find something resembling a favicon url or nil.
If a root url is given, will transform relative images to absolute urls.
Examples
iex> HTML.icon_url("")
nil
iex> HTML.icon_url("<link rel='shortcut icon' href='img.jpg' />")
"img.jpg"
image_url(dom, url \\ "")
image_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()
image_url(Scrape.Tools.DOM.dom(), String.t()) :: nil | String.t()
Attempts to find the best image_url for the website or nil.
If a root url is given, will transform relative images to absolute urls.
Examples
iex> HTML.image_url("")
nil
iex> HTML.image_url("<meta property='og:image' content='img.jpg' />")
"img.jpg"
paragraphs(dom)
paragraphs(Scrape.Tools.DOM.dom()) :: [String.t()]
paragraphs(Scrape.Tools.DOM.dom()) :: [String.t()]
Attempt to find the most meaningful content snippets in the HTML document.
Can be used as a fallback algorithm if content/1
did return nil but some
text corpus is needed to work with.
A text paragraph is relevant if it has a minimum amount of characters and contains any indicators of a sentence-like structure. Very naive approach, but works surprisingly well so far.
sentences(dom)
sentences(Scrape.Tools.DOM.dom()) :: nil | String.t()
sentences(Scrape.Tools.DOM.dom()) :: nil | String.t()
Convenient fallback function if content/1
didn't work. Uses paragraphs/1
under the hood.
simple(dom)
Try to extract the semantically relevant part from a given document.
Uses the Readability algorithm, which might fail sometimes. Ideally, it returns a single string containing full sentences. Remember that this method uses a few heuristics that somehow work together nicely in many cases, but nothing more.
title(dom)
title(Scrape.Tools.DOM.dom()) :: nil | String.t()
title(Scrape.Tools.DOM.dom()) :: nil | String.t()
Extract the best possible title from a HTML document (string or DOM) or nil.
Examples
iex> HTML.title("")
nil
iex> HTML.title("<title>abc</title>")
"abc"