View Source Crawler.Parser (Crawler v1.5.0)

Parses pages and calls a link handler to handle the detected links.

Summary

Functions

parse(input)

Parses the links and returns the page.

parse_links(body, opts, link_handler)

Functions

parse(input)

Parses the links and returns the page.

There are two hooks:

link_handler is useful when a custom parser calls this default parser and utilises a different link handler for processing links.
scraper is useful for scraping content immediately as the parser parses the page, alternatively you can simply access the crawled data asynchronously, refer to the README

Examples

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "Body",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"Body"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a href='http://parser/1'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"<a href='http://parser/1'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a name='hello'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"<a name='hello'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a href='http://parser/2' target='_blank'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"<a href='http://parser/2' target='_blank'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a href='parser/2'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html", referrer_url: "http://hello"}
iex> })
iex> page.body
"<a href='parser/2'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a href='../parser/2'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html", referrer_url: "http://hello"}
iex> })
iex> page.body
"<a href='../parser/2'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: image_file(),
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "img", content_type: "image/png"}
iex> })
iex> page.body
"#{image_file()}"

parse_links(body, opts, link_handler)

Settings View Source Crawler.Parser (Crawler v1.5.0)

Summary

Functions

Functions

parse(input)

Examples

parse_links(body, opts, link_handler)

View Source Crawler.Parser (Crawler v1.5.0)