Crawler.Parser (Crawler v1.1.2) View Source

Parses pages and calls a link handler to handle the detected links.

Link to this section Summary

Functions

Parses the links and returns the page.

Link to this section Functions

Parses the links and returns the page.

There are two hooks:

  • link_handler is useful when a custom parser calls this default parser and utilises a different link handler for processing links.
  • scraper is useful for scraping content immediately as the parser parses the page, alternatively you can simply access the crawled data asynchronously, refer to the README

Examples

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "Body",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"Body"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a href='http://parser/1'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"<a href='http://parser/1'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a name='hello'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"<a name='hello'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a href='http://parser/2' target='_blank'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"<a href='http://parser/2' target='_blank'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a href='parser/2'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html", referrer_url: "http://hello"}
iex> })
iex> page.body
"<a href='parser/2'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a href='../parser/2'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html", referrer_url: "http://hello"}
iex> })
iex> page.body
"<a href='../parser/2'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: image_file(),
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "img", content_type: "image/png"}
iex> })
iex> page.body
"#{image_file()}"
Link to this function

parse_links(body, opts, link_handler)

View Source