View Source Crawler.Parser (Crawler v1.5.0)

Parses pages and calls a link handler to handle the detected links.

Summary

Functions

Parses the links and returns the page.

Functions

Parses the links and returns the page.

There are two hooks:

  • link_handler is useful when a custom parser calls this default parser and utilises a different link handler for processing links.
  • scraper is useful for scraping content immediately as the parser parses the page, alternatively you can simply access the crawled data asynchronously, refer to the README

Examples

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "Body",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"Body"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a href='http://parser/1'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"<a href='http://parser/1'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a name='hello'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"<a name='hello'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a href='http://parser/2' target='_blank'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"<a href='http://parser/2' target='_blank'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a href='parser/2'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html", referrer_url: "http://hello"}
iex> })
iex> page.body
"<a href='parser/2'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: "<a href='../parser/2'>Link</a>",
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html", referrer_url: "http://hello"}
iex> })
iex> page.body
"<a href='../parser/2'>Link</a>"

iex> {:ok, page} = Parser.parse(%Page{
iex>   body: image_file(),
iex>   opts: %{scraper: Crawler.Scraper, html_tag: "img", content_type: "image/png"}
iex> })
iex> page.body
"#{image_file()}"
Link to this function

parse_links(body, opts, link_handler)

View Source