Crawly (Crawly v0.17.2) View Source

Crawly is a fast high-level web crawling & scraping framework for Elixir.

Link to this section Summary

Types

headers_opt()

parsed_item_result()

parsed_items()

pipeline_state()

request_opt()

spider()

with_opt()

Functions

fetch(url, opts \\ [])

Fetches a given url. This function is mainly used for the spiders development when you need to get individual pages and parse them.

list_spiders()

Returns a list of known modules which implements Crawly.Spider behaviour.

load_spiders()

Loads spiders from a given directory and the simple storage

parse(response, spider)

Parses a given response with a given spider. Allows to quickly see the outcome of the given :parse_item implementation.

Link to this section Types

headers_opt()

Specs

headers_opt() :: {:headers, [Crawly.Request.header()]}

parsed_item_result()

Specs

parsed_item_result() :: Crawly.ParsedItem.t()

parsed_items()

Specs

parsed_items() :: [any()]

pipeline_state()

Specs

pipeline_state() :: %{optional(atom()) => any()}

request_opt()

Specs

request_opt() :: {:request_options, [Crawly.Request.option()]}

spider()

Specs

spider() :: module()

with_opt()

Specs

with_opt() :: {:with, nil | module()}

Link to this section Functions

fetch(url, opts \\ [])

Specs

fetch(url, opts) ::
  HTTPoison.Response.t()
  | {HTTPoison.Response.t(), parsed_item_result(), parsed_items(),
     pipeline_state()}
when url: binary(), opts: [with_opt() | request_opt() | headers_opt()]

Fetches a given url. This function is mainly used for the spiders development when you need to get individual pages and parse them.

The fetched URL is being converted to a request, and the request is piped through the middlewares specified in a config (with the exception of Crawly.Middlewares.DomainFilter, Crawly.Middlewares.RobotsTxt)

Provide a spider with the :with option to fetch a given webpage using that spider.

Fetching with a spider

To fetch a response from a url with a spider, define your spider, and pass the module name to the :with option.

iex> Crawly.fetch("https://www.example.com", with: MySpider) {%HTTPoison.Response{...}, %{...}, [...], %{...}}

Using the :with option will return a 4 item tuple:

The HTTPoison response
The result returned from the parse_item/1 callback
The list of items that have been processed by the declared item pipelines.
The pipeline state, included for debugging purposes.

list_spiders()

Specs

list_spiders() :: [module()]

Returns a list of known modules which implements Crawly.Spider behaviour.

Should not be used for spider management. Use functions defined in Crawly.Engine for that.

load_spiders()

Specs

load_spiders() :: :ok

Loads spiders from a given directory and the simple storage

parse(response, spider)

Specs

parse(response, spider) :: {:ok, result}
when response: Crawly.Response.t(),
     spider: atom(),
     result: Crawly.ParsedItem.t()

Parses a given response with a given spider. Allows to quickly see the outcome of the given :parse_item implementation.