Crawly (Crawly v0.17.2) View Source

Crawly is a fast high-level web crawling & scraping framework for Elixir.

Link to this section Summary


Fetches a given url. This function is mainly used for the spiders development when you need to get individual pages and parse them.

Returns a list of known modules which implements Crawly.Spider behaviour.

Loads spiders from a given directory and the simple storage

Parses a given response with a given spider. Allows to quickly see the outcome of the given :parse_item implementation.

Link to this section Types


headers_opt() :: {:headers, [Crawly.Request.header()]}


parsed_item_result() :: Crawly.ParsedItem.t()


parsed_items() :: [any()]


pipeline_state() :: %{optional(atom()) => any()}


request_opt() :: {:request_options, [Crawly.Request.option()]}


spider() :: module()


with_opt() :: {:with, nil | module()}

Link to this section Functions


Fetches a given url. This function is mainly used for the spiders development when you need to get individual pages and parse them.

The fetched URL is being converted to a request, and the request is piped through the middlewares specified in a config (with the exception of Crawly.Middlewares.DomainFilter, Crawly.Middlewares.RobotsTxt)

Provide a spider with the :with option to fetch a given webpage using that spider.

Fetching with a spider

To fetch a response from a url with a spider, define your spider, and pass the module name to the :with option.

iex> Crawly.fetch("", with: MySpider) {%HTTPoison.Response{...}, %{...}, [...], %{...}}

Using the :with option will return a 4 item tuple:

  1. The HTTPoison response
  2. The result returned from the parse_item/1 callback
  3. The list of items that have been processed by the declared item pipelines.
  4. The pipeline state, included for debugging purposes.


list_spiders() :: [module()]

Returns a list of known modules which implements Crawly.Spider behaviour.

Should not be used for spider management. Use functions defined in Crawly.Engine for that.


load_spiders() :: :ok

Loads spiders from a given directory and the simple storage


parse(response, spider) :: {:ok, result}
when response: Crawly.Response.t(),
     spider: atom(),
     result: Crawly.ParsedItem.t()

Parses a given response with a given spider. Allows to quickly see the outcome of the given :parse_item implementation.