Crawly (Crawly v0.16.0) View Source

Crawly is a fast high-level web crawling & scraping framework for Elixir.

Link to this section Summary

Functions

Fetches a given url. This function is mainly used for the spiders development when you need to get individual pages and parse them.

Returns a list of known modules which implements Crawly.Spider behaviour.

Loads spiders from a given directory and the simple storage

Parses a given response with a given spider. Allows to quickly see the outcome of the given :parse_item implementation.

Link to this section Types

Specs

headers_opt() :: {:headers, [Crawly.Request.header()]}

Specs

parsed_item_result() :: Crawly.ParsedItem.t()

Specs

parsed_items() :: [any()]

Specs

pipeline_state() :: %{optional(atom()) => any()}

Specs

request_opt() :: {:request_options, [Crawly.Request.option()]}

Specs

spider() :: module()

Specs

with_opt() :: {:with, nil | module()}

Link to this section Functions

Specs

Fetches a given url. This function is mainly used for the spiders development when you need to get individual pages and parse them.

The fetched URL is being converted to a request, and the request is piped through the middlewares specified in a config (with the exception of Crawly.Middlewares.DomainFilter, Crawly.Middlewares.RobotsTxt)

Provide a spider with the :with option to fetch a given webpage using that spider.

Fetching with a spider

To fetch a response from a url with a spider, define your spider, and pass the module name to the :with option.

iex> Crawly.fetch("https://www.example.com", with: MySpider) {%HTTPoison.Response{...}, %{...}, [...], %{...}}

Using the :with option will return a 4 item tuple:

  1. The HTTPoison response
  2. The result returned from the parse_item/1 callback
  3. The list of items that have been processed by the declared item pipelines.
  4. The pipeline state, included for debugging purposes.

Specs

list_spiders() :: [module()]

Returns a list of known modules which implements Crawly.Spider behaviour.

Should not be used for spider management. Use functions defined in Crawly.Engine for that.

Specs

load_spiders() :: :ok

Loads spiders from a given directory and the simple storage

Specs

parse(response, spider) :: {:ok, result}
when response: Crawly.Response.t(),
     spider: atom(),
     result: Crawly.ParsedItem.t()

Parses a given response with a given spider. Allows to quickly see the outcome of the given :parse_item implementation.