Crawly (Crawly v0.17.2) View Source
Crawly is a fast high-level web crawling & scraping framework for Elixir.
Link to this section Summary
Functions
Fetches a given url. This function is mainly used for the spiders development when you need to get individual pages and parse them.
Returns a list of known modules which implements Crawly.Spider behaviour.
Loads spiders from a given directory and the simple storage
Parses a given response with a given spider. Allows to quickly see the outcome of the given :parse_item implementation.
Link to this section Types
Specs
headers_opt() :: {:headers, [Crawly.Request.header()]}
Specs
parsed_item_result() :: Crawly.ParsedItem.t()
Specs
parsed_items() :: [any()]
Specs
Specs
request_opt() :: {:request_options, [Crawly.Request.option()]}
Specs
spider() :: module()
Specs
with_opt() :: {:with, nil | module()}
Link to this section Functions
Specs
fetch(url, opts) :: HTTPoison.Response.t() | {HTTPoison.Response.t(), parsed_item_result(), parsed_items(), pipeline_state()} when url: binary(), opts: [with_opt() | request_opt() | headers_opt()]
Fetches a given url. This function is mainly used for the spiders development when you need to get individual pages and parse them.
The fetched URL is being converted to a request, and the request is piped
through the middlewares specified in a config (with the exception of
Crawly.Middlewares.DomainFilter
, Crawly.Middlewares.RobotsTxt
)
Provide a spider with the :with
option to fetch a given webpage using that spider.
Fetching with a spider
To fetch a response from a url with a spider, define your spider, and pass the module name to the :with
option.
iex> Crawly.fetch("https://www.example.com", with: MySpider) {%HTTPoison.Response{...}, %{...}, [...], %{...}}
Using the :with
option will return a 4 item tuple:
- The HTTPoison response
- The result returned from the
parse_item/1
callback - The list of items that have been processed by the declared item pipelines.
- The pipeline state, included for debugging purposes.
Specs
list_spiders() :: [module()]
Returns a list of known modules which implements Crawly.Spider behaviour.
Should not be used for spider management. Use functions defined in Crawly.Engine
for that.
Specs
load_spiders() :: :ok
Loads spiders from a given directory and the simple storage
Specs
parse(response, spider) :: {:ok, result} when response: Crawly.Response.t(), spider: atom(), result: Crawly.ParsedItem.t()
Parses a given response with a given spider. Allows to quickly see the outcome of the given :parse_item implementation.