crawlie v1.0.0 Crawlie

The simple Elixir web crawler.

Summary

Functions

Crawls the urls provided in source, using the Crawlie.ParserLogic provided in parser_logic

Crawls the urls provided in source, using the Crawlie.ParserLogic provided and collects the crawling statistics

Functions

crawl(source, parser_logic, options \\ [])
crawl(Stream.t, module, Keyword.t) :: Flow.t

Crawls the urls provided in source, using the Crawlie.ParserLogic provided in parser_logic.

The options are used to tweak the crawler’s behaviour. You can use most of the options for HttPoison, as well as Crawlie specific options.

It is perfectly ok to run multiple crawling sessions at the same time, they’re independent.

arguments

  • source - a Stream or an Enum containing the urls to crawl
  • parser_logic- a Crawlie.ParserLogic behaviour implementation
  • options - a Keyword List of options

Crawlie specific options

  • :http_client - module implementing the Crawlie.HttpClient behaviour to be used to make the requests. If not provided, will default to Crawlie.HttpClient.HTTPoisonClient.
  • :mock_client_fun - If you’re using the Crawlie.HttpClient.MockClient, this would be the url :: String.t -> {:ok, body :: String.t} | {:error, term} function simulating making the requests. for details
  • :max_depth - maximum crawling “depth”. 0 by default.
  • :max_retries - maximum amount of tries Crawlie should try to fetch any individual page before giving up. By default 3.
  • :fetch_phase - Flow partition configuration for the fetching phase of the crawling Flow. It should be a Keyword List containing any subset of :min_demand, :max_demand and :stages properties. For the meaning of these options see Flow documentation
  • :process_phase - same as :fetch_phase, but for the processing (page parsing, data and link extraction) part of the process
  • :pqueue_module - One of pqueue implementations: :pqueue, :pqueue2, :pqueue3, :pqueue4. Different implementation have different performance characteristics and allow for different :max_depth values. Consult docs for details. By default using :pqueue3 - good performance and allowing arbitrary :max_depth values.
crawl_and_track_stats(source, parser_logic, options \\ [])
crawl_and_track_stats(Stream.t, module, Keyword.t) :: {Crawlie.Stats.Server.ref, Flow.t}

Crawls the urls provided in source, using the Crawlie.ParserLogic provided and collects the crawling statistics.

The statistics are accumulated independently, per Crawlie.crawl_and_track_stats/3 call.

See Crawlie.crawl/3 for details.

Additional options

(apart from the ones from Crawlie.crawl/3, which all apply as well)

  • :max_fetch_failed_uris_tracked - 100 by default. The maximum quantity of uris that will be kept in the Crawlie.Stats.Server, for which the fetch operation was failed.
  • :max_parse_failed_uris_tracked - 100 by default. The maximum quantity of uris that will be kept in the Crawlie.Stats.Server, for which the parse operation was failed.