crawlie v1.0.0 Crawlie
The simple Elixir web crawler.
Summary
Functions
Crawls the urls provided in source
, using the Crawlie.ParserLogic
provided
in parser_logic
Crawls the urls provided in source
, using the Crawlie.ParserLogic
provided and collects the crawling statistics
Functions
Crawls the urls provided in source
, using the Crawlie.ParserLogic
provided
in parser_logic
.
The options
are used to tweak the crawler’s behaviour. You can use most of
the options for HttPoison,
as well as Crawlie specific options.
It is perfectly ok to run multiple crawling sessions at the same time, they’re independent.
arguments
source
- aStream
or anEnum
containing the urls to crawlparser_logic
- aCrawlie.ParserLogic
behaviour implementationoptions
- a Keyword List of options
Crawlie specific options
:http_client
- module implementing theCrawlie.HttpClient
behaviour to be used to make the requests. If not provided, will default toCrawlie.HttpClient.HTTPoisonClient
.:mock_client_fun
- If you’re using theCrawlie.HttpClient.MockClient
, this would be theurl :: String.t -> {:ok, body :: String.t} | {:error, term}
function simulating making the requests. for details:max_depth
- maximum crawling “depth”.0
by default.:max_retries
- maximum amount of tries Crawlie should try to fetch any individual page before giving up. By default3
.:fetch_phase
-Flow
partition configuration for the fetching phase of the crawlingFlow
. It should be a Keyword List containing any subset of:min_demand
,:max_demand
and:stages
properties. For the meaning of these options see Flow documentation:process_phase
- same as:fetch_phase
, but for the processing (page parsing, data and link extraction) part of the process:pqueue_module
- One of pqueue implementations::pqueue
,:pqueue2
,:pqueue3
,:pqueue4
. Different implementation have different performance characteristics and allow for different:max_depth
values. Consult docs for details. By default using:pqueue3
- good performance and allowing arbitrary:max_depth
values.
crawl_and_track_stats(Stream.t, module, Keyword.t) :: {Crawlie.Stats.Server.ref, Flow.t}
Crawls the urls provided in source
, using the Crawlie.ParserLogic
provided and collects the crawling statistics.
The statistics are accumulated independently, per Crawlie.crawl_and_track_stats/3
call.
See Crawlie.crawl/3
for details.
Additional options
(apart from the ones from Crawlie.crawl/3
, which all apply as well)
:max_fetch_failed_uris_tracked
-100
by default. The maximum quantity of uris that will be kept in theCrawlie.Stats.Server
, for which the fetch operation was failed.:max_parse_failed_uris_tracked
-100
by default. The maximum quantity of uris that will be kept in theCrawlie.Stats.Server
, for which the parse operation was failed.