crawlie v1.0.0 Crawlie
The simple Elixir web crawler.
Summary
Functions
Crawls the urls provided in source, using the Crawlie.ParserLogic provided
in parser_logic
Crawls the urls provided in source, using the Crawlie.ParserLogic provided and collects the crawling statistics
Functions
Crawls the urls provided in source, using the Crawlie.ParserLogic provided
in parser_logic.
The options are used to tweak the crawler’s behaviour. You can use most of
the options for HttPoison,
as well as Crawlie specific options.
It is perfectly ok to run multiple crawling sessions at the same time, they’re independent.
arguments
source- aStreamor anEnumcontaining the urls to crawlparser_logic- aCrawlie.ParserLogicbehaviour implementationoptions- a Keyword List of options
Crawlie specific options
:http_client- module implementing theCrawlie.HttpClientbehaviour to be used to make the requests. If not provided, will default toCrawlie.HttpClient.HTTPoisonClient.:mock_client_fun- If you’re using theCrawlie.HttpClient.MockClient, this would be theurl :: String.t -> {:ok, body :: String.t} | {:error, term}function simulating making the requests. for details:max_depth- maximum crawling “depth”.0by default.:max_retries- maximum amount of tries Crawlie should try to fetch any individual page before giving up. By default3.:fetch_phase-Flowpartition configuration for the fetching phase of the crawlingFlow. It should be a Keyword List containing any subset of:min_demand,:max_demandand:stagesproperties. For the meaning of these options see Flow documentation:process_phase- same as:fetch_phase, but for the processing (page parsing, data and link extraction) part of the process:pqueue_module- One of pqueue implementations::pqueue,:pqueue2,:pqueue3,:pqueue4. Different implementation have different performance characteristics and allow for different:max_depthvalues. Consult docs for details. By default using:pqueue3- good performance and allowing arbitrary:max_depthvalues.
crawl_and_track_stats(Stream.t, module, Keyword.t) :: {Crawlie.Stats.Server.ref, Flow.t}
Crawls the urls provided in source, using the Crawlie.ParserLogic provided and collects the crawling statistics.
The statistics are accumulated independently, per Crawlie.crawl_and_track_stats/3 call.
See Crawlie.crawl/3 for details.
Additional options
(apart from the ones from Crawlie.crawl/3, which all apply as well)
:max_fetch_failed_uris_tracked-100by default. The maximum quantity of uris that will be kept in theCrawlie.Stats.Server, for which the fetch operation was failed.:max_parse_failed_uris_tracked-100by default. The maximum quantity of uris that will be kept in theCrawlie.Stats.Server, for which the parse operation was failed.