View Source treewalker (treewalker v0.4.1)
This OTP application is used for crawling websites while respecting robots.txt
.
This module exposes some high level functions to be able to add new crawlers and start/stop them.
While most of the configuration is per crawler, this application is also configurable globally via the following sys.config
settings:
{treewalker, [
%% The minimum delay to wait before retrying a failed request
{min_retry_delay, pos_integer()},
%% The maximum delay to wait before retrying a failed request
{max_retry_delay, pos_integer()},
%% The maximum amount of retries of a failed request
{max_retries, pos_integer()},
%% The maximum amount of delay before starting a request (in seconds)
{max_worker_delay, pos_integer()},
%% The maximum amount of concurrent workers making HTTP requests
{max_concurrent_worker, pos_integer()},
%% The user agent making the HTTP requests
{user_agent, binary()}]},
Link to this section Summary
Functions
Add a new crawler with the specified configuration.
Link to this section Types
-type child() :: treewalker_crawlers_sup:child().
-type options() ::
#{scraper => module(),
scraper_options => term(),
fetcher => module(),
fetcher_options => module(),
max_depth => pos_integer(),
store => module(),
store_options => term(),
link_filter => module()}.
-type url() :: treewalker_page:url().
Link to this section Functions
-spec add_crawler(term(), url(), options()) -> {ok, child()} | {ok, child(), term()} | {error, term()}.
Add a new crawler with the specified configuration.
The available options are as follow:
- scraper
: Module implementing the treewalker_scraper
behaviour.
- scraper_options
: The options to pass to the module implementing the treewalker_scraper
behaviour.
- fetcher
: Module implementing the treewalker_fetcher
behaviour.
- fetcher_options
: The options to pass to the module implementing the treewalker_fetcher
behaviour.
- max_depth
: The max depth that the crawler will crawl.
- store
: Module implementing the treewalker_store
behaviour.
- store_options
: The options to pass to the module implementing the treewalker_store
behaviour.
link_filter
: Module implementing the treewalker_link_filter
behaviour.
-spec remove_crawler(term()) -> ok.
-spec start_crawler(term()) -> ok.
-spec stop_crawler(term()) -> ok.