robots.txt.
Copyright © 2020 Antoine Gagné
Authors: Antoine Gagné (gagnantoine@gmail.com).
This OTP application is used for crawling websites while respecting robots.txt.
This module exposes some high level functions to be able to add new crawlers and start/stop them.
While most of the configuration is per crawler, this application is also configurable globally
via the following sys.config settings:
{treewalker, [
%% The minimum delay to wait before retrying a failed request
{min_retry_delay, pos_integer()},
%% The maximum delay to wait before retrying a failed request
{max_retry_delay, pos_integer()},
%% The maximum amount of retries of a failed request
{max_retries, pos_integer()},
%% The maximum amount of delay before starting a request (in seconds)
{max_worker_delay, pos_integer()},
%% The maximum amount of concurrent workers making HTTP requests
{max_concurrent_worker, pos_integer()},
%% The user agent making the HTTP requests
{user_agent, binary()}]},
child() = treewalker_crawlers_sup:child()
options() = #{scraper => module(), scraper_options => term(), fetcher => module(), fetcher_options => module(), max_depth => pos_integer(), store => module(), store_options => term(), link_filter => module()}
url() = treewalker_page:url()
| add_crawler/2 | Add a new crawler with the default configuration. |
| add_crawler/3 | Add a new crawler with the specified configuration. |
| remove_crawler/1 | Remove the specified crawler. |
| start_crawler/1 | Start the specified crawler. |
| stop_crawler/1 | Stop the specified crawler. |
Add a new crawler with the default configuration.
add_crawler(Name::term(), Url::url(), Custom::options()) -> {ok, child()} | {ok, child(), term()} | {error, term()}
Add a new crawler with the specified configuration.
The available options are as follow:
- scraper: Module implementing the treewalker_scraper behaviour.
- scraper_options: The options to pass to the module implementing the
treewalker_scraper behaviour.
- fetcher: Module implementing the treewalker_fetcher behaviour.
- fetcher_options: The options to pass to the module implementing the
treewalker_fetcher behaviour.
- max_depth: The max depth that the crawler will crawl.
- store: Module implementing the treewalker_store behaviour.
- store_options: The options to pass to the module implementing the
treewalker_store behaviour.
link_filter: Module implementing the treewalker_link_filter behaviour.
remove_crawler(Name::term()) -> ok
Remove the specified crawler.
start_crawler(Name::term()) -> ok
Start the specified crawler.
stop_crawler(Name::term()) -> ok
Stop the specified crawler.
Generated by EDoc