robots.txt
.
Copyright © 2020 Antoine Gagné
Authors: Antoine Gagné (gagnantoine@gmail.com).
This OTP application is used for crawling websites while respecting robots.txt
.
This module exposes some high level functions to be able to add new crawlers and start/stop them.
While most of the configuration is per crawler, this application is also configurable globally
via the following sys.config
settings:
{treewalker, [ %% The minimum delay to wait before retrying a failed request {min_retry_delay, pos_integer()}, %% The maximum delay to wait before retrying a failed request {max_retry_delay, pos_integer()}, %% The maximum amount of retries of a failed request {max_retries, pos_integer()}, %% The maximum amount of delay before starting a request (in seconds) {max_worker_delay, pos_integer()}, %% The maximum amount of concurrent workers making HTTP requests {max_concurrent_worker, pos_integer()}, %% The user agent making the HTTP requests {user_agent, binary()}]},
child() = treewalker_crawlers_sup:child()
options() = #{scraper => module(), scraper_options => term(), fetcher => module(), fetcher_options => module(), max_depth => pos_integer(), store => module(), store_options => term(), link_filter => module()}
url() = treewalker_page:url()
add_crawler/2 | Add a new crawler with the default configuration. |
add_crawler/3 | Add a new crawler with the specified configuration. |
remove_crawler/1 | Remove the specified crawler. |
start_crawler/1 | Start the specified crawler. |
stop_crawler/1 | Stop the specified crawler. |
Add a new crawler with the default configuration.
add_crawler(Name::term(), Url::url(), Custom::options()) -> {ok, child()} | {ok, child(), term()} | {error, term()}
Add a new crawler with the specified configuration.
The available options are as follow:
- scraper
: Module implementing the treewalker_scraper
behaviour.
- scraper_options
: The options to pass to the module implementing the
treewalker_scraper
behaviour.
- fetcher
: Module implementing the treewalker_fetcher
behaviour.
- fetcher_options
: The options to pass to the module implementing the
treewalker_fetcher
behaviour.
- max_depth
: The max depth that the crawler will crawl.
- store
: Module implementing the treewalker_store
behaviour.
- store_options
: The options to pass to the module implementing the
treewalker_store
behaviour.
link_filter
: Module implementing the treewalker_link_filter
behaviour.
remove_crawler(Name::term()) -> ok
Remove the specified crawler.
start_crawler(Name::term()) -> ok
Start the specified crawler.
stop_crawler(Name::term()) -> ok
Stop the specified crawler.
Generated by EDoc