View Source treewalker
A web crawler in Erlang that respects robots.txt.
installation
Installation
This library is available on hex.pm.
Keep in mind that the library is not yet stable and its API may be subject to changes.
usage
Usage
%% This will add the specified crawler to the supervision tree
{ok, _} = treewalker:add_crawler(example, #{scraper => example_scraper,
fetcher => example_fetcher,
max_depth => 3,
link_filter => example_link_filter,
store => example_store}),
%% Starts crawling
ok = treewalker:start_crawler(example),
%% ...
%% Stops the crawler
%% The pending requests will be completed and dropped
ok = treewalker:stop_crawler(example),
options
Options
The following settings are available via the sys.config configuration:
{treewalker, [
%% The minimum delay to wait before retrying a failed request
{min_retry_delay, pos_integer()},
%% The maximum delay to wait before retrying a failed request
{max_retry_delay, pos_integer()},
%% The maximum amount of retries of a failed request
{max_retries, pos_integer()},
%% The maximum amount of delay before starting a request (in seconds)
{max_worker_delay, pos_integer()},
%% The maximum amount of concurrent workers making HTTP requests
{max_concurrent_worker, pos_integer()},
%% The user agent making the HTTP requests
{user_agent, binary()}]},
development
Development
running-all-the-tests-and-linters
Running all the tests and linters
You can run all the tests and linters with the rebar3 alias:
rebar3 check