Module treewalker

This OTP application is used for crawling websites while respecting robots.txt.

Copyright © 2020 Antoine Gagné

Authors: Antoine Gagné (gagnantoine@gmail.com).

Description

This OTP application is used for crawling websites while respecting robots.txt.

This module exposes some high level functions to be able to add new crawlers and start/stop them.

While most of the configuration is per crawler, this application is also configurable globally via the following sys.config settings:

   {treewalker, [
                 %% The minimum delay to wait before retrying a failed request
                 {min_retry_delay, pos_integer()},
                 %% The maximum delay to wait before retrying a failed request
                 {max_retry_delay, pos_integer()},
                 %% The maximum amount of retries of a failed request
                 {max_retries, pos_integer()},
                 %% The maximum amount of delay before starting a request (in seconds)
                 {max_worker_delay, pos_integer()},
                 %% The maximum amount of concurrent workers making HTTP requests
                 {max_concurrent_worker, pos_integer()},
                 %% The user agent making the HTTP requests
                 {user_agent, binary()}]},

Data Types

child()

child() = treewalker_crawlers_sup:child()

options()

options() = #{scraper => module(), scraper_options => term(), fetcher => module(), fetcher_options => module(), max_depth => pos_integer(), store => module(), store_options => term(), link_filter => module()}

url()

url() = treewalker_page:url()

Function Index

add_crawler/2 Add a new crawler with the default configuration.
add_crawler/3 Add a new crawler with the specified configuration.
remove_crawler/1 Remove the specified crawler.
start_crawler/1 Start the specified crawler.
stop_crawler/1 Stop the specified crawler.

Function Details

add_crawler/2

add_crawler(Name::term(), Url::url()) -> {ok, child()} | {ok, child(), term()} | {error, term()}

Add a new crawler with the default configuration.

add_crawler/3

add_crawler(Name::term(), Url::url(), Custom::options()) -> {ok, child()} | {ok, child(), term()} | {error, term()}

Add a new crawler with the specified configuration.

The available options are as follow:

- scraper: Module implementing the treewalker_scraper behaviour.

- scraper_options: The options to pass to the module implementing the treewalker_scraper behaviour.

- fetcher: Module implementing the treewalker_fetcher behaviour.

- fetcher_options: The options to pass to the module implementing the treewalker_fetcher behaviour.

- max_depth: The max depth that the crawler will crawl.

- store: Module implementing the treewalker_store behaviour.

- store_options: The options to pass to the module implementing the treewalker_store behaviour.

- link_filter: Module implementing the treewalker_link_filter behaviour.

remove_crawler/1

remove_crawler(Name::term()) -> ok

Remove the specified crawler.

start_crawler/1

start_crawler(Name::term()) -> ok

Start the specified crawler.

stop_crawler/1

stop_crawler(Name::term()) -> ok

Stop the specified crawler.


Generated by EDoc