View Source treewalker (treewalker v0.4.1)

This OTP application is used for crawling websites while respecting robots.txt.

This module exposes some high level functions to be able to add new crawlers and start/stop them.

While most of the configuration is per crawler, this application is also configurable globally via the following sys.config settings:

   {treewalker, [
                 %% The minimum delay to wait before retrying a failed request
                 {min_retry_delay, pos_integer()},
                 %% The maximum delay to wait before retrying a failed request
                 {max_retry_delay, pos_integer()},
                 %% The maximum amount of retries of a failed request
                 {max_retries, pos_integer()},
                 %% The maximum amount of delay before starting a request (in seconds)
                 {max_worker_delay, pos_integer()},
                 %% The maximum amount of concurrent workers making HTTP requests
                 {max_concurrent_worker, pos_integer()},
                 %% The user agent making the HTTP requests
                 {user_agent, binary()}]},

Link to this section Summary

Functions

Add a new crawler with the default configuration.

Add a new crawler with the specified configuration.

Remove the specified crawler.
Start the specified crawler.
Stop the specified crawler.

Link to this section Types

-type child() :: treewalker_crawlers_sup:child().
-type options() ::
    #{scraper => module(),
      scraper_options => term(),
      fetcher => module(),
      fetcher_options => module(),
      max_depth => pos_integer(),
      store => module(),
      store_options => term(),
      link_filter => module()}.
-type url() :: treewalker_page:url().

Link to this section Functions

-spec add_crawler(term(), url()) -> {ok, child()} | {ok, child(), term()} | {error, term()}.
Add a new crawler with the default configuration.
Link to this function

add_crawler(Name, Url, Custom)

View Source
-spec add_crawler(term(), url(), options()) -> {ok, child()} | {ok, child(), term()} | {error, term()}.

Add a new crawler with the specified configuration.

The available options are as follow:

- scraper: Module implementing the treewalker_scraper behaviour.

- scraper_options: The options to pass to the module implementing the treewalker_scraper behaviour.

- fetcher: Module implementing the treewalker_fetcher behaviour.

- fetcher_options: The options to pass to the module implementing the treewalker_fetcher behaviour.

- max_depth: The max depth that the crawler will crawl.

- store: Module implementing the treewalker_store behaviour.

- store_options: The options to pass to the module implementing the treewalker_store behaviour.

- link_filter: Module implementing the treewalker_link_filter behaviour.
-spec remove_crawler(term()) -> ok.
Remove the specified crawler.
-spec start_crawler(term()) -> ok.
Start the specified crawler.
-spec stop_crawler(term()) -> ok.
Stop the specified crawler.