Crawly.Utils (Crawly v0.17.2) View Source

Utility functions for Crawly

Link to this section Summary

Functions

A helper function which joins relative url with a base URL

A helper function which joins relative url with a base URL for a list

Remove all previousely registered dynamic spiders

Wrapper function for Code.ensure_loaded?/1 to allow mocking

A helper function that is used by YML spiders

A helper function that is used by YML spiders

Retrieves a header value from a list of key-value tuples or a map.

A helper which allows to extract a given setting.

Returns a list of known modules which implements Crawly.Spider behaviour

Loads spiders from a given directory. Store thm in persistant term under :spiders

Pipeline/Middleware helper

A helper function that allows to preview spider results based on a given YML

Register a given spider (so it's visible in the spiders list)

Return a list of registered spiders

A helper function which returns a Request structure for the given URL

A helper function which converts a list of URLS into a requests list.

A wrapper over Process.send after This wrapper should be used instead of Process.send_after, so it's possible to mock the last one. To avoid race conditions on worker's testing.

Composes the log file path for a given spider and crawl ID.

Function to get setting module in proper data structure

Link to this section Functions

Link to this function

build_absolute_url(url, base_url)

View Source

Specs

build_absolute_url(binary(), binary()) :: binary()

A helper function which joins relative url with a base URL

Link to this function

build_absolute_urls(urls, base_url)

View Source

Specs

build_absolute_urls([binary()], binary()) :: [binary()]

A helper function which joins relative url with a base URL for a list

Link to this function

clear_registered_spiders()

View Source

Specs

clear_registered_spiders() :: :ok

Remove all previousely registered dynamic spiders

Specs

ensure_loaded?(atom()) :: boolean()

Wrapper function for Code.ensure_loaded?/1 to allow mocking

Link to this function

extract_items(document, field_selectors)

View Source

Specs

extract_items(document, field_selectors) :: items
when document: [Floki.html_node()], field_selectors: binary(), items: [map()]

A helper function that is used by YML spiders

Extract items (actually one item) from a given document using a given set of selectors.

Selectors are aprovided as a JSON encoded list of maps, that contain name and selector binary keys. For example:

field_selectors = [%{"selector" => "h1", "name" => "title"}]

Link to this function

extract_requests(document, selectors, base_url)

View Source

Specs

extract_requests(document, selectors, base_url) :: requests
when document: [Floki.html_node()],
     selectors: binary(),
     base_url: binary(),
     requests: [Crawly.Request.t()]

A helper function that is used by YML spiders

Extract requests from a given document using a given set of selectors builds absolute_urls.

Selectors are aprovided as a JSON encoded list of maps, that contain selector and attribute keys. E.g. selectors = [%{"selector" => "a", "attribute" => "href"}]

Base URL is required to build absolute url from extracted links

Link to this function

get_header(headers, key, default \\ nil)

View Source

Specs

get_header(
  headers ::
    [{atom() | binary(), binary()}] | %{required(binary()) => binary()},
  key :: binary(),
  default :: binary() | nil
) :: binary() | nil

Retrieves a header value from a list of key-value tuples or a map.

This function searches for a header with the specified key in the given list of headers or map. If found, it returns the corresponding value; otherwise, it returns the provided default value if provided, otherwise nil.

Parameters

  • headers: A list of key-value tuples or a map representing headers.
  • key: The key of the header to retrieve.
  • default: (Optional) The default value to return if the header is not found. If not provided, returns nil.

Returns

The value of the header if found, otherwise the default value if provided, otherwise nil.

Link to this function

get_modules_from_applications()

View Source

Specs

get_modules_from_applications() :: [module()]
Link to this function

get_settings(setting_name, spider_name \\ nil, default \\ nil)

View Source

Specs

get_settings(setting_name, Crawly.spider(), default) :: result
when setting_name: atom(), default: term(), result: term()

A helper which allows to extract a given setting.

Returned value is a result of intersection of the global settings and settings defined as settings_override inside the spider. Settings defined on spider are taking precedence over the global settings defined in the config.

Specs

list_spiders() :: [module()]

Returns a list of known modules which implements Crawly.Spider behaviour

Specs

load_spiders() :: {:ok, [module()]} | {:error, :no_spiders_dir}

Loads spiders from a given directory. Store thm in persistant term under :spiders

This allows to readup spiders stored in specific directory which is not a part of Crawly application

Specs

pipe(pipelines, item, state) :: result
when pipelines: [Crawly.Pipeline.t()],
     item: map(),
     state: map(),
     result: {new_item | false, new_state},
     new_item: map(),
     new_state: map()

Pipeline/Middleware helper

Executes a given list of pipelines on the given item, mimics filtermap behavior. Takes an item and state and passes it through a list of modules which implements a pipeline behavior, executing the pipeline's Crawly.Pipeline.run/3 function.

The pipe function must either return a boolean (false), or an updated item.

If false is returned by a pipeline, the item is dropped. It will not be processed by any descendant pipelines.

In case of a pipeline crash, the pipeline will be skipped and the item will be passed on to descendant pipelines.

The state variable is used to persist the information across multiple items.

Usage in Tests

The Crawly.Utils.pipe/3 helper can be used in pipeline testing to simulate a set of middlewares/pipelines.

Internally, this function is used for both middlewares and pipelines. Hence, you can use it for testing modules that implement the Crawly.Pipeline behaviour.

For example, one can test that a given item is manipulated by a pipeline as so:

item = %{my: "item"}
state = %{}
pipelines = [ MyCustomPipelineOrMiddleware ]
{new_item, new_state} = Crawly.Utils.pipe(pipelines, item, state)

Specs

preview(yml) :: [result]
when yml: binary(),
     result:
       %{url: binary(), items: [map()], requests: [binary()]}
       | %{error: term()}
       | %{error: term(), url: binary()}

A helper function that allows to preview spider results based on a given YML

Specs

register_spider(module()) :: :ok

Register a given spider (so it's visible in the spiders list)

Specs

registered_spiders() :: [module()]

Return a list of registered spiders

Specs

request_from_url(binary()) :: Crawly.Request.t()

A helper function which returns a Request structure for the given URL

Link to this function

requests_from_urls(urls)

View Source

Specs

requests_from_urls([binary()]) :: [Crawly.Request.t()]

A helper function which converts a list of URLS into a requests list.

Link to this function

send_after(pid, message, timeout)

View Source

Specs

send_after(pid(), term(), pos_integer()) :: reference()

A wrapper over Process.send after This wrapper should be used instead of Process.send_after, so it's possible to mock the last one. To avoid race conditions on worker's testing.

Link to this function

spider_log_path(spider_name, crawl_id)

View Source

Specs

spider_log_path(spider_name, crawl_id) :: path
when spider_name: atom(), crawl_id: String.t(), path: String.t()

Composes the log file path for a given spider and crawl ID.

Args: spider_name (atom): The name of the spider to create the log path for. crawl_id (string): The ID of the crawl to create the log path for.

Returns: string: The file path to the log file for the given spider and crawl ID.

Examples: iex> spider_log_path(:my_spider, "crawl_123") "/tmp/crawly/my_spider/crawl_123.log"

iex> spider_log_path(:my_spider, "crawl_456") "/tmp/crawly/my_spider/crawl_456.log"

Link to this function

unwrap_module_and_options(setting)

View Source

Specs

unwrap_module_and_options(term()) :: {atom(), maybe_improper_list()}

Function to get setting module in proper data structure