Crawly v0.9.0 Crawly.Utils View Source

Utility functions for Crawly

Link to this section Summary

Functions

build_absolute_url(url, base_url)

A helper function which joins relative url with a base URL

build_absolute_urls(urls, base_url)

A helper function which joins relative url with a base URL for a list

get_settings(setting_name, spider_name \\ nil, default \\ nil)

A helper which allows to extract a given setting.

pipe(arg1, item, state)

Pipeline/Middleware helper

request_from_url(url)

A helper function which returns a Request structure for the given URL

requests_from_urls(urls)

A helper function which converts a list of URLS into a requests list.

send_after(pid, message, timeout)

A wrapper over Process.send after This wrapper should be used instead of Process.send_after, so it's possible to mock the last one. To avoid race conditions on worker's testing.

Link to this section Functions

build_absolute_url(url, base_url)

build_absolute_url(binary(), binary()) :: binary()

A helper function which joins relative url with a base URL

build_absolute_urls(urls, base_url)

build_absolute_urls([binary()], binary()) :: [binary()]

A helper function which joins relative url with a base URL for a list

get_settings(setting_name, spider_name \\ nil, default \\ nil)

get_settings(setting_name, spider_name, default) :: result
when setting_name: atom(), spider_name: atom(), default: term(), result: term()

A helper which allows to extract a given setting.

Returned value is a result of intersection of the global settings and settings defined as settings_override inside the spider. Settings defined on spider are taking precedence over the global settings defined in the config.

pipe(arg1, item, state)

pipe(pipelines, item, state) :: result
when pipelines: [Crawly.Pipeline.t()],
     item: map(),
     state: map(),
     result: {new_item | false, new_state},
     new_item: map(),
     new_state: map()

Pipeline/Middleware helper

Executes a given list of pipelines on the given item, mimics filtermap behavior. Takes an item and state and passes it through a list of modules which implements a pipeline behavior, executing the pipeline's Crawly.Pipeline.run/3 function.

The pipe function must either return a boolean (false), or an updated item.

If false is returned by a pipeline, the item is dropped. It will not be processed by any descendant pipelines.

In case of a pipeline crash, the pipeline will be skipped and the item will be passed on to descendant pipelines.

The state variable is used to persist the information accross multiple items.

Usage in Tests

The Crawly.Utils.pipe/3 helper can be used in pipeline testing to simulate a set of middlewares/pipelines.

Internally, this function is used for both middlewares and pipelines. Hence, you can use it for testing modules that implement the Crawly.Pipeline behaviour.

For example, one can test that a given item is manipulated by a pipeline as so:

item = %{my: "item"}
state = %{}
pipelines = [ MyCustomPipelineOrMiddleware ]
{new_item, new_state} = Crawly.Utils.pipe(pipelines, item, state)

request_from_url(url)

request_from_url(binary()) :: Crawly.Request.t()

A helper function which returns a Request structure for the given URL

requests_from_urls(urls)

requests_from_urls([binary()]) :: [Crawly.Request.t()]

A helper function which converts a list of URLS into a requests list.

send_after(pid, message, timeout)

send_after(pid(), term(), pos_integer()) :: reference()

A wrapper over Process.send after This wrapper should be used instead of Process.send_after, so it's possible to mock the last one. To avoid race conditions on worker's testing.

v0.9.0

Crawly v0.9.0 Crawly.Utils View Source

Link to this section Summary

Functions

Link to this section Functions

build_absolute_url(url, base_url) View Source

build_absolute_url(binary(), binary()) :: binary()

build_absolute_urls(urls, base_url) View Source

build_absolute_urls([binary()], binary()) :: [binary()]

get_settings(setting_name, spider_name \\ nil, default \\ nil) View Source

get_settings(setting_name, spider_name, default) :: result when setting_name: atom(), spider_name: atom(), default: term(), result: term()

pipe(arg1, item, state) View Source

pipe(pipelines, item, state) :: result when pipelines: [Crawly.Pipeline.t()], item: map(), state: map(), result: {new_item | false, new_state}, new_item: map(), new_state: map()

Usage in Tests

request_from_url(url) View Source

request_from_url(binary()) :: Crawly.Request.t()

requests_from_urls(urls) View Source

requests_from_urls([binary()]) :: [Crawly.Request.t()]

send_after(pid, message, timeout) View Source

send_after(pid(), term(), pos_integer()) :: reference()

v0.9.0

Crawly v0.9.0 Crawly.Utils View Source

Link to this section Summary

Functions

Link to this section Functions

build_absolute_url(url, base_url) View Source build_absolute_url(binary(), binary()) :: binary()

build_absolute_urls(urls, base_url) View Source build_absolute_urls([binary()], binary()) :: [binary()]

get_settings(setting_name, spider_name \\ nil, default \\ nil) View Source get_settings(setting_name, spider_name, default) :: result when setting_name: atom(), spider_name: atom(), default: term(), result: term()

pipe(arg1, item, state) View Source pipe(pipelines, item, state) :: result when pipelines: [Crawly.Pipeline.t()], item: map(), state: map(), result: {new_item | false, new_state}, new_item: map(), new_state: map()

Usage in Tests

request_from_url(url) View Source request_from_url(binary()) :: Crawly.Request.t()

requests_from_urls(urls) View Source requests_from_urls([binary()]) :: [Crawly.Request.t()]

send_after(pid, message, timeout) View Source send_after(pid(), term(), pos_integer()) :: reference()

build_absolute_url(url, base_url) View Source

build_absolute_url(binary(), binary()) :: binary()

build_absolute_urls(urls, base_url) View Source

build_absolute_urls([binary()], binary()) :: [binary()]

get_settings(setting_name, spider_name \\ nil, default \\ nil) View Source

get_settings(setting_name, spider_name, default) :: result when setting_name: atom(), spider_name: atom(), default: term(), result: term()

pipe(arg1, item, state) View Source

pipe(pipelines, item, state) :: result when pipelines: [Crawly.Pipeline.t()], item: map(), state: map(), result: {new_item | false, new_state}, new_item: map(), new_state: map()

request_from_url(url) View Source

request_from_url(binary()) :: Crawly.Request.t()

requests_from_urls(urls) View Source

requests_from_urls([binary()]) :: [Crawly.Request.t()]

send_after(pid, message, timeout) View Source

send_after(pid(), term(), pos_integer()) :: reference()