View Source API Reference Crawler v1.5.0


A high performance web crawler in Elixir.

Dispatches requests to a queue for crawling.

A worker that performs the crawling.

This example performs a Google search, then scrapes the results to find Github projects and output their name and description.

We only scrape Github pages, specifically looking for a project's name and description.

We start with Google, then only crawls Github.

Fetches pages and perform tasks on them.

Captures and prepares HTTP response headers.

Modifies request options and headers before dispatch.

Checks a series of conditions to determine whether it is okay to continue.

Records information about each crawl for internal use.

Makes HTTP requests.

Handles retries for failed crawls.

Spec for defining a fetch retrier.

A placeholder module that lets all URLs pass through.

Spec for defining an url filter.

Custom HTTPoison base module for potential customisation.

A set of high level functions for making online and offline URLs and links.

Builds a path for a link (can be a URL itself or a relative link) based on the input string which is a URL with or without its protocol.

Expands the path by expanding any . and .. characters.

Finds different components of a given URL, e.g. its domain name, directory path, or full path.

Transforms a link to be storable and linkable offline.

Returns prefixes (../s) according to the given URL's structure.

Options for the crawler.

Parses pages and calls a link handler to handle the detected links.

Parses CSS files.

Detects whether a page is parsable.

Parses HTML files.

Parses links and transforms them if necessary.

Expands a link into a full URL.

Spec for defining a parser.

Handles the queueing of crawl requests.

A placeholder module that demonstrates the scraping interface.

Spec for defining a scraper.

Stores crawled pages offline.

Makes a new (nested) folder according to the options provided.

Replaces links found in a page so they work offline.

An internal data store for information related to each crawl.

An internal struct for keeping the url and content of a crawled page.

Handles the crawl tasks.