API Reference Crawly v0.17.2

Modules

Crawly

Crawly is a fast high-level web crawling & scraping framework for Elixir.

Crawly.API.Router

Crawly HTTP API. Allows to schedule/stop/get_stats of all running spiders.

Data Storage, is a module responsible for storing crawled items. On the high level it's possible to represent the architecture of items storage this way

Crawly.DataStorage.Worker

A worker process which stores items for individual spiders. All items are pre-processed by item_pipelines.

Crawly.Engine

Crawly Engine - process responsible for starting and stopping spiders.

Crawly.Fetchers.CrawlyRenderServer

Implements Crawly.Fetchers.Fetcher behavior for Crawly Render Server Javascript rendering.

Crawly.Fetchers.Fetcher

A behavior module for defining Crawly Fetchers

Crawly.Fetchers.HTTPoisonFetcher

Implements Crawly.Fetchers.Fetcher behavior based on HTTPoison HTTP client

Crawly.Fetchers.Splash

Implements Crawly.Fetchers.Fetcher behavior for Splash Javascript rendering.

Crawly.Loggers.SendToUiBackend

Crawly.Manager

Crawler manager module

Crawly.Middlewares.AutoCookiesManager

Set/update cookies for requests. The cookies are being automatically picked up from prev_responses stored by Crawly. Only name/value pairs are taken into account, all other options like domain, secure and others are ignored.

Crawly.Middlewares.DomainFilter

Filters out requests which are going outside of the crawled domain.

Crawly.Middlewares.RequestOptions

Request settings middleware

Crawly.Middlewares.RobotsTxt

Obey robots.txt

Crawly.Middlewares.SameDomainFilter

Filters out requests which are going outside of the crawled domain.

Crawly.Middlewares.UniqueRequest

Avoid scheduling multiple requests for the same page. Allow to set a hashing algorithm via options to reduce the memory footprint. Be aware of reduced collision resistance, depending on the chosen algorithm.

Crawly.Middlewares.UserAgent

Set/Rotate user agents for crawling. The user agents are read from :crawly, :user_agents sessions.

Crawly.Models.Job

The Crawly.Models.Job module defines a struct and functions for managing and updating information about a web scraping job.

Crawly.Models.YMLSpider

Crawly.ParsedItem

Defines the structure of spider's result.

Crawly.Pipeline

A behavior module for implementing a pipeline module. Pipelines allow for customization of how Crawly.Requests, Crawly.Responses, and :items set on Crawly.ParsedItem are processed. Each pipeline is called in sequence, with the result of each being passed to the next pipeline.

Crawly.Pipelines.CSVEncoder

Encodes a given item (map) into CSV. Does not flatten nested maps.

Crawly.Pipelines.DuplicatesFilter

Filters out duplicated items based on the provided item_id.

Crawly.Pipelines.Experimental.SendToUI

Crawly.Pipelines.JSONEncoder

Encodes a given item (map) into JSON

Crawly.Pipelines.Validate

Ensure that scraped item contains a set of required fields.

Crawly.Pipelines.WriteToFile

Stores a given item into Filesystem

Crawly.Request