API Reference Crawly v0.17.2
Modules
Crawly is a fast high-level web crawling & scraping framework for Elixir.
Crawly HTTP API. Allows to schedule/stop/get_stats of all running spiders.
Data Storage, is a module responsible for storing crawled items. On the high level it's possible to represent the architecture of items storage this way
A worker process which stores items for individual spiders. All items are pre-processed by item_pipelines.
Crawly Engine - process responsible for starting and stopping spiders.
Implements Crawly.Fetchers.Fetcher behavior for Crawly Render Server Javascript rendering.
A behavior module for defining Crawly Fetchers
Implements Crawly.Fetchers.Fetcher behavior based on HTTPoison HTTP client
Implements Crawly.Fetchers.Fetcher behavior for Splash Javascript rendering.
Crawler manager module
Set/update cookies for requests. The cookies are being automatically picked up from prev_responses stored by Crawly. Only name/value pairs are taken into account, all other options like domain, secure and others are ignored.
Filters out requests which are going outside of the crawled domain.
Request settings middleware
Obey robots.txt
Filters out requests which are going outside of the crawled domain.
Avoid scheduling multiple requests for the same page. Allow to set a hashing algorithm via options to reduce the memory footprint. Be aware of reduced collision resistance, depending on the chosen algorithm.
Set/Rotate user agents for crawling. The user agents are read from :crawly, :user_agents sessions.
The Crawly.Models.Job
module defines a struct and functions for managing and updating information about a web scraping job.
Defines the structure of spider's result.
A behavior module for implementing a pipeline module. Pipelines allow for customization of how Crawly.Requests
, Crawly.Responses
, and :items
set on Crawly.ParsedItem
are processed. Each pipeline is called in sequence, with the result of each being passed to the next pipeline.
Encodes a given item (map) into CSV. Does not flatten nested maps.
Filters out duplicated items based on the provided item_id
.
Encodes a given item (map) into JSON
Ensure that scraped item contains a set of required fields.
Stores a given item into Filesystem
Request wrapper
Request storage, a module responsible for storing urls for crawling
Requests Storage, is a module responsible for storing requests for a given spider.
Define Crawly response structure
Define Crawly setting types
A behavior module for implementing a Crawly Spider
Utility functions for Crawly
A worker process responsible for the actual work (fetching requests, processing responses)
Mix Tasks
Generate Crawly configuration
Generate Crawly spider template