Basic Concepts
Flow from Request, Response, Parsed Item
Data is fetched in a linear series of operations.
- New
Request
s is formed throughCrawly.Spider.init/0
. - New
Request
s are pre-processed individually. - Data is fetched, and a
Response
is returned - The
Spider
receives the response and parses the response, returning newRequest
s and new parsed items - Parsed items are post-processed individually. New
Request
s from theSpider
goes to step 2
Spiders
Spiders are modules which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site.
For spiders, the scraping cycle goes through something like this:
You start by generating the initial Requests to crawl the first URLs, and use a callback function called with the response downloaded from those requests.
In the callback function, you parse the response (web page) and return a %Crawly.ParsedItem{}
struct. This struct should contain new requests to follow and items to be stored.
In the callback functions, you parse the page contents, typically using Floki (but you can also use any other library you prefer) and generate items with the parsed data.
Spiders are executed in the context of Crawly.Worker processes, and you can control the amount of concurrent workers via concurrent_requests_per_domain
setting.
All requests are being processed sequentially and are pre-processed by Middlewares.
All items are processed sequentially and are processed by Item pipelines.
Behaviour functions
In order to make a working web crawler, all the behaviour callbacks need to be implemented.
init()
- a part of the Crawly.Spider behaviour. This function should return a KVList which contains a start_urls
entry with a list, which defines the starting requests made by Crawly. Alternatively you may provide start_requests
if it's required
to prepare first requests on init()
. Which might be useful if, for example, you
want to pass a session cookie to the starting request. Note: start_requests
are
processed before start_urls.
** This callback is going to be deprecated in favour of init/1. For now the backwards
compatibility is kept with a help of macro which always generates init/1
.
init(options)
same as init/0
but also takes options (which can be passed from the engine during
the spider start).
base_url()
- defines a base_url of the given Spider. This function is used by the DomainFilter in order to filter out all requests which are going outside of the crawled website.
parse_item(response)
- a function which defines how a given response is translated into the Crawly.ParsedItem
structure. On the high level this function defines the extraction rules for both Items and Requests.
Requests and Responses
Crawly uses Request and Response objects for crawling web sites.
Typically, Request objects are generated in the spiders and pass across the system until they reach the Crawly.Worker process, which executes the request and returns a Response object which travels back to the spider that issued the request. The Request objects are being modified by the selected Middlewares, before hitting the worker.
The request is defined as the following structure:
@type t :: %Crawly.Request{
url: binary(),
headers: [header()],
prev_response: %{},
options: [option()]
}
@type header() :: {key(), value()}
Where:
- url - is the url of the request
- headers - define http headers which are going to be used with the given request
- options - would define options (like follow redirects).
Crawly uses HTTPoison library to perform the requests, but we have plans to extend the support with other pluggable backends like selenium and others.
Responses are defined in the same way as HTTPoison responses. See more details here: https://hexdocs.pm/httpoison/HTTPoison.Response.html#content
Parsed Item
ParsedItem is a structure which is filled by the parse_item/1
callback of the Spider. The structure is defined in the following way:
@type item() :: %{}
@type t :: %__MODULE__{
items: [item()],
requests: [Crawly.Request.t()]
}
The parsed item is being processed by Crawly.Worker process, which sends all requests to the Crawly.RequestsStorage
process, responsible for pre-processing requests and storing them for the future execution, all items are being sent to Crawly.DataStorage
process, which is responsible for pre-processing items and storing them on disk.
For now only one Storage backend is supported (writing on disc). But in future Crawly will also support work with amazon S3, sql and others.
The Crawly.Pipeline
Behaviour.
Crawly is using a concept of pipelines when it comes to processing of the elements sent to the system. This is applied to both request and scraped item manipulation. Conceptually, requests go through a series of manipulations, before the response is fetched. The response then goes through another different series of manipulations.
Importantly, the way that requests and responses are manipulated are abstracted into the Crawly.Pipeline
behaviour. This allows for a modular system for declaring changes. It is also to be noted that Each Crawly.Pipeline
module, when declared, are applied sequentially through the Crawly.Utils.pipe/3
function.
Writing Tests for Custom Pipelines
Modules that implement the Crawly.Pipeline
behaviour can make use of the Crawly.Utils.pipe/3
function to test for expected behaviour. Refer to the function documentation for more information and examples.
Request Middlewares
These are configured under the middlewares
option. See configuration for more details.
Middleware: A pipeline module that modifies a request. It implements the
Crawly.Pipeline
behaviour.
Middlewares are able to make changes to the underlying request, a Crawly.Request
struct. The request, along with any options specified, is then passed to the fetcher (currently HTTPoison
).
The available configuration options should correspond to the underlying options of the fetcher in use.
Note that all request configuration options for HTTPoison
, such as proxy, ssl, etc can be configured through Crawly.Request.options
.
Built-in middlewares:
Crawly.Middlewares.DomainFilter
- this middleware will disable scheduling for all requests leading outside of the crawled site, based on base_url.Crawly.Middlewares.SameDomainFilter
- this middleware filters by domain as well but does not require base_url to be set, instead the start_url is considered.Crawly.Middlewares.RobotsTxt
- this middleware ensures that Crawly respects the robots.txt defined by the target website.Crawly.Middlewares.UniqueRequest
- this middleware ensures that crawly will not schedule the same URL (request) multiple times. Optionally supports hashing to reduce the memory footprint.Crawly.Middlewares.UserAgent
- this middleware is used to set a User Agent HTTP header. Allows to rotate UserAgents, if the last one is defined as a list.Crawly.Middlewares.RequestOptions
- allows to set additional request options, for example timeout, of proxy string (at this moment the options should match options of the individual fetcher (e.g. HTTPoison))Crawly.Middlewares.AutoCookiesManager
- allows to turn on the automatic cookies management. Useful for cases when you need to login or enter form data used by a website. Example:{Crawly.Middlewares.RequestOptions, [timeout: 30_000, recv_timeout: 15000]}, Crawly.Middlewares.AutoCookiesManager
Response Parsers
Item Pipelines: a pipeline module that parses a fetcher's request. If declared, a spider's
c:Crawly.Spider.parse_item\1
callback is ignored. It is unused by default. It implements theCrawly.Pipeline
behaviour.
Parsers allow for logic reuse when spiders parse a fetcher's response.
Item Pipelines
Item Pipelines: a pipeline module that modifies and pre-processes a scraped item. It implements the
Crawly.Pipeline
behaviour.
Built-in item pipelines:
-
Crawly.Pipelines.Validate
- validates that a given item has all the required fields. All items which don't have all required fields are dropped. -
Crawly.Pipelines.DuplicatesFilter
- filters out items which are already stored the system. -
Crawly.Pipelines.JSONEncoder
- converts items into JSON format. -
Crawly.Pipelines.CSVEncoder
- converts items into CSV format. -
Crawly.Pipelines.WriteToFile
- Writes information to a given file.
The list of item pipelines used with a given project is defined in the project settings.
Creating a Custom Pipeline Module
Both item pipelines and request middlewares follows the Crawly.Pipeline
behaviour. As such, when creating your custom pipeline, it will need to implement the required callback c:Crawly.Pipeline.run\3
.
The c:Crawly.Pipeline.run\3
callback receives the processed item, item
from the previous pipeline module as the first argument. The second argument, state
, is a map containing information such as spider which the item originated from (under the :spider_name
key), and may optionally store pipeline information. Finally, opts
is a keyword list containing any tuple-based options.
Passing Configuration Options To Your Pipeline
Tuple-based option declaration is supported, similar to how a GenServer
is declared in a supervision tree. This allows for pipeline reusability for different use cases.
For example, you can pass options in this way through your pipeline declaration:
pipelines: [
{MyCustomPipeline, my_option: "value"}
]
In your pipeline, you will then receive the options passed through the opts
argument.
defmodule MyCustomPipeline do
@impl Crawly.Pipeline
def run(item, state, opts) do
IO.inspect(opts) # shows keyword list of [ my_option: "value" ]
# Do something
end
end
Best Practices
The use of global configs is discouraged, hence one should pass options through a tuple-based pipeline declaration where possible.
When storing information in the state
map, ensure that the state is namespaced with the pipeline name, so as to avoid key clashing. For example, to store state from MyEctoPipeline
, store the state on the key :my_ecto_pipeline_my_state
.
Custom Request Middlewares
Request Middleware Example - Add a Proxy
Following the documentation for proxy options of a request in HTTPoison
, we can do the following:
defmodule MyApp.MyProxyMiddleware do
@impl Crawly.Pipeline
def run(request, state, opts \\ []) do
# Set default proxy and proxy_auth to nil
opts = Enum.into(opts, %{proxy: nil, proxy_auth: nil})
case opts.proxy do
nil ->
# do nothing
{request, state}
value ->
old_options = request.options
new_options = [proxy: opts.proxy, proxy_auth: opts.proxy_auth]
new_request = Map.put(request, :options, old_optoins ++ new_options)
{new_request, state}
end
end
end
Custom Item Pipelines
Item pipelines receive the parsed item (from the Spider) and performs post-processing on the item.
Storing Parsed Items
You can use custom item pipelines to save the item to custom storages.
Example - Ecto Storage Pipeline
In this example, we insert the scraped item into a table with Ecto. This example does not directly call MyRepo.insert
, but delegates it to an application context function.
defmodule MyApp.MyEctoPipeline do
@impl Crawly.Pipeline
def run(item, state, _opts \\ []) do
case MyApp.insert_with_ecto(item) do
{:ok, _} ->
# insert successful, carry on with pipeline
{item, state}
{:error, _} ->
# insert not successful, drop from pipeline
{false, state}
end
end
end
Multiple Different Types of Parsed Items
If you need to selectively post-process different types of scraped items, you can utilize pattern-matching at the item pipeline level.
There are two general methods of doing so:
- Struct-based pattern matching
defmodule MyApp.MyCustomPipeline do
@impl Crawly.Pipeline
def run(%MyItem{} = item, state, _opts \\ []) do
# do something
end
# do nothing if it does not match
def run(item, state, _opts), do: {item, state}
end
- Key-based pattern matching
defmodule MyApp.MyCustomPipeline do
@impl Crawly.Pipeline
def run(%{my_item: my_item} = item, state, _opts \\ []) do
# do something
end
# do nothing if it does not match
def run(item, state, _opts), do: {item, state}
end
Use struct-based pattern matching when:
- you want to utilize existing Ecto schemas
- you have pre-defined structs that you want to conform to
Use key-based pattern matching when:
- you want to process two or more related and inter-dependent items together
- you want to bulk process multiple items for efficiency reasons. For example, processing the weather data for 365 days in one pass.
Caveats
When using the nested-key pattern matching method, the spider's Crawly.Spider.parse_item/1
callback will need to return items with a single key (or a map with multiple keys, if doing related processing).
When using struct-based pattern matching with existing Ecto structs, you will need to do an intermediate conversion of the struct into a map before performing the insertion into the Ecto Repo. This is due to the underlying Ecto schema metadata still being attached to the struct before insertion.
Example - Multi-Item Pipelines With Pattern Matching
In this example, your spider scrapes a "blog post" and a "weather data" from a website. We will use the key-based pattern matching approach to selectively post-process a blog post parsed item.
# in MyApp.CustomSpider.ex
def parse_item(response):
# parse my item
%{parsed_items: [
%{blog_post: blog_post} ,
%{weather: [ january_weather, february_weather ]}
]}
Then, in the custom pipeline, we will pattern match on the :blog_post
key, to ensure that we only process blog posts with this pipeline (and not weather data).
We then update the :blog_post
key of the received item.
defmodule MyApp.BlogPostPipeline do
@impl Crawly.Pipeline
def run(%{blog_post: old_blog_post} = item, state, _opts \\ []) do
# process the blog post
updated_item = Map.put(item, :blog_post, %{my: "data"})
{updated_item, state}
end
# do nothing if it does not match
def run(item, state, _opts), do: {item, state}
end
Browser rendering
Browser rendering is one of the most complex problems of scraping. The Internet moves towards more dynamic content, where not only parts of the pages are loaded asynchronously, but entire applications might be rendered by JavaScript and AJAX.
In most of the cases it's still possible to extract the data from dynamically rendered pages. (E.g. by sending additional POST requests from loaded pages), however this approach seems to have visible drawbacks. From our point of view it makes the spider code quite complicated and fragile.
Of course it's good when you can just get pages already rendered for you. And we're solving this problem with a help of pluggable HTTP fetchers.
Crawly's codebase contains a special Splash fetcher, which allows to do the browser rendering before the page content is being parsed by a spider. Also it's possible to build own fetchers.
Using crawly-render-server for browser rendering
NOTE: Experimental
I have made a simple puppeteer based browser rendering tool that's available here: https://github.com/elixir-crawly/crawly-render-server
I am actively testing it with various targets, and at least for me the results looks fine. However I am super interested in other feedback or contributions.
To run it do this:
- git clone https://github.com/yourusername/crawly-render-server.git
- cd ./crawly-render-server
- docker run -p 3000:3000 --rm -it $(docker build -q .)
- configure it on project or spider level:
(project level)
import Config
config :crawly,
fetcher: {Crawly.Fetchers.CrawlyRenderServer, [base_url: "http://localhost:3000/render"]},
Using splash fetcher for browser rendering
NOTE: It looks like splash is not maintained anymore.
We could not run it's Docker images on M1/M2 mac machines. We could not build it from sources as well :(
Splash is a lightweight opensourse browser implementation built with QT and python. See: https://splash.readthedocs.io/en/stable/api.html
You can try using Splash with Crawly in the following way:
- Start splash locally (e.g. using a docker image):
docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300
- Configure Crawly to use Splash:
fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html"]}
- Now all your pages will be automatically rendered by Splash.