Crawly

Overview

Crawly is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Requirements

Elixir "~> 1.7"
Works on Linux, Windows, OS X and BSD

Quickstart

Add Crawly as a dependencies:

   # mix.exs
   defp deps do
       [
         {:crawly, "~> 0.9.0"},
         {:floki, "~> 0.26.0"}
       ]
   end

Fetch dependencies: $ mix deps.get
Create a spider

   # lib/crawly_example/esl_spider.ex
   defmodule EslSpider do
     use Crawly.Spider
     
     alias Crawly.Utils

     @impl Crawly.Spider
     def base_url(), do: "https://www.erlang-solutions.com"

     @impl Crawly.Spider
     def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog.html"]]

     @impl Crawly.Spider
     def parse_item(response) do
       {:ok, document} = Floki.parse_document(response.body)
       hrefs = document |> Floki.find("a.more") |> Floki.attribute("href")

       requests =
         Utils.build_absolute_urls(hrefs, base_url())
         |> Utils.requests_from_urls()

       title = document |> Floki.find("article.blog_post h1") |> Floki.text()

       %{
         :requests => requests,
         :items => [%{title: title, url: response.request_url}]
       }
     end
   end

Configure Crawly
- By default, Crawly does not require any configuration. But obviously you will need a configuration for fine tuning the crawls:

   # in config.exs
   config :crawly,
     closespider_timeout: 10,
     concurrent_requests_per_domain: 8,
     middlewares: [
       Crawly.Middlewares.DomainFilter,
       Crawly.Middlewares.UniqueRequest,
       {Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]}
     ],
     pipelines: [
       {Crawly.Pipelines.Validate, fields: [:url, :title]},
       {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
       Crawly.Pipelines.JSONEncoder,
       {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
      ]

Start the Crawl:
- $ iex -S mix
- iex(1)> Crawly.Engine.start_spider(EslSpider)
Results can be seen with: $ cat /tmp/EslSpider.jl

Browser rendering

Crawly can be configured in the way that all fetched pages will be browser rendered, which can be very useful if you need to extract data from pages which has lots of asynchronous elements (for example parts loaded by AJAX).

You can read more here:

Browser Rendering

Documentation

Roadmap

[x] Pluggable HTTP client
[x] Retries support
[x] Cookies support
[x] XPath support - can be actually done with meeseeks
[ ] Project generators (spiders)
[ ] UI for jobs management

Articles

Blog post on Erlang Solutions website: https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html
Blog post about using Crawly inside a machine learning project with Tensorflow (Tensorflex): https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html

Example projects

Blog crawler: https://github.com/oltarasenko/crawly-spider-example
E-commerce websites: https://github.com/oltarasenko/products-advisor
Car shops: https://github.com/oltarasenko/crawly-cars
JavaScript based website (Splash example): https://github.com/oltarasenko/autosites

Contributors

We would gladly accept your contributions!

Documentation

Please find documentation on the HexDocs

v0.9.0

Crawly

Overview

Requirements

Quickstart

Browser rendering

Documentation

Roadmap

Articles

Example projects

Contributors

Documentation