Crawly
- Top
- Summary
- Types
- Functions
Crawly.API.Router
- Top
- Summary
- Functions
Crawly.Fetchers.CrawlyRenderServer
- Top
- Summary
- Functions
  - fetch/2
Crawly.Fetchers.Fetcher
- Top
- Summary
- Types
  - t/0
- Callbacks
  - fetch/2
Crawly.Fetchers.HTTPoisonFetcher
- Top
- Summary
- Functions
  - fetch/2
Crawly.Fetchers.Splash
- Top
- Summary
- Functions
  - fetch/2
Crawly.Loggers.SendToUiBackend
- Top
- Summary
- Functions
Crawly.Models.Job
- Top
- Sections
  - Struct
  - Functions
- Summary
- Types
  - t/0
- Functions
  - get/1
  - list/0
  - new/2
  - update/3
Crawly.Models.YMLSpider
- Top
- Summary
- Functions
  - delete/1
  - get/1
  - list/0
  - load/0
  - load/1
  - new/2
Crawly.Settings
- Top
- Summary
- Types
Crawly.SimpleStorage
- Top
- Summary
- Functions
  - clear/0
  - delete/2
  - get/2
  - init/0
  - list/1
  - put/3
Crawly.Utils
- Top
- Summary
- Functions
Building Spiders
Crawly.ParsedItem
- Top
- Sections
  - Usage with Parsers
- Summary
- Types
  - item/0
  - t/0
Crawly.Request
- Top
- Summary
- Types
  - header/0
  - option/0
  - t/0
  - url/0
- Functions
  - new/3
  - new/4
Crawly.Response
- Top
- Summary
- Types
  - t/0
Crawly.Spider
- Top
- Summary
- Callbacks
Middlewares and Pipelines
Crawly.Pipeline
- Top
- Summary
- Callbacks
  - run/2
  - run/3
Crawly.Middlewares
AutoCookiesManager
- Top
- Summary
- Functions
  - run/2
DomainFilter
- Top
- Summary
- Functions
  - run/3
RequestOptions
- Top
- Sections
  - Example Usage
- Summary
- Functions
  - run/3
RobotsTxt
- Top
- Summary
- Functions
  - run/3
SameDomainFilter
- Top
- Summary
- Functions
  - run/3
UniqueRequest
- Top
- Summary
- Functions
  - run/3
UserAgent
- Top
- Summary
- Functions
  - run/3
Crawly.Pipelines
CSVEncoder
- Top
DuplicatesFilter
- Top
Experimental.SendToUI
- Top
JSONEncoder
- Top
Validate
- Top
WriteToFile
- Top
Under the Hood
Crawly.DataStorage
- Top
- Summary
- Functions
  - child_spec/1
  - init/1
  - inspect/2
  - start_link/1
  - start_worker/2
  - stats/1
  - store/2
Crawly.DataStorage.Worker
- Top
- Summary
- Functions
  - child_spec/1
  - init/1
  - inspect/2
  - start_link/1
  - stats/1
  - store/2
Crawly.Engine
- Top
- Summary
- Types
- Functions
Crawly.Manager
- Top
- Summary
- Functions
Crawly.RequestsStorage
- Top
- Summary
- Functions
  - child_spec/1
  - init/1
  - pop/1
  - requests/1
  - start_link/1
  - start_worker/2
  - stats/1
  - store/2
Crawly.RequestsStorage.Worker
- Top
- Summary
- Functions
  - child_spec/1
  - init/1
  - pop/1
  - requests/1
  - start_link/2
  - stats/1
  - store/2
Crawly.Worker
- Top
- Summary
- Functions

Crawly.RequestsStorage (Crawly v0.17.2) View Source

Request storage, a module responsible for storing urls for crawling

           ┌──────────────────┐
           │                  │             ┌------------------┐
           │ RequestsStorage  <─────────────┤ From crawlers1,2 │
           │                  │             └------------------┘
           └─────────┬────────┘
                     │
                     │
                     │
                     │
        ┌────────────▼─────────────────┐
        │                              │
        │                              │
        │                              │

┌───────────▼──────────┐ ┌───────────▼──────────┐ │RequestsStorageWorker1│ │RequestsStorageWorker2│ │ (Crawler1) │ │ (Crawler2) │ └──────────────────────┘ └──────────────────────┘

All requests are going through one RequestsStorage process, which quickly finds the actual worker, which finally stores the request afterwords.

Link to this section Summary

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

init(args)

Callback implementation for GenServer.init/1.

pop(spider_name)

Pop a request out of requests storage

requests(spider_name)

start_link(list)

start_worker(spider_name, crawl_id)

Starts a worker for a given spider

stats(spider_name)

Get statistics from the requests storage

store(spider_name, request)

Store individual request or multiple requests in related child worker

Link to this section Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

init(args)

Callback implementation for GenServer.init/1.

pop(spider_name)

Specs

pop(Crawly.spider()) ::
  nil | Crawly.Request.t() | {:error, :storage_worker_not_running}

Pop a request out of requests storage

requests(spider_name)

Specs

requests(atom()) ::
  {:requests, [Crawly.Request.t()]} | {:error, :spider_not_running}

start_link(list)

start_worker(spider_name, crawl_id)

Specs

start_worker(Crawly.spider(), crawl_id :: String.t()) ::
  {:ok, pid()} | {:error, :already_started}

Starts a worker for a given spider

stats(spider_name)

Specs

stats(Crawly.spider()) ::
  {:stored_requests, non_neg_integer()} | {:error, :storage_worker_not_running}

Get statistics from the requests storage

store(spider_name, request)

Specs

store(Crawly.spider(), Crawly.Request.t() | [Crawly.Request.t()]) ::
  :ok | {:error, :storage_worker_not_running}

Store individual request or multiple requests in related child worker