View Source SpiderMan behaviour (spider_man v0.6.3)
SpiderMan, a fast high-level web crawling & scraping framework for Elixir.
Components
Each Spider had 3 components, each component has theirs work:
- Downloader: Download request.
- Spider: Analyze web pages.
- ItemProcessor: Store items.
Message flow: Downloader
-> Spider
-> ItemProcessor
.
Spider Life Cycle
Spider.settings()
- Prepare For Start Stage
Spider.prepare_for_start(:pre, state)
Spider.prepare_for_start_component(:downloader, state)
Spider.prepare_for_start_component(:spider, state)
Spider.prepare_for_start_component(:item_processor, state)
Spider.prepare_for_start(:post, state)
Spider.init(state)
Spider.handle_response(response, context)
- Prepare For Stop Stage
Spider.prepare_for_stop_component(:downloader, state)
Spider.prepare_for_stop_component(:spider, state)
Spider.prepare_for_stop_component(:item_processor, state)
Spider.prepare_for_stop(state)
Summary
Functions
continue a spider
fetch spider's statistics of all ets
fetch spider's state
insert a request to spider
insert multiple requests to spider
list spiders where already started
retry failed events for a spider
start a spider
fetch spider's statistics
fetch component's statistics
fetch spider's status
stop a spider
suspend a spider
Types
@type component() :: :downloader | :spider | :item_processor
@type ets_stats() :: [size: pos_integer(), memory: pos_integer()] | nil
@type prepare_for_start_stage() :: :pre | :post
@type request() :: SpiderMan.Request.t()
@type requests() :: [request()]
@type settings() :: keyword()
:print_stats
(boolean/0
) - Print the stats of spider, The default value istrue
.:log2file
- Save the log to files, The default value istrue
.:status
- Set the startup status for the spider, The default value is:running
.:spider
(atom/0
) - Set the callback module for the spider,:spider_module
(atom/0
) - Set the callback module for the spider,:callbacks
(keyword/0
):ets_file
(String.t/0
) - Set the filename for the spider, and load spider's state from ets files.:downloader_options
(keyword/0
) - see Downloader Options.:spider_options
(keyword/0
) - see Spider Options.:item_processor_options
(keyword/0
) - see ItemProcessor Options.
Downloader options
:requester
- The default value is{SpiderMan.Requester.Finch, []}
.:producer
- The default value isSpiderMan.Producer.ETS
.:context
(term/0
) - The default value is%{}
.:processor
(keyword/0
) - See Processors Options, The default value is[max_demand: 1]
.:concurrency
(pos_integer/0
) - The default value is16
.:min_demand
(non_neg_integer/0
):max_demand
(non_neg_integer/0
) - The default value is10
.:partition_by
(function of arity 1):spawn_opt
(keyword/0
):hibernate_after
(pos_integer/0
)
:rate_limiting
- See Producers Options - rate_limiting, The default value is[allowed_messages: 10, interval: 1000]
.:pipelines
- Each msg will handle by each pipelines, The default value is[SpiderMan.Pipeline.DuplicateFilter]
.:post_pipelines
- Each msg will handle by each pipelines, The default value is[SpiderMan.Pipeline.DuplicateFilter]
.
Spider options
:producer
- The default value isSpiderMan.Producer.ETS
.:context
(term/0
) - The default value is%{}
.:processor
(keyword/0
) - See Processors Options, The default value is[max_demand: 1]
.:concurrency
(pos_integer/0
) - The default value is16
.:min_demand
(non_neg_integer/0
):max_demand
(non_neg_integer/0
) - The default value is10
.:partition_by
(function of arity 1):spawn_opt
(keyword/0
):hibernate_after
(pos_integer/0
)
:rate_limiting
- See Producers Options - rate_limiting,:pipelines
- Each msg will handle by each pipelines, The default value is[]
.:post_pipelines
- Each msg will handle by each pipelines, The default value is[]
.
Batchers options
:concurrency
(pos_integer/0
) - The default value is1
.:batch_size
(pos_integer/0
) - The default value is100
.:batch_timeout
(pos_integer/0
) - The default value is1000
.:partition_by
(function of arity 1):spawn_opt
(keyword/0
):hibernate_after
(pos_integer/0
)
ItemProcessor options
:storage
- Set a storage module what are store items, The default value isSpiderMan.Storage.JsonLines
.:batchers
(keyword/0
) - See Batchers Options, The default value is[default: [concurrency: 1, batch_size: 50, batch_timeout: 1000]]
.:producer
- The default value isSpiderMan.Producer.ETS
.:context
(term/0
) - The default value is%{}
.:processor
(keyword/0
) - See Processors Options, The default value is[]
.:concurrency
(pos_integer/0
) - The default value is16
.:min_demand
(non_neg_integer/0
):max_demand
(non_neg_integer/0
) - The default value is10
.:partition_by
(function of arity 1):spawn_opt
(keyword/0
):hibernate_after
(pos_integer/0
)
:rate_limiting
- See Producers Options - rate_limiting,:pipelines
- Each msg will handle by each pipelines, The default value is[SpiderMan.Pipeline.DuplicateFilter]
.:post_pipelines
- Each msg will handle by each pipelines, The default value is[SpiderMan.Pipeline.DuplicateFilter]
.
@type status() :: :running | :suspended
Callbacks
@callback handle_response(SpiderMan.Response.t(), context :: map()) :: %{ optional(:requests) => [SpiderMan.Request.t()], optional(:items) => [SpiderMan.Item.t()] }
@callback init(state) :: state when state: SpiderMan.Engine.state()
@callback prepare_for_start(prepare_for_start_stage(), state) :: state when state: SpiderMan.Engine.state()
@callback prepare_for_stop(SpiderMan.Engine.state()) :: :ok
@callback settings() :: settings()
Functions
@spec components() :: [component()]
continue a spider
@spec ets_stats(spider()) :: [ common_pipeline_tid: ets_stats(), downloader_tid: ets_stats(), failed_tid: ets_stats(), spider_tid: ets_stats(), item_processor_tid: ets_stats() ]
fetch spider's statistics of all ets
@spec get_state(spider()) :: SpiderMan.Engine.state()
fetch spider's state
insert a request to spider
insert multiple requests to spider
@spec list_spiders() :: [spider()]
list spiders where already started
retry failed events for a spider
@spec start(spider(), settings()) :: Supervisor.on_start_child()
start a spider
@spec stats(spider()) :: [ status: status(), common_pipeline_tid: ets_stats(), downloader_tid: ets_stats(), failed_tid: ets_stats(), spider_tid: ets_stats(), item_processor_tid: ets_stats(), throughputs: map() ]
fetch spider's statistics
fetch component's statistics
fetch spider's status
@spec stop(spider()) :: :ok | {:error, error} when error: :not_found | :running | :restarting
stop a spider
suspend a spider