View Source cets_discovery behaviour (cets v0.3.0)
Node discovery logic.
Joins table together when a new node appears.
Things that make discovery logic harder:
- A table list is dynamic (but eventually we add all the tables into it).
- Creating Erlang distribution connection is async, but it net_kernel:ping/1 is blocking.
- net_kernel:ping/1 could block for unknown number of seconds (but net_kernel default timeout is 7 seconds).
- Resolving nodename could take a lot of time (5 seconds in tests). It is unpredictable blocking.
- join tables should be one by one to avoid OOM.
- Backend:get_nodes/1 could take a long time.
- cets_discovery:get_tables/1, cets_discovery:add_table/2 should be fast.
- The most important net_kernel flags for us to consider are:
* dist_auto_connect=never
* connect_all
* prevent_overlapping_partitions
These flags change the way the discovery logic behaves. Also the module would not try to connect to the hidden nodes.
Retry logic considerations:
- Backend:get_nodes/1 could return an error during startup, so we have to retry fast.
- There are two periods of operation for this module:
* startup phase, usually first 5 minutes.
* regular operation phase, after the startup phase.
- We don't need to check for the updated get_nodes too often in the regular operation phase.
Summary
Types
Backend state.
gen_server's caller.
Result of get_nodes/2 call.
Join result information.
Number of milliseconds.
Backend could define its own options.
Retry logic type.
Discovery server process.
Result of start_link/1.
Discovery status.
Functions
Adds a table to be tracked and joined.
Deletes a table from being tracked or joined.
Gets a list of the tracked tables.
Gets information for each tracked table.
Starts a discovery process.
Starts a discovery process with a link.
Gets discovery process status.
Waits for the current get_nodes call to return.
Blocks until the initial discovery is done.
Types
-type backend_state() :: term().
Backend state.
gen_server's caller.
Result of get_nodes/2 call.
-type join_result() :: #{node := node(), table := atom(), what := join_result | pid_not_found, result => ok | {error, _}, reason => term()}.
Join result information.
-type milliseconds() :: integer().
Number of milliseconds.
-type opts() :: #{name := atom(), _ := _}.
Backend could define its own options.
-type retry_type() :: initial | after_error | regular | after_nodedown.
Retry logic type.
Discovery server process.
Result of start_link/1.
-type state() :: #{phase := initial | regular, results := [join_result()], nodes := ordsets:ordset(node()), unavailable_nodes := ordsets:ordset(node()), tables := [atom()], backend_module := module(), backend_state := state(), get_nodes_status := not_running | running, should_retry_get_nodes := boolean(), last_get_nodes_result := not_called_yet | get_nodes_result(), last_get_nodes_retry_type := retry_type(), join_status := not_running | running, should_retry_join := boolean(), timer_ref := reference() | undefined, pending_wait_for_ready := [gen_server:from()], pending_wait_for_get_nodes := [gen_server:from()], nodeup_timestamps := #{node() => milliseconds()}, nodedown_timestamps := #{node() => milliseconds()}, node_start_timestamps := #{node() => milliseconds()}, start_time := milliseconds()}.
-type system_info() :: map().
Discovery status.
Callbacks
-callback get_nodes(backend_state()) -> {get_nodes_result(), backend_state()}.
-callback init(map()) -> backend_state().
Functions
-spec add_table(server(), cets:table_name()) -> ok.
Adds a table to be tracked and joined.
-spec delete_table(server(), cets:table_name()) -> ok.
Deletes a table from being tracked or joined.
-spec get_tables(server()) -> {ok, [cets:table_name()]}.
Gets a list of the tracked tables.
Gets information for each tracked table.
-spec start(opts()) -> start_result().
Starts a discovery process.
-spec start_link(opts()) -> start_result().
Starts a discovery process with a link.
-spec system_info(server()) -> system_info().
Gets discovery process status.
Waits for the current get_nodes call to return.
Just returns if there is no gen_nodes call running. Waits for another get_nodes, if should_retry_get_nodes flag is set. It is different from wait_for_ready, because it does not wait for unavailable nodes to return pang.
Blocks until the initial discovery is done.
This call would also wait till the data is loaded from the remote nodes.