View Source cets_discovery behaviour (cets v0.2.0)
Node discovery logic.
Joins table together when a new node appears.
Things that make discovery logic harder:
- A table list is dynamic (but eventually we add all the tables into it).
- Creating Erlang distribution connection is async, but it net_kernel:ping/1
is blocking.
- net_kernel:ping/1
could block for unknown number of seconds (but net_kernel
default timeout is 7 seconds).
- Resolving nodename could take a lot of time (5 seconds in tests). It is unpredictable blocking.
- join tables should be one by one to avoid OOM.
- Backend:get_nodes/1
could take a long time.
- cets_discovery:get_tables/1
, cets_discovery:add_table/2
should be fast.
- The most important net_kernel flags for us to consider are:
* dist_auto_connect=never
* connect_all
* prevent_overlapping_partitions
These flags change the way the discovery logic behaves. Also the module would not try to connect to the hidden nodes.
Retry logic considerations:
- Backend:get_nodes/1 could return an error during startup, so we have to retry fast.
- There are two periods of operation for this module:
* startup phase, usually first 5 minutes.
* regular operation phase, after the startup phase.
- We don't need to check for the updated get_nodes too often in the regular operation phase.Link to this section Summary
Types
get_nodes/2
call.start_link/1
.Functions
Waits for the current get_nodes call to return.
Blocks until the initial discovery is done.
Link to this section Types
-type backend_state() :: term().
-type from() :: {pid(), reference()}.
-type get_nodes_result() :: {ok, [node()]} | {error, term()}.
get_nodes/2
call.
-type join_result() ::
#{node := node(),
table := atom(),
what := join_result | pid_not_found,
result => ok | {error, _},
reason => term()}.
-type milliseconds() :: integer().
-type opts() :: #{name := atom(), _ := _}.
-type retry_type() :: initial | after_error | regular | after_nodedown.
-type server() :: pid() | atom().
-type start_result() :: {ok, pid()} | {error, term()}.
start_link/1
.
-type state() :: #{phase := initial | regular, results := [join_result()], nodes := ordsets:ordset(node()), unavailable_nodes := ordsets:ordset(node()), tables := [atom()], backend_module := module(), backend_state := state(), get_nodes_status := not_running | running, should_retry_get_nodes := boolean(), last_get_nodes_result := not_called_yet | get_nodes_result(), last_get_nodes_retry_type := retry_type(), join_status := not_running | running, should_retry_join := boolean(), timer_ref := reference() | undefined, pending_wait_for_ready := [gen_server:from()], pending_wait_for_get_nodes := [gen_server:from()], nodeup_timestamps := #{node() => milliseconds()}, nodedown_timestamps := #{node() => milliseconds()}, node_start_timestamps := #{node() => milliseconds()}, start_time := milliseconds()}.
-type system_info() :: map().
Link to this section Callbacks
-callback get_nodes(backend_state()) -> {get_nodes_result(), backend_state()}.
-callback init(map()) -> backend_state().
Link to this section Functions
-spec add_table(server(), cets:table_name()) -> ok.
-spec delete_table(server(), cets:table_name()) -> ok.
-spec get_tables(server()) -> {ok, [cets:table_name()]}.
-spec start(opts()) -> start_result().
-spec start_link(opts()) -> start_result().
-spec system_info(server()) -> system_info().
-spec wait_for_get_nodes(server(), timeout()) -> ok.
Waits for the current get_nodes call to return.
Just returns if there is no gen_nodes call running. Waits for another get_nodes, if should_retry_get_nodes flag is set. It is different from wait_for_ready, because it does not wait for unavailable nodes to return pang.-spec wait_for_ready(server(), timeout()) -> ok.
Blocks until the initial discovery is done.
This call would also wait till the data is loaded from the remote nodes.