Lockstep.Cluster (Lockstep v0.1.0)

Copy Markdown View Source

Multi-node simulation under Lockstep's controller.

A "node" here is a logical grouping of managed processes inside a single BEAM. Cross-node Lockstep.send/2 is just a controller-routed send that respects the active partition state.

This is a simulated cluster — there's no real BEAM distribution (no :net_kernel, no TCP, no epmd). The trade-off vs running real multi-node Erlang:

  • Faster — thousands of iterations per minute vs. minutes per iteration with real clustering
  • Reproducible — same seed produces byte-identical traces including network events
  • Catches user-code distributed bugs — CRDT non-convergence, cross-node monitor races, registry split-brain, leader election thrashing
  • Doesn't catch BEAM-distribution-layer bugs:net_kernel, :gen_tcp issues are out of scope

Quick example

Lockstep.Runner.run(fn ->
  [n1, n2, n3] = Lockstep.Cluster.start_nodes([:n1, :n2, :n3])

  # Spawn a process on each node.
  for node <- [n1, n2, n3] do
    Lockstep.Cluster.spawn(node, fn ->
      # ... do per-node work ...
    end)
  end

  # Partition n1 from the rest.
  Lockstep.Cluster.partition([n1], [n2, n3])

  # ... do work that should split-brain ...

  Lockstep.Cluster.heal()

  # ... verify convergence ...
end)

Partition modes

  • :drop (default) — cross-partition messages are silently lost. Matches real BEAM-distribution behavior when TCP closes during a partition.
  • :defer — messages are queued and delivered when the partition heals, in original send order. Useful for stress-testing convergence-on-eventual-delivery code paths.

Summary

Functions

Heal all active partitions. With :defer mode partitions, all queued messages are delivered to their destinations in original send order before the heal completes.

List all registered cluster node atoms (always includes :nonode@nohost).

List currently-up nodes (excludes the default :nonode@nohost).

Monitor node. When the node stops, the calling process receives {:nodedown, node}. Mirrors Node.monitor/2.

The node pid belongs to. Defaults to self().

Add a partition between two groups of nodes. While active, cross-group sends are dropped (:drop, default) or queued for later delivery (:defer).

Run fun on node and return its result. Synchronously spawns the function on the target node and waits for it to send back its return value.

Set the virtual net_ticktime (default 15000ms). Determines how long after a partition before cross-partition monitors fire {:DOWN, ref, :process, pid, :noconnection}.

Spawn a process on node. Returns its pid. Like Lockstep.spawn/1 but tags the new process as belonging to node.

Spawn-link a process on node. Same as Lockstep.spawn_link/1 with explicit node tagging.

Bring a previously-stopped node back online with fresh state. Pids that lived on the previous incarnation are gone permanently; new spawns on this node start from scratch.

Register a list of notional cluster nodes. Returns the same list for convenience. Idempotent — re-registering an existing node is a no-op. The default node :nonode@nohost is always present even if not explicitly registered.

Stop a node. All its processes are killed (Process.exit/:kill), monitors of pids on that node fire :DOWN with reason :nodedown, and any monitor_node watchers receive {:nodedown, node}.

Functions

heal()

@spec heal() :: :ok

Heal all active partitions. With :defer mode partitions, all queued messages are delivered to their destinations in original send order before the heal completes.

list()

@spec list() :: [atom()]

List all registered cluster node atoms (always includes :nonode@nohost).

list_up()

@spec list_up() :: [atom()]

List currently-up nodes (excludes the default :nonode@nohost).

monitor_node(node)

@spec monitor_node(atom()) :: :ok

Monitor node. When the node stops, the calling process receives {:nodedown, node}. Mirrors Node.monitor/2.

Repeated calls add the caller multiple times -- BEAM semantics is duplicate sends per duplicate registration.

node(pid \\ nil)

@spec node(pid()) :: atom()

The node pid belongs to. Defaults to self().

partition(group_a, group_b, opts \\ [])

@spec partition([atom()], [atom()], keyword()) :: :ok

Add a partition between two groups of nodes. While active, cross-group sends are dropped (:drop, default) or queued for later delivery (:defer).

Lockstep.Cluster.partition([:n1], [:n2, :n3])
# ... work happens with n1 isolated ...
Lockstep.Cluster.heal()

run(node, fun)

@spec run(atom(), (-> any())) :: any()

Run fun on node and return its result. Synchronously spawns the function on the target node and waits for it to send back its return value.

Useful for setup or assertions:

result = Lockstep.Cluster.run(:n1, fn ->
  Phoenix.Tracker.list(MyTracker, "topic")
end)

assert length(result) == 3

set_ticktime(ms)

@spec set_ticktime(non_neg_integer()) :: :ok

Set the virtual net_ticktime (default 15000ms). Determines how long after a partition before cross-partition monitors fire {:DOWN, ref, :process, pid, :noconnection}.

This is virtual time -- the wait advances Lockstep's clock when no process is otherwise ready, not wall-clock seconds.

spawn(node, fun)

@spec spawn(atom(), (-> any())) :: pid()

Spawn a process on node. Returns its pid. Like Lockstep.spawn/1 but tags the new process as belonging to node.

spawn_link(node, fun)

@spec spawn_link(atom(), (-> any())) :: pid()

Spawn-link a process on node. Same as Lockstep.spawn_link/1 with explicit node tagging.

start_node(node)

@spec start_node(atom()) :: :ok

Bring a previously-stopped node back online with fresh state. Pids that lived on the previous incarnation are gone permanently; new spawns on this node start from scratch.

start_nodes(names)

@spec start_nodes([atom()]) :: [atom()]

Register a list of notional cluster nodes. Returns the same list for convenience. Idempotent — re-registering an existing node is a no-op. The default node :nonode@nohost is always present even if not explicitly registered.

stop_node(node)

@spec stop_node(atom()) :: :ok

Stop a node. All its processes are killed (Process.exit/:kill), monitors of pids on that node fire :DOWN with reason :nodedown, and any monitor_node watchers receive {:nodedown, node}.