Multi-node simulation under Lockstep's controller.
A "node" here is a logical grouping of managed processes inside a
single BEAM. Cross-node Lockstep.send/2 is just a controller-routed
send that respects the active partition state.
This is a simulated cluster — there's no real BEAM distribution
(no :net_kernel, no TCP, no epmd). The trade-off vs running real
multi-node Erlang:
- Faster — thousands of iterations per minute vs. minutes per iteration with real clustering
- Reproducible — same seed produces byte-identical traces including network events
- Catches user-code distributed bugs — CRDT non-convergence, cross-node monitor races, registry split-brain, leader election thrashing
- Doesn't catch BEAM-distribution-layer bugs —
:net_kernel,:gen_tcpissues are out of scope
Quick example
Lockstep.Runner.run(fn ->
[n1, n2, n3] = Lockstep.Cluster.start_nodes([:n1, :n2, :n3])
# Spawn a process on each node.
for node <- [n1, n2, n3] do
Lockstep.Cluster.spawn(node, fn ->
# ... do per-node work ...
end)
end
# Partition n1 from the rest.
Lockstep.Cluster.partition([n1], [n2, n3])
# ... do work that should split-brain ...
Lockstep.Cluster.heal()
# ... verify convergence ...
end)Partition modes
:drop(default) — cross-partition messages are silently lost. Matches real BEAM-distribution behavior when TCP closes during a partition.:defer— messages are queued and delivered when the partition heals, in original send order. Useful for stress-testing convergence-on-eventual-delivery code paths.
Summary
Functions
Heal all active partitions. With :defer mode partitions, all
queued messages are delivered to their destinations in original
send order before the heal completes.
List all registered cluster node atoms (always includes :nonode@nohost).
List currently-up nodes (excludes the default :nonode@nohost).
Monitor node. When the node stops, the calling process receives
{:nodedown, node}. Mirrors Node.monitor/2.
The node pid belongs to. Defaults to self().
Add a partition between two groups of nodes. While active,
cross-group sends are dropped (:drop, default) or queued for
later delivery (:defer).
Run fun on node and return its result. Synchronously spawns
the function on the target node and waits for it to send back its
return value.
Set the virtual net_ticktime (default 15000ms). Determines how
long after a partition before cross-partition monitors fire
{:DOWN, ref, :process, pid, :noconnection}.
Spawn a process on node. Returns its pid. Like Lockstep.spawn/1
but tags the new process as belonging to node.
Spawn-link a process on node. Same as Lockstep.spawn_link/1
with explicit node tagging.
Bring a previously-stopped node back online with fresh state. Pids that lived on the previous incarnation are gone permanently; new spawns on this node start from scratch.
Register a list of notional cluster nodes. Returns the same list
for convenience. Idempotent — re-registering an existing node is a
no-op. The default node :nonode@nohost is always present even if
not explicitly registered.
Stop a node. All its processes are killed (Process.exit/:kill),
monitors of pids on that node fire :DOWN with reason
:nodedown, and any monitor_node watchers receive
{:nodedown, node}.
Functions
@spec heal() :: :ok
Heal all active partitions. With :defer mode partitions, all
queued messages are delivered to their destinations in original
send order before the heal completes.
@spec list() :: [atom()]
List all registered cluster node atoms (always includes :nonode@nohost).
@spec list_up() :: [atom()]
List currently-up nodes (excludes the default :nonode@nohost).
@spec monitor_node(atom()) :: :ok
Monitor node. When the node stops, the calling process receives
{:nodedown, node}. Mirrors Node.monitor/2.
Repeated calls add the caller multiple times -- BEAM semantics is duplicate sends per duplicate registration.
The node pid belongs to. Defaults to self().
Add a partition between two groups of nodes. While active,
cross-group sends are dropped (:drop, default) or queued for
later delivery (:defer).
Lockstep.Cluster.partition([:n1], [:n2, :n3])
# ... work happens with n1 isolated ...
Lockstep.Cluster.heal()
Run fun on node and return its result. Synchronously spawns
the function on the target node and waits for it to send back its
return value.
Useful for setup or assertions:
result = Lockstep.Cluster.run(:n1, fn ->
Phoenix.Tracker.list(MyTracker, "topic")
end)
assert length(result) == 3
@spec set_ticktime(non_neg_integer()) :: :ok
Set the virtual net_ticktime (default 15000ms). Determines how
long after a partition before cross-partition monitors fire
{:DOWN, ref, :process, pid, :noconnection}.
This is virtual time -- the wait advances Lockstep's clock when no process is otherwise ready, not wall-clock seconds.
Spawn a process on node. Returns its pid. Like Lockstep.spawn/1
but tags the new process as belonging to node.
Spawn-link a process on node. Same as Lockstep.spawn_link/1
with explicit node tagging.
@spec start_node(atom()) :: :ok
Bring a previously-stopped node back online with fresh state. Pids that lived on the previous incarnation are gone permanently; new spawns on this node start from scratch.
Register a list of notional cluster nodes. Returns the same list
for convenience. Idempotent — re-registering an existing node is a
no-op. The default node :nonode@nohost is always present even if
not explicitly registered.
@spec stop_node(atom()) :: :ok
Stop a node. All its processes are killed (Process.exit/:kill),
monitors of pids on that node fire :DOWN with reason
:nodedown, and any monitor_node watchers receive
{:nodedown, node}.