Skill (Anthropic)

name: lockstep description: | Use this skill when the user wants to find or verify concurrency bugs in BEAM code (Elixir, Erlang, Gleam) using Lockstep, the controlled-scheduling test framework. Triggers include: "test for races", "verify this is concurrency-safe", "find the schedule that causes this flaky test", "why does this fail occasionally", or any mention of GenServer/ETS/atomics race conditions.

Lockstep — controlled concurrency testing for the BEAM

Lockstep runs an ExUnit test body many times with different message-passing schedules. When it finds a bug, the schedule is deterministic and replayable — same seed + same iterations always produce the same trace.

When to use Lockstep

Situation	Use Lockstep?
"This test fails 1 in 50 runs in CI"	Yes — Lockstep gives you the schedule
"I'm worried about a TOCTTOU race"	Yes — POS strategy is good at these
"Two GenServers race over shared state"	Yes
"Is this single-pure-function correct?"	No — use property tests
"I have a logic bug"	No — use careful code review
"Test under network partition"	Yes — `Lockstep.Cluster.partition/3`

Lockstep's strength: schedule-dependent bugs where standard testing finds them rarely or not at all. Lockstep gives you a reproducible counterexample with a seed — re-running with the same seed always reproduces.

Adding to a project

# mix.exs
defp deps do
  [{:lockstep, "~> 0.1.0", only: :test}]
end

Writing your first Lockstep test

The simplest pattern: take an existing ExUnit test and rewrite the body to use Lockstep wrappers, then wrap it with Lockstep.Test.

defmodule MyApp.RaceTest do
  use Lockstep.Test

  defmodule Counter do
    use Lockstep.GenServer  # NOTE: Lockstep.GenServer, not GenServer

    def start_link, do: Lockstep.GenServer.start_link(__MODULE__, 0)
    def value(pid),    do: Lockstep.GenServer.call(pid, :value)
    def add(pid, n),   do: Lockstep.GenServer.call(pid, {:add, n})

    def init(state), do: {:ok, state}
    def handle_call(:value, _, n), do: {:reply, n, n}
    def handle_call({:add, n}, _, total), do: {:reply, :ok, total + n}
  end

  ctest "two clients adding 1 each end at 2" do
    {:ok, pid} = Counter.start_link()
    parent = self()

    for _ <- 1..2 do
      Lockstep.spawn(fn ->
        # The "buggy" RMW: read value, then add 1
        # If both read 0 before either writes, both write 1, total = 1.
        v = Counter.value(pid)
        Counter.add(pid, v + 1 - v)  # spelled out to be obviously buggy
        Lockstep.send(parent, :done)
      end)
    end

    for _ <- 1..2, do: Lockstep.recv_first(fn :done -> true; _ -> false end)

    final = Counter.value(pid)
    if final != 2, do: raise "lost update; counter is #{final}"
  end
end

Run with:

mix test path/to/race_test.exs

If Lockstep finds a bug, you'll see:

** (Lockstep.BugFound)
Lockstep found a concurrency bug on iteration 4.
  seed: 1
  strategy: :pct
  trace path: traces/<test-name>-iter4-seed1.lockstep

Schedule:
  step 1  hello   P0(root)
  step 2  spawn   P0(root) -> P1
  ...
  step 14 exit    P0(root) reason={...} <-- FAILED HERE

Replay with:
  mix lockstep.replay --trace traces/<test-name>-iter4-seed1.lockstep

Strategy choice

:pct (default) — Probabilistic Concurrency Testing. Best for coarse-grained interleaving exploration.
:pos — Probabilistic Operating System. Best for tight read-modify-write races on shared atomics/ETS.
:fair_pct — PCT then random; protects against starvation in spin loops.
:random — Pure random scheduling. Baseline.

ctest "race", strategy: :pos, iterations: 1000 do
  # ...
end

OTP wrappers — drop-in replacements

OTP module	Lockstep equivalent
`GenServer`	`Lockstep.GenServer`
`:gen_statem`	`Lockstep.GenStatem`
`Agent`, `Task`, `Task.Supervisor`	`Lockstep.{Agent,Task,Task.Supervisor}`
`Registry`, `Supervisor`	`Lockstep.{Registry,Supervisor}`
`send/2`, `spawn/1`, `Process.send_after/3`	`Lockstep.{send,spawn,send_after}`
`:ets.{insert,lookup,update_counter}`	`Lockstep.ETS.*`
`:atomics.`, `:persistent_term.`	`Lockstep.{Atomics,PersistentTerm}.*`

The semantic of every wrapper is identical to the underlying OTP function — Lockstep just inserts a sync point so the strategy can interleave between operations.

Replay + shrink

When Lockstep finds a bug, it writes a .lockstep trace file. You can:

# Re-execute the exact failing schedule (deterministic)
mix lockstep.replay --trace traces/<bug>.lockstep

# Minimize the trace to the smallest reproducing schedule
mix lockstep.shrink --trace traces/<bug>.lockstep

Replay lets you attach a debugger / add IO.inspect and step through the race. Shrinking turns a 5000-step trace into 12 steps.

Multi-node testing (`Lockstep.Cluster`)

For testing distributed systems:

ctest "partition + heal" do
  [a, b, c] = Lockstep.Cluster.start_nodes([:a, :b, :c])

  Lockstep.Cluster.run(a, fn -> MyService.start_link() end)
  Lockstep.Cluster.run(b, fn -> MyService.start_link() end)
  Lockstep.Cluster.run(c, fn -> MyService.start_link() end)

  Lockstep.Cluster.partition([a, b], [c], mode: :defer)
  # ... do work in each partition ...
  Lockstep.Cluster.heal()

  # Verify convergence
end

Also: Lockstep.Cluster.stop_node/1 and start_node/1 for crash/recovery scenarios.

What Lockstep does NOT do

It doesn't find logic bugs visible by reading source. Use code review.
It doesn't replace property-based testing. They're complementary.
It doesn't simulate disk fsync, network packet drops at the byte-level, or OS-level kill -9 (use chaos engineering for those).
It doesn't model wall-clock-tight latency requirements.

Common patterns to recognize

These are the bug shapes Lockstep finds well:

TOCTTOU (read-then-act)

def try_acquire({ref, limit}) do
  current = :atomics.get(ref, 1)        # T = check
  if current < limit do
    :atomics.add(ref, 1, 1)             # O = of-use; another caller can squeeze in
    :ok
  else
    {:error, :limit_exceeded}
  end
end

→ test 4+ concurrent callers; under POS, found at iteration ~1.

Lost update on read-modify-write

v = Counter.value(pid)
Counter.set(pid, v + 1)

→ test multiple concurrent processes; under POS/PCT, found at iteration 1-3.

Message-ordering race

A NeighborReply arrives at the same mailbox as a connection_lost signal. Whichever is processed first determines outcome.

→ Lockstep.send + Process.monitor + handle_info patterns.

Linearizability violations

Use Lockstep.Checker.Linearizable:

ctest "registry is linearizable" do
  history = run_workload(...)
  assert :ok = Lockstep.Checker.Linearizable.check(history, model)
end

Reading traces

Trace output uses pid aliases for readability:

P0(root) = #PID<0.123.0>
P1 = #PID<0.124.0>
P2 = #PID<0.125.0>

Schedule:
  step 1  hello   P0(root)
  step 2  spawn   P0(root) -> P1
  step 3  send    P0(root) -> P1  {:hello}
  step 4  recv    P1            {:hello}
  step 5  exit    P1 reason={:exception, :error, ...}  <-- FAILED HERE

The <-- FAILED HERE marker shows where the assertion or invariant fired. Read the trace bottom-up to understand causality.

Causal slice

Lockstep automatically slices traces to show only events causally related to the failure. Set LOCKSTEP_NO_CAUSAL_SLICE=1 to disable. For long traces, the slice is ~5-20% of the original.

LLM-explained counterexamples

If you have an Anthropic API key:

export ANTHROPIC_API_KEY=sk-ant-...
mix test  # Failures will be explained in plain English

Set LOCKSTEP_LLM_OFF=1 to disable.

Documentation

Overview + tutorials: README.md
Methodology: METHODOLOGY.md — playbook for testing real Hex packages
Bug-finding case studies: docs/design/BUG_FINDINGS.md
Design history + strategy: docs/design/

Programmatic API

For non-ExUnit callers:

Lockstep.Runner.run(
  fn -> my_test_body() end,
  iterations: 1000,
  strategy: :pos,
  max_steps: 5_000,
  seed: 1,
  iter_timeout: 30_000,
  suite: "my_suite"
)

Returns :ok or raises Lockstep.BugFound with iteration, seed, strategy, and trace path.

When to escalate

If a bug is reproducible with seed: N but doesn't reproduce with other seeds, the seed matters — file the bug with that exact seed. Maintainers can verify with the same seed and fix-cycle is deterministic.

If a test ALWAYS produces a "bug" at iteration 1 but you don't believe it's a real bug, the test scenario is likely too aggressive (forcing the race deterministically rather than testing whether the race exists). Restructure the test to be more "natural" usage.

← Previous Page Lockstep testing methodology — the playbook

Next Page → MCP Server