Lockstep is designed to be invokable from AI coding agents (Claude Code, Cursor, GitHub Copilot, etc.) to verify concurrency properties of BEAM code. This file documents the AI-agent contract.
What an agent can do with Lockstep
| Task | How |
|---|---|
| Verify a hypothesized race exists | Generate a Lockstep test, run it, parse the result |
| Reproduce a flaky test deterministically | Wrap the test body in ctest, capture seed |
| Replay a saved counterexample | mix lockstep.replay --trace <path> |
| Minimize a counterexample | mix lockstep.shrink --trace <path> |
| Verify a fix doesn't regress | Re-run the same test with same seed |
CLI commands (Bash-callable)
These are the primary entry points for AI agents. All run via the
mix command in the project root:
mix test <file> — run a Lockstep test
mix test test/path/to/race_test.exs --seed 1
Exit code 0 = no bug found. Non-zero = test failed.
If a Lockstep test detected a bug, output includes:
iteration: N— the iteration that triggeredseed: M— top-level seedstrategy: :pct/:pos/...— strategy usedtrace path: traces/<name>.lockstep— saved trace file
The trace file path is the key handoff: subsequent commands operate on that file.
mix lockstep.replay --trace <path> — print a saved trace
mix lockstep.replay --trace traces/foo-iter4-seed1.lockstep
Prints metadata + human-readable schedule with pid aliases (P0,
P1, ...) and step numbers. Includes <-- FAILED HERE marker.
mix lockstep.replay --trace <path> --run "Mod.fn" --file <test_file>
Re-execute a specific test function under the recorded schedule. Deterministic — same trace + same code = same outcome.
mix lockstep.shrink --trace <path> — minimize a counterexample
mix lockstep.shrink --trace traces/foo-iter4-seed1.lockstep
Outputs a smaller trace file that reproduces the same bug. Useful before filing an issue.
mix lockstep.dump_trace --trace <path> — print trace as terms
For programmatic parsing.
Generating a Lockstep test from a hypothesis
The simplest pattern an agent should produce when it wants to verify a race hypothesis:
defmodule Lockstep.Mimic.MyRaceTest do
use Lockstep.Test
ctest "verify hypothesis: <plain English description>",
iterations: 1000, strategy: :pos, seed: 1 do
# Build minimal mimic of the production code path.
# Use Lockstep.* wrappers everywhere instead of OTP defaults.
# 1. Setup state
{:ok, pid} = Lockstep.GenServer.start_link(MyServer, init_state)
# 2. Spawn concurrent actors
parent = self()
for i <- 1..3 do
Lockstep.spawn(fn ->
# ... code that exercises the hypothesized race ...
Lockstep.send(parent, {:done, i})
end)
end
# 3. Collect results
for _ <- 1..3 do
Lockstep.recv_first(fn {:done, _} -> true; _ -> false end)
end
# 4. Assert invariant
if not invariant_holds?(...) do
raise "invariant violated: ..."
end
end
endKey rules for agent-generated tests:
- Use
Lockstep.*wrappers for every send, spawn, GenServer.call, ETS op, atomics op. Bare OTP calls won't be intercepted. - Use
Lockstep.send/2in the test body (notsend/2). - Use
Lockstep.recv_first/1(notreceive). - The test function's body raises on invariant violation — that's how Lockstep detects the bug.
Strategy selection heuristic
| Bug shape | Strategy |
|---|---|
| Tight read-modify-write on atomics/ETS | :pos |
| Multi-process message-ordering races | :pct (default) |
| Spin-loop / liveness scenarios | :fair_pct |
| Unknown shape, exploring | :random |
For deep races (>5 sync points between cause and effect), bump
iterations to 10_000+ and try multiple strategies.
Parsing Lockstep output
A Lockstep failure exits with code 1 and prints an exception block that's parseable. Key fields:
** (Lockstep.BugFound)
Lockstep found a concurrency bug on iteration <ITER>.
seed: <TOP_SEED>
iter seed: <ITER_SEED>
strategy: :<STRATEGY>
trace path: traces/<NAME>.lockstep
Reason:
<REASON>
Schedule:
step 1 <EVENT>
...
step N <EVENT> <-- FAILED HERERegex for parsing (Python-flavoured):
import re
m = re.search(r"iteration (\d+).*seed:\s+(\d+).*strategy:\s+:(\w+).*trace path:\s+(\S+)",
output, re.DOTALL)
if m:
iteration, seed, strategy, trace_path = m.groups()Common pitfalls when agents generate tests
- Forgetting to use Lockstep wrappers. Agent writes
GenServer.start_link(...)and the GenServer is unmanaged. Lockstep can't schedule it. - Pre-constructed bugs. Agent writes a test that always raises (no scheduling needed). The "bug found at iteration 1" is not informative — restructure to be a real race.
- Drain too short. Many test bodies need
drain(100)+ticks to let timers fire and processes settle. - Bad invariant. Agent's invariant should fail ONLY in the bug case, not during normal flow.
Ground truth: what to trust
- Lockstep's deterministic finding (same seed → same iteration → same trace) means the bug is REAL given the test setup.
- Lockstep's negative result (10000 iterations, no bug found) means the test setup didn't surface a bug under the strategies used. It does NOT mean the code is bug-free.
<-- FAILED HEREmarker points to the exact event where the assertion fired.- Causal slice (default on) shows ~5-20% of trace events causally related to the failure.
Skill bundle
The Anthropic Skill format file is at skill/lockstep/SKILL.md.
Agents using the Anthropic Agent SDK can load this via
Skill.load("lockstep").
CI integration
Recommended pattern for CI:
# .github/workflows/concurrency-tests.yml
- run: mix test --include lockstep --seed 1 --max-failures 1The --seed 1 ensures the same iteration counts hit the same
schedule across runs. --max-failures 1 stops on the first
failure to keep the trace artifact small.
Save traces/ as an artifact so failures are debuggable.
Filing a bug found by an agent
When an AI agent finds a bug in third-party code via Lockstep, the filable artifact is:
- Lockstep test file
- Trace file (
traces/<bug>.lockstep) - Output of
mix lockstep.replay --trace <path>(the human-readable schedule) - Suggested fix + the
mix lockstep.replay --runcommand that verifies the fix
Reproducibility is the value-add — the bug isn't "we saw it once", it's "seed N + strategy X + iteration M reproduces every time."