Mission & Success Criteria
- Deliver a 100% feature-parity Elixir SDK mirroring the Python client (
openai/codex) while sharing the same codex-rs Rust engine. - Ship an idiomatic, well-documented OTP interface that honors repository conventions, keeps the Rust engine isolated, and maintains deterministic, reproducible builds.
- Sustain a continuous test-driven development cadence: every user-facing capability enters the codebase behind a failing test and lands with green CI, ≥95 % coverage, and lint/format compliance.
Guiding Principles
- Parity First: All behavior—including edge-case semantics, telemetry, and error surfaces—must match Python unless an intentional deviation is documented.
- TDD Discipline: The red → green → refactor loop is non-negotiable. Each milestone defines acceptance tests before implementation work begins.
- Deterministic Tooling: Tests rely on reproducible fixtures (golden event logs, fake binaries) and never call upstream services implicitly.
- Isolated Dependencies:
codex-rs stays in a dedicated git submodule with explicit build/install tooling; Elixir code treats it as a black box. - Continuous Documentation: Every module, behavior, and mix task gains docs/doctests; plans and checklists evolve alongside the implementation.
Dependency Management & Rust Engine Integration
- Git Submodule Setup
- Add
openai/codex as a submodule rooted at vendor/codex. - Configure sparse checkout to include only
codex-rs/**, codex-rs/scripts/**, and release manifests. Update .gitmodules accordingly. - Version pin lives in
config/native.exs and is echoed in mix.exs metadata (@codex_rs_commit).
- Mix Native Wrapper
- Introduce
Mix.Tasks.Codex.Install that bootstraps Rust toolchain checks, applies any patches/codex-rs/*.patch, and runs cargo build --release. - Built binaries land in
priv/codex/<platform>/codex; path recorded in {@literal Codex.Options.default_codex_path/0}. - Cache build artifacts in
_build/native/<platform> to keep CI fast.
- Prebuilt Artifact Fallback
- Optional opt-in flag
mix codex.install --prebuilt downloads signed archives from upstream release buckets. - Validate checksums, store in same
priv/codex/<platform> hierarchy, and respect offline installations by default (no network call unless flag provided).
- Isolation Guarantees
- Never modify submodule sources directly; maintain
patches/ and reapply during mix codex.install. - Add CI job that runs
git diff vendor/codex to ensure no uncommitted changes leak in.
Iterative TDD Roadmap
Milestone 0 – Discovery & Characterization (1 sprint)
- Clone Python SDK and enumerate public APIs; capture JSONL transcripts for baseline scenarios (threads, tools, approvals, structured output, attachments, telemetry).
- Build Python ↔ Elixir comparison harness producing golden fixtures under
integration/fixtures/python/*.jsonl. - Write failing contract tests that consume the fixtures, asserting parity at the event-stream level (known failures tolerated via
@tag :pending until corresponding milestone).
Milestone 1 – Core Thread & Turn Flow (2 sprints)
- Author ExUnit scenarios for thread lifecycle (
Codex.start_thread/2, Codex.resume_thread/3) matching Python behaviors. - Define event struct doctests covering JSON serialization/deserialization from fixtures.
- Implement minimal code to pass blocking turn tests, then expand to streaming by validating stream ordering and final response extraction.
- Write red tests for tool registration, invocation lifecycle, and auto-run loops based on Python decorators.
- Build Mox-backed tool registry ensuring concurrency safety and verifying approved command execution semantics.
- Implement sandbox/approval middleware with tests covering accept/deny/bypass cases.
Milestone 3 – Attachments & File APIs (1 sprint)
- Construct integration tests that simulate file staging, upload, and cleanup using a fake
codex-rs binary. - Provide unit tests for
Codex.Files helpers (content hashing, local caching).
Milestone 4 – Observability & Error Domains (1 sprint)
- Add failing tests for telemetry emission (
:telemetry events, logging hooks) matching Python callbacks. - Enumerate error classes (user errors, transport failures, timeouts) and encode them with property tests verifying message parity.
Milestone 5 – Regression Harness & Coverage Gate (ongoing)
- Stand up contract suite that spins both Python and Elixir clients against a mock
codex-rs responder, diffing event streams during CI. - Enforce coverage gate via
mix coveralls ≥ baseline; integrate mix credo --strict and mix dialyzer (with cached PLTs) into CI matrix (macOS + Linux).
Module Implementation Playbook
- Start with doctests capturing default propagation and environment overrides (
CODEX_API_KEY, CODEX_URL). - Use property tests to confirm that option merging is associative and idempotent.
Event Domain (Codex.Events.*)
- Generate struct modules per event type; tests enforce required keys and maintain exhaustive pattern matching.
- Provide
Codex.Events.parse!/1 with table-driven tests sourced from fixtures.
- Begin with failing Supertester scenarios verifying process supervision, stderr propagation, and crash resilience.
- Mock port interactions using
Port stubs; integration tests validate real Port behavior against fake binary.
Thread & Turn APIs (Codex.Thread, Codex.Turn)
- Tests define blocking and streaming semantics, auto-run loop, structured output handling, and tool-call bridging.
- Leverage property tests to ensure streaming enumerables remain cold (no side effects until consumption) and respect cancellation.
- Introduce behaviors and default implementations with tests covering registration, metadata exposure, and invocation context.
- Integration tests spawn supervised fake MCP servers to validate handshake and lifecycle management.
- Unit tests cover staging, deduplication, MIME detection; integration tests ensure cleanup on success/failure.
- Tests assert emission of telemetry events with expected metadata; include golden snapshots for log formatting.
- Unit Tests:
async: true, leverage Mox for external interactions; enforce fast runtime (<1 ms). - Integration Tests: Tagged
:integration, using Supertester to orchestrate deterministic Port interactions; rely on fixtures. - Contract Tests: Compare Python vs Elixir outputs using golden logs; executed nightly or on-demand due to runtime.
- Property Tests: Use StreamData for invariants (event encode/decode, options merging, stream ordering).
- Live Tests: Optional
:live suite hitting the real CLI/API; opt in with CODEX_TEST_LIVE=true mix test --only live --include live. - Test Utilities: Maintain
test/support/ helpers for fixture loading, fake codex binary creation, and telemetry capture. - Coverage & Linting: Gate merges on
mix coveralls, mix credo --strict, and mix dialyzer (post-PLT warmup).
Fixture & Golden Data Strategy
- Store Python-generated event transcripts in
integration/fixtures/python. - Maintain schema snapshots for structured outputs in
integration/fixtures/schemas. - Keep fake codex scripts under
test/support/fakes with helper for on-the-fly generation. - Version fixtures with semantic names (
thread_auto_run_success.jsonl) and document updates in docs/fixtures.md.
CI/CD & Automation
- Expand GitHub Actions to run:
mix deps.get && mix compile --warnings-as-errorsmix format --check-formattedmix test --include integrationmix coveralls.githubMIX_ENV=dev mix dialyzer- Nightly job: contract parity suite + Python harness
- Add job ensuring
mix codex.install --check verifies binary integrity and submodule cleanliness. - Publish artifacts (coverage reports, parity diff logs) as CI attachments to inform regressions.
Documentation & Communication
- Update
docs/02-architecture.md, docs/03-implementation-plan.md, and docs/04-testing-strategy.md after each milestone with distilled learnings. - Maintain running parity checklist in
docs/python-parity-checklist.md (new file) capturing feature status, associated tests, and owners. - Ensure every public module ships
@doc and doctests; include usage examples mirroring Python README scenarios. - Prepare release notes summarizing parity gaps closed per sprint; attach telemetry or fixture updates as needed.
Risk Register & Mitigations
- Rust Submodule Drift: Automate weekly reminder to sync commit hash; require explicit changelog entry for upgrades.
- Fixture Rot: Regenerate Python transcripts via automation script and diff results; tests fail if fixtures change unexpectedly.
- Cross-Platform Variance: Execute integration suite on macOS and Linux; provide container image to equalize developer environments.
- Binary Size Constraints: Consider optional Hex package
codex_sdk_native for binaries; document install steps. - TDD Discipline Slippage: Enforce code review checklist requiring failing test link before feature implementation.
- Introduce
vendor/codex submodule with sparse checkout and create Mix.Tasks.Codex.Install. - Build Python harness (
scripts/harvest_python_fixtures.py) to capture golden data for Milestone 0. - Author baseline failing contract tests asserting parity for thread lifecycle and event parsing.
- Schedule architecture review to validate roadmap, resourcing, and milestone timelines.