Testing Strategy

Overview

The Elixir Codex SDK follows a comprehensive test-driven development (TDD) approach using Supertester for deterministic OTP testing. This document outlines our testing philosophy, strategies, tools, and best practices.

Testing Philosophy

Core Principles

Test First: Write tests before implementation
Deterministic: Zero flaky tests, zero Process.sleep
Fast: Full suite < 5 minutes, average test < 50ms
Comprehensive: 95%+ coverage, all edge cases
Maintainable: Clear, readable, well-organized tests
Async: All tests run with async: true where possible

Red-Green-Refactor Cycle

Red: Write a failing test that defines desired behavior
Green: Write minimal code to make test pass
Refactor: Improve code quality while keeping tests green
Repeat: Continue with next feature

Test Categories

1. Unit Tests

Purpose: Test individual functions and modules in isolation.

Characteristics:

Run with async: true
Mock all external dependencies
Focus on single responsibility
Fast (< 1ms per test)
High coverage of edge cases

Example:

defmodule Codex.EventsTest do
  use ExUnit.Case, async: true

  describe "ThreadStarted" do
    test "creates struct with required fields" do
      event = %Codex.Events.ThreadStarted{
        type: :thread_started,
        thread_id: "thread_abc123"
      }

      assert event.type == :thread_started
      assert event.thread_id == "thread_abc123"
    end

    test "enforces required fields" do
      assert_raise ArgumentError, fn ->
        %Codex.Events.ThreadStarted{}
      end
    end
  end
end

2. Integration Tests

Purpose: Test interactions between components.

Characteristics:

Tagged :integration
Use mock codex-rs script
Test full workflows
Medium speed (< 100ms per test)
May run synchronously

Example:

defmodule Codex.Thread.IntegrationTest do
  use ExUnit.Case
  use Supertester

  @moduletag :integration

  test "full turn execution with mock codex" do
    mock_script = create_mock_codex_script([
      ~s({"type":"thread.started","thread_id":"thread_123"}),
      ~s({"type":"turn.started"}),
      ~s({"type":"item.completed","item":{"id":"1","type":"agent_message","text":"Hello"}}),
      ~s({"type":"turn.completed","usage":{"input_tokens":10,"cached_input_tokens":0,"output_tokens":5}})
    ])

    codex_opts = %Codex.Options{codex_path_override: mock_script}
    {:ok, thread} = Codex.start_thread(codex_opts)

    {:ok, result} = Codex.Thread.run(thread, "test input")

    assert result.final_response == "Hello"
    assert result.usage.input_tokens == 10
    assert thread.thread_id == "thread_123"

    File.rm!(mock_script)
  end

  defp create_mock_codex_script(events) do
    script = """
    #!/bin/bash
    #{Enum.map_join(events, "\n", &"echo '#{&1}'")}
    """

    path = Path.join(System.tmp_dir!(), "mock_codex_#{:rand.uniform(10000)}")
    File.write!(path, script)
    File.chmod!(path, 0o755)
    path
  end
end

3. Live Tests

Purpose: Test against real codex-rs binary and OpenAI API.

Characteristics:

Tagged :live
Require API key via environment variable
Optional (skip in CI by default)
Slow (seconds per test)
Useful for validation and debugging

Example:

defmodule Codex.LiveTest do
  use ExUnit.Case

  @moduletag :live
  @moduletag timeout: 60_000

  setup do
    unless System.get_env("CODEX_API_KEY") do
      ExUnit.configure(exclude: [:live])
    end

    :ok
  end

  test "real turn execution" do
    {:ok, thread} = Codex.start_thread()

    {:ok, result} = Codex.Thread.run(thread, "Say 'test successful' and nothing else")

    assert result.final_response =~ "test successful"
    assert result.usage.input_tokens > 0
  end
end

4. Property Tests

Purpose: Test properties that should hold for all inputs.

Characteristics:

Use StreamData for generation
Test invariants and laws
Discover edge cases automatically
Run many iterations

Example:

defmodule Codex.Events.PropertyTest do
  use ExUnit.Case, async: true
  use ExUnitProperties

  property "all events encode and decode correctly" do
    check all event <- event_generator() do
      json = Jason.encode!(event)
      {:ok, decoded} = Jason.decode(json)

      assert decoded["type"] in [
        "thread.started", "turn.started", "turn.completed",
        "turn.failed", "item.started", "item.updated",
        "item.completed", "error"
      ]
    end
  end

  defp event_generator do
    gen all type <- member_of([:thread_started, :turn_started, :turn_completed]),
            thread_id <- string(:alphanumeric, min_length: 1, max_length: 50) do
      case type do
        :thread_started ->
          %Codex.Events.ThreadStarted{
            type: :thread_started,
            thread_id: thread_id
          }

        :turn_started ->
          %Codex.Events.TurnStarted{type: :turn_started}

        :turn_completed ->
          %Codex.Events.TurnCompleted{
            type: :turn_completed,
            usage: %Codex.Events.Usage{
              input_tokens: 10,
              cached_input_tokens: 0,
              output_tokens: 5
            }
          }
      end
    end
  end
end

5. Chaos Tests

Purpose: Test system resilience under adverse conditions.

Characteristics:

Simulate process crashes
Test resource cleanup
Verify supervision behavior
Test under high load

Example:

defmodule Codex.ChaosTest do
  use ExUnit.Case
  use Supertester

  describe "resilience" do
    test "handles Exec GenServer crash during turn" do
      {:ok, thread} = Codex.start_thread()

      # Start turn in separate process
      task = Task.async(fn ->
        Codex.Thread.run(thread, "test")
      end)

      # Give it time to start
      Process.sleep(50)

      # Find and kill the Exec GenServer
      [{exec_pid, _}] = Registry.lookup(CodexSdk.ExecRegistry, thread.thread_id)
      Process.exit(exec_pid, :kill)

      # Should return error, not crash
      assert {:error, _} = Task.await(task)
    end

    test "cleans up resources on early stream halt" do
      {:ok, thread} = Codex.start_thread()
      {:ok, stream} = Codex.Thread.run_streamed(thread, "test")

      # Track temp files before
      temp_files_before = count_temp_files()

      # Take only first event, halting stream early
      [_first_event | _] = Enum.take(stream, 1)

      # Give cleanup time
      Process.sleep(100)

      # Verify no temp files leaked
      temp_files_after = count_temp_files()
      assert temp_files_after <= temp_files_before
    end

    defp count_temp_files do
      Path.wildcard(Path.join(System.tmp_dir!(), "codex-output-schema-*"))
      |> length()
    end
  end
end

Supertester Integration

Why Supertester?

Supertester provides deterministic OTP testing without Process.sleep. It enables:

Proper Synchronization: Wait for actual conditions, not arbitrary timeouts
Async Safety: All tests can run async: true
Clear Assertions: Readable test code with helpful error messages
Zero Flakes: Deterministic behavior eliminates timing issues

Basic Usage

defmodule Codex.Exec.SupertesterTest do
  use ExUnit.Case, async: true
  use Supertester

  test "GenServer receives message" do
    {:ok, pid} = Codex.Exec.start_link(...)

    # Send message
    send(pid, {:test, self()})

    # Wait for response (not Process.sleep!)
    assert_receive {:response, value}
    assert value == :expected
  end

  test "GenServer state changes" do
    {:ok, pid} = Codex.Exec.start_link(...)

    # Trigger state change
    GenServer.call(pid, :change_state)

    # Assert state changed
    assert :sys.get_state(pid).changed == true
  end
end

Advanced Patterns

Testing Async Workflows:

test "async event processing" do
  {:ok, pid} = Codex.Exec.start_link(...)

  ref = make_ref()
  GenServer.cast(pid, {:process, ref, self()})

  # Wait for specific message pattern
  assert_receive {:processed, ^ref, result}, 1000
  assert result.success
end

Testing Supervision:

test "supervised restart" do
  {:ok, sup} = Codex.Supervisor.start_link()

  # Get child pid
  [{:undefined, pid, :worker, _}] = Supervisor.which_children(sup)

  # Kill child
  Process.exit(pid, :kill)

  # Wait for restart
  eventually(fn ->
    [{:undefined, new_pid, :worker, _}] = Supervisor.which_children(sup)
    assert new_pid != pid
    assert Process.alive?(new_pid)
  end)
end

Mock Strategies

1. Mox for Protocols

When: Testing modules that depend on behaviors.

Example:

# Define behavior
defmodule Codex.ExecBehaviour do
  @callback run_turn(pid(), String.t(), map()) :: reference()
  @callback get_events(pid(), reference()) :: [Codex.Events.t()]
end

# Define mock in test_helper.exs
Mox.defmock(Codex.ExecMock, for: Codex.ExecBehaviour)

# Use in tests
test "thread uses exec" do
  Mox.expect(Codex.ExecMock, :run_turn, fn _pid, input, _opts ->
    assert input == "test"
    make_ref()
  end)

  Mox.expect(Codex.ExecMock, :get_events, fn _pid, _ref ->
    [
      %Codex.Events.ThreadStarted{...},
      %Codex.Events.TurnCompleted{...}
    ]
  end)

  # Test with mock
  thread = %Codex.Thread{exec: Codex.ExecMock, ...}
  {:ok, result} = Codex.Thread.run(thread, "test")
end

2. Mock Scripts for Exec

When: Testing Exec GenServer with controlled output.

Example:

defmodule MockCodexScript do
  def create(events) when is_list(events) do
    script_content = """
    #!/bin/bash
    # Read stdin (ignore for mock)
    cat > /dev/null

    # Output events
    #{Enum.map_join(events, "\n", &"echo '#{&1}'")}

    exit 0
    """

    path = Path.join(System.tmp_dir!(), "mock_codex_#{System.unique_integer([:positive])}.sh")
    File.write!(path, script_content)
    File.chmod!(path, 0o755)

    path
  end

  def cleanup(path) do
    File.rm(path)
  end
end

# Usage in tests
test "exec processes events" do
  events = [
    Jason.encode!(%{type: "thread.started", thread_id: "t1"}),
    Jason.encode!(%{type: "turn.completed", usage: %{input_tokens: 5}})
  ]

  script = MockCodexScript.create(events)

  try do
    {:ok, pid} = Codex.Exec.start_link(codex_path: script, input: "test")
    # ... test assertions
  after
    MockCodexScript.cleanup(script)
  end
end

3. Test Doubles for Data

When: Testing with known data structures.

Example:

defmodule Codex.Fixtures do
  def thread_started_event(thread_id \\ "thread_test123") do
    %Codex.Events.ThreadStarted{
      type: :thread_started,
      thread_id: thread_id
    }
  end

  def agent_message_item(text \\ "Hello") do
    %Codex.Items.AgentMessage{
      id: "msg_#{System.unique_integer([:positive])}",
      type: :agent_message,
      text: text
    }
  end

  def complete_turn_result do
    %Codex.Turn.Result{
      items: [agent_message_item()],
      final_response: "Hello",
      usage: %Codex.Events.Usage{
        input_tokens: 10,
        cached_input_tokens: 0,
        output_tokens: 5
      }
    }
  end
end

Coverage Goals

Overall Coverage: 95%+

Per Module:

Core modules (Codex, Thread, Exec): 100%
Type modules (Events, Items, Options): 100%
Utility modules (OutputSchemaFile): 95%
Test support modules: 80%

Coverage Tool: ExCoveralls

Configuration in mix.exs:

def project do
  [
    test_coverage: [tool: ExCoveralls],
    preferred_cli_env: [
      coveralls: :test,
      "coveralls.detail": :test,
      "coveralls.post": :test,
      "coveralls.html": :test
    ]
  ]
end

Commands:

# Run tests with coverage
mix coveralls

# Detailed coverage report
mix coveralls.detail

# HTML coverage report
mix coveralls.html

# CI coverage (upload to Coveralls.io)
mix coveralls.github

Coverage Exceptions

Some code is deliberately excluded:

# coveralls-ignore-start
def debug_helper do
  # Only used in development
end
# coveralls-ignore-stop

Test Organization

Directory Structure

test/
├── codex_test.exs                    # Codex module tests
├── codex/
│   ├── thread_test.exs              # Thread module tests
│   ├── exec_test.exs                # Exec GenServer tests
│   ├── exec/
│   │   ├── parser_test.exs          # Event parser tests
│   │   └── integration_test.exs     # Exec integration tests
│   ├── events_test.exs              # Event type tests
│   ├── items_test.exs               # Item type tests
│   ├── options_test.exs             # Option struct tests
│   └── output_schema_file_test.exs  # Schema file helper tests
├── integration/
│   ├── basic_workflow_test.exs      # End-to-end workflows
│   ├── streaming_test.exs           # Streaming workflows
│   └── error_scenarios_test.exs     # Error handling
├── live/
│   └── real_codex_test.exs          # Tests with real API
├── property/
│   ├── events_property_test.exs     # Event properties
│   └── parsing_property_test.exs    # Parser properties
├── chaos/
│   └── resilience_test.exs          # Chaos engineering
├── support/
│   ├── fixtures.ex                  # Test data fixtures
│   ├── mock_codex_script.ex         # Mock script helper
│   └── supertester_helpers.ex       # Supertester utilities
└── test_helper.exs                  # Test configuration

File Naming

*_test.exs: Standard tests
*_integration_test.exs: Integration tests
*_property_test.exs: Property-based tests

Test Naming

Descriptive Names:

# Good
test "returns error when codex binary not found"
test "accumulates events until turn completes"
test "cleans up temp files on early stream halt"

# Bad
test "it works"
test "error case"
test "cleanup"

Describe Blocks:

describe "run/3" do
  test "executes turn successfully" do
    # ...
  end

  test "handles API errors gracefully" do
    # ...
  end
end

describe "run/3 with output schema" do
  test "creates temporary schema file" do
    # ...
  end

  test "cleans up schema file after turn" do
    # ...
  end
end

Assertions and Matchers

Standard Assertions

# Equality
assert result == expected
refute result == unexpected

# Pattern matching
assert %Codex.Events.ThreadStarted{thread_id: id} = event
assert id =~ ~r/thread_\w+/

# Boolean
assert Process.alive?(pid)
assert File.exists?(path)

# Membership
assert value in list
assert Map.has_key?(map, :key)

# Exceptions
assert_raise ArgumentError, fn ->
  %Codex.Events.ThreadStarted{}
end

# Messages
assert_receive {:event, ^ref, event}, 1000
refute_received {:error, _}

Custom Assertions

defmodule CodexSdk.Assertions do
  import ExUnit.Assertions

  def assert_valid_thread_id(thread_id) do
    assert is_binary(thread_id), "thread_id must be a string"
    assert String.starts_with?(thread_id, "thread_"), "thread_id must start with 'thread_'"
    assert String.length(thread_id) > 7, "thread_id must have content after prefix"
  end

  def assert_complete_turn_result(result) do
    assert %Codex.Turn.Result{} = result
    assert is_list(result.items)
    assert is_binary(result.final_response)
    assert %Codex.Events.Usage{} = result.usage
    assert result.usage.input_tokens > 0
  end

  def assert_events_in_order(events, expected_types) do
    actual_types = Enum.map(events, & &1.type)
    assert actual_types == expected_types,
      "Events out of order.\nExpected: #{inspect(expected_types)}\nActual: #{inspect(actual_types)}"
  end
end

Error Testing

Expected Errors

test "returns error for invalid schema" do
  thread = %Codex.Thread{...}

  result = Codex.Thread.run(
    thread,
    "test",
    %Codex.Turn.Options{output_schema: "invalid"}
  )

  assert {:error, {:invalid_schema, _}} = result
end

Error Propagation

test "propagates turn failure from codex" do
  mock_script = create_failing_mock([
    ~s({"type":"thread.started","thread_id":"t1"}),
    ~s({"type":"turn.failed","error":{"message":"API error"}})
  ])

  codex_opts = %Codex.Options{codex_path_override: mock_script}
  {:ok, thread} = Codex.start_thread(codex_opts)

  result = Codex.Thread.run(thread, "test")

  assert {:error, {:turn_failed, error}} = result
  assert error.message == "API error"
end

Error Recovery

test "recovers from transient errors" do
  # Test retry logic, fallbacks, etc.
end

Performance Testing

Timing Assertions

test "parses event in under 1ms" do
  event_json = ~s({"type":"thread.started","thread_id":"t1"})

  {time_us, result} = :timer.tc(fn ->
    Codex.Exec.Parser.parse_event(event_json)
  end)

  assert {:ok, _event} = result
  assert time_us < 1000, "Parsing took #{time_us}µs, expected < 1000µs"
end

Load Testing

test "handles 100 concurrent turns" do
  threads = for _ <- 1..100 do
    {:ok, thread} = Codex.start_thread()
    thread
  end

  tasks = for thread <- threads do
    Task.async(fn ->
      Codex.Thread.run(thread, "test")
    end)
  end

  results = Task.await_many(tasks, 30_000)

  assert Enum.all?(results, fn
    {:ok, _} -> true
    _ -> false
  end)
end

Memory Testing

test "streaming does not accumulate memory" do
  {:ok, thread} = Codex.start_thread()
  {:ok, stream} = Codex.Thread.run_streamed(thread, "generate 1000 items")

  memory_before = :erlang.memory(:total)

  # Consume stream
  Enum.each(stream, fn _ -> :ok end)

  memory_after = :erlang.memory(:total)
  memory_delta = memory_after - memory_before

  # Should be roughly constant (< 1MB growth)
  assert memory_delta < 1_000_000,
    "Memory grew by #{memory_delta} bytes, expected < 1MB"
end

CI/CD Integration

GitHub Actions Workflow

name: CI

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        elixir: ['1.14', '1.15', '1.16']
        otp: ['25', '26', '27']

    steps:
      - uses: actions/checkout@v3

      - name: Setup Elixir
        uses: erlef/setup-beam@v1
        with:
          elixir-version: ${{ matrix.elixir }}
          otp-version: ${{ matrix.otp }}

      - name: Restore dependencies cache
        uses: actions/cache@v3
        with:
          path: deps
          key: ${{ runner.os }}-mix-${{ hashFiles('**/mix.lock') }}
          restore-keys: ${{ runner.os }}-mix-

      - name: Install dependencies
        run: mix deps.get

      - name: Run tests
        run: mix test --exclude live

      - name: Run integration tests
        run: mix test --only integration

      - name: Check coverage
        run: mix coveralls.github
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      - name: Run Dialyzer
        run: mix dialyzer

      - name: Run Credo
        run: mix credo --strict

Local Pre-commit Checks

#!/bin/bash
# .git/hooks/pre-commit

echo "Running tests..."
mix test || exit 1

echo "Running Dialyzer..."
mix dialyzer || exit 1

echo "Running Credo..."
mix credo --strict || exit 1

echo "Checking coverage..."
mix coveralls || exit 1

echo "All checks passed!"

Best Practices

DO

Write tests first - TDD approach
Use descriptive names - Clearly state what's being tested
Test one thing - Single responsibility per test
Use Supertester - No Process.sleep
Mock external deps - Fast, deterministic tests
Test edge cases - Null, empty, invalid inputs
Test errors - Both expected and unexpected
Keep tests simple - Easy to understand and maintain
Use fixtures - DRY test data
Run tests often - Continuous feedback

DON'T

Don't use Process.sleep - Use proper synchronization
Don't test implementation - Test behavior, not internals
Don't share state - Each test should be independent
Don't skip failing tests - Fix or remove them
Don't write flaky tests - Always reproducible
Don't mock everything - Test real integrations when possible
Don't ignore warnings - Keep Dialyzer clean
Don't hardcode values - Use variables and constants
Don't write long tests - Break into smaller tests
Don't test external APIs - Mock or tag as :live

Troubleshooting

Flaky Tests

Symptoms: Test sometimes passes, sometimes fails.

Common Causes:

Using Process.sleep for synchronization
Shared state between tests
Race conditions
Timing assumptions

Solutions:

Use Supertester for proper sync
Ensure async: true is safe
Use assert_receive with timeout
Check for shared resources

Slow Tests

Symptoms: Tests take too long to run.

Common Causes:

Real API calls
Large data generation
Inefficient algorithms
Too much setup

Solutions:

Mock external calls
Use smaller test data
Optimize code under test
Cache expensive setup

Low Coverage

Symptoms: Coverage below target.

Common Causes:

Missing edge case tests
Untested error paths
Dead code

Solutions:

Review coverage report
Add missing tests
Remove dead code
Test all branches

Conclusion

A comprehensive testing strategy is essential for building reliable, maintainable software. By following TDD principles, using Supertester for deterministic OTP testing, maintaining high coverage, and organizing tests clearly, we ensure the Elixir Codex SDK is production-ready and trustworthy.

Key takeaways:

Test first - Write tests before implementation
No flakes - Use proper synchronization, not sleeps
High coverage - 95%+ with focus on critical paths
Fast feedback - Quick test runs enable rapid iteration
Clear organization - Well-structured tests are maintainable tests

← Previous Page Implementation Plan - TDD Approach

Next Page → API Reference