Testing Strategy

View Source

Overview

The Elixir Codex SDK follows a comprehensive test-driven development (TDD) approach using Supertester for deterministic OTP testing. This document outlines our testing philosophy, strategies, tools, and best practices.

Testing Philosophy

Core Principles

  1. Test First: Write tests before implementation
  2. Deterministic: Zero flaky tests, zero Process.sleep
  3. Fast: Full suite < 5 minutes, average test < 50ms
  4. Comprehensive: 95%+ coverage, all edge cases
  5. Maintainable: Clear, readable, well-organized tests
  6. Async: All tests run with async: true where possible

Red-Green-Refactor Cycle

  1. Red: Write a failing test that defines desired behavior
  2. Green: Write minimal code to make test pass
  3. Refactor: Improve code quality while keeping tests green
  4. Repeat: Continue with next feature

Test Categories

1. Unit Tests

Purpose: Test individual functions and modules in isolation.

Characteristics:

  • Run with async: true
  • Mock all external dependencies
  • Focus on single responsibility
  • Fast (< 1ms per test)
  • High coverage of edge cases

Example:

defmodule Codex.EventsTest do
  use ExUnit.Case, async: true

  describe "ThreadStarted" do
    test "creates struct with required fields" do
      event = %Codex.Events.ThreadStarted{
        type: :thread_started,
        thread_id: "thread_abc123"
      }

      assert event.type == :thread_started
      assert event.thread_id == "thread_abc123"
    end

    test "enforces required fields" do
      assert_raise ArgumentError, fn ->
        %Codex.Events.ThreadStarted{}
      end
    end
  end
end

2. Integration Tests

Purpose: Test interactions between components.

Characteristics:

  • Tagged :integration
  • Use mock codex-rs script
  • Test full workflows
  • Medium speed (< 100ms per test)
  • May run synchronously

Example:

defmodule Codex.Thread.IntegrationTest do
  use ExUnit.Case
  use Supertester

  @moduletag :integration

  test "full turn execution with mock codex" do
    mock_script = create_mock_codex_script([
      ~s({"type":"thread.started","thread_id":"thread_123"}),
      ~s({"type":"turn.started"}),
      ~s({"type":"item.completed","item":{"id":"1","type":"agent_message","text":"Hello"}}),
      ~s({"type":"turn.completed","usage":{"input_tokens":10,"cached_input_tokens":0,"output_tokens":5}})
    ])

    codex_opts = %Codex.Options{codex_path_override: mock_script}
    {:ok, thread} = Codex.start_thread(codex_opts)

    {:ok, result} = Codex.Thread.run(thread, "test input")

    assert result.final_response == "Hello"
    assert result.usage.input_tokens == 10
    assert thread.thread_id == "thread_123"

    File.rm!(mock_script)
  end

  defp create_mock_codex_script(events) do
    script = """
    #!/bin/bash
    #{Enum.map_join(events, "\n", &"echo '#{&1}'")}
    """

    path = Path.join(System.tmp_dir!(), "mock_codex_#{:rand.uniform(10000)}")
    File.write!(path, script)
    File.chmod!(path, 0o755)
    path
  end
end

3. Live Tests

Purpose: Test against real codex-rs binary and OpenAI API.

Characteristics:

  • Tagged :live
  • Require API key via environment variable
  • Optional (skip in CI by default)
  • Slow (seconds per test)
  • Useful for validation and debugging

Example:

defmodule Codex.LiveTest do
  use ExUnit.Case

  @moduletag :live
  @moduletag timeout: 60_000

  setup do
    unless System.get_env("CODEX_API_KEY") do
      ExUnit.configure(exclude: [:live])
    end

    :ok
  end

  test "real turn execution" do
    {:ok, thread} = Codex.start_thread()

    {:ok, result} = Codex.Thread.run(thread, "Say 'test successful' and nothing else")

    assert result.final_response =~ "test successful"
    assert result.usage.input_tokens > 0
  end
end

4. Property Tests

Purpose: Test properties that should hold for all inputs.

Characteristics:

  • Use StreamData for generation
  • Test invariants and laws
  • Discover edge cases automatically
  • Run many iterations

Example:

defmodule Codex.Events.PropertyTest do
  use ExUnit.Case, async: true
  use ExUnitProperties

  property "all events encode and decode correctly" do
    check all event <- event_generator() do
      json = Jason.encode!(event)
      {:ok, decoded} = Jason.decode(json)

      assert decoded["type"] in [
        "thread.started", "turn.started", "turn.completed",
        "turn.failed", "item.started", "item.updated",
        "item.completed", "error"
      ]
    end
  end

  defp event_generator do
    gen all type <- member_of([:thread_started, :turn_started, :turn_completed]),
            thread_id <- string(:alphanumeric, min_length: 1, max_length: 50) do
      case type do
        :thread_started ->
          %Codex.Events.ThreadStarted{
            type: :thread_started,
            thread_id: thread_id
          }

        :turn_started ->
          %Codex.Events.TurnStarted{type: :turn_started}

        :turn_completed ->
          %Codex.Events.TurnCompleted{
            type: :turn_completed,
            usage: %Codex.Events.Usage{
              input_tokens: 10,
              cached_input_tokens: 0,
              output_tokens: 5
            }
          }
      end
    end
  end
end

5. Chaos Tests

Purpose: Test system resilience under adverse conditions.

Characteristics:

  • Simulate process crashes
  • Test resource cleanup
  • Verify supervision behavior
  • Test under high load

Example:

defmodule Codex.ChaosTest do
  use ExUnit.Case
  use Supertester

  describe "resilience" do
    test "handles Exec GenServer crash during turn" do
      {:ok, thread} = Codex.start_thread()

      # Start turn in separate process
      task = Task.async(fn ->
        Codex.Thread.run(thread, "test")
      end)

      # Give it time to start
      Process.sleep(50)

      # Find and kill the Exec GenServer
      [{exec_pid, _}] = Registry.lookup(CodexSdk.ExecRegistry, thread.thread_id)
      Process.exit(exec_pid, :kill)

      # Should return error, not crash
      assert {:error, _} = Task.await(task)
    end

    test "cleans up resources on early stream halt" do
      {:ok, thread} = Codex.start_thread()
      {:ok, stream} = Codex.Thread.run_streamed(thread, "test")

      # Track temp files before
      temp_files_before = count_temp_files()

      # Take only first event, halting stream early
      [_first_event | _] = Enum.take(stream, 1)

      # Give cleanup time
      Process.sleep(100)

      # Verify no temp files leaked
      temp_files_after = count_temp_files()
      assert temp_files_after <= temp_files_before
    end

    defp count_temp_files do
      Path.wildcard(Path.join(System.tmp_dir!(), "codex-output-schema-*"))
      |> length()
    end
  end
end

Supertester Integration

Why Supertester?

Supertester provides deterministic OTP testing without Process.sleep. It enables:

  1. Proper Synchronization: Wait for actual conditions, not arbitrary timeouts
  2. Async Safety: All tests can run async: true
  3. Clear Assertions: Readable test code with helpful error messages
  4. Zero Flakes: Deterministic behavior eliminates timing issues

Basic Usage

defmodule Codex.Exec.SupertesterTest do
  use ExUnit.Case, async: true
  use Supertester

  test "GenServer receives message" do
    {:ok, pid} = Codex.Exec.start_link(...)

    # Send message
    send(pid, {:test, self()})

    # Wait for response (not Process.sleep!)
    assert_receive {:response, value}
    assert value == :expected
  end

  test "GenServer state changes" do
    {:ok, pid} = Codex.Exec.start_link(...)

    # Trigger state change
    GenServer.call(pid, :change_state)

    # Assert state changed
    assert :sys.get_state(pid).changed == true
  end
end

Advanced Patterns

Testing Async Workflows:

test "async event processing" do
  {:ok, pid} = Codex.Exec.start_link(...)

  ref = make_ref()
  GenServer.cast(pid, {:process, ref, self()})

  # Wait for specific message pattern
  assert_receive {:processed, ^ref, result}, 1000
  assert result.success
end

Testing Supervision:

test "supervised restart" do
  {:ok, sup} = Codex.Supervisor.start_link()

  # Get child pid
  [{:undefined, pid, :worker, _}] = Supervisor.which_children(sup)

  # Kill child
  Process.exit(pid, :kill)

  # Wait for restart
  eventually(fn ->
    [{:undefined, new_pid, :worker, _}] = Supervisor.which_children(sup)
    assert new_pid != pid
    assert Process.alive?(new_pid)
  end)
end

Mock Strategies

1. Mox for Protocols

When: Testing modules that depend on behaviors.

Example:

# Define behavior
defmodule Codex.ExecBehaviour do
  @callback run_turn(pid(), String.t(), map()) :: reference()
  @callback get_events(pid(), reference()) :: [Codex.Events.t()]
end

# Define mock in test_helper.exs
Mox.defmock(Codex.ExecMock, for: Codex.ExecBehaviour)

# Use in tests
test "thread uses exec" do
  Mox.expect(Codex.ExecMock, :run_turn, fn _pid, input, _opts ->
    assert input == "test"
    make_ref()
  end)

  Mox.expect(Codex.ExecMock, :get_events, fn _pid, _ref ->
    [
      %Codex.Events.ThreadStarted{...},
      %Codex.Events.TurnCompleted{...}
    ]
  end)

  # Test with mock
  thread = %Codex.Thread{exec: Codex.ExecMock, ...}
  {:ok, result} = Codex.Thread.run(thread, "test")
end

2. Mock Scripts for Exec

When: Testing Exec GenServer with controlled output.

Example:

defmodule MockCodexScript do
  def create(events) when is_list(events) do
    script_content = """
    #!/bin/bash
    # Read stdin (ignore for mock)
    cat > /dev/null

    # Output events
    #{Enum.map_join(events, "\n", &"echo '#{&1}'")}

    exit 0
    """

    path = Path.join(System.tmp_dir!(), "mock_codex_#{System.unique_integer([:positive])}.sh")
    File.write!(path, script_content)
    File.chmod!(path, 0o755)

    path
  end

  def cleanup(path) do
    File.rm(path)
  end
end

# Usage in tests
test "exec processes events" do
  events = [
    Jason.encode!(%{type: "thread.started", thread_id: "t1"}),
    Jason.encode!(%{type: "turn.completed", usage: %{input_tokens: 5}})
  ]

  script = MockCodexScript.create(events)

  try do
    {:ok, pid} = Codex.Exec.start_link(codex_path: script, input: "test")
    # ... test assertions
  after
    MockCodexScript.cleanup(script)
  end
end

3. Test Doubles for Data

When: Testing with known data structures.

Example:

defmodule Codex.Fixtures do
  def thread_started_event(thread_id \\ "thread_test123") do
    %Codex.Events.ThreadStarted{
      type: :thread_started,
      thread_id: thread_id
    }
  end

  def agent_message_item(text \\ "Hello") do
    %Codex.Items.AgentMessage{
      id: "msg_#{System.unique_integer([:positive])}",
      type: :agent_message,
      text: text
    }
  end

  def complete_turn_result do
    %Codex.Turn.Result{
      items: [agent_message_item()],
      final_response: "Hello",
      usage: %Codex.Events.Usage{
        input_tokens: 10,
        cached_input_tokens: 0,
        output_tokens: 5
      }
    }
  end
end

Coverage Goals

Overall Coverage: 95%+

Per Module:

  • Core modules (Codex, Thread, Exec): 100%
  • Type modules (Events, Items, Options): 100%
  • Utility modules (OutputSchemaFile): 95%
  • Test support modules: 80%

Coverage Tool: ExCoveralls

Configuration in mix.exs:

def project do
  [
    test_coverage: [tool: ExCoveralls],
    preferred_cli_env: [
      coveralls: :test,
      "coveralls.detail": :test,
      "coveralls.post": :test,
      "coveralls.html": :test
    ]
  ]
end

Commands:

# Run tests with coverage
mix coveralls

# Detailed coverage report
mix coveralls.detail

# HTML coverage report
mix coveralls.html

# CI coverage (upload to Coveralls.io)
mix coveralls.github

Coverage Exceptions

Some code is deliberately excluded:

# coveralls-ignore-start
def debug_helper do
  # Only used in development
end
# coveralls-ignore-stop

Test Organization

Directory Structure

test/
 codex_test.exs                    # Codex module tests
 codex/
    thread_test.exs              # Thread module tests
    exec_test.exs                # Exec GenServer tests
    exec/
       parser_test.exs          # Event parser tests
       integration_test.exs     # Exec integration tests
    events_test.exs              # Event type tests
    items_test.exs               # Item type tests
    options_test.exs             # Option struct tests
    output_schema_file_test.exs  # Schema file helper tests
 integration/
    basic_workflow_test.exs      # End-to-end workflows
    streaming_test.exs           # Streaming workflows
    error_scenarios_test.exs     # Error handling
 live/
    real_codex_test.exs          # Tests with real API
 property/
    events_property_test.exs     # Event properties
    parsing_property_test.exs    # Parser properties
 chaos/
    resilience_test.exs          # Chaos engineering
 support/
    fixtures.ex                  # Test data fixtures
    mock_codex_script.ex         # Mock script helper
    supertester_helpers.ex       # Supertester utilities
 test_helper.exs                  # Test configuration

File Naming

  • *_test.exs: Standard tests
  • *_integration_test.exs: Integration tests
  • *_property_test.exs: Property-based tests

Test Naming

Descriptive Names:

# Good
test "returns error when codex binary not found"
test "accumulates events until turn completes"
test "cleans up temp files on early stream halt"

# Bad
test "it works"
test "error case"
test "cleanup"

Describe Blocks:

describe "run/3" do
  test "executes turn successfully" do
    # ...
  end

  test "handles API errors gracefully" do
    # ...
  end
end

describe "run/3 with output schema" do
  test "creates temporary schema file" do
    # ...
  end

  test "cleans up schema file after turn" do
    # ...
  end
end

Assertions and Matchers

Standard Assertions

# Equality
assert result == expected
refute result == unexpected

# Pattern matching
assert %Codex.Events.ThreadStarted{thread_id: id} = event
assert id =~ ~r/thread_\w+/

# Boolean
assert Process.alive?(pid)
assert File.exists?(path)

# Membership
assert value in list
assert Map.has_key?(map, :key)

# Exceptions
assert_raise ArgumentError, fn ->
  %Codex.Events.ThreadStarted{}
end

# Messages
assert_receive {:event, ^ref, event}, 1000
refute_received {:error, _}

Custom Assertions

defmodule CodexSdk.Assertions do
  import ExUnit.Assertions

  def assert_valid_thread_id(thread_id) do
    assert is_binary(thread_id), "thread_id must be a string"
    assert String.starts_with?(thread_id, "thread_"), "thread_id must start with 'thread_'"
    assert String.length(thread_id) > 7, "thread_id must have content after prefix"
  end

  def assert_complete_turn_result(result) do
    assert %Codex.Turn.Result{} = result
    assert is_list(result.items)
    assert is_binary(result.final_response)
    assert %Codex.Events.Usage{} = result.usage
    assert result.usage.input_tokens > 0
  end

  def assert_events_in_order(events, expected_types) do
    actual_types = Enum.map(events, & &1.type)
    assert actual_types == expected_types,
      "Events out of order.\nExpected: #{inspect(expected_types)}\nActual: #{inspect(actual_types)}"
  end
end

Error Testing

Expected Errors

test "returns error for invalid schema" do
  thread = %Codex.Thread{...}

  result = Codex.Thread.run(
    thread,
    "test",
    %Codex.Turn.Options{output_schema: "invalid"}
  )

  assert {:error, {:invalid_schema, _}} = result
end

Error Propagation

test "propagates turn failure from codex" do
  mock_script = create_failing_mock([
    ~s({"type":"thread.started","thread_id":"t1"}),
    ~s({"type":"turn.failed","error":{"message":"API error"}})
  ])

  codex_opts = %Codex.Options{codex_path_override: mock_script}
  {:ok, thread} = Codex.start_thread(codex_opts)

  result = Codex.Thread.run(thread, "test")

  assert {:error, {:turn_failed, error}} = result
  assert error.message == "API error"
end

Error Recovery

test "recovers from transient errors" do
  # Test retry logic, fallbacks, etc.
end

Performance Testing

Timing Assertions

test "parses event in under 1ms" do
  event_json = ~s({"type":"thread.started","thread_id":"t1"})

  {time_us, result} = :timer.tc(fn ->
    Codex.Exec.Parser.parse_event(event_json)
  end)

  assert {:ok, _event} = result
  assert time_us < 1000, "Parsing took #{time_us}µs, expected < 1000µs"
end

Load Testing

test "handles 100 concurrent turns" do
  threads = for _ <- 1..100 do
    {:ok, thread} = Codex.start_thread()
    thread
  end

  tasks = for thread <- threads do
    Task.async(fn ->
      Codex.Thread.run(thread, "test")
    end)
  end

  results = Task.await_many(tasks, 30_000)

  assert Enum.all?(results, fn
    {:ok, _} -> true
    _ -> false
  end)
end

Memory Testing

test "streaming does not accumulate memory" do
  {:ok, thread} = Codex.start_thread()
  {:ok, stream} = Codex.Thread.run_streamed(thread, "generate 1000 items")

  memory_before = :erlang.memory(:total)

  # Consume stream
  Enum.each(stream, fn _ -> :ok end)

  memory_after = :erlang.memory(:total)
  memory_delta = memory_after - memory_before

  # Should be roughly constant (< 1MB growth)
  assert memory_delta < 1_000_000,
    "Memory grew by #{memory_delta} bytes, expected < 1MB"
end

CI/CD Integration

GitHub Actions Workflow

name: CI

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        elixir: ['1.14', '1.15', '1.16']
        otp: ['25', '26', '27']

    steps:
      - uses: actions/checkout@v3

      - name: Setup Elixir
        uses: erlef/setup-beam@v1
        with:
          elixir-version: ${{ matrix.elixir }}
          otp-version: ${{ matrix.otp }}

      - name: Restore dependencies cache
        uses: actions/cache@v3
        with:
          path: deps
          key: ${{ runner.os }}-mix-${{ hashFiles('**/mix.lock') }}
          restore-keys: ${{ runner.os }}-mix-

      - name: Install dependencies
        run: mix deps.get

      - name: Run tests
        run: mix test --exclude live

      - name: Run integration tests
        run: mix test --only integration

      - name: Check coverage
        run: mix coveralls.github
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      - name: Run Dialyzer
        run: mix dialyzer

      - name: Run Credo
        run: mix credo --strict

Local Pre-commit Checks

#!/bin/bash
# .git/hooks/pre-commit

echo "Running tests..."
mix test || exit 1

echo "Running Dialyzer..."
mix dialyzer || exit 1

echo "Running Credo..."
mix credo --strict || exit 1

echo "Checking coverage..."
mix coveralls || exit 1

echo "All checks passed!"

Best Practices

DO

  1. Write tests first - TDD approach
  2. Use descriptive names - Clearly state what's being tested
  3. Test one thing - Single responsibility per test
  4. Use Supertester - No Process.sleep
  5. Mock external deps - Fast, deterministic tests
  6. Test edge cases - Null, empty, invalid inputs
  7. Test errors - Both expected and unexpected
  8. Keep tests simple - Easy to understand and maintain
  9. Use fixtures - DRY test data
  10. Run tests often - Continuous feedback

DON'T

  1. Don't use Process.sleep - Use proper synchronization
  2. Don't test implementation - Test behavior, not internals
  3. Don't share state - Each test should be independent
  4. Don't skip failing tests - Fix or remove them
  5. Don't write flaky tests - Always reproducible
  6. Don't mock everything - Test real integrations when possible
  7. Don't ignore warnings - Keep Dialyzer clean
  8. Don't hardcode values - Use variables and constants
  9. Don't write long tests - Break into smaller tests
  10. Don't test external APIs - Mock or tag as :live

Troubleshooting

Flaky Tests

Symptoms: Test sometimes passes, sometimes fails.

Common Causes:

  • Using Process.sleep for synchronization
  • Shared state between tests
  • Race conditions
  • Timing assumptions

Solutions:

  • Use Supertester for proper sync
  • Ensure async: true is safe
  • Use assert_receive with timeout
  • Check for shared resources

Slow Tests

Symptoms: Tests take too long to run.

Common Causes:

  • Real API calls
  • Large data generation
  • Inefficient algorithms
  • Too much setup

Solutions:

  • Mock external calls
  • Use smaller test data
  • Optimize code under test
  • Cache expensive setup

Low Coverage

Symptoms: Coverage below target.

Common Causes:

  • Missing edge case tests
  • Untested error paths
  • Dead code

Solutions:

  • Review coverage report
  • Add missing tests
  • Remove dead code
  • Test all branches

Conclusion

A comprehensive testing strategy is essential for building reliable, maintainable software. By following TDD principles, using Supertester for deterministic OTP testing, maintaining high coverage, and organizing tests clearly, we ensure the Elixir Codex SDK is production-ready and trustworthy.

Key takeaways:

  • Test first - Write tests before implementation
  • No flakes - Use proper synchronization, not sleeps
  • High coverage - 95%+ with focus on critical paths
  • Fast feedback - Quick test runs enable rapid iteration
  • Clear organization - Well-structured tests are maintainable tests