Observability Enhancements – Design (2025-10-17)
View SourceOverview
Augment the telemetry stack with execution timing, error classification, and optional OTLP exporter integration so downstream monitoring can track latency and failure modes without manual instrumentation.
Goals
- Emit duration metadata for turn execution, tool invocations, and attachment cleanup.
- Tag telemetry events with
originator(CLI vs SDK) for correlation. - Provide optional OTLP exporter settings via application config.
- Document runbook for capturing logs and metrics in production.
Non-Goals
- Ship a bundled OTLP collector.
- Provide dashboards (Grafana, etc.).
- Persist metrics locally.
Architecture
Codex.Telemetry.emit/3wraps duration conversions — it acceptsSystem.monotonic_time/1input and normalises payloads with:duration_ms.- Thread, tool, and approval emitters surface
originator: :sdk, span tokens, and stop timestamps to support OpenTelemetry spans. Codex.Telemetry.configure/1restarts the OTEL apps with a simple processor when OTLP is enabled (viaCODEX_OTLP_ENABLE=1), readingCODEX_OTLP_ENDPOINTand optionalCODEX_OTLP_HEADERS, defaulting tootel_exporter_pidduring tests.- Provide runbook with commands for enabling exporters, tailing telemetry, and cleaning erlexec state.
Implementation Steps
- Add
:duration_ms(and stop-system timestamps) across thread, tool, and approval events. - Introduce
Codex.Telemetry.configure/1that restarts OTEL with a configured exporter and attaches span handlers. - Attach OpenTelemetry spans to thread lifecycle events via telemetry handlers and
otel_exporter_pidfor tests. - Document how to enable the exporter and verify spans in the runbook/ops docs.
Risks
- Optional OTLP dependency should be runtime-only; guard runtime starts and tolerate
:tls_certificate_checkbeing absent. - Ensure exporter init errors fail gracefully (log warning, continue).
Verification
- Tests capturing telemetry ensure duration present and within expected range.
- Integration test enabling exporter with mock OTLP collector (use
opentelemetry_exportertest handler). - Runbook instructions validated manually.
Open Questions
- Should exporter configuration live in
config/*.exs? Default to environment-driven to avoid compile-time dependency. - Do we need sampling controls? Possibly later; start with full stream.