This guide covers the Realtime API integration for bidirectional voice interactions and the Voice Pipeline for non-realtime STT -> Workflow -> TTS processing.
Important: Architecture Note
The Realtime and Voice modules are ported from the OpenAI Agents Python SDK (openai-agents-python). Unlike the main Codex SDK features (Codex.start_thread/2, Codex.resume_thread/3), these modules make direct API calls to OpenAI rather than wrapping the codex CLI.
This means:
- Realtime/Voice use API key auth precedence:
CODEX_API_KEY->auth.jsonOPENAI_API_KEY->OPENAI_API_KEY - Realtime uses WebSocket connections to
wss://api.openai.com/v1/realtime - Voice uses HTTP calls to OpenAI's STT/TTS endpoints
Overview
The Codex SDK provides two complementary approaches for voice-based interactions:
- Realtime API (
Codex.Realtime.*): Bidirectional WebSocket streaming for real-time voice conversations with the OpenAI Realtime API - Voice Pipeline (
Codex.Voice.*): Non-realtime processing pipeline for speech-to-text, custom workflow execution, and text-to-speech
Prerequisites
Both Realtime and Voice features require an OpenAI API key with access to the relevant models:
# Recommended
export CODEX_API_KEY=your-api-key-here
# Also supported
export OPENAI_API_KEY=your-api-key-here
# Or store OPENAI_API_KEY in auth.json under CODEX_HOME
codex login tokens alone are not used for these direct API paths.
For realtime examples with actual audio capture/playback, you'll need appropriate audio hardware and libraries.
Realtime API
Architecture
The Realtime API integration uses WebSocket-based bidirectional streaming:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Your App │────>│ Realtime.Session│────>│ OpenAI Realtime│
│ │<────│ (WebSockex) │<────│ API │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
│ ▼
│ ┌─────────────────┐
└──────────────>│ Realtime.Runner │
│ (Orchestrator) │
└─────────────────┘Key Components
Codex.Realtime: Main module with agent builder and convenience functionsCodex.Realtime.Session: WebSocket GenServer managing the connectionCodex.Realtime.Runner: High-level orchestrator for agent sessionsCodex.Realtime.Agent: Agent configuration structCodex.Realtime.Events: Session and model event types
Creating a Realtime Agent
alias Codex.Realtime
# Simple agent
agent = Realtime.agent(
name: "Assistant",
instructions: "You are a helpful voice assistant."
)
# Agent with tools
agent_with_tools = Realtime.agent(
name: "WeatherBot",
instructions: "Help users check the weather.",
tools: [
%{
name: "get_weather",
description: "Get current weather for a location",
parameters: %{
type: "object",
properties: %{
location: %{type: "string", description: "City name"}
},
required: ["location"]
}
}
]
)Session Configuration
Configure session behavior with RunConfig and SessionModelSettings:
alias Codex.Realtime.Config.{RunConfig, SessionModelSettings, TurnDetectionConfig}
config = %RunConfig{
model_settings: %SessionModelSettings{
# Voice options: alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, cedar
voice: "alloy",
# Turn detection configuration
turn_detection: %TurnDetectionConfig{
type: :semantic_vad, # or :server_vad
eagerness: :medium # :low, :medium, :high
}
}
}Starting a Session
# Start a realtime session
{:ok, session} = Realtime.start_session(agent, config)
# Subscribe to events
Realtime.subscribe(session, self())
# The session is now ready to send/receive audioSending Audio
# Send audio data (PCM16 format)
Realtime.send_audio(session, audio_data)
# Commit the audio buffer (signals end of user turn)
Realtime.commit_audio(session)Handling Events
def handle_info({:realtime_event, event}, state) do
case event do
%Codex.Realtime.Events.RealtimeAudioEvent{audio: audio} ->
# Play audio from the agent
play_audio(audio)
%Codex.Realtime.Events.RealtimeAgentStartEvent{} ->
IO.puts("Agent started speaking")
%Codex.Realtime.Events.RealtimeAgentStateEvent{state: agent_state} ->
IO.puts("Agent state: #{agent_state}")
%Codex.Realtime.Events.RealtimeToolCallEvent{name: name, args: args} ->
# Handle tool call
result = execute_tool(name, args)
Realtime.send_tool_result(session, event.call_id, result)
%Codex.Realtime.Events.RealtimeErrorEvent{error: error} ->
Logger.error("Realtime error: #{inspect(error)}")
_ ->
:ok
end
{:noreply, state}
endAgent Handoffs
Transfer conversations between specialized agents:
# Create specialized agents
greeter = Realtime.agent(
name: "Greeter",
instructions: "Welcome users and route to appropriate specialist."
)
tech_support = Realtime.agent(
name: "TechSupport",
instructions: "Provide technical assistance."
)
sales = Realtime.agent(
name: "Sales",
instructions: "Handle sales inquiries."
)
# Configure handoffs
greeter_with_handoffs = greeter
|> Realtime.add_handoff(tech_support, condition: "Technical issues")
|> Realtime.add_handoff(sales, condition: "Sales questions")
# Start session with the greeter
{:ok, session} = Realtime.start_session(greeter_with_handoffs, config)Session Lifecycle
Session behavior notes:
subscribe/2andunsubscribe/2are idempotent.- Tool execution runs outside the session callback path so other session messages stay responsive.
- WebSocket process exits are trapped and surfaced as session error events; the session process does not crash from linked socket exits.
# Stop the session
Realtime.stop_session(session)
# Or let it timeout/disconnect naturallyVoice Pipeline
Architecture
The Voice Pipeline processes audio in stages:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Audio Input │────>│ STT │────>│ Workflow │────>│ TTS │
│ │ │ (Transcribe)│ │ (Process) │ │ (Synthesize)│
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│Audio Output │
│ (Stream) │
└─────────────┘Key Components
Codex.Voice.Pipeline: Main orchestrator for the STT -> Workflow -> TTS flowCodex.Voice.Workflow: Behaviour for custom processing logicCodex.Voice.SimpleWorkflow: Simple function-based workflowCodex.Voice.AgentWorkflow: Workflow backed byCodex.AgentCodex.Voice.Input.AudioInput: Single audio buffer inputCodex.Voice.Input.StreamedAudioInput: Streaming audio inputCodex.Voice.Result: Streamed audio output
Simple Workflow
For basic request-response patterns:
alias Codex.Voice.{SimpleWorkflow, Config, Pipeline}
# Create a workflow with a handler function
workflow = SimpleWorkflow.new(
fn transcribed_text ->
# Process the text and return response(s)
["I understood: #{transcribed_text}. How can I help?"]
end,
greeting: "Hello! I'm listening."
)Agent Workflow
For multi-turn conversations backed by a Codex agent:
alias Codex.Voice.AgentWorkflow
workflow = AgentWorkflow.new(
agent: %{
instructions: """
You are a helpful coding assistant accessible via voice.
Keep responses concise and clear for audio delivery.
""",
tools: [Codex.Tools.FileSearchTool]
}
)Pipeline Configuration
alias Codex.Voice.Config
alias Codex.Voice.Config.{STTSettings, TTSSettings}
config = %Config{
workflow_name: "MyVoiceAssistant",
# Speech-to-text settings
stt_settings: %STTSettings{
model: "gpt-4o-transcribe"
},
# Text-to-speech settings
tts_settings: %TTSSettings{
model: "gpt-4o-mini-tts",
voice: :nova # :alloy, :echo, :fable, :onyx, :nova, :shimmer
}
}Running the Pipeline
Single-Turn Processing
alias Codex.Voice.Pipeline
alias Codex.Voice.Input.AudioInput
# Start the pipeline
{:ok, pipeline} = Pipeline.start_link(
workflow: workflow,
config: config
)
# Create audio input (WAV format)
input = AudioInput.new(audio_data, format: :wav)
# Run the pipeline
{:ok, result} = Pipeline.run(pipeline, input)
# Process the streamed audio output
for event <- result do
case event do
%Codex.Voice.Events.VoiceStreamEventAudio{data: audio_chunk} ->
play_audio(audio_chunk)
%Codex.Voice.Events.VoiceStreamEventLifecycle{event: :completed} ->
IO.puts("Processing complete")
%Codex.Voice.Events.VoiceStreamEventError{error: error} ->
Logger.error("Error: #{inspect(error)}")
_ ->
:ok
end
endMulti-Turn Streaming
alias Codex.Voice.Input.StreamedAudioInput
# Create streaming input
input = StreamedAudioInput.new()
# Start streaming processing
{:ok, result_stream} = Pipeline.run_streamed(pipeline, input)
# Feed audio chunks in a separate task
Task.start(fn ->
for chunk <- audio_source do
StreamedAudioInput.push(input, chunk)
end
StreamedAudioInput.close(input)
end)
# Process results as they arrive
for event <- result_stream do
handle_voice_event(event)
endCustom Workflow Implementation
Implement the Codex.Voice.Workflow behaviour for custom processing:
defmodule MyCustomWorkflow do
@behaviour Codex.Voice.Workflow
defstruct [:state, :greeting]
@impl true
def new(opts) do
%__MODULE__{
state: opts[:initial_state] || %{},
greeting: opts[:greeting]
}
end
@impl true
def greeting(%__MODULE__{greeting: greeting}), do: greeting
@impl true
def run(%__MODULE__{} = workflow, input_text) do
# Process input and generate response(s)
responses = process_input(input_text, workflow.state)
# Return list of response strings
{:ok, responses, workflow}
end
defp process_input(text, state) do
# Your custom logic here
["Processed: #{text}"]
end
endAudio Formats
The pipeline supports various audio formats:
# WAV format (recommended for recordings)
input = AudioInput.new(wav_data, format: :wav)
# Raw PCM16
input = AudioInput.new(pcm_data, format: :pcm16, sample_rate: 16000)
# The pipeline auto-detects WAV headers when format is not specified
input = AudioInput.new(audio_data)Collecting Audio Output
# Collect all audio chunks
audio_output = result
|> Enum.filter(&match?(%Codex.Voice.Events.VoiceStreamEventAudio{}, &1))
|> Enum.map(& &1.data)
|> IO.iodata_to_binary()
# Save to file
File.write!("output.wav", Codex.Voice.WAV.encode(audio_output))Telemetry Events
Both Realtime and Voice emit telemetry events for observability:
Realtime Events
# Session lifecycle
[:codex, :realtime, :session, :start]
[:codex, :realtime, :session, :stop]
[:codex, :realtime, :session, :error]
# Audio events
[:codex, :realtime, :audio, :sent]
[:codex, :realtime, :audio, :received]
# Tool calls
[:codex, :realtime, :tool, :call]
[:codex, :realtime, :tool, :result]Voice Pipeline Events
# Pipeline lifecycle
[:codex, :voice, :pipeline, :start]
[:codex, :voice, :pipeline, :stop]
# STT events
[:codex, :voice, :stt, :start]
[:codex, :voice, :stt, :complete]
# TTS events
[:codex, :voice, :tts, :start]
[:codex, :voice, :tts, :chunk]
[:codex, :voice, :tts, :complete]Attaching Handlers
:telemetry.attach_many(
"voice-handler",
[
[:codex, :voice, :pipeline, :start],
[:codex, :voice, :pipeline, :stop],
[:codex, :realtime, :session, :start]
],
fn event, measurements, metadata, _config ->
Logger.info("#{inspect(event)}: #{inspect(measurements)}")
end,
nil
)Examples
The SDK includes comprehensive examples for both Realtime and Voice:
Realtime Examples
# Basic session setup
mix run examples/realtime_basic.exs
# Function tools with realtime
mix run examples/realtime_tools.exs
# Multi-agent handoffs
mix run examples/realtime_handoffs.exs
# Full interactive demo
mix run examples/live_realtime_voice.exs
Voice Pipeline Examples
# Basic STT -> Workflow -> TTS
mix run examples/voice_pipeline.exs
# Multi-turn conversations
mix run examples/voice_multi_turn.exs
# Agent-backed voice
mix run examples/voice_with_agent.exs
Best Practices
Realtime
- Handle disconnections: The WebSocket may disconnect; implement reconnection logic
- Monitor latency: Use telemetry to track round-trip times
- Buffer audio: Send audio in reasonable chunks (e.g., 200ms)
- Use semantic VAD: Provides better turn detection than server VAD
Voice Pipeline
- Streaming for long audio: Use
StreamedAudioInputfor audio longer than a few seconds - Keep responses concise: Shorter responses work better for voice
- Handle errors gracefully: The pipeline may fail at any stage
- Cache workflows: Reuse
AgentWorkflowinstances for multi-turn conversations
General
- Test with real audio: Synthetic test audio may not represent real-world conditions
- Monitor costs: Both STT and TTS incur API costs
- Respect rate limits: OpenAI APIs have rate limits
- Handle silence: Users may pause; configure appropriate timeouts
Troubleshooting
Common Issues
WebSocket connection fails
- Check API key validity
- Verify network connectivity
- Check for firewall restrictions on WebSocket connections
Audio not transcribed correctly
- Ensure audio is in a supported format (WAV, PCM16)
- Check sample rate matches what the API expects (usually 16kHz)
- Verify audio quality (minimize background noise)
TTS output sounds robotic
- Try different voice options
- Adjust text for better prosody (shorter sentences, punctuation)
High latency
- Check network conditions
- Consider geographic proximity to API servers
- Use streaming for faster first-byte response
Debug Logging
Enable debug logging for troubleshooting:
# In config/config.exs
config :logger, level: :debug
# Or at runtime
Logger.configure(level: :debug)