Realtime and Voice Guide

Copy Markdown View Source

This guide covers the Realtime API integration for bidirectional voice interactions and the Voice Pipeline for non-realtime STT -> Workflow -> TTS processing.

Important: Architecture Note

The Realtime and Voice modules are ported from the OpenAI Agents Python SDK (openai-agents-python). Unlike the main Codex SDK features (Codex.start_thread/2, Codex.resume_thread/3), these modules make direct API calls to OpenAI rather than wrapping the codex CLI.

This means:

  • Realtime/Voice use API key auth precedence: CODEX_API_KEY -> auth.json OPENAI_API_KEY -> OPENAI_API_KEY
  • Realtime uses WebSocket connections to wss://api.openai.com/v1/realtime
  • Voice uses HTTP calls to OpenAI's STT/TTS endpoints

Overview

The Codex SDK provides two complementary approaches for voice-based interactions:

  1. Realtime API (Codex.Realtime.*): Bidirectional WebSocket streaming for real-time voice conversations with the OpenAI Realtime API
  2. Voice Pipeline (Codex.Voice.*): Non-realtime processing pipeline for speech-to-text, custom workflow execution, and text-to-speech

Prerequisites

Both Realtime and Voice features require an OpenAI API key with access to the relevant models:

# Recommended
export CODEX_API_KEY=your-api-key-here

# Also supported
export OPENAI_API_KEY=your-api-key-here

# Or store OPENAI_API_KEY in auth.json under CODEX_HOME

codex login tokens alone are not used for these direct API paths.

For realtime examples with actual audio capture/playback, you'll need appropriate audio hardware and libraries.


Realtime API

Architecture

The Realtime API integration uses WebSocket-based bidirectional streaming:

          
   Your App      > Realtime.Session>  OpenAI Realtime
                 <   (WebSockex)   <       API       
          
                               
                               
                       
        > Realtime.Runner 
                          (Orchestrator) 
                        

Key Components

Creating a Realtime Agent

alias Codex.Realtime

# Simple agent
agent = Realtime.agent(
  name: "Assistant",
  instructions: "You are a helpful voice assistant."
)

# Agent with tools
agent_with_tools = Realtime.agent(
  name: "WeatherBot",
  instructions: "Help users check the weather.",
  tools: [
    %{
      name: "get_weather",
      description: "Get current weather for a location",
      parameters: %{
        type: "object",
        properties: %{
          location: %{type: "string", description: "City name"}
        },
        required: ["location"]
      }
    }
  ]
)

Session Configuration

Configure session behavior with RunConfig and SessionModelSettings:

alias Codex.Realtime.Config.{RunConfig, SessionModelSettings, TurnDetectionConfig}

config = %RunConfig{
  model_settings: %SessionModelSettings{
    # Voice options: alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, cedar
    voice: "alloy",

    # Turn detection configuration
    turn_detection: %TurnDetectionConfig{
      type: :semantic_vad,  # or :server_vad
      eagerness: :medium    # :low, :medium, :high
    }
  }
}

Starting a Session

# Start a realtime session
{:ok, session} = Realtime.start_session(agent, config)

# Subscribe to events
Realtime.subscribe(session, self())

# The session is now ready to send/receive audio

Sending Audio

# Send audio data (PCM16 format)
Realtime.send_audio(session, audio_data)

# Commit the audio buffer (signals end of user turn)
Realtime.commit_audio(session)

Handling Events

def handle_info({:realtime_event, event}, state) do
  case event do
    %Codex.Realtime.Events.RealtimeAudioEvent{audio: audio} ->
      # Play audio from the agent
      play_audio(audio)

    %Codex.Realtime.Events.RealtimeAgentStartEvent{} ->
      IO.puts("Agent started speaking")

    %Codex.Realtime.Events.RealtimeAgentStateEvent{state: agent_state} ->
      IO.puts("Agent state: #{agent_state}")

    %Codex.Realtime.Events.RealtimeToolCallEvent{name: name, args: args} ->
      # Handle tool call
      result = execute_tool(name, args)
      Realtime.send_tool_result(session, event.call_id, result)

    %Codex.Realtime.Events.RealtimeErrorEvent{error: error} ->
      Logger.error("Realtime error: #{inspect(error)}")

    _ ->
      :ok
  end

  {:noreply, state}
end

Agent Handoffs

Transfer conversations between specialized agents:

# Create specialized agents
greeter = Realtime.agent(
  name: "Greeter",
  instructions: "Welcome users and route to appropriate specialist."
)

tech_support = Realtime.agent(
  name: "TechSupport",
  instructions: "Provide technical assistance."
)

sales = Realtime.agent(
  name: "Sales",
  instructions: "Handle sales inquiries."
)

# Configure handoffs
greeter_with_handoffs = greeter
  |> Realtime.add_handoff(tech_support, condition: "Technical issues")
  |> Realtime.add_handoff(sales, condition: "Sales questions")

# Start session with the greeter
{:ok, session} = Realtime.start_session(greeter_with_handoffs, config)

Session Lifecycle

Session behavior notes:

  • subscribe/2 and unsubscribe/2 are idempotent.
  • Tool execution runs outside the session callback path so other session messages stay responsive.
  • WebSocket process exits are trapped and surfaced as session error events; the session process does not crash from linked socket exits.
# Stop the session
Realtime.stop_session(session)

# Or let it timeout/disconnect naturally

Voice Pipeline

Architecture

The Voice Pipeline processes audio in stages:

               
 Audio Input >     STT     >  Workflow   >     TTS     
                   (Transcribe)      (Process)         (Synthesize)
               
                                                                    
                                                                    
                                                            
                                                            Audio Output 
                                                              (Stream)   
                                                            

Key Components

Simple Workflow

For basic request-response patterns:

alias Codex.Voice.{SimpleWorkflow, Config, Pipeline}

# Create a workflow with a handler function
workflow = SimpleWorkflow.new(
  fn transcribed_text ->
    # Process the text and return response(s)
    ["I understood: #{transcribed_text}. How can I help?"]
  end,
  greeting: "Hello! I'm listening."
)

Agent Workflow

For multi-turn conversations backed by a Codex agent:

alias Codex.Voice.AgentWorkflow

workflow = AgentWorkflow.new(
  agent: %{
    instructions: """
    You are a helpful coding assistant accessible via voice.
    Keep responses concise and clear for audio delivery.
    """,
    tools: [Codex.Tools.FileSearchTool]
  }
)

Pipeline Configuration

alias Codex.Voice.Config
alias Codex.Voice.Config.{STTSettings, TTSSettings}

config = %Config{
  workflow_name: "MyVoiceAssistant",

  # Speech-to-text settings
  stt_settings: %STTSettings{
    model: "gpt-4o-transcribe"
  },

  # Text-to-speech settings
  tts_settings: %TTSSettings{
    model: "gpt-4o-mini-tts",
    voice: :nova  # :alloy, :echo, :fable, :onyx, :nova, :shimmer
  }
}

Running the Pipeline

Single-Turn Processing

alias Codex.Voice.Pipeline
alias Codex.Voice.Input.AudioInput

# Start the pipeline
{:ok, pipeline} = Pipeline.start_link(
  workflow: workflow,
  config: config
)

# Create audio input (WAV format)
input = AudioInput.new(audio_data, format: :wav)

# Run the pipeline
{:ok, result} = Pipeline.run(pipeline, input)

# Process the streamed audio output
for event <- result do
  case event do
    %Codex.Voice.Events.VoiceStreamEventAudio{data: audio_chunk} ->
      play_audio(audio_chunk)

    %Codex.Voice.Events.VoiceStreamEventLifecycle{event: :completed} ->
      IO.puts("Processing complete")

    %Codex.Voice.Events.VoiceStreamEventError{error: error} ->
      Logger.error("Error: #{inspect(error)}")

    _ ->
      :ok
  end
end

Multi-Turn Streaming

alias Codex.Voice.Input.StreamedAudioInput

# Create streaming input
input = StreamedAudioInput.new()

# Start streaming processing
{:ok, result_stream} = Pipeline.run_streamed(pipeline, input)

# Feed audio chunks in a separate task
Task.start(fn ->
  for chunk <- audio_source do
    StreamedAudioInput.push(input, chunk)
  end
  StreamedAudioInput.close(input)
end)

# Process results as they arrive
for event <- result_stream do
  handle_voice_event(event)
end

Custom Workflow Implementation

Implement the Codex.Voice.Workflow behaviour for custom processing:

defmodule MyCustomWorkflow do
  @behaviour Codex.Voice.Workflow

  defstruct [:state, :greeting]

  @impl true
  def new(opts) do
    %__MODULE__{
      state: opts[:initial_state] || %{},
      greeting: opts[:greeting]
    }
  end

  @impl true
  def greeting(%__MODULE__{greeting: greeting}), do: greeting

  @impl true
  def run(%__MODULE__{} = workflow, input_text) do
    # Process input and generate response(s)
    responses = process_input(input_text, workflow.state)

    # Return list of response strings
    {:ok, responses, workflow}
  end

  defp process_input(text, state) do
    # Your custom logic here
    ["Processed: #{text}"]
  end
end

Audio Formats

The pipeline supports various audio formats:

# WAV format (recommended for recordings)
input = AudioInput.new(wav_data, format: :wav)

# Raw PCM16
input = AudioInput.new(pcm_data, format: :pcm16, sample_rate: 16000)

# The pipeline auto-detects WAV headers when format is not specified
input = AudioInput.new(audio_data)

Collecting Audio Output

# Collect all audio chunks
audio_output = result
  |> Enum.filter(&match?(%Codex.Voice.Events.VoiceStreamEventAudio{}, &1))
  |> Enum.map(& &1.data)
  |> IO.iodata_to_binary()

# Save to file
File.write!("output.wav", Codex.Voice.WAV.encode(audio_output))

Telemetry Events

Both Realtime and Voice emit telemetry events for observability:

Realtime Events

# Session lifecycle
[:codex, :realtime, :session, :start]
[:codex, :realtime, :session, :stop]
[:codex, :realtime, :session, :error]

# Audio events
[:codex, :realtime, :audio, :sent]
[:codex, :realtime, :audio, :received]

# Tool calls
[:codex, :realtime, :tool, :call]
[:codex, :realtime, :tool, :result]

Voice Pipeline Events

# Pipeline lifecycle
[:codex, :voice, :pipeline, :start]
[:codex, :voice, :pipeline, :stop]

# STT events
[:codex, :voice, :stt, :start]
[:codex, :voice, :stt, :complete]

# TTS events
[:codex, :voice, :tts, :start]
[:codex, :voice, :tts, :chunk]
[:codex, :voice, :tts, :complete]

Attaching Handlers

:telemetry.attach_many(
  "voice-handler",
  [
    [:codex, :voice, :pipeline, :start],
    [:codex, :voice, :pipeline, :stop],
    [:codex, :realtime, :session, :start]
  ],
  fn event, measurements, metadata, _config ->
    Logger.info("#{inspect(event)}: #{inspect(measurements)}")
  end,
  nil
)

Examples

The SDK includes comprehensive examples for both Realtime and Voice:

Realtime Examples

# Basic session setup
mix run examples/realtime_basic.exs

# Function tools with realtime
mix run examples/realtime_tools.exs

# Multi-agent handoffs
mix run examples/realtime_handoffs.exs

# Full interactive demo
mix run examples/live_realtime_voice.exs

Voice Pipeline Examples

# Basic STT -> Workflow -> TTS
mix run examples/voice_pipeline.exs

# Multi-turn conversations
mix run examples/voice_multi_turn.exs

# Agent-backed voice
mix run examples/voice_with_agent.exs

Best Practices

Realtime

  1. Handle disconnections: The WebSocket may disconnect; implement reconnection logic
  2. Monitor latency: Use telemetry to track round-trip times
  3. Buffer audio: Send audio in reasonable chunks (e.g., 200ms)
  4. Use semantic VAD: Provides better turn detection than server VAD

Voice Pipeline

  1. Streaming for long audio: Use StreamedAudioInput for audio longer than a few seconds
  2. Keep responses concise: Shorter responses work better for voice
  3. Handle errors gracefully: The pipeline may fail at any stage
  4. Cache workflows: Reuse AgentWorkflow instances for multi-turn conversations

General

  1. Test with real audio: Synthetic test audio may not represent real-world conditions
  2. Monitor costs: Both STT and TTS incur API costs
  3. Respect rate limits: OpenAI APIs have rate limits
  4. Handle silence: Users may pause; configure appropriate timeouts

Troubleshooting

Common Issues

WebSocket connection fails

  • Check API key validity
  • Verify network connectivity
  • Check for firewall restrictions on WebSocket connections

Audio not transcribed correctly

  • Ensure audio is in a supported format (WAV, PCM16)
  • Check sample rate matches what the API expects (usually 16kHz)
  • Verify audio quality (minimize background noise)

TTS output sounds robotic

  • Try different voice options
  • Adjust text for better prosody (shorter sentences, punctuation)

High latency

  • Check network conditions
  • Consider geographic proximity to API servers
  • Use streaming for faster first-byte response

Debug Logging

Enable debug logging for troubleshooting:

# In config/config.exs
config :logger, level: :debug

# Or at runtime
Logger.configure(level: :debug)