This guide walks through how mix feline.talk works: how microphone audio
flows through the pipeline, gets transcribed, generates an LLM response,
synthesizes speech, and plays it back — all in real time.
The Pipeline
Mic (ffmpeg Port)
│
▼ InputAudioRawFrame (16kHz, mono, 16-bit PCM)
VADProcessor
│
▼ InputAudioRawFrame + UserStartedSpeakingFrame / UserStoppedSpeakingFrame
Deepgram.StreamingSTT
│
▼ TranscriptionFrame
ConsoleLogger.UserInput ─── prints "You: ..."
│
▼ TranscriptionFrame
UserContextAggregator ─── appends user message to LLM context, pushes LLMContextFrame
│
▼ LLMContextFrame
OpenAI.StreamingLLM
│
▼ LLMTextFrame (one per token) + LLMFullResponseStartFrame / LLMFullResponseEndFrame
AssistantContextAggregator ─── appends assistant message to LLM context
│
▼ LLMTextFrame
ConsoleLogger.BotOutput ─── prints "Bot: ..." token by token
│
▼ LLMTextFrame
SentenceAggregator ─── buffers tokens, emits TextFrame per complete sentence
│
▼ TextFrame (one per sentence)
ElevenLabs.StreamingTTS ─── WebSocket streaming, emits audio chunks
│
▼ TTSAudioRawFrame (24kHz, mono, 16-bit PCM)
AudioPlayer ─── buffers audio, plays via sox on TTSStoppedFrame
│
▼ BotStartedSpeakingFrame / BotStoppedSpeakingFrame (upstream)Stage by Stage
1. Microphone Capture
The Mix task opens an ffmpeg process as an Erlang Port:
ffmpeg -f avfoundation -i :0 -ac 1 -ar 16000 -f s16le pipe:1This captures the default macOS microphone and outputs raw PCM to stdout.
The task reads binary data from the port and chunks it into 640-byte frames
(20ms at 16kHz, mono, 16-bit). Each chunk becomes an InputAudioRawFrame
injected into the pipeline via Pipeline.Task.queue_frame/2.
See lib/mix/tasks/feline.talk.ex.
2. Voice Activity Detection
VADProcessor runs energy-based VAD on each audio chunk. It maintains a
state machine with two states: :quiet and :speaking.
- When energy exceeds the threshold for
start_secs(0.2s), it emitsUserStartedSpeakingFrame. - When energy drops below the threshold for
stop_secs(0.8s), it emitsUserStoppedSpeakingFrame.
Echo suppression: When bot_speaking is true (set by
BotStartedSpeakingFrame flowing upstream from AudioPlayer), the VAD drops
all InputAudioRawFrames. This prevents the bot from hearing its own voice
through the speakers.
See lib/feline/processors/vad_processor.ex.
3. Speech-to-Text (Deepgram)
Deepgram.StreamingSTT maintains a WebSocket connection to Deepgram's
real-time transcription API. It forwards InputAudioRawFrame audio bytes
over the socket and receives JSON transcription results back. Final
transcriptions become TranscriptionFrames.
See lib/feline/services/deepgram/streaming_stt.ex.
4. Context Management
Two processors work as a pair to manage LLM conversation history:
UserContextAggregatorabsorbsTranscriptionFrame, appends the user's message to a shared context (via an Agent process), and pushesLLMContextFramecontaining the full conversation history.AssistantContextAggregatorcollectsLLMTextFrametokens into the assistant's response and appends it to the shared context whenLLMFullResponseEndFramearrives.
The shared context Agent (ContextAggregatorPair) ensures both aggregators
see the same conversation history.
See lib/feline/processors/user_context_aggregator.ex and
lib/feline/processors/context_aggregator_pair.ex.
5. LLM (OpenAI)
OpenAI.StreamingLLM receives LLMContextFrame, calls the OpenAI Chat
Completions API with stream: true, and pushes one LLMTextFrame per
token. The streaming request runs in a supervised task so it doesn't block
the processor's GenServer.
The response is bookended by LLMFullResponseStartFrame and
LLMFullResponseEndFrame.
See lib/feline/services/openai/streaming_llm.ex.
6. Sentence Aggregation
SentenceAggregator buffers LLMTextFrame tokens until a sentence
boundary (., !, ?) is found, then pushes the complete sentence as a
TextFrame. This allows TTS to start synthesizing as soon as the first
sentence is ready, without waiting for the entire LLM response.
On LLMFullResponseEndFrame, any remaining buffered text is flushed.
See lib/feline/processors/sentence_aggregator.ex.
7. Text-to-Speech (ElevenLabs)
ElevenLabs.StreamingTTS uses the ElevenLabs WebSocket streaming API.
The protocol:
- Connect to
wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input - BOS (Beginning of Stream): send
{"text": " "}to initialize - Text: send
{"text": "Hello!", "flush": true}for each sentence - EOS (End of Stream): send
{"text": ""}when the LLM response is complete
Audio chunks arrive as base64-encoded PCM in {"audio": "..."} messages
and are decoded into TTSAudioRawFrames. When {"isFinal": true} arrives,
a TTSStoppedFrame is pushed.
See lib/feline/services/elevenlabs/streaming_tts.ex.
8. Audio Playback
AudioPlayer buffers all TTSAudioRawFrame audio bytes for the current
utterance. When TTSStoppedFrame arrives, it writes the buffer to a temp
file and spawns sox play to play it back.
Echo suppression coordination: AudioPlayer pushes
BotStartedSpeakingFrame upstream when audio starts arriving, and
BotStoppedSpeakingFrame when the play process finishes (not when
buffering stops). This keeps the VAD's mic-mute active for the entire
duration of audible playback.
See lib/feline/processors/audio_player.ex.
Interruption Flow
When you speak while the bot is talking:
- VAD detects speech during
bot_speakingand emitsInterruptionFrame InterruptionFrameis a system frame — it jumps the queue via selective receive- Each processor handles it: StreamingTTS closes the WebSocket, AudioPlayer
kills the
playprocess and discards buffered audio, SentenceAggregator clears its buffer - The new user speech flows through normally once the interruption clears
Typed Input
You can also type messages directly in the console. The stdin reader thread
appends the message to the shared context and injects an LLMContextFrame
into the pipeline, bypassing STT entirely.
Debug Logging
All internal TTS and WebSocket logging uses Logger.debug. Enable it with:
FELINE_DEBUG=1 mix feline.talk