ExLLM User Guide

This comprehensive guide covers all features and capabilities of the ExLLM library.

Installation and Setup
Configuration
Basic Usage
Providers
Chat Completions
Streaming
Session Management
Context Management
Function Calling
Vision and Multimodal
Embeddings
Structured Outputs
Cost Tracking
Error Handling and Retries
Caching
Response Caching
Model Discovery
Provider Capabilities
Logging
Testing with Mock Adapter
Advanced Topics

Installation and Setup

Adding to Your Project

Add ExLLM to your mix.exs dependencies:

def deps do
  [
    {:ex_llm, "~> 0.4.1"},
    # Included dependencies (automatically installed with ex_llm):
    # - {:instructor, "~> 0.1.0"} - For structured outputs
    # - {:bumblebee, "~> 0.5"} - For local model inference
    # - {:nx, "~> 0.7"} - For numerical computing
    
    # Optional hardware acceleration (choose one):
    # {:exla, "~> 0.7"}  # For CUDA/ROCm GPUs
    # {:emlx, github: "elixir-nx/emlx", branch: "main"}  # For Apple Silicon
  ]
end

Run mix deps.get to install the dependencies.

Optional Dependencies

Req: HTTP client (automatically included)
Jason: JSON parser (automatically included)
Instructor: Structured outputs with schema validation (automatically included)
Bumblebee: Local model inference (automatically included)
Nx: Numerical computing (automatically included)
EXLA: CUDA/ROCm GPU acceleration (optional)
EMLX: Apple Silicon Metal acceleration (optional)

Configuration

ExLLM supports multiple configuration methods to suit different use cases.

Environment Variables

The simplest way to configure ExLLM:

# OpenAI
export OPENAI_API_KEY="sk-..."
export OPENAI_API_BASE="https://api.openai.com/v1"  # Optional custom endpoint

# Anthropic
export ANTHROPIC_API_KEY="sk-ant-..."

# Google Gemini
export GOOGLE_API_KEY="..."
# or
export GEMINI_API_KEY="..."

# Groq
export GROQ_API_KEY="gsk_..."

# OpenRouter
export OPENROUTER_API_KEY="sk-or-..."

# X.AI
export XAI_API_KEY="xai-..."

# Mistral AI
export MISTRAL_API_KEY="..."

# Perplexity
export PERPLEXITY_API_KEY="pplx-..."

# AWS Bedrock
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="us-east-1"

# Ollama
export OLLAMA_API_BASE="http://localhost:11434"

# LM Studio
export LMSTUDIO_API_BASE="http://localhost:1234"

Static Configuration

For more control, use static configuration:

config = %{
  openai: %{
    api_key: "sk-...",
    api_base: "https://api.openai.com/v1",
    default_model: "gpt-4o"
  },
  anthropic: %{
    api_key: "sk-ant-...",
    default_model: "claude-3-5-sonnet-20241022"
  }
}

{:ok, provider} = ExLLM.ConfigProvider.Static.start_link(config)

# Use with config_provider option
{:ok, response} = ExLLM.chat(:openai, messages, config_provider: provider)

Custom Configuration Provider

Implement your own configuration provider:

defmodule MyApp.ConfigProvider do
  @behaviour ExLLM.ConfigProvider
  
  def get([:openai, :api_key]), do: fetch_from_vault("openai_key")
  def get([:anthropic, :api_key]), do: fetch_from_vault("anthropic_key")
  def get(_path), do: nil
  
  def get_all() do
    %{
      openai: %{api_key: fetch_from_vault("openai_key")},
      anthropic: %{api_key: fetch_from_vault("anthropic_key")}
    }
  end
end

# Use it
{:ok, response} = ExLLM.chat(:openai, messages, 
  config_provider: MyApp.ConfigProvider
)

Basic Usage

Simple Chat

messages = [
  %{role: "user", content: "Hello, how are you?"}
]

{:ok, response} = ExLLM.chat(:openai, messages)
IO.puts(response.content)

Provider/Model Syntax

# Use provider/model string syntax
{:ok, response} = ExLLM.chat("anthropic/claude-3-haiku-20240307", messages)

# Equivalent to
{:ok, response} = ExLLM.chat(:anthropic, messages, 
  model: "claude-3-haiku-20240307"
)

Response Structure

%ExLLM.Types.LLMResponse{
  content: "I'm doing well, thank you!",
  model: "gpt-4o",
  finish_reason: "stop",
  usage: %{
    input_tokens: 12,
    output_tokens: 8,
    total_tokens: 20
  },
  cost: %{
    input_cost: 0.00006,
    output_cost: 0.00016,
    total_cost: 0.00022,
    currency: "USD"
  }
}

Providers

Supported Providers

ExLLM supports these providers out of the box:

:openai - OpenAI GPT models
:anthropic - Anthropic Claude models
:gemini - Google Gemini models
:groq - Groq fast inference
:mistral - Mistral AI models
:perplexity - Perplexity search-enhanced models
:ollama - Local models via Ollama
:lmstudio - Local models via LM Studio
:bedrock - AWS Bedrock
:openrouter - OpenRouter (300+ models)
:xai - X.AI Grok models
:bumblebee - Local models via Bumblebee/NX
:mock - Mock adapter for testing

Checking Provider Configuration

# Check if a provider is configured
if ExLLM.configured?(:openai) do
  {:ok, response} = ExLLM.chat(:openai, messages)
end

# Get default model for a provider
model = ExLLM.default_model(:anthropic)
# => "claude-3-5-sonnet-20241022"

# List available models
{:ok, models} = ExLLM.list_models(:openai)
for model <- models do
  IO.puts("#{model.id}: #{model.context_window} tokens")
end

Chat Completions

Basic Options

{:ok, response} = ExLLM.chat(:openai, messages,
  model: "gpt-4o",           # Specific model
  temperature: 0.7,          # 0.0-1.0, higher = more creative
  max_tokens: 1000,          # Max response length
  top_p: 0.9,                # Nucleus sampling
  frequency_penalty: 0.5,    # Reduce repetition
  presence_penalty: 0.5,     # Encourage new topics
  stop: ["\n\n", "END"],     # Stop sequences
  seed: 12345,               # Reproducible outputs
  timeout: 60_000            # Request timeout in ms (default: provider-specific)
)

Timeout Configuration

Different providers have different timeout requirements. ExLLM allows you to configure timeouts per request:

# Ollama with function calling (can be slow)
{:ok, response} = ExLLM.chat(:ollama, messages,
  functions: functions,
  timeout: 300_000  # 5 minutes
)

# Quick requests with shorter timeout
{:ok, response} = ExLLM.chat(:openai, messages,
  timeout: 30_000   # 30 seconds
)

Default timeouts:

Ollama: 120,000ms (2 minutes) - Local models can be slower
Other providers: Use their HTTP client defaults (typically 30-60 seconds)

System Messages

messages = [
  %{role: "system", content: "You are a helpful coding assistant."},
  %{role: "user", content: "How do I read a file in Elixir?"}
]

{:ok, response} = ExLLM.chat(:openai, messages)

Multi-turn Conversations

conversation = [
  %{role: "user", content: "What's the capital of France?"},
  %{role: "assistant", content: "The capital of France is Paris."},
  %{role: "user", content: "What's the population?"}
]

{:ok, response} = ExLLM.chat(:openai, conversation)

Streaming

Basic Streaming

{:ok, stream} = ExLLM.stream_chat(:openai, messages)

for chunk <- stream do
  case chunk do
    %{content: content} when content != nil ->
      IO.write(content)
      
    %{finish_reason: reason} when reason != nil ->
      IO.puts("\nFinished: #{reason}")
      
    _ ->
      # Other chunk types (role, etc.)
      :ok
  end
end

Streaming with Callback

{:ok, stream} = ExLLM.stream_chat(:openai, messages,
  on_chunk: fn chunk ->
    if chunk.content, do: IO.write(chunk.content)
  end
)

# Consume the stream
Enum.to_list(stream)

Collecting Streamed Response

{:ok, stream} = ExLLM.stream_chat(:openai, messages)

# Collect all chunks into a single response
full_content = 
  stream
  |> Enum.map(& &1.content)
  |> Enum.reject(&is_nil/1)
  |> Enum.join("")

Stream Recovery

Enable automatic stream recovery for interrupted streams:

{:ok, stream} = ExLLM.stream_chat(:openai, messages,
  stream_recovery: true,
  recovery_strategy: :exact  # :exact, :paragraph, or :summarize
)

# If stream is interrupted, you can resume
{:ok, resumed_stream} = ExLLM.resume_stream(recovery_id)

Session Management

Sessions provide stateful conversation management with automatic token tracking.

Creating and Using Sessions

# Create a new session
session = ExLLM.new_session(:openai, name: "Customer Support")

# Chat with session (automatically manages message history)
{:ok, {response, session}} = ExLLM.chat_with_session(
  session,
  "What's the weather like?"
)

# Continue the conversation
{:ok, {response2, session}} = ExLLM.chat_with_session(
  session,
  "What should I wear?"
)

# Check token usage
total_tokens = ExLLM.session_token_usage(session)
IO.puts("Total tokens used: #{total_tokens}")

Managing Session Messages

# Add messages manually
session = ExLLM.add_session_message(session, "user", "Hello!")
session = ExLLM.add_session_message(session, "assistant", "Hi there!")

# Get message history
messages = ExLLM.get_session_messages(session)
recent_10 = ExLLM.get_session_messages(session, 10)

# Clear messages but keep session metadata
session = ExLLM.clear_session(session)

Persisting Sessions

# Save session to JSON
{:ok, json} = ExLLM.save_session(session)
File.write!("session.json", json)

# Load session from JSON
{:ok, json} = File.read("session.json")
{:ok, restored_session} = ExLLM.load_session(json)

Session with Context

# Create session with default context
session = ExLLM.new_session(:openai,
  name: "Tech Support",
  context: %{
    temperature: 0.3,
    system_message: "You are a technical support agent."
  }
)

# Context is automatically applied to all chats
{:ok, {response, session}} = ExLLM.chat_with_session(session, "Help!")

Context Management

Automatically manage conversation context to fit within model limits.

Context Window Validation

# Check if messages fit in context window
case ExLLM.validate_context(messages, provider: :openai, model: "gpt-4") do
  {:ok, token_count} ->
    IO.puts("Messages use #{token_count} tokens")
    
  {:error, reason} ->
    IO.puts("Messages too large: #{reason}")
end

# Get context window size for a model
window_size = ExLLM.context_window_size(:anthropic, "claude-3-opus-20240229")
# => 200000

Automatic Message Truncation

# Prepare messages to fit in context window
truncated = ExLLM.prepare_messages(long_conversation,
  provider: :openai,
  model: "gpt-4",
  max_tokens: 4000,           # Reserve tokens for response
  strategy: :sliding_window,   # or :smart
  preserve_messages: 5         # Always keep last 5 messages
)

Truncation Strategies

:sliding_window - Keep most recent messages
:smart - Preserve system messages and recent context

# Smart truncation preserves important context
{:ok, response} = ExLLM.chat(:openai, very_long_conversation,
  strategy: :smart,
  preserve_messages: 10
)

Context Statistics

stats = ExLLM.context_stats(messages)
# => %{
#   message_count: 20,
#   total_tokens: 1500,
#   by_role: %{"user" => 10, "assistant" => 9, "system" => 1},
#   avg_tokens_per_message: 75
# }

Function Calling

Enable AI models to call functions/tools in your application.

Basic Function Calling

# Define available functions
functions = [
  %{
    name: "get_weather",
    description: "Get current weather for a location",
    parameters: %{
      type: "object",
      properties: %{
        location: %{
          type: "string",
          description: "City and state, e.g. San Francisco, CA"
        },
        unit: %{
          type: "string",
          enum: ["celsius", "fahrenheit"],
          description: "Temperature unit"
        }
      },
      required: ["location"]
    }
  }
]

# Let the AI decide when to call functions
{:ok, response} = ExLLM.chat(:openai, 
  [%{role: "user", content: "What's the weather in NYC?"}],
  functions: functions,
  function_call: "auto"  # or "none" or %{name: "get_weather"}
)

Handling Function Calls

# Parse function calls from response
case ExLLM.parse_function_calls(response, :openai) do
  {:ok, [function_call | _]} ->
    # AI wants to call a function
    IO.inspect(function_call)
    # => %ExLLM.FunctionCalling.FunctionCall{
    #      name: "get_weather",
    #      arguments: %{"location" => "New York, NY"}
    #    }
    
    # Execute the function
    result = get_weather_impl(function_call.arguments["location"])
    
    # Format result for conversation
    function_message = ExLLM.format_function_result(
      %ExLLM.FunctionCalling.FunctionResult{
        name: "get_weather",
        result: result
      },
      :openai
    )
    
    # Continue conversation with function result
    messages = messages ++ [response_message, function_message]
    {:ok, final_response} = ExLLM.chat(:openai, messages)
    
  {:ok, []} ->
    # No function call, regular response
    IO.puts(response.content)
end

Function Execution

# Define functions with handlers
functions_with_handlers = [
  %{
    name: "calculate",
    description: "Perform mathematical calculations",
    parameters: %{
      type: "object",
      properties: %{
        expression: %{type: "string"}
      },
      required: ["expression"]
    },
    handler: fn args ->
      # Your implementation
      {result, _} = Code.eval_string(args["expression"])
      %{result: result}
    end
  }
]

# Execute function automatically
{:ok, result} = ExLLM.execute_function(function_call, functions_with_handlers)

Provider-Specific Notes

Different providers use different terminology:

OpenAI: "functions" and "function_call"
Anthropic: "tools" and "tool_use"
ExLLM normalizes these automatically

Vision and Multimodal

Work with images and other media types.

Basic Image Analysis

# Create a vision message
{:ok, message} = ExLLM.vision_message(
  "What's in this image?",
  ["path/to/image.jpg"]
)

# Send to vision-capable model
{:ok, response} = ExLLM.chat(:openai, [message],
  model: "gpt-4o"  # or any vision model
)

Multiple Images

{:ok, message} = ExLLM.vision_message(
  "Compare these images",
  [
    "image1.jpg",
    "image2.jpg",
    "https://example.com/image3.png"  # URLs work too
  ],
  detail: :high  # :low, :high, or :auto
)

Loading Images

# Load image with options
{:ok, image_part} = ExLLM.load_image("photo.jpg",
  detail: :high,
  resize: {1024, 1024}  # Optional resizing
)

# Build custom message
message = %{
  role: "user",
  content: [
    %{type: "text", text: "Describe this image"},
    image_part
  ]
}

Checking Vision Support

# Check if provider/model supports vision
if ExLLM.supports_vision?(:anthropic, "claude-3-opus-20240229") do
  # This model supports vision
end

# Find all vision-capable models
vision_models = ExLLM.find_models_with_features([:vision])

Text Extraction from Images

# OCR-like functionality
{:ok, text} = ExLLM.extract_text_from_image(:openai, "document.png",
  model: "gpt-4o",
  prompt: "Extract all text, preserving formatting and layout"
)

Image Analysis

# Analyze multiple images
{:ok, analysis} = ExLLM.analyze_images(:anthropic,
  ["chart1.png", "chart2.png"],
  "Compare these charts and identify trends",
  model: "claude-3-5-sonnet-20241022"
)

Embeddings

Generate vector embeddings for semantic search and similarity.

Basic Embeddings

# Generate embeddings for text
{:ok, response} = ExLLM.embeddings(:openai, 
  ["Hello world", "Goodbye world"]
)

# Response structure
%ExLLM.Types.EmbeddingResponse{
  embeddings: [
    [0.0123, -0.0456, ...],  # 1536 dimensions for text-embedding-3-small
    [0.0789, -0.0234, ...]
  ],
  model: "text-embedding-3-small",
  usage: %{total_tokens: 8}
}

Embedding Options

{:ok, response} = ExLLM.embeddings(:openai, texts,
  model: "text-embedding-3-large",
  dimensions: 256,  # Reduce dimensions (model-specific)
  encoding_format: "float"  # or "base64"
)

Similarity Search

# Calculate similarity between embeddings
similarity = ExLLM.cosine_similarity(embedding1, embedding2)
# => 0.87 (1.0 = identical, 0.0 = orthogonal, -1.0 = opposite)

# Find similar items
query_embedding = get_embedding("search query")
items = [
  %{id: 1, text: "Document 1", embedding: [...]},
  %{id: 2, text: "Document 2", embedding: [...]},
  # ...
]

results = ExLLM.find_similar(query_embedding, items,
  top_k: 10,
  threshold: 0.7  # Minimum similarity
)
# => [
#   %{item: %{id: 2, ...}, similarity: 0.92},
#   %{item: %{id: 5, ...}, similarity: 0.85},
#   ...
# ]

Listing Embedding Models

{:ok, models} = ExLLM.list_embedding_models(:openai)
for model <- models do
  IO.puts("#{model.name}: #{model.dimensions} dimensions")
end

Caching Embeddings

# Enable caching for embeddings
{:ok, response} = ExLLM.embeddings(:openai, texts,
  cache: true,
  cache_ttl: :timer.hours(24)
)

Structured Outputs

Generate structured data with schema validation using Instructor integration.

Basic Structured Output

defmodule EmailClassification do
  use Ecto.Schema
  
  embedded_schema do
    field :category, Ecto.Enum, values: [:personal, :work, :spam]
    field :priority, Ecto.Enum, values: [:high, :medium, :low]
    field :summary, :string
  end
end

{:ok, result} = ExLLM.chat(:openai, 
  [%{role: "user", content: "Classify this email: Meeting tomorrow at 3pm"}],
  response_model: EmailClassification,
  max_retries: 3  # Retry on validation failure
)

IO.inspect(result)
# => %EmailClassification{
#      category: :work,
#      priority: :high,
#      summary: "Meeting scheduled for tomorrow"
#    }

Complex Schemas

defmodule ProductExtraction do
  use Ecto.Schema
  
  embedded_schema do
    field :name, :string
    field :price, :decimal
    field :currency, :string
    field :in_stock, :boolean
    
    embeds_many :features, Feature do
      field :name, :string
      field :value, :string
    end
  end
  
  def changeset(struct, params) do
    struct
    |> Ecto.Changeset.cast(params, [:name, :price, :currency, :in_stock])
    |> Ecto.Changeset.cast_embed(:features)
    |> Ecto.Changeset.validate_required([:name, :price])
    |> Ecto.Changeset.validate_number(:price, greater_than: 0)
  end
end

{:ok, product} = ExLLM.chat(:anthropic,
  [%{role: "user", content: "Extract product info from: iPhone 15 Pro, $999, 256GB storage, A17 chip"}],
  response_model: ProductExtraction
)

Lists and Collections

defmodule TodoList do
  use Ecto.Schema
  
  embedded_schema do
    embeds_many :todos, Todo do
      field :task, :string
      field :priority, Ecto.Enum, values: [:high, :medium, :low]
      field :completed, :boolean, default: false
    end
  end
end

{:ok, todo_list} = ExLLM.chat(:openai,
  [%{role: "user", content: "Create a todo list for launching a new feature"}],
  response_model: TodoList
)

Cost Tracking

ExLLM automatically tracks API costs for all operations.

Automatic Cost Tracking

{:ok, response} = ExLLM.chat(:openai, messages)

# Cost is included in response
IO.inspect(response.cost)
# => %{
#   input_cost: 0.00003,
#   output_cost: 0.00006,
#   total_cost: 0.00009,
#   currency: "USD"
# }

# Format for display
IO.puts(ExLLM.format_cost(response.cost.total_cost))
# => "$0.009¢"

Manual Cost Calculation

usage = %{input_tokens: 1000, output_tokens: 500}
cost = ExLLM.calculate_cost(:openai, "gpt-4", usage)
# => %{
#   input_cost: 0.03,
#   output_cost: 0.06,
#   total_cost: 0.09,
#   currency: "USD",
#   per_million_input: 30.0,
#   per_million_output: 120.0
# }

Token Estimation

# Estimate tokens for text
tokens = ExLLM.estimate_tokens("Hello, world!")
# => 4

# Estimate for messages
tokens = ExLLM.estimate_tokens([
  %{role: "user", content: "Hi"},
  %{role: "assistant", content: "Hello!"}
])
# => 12

Disabling Cost Tracking

{:ok, response} = ExLLM.chat(:openai, messages,
  track_cost: false
)
# response.cost will be nil

Error Handling and Retries

Automatic Retries

Retries are enabled by default with exponential backoff:

{:ok, response} = ExLLM.chat(:openai, messages,
  retry: true,           # Default: true
  retry_count: 3,        # Default: 3 attempts
  retry_delay: 1000,     # Default: 1 second initial delay
  retry_backoff: :exponential,  # or :linear
  retry_jitter: true     # Add randomness to prevent thundering herd
)

Error Types

case ExLLM.chat(:openai, messages) do
  {:ok, response} ->
    IO.puts(response.content)
    
  {:error, %ExLLM.Error{type: :rate_limit} = error} ->
    IO.puts("Rate limited. Retry after: #{error.retry_after}")
    
  {:error, %ExLLM.Error{type: :invalid_api_key}} ->
    IO.puts("Check your API key configuration")
    
  {:error, %ExLLM.Error{type: :context_length_exceeded}} ->
    IO.puts("Message too long for model")
    
  {:error, %ExLLM.Error{type: :timeout}} ->
    IO.puts("Request timed out")
    
  {:error, error} ->
    IO.inspect(error)
end

Custom Retry Logic

defmodule MyApp.RetryHandler do
  def with_custom_retry(provider, messages, opts \\ []) do
    Enum.reduce_while(1..5, nil, fn attempt, _acc ->
      case ExLLM.chat(provider, messages, Keyword.put(opts, :retry, false)) do
        {:ok, response} ->
          {:halt, {:ok, response}}
          
        {:error, %{type: :rate_limit} = error} ->
          wait_time = error[:retry_after] || :timer.seconds(attempt * 10)
          Process.sleep(wait_time)
          {:cont, nil}
          
        {:error, _} = error ->
          if attempt == 5 do
            {:halt, error}
          else
            Process.sleep(:timer.seconds(attempt))
            {:cont, nil}
          end
      end
    end)
  end
end

Caching

Cache responses to reduce API calls and costs.

Basic Caching

# Enable caching globally
Application.put_env(:ex_llm, :cache_enabled, true)

# Or per request
{:ok, response} = ExLLM.chat(:openai, messages,
  cache: true,
  cache_ttl: :timer.minutes(15)  # Default: 15 minutes
)

# Same request will use cache
{:ok, cached_response} = ExLLM.chat(:openai, messages, cache: true)

Cache Management

# Clear specific cache entry
ExLLM.Cache.delete(cache_key)

# Clear all cache
ExLLM.Cache.clear()

# Get cache stats
stats = ExLLM.Cache.stats()
# => %{size: 42, hits: 100, misses: 20}

Custom Cache Keys

# Cache key is automatically generated from:
# - Provider
# - Messages
# - Relevant options (model, temperature, etc.)

# You can also use manual cache management
cache_key = ExLLM.Cache.generate_cache_key(:openai, messages, options)

Response Caching

Cache real provider responses for offline testing and development cost reduction.

ExLLM provides two approaches for response caching:

Unified Cache System (Recommended) - Extends the runtime cache with optional disk persistence
Legacy Response Cache - Standalone response collection system

Unified Cache System (Recommended)

The unified cache system extends ExLLM's runtime performance cache with optional disk persistence. This provides both speed benefits and testing capabilities from a single system.

Enabling Unified Cache Persistence

# Method 1: Environment variables (temporary)
export EX_LLM_CACHE_PERSIST=true
export EX_LLM_CACHE_DIR="/path/to/cache"  # Optional

# Method 2: Runtime configuration (recommended for tests)
ExLLM.Cache.configure_disk_persistence(true, "/path/to/cache")

# Method 3: Application configuration
config :ex_llm,
  cache_persist_disk: true,
  cache_disk_path: "/tmp/ex_llm_cache"

Automatic Response Collection with Unified Cache

When persistence is enabled, all cached responses are automatically stored to disk:

# Normal caching usage - responses automatically persist to disk when enabled
{:ok, response} = ExLLM.chat(messages, provider: :openai, cache: true)
{:ok, response} = ExLLM.chat(messages, provider: :anthropic, cache: true)

Benefits of Unified Cache System

Zero performance impact when persistence is disabled (default)
Single configuration controls both runtime cache and disk persistence
Natural development workflow - enable during development, disable in production
Automatic mock integration - cached responses work seamlessly with Mock adapter

Legacy Response Cache System

For compatibility, the original response cache system is still available:

Enabling Legacy Response Caching

# Enable response caching via environment variables
export EX_LLM_CACHE_RESPONSES=true
export EX_LLM_CACHE_DIR="/path/to/cache"  # Optional: defaults to /tmp/ex_llm_cache

Automatic Response Collection

When caching is enabled, all provider responses are automatically stored:

# Normal usage - responses are automatically cached
{:ok, response} = ExLLM.chat(messages, provider: :openai)
{:ok, stream} = ExLLM.stream_chat(messages, provider: :anthropic)

Cache Structure

Responses are organized by provider and endpoint:

/tmp/ex_llm_cache/
├── openai/
│   ├── chat.json           # Chat completions
│   └── streaming.json      # Streaming responses
├── anthropic/
│   ├── chat.json           # Claude messages
│   └── streaming.json      # Streaming responses
└── openrouter/
    └── chat.json           # OpenRouter responses

Manual Response Storage

# Store a specific response
ExLLM.ResponseCache.store_response(
  "openai",                    # Provider
  "chat",                      # Endpoint
  %{messages: messages},       # Request data
  %{"choices" => [...]}        # Response data
)

Mock Adapter Integration

Configure the Mock adapter to replay cached responses from any provider:

Using Unified Cache System

With the unified cache system, responses are automatically available for mock testing when disk persistence is enabled:

# 1. Enable disk persistence during development/testing
ExLLM.Cache.configure_disk_persistence(true, "/tmp/ex_llm_cache")

# 2. Use normal caching to collect responses
{:ok, response} = ExLLM.chat(:openai, messages, cache: true)
{:ok, response} = ExLLM.chat(:anthropic, messages, cache: true)

# 3. Configure mock adapter to use cached responses
ExLLM.ResponseCache.configure_mock_provider(:openai)

# 4. Mock calls now return authentic cached responses
{:ok, response} = ExLLM.chat(messages, provider: :mock)
# Returns real OpenAI response structure and content

# 5. Switch to different provider responses
ExLLM.ResponseCache.configure_mock_provider(:anthropic)
{:ok, response} = ExLLM.chat(messages, provider: :mock)
# Now returns real Anthropic response structure

Using Legacy Response Cache

For compatibility with the original caching approach:

# Enable legacy response caching
export EX_LLM_CACHE_RESPONSES=true

# Use cached OpenAI responses for realistic testing
ExLLM.ResponseCache.configure_mock_provider(:openai)

# Now mock calls return authentic OpenAI responses
{:ok, response} = ExLLM.chat(messages, provider: :mock)
# Returns real OpenAI response structure and content

Response Collection for Testing

Collect comprehensive test scenarios:

# Collect responses for common test cases
ExLLM.CachingInterceptor.create_test_collection(:openai)

# Collect specific scenarios
test_cases = [
  {[%{role: "user", content: "Hello"}], []},
  {[%{role: "user", content: "What is 2+2?"}], [max_tokens: 10]},
  {[%{role: "user", content: "Tell me a joke"}], [temperature: 0.8]}
]

ExLLM.CachingInterceptor.collect_test_responses(:anthropic, test_cases)

Cache Management

# List available cached providers
providers = ExLLM.ResponseCache.list_cached_providers()
# => [{"openai", 15}, {"anthropic", 8}]  # {provider, response_count}

# Clear cache for specific provider
ExLLM.ResponseCache.clear_provider_cache("openai")

# Clear all cached responses
ExLLM.ResponseCache.clear_all_cache()

# Get specific cached response
cached = ExLLM.ResponseCache.get_response("openai", "chat", request_data)

Configuration Options

# Environment variables
EX_LLM_CACHE_RESPONSES=true      # Enable/disable caching
EX_LLM_CACHE_DIR="/custom/path"  # Custom cache directory

# Check if caching is enabled
ExLLM.ResponseCache.caching_enabled?()
# => true

# Get current cache directory
ExLLM.ResponseCache.cache_dir()
# => "/tmp/ex_llm_cache"

Use Cases

Development Testing with Unified Cache:

# 1. Enable disk persistence during development
ExLLM.Cache.configure_disk_persistence(true)

# 2. Use normal caching - responses get collected automatically
{:ok, response} = ExLLM.chat(:openai, messages, cache: true)
{:ok, response} = ExLLM.chat(:anthropic, messages, cache: true)

# 3. Use cached responses in tests
ExLLM.ResponseCache.configure_mock_provider(:openai)
# Tests now use real OpenAI response structures

Development Testing with Legacy Cache:

# 1. Collect responses during development
export EX_LLM_CACHE_RESPONSES=true
# Run your app normally - responses get cached

# 2. Use cached responses in tests
ExLLM.ResponseCache.configure_mock_provider(:openai)
# Tests now use real OpenAI response structures

Cost Reduction:

# Unified cache approach - enable persistence temporarily
ExLLM.Cache.configure_disk_persistence(true)

# Cache expensive model responses during development
{:ok, response} = ExLLM.chat(:openai, messages,
  cache: true,
  model: "gpt-4o"  # Expensive model
)
# Response is cached automatically both in memory and disk

# Later testing uses cached response - no API cost
ExLLM.ResponseCache.configure_mock_provider(:openai)
{:ok, same_response} = ExLLM.chat(messages, provider: :mock)

# Disable persistence for production
ExLLM.Cache.configure_disk_persistence(false)

Cross-Provider Testing:

# Test how your app handles different provider response formats
ExLLM.ResponseCache.configure_mock_provider(:openai)
test_openai_format()

ExLLM.ResponseCache.configure_mock_provider(:anthropic)
test_anthropic_format()

ExLLM.ResponseCache.configure_mock_provider(:openrouter)
test_openrouter_format()

Advanced Usage

Streaming Response Caching:

# Streaming responses are automatically cached
{:ok, stream} = ExLLM.stream_chat(messages, provider: :openai)
chunks = Enum.to_list(stream)

# Later, mock can replay the exact same stream
ExLLM.ResponseCache.configure_mock_provider(:openai)
{:ok, cached_stream} = ExLLM.stream_chat(messages, provider: :mock)
# Returns identical streaming chunks

Interceptor Wrapping:

# Manually wrap API calls for caching
{:ok, response} = ExLLM.CachingInterceptor.with_caching(:openai, fn ->
  ExLLM.Adapters.OpenAI.chat(messages)
end)

# Wrap streaming calls
{:ok, stream} = ExLLM.CachingInterceptor.with_streaming_cache(
  :anthropic, 
  messages, 
  options, 
  fn -> ExLLM.Adapters.Anthropic.stream_chat(messages, options) end
)

Model Discovery

Finding Models

# Get model information
{:ok, info} = ExLLM.get_model_info(:openai, "gpt-4o")
IO.inspect(info)
# => %ExLLM.ModelCapabilities.ModelInfo{
#   id: "gpt-4o",
#   context_window: 128000,
#   max_output_tokens: 16384,
#   capabilities: %{
#     vision: %{supported: true},
#     function_calling: %{supported: true},
#     streaming: %{supported: true},
#     ...
#   }
# }

# Check specific capability
if ExLLM.model_supports?(:openai, "gpt-4o", :vision) do
  # Model supports vision
end

Model Recommendations

# Get recommendations based on requirements
recommendations = ExLLM.recommend_models(
  features: [:vision, :function_calling],
  min_context_window: 100_000,
  max_cost_per_1k_tokens: 1.0,
  prefer_local: false,
  limit: 5
)

for {provider, model, info} <- recommendations do
  IO.puts("#{provider}/#{model}")
  IO.puts("  Score: #{info.score}")
  IO.puts("  Context: #{info.context_window}")
  IO.puts("  Cost: $#{info.cost_per_1k}/1k tokens")
end

Finding Models by Feature

# Find all models with specific features
models = ExLLM.find_models_with_features([:vision, :streaming])
# => [
#   {:openai, "gpt-4o"},
#   {:anthropic, "claude-3-opus-20240229"},
#   ...
# ]

# Group models by capability
grouped = ExLLM.models_by_capability(:vision)
# => %{
#   supported: [{:openai, "gpt-4o"}, ...],
#   not_supported: [{:openai, "gpt-3.5-turbo"}, ...]
# }

Comparing Models

comparison = ExLLM.compare_models([
  {:openai, "gpt-4o"},
  {:anthropic, "claude-3-5-sonnet-20241022"},
  {:gemini, "gemini-1.5-pro"}
])

# See feature support across models
IO.inspect(comparison.features[:vision])
# => [
#   %{model: "gpt-4o", supported: true, details: %{...}},
#   %{model: "claude-3-5-sonnet", supported: true, details: %{...}},
#   %{model: "gemini-1.5-pro", supported: true, details: %{...}}
# ]

Provider Capabilities

Capability Normalization

ExLLM automatically normalizes different provider terminologies:

# These all work and refer to the same capability
ExLLM.provider_supports?(:openai, :function_calling)     # => true
ExLLM.provider_supports?(:anthropic, :tool_use)          # => true
ExLLM.provider_supports?(:openai, :tools)                # => true

# Find providers using any terminology
ExLLM.find_providers_with_features([:tool_use])          # Works!
ExLLM.find_providers_with_features([:function_calling])  # Also works!

Provider Discovery

# Get provider capabilities
{:ok, caps} = ExLLM.get_provider_capabilities(:openai)
IO.inspect(caps)
# => %ExLLM.ProviderCapabilities.ProviderInfo{
#   id: :openai,
#   name: "OpenAI",
#   endpoints: [:chat, :embeddings, :images, ...],
#   features: [:streaming, :function_calling, ...],
#   limitations: %{max_file_size: 512MB, ...}
# }

# Find providers by feature
providers = ExLLM.find_providers_with_features([:embeddings, :streaming])
# => [:openai, :gemini, :bedrock, ...]

# Check authentication requirements
if ExLLM.provider_requires_auth?(:openai) do
  # Provider needs API key
end

# Check if provider is local
if ExLLM.is_local_provider?(:ollama) do
  # No API costs
end

Provider Recommendations

recommendations = ExLLM.recommend_providers(%{
  required_features: [:vision, :streaming],
  preferred_features: [:embeddings, :function_calling],
  exclude_providers: [:mock],
  prefer_local: false,
  prefer_free: false
})

for %{provider: provider, score: score, matched_features: features} <- recommendations do
  IO.puts("#{provider}: #{Float.round(score, 2)}")
  IO.puts("  Features: #{Enum.join(features, ", ")}")
end

Comparing Providers

comparison = ExLLM.compare_providers([:openai, :anthropic, :gemini])

# See all features across providers
IO.puts("All features: #{Enum.join(comparison.features, ", ")}")

# Check specific provider capabilities
openai_features = comparison.comparison.openai.features
# => [:streaming, :function_calling, :embeddings, ...]

Logging

ExLLM provides a unified logging system with security features.

Basic Logging

alias ExLLM.Logger

# Log at different levels
Logger.debug("Starting chat request")
Logger.info("Chat completed in #{duration}ms")
Logger.warn("Rate limit approaching")
Logger.error("API request failed", error: reason)

Structured Logging

# Log with metadata
Logger.info("Chat completed",
  provider: :openai,
  model: "gpt-4o",
  tokens: 150,
  duration_ms: 523
)

# Context-aware logging
Logger.with_context(request_id: "abc123") do
  Logger.info("Processing request")
  # All logs in this block include request_id
end

Security Features

# API keys are automatically redacted
Logger.info("Using API key", api_key: "sk-1234567890")
# Logs: "Using API key [api_key: REDACTED]"

# Configure content filtering
Application.put_env(:ex_llm, :log_redact_messages, true)

Configuration

# In config/config.exs
config :ex_llm,
  log_level: :info,                    # Minimum level to log
  log_redact_keys: true,               # Redact API keys
  log_redact_messages: false,          # Don't log message content
  log_include_metadata: true,          # Include structured metadata
  log_filter_components: [:cache]      # Don't log from cache component

See the Logger User Guide for complete documentation.

Testing with Mock Adapter

The mock adapter helps you test LLM integrations without making real API calls.

Basic Mocking

# Start the mock adapter
{:ok, _} = ExLLM.Adapters.Mock.start_link()

# Configure mock response
{:ok, response} = ExLLM.chat(:mock, messages,
  mock_response: "This is a mock response"
)

assert response.content == "This is a mock response"

Dynamic Responses

# Use a handler function
{:ok, response} = ExLLM.chat(:mock, messages,
  mock_handler: fn messages, _options ->
    last_message = List.last(messages)
    %ExLLM.Types.LLMResponse{
      content: "You said: #{last_message.content}",
      model: "mock-model",
      usage: %{input_tokens: 10, output_tokens: 20}
    }
  end
)

Simulating Errors

# Simulate specific errors
{:error, error} = ExLLM.chat(:mock, messages,
  mock_error: %ExLLM.Error{
    type: :rate_limit,
    message: "Rate limit exceeded",
    retry_after: 60
  }
)

Streaming Mocks

{:ok, stream} = ExLLM.stream_chat(:mock, messages,
  mock_chunks: [
    %{content: "Hello"},
    %{content: " world"},
    %{content: "!", finish_reason: "stop"}
  ],
  chunk_delay: 100  # Milliseconds between chunks
)

for chunk <- stream do
  IO.write(chunk.content || "")
end

Request Capture

# Capture requests for assertions
ExLLM.Adapters.Mock.clear_requests()

{:ok, _} = ExLLM.chat(:mock, messages,
  capture_requests: true,
  mock_response: "OK"
)

requests = ExLLM.Adapters.Mock.get_requests()
assert length(requests) == 1
assert List.first(requests).messages == messages

Advanced Topics

Custom Adapters

Create your own adapter for unsupported providers:

defmodule MyApp.CustomAdapter do
  @behaviour ExLLM.Adapter
  
  @impl true
  def configured?(options) do
    # Check if adapter is properly configured
    config = get_config(options)
    config[:api_key] != nil
  end
  
  @impl true
  def default_model() do
    "custom-model-v1"
  end
  
  @impl true
  def chat(messages, options) do
    # Implement chat logic
    # Return {:ok, %ExLLM.Types.LLMResponse{}} or {:error, reason}
  end
  
  @impl true
  def stream_chat(messages, options) do
    # Return {:ok, stream} where stream yields StreamChunk structs
  end
  
  # Optional callbacks
  @impl true
  def list_models(options) do
    # Return {:ok, [%ExLLM.Types.Model{}]}
  end
  
  @impl true
  def embeddings(inputs, options) do
    # Return {:ok, %ExLLM.Types.EmbeddingResponse{}}
  end
end

Stream Processing

Advanced stream handling:

defmodule StreamProcessor do
  def process_with_buffer(provider, messages, opts) do
    {:ok, stream} = ExLLM.stream_chat(provider, messages, opts)
    
    stream
    |> Stream.scan("", fn chunk, buffer ->
      case chunk do
        %{content: nil} -> buffer
        %{content: text} -> buffer <> text
      end
    end)
    |> Stream.each(fn buffer ->
      # Process complete sentences
      if String.ends_with?(buffer, ".") do
        IO.puts("\nComplete: #{buffer}")
      end
    end)
    |> Stream.run()
  end
end

Token Budget Management

Manage token usage across multiple requests:

defmodule TokenBudget do
  use GenServer
  
  def init(budget) do
    {:ok, %{budget: budget, used: 0}}
  end
  
  def track_usage(pid, tokens) do
    GenServer.call(pid, {:track, tokens})
  end
  
  def handle_call({:track, tokens}, _from, state) do
    new_used = state.used + tokens
    if new_used <= state.budget do
      {:reply, :ok, %{state | used: new_used}}
    else
      {:reply, {:error, :budget_exceeded}, state}
    end
  end
end

# Use with ExLLM
{:ok, budget} = GenServer.start_link(TokenBudget, 10_000)

{:ok, response} = ExLLM.chat(:openai, messages)
:ok = TokenBudget.track_usage(budget, response.usage.total_tokens)

Multi-Provider Routing

Route requests to different providers based on criteria:

defmodule ProviderRouter do
  def route_request(messages, requirements) do
    cond do
      # Use local for development
      Mix.env() == :dev ->
        ExLLM.chat(:ollama, messages)
        
      # Use Groq for speed-critical requests
      requirements[:max_latency_ms] < 1000 ->
        ExLLM.chat(:groq, messages)
        
      # Use OpenAI for complex reasoning
      requirements[:complexity] == :high ->
        ExLLM.chat(:openai, messages, model: "gpt-4o")
        
      # Default to Anthropic
      true ->
        ExLLM.chat(:anthropic, messages)
    end
  end
end

Batch Processing

Process multiple requests efficiently:

defmodule BatchProcessor do
  def process_batch(items, opts \\ []) do
    # Use Task.async_stream for parallel processing
    items
    |> Task.async_stream(
      fn item ->
        ExLLM.chat(opts[:provider] || :openai, [
          %{role: "user", content: item}
        ])
      end,
      max_concurrency: opts[:concurrency] || 5,
      timeout: opts[:timeout] || 30_000
    )
    |> Enum.map(fn
      {:ok, {:ok, response}} -> {:ok, response}
      {:ok, {:error, reason}} -> {:error, reason}
      {:exit, reason} -> {:error, {:timeout, reason}}
    end)
  end
end

Custom Configuration Management

Implement advanced configuration strategies:

defmodule ConfigManager do
  use GenServer
  
  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end
  
  def init(_opts) do
    # Load from multiple sources
    config = %{}
    |> load_from_env()
    |> load_from_file()
    |> load_from_vault()
    |> validate_config()
    
    {:ok, config}
  end
  
  def get_config(provider) do
    GenServer.call(__MODULE__, {:get, provider})
  end
  
  defp load_from_vault(config) do
    # Fetch from HashiCorp Vault, AWS Secrets Manager, etc.
    Map.merge(config, fetch_secrets())
  end
end

Best Practices

Always handle errors - LLM APIs can fail for various reasons
Use streaming for long responses - Better user experience
Enable caching for repeated queries - Save costs
Monitor token usage - Stay within budget
Use appropriate models - Don't use GPT-4 for simple tasks
Implement fallbacks - Have backup providers ready
Test with mocks - Don't make API calls in tests
Use context management - Handle long conversations properly
Track costs - Monitor spending across providers
Follow rate limits - Respect provider limitations

Troubleshooting

Common Issues

"API key not found"
- Check environment variables
- Verify configuration provider is started
- Use ExLLM.configured?/1 to debug
"Context length exceeded"
- Use context management strategies
- Choose models with larger context windows
- Truncate conversation history
"Rate limit exceeded"
- Enable automatic retry
- Implement backoff strategies
- Consider multiple API keys
"Stream interrupted"
- Enable stream recovery
- Implement reconnection logic
- Check network stability
"Invalid response format"
- Check provider documentation
- Verify model capabilities
- Use appropriate options

Debug Mode

Enable debug logging:

# In config
config :ex_llm, :log_level, :debug

# Or at runtime
Logger.configure(level: :debug)

Getting Help

Check the API documentation
Review example applications
Open an issue on GitHub
Read provider-specific documentation

Additional Resources

Quick Start Guide - Get started quickly
Provider Capabilities - Detailed provider information
Logger Guide - Logging system documentation
API Reference - Complete API documentation

← Previous Page ExLLM Quick Start Guide

Next Page → Testing Guide

ExLLM User Guide

Table of Contents

Installation and Setup

Adding to Your Project

Optional Dependencies

Configuration

Environment Variables

Static Configuration

Custom Configuration Provider

Basic Usage

Simple Chat

Provider/Model Syntax

Response Structure

Providers

Supported Providers

Checking Provider Configuration

Chat Completions

Basic Options

Timeout Configuration

System Messages

Multi-turn Conversations

Streaming

Basic Streaming

Streaming with Callback

Collecting Streamed Response

Stream Recovery

Session Management

Creating and Using Sessions

Managing Session Messages

Persisting Sessions

Session with Context

Context Management

Context Window Validation

Automatic Message Truncation

Truncation Strategies

Context Statistics

Function Calling

Basic Function Calling

Handling Function Calls

Function Execution

Provider-Specific Notes

Vision and Multimodal

Basic Image Analysis

Multiple Images

Loading Images

Checking Vision Support

Text Extraction from Images

Image Analysis

Embeddings

Basic Embeddings

Embedding Options

Similarity Search

Listing Embedding Models

Caching Embeddings

Structured Outputs

Basic Structured Output

Complex Schemas

Lists and Collections

Cost Tracking

Automatic Cost Tracking

Manual Cost Calculation

Token Estimation

Disabling Cost Tracking

Error Handling and Retries

Automatic Retries

Error Types

Custom Retry Logic

Caching

Basic Caching

Cache Management

Custom Cache Keys

Response Caching

Unified Cache System (Recommended)

Enabling Unified Cache Persistence

Automatic Response Collection with Unified Cache

Benefits of Unified Cache System

Legacy Response Cache System

Enabling Legacy Response Caching

Automatic Response Collection

Cache Structure