Groq Provider Guide

View Source

Groq provides ultra-fast LLM inference with their custom hardware, delivering exceptional performance for real-time applications.

Configuration

Set your Groq API key:

# Add to .env file (automatically loaded)
GROQ_API_KEY=gsk_...

Or use in-memory storage:

ReqLLM.put_key(:groq_api_key, "gsk_...")

Supported Models

Popular Groq models include:

  • llama-3.3-70b-versatile - Latest Llama 3.3
  • llama-3.1-8b-instant - Fast, efficient
  • mixtral-8x7b-32768 - Large context window
  • gemma2-9b-it - Google's Gemma 2

See the full list with mix req_llm.model_sync groq.

Basic Usage

# Simple text generation
{:ok, response} = ReqLLM.generate_text(
  "groq:llama-3.3-70b-versatile",
  "Explain async programming"
)

# Streaming (ultra-fast with Groq hardware)
{:ok, stream_response} = ReqLLM.stream_text(
  "groq:llama-3.1-8b-instant",
  "Write a story"
)

ReqLLM.StreamResponse.tokens(stream_response)
|> Stream.each(&IO.write/1)
|> Stream.run()

Provider-Specific Options

Service Tier

Control performance tier for requests:

{:ok, response} = ReqLLM.generate_text(
  "groq:llama-3.3-70b-versatile",
  "Hello",
  provider_options: [service_tier: "performance"]
)

Tiers:

  • "auto" - Automatic selection (default)
  • "on_demand" - Standard on-demand
  • "flex" - Flexible pricing
  • "performance" - Highest performance

Reasoning Effort

Control reasoning level for compatible models:

{:ok, response} = ReqLLM.generate_text(
  "groq:deepseek-r1-distill-llama-70b",
  "Complex problem",
  provider_options: [reasoning_effort: "high"]
)

Levels: "none", "default", "low", "medium", "high"

Reasoning Format

Specify format for reasoning output:

{:ok, response} = ReqLLM.generate_text(
  "groq:deepseek-r1-distill-llama-70b",
  "Problem to solve",
  provider_options: [reasoning_format: "detailed"]
)

Enable web search capabilities:

{:ok, response} = ReqLLM.generate_text(
  "groq:llama-3.3-70b-versatile",
  "Latest tech news",
  provider_options: [
    search_settings: %{
      include_domains: ["techcrunch.com", "arstechnica.com"],
      exclude_domains: ["spam.com"]
    }
  ]
)

Compound Custom

Custom configuration for Compound systems:

{:ok, response} = ReqLLM.generate_text(
  "groq:model",
  "Text",
  provider_options: [
    compound_custom: %{
      # Compound-specific settings
    }
  ]
)

Complete Example

import ReqLLM.Context

context = Context.new([
  system("You are a fast, helpful coding assistant"),
  user("Explain tail call optimization")
])

{:ok, response} = ReqLLM.generate_text(
  "groq:llama-3.3-70b-versatile",
  context,
  temperature: 0.7,
  max_tokens: 1000,
  provider_options: [
    service_tier: "performance",
    search_settings: %{
      include_domains: ["developer.mozilla.org", "stackoverflow.com"]
    }
  ]
)

text = ReqLLM.Response.text(response)
usage = response.usage

IO.puts(text)
IO.puts("Tokens: #{usage.total_tokens}, Cost: $#{usage.total_cost}")

Tool Calling

Groq supports function calling on compatible models:

weather_tool = ReqLLM.tool(
  name: "get_weather",
  description: "Get weather for a location",
  parameter_schema: [
    location: [type: :string, required: true]
  ],
  callback: {WeatherAPI, :fetch}
)

{:ok, response} = ReqLLM.generate_text(
  "groq:llama-3.3-70b-versatile",
  "What's the weather in Berlin?",
  tools: [weather_tool]
)

Structured Output

Groq supports structured output generation:

schema = [
  name: [type: :string, required: true],
  age: [type: :integer, required: true],
  skills: [type: {:list, :string}]
]

{:ok, response} = ReqLLM.generate_object(
  "groq:llama-3.3-70b-versatile",
  "Generate a software engineer profile",
  schema
)

person = ReqLLM.Response.object(response)

Performance Tips

  1. Use Streaming: Groq's hardware excels at streaming - you'll see tokens instantly
  2. Choose Right Model: Use 8b-instant for speed, 70b for quality
  3. Service Tier: Use "performance" tier for lowest latency
  4. Batch Requests: Groq handles concurrent requests efficiently

Streaming Performance

Groq's custom hardware provides exceptional streaming performance:

{:ok, stream_response} = ReqLLM.stream_text(
  "groq:llama-3.1-8b-instant",
  "Count from 1 to 100"
)

# You'll see tokens appearing almost instantly
stream_response
|> ReqLLM.StreamResponse.tokens()
|> Stream.each(&IO.write/1)
|> Stream.run()

Error Handling

case ReqLLM.generate_text("groq:llama-3.3-70b-versatile", "Hello") do
  {:ok, response} -> 
    handle_success(response)
    
  {:error, error} -> 
    IO.puts("Error: #{error.message}")
end

Key Advantages

  1. Speed: Custom LPU hardware for ultra-fast inference
  2. Cost: Competitive pricing for high performance
  3. Reliability: Enterprise-grade infrastructure
  4. Compatibility: OpenAI-compatible API

Resources