Rate Limiting Guide
View SourceGeminiEx includes an automatic rate limiting system that is enabled by default. This system helps you stay within Gemini API quotas and handles rate limit errors gracefully.
Overview
The rate limiter provides:
- Automatic rate limit enforcement - Waits when rate limited (429 responses)
- Concurrency gating - Limits concurrent requests per model (default: 4)
- Token budgeting - Tracks usage to preemptively avoid rate limits
- Adaptive mode - Optionally adjusts concurrency based on 429 responses
- Structured errors - Returns
{:error, {:rate_limited, retry_at, details}} - Telemetry events - Observable rate limit wait/error events
Default Behavior
When you make requests through the Gemini API, they automatically:
- Check against the shared retry window (from previous 429s) with jittered release
- Atomically reserve token budget for the estimated request size (optional safety multiplier)
- Get gated by concurrency permits (default 4 per model)
- Retry with backoff on transient failures
# Rate limiting is automatic - no changes needed!
{:ok, response} = Gemini.generate("Hello")
# This works even under heavy load
results = 1..100
|> Task.async_stream(fn _ -> Gemini.generate("Hello") end)
|> Enum.to_list()Configuration
Application Config
Configure globally via application environment:
config :gemini_ex, :rate_limiter,
max_concurrency_per_model: 4, # nil or 0 disables concurrency gating
permit_timeout_ms: :infinity, # default: no cap on queue wait; set a number to cap
max_attempts: 3, # Retry attempts for transient errors
base_backoff_ms: 1000, # Base backoff duration
jitter_factor: 0.25, # Jitter range (±25%)
adaptive_concurrency: false, # Enable adaptive mode
adaptive_ceiling: 8, # Max concurrency in adaptive mode
profile: :prod # :dev, :prod, or :custom
# Reserve with a safety multiplier (optional, default 1.0)
# budget_safety_multiplier: 1.2Per-Request Options
Override behavior on individual requests:
# Bypass rate limiting entirely
{:ok, response} = Gemini.generate("Hello", disable_rate_limiter: true)
# Return immediately if rate limited
case Gemini.generate("Hello", non_blocking: true) do
{:ok, response} ->
handle_response(response)
{:error, {:rate_limited, retry_at, details}} ->
# Schedule retry for later
schedule_retry(retry_at)
end
# Override concurrency limit
{:ok, response} = Gemini.generate("Hello", max_concurrency_per_model: 8)
# Override permit wait timeout (defaults to :infinity)
{:ok, response} = Gemini.generate("Hello", permit_timeout_ms: 600_000)
# Partition the concurrency gate (e.g., by tenant/location)
{:ok, response} = Gemini.generate("Hello", concurrency_key: "tenant_a")
# Fail fast instead of waiting
{:error, {:rate_limited, nil, %{reason: :no_permit_available}}} =
Gemini.generate("Hello", non_blocking: true)Quick Start
Choose a profile matching your Google Cloud tier:
| Profile | Best For | RPM | TPM | Token Budget |
|---|---|---|---|---|
:free_tier | Development, testing | 15 | 1M | 32,000 |
:paid_tier_1 | Standard production | 500 | 4M | 1,000,000 |
:paid_tier_2 | High throughput | 1000 | 8M | 2,000,000 |
# Select your tier
config :gemini_ex, :rate_limiter, profile: :paid_tier_1Default behavior: If you don’t choose a profile,
:prodis used (token_budget_per_window: 500_000,window_duration_ms: 60_000). The base fallback defaults are 32,000/60s and are used by:customunless overridden. Settoken_budget_per_window: nilif you need the pre-0.6.1 “unlimited” budgeting behavior.
Profiles
The rate limiter supports tier-based configuration profiles that automatically set appropriate limits for your Google Cloud plan.
Tier Profiles
Free Tier (:free_tier)
Conservative settings for Google's free tier (15 RPM / 1M TPM):
config :gemini_ex, :rate_limiter, profile: :free_tier
# Equivalent to:
# max_concurrency_per_model: 2
# max_attempts: 5
# base_backoff_ms: 2000
# token_budget_per_window: 32_000
# adaptive_concurrency: true
# adaptive_ceiling: 4Paid Tier 1 (:paid_tier_1)
Standard production settings for Tier 1 plans (500 RPM / 4M TPM):
config :gemini_ex, :rate_limiter, profile: :paid_tier_1
# Equivalent to:
# max_concurrency_per_model: 10
# max_attempts: 3
# base_backoff_ms: 500
# token_budget_per_window: 1_000_000
# adaptive_concurrency: true
# adaptive_ceiling: 15Paid Tier 2 (:paid_tier_2)
High throughput settings for Tier 2 plans (1000 RPM / 8M TPM):
config :gemini_ex, :rate_limiter, profile: :paid_tier_2
# Equivalent to:
# max_concurrency_per_model: 20
# max_attempts: 2
# base_backoff_ms: 250
# token_budget_per_window: 2_000_000
# adaptive_concurrency: true
# adaptive_ceiling: 30Legacy Profiles
Development Profile (:dev)
Lower concurrency, longer backoff, ideal for local development:
config :gemini_ex, :rate_limiter, profile: :dev
# Equivalent to:
# max_concurrency_per_model: 2
# max_attempts: 5
# base_backoff_ms: 2000
# token_budget_per_window: 16_000
# adaptive_ceiling: 4Production Profile (:prod)
Balanced settings for typical production usage (default):
config :gemini_ex, :rate_limiter, profile: :prod
# Equivalent to:
# max_concurrency_per_model: 4
# max_attempts: 3
# base_backoff_ms: 1000
# token_budget_per_window: 500_000
# adaptive_ceiling: 8Fine-Tuning
Concurrency vs Token Budget
- Concurrency limits parallel requests (affects RPM)
- Token Budget limits total tokens per window (affects TPM)
For most applications, start with a profile and adjust:
- Seeing 429s? Lower both concurrency and budget
- Underutilizing quota? Raise budget, enable adaptive concurrency
Concurrency semantics
The concurrency gate is per model by default (all callers to the same model share a queue). Use concurrency_key: to partition by tenant/location. permit_timeout_ms defaults to :infinity; a waiter only errors if you explicitly set a finite cap and it expires. Use non_blocking: true to fail fast instead of queueing.
Structured Errors
Rate limit errors include retry information:
case Gemini.generate("Hello") do
{:ok, response} ->
# Handle success
{:error, {:rate_limited, retry_at, details}} ->
# retry_at is a DateTime when you can retry
# details contains quota information
IO.puts("Rate limited until #{retry_at}")
IO.puts("Quota: #{details.quota_metric}")
{:error, {:transient_failure, attempts, original_error}} ->
# Transient error after max retry attempts
IO.puts("Failed after #{attempts} attempts")
{:error, reason} ->
# Other errors
IO.puts("Error: #{inspect(reason)}")
endNon-Blocking Mode
For applications that need to handle rate limits without waiting:
defmodule MyApp.GeminiWorker do
def generate_with_queue(prompt) do
case Gemini.generate(prompt, non_blocking: true) do
{:ok, response} ->
{:ok, response}
{:error, {:rate_limited, retry_at, _details}} ->
# Queue for later
schedule_retry(prompt, retry_at)
{:queued, retry_at}
end
end
defp schedule_retry(prompt, retry_at) do
delay_ms = DateTime.diff(retry_at, DateTime.utc_now(), :millisecond)
Process.send_after(self(), {:retry, prompt}, max(0, delay_ms))
end
endAdaptive Concurrency Mode
Adaptive mode starts with lower concurrency and adjusts based on API responses:
config :gemini_ex, :rate_limiter,
adaptive_concurrency: true,
adaptive_ceiling: 8 # Maximum concurrency to reachIn adaptive mode:
- Starts at the configured
max_concurrency_per_model - Increases by 1 on each success (up to
adaptive_ceiling) - Decreases by 25% on each 429 response
This is useful when you're unsure of your quota limits.
Token Budgeting
Token budgeting helps you stay within TPM (tokens per minute) limits by tracking token usage and preemptively blocking requests that would exceed your budget.
Atomic reservations (v0.7.1)
- Reservations happen before dispatch using
try_reserve_budget/3to avoid thundering-herd launches. - A safety multiplier (default
1.0, configurable viabudget_safety_multiplier) can over-reserve for extra safety. - After responses, the limiter reconciles the reservation: surplus tokens are returned, and shortfalls are charged.
- Over-budget calls fail fast (or wait once in blocking mode) with
{:error, {:rate_limited, retry_at, %{reason: :over_budget}}}before hitting the network.
Automatic Token Estimation
By default, the rate limiter automatically estimates input tokens for each request before it's sent. This estimation happens on the original input (string or Content list) and is used for proactive budget checking.
# Token estimation happens automatically - no code changes needed!
{:ok, response} = Gemini.generate("Hello world")
# The estimated tokens are passed to the rate limiter internally
# If the estimate would exceed your budget, the request is blocked locallyDefault Token Budgets
Each profile includes a default token budget:
:free_tier- 32,000 tokens per minute (~3% of 1M TPM):paid_tier_1- 1,000,000 tokens per minute (25% of 4M TPM):paid_tier_2- 2,000,000 tokens per minute (25% of 8M TPM):prod- 500,000 tokens per minute:dev- 16,000 tokens per minute
Cached context tokens
Cached contexts still consume tokens (returned as cachedContentTokenCount in responses) and are counted toward the budget. The rate limiter records these tokens automatically on success. If you pre-compute cache size and want proactive blocking before first use, supply both:
Gemini.generate("Run on cached context",
cached_content: cache_name,
estimated_input_tokens: 200, # prompt size
estimated_cached_tokens: 50_000, # precomputed cache size
token_budget_per_window: 1_000_000
)Over-budget behavior
- Request too large: If
estimated_input_tokens + estimated_cached_tokensexceeds the budget (after applying the safety multiplier), the limiter returns{:error, {:rate_limited, nil, %{reason: :over_budget, request_too_large: true}}}immediately (no retries). - Window full: If the current window is full but the request fits the budget, blocking mode waits once (respecting
max_budget_wait_ms) then retries the reservation; non-blocking mode returnsretry_atset to that window end.
Limiting wait time
Blocking calls can cap their wait with max_budget_wait_ms (default: nil = no cap). If the cap is reached and the window is still full, the limiter returns {:error, {:rate_limited, retry_at, details}} where retry_at is the actual window end:
Gemini.generate("...", [
token_budget_per_window: 500_000,
estimated_input_tokens: 20_000,
max_budget_wait_ms: 5_000 # block at most 5 seconds on budget waits
])Manual Token Estimation
You can override the automatic estimate if you have a more accurate count:
# Use your own token estimate (e.g., from countTokens API)
{:ok, token_count} = Gemini.count_tokens("Your long prompt here...")
opts = [
estimated_input_tokens: token_count.total_tokens,
token_budget_per_window: 500_000
]
case Gemini.generate("Your long prompt here...", opts) do
{:ok, response} ->
# Success
{:error, {:rate_limited, _, %{reason: :over_budget}}} ->
# Would exceed budget, wait for window to reset
endDisabling Token Budgeting
To disable token budget checking:
# Disable for a single request
{:ok, response} = Gemini.generate("Hello", token_budget_per_window: nil)
# Disable globally
config :gemini_ex, :rate_limiter,
token_budget_per_window: nilTelemetry Events
The rate limiter emits telemetry events for monitoring:
:telemetry.attach_many(
"rate-limit-monitor",
[
[:gemini, :rate_limit, :request, :start],
[:gemini, :rate_limit, :request, :stop],
[:gemini, :rate_limit, :wait],
[:gemini, :rate_limit, :error],
[:gemini, :rate_limit, :budget, :reserved],
[:gemini, :rate_limit, :budget, :rejected],
[:gemini, :rate_limit, :retry_window, :set],
[:gemini, :rate_limit, :retry_window, :hit],
[:gemini, :rate_limit, :retry_window, :release]
],
fn event, measurements, metadata, _config ->
IO.puts("Event: #{inspect(event)}")
IO.puts("Model: #{metadata.model}")
IO.puts("Duration: #{measurements[:duration]}ms")
end,
nil
)Available Events
| Event | Description |
|---|---|
[:gemini, :rate_limit, :request, :start] | Request submitted to rate limiter |
[:gemini, :rate_limit, :request, :stop] | Request completed |
[:gemini, :rate_limit, :wait] | Waiting for retry window |
[:gemini, :rate_limit, :error] | Rate limit error occurred |
[:gemini, :rate_limit, :budget, :reserved] | Budget reservation made pre-flight |
[:gemini, :rate_limit, :budget, :rejected] | Reservation rejected as over-budget |
[:gemini, :rate_limit, :retry_window, :set] | Retry window set from a 429 |
[:gemini, :rate_limit, :retry_window, :hit] | Incoming request hit an active retry window |
[:gemini, :rate_limit, :retry_window, :release] | Retry window released after jittered wait |
[:gemini, :rate_limit, :stream, :started] | Streaming request acquired a permit/reservation |
[:gemini, :rate_limit, :stream, :completed] | Streaming request released permit on completion |
[:gemini, :rate_limit, :stream, :error] | Streaming request released permit on error |
[:gemini, :rate_limit, :stream, :stopped] | Streaming request released permit on manual stop |
Streaming
Streaming requests now go through the same rate limiter:
- Concurrency permits are held for the entire stream duration.
- Budget reservations are made up-front (on estimated input) and reconciled when the stream finishes.
- Non-blocking mode returns
{:error, {:rate_limited, retry_at, details}}when a stream cannot acquire a permit or budget.
Model Aliases
Use the built-in use-case aliases to avoid scattering model strings:
| Use Case | Model Key | Recommended Token Budget |
|---|---|---|
:cache_context | :flash_2_5 | 32,000 |
:report_section | :pro_2_5 | 16,000 |
:fast_path | :flash_2_5_lite | 8,000 |
Gemini.Config.model_for_use_case(:cache_context)
#=> "gemini-2.5-flash"
Gemini.Config.resolved_use_case_models()
#=> %{cache_context: "gemini-2.5-flash", report_section: "gemini-2.5-pro", fast_path: "gemini-2.5-flash-lite"}Checking Status
You can check rate limit status before making requests:
case Gemini.RateLimiter.check_status("gemini-flash-lite-latest") do
:ok ->
IO.puts("Ready to make requests")
{:rate_limited, retry_at, details} ->
IO.puts("Rate limited until #{retry_at}")
{:over_budget, usage} ->
IO.puts("Over token budget: #{inspect(usage)}")
{:no_permits, 0} ->
IO.puts("No concurrency permits available")
endDisabling Rate Limiting
While not recommended, you can disable rate limiting:
# Per-request
{:ok, response} = Gemini.generate("Hello", disable_rate_limiter: true)
# Globally (not recommended)
config :gemini_ex, :rate_limiter, disable_rate_limiter: trueBest Practices
- Use default settings in production - They're tuned for typical usage patterns
- Use adaptive mode when unsure of quotas - It will find the right concurrency
- Handle rate limit errors gracefully - Use
non_blocking: truewith queuing for high-throughput apps - Monitor telemetry events - Track rate limit events to optimize your usage
- Set token budgets for predictable costs - Prevents unexpected overages
- Use the
:devprofile during development - Lower concurrency helps avoid rate limits while testing
Streaming
The rate limiter only gates request submission. Once a stream is opened, it is not interrupted. This ensures streaming responses complete even if you hit rate limits during the stream.
# Rate limiter gates this initial request
{:ok, stream} = Gemini.stream_generate("Tell me a long story")
# Once streaming starts, it continues uninterrupted
Enum.each(stream, fn chunk ->
IO.write(chunk.text)
end)