Sources and Engine

View Source

The Engine runs a build-time ETL pipeline that loads data from sources, normalizes, validates, merges, enriches, and indexes it, then writes priv/llm_db/snapshot.json. Runtime only loads the snapshot.

Source Behaviour

All sources implement the LLMDB.Source behaviour:

@callback load(opts :: map()) :: {:ok, data :: map()} | {:error, term()}
@callback pull(opts :: map()) :: :ok | {:error, term()}  # Optional

Canonical Format

%{
  "providers" => %{
    openai: %{
      "id" => :openai,
      "name" => "OpenAI",
      "base_url" => "https://api.openai.com/v1",
      # ...
    }
  },
  "models" => [
    %{
      "id" => "gpt-4",
      "provider" => :openai,
      "name" => "GPT-4",
      # ...
    },
    # ...
  ]
}

Outer map uses string keys; provider keys are atoms; model IDs are strings. Use LLMDB.Source.assert_canonical!/1 for validation.

Built-in Sources

ModelsDev (Remote)

{LLMDB.Sources.ModelsDev, %{
  url: "https://models.dev/api/models",
  cache_path: "priv/llm_db/cache/models_dev.json"
}}

pull/1 downloads and caches via Req. load/1 loads from cache. Transforms models.dev schema to canonical format (limitlimits, modality strings → atoms, unmapped → :extra).

Local (TOML)

{LLMDB.Sources.Local, %{dir: "priv/llm_db"}}

Structure: provider.toml + models/{provider}/*.toml. Atomizes keys, injects :provider from directory name.

Configuring Sources

config :llm_db,
  sources: [
    {LLMDB.Sources.ModelsDev, %{}},
    {LLMDB.Sources.Local, %{dir: "priv/llm_db"}}
  ]

Sources processed in order. Later sources override earlier ones.

ETL Pipeline

LLMDB.Engine.run/1 executes 7 stages:

  1. Ingest: Load sources, validate canonical format, flatten nested provider data
  2. Normalize: Convert provider IDs to atoms, normalize modalities to atoms, parse dates
  3. Validate: Zoi validation via LLMDB.Validate, drop invalid, log warnings
  4. Merge: Last-wins precedence; :aliases are unioned, other lists replaced, maps deep merged
  5. Filter: Compile allow/deny patterns (deny wins, globs supported)
  6. Enrich: Derive :family, fill :provider_model_id, apply capability defaults
  7. Index: Build providers_by_id, models_by_key, models_by_provider, aliases_by_key, then v2 snapshot

Final check warns if zero providers/models.

Mix Tasks

Custom Source Example

defmodule MyApp.InternalModels do
  @behaviour LLMDB.Source

  @impl true
  def load(_opts) do
    {:ok, %{
      "providers" => %{internal: %{"id" => :internal, "name" => "Internal"}},
      "models" => [%{"id" => "custom-gpt", "provider" => :internal, "capabilities" => %{"chat" => true}}]
    }}
  end
end

# config.exs
config :llm_db, sources: [{MyApp.InternalModels, %{}}]