# Sources and Engine

The Engine runs a build-time ETL pipeline that loads data from sources, normalizes, validates, merges, enriches, and indexes it, then writes `priv/llm_db/snapshot.json`. Runtime only loads the snapshot.

## Source Behaviour

All sources implement the `LLMDB.Source` behaviour:

```elixir
@callback load(opts :: map()) :: {:ok, data :: map()} | {:error, term()}
@callback pull(opts :: map()) :: :ok | {:error, term()}  # Optional
```

### Canonical Format

```elixir
%{
  "providers" => %{
    openai: %{
      "id" => :openai,
      "name" => "OpenAI",
      "base_url" => "https://api.openai.com/v1",
      # ...
    }
  },
  "models" => [
    %{
      "id" => "gpt-4",
      "provider" => :openai,
      "name" => "GPT-4",
      # ...
    },
    # ...
  ]
}
```

Outer map uses string keys; provider keys are atoms; model IDs are strings. Use `LLMDB.Source.assert_canonical!/1` for validation.

## Built-in Sources

### ModelsDev (Remote)

```elixir
{LLMDB.Sources.ModelsDev, %{
  url: "https://models.dev/api/models",
  cache_path: "priv/llm_db/cache/models_dev.json"
}}
```

`pull/1` downloads and caches via Req. `load/1` loads from cache. Transforms models.dev schema to canonical format (`limit` → `limits`, modality strings → atoms, unmapped → `:extra`).

### Local (TOML)

```elixir
{LLMDB.Sources.Local, %{dir: "priv/llm_db"}}
```

Structure: `provider.toml` + `models/{provider}/*.toml`. Atomizes keys, injects `:provider` from directory name.

## Configuring Sources

```elixir
config :llm_db,
  sources: [
    {LLMDB.Sources.ModelsDev, %{}},
    {LLMDB.Sources.Local, %{dir: "priv/llm_db"}}
  ]
```

Sources processed in order. Later sources override earlier ones.

## ETL Pipeline

`LLMDB.Engine.run/1` executes 7 stages:

1. **Ingest**: Load sources, validate canonical format, flatten nested provider data
2. **Normalize**: Convert provider IDs to atoms, normalize modalities to atoms, parse dates
3. **Validate**: Zoi validation via `LLMDB.Validate`, drop invalid, log warnings
4. **Merge**: Last-wins precedence; `:aliases` are unioned, other lists replaced, maps deep merged
5. **Filter**: Compile allow/deny patterns (deny wins, globs supported)
6. **Enrich**: Derive `:family`, fill `:provider_model_id`, apply capability defaults
7. **Index**: Build `providers_by_id`, `models_by_key`, `models_by_provider`, `aliases_by_key`, then v2 snapshot

Final check warns if zero providers/models.

## Mix Tasks

- `mix llm_db.pull` - Fetch and cache remote sources
- `mix llm_db.build` - Run ETL, write `priv/llm_db/snapshot.json` and `lib/llm_db/generated/valid_providers.ex`

## Custom Source Example

```elixir
defmodule MyApp.InternalModels do
  @behaviour LLMDB.Source

  @impl true
  def load(_opts) do
    {:ok, %{
      "providers" => %{internal: %{"id" => :internal, "name" => "Internal"}},
      "models" => [%{"id" => "custom-gpt", "provider" => :internal, "capabilities" => %{"chat" => true}}]
    }}
  end
end

# config.exs
config :llm_db, sources: [{MyApp.InternalModels, %{}}]
```
