LLMDB.Engine (LLM DB v2025.12.4)

View Source

Pure ETL pipeline for BUILD-TIME LLM model catalog generation.

Engine is a pure function: sources in, snapshot out. It processes ONLY the sources explicitly passed via options or configured sources.

This module is designed for BUILD-TIME use (e.g., mix tasks) to generate complete, unfiltered snapshots from remote/local sources that will be packaged into the library.

Pipeline Stages

  1. Ingest - Load data from configured sources
  2. Normalize - Apply normalization to providers and models per layer
  3. Validate - Validate schemas and log dropped records per layer
  4. Merge - Combine layers with precedence rules (last wins)
  5. Finalize - Enrich and nest models under providers
  6. Ensure viable - Verify catalog has content (warns if empty)

Architecture

Sources are processed in order with last-wins precedence:

  1. First source (lowest precedence)
  2. Second source
  3. ... (higher precedence)
  4. Last source (highest precedence)

The engine coordinates data ingestion, normalization, validation, merging, and finalization to produce a complete v2 snapshot ready for JSON serialization.

Filtering and indexing are deferred to load-time - the snapshot contains ALL data from sources. Runtime policies (allow/deny patterns, preferences) are applied when the snapshot is loaded via LLMDB.load/1.

Summary

Functions

Applies allow/deny filters to models.

Builds the nested v2 provider structure for snapshot serialization.

Runs the complete ETL pipeline to generate a model catalog snapshot.

Functions

apply_filters(models, map)

@spec apply_filters([map()], map()) :: [map()]

Applies allow/deny filters to models.

Deny patterns always win over allow patterns.

Parameters

  • models - List of model maps
  • filters - %{allow: compiled_patterns, deny: compiled_patterns}

Returns

Filtered list of models

build_nested_providers(providers, models)

@spec build_nested_providers([map()], [map()]) :: %{required(atom()) => map()}

Builds the nested v2 provider structure for snapshot serialization.

Groups models by provider and nests them under their provider. Models are keyed by model.id for easy lookup.

Parameters

  • providers - List of provider maps
  • models - List of model maps

Returns

%{atom => %{provider fields + models: %{string => model}}}

run(opts \\ [])

@spec run(keyword()) :: {:ok, map()} | {:error, term()}

Runs the complete ETL pipeline to generate a model catalog snapshot.

Pure function that processes sources into a complete, unfiltered snapshot. BUILD-TIME only.

Options

  • :sources - List of {module, opts} source tuples (optional, defaults to Config.sources!())

Note: :allow, :deny, :prefer, and :filters options are ignored. Filtering is a load-time concern applied via LLMDB.load/1 and runtime config.

Returns

  • {:ok, snapshot_map} - Success with v2 snapshot structure
  • {:ok, snapshot_map} - Empty catalog (warns but succeeds if no sources)
  • {:error, term} - Other error

Snapshot Structure (v2)

%{
  version: 2,
  generated_at: String.t(),
  providers: %{atom => %{provider_fields... + models: %{String.t() => Model.t()}}}
}

The snapshot contains ALL models from all sources. Indexes and filters are built at load-time by LLMDB.load/1 using the LLMDB.Index module.