Configuration

View Source

Encoder Options

{:ok, encoder} = Stephen.Encoder.load(
  model: "colbert-ir/colbertv2.0",
  max_query_length: 32,
  max_doc_length: 180,
  projection_dim: 128
)

Model Selection

ModelDescription
colbert-ir/colbertv2.0Official ColBERT v2 with trained projection (recommended)
colbert-ir/colbertv1.0Original ColBERT model
bert-base-uncasedStandard BERT (requires random projection)
Any HuggingFace BERTCustom models work with random projection

Parameters

ParameterDefaultDescription
:modelcolbert-ir/colbertv2.0HuggingFace model name
:max_query_length32Query padding length
:max_doc_length180Max document tokens
:projection_dim128Output dimension (nil to disable)
:base_moduleauto-detectedOverride Bumblebee module

ColBERT vs Standard Models

When loading official ColBERT models, Stephen:

  1. Downloads the model from HuggingFace
  2. Loads BERT backbone via Bumblebee
  3. Extracts trained projection weights from SafeTensors

For other models, Stephen initializes a random projection matrix.

Document Encoding Options

embeddings = Stephen.Encoder.encode_document(encoder, text,
  skip_punctuation?: true,
  deduplicate?: true
)
OptionDefaultDescription
:skip_punctuation?falseFilter punctuation token embeddings
:deduplicate?falseRemove near-duplicate embeddings

Index Options

Standard Index

index = Stephen.Index.new(
  embedding_dim: 128,
  space: :cosine,
  max_tokens: 100_000,
  m: 16,
  ef_construction: 200
)
ParameterDefaultDescription
:embedding_dimrequiredMust match encoder output
:space:cosineDistance metric (:cosine or :l2)
:max_tokens100,000Maximum token embeddings
:m16HNSW connectivity
:ef_construction200Build quality

PLAID Index

plaid = Stephen.Plaid.new(
  embedding_dim: 128,
  num_centroids: 1024
)
ParameterDefaultDescription
:embedding_dimrequiredMust match encoder output
:num_centroids1024Number of clusters

Compressed Index

index = Stephen.Index.Compressed.new(
  embedding_dim: 128,
  num_centroids: 1024,
  compression_centroids: 2048,
  residual_bits: 8
)
ParameterDefaultDescription
:embedding_dimrequiredMust match encoder output
:num_centroids1024PLAID centroids
:compression_centroids2048Compression codebook size
:residual_bits8Quantization depth (1, 2, 4, or 8)

Search Options

results = Stephen.search(encoder, index, query,
  top_k: 10,
  rerank?: true,
  candidates_per_token: 50
)
OptionDefaultDescription
:top_k10Number of results
:rerank?trueFull MaxSim reranking
:candidates_per_token50ANN candidates (Index)
:nprobe32Centroids to probe (PLAID)

PRF Options

Pseudo-relevance feedback expands queries using top-ranked documents:

results = Stephen.search_with_prf(encoder, index, query,
  top_k: 10,
  feedback_docs: 3,
  expansion_tokens: 10,
  expansion_weight: 0.5
)
OptionDefaultDescription
:top_k10Final results to return
:feedback_docs3Documents for feedback
:expansion_tokens10Tokens to add from feedback
:expansion_weight0.5Weight for expansion vs original

Tuning tips:

  • More feedback_docs adds diversity but may add noise
  • Higher expansion_weight emphasizes expansion over original query
  • More expansion_tokens broadens matching but may reduce precision

GPU Acceleration

Enable EXLA for GPU acceleration:

# mix.exs
{:exla, "~> 0.9"}

# config/config.exs
config :nx, default_backend: EXLA.Backend

For CPU-only with EXLA optimizations:

config :nx, default_backend: {EXLA.Backend, client: :host}

Batch Processing

For large document sets, process in batches to manage memory:

documents
|> Stream.chunk_every(100)
|> Enum.reduce(index, fn batch, acc ->
  Stephen.index(encoder, acc, batch)
end)

For batch queries:

results = Stephen.Retriever.batch_search(encoder, index, queries, top_k: 10)