Stephen.Encoder (Stephen v1.0.0)

View Source

Encodes text into per-token embeddings using BERT.

ColBERT uses per-token embeddings rather than pooled [CLS] embeddings, enabling fine-grained late interaction scoring.

Features

  • Query padding with [MASK] tokens for query augmentation
  • Configurable max lengths for queries and documents
  • Batch encoding for efficient processing
  • Optional linear projection to reduce embedding dimension

Summary

Functions

Returns the output embedding dimension (after projection if enabled).

Encodes text into per-token embeddings without markers.

Encodes a document into per-token embeddings.

Encodes multiple documents in batch.

Encodes multiple queries in batch.

Encodes a query into per-token embeddings.

Returns the raw model embedding dimension (before projection).

Loads a BERT model for encoding.

Tokenizes text and returns the token strings.

Types

embeddings()

@type embeddings() :: Nx.Tensor.t()

encoder()

@type encoder() :: %{
  model: Axon.t(),
  params: map(),
  tokenizer: Tokenizers.Tokenizer.t(),
  embedding_dim: pos_integer(),
  output_dim: pos_integer(),
  max_query_length: pos_integer(),
  max_doc_length: pos_integer(),
  projection: Nx.Tensor.t() | nil,
  mask_token_id: non_neg_integer(),
  skiplist: MapSet.t()
}

Functions

embedding_dim(encoder)

@spec embedding_dim(encoder()) :: pos_integer()

Returns the output embedding dimension (after projection if enabled).

encode(encoder, text)

@spec encode(encoder(), String.t()) :: embeddings()

Encodes text into per-token embeddings without markers.

Returns normalized embeddings with shape {sequence_length, output_dim}.

encode_document(encoder, text, opts \\ [])

@spec encode_document(encoder(), String.t(), keyword()) :: embeddings()

Encodes a document into per-token embeddings.

Prepends the document marker [D] before encoding. Returns normalized embeddings with shape {sequence_length, output_dim}.

Options

  • :skip_punctuation? - Whether to filter out punctuation token embeddings (default: false)
  • :deduplicate? - Whether to remove duplicate token embeddings (default: false)

encode_documents(encoder, texts, opts \\ [])

@spec encode_documents(encoder(), [String.t()], keyword()) :: [embeddings()]

Encodes multiple documents in batch.

Uses true batched inference for efficiency. Returns a list of normalized embeddings, one per document.

Options

  • :skip_punctuation? - Whether to filter out punctuation token embeddings (default: false)
  • :deduplicate? - Whether to remove duplicate token embeddings (default: false)

encode_queries(encoder, texts, opts \\ [])

@spec encode_queries(encoder(), [String.t()], keyword()) :: [embeddings()]

Encodes multiple queries in batch.

Returns a list of normalized embeddings, one per query.

encode_query(encoder, text, opts \\ [])

@spec encode_query(encoder(), String.t(), keyword()) :: embeddings()

Encodes a query into per-token embeddings.

Prepends the query marker [Q] and pads with [MASK] tokens to max_query_length. Returns normalized embeddings with shape {max_query_length, output_dim}.

Options

  • :pad - Whether to pad with [MASK] tokens (default: true)

hidden_dim(encoder)

@spec hidden_dim(encoder()) :: pos_integer()

Returns the raw model embedding dimension (before projection).

load(opts \\ [])

@spec load(keyword()) :: {:ok, encoder()} | {:error, term()}

Loads a BERT model for encoding.

Options

  • :model - HuggingFace model name (default: colbert-ir/colbertv2.0)
  • :max_query_length - Maximum query length in tokens (default: 32)
  • :max_doc_length - Maximum document length in tokens (default: 180)
  • :projection_dim - Output dimension after projection (default: 128, nil to disable)
  • :base_module - Override the Bumblebee module for ColBERT models (auto-detected from config.json)

ColBERT Models

When loading a ColBERT model (e.g., colbert-ir/colbertv2.0), the trained projection weights are automatically loaded from the model's SafeTensors file. The base model type is auto-detected from config.json, but can be overridden with :base_module.

Supported base models (auto-detected): BERT, RoBERTa, DistilBERT, ALBERT, XLM-RoBERTa. Note: Only BERT has been tested with official ColBERT weights. Other architectures should work if the model provides compatible weights.

Examples

{:ok, encoder} = Stephen.Encoder.load()
{:ok, encoder} = Stephen.Encoder.load(model: "bert-base-uncased", projection_dim: 128)

# Load official ColBERT model with trained weights
{:ok, encoder} = Stephen.Encoder.load(model: "colbert-ir/colbertv2.0")

# Override base model type for custom ColBERT models
{:ok, encoder} = Stephen.Encoder.load(
  model: "custom/roberta-colbert",
  base_module: Bumblebee.Text.Roberta
)

tokenize(encoder, text, opts \\ [])

@spec tokenize(encoder(), String.t(), keyword()) :: [String.t()]

Tokenizes text and returns the token strings.

Useful for visualization and debugging. The tokens correspond to the embeddings returned by encode_query/2 or encode_document/2.

Options

  • :type - :query or :document (default: :document)
  • :max_length - Maximum tokens (defaults based on type)

Examples

tokens = Stephen.Encoder.tokenize(encoder, "Stephen Colbert")
# => ["[CLS]", "[D]", "stephen", "colbert", "[SEP]"]

tokens = Stephen.Encoder.tokenize(encoder, "Conan", type: :query)
# => ["[CLS]", "[Q]", "conan", "[MASK]", ..., "[SEP]"]