Stephen.Encoder (Stephen v1.0.0)
View SourceEncodes text into per-token embeddings using BERT.
ColBERT uses per-token embeddings rather than pooled [CLS] embeddings, enabling fine-grained late interaction scoring.
Features
- Query padding with [MASK] tokens for query augmentation
- Configurable max lengths for queries and documents
- Batch encoding for efficient processing
- Optional linear projection to reduce embedding dimension
Summary
Functions
Returns the output embedding dimension (after projection if enabled).
Encodes text into per-token embeddings without markers.
Encodes a document into per-token embeddings.
Encodes multiple documents in batch.
Encodes multiple queries in batch.
Encodes a query into per-token embeddings.
Returns the raw model embedding dimension (before projection).
Loads a BERT model for encoding.
Tokenizes text and returns the token strings.
Types
@type embeddings() :: Nx.Tensor.t()
@type encoder() :: %{ model: Axon.t(), params: map(), tokenizer: Tokenizers.Tokenizer.t(), embedding_dim: pos_integer(), output_dim: pos_integer(), max_query_length: pos_integer(), max_doc_length: pos_integer(), projection: Nx.Tensor.t() | nil, mask_token_id: non_neg_integer(), skiplist: MapSet.t() }
Functions
@spec embedding_dim(encoder()) :: pos_integer()
Returns the output embedding dimension (after projection if enabled).
@spec encode(encoder(), String.t()) :: embeddings()
Encodes text into per-token embeddings without markers.
Returns normalized embeddings with shape {sequence_length, output_dim}.
@spec encode_document(encoder(), String.t(), keyword()) :: embeddings()
Encodes a document into per-token embeddings.
Prepends the document marker [D] before encoding. Returns normalized embeddings with shape {sequence_length, output_dim}.
Options
:skip_punctuation?- Whether to filter out punctuation token embeddings (default: false):deduplicate?- Whether to remove duplicate token embeddings (default: false)
@spec encode_documents(encoder(), [String.t()], keyword()) :: [embeddings()]
Encodes multiple documents in batch.
Uses true batched inference for efficiency. Returns a list of normalized embeddings, one per document.
Options
:skip_punctuation?- Whether to filter out punctuation token embeddings (default: false):deduplicate?- Whether to remove duplicate token embeddings (default: false)
@spec encode_queries(encoder(), [String.t()], keyword()) :: [embeddings()]
Encodes multiple queries in batch.
Returns a list of normalized embeddings, one per query.
@spec encode_query(encoder(), String.t(), keyword()) :: embeddings()
Encodes a query into per-token embeddings.
Prepends the query marker [Q] and pads with [MASK] tokens to max_query_length. Returns normalized embeddings with shape {max_query_length, output_dim}.
Options
:pad- Whether to pad with [MASK] tokens (default: true)
Loads a BERT model for encoding.
Options
:model- HuggingFace model name (default: colbert-ir/colbertv2.0):max_query_length- Maximum query length in tokens (default: 32):max_doc_length- Maximum document length in tokens (default: 180):projection_dim- Output dimension after projection (default: 128, nil to disable):base_module- Override the Bumblebee module for ColBERT models (auto-detected from config.json)
ColBERT Models
When loading a ColBERT model (e.g., colbert-ir/colbertv2.0), the trained projection
weights are automatically loaded from the model's SafeTensors file. The base model
type is auto-detected from config.json, but can be overridden with :base_module.
Supported base models (auto-detected): BERT, RoBERTa, DistilBERT, ALBERT, XLM-RoBERTa. Note: Only BERT has been tested with official ColBERT weights. Other architectures should work if the model provides compatible weights.
Examples
{:ok, encoder} = Stephen.Encoder.load()
{:ok, encoder} = Stephen.Encoder.load(model: "bert-base-uncased", projection_dim: 128)
# Load official ColBERT model with trained weights
{:ok, encoder} = Stephen.Encoder.load(model: "colbert-ir/colbertv2.0")
# Override base model type for custom ColBERT models
{:ok, encoder} = Stephen.Encoder.load(
model: "custom/roberta-colbert",
base_module: Bumblebee.Text.Roberta
)
Tokenizes text and returns the token strings.
Useful for visualization and debugging. The tokens correspond to the
embeddings returned by encode_query/2 or encode_document/2.
Options
:type-:queryor:document(default::document):max_length- Maximum tokens (defaults based on type)
Examples
tokens = Stephen.Encoder.tokenize(encoder, "Stephen Colbert")
# => ["[CLS]", "[D]", "stephen", "colbert", "[SEP]"]
tokens = Stephen.Encoder.tokenize(encoder, "Conan", type: :query)
# => ["[CLS]", "[Q]", "conan", "[MASK]", ..., "[SEP]"]