Stephen.Index (Stephen v1.0.0)
View SourceManages the ColBERT document index.
Stores per-token embeddings from documents and enables efficient approximate nearest neighbor search using HNSWLib.
Each token embedding in the index maps back to its source document, enabling document-level retrieval through token-level search.
Summary
Functions
Adds a document's embeddings to the index.
Adds multiple documents to the index.
Removes a document from the index.
Removes multiple documents from the index.
Returns all document IDs in the index.
Gets the stored embeddings for a document.
Checks if a document exists in the index.
Loads an index from disk.
Creates a new empty index.
Saves the index to disk.
Searches for the k nearest token embeddings to the query tokens.
Returns the number of documents in the index.
Returns the number of token embeddings in the index.
Updates a document in the index by replacing its embeddings.
Types
@type doc_id() :: term()
@type t() :: %Stephen.Index{ deleted_token_ids: [non_neg_integer()], doc_count: non_neg_integer(), doc_embeddings: %{required(term()) => Nx.Tensor.t()}, doc_to_tokens: %{required(term()) => [non_neg_integer()]}, embedding_dim: non_neg_integer(), hnsw_index: HNSWLib.Index.t(), token_count: non_neg_integer(), token_to_doc: %{required(non_neg_integer()) => term()} }
Functions
@spec add(t(), doc_id(), Nx.Tensor.t()) :: t()
Adds a document's embeddings to the index.
Arguments
index- The index structdoc_id- Unique identifier for the documentembeddings- Tensor of shape {num_tokens, embedding_dim}
Returns
Updated index struct.
@spec add_all(t(), [{doc_id(), Nx.Tensor.t()}]) :: t()
Adds multiple documents to the index.
Arguments
index- The index structdocuments- List of {doc_id, embeddings} tuples
Returns
Updated index struct.
Removes a document from the index.
The document's token embeddings are marked as deleted in the HNSW index and their IDs are saved for reuse when new documents are added.
Arguments
index- The index structdoc_id- The document ID to remove
Returns
Updated index struct, or the original index if doc_id not found.
Removes multiple documents from the index.
Arguments
index- The index structdoc_ids- List of document IDs to remove
Returns
Updated index struct.
Returns all document IDs in the index.
@spec get_embeddings(t(), doc_id()) :: Nx.Tensor.t() | nil
Gets the stored embeddings for a document.
Checks if a document exists in the index.
Loads an index from disk.
Arguments
path- Directory path where the index was saved
Creates a new empty index.
Options
:embedding_dim- Dimension of embeddings (required):space- Distance space, :cosine or :l2 (default: :cosine):max_tokens- Maximum number of token embeddings (default: 100_000):m- HNSW M parameter (default: 16):ef_construction- HNSW ef_construction parameter (default: 200)
Saves the index to disk.
Arguments
index- The index structpath- Directory path to save the index
@spec search_tokens(t(), Nx.Tensor.t(), pos_integer()) :: %{ required(doc_id()) => pos_integer() }
Searches for the k nearest token embeddings to the query tokens.
Returns candidate document IDs with their matching token counts.
Arguments
index- The index structquery_embeddings- Tensor of shape {query_len, embedding_dim}k- Number of nearest neighbors per query token (default: 10)
Returns
Map of doc_id => count of matching tokens
@spec size(t()) :: non_neg_integer()
Returns the number of documents in the index.
@spec token_count(t()) :: non_neg_integer()
Returns the number of token embeddings in the index.
@spec update(t(), doc_id(), Nx.Tensor.t()) :: t()
Updates a document in the index by replacing its embeddings.
This is equivalent to deleting and re-adding the document.
Arguments
index- The index structdoc_id- The document ID to updateembeddings- New embeddings tensor
Returns
Updated index struct.