Stephen.Index (Stephen v1.0.0)

View Source

Manages the ColBERT document index.

Stores per-token embeddings from documents and enables efficient approximate nearest neighbor search using HNSWLib.

Each token embedding in the index maps back to its source document, enabling document-level retrieval through token-level search.

Summary

Functions

Adds a document's embeddings to the index.

Adds multiple documents to the index.

Removes a document from the index.

Removes multiple documents from the index.

Returns all document IDs in the index.

Gets the stored embeddings for a document.

Checks if a document exists in the index.

Loads an index from disk.

Creates a new empty index.

Saves the index to disk.

Searches for the k nearest token embeddings to the query tokens.

Returns the number of documents in the index.

Returns the number of token embeddings in the index.

Updates a document in the index by replacing its embeddings.

Types

doc_id()

@type doc_id() :: term()

t()

@type t() :: %Stephen.Index{
  deleted_token_ids: [non_neg_integer()],
  doc_count: non_neg_integer(),
  doc_embeddings: %{required(term()) => Nx.Tensor.t()},
  doc_to_tokens: %{required(term()) => [non_neg_integer()]},
  embedding_dim: non_neg_integer(),
  hnsw_index: HNSWLib.Index.t(),
  token_count: non_neg_integer(),
  token_to_doc: %{required(non_neg_integer()) => term()}
}

Functions

add(index, doc_id, embeddings)

@spec add(t(), doc_id(), Nx.Tensor.t()) :: t()

Adds a document's embeddings to the index.

Arguments

  • index - The index struct
  • doc_id - Unique identifier for the document
  • embeddings - Tensor of shape {num_tokens, embedding_dim}

Returns

Updated index struct.

add_all(index, documents)

@spec add_all(t(), [{doc_id(), Nx.Tensor.t()}]) :: t()

Adds multiple documents to the index.

Arguments

  • index - The index struct
  • documents - List of {doc_id, embeddings} tuples

Returns

Updated index struct.

delete(index, doc_id)

@spec delete(t(), doc_id()) :: t()

Removes a document from the index.

The document's token embeddings are marked as deleted in the HNSW index and their IDs are saved for reuse when new documents are added.

Arguments

  • index - The index struct
  • doc_id - The document ID to remove

Returns

Updated index struct, or the original index if doc_id not found.

delete_all(index, doc_ids)

@spec delete_all(t(), [doc_id()]) :: t()

Removes multiple documents from the index.

Arguments

  • index - The index struct
  • doc_ids - List of document IDs to remove

Returns

Updated index struct.

doc_ids(index)

@spec doc_ids(t()) :: [doc_id()]

Returns all document IDs in the index.

get_embeddings(index, doc_id)

@spec get_embeddings(t(), doc_id()) :: Nx.Tensor.t() | nil

Gets the stored embeddings for a document.

has_doc?(index, doc_id)

@spec has_doc?(t(), doc_id()) :: boolean()

Checks if a document exists in the index.

load(path)

@spec load(Path.t()) :: {:ok, t()} | {:error, term()}

Loads an index from disk.

Arguments

  • path - Directory path where the index was saved

new(opts \\ [])

@spec new(keyword()) :: t()

Creates a new empty index.

Options

  • :embedding_dim - Dimension of embeddings (required)
  • :space - Distance space, :cosine or :l2 (default: :cosine)
  • :max_tokens - Maximum number of token embeddings (default: 100_000)
  • :m - HNSW M parameter (default: 16)
  • :ef_construction - HNSW ef_construction parameter (default: 200)

save(index, path)

@spec save(t(), Path.t()) :: :ok | {:error, term()}

Saves the index to disk.

Arguments

  • index - The index struct
  • path - Directory path to save the index

search_tokens(index, query_embeddings, k \\ 10)

@spec search_tokens(t(), Nx.Tensor.t(), pos_integer()) :: %{
  required(doc_id()) => pos_integer()
}

Searches for the k nearest token embeddings to the query tokens.

Returns candidate document IDs with their matching token counts.

Arguments

  • index - The index struct
  • query_embeddings - Tensor of shape {query_len, embedding_dim}
  • k - Number of nearest neighbors per query token (default: 10)

Returns

Map of doc_id => count of matching tokens

size(index)

@spec size(t()) :: non_neg_integer()

Returns the number of documents in the index.

token_count(index)

@spec token_count(t()) :: non_neg_integer()

Returns the number of token embeddings in the index.

update(index, doc_id, embeddings)

@spec update(t(), doc_id(), Nx.Tensor.t()) :: t()

Updates a document in the index by replacing its embeddings.

This is equivalent to deleting and re-adding the document.

Arguments

  • index - The index struct
  • doc_id - The document ID to update
  • embeddings - New embeddings tensor

Returns

Updated index struct.