Text.Embedding (Text v0.5.0)

Copy Markdown View Source

Word embeddings — load pre-trained vectors and compute similarity, nearest neighbours, and analogies.

Designed around the fastText .vec text format used by the pre-trained vectors that Facebook publishes for ~157 languages (Common Crawl + Wikipedia, 300-dimensional, available from https://fasttext.cc/docs/en/crawl-vectors.html). The format is also produced by word2vec and GloVe with --save-format text, so any of those work too.

File format

fastText .vec is a UTF-8 text file:

  • Line 1: two integers separated by a space — n dim where n is the number of vectors and dim is the vector dimensionality.

  • Lines 2..n+1: a token followed by dim space-separated floats. The token is everything before the first space; values may contain scientific notation.

Memory

A typical fastText .vec for a single language is several gigabytes (English Common Crawl: ~7 GB). Loading it eagerly with load/2 materialises the entire vector table as an Nx tensor of shape {n, dim}. For deployment where memory is tight, prefer the quantised binary format used by lid.176.ftz (not yet supported in this module) or load only a relevant subset of the vocabulary via :filter.

Quick tour

{:ok, emb} = Text.Embedding.load("path/to/cc.en.300.vec")

Text.Embedding.vector(emb, "king")
#=> #Nx.Tensor<f32[300] [...]>

Text.Embedding.similarity(emb, "king", "queen")
#=> 0.84...

Text.Embedding.nearest(emb, "king", k: 3)
#=> [{"queen", 0.84}, {"prince", 0.79}, {"monarch", 0.77}]

Text.Embedding.analogy(emb, "king", "man", "woman", k: 1)
#=> [{"queen", 0.71}]

Cosine similarity

All similarity functions in this module use cosine similarity by default — the dot product of two L2-normalised vectors. The implementation pre-normalises the embedding matrix at load time so per-query cost is one Nx.dot against a {n, dim} matrix.

Summary

Functions

Returns the top :k candidates for the analogy a : b :: c : ?.

Loads embeddings from a .vec file.

Returns the :k nearest neighbours of token by cosine similarity.

Returns the cosine similarity between two tokens.

Returns the size of the loaded vocabulary.

Returns the vector for token, or nil if the token is not in the vocabulary.

Types

t()

@type t() :: %Text.Embedding{
  dim: pos_integer(),
  index_to_token: %{required(non_neg_integer()) => String.t()},
  n: non_neg_integer(),
  norms: Nx.Tensor.t(),
  vectors: Nx.Tensor.t(),
  vocab: %{required(String.t()) => non_neg_integer()}
}

Functions

analogy(embeddings, a, b, c, options \\ [])

@spec analogy(t(), String.t(), String.t(), String.t(), keyword()) :: [
  {String.t(), float()}
]

Returns the top :k candidates for the analogy a : b :: c : ?.

Computes the query vector as b - a + c, normalises it, and finds the nearest neighbours by cosine similarity. The three input tokens are excluded from the result, since the most-similar vector to b - a + c is almost always one of them.

Arguments

  • a, b, c — the three tokens framing the analogy. All three must be in the vocabulary; if any is missing, the function returns [].

Options

  • :k — number of candidates to return. Defaults to 1.

Returns

  • A list of {token, similarity} pairs sorted by similarity descending.

Examples

Text.Embedding.analogy(emb, "king", "man", "woman", k: 3)
# => [{"queen", 0.71}, {"princess", 0.62}, ...]

load(path, options \\ [])

@spec load(
  Path.t(),
  keyword()
) :: {:ok, t()} | {:error, term()}

Loads embeddings from a .vec file.

Arguments

  • path is the path to a fastText- or word2vec-style .vec text file.

Options

  • :filter — a list or MapSet of tokens to keep. When given, only vectors whose token is in the filter are loaded. Useful for cutting memory by an order of magnitude when you only need a domain-specific vocabulary.

  • :max_tokens — load at most this many tokens (regardless of :filter). Useful for testing or a quick top-N baseline.

Returns

  • {:ok, %Text.Embedding{}} on success.

  • {:error, reason} if the file is missing or malformed.

nearest(embeddings, token, options \\ [])

@spec nearest(t(), String.t(), keyword()) :: [{String.t(), float()}]

Returns the :k nearest neighbours of token by cosine similarity.

The token itself is excluded from the result.

Arguments

  • token — a string.

Options

  • :k — number of neighbours to return. Defaults to 10.

Returns

  • A list of {token, similarity} pairs sorted by similarity descending. Returns [] if the token is not in the vocabulary.

Examples

Text.Embedding.nearest(emb, "king", k: 3)
# => [{"queen", 0.84}, {"prince", 0.79}, {"monarch", 0.77}]

similarity(embeddings, a, b)

@spec similarity(t(), String.t(), String.t()) :: float() | nil

Returns the cosine similarity between two tokens.

Returns

  • A float in [-1.0, +1.0].

  • nil if either token is missing from the vocabulary.

Examples

Text.Embedding.similarity(emb, "king", "queen")
# => 0.84
Text.Embedding.similarity(emb, "king", "carrot")
# => 0.18

size(embedding)

@spec size(t()) :: non_neg_integer()

Returns the size of the loaded vocabulary.

vector(embeddings, token)

@spec vector(t(), String.t()) :: Nx.Tensor.t() | nil

Returns the vector for token, or nil if the token is not in the vocabulary.

Examples

vector = Text.Embedding.vector(embeddings, "king")
# => #Nx.Tensor<f32[300] [...]>

Text.Embedding.vector(embeddings, "no-such-word")
# => nil