Word embeddings — load pre-trained vectors and compute similarity, nearest neighbours, and analogies.
Designed around the fastText .vec text format used by the
pre-trained vectors that Facebook publishes for ~157 languages
(Common Crawl + Wikipedia, 300-dimensional, available from
https://fasttext.cc/docs/en/crawl-vectors.html). The format is
also produced by word2vec and GloVe with --save-format text,
so any of those work too.
File format
fastText .vec is a UTF-8 text file:
Line 1: two integers separated by a space —
n dimwherenis the number of vectors anddimis the vector dimensionality.Lines 2..n+1: a token followed by
dimspace-separated floats. The token is everything before the first space; values may contain scientific notation.
Memory
A typical fastText .vec for a single language is several gigabytes
(English Common Crawl: ~7 GB). Loading it eagerly with load/2
materialises the entire vector table as an Nx tensor of shape
{n, dim}. For deployment where memory is tight, prefer the
quantised binary format used by lid.176.ftz (not yet supported in
this module) or load only a relevant subset of the vocabulary via
:filter.
Quick tour
{:ok, emb} = Text.Embedding.load("path/to/cc.en.300.vec")
Text.Embedding.vector(emb, "king")
#=> #Nx.Tensor<f32[300] [...]>
Text.Embedding.similarity(emb, "king", "queen")
#=> 0.84...
Text.Embedding.nearest(emb, "king", k: 3)
#=> [{"queen", 0.84}, {"prince", 0.79}, {"monarch", 0.77}]
Text.Embedding.analogy(emb, "king", "man", "woman", k: 1)
#=> [{"queen", 0.71}]Cosine similarity
All similarity functions in this module use cosine similarity by
default — the dot product of two L2-normalised vectors. The
implementation pre-normalises the embedding matrix at load time
so per-query cost is one Nx.dot against a {n, dim} matrix.
Summary
Functions
Returns the top :k candidates for the analogy a : b :: c : ?.
Loads embeddings from a .vec file.
Returns the :k nearest neighbours of token by cosine similarity.
Returns the cosine similarity between two tokens.
Returns the size of the loaded vocabulary.
Returns the vector for token, or nil if the token is not in the
vocabulary.
Types
@type t() :: %Text.Embedding{ dim: pos_integer(), index_to_token: %{required(non_neg_integer()) => String.t()}, n: non_neg_integer(), norms: Nx.Tensor.t(), vectors: Nx.Tensor.t(), vocab: %{required(String.t()) => non_neg_integer()} }
Functions
Returns the top :k candidates for the analogy a : b :: c : ?.
Computes the query vector as b - a + c, normalises it, and finds
the nearest neighbours by cosine similarity. The three input tokens
are excluded from the result, since the most-similar vector to
b - a + c is almost always one of them.
Arguments
a,b,c— the three tokens framing the analogy. All three must be in the vocabulary; if any is missing, the function returns[].
Options
:k— number of candidates to return. Defaults to1.
Returns
- A list of
{token, similarity}pairs sorted by similarity descending.
Examples
Text.Embedding.analogy(emb, "king", "man", "woman", k: 3)
# => [{"queen", 0.71}, {"princess", 0.62}, ...]
Loads embeddings from a .vec file.
Arguments
pathis the path to a fastText- or word2vec-style.vectext file.
Options
:filter— a list orMapSetof tokens to keep. When given, only vectors whose token is in the filter are loaded. Useful for cutting memory by an order of magnitude when you only need a domain-specific vocabulary.:max_tokens— load at most this many tokens (regardless of:filter). Useful for testing or a quick top-N baseline.
Returns
{:ok, %Text.Embedding{}}on success.{:error, reason}if the file is missing or malformed.
Returns the :k nearest neighbours of token by cosine similarity.
The token itself is excluded from the result.
Arguments
token— a string.
Options
:k— number of neighbours to return. Defaults to10.
Returns
- A list of
{token, similarity}pairs sorted by similarity descending. Returns[]if the token is not in the vocabulary.
Examples
Text.Embedding.nearest(emb, "king", k: 3)
# => [{"queen", 0.84}, {"prince", 0.79}, {"monarch", 0.77}]
Returns the cosine similarity between two tokens.
Returns
A float in
[-1.0, +1.0].nilif either token is missing from the vocabulary.
Examples
Text.Embedding.similarity(emb, "king", "queen")
# => 0.84
Text.Embedding.similarity(emb, "king", "carrot")
# => 0.18
@spec size(t()) :: non_neg_integer()
Returns the size of the loaded vocabulary.
@spec vector(t(), String.t()) :: Nx.Tensor.t() | nil
Returns the vector for token, or nil if the token is not in the
vocabulary.
Examples
vector = Text.Embedding.vector(embeddings, "king")
# => #Nx.Tensor<f32[300] [...]>
Text.Embedding.vector(embeddings, "no-such-word")
# => nil