# `Text.Embedding`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/embedding.ex#L1)

Word embeddings — load pre-trained vectors and compute similarity,
nearest neighbours, and analogies.

Designed around the fastText `.vec` text format used by the
pre-trained vectors that Facebook publishes for ~157 languages
(Common Crawl + Wikipedia, 300-dimensional, available from
<https://fasttext.cc/docs/en/crawl-vectors.html>). The format is
also produced by word2vec and GloVe with `--save-format text`,
so any of those work too.

### File format

fastText `.vec` is a UTF-8 text file:

* Line 1: two integers separated by a space — `n dim` where `n` is
  the number of vectors and `dim` is the vector dimensionality.

* Lines 2..n+1: a token followed by `dim` space-separated floats.
  The token is everything before the first space; values may
  contain scientific notation.

### Memory

A typical fastText `.vec` for a single language is several gigabytes
(English Common Crawl: ~7 GB). Loading it eagerly with `load/2`
materialises the entire vector table as an `Nx` tensor of shape
`{n, dim}`. For deployment where memory is tight, prefer the
quantised binary format used by `lid.176.ftz` (not yet supported in
this module) or load only a relevant subset of the vocabulary via
`:filter`.

### Quick tour

    {:ok, emb} = Text.Embedding.load("path/to/cc.en.300.vec")

    Text.Embedding.vector(emb, "king")
    #=> #Nx.Tensor<f32[300] [...]>

    Text.Embedding.similarity(emb, "king", "queen")
    #=> 0.84...

    Text.Embedding.nearest(emb, "king", k: 3)
    #=> [{"queen", 0.84}, {"prince", 0.79}, {"monarch", 0.77}]

    Text.Embedding.analogy(emb, "king", "man", "woman", k: 1)
    #=> [{"queen", 0.71}]

### Cosine similarity

All similarity functions in this module use cosine similarity by
default — the dot product of two L2-normalised vectors. The
implementation pre-normalises the embedding matrix at load time
so per-query cost is one `Nx.dot` against a `{n, dim}` matrix.

# `t`

```elixir
@type t() :: %Text.Embedding{
  dim: pos_integer(),
  index_to_token: %{required(non_neg_integer()) =&gt; String.t()},
  n: non_neg_integer(),
  norms: Nx.Tensor.t(),
  vectors: Nx.Tensor.t(),
  vocab: %{required(String.t()) =&gt; non_neg_integer()}
}
```

# `analogy`

```elixir
@spec analogy(t(), String.t(), String.t(), String.t(), keyword()) :: [
  {String.t(), float()}
]
```

Returns the top `:k` candidates for the analogy `a : b :: c : ?`.

Computes the query vector as `b - a + c`, normalises it, and finds
the nearest neighbours by cosine similarity. The three input tokens
are excluded from the result, since the most-similar vector to
`b - a + c` is almost always one of them.

### Arguments

* `a`, `b`, `c` — the three tokens framing the analogy. All three
  must be in the vocabulary; if any is missing, the function
  returns `[]`.

### Options

* `:k` — number of candidates to return. Defaults to `1`.

### Returns

* A list of `{token, similarity}` pairs sorted by similarity
  descending.

### Examples

    Text.Embedding.analogy(emb, "king", "man", "woman", k: 3)
    # => [{"queen", 0.71}, {"princess", 0.62}, ...]

# `load`

```elixir
@spec load(
  Path.t(),
  keyword()
) :: {:ok, t()} | {:error, term()}
```

Loads embeddings from a `.vec` file.

### Arguments

* `path` is the path to a fastText- or word2vec-style `.vec` text
  file.

### Options

* `:filter` — a list or `MapSet` of tokens to keep. When given,
  only vectors whose token is in the filter are loaded. Useful for
  cutting memory by an order of magnitude when you only need a
  domain-specific vocabulary.

* `:max_tokens` — load at most this many tokens (regardless of
  `:filter`). Useful for testing or a quick top-N baseline.

### Returns

* `{:ok, %Text.Embedding{}}` on success.

* `{:error, reason}` if the file is missing or malformed.

# `nearest`

```elixir
@spec nearest(t(), String.t(), keyword()) :: [{String.t(), float()}]
```

Returns the `:k` nearest neighbours of `token` by cosine similarity.

The token itself is excluded from the result.

### Arguments

* `token` — a string.

### Options

* `:k` — number of neighbours to return. Defaults to `10`.

### Returns

* A list of `{token, similarity}` pairs sorted by similarity
  descending. Returns `[]` if the token is not in the vocabulary.

### Examples

    Text.Embedding.nearest(emb, "king", k: 3)
    # => [{"queen", 0.84}, {"prince", 0.79}, {"monarch", 0.77}]

# `similarity`

```elixir
@spec similarity(t(), String.t(), String.t()) :: float() | nil
```

Returns the cosine similarity between two tokens.

### Returns

* A float in `[-1.0, +1.0]`.

* `nil` if either token is missing from the vocabulary.

### Examples

    Text.Embedding.similarity(emb, "king", "queen")
    # => 0.84
    Text.Embedding.similarity(emb, "king", "carrot")
    # => 0.18

# `size`

```elixir
@spec size(t()) :: non_neg_integer()
```

Returns the size of the loaded vocabulary.

# `vector`

```elixir
@spec vector(t(), String.t()) :: Nx.Tensor.t() | nil
```

Returns the vector for `token`, or `nil` if the token is not in the
vocabulary.

### Examples

    vector = Text.Embedding.vector(embeddings, "king")
    # => #Nx.Tensor<f32[300] [...]>

    Text.Embedding.vector(embeddings, "no-such-word")
    # => nil

---

*Consult [api-reference.md](api-reference.md) for complete listing*