Text.Language.Classifier.Fasttext.Subwords (Text v0.5.0)

Copy Markdown View Source

Character n-gram extraction and input-matrix indexing for fastText models.

Mirrors the C++ Dictionary::computeSubwords and Dictionary::pushHash routines in src/dictionary.cc. Used at inference time to convert a word into the subset of input matrix rows whose embeddings should be averaged into the word's feature vector.

Algorithm

fastText prefixes the word with < (BOW) and suffixes it with > (EOW), then generates every byte-aligned UTF-8 substring whose length in characters is between minn and maxn. Single-character n-grams that start with < or end with > are dropped — those are the boundary markers themselves, which would carry no information beyond the word identity.

Each n-gram's hash (see Text.Language.Classifier.Fasttext.Hash) is reduced modulo args.bucket and offset by dictionary.nwords to land in the second half of the input matrix where the subword embeddings live.

UTF-8 handling

fastText operates on UTF-8 byte sequences but counts characters, not bytes, when sizing n-grams. A continuation byte (0x80..0xBF) is never a valid n-gram start, and once a leading byte is consumed any continuation bytes that follow are pulled into the same character. The implementation inspects bytes directly via (byte &&& 0xC0) == 0x80 to identify continuation bytes — exactly what the C++ does.

This must match the reference bit-for-bit, otherwise non-Latin scripts (Chinese, Cyrillic, Devanagari, Arabic) hash to different buckets and inference quality collapses for those languages. The test suite includes golden subword-index fixtures generated from the official fastText Python bindings on lid.176 for differential validation.

pushHash semantics

fastText's pushHash has three regimes keyed off pruneidx_size:

  • < 0 — model was never pruned. Push nwords + (hash % bucket). This is the regime lid.176 runs in.

  • == 0 — model went through pruning but produced no entries. Drop the n-gram entirely. Rare in practice.

  • > 0 — model has a populated prune index. Look the hash up in pruneidx; if absent, drop the n-gram; if present, push nwords + remapped_id.

See Dictionary::pushHash in src/dictionary.cc.

Summary

Functions

Returns the input-matrix row indices contributed by a word's n-grams.

Generates the character n-grams of a word with the BOW/EOW boundary markers.

Direct port of fastText's pushHash. Returns either an empty list (drop) or a single-element list containing the input-matrix row index.

Functions

compute_indices(word, args, dictionary)

Returns the input-matrix row indices contributed by a word's n-grams.

Equivalent to repeatedly calling fastText's pushHash for each generated n-gram, given the model's args and dictionary. Indices are returned in the same order the reference implementation appends them to its feature vector.

Arguments

Returns

  • A list of non-negative integers, each a valid row index in the model's input matrix (in the n-gram region, i.e. nwords <= idx < nwords + bucket).

Examples

iex> args = %Text.Language.Classifier.Fasttext.Args{
...>   minn: 2, maxn: 4, bucket: 100, dim: 16, ws: 0, epoch: 0,
...>   min_count: 0, neg: 0, word_ngrams: 1, loss: :softmax,
...>   model: :sup, lr_update_rate: 0, t: 0.0
...> }
iex> dict = %Text.Language.Classifier.Fasttext.Dictionary{
...>   nwords: 1000, nlabels: 0, size: 1000, ntokens: 0,
...>   pruneidx_size: -1, entries: [], word_to_index: %{}, pruneidx: %{}
...> }
iex> indices = Text.Language.Classifier.Fasttext.Subwords.compute_indices("a", args, dict)
iex> Enum.all?(indices, fn i -> i >= 1000 and i < 1100 end)
true

compute_ngrams(word, minn, maxn)

@spec compute_ngrams(binary(), pos_integer(), pos_integer()) :: [binary()]

Generates the character n-grams of a word with the BOW/EOW boundary markers.

Arguments

  • word is a UTF-8 binary. The caller passes the raw word; this function adds the < and > markers internally.

  • minn is the minimum n-gram length in characters.

  • maxn is the maximum n-gram length in characters.

Returns

  • A list of UTF-8 binaries in the order the reference implementation emits them. Order matters for pushHash to produce the same index sequence as the C++ implementation.

Examples

iex> Text.Language.Classifier.Fasttext.Subwords.compute_ngrams("the", 2, 4)
["<t", "<th", "<the", "th", "the", "the>", "he", "he>", "e>"]

iex> Text.Language.Classifier.Fasttext.Subwords.compute_ngrams("a", 2, 3)
["<a", "<a>", "a>"]

push_hash(id, nwords, pruneidx_size, pruneidx)

@spec push_hash(integer(), non_neg_integer(), integer(), %{
  required(integer()) => integer()
}) :: [
  non_neg_integer()
]

Direct port of fastText's pushHash. Returns either an empty list (drop) or a single-element list containing the input-matrix row index.

Exposed for differential testing; production code should call compute_indices/3.