Character n-gram extraction and input-matrix indexing for fastText models.
Mirrors the C++ Dictionary::computeSubwords and Dictionary::pushHash
routines in src/dictionary.cc. Used at inference time to convert a word
into the subset of input matrix rows whose embeddings should be averaged
into the word's feature vector.
Algorithm
fastText prefixes the word with < (BOW) and suffixes it with > (EOW),
then generates every byte-aligned UTF-8 substring whose length in
characters is between minn and maxn. Single-character n-grams that
start with < or end with > are dropped — those are the boundary
markers themselves, which would carry no information beyond the word
identity.
Each n-gram's hash (see Text.Language.Classifier.Fasttext.Hash) is
reduced modulo args.bucket and offset by dictionary.nwords to land in
the second half of the input matrix where the subword embeddings live.
UTF-8 handling
fastText operates on UTF-8 byte sequences but counts characters, not
bytes, when sizing n-grams. A continuation byte (0x80..0xBF) is never a
valid n-gram start, and once a leading byte is consumed any continuation
bytes that follow are pulled into the same character. The implementation
inspects bytes directly via (byte &&& 0xC0) == 0x80 to identify
continuation bytes — exactly what the C++ does.
This must match the reference bit-for-bit, otherwise non-Latin scripts
(Chinese, Cyrillic, Devanagari, Arabic) hash to different buckets and
inference quality collapses for those languages. The test suite includes
golden subword-index fixtures generated from the official fastText
Python bindings on lid.176 for differential validation.
pushHash semantics
fastText's pushHash has three regimes keyed off pruneidx_size:
< 0— model was never pruned. Pushnwords + (hash % bucket). This is the regimelid.176runs in.== 0— model went through pruning but produced no entries. Drop the n-gram entirely. Rare in practice.> 0— model has a populated prune index. Look the hash up inpruneidx; if absent, drop the n-gram; if present, pushnwords + remapped_id.
See Dictionary::pushHash in src/dictionary.cc.
Summary
Functions
Returns the input-matrix row indices contributed by a word's n-grams.
Generates the character n-grams of a word with the BOW/EOW boundary markers.
Direct port of fastText's pushHash. Returns either an empty list (drop)
or a single-element list containing the input-matrix row index.
Functions
@spec compute_indices( binary(), Text.Language.Classifier.Fasttext.Args.t(), Text.Language.Classifier.Fasttext.Dictionary.t() ) :: [non_neg_integer()]
Returns the input-matrix row indices contributed by a word's n-grams.
Equivalent to repeatedly calling fastText's pushHash for each generated
n-gram, given the model's args and dictionary. Indices are returned
in the same order the reference implementation appends them to its
feature vector.
Arguments
wordis the raw UTF-8 binary (without BOW/EOW markers).argsis aText.Language.Classifier.Fasttext.Argsstruct providingminn,maxn, andbucket.dictionaryis aText.Language.Classifier.Fasttext.Dictionaryprovidingnwordsandpruneidx.
Returns
- A list of non-negative integers, each a valid row index in the
model's input matrix (in the n-gram region, i.e.
nwords <= idx < nwords + bucket).
Examples
iex> args = %Text.Language.Classifier.Fasttext.Args{
...> minn: 2, maxn: 4, bucket: 100, dim: 16, ws: 0, epoch: 0,
...> min_count: 0, neg: 0, word_ngrams: 1, loss: :softmax,
...> model: :sup, lr_update_rate: 0, t: 0.0
...> }
iex> dict = %Text.Language.Classifier.Fasttext.Dictionary{
...> nwords: 1000, nlabels: 0, size: 1000, ntokens: 0,
...> pruneidx_size: -1, entries: [], word_to_index: %{}, pruneidx: %{}
...> }
iex> indices = Text.Language.Classifier.Fasttext.Subwords.compute_indices("a", args, dict)
iex> Enum.all?(indices, fn i -> i >= 1000 and i < 1100 end)
true
@spec compute_ngrams(binary(), pos_integer(), pos_integer()) :: [binary()]
Generates the character n-grams of a word with the BOW/EOW boundary markers.
Arguments
wordis a UTF-8 binary. The caller passes the raw word; this function adds the<and>markers internally.minnis the minimum n-gram length in characters.maxnis the maximum n-gram length in characters.
Returns
- A list of UTF-8 binaries in the order the reference implementation
emits them. Order matters for
pushHashto produce the same index sequence as the C++ implementation.
Examples
iex> Text.Language.Classifier.Fasttext.Subwords.compute_ngrams("the", 2, 4)
["<t", "<th", "<the", "th", "the", "the>", "he", "he>", "e>"]
iex> Text.Language.Classifier.Fasttext.Subwords.compute_ngrams("a", 2, 3)
["<a", "<a>", "a>"]
@spec push_hash(integer(), non_neg_integer(), integer(), %{ required(integer()) => integer() }) :: [ non_neg_integer() ]
Direct port of fastText's pushHash. Returns either an empty list (drop)
or a single-element list containing the input-matrix row index.
Exposed for differential testing; production code should call
compute_indices/3.