Converts an input string into the flat list of input-matrix row indices that fastText averages to produce a feature vector.
Mirrors Dictionary::getStringNoNewline / Dictionary::addSubwords /
Dictionary::addWordNgrams from the C++ reference (src/dictionary.cc),
specialized for the inference path used by the Python predict wrapper:
Newlines are pre-replaced with spaces by the caller, so EOS tokens are not produced.
wordNgrams = 1(thelid.176setting) collapsesaddWordNgramsto a no-op, so word-level n-gram hashes are never pushed.Label-typed tokens — either a known label entry in the dictionary, or an unknown token that starts with the
__label__prefix — are excluded from the word-feature list. They would have been routed to thelabelsvector in the C++ code, which is unused at inference.
The returned list is the exact sequence of row indices that the C++
averages to compute the input feature vector. Phase 5 turns it into an
Nx.take and a mean.
What ends up in the list, per token
| Token state | Contribution |
|---|---|
| In-vocab word entry | [wid] followed by character-n-gram subword indices |
Out-of-vocab, no __label__ prefix | character-n-gram subword indices only |
| In-vocab label entry | dropped |
Out-of-vocab, starts with __label__ prefix | dropped |
Subword indices are produced by
Text.Language.Classifier.Fasttext.Subwords.compute_indices/3, which
honours the model's pruneidx regime.
Summary
Functions
Returns the input-matrix row indices for the features of text.
Functions
@spec extract(binary(), Text.Language.Classifier.Fasttext.Model.t()) :: [ non_neg_integer() ]
Returns the input-matrix row indices for the features of text.
Arguments
textis a UTF-8 binary. Newlines are treated as whitespace separators (matching the Pythonpredictwrapper, which strips them before tokenizing).modelis a fully-loadedText.Language.Classifier.Fasttext.Model.
Returns
- A list of non-negative integers, each a valid row index into
model.input_matrix. The list may be empty if the input contains no word-typed tokens.
Examples
Given a loaded lid.176 model, extract/2 returns the same row index
list the C++ reference would average:
# iex> {:ok, model} = Text.Language.Classifier.Fasttext.ModelLoader.load("priv/lid_176/lid.176.bin")
# iex> Text.Language.Classifier.Fasttext.Features.extract("hello world", model)
# [..., ...] # word and subword indices for both tokensLabel-shaped tokens are dropped:
# iex> Text.Language.Classifier.Fasttext.Features.extract("__label__en hello", model)
# [...] # only the features for "hello"