Text.Language.Classifier.Fasttext.Features (Text v0.5.0)

Converts an input string into the flat list of input-matrix row indices that fastText averages to produce a feature vector.

Mirrors Dictionary::getStringNoNewline / Dictionary::addSubwords / Dictionary::addWordNgrams from the C++ reference (src/dictionary.cc), specialized for the inference path used by the Python predict wrapper:

Newlines are pre-replaced with spaces by the caller, so EOS tokens are not produced.
wordNgrams = 1 (the lid.176 setting) collapses addWordNgrams to a no-op, so word-level n-gram hashes are never pushed.
Label-typed tokens — either a known label entry in the dictionary, or an unknown token that starts with the __label__ prefix — are excluded from the word-feature list. They would have been routed to the labels vector in the C++ code, which is unused at inference.

The returned list is the exact sequence of row indices that the C++ averages to compute the input feature vector. Phase 5 turns it into an Nx.take and a mean.

What ends up in the list, per token

Token state	Contribution
In-vocab word entry	`[wid]` followed by character-n-gram subword indices
Out-of-vocab, no `__label__` prefix	character-n-gram subword indices only
In-vocab label entry	dropped
Out-of-vocab, starts with `__label__` prefix	dropped

Subword indices are produced by Text.Language.Classifier.Fasttext.Subwords.compute_indices/3, which honours the model's pruneidx regime.

Summary

Functions

extract(text, model)

Returns the input-matrix row indices for the features of text.

Functions

extract(text, model)

@spec extract(binary(), Text.Language.Classifier.Fasttext.Model.t()) :: [
  non_neg_integer()
]

Returns the input-matrix row indices for the features of text.

Arguments

text is a UTF-8 binary. Newlines are treated as whitespace separators (matching the Python predict wrapper, which strips them before tokenizing).
model is a fully-loaded Text.Language.Classifier.Fasttext.Model.

Returns

A list of non-negative integers, each a valid row index into model.input_matrix. The list may be empty if the input contains no word-typed tokens.

Examples

Given a loaded lid.176 model, extract/2 returns the same row index list the C++ reference would average:

# iex> {:ok, model} = Text.Language.Classifier.Fasttext.ModelLoader.load("priv/lid_176/lid.176.bin")
# iex> Text.Language.Classifier.Fasttext.Features.extract("hello world", model)
# [..., ...]  # word and subword indices for both tokens

Label-shaped tokens are dropped:

# iex> Text.Language.Classifier.Fasttext.Features.extract("__label__en hello", model)
# [...]  # only the features for "hello"