# `Text.Language.Classifier.Fasttext.Features`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/language/classifier/fasttext/features.ex#L1)

Converts an input string into the flat list of input-matrix row indices
that fastText averages to produce a feature vector.

Mirrors `Dictionary::getStringNoNewline` / `Dictionary::addSubwords` /
`Dictionary::addWordNgrams` from the C++ reference (`src/dictionary.cc`),
specialized for the inference path used by the Python `predict` wrapper:

* Newlines are pre-replaced with spaces by the caller, so EOS tokens are
  not produced.

* `wordNgrams = 1` (the `lid.176` setting) collapses
  `addWordNgrams` to a no-op, so word-level n-gram hashes are never
  pushed.

* Label-typed tokens — either a known label entry in the dictionary, or
  an unknown token that starts with the `__label__` prefix — are
  excluded from the word-feature list. They would have been routed to
  the `labels` vector in the C++ code, which is unused at inference.

The returned list is the exact sequence of row indices that the C++
averages to compute the input feature vector. Phase 5 turns it into an
`Nx.take` and a mean.

### What ends up in the list, per token

| Token state                                  | Contribution                                       |
|----------------------------------------------|----------------------------------------------------|
| In-vocab word entry                          | `[wid]` followed by character-n-gram subword indices |
| Out-of-vocab, no `__label__` prefix          | character-n-gram subword indices only              |
| In-vocab label entry                         | dropped                                            |
| Out-of-vocab, starts with `__label__` prefix | dropped                                            |

Subword indices are produced by
`Text.Language.Classifier.Fasttext.Subwords.compute_indices/3`, which
honours the model's `pruneidx` regime.

# `extract`

```elixir
@spec extract(binary(), Text.Language.Classifier.Fasttext.Model.t()) :: [
  non_neg_integer()
]
```

Returns the input-matrix row indices for the features of `text`.

### Arguments

* `text` is a UTF-8 binary. Newlines are treated as whitespace
  separators (matching the Python `predict` wrapper, which strips them
  before tokenizing).

* `model` is a fully-loaded
  `Text.Language.Classifier.Fasttext.Model`.

### Returns

* A list of non-negative integers, each a valid row index into
  `model.input_matrix`. The list may be empty if the input contains no
  word-typed tokens.

### Examples

Given a loaded `lid.176` model, `extract/2` returns the same row index
list the C++ reference would average:

    # iex> {:ok, model} = Text.Language.Classifier.Fasttext.ModelLoader.load("priv/lid_176/lid.176.bin")
    # iex> Text.Language.Classifier.Fasttext.Features.extract("hello world", model)
    # [..., ...]  # word and subword indices for both tokens

Label-shaped tokens are dropped:

    # iex> Text.Language.Classifier.Fasttext.Features.extract("__label__en hello", model)
    # [...]  # only the features for "hello"

---

*Consult [api-reference.md](api-reference.md) for complete listing*