Vocabulary and label table parsed from a fastText model file.
Mirrors the C++ fasttext::Dictionary data written by Dictionary::save.
Each entry is a Text.Language.Classifier.Fasttext.Entry carrying the
surface form (UTF-8 string), occurrence count from training, and a
word/label tag.
Entries are stored in two collections:
entriesis the original sequence in file order. Indexihere is the sameiused elsewhere in fastText to address the input matrix for word rows.word_to_indexis a precomputed lookup keyed by the surface form, mapping back to the entry index. Built once at load time so feature extraction can do O(1) lookups.
See docs/lid176_binary_format.md (Section 3) for the byte layout.
Summary
Functions
Decodes the dictionary section of a fastText model file.
Returns the labels (in file order) with the __label__ prefix stripped.
Types
@type t() :: %Text.Language.Classifier.Fasttext.Dictionary{ entries: [Text.Language.Classifier.Fasttext.Entry.t()], nlabels: non_neg_integer(), ntokens: non_neg_integer(), nwords: non_neg_integer(), pruneidx: %{required(integer()) => integer()}, pruneidx_size: non_neg_integer(), size: non_neg_integer(), word_to_index: %{required(String.t()) => non_neg_integer()} }
Functions
Decodes the dictionary section of a fastText model file.
Arguments
binaryis the raw byte sequence positioned at the start of the dictionary block (immediately after the args block).
Returns
{:ok, dictionary, rest}wheredictionaryis at/0struct andrestis the binary remainder positioned at the start of thequant_inputflag byte.{:error, reason}if the input is truncated or malformed (e.g. an unterminated word string, an out-of-range entry type byte).
Returns the labels (in file order) with the __label__ prefix stripped.
For lid.176 this produces a 176-element list of language tags such as
["en", "zh-Hans", "fr", ...]. Index i in the returned list corresponds
to row i of the output matrix.
Arguments
dictionaryis a parsedt/0.
Returns
- A list of
nlabelsstrings.