Text.Language.Classifier.Fasttext.Dictionary (Text v0.5.0)

Copy Markdown View Source

Vocabulary and label table parsed from a fastText model file.

Mirrors the C++ fasttext::Dictionary data written by Dictionary::save. Each entry is a Text.Language.Classifier.Fasttext.Entry carrying the surface form (UTF-8 string), occurrence count from training, and a word/label tag.

Entries are stored in two collections:

  • entries is the original sequence in file order. Index i here is the same i used elsewhere in fastText to address the input matrix for word rows.

  • word_to_index is a precomputed lookup keyed by the surface form, mapping back to the entry index. Built once at load time so feature extraction can do O(1) lookups.

See docs/lid176_binary_format.md (Section 3) for the byte layout.

Summary

Functions

Decodes the dictionary section of a fastText model file.

Returns the labels (in file order) with the __label__ prefix stripped.

Types

t()

@type t() :: %Text.Language.Classifier.Fasttext.Dictionary{
  entries: [Text.Language.Classifier.Fasttext.Entry.t()],
  nlabels: non_neg_integer(),
  ntokens: non_neg_integer(),
  nwords: non_neg_integer(),
  pruneidx: %{required(integer()) => integer()},
  pruneidx_size: non_neg_integer(),
  size: non_neg_integer(),
  word_to_index: %{required(String.t()) => non_neg_integer()}
}

Functions

decode(arg1)

@spec decode(binary()) :: {:ok, t(), binary()} | {:error, term()}

Decodes the dictionary section of a fastText model file.

Arguments

  • binary is the raw byte sequence positioned at the start of the dictionary block (immediately after the args block).

Returns

  • {:ok, dictionary, rest} where dictionary is a t/0 struct and rest is the binary remainder positioned at the start of the quant_input flag byte.

  • {:error, reason} if the input is truncated or malformed (e.g. an unterminated word string, an out-of-range entry type byte).

labels(dictionary)

@spec labels(t()) :: [String.t()]

Returns the labels (in file order) with the __label__ prefix stripped.

For lid.176 this produces a 176-element list of language tags such as ["en", "zh-Hans", "fr", ...]. Index i in the returned list corresponds to row i of the output matrix.

Arguments

  • dictionary is a parsed t/0.

Returns

  • A list of nlabels strings.