Text.Language.Classifier.Fasttext.ModelLoader (Text v0.5.0)

Parses a fastText .bin model file into a Text.Language.Classifier.Fasttext.Model struct.

The full byte layout is documented in docs/lid176_binary_format.md. In brief, the file is a sequence of:

8-byte magic + version header.
56-byte fixed Args block.
Variable-length Dictionary (28-byte header + size entries + optional prune index).
1-byte quant_input flag (must be 0 — quantized models are out of scope for v1).
Input matrix ({nwords + bucket, dim} float32, row-major).
1-byte qout flag (must be 0).
Output matrix ({nlabels, dim} float32, row-major).

The whole file is read into memory and then parsed against an Elixir binary. For lid.176.bin (~126 MB) this means a transient peak of roughly twice the file size during loading: the original binary and the matrix tensors live concurrently until the binary becomes garbage collectable. Models are normally loaded once at boot, so the peak is acceptable.

Summary

Types

load_error()

Functions

decode_model(binary, tensor_type \\ {:f, 32})

Parses a fastText model from an in-memory binary.

load(path, options \\ [])

Loads and parses a fastText model file.

Types

load_error()

@type load_error() ::
  {:bad_magic, integer()}
  | {:unsupported_version, integer()}
  | {:quantized_input_unsupported, true}
  | {:quantized_output_unsupported, true}
  | {:input_matrix_shape_mismatch, %{expected: tuple(), actual: tuple()}}
  | {:output_matrix_shape_mismatch, %{expected: tuple(), actual: tuple()}}
  | :truncated_header
  | :truncated_args
  | :truncated_dictionary_header
  | :truncated_entry
  | :truncated_pruneidx
  | :truncated_quant_flag
  | :truncated_matrix_header
  | :truncated_matrix_data
  | :unterminated_word
  | {:unknown_loss, integer()}
  | {:unknown_model, integer()}
  | File.posix()

Functions

decode_model(binary, tensor_type \\ {:f, 32})

@spec decode_model(binary(), Nx.Type.t()) ::
  {:ok, Text.Language.Classifier.Fasttext.Model.t()} | {:error, load_error()}

Parses a fastText model from an in-memory binary.

Useful for testing against synthetic fixtures and for users who already have the file contents in memory.

Arguments

binary is the complete byte sequence of a fastText .bin model.
tensor_type is an Nx tensor type. Defaults to {:f, 32}.

Returns

{:ok, model} on success.
{:error, reason} on parse or validation failure.

Examples

iex> args = <<
...>   8::little-32, 0::little-32, 0::little-32, 0::little-32,
...>   0::little-32, 1::little-32, 3::little-32, 3::little-32,
...>   4::little-32, 2::little-32, 4::little-32, 0::little-32,
...>   1.0e-4::little-float-64
...> >>
iex> dict_header = <<
...>   1::little-32, 0::little-32, 1::little-32,
...>   0::little-64, 0::little-64
...> >>
iex> entry = "__label__en" <> <<0, 7::little-64, 1::little-8>>
iex> input_dim = 8
iex> input_rows = 4
iex> input_zeros = :binary.copy(<<0::little-float-32>>, input_rows * input_dim)
iex> input_matrix = <<input_rows::little-64, input_dim::little-64>> <> input_zeros
iex> output_zeros = :binary.copy(<<0::little-float-32>>, 1 * input_dim)
iex> output_matrix = <<1::little-64, input_dim::little-64>> <> output_zeros
iex> binary =
...>   <<793_712_314::little-32, 12::little-32>> <>
...>     args <> dict_header <> entry <>
...>     <<0>> <> input_matrix <>
...>     <<0>> <> output_matrix
iex> {:ok, model} = Text.Language.Classifier.Fasttext.ModelLoader.decode_model(binary)
iex> model.labels
["en"]

load(path, options \\ [])

@spec load(
  Path.t(),
  keyword()
) :: {:ok, Text.Language.Classifier.Fasttext.Model.t()} | {:error, load_error()}

Loads and parses a fastText model file.

Arguments

path is the absolute or relative path to a fastText .bin model file.

Options

:tensor_type is the Nx tensor type used for the input and output matrices. The default is {:f, 32}, matching fastText's on-disk layout. Override only if downstream code requires a different precision.

Returns

{:ok, model} where model is a Text.Language.Classifier.Fasttext.Model struct.
{:error, reason} where reason describes the parse or validation failure. See load_error/0 for the set of possible reasons.

Examples

# Loading the official lid.176 model (after running mix text.download_lid176):
# iex> path = Path.expand("priv/lid_176/lid.176.bin")
# iex> {:ok, model} = Text.Language.Classifier.Fasttext.ModelLoader.load(path)
# iex> model.dictionary.nlabels
# 176