Parses a fastText .bin model file into a
Text.Language.Classifier.Fasttext.Model struct.
The full byte layout is documented in docs/lid176_binary_format.md. In
brief, the file is a sequence of:
8-byte magic + version header.
56-byte fixed
Argsblock.Variable-length
Dictionary(28-byte header + size entries + optional prune index).1-byte
quant_inputflag (must be0— quantized models are out of scope for v1).Input matrix (
{nwords + bucket, dim}float32, row-major).1-byte
qoutflag (must be0).Output matrix (
{nlabels, dim}float32, row-major).
The whole file is read into memory and then parsed against an Elixir
binary. For lid.176.bin (~126 MB) this means a transient peak of roughly
twice the file size during loading: the original binary and the matrix
tensors live concurrently until the binary becomes garbage collectable.
Models are normally loaded once at boot, so the peak is acceptable.
Summary
Types
@type load_error() :: {:bad_magic, integer()} | {:unsupported_version, integer()} | {:quantized_input_unsupported, true} | {:quantized_output_unsupported, true} | {:input_matrix_shape_mismatch, %{expected: tuple(), actual: tuple()}} | {:output_matrix_shape_mismatch, %{expected: tuple(), actual: tuple()}} | :truncated_header | :truncated_args | :truncated_dictionary_header | :truncated_entry | :truncated_pruneidx | :truncated_quant_flag | :truncated_matrix_header | :truncated_matrix_data | :unterminated_word | {:unknown_loss, integer()} | {:unknown_model, integer()} | File.posix()
Functions
@spec decode_model(binary(), Nx.Type.t()) :: {:ok, Text.Language.Classifier.Fasttext.Model.t()} | {:error, load_error()}
Parses a fastText model from an in-memory binary.
Useful for testing against synthetic fixtures and for users who already have the file contents in memory.
Arguments
binaryis the complete byte sequence of a fastText.binmodel.tensor_typeis anNxtensor type. Defaults to{:f, 32}.
Returns
{:ok, model}on success.{:error, reason}on parse or validation failure.
Examples
iex> args = <<
...> 8::little-32, 0::little-32, 0::little-32, 0::little-32,
...> 0::little-32, 1::little-32, 3::little-32, 3::little-32,
...> 4::little-32, 2::little-32, 4::little-32, 0::little-32,
...> 1.0e-4::little-float-64
...> >>
iex> dict_header = <<
...> 1::little-32, 0::little-32, 1::little-32,
...> 0::little-64, 0::little-64
...> >>
iex> entry = "__label__en" <> <<0, 7::little-64, 1::little-8>>
iex> input_dim = 8
iex> input_rows = 4
iex> input_zeros = :binary.copy(<<0::little-float-32>>, input_rows * input_dim)
iex> input_matrix = <<input_rows::little-64, input_dim::little-64>> <> input_zeros
iex> output_zeros = :binary.copy(<<0::little-float-32>>, 1 * input_dim)
iex> output_matrix = <<1::little-64, input_dim::little-64>> <> output_zeros
iex> binary =
...> <<793_712_314::little-32, 12::little-32>> <>
...> args <> dict_header <> entry <>
...> <<0>> <> input_matrix <>
...> <<0>> <> output_matrix
iex> {:ok, model} = Text.Language.Classifier.Fasttext.ModelLoader.decode_model(binary)
iex> model.labels
["en"]
@spec load( Path.t(), keyword() ) :: {:ok, Text.Language.Classifier.Fasttext.Model.t()} | {:error, load_error()}
Loads and parses a fastText model file.
Arguments
pathis the absolute or relative path to a fastText.binmodel file.
Options
:tensor_typeis theNxtensor type used for the input and output matrices. The default is{:f, 32}, matching fastText's on-disk layout. Override only if downstream code requires a different precision.
Returns
{:ok, model}wheremodelis aText.Language.Classifier.Fasttext.Modelstruct.{:error, reason}wherereasondescribes the parse or validation failure. Seeload_error/0for the set of possible reasons.
Examples
# Loading the official lid.176 model (after running mix text.download_lid176):
# iex> path = Path.expand("priv/lid_176/lid.176.bin")
# iex> {:ok, model} = Text.Language.Classifier.Fasttext.ModelLoader.load(path)
# iex> model.dictionary.nlabels
# 176