mix text.download_models (Text v0.5.0)

Copy Markdown View Source

Pre-downloads every external model used by :text so that subsequent calls run without network access.

On-demand downloads work fine for development, but most production environments want every artefact present at boot. This task fetches:

  • lid.176.bin — fastText language identification (~126 MB), saved to priv/lid_176/lid.176.bin inside this project.

  • The default Hugging Face model used by Text.Sentiment.Backends.Bumblebee (XLM-RoBERTa, ~1.1 GB on first download) plus the tokenizer it actually loads (FacebookAI/xlm-roberta-base).

  • The default Hugging Face model used by Text.POS (English BERT, ~440 MB) plus its tokenizer (google-bert/bert-base-uncased).

  • The default Hugging Face model used by Text.NER (multilingual BERT, ~700 MB) plus its tokenizer (google-bert/bert-base-multilingual-cased).

Hugging Face artefacts land in Bumblebee's cache directory (~/.cache/bumblebee/ by default; override with BUMBLEBEE_CACHE_DIR or XDG_CACHE_HOME). Once cached, the corresponding Text.* modules load without any network round-trip.

Usage

mix text.download_models                  # download everything
mix text.download_models --lid176         # just lid.176.bin
mix text.download_models --sentiment      # just the sentiment stack
mix text.download_models --pos --ner      # just POS + NER
mix text.download_models --bumblebee      # all three Bumblebee stacks
mix text.download_models --force          # re-download even if cached

Options

  • --lid176 — fetch lid.176.bin (or lid.176.ftz with --quantized).

  • --sentiment — fetch the default Text.Sentiment.Backends.Bumblebee model and tokenizer.

  • --pos — fetch the default Text.POS model and tokenizer.

  • --ner — fetch the default Text.NER model and tokenizer.

  • --keybert — fetch the default Text.WordCloud.Backends.KeyBERT multilingual sentence-transformer model and tokenizer (~470 MB).

  • --bumblebee — shorthand for --sentiment --pos --ner --keybert.

  • --all — download every model. This is the default when no selection flag is given.

  • --force — re-download lid.176.bin even if a cached copy is already present. Bumblebee artefacts are cached by etag and refresh automatically when the upstream model updates, so this flag has no effect on the sentiment, POS, or NER stacks.

  • --quantized — only meaningful with --lid176; downloads the .ftz quantized variant instead of the full .bin.

  • --model <repo> — override the Hugging Face repo for the single selected model. Only valid when exactly one of --sentiment, --pos, or --ner is passed; mirrors the :model option each of those modules accepts.

  • --tokenizer <repo> — pair with --model to override the tokenizer repo as well.

Bumblebee dependency

Downloading the sentiment, POS, or NER models requires the optional :bumblebee dependency to be present in the host application. If it is missing, those steps are skipped with a warning; the fastText download still proceeds.