Text.WordCloud.Backends.KeyBERT (Text v0.5.0)

Copy Markdown View Source

Neural keyword-extraction backend backed by Bumblebee.

Implements KeyBERT-style scoring: embed the input document and each candidate phrase with a multilingual sentence-transformer, then rank candidates by cosine similarity to the document embedding. The intuition is that the best keyword candidates are the phrases whose meaning is closest to the document as a whole — exactly what neural sentence embeddings capture.

This backend is opt-in:

  • The :bumblebee and (recommended) :exla Hex packages must be declared as dependencies of the host application.

  • Either pass scoring: :key_bert to Text.WordCloud.terms/2, or use this module directly.

Cold start

The first call downloads the default model (~470 MB — sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, multilingual across ~50 languages) from Hugging Face, traces the inference graph, and compiles it under EXLA. Subsequent calls hit a cached Nx.Serving in :persistent_term. Pre-download via mix text.download_models --keybert if your production environment needs everything present at boot.

When to use this backend

KeyBERT typically produces the highest-quality output of any backend in this package — at the cost of a model download, GPU/EXLA compilation, and substantially higher per-call latency than YAKE! Use this when quality matters more than throughput, or when YAKE!'s statistical features struggle with very short or very domain-specific text.

Options

  • :model — Hugging Face model id. Defaults to "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2". Any sentence-transformer model compatible with Bumblebee.Text.text_embedding/3 works.

  • :tokenizer_repo — overrides the tokenizer source repo (rarely needed for sentence-transformer models, which ship complete tokenizers).

  • :serving — name or pid of a pre-started Nx.Serving to skip the lazy :persistent_term cache. Recommended for production.

  • :candidate_pool_size — cap on the number of candidates embedded; large documents can produce hundreds of phrases and embedding all of them is wasteful. Defaults to 200. Candidates are pre-filtered by raw frequency before embedding.

  • :ngram_range{min, max} candidate length. Defaults to {1, 3}.

Standard Text.WordCloud orchestrator options (:language, :stopwords, :case_fold, :locale) are honoured.

Summary

Functions

reset(model \\ "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

@spec reset(String.t() | :all) :: :ok

Drops the cached Nx.Serving for the given KeyBERT model.

Arguments

  • model — a model id string. Defaults to the package default. Pass :all to drop every cached serving for this backend.

Returns

  • :ok.