Part-of-speech tagging via Bumblebee.
Exposes tag/2 which assigns a part-of-speech label
(:noun, :verb, :adj, …) to every word in an input sentence.
Backed by a pre-trained transformer loaded through Bumblebee's
token_classification/3 serving — the default model is
vblagoje/bert-english-uncased-finetuned-pos,
trained on the OntoNotes 5.0 / Penn Treebank tag set.
Optional dependency
POS tagging requires the :bumblebee and (recommended) :exla
Hex packages. They are declared as optional dependencies of
:text; add them to your application's mix.exs to enable this
module:
{:bumblebee, "~> 0.6"},
{:exla, "~> 0.9"}Without :bumblebee, calling tag/2 raises with a clear "add
these to your deps" message.
Cold start and caching
The first call to tag/2 downloads the model (~440 MB for the
default English model) from Hugging Face, traces the inference
graph, and compiles it under EXLA. Subsequent calls hit a cached
Nx.Serving in :persistent_term and run in single-digit
milliseconds.
For production, prefer starting a named serving at boot:
serving = Bumblebee.Text.token_classification(model_info, tokenizer)
{:ok, _pid} = Nx.Serving.start_link(serving: serving, name: MyApp.POS)
Text.POS.tag("the cat sat", serving: MyApp.POS)Result shape
tag/2 returns a list of {token, tag, score} triples. The
tag is an atom drawn from the model's label set — for the
default English model that's the Penn Treebank-derived
:noun, :verb, :adj, :adv, :pron, :det, :punct, ….
Languages
The default model is English-only. For multilingual POS, supply
a :model option pointing to a multilingual checkpoint — for
example, "QCRI/bert-base-multilingual-cased-pos-english" for
English, or one of the language-specific BERT POS models on
Hugging Face. The result shape is the same; only the tag
vocabulary changes.
Summary
Types
A single token-and-tag entry in the result list.
Functions
Drops the cached Nx.Serving for the given model (or all models).
Returns the part-of-speech tags for text.
Types
Functions
@spec reset(String.t() | :all) :: :ok
Drops the cached Nx.Serving for the given model (or all models).
@spec tag( String.t(), keyword() ) :: [tagged_token()]
Returns the part-of-speech tags for text.
Arguments
textis a UTF-8 binary.
Options
:model— the Hugging Face model id to use. Defaults to"vblagoje/bert-english-uncased-finetuned-pos". Any sequence-tagging checkpoint compatible withBumblebee.Text.token_classification/3works.:serving— pass a name or pid of a pre-startedNx.Servingto skip the lazy:persistent_termcache. Useful in production, especially for sharing a single serving across an application.:compile— defn-compilation options for the model. Defaults to[batch_size: 1, sequence_length: 128].
Returns
- A list of
{token, tag, score}triples.