Text.POS (Text v0.5.0)

Copy Markdown View Source

Part-of-speech tagging via Bumblebee.

Exposes tag/2 which assigns a part-of-speech label (:noun, :verb, :adj, …) to every word in an input sentence. Backed by a pre-trained transformer loaded through Bumblebee's token_classification/3 serving — the default model is vblagoje/bert-english-uncased-finetuned-pos, trained on the OntoNotes 5.0 / Penn Treebank tag set.

Optional dependency

POS tagging requires the :bumblebee and (recommended) :exla Hex packages. They are declared as optional dependencies of :text; add them to your application's mix.exs to enable this module:

{:bumblebee, "~> 0.6"},
{:exla, "~> 0.9"}

Without :bumblebee, calling tag/2 raises with a clear "add these to your deps" message.

Cold start and caching

The first call to tag/2 downloads the model (~440 MB for the default English model) from Hugging Face, traces the inference graph, and compiles it under EXLA. Subsequent calls hit a cached Nx.Serving in :persistent_term and run in single-digit milliseconds.

For production, prefer starting a named serving at boot:

serving = Bumblebee.Text.token_classification(model_info, tokenizer)
{:ok, _pid} = Nx.Serving.start_link(serving: serving, name: MyApp.POS)

Text.POS.tag("the cat sat", serving: MyApp.POS)

Result shape

tag/2 returns a list of {token, tag, score} triples. The tag is an atom drawn from the model's label set — for the default English model that's the Penn Treebank-derived :noun, :verb, :adj, :adv, :pron, :det, :punct, ….

Languages

The default model is English-only. For multilingual POS, supply a :model option pointing to a multilingual checkpoint — for example, "QCRI/bert-base-multilingual-cased-pos-english" for English, or one of the language-specific BERT POS models on Hugging Face. The result shape is the same; only the tag vocabulary changes.

Summary

Types

A single token-and-tag entry in the result list.

Functions

Drops the cached Nx.Serving for the given model (or all models).

Returns the part-of-speech tags for text.

Types

tagged_token()

@type tagged_token() :: {String.t(), atom(), float()}

A single token-and-tag entry in the result list.

Functions

reset(model \\ "vblagoje/bert-english-uncased-finetuned-pos")

@spec reset(String.t() | :all) :: :ok

Drops the cached Nx.Serving for the given model (or all models).

tag(text, options \\ [])

@spec tag(
  String.t(),
  keyword()
) :: [tagged_token()]

Returns the part-of-speech tags for text.

Arguments

  • text is a UTF-8 binary.

Options

  • :model — the Hugging Face model id to use. Defaults to "vblagoje/bert-english-uncased-finetuned-pos". Any sequence-tagging checkpoint compatible with Bumblebee.Text.token_classification/3 works.

  • :serving — pass a name or pid of a pre-started Nx.Serving to skip the lazy :persistent_term cache. Useful in production, especially for sharing a single serving across an application.

  • :compile — defn-compilation options for the model. Defaults to [batch_size: 1, sequence_length: 128].

Returns

  • A list of {token, tag, score} triples.