Text.WordCloud (Text v0.5.0)

Builds a weighted list of terms suitable for rendering as a word cloud.

The function returns a list of %{term, weight, count, kind} maps sorted by :weight (descending). The top term always has weight 1.0; every other weight is normalised relative to it. Visual layout — placing the words on a canvas — is handled separately by Text.WordCloud.Layout.

Supports several scoring algorithms via the :scoring option; :yake (the default) requires no reference corpus and is multilingual by construction. See the Text.WordCloud.Backends.* modules for the catalogue.

Multilingual end-to-end:

Tokenisation runs through Text.Segment.words/2 (Unicode UAX #29).
Sentence segmentation uses Text.Segment.sentences/2.
Stopwords come from the bundled Text.Stopwords (~60 languages) via the :stopwords option.
Language is auto-detected with Text.Language.Classifier.Fasttext when :language is unset, falling back to no language-specific behaviour if the classifier is not available.

Summary

Types

term_entry()

A scored term, ready for rendering.

Functions

terms(text, options \\ [])

Returns a weighted list of terms for text suitable for word-cloud rendering.

Types

term_entry()

@type term_entry() :: %{
  term: String.t(),
  weight: float(),
  count: pos_integer(),
  kind: :word | :phrase
}

A scored term, ready for rendering.

Functions

terms(text, options \\ [])

@spec terms(
  String.t() | [String.t()],
  keyword()
) :: [term_entry()]

Returns a weighted list of terms for text suitable for word-cloud rendering.

Arguments

text is a UTF-8 string or a list of strings. A list is treated as a corpus of independent documents.

Options

:scoring — :yake (default), :frequency, :tf_idf, :rake, :text_rank, :key_bert, or any module implementing Text.WordCloud.Backend.
:max_terms — cap on returned entries. Default 100.
:min_count — drop terms occurring fewer times than this. Default 1.
:ngram_range — {min, max} token length for candidate terms. Default depends on backend ({1, 3} for YAKE, {1, 1} for Frequency).
:language — atom, BCP-47 string, or Localize.LanguageTag. Default nil (no language-specific behaviour). Pass {:auto, model} to auto-detect via a pre-loaded Text.Language.Classifier.Fasttext.Model — the orchestrator does not load the fastText model itself, so callers wanting detection load it once at boot and hand it in.
:stopwords — :auto (use the bundled list for the resolved language; default), :none, a list, a MapSet, or {:extend, [extra]} to add to the bundled list.
:case_fold — boolean, default true.
:stem — boolean, default false. When true, candidate terms are bucketed by their Snowball stem so morphological variants (demolish, demolished, demolishing, demolition) collapse into a single entry. The most-frequent surface form represents the bucket; counts and raw scores are summed across members. Requires the optional :text_stemmer dependency. The stemmer language defaults to the resolved :language; override with :stem_language.
:stem_language — atom override for the stemmer language. Useful when the corpus language differs from the bucketing language (e.g. mixed-language text where you want only English variants consolidated). Defaults to :language.
:include — :all (default), :words only, or :phrases only.
:reference_corpus — used by :tf_idf and :log_likelihood.

Returns

A list of %{term, weight, count, kind} maps sorted by :weight descending. The top entry has weight: 1.0.

Examples

iex> text = "the cat sat on the mat. the cat ran. the cat slept."
iex> [first | _] = Text.WordCloud.terms(text, scoring: :frequency, language: :en, max_terms: 3)
iex> first.term
"cat"