Text.WordCloud (Text v0.5.0)

Copy Markdown View Source

Builds a weighted list of terms suitable for rendering as a word cloud.

The function returns a list of %{term, weight, count, kind} maps sorted by :weight (descending). The top term always has weight 1.0; every other weight is normalised relative to it. Visual layout — placing the words on a canvas — is handled separately by Text.WordCloud.Layout.

Supports several scoring algorithms via the :scoring option; :yake (the default) requires no reference corpus and is multilingual by construction. See the Text.WordCloud.Backends.* modules for the catalogue.

Multilingual end-to-end:

Summary

Types

A scored term, ready for rendering.

Functions

Returns a weighted list of terms for text suitable for word-cloud rendering.

Types

term_entry()

@type term_entry() :: %{
  term: String.t(),
  weight: float(),
  count: pos_integer(),
  kind: :word | :phrase
}

A scored term, ready for rendering.

Functions

terms(text, options \\ [])

@spec terms(
  String.t() | [String.t()],
  keyword()
) :: [term_entry()]

Returns a weighted list of terms for text suitable for word-cloud rendering.

Arguments

  • text is a UTF-8 string or a list of strings. A list is treated as a corpus of independent documents.

Options

  • :scoring:yake (default), :frequency, :tf_idf, :rake, :text_rank, :key_bert, or any module implementing Text.WordCloud.Backend.

  • :max_terms — cap on returned entries. Default 100.

  • :min_count — drop terms occurring fewer times than this. Default 1.

  • :ngram_range{min, max} token length for candidate terms. Default depends on backend ({1, 3} for YAKE, {1, 1} for Frequency).

  • :language — atom, BCP-47 string, or Localize.LanguageTag. Default nil (no language-specific behaviour). Pass {:auto, model} to auto-detect via a pre-loaded Text.Language.Classifier.Fasttext.Model — the orchestrator does not load the fastText model itself, so callers wanting detection load it once at boot and hand it in.

  • :stopwords:auto (use the bundled list for the resolved language; default), :none, a list, a MapSet, or {:extend, [extra]} to add to the bundled list.

  • :case_fold — boolean, default true.

  • :stem — boolean, default false. When true, candidate terms are bucketed by their Snowball stem so morphological variants (demolish, demolished, demolishing, demolition) collapse into a single entry. The most-frequent surface form represents the bucket; counts and raw scores are summed across members. Requires the optional :text_stemmer dependency. The stemmer language defaults to the resolved :language; override with :stem_language.

  • :stem_language — atom override for the stemmer language. Useful when the corpus language differs from the bucketing language (e.g. mixed-language text where you want only English variants consolidated). Defaults to :language.

  • :include:all (default), :words only, or :phrases only.

  • :reference_corpus — used by :tf_idf and :log_likelihood.

Returns

  • A list of %{term, weight, count, kind} maps sorted by :weight descending. The top entry has weight: 1.0.

Examples

iex> text = "the cat sat on the mat. the cat ran. the cat slept."
iex> [first | _] = Text.WordCloud.terms(text, scoring: :frequency, language: :en, max_terms: 3)
iex> first.term
"cat"