Builds a weighted list of terms suitable for rendering as a word cloud.
The function returns a list of %{term, weight, count, kind} maps
sorted by :weight (descending). The top term always has weight
1.0; every other weight is normalised relative to it. Visual
layout — placing the words on a canvas — is handled separately by
Text.WordCloud.Layout.
Supports several scoring algorithms via the :scoring option;
:yake (the default) requires no reference corpus and is
multilingual by construction. See the Text.WordCloud.Backends.*
modules for the catalogue.
Multilingual end-to-end:
Tokenisation runs through
Text.Segment.words/2(Unicode UAX #29).Sentence segmentation uses
Text.Segment.sentences/2.Stopwords come from the bundled
Text.Stopwords(~60 languages) via the:stopwordsoption.Language is auto-detected with
Text.Language.Classifier.Fasttextwhen:languageis unset, falling back to no language-specific behaviour if the classifier is not available.
Summary
Types
A scored term, ready for rendering.
Functions
Returns a weighted list of terms for text suitable for word-cloud rendering.
Types
@type term_entry() :: %{ term: String.t(), weight: float(), count: pos_integer(), kind: :word | :phrase }
A scored term, ready for rendering.
Functions
@spec terms( String.t() | [String.t()], keyword() ) :: [term_entry()]
Returns a weighted list of terms for text suitable for word-cloud rendering.
Arguments
textis a UTF-8 string or a list of strings. A list is treated as a corpus of independent documents.
Options
:scoring—:yake(default),:frequency,:tf_idf,:rake,:text_rank,:key_bert, or any module implementingText.WordCloud.Backend.:max_terms— cap on returned entries. Default100.:min_count— drop terms occurring fewer times than this. Default1.:ngram_range—{min, max}token length for candidate terms. Default depends on backend ({1, 3}for YAKE,{1, 1}for Frequency).:language— atom, BCP-47 string, orLocalize.LanguageTag. Defaultnil(no language-specific behaviour). Pass{:auto, model}to auto-detect via a pre-loadedText.Language.Classifier.Fasttext.Model— the orchestrator does not load the fastText model itself, so callers wanting detection load it once at boot and hand it in.:stopwords—:auto(use the bundled list for the resolved language; default),:none, a list, aMapSet, or{:extend, [extra]}to add to the bundled list.:case_fold— boolean, defaulttrue.:stem— boolean, defaultfalse. Whentrue, candidate terms are bucketed by their Snowball stem so morphological variants (demolish,demolished,demolishing,demolition) collapse into a single entry. The most-frequent surface form represents the bucket; counts and raw scores are summed across members. Requires the optional:text_stemmerdependency. The stemmer language defaults to the resolved:language; override with:stem_language.:stem_language— atom override for the stemmer language. Useful when the corpus language differs from the bucketing language (e.g. mixed-language text where you want only English variants consolidated). Defaults to:language.:include—:all(default),:wordsonly, or:phrasesonly.:reference_corpus— used by:tf_idfand:log_likelihood.
Returns
- A list of
%{term, weight, count, kind}maps sorted by:weightdescending. The top entry hasweight: 1.0.
Examples
iex> text = "the cat sat on the mat. the cat ran. the cat slept."
iex> [first | _] = Text.WordCloud.terms(text, scoring: :frequency, language: :en, max_terms: 3)
iex> first.term
"cat"