# `Text.WordCloud`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/word_cloud.ex#L1)

Builds a weighted list of terms suitable for rendering as a word cloud.

The function returns a list of `%{term, weight, count, kind}` maps
sorted by `:weight` (descending). The top term always has weight
`1.0`; every other weight is normalised relative to it. Visual
layout — placing the words on a canvas — is handled separately by
`Text.WordCloud.Layout`.

Supports several scoring algorithms via the `:scoring` option;
`:yake` (the default) requires no reference corpus and is
multilingual by construction. See the `Text.WordCloud.Backends.*`
modules for the catalogue.

Multilingual end-to-end:

* Tokenisation runs through `Text.Segment.words/2` (Unicode UAX #29).

* Sentence segmentation uses `Text.Segment.sentences/2`.

* Stopwords come from the bundled `Text.Stopwords` (~60 languages)
  via the `:stopwords` option.

* Language is auto-detected with `Text.Language.Classifier.Fasttext`
  when `:language` is unset, falling back to no language-specific
  behaviour if the classifier is not available.

# `term_entry`

```elixir
@type term_entry() :: %{
  term: String.t(),
  weight: float(),
  count: pos_integer(),
  kind: :word | :phrase
}
```

A scored term, ready for rendering.

# `terms`

```elixir
@spec terms(
  String.t() | [String.t()],
  keyword()
) :: [term_entry()]
```

Returns a weighted list of terms for `text` suitable for word-cloud rendering.

### Arguments

* `text` is a UTF-8 string or a list of strings. A list is treated
  as a corpus of independent documents.

### Options

* `:scoring` — `:yake` (default), `:frequency`, `:tf_idf`, `:rake`,
  `:text_rank`, `:key_bert`, or any module implementing
  `Text.WordCloud.Backend`.

* `:max_terms` — cap on returned entries. Default `100`.

* `:min_count` — drop terms occurring fewer times than this.
  Default `1`.

* `:ngram_range` — `{min, max}` token length for candidate terms.
  Default depends on backend (`{1, 3}` for YAKE, `{1, 1}` for
  Frequency).

* `:language` — atom, BCP-47 string, or `Localize.LanguageTag`.
  Default `nil` (no language-specific behaviour). Pass
  `{:auto, model}` to auto-detect via a pre-loaded
  `Text.Language.Classifier.Fasttext.Model` — the orchestrator
  does not load the fastText model itself, so callers wanting
  detection load it once at boot and hand it in.

* `:stopwords` — `:auto` (use the bundled list for the resolved
  language; default), `:none`, a list, a `MapSet`, or
  `{:extend, [extra]}` to add to the bundled list.

* `:case_fold` — boolean, default `true`.

* `:stem` — boolean, default `false`. When `true`, candidate terms
  are bucketed by their Snowball stem so morphological variants
  (`demolish`, `demolished`, `demolishing`, `demolition`) collapse
  into a single entry. The most-frequent surface form represents
  the bucket; counts and raw scores are summed across members.
  Requires the optional `:text_stemmer` dependency. The stemmer
  language defaults to the resolved `:language`; override with
  `:stem_language`.

* `:stem_language` — atom override for the stemmer language. Useful
  when the corpus language differs from the bucketing language
  (e.g. mixed-language text where you want only English variants
  consolidated). Defaults to `:language`.

* `:include` — `:all` (default), `:words` only, or `:phrases` only.

* `:reference_corpus` — used by `:tf_idf` and `:log_likelihood`.

### Returns

* A list of `%{term, weight, count, kind}` maps sorted by `:weight`
  descending. The top entry has `weight: 1.0`.

### Examples

    iex> text = "the cat sat on the mat. the cat ran. the cat slept."
    iex> [first | _] = Text.WordCloud.terms(text, scoring: :frequency, language: :en, max_terms: 3)
    iex> first.term
    "cat"

---

*Consult [api-reference.md](api-reference.md) for complete listing*
