# `Text.Stopwords`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/stopwords.ex#L1)

Bundled multilingual stopword lists.

Stopwords are the high-frequency function words that carry little
topical content (`the`, `is`, `of` in English; `le`, `la`, `et` in
French; …). They are routinely filtered out of text before frequency
analysis, keyword extraction, or other content-focused processing.

This module ships the [stopwords-iso](https://github.com/stopwords-iso/stopwords-iso)
collection — a community-curated set of stopword lists covering
~60 languages, distributed under the MIT license. The raw `.txt`
files live in `priv/stopwords/<lang>.txt`; this module loads them
at compile time into per-language `MapSet`s and exposes a small
query API.

### Languages

Every list is keyed by its ISO 639-1 two-letter code. Use
`available_languages/0` to enumerate the bundled set. Common codes
include `:en`, `:fr`, `:de`, `:es`, `:it`, `:pt`, `:ru`, `:zh`,
`:ja`, `:ar`, `:nl`, `:sv`, `:fi`, `:da`, `:no`, `:pl`, `:tr`,
`:ko`, `:hi`, …

### Composing lists

`union/2` merges two language sets — useful for code-mixed text or
for adding the emoticon-equivalent token list. `extend/2` returns
a new set with caller-supplied tokens added; that's how callers
layer in domain-specific stopwords (e.g. boilerplate, brand names)
without having to rebuild the whole list.

### Licensing

The bundled `.txt` files are reproduced from stopwords-iso under
the MIT license. See `priv/stopwords/LICENSE` for the upstream
attribution. Regeneration is via `mix text.gen_stopwords`.

# `language`

```elixir
@type language() :: atom()
```

An ISO 639-1 language code, atom-typed.

# `language_input`

```elixir
@type language_input() :: atom() | String.t() | struct()
```

Anything `Text.Language.normalize/1` accepts: an atom language tag, a
BCP-47 string, or a `Localize.LanguageTag` struct.

# `available?`

```elixir
@spec available?(language_input()) :: boolean()
```

Returns whether a stopword list is bundled for the given language.

### Arguments

* `language` is any value accepted by `for/1`.

### Returns

* `true` if a list is bundled, `false` otherwise.

### Examples

    iex> Text.Stopwords.available?(:en)
    true

    iex> Text.Stopwords.available?(:zz)
    false

# `available_languages`

```elixir
@spec available_languages() :: [language()]
```

Returns the sorted list of bundled language tags.

### Returns

* A list of atom language tags (ISO 639-1 codes).

### Examples

    iex> :en in Text.Stopwords.available_languages()
    true

    iex> :fr in Text.Stopwords.available_languages()
    true

    iex> Text.Stopwords.available_languages() |> length() > 50
    true

# `contains?`

```elixir
@spec contains?(language_input(), String.t()) :: boolean()
```

Returns whether `token` is in the bundled stopword list for a language.

Equivalent to `MapSet.member?(Text.Stopwords.for(language), token)`,
with the same input flexibility on the `language` argument.

### Arguments

* `language` is any value accepted by `for/1`.

* `token` is a string. Comparison is case-sensitive against the
  lowercased upstream lists; pass a folded token if you need
  case-insensitive matching.

### Returns

* `true` if the token is in the list, `false` otherwise.

### Examples

    iex> Text.Stopwords.contains?(:en, "the")
    true

    iex> Text.Stopwords.contains?(:en, "Zebra")
    false

    iex> Text.Stopwords.contains?(:fr, "le")
    true

# `extend`

```elixir
@spec extend(language_input(), Enumerable.t(String.t())) :: MapSet.t(String.t())
```

Returns a stopword set augmented with extra tokens.

Lets callers layer domain-specific stopwords (e.g. brand names,
boilerplate, navigation chrome) on top of the bundled list without
mutating the bundled set.

### Arguments

* `language` is any value accepted by `for/1`.

* `extra` is a list, `MapSet`, or any enumerable of strings to add.

### Returns

* A `MapSet` containing the bundled set plus `extra`.

### Examples

    iex> set = Text.Stopwords.extend(:en, ["acme", "lorem"])
    iex> MapSet.member?(set, "the") and MapSet.member?(set, "acme")
    true

# `for`

```elixir
@spec for(language_input()) :: MapSet.t(String.t())
```

Returns the bundled stopword `MapSet` for a language.

### Arguments

* `language` is an atom language tag (`:en`, `:fr`, …), a string
  BCP-47 tag (`"en"`, `"en-US"`), or a `Localize.LanguageTag` struct.
  The input is normalised via `Text.Language.normalize/1` before
  lookup.

### Returns

* A `MapSet` of lowercased stopword strings.

* Raises `ArgumentError` if no list is bundled for the resolved
  language. Use `available_languages/0` to enumerate the supported
  set.

### Examples

    iex> Text.Stopwords.for(:en) |> MapSet.member?("the")
    true

    iex> Text.Stopwords.for("fr") |> MapSet.member?("le")
    true

    iex> Text.Stopwords.for(:en) |> MapSet.member?("zebra")
    false

# `union`

```elixir
@spec union(language_input(), language_input()) :: MapSet.t(String.t())
```

Returns the union of two bundled stopword sets.

Useful for code-mixed input (e.g. a French-English document) or for
layering in a small auxiliary set (an `:emoticon`-style list).

### Arguments

* `first` and `second` are anything accepted by `for/1`.

### Returns

* A `MapSet` containing every token from either list.

### Examples

    iex> set = Text.Stopwords.union(:en, :fr)
    iex> MapSet.member?(set, "the") and MapSet.member?(set, "le")
    true

---

*Consult [api-reference.md](api-reference.md) for complete listing*
