Text.Stopwords (Text v0.5.0)

Copy Markdown View Source

Bundled multilingual stopword lists.

Stopwords are the high-frequency function words that carry little topical content (the, is, of in English; le, la, et in French; …). They are routinely filtered out of text before frequency analysis, keyword extraction, or other content-focused processing.

This module ships the stopwords-iso collection — a community-curated set of stopword lists covering ~60 languages, distributed under the MIT license. The raw .txt files live in priv/stopwords/<lang>.txt; this module loads them at compile time into per-language MapSets and exposes a small query API.

Languages

Every list is keyed by its ISO 639-1 two-letter code. Use available_languages/0 to enumerate the bundled set. Common codes include :en, :fr, :de, :es, :it, :pt, :ru, :zh, :ja, :ar, :nl, :sv, :fi, :da, :no, :pl, :tr, :ko, :hi, …

Composing lists

union/2 merges two language sets — useful for code-mixed text or for adding the emoticon-equivalent token list. extend/2 returns a new set with caller-supplied tokens added; that's how callers layer in domain-specific stopwords (e.g. boilerplate, brand names) without having to rebuild the whole list.

Licensing

The bundled .txt files are reproduced from stopwords-iso under the MIT license. See priv/stopwords/LICENSE for the upstream attribution. Regeneration is via mix text.gen_stopwords.

Summary

Types

An ISO 639-1 language code, atom-typed.

Anything Text.Language.normalize/1 accepts: an atom language tag, a BCP-47 string, or a Localize.LanguageTag struct.

Functions

Returns whether a stopword list is bundled for the given language.

Returns the sorted list of bundled language tags.

Returns whether token is in the bundled stopword list for a language.

Returns a stopword set augmented with extra tokens.

Returns the bundled stopword MapSet for a language.

Returns the union of two bundled stopword sets.

Types

language()

@type language() :: atom()

An ISO 639-1 language code, atom-typed.

language_input()

@type language_input() :: atom() | String.t() | struct()

Anything Text.Language.normalize/1 accepts: an atom language tag, a BCP-47 string, or a Localize.LanguageTag struct.

Functions

available?(language)

@spec available?(language_input()) :: boolean()

Returns whether a stopword list is bundled for the given language.

Arguments

  • language is any value accepted by for/1.

Returns

  • true if a list is bundled, false otherwise.

Examples

iex> Text.Stopwords.available?(:en)
true

iex> Text.Stopwords.available?(:zz)
false

available_languages()

@spec available_languages() :: [language()]

Returns the sorted list of bundled language tags.

Returns

  • A list of atom language tags (ISO 639-1 codes).

Examples

iex> :en in Text.Stopwords.available_languages()
true

iex> :fr in Text.Stopwords.available_languages()
true

iex> Text.Stopwords.available_languages() |> length() > 50
true

contains?(language, token)

@spec contains?(language_input(), String.t()) :: boolean()

Returns whether token is in the bundled stopword list for a language.

Equivalent to MapSet.member?(Text.Stopwords.for(language), token), with the same input flexibility on the language argument.

Arguments

  • language is any value accepted by for/1.

  • token is a string. Comparison is case-sensitive against the lowercased upstream lists; pass a folded token if you need case-insensitive matching.

Returns

  • true if the token is in the list, false otherwise.

Examples

iex> Text.Stopwords.contains?(:en, "the")
true

iex> Text.Stopwords.contains?(:en, "Zebra")
false

iex> Text.Stopwords.contains?(:fr, "le")
true

extend(language, extra)

Returns a stopword set augmented with extra tokens.

Lets callers layer domain-specific stopwords (e.g. brand names, boilerplate, navigation chrome) on top of the bundled list without mutating the bundled set.

Arguments

  • language is any value accepted by for/1.

  • extra is a list, MapSet, or any enumerable of strings to add.

Returns

  • A MapSet containing the bundled set plus extra.

Examples

iex> set = Text.Stopwords.extend(:en, ["acme", "lorem"])
iex> MapSet.member?(set, "the") and MapSet.member?(set, "acme")
true

for(language)

@spec for(language_input()) :: MapSet.t(String.t())

Returns the bundled stopword MapSet for a language.

Arguments

Returns

Examples

iex> Text.Stopwords.for(:en) |> MapSet.member?("the")
true

iex> Text.Stopwords.for("fr") |> MapSet.member?("le")
true

iex> Text.Stopwords.for(:en) |> MapSet.member?("zebra")
false

union(first, second)

Returns the union of two bundled stopword sets.

Useful for code-mixed input (e.g. a French-English document) or for layering in a small auxiliary set (an :emoticon-style list).

Arguments

  • first and second are anything accepted by for/1.

Returns

  • A MapSet containing every token from either list.

Examples

iex> set = Text.Stopwords.union(:en, :fr)
iex> MapSet.member?(set, "the") and MapSet.member?(set, "le")
true