Bundled multilingual stopword lists.
Stopwords are the high-frequency function words that carry little
topical content (the, is, of in English; le, la, et in
French; …). They are routinely filtered out of text before frequency
analysis, keyword extraction, or other content-focused processing.
This module ships the stopwords-iso
collection — a community-curated set of stopword lists covering
~60 languages, distributed under the MIT license. The raw .txt
files live in priv/stopwords/<lang>.txt; this module loads them
at compile time into per-language MapSets and exposes a small
query API.
Languages
Every list is keyed by its ISO 639-1 two-letter code. Use
available_languages/0 to enumerate the bundled set. Common codes
include :en, :fr, :de, :es, :it, :pt, :ru, :zh,
:ja, :ar, :nl, :sv, :fi, :da, :no, :pl, :tr,
:ko, :hi, …
Composing lists
union/2 merges two language sets — useful for code-mixed text or
for adding the emoticon-equivalent token list. extend/2 returns
a new set with caller-supplied tokens added; that's how callers
layer in domain-specific stopwords (e.g. boilerplate, brand names)
without having to rebuild the whole list.
Licensing
The bundled .txt files are reproduced from stopwords-iso under
the MIT license. See priv/stopwords/LICENSE for the upstream
attribution. Regeneration is via mix text.gen_stopwords.
Summary
Types
An ISO 639-1 language code, atom-typed.
Anything Text.Language.normalize/1 accepts: an atom language tag, a
BCP-47 string, or a Localize.LanguageTag struct.
Functions
Returns whether a stopword list is bundled for the given language.
Returns the sorted list of bundled language tags.
Returns whether token is in the bundled stopword list for a language.
Returns a stopword set augmented with extra tokens.
Returns the bundled stopword MapSet for a language.
Returns the union of two bundled stopword sets.
Types
@type language() :: atom()
An ISO 639-1 language code, atom-typed.
Anything Text.Language.normalize/1 accepts: an atom language tag, a
BCP-47 string, or a Localize.LanguageTag struct.
Functions
@spec available?(language_input()) :: boolean()
Returns whether a stopword list is bundled for the given language.
Arguments
languageis any value accepted byfor/1.
Returns
trueif a list is bundled,falseotherwise.
Examples
iex> Text.Stopwords.available?(:en)
true
iex> Text.Stopwords.available?(:zz)
false
@spec available_languages() :: [language()]
Returns the sorted list of bundled language tags.
Returns
- A list of atom language tags (ISO 639-1 codes).
Examples
iex> :en in Text.Stopwords.available_languages()
true
iex> :fr in Text.Stopwords.available_languages()
true
iex> Text.Stopwords.available_languages() |> length() > 50
true
@spec contains?(language_input(), String.t()) :: boolean()
Returns whether token is in the bundled stopword list for a language.
Equivalent to MapSet.member?(Text.Stopwords.for(language), token),
with the same input flexibility on the language argument.
Arguments
languageis any value accepted byfor/1.tokenis a string. Comparison is case-sensitive against the lowercased upstream lists; pass a folded token if you need case-insensitive matching.
Returns
trueif the token is in the list,falseotherwise.
Examples
iex> Text.Stopwords.contains?(:en, "the")
true
iex> Text.Stopwords.contains?(:en, "Zebra")
false
iex> Text.Stopwords.contains?(:fr, "le")
true
@spec extend(language_input(), Enumerable.t(String.t())) :: MapSet.t(String.t())
Returns a stopword set augmented with extra tokens.
Lets callers layer domain-specific stopwords (e.g. brand names, boilerplate, navigation chrome) on top of the bundled list without mutating the bundled set.
Arguments
languageis any value accepted byfor/1.extrais a list,MapSet, or any enumerable of strings to add.
Returns
- A
MapSetcontaining the bundled set plusextra.
Examples
iex> set = Text.Stopwords.extend(:en, ["acme", "lorem"])
iex> MapSet.member?(set, "the") and MapSet.member?(set, "acme")
true
@spec for(language_input()) :: MapSet.t(String.t())
Returns the bundled stopword MapSet for a language.
Arguments
languageis an atom language tag (:en,:fr, …), a string BCP-47 tag ("en","en-US"), or aLocalize.LanguageTagstruct. The input is normalised viaText.Language.normalize/1before lookup.
Returns
A
MapSetof lowercased stopword strings.Raises
ArgumentErrorif no list is bundled for the resolved language. Useavailable_languages/0to enumerate the supported set.
Examples
iex> Text.Stopwords.for(:en) |> MapSet.member?("the")
true
iex> Text.Stopwords.for("fr") |> MapSet.member?("le")
true
iex> Text.Stopwords.for(:en) |> MapSet.member?("zebra")
false
@spec union(language_input(), language_input()) :: MapSet.t(String.t())
Returns the union of two bundled stopword sets.
Useful for code-mixed input (e.g. a French-English document) or for
layering in a small auxiliary set (an :emoticon-style list).
Arguments
firstandsecondare anything accepted byfor/1.
Returns
- A
MapSetcontaining every token from either list.
Examples
iex> set = Text.Stopwords.union(:en, :fr)
iex> MapSet.member?(set, "the") and MapSet.member?(set, "le")
true