Gibran.Tokeniser
This module contains functions that allow the caller to convert a string into a list of tokens using different strategies.
Summary
Functions
Takes a string and splits it into a list of tokens using a regular expression. If a regular expression
is not provided it defaults to @token_regexp
Functions
Takes a string and splits it into a list of tokens using a regular expression. If a regular expression
is not provided it defaults to @token_regexp
.
iex> Gibran.Tokeniser.tokenise("The Prophet")
["the", "prophet"]
The default regular expression ignores punctuation, but accounts for apostrophes and compound words.
iex> Gibran.Tokeniser.tokenise("Prophet, The")
["prophet", "the"]
iex> Gibran.Tokeniser.tokenise("Al-Ajniha al-Mutakassira")
["al-ajniha", "al-mutakassira"]
The tokeniser will normalize input by downcasing all tokens.
iex> Gibran.Tokeniser.tokenise("THE PROPHET")
["the", "prophet"]
Options
:pattern
A regular expression to tokenise the input. It defaults to@token_regexp
.:exclude
A filter that is applied to the string after it has been tokenised. It can be a function, a string, a regular expression, or a list of any combination of those types.
Examples
Using pattern
iex> Gibran.Tokeniser.tokenise("Broken Wings, 1912", pattern: ~r/\,/)
["broken wings", " 1912"]
Using exclude
with a function.
iex> Gibran.Tokeniser.tokenise("Kingdom of the Imagination", exclude: &(String.length(&1) < 10))
["imagination"]
Using exclude
with a regular expression.
iex> Gibran.Tokeniser.tokenise("Sand and Foam", exclude: ~r/and/)
["foam"]
Using exclude
with a string.
iex> Gibran.Tokeniser.tokenise("Eye of The Prophet", exclude: "eye of")
["the", "prophet"]
Using exclude
with a list of a combination of types.
iex> Gibran.Tokeniser.tokenise("Eye of The Prophet", exclude: ["eye", &(String.ends_with?(&1, "he")), ~r/of/])
["prophet"]
Using exclude
with a list.
iex> Gibran.Tokeniser.tokenise("Eye of The Prophet", exclude: ["eye", "of"])
["the", "prophet"]