Gibran.Tokeniser

This module contains functions that allow the caller to convert a string into a list of tokens using different strategies.

Summary

Functions

Takes a string and splits it into a list of tokens using a regular expression. If a regular expression is not provided it defaults to @token_regexp

Functions

tokenise(input, opts \\ [])

Takes a string and splits it into a list of tokens using a regular expression. If a regular expression is not provided it defaults to @token_regexp.

iex> Gibran.Tokeniser.tokenise("The Prophet")
["the", "prophet"]

The default regular expression ignores punctuation, but accounts for apostrophes and compound words.

iex> Gibran.Tokeniser.tokenise("Prophet, The")
["prophet", "the"]

iex> Gibran.Tokeniser.tokenise("Al-Ajniha al-Mutakassira")
["al-ajniha", "al-mutakassira"]

The tokeniser will normalize input by downcasing all tokens.

iex> Gibran.Tokeniser.tokenise("THE PROPHET")
["the", "prophet"]

Options

  • :pattern A regular expression to tokenise the input. It defaults to @token_regexp.
  • :exclude A filter that is applied to the string after it has been tokenised. It can be a function, a string, a regular expression, or a list of any combination of those types.

Examples

Using pattern

iex> Gibran.Tokeniser.tokenise("Broken Wings, 1912", pattern: ~r/\,/)
["broken wings", " 1912"]

Using exclude with a function.

iex> Gibran.Tokeniser.tokenise("Kingdom of the Imagination", exclude: &(String.length(&1) < 10))
["imagination"]

Using exclude with a regular expression.

iex> Gibran.Tokeniser.tokenise("Sand and Foam", exclude: ~r/and/)
["foam"]

Using exclude with a string.

iex> Gibran.Tokeniser.tokenise("Eye of The Prophet", exclude: "eye of")
["the", "prophet"]

Using exclude with a list of a combination of types.

iex> Gibran.Tokeniser.tokenise("Eye of The Prophet", exclude: ["eye", &(String.ends_with?(&1, "he")), ~r/of/])
["prophet"]

Using exclude with a list.

iex> Gibran.Tokeniser.tokenise("Eye of The Prophet", exclude: ["eye", "of"])
["the", "prophet"]