TantivyEx.Tokenizer (TantivyEx v0.4.1)
View SourceProvides comprehensive tokenization functionality for TantivyEx.
This module allows you to register and use various types of tokenizers including:
- Simple and whitespace tokenizers
- Regex-based tokenizers
- N-gram tokenizers
- Text analyzers with filters (lowercase, stop words, stemming)
- Language-specific stemmers
- Pre-tokenized text support
Basic Usage
iex> TantivyEx.Tokenizer.register_default_tokenizers()
"Default tokenizers registered successfully"
iex> TantivyEx.Tokenizer.tokenize_text("simple", "Hello World!")
["hello", "world"]Advanced Usage
# Register a custom regex tokenizer
iex> TantivyEx.Tokenizer.register_regex_tokenizer("email", "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b")
"Regex tokenizer 'email' registered successfully"
# Register a text analyzer with multiple filters
iex> TantivyEx.Tokenizer.register_text_analyzer(
...> "en_full",
...> "simple",
...> true, # lowercase
...> "en", # stop words
...> "en", # stemming
...> 40 # remove long tokens threshold
...> )
{:ok, "Text analyzer 'en_full' registered successfully"}Supported Languages
The following languages are supported for stop words and stemming:
- English (en)
- French (fr)
- German (de)
- Spanish (es)
- Italian (it)
- Portuguese (pt)
- Russian (ru)
- Japanese (ja)
- Korean (ko)
- Arabic (ar)
- Hindi (hi)
- Chinese (zh)
- Danish (da)
- Dutch (nl)
- Finnish (fi)
- Hungarian (hu)
- Norwegian (no)
- Romanian (ro)
- Swedish (sv)
- Tamil (ta)
- Turkish (tr)
Summary
Functions
Test tokenization performance with a given tokenizer and text.
Get a list of all registered tokenizers.
Process pre-tokenized text.
Register default tokenizers with sensible configurations.
Register a language-specific text analyzer with stop words and stemming.
Register an N-gram tokenizer.
Register a regex-based tokenizer.
Register a simple tokenizer.
Register a language-specific stemming tokenizer.
Register a comprehensive text analyzer with multiple filters.
Register a whitespace tokenizer.
Tokenize text using a registered tokenizer.
Tokenize text and return detailed token information including positions.
Types
@type detailed_tokens() :: [{String.t(), non_neg_integer(), non_neg_integer()}]
@type tokenizer_name() :: String.t()
@type tokens() :: [String.t()]
Functions
@spec benchmark_tokenizer(tokenizer_name(), String.t(), pos_integer()) :: {tokens(), number()}
Test tokenization performance with a given tokenizer and text.
Returns timing information along with the tokens.
Parameters
tokenizer_name: Name of the registered tokenizertext: Text to tokenizeiterations: Number of iterations to run (default: 1000)
Examples
iex> TantivyEx.Tokenizer.register_default_tokenizers()
iex> {tokens, microseconds} = TantivyEx.Tokenizer.benchmark_tokenizer("simple", "Hello World", 100)
iex> is_list(tokens) and is_number(microseconds)
true
@spec list_tokenizers() :: [String.t()]
Get a list of all registered tokenizers.
Examples
iex> TantivyEx.Tokenizer.register_default_tokenizers()
iex> tokenizers = TantivyEx.Tokenizer.list_tokenizers()
iex> "default" in tokenizers
true
iex> "en_stem" in tokenizers
true
Process pre-tokenized text.
Useful when you have already tokenized text and want to pass it to Tantivy.
Parameters
tokens: List of pre-tokenized strings
Examples
iex> tokens = ["hello", "world", "test"]
iex> TantivyEx.Tokenizer.process_pre_tokenized_text(tokens)
"PreTokenizedString(["hello", "world", "test"])"
@spec register_default_tokenizers() :: String.t()
Register default tokenizers with sensible configurations.
This registers commonly used tokenizers including:
"default","simple","whitespace","raw"- Language-specific stemmers:
"en_stem","fr_stem", etc. - English text analyzer:
"en_text"(lowercase + stop words + stemming)
Examples
iex> TantivyEx.Tokenizer.register_default_tokenizers()
"Default tokenizers registered successfully"
@spec register_language_analyzer(String.t(), pos_integer() | nil) :: tokenizer_result()
Register a language-specific text analyzer with stop words and stemming.
This creates a comprehensive text analyzer for the specified language including lowercasing, stop word removal, and stemming.
Parameters
language: Language code (e.g., "en", "fr", "de")remove_long_threshold: Optional threshold for long word removal (nil to disable, integer for custom threshold)
Examples
iex> TantivyEx.Tokenizer.register_language_analyzer("en")
"Text analyzer 'en_text' registered successfully"
iex> TantivyEx.Tokenizer.register_language_analyzer("de")
"Text analyzer 'de_text' registered successfully"
iex> TantivyEx.Tokenizer.register_language_analyzer("en", 60)
"Text analyzer 'en_text' registered successfully"
@spec register_ngram_tokenizer( tokenizer_name(), pos_integer(), pos_integer(), boolean() ) :: tokenizer_result()
Register an N-gram tokenizer.
N-gram tokenizers generate character or word n-grams of specified lengths.
Parameters
name: Name to register the tokenizer undermin_gram: Minimum n-gram lengthmax_gram: Maximum n-gram lengthprefix_only: If true, only generate n-grams from the beginning of tokens
Examples
# Character bigrams and trigrams
iex> TantivyEx.Tokenizer.register_ngram_tokenizer("char_2_3", 2, 3, false)
"N-gram tokenizer 'char_2_3' registered successfully"
# Prefix-only trigrams
iex> TantivyEx.Tokenizer.register_ngram_tokenizer("prefix_3", 3, 3, true)
"N-gram tokenizer 'prefix_3' registered successfully"
@spec register_regex_tokenizer(tokenizer_name(), String.t()) :: tokenizer_result()
Register a regex-based tokenizer.
Regex tokenizers split text based on a regular expression pattern.
Parameters
name: Name to register the tokenizer underpattern: Regular expression pattern for tokenization
Examples
# Split on any non-alphanumeric character
iex> TantivyEx.Tokenizer.register_regex_tokenizer("alphanum", "[^a-zA-Z0-9]+")
"Regex tokenizer 'alphanum' registered successfully"
# Extract email addresses
iex> TantivyEx.Tokenizer.register_regex_tokenizer("email", "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b")
"Regex tokenizer 'email' registered successfully"
@spec register_simple_tokenizer(tokenizer_name()) :: tokenizer_result()
Register a simple tokenizer.
Simple tokenizers split text on whitespace and punctuation, converting to lowercase.
Parameters
name: Name to register the tokenizer under
Examples
iex> TantivyEx.Tokenizer.register_simple_tokenizer("my_simple")
"Simple tokenizer 'my_simple' registered successfully"
@spec register_stemming_tokenizer(String.t(), pos_integer() | nil) :: tokenizer_result()
Register a language-specific stemming tokenizer.
This is a convenience function that creates a text analyzer with lowercasing and stemming for the specified language.
Parameters
language: Language code (e.g., "en", "fr", "de")remove_long_threshold: Optional threshold for long word removal (nil to disable, integer for custom threshold)
Examples
iex> TantivyEx.Tokenizer.register_stemming_tokenizer("en")
"Text analyzer 'en_stem' registered successfully"
iex> TantivyEx.Tokenizer.register_stemming_tokenizer("fr")
"Text analyzer 'fr_stem' registered successfully"
iex> TantivyEx.Tokenizer.register_stemming_tokenizer("en", 50)
"Text analyzer 'en_stem' registered successfully"
@spec register_text_analyzer( tokenizer_name(), String.t(), boolean(), String.t() | nil, String.t() | nil, pos_integer() | nil ) :: tokenizer_result()
Register a comprehensive text analyzer with multiple filters.
Text analyzers chain together a base tokenizer with various token filters.
Parameters
name: Name to register the text analyzer underbase_tokenizer: Base tokenizer ("simple" or "whitespace")lowercase: Whether to apply lowercase filterstop_words_language: Language for stop words filter (nil to disable)stemming_language: Language for stemming filter (nil to disable)remove_long_threshold: Custom threshold for long word removal (nil to disable, integer for custom threshold)
Examples
# Full English text analyzer with default 40-character threshold
iex> TantivyEx.Tokenizer.register_text_analyzer(
...> "en_full",
...> "simple",
...> true,
...> "en",
...> "en",
...> 40
...> )
{:ok, "Text analyzer 'en_full' registered successfully"}
# Custom threshold of 50 characters
iex> TantivyEx.Tokenizer.register_text_analyzer(
...> "custom_threshold",
...> "simple",
...> true,
...> "en",
...> "en",
...> 50
...> )
{:ok, "Text analyzer 'custom_threshold' registered successfully"}
# Disable long word filtering entirely
iex> TantivyEx.Tokenizer.register_text_analyzer(
...> "no_long_filter",
...> "simple",
...> true,
...> "en",
...> "en",
...> nil
...> )
{:ok, "Text analyzer 'no_long_filter' registered successfully"}
# French analyzer with stop words only
iex> TantivyEx.Tokenizer.register_text_analyzer(
...> "fr_stop",
...> "simple",
...> true,
...> "fr",
...> nil,
...> nil
...> )
{:ok, "Text analyzer 'fr_stop' registered successfully"}
@spec register_whitespace_tokenizer(tokenizer_name()) :: tokenizer_result()
Register a whitespace tokenizer.
Whitespace tokenizers split text only on whitespace characters.
Parameters
name: Name to register the tokenizer under
Examples
iex> TantivyEx.Tokenizer.register_whitespace_tokenizer("whitespace_only")
"Whitespace tokenizer 'whitespace_only' registered successfully"
@spec tokenize_text(tokenizer_name(), String.t()) :: tokens()
Tokenize text using a registered tokenizer.
Parameters
tokenizer_name: Name of the registered tokenizertext: Text to tokenize
Examples
iex> TantivyEx.Tokenizer.register_default_tokenizers()
iex> TantivyEx.Tokenizer.tokenize_text("simple", "Hello, World!")
["hello", "world"]
iex> TantivyEx.Tokenizer.tokenize_text("en_stem", "running quickly")
["run", "quickli"]
@spec tokenize_text_detailed(tokenizer_name(), String.t()) :: detailed_tokens()
Tokenize text and return detailed token information including positions.
Returns tuples of {token, start_offset, end_offset}.
Parameters
tokenizer_name: Name of the registered tokenizertext: Text to tokenize
Examples
iex> TantivyEx.Tokenizer.register_default_tokenizers()
iex> TantivyEx.Tokenizer.tokenize_text_detailed("simple", "Hello World")
[{"hello", 0, 5}, {"world", 6, 11}]