ExNlp.Tokenizer (ex_nlp v0.1.0)
View SourceUnified API for text tokenization, inspired by NLTK's tokenization package.
This module provides convenient top-level functions for common tokenization tasks,
similar to NLTK's word_tokenize(), wordpunct_tokenize(), etc.
For more control, use the specific tokenizer modules directly:
ExNlp.Tokenizer.Standard- Standard word tokenizerExNlp.Tokenizer.Whitespace- Whitespace-only tokenizerExNlp.Tokenizer.Regex- Regex-based tokenizerExNlp.Tokenizer.Ngram- Character n-gram tokenizerExNlp.Tokenizer.Keyword- Keyword (whole-text) tokenizer
Examples
# Quick tokenization - returns just text strings (like NLTK)
iex> ExNlp.Tokenizer.word_tokenize("Hello, world!")
["Hello", "world"]
# Full tokenization - returns tokens with offsets
iex> ExNlp.Tokenizer.tokenize("Hello, world!")
[
%ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
%ExNlp.Token{text: "world", position: 1, start_offset: 7, end_offset: 12}
]
# Get spans (offsets) like NLTK's span_tokenize
iex> ExNlp.Tokenizer.span_tokenize("Hello, world!")
[{0, 5}, {7, 12}]
# Use specific tokenizers
iex> ExNlp.Tokenizer.Whitespace.tokenize("Hello, world!")
[
%ExNlp.Token{text: "Hello,", position: 0, start_offset: 0, end_offset: 6},
%ExNlp.Token{text: "world!", position: 1, start_offset: 7, end_offset: 13}
]Reference: https://www.nltk.org/api/nltk.tokenize.html
Summary
Functions
Tokenizes using keyword tokenizer (treats entire input as single token).
Tokenizes using keyword tokenizer and returns just the text string.
Tokenizes using n-gram tokenizer.
Tokenizes using n-gram tokenizer and returns just the text strings.
Tokenizes using regex tokenizer.
Tokenizes text using a regex pattern.
Returns spans (start_offset, end_offset) for tokens.
Tokenizes using standard tokenizer.
Tokenizes text using the standard tokenizer (default).
Tokenizes using whitespace tokenizer.
Tokenizes text and returns just the text strings (no offsets).
Tokenizes text using whitespace-only tokenizer.
Types
@type span() :: ExNlp.Tokenizer.Base.span()
@type token() :: ExNlp.Tokenizer.Base.token()
Functions
Tokenizes using keyword tokenizer (treats entire input as single token).
Tokenizes using keyword tokenizer and returns just the text string.
@spec ngram(String.t(), pos_integer(), pos_integer()) :: [token()]
Tokenizes using n-gram tokenizer.
Arguments
text- The text to tokenizemin_gram- Minimum n-gram size (default: 2)max_gram- Maximum n-gram size (default: 3)
@spec ngram_text(String.t(), pos_integer(), pos_integer()) :: [String.t()]
Tokenizes using n-gram tokenizer and returns just the text strings.
Arguments
text- The text to tokenizemin_gram- Minimum n-gram size (default: 2)max_gram- Maximum n-gram size (default: 3)
Tokenizes using regex tokenizer.
Arguments
text- The text to tokenizepattern- A regex pattern or string to match tokens
Tokenizes text using a regex pattern.
Similar to NLTK's regexp_tokenize() function.
Examples
iex> ExNlp.Tokenizer.regexp_tokenize("Hello123 world456", "\\w+")
["Hello123", "world456"]
iex> ExNlp.Tokenizer.regexp_tokenize("abc def ghi", "[a-c]+")
["abc"]
Returns spans (start_offset, end_offset) for tokens.
Similar to NLTK's span_tokenize() method. Useful for aligning tokens
with the original text.
Examples
iex> ExNlp.Tokenizer.span_tokenize("Hello, world!")
[{0, 5}, {7, 12}]
Tokenizes using standard tokenizer.
Tokenizes text using the standard tokenizer (default).
Similar to NLTK's word_tokenize() - splits on whitespace and punctuation,
converting to lowercase. This is the recommended general-purpose tokenizer.
Examples
iex> ExNlp.Tokenizer.tokenize("Hello, world!")
[
%ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
%ExNlp.Token{text: "world", position: 1, start_offset: 7, end_offset: 12}
]
Tokenizes using whitespace tokenizer.
Tokenizes text and returns just the text strings (no offsets).
Similar to NLTK's word_tokenize() function which returns a list of strings.
This is a convenience wrapper around tokenize/1.
Examples
iex> ExNlp.Tokenizer.word_tokenize("Hello, world!")
["Hello", "world"]
Tokenizes text using whitespace-only tokenizer.
Preserves punctuation within tokens. Similar to NLTK's WhitespaceTokenizer.
Examples
iex> ExNlp.Tokenizer.wordpunct_tokenize("Hello, world!")
["Hello,", "world!"]