ExNlp.Tokenizer (ex_nlp v0.1.0)

View Source

Unified API for text tokenization, inspired by NLTK's tokenization package.

This module provides convenient top-level functions for common tokenization tasks, similar to NLTK's word_tokenize(), wordpunct_tokenize(), etc.

For more control, use the specific tokenizer modules directly:

Examples

# Quick tokenization - returns just text strings (like NLTK)
iex> ExNlp.Tokenizer.word_tokenize("Hello, world!")
["Hello", "world"]

# Full tokenization - returns tokens with offsets
iex> ExNlp.Tokenizer.tokenize("Hello, world!")
[
  %ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
  %ExNlp.Token{text: "world", position: 1, start_offset: 7, end_offset: 12}
]

# Get spans (offsets) like NLTK's span_tokenize
iex> ExNlp.Tokenizer.span_tokenize("Hello, world!")
[{0, 5}, {7, 12}]

# Use specific tokenizers
iex> ExNlp.Tokenizer.Whitespace.tokenize("Hello, world!")
[
  %ExNlp.Token{text: "Hello,", position: 0, start_offset: 0, end_offset: 6},
  %ExNlp.Token{text: "world!", position: 1, start_offset: 7, end_offset: 13}
]

Reference: https://www.nltk.org/api/nltk.tokenize.html

Summary

Functions

Tokenizes using keyword tokenizer (treats entire input as single token).

Tokenizes using keyword tokenizer and returns just the text string.

Tokenizes using n-gram tokenizer.

Tokenizes using n-gram tokenizer and returns just the text strings.

Tokenizes using regex tokenizer.

Tokenizes text using a regex pattern.

Returns spans (start_offset, end_offset) for tokens.

Tokenizes using standard tokenizer.

Tokenizes text using the standard tokenizer (default).

Tokenizes using whitespace tokenizer.

Tokenizes text and returns just the text strings (no offsets).

Tokenizes text using whitespace-only tokenizer.

Types

span()

@type span() :: ExNlp.Tokenizer.Base.span()

token()

@type token() :: ExNlp.Tokenizer.Base.token()

Functions

keyword(text)

@spec keyword(String.t()) :: [token()]

Tokenizes using keyword tokenizer (treats entire input as single token).

keyword_text(text)

@spec keyword_text(String.t()) :: [String.t()]

Tokenizes using keyword tokenizer and returns just the text string.

ngram(text, min_gram \\ 2, max_gram \\ 3)

@spec ngram(String.t(), pos_integer(), pos_integer()) :: [token()]

Tokenizes using n-gram tokenizer.

Arguments

  • text - The text to tokenize
  • min_gram - Minimum n-gram size (default: 2)
  • max_gram - Maximum n-gram size (default: 3)

ngram_text(text, min_gram \\ 2, max_gram \\ 3)

@spec ngram_text(String.t(), pos_integer(), pos_integer()) :: [String.t()]

Tokenizes using n-gram tokenizer and returns just the text strings.

Arguments

  • text - The text to tokenize
  • min_gram - Minimum n-gram size (default: 2)
  • max_gram - Maximum n-gram size (default: 3)

regex(text, pattern)

@spec regex(String.t(), ExNlp.Tokenizer.Regex.t() | String.t()) :: [token()]

Tokenizes using regex tokenizer.

Arguments

  • text - The text to tokenize
  • pattern - A regex pattern or string to match tokens

regexp_tokenize(text, pattern)

@spec regexp_tokenize(String.t(), ExNlp.Tokenizer.Regex.t() | String.t()) :: [
  String.t()
]

Tokenizes text using a regex pattern.

Similar to NLTK's regexp_tokenize() function.

Examples

iex> ExNlp.Tokenizer.regexp_tokenize("Hello123 world456", "\\w+")
["Hello123", "world456"]

iex> ExNlp.Tokenizer.regexp_tokenize("abc def ghi", "[a-c]+")
["abc"]

span_tokenize(text)

@spec span_tokenize(String.t()) :: [span()]

Returns spans (start_offset, end_offset) for tokens.

Similar to NLTK's span_tokenize() method. Useful for aligning tokens with the original text.

Examples

iex> ExNlp.Tokenizer.span_tokenize("Hello, world!")
[{0, 5}, {7, 12}]

standard(text)

@spec standard(String.t()) :: [token()]

Tokenizes using standard tokenizer.

tokenize(text)

@spec tokenize(String.t()) :: [token()]

Tokenizes text using the standard tokenizer (default).

Similar to NLTK's word_tokenize() - splits on whitespace and punctuation, converting to lowercase. This is the recommended general-purpose tokenizer.

Examples

iex> ExNlp.Tokenizer.tokenize("Hello, world!")
[
  %ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
  %ExNlp.Token{text: "world", position: 1, start_offset: 7, end_offset: 12}
]

whitespace(text)

@spec whitespace(String.t()) :: [token()]

Tokenizes using whitespace tokenizer.

word_tokenize(text)

@spec word_tokenize(String.t()) :: [String.t()]

Tokenizes text and returns just the text strings (no offsets).

Similar to NLTK's word_tokenize() function which returns a list of strings. This is a convenience wrapper around tokenize/1.

Examples

iex> ExNlp.Tokenizer.word_tokenize("Hello, world!")
["Hello", "world"]

wordpunct_tokenize(text)

@spec wordpunct_tokenize(String.t()) :: [String.t()]

Tokenizes text using whitespace-only tokenizer.

Preserves punctuation within tokens. Similar to NLTK's WhitespaceTokenizer.

Examples

iex> ExNlp.Tokenizer.wordpunct_tokenize("Hello, world!")
["Hello,", "world!"]