ExNlp.Tokenizer.Standard (ex_nlp v0.1.0)

View Source

Standard word tokenizer - splits text on whitespace and punctuation.

Preserves original case. Similar to NLTK's wordpunct_tokenize or improved TreebankWordTokenizer.

Examples

iex> ExNlp.Tokenizer.Standard.tokenize("Hello, world!")
[
  %ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
  %ExNlp.Token{text: "world", position: 1, start_offset: 7, end_offset: 12}
]

iex> ExNlp.Tokenizer.Standard.span_tokenize("Hello, world!")
[{0, 5}, {7, 12}]

Summary

Functions

Returns spans (start_offset, end_offset) for tokens.

Tokenizes text by splitting on whitespace and punctuation. Preserves original case.

Tokenizes text and returns just the text strings (no Token structs).

Types

span()

@type span() :: ExNlp.Tokenizer.Base.span()

token()

@type token() :: ExNlp.Tokenizer.Base.token()

Functions

span_tokenize(text)

@spec span_tokenize(String.t()) :: [span()]

Returns spans (start_offset, end_offset) for tokens.

Similar to NLTK's span_tokenize method.

tokenize(text)

@spec tokenize(String.t()) :: [token()]

Tokenizes text by splitting on whitespace and punctuation. Preserves original case.

tokenize_text(text)

@spec tokenize_text(String.t()) :: [String.t()]

Tokenizes text and returns just the text strings (no Token structs).

More efficient when you don't need position or offset information. Similar to NLTK's word_tokenize().

Examples

iex> ExNlp.Tokenizer.Standard.tokenize_text("Hello, world!")
["Hello", "world"]