ExNlp.Tokenizer.Standard (ex_nlp v0.1.0)
View SourceStandard word tokenizer - splits text on whitespace and punctuation.
Preserves original case. Similar to NLTK's wordpunct_tokenize or improved TreebankWordTokenizer.
Examples
iex> ExNlp.Tokenizer.Standard.tokenize("Hello, world!")
[
%ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
%ExNlp.Token{text: "world", position: 1, start_offset: 7, end_offset: 12}
]
iex> ExNlp.Tokenizer.Standard.span_tokenize("Hello, world!")
[{0, 5}, {7, 12}]
Summary
Functions
Returns spans (start_offset, end_offset) for tokens.
Tokenizes text by splitting on whitespace and punctuation. Preserves original case.
Tokenizes text and returns just the text strings (no Token structs).
Types
@type span() :: ExNlp.Tokenizer.Base.span()
@type token() :: ExNlp.Tokenizer.Base.token()
Functions
Returns spans (start_offset, end_offset) for tokens.
Similar to NLTK's span_tokenize method.
Tokenizes text by splitting on whitespace and punctuation. Preserves original case.
Tokenizes text and returns just the text strings (no Token structs).
More efficient when you don't need position or offset information.
Similar to NLTK's word_tokenize().
Examples
iex> ExNlp.Tokenizer.Standard.tokenize_text("Hello, world!")
["Hello", "world"]