ExNlp.Tokenizer.Whitespace (ex_nlp v0.1.0)
View SourceWhitespace tokenizer - splits text only on whitespace boundaries.
Preserves punctuation within tokens. Similar to NLTK's WhitespaceTokenizer.
Examples
iex> ExNlp.Tokenizer.Whitespace.tokenize("Hello world")
[
%ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
%ExNlp.Token{text: "world", position: 1, start_offset: 6, end_offset: 11}
]
iex> ExNlp.Tokenizer.Whitespace.tokenize("Hello, world!")
[
%ExNlp.Token{text: "Hello,", position: 0, start_offset: 0, end_offset: 6},
%ExNlp.Token{text: "world!", position: 1, start_offset: 7, end_offset: 13}
]
iex> ExNlp.Tokenizer.Whitespace.span_tokenize("Hello world")
[{0, 5}, {6, 11}]
Summary
Functions
Returns spans (start_offset, end_offset) for tokens.
Fast span tokenization using String.split.
Tokenizes text by splitting on whitespace only.
Fast tokenization using String.split.
Tokenizes text and returns just the text strings (no Token structs).
Fast text-only tokenization using String.split.
Types
@type span() :: ExNlp.Tokenizer.Base.span()
@type token() :: ExNlp.Tokenizer.Base.token()
Functions
Returns spans (start_offset, end_offset) for tokens.
Similar to NLTK's span_tokenize method.
Fast span tokenization using String.split.
Tokenizes text by splitting on whitespace only.
Uses character-by-character iteration (more precise with offsets).
Fast tokenization using String.split.
This version uses String.split/2 which is typically faster than
character-by-character iteration, but may handle some edge cases
differently (e.g., multiple consecutive spaces).
Examples
iex> ExNlp.Tokenizer.Whitespace.tokenize_fast("Hello world")
[
%ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
%ExNlp.Token{text: "world", position: 1, start_offset: 6, end_offset: 11}
]
iex> ExNlp.Tokenizer.Whitespace.tokenize_fast("Hello world")
[
%ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
%ExNlp.Token{text: "world", position: 1, start_offset: 7, end_offset: 12}
]
Tokenizes text and returns just the text strings (no Token structs).
More efficient when you don't need position or offset information.
Examples
iex> ExNlp.Tokenizer.Whitespace.tokenize_text("Hello world")
["Hello", "world"]
iex> ExNlp.Tokenizer.Whitespace.tokenize_text("Hello, world!")
["Hello,", "world!"]
Fast text-only tokenization using String.split.
Returns just strings without creating Token structs.
Examples
iex> ExNlp.Tokenizer.Whitespace.tokenize_text_fast("Hello world")
["Hello", "world"]