ExNlp.Tokenizer.Whitespace (ex_nlp v0.1.0)

View Source

Whitespace tokenizer - splits text only on whitespace boundaries.

Preserves punctuation within tokens. Similar to NLTK's WhitespaceTokenizer.

Examples

iex> ExNlp.Tokenizer.Whitespace.tokenize("Hello world")
[
  %ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
  %ExNlp.Token{text: "world", position: 1, start_offset: 6, end_offset: 11}
]

iex> ExNlp.Tokenizer.Whitespace.tokenize("Hello, world!")
[
  %ExNlp.Token{text: "Hello,", position: 0, start_offset: 0, end_offset: 6},
  %ExNlp.Token{text: "world!", position: 1, start_offset: 7, end_offset: 13}
]

iex> ExNlp.Tokenizer.Whitespace.span_tokenize("Hello world")
[{0, 5}, {6, 11}]

Summary

Functions

Returns spans (start_offset, end_offset) for tokens.

Fast span tokenization using String.split.

Tokenizes text by splitting on whitespace only.

Fast tokenization using String.split.

Tokenizes text and returns just the text strings (no Token structs).

Fast text-only tokenization using String.split.

Types

span()

@type span() :: ExNlp.Tokenizer.Base.span()

token()

@type token() :: ExNlp.Tokenizer.Base.token()

Functions

span_tokenize(text)

@spec span_tokenize(String.t()) :: [span()]

Returns spans (start_offset, end_offset) for tokens.

Similar to NLTK's span_tokenize method.

span_tokenize_fast(text)

@spec span_tokenize_fast(String.t()) :: [span()]

Fast span tokenization using String.split.

tokenize(text)

@spec tokenize(String.t()) :: [token()]

Tokenizes text by splitting on whitespace only.

Uses character-by-character iteration (more precise with offsets).

tokenize_fast(text)

@spec tokenize_fast(String.t()) :: [token()]

Fast tokenization using String.split.

This version uses String.split/2 which is typically faster than character-by-character iteration, but may handle some edge cases differently (e.g., multiple consecutive spaces).

Examples

iex> ExNlp.Tokenizer.Whitespace.tokenize_fast("Hello world")
[
  %ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
  %ExNlp.Token{text: "world", position: 1, start_offset: 6, end_offset: 11}
]

iex> ExNlp.Tokenizer.Whitespace.tokenize_fast("Hello  world")
[
  %ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
  %ExNlp.Token{text: "world", position: 1, start_offset: 7, end_offset: 12}
]

tokenize_text(text)

@spec tokenize_text(String.t()) :: [String.t()]

Tokenizes text and returns just the text strings (no Token structs).

More efficient when you don't need position or offset information.

Examples

iex> ExNlp.Tokenizer.Whitespace.tokenize_text("Hello world")
["Hello", "world"]

iex> ExNlp.Tokenizer.Whitespace.tokenize_text("Hello, world!")
["Hello,", "world!"]

tokenize_text_fast(text)

@spec tokenize_text_fast(String.t()) :: [String.t()]

Fast text-only tokenization using String.split.

Returns just strings without creating Token structs.

Examples

iex> ExNlp.Tokenizer.Whitespace.tokenize_text_fast("Hello world")
["Hello", "world"]