ExNlp.Tokenizer.Regex (ex_nlp v0.1.0)

View Source

Regex tokenizer - extracts tokens matching a regular expression pattern.

Similar to NLTK's RegexpTokenizer. Useful for extracting specific patterns from text (e.g., words, numbers, emails).

Examples

iex> ExNlp.Tokenizer.Regex.tokenize("Hello123 world456", ~r/\w+/)
[
  %ExNlp.Token{text: "Hello123", position: 0, start_offset: 0, end_offset: 8},
  %ExNlp.Token{text: "world456", position: 1, start_offset: 9, end_offset: 17}
]

iex> ExNlp.Tokenizer.Regex.tokenize("abc def ghi", ~r/[a-c]+/)
[%ExNlp.Token{text: "abc", position: 0, start_offset: 0, end_offset: 3}]

iex> ExNlp.Tokenizer.Regex.span_tokenize("Hello world", ~r/\w+/)
[{0, 5}, {6, 11}]

Summary

Functions

Returns spans (start_offset, end_offset) for tokens.

Tokenizes text using a regex pattern to match tokens.

Tokenizes text and returns just the text strings (no Token structs).

Types

span()

@type span() :: ExNlp.Tokenizer.Base.span()

token()

@type token() :: ExNlp.Tokenizer.Base.token()

Functions

do_tokenize(text, pattern)

span_tokenize(text, pattern)

@spec span_tokenize(String.t(), Regex.t() | String.t()) :: [span()]

Returns spans (start_offset, end_offset) for tokens.

Similar to NLTK's span_tokenize method.

tokenize(text, pattern)

@spec tokenize(String.t(), Regex.t() | String.t()) :: [token()]

Tokenizes text using a regex pattern to match tokens.

Arguments

  • text - The text to tokenize
  • pattern - A regex pattern or string to match tokens

tokenize_text(text, pattern)

@spec tokenize_text(String.t(), Regex.t() | String.t()) :: [String.t()]

Tokenizes text and returns just the text strings (no Token structs).

More efficient when you don't need position or offset information.

Examples

iex> ExNlp.Tokenizer.Regex.tokenize_text("Hello123 world456", ~r/\w+/)
["Hello123", "world456"]