ExNlp.Tokenizer.Regex (ex_nlp v0.1.0)
View SourceRegex tokenizer - extracts tokens matching a regular expression pattern.
Similar to NLTK's RegexpTokenizer. Useful for extracting specific patterns from text (e.g., words, numbers, emails).
Examples
iex> ExNlp.Tokenizer.Regex.tokenize("Hello123 world456", ~r/\w+/)
[
%ExNlp.Token{text: "Hello123", position: 0, start_offset: 0, end_offset: 8},
%ExNlp.Token{text: "world456", position: 1, start_offset: 9, end_offset: 17}
]
iex> ExNlp.Tokenizer.Regex.tokenize("abc def ghi", ~r/[a-c]+/)
[%ExNlp.Token{text: "abc", position: 0, start_offset: 0, end_offset: 3}]
iex> ExNlp.Tokenizer.Regex.span_tokenize("Hello world", ~r/\w+/)
[{0, 5}, {6, 11}]
Summary
Functions
Returns spans (start_offset, end_offset) for tokens.
Tokenizes text using a regex pattern to match tokens.
Tokenizes text and returns just the text strings (no Token structs).
Types
@type span() :: ExNlp.Tokenizer.Base.span()
@type token() :: ExNlp.Tokenizer.Base.token()
Functions
Returns spans (start_offset, end_offset) for tokens.
Similar to NLTK's span_tokenize method.
Tokenizes text using a regex pattern to match tokens.
Arguments
text- The text to tokenizepattern- A regex pattern or string to match tokens
Tokenizes text and returns just the text strings (no Token structs).
More efficient when you don't need position or offset information.
Examples
iex> ExNlp.Tokenizer.Regex.tokenize_text("Hello123 world456", ~r/\w+/)
["Hello123", "world456"]