Nasty.Language.English.Tokenizer (Nasty v0.3.0)
View SourceEnglish tokenizer using NimbleParsec.
Tokenizes English text into words, punctuation, numbers, and special tokens with accurate position tracking for AST span information.
Features
- Word tokenization with contractions ("don't", "I'm", "we've")
- Punctuation handling (periods, commas, quotes, etc.)
- Number recognition (integers, decimals, percentages)
- Sentence boundary detection
- Accurate line/column and byte offset tracking
- Unicode support
Examples
iex> {:ok, tokens} = Nasty.Language.English.Tokenizer.tokenize("Hello world!")
iex> Enum.map(tokens, & &1.text)
["Hello", "world", "!"]
Summary
Functions
@spec parse_text(binary(), keyword()) :: {:ok, [term()], rest, context, line, byte_offset} | {:error, reason, rest, context, line, byte_offset} when line: {pos_integer(), byte_offset}, byte_offset: non_neg_integer(), rest: binary(), reason: String.t(), context: map()
Parses the given binary as parse_text.
Returns {:ok, [token], rest, context, position, byte_offset} or
{:error, reason, rest, context, line, byte_offset} where position
describes the location of the parse_text (start position) as {line, offset_to_start_of_line}.
To column where the error occurred can be inferred from byte_offset - offset_to_start_of_line.
Options
:byte_offset- the byte offset for the whole binary, defaults to 0:line- the line and the byte offset into that line, defaults to{1, byte_offset}:context- the initial context value. It will be converted to a map
@spec tokenize( String.t(), keyword() ) :: {:ok, [Nasty.AST.Token.t()]} | {:error, term()}
Tokenizes English text into Token structs.
Returns a list of Token structs with:
- Accurate text content
- Position information (line, column, byte offset)
- Span covering the token's location
- Language set to :en
Note: POS tags and morphology are not set by the tokenizer; those are added by the POS tagger.
Parameters
text- The text to tokenizeopts- Options (currently unused)
Returns
{:ok, tokens}- List of Token structs{:error, reason}- Parse error
Examples
iex> {:ok, tokens} = Nasty.Language.English.Tokenizer.tokenize("Hello!")
iex> length(tokens)
2
iex> hd(tokens).text
"Hello"
iex> {:ok, tokens} = Nasty.Language.English.Tokenizer.tokenize("I don't know.")
iex> Enum.map(tokens, & &1.text)
["I", "don't", "know", "."]