Nasty.Language.English.Tokenizer (Nasty v0.3.0)

English tokenizer using NimbleParsec.

Tokenizes English text into words, punctuation, numbers, and special tokens with accurate position tracking for AST span information.

Features

Word tokenization with contractions ("don't", "I'm", "we've")
Punctuation handling (periods, commas, quotes, etc.)
Number recognition (integers, decimals, percentages)
Sentence boundary detection
Accurate line/column and byte offset tracking
Unicode support

Examples

iex> {:ok, tokens} = Nasty.Language.English.Tokenizer.tokenize("Hello world!")
iex> Enum.map(tokens, & &1.text)
["Hello", "world", "!"]

Summary

Functions

parse_text(binary, opts \\ [])

Parses the given binary as parse_text.

tokenize(text, opts \\ [])

Tokenizes English text into Token structs.

Functions

parse_text(binary, opts \\ [])

@spec parse_text(binary(), keyword()) ::
  {:ok, [term()], rest, context, line, byte_offset}
  | {:error, reason, rest, context, line, byte_offset}
when line: {pos_integer(), byte_offset},
     byte_offset: non_neg_integer(),
     rest: binary(),
     reason: String.t(),
     context: map()

Parses the given binary as parse_text.

Returns {:ok, [token], rest, context, position, byte_offset} or {:error, reason, rest, context, line, byte_offset} where position describes the location of the parse_text (start position) as {line, offset_to_start_of_line}.

To column where the error occurred can be inferred from byte_offset - offset_to_start_of_line.

Options

:byte_offset - the byte offset for the whole binary, defaults to 0
:line - the line and the byte offset into that line, defaults to {1, byte_offset}
:context - the initial context value. It will be converted to a map

tokenize(text, opts \\ [])

@spec tokenize(
  String.t(),
  keyword()
) :: {:ok, [Nasty.AST.Token.t()]} | {:error, term()}

Tokenizes English text into Token structs.

Returns a list of Token structs with:

Accurate text content
Position information (line, column, byte offset)
Span covering the token's location
Language set to :en

Note: POS tags and morphology are not set by the tokenizer; those are added by the POS tagger.

Parameters

text - The text to tokenize
opts - Options (currently unused)

Returns

{:ok, tokens} - List of Token structs
{:error, reason} - Parse error

Examples

iex> {:ok, tokens} = Nasty.Language.English.Tokenizer.tokenize("Hello!")
iex> length(tokens)
2
iex> hd(tokens).text
"Hello"

iex> {:ok, tokens} = Nasty.Language.English.Tokenizer.tokenize("I don't know.")
iex> Enum.map(tokens, & &1.text)
["I", "don't", "know", "."]