Nasty.Language.Spanish.Tokenizer (Nasty v0.3.0)
View SourceSpanish tokenizer using NimbleParsec.
Tokenizes Spanish text into words, punctuation, numbers, and special tokens with accurate position tracking for AST span information.
Spanish-Specific Features
- Inverted punctuation: ¿?, ¡!
- Guillemets: «», ‹›
- Contractions: del, al, del
- Clitic pronouns: dámelo, dáselo, cómetelo
- Accented characters: á, é, í, ó, ú, ñ, ü
- Abbreviations: Sr., Sra., Dr., etc.
Examples
iex> {:ok, tokens} = Spanish.Tokenizer.tokenize("¡Hola mundo!")
iex> Enum.map(tokens, & &1.text)
["¡", "Hola", "mundo", "!"]
iex> {:ok, tokens} = Spanish.Tokenizer.tokenize("¿Cómo estás?")
iex> Enum.map(tokens, & &1.text)
["¿", "Cómo", "estás", "?"]
Summary
Functions
@spec parse_text(binary(), keyword()) :: {:ok, [term()], rest, context, line, byte_offset} | {:error, reason, rest, context, line, byte_offset} when line: {pos_integer(), byte_offset}, byte_offset: non_neg_integer(), rest: binary(), reason: String.t(), context: map()
Parses the given binary as parse_text.
Returns {:ok, [token], rest, context, position, byte_offset} or
{:error, reason, rest, context, line, byte_offset} where position
describes the location of the parse_text (start position) as {line, offset_to_start_of_line}.
To column where the error occurred can be inferred from byte_offset - offset_to_start_of_line.
Options
:byte_offset- the byte offset for the whole binary, defaults to 0:line- the line and the byte offset into that line, defaults to{1, byte_offset}:context- the initial context value. It will be converted to a map
@spec tokenize( String.t(), keyword() ) :: {:ok, [Nasty.AST.Token.t()]} | {:error, term()}
Tokenizes Spanish text into Token structs.
Returns a list of Token structs with:
- Accurate text content
- Position information (line, column, byte offset)
- Span covering the token's location
- Language set to :es
Note: POS tags and morphology are not set by the tokenizer; those are added by the POS tagger.
Parameters
text- The Spanish text to tokenizeopts- Options (currently unused)
Returns
{:ok, tokens}- List of Token structs{:error, reason}- Parse error
Examples
iex> {:ok, tokens} = Spanish.Tokenizer.tokenize("¡Hola!")
iex> length(tokens)
3
iex> {:ok, tokens} = Spanish.Tokenizer.tokenize("Dámelo ahora.")
iex> Enum.map(tokens, & &1.text)
["Dámelo", "ahora", "."]
iex> {:ok, tokens} = Spanish.Tokenizer.tokenize("¿Cómo estás?")
iex> Enum.map(tokens, & &1.text)
["¿", "Cómo", "estás", "?"]