Nasty.Language.Spanish.Tokenizer (Nasty v0.3.0)

View Source

Spanish tokenizer using NimbleParsec.

Tokenizes Spanish text into words, punctuation, numbers, and special tokens with accurate position tracking for AST span information.

Spanish-Specific Features

  • Inverted punctuation: ¿?, ¡!
  • Guillemets: «», ‹›
  • Contractions: del, al, del
  • Clitic pronouns: dámelo, dáselo, cómetelo
  • Accented characters: á, é, í, ó, ú, ñ, ü
  • Abbreviations: Sr., Sra., Dr., etc.

Examples

iex> {:ok, tokens} = Spanish.Tokenizer.tokenize("¡Hola mundo!")
iex> Enum.map(tokens, & &1.text)
["¡", "Hola", "mundo", "!"]

iex> {:ok, tokens} = Spanish.Tokenizer.tokenize("¿Cómo estás?")
iex> Enum.map(tokens, & &1.text)
["¿", "Cómo", "estás", "?"]

Summary

Functions

Parses the given binary as parse_text.

Tokenizes Spanish text into Token structs.

Functions

parse_text(binary, opts \\ [])

@spec parse_text(binary(), keyword()) ::
  {:ok, [term()], rest, context, line, byte_offset}
  | {:error, reason, rest, context, line, byte_offset}
when line: {pos_integer(), byte_offset},
     byte_offset: non_neg_integer(),
     rest: binary(),
     reason: String.t(),
     context: map()

Parses the given binary as parse_text.

Returns {:ok, [token], rest, context, position, byte_offset} or {:error, reason, rest, context, line, byte_offset} where position describes the location of the parse_text (start position) as {line, offset_to_start_of_line}.

To column where the error occurred can be inferred from byte_offset - offset_to_start_of_line.

Options

  • :byte_offset - the byte offset for the whole binary, defaults to 0
  • :line - the line and the byte offset into that line, defaults to {1, byte_offset}
  • :context - the initial context value. It will be converted to a map

tokenize(text, opts \\ [])

@spec tokenize(
  String.t(),
  keyword()
) :: {:ok, [Nasty.AST.Token.t()]} | {:error, term()}

Tokenizes Spanish text into Token structs.

Returns a list of Token structs with:

  • Accurate text content
  • Position information (line, column, byte offset)
  • Span covering the token's location
  • Language set to :es

Note: POS tags and morphology are not set by the tokenizer; those are added by the POS tagger.

Parameters

  • text - The Spanish text to tokenize
  • opts - Options (currently unused)

Returns

  • {:ok, tokens} - List of Token structs
  • {:error, reason} - Parse error

Examples

iex> {:ok, tokens} = Spanish.Tokenizer.tokenize("¡Hola!")
iex> length(tokens)
3

iex> {:ok, tokens} = Spanish.Tokenizer.tokenize("Dámelo ahora.")
iex> Enum.map(tokens, & &1.text)
["Dámelo", "ahora", "."]

iex> {:ok, tokens} = Spanish.Tokenizer.tokenize("¿Cómo estás?")
iex> Enum.map(tokens, & &1.text)
["¿", "Cómo", "estás", "?"]