ExNlp.Tokenizer.Base (ex_nlp v0.1.0)

View Source

Base module for tokenizer implementations.

Defines common types and helper functions for tokenizers.

Summary

Types

A span representing the start and end offsets of a token

A token with text, position, and offset information

Functions

Converts tokens to spans (start_offset, end_offset tuples).

Extracts just the text from tokens.

Types

span()

@type span() :: {non_neg_integer(), non_neg_integer()}

A span representing the start and end offsets of a token

token()

@type token() :: ExNlp.Token.t()

A token with text, position, and offset information

Functions

tokens_to_spans(tokens)

@spec tokens_to_spans([token()]) :: [span()]

Converts tokens to spans (start_offset, end_offset tuples).

Similar to NLTK's span_tokenize method.

tokens_to_texts(tokens)

@spec tokens_to_texts([token()]) :: [String.t()]

Extracts just the text from tokens.