ExNlp.Tokenizer.Ngram (ex_nlp v0.1.0)

View Source

Character n-gram tokenizer - generates character-level n-grams.

Useful for fuzzy matching and approximate string matching. Similar to character n-gram tokenizers used in various NLP libraries.

Examples

iex> ExNlp.Tokenizer.Ngram.tokenize("hello", 2, 2)
[
  %ExNlp.Token{text: "he", position: 0, start_offset: 0, end_offset: 2},
  %ExNlp.Token{text: "el", position: 1, start_offset: 1, end_offset: 3},
  %ExNlp.Token{text: "ll", position: 2, start_offset: 2, end_offset: 4},
  %ExNlp.Token{text: "lo", position: 3, start_offset: 3, end_offset: 5}
]

iex> ExNlp.Tokenizer.Ngram.tokenize("hello", 2, 3)
[
  %ExNlp.Token{text: "he", position: 0, start_offset: 0, end_offset: 2},
  %ExNlp.Token{text: "el", position: 1, start_offset: 1, end_offset: 3},
  %ExNlp.Token{text: "ll", position: 2, start_offset: 2, end_offset: 4},
  %ExNlp.Token{text: "lo", position: 3, start_offset: 3, end_offset: 5},
  %ExNlp.Token{text: "hel", position: 4, start_offset: 0, end_offset: 3},
  %ExNlp.Token{text: "ell", position: 5, start_offset: 1, end_offset: 4},
  %ExNlp.Token{text: "llo", position: 6, start_offset: 2, end_offset: 5}
]

Summary

Functions

Returns spans (start_offset, end_offset) for tokens.

Tokenizes text and returns just the text strings (no Token structs).

Types

span()

@type span() :: ExNlp.Tokenizer.Base.span()

token()

@type token() :: ExNlp.Tokenizer.Base.token()

Functions

span_tokenize(text, min_gram \\ 2, max_gram \\ 3)

@spec span_tokenize(String.t(), pos_integer(), pos_integer()) :: [span()]

Returns spans (start_offset, end_offset) for tokens.

Similar to NLTK's span_tokenize method, but not very optimized.

tokenize(text, min_gram \\ 2, max_gram \\ 3)

@spec tokenize(String.t(), pos_integer(), pos_integer()) :: [token()]

tokenize_text(text, min_gram \\ 2, max_gram \\ 3)

@spec tokenize_text(String.t(), pos_integer(), pos_integer()) :: [String.t()]

Tokenizes text and returns just the text strings (no Token structs).

More efficient when you don't need position or offset information.

Examples

iex> ExNlp.Tokenizer.Ngram.tokenize_text("hello", 2, 2)
["he", "el", "ll", "lo"]