ExNlp.Tokenizer.Ngram (ex_nlp v0.1.0)
View SourceCharacter n-gram tokenizer - generates character-level n-grams.
Useful for fuzzy matching and approximate string matching. Similar to character n-gram tokenizers used in various NLP libraries.
Examples
iex> ExNlp.Tokenizer.Ngram.tokenize("hello", 2, 2)
[
%ExNlp.Token{text: "he", position: 0, start_offset: 0, end_offset: 2},
%ExNlp.Token{text: "el", position: 1, start_offset: 1, end_offset: 3},
%ExNlp.Token{text: "ll", position: 2, start_offset: 2, end_offset: 4},
%ExNlp.Token{text: "lo", position: 3, start_offset: 3, end_offset: 5}
]
iex> ExNlp.Tokenizer.Ngram.tokenize("hello", 2, 3)
[
%ExNlp.Token{text: "he", position: 0, start_offset: 0, end_offset: 2},
%ExNlp.Token{text: "el", position: 1, start_offset: 1, end_offset: 3},
%ExNlp.Token{text: "ll", position: 2, start_offset: 2, end_offset: 4},
%ExNlp.Token{text: "lo", position: 3, start_offset: 3, end_offset: 5},
%ExNlp.Token{text: "hel", position: 4, start_offset: 0, end_offset: 3},
%ExNlp.Token{text: "ell", position: 5, start_offset: 1, end_offset: 4},
%ExNlp.Token{text: "llo", position: 6, start_offset: 2, end_offset: 5}
]
Summary
Functions
Returns spans (start_offset, end_offset) for tokens.
Tokenizes text and returns just the text strings (no Token structs).
Types
@type span() :: ExNlp.Tokenizer.Base.span()
@type token() :: ExNlp.Tokenizer.Base.token()
Functions
@spec span_tokenize(String.t(), pos_integer(), pos_integer()) :: [span()]
Returns spans (start_offset, end_offset) for tokens.
Similar to NLTK's span_tokenize method, but not very optimized.
@spec tokenize(String.t(), pos_integer(), pos_integer()) :: [token()]
@spec tokenize_text(String.t(), pos_integer(), pos_integer()) :: [String.t()]
Tokenizes text and returns just the text strings (no Token structs).
More efficient when you don't need position or offset information.
Examples
iex> ExNlp.Tokenizer.Ngram.tokenize_text("hello", 2, 2)
["he", "el", "ll", "lo"]