ExNlp.Ngram (ex_nlp v0.1.0)
View SourceWord-level n-gram generation.
This module generates word n-grams from tokenized text, useful for phrase
matching and language modeling. Note: This is separate from Tokenizer.Ngram
which generates character-level n-grams for tokenization.
Examples
# Generate bigrams and trigrams
iex> tokens = ["the", "quick", "brown", "fox"]
iex> ExNlp.Ngram.word_ngrams(tokens, 2, 3)
["the quick", "quick brown", "brown fox", "the quick brown", "quick brown fox"]
# Convenience functions
iex> ExNlp.Ngram.bigrams(["the", "quick", "brown"])
["the quick", "quick brown"]
iex> ExNlp.Ngram.trigrams(["the", "quick", "brown", "fox"])
["the quick brown", "quick brown fox"]
Summary
Functions
Generates bigrams (2-grams) from tokens.
Generates fourgrams (4-grams) from tokens.
Generates trigrams (3-grams) from tokens.
Generates word n-grams of specified lengths.
Generates word n-grams with position tracking.
Types
@type token() :: String.t()
A token is a string
Functions
Generates bigrams (2-grams) from tokens.
Examples
iex> ExNlp.Ngram.bigrams(["the", "quick", "brown"])
["the quick", "quick brown"]
Generates fourgrams (4-grams) from tokens.
Examples
iex> ExNlp.Ngram.fourgrams(["the", "quick", "brown", "fox", "jumps"])
["the quick brown fox", "quick brown fox jumps"]
Generates trigrams (3-grams) from tokens.
Examples
iex> ExNlp.Ngram.trigrams(["the", "quick", "brown", "fox"])
["the quick brown", "quick brown fox"]
@spec word_ngrams([token()], pos_integer(), pos_integer()) :: [String.t()]
Generates word n-grams of specified lengths.
Returns a list of n-gram strings (space-separated words) for all n values
from min_gram to max_gram inclusive.
Examples
iex> ExNlp.Ngram.word_ngrams(["the", "quick", "brown"], 2, 2)
["the quick", "quick brown"]
iex> ExNlp.Ngram.word_ngrams(["the", "quick", "brown", "fox"], 2, 3)
["the quick", "quick brown", "brown fox", "the quick brown", "quick brown fox"]
@spec word_ngrams_with_position([token()], pos_integer(), pos_integer()) :: [ {non_neg_integer(), String.t()} ]
Generates word n-grams with position tracking.
Returns a list of {n_gram, position} tuples where position is the starting
index of the n-gram in the token list.
Examples
iex> ExNlp.Ngram.word_ngrams_with_position(["the", "quick", "brown"], 2, 2)
[{0, "the quick"}, {1, "quick brown"}]