ExNlp.Ngram (ex_nlp v0.1.0)

View Source

Word-level n-gram generation.

This module generates word n-grams from tokenized text, useful for phrase matching and language modeling. Note: This is separate from Tokenizer.Ngram which generates character-level n-grams for tokenization.

Examples

# Generate bigrams and trigrams
iex> tokens = ["the", "quick", "brown", "fox"]
iex> ExNlp.Ngram.word_ngrams(tokens, 2, 3)
["the quick", "quick brown", "brown fox", "the quick brown", "quick brown fox"]

# Convenience functions
iex> ExNlp.Ngram.bigrams(["the", "quick", "brown"])
["the quick", "quick brown"]

iex> ExNlp.Ngram.trigrams(["the", "quick", "brown", "fox"])
["the quick brown", "quick brown fox"]

Summary

Types

A token is a string

Functions

Generates bigrams (2-grams) from tokens.

Generates fourgrams (4-grams) from tokens.

Generates trigrams (3-grams) from tokens.

Generates word n-grams of specified lengths.

Generates word n-grams with position tracking.

Types

token()

@type token() :: String.t()

A token is a string

Functions

bigrams(tokens)

@spec bigrams([token()]) :: [String.t()]

Generates bigrams (2-grams) from tokens.

Examples

iex> ExNlp.Ngram.bigrams(["the", "quick", "brown"])
["the quick", "quick brown"]

fourgrams(tokens)

@spec fourgrams([token()]) :: [String.t()]

Generates fourgrams (4-grams) from tokens.

Examples

iex> ExNlp.Ngram.fourgrams(["the", "quick", "brown", "fox", "jumps"])
["the quick brown fox", "quick brown fox jumps"]

trigrams(tokens)

@spec trigrams([token()]) :: [String.t()]

Generates trigrams (3-grams) from tokens.

Examples

iex> ExNlp.Ngram.trigrams(["the", "quick", "brown", "fox"])
["the quick brown", "quick brown fox"]

word_ngrams(tokens, min_gram, max_gram)

@spec word_ngrams([token()], pos_integer(), pos_integer()) :: [String.t()]

Generates word n-grams of specified lengths.

Returns a list of n-gram strings (space-separated words) for all n values from min_gram to max_gram inclusive.

Examples

iex> ExNlp.Ngram.word_ngrams(["the", "quick", "brown"], 2, 2)
["the quick", "quick brown"]

iex> ExNlp.Ngram.word_ngrams(["the", "quick", "brown", "fox"], 2, 3)
["the quick", "quick brown", "brown fox", "the quick brown", "quick brown fox"]

word_ngrams_with_position(tokens, min_gram, max_gram)

@spec word_ngrams_with_position([token()], pos_integer(), pos_integer()) :: [
  {non_neg_integer(), String.t()}
]

Generates word n-grams with position tracking.

Returns a list of {n_gram, position} tuples where position is the starting index of the n-gram in the token list.

Examples

iex> ExNlp.Ngram.word_ngrams_with_position(["the", "quick", "brown"], 2, 2)
[{0, "the quick"}, {1, "quick brown"}]