Extract statistically significant word bigrams from a token stream.
Given a sequence of tokens (or a string that gets tokenised), returns the bigrams that occur together more often than chance would predict. Useful for surfacing multi-word expressions ("New York", "machine learning"), finding domain-specific phrases in a corpus, and as a pre-processing step for higher-order analytics.
Association measures
Three measures are supported:
:frequency— raw bigram count. Cheapest baseline; biased toward high-frequency stop-word pairs.:pmi— pointwise mutual information,log(P(a,b) / (P(a) · P(b))). Highlights pairs that appear together far more than independence predicts. Sensitive to rare pairs, so combine with:min_count.:log_likelihood— Dunning's log-likelihood ratio (G²). Less sensitive to rare-pair noise than PMI; the standard choice in corpus linguistics.
Defaults
By default, bigrams/2 lower-cases tokens, drops bigrams that
appear fewer than 3 times, and ranks by :log_likelihood. Override
via the keyword options.
Summary
Functions
Returns the most strongly-associated bigrams in input.
Types
@type bigram() :: [String.t()]
A token bigram represented as a 2-element list.
@type measure() :: :frequency | :pmi | :log_likelihood
An association measure used to rank bigrams.
Functions
Returns the most strongly-associated bigrams in input.
Arguments
inputis either:- a string (tokenised via the
:tokenizeroption), or - a pre-tokenised list of strings (treated as a single contiguous token stream).
- a string (tokenised via the
Options
:measure—:frequency,:pmi, or:log_likelihood. Defaults to:log_likelihood.:min_count— drops bigrams with raw count below this. Defaults to3. Keeps PMI and friends from being dominated by once-seen rare-pair noise.:k— number of results. Defaults to20. Pass:infinityto return all bigrams that pass:min_count.:tokenizer— string-to-tokens function used wheninputis a string. Defaults to&Text.Segment.words/1.:fold_case— whentrue(default), tokens are lowercased.:stopwords— a list orMapSetof tokens to exclude from bigrams. A bigram with either side in the stopword set is dropped. Defaults to[].
Returns
- A list of
{bigram, score}tuples sorted by score descending, wherebigramis a two-element list.
Examples
iex> tokens = ~w[the cat sat on the mat the cat ran fast the cat slept]
iex> [{top_bigram, _score} | _] =
...> Text.Collocation.bigrams(tokens, min_count: 2, k: 5)
iex> top_bigram
["the", "cat"]
iex> Text.Collocation.bigrams("a a a", min_count: 1, measure: :frequency)
[{["a", "a"], 2}]