Text.Collocation (Text v0.5.0)

Copy Markdown View Source

Extract statistically significant word bigrams from a token stream.

Given a sequence of tokens (or a string that gets tokenised), returns the bigrams that occur together more often than chance would predict. Useful for surfacing multi-word expressions ("New York", "machine learning"), finding domain-specific phrases in a corpus, and as a pre-processing step for higher-order analytics.

Association measures

Three measures are supported:

  • :frequency — raw bigram count. Cheapest baseline; biased toward high-frequency stop-word pairs.

  • :pmi — pointwise mutual information, log(P(a,b) / (P(a) · P(b))). Highlights pairs that appear together far more than independence predicts. Sensitive to rare pairs, so combine with :min_count.

  • :log_likelihood — Dunning's log-likelihood ratio (G²). Less sensitive to rare-pair noise than PMI; the standard choice in corpus linguistics.

Defaults

By default, bigrams/2 lower-cases tokens, drops bigrams that appear fewer than 3 times, and ranks by :log_likelihood. Override via the keyword options.

Summary

Types

A token bigram represented as a 2-element list.

An association measure used to rank bigrams.

Functions

Returns the most strongly-associated bigrams in input.

Types

bigram()

@type bigram() :: [String.t()]

A token bigram represented as a 2-element list.

measure()

@type measure() :: :frequency | :pmi | :log_likelihood

An association measure used to rank bigrams.

Functions

bigrams(input, options \\ [])

@spec bigrams(
  String.t() | [String.t()],
  keyword()
) :: [{bigram(), number()}]

Returns the most strongly-associated bigrams in input.

Arguments

  • input is either:
    • a string (tokenised via the :tokenizer option), or
    • a pre-tokenised list of strings (treated as a single contiguous token stream).

Options

  • :measure:frequency, :pmi, or :log_likelihood. Defaults to :log_likelihood.

  • :min_count — drops bigrams with raw count below this. Defaults to 3. Keeps PMI and friends from being dominated by once-seen rare-pair noise.

  • :k — number of results. Defaults to 20. Pass :infinity to return all bigrams that pass :min_count.

  • :tokenizer — string-to-tokens function used when input is a string. Defaults to &Text.Segment.words/1.

  • :fold_case — when true (default), tokens are lowercased.

  • :stopwords — a list or MapSet of tokens to exclude from bigrams. A bigram with either side in the stopword set is dropped. Defaults to [].

Returns

  • A list of {bigram, score} tuples sorted by score descending, where bigram is a two-element list.

Examples

iex> tokens = ~w[the cat sat on the mat the cat ran fast the cat slept]
iex> [{top_bigram, _score} | _] =
...>   Text.Collocation.bigrams(tokens, min_count: 2, k: 5)
iex> top_bigram
["the", "cat"]

iex> Text.Collocation.bigrams("a a a", min_count: 1, measure: :frequency)
[{["a", "a"], 2}]