# `Text.Collocation`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/collocation.ex#L1)

Extract statistically significant word bigrams from a token stream.

Given a sequence of tokens (or a string that gets tokenised), returns
the bigrams that occur together more often than chance would predict.
Useful for surfacing multi-word expressions ("New York", "machine
learning"), finding domain-specific phrases in a corpus, and as a
pre-processing step for higher-order analytics.

### Association measures

Three measures are supported:

* `:frequency` — raw bigram count. Cheapest baseline; biased toward
  high-frequency stop-word pairs.

* `:pmi` — pointwise mutual information,
  `log(P(a,b) / (P(a) · P(b)))`. Highlights pairs that appear
  together far more than independence predicts. Sensitive to rare
  pairs, so combine with `:min_count`.

* `:log_likelihood` — Dunning's log-likelihood ratio (G²). Less
  sensitive to rare-pair noise than PMI; the standard choice in
  corpus linguistics.

### Defaults

By default, `bigrams/2` lower-cases tokens, drops bigrams that
appear fewer than 3 times, and ranks by `:log_likelihood`. Override
via the keyword options.

# `bigram`

```elixir
@type bigram() :: [String.t()]
```

A token bigram represented as a 2-element list.

# `measure`

```elixir
@type measure() :: :frequency | :pmi | :log_likelihood
```

An association measure used to rank bigrams.

# `bigrams`

```elixir
@spec bigrams(
  String.t() | [String.t()],
  keyword()
) :: [{bigram(), number()}]
```

Returns the most strongly-associated bigrams in `input`.

### Arguments

* `input` is either:
  * a string (tokenised via the `:tokenizer` option), or
  * a pre-tokenised list of strings (treated as a single contiguous
    token stream).

### Options

* `:measure` — `:frequency`, `:pmi`, or `:log_likelihood`. Defaults
  to `:log_likelihood`.

* `:min_count` — drops bigrams with raw count below this. Defaults
  to `3`. Keeps PMI and friends from being dominated by once-seen
  rare-pair noise.

* `:k` — number of results. Defaults to `20`. Pass `:infinity` to
  return all bigrams that pass `:min_count`.

* `:tokenizer` — string-to-tokens function used when `input` is a
  string. Defaults to `&Text.Segment.words/1`.

* `:fold_case` — when `true` (default), tokens are lowercased.

* `:stopwords` — a list or `MapSet` of tokens to exclude from
  bigrams. A bigram with either side in the stopword set is dropped.
  Defaults to `[]`.

### Returns

* A list of `{bigram, score}` tuples sorted by score descending,
  where `bigram` is a two-element list.

### Examples

    iex> tokens = ~w[the cat sat on the mat the cat ran fast the cat slept]
    iex> [{top_bigram, _score} | _] =
    ...>   Text.Collocation.bigrams(tokens, min_count: 2, k: 5)
    iex> top_bigram
    ["the", "cat"]

    iex> Text.Collocation.bigrams("a a a", min_count: 1, measure: :frequency)
    [{["a", "a"], 2}]

---

*Consult [api-reference.md](api-reference.md) for complete listing*
