# `Text.Similarity`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/similarity.ex#L1)

Set- and vector-based string similarity coefficients.

Where `Text.Distance` answers "how many edits separate these strings",
this module answers "how much do these strings have in common". The
two views complement each other — edit distance is sensitive to local
insertions and transpositions, while set-based similarity is robust
against word reorderings and is what most search-and-deduplication
pipelines reach for first.

### Inputs

Every function in this module accepts either:

* **two strings** — they are converted to a bag of grapheme n-grams
  using the `:n` option (default `2`, i.e. character bigrams), or

* **two pre-computed token bags** — any enumerable of comparable terms
  (typically strings, atoms, or n-gram lists from `Text.Ngram.ngram/2`).

String inputs that are shorter than `n` graphemes produce empty bags;
see "Edge cases" below for how empty bags are handled.

### Edge cases

* Two empty bags are vacuously similar (`1.0`).

* One empty and one non-empty is `0.0`.

* For very short strings, prefer a smaller `:n` or use
  `Text.Distance` instead — n-gram similarity loses meaning below
  the n-gram length.

### Algorithms

| Function | Formula | Range | Notes |
|---|---|---|---|
| `jaccard/3` | `|A ∩ B| / |A ∪ B|` | `0.0..1.0` | Set-level; ignores duplicates |
| `dice/3` | `2|A ∩ B| / (|A| + |B|)` | `0.0..1.0` | A.k.a. Sørensen-Dice |
| `overlap/3` | `|A ∩ B| / min(|A|, |B|)` | `0.0..1.0` | Szymkiewicz-Simpson; favours containment |
| `cosine/3` | `dot(a, b) / (‖a‖ · ‖b‖)` | `0.0..1.0` | Multi-set / TF; weighted by repetition |
| `jaro/2` | `1.0 - Text.Distance.jaro/2` | `0.0..1.0` | Same as `String.jaro_distance/2` |
| `jaro_winkler/3` | `1.0 - Text.Distance.jaro_winkler/3` | `0.0..1.0` | Jaro with prefix bonus |

# `bag`

```elixir
@type bag() :: String.t() | Enumerable.t()
```

A bag of tokens — anything `Enum` can iterate. Strings get converted
to grapheme n-grams; lists, MapSets, etc. are used as-is.

# `cosine`

```elixir
@spec cosine(bag(), bag(), keyword()) :: float()
```

Returns the cosine similarity over term-frequency vectors.

Unlike `jaccard/3`, `dice/3`, and `overlap/3`, cosine treats each
input as a *multiset* and weights similarity by how often each token
appears. For strings of n-grams this is rarely interesting — most
short strings have unique n-grams — but for word-level inputs (where
the same word repeats) cosine is the standard choice.

Computed as `dot(a, b) / (‖a‖ · ‖b‖)`, where the vectors are indexed
by the union of terms. Returns `1.0` for proportional vectors and
`0.0` when there are no shared terms.

### Arguments

* `a` is a string or token enumerable.

* `b` is a string or token enumerable.

### Options

* `:n` — n-gram size when inputs are strings. Defaults to `2`.

### Returns

* A float in `[0.0, 1.0]`.

### Examples

    iex> Text.Similarity.cosine("night", "night")
    1.0

    iex> Text.Similarity.cosine("abc", "xyz")
    0.0

    iex> Text.Similarity.cosine("", "")
    1.0

    iex> Text.Similarity.cosine(~w[the cat sat on the mat], ~w[the dog sat on the log])
    0.7499999999999999

# `dice`

```elixir
@spec dice(bag(), bag(), keyword()) :: float()
```

Returns the Sørensen-Dice coefficient.

Computed as `2|A ∩ B| / (|A| + |B|)` over the *sets* (deduplicated
bags) of tokens. Dice gives a higher score than Jaccard for the same
inputs and is often preferred for short strings. Returns `1.0` for
identical sets and `0.0` for disjoint sets.

### Arguments

* `a` is a string or token enumerable.

* `b` is a string or token enumerable.

### Options

* `:n` — n-gram size when inputs are strings. Defaults to `2`.

### Returns

* A float in `[0.0, 1.0]`.

### Examples

    iex> Text.Similarity.dice("night", "nacht")
    0.25

    iex> Text.Similarity.dice("night", "night")
    1.0

    iex> Text.Similarity.dice("abc", "xyz")
    0.0

    iex> Text.Similarity.dice("", "")
    1.0

# `jaccard`

```elixir
@spec jaccard(bag(), bag(), keyword()) :: float()
```

Returns the Jaccard similarity coefficient.

Computed as `|A ∩ B| / |A ∪ B|` over the *sets* (deduplicated bags) of
tokens. Returns `1.0` for identical sets, `0.0` for disjoint sets,
and `1.0` for two empty sets.

### Arguments

* `a` is a string or token enumerable.

* `b` is a string or token enumerable.

### Options

* `:n` — the n-gram size used when `a` or `b` is a string. Defaults
  to `2`. Ignored when both inputs are pre-tokenized.

### Returns

* A float in `[0.0, 1.0]`.

### Examples

    iex> Text.Similarity.jaccard("night", "nacht")
    0.14285714285714285

    iex> Text.Similarity.jaccard("night", "nacht", n: 1)
    0.42857142857142855

    iex> Text.Similarity.jaccard("night", "night")
    1.0

    iex> Text.Similarity.jaccard("abc", "xyz")
    0.0

    iex> Text.Similarity.jaccard("", "")
    1.0

    iex> Text.Similarity.jaccard(~w[red green blue], ~w[blue yellow])
    0.25

# `jaro`

```elixir
@spec jaro(String.t(), String.t()) :: float()
```

Returns the Jaro similarity between two strings.

This is the complement of `Text.Distance.jaro/2` and is also
available as `String.jaro_distance/2` in the standard library — it
is provided here so callers can keep the rest of the similarity
surface in one module.

### Arguments

* `a` is a UTF-8 string.

* `b` is a UTF-8 string.

### Returns

* A float in `[0.0, 1.0]` where `1.0` means identical and `0.0`
  means completely different.

### Examples

    iex> Text.Similarity.jaro("MARTHA", "MARTHA")
    1.0

    iex> Text.Similarity.jaro("MARTHA", "MARHTA") |> Float.round(4)
    0.9444

# `jaro_winkler`

```elixir
@spec jaro_winkler(String.t(), String.t(), keyword()) :: float()
```

Returns the Jaro-Winkler similarity between two strings.

Complement of `Text.Distance.jaro_winkler/3`. Accepts the same
options.

### Arguments

* `a` is a UTF-8 string.

* `b` is a UTF-8 string.

### Options

* `:prefix_scale` — defaults to `0.1`.

* `:max_prefix_length` — defaults to `4`.

### Returns

* A float in `[0.0, 1.0]`.

### Examples

    iex> Text.Similarity.jaro_winkler("MARTHA", "MARHTA") |> Float.round(4)
    0.9611

    iex> Text.Similarity.jaro_winkler("MARTHA", "MARTHA")
    1.0

# `overlap`

```elixir
@spec overlap(bag(), bag(), keyword()) :: float()
```

Returns the overlap coefficient (Szymkiewicz-Simpson similarity).

Computed as `|A ∩ B| / min(|A|, |B|)`. Unlike Jaccard or Dice, the
overlap coefficient reaches `1.0` whenever one set is fully contained
in the other — making it the right choice when the question is
"is one of these a subset of the other?" rather than "are these the
same size and content?".

### Arguments

* `a` is a string or token enumerable.

* `b` is a string or token enumerable.

### Options

* `:n` — n-gram size when inputs are strings. Defaults to `2`.

### Returns

* A float in `[0.0, 1.0]`.

### Examples

    iex> Text.Similarity.overlap("night", "nights")
    1.0

    iex> Text.Similarity.overlap("night", "nacht")
    0.25

    iex> Text.Similarity.overlap("abc", "xyz")
    0.0

    iex> Text.Similarity.overlap("", "")
    1.0

---

*Consult [api-reference.md](api-reference.md) for complete listing*