# `Text.Distance`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/distance.ex#L1)

String edit-distance algorithms.

### Why this module exists

Elixir's standard library includes `String.jaro_distance/2` and
`String.bag_distance/2`, but both functions return values in
`[0.0, 1.0]` where `1.0` means "identical" and `0.0` means "completely
different". That is a *similarity* measure named "distance", which is
inconsistent with how virtually every other ecosystem (Python's
`Levenshtein` package, Rust's `strsim`, JavaScript's `string-similarity`
…) defines distance.

This module provides the conventional definitions instead — `0` (or
`0.0`) means identical, larger values mean more different — and adds
the algorithms the standard library does not ship: Levenshtein,
Damerau-Levenshtein, Hamming, and Jaro-Winkler. Callers wanting the
similarity form of any of these should use `Text.Similarity` (which
returns `1.0 - distance` for the bounded forms).

### Granularity

All algorithms operate at the *grapheme* level, matching the
convention used by `String.jaro_distance/2`. A grapheme is one
user-perceived character — `"é"` (U+00E9) and `"é"` (U+0065 + U+0301)
are both a single grapheme, even though their byte representations
differ. For ASCII-only inputs the grapheme-level analysis is
equivalent to byte-level operation.

### Algorithms

| Function | Returns | Range | Definition |
|---|---|---|---|
| `levenshtein/2` | non-negative integer | `0..max(len(a), len(b))` | minimum number of insertions, deletions, and substitutions |
| `damerau_levenshtein/2` | non-negative integer | `0..max(len(a), len(b))` | Levenshtein plus adjacent transposition (optimal-string-alignment variant) |
| `hamming/2` | non-negative integer | `0..len` (raises if lengths differ) | substitutions only |
| `jaro/2` | float | `0.0..1.0` | `1.0 - String.jaro_distance/2` |
| `jaro_winkler/2` | float | `0.0..1.0` | Jaro with a prefix bonus, then `1.0 - similarity` |

# `cosine`

```elixir
@spec cosine(String.t(), String.t(), keyword()) :: float()
```

Returns the cosine distance between `a` and `b` over character
n-grams treated as a binary bag of features.

Defined as `1 - |A ∩ B| / sqrt(|A| · |B|)`.

### Arguments

* `a` is a string.

* `b` is a string.

### Options

* `:n` — n-gram size in graphemes. Defaults to `2`.

### Returns

* A float in `[0.0, 1.0]`. `0.0` for identical strings; `1.0`
  when the n-gram sets are completely disjoint.

### Examples

    iex> Text.Distance.cosine("night", "nacht")
    0.75

    iex> Text.Distance.cosine("kitten", "kitten")
    0.0

    iex> Text.Distance.cosine("abcdef", "uvwxyz")
    1.0

# `damerau_levenshtein`

```elixir
@spec damerau_levenshtein(String.t(), String.t()) :: non_neg_integer()
```

Returns the Damerau-Levenshtein distance between `a` and `b`.

This is the *optimal-string-alignment* (OSA) variant, which is what
most libraries mean by "Damerau-Levenshtein": Levenshtein with the
added operation of transposing two adjacent graphemes, with the
restriction that no substring is edited more than once. The full
Damerau-Levenshtein distance (without the no-repeat-edit
restriction) is rarely used in practice and not provided here.

### Arguments

* `a` is a UTF-8 string.

* `b` is a UTF-8 string.

### Returns

* A non-negative integer.

### Examples

    iex> Text.Distance.damerau_levenshtein("ca", "ac")
    1

    iex> Text.Distance.damerau_levenshtein("kitten", "sitting")
    3

    iex> Text.Distance.damerau_levenshtein("abcdef", "bacdfe")
    2

    iex> Text.Distance.damerau_levenshtein("", "abc")
    3

# `hamming`

```elixir
@spec hamming(String.t(), String.t()) :: non_neg_integer()
```

Returns the Hamming distance between two equal-length strings.

The Hamming distance is the number of grapheme positions at which the
two strings differ. It is undefined for strings of different length;
passing such strings raises `ArgumentError`.

### Arguments

* `a` is a UTF-8 string.

* `b` is a UTF-8 string of the same grapheme length as `a`.

### Returns

* A non-negative integer.

### Examples

    iex> Text.Distance.hamming("karolin", "kathrin")
    3

    iex> Text.Distance.hamming("karolin", "kerstin")
    3

    iex> Text.Distance.hamming("1011101", "1001001")
    2

    iex> Text.Distance.hamming("", "")
    0

# `jaccard`

```elixir
@spec jaccard(String.t(), String.t(), keyword()) :: float()
```

Returns the Jaccard distance between `a` and `b` computed on
character n-grams (shingles).

Defined as `1 - |A ∩ B| / |A ∪ B|` where `A` and `B` are the
sets of n-grams of each string.

### Arguments

* `a` is a string.

* `b` is a string.

### Options

* `:n` — n-gram size in graphemes. Defaults to `2` (character
  bigrams).

### Returns

* A float in `[0.0, 1.0]`. `0.0` for identical strings; `1.0`
  when the n-gram sets are completely disjoint.

### Examples

    iex> Text.Distance.jaccard("night", "nacht")
    0.8571428571428572

    iex> Text.Distance.jaccard("kitten", "kitten")
    0.0

    iex> Text.Distance.jaccard("abcdef", "uvwxyz")
    1.0

# `jaro`

```elixir
@spec jaro(String.t(), String.t()) :: float()
```

Returns the Jaro distance between `a` and `b`.

Jaro distance is a normalised metric in `[0.0, 1.0]` where `0.0`
means the strings are identical and `1.0` means they have nothing in
common. It is computed as `1.0 - String.jaro_distance(a, b)` — the
standard library's "jaro_distance" function actually returns a
similarity, despite the name.

### Arguments

* `a` is a UTF-8 string.

* `b` is a UTF-8 string.

### Returns

* A float in `[0.0, 1.0]`.

### Examples

    iex> Text.Distance.jaro("MARTHA", "MARTHA")
    0.0

    iex> Text.Distance.jaro("MARTHA", "MARHTA") |> Float.round(4)
    0.0556

    iex> Text.Distance.jaro("DIXON", "DICKSONX") |> Float.round(4)
    0.2333

    iex> Text.Distance.jaro("", "")
    0.0

# `jaro_winkler`

```elixir
@spec jaro_winkler(String.t(), String.t(), keyword()) :: float()
```

Returns the Jaro-Winkler distance between `a` and `b`.

Jaro-Winkler is the Jaro similarity boosted by a per-grapheme bonus
for shared prefixes (up to four graphemes), then expressed as a
distance: `1.0 - similarity`. Two strings sharing a long prefix end
up much closer than under plain Jaro.

### Arguments

* `a` is a UTF-8 string.

* `b` is a UTF-8 string.

### Options

* `:prefix_scale` — the per-grapheme prefix bonus, traditionally
  `0.1` (the maximum value before Jaro-Winkler can exceed 1.0).
  Defaults to `0.1`.

* `:max_prefix_length` — the prefix length is capped at this many
  graphemes. Defaults to `4`.

### Returns

* A float in `[0.0, 1.0]`.

### Examples

    iex> Text.Distance.jaro_winkler("MARTHA", "MARHTA") |> Float.round(4)
    0.0389

    iex> Text.Distance.jaro_winkler("DIXON", "DICKSONX") |> Float.round(4)
    0.1867

    iex> Text.Distance.jaro_winkler("MARTHA", "MARTHA")
    0.0

# `levenshtein`

```elixir
@spec levenshtein(String.t(), String.t()) :: non_neg_integer()
```

Returns the Levenshtein distance between `a` and `b`.

The Levenshtein distance is the minimum number of single-grapheme
insertions, deletions, or substitutions needed to transform `a` into
`b`.

### Arguments

* `a` is a UTF-8 string.

* `b` is a UTF-8 string.

### Returns

* A non-negative integer.

### Examples

    iex> Text.Distance.levenshtein("kitten", "sitting")
    3

    iex> Text.Distance.levenshtein("Saturday", "Sunday")
    3

    iex> Text.Distance.levenshtein("", "abc")
    3

    iex> Text.Distance.levenshtein("naïve", "naive")
    1

# `sorensen_dice`

```elixir
@spec sorensen_dice(String.t(), String.t(), keyword()) :: float()
```

Returns the Sørensen–Dice distance between `a` and `b` computed on
character n-grams.

Defined as `1 - 2|A ∩ B| / (|A| + |B|)`. Closely related to Jaccard
but weights matches more heavily; widely used for fuzzy string
matching where a slightly more forgiving score is desirable.

### Arguments

* `a` is a string.

* `b` is a string.

### Options

* `:n` — n-gram size in graphemes. Defaults to `2`.

### Returns

* A float in `[0.0, 1.0]`. `0.0` for identical strings; `1.0`
  when the n-gram sets are completely disjoint.

### Examples

    iex> Text.Distance.sorensen_dice("night", "nacht")
    0.75

    iex> Text.Distance.sorensen_dice("kitten", "kitten")
    0.0

    iex> Text.Distance.sorensen_dice("abcdef", "uvwxyz")
    1.0

# `tanimoto`

```elixir
@spec tanimoto(String.t(), String.t(), keyword()) :: float()
```

Returns the Tanimoto distance between `a` and `b` over character
n-grams.

For binary feature vectors (which is what we have when treating
n-grams as set membership) Tanimoto reduces to Jaccard — so this
is an alias maintained for callers who specifically want the
Tanimoto name.

### Arguments

* `a` is a string.

* `b` is a string.

### Options

Same as `jaccard/3`.

### Returns

* A float in `[0.0, 1.0]`. Identical to the result of
  `jaccard/3` with the same arguments.

### Examples

    iex> Text.Distance.tanimoto("night", "nacht") == Text.Distance.jaccard("night", "nacht")
    true

---

*Consult [api-reference.md](api-reference.md) for complete listing*