String edit-distance algorithms.
Why this module exists
Elixir's standard library includes String.jaro_distance/2 and
String.bag_distance/2, but both functions return values in
[0.0, 1.0] where 1.0 means "identical" and 0.0 means "completely
different". That is a similarity measure named "distance", which is
inconsistent with how virtually every other ecosystem (Python's
Levenshtein package, Rust's strsim, JavaScript's string-similarity
…) defines distance.
This module provides the conventional definitions instead — 0 (or
0.0) means identical, larger values mean more different — and adds
the algorithms the standard library does not ship: Levenshtein,
Damerau-Levenshtein, Hamming, and Jaro-Winkler. Callers wanting the
similarity form of any of these should use Text.Similarity (which
returns 1.0 - distance for the bounded forms).
Granularity
All algorithms operate at the grapheme level, matching the
convention used by String.jaro_distance/2. A grapheme is one
user-perceived character — "é" (U+00E9) and "é" (U+0065 + U+0301)
are both a single grapheme, even though their byte representations
differ. For ASCII-only inputs the grapheme-level analysis is
equivalent to byte-level operation.
Algorithms
| Function | Returns | Range | Definition |
|---|---|---|---|
levenshtein/2 | non-negative integer | 0..max(len(a), len(b)) | minimum number of insertions, deletions, and substitutions |
damerau_levenshtein/2 | non-negative integer | 0..max(len(a), len(b)) | Levenshtein plus adjacent transposition (optimal-string-alignment variant) |
hamming/2 | non-negative integer | 0..len (raises if lengths differ) | substitutions only |
jaro/2 | float | 0.0..1.0 | 1.0 - String.jaro_distance/2 |
jaro_winkler/2 | float | 0.0..1.0 | Jaro with a prefix bonus, then 1.0 - similarity |
Summary
Functions
Returns the cosine distance between a and b over character
n-grams treated as a binary bag of features.
Returns the Damerau-Levenshtein distance between a and b.
Returns the Hamming distance between two equal-length strings.
Returns the Jaccard distance between a and b computed on
character n-grams (shingles).
Returns the Jaro distance between a and b.
Returns the Jaro-Winkler distance between a and b.
Returns the Levenshtein distance between a and b.
Returns the Sørensen–Dice distance between a and b computed on
character n-grams.
Returns the Tanimoto distance between a and b over character
n-grams.
Functions
Returns the cosine distance between a and b over character
n-grams treated as a binary bag of features.
Defined as 1 - |A ∩ B| / sqrt(|A| · |B|).
Arguments
ais a string.bis a string.
Options
:n— n-gram size in graphemes. Defaults to2.
Returns
- A float in
[0.0, 1.0].0.0for identical strings;1.0when the n-gram sets are completely disjoint.
Examples
iex> Text.Distance.cosine("night", "nacht")
0.75
iex> Text.Distance.cosine("kitten", "kitten")
0.0
iex> Text.Distance.cosine("abcdef", "uvwxyz")
1.0
@spec damerau_levenshtein(String.t(), String.t()) :: non_neg_integer()
Returns the Damerau-Levenshtein distance between a and b.
This is the optimal-string-alignment (OSA) variant, which is what most libraries mean by "Damerau-Levenshtein": Levenshtein with the added operation of transposing two adjacent graphemes, with the restriction that no substring is edited more than once. The full Damerau-Levenshtein distance (without the no-repeat-edit restriction) is rarely used in practice and not provided here.
Arguments
ais a UTF-8 string.bis a UTF-8 string.
Returns
- A non-negative integer.
Examples
iex> Text.Distance.damerau_levenshtein("ca", "ac")
1
iex> Text.Distance.damerau_levenshtein("kitten", "sitting")
3
iex> Text.Distance.damerau_levenshtein("abcdef", "bacdfe")
2
iex> Text.Distance.damerau_levenshtein("", "abc")
3
@spec hamming(String.t(), String.t()) :: non_neg_integer()
Returns the Hamming distance between two equal-length strings.
The Hamming distance is the number of grapheme positions at which the
two strings differ. It is undefined for strings of different length;
passing such strings raises ArgumentError.
Arguments
ais a UTF-8 string.bis a UTF-8 string of the same grapheme length asa.
Returns
- A non-negative integer.
Examples
iex> Text.Distance.hamming("karolin", "kathrin")
3
iex> Text.Distance.hamming("karolin", "kerstin")
3
iex> Text.Distance.hamming("1011101", "1001001")
2
iex> Text.Distance.hamming("", "")
0
Returns the Jaccard distance between a and b computed on
character n-grams (shingles).
Defined as 1 - |A ∩ B| / |A ∪ B| where A and B are the
sets of n-grams of each string.
Arguments
ais a string.bis a string.
Options
:n— n-gram size in graphemes. Defaults to2(character bigrams).
Returns
- A float in
[0.0, 1.0].0.0for identical strings;1.0when the n-gram sets are completely disjoint.
Examples
iex> Text.Distance.jaccard("night", "nacht")
0.8571428571428572
iex> Text.Distance.jaccard("kitten", "kitten")
0.0
iex> Text.Distance.jaccard("abcdef", "uvwxyz")
1.0
Returns the Jaro distance between a and b.
Jaro distance is a normalised metric in [0.0, 1.0] where 0.0
means the strings are identical and 1.0 means they have nothing in
common. It is computed as 1.0 - String.jaro_distance(a, b) — the
standard library's "jaro_distance" function actually returns a
similarity, despite the name.
Arguments
ais a UTF-8 string.bis a UTF-8 string.
Returns
- A float in
[0.0, 1.0].
Examples
iex> Text.Distance.jaro("MARTHA", "MARTHA")
0.0
iex> Text.Distance.jaro("MARTHA", "MARHTA") |> Float.round(4)
0.0556
iex> Text.Distance.jaro("DIXON", "DICKSONX") |> Float.round(4)
0.2333
iex> Text.Distance.jaro("", "")
0.0
Returns the Jaro-Winkler distance between a and b.
Jaro-Winkler is the Jaro similarity boosted by a per-grapheme bonus
for shared prefixes (up to four graphemes), then expressed as a
distance: 1.0 - similarity. Two strings sharing a long prefix end
up much closer than under plain Jaro.
Arguments
ais a UTF-8 string.bis a UTF-8 string.
Options
:prefix_scale— the per-grapheme prefix bonus, traditionally0.1(the maximum value before Jaro-Winkler can exceed 1.0). Defaults to0.1.:max_prefix_length— the prefix length is capped at this many graphemes. Defaults to4.
Returns
- A float in
[0.0, 1.0].
Examples
iex> Text.Distance.jaro_winkler("MARTHA", "MARHTA") |> Float.round(4)
0.0389
iex> Text.Distance.jaro_winkler("DIXON", "DICKSONX") |> Float.round(4)
0.1867
iex> Text.Distance.jaro_winkler("MARTHA", "MARTHA")
0.0
@spec levenshtein(String.t(), String.t()) :: non_neg_integer()
Returns the Levenshtein distance between a and b.
The Levenshtein distance is the minimum number of single-grapheme
insertions, deletions, or substitutions needed to transform a into
b.
Arguments
ais a UTF-8 string.bis a UTF-8 string.
Returns
- A non-negative integer.
Examples
iex> Text.Distance.levenshtein("kitten", "sitting")
3
iex> Text.Distance.levenshtein("Saturday", "Sunday")
3
iex> Text.Distance.levenshtein("", "abc")
3
iex> Text.Distance.levenshtein("naïve", "naive")
1
Returns the Sørensen–Dice distance between a and b computed on
character n-grams.
Defined as 1 - 2|A ∩ B| / (|A| + |B|). Closely related to Jaccard
but weights matches more heavily; widely used for fuzzy string
matching where a slightly more forgiving score is desirable.
Arguments
ais a string.bis a string.
Options
:n— n-gram size in graphemes. Defaults to2.
Returns
- A float in
[0.0, 1.0].0.0for identical strings;1.0when the n-gram sets are completely disjoint.
Examples
iex> Text.Distance.sorensen_dice("night", "nacht")
0.75
iex> Text.Distance.sorensen_dice("kitten", "kitten")
0.0
iex> Text.Distance.sorensen_dice("abcdef", "uvwxyz")
1.0
Returns the Tanimoto distance between a and b over character
n-grams.
For binary feature vectors (which is what we have when treating n-grams as set membership) Tanimoto reduces to Jaccard — so this is an alias maintained for callers who specifically want the Tanimoto name.
Arguments
ais a string.bis a string.
Options
Same as jaccard/3.
Returns
- A float in
[0.0, 1.0]. Identical to the result ofjaccard/3with the same arguments.
Examples
iex> Text.Distance.tanimoto("night", "nacht") == Text.Distance.jaccard("night", "nacht")
true