Text.Distance (Text v0.5.0)

Copy Markdown View Source

String edit-distance algorithms.

Why this module exists

Elixir's standard library includes String.jaro_distance/2 and String.bag_distance/2, but both functions return values in [0.0, 1.0] where 1.0 means "identical" and 0.0 means "completely different". That is a similarity measure named "distance", which is inconsistent with how virtually every other ecosystem (Python's Levenshtein package, Rust's strsim, JavaScript's string-similarity …) defines distance.

This module provides the conventional definitions instead — 0 (or 0.0) means identical, larger values mean more different — and adds the algorithms the standard library does not ship: Levenshtein, Damerau-Levenshtein, Hamming, and Jaro-Winkler. Callers wanting the similarity form of any of these should use Text.Similarity (which returns 1.0 - distance for the bounded forms).

Granularity

All algorithms operate at the grapheme level, matching the convention used by String.jaro_distance/2. A grapheme is one user-perceived character — "é" (U+00E9) and "é" (U+0065 + U+0301) are both a single grapheme, even though their byte representations differ. For ASCII-only inputs the grapheme-level analysis is equivalent to byte-level operation.

Algorithms

FunctionReturnsRangeDefinition
levenshtein/2non-negative integer0..max(len(a), len(b))minimum number of insertions, deletions, and substitutions
damerau_levenshtein/2non-negative integer0..max(len(a), len(b))Levenshtein plus adjacent transposition (optimal-string-alignment variant)
hamming/2non-negative integer0..len (raises if lengths differ)substitutions only
jaro/2float0.0..1.01.0 - String.jaro_distance/2
jaro_winkler/2float0.0..1.0Jaro with a prefix bonus, then 1.0 - similarity

Summary

Functions

Returns the cosine distance between a and b over character n-grams treated as a binary bag of features.

Returns the Damerau-Levenshtein distance between a and b.

Returns the Hamming distance between two equal-length strings.

Returns the Jaccard distance between a and b computed on character n-grams (shingles).

Returns the Jaro distance between a and b.

Returns the Jaro-Winkler distance between a and b.

Returns the Levenshtein distance between a and b.

Returns the Sørensen–Dice distance between a and b computed on character n-grams.

Returns the Tanimoto distance between a and b over character n-grams.

Functions

cosine(a, b, options \\ [])

@spec cosine(String.t(), String.t(), keyword()) :: float()

Returns the cosine distance between a and b over character n-grams treated as a binary bag of features.

Defined as 1 - |A ∩ B| / sqrt(|A| · |B|).

Arguments

  • a is a string.

  • b is a string.

Options

  • :n — n-gram size in graphemes. Defaults to 2.

Returns

  • A float in [0.0, 1.0]. 0.0 for identical strings; 1.0 when the n-gram sets are completely disjoint.

Examples

iex> Text.Distance.cosine("night", "nacht")
0.75

iex> Text.Distance.cosine("kitten", "kitten")
0.0

iex> Text.Distance.cosine("abcdef", "uvwxyz")
1.0

damerau_levenshtein(a, b)

@spec damerau_levenshtein(String.t(), String.t()) :: non_neg_integer()

Returns the Damerau-Levenshtein distance between a and b.

This is the optimal-string-alignment (OSA) variant, which is what most libraries mean by "Damerau-Levenshtein": Levenshtein with the added operation of transposing two adjacent graphemes, with the restriction that no substring is edited more than once. The full Damerau-Levenshtein distance (without the no-repeat-edit restriction) is rarely used in practice and not provided here.

Arguments

  • a is a UTF-8 string.

  • b is a UTF-8 string.

Returns

  • A non-negative integer.

Examples

iex> Text.Distance.damerau_levenshtein("ca", "ac")
1

iex> Text.Distance.damerau_levenshtein("kitten", "sitting")
3

iex> Text.Distance.damerau_levenshtein("abcdef", "bacdfe")
2

iex> Text.Distance.damerau_levenshtein("", "abc")
3

hamming(a, b)

@spec hamming(String.t(), String.t()) :: non_neg_integer()

Returns the Hamming distance between two equal-length strings.

The Hamming distance is the number of grapheme positions at which the two strings differ. It is undefined for strings of different length; passing such strings raises ArgumentError.

Arguments

  • a is a UTF-8 string.

  • b is a UTF-8 string of the same grapheme length as a.

Returns

  • A non-negative integer.

Examples

iex> Text.Distance.hamming("karolin", "kathrin")
3

iex> Text.Distance.hamming("karolin", "kerstin")
3

iex> Text.Distance.hamming("1011101", "1001001")
2

iex> Text.Distance.hamming("", "")
0

jaccard(a, b, options \\ [])

@spec jaccard(String.t(), String.t(), keyword()) :: float()

Returns the Jaccard distance between a and b computed on character n-grams (shingles).

Defined as 1 - |A ∩ B| / |A ∪ B| where A and B are the sets of n-grams of each string.

Arguments

  • a is a string.

  • b is a string.

Options

  • :n — n-gram size in graphemes. Defaults to 2 (character bigrams).

Returns

  • A float in [0.0, 1.0]. 0.0 for identical strings; 1.0 when the n-gram sets are completely disjoint.

Examples

iex> Text.Distance.jaccard("night", "nacht")
0.8571428571428572

iex> Text.Distance.jaccard("kitten", "kitten")
0.0

iex> Text.Distance.jaccard("abcdef", "uvwxyz")
1.0

jaro(a, b)

@spec jaro(String.t(), String.t()) :: float()

Returns the Jaro distance between a and b.

Jaro distance is a normalised metric in [0.0, 1.0] where 0.0 means the strings are identical and 1.0 means they have nothing in common. It is computed as 1.0 - String.jaro_distance(a, b) — the standard library's "jaro_distance" function actually returns a similarity, despite the name.

Arguments

  • a is a UTF-8 string.

  • b is a UTF-8 string.

Returns

  • A float in [0.0, 1.0].

Examples

iex> Text.Distance.jaro("MARTHA", "MARTHA")
0.0

iex> Text.Distance.jaro("MARTHA", "MARHTA") |> Float.round(4)
0.0556

iex> Text.Distance.jaro("DIXON", "DICKSONX") |> Float.round(4)
0.2333

iex> Text.Distance.jaro("", "")
0.0

jaro_winkler(a, b, options \\ [])

@spec jaro_winkler(String.t(), String.t(), keyword()) :: float()

Returns the Jaro-Winkler distance between a and b.

Jaro-Winkler is the Jaro similarity boosted by a per-grapheme bonus for shared prefixes (up to four graphemes), then expressed as a distance: 1.0 - similarity. Two strings sharing a long prefix end up much closer than under plain Jaro.

Arguments

  • a is a UTF-8 string.

  • b is a UTF-8 string.

Options

  • :prefix_scale — the per-grapheme prefix bonus, traditionally 0.1 (the maximum value before Jaro-Winkler can exceed 1.0). Defaults to 0.1.

  • :max_prefix_length — the prefix length is capped at this many graphemes. Defaults to 4.

Returns

  • A float in [0.0, 1.0].

Examples

iex> Text.Distance.jaro_winkler("MARTHA", "MARHTA") |> Float.round(4)
0.0389

iex> Text.Distance.jaro_winkler("DIXON", "DICKSONX") |> Float.round(4)
0.1867

iex> Text.Distance.jaro_winkler("MARTHA", "MARTHA")
0.0

levenshtein(a, b)

@spec levenshtein(String.t(), String.t()) :: non_neg_integer()

Returns the Levenshtein distance between a and b.

The Levenshtein distance is the minimum number of single-grapheme insertions, deletions, or substitutions needed to transform a into b.

Arguments

  • a is a UTF-8 string.

  • b is a UTF-8 string.

Returns

  • A non-negative integer.

Examples

iex> Text.Distance.levenshtein("kitten", "sitting")
3

iex> Text.Distance.levenshtein("Saturday", "Sunday")
3

iex> Text.Distance.levenshtein("", "abc")
3

iex> Text.Distance.levenshtein("naïve", "naive")
1

sorensen_dice(a, b, options \\ [])

@spec sorensen_dice(String.t(), String.t(), keyword()) :: float()

Returns the Sørensen–Dice distance between a and b computed on character n-grams.

Defined as 1 - 2|A ∩ B| / (|A| + |B|). Closely related to Jaccard but weights matches more heavily; widely used for fuzzy string matching where a slightly more forgiving score is desirable.

Arguments

  • a is a string.

  • b is a string.

Options

  • :n — n-gram size in graphemes. Defaults to 2.

Returns

  • A float in [0.0, 1.0]. 0.0 for identical strings; 1.0 when the n-gram sets are completely disjoint.

Examples

iex> Text.Distance.sorensen_dice("night", "nacht")
0.75

iex> Text.Distance.sorensen_dice("kitten", "kitten")
0.0

iex> Text.Distance.sorensen_dice("abcdef", "uvwxyz")
1.0

tanimoto(a, b, options \\ [])

@spec tanimoto(String.t(), String.t(), keyword()) :: float()

Returns the Tanimoto distance between a and b over character n-grams.

For binary feature vectors (which is what we have when treating n-grams as set membership) Tanimoto reduces to Jaccard — so this is an alias maintained for callers who specifically want the Tanimoto name.

Arguments

  • a is a string.

  • b is a string.

Options

Same as jaccard/3.

Returns

  • A float in [0.0, 1.0]. Identical to the result of jaccard/3 with the same arguments.

Examples

iex> Text.Distance.tanimoto("night", "nacht") == Text.Distance.jaccard("night", "nacht")
true