Set- and vector-based string similarity coefficients.
Where Text.Distance answers "how many edits separate these strings",
this module answers "how much do these strings have in common". The
two views complement each other — edit distance is sensitive to local
insertions and transpositions, while set-based similarity is robust
against word reorderings and is what most search-and-deduplication
pipelines reach for first.
Inputs
Every function in this module accepts either:
two strings — they are converted to a bag of grapheme n-grams using the
:noption (default2, i.e. character bigrams), ortwo pre-computed token bags — any enumerable of comparable terms (typically strings, atoms, or n-gram lists from
Text.Ngram.ngram/2).
String inputs that are shorter than n graphemes produce empty bags;
see "Edge cases" below for how empty bags are handled.
Edge cases
Two empty bags are vacuously similar (
1.0).One empty and one non-empty is
0.0.For very short strings, prefer a smaller
:nor useText.Distanceinstead — n-gram similarity loses meaning below the n-gram length.
Algorithms
| Function | Formula | Range | Notes |
|---|
| jaccard/3 | |A ∩ B| / |A ∪ B| | 0.0..1.0 | Set-level; ignores duplicates |
| dice/3 | 2|A ∩ B| / (|A| + |B|) | 0.0..1.0 | A.k.a. Sørensen-Dice |
| overlap/3 | |A ∩ B| / min(|A|, |B|) | 0.0..1.0 | Szymkiewicz-Simpson; favours containment |
| cosine/3 | dot(a, b) / (‖a‖ · ‖b‖) | 0.0..1.0 | Multi-set / TF; weighted by repetition |
| jaro/2 | 1.0 - Text.Distance.jaro/2 | 0.0..1.0 | Same as String.jaro_distance/2 |
| jaro_winkler/3 | 1.0 - Text.Distance.jaro_winkler/3 | 0.0..1.0 | Jaro with prefix bonus |
Summary
Functions
Returns the cosine similarity over term-frequency vectors.
Returns the Sørensen-Dice coefficient.
Returns the Jaccard similarity coefficient.
Returns the Jaro similarity between two strings.
Returns the Jaro-Winkler similarity between two strings.
Returns the overlap coefficient (Szymkiewicz-Simpson similarity).
Types
@type bag() :: String.t() | Enumerable.t()
A bag of tokens — anything Enum can iterate. Strings get converted
to grapheme n-grams; lists, MapSets, etc. are used as-is.
Functions
Returns the cosine similarity over term-frequency vectors.
Unlike jaccard/3, dice/3, and overlap/3, cosine treats each
input as a multiset and weights similarity by how often each token
appears. For strings of n-grams this is rarely interesting — most
short strings have unique n-grams — but for word-level inputs (where
the same word repeats) cosine is the standard choice.
Computed as dot(a, b) / (‖a‖ · ‖b‖), where the vectors are indexed
by the union of terms. Returns 1.0 for proportional vectors and
0.0 when there are no shared terms.
Arguments
ais a string or token enumerable.bis a string or token enumerable.
Options
:n— n-gram size when inputs are strings. Defaults to2.
Returns
- A float in
[0.0, 1.0].
Examples
iex> Text.Similarity.cosine("night", "night")
1.0
iex> Text.Similarity.cosine("abc", "xyz")
0.0
iex> Text.Similarity.cosine("", "")
1.0
iex> Text.Similarity.cosine(~w[the cat sat on the mat], ~w[the dog sat on the log])
0.7499999999999999
Returns the Sørensen-Dice coefficient.
Computed as 2|A ∩ B| / (|A| + |B|) over the sets (deduplicated
bags) of tokens. Dice gives a higher score than Jaccard for the same
inputs and is often preferred for short strings. Returns 1.0 for
identical sets and 0.0 for disjoint sets.
Arguments
ais a string or token enumerable.bis a string or token enumerable.
Options
:n— n-gram size when inputs are strings. Defaults to2.
Returns
- A float in
[0.0, 1.0].
Examples
iex> Text.Similarity.dice("night", "nacht")
0.25
iex> Text.Similarity.dice("night", "night")
1.0
iex> Text.Similarity.dice("abc", "xyz")
0.0
iex> Text.Similarity.dice("", "")
1.0
Returns the Jaccard similarity coefficient.
Computed as |A ∩ B| / |A ∪ B| over the sets (deduplicated bags) of
tokens. Returns 1.0 for identical sets, 0.0 for disjoint sets,
and 1.0 for two empty sets.
Arguments
ais a string or token enumerable.bis a string or token enumerable.
Options
:n— the n-gram size used whenaorbis a string. Defaults to2. Ignored when both inputs are pre-tokenized.
Returns
- A float in
[0.0, 1.0].
Examples
iex> Text.Similarity.jaccard("night", "nacht")
0.14285714285714285
iex> Text.Similarity.jaccard("night", "nacht", n: 1)
0.42857142857142855
iex> Text.Similarity.jaccard("night", "night")
1.0
iex> Text.Similarity.jaccard("abc", "xyz")
0.0
iex> Text.Similarity.jaccard("", "")
1.0
iex> Text.Similarity.jaccard(~w[red green blue], ~w[blue yellow])
0.25
Returns the Jaro similarity between two strings.
This is the complement of Text.Distance.jaro/2 and is also
available as String.jaro_distance/2 in the standard library — it
is provided here so callers can keep the rest of the similarity
surface in one module.
Arguments
ais a UTF-8 string.bis a UTF-8 string.
Returns
- A float in
[0.0, 1.0]where1.0means identical and0.0means completely different.
Examples
iex> Text.Similarity.jaro("MARTHA", "MARTHA")
1.0
iex> Text.Similarity.jaro("MARTHA", "MARHTA") |> Float.round(4)
0.9444
Returns the Jaro-Winkler similarity between two strings.
Complement of Text.Distance.jaro_winkler/3. Accepts the same
options.
Arguments
ais a UTF-8 string.bis a UTF-8 string.
Options
:prefix_scale— defaults to0.1.:max_prefix_length— defaults to4.
Returns
- A float in
[0.0, 1.0].
Examples
iex> Text.Similarity.jaro_winkler("MARTHA", "MARHTA") |> Float.round(4)
0.9611
iex> Text.Similarity.jaro_winkler("MARTHA", "MARTHA")
1.0
Returns the overlap coefficient (Szymkiewicz-Simpson similarity).
Computed as |A ∩ B| / min(|A|, |B|). Unlike Jaccard or Dice, the
overlap coefficient reaches 1.0 whenever one set is fully contained
in the other — making it the right choice when the question is
"is one of these a subset of the other?" rather than "are these the
same size and content?".
Arguments
ais a string or token enumerable.bis a string or token enumerable.
Options
:n— n-gram size when inputs are strings. Defaults to2.
Returns
- A float in
[0.0, 1.0].
Examples
iex> Text.Similarity.overlap("night", "nights")
1.0
iex> Text.Similarity.overlap("night", "nacht")
0.25
iex> Text.Similarity.overlap("abc", "xyz")
0.0
iex> Text.Similarity.overlap("", "")
1.0