ExNlp.Similarity (ex_nlp v0.1.0)
View SourceString and word set similarity metrics.
This module provides various algorithms for measuring similarity between strings and sets of words, useful for fuzzy matching and search.
Examples
# Levenshtein distance between strings
iex> ExNlp.Similarity.levenshtein("kitten", "sitting")
3
# Jaccard similarity between word sets
iex> ExNlp.Similarity.jaccard(["cat", "dog"], ["cat", "bird"])
0.3333333333333333
Summary
Functions
Calculates the Dice coefficient (Sørensen-Dice coefficient) between two sets of words.
Calculates the Hamming distance between two strings.
Calculates the Jaccard similarity coefficient between two sets of words.
Calculates the Jaro similarity between two strings.
Calculates the Jaro-Winkler similarity between two strings.
Calculates the Levenshtein (edit) distance between two strings using matrix-based DP.
Calculates Levenshtein distance using a compact (space-optimized) approach.
Calculates Levenshtein similarity between two strings.
Calculates Levenshtein distance using a tabulated (array-based) approach.
Finds the length of the longest common subsequence (LCS) between two strings.
Functions
Calculates the Dice coefficient (Sørensen-Dice coefficient) between two sets of words.
Similar to Jaccard, but gives more weight to common elements.
Returns a value between 0.0 and 1.0.
Examples
iex> ExNlp.Similarity.dice_coefficient(["cat", "dog"], ["cat", "bird"])
0.5
iex> ExNlp.Similarity.dice_coefficient(["cat", "dog"], ["cat", "dog"])
1.0
@spec hamming(String.t(), String.t()) :: non_neg_integer()
Calculates the Hamming distance between two strings.
The Hamming distance is the number of positions at which the corresponding characters are different. Both strings must be of equal length.
Examples
iex> ExNlp.Similarity.hamming("karolin", "kathrin")
3
iex> ExNlp.Similarity.hamming("hello", "hello")
0
Calculates the Jaccard similarity coefficient between two sets of words.
Jaccard similarity is the size of the intersection divided by the size of the union of the two sets.
Returns a value between 0.0 and 1.0, where 1.0 means identical sets.
Examples
iex> ExNlp.Similarity.jaccard(["cat", "dog"], ["cat", "bird"])
0.3333333333333333
iex> ExNlp.Similarity.jaccard(["cat", "dog"], ["cat", "dog"])
1.0
iex> ExNlp.Similarity.jaccard(["cat"], ["dog"])
0.0
Calculates the Jaro similarity between two strings.
Jaro similarity is a string metric measuring similarity between two strings. Returns a value between 0.0 (no similarity) and 1.0 (identical strings).
Examples
iex> ExNlp.Similarity.jaro_similarity("martha", "marhta")
0.9444444444444445
iex> ExNlp.Similarity.jaro_similarity("dwayne", "duane")
0.8222222222222223
iex> ExNlp.Similarity.jaro_similarity("abc", "xyz")
0.0
Calculates the Jaro-Winkler similarity between two strings.
Jaro-Winkler is an extension of Jaro that gives more favorable ratings to strings that match from the beginning up to a maximum prefix length.
Options
:prefix_length- Maximum prefix length to consider (default: 4):prefix_weight- Weight factor for the prefix bonus (default: 0.1)
Examples
iex> ExNlp.Similarity.jaro_winkler_similarity("martha", "marhta")
0.9611111111111111
iex> ExNlp.Similarity.jaro_winkler_similarity("dwayne", "duane")
0.84
iex> ExNlp.Similarity.jaro_winkler_similarity("hello", "helo", prefix_length: 2)
0.9333333333333333
@spec levenshtein(String.t(), String.t()) :: non_neg_integer()
@spec levenshtein(String.t(), String.t(), :compact) :: non_neg_integer()
@spec levenshtein(String.t(), String.t(), :matrix) :: non_neg_integer()
@spec levenshtein(String.t(), String.t(), :tabulated) :: non_neg_integer()
Calculates the Levenshtein (edit) distance between two strings using matrix-based DP.
This is the default implementation using a full matrix. For better performance
with large strings, consider levenshtein_tabulated/2 or levenshtein_compact/2.
Returns the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.
Examples
iex> ExNlp.Similarity.levenshtein("kitten", "sitting")
3
iex> ExNlp.Similarity.levenshtein("", "abc")
3
iex> ExNlp.Similarity.levenshtein("abc", "abc")
0
@spec levenshtein_compact(String.t(), String.t()) :: non_neg_integer()
Calculates Levenshtein distance using a compact (space-optimized) approach.
Uses only two rows of the matrix instead of the full matrix, reducing memory usage from O(n*m) to O(min(n,m)).
Examples
iex> ExNlp.Similarity.levenshtein_compact("kitten", "sitting")
3
iex> ExNlp.Similarity.levenshtein_compact("abc", "xyz")
3
Calculates Levenshtein similarity between two strings.
Levenshtein similarity is a normalized measure derived from Levenshtein distance. Returns a value between 0.0 (completely different) and 1.0 (identical strings).
The similarity is calculated as: 1 - (distance / max(len1, len2))
Examples
iex> ExNlp.Similarity.levenshtein_similarity("kitten", "sitting")
0.5714285714285714
iex> ExNlp.Similarity.levenshtein_similarity("abc", "abc")
1.0
iex> ExNlp.Similarity.levenshtein_similarity("abc", "xyz")
0.0
iex> ExNlp.Similarity.levenshtein_similarity("", "abc")
0.0
@spec levenshtein_tabulated(String.t(), String.t()) :: non_neg_integer()
Calculates Levenshtein distance using a tabulated (array-based) approach.
This implementation uses Erlang's :array module for efficient memory usage.
Generally faster than the matrix-based approach for longer strings.
Examples
iex> ExNlp.Similarity.levenshtein_tabulated("kitten", "sitting")
3
iex> ExNlp.Similarity.levenshtein_tabulated("house", "horses")
2
@spec longest_common_subsequence(String.t(), String.t()) :: non_neg_integer()
Finds the length of the longest common subsequence (LCS) between two strings.
A subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order.
Examples
iex> ExNlp.Similarity.longest_common_subsequence("ABCDGH", "AEDFHR")
3
iex> ExNlp.Similarity.longest_common_subsequence("AGGTAB", "GXTXAYB")
4