ExNlp.Similarity (ex_nlp v0.1.0)

String and word set similarity metrics.

This module provides various algorithms for measuring similarity between strings and sets of words, useful for fuzzy matching and search.

Examples

# Levenshtein distance between strings
iex> ExNlp.Similarity.levenshtein("kitten", "sitting")
3

# Jaccard similarity between word sets
iex> ExNlp.Similarity.jaccard(["cat", "dog"], ["cat", "bird"])
0.3333333333333333

Summary

Functions

dice_coefficient(set1, set2)

Calculates the Dice coefficient (Sørensen-Dice coefficient) between two sets of words.

hamming(s1, s2)

Calculates the Hamming distance between two strings.

jaccard(set1, set2)

Calculates the Jaccard similarity coefficient between two sets of words.

jaro_similarity(s1, s2)

Calculates the Jaro similarity between two strings.

Calculates the Jaro-Winkler similarity between two strings.

levenshtein(s1, s2)

levenshtein(s1, s2, atom)

Calculates the Levenshtein (edit) distance between two strings using matrix-based DP.

levenshtein_compact(s1, s2)

Calculates Levenshtein distance using a compact (space-optimized) approach.

levenshtein_similarity(s1, s2)

Calculates Levenshtein similarity between two strings.

levenshtein_tabulated(s1, s2)

Calculates Levenshtein distance using a tabulated (array-based) approach.

longest_common_subsequence(s1, s2)

Finds the length of the longest common subsequence (LCS) between two strings.

Functions

dice_coefficient(set1, set2)

@spec dice_coefficient([String.t()], [String.t()]) :: float()

Calculates the Dice coefficient (Sørensen-Dice coefficient) between two sets of words.

Similar to Jaccard, but gives more weight to common elements.

Returns a value between 0.0 and 1.0.

Examples

iex> ExNlp.Similarity.dice_coefficient(["cat", "dog"], ["cat", "bird"])
0.5

iex> ExNlp.Similarity.dice_coefficient(["cat", "dog"], ["cat", "dog"])
1.0

hamming(s1, s2)

@spec hamming(String.t(), String.t()) :: non_neg_integer()

Calculates the Hamming distance between two strings.

The Hamming distance is the number of positions at which the corresponding characters are different. Both strings must be of equal length.

Examples

iex> ExNlp.Similarity.hamming("karolin", "kathrin")
3

iex> ExNlp.Similarity.hamming("hello", "hello")
0

jaccard(set1, set2)

@spec jaccard([String.t()], [String.t()]) :: float()

Calculates the Jaccard similarity coefficient between two sets of words.

Jaccard similarity is the size of the intersection divided by the size of the union of the two sets.

Returns a value between 0.0 and 1.0, where 1.0 means identical sets.

Examples

iex> ExNlp.Similarity.jaccard(["cat", "dog"], ["cat", "bird"])
0.3333333333333333

iex> ExNlp.Similarity.jaccard(["cat", "dog"], ["cat", "dog"])
1.0

iex> ExNlp.Similarity.jaccard(["cat"], ["dog"])
0.0

jaro_similarity(s1, s2)

@spec jaro_similarity(String.t(), String.t()) :: float()

Calculates the Jaro similarity between two strings.

Jaro similarity is a string metric measuring similarity between two strings. Returns a value between 0.0 (no similarity) and 1.0 (identical strings).

Examples

iex> ExNlp.Similarity.jaro_similarity("martha", "marhta")
0.9444444444444445

iex> ExNlp.Similarity.jaro_similarity("dwayne", "duane")
0.8222222222222223

iex> ExNlp.Similarity.jaro_similarity("abc", "xyz")
0.0

jaro_winkler_similarity(s1, s2, opts \\ [])

@spec jaro_winkler_similarity(String.t(), String.t(), keyword()) :: float()

Calculates the Jaro-Winkler similarity between two strings.

Jaro-Winkler is an extension of Jaro that gives more favorable ratings to strings that match from the beginning up to a maximum prefix length.

Options

:prefix_length - Maximum prefix length to consider (default: 4)
:prefix_weight - Weight factor for the prefix bonus (default: 0.1)

Examples

iex> ExNlp.Similarity.jaro_winkler_similarity("martha", "marhta")
0.9611111111111111

iex> ExNlp.Similarity.jaro_winkler_similarity("dwayne", "duane")
0.84

iex> ExNlp.Similarity.jaro_winkler_similarity("hello", "helo", prefix_length: 2)
0.9333333333333333

levenshtein(s1, s2)

@spec levenshtein(String.t(), String.t()) :: non_neg_integer()

levenshtein(s1, s2, atom)

@spec levenshtein(String.t(), String.t(), :compact) :: non_neg_integer()

@spec levenshtein(String.t(), String.t(), :matrix) :: non_neg_integer()

@spec levenshtein(String.t(), String.t(), :tabulated) :: non_neg_integer()

Calculates the Levenshtein (edit) distance between two strings using matrix-based DP.

This is the default implementation using a full matrix. For better performance with large strings, consider levenshtein_tabulated/2 or levenshtein_compact/2.

Returns the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.

Examples

iex> ExNlp.Similarity.levenshtein("kitten", "sitting")
3

iex> ExNlp.Similarity.levenshtein("", "abc")
3

iex> ExNlp.Similarity.levenshtein("abc", "abc")
0

levenshtein_compact(s1, s2)

@spec levenshtein_compact(String.t(), String.t()) :: non_neg_integer()

Calculates Levenshtein distance using a compact (space-optimized) approach.

Uses only two rows of the matrix instead of the full matrix, reducing memory usage from O(n*m) to O(min(n,m)).

Examples

iex> ExNlp.Similarity.levenshtein_compact("kitten", "sitting")
3

iex> ExNlp.Similarity.levenshtein_compact("abc", "xyz")
3

levenshtein_similarity(s1, s2)

@spec levenshtein_similarity(String.t(), String.t()) :: float()

Calculates Levenshtein similarity between two strings.

Levenshtein similarity is a normalized measure derived from Levenshtein distance. Returns a value between 0.0 (completely different) and 1.0 (identical strings).

The similarity is calculated as: 1 - (distance / max(len1, len2))

Examples

iex> ExNlp.Similarity.levenshtein_similarity("kitten", "sitting")
0.5714285714285714

iex> ExNlp.Similarity.levenshtein_similarity("abc", "abc")
1.0

iex> ExNlp.Similarity.levenshtein_similarity("abc", "xyz")
0.0

iex> ExNlp.Similarity.levenshtein_similarity("", "abc")
0.0

levenshtein_tabulated(s1, s2)

@spec levenshtein_tabulated(String.t(), String.t()) :: non_neg_integer()

Calculates Levenshtein distance using a tabulated (array-based) approach.

This implementation uses Erlang's :array module for efficient memory usage. Generally faster than the matrix-based approach for longer strings.

Examples

iex> ExNlp.Similarity.levenshtein_tabulated("kitten", "sitting")
3

iex> ExNlp.Similarity.levenshtein_tabulated("house", "horses")
2

longest_common_subsequence(s1, s2)

@spec longest_common_subsequence(String.t(), String.t()) :: non_neg_integer()

Finds the length of the longest common subsequence (LCS) between two strings.

A subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order.

Examples

iex> ExNlp.Similarity.longest_common_subsequence("ABCDGH", "AEDFHR")
3

iex> ExNlp.Similarity.longest_common_subsequence("AGGTAB", "GXTXAYB")
4