Nasty.Lexical.WordNet.Similarity (Nasty v0.3.0)

Semantic similarity metrics for WordNet synsets.

Provides various algorithms for measuring semantic similarity between words or synsets based on their position in the WordNet hierarchy and their definitions.

Metrics

Path Similarity - Based on shortest path length in hypernym hierarchy
Wu-Palmer Similarity - Based on depth of LCS (Least Common Subsumer)
Lesk Similarity - Based on definition overlap
Depth - Distance from root in taxonomy

Example

alias Nasty.Lexical.WordNet.Similarity

# Compare "dog" and "cat"
dog_synset = WordNet.synsets("dog", :noun) |> hd()
cat_synset = WordNet.synsets("cat", :noun) |> hd()

# Path similarity
Similarity.path_similarity(dog_synset.id, cat_synset.id)  # ~0.2

# Wu-Palmer similarity
Similarity.wup_similarity(dog_synset.id, cat_synset.id)   # ~0.857

Summary

Types

language()

similarity_score()

synset_id()

Functions

Combines multiple similarity metrics with optional weights.

depth(synset_id, language \\ :en)

Calculates the depth of a synset in the taxonomy.

lcs(synset1_id, synset2_id, language \\ :en)

Finds the Least Common Subsumer (LCS) of two synsets.

Calculates Lesk similarity based on definition overlap.

Calculates path-based similarity between two synsets.

Calculates similarity between two words (not synsets).

Calculates Wu-Palmer similarity between two synsets.

Types

language()

@type language() :: atom()

similarity_score()

@type similarity_score() :: float()

synset_id()

@type synset_id() :: String.t()

Functions

combined_similarity(synset1_id, synset2_id, language \\ :en, opts \\ [])

@spec combined_similarity(synset_id(), synset_id(), language(), keyword()) ::
  similarity_score()

Combines multiple similarity metrics with optional weights.

Returns a weighted average of specified similarity metrics.

Options

:metrics - List of metrics to use (default: all)
:weights - Weights for each metric (default: equal weights)

Examples

iex> Similarity.combined_similarity(
...>   "oewn-02084071-n",
...>   "oewn-02121620-n",
...>   metrics: [:path, :wup, :lesk],
...>   weights: [0.3, 0.5, 0.2]
...> )
0.654

depth(synset_id, language \\ :en)

@spec depth(synset_id(), language()) :: non_neg_integer()

Calculates the depth of a synset in the taxonomy.

Depth is measured as the length of the longest path from the synset to a root node (a synset with no hypernyms).

Returns a non-negative integer representing depth.

Examples

iex> Similarity.depth("oewn-00001740-n", :en)  # entity (root)
0

iex> Similarity.depth("oewn-02084071-n", :en)  # dog
13

lcs(synset1_id, synset2_id, language \\ :en)

@spec lcs(synset_id(), synset_id(), language()) :: synset_id() | nil

Finds the Least Common Subsumer (LCS) of two synsets.

The LCS is the most specific common ancestor (deepest common hypernym) of two synsets in the taxonomy.

Returns the synset ID of the LCS, or nil if no common ancestor exists.

Examples

iex> Similarity.lcs("oewn-02084071-n", "oewn-02121620-n", :en)  # dog, cat
"oewn-02075296-n"  # carnivore

lesk_similarity(synset1_id, synset2_id, language \\ :en)

@spec lesk_similarity(synset_id(), synset_id(), language()) :: similarity_score()

Calculates Lesk similarity based on definition overlap.

Measures similarity by counting overlapping words between synset definitions. This is context-based rather than hierarchy-based.

Returns a score from 0.0 to 1.0, where:

Higher values = more overlapping words in definitions
0.0 = no overlap

Examples

iex> Similarity.lesk_similarity("oewn-02084071-n", "oewn-02121620-n", :en)  # dog, cat
0.15  # Some overlap in definitions (animal-related words)

path_similarity(synset1_id, synset2_id, language \\ :en)

@spec path_similarity(synset_id(), synset_id(), language()) :: similarity_score()

Calculates path-based similarity between two synsets.

Uses the shortest path length in the hypernym/hyponym hierarchy. Formula: 1 / (path_length + 1)

Returns a score from 0.0 to 1.0, where:

1.0 = identical synsets
Higher values = more similar
0.0 = no path exists

Examples

iex> Similarity.path_similarity("oewn-02084071-n", "oewn-02084071-n")  # dog == dog
1.0

iex> Similarity.path_similarity("oewn-02084071-n", "oewn-02083346-n")  # dog -> canine
0.5

word_similarity(word1, word2, pos \\ nil, language \\ :en, opts \\ [])

@spec word_similarity(String.t(), String.t(), atom() | nil, language(), keyword()) ::
  similarity_score()

Calculates similarity between two words (not synsets).

Finds the maximum similarity across all synset pairs for the two words.

Examples

iex> Similarity.word_similarity("dog", "cat", :noun)
0.857

wup_similarity(synset1_id, synset2_id, language \\ :en)

@spec wup_similarity(synset_id(), synset_id(), language()) :: similarity_score()

Calculates Wu-Palmer similarity between two synsets.

Based on the depth of the Least Common Subsumer (LCS) and the depths of the two synsets in the taxonomy.

Formula: 2 * depth(LCS) / (depth(synset1) + depth(synset2))

Returns a score from 0.0 to 1.0, where:

1.0 = identical synsets or same depth
Higher values = more similar
0.0 = no common ancestor

This metric often gives more intuitive results than path similarity because it considers depth in the taxonomy.

Examples

iex> Similarity.wup_similarity("oewn-02084071-n", "oewn-02121620-n", :en)  # dog, cat
0.857  # High similarity (both are carnivores)

iex> Similarity.wup_similarity("oewn-02084071-n", "oewn-12345678-n", :en)  # dog, tree
0.133  # Low similarity (different domains)