View Source Similarity.Simhash (Similarity v0.4.0)

Simhash string similarity algorithm. Description of Simhash

iex> Similarity.simhash("Barna", "Kovacs")
0.59375

iex> Similarity.simhash("Austria", "Australia")
0.65625

Summary

Functions

Returns the Hamming distance between the left and right hash, given as lists of bits.

Returns the hash for the given string and hash_function in the given return_type.

Calculates the similarity between the left and right string, using Simhash. Returns a float representing similarity between left and right strings.

Functions

Link to this function

hamming_distance(left, right, acc \\ 0)

View Source (since 0.1.1)

Returns the Hamming distance between the left and right hash, given as lists of bits.

Examples

iex> Similarity.Simhash.hamming_distance([1, 1, 0, 1, 0], [0, 1, 1, 1, 0])
2
Link to this function

hash(string, options)

View Source (since 0.1.1)
@spec hash(
  String.t(),
  keyword()
) :: [0 | 1] | integer()

Returns the hash for the given string and hash_function in the given return_type.

Options

  • :ngram_size - defaults to 3
  • :hash_function - defaults to :siphash, available options are :siphash, :md5, :sha256
  • :return_type - defaults to :list, available options are :list, :int64_unsigned, :int64_signed, :binary

The return types :int64_unsigned and :int64_signed are only available for the :siphash hash function.

Examples

Similarity.Simhash.hash("alma korte")
[1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, ...]

iex> Similarity.Simhash.hash("alma korte", ngram_size: 3, hash_function: :siphash, return_type: :int64_unsigned)
15012197954348909067

iex> Similarity.Simhash.hash("alma korte", ngram_size: 3, hash_function: :siphash, return_type: :int64_signed)
-3434546119360642549
Link to this function

hash_similarity(left, right, length)

View Source (since 0.1.1)
Link to this function

similarity(left, right, options \\ [])

View Source (since 0.1.1)
@spec similarity(String.t(), String.t(), pos_integer()) :: float()

Calculates the similarity between the left and right string, using Simhash. Returns a float representing similarity between left and right strings.

Options

  • :ngram_size - defaults to 3
  • :hash_function - defaults to :siphash, available options are :siphash, :md5, :sha256

Examples

iex> Similarity.simhash("khan academy", "khan academia")
0.890625

iex> Similarity.simhash("khan academy", "academy khan", ngram_size: 1)
1.0