Text.Phonetic.DoubleMetaphone (Text v0.5.0)

Copy Markdown View Source

Double Metaphone phonetic encoding (Lawrence Philips, 2000).

Double Metaphone is the de-facto standard for fuzzy matching of English names with non-English origins. Unlike single Metaphone, it returns two codes per input — a primary and an alternate — reflecting the fact that the same Anglicised name may be pronounced differently depending on the speaker's expectations.

Two names are considered a match when any of the four (primary_a, alternate_a) × (primary_b, alternate_b) combinations agree.

When to use

Double Metaphone is the strongest of the four phonetic encodings shipped with text for English-language name matching, and it handles non-Anglo-Saxon names (Slavic, Italian, Spanish, French, German, Greek, …) noticeably better than Soundex, Metaphone, or NYSIIS. It is the algorithm Apache Lucene's DoubleMetaphoneFilter uses, and the algorithm Python's jellyfish and metaphone packages expose by default.

Use Soundex only for compatibility with legacy systems that expose Soundex codes (databases, government records).

Use Cologne for German-only corpora — it outperforms Double Metaphone there.

Reference

Philips, L. (2000). The Double Metaphone Search Algorithm. C/C++ Users Journal, 18(6), 38–43.

This implementation is a port of the canonical algorithm and validates against the test vectors published with the original paper.

Summary

Types

A {primary, alternate} code pair returned by encode/2.

Functions

Returns the Double Metaphone code pair {primary, alternate} for name.

Returns true if name_a and name_b share at least one of the four primary/alternate code combinations (and both produce non-empty codes).

Types

code()

@type code() :: {String.t(), String.t()}

A {primary, alternate} code pair returned by encode/2.

Functions

encode(name, options \\ [])

@spec encode(
  String.t(),
  keyword()
) :: code()

Returns the Double Metaphone code pair {primary, alternate} for name.

When the name is unambiguous, primary and alternate are the same string.

Arguments

  • name is a string. Diacritics are folded via Text.Clean.unaccent/1 before encoding; non-Latin letters are discarded.

Options

  • :max_length — truncate both codes to this many characters. Defaults to 4 (the canonical Philips length); pass nil to skip truncation.

Returns

  • A 2-tuple {primary, alternate} of uppercase ASCII strings. Returns {"", ""} for empty input or input containing no Latin letters.

Examples

iex> Text.Phonetic.DoubleMetaphone.encode("Smith")
{"SM0", "XMT"}

iex> Text.Phonetic.DoubleMetaphone.encode("Schmidt")
{"XMT", "SMT"}

iex> Text.Phonetic.DoubleMetaphone.encode("Thompson")
{"TMPS", "TMPS"}

match?(name_a, name_b, options \\ [])

@spec match?(String.t(), String.t(), keyword()) :: boolean()

Returns true if name_a and name_b share at least one of the four primary/alternate code combinations (and both produce non-empty codes).

Arguments

  • name_a is a string.

  • name_b is a string.

Options

Same as encode/2. Both inputs are encoded with the same options.

Returns

  • true when both inputs produce a non-empty code pair and any one of the four combinations (primary_a/primary_b, primary_a/alternate_b, alternate_a/primary_b, alternate_a/alternate_b) match.

  • false otherwise.

Examples

iex> Text.Phonetic.DoubleMetaphone.match?("Smith", "Schmidt")
true

iex> Text.Phonetic.DoubleMetaphone.match?("Smith", "Brown")
false