View Source Difflib.SequenceMatcher (Difflib v0.1.0)

SequenceMatcher is a flexible module for comparing pairs of sequences of any type, so long as the sequence elements are hashable.

The algorithm

The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching". The basic idea is to find the longest contiguous matching subsequence that contains no "junk" elements (R-O doesn't address junk).

The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that "look right" to people.

SequenceMatcher tries to compute a "human-friendly diff" between two sequences. Unlike e.g. UNIX(tm) diff, the fundamental notion is the longest contiguous & junk-free matching subsequence. That's what catches peoples' eyes.

The Windows(tm) windiff has another interesting notion, pairing up elements that appear uniquely in each sequence. That, and the method here, appear to yield more intuitive difference reports than does diff.

This method appears to be the least vulnerable to synching up on blocks of "junk lines", though (like blank lines in ordinary text files, or maybe "<P>" lines in HTML files). That may be because this is the only method of the 3 that has a concept of "junk" <wink>.

Examples

Example, comparing two strings, and considering blanks to be "junk"

iex> is_junk = fn c -> c == " " end
iex> a = "private Thread currentThread;"
iex> b = "private volatile Thread currentThread;"
iex> SequenceMatcher.ratio(a, b, is_junk: is_junk)
0.8656716417910447

ratio/3 returns a float between 0 and 1, measuring the "similarity" of the sequences. As a rule of thumb, a ratio/3 value over 0.6 means the sequences are close matches.

If you're only interested in where the sequences match, get_matching_blocks/3 is handy:

iex> for {a, b, size} <- SequenceMatcher.get_matching_blocks(a, b, is_junk: is_junk) do
iex>   IO.puts("a[#{a}] and b[#{b}] match for #{size} elements")
iex> end
a[0] and b[0] match for 8 elements
a[8] and b[17] match for 21 elements
a[29] and b[38] match for 0 elements

Note that the last tuple returned by get_matching_blocks/3 is always a dummy, {length(a), length(b), 0}, and this is the only case in which the last tuple element (number of elements matched) is 0.

If you want to know how to change the first sequence into the second, use get_opcodes/3

iex> for {op, a1, a2, b1, b2} <- SequenceMatcher.get_opcodes(a, b, is_junk: is_junk) do
iex>   IO.puts("#{op} a[#{a1}:#{a2}] b[#{b1}:#{b2}]")
iex> end
equal a[0:8] b[0:8]
insert a[8:8] b[8:17]
equal a[8:29] b[17:38]

See also function get_close_matches/3 in this module, which shows how simple code building on SequenceMatcher can be used to do useful work.

Timing: Basic R-O is cubic time worst case and quadratic time expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.

Summary

Functions

Analyzes an input for junk elements.

Find longest matching block in a[alo...ahi] and b[blo...bhi].

Use SequenceMatcher to return list of the best "good enough" matches.

Isolate change clusters by eliminating ranges with no changes.

Return list of triples describing matching subsequences.

Return list of 5-tuples describing how to turn a into b.

Return an upper bound on ratio/3 relatively quickly.

Return a measure of the sequences' similarity (float between 0 and 1).

Return an upper bound on ratio/3 very quickly.

Functions

Analyzes an input for junk elements.

Background

Because is_junk is a user-defined function, and we test for junk a LOT, it's important to minimize the number of calls. Before the tricks described here, chain_b/2 was by far the most time-consuming routine in the whole module! If anyone sees Jim Roskind, thank him again for profile.py -- I never would have guessed that.

The first trick is to build b2j ignoring the possibility of junk. I.e., we don't call is_junk at all yet. Throwing out the junk later is much cheaper than building b2j "right" from the start.

Parameters

  • a - The first of two sequences to be compared. The elements of a must be hashable.
  • b - The second of two sequences to be compared. The elements of b must be hashable.
  • opts - Keyword list of options.
    • is_junk - Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default is nil. For example, pass fn x -> x == " " if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.
    • auto_junk - Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default is true.

Example

iex> is_junk = fn x -> x == " " end
iex> b = "abcd abcd"
iex> SequenceMatcher.chain_b(b, is_junk: is_junk)
%{
  b2j: %{
    "a" => [0, 5],
    "b" => [1, 6],
    "c" => [2, 7],
    "d" => [3, 8]
  },
  isbjunk: #Function<1.118419402/1>,
  isbpopular: #Function<1.118419402/1>,
  bjunk: %{" " => true},
  bpopular: %{}
}
Link to this function

find_longest_match(a, b, opts \\ [])

View Source

Find longest matching block in a[alo...ahi] and b[blo...bhi].

Description

If is_junk is not defined:

Return {i,j,k} such that a[i...i+k] is equal to b[j...j+k], where

alo <= i <= i+k <= ahi
blo <= j <= j+k <= bhi

and for all {i',j',k'} meeting those conditions,

k >= k'
i <= i'
and if i == i', j <= j'

In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b.

Parameters

  • a - The first of two sequences to be compared. The elements of a must be hashable.
  • b - The second of two sequences to be compared. The elements of b must be hashable.
  • opts - Keyword list of options.
    • alo - Optional parameter alo is the lower bound of the range in a to consider. The default is 0.
    • ahi - Optional parameter ahi is the upper bound of the range in a to consider. The default is length of a.
    • blo - Optional parameter blo is the lower bound of the range in b to consider. The default is 0.
    • bhi - Optional parameter bhi is the upper bound of the range in b to consider. The default is length of b.
    • is_junk - Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default is nil. For example, pass fn x -> x == " " if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.
    • auto_junk - Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default is true.

Examples

iex> is_junk = fn x -> x == " " end
iex> a = " abcd"
iex> b = "abcd abcd"
iex> SequenceMatcher.find_longest_match(a, b, alo: 0, ahi: 5, blo: 0, bhi: 9, is_junk: is_junk)
{1, 0, 4}

iex> a = "ab"
iex> b = "c"
iex> SequenceMatcher.find_longest_match(a, b, alo: 0, ahi: 2, blo: 0, bhi: 1)
{0, 0, 0}

CAUTION

CAUTION: stripping common prefix or suffix would be incorrect. E.g.,

ab
acab

Longest matching block is "ab", but if common prefix is stripped, it's "a" (tied with "b"). UNIX(tm) diff does so strip, so ends up claiming that ab is changed to acab by inserting "ca" in the middle. That's minimal but unintuitive: "it's obvious" that someone inserted "ac" at the front. Windiff ends up at the same place as diff, but by pairing up the unique 'b's and then matching the first two 'a's.

Link to this function

get_close_matches(word, possibilities, opts \\ [])

View Source

Use SequenceMatcher to return list of the best "good enough" matches.

Description

The best (no more than n) matches among the possibilities are returned in a list, sorted by similarity score, most similar first.

Parameters

  • word - The sequence for which close matches are desired. Typically a string.
  • possibilities - A list of sequences against which to match word. Typically a list of strings.
  • opts - Keyword list of options.
    • n - Optional parameter n is the maximum number of close matches to return. The default is 3 and n must be > 0.
    • cutoff - Optional parameter cutoff is a float between 0 and 1. Possibilities that don't score at least that similar to word are ignored. The default is 0.6.
    • is_junk - Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default is nil. For example, pass fn x -> x == " " if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.
      • auto_junk - Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default is true.

Example

iex> SequenceMatcher.get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
["apple", "ape"]
Link to this function

get_grouped_opcodes(a, b, opts \\ [])

View Source

Isolate change clusters by eliminating ranges with no changes.

Description

Return a list groups with up to n lines of context. Each group is in the same format as returned by get_opcodes/3.

Parameters

  • a - The first of two sequences to be compared. The elements of a must be hashable.
  • b - The second of two sequences to be compared. The elements of b must be hashable.
  • opts - Keyword list of options.
    • n - Optional parameter n is the number of lines of context to include in each group. The default is 3.
    • is_junk - Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default is nil. For example, pass fn x -> x == " " if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.
    • auto_junk - Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default is true.

Example

iex> a = Enum.map(1..39, &Integer.to_string/1)
iex> b = Enum.slice(a, 0..-1)
iex> b = Enum.slice(b, 0..7) ++ ["i"] ++ Enum.slice(b, 8..-1)
iex> b = Enum.slice(b, 0..19) ++ ["20x"] ++ Enum.slice(b, 21..-1)
iex> b = Enum.slice(b, 0..22) ++ Enum.slice(b, 28..-1)
iex> b = Enum.slice(b, 0..29) ++ ["35y"] ++ Enum.slice(b, 31..-1)
iex> SequenceMatcher.get_grouped_opcodes(a, b)
[
  [
    {:equal, 5, 8, 5, 8},
    {:insert, 8, 8, 8, 9},
    {:equal, 8, 11, 9, 12}],
  [
    {:equal, 16, 19, 17, 20},
    {:replace, 19, 20, 20, 21},
    {:equal, 20, 22, 21, 23},
    {:delete, 22, 27, 23, 23},
    {:equal, 27, 30, 23, 26}
  ],
  [
    {:equal, 31, 34, 27, 30},
    {:replace, 34, 35, 30, 31},
    {:equal, 35, 38, 31, 34}
  ]
]
Link to this function

get_matching_blocks(a, b, opts \\ [])

View Source

Return list of triples describing matching subsequences.

Description

Each triple is of the form {i, j, n}, and means that a[i...i+n] == b[j...j+n]. The triples are monotonically increasing in i and in j. it's also guaranteed that if {i, j, n} and {i', j', n'} are adjacent triples in the list, and the second is not the last triple in the list, then i+n != i' or j+n != j'. IOW, adjacent triples never describe adjacent equal blocks.

The last triple is a dummy, {a.length, b.length, 0}, and is the only triple with n==0.

Parameters

  • a - The first of two sequences to be compared. The elements of a must be hashable.
  • b - The second of two sequences to be compared. The elements of b must be hashable.
  • opts - Keyword list of options.
    • is_junk - Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default is nil. For example, pass fn x -> x == " " if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.
    • auto_junk - Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default is true.

Example

iex> a = "abxcd"
iex> b = "abcd"
iex> SequenceMatcher.get_matching_blocks(a, b)
[{0, 0, 2}, {3, 2, 2}, {5, 4, 0}]
Link to this function

get_opcodes(a, b, opts \\ [])

View Source

Return list of 5-tuples describing how to turn a into b.

Description

Each tuple is of the form {tag, i1, i2, j1, j2}. The first tuple has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the tuple preceding it, and likewise for j1 == the previous j2.

The tags are atoms, with these meanings:

  • :replace - a[i1...i2] should be replaced by b[j1...j2]
  • :delete - a[i1...i2] should be deleted. Note that j1==j2 in this case.
  • :insert - b[j1...j2] should be inserted at a[i1...i1]. Note that i1==i2 in this case.
  • :equal - a[i1...i2] == b[j1...j2]

Parameters

  • a - The first of two sequences to be compared. The elements of a must be hashable.
  • b - The second of two sequences to be compared. The elements of b must be hashable.
  • opts - Keyword list of options.
    • is_junk - Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default is nil. For example, pass fn x -> x == " " if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.
    • auto_junk - Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default is true.

Example

iex> a = "qabxcd"
iex> b = "abycdf"
iex> SequenceMatcher.get_opcodes(a, b)
[
  {:delete, 0, 1, 0, 0},
  {:equal, 1, 3, 0, 2},
  {:replace, 3, 4, 2, 3},
  {:equal, 4, 6, 3, 5},
  {:insert, 6, 6, 5, 6}
]
Link to this function

quick_ratio(a, b, opts \\ [])

View Source

Return an upper bound on ratio/3 relatively quickly.

Description

This isn't defined beyond that it is an upper bound on ratio/3, and is faster to compute.

Parameters

  • a - The first of two sequences to be compared. The elements of a must be hashable.
  • b - The second of two sequences to be compared. The elements of b must be hashable.
  • opts - Keyword list of options.
    • fullbcount - Optional parameter fullbcount is a map of the counts of each element in b. It will be constructed if it does not exist. The default is nil.

Example

iex> a = "abcd"
iex> b = "bcde"
iex> SequenceMatcher.quick_ratio(a, b)
0.75

Return a measure of the sequences' similarity (float between 0 and 1).

Description

Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1 if the sequences are identical, and 0 if they have nothing in common.

ratio/3 is expensive to compute if you haven't already computed get_matching_blocks/3 or get_opcodes/3, in which case you may want to try quick_ratio/3 or real_quick_ratio/3 first to get an upper bound.

Parameters

  • a - The first of two sequences to be compared. The elements of a must be hashable.
  • b - The second of two sequences to be compared. The elements of b must be hashable.
  • opts - Keyword list of options.
    • is_junk - Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default is nil. For example, pass fn x -> x == " " if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.
    • auto_junk - Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default is true.

Example

iex> a = "abcd"
iex> b = "bcde"
iex> SequenceMatcher.ratio(a, b)
0.75

Return an upper bound on ratio/3 very quickly.

Description

This isn't defined beyond that it is an upper bound on ratio/3, and is faster to compute than either ratio/3 or quick_ratio/3.

Parameters

  • a - The first of two sequences to be compared. The elements of a must be hashable.
  • b - The second of two sequences to be compared. The elements of b must be hashable.

Example

iex> a = "abcd"
iex> b = "bcde"
iex> SequenceMatcher.real_quick_ratio(a, b)
1.0