View Source Difflib.SequenceMatcher (Difflib v0.1.0)
SequenceMatcher is a flexible module for comparing pairs of sequences of any type, so long as the sequence elements are hashable.
The algorithm
The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching". The basic idea is to find the longest contiguous matching subsequence that contains no "junk" elements (R-O doesn't address junk).
The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that "look right" to people.
SequenceMatcher tries to compute a "human-friendly diff" between two sequences. Unlike e.g. UNIX(tm) diff, the fundamental notion is the longest contiguous & junk-free matching subsequence. That's what catches peoples' eyes.
The Windows(tm) windiff has another interesting notion, pairing up elements that appear uniquely in each sequence. That, and the method here, appear to yield more intuitive difference reports than does diff.
This method appears to be the least vulnerable to synching up on blocks of "junk lines", though (like blank lines in ordinary text files, or maybe "<P>" lines in HTML files). That may be because this is the only method of the 3 that has a concept of "junk" <wink>.
Examples
Example, comparing two strings, and considering blanks to be "junk"
iex> is_junk = fn c -> c == " " end
iex> a = "private Thread currentThread;"
iex> b = "private volatile Thread currentThread;"
iex> SequenceMatcher.ratio(a, b, is_junk: is_junk)
0.8656716417910447
ratio/3
returns a float between 0 and 1, measuring the "similarity" of the
sequences. As a rule of thumb, a ratio/3
value over 0.6 means the
sequences are close matches.
If you're only interested in where the sequences match,
get_matching_blocks/3
is handy:
iex> for {a, b, size} <- SequenceMatcher.get_matching_blocks(a, b, is_junk: is_junk) do
iex> IO.puts("a[#{a}] and b[#{b}] match for #{size} elements")
iex> end
a[0] and b[0] match for 8 elements
a[8] and b[17] match for 21 elements
a[29] and b[38] match for 0 elements
Note that the last tuple returned by get_matching_blocks/3
is always a
dummy, {length(a), length(b), 0}
, and this is the only case in which the last
tuple element (number of elements matched) is 0.
If you want to know how to change the first sequence into the second,
use get_opcodes/3
iex> for {op, a1, a2, b1, b2} <- SequenceMatcher.get_opcodes(a, b, is_junk: is_junk) do
iex> IO.puts("#{op} a[#{a1}:#{a2}] b[#{b1}:#{b2}]")
iex> end
equal a[0:8] b[0:8]
insert a[8:8] b[8:17]
equal a[8:29] b[17:38]
See also function get_close_matches/3
in this module, which shows how
simple code building on SequenceMatcher can be used to do useful work.
Timing: Basic R-O is cubic time worst case and quadratic time expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.
Summary
Functions
Analyzes an input for junk elements.
Find longest matching block in a[alo...ahi] and b[blo...bhi].
Use SequenceMatcher to return list of the best "good enough" matches.
Isolate change clusters by eliminating ranges with no changes.
Return list of triples describing matching subsequences.
Return list of 5-tuples describing how to turn a into b.
Return an upper bound on ratio/3
relatively quickly.
Return a measure of the sequences' similarity (float between 0 and 1).
Return an upper bound on ratio/3
very quickly.
Functions
Analyzes an input for junk elements.
Background
Because is_junk
is a user-defined function, and we test
for junk a LOT, it's important to minimize the number of calls.
Before the tricks described here, chain_b/2
was by far the most
time-consuming routine in the whole module! If anyone sees
Jim Roskind, thank him again for profile.py -- I never would
have guessed that.
The first trick is to build b2j
ignoring the possibility
of junk. I.e., we don't call is_junk at all yet. Throwing
out the junk later is much cheaper than building b2j
"right"
from the start.
Parameters
a
- The first of two sequences to be compared. The elements of a must be hashable.b
- The second of two sequences to be compared. The elements of b must be hashable.opts
- Keyword list of options.is_junk
- Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default isnil
. For example, passfn x -> x == " "
if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.auto_junk
- Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default istrue
.
Example
iex> is_junk = fn x -> x == " " end
iex> b = "abcd abcd"
iex> SequenceMatcher.chain_b(b, is_junk: is_junk)
%{
b2j: %{
"a" => [0, 5],
"b" => [1, 6],
"c" => [2, 7],
"d" => [3, 8]
},
isbjunk: #Function<1.118419402/1>,
isbpopular: #Function<1.118419402/1>,
bjunk: %{" " => true},
bpopular: %{}
}
Find longest matching block in a[alo...ahi] and b[blo...bhi].
Description
If is_junk
is not defined:
Return {i,j,k}
such that a[i...i+k]
is equal to b[j...j+k]
, where
alo <= i <= i+k <= ahi
blo <= j <= j+k <= bhi
and for all {i',j',k'}
meeting those conditions,
k >= k'
i <= i'
and if i == i', j <= j'
In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b.
Parameters
a
- The first of two sequences to be compared. The elements of a must be hashable.b
- The second of two sequences to be compared. The elements of b must be hashable.opts
- Keyword list of options.alo
- Optional parameter alo is the lower bound of the range in a to consider. The default is 0.ahi
- Optional parameter ahi is the upper bound of the range in a to consider. The default is length of a.blo
- Optional parameter blo is the lower bound of the range in b to consider. The default is 0.bhi
- Optional parameter bhi is the upper bound of the range in b to consider. The default is length of b.is_junk
- Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default isnil
. For example, passfn x -> x == " "
if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.auto_junk
- Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default istrue
.
Examples
iex> is_junk = fn x -> x == " " end
iex> a = " abcd"
iex> b = "abcd abcd"
iex> SequenceMatcher.find_longest_match(a, b, alo: 0, ahi: 5, blo: 0, bhi: 9, is_junk: is_junk)
{1, 0, 4}
iex> a = "ab"
iex> b = "c"
iex> SequenceMatcher.find_longest_match(a, b, alo: 0, ahi: 2, blo: 0, bhi: 1)
{0, 0, 0}
CAUTION
CAUTION: stripping common prefix or suffix would be incorrect. E.g.,
ab
acab
Longest matching block is "ab", but if common prefix is stripped, it's "a" (tied with "b"). UNIX(tm) diff does so strip, so ends up claiming that ab is changed to acab by inserting "ca" in the middle. That's minimal but unintuitive: "it's obvious" that someone inserted "ac" at the front. Windiff ends up at the same place as diff, but by pairing up the unique 'b's and then matching the first two 'a's.
Use SequenceMatcher to return list of the best "good enough" matches.
Description
The best (no more than n
) matches among the possibilities are returned
in a list, sorted by similarity score, most similar first.
Parameters
word
- The sequence for which close matches are desired. Typically a string.possibilities
- A list of sequences against which to match word. Typically a list of strings.opts
- Keyword list of options.n
- Optional parameter n is the maximum number of close matches to return. The default is 3 and n must be > 0.cutoff
- Optional parameter cutoff is a float between 0 and 1. Possibilities that don't score at least that similar to word are ignored. The default is 0.6.is_junk
- Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default isnil
. For example, passfn x -> x == " "
if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.auto_junk
- Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default istrue
.
Example
iex> SequenceMatcher.get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
["apple", "ape"]
Isolate change clusters by eliminating ranges with no changes.
Description
Return a list groups with up to n
lines of context.
Each group is in the same format as returned by get_opcodes/3
.
Parameters
a
- The first of two sequences to be compared. The elements of a must be hashable.b
- The second of two sequences to be compared. The elements of b must be hashable.opts
- Keyword list of options.n
- Optional parameter n is the number of lines of context to include in each group. The default is 3.is_junk
- Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default isnil
. For example, passfn x -> x == " "
if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.auto_junk
- Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default istrue
.
Example
iex> a = Enum.map(1..39, &Integer.to_string/1)
iex> b = Enum.slice(a, 0..-1)
iex> b = Enum.slice(b, 0..7) ++ ["i"] ++ Enum.slice(b, 8..-1)
iex> b = Enum.slice(b, 0..19) ++ ["20x"] ++ Enum.slice(b, 21..-1)
iex> b = Enum.slice(b, 0..22) ++ Enum.slice(b, 28..-1)
iex> b = Enum.slice(b, 0..29) ++ ["35y"] ++ Enum.slice(b, 31..-1)
iex> SequenceMatcher.get_grouped_opcodes(a, b)
[
[
{:equal, 5, 8, 5, 8},
{:insert, 8, 8, 8, 9},
{:equal, 8, 11, 9, 12}],
[
{:equal, 16, 19, 17, 20},
{:replace, 19, 20, 20, 21},
{:equal, 20, 22, 21, 23},
{:delete, 22, 27, 23, 23},
{:equal, 27, 30, 23, 26}
],
[
{:equal, 31, 34, 27, 30},
{:replace, 34, 35, 30, 31},
{:equal, 35, 38, 31, 34}
]
]
Return list of triples describing matching subsequences.
Description
Each triple is of the form {i, j, n}
, and means that
a[i...i+n] == b[j...j+n]
. The triples are monotonically increasing in
i
and in j
. it's also guaranteed that if
{i, j, n}
and {i', j', n'}
are adjacent triples in the list, and
the second is not the last triple in the list, then i+n != i'
or
j+n != j'
. IOW, adjacent triples never describe adjacent equal
blocks.
The last triple is a dummy, {a.length, b.length, 0}
, and is the only
triple with n==0
.
Parameters
a
- The first of two sequences to be compared. The elements of a must be hashable.b
- The second of two sequences to be compared. The elements of b must be hashable.opts
- Keyword list of options.is_junk
- Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default isnil
. For example, passfn x -> x == " "
if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.auto_junk
- Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default istrue
.
Example
iex> a = "abxcd"
iex> b = "abcd"
iex> SequenceMatcher.get_matching_blocks(a, b)
[{0, 0, 2}, {3, 2, 2}, {5, 4, 0}]
Return list of 5-tuples describing how to turn a into b.
Description
Each tuple is of the form {tag, i1, i2, j1, j2}
. The first tuple
has i1 == j1 == 0
, and remaining tuples have i1 ==
the i2
from the
tuple preceding it, and likewise for j1 ==
the previous j2
.
The tags are atoms, with these meanings:
:replace
-a[i1...i2]
should be replaced byb[j1...j2]
:delete
-a[i1...i2]
should be deleted. Note thatj1==j2
in this case.:insert
-b[j1...j2]
should be inserted ata[i1...i1]
. Note thati1==i2
in this case.:equal
-a[i1...i2] == b[j1...j2]
Parameters
a
- The first of two sequences to be compared. The elements of a must be hashable.b
- The second of two sequences to be compared. The elements of b must be hashable.opts
- Keyword list of options.is_junk
- Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default isnil
. For example, passfn x -> x == " "
if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.auto_junk
- Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default istrue
.
Example
iex> a = "qabxcd"
iex> b = "abycdf"
iex> SequenceMatcher.get_opcodes(a, b)
[
{:delete, 0, 1, 0, 0},
{:equal, 1, 3, 0, 2},
{:replace, 3, 4, 2, 3},
{:equal, 4, 6, 3, 5},
{:insert, 6, 6, 5, 6}
]
Return an upper bound on ratio/3
relatively quickly.
Description
This isn't defined beyond that it is an upper bound on ratio/3
, and is faster to compute.
Parameters
a
- The first of two sequences to be compared. The elements of a must be hashable.b
- The second of two sequences to be compared. The elements of b must be hashable.opts
- Keyword list of options.fullbcount
- Optional parameter fullbcount is a map of the counts of each element in b. It will be constructed if it does not exist. The default isnil
.
Example
iex> a = "abcd"
iex> b = "bcde"
iex> SequenceMatcher.quick_ratio(a, b)
0.75
Return a measure of the sequences' similarity (float between 0 and 1).
Description
Where T
is the total number of elements in both sequences, and
M
is the number of matches, this is 2.0*M / T
.
Note that this is 1 if the sequences are identical, and 0 if
they have nothing in common.
ratio/3
is expensive to compute if you haven't already computed
get_matching_blocks/3
or get_opcodes/3
, in which case you may
want to try quick_ratio/3
or real_quick_ratio/3
first to get an
upper bound.
Parameters
a
- The first of two sequences to be compared. The elements of a must be hashable.b
- The second of two sequences to be compared. The elements of b must be hashable.opts
- Keyword list of options.is_junk
- Optional parameter is_junk is a one-argument function that takes a sequence element and returns true if the element is junk. The default isnil
. For example, passfn x -> x == " "
if you're comparing lines as sequences of characters, and don't want to synch up on blanks or hard tabs.auto_junk
- Optional parameter autojunk should be set to false to disable the "automatic junk heuristic" that treats popular elements as junk. The default istrue
.
Example
iex> a = "abcd"
iex> b = "bcde"
iex> SequenceMatcher.ratio(a, b)
0.75
Return an upper bound on ratio/3
very quickly.
Description
This isn't defined beyond that it is an upper bound on ratio/3
, and
is faster to compute than either ratio/3
or quick_ratio/3
.
Parameters
a
- The first of two sequences to be compared. The elements of a must be hashable.b
- The second of two sequences to be compared. The elements of b must be hashable.
Example
iex> a = "abcd"
iex> b = "bcde"
iex> SequenceMatcher.real_quick_ratio(a, b)
1.0