View Source Dmp.Match (diff_match_patch v0.2.0)
Given a search string, find its best fuzzy match in a block of plain text. Weighted for both accuracy and location.
Link to this section Summary
Types
A bitarray encoding the locations of characters within the search pattern.
A bitarray encoding possible match sequences of the search pattern within the text.
Accumulator for bitap_update/3
. A tuple with these elements
Constants needed for bitap_update/3
. A tuple with these elements
Functions
Initialise the alphabet for the Bitap algorithm.
Search for the best instance of pattern
in text
near loc
, with errors,
using the Bitap algorithm.
Compute and return a weighted score for a match with e
errors and x
location.
Perform the bitap algorithm and calculate error score if a match is found.
Look up a character in the alphabet and return its encoded bitmap.
Locate the best instance of pattern
in text
near loc
.
Link to this section Types
@type alpha() :: %{required(non_neg_integer()) => non_neg_integer()}
A bitarray encoding the locations of characters within the search pattern.
The index is the ordinal value of the character.
@type bitap_array() :: %{required(integer()) => non_neg_integer()}
A bitarray encoding possible match sequences of the search pattern within the text.
The index is the location of the text, plus one.
@type options() :: Dmp.Options.t()
@type update_acc() :: {integer(), float(), non_neg_integer(), bitap_array()}
Accumulator for bitap_update/3
. A tuple with these elements:
best_loc
- The value ofloc
with the lowest score found so far.score
- The lowest score found so far.start
- The lowest value ofj
that will be searched.rd
- The bitarray being calculated at the current error leveld
.
@type update_constants() :: {non_neg_integer(), String.t(), integer(), bitap_array(), alpha(), non_neg_integer(), non_neg_integer(), non_neg_integer(), non_neg_integer()}
Constants needed for bitap_update/3
. A tuple with these elements:
d
- Current error level.text
- Text to search.loc
- The location (zero-based index intext
) to search around.last_rd
- The bitarray from error leveld - 1
s
- The alphabet constructed from pattern being matched (seealphabet/1
).match_mask
- A bitmap from the alphabet for the character atloc
intext
.overflow_mask
- A bitmap of all ones, of sizepattern_length
pattern_length
- The length of the pattern being matched.match_distance
- How far fromloc
that the search will be conducted.
Link to this section Functions
Initialise the alphabet for the Bitap algorithm.
pattern
- The text to encode.
Returns the map of character locations within the pattern. From the Wu and Manber paper:
$$\text{If } s\xsb{i} .. s\xsb{\Sigma} \text{ are the unique characters in the pattern,}$$ $$\text{and } p\xsb{1} .. p\xsb{m} \text{ are the characters in the search pattern, then}$$ $$S \text{ is the bitarray where } S\xsb{i}[r] = 1 \text{ if } p\xsb{r} = s\xsb{i}$$
For best results the number of characters in pattern
should less than or equal to the native bit size of an integer (32).
bitap(text, pattern, loc, match_threshold \\ 0.5, match_distance \\ 1000, more_results \\ false)
View Source@spec bitap( String.t(), String.t(), non_neg_integer(), float(), non_neg_integer(), boolean() ) :: integer()
Search for the best instance of pattern
in text
near loc
, with errors,
using the Bitap algorithm.
See: Fast Text Searching with Errors (Wu and Manber, 1991).
text
- The text to search.pattern
- The pattern to search for.loc
- The location (zero-based index intext
) to search around.match_threshold
- At what point is no match declared (0.0 = perfection, 1.0 = very loose, default = 0.5).match_distance
- How far to search for a match (0 = exact location, 1000+ = broad match, default = 1000).more_results
- Control which data is returned.
If more_results
is false (the default), returns the index of the best
match closest to loc
in text
, or -1 if no suitable match was found.
If more_results
is true, returns a tuple with three elements:
- The best match index in
text
, or -1 if no suitable match was found. - The highest error level
d
that was searched. - The adjusted match score at the returned index (a value less than
or equal to the
match_threshold
). If no match was found, the score is 1.0.
@spec bitap_score( non_neg_integer(), integer(), non_neg_integer(), integer(), non_neg_integer() ) :: float()
Compute and return a weighted score for a match with e
errors and x
location.
e
- Number of errors in match.x
- Location of match.loc
- Expected location of match.pattern_length
- Length of pattern being sought.match_distance
- How far to search for a match (0 = exact location, 1000+ = broad match).
Returns overall score for match (0.0 = good, 1.0 = bad).
@spec bitap_update(non_neg_integer(), update_acc(), update_constants()) :: {:cont | :halt, update_acc()}
Perform the bitap algorithm and calculate error score if a match is found.
acc
- Accumulator tuple, withbest_loc
,score_threshold
,start
, andrd
elements.constants
- Other constant values needed for calculations.
Updates the rd
bitarray for the current error level d
at the index j
(representing the zero-based location j - 1
in text
), and
then tests for a match (an exact match if d == 0
or a match with
d
errors).
If a match is found at position j
, calculate the error score
(based on the error level d
and the distance from expected location).
If the score is lower than the current threshold, stop calculating the update
if we have already gone below the minimum possible location,
or continue the update, limiting the range of j
(increasing the
start
value).
notes
Notes
The j
index is decremented from the end of the text to the start of the text.
Since the iteration is moving from high j
to low, bitap_update/3
does "Lshift"
operations, not the "Rshift" operations in the Wu and Manber paper, and uses
the previous values that were set at j + 1
, not j
.
Here the calculations are:
$$R\xsb{j}^d = \begin{cases} Lshift [ R\xsb{j+1}^d ] \text{ AND } S\xsb{c} &\text{if } d = 0 \cr Lshift [ R\xsb{j+1}^d ] \text{ AND } S\xsb{c} \text{ OR } Lshift [ R\xsb{j}^{d-1} \text{ OR } R\xsb{j+1}^{d-1} ] \text{ OR } R\xsb{j+1}^{d-1} &\text{otherwise} \end{cases}$$
versus in Wu and Manber's paper:
$$R\xsb{j}^d = \begin{cases} Rshift [ R\xsb{j}^d ] \text{ AND } S\xsb{c} &\text{if } d = 0 \cr Rshift [ R\xsb{j}^d ] \text{ AND } S\xsb{c} \text{ OR } Rshift [ R\xsb{j}^{d-1} \text{ OR } R\xsb{j+1}^{d-1} ] \text{ OR } R\xsb{j}^{d-1} &\text{otherwise} \end{cases}$$
@spec character_mask(alpha(), String.t()) :: non_neg_integer()
Look up a character in the alphabet and return its encoded bitmap.
s
- An alphabet constructed withalphabet/1
.ch
- A single character string.
Returns a bitarray of positions in the pattern
for the
character ch
. In Wu and Manber:
$$S\xsb{c}$$
@spec main(String.t(), String.t(), non_neg_integer(), options()) :: integer()
Locate the best instance of pattern
in text
near loc
.
text
- The text to search.pattern
- The pattern to search for.loc
- The location to search around.opts
- A options keyword list,[]
to use the default options. In addition to the general parameters, the option:more_results
, if set totrue
, can be used to return more data. Seebitap/6
.
Returns -1 if no match found.