View Source Dmp.Match (diff_match_patch v0.2.0)

Given a search string, find its best fuzzy match in a block of plain text. Weighted for both accuracy and location.

Link to this section Summary

Types

A bitarray encoding the locations of characters within the search pattern.

A bitarray encoding possible match sequences of the search pattern within the text.

Accumulator for bitap_update/3. A tuple with these elements

Constants needed for bitap_update/3. A tuple with these elements

Functions

Initialise the alphabet for the Bitap algorithm.

Search for the best instance of pattern in text near loc, with errors, using the Bitap algorithm.

Compute and return a weighted score for a match with e errors and x location.

Perform the bitap algorithm and calculate error score if a match is found.

Look up a character in the alphabet and return its encoded bitmap.

Locate the best instance of pattern in text near loc.

Link to this section Types

@type alpha() :: %{required(non_neg_integer()) => non_neg_integer()}

A bitarray encoding the locations of characters within the search pattern.

The index is the ordinal value of the character.

@type bitap_array() :: %{required(integer()) => non_neg_integer()}

A bitarray encoding possible match sequences of the search pattern within the text.

The index is the location of the text, plus one.

@type options() :: Dmp.Options.t()
@type update_acc() :: {integer(), float(), non_neg_integer(), bitap_array()}

Accumulator for bitap_update/3. A tuple with these elements:

  • best_loc - The value of loc with the lowest score found so far.
  • score - The lowest score found so far.
  • start - The lowest value of j that will be searched.
  • rd - The bitarray being calculated at the current error level d.

Constants needed for bitap_update/3. A tuple with these elements:

  • d - Current error level.
  • text - Text to search.
  • loc - The location (zero-based index in text) to search around.
  • last_rd - The bitarray from error level d - 1
  • s - The alphabet constructed from pattern being matched (see alphabet/1).
  • match_mask - A bitmap from the alphabet for the character at loc in text.
  • overflow_mask - A bitmap of all ones, of size pattern_length
  • pattern_length - The length of the pattern being matched.
  • match_distance - How far from loc that the search will be conducted.

Link to this section Functions

@spec alphabet(String.t()) :: alpha()

Initialise the alphabet for the Bitap algorithm.

pattern - The text to encode.

Returns the map of character locations within the pattern. From the Wu and Manber paper:

$$\text{If } s\xsb{i} .. s\xsb{\Sigma} \text{ are the unique characters in the pattern,}$$ $$\text{and } p\xsb{1} .. p\xsb{m} \text{ are the characters in the search pattern, then}$$ $$S \text{ is the bitarray where } S\xsb{i}[r] = 1 \text{ if } p\xsb{r} = s\xsb{i}$$

For best results the number of characters in pattern should less than or equal to the native bit size of an integer (32).

Link to this function

bitap(text, pattern, loc, match_threshold \\ 0.5, match_distance \\ 1000, more_results \\ false)

View Source
@spec bitap(
  String.t(),
  String.t(),
  non_neg_integer(),
  float(),
  non_neg_integer(),
  boolean()
) ::
  integer()

Search for the best instance of pattern in text near loc, with errors, using the Bitap algorithm.

See: Fast Text Searching with Errors (Wu and Manber, 1991).

  • text - The text to search.
  • pattern - The pattern to search for.
  • loc - The location (zero-based index in text) to search around.
  • match_threshold - At what point is no match declared (0.0 = perfection, 1.0 = very loose, default = 0.5).
  • match_distance - How far to search for a match (0 = exact location, 1000+ = broad match, default = 1000).
  • more_results - Control which data is returned.

If more_results is false (the default), returns the index of the best match closest to loc in text, or -1 if no suitable match was found.

If more_results is true, returns a tuple with three elements:

  • The best match index in text, or -1 if no suitable match was found.
  • The highest error level d that was searched.
  • The adjusted match score at the returned index (a value less than or equal to the match_threshold). If no match was found, the score is 1.0.
Link to this function

bitap_score(e, x, loc, pattern_length, match_distance)

View Source
@spec bitap_score(
  non_neg_integer(),
  integer(),
  non_neg_integer(),
  integer(),
  non_neg_integer()
) ::
  float()

Compute and return a weighted score for a match with e errors and x location.

  • e - Number of errors in match.
  • x - Location of match.
  • loc - Expected location of match.
  • pattern_length - Length of pattern being sought.
  • match_distance - How far to search for a match (0 = exact location, 1000+ = broad match).

Returns overall score for match (0.0 = good, 1.0 = bad).

Link to this function

bitap_update(j, acc, constants)

View Source
@spec bitap_update(non_neg_integer(), update_acc(), update_constants()) ::
  {:cont | :halt, update_acc()}

Perform the bitap algorithm and calculate error score if a match is found.

  • acc - Accumulator tuple, with best_loc, score_threshold, start, and rd elements.
  • constants - Other constant values needed for calculations.

Updates the rd bitarray for the current error level d at the index j (representing the zero-based location j - 1 in text), and then tests for a match (an exact match if d == 0 or a match with d errors).

If a match is found at position j, calculate the error score (based on the error level d and the distance from expected location). If the score is lower than the current threshold, stop calculating the update if we have already gone below the minimum possible location, or continue the update, limiting the range of j (increasing the start value).

notes

Notes

The j index is decremented from the end of the text to the start of the text. Since the iteration is moving from high j to low, bitap_update/3 does "Lshift" operations, not the "Rshift" operations in the Wu and Manber paper, and uses the previous values that were set at j + 1, not j.

Here the calculations are:

$$R\xsb{j}^d = \begin{cases} Lshift [ R\xsb{j+1}^d ] \text{ AND } S\xsb{c} &\text{if } d = 0 \cr Lshift [ R\xsb{j+1}^d ] \text{ AND } S\xsb{c} \text{ OR } Lshift [ R\xsb{j}^{d-1} \text{ OR } R\xsb{j+1}^{d-1} ] \text{ OR } R\xsb{j+1}^{d-1} &\text{otherwise} \end{cases}$$

versus in Wu and Manber's paper:

$$R\xsb{j}^d = \begin{cases} Rshift [ R\xsb{j}^d ] \text{ AND } S\xsb{c} &\text{if } d = 0 \cr Rshift [ R\xsb{j}^d ] \text{ AND } S\xsb{c} \text{ OR } Rshift [ R\xsb{j}^{d-1} \text{ OR } R\xsb{j+1}^{d-1} ] \text{ OR } R\xsb{j}^{d-1} &\text{otherwise} \end{cases}$$

@spec character_mask(alpha(), String.t()) :: non_neg_integer()

Look up a character in the alphabet and return its encoded bitmap.

  • s - An alphabet constructed with alphabet/1.
  • ch - A single character string.

Returns a bitarray of positions in the pattern for the character ch. In Wu and Manber:

$$S\xsb{c}$$

Link to this function

main(text, pattern, loc, opts \\ [])

View Source
@spec main(String.t(), String.t(), non_neg_integer(), options()) :: integer()

Locate the best instance of pattern in text near loc.

  • text - The text to search.
  • pattern - The pattern to search for.
  • loc - The location to search around.
  • opts - A options keyword list, [] to use the default options. In addition to the general parameters, the option :more_results, if set to true, can be used to return more data. See bitap/6.

Returns -1 if no match found.