View Source Dmp.Match (diff_match_patch v0.1.0)

Given a search string, find its best fuzzy match in a block of plain text. Weighted for both accuracy and location.

Link to this section Summary

Types

A bitarray encoding the locations of characters within the search pattern.

A bitarray encoding possible match sequences of the search pattern within the text.

Accumulator for bitap_update/3.

Functions

Initialise the alphabet for the Bitap algorithm.

Search for the best instance of pattern in text near loc, with errors, using the Bitap algorithm.

Compute and return a weighted score for a match with e errors and x location.

Perform the bitap algorithm and calculate error score if a match is found.

Locate the best instance of pattern in text near loc.

Link to this section Types

@type alpha() :: %{required(non_neg_integer()) => non_neg_integer()}

A bitarray encoding the locations of characters within the search pattern.

The index is the ordinal value of the character.

@type bitap_array() :: %{required(integer()) => non_neg_integer()}

A bitarray encoding possible match sequences of the search pattern within the text.

The index is the location of the text, plus one.

@type options() :: Dmp.Options.t()
@type update_acc() :: {integer(), float(), non_neg_integer(), bitap_array()}

Accumulator for bitap_update/3.

Link to this section Functions

@spec alphabet(String.t()) :: alpha()

Initialise the alphabet for the Bitap algorithm.

pattern - The text to encode.

Returns map of character locations within the pattern. $$S_c$$ in the Wu and Manber paper.

Link to this function

bitap(text, pattern, loc, match_threshold \\ 0.5, match_distance \\ 1000, return_score \\ false)

View Source
@spec bitap(
  String.t(),
  String.t(),
  non_neg_integer(),
  float(),
  non_neg_integer(),
  boolean()
) :: integer()

Search for the best instance of pattern in text near loc, with errors, using the Bitap algorithm.

See: Fast Text Searching with Errors (Wu and Manber, 1991).

  • text - The text to search.
  • pattern - The pattern to search for.
  • loc - The location to search around.
  • match_threshold - At what point is no match declared (0.0 = perfection, 1.0 = very loose, default = 0.5).
  • match_distance - How far to search for a match (0 = exact location, 1000+ = broad match, default = 1000).
  • return_score - Control what is returned. See the following.

If return_score is false (the default), returns the index of the best match closest to loc in text, or -1 if no suitable match was found.

If return_score is true, returns a tuple with two elements:

  • The best match index, or -1.
  • The adjusted match score at the returned index (a value less than or equal to the match_threshold). If no match was found, the score is 1.
Link to this function

bitap_score(e, x, loc, pattern_length, match_distance)

View Source
@spec bitap_score(
  non_neg_integer(),
  integer(),
  non_neg_integer(),
  integer(),
  non_neg_integer()
) :: float()

Compute and return a weighted score for a match with e errors and x location.

  • e - Number of errors in match.
  • x - Location of match.
  • loc - Expected location of match.
  • pattern_length - Length of pattern being sought.
  • match_distance - How far to search for a match (0 = exact location, 1000+ = broad match).

Returns overall score for match (0.0 = good, 1.0 = bad).

Link to this function

bitap_update(j, acc, constants)

View Source
@spec bitap_update(non_neg_integer(), update_acc(), update_constants()) ::
  {:cont | :halt, update_acc()}

Perform the bitap algorithm and calculate error score if a match is found.

  • acc - Accumulator tuple, with best_loc, score_threshold, start, and rd elements.
  • constants - Other constant values needed for calculations.

Updates the rd bitarray for the current error level d at the index j (representing the zero-based location j - 1 in text), and then tests for a match (an exact match if d == 0 or a match with d errors).

If a match is found at position j, calculate the error score (based on the error level d and the distance from expected location). If the score is lower than the current threshold, stop calculating the update if we have already gone below the minimum possible location, or continue the update, limiting the range of j (increasing the start value).

notes

Notes

The j index is decremented from the end of the text to the start of the text. Since the iteration is moving from high j to low, bitap_update does "Lshift" operations, not the "Rshift" operations in the Wu and Manber paper, and uses the previous values that were set at j + 1, not j.

Here the calculations are:

$$Rsubscptj^d = \begin{cases} Lshift [ Rsubscpt{j+1}^d ] \text{ AND } S_c &\text{if } d = 0 \cr Lshift [ Rsubscpt{j+1}^d ] \text{ AND } S_c \text{ OR } Lshift [ Rsubscptj^{d-1} \text{ OR } Rsubscpt{j+1}^{d-1} ] \text{ OR } Rsubscpt{j+1}^{d-1} &\text{otherwise} \end{cases}$$

versus in Wu and Manber's paper:

$$Rsubscptj^d = \begin{cases} Rshift [ Rsubscpt{j}^d ] \text{ AND } S_c &\text{if } d = 0 \cr Rshift [ Rsubscpt{j}^d ] \text{ AND } S_c \text{ OR } Rshift [ Rsubscptj^{d-1} \text{ OR } Rsubscpt{j+1}^{d-1} ] \text{ OR } Rsubscpt{j}^{d-1} &\text{otherwise} \end{cases}$$

Link to this function

main(text, pattern, loc, opts \\ [])

View Source
@spec main(String.t(), String.t(), non_neg_integer(), options()) :: integer()

Locate the best instance of pattern in text near loc.

  • text - The text to search.
  • pattern - The pattern to search for.
  • loc - The location to search around.
  • opts - A options keyword list, [] to use the default options.

Returns -1 if no match found.