View Source AhoCorasearch (AhoCorasearch v0.3.0)

Documentation for AhoCorasearch.

Summary

Functions

Returns the number of bytes that the tree is using on the native/Rust side.

Searches the given string, against the given tree.

This is a convenience function for doing a word based search.

Types

@type id() :: term()
@type match() :: {start(), stop(), id()} | {start(), stop(), [id()]}
@type match_kind() :: :standard | :leftmost_longest | :leftmost_first
@type matches() :: [match()]
@type options() :: [
  unique: boolean(),
  insensitive: boolean(),
  match_kind: match_kind()
]
@type pattern() :: binary()
@type patterns() :: [{pattern(), id()}]
@type search_opts() :: [overlap: boolean(), word_search: boolean()]
@type start() :: integer()
@type stop() :: integer()
@type tree() :: AhoCorasearch.Tree.t()

Functions

Link to this function

build_tree(patterns, opts \\ [unique: false, insensitive: true, match_kind: :standard])

View Source
@spec build_tree(patterns(), options()) :: {:ok, tree()} | {:error, term()}

Builds the tree needed for efficient string searching

Examples

iex> AhoCorasearch.build_tree([{"abc", 1}, {"abc", 2}, {"def", 36}, {"xyz", 72}])
{:ok, "#AhoCorasearch.Tree<0.1080113559.418906113.32320>"}

iex> AhoCorasearch.build_tree([{"abc", 1}, {"def", 2}], unique: true, match_kind: :standard, insensitive: false)
{:ok, "#AhoCorasearch.UniqueTree<0.1080113559.418906113.32348>"}

iex> AhoCorasearch.build_tree([{"abc", 1}, {"abc", 2}, {"def", 36}, {"xyz", 72}], unique: true)
{:error, :not_unique}

The options it accepts are:

  • unique - Whether the keys in the pattern list are unique or not. This determines what the result tuple will look like. {integer, integer, integer} for unique trees, and {integer, integer, list(integer)} for non-unique trees. Defaults to false.
  • insensitive - If the matching should be case insensitive. Insensitivity is acheived by downcasing the patterns, and also downcasing the search text. This has a minor performance impact, but in most cases a negligible one. Defaults to true.
  • match_kind - There are 3 different matching options, but they must be chosen a tree building time, not at search time. The options are standard, leftmost_longest, or leftmost_first. Defaults to standard.

Given the following patterns:

[
  {"headphone", 0},
  {"head", 1},
  {"phone", 2},
  {"pine", 3},
  {"cone", 4},
  {"pinecone", 5}
]

Searching with AhoCorasearch.search(tree, "headphones and pinecones")

  • standard:

    • will return [{0, 4, [1]}, {4, 9, [2]}, {15, 19, [3]}, {19, 23, [4]}]
    • which translates to ["head", "phone", "pine", "cone"]
  • leftmost_longest:

    • will return [{0, 9, [0]}, {15, 23, [5]}]
    • which translates to ["headphone", "pinecone"]
  • leftmost_first:

    • will return [{0, 9, [0]}, {15, 19, [3]}, {19, 23, [4]}]
    • which translates to ["headphone", "pine", "cone"](notice that first here is first in the pattern list, not first in the search string).

The examples above demonstrate the return value for a non-unique tree. The results is {start_index, end_index, ids}. Not that the indexes are in bytes, so you'll want to use :binary.part/3 to extract substrings, not String.slice/3. The ID part must be an integer, but it does not need to be unique. Multiple keys can use the same ID. If multiple patterns are given with the same key and ID, the result will be repeated, for example: [{34, 35, [1, 1]}]. IDs are returned in the same order they are given, so for patterns: [{"a", 3}, {"a", 1}, {"a", 2}, {"a", 1}], an "a" in the search text would give the result: {X, Y, [3, 1, 2, 1]}

@spec heap_bytes(AhoCorasearch.Tree.t()) :: pos_integer()

Returns the number of bytes that the tree is using on the native/Rust side.

This does not account for any memory being stored on the Beam side, which hold the duplicate map Total size(beam + native) can be approximated with: :erts_debug.flat_size(tree) + AhoCorasearch.heap_bytes(tree)

This function is only intended for use in development/debugging. I don't know the implications(performance or safety) of using it in production.

Link to this function

search(tree, string, opts \\ [overlap: false, word_search: false])

View Source
@spec search(AhoCorasearch.Tree.t(), binary(), search_opts()) :: matches()
@spec search(AhoCorasearch.Tree.t(), binary(), [{:overlap, boolean()}]) :: matches()

Searches the given string, against the given tree.

This is one of two available ways to search. This is the faster of the two methods, but also might give more unexpected data. It will return matches anywhere they appear in the string(assuming the rules still abide by the tree's match configuration), but it has no notion of words. So partial and subwords will match. If you're looking for only full word matches, look at the AhoCorasearch.word_search/3, or use the option word_search: true which will filter out any matches that are not complete words(starting and ending at a word boundary).

The inputs are:

  • tree: AhoCorasearch.Tree.t() - The tree that was build with build_tree/2-3
  • string: String.t() - The string that will be searched against the tree
  • opts:
    • overlap: boolean(default: false) - This argument is ONLY used if the tree type is standard. For trees with another match_kind it is ignored.
    • word_search: boolean(default: false) - Applies an additional filtering step to remove any matches that aren't complete words

The result is a list of matches, either in the form {integer, integer, integer} for unique tree, or {integer, integer, list(integer)} for non-unique trees. The values are: {start_index, end_index, ID}. ID is whatever you passed in as the pattern([{"string", ID}, ...]).

The start_index and end_index are in BYTE values, not character or codepoints, so be sure you operate on the string with them by using :binary.part/3 rather than String.slice/3.

returns a list of all the matches in the input string. leftmost, longest matches While this handles unicode optimally, the return start/stop values are in bytes. i.e. use :binary.part/3 rather than String.slice/3

Link to this function

word_search(tree, string, opts \\ [overlap: false])

View Source

This is a convenience function for doing a word based search.

Internally it just sets the word_search option to true for search/3, so either can be used as they're the same.