View Source AhoCorasearch (AhoCorasearch v0.3.0)
Documentation for AhoCorasearch
.
Summary
Functions
Builds the tree needed for efficient string searching
Returns the number of bytes that the tree is using on the native/Rust side.
Searches the given string, against the given tree.
This is a convenience function for doing a word based search.
Types
@type id() :: term()
@type match_kind() :: :standard | :leftmost_longest | :leftmost_first
@type matches() :: [match()]
@type options() :: [ unique: boolean(), insensitive: boolean(), match_kind: match_kind() ]
@type pattern() :: binary()
@type start() :: integer()
@type stop() :: integer()
@type tree() :: AhoCorasearch.Tree.t()
Functions
build_tree(patterns, opts \\ [unique: false, insensitive: true, match_kind: :standard])
View SourceBuilds the tree needed for efficient string searching
Examples
iex> AhoCorasearch.build_tree([{"abc", 1}, {"abc", 2}, {"def", 36}, {"xyz", 72}])
{:ok, "#AhoCorasearch.Tree<0.1080113559.418906113.32320>"}
iex> AhoCorasearch.build_tree([{"abc", 1}, {"def", 2}], unique: true, match_kind: :standard, insensitive: false)
{:ok, "#AhoCorasearch.UniqueTree<0.1080113559.418906113.32348>"}
iex> AhoCorasearch.build_tree([{"abc", 1}, {"abc", 2}, {"def", 36}, {"xyz", 72}], unique: true)
{:error, :not_unique}
The options it accepts are:
unique
- Whether the keys in the pattern list are unique or not. This determines what the result tuple will look like.{integer, integer, integer}
for unique trees, and{integer, integer, list(integer)}
for non-unique trees. Defaults tofalse
.insensitive
- If the matching should be case insensitive. Insensitivity is acheived by downcasing the patterns, and also downcasing the search text. This has a minor performance impact, but in most cases a negligible one. Defaults totrue
.match_kind
- There are 3 different matching options, but they must be chosen a tree building time, not at search time. The options arestandard
,leftmost_longest
, orleftmost_first
. Defaults tostandard
.
Given the following patterns:
[
{"headphone", 0},
{"head", 1},
{"phone", 2},
{"pine", 3},
{"cone", 4},
{"pinecone", 5}
]
Searching with AhoCorasearch.search(tree, "headphones and pinecones")
standard
:- will return
[{0, 4, [1]}, {4, 9, [2]}, {15, 19, [3]}, {19, 23, [4]}]
- which translates to
["head", "phone", "pine", "cone"]
- will return
leftmost_longest
:- will return
[{0, 9, [0]}, {15, 23, [5]}]
- which translates to
["headphone", "pinecone"]
- will return
leftmost_first
:- will return
[{0, 9, [0]}, {15, 19, [3]}, {19, 23, [4]}]
- which translates to
["headphone", "pine", "cone"]
(notice thatfirst
here is first in the pattern list, not first in the search string).
- will return
The examples above demonstrate the return value for a non-unique tree. The results is {start_index, end_index, ids}
. Not that the indexes are in bytes, so you'll want to use :binary.part/3
to extract substrings, not String.slice/3
. The ID part must be an integer, but it does not need to be unique. Multiple keys can use the same ID. If multiple patterns are given with the same key and ID, the result will be repeated, for example: [{34, 35, [1, 1]}]
. IDs are returned in the same order they are given, so for patterns: [{"a", 3}, {"a", 1}, {"a", 2}, {"a", 1}]
, an "a" in the search text would give the result: {X, Y, [3, 1, 2, 1]}
@spec heap_bytes(AhoCorasearch.Tree.t()) :: pos_integer()
Returns the number of bytes that the tree is using on the native/Rust side.
This does not account for any memory being stored on the Beam side, which hold the duplicate map
Total size(beam + native) can be approximated with:
:erts_debug.flat_size(tree) + AhoCorasearch.heap_bytes(tree)
This function is only intended for use in development/debugging. I don't know the implications(performance or safety) of using it in production.
search(tree, string, opts \\ [overlap: false, word_search: false])
View Source@spec search(AhoCorasearch.Tree.t(), binary(), search_opts()) :: matches()
@spec search(AhoCorasearch.Tree.t(), binary(), [{:overlap, boolean()}]) :: matches()
Searches the given string, against the given tree.
This is one of two available ways to search. This is the faster of the two methods, but also might give more unexpected data. It will return matches anywhere they appear in the string(assuming the rules still abide by the tree's match configuration), but it has no notion of words. So partial and subwords will match. If you're looking for only full word matches, look at the AhoCorasearch.word_search/3
, or use the option word_search: true
which will filter out any matches that are not complete words(starting and ending at a word boundary).
The inputs are:
tree
:AhoCorasearch.Tree.t()
- The tree that was build withbuild_tree/2-3
string
:String.t()
- The string that will be searched against the treeopts
:overlap: boolean
(default: false) - This argument is ONLY used if the tree type isstandard
. For trees with anothermatch_kind
it is ignored.word_search: boolean
(default: false) - Applies an additional filtering step to remove any matches that aren't complete words
The result is a list of matches, either in the form {integer, integer, integer}
for unique tree, or {integer, integer, list(integer)}
for non-unique trees. The values are: {start_index, end_index, ID}
. ID is whatever you passed in as the pattern([{"string", ID}, ...]
).
The start_index
and end_index
are in BYTE values, not character or codepoints, so be sure you operate on the string with them by using :binary.part/3
rather than String.slice/3
.
returns a list of all the matches in the input string.
leftmost, longest matches
While this handles unicode optimally, the return start/stop values are
in bytes. i.e. use :binary.part/3
rather than String.slice/3
This is a convenience function for doing a word based search.
Internally it just sets the word_search
option to true for search/3
, so either can be used as they're the same.