View Source ArangoXEcto.Analyzer (ArangoX Ecto v1.3.1)

Defines an analyzer for use in views

Since analyzer defintions are short and you may have many of them, you can just define multiple analyzer modules in one file, e.g. named analyzers.ex.

example

Example

defmodule MyApp.Analyzers do
  use ArangoXEcto.Analyzer

  norm :norm_en, [:frequency, :norm, :position], %{
    locale: "en",
    accent: false,
    case: :lower
  }

  # this exists by default but is just used as an example
  text :text_en, [:frequency, :norm, :position], %{
    locale: "en",
    accent: false,
    stemming: true,
    case: :lower
  }

  # Needed to compile the analyzers
  build()
end

features

Features

The following are the features available to all the analyzers. Some analyzers and functions need certin features enabled, refer to the ArangoDB docs for more info.

* `:frequency` - (boolean) - track how often a term occurs.
* `:norm` - (boolean) - calculate and store the field normalization factor that is used to score fairer if the same term is repeated, reducing its importance.
* `:position` - (boolean) - enumerate the tokens for position-dependent queries.

Link to this section Summary

Functions

Defines an aql type analyzer.

Compiles analyzers

Defines a classification type analyzer.

Defines a collation type analyzer.

Defines a delimiter type analyzer.

Defines a geo_s2 type analyzer.

Defines a geojson type analyzer.

Defines a geopoint type analyzer.

Defines an identity type analyzer.

Defines a minhash type analyzer.

Defines a nearest_neighbors type analyzer.

Defines a ngram type analyzer.

Defines a norm type analyzer.

Defines a pipeline type analyzer.

Defines a segmentation type analyzer.

Defines a stem type analyzer.

Defines a stopwords type analyzer.

Defines a text type analyzer.

Link to this section Functions

Link to this macro

aql(name, features, properties \\ %{})

View Source (macro)

Defines an aql type analyzer.

An Analyzer capable of running a restricted AQL query to perform data manipulation / filtering.

Refer to the ArangoDB AQL Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `queryString` - (string) - AQL query to be executed.
* `collapsePositions` - (boolean) - whether to set the position to 0 for all members of the query result array (true) or
  set the position corresponding to the index of the result array member (false).
* `keepNull` - (boolean) - either treat treat null like an empty string or discard null.
* `batchSize` - (integer) - number between 1 and 1000 (default = 1) that determines the batch size for reading data from the query.
* `memoryLimit` - (integer) - memory limit for query execution in bytes. (default is 1048576 = 1Mb) Maximum is 33554432U (32Mb).
* `returnType` - (string) - data type of the returned tokens.
  `:string` - convert emitted tokens to strings.
  `:number` - convert emitted tokens to numbers.
  `:bool` - convert emitted tokens to booleans.

Compiles analyzers

Link to this macro

classification(name, features, properties \\ %{})

View Source (macro)

Defines a classification type analyzer.

An Analyzer capable of classifying tokens in the input text.

Refer to the ArangoDB Classification Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `model_location` - (string) - the on-disk path to the trained fastText supervised model.
* `top_k` - (number) - the number of class labels that will be produced per input (default: 1).
* `threshold` - (number) - the probability threshold for which a label will be assigned to an input. A fastText
  model produces a probability per class label, and this is what will be filtered (default: 0.99).
Link to this macro

collation(name, features, properties \\ %{})

View Source (macro)

Defines a collation type analyzer.

An Analyzer capable of converting the input into a set of language-specific tokens. This makes comparisons follow the rules of the respective language, most notable in range queries against Views.

Refer to the ArangoDB Collation Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `locale` - (string) - a locale in the format language[_COUNTRY] (square brackets denote optional parts), e.g. "de" or "en_US".
Link to this macro

delimiter(name, features, properties \\ %{})

View Source (macro)

Defines a delimiter type analyzer.

An Analyzer capable of breaking up delimited text into tokens as per RFC 4180 (without starting new records on newlines).

Refer to the ArangoDB Delimiter Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `delimiter` - (string) - the delimiting character(s). The whole string is considered as one delimiter.
Link to this macro

geo_s2(name, features, properties \\ %{})

View Source (macro)

Defines a geo_s2 type analyzer.

An Analyzer capable of breaking up a GeoJSON object or coordinate array in [longitude, latitude] order into a set of indexable tokens for further usage with ArangoSearch Geo functions.

Refer to the ArangoDB Geo S2 Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `format` - (atom) - the internal binary representation to use for storing the geo-spatial data in an index 
  * `:latLngDouble` (default) - store each latitude and longitude value as an 8-byte floating-point value
    (16 bytes per coordinate pair). This format preserves numeric values exactly and is more compact than the
    VelocyPack format used by the geojson Analyzer.
  * `:latLngInt` - store each latitude and longitude value as an 4-byte integer value (8 bytes per coordinate
    pair). This is the most compact format but the precision is limited to approximately 1 to 10 centimeters.
  * `:s2Point` - store each longitude-latitude pair in the native format of Google S2 which is used for
    geo-spatial calculations (24 bytes per coordinate pair). This is not a particular compact format but it
    reduces the number of computations necessary when you execute geo-spatial queries. This format preserves
    numeric values exactly.
* `type` - (atom) - type of geojson object
  * `:shape` (default) - index all GeoJSON geometry types (Point, Polygon etc.)
  * `:centroid` - compute and only index the centroid of the input geometry
  * `:point` - only index GeoJSON objects of type Point, ignore all other geometry types
* `options` - (map) - options for fine-tuning geo queries. These options should generally remain unchanged 
  * `:maxCells` (number, optional) - maximum number of S2 cells (default: 20)
  * `:minLevel` (number, optional) - the least precise S2 level (default: 4)
  * `:maxLevel` (number, optional) - the most precise S2 level (default: 23)
Link to this macro

geojson(name, features, properties \\ %{})

View Source (macro)

Defines a geojson type analyzer.

An Analyzer capable of breaking up a GeoJSON object or coordinate array in [longitude, latitude] order into a set of indexable tokens for further usage with ArangoSearch Geo functions.

Refer to the ArangoDB GeoJSON Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `type` - (atom) - type of geojson object
  * `:shape` (default) - index all GeoJSON geometry types (Point, Polygon etc.)
  * `:centroid` - compute and only index the centroid of the input geometry
  * `:point` - only index GeoJSON objects of type Point, ignore all other geometry types
* `options` - (map) - options for fine-tuning geo queries. These options should generally remain unchanged 
  * `:maxCells` (number, optional) - maximum number of S2 cells (default: 20)
  * `:minLevel` (number, optional) - the least precise S2 level (default: 4)
  * `:maxLevel` (number, optional) - the most precise S2 level (default: 23)
Link to this macro

geopoint(name, features, properties \\ %{})

View Source (macro)

Defines a geopoint type analyzer.

An Analyzer capable of breaking up a coordinate array in [latitude, longitude] order or a JSON object describing a coordinate pair using two separate attributes into a set of indexable tokens for further usage with ArangoSearch Geo functions.

Refer to the ArangoDB Geo Point Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `latitude` - (list of string) - list of strings that describes the attribute path of the latitude value
  relative to the field for which the Analyzer is defined in the View
* `longitude` - (list of string) - list of strings that describes the attribute path of the longitude value
  relative to the field for which the Analyzer is defined in the View
* `options` - (map) - options for fine-tuning geo queries. These options should generally remain unchanged 
  * `:maxCells` (number, optional) - maximum number of S2 cells (default: 20)
  * `:minLevel` (number, optional) - the least precise S2 level (default: 4)
  * `:maxLevel` (number, optional) - the most precise S2 level (default: 23)
Link to this macro

identity(name, features)

View Source (macro)

Defines an identity type analyzer.

An Analyzer applying the identity transformation, i.e. returning the input unmodified.

Refer to the ArangoDB Identity Docs for more info.

This does not accept any properties.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
Link to this macro

minhash(name, features, properties, block)

View Source (macro)

Defines a minhash type analyzer.

An Analyzer that computes so called MinHash signatures using a locality-sensitive hash function. It applies an Analyzer of your choice before the hashing, for example, to break up text into words.

Refer to the ArangoDB MinHash Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below
* `analyzer` - block with one analyzer (if more then one is supplied, only the last will be used

properties

Properties

* `numHashes` - (number) - the size of the MinHash signature.
Link to this macro

nearest_neighbors(name, features, properties \\ %{})

View Source (macro)

Defines a nearest_neighbors type analyzer.

An Analyzer capable of finding nearest neighbors of tokens in the input.

Refer to the ArangoDB Nearest Neighbors Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `model_location` - (string) - the on-disk path to the trained fastText supervised model.
* `top_k` - (number) - the number of class labels that will be produced per input (default: 1).
Link to this macro

ngram(name, features, properties \\ %{})

View Source (macro)

Defines a ngram type analyzer.

An Analyzer capable of producing n-grams from a specified input in a range of min..max (inclusive). Can optionally preserve the original input.

Refer to the ArangoDB NGram Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `min` - (integer) - minimum n-gram length.
* `max` - (integer) - maximum n-gram length.
* `preserveOriginal` - (boolean) - whether to include the original value or just use the min & max values.
* `startMarker` - (string) - this value will be prepended to n-grams which include the beginning of the input.
* `endMarker` - (string) - this value will be appended to n-grams which include the end of the input.
* `streamType` - (atom) - type of the input stream.
  * `:binary` - one byte is considered as one character (default).
  * `:utf8` - one Unicode codepoint is treated as one character.
Link to this macro

norm(name, features, properties \\ %{})

View Source (macro)

Defines a norm type analyzer.

An Analyzer capable of normalizing the text, treated as a single token, i.e. case conversion and accent removal.

Refer to the ArangoDB Norm Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `locale` - (string) - a locale in the format language[_COUNTRY] (square brackets denote optional parts), e.g. "de" or "en_US".
* `accent` - (boolean) - whether to preserve accented characters or convert them to the base characters.
* `case` - (atom) - option of how to store case
  * `:lower` - to convert to all lower-case characters
  * `:upper` - to convert to all upper-case characters
  * `:none` - to not change character case (default)
Link to this macro

pipeline(name, features, block)

View Source (macro)

Defines a pipeline type analyzer.

An Analyzer capable of chaining effects of multiple Analyzers into one. The pipeline is a list of Analyzers, where the output of an Analyzer is passed to the next for further processing. The final token value is determined by last Analyzer in the pipeline.

Refer to the ArangoDB Pipeline Docs for more info.

note

Note

Features are only required on the pipeline and not on the individual analyzers within. Any features on sub analyzers will be ignored if supplied.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `analyzers` - a block with other analyzers

example

Example

pipeline :my_pipeline, [:frequency, :norm, :position] do
  norm "norm_en",  %{
    locale: "en",
    accent: false,
    case: :lower
  }

  text "text_en", %{
    locale: "en",
    accent: false,
    stemming: true,
    case: :lower
  }
end
Link to this macro

segmentation(name, features, properties \\ %{})

View Source (macro)

Defines a segmentation type analyzer.

An Analyzer capable of breaking up the input text into tokens in a language-agnostic manner as per Unicode Standard Annex #29, making it suitable for mixed language strings.

Refer to the ArangoDB Segmentation Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `break` - (atom) - character to break at
  * `:all` - return all tokens
  * `:alpha` - return tokens composed of alphanumeric characters only (default). Alphanumeric characters are Unicode codepoints from the
    Letter and Number categories, see Unicode Technical Note #36.
  * `:graphic` - return tokens composed of non-whitespace characters only. Note that the list of whitespace characters does not include line breaks:
    * `U+0009` Character Tabulation
    * `U+0020` Space
    * `U+0085` Next Line
    * `U+00A0` No-break Space
    * `U+1680` Ogham Space Mark
    * `U+2000` En Quad
    * `U+2028` Line Separator
    * `U+202F` Narrow No-break Space
    * `U+205F` Medium Mathematical Space
    * `U+3000` Ideographic Space
* `case` - (atom) - option of how to store case
  * `:lower` - to convert to all lower-case characters
  * `:upper` - to convert to all upper-case characters
  * `:none` - to not change character case (default)
Link to this macro

stem(name, features, properties \\ %{})

View Source (macro)

Defines a stem type analyzer.

An Analyzer capable of stemming the text, treated as a single token, for supported languages.

Refer to the ArangoDB Stem Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `locale` - (string) - a locale in the format language[_COUNTRY] (square brackets denote optional parts), e.g. "de" or "en_US".
Link to this macro

stopwords(name, features, properties \\ %{})

View Source (macro)

Defines a stopwords type analyzer.

An Analyzer capable of removing specified tokens from the input.

Refer to the ArangoDB Stopwords Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `stopwords` - (list of strings) - array of strings that describe the tokens to be discarded.
* `hex` - (boolean) - If false (default), then each string in stopwords is used verbatim.
Link to this macro

text(name, features, properties \\ %{})

View Source (macro)

Defines a text type analyzer.

An Analyzer capable of breaking up strings into individual words while also optionally filtering out stop-words, extracting word stems, applying case conversion and accent removal.

Refer to the ArangoDB Text Docs for more info.

parameters

Parameters

* `name` - atom of the analyzer name
* `features` - the features options to be set (see [Analyzer Features](https://hexdocs.pm/arangox_ecto/ArangoXEcto.Analyzer.html#module-features))
* `properties` - a map of the properties to be set, see below

properties

Properties

* `locale` - (string) - a locale in the format language[_COUNTRY] (square brackets denote optional parts), e.g. "de" or "en_US".
* `accent` - (boolean) - whether to preserve accented characters or convert them to the base characters.
* `case` - (string) - option of how to store case
  * `:lower` - to convert to all lower-case characters
  * `:upper` - to convert to all upper-case characters
  * `:none` - to not change character case (default)
* `stemming` - (boolean) - whether to apply stemming on returned words or leave as-is
* `edgeNgram` - (map) - if present, then edge n-grams are generated for each token (word). 
  * `min` - (integer) - minimum n-gram length.
  * `max` - (integer) - maximum n-gram length.
  * `preserveOriginal` - (boolean) - whether to include the original value or just use the min & max values.
* `stopwords` - (list of strings) - a list of strings with words to omit from result.
* `stopwordsPath` - (string) - path with a language sub-directory (e.g. en for a locale en_US) containing files with words to omit.