Torus (Torus v0.3.0)

View Source

Torus is a plug-and-play Elixir library that seamlessly integrates PostgreSQL's search into Ecto, streamlining the construction of advanced search queries.

Usage

The package can be installed by adding torus to your list of dependencies in mix.exs:

def deps do
  [
    {:torus, "~> 0.3"}
  ]
end

Then, in any query, you can (for example) add a full-text search:

import Torus
# ...

Post
# ... your complex query
|> Torus.full_text([p], [p.title, p.body], "uncovered hogwarts")
|> Repo.all()

See full_text/5 for more details.

  1. Pattern matching: Searches for a specific pattern in a string.

    iex> insert_posts!(["Wand", "Magic wand", "Owl"])
    ...> Post
    ...> |> Torus.ilike([p], [p.title], "wan%")
    ...> |> select([p], p.title)
    ...> |> Repo.all()
    ["Wand"]

    See like/5, ilike/5, and similar_to/5 for more details.

  2. Similarity: Searches for items that are closely alike based on attributes, often using measures like cosine similarity or Euclidean distance. Is great for fuzzy searching and ignoring typos in short texts.

    iex> insert_posts!(["Hogwarts Secrets", "Quidditch Fever", "Hogwart’s Secret"])
    ...> Post
    ...> |> Torus.similarity([p], [p.title], "hoggwarrds")
    ...> |> limit(2)
    ...> |> select([p], p.title)
    ...> |> Repo.all()
    ["Hogwarts Secrets", "Hogwart’s Secret"]

    See similarity/5 for more details.

  3. Text Search Vectors: Uses term-document matrix vectors for full-text search, enabling efficient querying and ranking based on term frequency. - PostgreSQL: Full Text Search. Is great for large datasets to quickly return relevant results.

       iex> insert_post!(title: "Hogwarts Shocker", body: "A spell disrupts the Quidditch Cup.")
       ...> insert_post!(title: "Diagon Bombshell", body: "Secrets uncovered in the heart of Hogwarts.")
       ...> insert_post!(title: "Completely unrelated", body: "No magic here!")
       ...>  Post
       ...> |> Torus.full_text([p], [p.title, p.body], "uncov hogwar")
       ...> |> select([p], p.title)
       ...> |> Repo.all()
       ["Diagon Bombshell"]

    See full_text/5 for more details.

  4. Semantic Search: Understands the contextual meaning of queries to match and retrieve related content, often utilizing natural language processing. Semantic Search with PostgreSQL and OpenAI Embeddings

    Will be added soon.

  5. Hybrid Search: Combines multiple search techniques (e.g., keyword and semantic) to leverage their strengths for more accurate results.

    Will be added soon.

  6. 3rd Party Engines/Providers: Utilizes external services or software specifically designed for optimized and scalable search capabilities, such as Elasticsearch or Algolia.

Optimizations and relevance

Torus is designed to be as efficient and relevant as possible from the start. But handling large datasets and complex search queries tends to be tricky. The best way to combine these two to achieve the best result is to:

  1. Create a query that returns as relevant results as possible (by tweaking the options of search function). If there is any option missing - feel free to open an issue/contribute back with it, or implement it manually using fragments.
  2. Test its performance on real production data - maybe it's good enough already?
  3. If it's not:

Debugging your queries

Torus offers a few helpers to debug, explain, and analyze your queries before using them on production. See Torus.QueryInspector for more details.

Torus support

For now, Torus supports pattern match, similarity, and full-text search, with plans to expand support further. These docs will be updated with more examples on which search type to choose and how to make them more performant (by adding indexes or using specific functions).

Summary

Functions

Full text search with rank ordering. Accepts a list of columns to search in. A list of columns can either be a text or tsvector type. If tsvectors are passed make sure to set stored: true.

Case-insensitive pattern matching search using PostgreSQL ILIKE operator.

Case-sensitive pattern matching search using PostgreSQL LIKE operator.

Similar to like/5, except that it interprets the pattern using the SQL standard's definition of a regular expression. SQL regular expressions are a curious cross between LIKE notation and common (POSIX) regular expression notation. See PostgreSQL SIMILAR TO

The substring function with three parameters provides extraction of a substring that matches an SQL regular expression pattern. The function can be written according to standard SQL syntax

Functions

full_text(query, bindings, qualifiers, term, opts \\ [])

(macro)

Full text search with rank ordering. Accepts a list of columns to search in. A list of columns can either be a text or tsvector type. If tsvectors are passed make sure to set stored: true.

Cleans the term, so it can be input directly by the user. The default preset of settings is optimal for most cases.

Full Text Searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query. Notions of query and similarity are very flexible and depend on the specific application. The simplest search considers query as a set of words and similarity as the frequency of query words in the document. Read more in PostgreSQL Full Text Search docs.

Options

  • :prefix_search - whether to apply prefix search.
    • true (default) - the term is treated as a prefix
    • false - only counts full-word matches
  • :stored
    • false (default) - columns (or expressions) passed as qualifiers are of type text
    • true - columns (or expressions) passed as qualifiers are tsvectors
  • :language - language used for the search. Defaults to "english".
  • :term_function - function used to convert the term to ts_query. Can be one of:
    • :websearch_to_tsquery (default) - converts term to a tsquery, normalizing words according to the specified or default configuration. Quoted word sequences are converted to phrase tests. The word “or” is understood as producing an OR operator, and a dash produces a NOT operator; other punctuation is ignored. This approximates the behavior of some common web search tools.
    • :plainto_tsquery - converts term to a tsquery, normalizing words according to the specified or default configuration. Any punctuation in the string is ignored (it does not determine query operators). The resulting query matches documents containing all non-stopwords in the term.
    • :phraseto_tsquery - converts term to a tsquery, normalizing words according to the specified or default configuration. Any punctuation in the string is ignored (it does not determine query operators). The resulting query matches phrases containing all non-stopwords in the text.
  • :rank_function - function used to rank the results.
    • :ts_rank_cd (default) - computes a score showing how well the vector matches the query, using a cover density algorithm. See Ranking Search Results for more details.
    • :ts_rank - computes a score showing how well the vector matches the query.
  • :rank_weights - a list of weights for each column. Defaults to [:A, :B, :C, :D]. The length of weights (if provided) should be the same as the length of the columns we search for. A single weight can be either a string or an atom. Possible values are:
    • :A - 1.0
    • :B - 0.4
    • :C - 0.2
    • :D - 0.1
  • :rank_normalization - a string that specifies whether and how a document's length should impact its rank. The integer option controls several behaviors, so it is a bit mask: you can specify one or more behaviors using | (for example, 2|4).
    • 0 (default for ts_rank) - ignores the document length
    • 1 - divides the rank by 1 + the logarithm of the document length
    • 2 - divides the rank by the document length
    • 4 (default for ts_rank_cs) - divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)
    • 8 - divides the rank by the number of unique words in document
    • 16 - divides the rank by 1 + the logarithm of the number of unique words in document
    • 32 - divides the rank by itself + 1
  • :order - describes the ordering of the results. Possible values are
    • :desc (default) - orders the results by similarity rank in descending order.
    • :asc - orders the results by similarity rank in ascending order.
    • :none - doesn't apply ordering at all.
  • :filter_type
    • :or (default) - uses OR operator to combine different column matches. Selecting this option means that the search term won't match across columns.
    • :concat - joins the columns into a single tsvector and searches for the term in the concatenated string containing all columns.
    • :none - doesn't apply any filtering and returns all results.
  • :coalesce
    • true (default) - when joining columns via :concat option, adds a COALESCE function to handle NULL values. Choose this when you can't guarantee that all columns are non-null.
    • false - doesn't add COALESCE function to the query. Choose this when you're using filter_type: :concat and can guarantee that all columns are non-null.

Example usage

iex> insert_post!(title: "Hogwarts Shocker", body: "A spell disrupts the Quidditch Cup.")
...> insert_post!(title: "Diagon Bombshell", body: "Secrets uncovered in the heart of Hogwarts.")
...> insert_post!(title: "Completely unrelated", body: "No magic here!")
...>  Post
...> |> Torus.full_text([p], [p.title, p.body], "uncov hogwar")
...> |> select([p], p.title)
...> |> Repo.all()
["Diagon Bombshell"]

Optimizations

  • Store precomputed tsvector in a separate column, add a GIN index to it, and use stored: true.

  • Add a GIN tsvector index on the column(s) you search in. Use Torus.QueryInspector.tap_sql/2 on your query (with all the options passed) to see the exact search string and add an index to it. For example for nullable title, the GIN index could look like:

    CREATE INDEX index_gin_posts_title
    ON posts USING GIN (to_tsvector('english', COALESCE(title, '')));

ilike(query, bindings, qualifiers, term, opts \\ [])

(macro)

Case-insensitive pattern matching search using PostgreSQL ILIKE operator.

Warning

Doesn't clean the term, so it needs to be sanitized before being passed in. See LIKE-injections.

Examples

iex> insert_posts!(titles: ["Wand", "Magic wand", "Owl"])
...> Post
...> |> Torus.ilike([p], [p.title], "wan%")
...> |> select([p], p.title)
...> |> Repo.all()
["Wand"]

iex> insert_posts!([%{title: "hogwarts", body: nil}, %{title: nil, body: "HOGWARTS"}])
...> Post
...> |> Torus.ilike([p], [p.title, p.body], "%OGWART%")
...> |> select([p], %{title: p.title, body: p.body})
...> |> order_by(:id)
...> |> Repo.all()
[%{title: "hogwarts", body: nil}, %{title: nil, body: "HOGWARTS"}]

iex> insert_post!(title: "MaGiC")
...> Post
...> |> Torus.ilike([p], p.title, "magi%")
...> |> select([p], p.title)
...> |> Repo.all()
["MaGiC"]

Optimizations

See like/5 optimization section for more details.

like(query, binding, qualifiers, term, opts \\ [])

(macro)

Case-sensitive pattern matching search using PostgreSQL LIKE operator.

Warning

Doesn't clean the term, so it needs to be sanitized before being passed in. See LIKE-injections.

Examples

iex> insert_posts!([%{title: "hogwarts", body: nil}, %{title: nil, body: "HOGWARTS"}])
...> Post
...> |> Torus.like([p], [p.title, p.body], "%OGWART%")
...> |> select([p], p.body)
...> |> Repo.all()
["HOGWARTS"]

Optimizations

  • like/5 is case-sensitive, so it can take advantage of B-tree indexes when there is no wildcard (%) at the beginning of the search term, prefer it over ilike/5 if possible.

    Adding a B-tree index:

      CREATE INDEX index_posts_on_title ON posts (title);
  • Use GIN or GiST Index with pg_trgm extension for LIKE and ILIKE.

    When searching for substrings (%word%), B-tree indexes won't help. Instead, use trigram indexing (pg_trgm extension):

      CREATE EXTENSION IF NOT EXISTS pg_trgm;
      CREATE INDEX posts_title_trgm_idx ON posts USING GIN (title gin_trgm_ops);
  • If using prefix search, convert data to lowercase and use B-tree index for case-insensitive search:

      ALTER TABLE posts ADD COLUMN title_lower TEXT GENERATED ALWAYS AS (LOWER(title)) STORED;
      CREATE INDEX index_posts_on_title ON posts (title_lower);
      Torus.like([p], [p.title_lower], "hogwarts%")
  • Use full-text search for large text fields, see full_text/5 for more details.

similar_to(query, bindings, qualifiers, term, opts \\ [])

(macro)

Similar to like/5, except that it interprets the pattern using the SQL standard's definition of a regular expression. SQL regular expressions are a curious cross between LIKE notation and common (POSIX) regular expression notation. See PostgreSQL SIMILAR TO

Examples

iex> insert_post!(body: "abc")
...> Post
...> |> Torus.similar_to([p], [p.title, p.body], "%(b|d)%")
...> |> select([p], p.body)
...> |> Repo.all()
["abc"]

Optimizations

  • If regex is needed, use POSIX regex with ~ or ~* operators since they may leverage GIN or GiST indexes in some cases. These operators will be introduced later on.
  • Use ilike/5 or like/5 when possible, similar_to/5 almost always does full table scans
  • Filter and limit the result set as much as possible before calling similar_to/5

similarity(query, bindings, qualifiers, term, opts \\ [])

(macro)

Case-insensitive similarity search using PostgreSQL similarity functions.

Warning

You need to have pg_trgm extension installed.

Options

  • :type - similarity type. Possible options are:
    • :full (default) - uses similarity function.
    • :word - uses word_similarity function. If you're dealing with sentences and you don't want the length of the strings to affect the search result.
    • :strict - uses strict_word_similarity function. Prioritizes full matches, forces extent boundaries to match word boundaries. Since we don't have cross-word trigrams, this function actually returns greatest similarity between first string and any continuous extent of words of the second string.
  • :order - describes the ordering of the results. Possible values are
    • :desc (default) - orders the results by similarity rank in descending order.
    • :asc - orders the results by similarity rank in ascending order.
    • :none - doesn't apply ordering and returns
  • :pre_filter - whether or not to pre-filter the results:
    • false (default) - omits pre-filtering and returns all results.
    • true - before applying the order, pre filters (using boolean operators which potentially use GIN indexes) the result set. The results above pg_trgm.similarity_threshold (which defaults to 0.3) are returned.

Examples

iex> insert_post!(title: "Hogwarts Shocker", body: "A spell disrupts the Quidditch Cup.")
...> insert_post!(title: "Diagon Bombshell", body: "Secrets uncovered in the heart of Hogwarts.")
...> insert_post!(title: "Completely unrelated", body: "No magic here!")
...>  Post
...> |> Torus.similarity([p], [p.title, p.body], "Diagon Bombshell")
...> |> limit(1)
...> |> select([p], p.title)
...> |> Repo.all()
["Diagon Bombshell"]

iex> insert_posts!(["Wand", "Owl", "What an amazing cloak"])
...> Post
...> |> Torus.similarity([p], [p.title], "what a cloak", pre_filter: true)
...> |> select([p], p.title)
...> |> Repo.all()
["What an amazing cloak"]

Optimizations

  • Use pre_filter: true to pre-filter the results before applying the order. This would significantly reduce the number of rows to order. The pre-filtering phase uses different (boolean) similarity operators which more actively leverage GIN indexes.
  • Limit the number of raws returned using limit.
  • Use order: :none argument if you don't care about the order of the results. The query will return all results that are above the similarity threshold, which you can set globally via SET pg_trgm.similarity_threshold = 0.3;.
  • When order: :desc (default) and the limit is not set, the query will do a full table scan, so it's recommended to manually limit the results (by applying where or limit clauses to filter the rows as much as possible).

Adding an index

CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX index_posts_on_title ON posts USING GIN (title gin_trgm_ops);

substring(string, pattern, escape_character)

(macro)

The substring function with three parameters provides extraction of a substring that matches an SQL regular expression pattern. The function can be written according to standard SQL syntax:

substring('foobar' similar '%#"o_b#"%' escape '#')   oob
substring('foobar' similar '#"o_b#"%' escape '#')    NULL

Examples

insert_post!(title: "Hello123World")
Post |> select([p], substring(p.title, "[0-9]+", "#")) |> Repo.all()
["123"]