Torus (Torus v0.3.0)
View SourceTorus is a plug-and-play Elixir library that seamlessly integrates PostgreSQL's search into Ecto, streamlining the construction of advanced search queries.
Usage
The package can be installed by adding torus
to your list of dependencies in mix.exs
:
def deps do
[
{:torus, "~> 0.3"}
]
end
Then, in any query, you can (for example) add a full-text search:
import Torus
# ...
Post
# ... your complex query
|> Torus.full_text([p], [p.title, p.body], "uncovered hogwarts")
|> Repo.all()
See full_text/5
for more details.
6 types of search:
Pattern matching: Searches for a specific pattern in a string.
iex> insert_posts!(["Wand", "Magic wand", "Owl"]) ...> Post ...> |> Torus.ilike([p], [p.title], "wan%") ...> |> select([p], p.title) ...> |> Repo.all() ["Wand"]
See
like/5
,ilike/5
, andsimilar_to/5
for more details.Similarity: Searches for items that are closely alike based on attributes, often using measures like cosine similarity or Euclidean distance. Is great for fuzzy searching and ignoring typos in short texts.
iex> insert_posts!(["Hogwarts Secrets", "Quidditch Fever", "Hogwart’s Secret"]) ...> Post ...> |> Torus.similarity([p], [p.title], "hoggwarrds") ...> |> limit(2) ...> |> select([p], p.title) ...> |> Repo.all() ["Hogwarts Secrets", "Hogwart’s Secret"]
See
similarity/5
for more details.Text Search Vectors: Uses term-document matrix vectors for full-text search, enabling efficient querying and ranking based on term frequency. - PostgreSQL: Full Text Search. Is great for large datasets to quickly return relevant results.
iex> insert_post!(title: "Hogwarts Shocker", body: "A spell disrupts the Quidditch Cup.") ...> insert_post!(title: "Diagon Bombshell", body: "Secrets uncovered in the heart of Hogwarts.") ...> insert_post!(title: "Completely unrelated", body: "No magic here!") ...> Post ...> |> Torus.full_text([p], [p.title, p.body], "uncov hogwar") ...> |> select([p], p.title) ...> |> Repo.all() ["Diagon Bombshell"]
See
full_text/5
for more details.Semantic Search: Understands the contextual meaning of queries to match and retrieve related content, often utilizing natural language processing. Semantic Search with PostgreSQL and OpenAI Embeddings
Will be added soon.
Hybrid Search: Combines multiple search techniques (e.g., keyword and semantic) to leverage their strengths for more accurate results.
Will be added soon.
3rd Party Engines/Providers: Utilizes external services or software specifically designed for optimized and scalable search capabilities, such as Elasticsearch or Algolia.
Optimizations and relevance
Torus is designed to be as efficient and relevant as possible from the start. But handling large datasets and complex search queries tends to be tricky. The best way to combine these two to achieve the best result is to:
- Create a query that returns as relevant results as possible (by tweaking the options of search function). If there is any option missing - feel free to open an issue/contribute back with it, or implement it manually using fragments.
- Test its performance on real production data - maybe it's good enough already?
- If it's not:
- See optimization sections for your search type in
Torus
docs - Inspect your query using
Torus.QueryInspector.tap_substituted_sql/3
orTorus.QueryInspector.tap_explain_analyze/3
- According to the above SQL - add indexes for the queried rows/vectors
- See optimization sections for your search type in
Debugging your queries
Torus offers a few helpers to debug, explain, and analyze your queries before using them on production. See Torus.QueryInspector
for more details.
Torus support
For now, Torus supports pattern match, similarity, and full-text search, with plans to expand support further. These docs will be updated with more examples on which search type to choose and how to make them more performant (by adding indexes or using specific functions).
Summary
Functions
Full text search with rank ordering. Accepts a list of columns to search in. A list of columns
can either be a text or tsvector
type. If tsvector
s are passed make sure to set
stored: true
.
Case-insensitive pattern matching search using
PostgreSQL ILIKE
operator.
Case-sensitive pattern matching search using PostgreSQL LIKE
operator.
Similar to like/5
, except that it interprets the pattern using the SQL standard's
definition of a regular expression. SQL regular expressions are a curious cross between
LIKE notation and common (POSIX) regular expression notation. See
PostgreSQL SIMILAR TO
Case-insensitive similarity search using PostgreSQL similarity functions.
The substring function with three parameters provides extraction of a substring that matches an SQL regular expression pattern. The function can be written according to standard SQL syntax
Functions
Full text search with rank ordering. Accepts a list of columns to search in. A list of columns
can either be a text or tsvector
type. If tsvector
s are passed make sure to set
stored: true
.
Cleans the term, so it can be input directly by the user. The default preset of settings is optimal for most cases.
Full Text Searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query. Notions of query and similarity are very flexible and depend on the specific application. The simplest search considers query as a set of words and similarity as the frequency of query words in the document. Read more in PostgreSQL Full Text Search docs.
Options
:prefix_search
- whether to apply prefix search.true
(default) - the term is treated as a prefixfalse
- only counts full-word matches
:stored
false
(default) - columns (or expressions) passed as qualifiers are of typetext
true
- columns (or expressions) passed as qualifiers are tsvectors
:language
- language used for the search. Defaults to"english"
.:term_function
- function used to convert the term tots_query
. Can be one of::websearch_to_tsquery
(default) - converts term to a tsquery, normalizing words according to the specified or default configuration. Quoted word sequences are converted to phrase tests. The word “or” is understood as producing an OR operator, and a dash produces a NOT operator; other punctuation is ignored. This approximates the behavior of some common web search tools.:plainto_tsquery
- converts term to a tsquery, normalizing words according to the specified or default configuration. Any punctuation in the string is ignored (it does not determine query operators). The resulting query matches documents containing all non-stopwords in the term.:phraseto_tsquery
- converts term to a tsquery, normalizing words according to the specified or default configuration. Any punctuation in the string is ignored (it does not determine query operators). The resulting query matches phrases containing all non-stopwords in the text.
:rank_function
- function used to rank the results.:ts_rank_cd
(default) - computes a score showing how well the vector matches the query, using a cover density algorithm. See Ranking Search Results for more details.:ts_rank
- computes a score showing how well the vector matches the query.
:rank_weights
- a list of weights for each column. Defaults to[:A, :B, :C, :D]
. The length of weights (if provided) should be the same as the length of the columns we search for. A single weight can be either a string or an atom. Possible values are::A
- 1.0:B
- 0.4:C
- 0.2:D
- 0.1
:rank_normalization
- a string that specifies whether and how a document's length should impact its rank. The integer option controls several behaviors, so it is a bit mask: you can specify one or more behaviors using|
(for example,2|4
).0
(default forts_rank
) - ignores the document length1
- divides the rank by 1 + the logarithm of the document length2
- divides the rank by the document length4
(default forts_rank_cs
) - divides the rank by the mean harmonic distance between extents (this is implemented only byts_rank_cd
)8
- divides the rank by the number of unique words in document16
- divides the rank by 1 + the logarithm of the number of unique words in document32
- divides the rank by itself + 1
:order
- describes the ordering of the results. Possible values are:desc
(default) - orders the results by similarity rank in descending order.:asc
- orders the results by similarity rank in ascending order.:none
- doesn't apply ordering at all.
:filter_type
:or
(default) - usesOR
operator to combine different column matches. Selecting this option means that the search term won't match across columns.:concat
- joins the columns into a single tsvector and searches for the term in the concatenated string containing all columns.:none
- doesn't apply any filtering and returns all results.
:coalesce
true
(default) - when joining columns via:concat
option, adds aCOALESCE
function to handle NULL values. Choose this when you can't guarantee that all columns are non-null.false
- doesn't addCOALESCE
function to the query. Choose this when you're usingfilter_type: :concat
and can guarantee that all columns are non-null.
Example usage
iex> insert_post!(title: "Hogwarts Shocker", body: "A spell disrupts the Quidditch Cup.")
...> insert_post!(title: "Diagon Bombshell", body: "Secrets uncovered in the heart of Hogwarts.")
...> insert_post!(title: "Completely unrelated", body: "No magic here!")
...> Post
...> |> Torus.full_text([p], [p.title, p.body], "uncov hogwar")
...> |> select([p], p.title)
...> |> Repo.all()
["Diagon Bombshell"]
Optimizations
Store precomputed tsvector in a separate column, add a GIN index to it, and use
stored: true
.Add a GIN tsvector index on the column(s) you search in. Use
Torus.QueryInspector.tap_sql/2
on your query (with all the options passed) to see the exact search string and add an index to it. For example for nullable title, the GIN index could look like:CREATE INDEX index_gin_posts_title ON posts USING GIN (to_tsvector('english', COALESCE(title, '')));
Case-insensitive pattern matching search using
PostgreSQL ILIKE
operator.
Warning
Doesn't clean the term, so it needs to be sanitized before being passed in. See LIKE-injections.
Examples
iex> insert_posts!(titles: ["Wand", "Magic wand", "Owl"])
...> Post
...> |> Torus.ilike([p], [p.title], "wan%")
...> |> select([p], p.title)
...> |> Repo.all()
["Wand"]
iex> insert_posts!([%{title: "hogwarts", body: nil}, %{title: nil, body: "HOGWARTS"}])
...> Post
...> |> Torus.ilike([p], [p.title, p.body], "%OGWART%")
...> |> select([p], %{title: p.title, body: p.body})
...> |> order_by(:id)
...> |> Repo.all()
[%{title: "hogwarts", body: nil}, %{title: nil, body: "HOGWARTS"}]
iex> insert_post!(title: "MaGiC")
...> Post
...> |> Torus.ilike([p], p.title, "magi%")
...> |> select([p], p.title)
...> |> Repo.all()
["MaGiC"]
Optimizations
See like/5
optimization section for more details.
Case-sensitive pattern matching search using PostgreSQL LIKE
operator.
Warning
Doesn't clean the term, so it needs to be sanitized before being passed in. See LIKE-injections.
Examples
iex> insert_posts!([%{title: "hogwarts", body: nil}, %{title: nil, body: "HOGWARTS"}])
...> Post
...> |> Torus.like([p], [p.title, p.body], "%OGWART%")
...> |> select([p], p.body)
...> |> Repo.all()
["HOGWARTS"]
Optimizations
like/5
is case-sensitive, so it can take advantage of B-tree indexes when there is no wildcard (%) at the beginning of the search term, prefer it overilike/5
if possible.Adding a B-tree index:
CREATE INDEX index_posts_on_title ON posts (title);
Use
GIN
orGiST
Index withpg_trgm
extension for LIKE and ILIKE.When searching for substrings (%word%), B-tree indexes won't help. Instead, use trigram indexing (
pg_trgm
extension):CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX posts_title_trgm_idx ON posts USING GIN (title gin_trgm_ops);
If using prefix search, convert data to lowercase and use B-tree index for case-insensitive search:
ALTER TABLE posts ADD COLUMN title_lower TEXT GENERATED ALWAYS AS (LOWER(title)) STORED; CREATE INDEX index_posts_on_title ON posts (title_lower);
Torus.like([p], [p.title_lower], "hogwarts%")
Use full-text search for large text fields, see
full_text/5
for more details.
Similar to like/5
, except that it interprets the pattern using the SQL standard's
definition of a regular expression. SQL regular expressions are a curious cross between
LIKE notation and common (POSIX) regular expression notation. See
PostgreSQL SIMILAR TO
Examples
iex> insert_post!(body: "abc")
...> Post
...> |> Torus.similar_to([p], [p.title, p.body], "%(b|d)%")
...> |> select([p], p.body)
...> |> Repo.all()
["abc"]
Optimizations
- If regex is needed, use POSIX regex with
~
or~*
operators since they may leverage GIN or GiST indexes in some cases. These operators will be introduced later on. - Use
ilike/5
orlike/5
when possible,similar_to/5
almost always does full table scans - Filter and limit the result set as much as possible before calling
similar_to/5
Case-insensitive similarity search using PostgreSQL similarity functions.
Warning
You need to have pg_trgm extension installed.
Options
:type
- similarity type. Possible options are::full
(default) - usessimilarity
function.:word
- usesword_similarity
function. If you're dealing with sentences and you don't want the length of the strings to affect the search result.:strict
- usesstrict_word_similarity
function. Prioritizes full matches, forces extent boundaries to match word boundaries. Since we don't have cross-word trigrams, this function actually returns greatest similarity between first string and any continuous extent of words of the second string.
:order
- describes the ordering of the results. Possible values are:desc
(default) - orders the results by similarity rank in descending order.:asc
- orders the results by similarity rank in ascending order.:none
- doesn't apply ordering and returns
:pre_filter
- whether or not to pre-filter the results:false
(default) - omits pre-filtering and returns all results.true
- before applying the order, pre filters (using boolean operators which potentially use GIN indexes) the result set. The results abovepg_trgm.similarity_threshold
(which defaults to 0.3) are returned.
Examples
iex> insert_post!(title: "Hogwarts Shocker", body: "A spell disrupts the Quidditch Cup.")
...> insert_post!(title: "Diagon Bombshell", body: "Secrets uncovered in the heart of Hogwarts.")
...> insert_post!(title: "Completely unrelated", body: "No magic here!")
...> Post
...> |> Torus.similarity([p], [p.title, p.body], "Diagon Bombshell")
...> |> limit(1)
...> |> select([p], p.title)
...> |> Repo.all()
["Diagon Bombshell"]
iex> insert_posts!(["Wand", "Owl", "What an amazing cloak"])
...> Post
...> |> Torus.similarity([p], [p.title], "what a cloak", pre_filter: true)
...> |> select([p], p.title)
...> |> Repo.all()
["What an amazing cloak"]
Optimizations
- Use
pre_filter: true
to pre-filter the results before applying the order. This would significantly reduce the number of rows to order. The pre-filtering phase uses different (boolean) similarity operators which more actively leverage GIN indexes. - Limit the number of raws returned using
limit
. - Use
order: :none
argument if you don't care about the order of the results. The query will return all results that are above the similarity threshold, which you can set globally viaSET pg_trgm.similarity_threshold = 0.3;
. - When
order: :desc
(default) and the limit is not set, the query will do a full table scan, so it's recommended to manually limit the results (by applyingwhere
orlimit
clauses to filter the rows as much as possible).
Adding an index
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX index_posts_on_title ON posts USING GIN (title gin_trgm_ops);
The substring function with three parameters provides extraction of a substring that matches an SQL regular expression pattern. The function can be written according to standard SQL syntax:
substring('foobar' similar '%#"o_b#"%' escape '#') oob
substring('foobar' similar '#"o_b#"%' escape '#') NULL
Examples
insert_post!(title: "Hello123World")
Post |> select([p], substring(p.title, "[0-9]+", "#")) |> Repo.all()
["123"]