Core Concepts
View SourceUnderstanding TantivyEx's core concepts will help you build better search applications and make informed design decisions.
What is TantivyEx?
TantivyEx is an Elixir wrapper around Tantivy, a full-text search engine library written in Rust. It provides:
- High Performance: Leverages Rust's memory safety and speed
- Full-Text Search: Advanced text processing and ranking
- Flexible Schema: Support for various field types and options
- Real-Time Updates: Add, update, and delete documents efficiently
- Query Language: Rich query syntax for complex searches
Key Components
1. Index
An Index is a data structure that stores your documents in a way that enables fast search operations. Think of it as a specialized database optimized for text search.
# Create an index in memory (for testing)
{:ok, index} = Index.create_in_ram(schema)
# Create a persistent index (for production)
{:ok, index} = Index.create_in_dir("/path/to/index", schema)
# Open an existing index
{:ok, index} = Index.open("/path/to/index")
# Open existing or create new index (recommended)
{:ok, index} = Index.open_or_create("/path/to/index", schema)Index Types:
- In-Memory: Fast, temporary, lost on restart
- Persistent: Stored on disk, survives restarts
- Distributed: Multiple shards (advanced topic)
Index Management Functions:
create_in_dir/2: Creates new index, fails if existscreate_in_ram/1: Creates temporary in-memory indexopen/1: Opens existing index, fails if doesn't existopen_or_create/2: Opens existing or creates new (production recommended)
2. Schema
A Schema defines the structure of your documents - what fields they have and how those fields should be processed and stored.
schema = Schema.new()
schema = Schema.add_text_field(schema, "title", :text_stored)
schema = Schema.add_u64_field(schema, "timestamp", :fast_stored)Schema Design Principles:
- Define all fields upfront
- Choose appropriate field types and options
- Consider storage vs. performance trade-offs
- Plan for future requirements
3. Documents
Documents are the actual data you want to search. They must conform to your schema.
document = %{
"title" => "Introduction to Elixir",
"timestamp" => 1640995200
}
{:ok, writer} = TantivyEx.IndexWriter.new(index)
:ok = TantivyEx.IndexWriter.add_document(writer, document)Document Characteristics:
- Must match the schema structure
- Can contain only defined fields
- Support various data types
- Immutable once added (update = delete + add)
4. Fields
Fields are the individual pieces of data in your documents. Different field types are optimized for different use cases:
Text Fields
# For full-text search
schema = Schema.add_text_field(schema, "content", :text)
# For exact matches and storage
schema = Schema.add_text_field(schema, "title", :text_stored)Numeric Fields
# For integers
schema = Schema.add_u64_field(schema, "timestamp", :fast_stored)
schema = Schema.add_i64_field(schema, "score", :fast)
# For floating point
schema = Schema.add_f64_field(schema, "price", :fast_stored)Facet Fields
# For hierarchical filtering
schema = Schema.add_facet_field(schema, "category", :facet)Binary Fields
# For storing raw data
schema = Schema.add_bytes_field(schema, "thumbnail", :stored)5. Queries
Queries are how you search your index. TantivyEx supports a rich query language:
# Simple term search
{:ok, searcher} = TantivyEx.Searcher.new(index)
TantivyEx.Searcher.search(searcher, "elixir", 10)
# Boolean queries
TantivyEx.Searcher.search(searcher, "elixir AND phoenix", 10)
# Range queries
TantivyEx.Searcher.search(searcher, "price:[10.0 TO 100.0]", 10)
# Field-specific search
TantivyEx.Searcher.search(searcher, "title:elixir", 10)Data Flow
Understanding the data flow helps you design better search applications:
- Schema Definition → Define your document structure
- Index Creation → Create storage for your documents
- Document Addition → Add your data to the index
- Commit → Make changes searchable
- Query Execution → Search and retrieve results
# 1. Define schema
{:ok, schema} = Schema.new() |> Schema.add_text_field("title", :text_stored)
# 2. Create index
{:ok, index} = Index.create_in_ram(schema)
# 3. Add documents
{:ok, writer} = TantivyEx.IndexWriter.new(index)
:ok = TantivyEx.IndexWriter.add_document(writer, %{"title" => "Hello World"})
# 4. Commit (make searchable)
:ok = TantivyEx.IndexWriter.commit(writer)
# 5. Search
{:ok, searcher} = TantivyEx.Searcher.new(index)
{:ok, results} = TantivyEx.Searcher.search(searcher, "hello", 10)Field Options Deep Dive
Understanding field options is crucial for performance and functionality:
Storage Options
:stored- Field value is stored and can be retrieved:fast- Field is indexed for fast filtering and sorting:indexed- Field is searchable (default for text fields)
Combined Options
:text_stored- Text field that's both searchable and stored:fast_stored- Numeric field that's fast-accessible and stored:facet- Facet field for hierarchical filtering
Choosing Options
# For full-text search with retrieval
Schema.add_text_field(schema, "content", :text_stored)
# For filtering without retrieval
Schema.add_u64_field(schema, "timestamp", :fast)
# For display-only fields
Schema.add_text_field(schema, "author", :stored)
# For hierarchical navigation
Schema.add_facet_field(schema, "category", :facet)Memory and Performance Considerations
Index Structure
- Segments: Immutable chunks of data
- Merging: Combines segments for efficiency
- Commits: Make changes visible to searchers
Memory Usage
- In-memory indexes: Fast but limited by RAM
- Disk indexes: Scalable but slower access
- Caching: Frequently accessed data in memory
Performance Tips
- Use appropriate field types
- Minimize stored fields for large datasets
- Batch document operations
- Optimize commit frequency
Error Handling Patterns
TantivyEx follows Elixir conventions with {:ok, result} and {:error, reason} tuples:
case TantivyEx.Searcher.search(searcher, query, limit) do
{:ok, results} ->
process_results(results)
{:error, :invalid_query} ->
{:error, "Invalid search query format"}
{:error, reason} ->
Logger.error("Search failed: #{inspect(reason)}")
{:error, "Search temporarily unavailable"}
endGlossary
A-C
Analyzer : A component that processes text fields during indexing. It typically includes tokenization, stemming, and filtering. Analyzers determine how text is broken down and normalized for search.
Commit : The operation that makes indexed documents visible to searchers. Until a commit occurs, newly added documents won't appear in search results. Commits also persist changes to disk for persistent indexes.
Collector : A component that gathers search results during query execution. Different collectors can sort results, apply scoring, or collect specific types of data (e.g., facet counts, top documents).
D-F
Document : A single record in your search index, represented as a collection of fields. Documents must conform to the schema and are the basic unit of indexing and retrieval.
Facet : A hierarchical field type used for categorical navigation and filtering. Facets allow users to drill down through categories (e.g., "Electronics/Computers/Laptops") and get counts for each category level.
# Example facet structure
"/Electronics/Computers/Laptops"
"/Electronics/Phones/Smartphones"
"/Books/Fiction/Science Fiction"Field : A named attribute of a document (e.g., "title", "content", "price"). Each field has a specific type and configuration that determines how it's indexed and stored.
Field Options : Configuration flags that control how fields are processed:
:stored- Field value is retrievable from search results:indexed- Field is searchable (default for text fields):fast- Field supports fast filtering, sorting, and aggregation
G-I
Index : The core data structure that stores documents in an optimized format for fast search operations. Can be in-memory (temporary) or persistent (disk-based).
Index Writer : The component responsible for adding, updating, and deleting documents in an index. Only one writer can be active per index at a time to ensure consistency.
Inverted Index : The underlying data structure that maps terms to the documents containing them. This enables fast text search by looking up terms directly rather than scanning all documents.
J-M
JSON Field : A special field type that can store and search structured JSON data. Allows dynamic schemas and complex nested data structures.
Merge Policy : Rules that determine when and how index segments are combined. Affects indexing performance and search speed by controlling the number and size of segments.
N-Q
Query : A search expression that specifies what documents to find. Can range from simple term queries to complex boolean expressions with filters and scoring modifications.
Query Parser : Component that converts human-readable query strings into internal query objects. Supports various syntaxes like boolean operators, field-specific searches, and range queries.
R-S
Schema : The blueprint that defines the structure of documents in an index. Specifies all possible fields, their types, and how they should be processed. Must be defined before creating an index.
Searcher : A read-only component that executes queries against an index. Multiple searchers can operate simultaneously, and they provide a consistent view of the index at a point in time.
Segment : An immutable chunk of the index containing a subset of documents. New documents are added to new segments, and segments are periodically merged for efficiency.
Snippet : A highlighted excerpt from a document that shows where query terms appear in context. Used to provide search result previews with relevant portions emphasized.
Stemming : The process of reducing words to their root form (e.g., "running" → "run"). Improves search recall by matching different word forms.
T-Z
Term : A single unit of text after tokenization and processing. For example, "machine learning" might become two terms: "machine" and "learn" (after stemming).
Term Dictionary : An internal data structure that efficiently stores all unique terms in the index along with their statistics and pointers to document lists.
Tokenizer : Component that breaks text into individual terms. Different tokenizers handle different languages and text types (e.g., splitting on whitespace, handling punctuation, processing URLs).
# Example tokenization
"Hello, world!" → ["hello", "world"]
"user@example.com" → ["user", "example", "com"] # with email tokenizerTF-IDF (Term Frequency-Inverse Document Frequency) : A scoring algorithm that ranks documents based on how frequently terms appear in the document versus how common they are across the entire index. More unique terms get higher scores.
Writer : See Index Writer.
Usage Examples by Term
Working with Facets
# Define facet field
{:ok, schema} = Schema.add_facet_field(schema, "category", :facet)
# Add document with facet
document = %{"category" => "/Electronics/Computers/Laptops"}
TantivyEx.IndexWriter.add_document(writer, document)
# Query with facet filter
{:ok, results} = TantivyEx.Searcher.search(searcher, "category:/Electronics/Computers", 10)Understanding Segments and Commits
# Documents are added to segments
TantivyEx.IndexWriter.add_document(writer, doc1) # Goes to segment 1
TantivyEx.IndexWriter.add_document(writer, doc2) # Goes to segment 1
# Commit makes documents searchable
TantivyEx.IndexWriter.commit(writer) # Segment 1 becomes visible
# More documents go to new segments
TantivyEx.IndexWriter.add_document(writer, doc3) # Goes to segment 2Field Options in Practice
# Different field configurations for different use cases
{:ok, schema} = Schema.new()
|> Schema.add_text_field("title", :text_stored) # Searchable + retrievable
|> Schema.add_text_field("content", :text) # Searchable only
|> Schema.add_u64_field("timestamp", :fast_stored) # Fast filtering + retrievable
|> Schema.add_text_field("author", :stored) # Retrievable only
|> Schema.add_facet_field("category", :facet) # Hierarchical navigationNext Steps
Now that you understand the core concepts, explore these guides:
- Schema Design Guide - Design effective schemas
- Indexing Guide - Master document operations
- Search Guide - Learn advanced querying
- Performance Tuning - Optimize for production