TantivyEx.Document (TantivyEx v0.4.1)

Comprehensive document operations for TantivyEx with schema-aware field mapping, validation, and batch processing capabilities.

This module addresses the 70% gap in document operations by providing:

Proper field-to-value mapping using schema information
Document validation against schema constraints
Support for all Tantivy field types in documents
Batch document operations for performance
Document updates and deletions (via index rebuilding)
Enhanced JSON document handling with type conversion

Core Concepts

Schema-Aware Operations

All document operations use the schema to ensure proper field mapping and type validation. Fields are mapped to their correct Tantivy field types based on schema definitions.

Document Validation

Documents are validated against the schema before indexing to catch type mismatches and missing required fields early.

Batch Processing

Batch operations provide significant performance improvements for bulk indexing scenarios.

Field Type Support

Supports all Tantivy field types with proper type conversion:

Text: String values with optional tokenization
U64/I64/F64: Numeric values with range validation
Bool: Boolean true/false values
Date: DateTime values (Unix timestamps or ISO strings)
Facet: Hierarchical path strings (e.g., "/category/subcategory")
Bytes: Base64-encoded binary data
JSON: Complex JSON objects with schema-aware field extraction
IpAddr: IPv4 and IPv6 address strings

Usage Examples

# Basic document operations
{:ok, index} = TantivyEx.create_index_in_ram(schema)
{:ok, writer} = TantivyEx.writer(index)

# Single document with validation
doc = %{
  "title" => "Getting Started with TantivyEx",
  "content" => "This is a comprehensive guide...",
  "price" => 29.99,
  "published_at" => "2024-01-15T10:30:00Z",
  "category" => "/books/programming/elixir"
}

{:ok, validated_doc} = TantivyEx.Document.validate(doc, schema)
:ok = TantivyEx.Document.add(writer, validated_doc, schema)

# Batch operations
documents = [doc1, doc2, doc3]
{:ok, results} = TantivyEx.Document.add_batch(writer, documents, schema)

# Document updates (rebuilds index with new data)
{:ok, new_index} = TantivyEx.Document.update(index, doc_id, updated_fields, schema)

Summary

Types

batch_result()

document()

validation_error()

Functions

add(writer, document, schema)

Adds a single document to the index with schema validation.

add_batch(writer, documents, schema, options \\ [])

Adds multiple documents to the index in a batch operation.

delete(writer, term_field, term_value, schema)

Deletes a document by term matching.

prepare_json(json_doc, schema, field_mapping \\ %{})

Prepares a JSON document for indexing by extracting and validating nested fields.

update(writer, term_field, term_value, updated_document, schema)

Updates a document by term-based deletion and re-addition.

validate(document, schema)

Validates a document against the provided schema.

validate_batch(documents, schema)

Validates a batch of documents against the schema.

Types

batch_result()

@type batch_result() :: {:ok, [any()]} | {:error, [{integer(), any()}]}

document()

@type document() :: map()

validation_error()

@type validation_error() :: {:error, String.t()}

Functions

add(writer, document, schema)

@spec add(TantivyEx.IndexWriter.t(), document(), TantivyEx.Schema.t()) ::
  :ok | {:error, String.t()}

Adds a single document to the index with schema validation.

Parameters

writer: IndexWriter reference
document: Document map to add
schema: Schema reference for validation and field mapping

Returns

:ok - Document successfully added
{:error, reason} - Addition failed with specific error

Examples

iex> doc = %{"title" => "Test Document", "content" => "Sample content"}
iex> :ok = TantivyEx.Document.add(writer, doc, schema)

add_batch(writer, documents, schema, options \\ [])

@spec add_batch(
  TantivyEx.IndexWriter.t(),
  [document()],
  TantivyEx.Schema.t(),
  keyword() | map()
) ::
  batch_result()

Adds multiple documents to the index in a batch operation.

Batch operations are significantly more efficient than individual additions for large document sets.

Parameters

writer: IndexWriter reference
documents: List of document maps
schema: Schema reference for validation and field mapping
options: Batch processing options

Options

:batch_size - Number of documents to process in each batch (default: 1000)
:validate - Whether to validate documents (default: true)
:continue_on_error - Whether to continue processing if a document fails (default: false)

Returns

{:ok, results} - List of results for each document
{:error, [{index, error}, ...]} - Errors with document indices

Examples

iex> docs = [%{"title" => "Doc 1"}, %{"title" => "Doc 2"}]
iex> {:ok, results} = TantivyEx.Document.add_batch(writer, docs, schema)
iex> length(results)
2

iex> # With options
iex> {:ok, results} = TantivyEx.Document.add_batch(writer, docs, schema,
...>   batch_size: 500, continue_on_error: true)

delete(writer, term_field, term_value, schema)

@spec delete(TantivyEx.IndexWriter.t(), String.t(), String.t(), TantivyEx.Schema.t()) ::
  {:ok, :deleted} | {:error, String.t()}

Deletes a document by term matching.

Uses Tantivy's term-based deletion to remove documents that match the specified field and value combination.

Parameters

writer: IndexWriter reference
term_field: Field name to use for identifying the document (e.g., "id")
term_value: Value to match for document identification
schema: Schema reference

Returns

{:ok, :deleted} - Document successfully deleted
{:error, reason} - Deletion failed

Examples

iex> {:ok, :deleted} = TantivyEx.Document.delete(writer, "id", "doc_123", schema)

prepare_json(json_doc, schema, field_mapping \\ %{})

@spec prepare_json(map() | String.t(), TantivyEx.Schema.t(), map()) ::
  {:ok, document()} | {:error, String.t()}

Prepares a JSON document for indexing by extracting and validating nested fields.

Parameters

json_doc: JSON document as a map or JSON string
schema: Schema reference for field extraction
field_mapping: Optional mapping of JSON paths to schema fields

Returns

{:ok, prepared_document} - Document ready for indexing
{:error, reason} - JSON processing failed

Examples

iex> json_doc = %{"metadata" => %{"title" => "Test", "tags" => ["elixir", "search"]}}
iex> mapping = %{"metadata.title" => "title", "metadata.tags" => "tags"}
iex> {:ok, doc} = TantivyEx.Document.prepare_json(json_doc, schema, mapping)

update(writer, term_field, term_value, updated_document, schema)

@spec update(
  TantivyEx.IndexWriter.t(),
  String.t(),
  any(),
  map(),
  TantivyEx.Schema.t()
) ::
  {:ok, :updated} | {:error, String.t()}

Updates a document by term-based deletion and re-addition.

This implementation uses Tantivy's term-based document deletion followed by adding the updated document. This is more efficient than full index rebuilding for sparse updates.

Parameters

writer: IndexWriter reference
term_field: Field name to use for identifying the document (e.g., "id")
term_value: Value to match for document identification
updated_document: Complete updated document map
schema: Schema reference

Returns

{:ok, :updated} - Document successfully updated
{:error, reason} - Update failed

Examples

iex> updated_doc = %{"id" => "doc_123", "title" => "Updated Title", "price" => 39.99}
iex> {:ok, :updated} = TantivyEx.Document.update(writer, "id", "doc_123", updated_doc, schema)

validate(document, schema)

@spec validate(document(), TantivyEx.Schema.t()) ::
  {:ok, document()} | validation_error()

Validates a document against the provided schema.

Ensures all field types match schema expectations and converts values to appropriate types where possible.

Parameters

document: Map containing field names and values
schema: Schema reference to validate against

Returns

{:ok, validated_document} - Document with type-converted values
{:error, reason} - Validation error with specific details

Examples

iex> doc = %{"title" => "Test", "price" => "29.99", "published_at" => "2024-01-15T10:30:00Z"}
iex> {:ok, validated} = TantivyEx.Document.validate(doc, schema)
iex> validated["price"]
29.99
iex> is_integer(validated["published_at"])
true

validate_batch(documents, schema)

@spec validate_batch([document()], TantivyEx.Schema.t()) ::
  {:ok, [document()]} | {:error, [{integer(), String.t()}]}

Validates a batch of documents against the schema.

Parameters

documents: List of document maps
schema: Schema reference to validate against

Returns

{:ok, validated_documents} - All documents successfully validated
{:error, [{index, error}, ...]} - List of validation errors with document indices

Examples

iex> docs = [%{"title" => "Doc 1"}, %{"title" => "Doc 2"}]
iex> {:ok, validated} = TantivyEx.Document.validate_batch(docs, schema)
iex> length(validated)
2