TantivyEx.Document (TantivyEx v0.4.1)

View Source

Comprehensive document operations for TantivyEx with schema-aware field mapping, validation, and batch processing capabilities.

This module addresses the 70% gap in document operations by providing:

  • Proper field-to-value mapping using schema information
  • Document validation against schema constraints
  • Support for all Tantivy field types in documents
  • Batch document operations for performance
  • Document updates and deletions (via index rebuilding)
  • Enhanced JSON document handling with type conversion

Core Concepts

Schema-Aware Operations

All document operations use the schema to ensure proper field mapping and type validation. Fields are mapped to their correct Tantivy field types based on schema definitions.

Document Validation

Documents are validated against the schema before indexing to catch type mismatches and missing required fields early.

Batch Processing

Batch operations provide significant performance improvements for bulk indexing scenarios.

Field Type Support

Supports all Tantivy field types with proper type conversion:

  • Text: String values with optional tokenization
  • U64/I64/F64: Numeric values with range validation
  • Bool: Boolean true/false values
  • Date: DateTime values (Unix timestamps or ISO strings)
  • Facet: Hierarchical path strings (e.g., "/category/subcategory")
  • Bytes: Base64-encoded binary data
  • JSON: Complex JSON objects with schema-aware field extraction
  • IpAddr: IPv4 and IPv6 address strings

Usage Examples

# Basic document operations
{:ok, index} = TantivyEx.create_index_in_ram(schema)
{:ok, writer} = TantivyEx.writer(index)

# Single document with validation
doc = %{
  "title" => "Getting Started with TantivyEx",
  "content" => "This is a comprehensive guide...",
  "price" => 29.99,
  "published_at" => "2024-01-15T10:30:00Z",
  "category" => "/books/programming/elixir"
}

{:ok, validated_doc} = TantivyEx.Document.validate(doc, schema)
:ok = TantivyEx.Document.add(writer, validated_doc, schema)

# Batch operations
documents = [doc1, doc2, doc3]
{:ok, results} = TantivyEx.Document.add_batch(writer, documents, schema)

# Document updates (rebuilds index with new data)
{:ok, new_index} = TantivyEx.Document.update(index, doc_id, updated_fields, schema)

Summary

Functions

Adds a single document to the index with schema validation.

Adds multiple documents to the index in a batch operation.

Deletes a document by term matching.

Prepares a JSON document for indexing by extracting and validating nested fields.

Updates a document by term-based deletion and re-addition.

Validates a document against the provided schema.

Validates a batch of documents against the schema.

Types

batch_result()

@type batch_result() :: {:ok, [any()]} | {:error, [{integer(), any()}]}

document()

@type document() :: map()

validation_error()

@type validation_error() :: {:error, String.t()}

Functions

add(writer, document, schema)

@spec add(TantivyEx.IndexWriter.t(), document(), TantivyEx.Schema.t()) ::
  :ok | {:error, String.t()}

Adds a single document to the index with schema validation.

Parameters

  • writer: IndexWriter reference
  • document: Document map to add
  • schema: Schema reference for validation and field mapping

Returns

  • :ok - Document successfully added
  • {:error, reason} - Addition failed with specific error

Examples

iex> doc = %{"title" => "Test Document", "content" => "Sample content"}
iex> :ok = TantivyEx.Document.add(writer, doc, schema)

add_batch(writer, documents, schema, options \\ [])

@spec add_batch(
  TantivyEx.IndexWriter.t(),
  [document()],
  TantivyEx.Schema.t(),
  keyword() | map()
) ::
  batch_result()

Adds multiple documents to the index in a batch operation.

Batch operations are significantly more efficient than individual additions for large document sets.

Parameters

  • writer: IndexWriter reference
  • documents: List of document maps
  • schema: Schema reference for validation and field mapping
  • options: Batch processing options

Options

  • :batch_size - Number of documents to process in each batch (default: 1000)
  • :validate - Whether to validate documents (default: true)
  • :continue_on_error - Whether to continue processing if a document fails (default: false)

Returns

  • {:ok, results} - List of results for each document
  • {:error, [{index, error}, ...]} - Errors with document indices

Examples

iex> docs = [%{"title" => "Doc 1"}, %{"title" => "Doc 2"}]
iex> {:ok, results} = TantivyEx.Document.add_batch(writer, docs, schema)
iex> length(results)
2

iex> # With options
iex> {:ok, results} = TantivyEx.Document.add_batch(writer, docs, schema,
...>   batch_size: 500, continue_on_error: true)

delete(writer, term_field, term_value, schema)

@spec delete(TantivyEx.IndexWriter.t(), String.t(), String.t(), TantivyEx.Schema.t()) ::
  {:ok, :deleted} | {:error, String.t()}

Deletes a document by term matching.

Uses Tantivy's term-based deletion to remove documents that match the specified field and value combination.

Parameters

  • writer: IndexWriter reference
  • term_field: Field name to use for identifying the document (e.g., "id")
  • term_value: Value to match for document identification
  • schema: Schema reference

Returns

  • {:ok, :deleted} - Document successfully deleted
  • {:error, reason} - Deletion failed

Examples

iex> {:ok, :deleted} = TantivyEx.Document.delete(writer, "id", "doc_123", schema)

prepare_json(json_doc, schema, field_mapping \\ %{})

@spec prepare_json(map() | String.t(), TantivyEx.Schema.t(), map()) ::
  {:ok, document()} | {:error, String.t()}

Prepares a JSON document for indexing by extracting and validating nested fields.

Parameters

  • json_doc: JSON document as a map or JSON string
  • schema: Schema reference for field extraction
  • field_mapping: Optional mapping of JSON paths to schema fields

Returns

  • {:ok, prepared_document} - Document ready for indexing
  • {:error, reason} - JSON processing failed

Examples

iex> json_doc = %{"metadata" => %{"title" => "Test", "tags" => ["elixir", "search"]}}
iex> mapping = %{"metadata.title" => "title", "metadata.tags" => "tags"}
iex> {:ok, doc} = TantivyEx.Document.prepare_json(json_doc, schema, mapping)

update(writer, term_field, term_value, updated_document, schema)

@spec update(
  TantivyEx.IndexWriter.t(),
  String.t(),
  any(),
  map(),
  TantivyEx.Schema.t()
) ::
  {:ok, :updated} | {:error, String.t()}

Updates a document by term-based deletion and re-addition.

This implementation uses Tantivy's term-based document deletion followed by adding the updated document. This is more efficient than full index rebuilding for sparse updates.

Parameters

  • writer: IndexWriter reference
  • term_field: Field name to use for identifying the document (e.g., "id")
  • term_value: Value to match for document identification
  • updated_document: Complete updated document map
  • schema: Schema reference

Returns

  • {:ok, :updated} - Document successfully updated
  • {:error, reason} - Update failed

Examples

iex> updated_doc = %{"id" => "doc_123", "title" => "Updated Title", "price" => 39.99}
iex> {:ok, :updated} = TantivyEx.Document.update(writer, "id", "doc_123", updated_doc, schema)

validate(document, schema)

@spec validate(document(), TantivyEx.Schema.t()) ::
  {:ok, document()} | validation_error()

Validates a document against the provided schema.

Ensures all field types match schema expectations and converts values to appropriate types where possible.

Parameters

  • document: Map containing field names and values
  • schema: Schema reference to validate against

Returns

  • {:ok, validated_document} - Document with type-converted values
  • {:error, reason} - Validation error with specific details

Examples

iex> doc = %{"title" => "Test", "price" => "29.99", "published_at" => "2024-01-15T10:30:00Z"}
iex> {:ok, validated} = TantivyEx.Document.validate(doc, schema)
iex> validated["price"]
29.99
iex> is_integer(validated["published_at"])
true

validate_batch(documents, schema)

@spec validate_batch([document()], TantivyEx.Schema.t()) ::
  {:ok, [document()]} | {:error, [{integer(), String.t()}]}

Validates a batch of documents against the schema.

Parameters

  • documents: List of document maps
  • schema: Schema reference to validate against

Returns

  • {:ok, validated_documents} - All documents successfully validated
  • {:error, [{index, error}, ...]} - List of validation errors with document indices

Examples

iex> docs = [%{"title" => "Doc 1"}, %{"title" => "Doc 2"}]
iex> {:ok, validated} = TantivyEx.Document.validate_batch(docs, schema)
iex> length(validated)
2