Behaviour module for Kreuzberg document extraction validators.
This module defines the callback interface for implementing custom validators in the Kreuzberg plugin system. Validators are responsible for validating extraction results and ensuring data quality and consistency.
Validators are executed in a pipeline with configurable priorities, allowing fine-grained control over validation order and result handling. Each validator can decide whether it should validate a given result based on custom logic.
Validator Lifecycle
The validator lifecycle consists of four main phases:
Initialization - Called once when the validator is registered
- Use this to set up resources, connect to services, etc.
- Must return
:okor{:error, reason}
Conditional Validation - Before validating, check if validation should run
- Use
should_validate?/1to conditionally apply validation logic - Useful for document-type-specific validators
- Use
Validation - Perform the actual validation
- Check result structure and content
- Return
:okor{:error, reason}with descriptive message
Shutdown - Called when the validator is unregistered
- Use this to clean up resources
- Must return
:ok
Priority System
Validators are sorted by priority (descending) before execution. Higher priority values run first. This allows you to:
- Run fast validators first (fail-fast approach)
- Run validators with dependencies in order
- Control the order of detailed validation passes
Typical priority levels:
- 100+ - Critical validators (must pass)
- 50-100 - High priority validators
- 1-50 - Standard validators
- 0 or negative - Low priority (informational)
Validation Results
All validation functions return one of:
:ok- Validation passed{:error, reason}- Validation failed with human-readable reason
Error messages should be descriptive enough to help developers:
- Specify what was wrong
- Explain why it matters
- Provide hints for fixing the issue
Example Validators
See the examples below for common validator patterns.
Behaviour Callbacks
All modules implementing this behaviour must define:
name/0- Return a unique validator identifiervalidate/1- Perform validation on extraction resultshould_validate?/1- Decide if validation should runpriority/0- Return validation priority (integer)initialize/0- Set up validator resourcesshutdown/0- Clean up validator resourcesversion/0- Return validator version string
Examples
A minimal validator that checks for empty content:
defmodule MyApp.Validators.NonEmptyValidator do
@behaviour Kreuzberg.Plugin.Validator
def name, do: "non_empty_content_validator"
def validate(result) do
if String.length(result["content"] || "") > 0 do
:ok
else
{:error, "Extraction result contains empty content"}
end
end
def should_validate?(result) do
is_map(result) and Map.has_key?(result, "content")
end
def priority, do: 100
def initialize do
:ok
end
def shutdown, do: :ok
def version, do: "1.0.0"
endA more complex validator that validates PDF metadata:
defmodule MyApp.Validators.PDFMetadataValidator do
@behaviour Kreuzberg.Plugin.Validator
def name, do: "pdf_metadata_validator"
def validate(result) do
with {:ok, mime} <- validate_mime_type(result),
{:ok, metadata} <- validate_metadata_exists(result),
{:ok, _} <- validate_required_fields(result) do
:ok
end
end
def should_validate?(result) do
mime_type = result["mime_type"]
String.starts_with?(mime_type || "", "application/pdf")
end
def priority, do: 75
def initialize do
# Could initialize PDF validation library here
:ok
end
def shutdown, do: :ok
def version, do: "2.1.0"
# Private helpers would go here
# (implementation details omitted for brevity)
endA stateful validator that tracks statistics:
defmodule MyApp.Validators.StatisticsValidator do
@behaviour Kreuzberg.Plugin.Validator
use GenServer
def name, do: "statistics_validator"
def validate(result) do
try do
# Update statistics
:ok = GenServer.call(__MODULE__, {:track, result})
:ok
catch
:exit, _ ->
{:error, "Failed to record statistics"}
end
end
def should_validate?(_result), do: true
def priority, do: 10
def initialize do
GenServer.start_link(__MODULE__, %{}, name: __MODULE__)
:ok
end
def shutdown do
GenServer.stop(__MODULE__)
end
def version, do: "1.5.0"
# GenServer callbacks
@impl true
def init(state) do
{:ok, state}
end
@impl true
def handle_call({:track, result}, _from, state) do
new_state = update_stats(state, result)
{:reply, :ok, new_state}
end
defp update_stats(state, result) do
# Track metrics based on result
state
end
endValidator Registration
Validators are registered with the Kreuzberg plugin system:
Kreuzberg.Plugin.register_validator(MyApp.Validators.NonEmptyValidator)And can be unregistered when no longer needed:
Kreuzberg.Plugin.unregister_validator("non_empty_content_validator")Validation Pipeline
During result processing, the system:
- Collects all registered validators
- Sorts by priority (highest first)
- For each validator:
a. Calls
should_validate?/1to check applicability b. If true, callsvalidate/1c. Continues on:ok, may stop on error based on policy - Returns combined validation result
Error Handling
When validation fails:
- Single validation error:
{:error, "reason"} - Multiple validation errors:
{:error, [{"validator_name", "reason"}, ...]} - System errors:
{:error, "Validator xyz: system error"}
Validators should avoid raising exceptions and instead return error tuples.
Summary
Callbacks
Initializes the validator.
Returns the unique identifier/name for this validator.
Returns the priority for this validator.
Determines whether this validator should validate the given result.
Shuts down the validator.
Validates an extraction result.
Returns the version of this validator.
Types
Callbacks
@callback initialize() :: :ok | {:error, String.t()}
Initializes the validator.
This callback is called once when the validator is registered with the plugin system. Use it to:
- Set up resources (connections, file handles, etc.)
- Initialize state
- Validate configuration
- Perform one-time setup
If initialization fails, the validator will not be registered and an error will be returned to the caller.
Returns
:ok- Initialization successful{:error, reason}- Initialization failed with a reason
Examples
# Minimal validator with no setup
iex> MyValidator.initialize()
:ok
# Validator that needs to connect to a service
iex> ServiceValidator.initialize()
# Attempts to connect, returns :ok or {:error, "Connection failed"}
@callback name() :: String.t()
Returns the unique identifier/name for this validator.
The name should be a descriptive string that uniquely identifies this validator within the plugin system. It will be used for logging, registration, and error messages.
Returns
A string identifier (e.g., "pdf_content_validator", "table_format_validator").
Examples
iex> MyValidator.name()
"content_length_validator"
@callback priority() :: integer()
Returns the priority for this validator.
Higher priority validators run first in the validation pipeline. Priority is used to control the order of validator execution, allowing:
- Critical validators to fail fast
- Validators with dependencies to run in order
- Expensive validators to run last
Typical values:
- 100-200: Critical system validators
- 50-100: High priority domain validators
- 1-50: Standard validators
- 0 or negative: Low priority, informational validators
Returns
An integer representing the priority (typically 0-200, but any integer is valid).
Examples
iex> MyValidator.priority()
50
iex> CriticalValidator.priority()
150
iex> InformationalValidator.priority()
-10
Determines whether this validator should validate the given result.
This callback allows validators to conditionally apply validation logic based on the result content. For example, a PDF-specific validator might only validate results with mime_type "application/pdf".
Returning false from this callback causes the validator to be skipped for
that result without calling validate/1.
Parameters
result- A map containing the extraction result to check
Returns
true- This validator should validate the resultfalse- This validator should be skipped for this result
Examples
# Validate all results
iex> MyValidator.should_validate?(%{"content" => "text"})
true
# Only validate PDFs
iex> PDFValidator.should_validate?(%{"mime_type" => "application/pdf"})
true
iex> PDFValidator.should_validate?(%{"mime_type" => "text/plain"})
false
@callback shutdown() :: :ok
Shuts down the validator.
This callback is called when the validator is unregistered from the plugin system. Use it to:
- Close resources (connections, files, etc.)
- Clean up state
- Stop processes
The shutdown callback should always return :ok and not raise exceptions.
Returns
:ok- Always returns :ok to ensure cleanup completes
Examples
# Validator with no resources
iex> MyValidator.shutdown()
:ok
@callback validate(result :: map()) :: validation_result()
Validates an extraction result.
This is the main validation function called by the plugin system. It should
perform all necessary validation checks on the result and return either :ok
or an error tuple with a descriptive message.
The validator should not raise exceptions; use error tuples instead for consistent error handling in the plugin system.
Parameters
result- A map containing the extraction result with keys like:"content"- Extracted text content"mime_type"- Document MIME type"metadata"- Document metadata"tables"- Extracted tables- And other extraction result fields
Returns
:ok- If validation passes{:error, reason}- If validation fails, with a human-readable reason
Error Messages
Error messages should be specific and helpful:
- "Content is empty" - Good
- "Validation failed" - Poor
Examples
iex> MyValidator.validate(%{"content" => "Hello", "mime_type" => "text/plain"})
:ok
iex> MyValidator.validate(%{"content" => "", "mime_type" => "text/plain"})
{:error, "Content cannot be empty"}
@callback version() :: String.t()
Returns the version of this validator.
This should be a version string that identifies the specific implementation of this validator. It's useful for:
- Debugging (knowing which version of a validator is running)
- Logging and metrics
- Compatibility checking
Version format should follow semantic versioning (e.g., "1.2.3").
Returns
A version string (e.g., "1.0.0", "2.1.5-beta").
Examples
iex> MyValidator.version()
"1.0.0"
iex> EnhancedValidator.version()
"2.1.0"