Configuration structure for document extraction operations.
Provides options for controlling extraction behavior including caching, quality processing, OCR, chunking, language detection, and post-processing. This module defines the configuration schema and provides validation utilities to ensure configurations are valid before passing them to the Rust extraction engine.
Configuration Fields
Boolean Flags (Top-level)
:use_cache- Enable result caching (default: true):enable_quality_processing- Enable quality post-processing (default: true):force_ocr- Force OCR even for searchable PDFs (default: false)
Output Format Flags
:output_format- Content text format (default: "plain") - "plain", "markdown", "djot", "html":result_format- Result structure format (default: "unified") - "unified", "element_based"
Nested Configuration Maps (Optional)
:chunking- Text chunking configuration with options like chunk_size, overlap, etc.:ocr- OCR backend configuration with settings for language, PSM mode, etc.- Can include nested
:paddle_ocr_configfor PaddleOCR-specific settings - Can include nested
:element_configfor OCR element extraction settings
- Can include nested
:language_detection- Language detection settings for multi-language content:postprocessor- Post-processor configuration for cleaning/formatting extracted text:images- Image extraction configuration (quality, format, preprocessing options):pages- Page-level extraction configuration (which pages to extract, etc.):token_reduction- Token reduction settings for optimizing output size:keywords- Keyword extraction configuration:pdf_options- PDF-specific options (requires pdf feature to be enabled):html_options- HTML to Markdown conversion options (quality, format, preprocessing options):max_concurrent_extractions- Maximum concurrent extractions in batch operations (positive integer or nil)
Default Values
All boolean flags default to reasonable values:
use_cache: true - Caching is enabled by default for better performanceenable_quality_processing: true - Quality processing is enabled by default for better extraction resultsforce_ocr: false - OCR is only used when necessary (searchable PDFs bypass OCR)
Format defaults:
output_format: "plain" - Raw extracted text (no formatting)result_format: "unified" - All content in unified content field
All nested configurations default to nil, allowing the Rust implementation to apply its own defaults.
Field Validation
The validate/1 function ensures:
- Boolean fields are actually booleans
- Format fields are valid enum values
- Nested configurations are maps or nil
- No invalid field names are used
Examples
# Create config with chunking enabled
iex> config = %Kreuzberg.ExtractionConfig{
...> chunking: %{"enabled" => true, "chunk_size" => 1024},
...> use_cache: true
...> }
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}
# Create config with markdown output format
iex> config = %Kreuzberg.ExtractionConfig{
...> output_format: "markdown",
...> result_format: "unified"
...> }
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}
# Create config that forces OCR with element-based result format
iex> config = %Kreuzberg.ExtractionConfig{
...> force_ocr: true,
...> result_format: "element_based",
...> enable_quality_processing: true
...> }
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}
# Validate invalid configuration (non-boolean field)
iex> config = %Kreuzberg.ExtractionConfig{use_cache: "yes"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'use_cache' must be a boolean, got: string"}
# Validate invalid format
iex> config = %Kreuzberg.ExtractionConfig{output_format: "invalid"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'output_format' must be one of: plain, text, markdown, md, djot, html, got: invalid"}
# Convert to map for NIF
iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 512}}
iex> Kreuzberg.ExtractionConfig.to_map(config)
%{
"chunking" => %{"size" => 512},
"ocr" => nil,
"language_detection" => nil,
"postprocessor" => nil,
"images" => nil,
"pages" => nil,
"token_reduction" => nil,
"keywords" => nil,
"pdf_options" => nil,
"html_options" => nil,
"max_concurrent_extractions" => nil,
"include_document_structure" => false,
"use_cache" => true,
"enable_quality_processing" => true,
"force_ocr" => false,
"output_format" => "plain",
"result_format" => "unified"
}
Summary
Functions
Discover and load an ExtractionConfig by searching directories.
Load an ExtractionConfig from a file.
Creates a new ExtractionConfig with all default values.
Creates a new ExtractionConfig from keyword list or map.
Converts an ExtractionConfig struct to a map for NIF serialization.
Validates an ExtractionConfig for correct field types and values.
Types
@type nested_config() :: config_map() | nil
@type output_format() :: String.t()
@type result_format() :: String.t()
@type t() :: %Kreuzberg.ExtractionConfig{ chunking: nested_config(), enable_quality_processing: boolean(), force_ocr: boolean(), html_options: config_map() | nil, images: nested_config(), include_document_structure: boolean(), keywords: nested_config(), language_detection: nested_config(), max_concurrent_extractions: non_neg_integer() | nil, ocr: nested_config(), output_format: output_format(), pages: nested_config(), pdf_options: nested_config(), postprocessor: nested_config(), result_format: result_format(), security_limits: nested_config(), token_reduction: nested_config(), use_cache: boolean() }
Functions
Discover and load an ExtractionConfig by searching directories.
Searches the current working directory and all parent directories for a configuration file in the following order:
kreuzberg.tomlkreuzberg.yamlkreuzberg.ymlkreuzberg.json
Returns the first configuration file found.
Returns
{:ok, config}- Successfully discovered and loaded configuration{:error, :not_found}- No configuration file found in directory tree{:error, reason}- Error loading or parsing the configuration file
Examples
# When no config file exists
iex> Kreuzberg.ExtractionConfig.discover()
{:error, :not_found}
Load an ExtractionConfig from a file.
Supports TOML, YAML, and JSON configuration file formats. The file format is automatically detected based on the file extension or file contents.
Parameters
file_path- Path to the configuration file (String or Path.t())
Returns
{:ok, config}- Successfully loaded configuration as a struct{:error, reason}- Failed to load or parse the configuration file
Supported Formats
.toml- TOML format (e.g.,kreuzberg.toml).yaml,.yml- YAML format (e.g.,kreuzberg.yaml).json- JSON format (e.g.,kreuzberg.json)
Examples
Loading from a TOML file:
Kreuzberg.ExtractionConfig.from_file("kreuzberg.toml")
# => {:ok, %Kreuzberg.ExtractionConfig{...}}Loading from a YAML file:
Kreuzberg.ExtractionConfig.from_file("/etc/config/extraction.yaml")
# => {:ok, %Kreuzberg.ExtractionConfig{...}}Handling missing files:
Kreuzberg.ExtractionConfig.from_file("/nonexistent/file.toml")
# => {:error, "File not found: ..."}
@spec new() :: t()
Creates a new ExtractionConfig with all default values.
Examples
iex> config = Kreuzberg.ExtractionConfig.new()
iex> config.use_cache
true
Creates a new ExtractionConfig from keyword list or map.
Parameters
opts- Keyword list or map (supports string keys from JSON)
Examples
iex> config = Kreuzberg.ExtractionConfig.new(use_cache: false)
iex> config.use_cache
false
iex> config = Kreuzberg.ExtractionConfig.new(%{"output_format" => "markdown"})
iex> config.output_format
"markdown"
Converts an ExtractionConfig struct to a map for NIF serialization.
Returns a map containing all configuration fields, both boolean flags and nested configurations. Serializes all values including nil for complete representation.
Parameters
config- AnExtractionConfigstruct, a plain map, nil, or a keyword list
Returns
A map with string keys representing the configuration options. All fields are included, allowing the Rust side to override with provided values.
Field Descriptions
"chunking"- Text chunking configuration (map or nil)"ocr"- OCR backend configuration (map or nil)"language_detection"- Language detection settings (map or nil)"postprocessor"- Post-processor configuration (map or nil)"images"- Image extraction configuration (map or nil)"pages"- Page-level extraction configuration (map or nil)"token_reduction"- Token reduction settings (map or nil)"keywords"- Keyword extraction configuration (map or nil)"pdf_options"- PDF-specific options (map or nil)"max_concurrent_extractions"- Maximum concurrent extractions (positive integer or nil)"html_options"- HTML to Markdown conversion options (map or nil)"include_document_structure"- Include document structure in extraction (boolean)"use_cache"- Enable caching (boolean)"enable_quality_processing"- Enable quality processing (boolean)"force_ocr"- Force OCR usage (boolean)"output_format"- Content text format (string: "plain", "markdown", "djot", "html")"result_format"- Result structure format (string: "unified", "element_based")
Examples
iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 512}, output_format: "markdown"}
iex> Kreuzberg.ExtractionConfig.to_map(config)
%{
"chunking" => %{"size" => 512},
"ocr" => nil,
"language_detection" => nil,
"postprocessor" => nil,
"images" => nil,
"pages" => nil,
"token_reduction" => nil,
"keywords" => nil,
"pdf_options" => nil,
"html_options" => nil,
"max_concurrent_extractions" => nil,
"include_document_structure" => false,
"use_cache" => true,
"enable_quality_processing" => true,
"force_ocr" => false,
"output_format" => "markdown",
"result_format" => "unified"
}
iex> config = %Kreuzberg.ExtractionConfig{}
iex> Kreuzberg.ExtractionConfig.to_map(config)
%{
"chunking" => nil,
"ocr" => nil,
"language_detection" => nil,
"postprocessor" => nil,
"images" => nil,
"pages" => nil,
"token_reduction" => nil,
"keywords" => nil,
"pdf_options" => nil,
"html_options" => nil,
"max_concurrent_extractions" => nil,
"use_cache" => true,
"enable_quality_processing" => true,
"include_document_structure" => false,
"force_ocr" => false,
"output_format" => "plain",
"result_format" => "unified"
}
iex> Kreuzberg.ExtractionConfig.to_map(nil)
nil
iex> Kreuzberg.ExtractionConfig.to_map(%{"use_cache" => false, "output_format" => "markdown"})
%{"use_cache" => false, "output_format" => "markdown"}
Validates an ExtractionConfig for correct field types and values.
Ensures that:
- Boolean fields (use_cache, enable_quality_processing, force_ocr) are actually booleans
- Format fields (output_format, result_format) are valid enum values
- Nested configuration fields are maps or nil
- All values are valid according to the configuration schema
This function is useful for early validation before passing configuration to the extraction functions.
Parameters
config- AnExtractionConfigstruct to validate
Returns
{:ok, config}- If the configuration is valid{:error, reason}- If validation fails, with a descriptive error message
Examples
iex> config = %Kreuzberg.ExtractionConfig{use_cache: true}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}
iex> config = %Kreuzberg.ExtractionConfig{output_format: "markdown", result_format: "unified"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}
iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 1024}}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}
iex> config = %Kreuzberg.ExtractionConfig{use_cache: "yes"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'use_cache' must be a boolean, got: string"}
iex> config = %Kreuzberg.ExtractionConfig{output_format: "invalid"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'output_format' must be one of: plain, text, markdown, md, djot, html, got: invalid"}
iex> config = %Kreuzberg.ExtractionConfig{chunking: "invalid"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'chunking' must be a map or nil, got: string"}
iex> config = %Kreuzberg.ExtractionConfig{force_ocr: true, enable_quality_processing: true}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}