Kreuzberg.ExtractionConfig (kreuzberg v4.0.8)
View SourceConfiguration structure for document extraction operations.
Provides options for controlling extraction behavior including caching, quality processing, OCR, chunking, language detection, and post-processing. This module defines the configuration schema and provides validation utilities to ensure configurations are valid before passing them to the Rust extraction engine.
Configuration Fields
Boolean Flags (Top-level)
:use_cache- Enable result caching (default: true):enable_quality_processing- Enable quality post-processing (default: false):force_ocr- Force OCR even for searchable PDFs (default: false)
Nested Configuration Maps (Optional)
:chunking- Text chunking configuration with options like chunk_size, overlap, etc.:ocr- OCR backend configuration with settings for language, PSM mode, etc.:language_detection- Language detection settings for multi-language content:postprocessor- Post-processor configuration for cleaning/formatting extracted text:images- Image extraction configuration (quality, format, preprocessing options):pages- Page-level extraction configuration (which pages to extract, etc.):token_reduction- Token reduction settings for optimizing output size:keywords- Keyword extraction configuration:pdf_options- PDF-specific options (requires pdf feature to be enabled)
Default Values
All boolean flags default to reasonable values:
use_cache: true - Caching is enabled by default for better performanceenable_quality_processing: true - Quality processing is enabled by default for better extraction resultsforce_ocr: false - OCR is only used when necessary (searchable PDFs bypass OCR)
All nested configurations default to nil, allowing the Rust implementation to apply its own defaults.
Field Validation
The validate/1 function ensures:
- Boolean fields are actually booleans
- Nested configurations are maps or nil
- No invalid field names are used
Examples
# Create config with chunking enabled
iex> config = %Kreuzberg.ExtractionConfig{
...> chunking: %{"enabled" => true, "chunk_size" => 1024},
...> use_cache: true
...> }
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}
# Create config that forces OCR
iex> config = %Kreuzberg.ExtractionConfig{
...> force_ocr: true,
...> enable_quality_processing: true
...> }
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}
# Validate invalid configuration (non-boolean field)
iex> config = %Kreuzberg.ExtractionConfig{use_cache: "yes"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'use_cache' must be a boolean"}
# Convert to map for NIF
iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 512}}
iex> Kreuzberg.ExtractionConfig.to_map(config)
%{
"chunking" => %{"size" => 512},
"ocr" => nil,
"language_detection" => nil,
"postprocessor" => nil,
"images" => nil,
"pages" => nil,
"token_reduction" => nil,
"keywords" => nil,
"pdf_config" => nil,
"use_cache" => true,
"enable_quality_processing" => false,
"force_ocr" => false
}
Summary
Functions
Discover and load an ExtractionConfig by searching directories.
Load an ExtractionConfig from a file.
Converts an ExtractionConfig struct to a map for NIF serialization.
Validates an ExtractionConfig for correct field types and values.
Types
@type nested_config() :: config_map() | nil
@type t() :: %Kreuzberg.ExtractionConfig{ chunking: nested_config(), enable_quality_processing: boolean(), force_ocr: boolean(), images: nested_config(), keywords: nested_config(), language_detection: nested_config(), ocr: nested_config(), pages: nested_config(), pdf_options: nested_config(), postprocessor: nested_config(), token_reduction: nested_config(), use_cache: boolean() }
Functions
Discover and load an ExtractionConfig by searching directories.
Searches the current working directory and all parent directories for a configuration file in the following order:
kreuzberg.tomlkreuzberg.yamlkreuzberg.ymlkreuzberg.json
Returns the first configuration file found.
Returns
{:ok, config}- Successfully discovered and loaded configuration{:error, :not_found}- No configuration file found in directory tree{:error, reason}- Error loading or parsing the configuration file
Examples
# With kreuzberg.toml in current directory
iex> Kreuzberg.ExtractionConfig.discover()
{:ok, %Kreuzberg.ExtractionConfig{...}}
# With kreuzberg.yaml in a parent directory
iex> Kreuzberg.ExtractionConfig.discover()
{:ok, config}
# When no config file exists
iex> Kreuzberg.ExtractionConfig.discover()
{:error, :not_found}
Load an ExtractionConfig from a file.
Supports TOML, YAML, and JSON configuration file formats. The file format is automatically detected based on the file extension or file contents.
Parameters
file_path- Path to the configuration file (String or Path.t())
Returns
{:ok, config}- Successfully loaded configuration as a struct{:error, reason}- Failed to load or parse the configuration file
Supported Formats
.toml- TOML format (e.g.,kreuzberg.toml).yaml,.yml- YAML format (e.g.,kreuzberg.yaml).json- JSON format (e.g.,kreuzberg.json)
Examples
iex> Kreuzberg.ExtractionConfig.from_file("kreuzberg.toml")
{:ok, %Kreuzberg.ExtractionConfig{...}}
iex> Kreuzberg.ExtractionConfig.from_file("/etc/config/extraction.yaml")
{:ok, config}
iex> Kreuzberg.ExtractionConfig.from_file("/nonexistent/file.toml")
{:error, "File not found: /nonexistent/file.toml"}
Converts an ExtractionConfig struct to a map for NIF serialization.
Returns a map containing all configuration fields, both boolean flags and nested configurations. Serializes all values including nil for complete representation.
Parameters
config- AnExtractionConfigstruct, a plain map, nil, or a keyword list
Returns
A map with string keys representing the configuration options. All fields are included, allowing the Rust side to override with provided values.
Field Descriptions
"chunking"- Text chunking configuration (map or nil)"ocr"- OCR backend configuration (map or nil)"language_detection"- Language detection settings (map or nil)"postprocessor"- Post-processor configuration (map or nil)"images"- Image extraction configuration (map or nil)"pages"- Page-level extraction configuration (map or nil)"token_reduction"- Token reduction settings (map or nil)"keywords"- Keyword extraction configuration (map or nil)"pdf_options"- PDF-specific options (map or nil)"use_cache"- Enable caching (boolean)"enable_quality_processing"- Enable quality processing (boolean)"force_ocr"- Force OCR usage (boolean)
Examples
iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 512}}
iex> Kreuzberg.ExtractionConfig.to_map(config)
%{
"chunking" => %{"size" => 512},
"ocr" => nil,
"language_detection" => nil,
"postprocessor" => nil,
"images" => nil,
"pages" => nil,
"token_reduction" => nil,
"keywords" => nil,
"pdf_options" => nil,
"use_cache" => true,
"enable_quality_processing" => true,
"force_ocr" => false
}
iex> config = %Kreuzberg.ExtractionConfig{}
iex> Kreuzberg.ExtractionConfig.to_map(config)
%{
"chunking" => nil,
"ocr" => nil,
"language_detection" => nil,
"postprocessor" => nil,
"images" => nil,
"pages" => nil,
"token_reduction" => nil,
"keywords" => nil,
"pdf_options" => nil,
"use_cache" => true,
"enable_quality_processing" => true,
"force_ocr" => false
}
iex> Kreuzberg.ExtractionConfig.to_map(nil)
nil
iex> Kreuzberg.ExtractionConfig.to_map(%{"use_cache" => false})
%{"use_cache" => false}
Validates an ExtractionConfig for correct field types and values.
Ensures that:
- Boolean fields (use_cache, enable_quality_processing, force_ocr) are actually booleans
- Nested configuration fields are maps or nil
- All values are valid according to the configuration schema
This function is useful for early validation before passing configuration to the extraction functions.
Parameters
config- AnExtractionConfigstruct to validate
Returns
{:ok, config}- If the configuration is valid{:error, reason}- If validation fails, with a descriptive error message
Examples
iex> config = %Kreuzberg.ExtractionConfig{use_cache: true}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}
iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 1024}}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}
iex> config = %Kreuzberg.ExtractionConfig{use_cache: "yes"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'use_cache' must be a boolean, got: string"}
iex> config = %Kreuzberg.ExtractionConfig{chunking: "invalid"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'chunking' must be a map or nil, got: string"}
iex> config = %Kreuzberg.ExtractionConfig{force_ocr: true, enable_quality_processing: true}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}