Kreuzberg.ExtractionConfig (kreuzberg v4.9.5)

Configuration structure for document extraction operations.

Provides options for controlling extraction behavior including caching, quality processing, OCR, chunking, language detection, and post-processing. This module defines the configuration schema and provides validation utilities to ensure configurations are valid before passing them to the Rust extraction engine.

Configuration Fields

Boolean Flags (Top-level)

:use_cache - Enable result caching (default: true)
:enable_quality_processing - Enable quality post-processing (default: true)
:force_ocr - Force OCR even for searchable PDFs (default: false)
:disable_ocr - Disable OCR entirely — image files return empty content (default: false)

Output Format Flags

:output_format - Content text format (default: "plain") - "plain", "markdown", "djot", "html"
:result_format - Result structure format (default: "unified") - "unified", "element_based"

Nested Configuration Maps (Optional)

:chunking - Text chunking configuration with options like chunk_size, overlap, etc.
:ocr - OCR backend configuration with settings for language, PSM mode, etc.
- Can include nested :paddle_ocr_config for PaddleOCR-specific settings
- Can include nested :element_config for OCR element extraction settings
:language_detection - Language detection settings for multi-language content
:postprocessor - Post-processor configuration for cleaning/formatting extracted text
:images - Image extraction configuration (quality, format, preprocessing options)
:pages - Page-level extraction configuration (which pages to extract, etc.)
:token_reduction - Token reduction settings for optimizing output size
:keywords - Keyword extraction configuration
:pdf_options - PDF-specific options (requires pdf feature to be enabled)
:html_options - HTML to Markdown conversion options (quality, format, preprocessing options)
:layout - Layout detection configuration (confidence_threshold, apply_heuristics, table_model)
:acceleration - GPU acceleration configuration (provider, device_id)
:security_limits - Security limits for archive extraction (max sizes, compression ratio, etc.)
:email - Email extraction configuration (msg_fallback_codepage)
:content_filter - Content filter configuration (include_headers, include_footers, strip_repeating_text, include_watermarks)
:max_concurrent_extractions - Maximum concurrent extractions in batch operations (positive integer or nil)
:cache_namespace - Cache namespace for tenant isolation (string or nil)
:cache_ttl_secs - Per-request cache TTL in seconds (non-negative integer or nil)
:extraction_timeout_secs - Per-request extraction timeout in seconds; when exceeded, extraction is cancelled (non-negative integer or nil)

Default Values

All boolean flags default to reasonable values:

use_cache: true - Caching is enabled by default for better performance
enable_quality_processing: true - Quality processing is enabled by default for better extraction results
force_ocr: false - OCR is only used when necessary (searchable PDFs bypass OCR)
disable_ocr: false - OCR is available when needed

Format defaults:

output_format: "plain" - Raw extracted text (no formatting)
result_format: "unified" - All content in unified content field

All nested configurations default to nil, allowing the Rust implementation to apply its own defaults.

Field Validation

The validate/1 function ensures:

Boolean fields are actually booleans
Format fields are valid enum values
Nested configurations are maps or nil
No invalid field names are used

Examples

# Create config with chunking enabled
iex> config = %Kreuzberg.ExtractionConfig{
...>   chunking: %{"enabled" => true, "chunk_size" => 1024},
...>   use_cache: true
...> }
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}

# Create config with markdown output format
iex> config = %Kreuzberg.ExtractionConfig{
...>   output_format: "markdown",
...>   result_format: "unified"
...> }
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}

# Create config that forces OCR with element-based result format
iex> config = %Kreuzberg.ExtractionConfig{
...>   force_ocr: true,
...>   result_format: "element_based",
...>   enable_quality_processing: true
...> }
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}

# Validate invalid configuration (non-boolean field)
iex> config = %Kreuzberg.ExtractionConfig{use_cache: "yes"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'use_cache' must be a boolean, got: string"}

# Validate invalid format
iex> config = %Kreuzberg.ExtractionConfig{output_format: "invalid"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'output_format' must be one of: plain, text, markdown, md, djot, html, got: invalid"}

# Convert to map for NIF
iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 512}}
iex> Kreuzberg.ExtractionConfig.to_map(config)
%{
  "acceleration" => nil,
  "cache_namespace" => nil,
  "cache_ttl_secs" => nil,
  "chunking" => %{"size" => 512},
  "concurrency" => nil,
  "content_filter" => nil,
  "email" => nil,
  "disable_ocr" => false,
  "enable_quality_processing" => true,
  "extraction_timeout_secs" => nil,
  "force_ocr" => false,
  "force_ocr_pages" => nil,
  "html_options" => nil,
  "images" => nil,
  "include_document_structure" => false,
  "keywords" => nil,
  "language_detection" => nil,
  "layout" => nil,
  "max_archive_depth" => 3,
  "max_concurrent_extractions" => nil,
  "ocr" => nil,
  "output_format" => "plain",
  "pages" => nil,
  "pdf_options" => nil,
  "postprocessor" => nil,
  "result_format" => "unified",
  "security_limits" => nil,
  "token_reduction" => nil,
  "tree_sitter" => nil,
  "use_cache" => true
}

Summary

Types

config_map()

layout_config()

nested_config()

output_format()

result_format()

t()

Functions

discover()

Discover and load an ExtractionConfig by searching directories.

from_file(file_path)

Load an ExtractionConfig from a file.

new()

Creates a new ExtractionConfig with all default values.

new(opts)

Creates a new ExtractionConfig from keyword list or map.

to_map(map)

Converts an ExtractionConfig struct to a map for NIF serialization.

validate(config)

Validates an ExtractionConfig for correct field types and values.

Types

config_map()

@type config_map() :: %{required(String.t()) => any()}

layout_config()

@type layout_config() ::
  %{
    confidence_threshold: float() | nil,
    apply_heuristics: boolean(),
    table_model: String.t() | nil
  }
  | nil

nested_config()

@type nested_config() :: config_map() | nil

output_format()

@type output_format() :: String.t()

result_format()

@type result_format() :: String.t()

t()

@type t() :: %Kreuzberg.ExtractionConfig{
  acceleration: nested_config(),
  cache_namespace: String.t() | nil,
  cache_ttl_secs: non_neg_integer() | nil,
  chunking: nested_config(),
  concurrency: nested_config(),
  content_filter: nested_config(),
  disable_ocr: boolean(),
  email: nested_config(),
  enable_quality_processing: boolean(),
  extraction_timeout_secs: non_neg_integer() | nil,
  force_ocr: boolean(),
  force_ocr_pages: [non_neg_integer()] | nil,
  html_options: config_map() | nil,
  html_output: nested_config(),
  images: nested_config(),
  include_document_structure: boolean(),
  keywords: nested_config(),
  language_detection: nested_config(),
  layout: layout_config(),
  max_archive_depth: non_neg_integer(),
  max_concurrent_extractions: non_neg_integer() | nil,
  ocr: nested_config(),
  output_format: output_format(),
  pages: nested_config(),
  pdf_options: nested_config(),
  postprocessor: nested_config(),
  result_format: result_format(),
  security_limits: nested_config(),
  token_reduction: nested_config(),
  tree_sitter: Kreuzberg.TreeSitterConfig.t() | nested_config(),
  use_cache: boolean()
}

Functions

discover()

@spec discover() :: {:ok, t()} | {:error, :not_found | String.t()}

Discover and load an ExtractionConfig by searching directories.

Searches the current working directory and all parent directories for a configuration file in the following order:

kreuzberg.toml
kreuzberg.yaml
kreuzberg.yml
kreuzberg.json

Returns the first configuration file found.

Returns

{:ok, config} - Successfully discovered and loaded configuration
{:error, :not_found} - No configuration file found in directory tree
{:error, reason} - Error loading or parsing the configuration file

Examples

# When no config file exists
iex> Kreuzberg.ExtractionConfig.discover()
{:error, :not_found}

from_file(file_path)

@spec from_file(String.t() | Path.t()) :: {:ok, t()} | {:error, String.t()}

Load an ExtractionConfig from a file.

Supports TOML, YAML, and JSON configuration file formats. The file format is automatically detected based on the file extension or file contents.

Parameters

file_path - Path to the configuration file (String or Path.t())

Returns

{:ok, config} - Successfully loaded configuration as a struct
{:error, reason} - Failed to load or parse the configuration file

Supported Formats

.toml - TOML format (e.g., kreuzberg.toml)
.yaml, .yml - YAML format (e.g., kreuzberg.yaml)
.json - JSON format (e.g., kreuzberg.json)

Examples

Loading from a TOML file:

Kreuzberg.ExtractionConfig.from_file("kreuzberg.toml")
# => {:ok, %Kreuzberg.ExtractionConfig{...}}

Loading from a YAML file:

Kreuzberg.ExtractionConfig.from_file("/etc/config/extraction.yaml")
# => {:ok, %Kreuzberg.ExtractionConfig{...}}

Handling missing files:

Kreuzberg.ExtractionConfig.from_file("/nonexistent/file.toml")
# => {:error, "File not found: ..."}

new()

@spec new() :: t()

Creates a new ExtractionConfig with all default values.

Examples

iex> config = Kreuzberg.ExtractionConfig.new()
iex> config.use_cache
true

new(opts)

@spec new(keyword() | map()) :: t()

Creates a new ExtractionConfig from keyword list or map.

Parameters

opts - Keyword list or map (supports string keys from JSON)

Examples

iex> config = Kreuzberg.ExtractionConfig.new(use_cache: false)
iex> config.use_cache
false

iex> config = Kreuzberg.ExtractionConfig.new(%{"output_format" => "markdown"})
iex> config.output_format
"markdown"

to_map(map)

@spec to_map(t() | map() | nil | list()) :: map() | nil

Converts an ExtractionConfig struct to a map for NIF serialization.

Returns a map containing all configuration fields, both boolean flags and nested configurations. Serializes all values including nil for complete representation.

Parameters

config - An ExtractionConfig struct, a plain map, nil, or a keyword list

Returns

A map with string keys representing the configuration options. All fields are included, allowing the Rust side to override with provided values.

Field Descriptions

"chunking" - Text chunking configuration (map or nil)
"ocr" - OCR backend configuration (map or nil)
"language_detection" - Language detection settings (map or nil)
"postprocessor" - Post-processor configuration (map or nil)
"images" - Image extraction configuration (map or nil)
"pages" - Page-level extraction configuration (map or nil)
"token_reduction" - Token reduction settings (map or nil)
"keywords" - Keyword extraction configuration (map or nil)
"pdf_options" - PDF-specific options (map or nil)
"max_concurrent_extractions" - Maximum concurrent extractions (positive integer or nil)
"html_options" - HTML to Markdown conversion options (map or nil)
"layout" - Layout detection configuration (map or nil)
"include_document_structure" - Include document structure in extraction (boolean)
"use_cache" - Enable caching (boolean)
"enable_quality_processing" - Enable quality processing (boolean)
"force_ocr" - Force OCR usage (boolean)
"output_format" - Content text format (string: "plain", "markdown", "djot", "html")
"result_format" - Result structure format (string: "unified", "element_based")

Examples

iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 512}, output_format: "markdown"}
iex> Kreuzberg.ExtractionConfig.to_map(config)
%{
  "acceleration" => nil,
  "cache_namespace" => nil,
  "cache_ttl_secs" => nil,
  "chunking" => %{"size" => 512},
  "concurrency" => nil,
  "email" => nil,
  "disable_ocr" => false,
  "enable_quality_processing" => true,
  "extraction_timeout_secs" => nil,
  "force_ocr" => false,
  "force_ocr_pages" => nil,
  "html_options" => nil,
  "images" => nil,
  "include_document_structure" => false,
  "keywords" => nil,
  "language_detection" => nil,
  "layout" => nil,
  "max_archive_depth" => 3,
  "max_concurrent_extractions" => nil,
  "ocr" => nil,
  "output_format" => "markdown",
  "pages" => nil,
  "pdf_options" => nil,
  "postprocessor" => nil,
  "result_format" => "unified",
  "security_limits" => nil,
  "token_reduction" => nil,
  "tree_sitter" => nil,
  "use_cache" => true
}

iex> config = %Kreuzberg.ExtractionConfig{}
iex> Kreuzberg.ExtractionConfig.to_map(config)
%{
  "acceleration" => nil,
  "cache_namespace" => nil,
  "cache_ttl_secs" => nil,
  "chunking" => nil,
  "concurrency" => nil,
  "content_filter" => nil,
  "email" => nil,
  "disable_ocr" => false,
  "enable_quality_processing" => true,
  "extraction_timeout_secs" => nil,
  "force_ocr" => false,
  "force_ocr_pages" => nil,
  "html_options" => nil,
  "images" => nil,
  "include_document_structure" => false,
  "keywords" => nil,
  "language_detection" => nil,
  "layout" => nil,
  "max_archive_depth" => 3,
  "max_concurrent_extractions" => nil,
  "ocr" => nil,
  "output_format" => "plain",
  "pages" => nil,
  "pdf_options" => nil,
  "postprocessor" => nil,
  "result_format" => "unified",
  "security_limits" => nil,
  "token_reduction" => nil,
  "tree_sitter" => nil,
  "use_cache" => true
}

iex> Kreuzberg.ExtractionConfig.to_map(nil)
nil

iex> Kreuzberg.ExtractionConfig.to_map(%{"use_cache" => false, "output_format" => "markdown"})
%{"use_cache" => false, "output_format" => "markdown"}

validate(config)

@spec validate(t()) :: {:ok, t()} | {:error, String.t()}

Validates an ExtractionConfig for correct field types and values.

Ensures that:

Boolean fields (use_cache, enable_quality_processing, force_ocr) are actually booleans
Format fields (output_format, result_format) are valid enum values
Nested configuration fields are maps or nil
All values are valid according to the configuration schema

This function is useful for early validation before passing configuration to the extraction functions.

Parameters

config - An ExtractionConfig struct to validate

Returns

{:ok, config} - If the configuration is valid
{:error, reason} - If validation fails, with a descriptive error message

Examples

iex> config = %Kreuzberg.ExtractionConfig{use_cache: true}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}

iex> config = %Kreuzberg.ExtractionConfig{output_format: "markdown", result_format: "unified"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}

iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 1024}}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}

iex> config = %Kreuzberg.ExtractionConfig{use_cache: "yes"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'use_cache' must be a boolean, got: string"}

iex> config = %Kreuzberg.ExtractionConfig{output_format: "invalid"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'output_format' must be one of: plain, text, markdown, md, djot, html, got: invalid"}

iex> config = %Kreuzberg.ExtractionConfig{chunking: "invalid"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'chunking' must be a map or nil, got: string"}

iex> config = %Kreuzberg.ExtractionConfig{force_ocr: true, enable_quality_processing: true}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}