Kreuzberg.ExtractionConfig (kreuzberg v4.4.2)

Copy Markdown View Source

Configuration structure for document extraction operations.

Provides options for controlling extraction behavior including caching, quality processing, OCR, chunking, language detection, and post-processing. This module defines the configuration schema and provides validation utilities to ensure configurations are valid before passing them to the Rust extraction engine.

Configuration Fields

Boolean Flags (Top-level)

  • :use_cache - Enable result caching (default: true)
  • :enable_quality_processing - Enable quality post-processing (default: true)
  • :force_ocr - Force OCR even for searchable PDFs (default: false)

Output Format Flags

  • :output_format - Content text format (default: "plain") - "plain", "markdown", "djot", "html"
  • :result_format - Result structure format (default: "unified") - "unified", "element_based"

Nested Configuration Maps (Optional)

  • :chunking - Text chunking configuration with options like chunk_size, overlap, etc.
  • :ocr - OCR backend configuration with settings for language, PSM mode, etc.
    • Can include nested :paddle_ocr_config for PaddleOCR-specific settings
    • Can include nested :element_config for OCR element extraction settings
  • :language_detection - Language detection settings for multi-language content
  • :postprocessor - Post-processor configuration for cleaning/formatting extracted text
  • :images - Image extraction configuration (quality, format, preprocessing options)
  • :pages - Page-level extraction configuration (which pages to extract, etc.)
  • :token_reduction - Token reduction settings for optimizing output size
  • :keywords - Keyword extraction configuration
  • :pdf_options - PDF-specific options (requires pdf feature to be enabled)
  • :html_options - HTML to Markdown conversion options (quality, format, preprocessing options)
  • :max_concurrent_extractions - Maximum concurrent extractions in batch operations (positive integer or nil)

Default Values

All boolean flags default to reasonable values:

  • use_cache: true - Caching is enabled by default for better performance
  • enable_quality_processing: true - Quality processing is enabled by default for better extraction results
  • force_ocr: false - OCR is only used when necessary (searchable PDFs bypass OCR)

Format defaults:

  • output_format: "plain" - Raw extracted text (no formatting)
  • result_format: "unified" - All content in unified content field

All nested configurations default to nil, allowing the Rust implementation to apply its own defaults.

Field Validation

The validate/1 function ensures:

  • Boolean fields are actually booleans
  • Format fields are valid enum values
  • Nested configurations are maps or nil
  • No invalid field names are used

Examples

# Create config with chunking enabled
iex> config = %Kreuzberg.ExtractionConfig{
...>   chunking: %{"enabled" => true, "chunk_size" => 1024},
...>   use_cache: true
...> }
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}

# Create config with markdown output format
iex> config = %Kreuzberg.ExtractionConfig{
...>   output_format: "markdown",
...>   result_format: "unified"
...> }
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}

# Create config that forces OCR with element-based result format
iex> config = %Kreuzberg.ExtractionConfig{
...>   force_ocr: true,
...>   result_format: "element_based",
...>   enable_quality_processing: true
...> }
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}

# Validate invalid configuration (non-boolean field)
iex> config = %Kreuzberg.ExtractionConfig{use_cache: "yes"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'use_cache' must be a boolean, got: string"}

# Validate invalid format
iex> config = %Kreuzberg.ExtractionConfig{output_format: "invalid"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'output_format' must be one of: plain, text, markdown, md, djot, html, got: invalid"}

# Convert to map for NIF
iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 512}}
iex> Kreuzberg.ExtractionConfig.to_map(config)
%{
  "chunking" => %{"size" => 512},
  "ocr" => nil,
  "language_detection" => nil,
  "postprocessor" => nil,
  "images" => nil,
  "pages" => nil,
  "token_reduction" => nil,
  "keywords" => nil,
  "pdf_options" => nil,
  "html_options" => nil,
  "max_concurrent_extractions" => nil,
  "include_document_structure" => false,
  "use_cache" => true,
  "enable_quality_processing" => true,
  "force_ocr" => false,
  "output_format" => "plain",
  "result_format" => "unified"
}

Summary

Functions

Discover and load an ExtractionConfig by searching directories.

Load an ExtractionConfig from a file.

Creates a new ExtractionConfig with all default values.

Creates a new ExtractionConfig from keyword list or map.

Converts an ExtractionConfig struct to a map for NIF serialization.

Validates an ExtractionConfig for correct field types and values.

Types

config_map()

@type config_map() :: %{required(String.t()) => any()}

nested_config()

@type nested_config() :: config_map() | nil

output_format()

@type output_format() :: String.t()

result_format()

@type result_format() :: String.t()

t()

@type t() :: %Kreuzberg.ExtractionConfig{
  chunking: nested_config(),
  enable_quality_processing: boolean(),
  force_ocr: boolean(),
  html_options: config_map() | nil,
  images: nested_config(),
  include_document_structure: boolean(),
  keywords: nested_config(),
  language_detection: nested_config(),
  max_concurrent_extractions: non_neg_integer() | nil,
  ocr: nested_config(),
  output_format: output_format(),
  pages: nested_config(),
  pdf_options: nested_config(),
  postprocessor: nested_config(),
  result_format: result_format(),
  security_limits: nested_config(),
  token_reduction: nested_config(),
  use_cache: boolean()
}

Functions

discover()

@spec discover() :: {:ok, t()} | {:error, :not_found | String.t()}

Discover and load an ExtractionConfig by searching directories.

Searches the current working directory and all parent directories for a configuration file in the following order:

  1. kreuzberg.toml
  2. kreuzberg.yaml
  3. kreuzberg.yml
  4. kreuzberg.json

Returns the first configuration file found.

Returns

  • {:ok, config} - Successfully discovered and loaded configuration
  • {:error, :not_found} - No configuration file found in directory tree
  • {:error, reason} - Error loading or parsing the configuration file

Examples

# When no config file exists
iex> Kreuzberg.ExtractionConfig.discover()
{:error, :not_found}

from_file(file_path)

@spec from_file(String.t() | Path.t()) :: {:ok, t()} | {:error, String.t()}

Load an ExtractionConfig from a file.

Supports TOML, YAML, and JSON configuration file formats. The file format is automatically detected based on the file extension or file contents.

Parameters

  • file_path - Path to the configuration file (String or Path.t())

Returns

  • {:ok, config} - Successfully loaded configuration as a struct
  • {:error, reason} - Failed to load or parse the configuration file

Supported Formats

  • .toml - TOML format (e.g., kreuzberg.toml)
  • .yaml, .yml - YAML format (e.g., kreuzberg.yaml)
  • .json - JSON format (e.g., kreuzberg.json)

Examples

Loading from a TOML file:

Kreuzberg.ExtractionConfig.from_file("kreuzberg.toml")
# => {:ok, %Kreuzberg.ExtractionConfig{...}}

Loading from a YAML file:

Kreuzberg.ExtractionConfig.from_file("/etc/config/extraction.yaml")
# => {:ok, %Kreuzberg.ExtractionConfig{...}}

Handling missing files:

Kreuzberg.ExtractionConfig.from_file("/nonexistent/file.toml")
# => {:error, "File not found: ..."}

new()

@spec new() :: t()

Creates a new ExtractionConfig with all default values.

Examples

iex> config = Kreuzberg.ExtractionConfig.new()
iex> config.use_cache
true

new(opts)

@spec new(keyword() | map()) :: t()

Creates a new ExtractionConfig from keyword list or map.

Parameters

  • opts - Keyword list or map (supports string keys from JSON)

Examples

iex> config = Kreuzberg.ExtractionConfig.new(use_cache: false)
iex> config.use_cache
false

iex> config = Kreuzberg.ExtractionConfig.new(%{"output_format" => "markdown"})
iex> config.output_format
"markdown"

to_map(map)

@spec to_map(t() | map() | nil | list()) :: map() | nil

Converts an ExtractionConfig struct to a map for NIF serialization.

Returns a map containing all configuration fields, both boolean flags and nested configurations. Serializes all values including nil for complete representation.

Parameters

  • config - An ExtractionConfig struct, a plain map, nil, or a keyword list

Returns

A map with string keys representing the configuration options. All fields are included, allowing the Rust side to override with provided values.

Field Descriptions

  • "chunking" - Text chunking configuration (map or nil)
  • "ocr" - OCR backend configuration (map or nil)
  • "language_detection" - Language detection settings (map or nil)
  • "postprocessor" - Post-processor configuration (map or nil)
  • "images" - Image extraction configuration (map or nil)
  • "pages" - Page-level extraction configuration (map or nil)
  • "token_reduction" - Token reduction settings (map or nil)
  • "keywords" - Keyword extraction configuration (map or nil)
  • "pdf_options" - PDF-specific options (map or nil)
  • "max_concurrent_extractions" - Maximum concurrent extractions (positive integer or nil)
  • "html_options" - HTML to Markdown conversion options (map or nil)
  • "include_document_structure" - Include document structure in extraction (boolean)
  • "use_cache" - Enable caching (boolean)
  • "enable_quality_processing" - Enable quality processing (boolean)
  • "force_ocr" - Force OCR usage (boolean)
  • "output_format" - Content text format (string: "plain", "markdown", "djot", "html")
  • "result_format" - Result structure format (string: "unified", "element_based")

Examples

iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 512}, output_format: "markdown"}
iex> Kreuzberg.ExtractionConfig.to_map(config)
%{
  "chunking" => %{"size" => 512},
  "ocr" => nil,
  "language_detection" => nil,
  "postprocessor" => nil,
  "images" => nil,
  "pages" => nil,
  "token_reduction" => nil,
  "keywords" => nil,
  "pdf_options" => nil,
  "html_options" => nil,
  "max_concurrent_extractions" => nil,
  "include_document_structure" => false,
  "use_cache" => true,
  "enable_quality_processing" => true,
  "force_ocr" => false,
  "output_format" => "markdown",
  "result_format" => "unified"
}

iex> config = %Kreuzberg.ExtractionConfig{}
iex> Kreuzberg.ExtractionConfig.to_map(config)
%{
  "chunking" => nil,
  "ocr" => nil,
  "language_detection" => nil,
  "postprocessor" => nil,
  "images" => nil,
  "pages" => nil,
  "token_reduction" => nil,
  "keywords" => nil,
  "pdf_options" => nil,
  "html_options" => nil,
  "max_concurrent_extractions" => nil,
  "use_cache" => true,
  "enable_quality_processing" => true,
  "include_document_structure" => false,
  "force_ocr" => false,
  "output_format" => "plain",
  "result_format" => "unified"
}

iex> Kreuzberg.ExtractionConfig.to_map(nil)
nil

iex> Kreuzberg.ExtractionConfig.to_map(%{"use_cache" => false, "output_format" => "markdown"})
%{"use_cache" => false, "output_format" => "markdown"}

validate(config)

@spec validate(t()) :: {:ok, t()} | {:error, String.t()}

Validates an ExtractionConfig for correct field types and values.

Ensures that:

  • Boolean fields (use_cache, enable_quality_processing, force_ocr) are actually booleans
  • Format fields (output_format, result_format) are valid enum values
  • Nested configuration fields are maps or nil
  • All values are valid according to the configuration schema

This function is useful for early validation before passing configuration to the extraction functions.

Parameters

  • config - An ExtractionConfig struct to validate

Returns

  • {:ok, config} - If the configuration is valid
  • {:error, reason} - If validation fails, with a descriptive error message

Examples

iex> config = %Kreuzberg.ExtractionConfig{use_cache: true}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}

iex> config = %Kreuzberg.ExtractionConfig{output_format: "markdown", result_format: "unified"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}

iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 1024}}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}

iex> config = %Kreuzberg.ExtractionConfig{use_cache: "yes"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'use_cache' must be a boolean, got: string"}

iex> config = %Kreuzberg.ExtractionConfig{output_format: "invalid"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'output_format' must be one of: plain, text, markdown, md, djot, html, got: invalid"}

iex> config = %Kreuzberg.ExtractionConfig{chunking: "invalid"}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:error, "Field 'chunking' must be a map or nil, got: string"}

iex> config = %Kreuzberg.ExtractionConfig{force_ocr: true, enable_quality_processing: true}
iex> Kreuzberg.ExtractionConfig.validate(config)
{:ok, config}