High-performance document extraction for Elixir.
Examples
# Extract from binary with MIME type
{:ok, result} = Kreuzberg.extract(pdf_binary, "application/pdf")
# With configuration
config = %Kreuzberg.ExtractionConfig{force_ocr: true}
{:ok, result} = Kreuzberg.extract(pdf_binary, "application/pdf", config)
# Bang variant
result = Kreuzberg.extract!(pdf_binary, "application/pdf")
Summary
Functions
Extract content from binary document data.
Extract content, raising on error
Extract content from a file at the given path.
Extract content from a file, raising on error.
Extract content with plugin processing support.
Functions
@spec extract( binary(), String.t(), Kreuzberg.ExtractionConfig.t() | map() | keyword() | nil ) :: {:ok, Kreuzberg.ExtractionResult.t()} | {:error, String.t()}
Extract content from binary document data.
Performs document extraction on binary input with support for various file formats. Returns extracted content including text, metadata, tables, images, and more. If no configuration is provided, uses default extraction settings.
Parameters
input- Binary document data to extract frommime_type- MIME type of the document (e.g., "application/pdf", "text/plain")config- ExtractionConfig struct, map, keyword list, or nil (optional, defaults to nil)
Returns
{:ok, ExtractionResult.t()}- Successfully extracted content with metadata{:error, reason}- Extraction failed with error message
Examples
# Extract from binary with MIME type
{:ok, result} = Kreuzberg.extract(pdf_binary, "application/pdf")
result.content
# Extract with configuration
config = %Kreuzberg.ExtractionConfig{ocr: %{"enabled" => true}}
{:ok, result} = Kreuzberg.extract(pdf_binary, "application/pdf", config)
# With keyword list configuration
{:ok, result} = Kreuzberg.extract(
pdf_binary,
"application/pdf",
ocr: %{"enabled" => true}
)
@spec extract!( binary(), String.t(), Kreuzberg.ExtractionConfig.t() | map() | keyword() | nil ) :: Kreuzberg.ExtractionResult.t()
Extract content, raising on error
@spec extract_file( String.t() | Path.t(), String.t() | nil, Kreuzberg.ExtractionConfig.t() | map() | keyword() | nil ) :: {:ok, Kreuzberg.ExtractionResult.t()} | {:error, String.t()}
Extract content from a file at the given path.
Accepts a file path and optional MIME type, returning extracted content. If no MIME type is provided, the library will attempt to detect it from the file.
Parameters
path- File path (String or Path.t())mime_type- MIME type of the file (optional, defaults to nil for auto-detection)config- ExtractionConfig struct or map with extraction options (optional)
Returns
{:ok, ExtractionResult.t()}- Successfully extracted content{:error, reason}- Extraction failed with error message
Examples
# Extract with explicit MIME type
{:ok, result} = Kreuzberg.extract_file("document.pdf", "application/pdf")
result.content
# Extract with auto-detection
{:ok, result} = Kreuzberg.extract_file("document.pdf")
# With configuration
config = %Kreuzberg.ExtractionConfig{force_ocr: true}
{:ok, result} = Kreuzberg.extract_file("document.pdf", "application/pdf", config)
# With keyword list configuration
{:ok, result} = Kreuzberg.extract_file(
"document.pdf",
"application/pdf",
ocr: %{"enabled" => true}
)
@spec extract_file!( String.t() | Path.t(), String.t() | nil, Kreuzberg.ExtractionConfig.t() | map() | keyword() | nil ) :: Kreuzberg.ExtractionResult.t()
Extract content from a file, raising on error.
Same as extract_file/3 but raises a Kreuzberg.Error exception if extraction fails.
Parameters
path- File path (String or Path.t())mime_type- MIME type of the file (optional, defaults to nil for auto-detection)config- ExtractionConfig struct or map with extraction options (optional)
Returns
ExtractionResult.t()- Successfully extracted content
Raises
Kreuzberg.Error- If extraction fails
Examples
# Extract with explicit MIME type, raising on error
result = Kreuzberg.extract_file!("document.pdf", "application/pdf")
result.content
# Extract with auto-detection, raising on error
result = Kreuzberg.extract_file!("document.pdf")
result.content
# With configuration
config = %Kreuzberg.ExtractionConfig{ocr: %{"enabled" => true}}
result = Kreuzberg.extract_file!("document.pdf", "application/pdf", config)
@spec extract_with_plugins( binary(), String.t(), Kreuzberg.ExtractionConfig.t() | map() | keyword() | nil, keyword() ) :: {:ok, Kreuzberg.ExtractionResult.t()} | {:error, String.t()}
Extract content with plugin processing support.
Performs document extraction with additional processing through registered plugins. Applies validators before extraction, post-processors by stage (early, middle, late) after extraction, and optional final validators to the result.
Plugins are retrieved from the Plugin.Registry if not explicitly provided in plugin_opts.
Parameters
input- Binary document data to extract frommime_type- MIME type of the document (e.g., "application/pdf")config- ExtractionConfig struct, map, keyword list, or nil for extraction (optional)plugin_opts- Keyword list of plugin options (optional)::validators- List of validator modules to run before extraction:post_processors- Map of stage atoms to lists of post-processor modules:early- Applied first to extraction result:middle- Applied after early processors:late- Applied last before final validators
:final_validators- List of validator modules to run after post-processing
Returns
{:ok, ExtractionResult.t()}- Successfully extracted and processed content{:error, reason}- Extraction or processing failed with error message
Plugin Processing Flow
- Validators - If specified, run input validators to check extraction preconditions
- Extraction - Call
extract/3to get initial result - Post-Processors - Apply by stage in order (early → middle → late)
- Each processor receives the extraction result or output from previous processor
- Processors should return modified result or data
- Final Validators - If specified, validate the processed result
- Return - Return enhanced extraction result
Examples
# Extract with registered validators and post-processors
{:ok, result} = Kreuzberg.extract_with_plugins(
pdf_binary,
"application/pdf",
nil,
validators: [MyApp.InputValidator],
post_processors: %{
early: [MyApp.EarlyProcessor],
middle: [MyApp.MiddleProcessor],
late: [MyApp.FinalProcessor]
},
final_validators: [MyApp.ResultValidator]
)
# Extract with only post-processors
{:ok, result} = Kreuzberg.extract_with_plugins(
pdf_binary,
"application/pdf",
%{use_cache: true},
post_processors: %{
early: [MyApp.Processor1, MyApp.Processor2]
}
)
# Extract with configuration and validators only
config = %Kreuzberg.ExtractionConfig{ocr: %{"enabled" => true}}
{:ok, result} = Kreuzberg.extract_with_plugins(
pdf_binary,
"application/pdf",
config,
validators: [MyApp.Validator]
)
# Extract with no plugins (standard extraction)
{:ok, result} = Kreuzberg.extract_with_plugins(pdf_binary, "application/pdf")