Rust Elixir Python Node.js WASM Java Go C# PHP Ruby License Documentation
Banner2
Discord

Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. Elixir bindings with native BEAM concurrency, OTP integration, and idiomatic Elixir API.

Installation

Package Installation

Add to your mix.exs dependencies:

def deps do
  [
    kreuzberg: "~> 4.0"
  ]
end

Then run:

mix deps.get

System Requirements

  • Elixir 1.12+ and Erlang/OTP 24+ required
  • Optional: ONNX Runtime version 1.22.x for embeddings support
  • Optional: Tesseract OCR for OCR functionality

Quick Start

Basic Extraction

Extract text, metadata, and structure from any supported document format:

elixir title="Elixir" # Basic document extraction workflow # Load file -> extract -> access results {:ok, result} = Kreuzberg.extract_file("document.pdf") IO.puts("Extracted Content:") IO.puts(result.content) IO.puts("\nMetadata:") IO.puts("Format: #{inspect(result.metadata.format_type)}") IO.puts("Tables found: #{length(result.tables)}")

### Common Use Cases

#### Extract with Custom Configuration

Most use cases benefit from configuration to control extraction behavior:

With OCR (for scanned documents):

elixir title="Elixir" alias Kreuzberg.ExtractionConfig config = %ExtractionConfig{ ocr: %{"enabled" => true, "backend" => "tesseract"} } {:ok, result} = Kreuzberg.extract_file("scanned_document.pdf", nil, config) content = result.content IO.puts("OCR Extracted content:") IO.puts(content) IO.puts("Metadata: #{inspect(result.metadata)}")

Table Extraction

See Table Extraction Guide for detailed examples.

Processing Multiple Files

elixir title="Elixir" file_paths = ["document1.pdf", "document2.pdf", "document3.pdf"] {:ok, results} = Kreuzberg.batch_extract_files(file_paths) Enum.each(results, fn result -> IO.puts("File: #{result.mime_type}") IO.puts("Content length: #{byte_size(result.content)} characters") IO.puts("Tables: #{length(result.tables)}") IO.puts("---") end) IO.puts("Total files processed: #{length(results)}")

#### Async Processing

For non-blocking document processing:

elixir title="Elixir" # Extract from different file types (PDF, DOCX, etc.) case Kreuzberg.extract_file("document.pdf") do {:ok, result} -> IO.puts("Content: #{result.content}") IO.puts("MIME Type: #{result.metadata.format_type}") IO.puts("Tables: #{length(result.tables)}") {:error, reason} -> IO.puts("Extraction failed: #{inspect(reason)}") end

Next Steps

Features

Supported File Formats (56+)

56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

CategoryFormatsCapabilities
Word Processing.docx, .odtFull text, tables, images, metadata, styles
Spreadsheets.xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .odsSheet data, formulas, cell metadata, charts
Presentations.pptx, .ppt, .ppsxSlides, speaker notes, images, metadata
PDF.pdfText, tables, images, metadata, OCR support
eBooks.epub, .fb2Chapters, metadata, embedded resources

Images (OCR-Enabled)

CategoryFormatsFeatures
Raster.png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tifOCR, table detection, EXIF metadata, dimensions, color space
Advanced.jp2, .jpx, .jpm, .mj2, .pnm, .pbm, .pgm, .ppmOCR, table detection, format-specific metadata
Vector.svgDOM parsing, embedded text, graphics metadata

Web & Data

CategoryFormatsFeatures
Markup.html, .htm, .xhtml, .xml, .svgDOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data.json, .yaml, .yml, .toml, .csv, .tsvSchema detection, nested structures, validation
Text & Markdown.txt, .md, .markdown, .rst, .org, .rtfCommonMark, GFM, reStructuredText, Org Mode

Email & Archives

CategoryFormatsFeatures
Email.eml, .msgHeaders, body (HTML/plain), attachments, threading
Archives.zip, .tar, .tgz, .gz, .7zFile listing, nested archives, metadata

Academic & Scientific

CategoryFormatsFeatures
Citations.bib, .biblatex, .ris, .enw, .cslBibliography parsing, citation extraction
Scientific.tex, .latex, .typst, .jats, .ipynb, .docbookLaTeX, Jupyter notebooks, PubMed JATS
Documentation.opml, .pod, .mdoc, .troffTechnical documentation formats

Complete Format Reference

Key Capabilities

  • Text Extraction - Extract all text content with position and formatting information

  • Metadata Extraction - Retrieve document properties, creation date, author, etc.

  • Table Extraction - Parse tables with structure and cell content preservation

  • Image Extraction - Extract embedded images and render page previews

  • OCR Support - Integrate multiple OCR backends for scanned documents

  • Async/Await - Non-blocking document processing with concurrent operations

  • Plugin System - Extensible post-processing for custom text transformation

  • Embeddings - Generate vector embeddings using ONNX Runtime models

  • Batch Processing - Efficiently process multiple documents in parallel

  • Memory Efficient - Stream large files without loading entirely into memory

  • Language Detection - Detect and support multiple languages in documents

  • Configuration - Fine-grained control over extraction behavior

Performance Characteristics

FormatSpeedMemoryNotes
PDF (text)10-100 MB/s~50MB per docFastest extraction
Office docs20-200 MB/s~100MB per docDOCX, XLSX, PPTX
Images (OCR)1-5 MB/sVariableDepends on OCR backend
Archives5-50 MB/s~200MB per docZIP, TAR, etc.
Web formats50-200 MB/sStreamingHTML, XML, JSON

OCR Support

Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:

  • Tesseract

OCR Configuration Example

elixir title="Elixir" alias Kreuzberg.ExtractionConfig config = %ExtractionConfig{ ocr: %{"enabled" => true, "backend" => "tesseract"} } {:ok, result} = Kreuzberg.extract_file("scanned_document.pdf", nil, config) content = result.content IO.puts("OCR Extracted content:") IO.puts(content) IO.puts("Metadata: #{inspect(result.metadata)}")

## Async Support

This binding provides full async/await support for non-blocking document processing:

elixir title="Elixir" # Extract from different file types (PDF, DOCX, etc.) case Kreuzberg.extract_file("document.pdf") do {:ok, result} -> IO.puts("Content: #{result.content}") IO.puts("MIME Type: #{result.metadata.format_type}") IO.puts("Tables: #{length(result.tables)}") {:error, reason} -> IO.puts("Extraction failed: #{inspect(reason)}") end

Plugin System

Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.

For detailed plugin documentation, visit Plugin System Guide.

Plugin Example

elixir title="Elixir" alias Kreuzberg.Plugin # Word Count Post-Processor Plugin # This post-processor automatically counts words in extracted content # and adds the word count to the metadata. defmodule MyApp.Plugins.WordCountProcessor do @behaviour Kreuzberg.Plugin.PostProcessor require Logger @impl true def name do "WordCountProcessor" end @impl true def processing_stage do :post end @impl true def version do "1.0.0" end @impl true def initialize do :ok end @impl true def shutdown do :ok end @impl true def process(result, _options) do content = result["content"] || "" word_count = content |> String.split(~r/\s+/, trim: true) |> length() # Update metadata with word count metadata = Map.get(result, "metadata", %{}) updated_metadata = Map.put(metadata, "word_count", word_count) {:ok, Map.put(result, "metadata", updated_metadata)} end end # Register the word count post-processor Plugin.register_post_processor(:word_count_processor, MyApp.Plugins.WordCountProcessor) # Example usage result = %{ "content" => "The quick brown fox jumps over the lazy dog. This is a sample document with multiple words.", "metadata" => %{ "source" => "document.pdf", "pages" => 1 } } case MyApp.Plugins.WordCountProcessor.process(result, %{}) do {:ok, processed_result} -> word_count = processed_result["metadata"]["word_count"] IO.puts("Word count added: #{word_count} words") IO.inspect(processed_result, label: "Processed Result") {:error, reason} -> IO.puts("Processing failed: #{reason}") end # List all registered post-processors {:ok, processors} = Plugin.list_post_processors() IO.inspect(processors, label: "Registered Post-Processors")

## Embeddings Support

Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.

Embeddings Guide

## Batch Processing

Process multiple documents efficiently:

elixir title="Elixir" file_paths = ["document1.pdf", "document2.pdf", "document3.pdf"] {:ok, results} = Kreuzberg.batch_extract_files(file_paths) Enum.each(results, fn result -> IO.puts("File: #{result.mime_type}") IO.puts("Content length: #{byte_size(result.content)} characters") IO.puts("Tables: #{length(result.tables)}") IO.puts("---") end) IO.puts("Total files processed: #{length(results)}")

Configuration

For advanced configuration options including language detection, table extraction, OCR settings, and more:

Configuration Guide

Documentation

Contributing

Contributions are welcome! See Contributing Guide.

License

MIT License - see LICENSE file for details.

Support