Extract text, tables, images, and metadata from 88+ file formats including PDF, Office documents, and images. Elixir bindings with native BEAM concurrency, OTP integration, and idiomatic Elixir API.
Installation
Package Installation
Add to your mix.exs dependencies:
def deps do
[
kreuzberg: "~> 4.4"
]
endThen run:
mix deps.get
System Requirements
- Elixir 1.12+ and Erlang/OTP 24+ required
- Optional: ONNX Runtime version 1.22.x for embeddings support
- Optional: Tesseract OCR for OCR functionality
Quick Start
Basic Extraction
Extract text, metadata, and structure from any supported document format:
elixir title="Elixir" # Basic document extraction workflow # Load file -> extract -> access results {:ok, result} = Kreuzberg.extract_file("document.pdf") IO.puts("Extracted Content:") IO.puts(result.content) IO.puts("\nMetadata:") IO.puts("Format: #{inspect(result.metadata.format_type)}") IO.puts("Tables found: #{length(result.tables)}")
### Common Use Cases
#### Extract with Custom Configuration
Most use cases benefit from configuration to control extraction behavior:
With OCR (for scanned documents):
elixir title="Elixir" alias Kreuzberg.ExtractionConfig config = %ExtractionConfig{ ocr: %{"enabled" => true, "backend" => "tesseract"} } {:ok, result} = Kreuzberg.extract_file("scanned_document.pdf", nil, config) content = result.content IO.puts("OCR Extracted content:") IO.puts(content) IO.puts("Metadata: #{inspect(result.metadata)}")
Table Extraction
See Table Extraction Guide for detailed examples.
Processing Multiple Files
elixir title="Elixir" file_paths = ["document1.pdf", "document2.pdf", "document3.pdf"] {:ok, results} = Kreuzberg.batch_extract_files(file_paths) Enum.each(results, fn result -> IO.puts("File: #{result.mime_type}") IO.puts("Content length: #{byte_size(result.content)} characters") IO.puts("Tables: #{length(result.tables)}") IO.puts("---") end) IO.puts("Total files processed: #{length(results)}")
#### Async Processing
For non-blocking document processing:
elixir title="Elixir" # Extract from different file types (PDF, DOCX, etc.) case Kreuzberg.extract_file("document.pdf") do {:ok, result} -> IO.puts("Content: #{result.content}") IO.puts("MIME Type: #{result.metadata.format_type}") IO.puts("Tables: #{length(result.tables)}") {:error, reason} -> IO.puts("Extraction failed: #{inspect(reason)}") end
Next Steps
- Installation Guide - Platform-specific setup
- API Documentation - Complete API reference
- Examples & Guides - Full code examples and usage guides
- Configuration Guide - Advanced configuration options
Features
Supported File Formats (88+)
88+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
Office Documents
| Category | Formats | Capabilities |
|---|---|---|
| Word Processing | .docx, .docm, .dotx, .dotm, .dot, .odt | Full text, tables, images, metadata, styles |
| Spreadsheets | .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .xltx, .xlt, .ods | Sheet data, formulas, cell metadata, charts |
| Presentations | .pptx, .pptm, .ppsx, .potx, .potm, .pot, .ppt | Slides, speaker notes, images, metadata |
.pdf | Text, tables, images, metadata, OCR support | |
| eBooks | .epub, .fb2 | Chapters, metadata, embedded resources |
| Database | .dbf | Table data extraction, field type support |
| Hangul | .hwp, .hwpx | Korean document format, text extraction |
Images (OCR-Enabled)
| Category | Formats | Features |
|---|---|---|
| Raster | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif | OCR, table detection, EXIF metadata, dimensions, color space |
| Advanced | .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm | OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata |
| Vector | .svg | DOM parsing, embedded text, graphics metadata |
Web & Data
| Category | Formats | Features |
|---|---|---|
| Markup | .html, .htm, .xhtml, .xml, .svg | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
| Structured Data | .json, .yaml, .yml, .toml, .csv, .tsv | Schema detection, nested structures, validation |
| Text & Markdown | .txt, .md, .markdown, .djot, .rst, .org, .rtf | CommonMark, GFM, Djot, reStructuredText, Org Mode |
Email & Archives
| Category | Formats | Features |
|---|---|---|
.eml, .msg | Headers, body (HTML/plain), attachments, threading | |
| Archives | .zip, .tar, .tgz, .gz, .7z | File listing, nested archives, metadata |
Academic & Scientific
| Category | Formats | Features |
|---|---|---|
| Citations | .bib, .biblatex, .ris, .nbib, .enw, .csl | Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON |
| Scientific | .tex, .latex, .typst, .jats, .ipynb, .docbook | LaTeX, Jupyter notebooks, PubMed JATS |
| Documentation | .opml, .pod, .mdoc, .troff | Technical documentation formats |
Key Capabilities
Text Extraction - Extract all text content with position and formatting information
Metadata Extraction - Retrieve document properties, creation date, author, etc.
Table Extraction - Parse tables with structure and cell content preservation
Image Extraction - Extract embedded images and render page previews
OCR Support - Integrate multiple OCR backends for scanned documents
Async/Await - Non-blocking document processing with concurrent operations
Plugin System - Extensible post-processing for custom text transformation
Embeddings - Generate vector embeddings using ONNX Runtime models
Batch Processing - Efficiently process multiple documents in parallel
Memory Efficient - Stream large files without loading entirely into memory
Language Detection - Detect and support multiple languages in documents
Configuration - Fine-grained control over extraction behavior
Performance Characteristics
| Format | Speed | Memory | Notes |
|---|---|---|---|
| PDF (text) | 10-100 MB/s | ~50MB per doc | Fastest extraction |
| Office docs | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
| Images (OCR) | 1-5 MB/s | Variable | Depends on OCR backend |
| Archives | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
| Web formats | 50-200 MB/s | Streaming | HTML, XML, JSON |
OCR Support
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
Tesseract
Paddleocr
OCR Configuration Example
elixir title="Elixir" alias Kreuzberg.ExtractionConfig config = %ExtractionConfig{ ocr: %{"enabled" => true, "backend" => "tesseract"} } {:ok, result} = Kreuzberg.extract_file("scanned_document.pdf", nil, config) content = result.content IO.puts("OCR Extracted content:") IO.puts(content) IO.puts("Metadata: #{inspect(result.metadata)}")
## Async Support
This binding provides full async/await support for non-blocking document processing:
elixir title="Elixir" # Extract from different file types (PDF, DOCX, etc.) case Kreuzberg.extract_file("document.pdf") do {:ok, result} -> IO.puts("Content: #{result.content}") IO.puts("MIME Type: #{result.metadata.format_type}") IO.puts("Tables: #{length(result.tables)}") {:error, reason} -> IO.puts("Extraction failed: #{inspect(reason)}") end
Plugin System
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
For detailed plugin documentation, visit Plugin System Guide.
Plugin Example
elixir title="Elixir" alias Kreuzberg.Plugin # Word Count Post-Processor Plugin # This post-processor automatically counts words in extracted content # and adds the word count to the metadata. defmodule MyApp.Plugins.WordCountProcessor do @behaviour Kreuzberg.Plugin.PostProcessor require Logger @impl true def name do "WordCountProcessor" end @impl true def processing_stage do :post end @impl true def version do "1.0.0" end @impl true def initialize do :ok end @impl true def shutdown do :ok end @impl true def process(result, _options) do content = result["content"] || "" word_count = content |> String.split(~r/\s+/, trim: true) |> length() # Update metadata with word count metadata = Map.get(result, "metadata", %{}) updated_metadata = Map.put(metadata, "word_count", word_count) {:ok, Map.put(result, "metadata", updated_metadata)} end end # Register the word count post-processor Plugin.register_post_processor(:word_count_processor, MyApp.Plugins.WordCountProcessor) # Example usage result = %{ "content" => "The quick brown fox jumps over the lazy dog. This is a sample document with multiple words.", "metadata" => %{ "source" => "document.pdf", "pages" => 1 } } case MyApp.Plugins.WordCountProcessor.process(result, %{}) do {:ok, processed_result} -> word_count = processed_result["metadata"]["word_count"] IO.puts("Word count added: #{word_count} words") IO.inspect(processed_result, label: "Processed Result") {:error, reason} -> IO.puts("Processing failed: #{reason}") end # List all registered post-processors {:ok, processors} = Plugin.list_post_processors() IO.inspect(processors, label: "Registered Post-Processors")
## Embeddings Support
Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
## Batch Processing
Process multiple documents efficiently:
elixir title="Elixir" file_paths = ["document1.pdf", "document2.pdf", "document3.pdf"] {:ok, results} = Kreuzberg.batch_extract_files(file_paths) Enum.each(results, fn result -> IO.puts("File: #{result.mime_type}") IO.puts("Content length: #{byte_size(result.content)} characters") IO.puts("Tables: #{length(result.tables)}") IO.puts("---") end) IO.puts("Total files processed: #{length(results)}")
Configuration
For advanced configuration options including language detection, table extraction, OCR settings, and more:
Documentation
Contributing
Contributions are welcome! See Contributing Guide.
License
MIT License - see LICENSE file for details.
Support
- Discord Community: Join our Discord
- GitHub Issues: Report bugs
- Discussions: Ask questions