High-performance document extraction for Elixir.
Examples
# Extract from binary with MIME type
{:ok, result} = Kreuzberg.extract(pdf_binary, "application/pdf")
# With configuration
config = %Kreuzberg.ExtractionConfig{force_ocr: true}
{:ok, result} = Kreuzberg.extract(pdf_binary, "application/pdf", config)
# Bang variant
result = Kreuzberg.extract!(pdf_binary, "application/pdf")
Summary
Functions
Generate text embeddings for a list of strings.
Generate text embeddings, raising on error.
Extract content from binary document data.
Extract content, raising on error
Extract content from a file at the given path.
Extract content from a file, raising on error.
Extract content with plugin processing support.
Render a single PDF page as a PNG image.
Return a lazy Stream that yields {page_index, png_binary} tuples.
Functions
Generate text embeddings for a list of strings.
Parameters
texts- List of strings to embedconfig- EmbeddingConfig struct or nil
Returns
{:ok, [[float()]]}- List of embedding vectors{:error, reason}- Embedding failed
Examples
# Embed with default config (balanced preset)
iex> {:ok, embeddings} = Kreuzberg.embed(["Hello world", "How are you?"])
iex> length(embeddings) == 2
true
# Embed with a specific preset
iex> config = %Kreuzberg.EmbeddingConfig{model: {:preset, "fast"}}
iex> {:ok, embeddings} = Kreuzberg.embed(["Hello world"], config)
iex> is_list(hd(embeddings))
true
Generate text embeddings, raising on error.
Same as do_embed/2 but raises a Kreuzberg.Error on failure.
Examples
# Embed and get results directly
iex> embeddings = Kreuzberg.embed!(["Hello world"])
iex> is_list(embeddings)
true
# Each embedding is a list of floats
iex> [vector | _rest] = Kreuzberg.embed!(["Test sentence"])
iex> is_float(hd(vector))
true
See Kreuzberg.do_embed/2.
@spec extract( binary(), String.t(), Kreuzberg.ExtractionConfig.t() | map() | keyword() | nil ) :: {:ok, Kreuzberg.ExtractionResult.t()} | {:error, String.t()}
Extract content from binary document data.
Performs document extraction on binary input with support for various file formats. Returns extracted content including text, metadata, tables, images, and more. If no configuration is provided, uses default extraction settings.
Parameters
input- Binary document data to extract frommime_type- MIME type of the document (e.g., "application/pdf", "text/plain")config- ExtractionConfig struct, map, keyword list, or nil (optional, defaults to nil)
Returns
{:ok, ExtractionResult.t()}- Successfully extracted content with metadata{:error, reason}- Extraction failed with error message
Examples
# Extract from binary with MIME type
{:ok, result} = Kreuzberg.extract(pdf_binary, "application/pdf")
result.content
# Extract with configuration
config = %Kreuzberg.ExtractionConfig{ocr: %{"enabled" => true}}
{:ok, result} = Kreuzberg.extract(pdf_binary, "application/pdf", config)
# With keyword list configuration
{:ok, result} = Kreuzberg.extract(
pdf_binary,
"application/pdf",
ocr: %{"enabled" => true}
)
@spec extract!( binary(), String.t(), Kreuzberg.ExtractionConfig.t() | map() | keyword() | nil ) :: Kreuzberg.ExtractionResult.t()
Extract content, raising on error
@spec extract_file( String.t() | Path.t(), String.t() | nil, Kreuzberg.ExtractionConfig.t() | map() | keyword() | nil ) :: {:ok, Kreuzberg.ExtractionResult.t()} | {:error, String.t()}
Extract content from a file at the given path.
Accepts a file path and optional MIME type, returning extracted content. If no MIME type is provided, the library will attempt to detect it from the file.
Parameters
path- File path (String or Path.t())mime_type- MIME type of the file (optional, defaults to nil for auto-detection)config- ExtractionConfig struct or map with extraction options (optional)
Returns
{:ok, ExtractionResult.t()}- Successfully extracted content{:error, reason}- Extraction failed with error message
Examples
# Extract with explicit MIME type
{:ok, result} = Kreuzberg.extract_file("document.pdf", "application/pdf")
result.content
# Extract with auto-detection
{:ok, result} = Kreuzberg.extract_file("document.pdf")
# With configuration
config = %Kreuzberg.ExtractionConfig{force_ocr: true}
{:ok, result} = Kreuzberg.extract_file("document.pdf", "application/pdf", config)
# With keyword list configuration
{:ok, result} = Kreuzberg.extract_file(
"document.pdf",
"application/pdf",
ocr: %{"enabled" => true}
)
@spec extract_file!( String.t() | Path.t(), String.t() | nil, Kreuzberg.ExtractionConfig.t() | map() | keyword() | nil ) :: Kreuzberg.ExtractionResult.t()
Extract content from a file, raising on error.
Same as extract_file/3 but raises a Kreuzberg.Error exception if extraction fails.
Parameters
path- File path (String or Path.t())mime_type- MIME type of the file (optional, defaults to nil for auto-detection)config- ExtractionConfig struct or map with extraction options (optional)
Returns
ExtractionResult.t()- Successfully extracted content
Raises
Kreuzberg.Error- If extraction fails
Examples
# Extract with explicit MIME type, raising on error
result = Kreuzberg.extract_file!("document.pdf", "application/pdf")
result.content
# Extract with auto-detection, raising on error
result = Kreuzberg.extract_file!("document.pdf")
result.content
# With configuration
config = %Kreuzberg.ExtractionConfig{ocr: %{"enabled" => true}}
result = Kreuzberg.extract_file!("document.pdf", "application/pdf", config)
@spec extract_with_plugins( binary(), String.t(), Kreuzberg.ExtractionConfig.t() | map() | keyword() | nil, keyword() ) :: {:ok, Kreuzberg.ExtractionResult.t()} | {:error, String.t()}
Extract content with plugin processing support.
Performs document extraction with additional processing through registered plugins. Applies validators before extraction, post-processors by stage (early, middle, late) after extraction, and optional final validators to the result.
Plugins are retrieved from the Plugin.Registry if not explicitly provided in plugin_opts.
Parameters
input- Binary document data to extract frommime_type- MIME type of the document (e.g., "application/pdf")config- ExtractionConfig struct, map, keyword list, or nil for extraction (optional)plugin_opts- Keyword list of plugin options (optional)::validators- List of validator modules to run before extraction:post_processors- Map of stage atoms to lists of post-processor modules:early- Applied first to extraction result:middle- Applied after early processors:late- Applied last before final validators
:final_validators- List of validator modules to run after post-processing
Returns
{:ok, ExtractionResult.t()}- Successfully extracted and processed content{:error, reason}- Extraction or processing failed with error message
Plugin Processing Flow
- Validators - If specified, run input validators to check extraction preconditions
- Extraction - Call
extract/3to get initial result - Post-Processors - Apply by stage in order (early → middle → late)
- Each processor receives the extraction result or output from previous processor
- Processors should return modified result or data
- Final Validators - If specified, validate the processed result
- Return - Return enhanced extraction result
Examples
# Extract with registered validators and post-processors
{:ok, result} = Kreuzberg.extract_with_plugins(
pdf_binary,
"application/pdf",
nil,
validators: [MyApp.InputValidator],
post_processors: %{
early: [MyApp.EarlyProcessor],
middle: [MyApp.MiddleProcessor],
late: [MyApp.FinalProcessor]
},
final_validators: [MyApp.ResultValidator]
)
# Extract with only post-processors
{:ok, result} = Kreuzberg.extract_with_plugins(
pdf_binary,
"application/pdf",
%{use_cache: true},
post_processors: %{
early: [MyApp.Processor1, MyApp.Processor2]
}
)
# Extract with configuration and validators only
config = %Kreuzberg.ExtractionConfig{ocr: %{"enabled" => true}}
{:ok, result} = Kreuzberg.extract_with_plugins(
pdf_binary,
"application/pdf",
config,
validators: [MyApp.Validator]
)
# Extract with no plugins (standard extraction)
{:ok, result} = Kreuzberg.extract_with_plugins(pdf_binary, "application/pdf")
@spec render_pdf_page(String.t(), non_neg_integer(), keyword()) :: {:ok, binary()} | {:error, String.t()}
Render a single PDF page as a PNG image.
Parameters
path- Path to the PDF filepage_index- Zero-based page indexopts- Keyword list of options::dpi- Rendering resolution (default 150)
Returns
{:ok, binary()}- PNG-encoded binary{:error, reason}- Rendering failed
Examples
{:ok, png} = Kreuzberg.render_pdf_page("document.pdf", 0)
{:ok, png} = Kreuzberg.render_pdf_page("document.pdf", 2, dpi: 300)
@spec render_pdf_pages_stream( String.t(), keyword() ) :: Enumerable.t()
Return a lazy Stream that yields {page_index, png_binary} tuples.
Pages are rendered one at a time via the native PDF page iterator, so only one page's worth of PNG bytes is in memory at a time.
Parameters
path- Path to the PDF fileopts- Keyword list of options::dpi- Rendering resolution (default 150)
Returns
Enumerable.t()- A Stream of{non_neg_integer(), binary()}tuples
Examples
Kreuzberg.render_pdf_pages_stream("document.pdf")
|> Enum.each(fn {page_index, png} ->
File.write!("page_#{page_index}.png", png)
end)