Kreuzberg.ExtractionResult (kreuzberg v4.0.4)

View Source

Structure representing the result of a document extraction operation.

Contains all extracted data from a processed document, including content, metadata, tables, detected languages, chunks with embeddings, images with OCR results, and per-page information.

Fields

  • :content - The main extracted text content as a UTF-8 string

    • Contains the primary textual output from document analysis
    • Cleaned and normalized from the original document
    • May include line breaks and structural markers
  • :mime_type - The MIME type of the processed document (e.g., "application/pdf")

    • Used to identify document type and format
    • Common types: "application/pdf", "text/plain", "image/png", etc.
    • Helps downstream processors know how to handle the content
  • :metadata - Metadata struct containing document-specific information

    • Proper Kreuzberg.Metadata struct with typed fields
    • Contains title, author, created_date, page_count, etc.
    • Can be an empty struct if no metadata is available
  • :tables - List of extracted table structs

    • Each table is a Kreuzberg.Table struct with proper fields
    • Contains cells, headers, markdown, and other table info
    • Empty list [] if no tables found in document
  • :detected_languages - List of detected language codes (ISO 639-1 format)

    • Language codes: "en", "de", "fr", "es", "zh", etc.
    • May be nil if language detection is disabled
    • Multiple languages if document contains mixed-language content
    • Example: ["en", "de"] for bilingual document
  • :chunks - Optional list of text chunk structs with embeddings

    • nil if chunking/embedding is not enabled
    • Each chunk is a Kreuzberg.Chunk struct with text and embedding
    • Used for semantic search and RAG applications
  • :images - Optional list of extracted image structs with OCR results

    • nil if image extraction is disabled
    • Each image is a Kreuzberg.Image struct with format, data, and ocr_text
    • OCR text is result of Tesseract or other OCR backend processing
  • :pages - Optional list of per-page content structs

    • nil if page-level extraction is not enabled
    • Each page is a Kreuzberg.Page struct with number, content, and dimensions
    • Useful for documents where position and structure matter
  • :keywords - Optional list of extracted keyword maps

    • nil if keyword extraction is disabled
    • Each keyword is a map with "text" and "score" fields
    • Used for document classification, tagging, and search optimization

Examples

# Basic extraction result
iex> result = %Kreuzberg.ExtractionResult{
...>   content: "Document content",
...>   mime_type: "application/pdf",
...>   metadata: %Kreuzberg.Metadata{},
...>   tables: [],
...>   detected_languages: ["en"]
...> }
iex> result.content
"Document content"

# Rich extraction with metadata and tables
iex> result = %Kreuzberg.ExtractionResult{
...>   content: "Sales Report 2024\n\nQ1: 1M, Q2: 1.2M, Q3: 1.5M",
...>   mime_type: "application/pdf",
...>   metadata: %Kreuzberg.Metadata{title: "Sales Report"},
...>   tables: [%Kreuzberg.Table{headers: ["Quarter", "Amount"]}],
...>   detected_languages: ["en"],
...>   chunks: nil,
...>   images: nil,
...>   pages: nil
...> }
iex> result.metadata.title
"Sales Report"

# Full extraction with all fields
iex> result = %Kreuzberg.ExtractionResult{
...>   content: "Multi-page document content...",
...>   mime_type: "application/pdf",
...>   metadata: %Kreuzberg.Metadata{page_count: 5},
...>   tables: [%Kreuzberg.Table{cells: [["Data1", "Data2"]]}],
...>   detected_languages: ["en", "de"],
...>   chunks: [%Kreuzberg.Chunk{text: "chunk1 content"}],
...>   images: [%Kreuzberg.Image{format: "png", ocr_text: "Image text"}],
...>   pages: [%Kreuzberg.Page{number: 1, content: "Page 1 content"}]
...> }
iex> Enum.count(result.pages)
1

Summary

Types

t()

@type t() :: %Kreuzberg.ExtractionResult{
  chunks: [Kreuzberg.Chunk.t()] | nil,
  content: String.t(),
  detected_languages: [String.t()] | nil,
  images: [Kreuzberg.Image.t()] | nil,
  keywords: [map()] | nil,
  metadata: Kreuzberg.Metadata.t(),
  mime_type: String.t(),
  pages: [Kreuzberg.Page.t()] | nil,
  tables: [Kreuzberg.Table.t()]
}

Functions

new(content, mime_type, metadata \\ %Kreuzberg.Metadata{}, tables \\ [], opts \\ [])

@spec new(
  String.t(),
  String.t(),
  Kreuzberg.Metadata.t() | map(),
  [Kreuzberg.Table.t() | map()],
  keyword()
) :: t()

Creates a new ExtractionResult from extracted data.

Parameters

  • content - The extracted text content
  • mime_type - The MIME type of the document
  • metadata - Document metadata struct or map (defaults to empty Metadata struct)
  • tables - List of extracted table structs or maps (defaults to empty list)
  • opts - Optional keyword list containing:
    • :detected_languages - List of detected language codes
    • :chunks - List of chunk structs or maps
    • :images - List of image structs or maps
    • :pages - List of page structs or maps
    • :keywords - List of keyword structs or maps

Returns

An ExtractionResult struct with all fields properly typed as structs.

Examples

iex> Kreuzberg.ExtractionResult.new("text", "text/plain")
%Kreuzberg.ExtractionResult{
  content: "text",
  mime_type: "text/plain",
  metadata: %Kreuzberg.Metadata{},
  tables: [],
  detected_languages: nil,
  chunks: nil,
  images: nil,
  pages: nil
}

iex> metadata = %Kreuzberg.Metadata{page_count: 5}
iex> Kreuzberg.ExtractionResult.new("text", "application/pdf", metadata, [],
...>   detected_languages: ["en", "de"])
%Kreuzberg.ExtractionResult{
  content: "text",
  mime_type: "application/pdf",
  metadata: %Kreuzberg.Metadata{page_count: 5},
  tables: [],
  detected_languages: ["en", "de"],
  chunks: nil,
  images: nil,
  pages: nil
}