LeXtract.Annotator (lextract v0.1.2)

View Source

Annotates documents with extractions using LLMs.

The core extraction orchestrator that:

  1. Chunks documents
  2. Generates prompts
  3. Calls LLM via ReqLLM
  4. Parses and aligns results
  5. Aggregates into AnnotatedDocument

Extraction Modes

The Annotator supports two modes of operation:

Text Generation Mode (Default)

Uses ReqLLM.generate_text/3 to generate free-form text responses in JSON or YAML format. The LLM response is parsed and converted to extractions.

template = %{
  description: "Extract medication entities",
  examples: [
    %{
      text: "Patient takes aspirin 100mg",
      extractions: [
        %{extraction_class: "Medication", name: "aspirin", dosage: "100mg"}
      ]
    }
  ]
}

annotator = LeXtract.Annotator.new(template,
  model: "gemini-2.0-flash",
  provider: :gemini,
  api_key: "your-api-key"
)

doc = LeXtract.Annotator.annotate_text(annotator, "Patient takes aspirin 100mg daily")

Structured Output Mode

Uses ReqLLM.generate_object/4 to generate structured output with schema validation. This mode automatically generates a schema from your examples and ensures the LLM response conforms to the expected structure.

Enable with :use_structured_output option:

template = %{
  description: "Extract medication entities with structured output",
  examples: [
    %{
      text: "Patient takes aspirin 100mg twice daily",
      extractions: [
        %{
          extraction_class: "Medication",
          name: "aspirin",
          dosage: "100mg",
          frequency: "twice daily"
        }
      ]
    }
  ]
}

annotator = LeXtract.Annotator.new(template,
  [model: "gemini-2.0-flash", provider: :gemini, api_key: "your-api-key"],
  use_structured_output: true
)

doc = LeXtract.Annotator.annotate_text(annotator, "Patient takes aspirin 100mg twice daily")

Structured output mode offers several benefits:

  • Automatic schema generation from examples
  • Built-in validation by the LLM provider
  • More reliable parsing (no JSON/YAML parsing errors)
  • Better support for complex nested structures

Examples

iex> template = %{
...>   description: "Extract medication entities",
...>   examples: [
...>     %{
...>       text: "Patient takes aspirin",
...>       extractions: [%{medication: "aspirin"}]
...>     }
...>   ]
...> }
iex> annotator = LeXtract.Annotator.new(template,
...>   model: "gemini-2.0-flash",
...>   provider: :gemini,
...>   api_key: "test-key"
...> )
iex> is_struct(annotator, LeXtract.Annotator)
true

Summary

Types

t()

@type t() :: %LeXtract.Annotator{
  format_handler: LeXtract.FormatHandler.t(),
  prompt_generator: LeXtract.Prompting.t(),
  req_llm_config: keyword(),
  use_structured_output: boolean()
}

Functions

annotate_documents(annotator, documents, opts \\ [])

Annotates a stream of documents.

Main API for batch processing. Handles:

  • Chunking of long documents
  • Batch inference for efficiency
  • Alignment of extractions
  • Multi-pass extraction (if enabled)

Parameters

  • annotator - The annotator instance
  • documents - Enumerable of %Document{} structs
  • opts - Options (see below)

Options

  • :max_char_buffer - Max chunk size in chars (default: 1000)
  • :batch_size - Number of chunks per LLM batch (default: 5)
  • :extraction_passes - Number of passes for multi-pass (default: 1)
  • :show_progress - Show progress bar (default: false)
  • :chunk_overlap - Chunk overlap in chars (default: 200)

Returns

Stream of %AnnotatedDocument{} with extractions.

annotate_text(annotator, text, opts \\ [])

@spec annotate_text(t(), String.t(), keyword()) :: LeXtract.AnnotatedDocument.t()

Annotates a single text string.

Convenience wrapper around annotate_documents/3 for single text inputs.

Parameters

  • annotator - The annotator instance
  • text - Text to extract from
  • opts - Options (see annotate_documents/3)

Returns

Single %AnnotatedDocument{} with extractions aligned to text.

Examples

iex> template = %{description: "Extract entities", examples: []}
iex> annotator = LeXtract.Annotator.new(template,
...>   model: "gemini-2.0-flash",
...>   provider: :gemini,
...>   api_key: "test"
...> )
iex> # Note: This example would require mocking ReqLLM in real tests
iex> is_struct(annotator, LeXtract.Annotator)
true

new(prompt_template, req_llm_config, opts \\ [])

@spec new(LeXtract.Prompting.template(), keyword(), keyword()) :: t()

Creates a new annotator.

Parameters

  • prompt_template - Template with description and examples
  • req_llm_config - ReqLLM configuration (model, provider, API keys, etc.)
  • opts - Options (see below)

Options

  • :format - Output format (:json or :yaml, default: :yaml)
  • :fence_output - Whether to expect fenced output (default: false)
  • :attribute_suffix - Suffix for attributes (default: "_attributes")
  • :use_structured_output - Use ReqLLM's generate_object/4 for structured output (default: false)

Examples

iex> template = %{description: "Extract entities", examples: []}
iex> config = [model: "gemini-2.0-flash", provider: :gemini, api_key: "test"]
iex> annotator = LeXtract.Annotator.new(template, config)
iex> annotator.format_handler.format
:yaml