LeXtract (lextract v0.1.2)

View Source

LeXtract

Hex Coverage Status

LLM-powered text extraction library for Elixir. Based on Google's LangExtract

LeXtract enables you to extract structured information from unstructured text using Large Language Models (LLMs). It provides a simple, streaming API with support for multiple LLM providers.

Features

  • Multi-Provider LLM Support - Works with OpenAI, Gemini, Anthropic, and other providers through ReqLLM
  • Streaming API - Memory-efficient batch processing with lazy streams
  • Automatic Text Chunking - Handles long documents with configurable chunk sizes and overlap
  • Character-Level Alignment - Precise alignment of extractions to source text positions
  • Schema Generation - Automatic schema inference from examples
  • Template-Based Configuration - Reusable extraction templates in JSON or YAML
  • Structured Output Mode - Enhanced reliability with schema validation
  • Multi-Pass Extraction - Improved recall through multiple extraction passes
  • Flexible Output Formats - Support for JSON and YAML output formats

Installation

Add lextract to your list of dependencies in mix.exs:

def deps do
  [
    {:lextract, "~> 0.1.0"}
  ]
end

Quick Start

Basic Entity Extraction

Extract named entities from text with inline template options:

{:ok, stream} = LeXtract.extract(
  "Dr. Smith prescribed aspirin 100mg to the patient.",
  prompt: "Extract medical entities from the text",
  examples: [
    %{
      text: "Patient takes ibuprofen 200mg",
      extractions: [
        %{extraction_class: "Medication", name: "ibuprofen", dosage: "200mg"}
      ]
    }
  ],
  model: "gpt-4o-mini",
  provider: :openai
)

annotated_docs = Enum.to_list(stream)

Using Template Files

Create a template file (JSON or YAML) for reusable extraction configurations:

# medication_template.yaml
description: Extract medication entities with dosage and frequency
examples:
  - text: "Patient takes aspirin 100mg twice daily"
    extractions:
      - extraction_class: Medication
        name: aspirin
        dosage: 100mg
        frequency: twice daily

Then extract using the template:

{:ok, stream} = LeXtract.extract(
  "Dr. Jones prescribed metformin 500mg once daily.",
  template_file: "medication_template.yaml",
  model: "gpt-4o-mini",
  provider: :openai
)

Batch Processing with Streams

Process multiple documents efficiently with streaming:

documents = [
  "First patient document...",
  "Second patient document...",
  "Third patient document..."
]

{:ok, stream} = LeXtract.extract(
  documents,
  prompt: "Extract medical conditions",
  examples: [...],
  model: "gpt-4o-mini",
  provider: :openai,
  batch_size: 5
)

stream
|> Stream.each(fn annotated_doc ->
  IO.puts("Document: #{annotated_doc.document_id}")
  IO.puts("Extractions: #{length(annotated_doc.extractions)}")
end)
|> Stream.run()

Structured Output Mode

For better reliability and schema validation, use structured output mode:

{:ok, stream} = LeXtract.extract(
  "Patient has hypertension and diabetes.",
  prompt: "Extract medical conditions",
  examples: [
    %{
      text: "Patient diagnosed with asthma",
      extractions: [
        %{extraction_class: "Condition", name: "asthma", severity: "mild"}
      ]
    }
  ],
  model: "gpt-4o-mini",
  provider: :openai,
  use_structured_output: true
)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Summary

Functions

Extracts structured information from text using LLMs.

Extracts structured information from text, raising on error.

Extracts structured information from a text file.

Validates extraction options against the schema.

Functions

extract(input, opts)

@spec extract(
  source_document :: String.t() | [String.t()] | [LeXtract.Document.t()],
  options :: LeXtract.Config.options()
) ::
  {:ok, Enumerable.t(LeXtract.AnnotatedDocument.t())} | {:error, Exception.t()}

Extracts structured information from text using LLMs.

This is the main entry point for the library. It accepts text (string, list of strings, or list of Document structs) and returns a lazy Stream of AnnotatedDocument results.

Parameters

  • input - Text to extract from (String.t(), [String.t()], or [Document.t()])
  • opts - Extraction options (see module documentation for full list)

Returns

{:ok, Stream.t(AnnotatedDocument.t())} or {:error, reason}

Examples

iex> {:ok, _stream} = LeXtract.extract(
...>   "Sample text",
...>   prompt: "Extract entities",
...>   examples: [],
...>   model: "gpt-4o-mini",
...>   provider: :openai,
...>   api_key: "test-key"
...> )

extract!(input, opts)

@spec extract!(
  source_document :: String.t() | [String.t()] | [LeXtract.Document.t()],
  options :: LeXtract.Config.options()
) :: Enumerable.t(LeXtract.AnnotatedDocument.t())

Extracts structured information from text, raising on error.

Same as extract/2 but returns the stream directly or raises an exception on error.

Examples

iex> stream = LeXtract.extract!(
...>   "Sample text",
...>   prompt: "Extract entities",
...>   examples: [],
...>   model: "gpt-4o-mini",
...>   provider: :openai,
...>   api_key: "test-key"
...> )
iex> is_struct(stream, Stream)
true

extract_from_file(file_path, opts)

@spec extract_from_file(file_path :: Path.t(), options :: LeXtract.Config.options()) ::
  {:ok, Enumerable.t(LeXtract.AnnotatedDocument.t())} | {:error, Exception.t()}

Extracts structured information from a text file.

Reads the file content and then calls extract/2. Useful for processing single documents stored on disk.

Parameters

  • file_path - Path to text file
  • opts - Extraction options (see extract/2)

Returns

{:ok, Stream.t(AnnotatedDocument.t())} or {:error, reason}

Examples

iex> File.write!("/tmp/test_doc.txt", "Sample text")
iex> {:ok, stream} = LeXtract.extract_from_file(
...>   "/tmp/test_doc.txt",
...>   prompt: "Extract entities",
...>   examples: [],
...>   model: "gpt-4o-mini",
...>   provider: :openai,
...>   api_key: "test-key"
...> )
iex> is_struct(stream, Stream)
true
iex> File.rm("/tmp/test_doc.txt")
:ok

validate_options(opts)

@spec validate_options(LeXtract.Config.options()) ::
  {:ok, LeXtract.Config.options()} | {:error, Exception.t()}

Validates extraction options against the schema.

Useful for validating options before processing or for debugging configuration issues.

Parameters

  • opts - Keyword list of options

Returns

{:ok, validated_opts} or {:error, validation_error}

Examples

iex> {:ok, opts} = LeXtract.validate_options(
...>   prompt: "Extract",
...>   model: "gpt-4o-mini",
...>   provider: :openai,
...>   api_key: "key"
...> )
iex> Keyword.get(opts, :prompt)
"Extract"
iex> Keyword.get(opts, :format)
:yaml