PdfExtractor

View Source

Release Documentation Downloads License Last Commit

A powerful and easy-to-use Elixir library for extracting text and metadata from PDF files.

PdfExtractor leverages Python's pdfplumber library through seamless integration to provide robust PDF text extraction capabilities. It supports both file-based and binary-based operations, making it suitable for various use cases from local file processing to web-based PDF handling.

Features

  • 🔍 Extract text from single or multiple PDF pages
  • 📍 Area-based extraction using bounding boxes
  • 🌐 Work with PDF data directly from memory (e.g., HTTP downloads)
  • 📊 Get PDF metadata like title, author, creation date
  • 🐍 Leverages Python's powerful pdfplumber library
  • 🚀 Simple and intuitive API
  • ✅ Comprehensive test coverage
  • 📚 Full documentation

Installation

Add pdf_extractor to your list of dependencies in mix.exs:

def deps do
  [
    {:pdf_extractor, "~> 0.5.0"}
  ]
end

Then start it in your application start function:

defmodule MyApp.Application do
  use Application

  def start(_type, _args) do
    children = [
        PdfExtractor,
        ...
    ]

    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Usage

Extract text from specific regions using bounding boxes {x0, y0, x1, y1}:

areas = %{
  0 => {0, 0, 300, 200},    # Top-left area of page 0
  1 => [
        {200, 300, 600, 500}, # Bottom-right area of page 1
        {0, 0, 200, 250}, # Top-left area of page 1
       ]
}
PdfExtractor.extract_text("path/to/document.pdf", areas)

Return Format

The function returns a map where keys are page numbers and values are the extracted text:

%{
  0 => "Text from page 0...",
  1 => ["Text from page 1 (first area)...", "Text from page 1 (second area)..."],
  2 => "Text from page 2..."
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built on top of the excellent pdfplumber Python library
  • Uses pythonx for seamless Python integration