PdfExtractor
View SourceA powerful and easy-to-use Elixir library for extracting text and metadata from PDF files.
PdfExtractor leverages Python's pdfplumber
library through seamless integration to provide
robust PDF text extraction capabilities. It supports both file-based and binary-based operations,
making it suitable for various use cases from local file processing to web-based PDF handling.
Features
- 🔍 Extract text from single or multiple PDF pages
- 📍 Area-based extraction using bounding boxes
- 🌐 Work with PDF data directly from memory (e.g., HTTP downloads)
- 📊 Get PDF metadata like title, author, creation date
- 🐍 Leverages Python's powerful
pdfplumber
library - 🚀 Simple and intuitive API
- ✅ Comprehensive test coverage
- 📚 Full documentation
Installation
Add pdf_extractor
to your list of dependencies in mix.exs
:
def deps do
[
{:pdf_extractor, "~> 0.5.0"}
]
end
Then start it in your application start function:
defmodule MyApp.Application do
use Application
def start(_type, _args) do
children = [
PdfExtractor,
...
]
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
end
end
Usage
Extract text from specific regions using bounding boxes {x0, y0, x1, y1}
:
areas = %{
0 => {0, 0, 300, 200}, # Top-left area of page 0
1 => [
{200, 300, 600, 500}, # Bottom-right area of page 1
{0, 0, 200, 250}, # Top-left area of page 1
]
}
PdfExtractor.extract_text("path/to/document.pdf", areas)
Return Format
The function returns a map where keys are page numbers and values are the extracted text:
%{
0 => "Text from page 0...",
1 => ["Text from page 1 (first area)...", "Text from page 1 (second area)..."],
2 => "Text from page 2..."
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Built on top of the excellent pdfplumber Python library
- Uses pythonx for seamless Python integration