PdfExtractor (PdfExtractor v0.5.0)
View SourceA powerful and easy-to-use Elixir library for extracting text and metadata from PDF files.
PdfExtractor leverages Python's pdfplumber
library through seamless integration to provide
robust PDF text extraction capabilities. It supports both file-based and binary-based operations,
making it suitable for various use cases from local file processing to web-based PDF handling.
Features
- 🔍 Extract text from single or multiple PDF pages
- 📍 Area-based extraction using bounding boxes
- 🌐 Work with PDF data directly from memory (e.g., HTTP downloads)
- 📊 Get PDF metadata like title, author, creation date
- 🐍 Leverages Python's powerful
pdfplumber
library - 🚀 Simple and intuitive API
- ✅ Comprehensive test coverage
- 📚 Full documentation
Installation
Add pdf_extractor
to your list of dependencies in mix.exs
:
def deps do
[
{:pdf_extractor, "~> 0.5.0"}
]
end
Then start it in your application start function:
defmodule MyApp.Application do
use Application
def start(_type, _args) do
children = [
PdfExtractor,
...
]
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
end
end
Usage
Extract text from specific regions using bounding boxes {x0, y0, x1, y1}
:
areas = %{
0 => {0, 0, 300, 200}, # Top-left area of page 0
1 => [
{200, 300, 600, 500}, # Bottom-right area of page 1
{0, 0, 200, 250}, # Top-left area of page 1
]
}
PdfExtractor.extract_text("path/to/document.pdf", areas)
Return Format
The function returns a map where keys are page numbers and values are the extracted text:
%{
0 => "Text from page 0...",
1 => ["Text from page 1 (first area)...", "Text from page 1 (second area)..."],
2 => "Text from page 2..."
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Built on top of the excellent pdfplumber Python library
- Uses pythonx for seamless Python integration
Summary
Functions
Returns a specification to start this module under a supervisor.
Extracts metadata from a PDF file info trailers. Typically includes "CreationDate", "ModDate", "Producer", et cetera.
Extracts metadata from PDF binary data. Similar to extract_metadata/1
but works with PDF data in memory instead of
files.
Extracts text from PDF pages.
Extracts text from PDF binary data. See extract_text/3
for details on how to specify pages and areas.
Functions
Returns a specification to start this module under a supervisor.
See Supervisor
.
Extracts metadata from a PDF file info trailers. Typically includes "CreationDate", "ModDate", "Producer", et cetera.
Examples
iex> PdfExtractor.extract_metadata("priv/fixtures/book.pdf")
{:ok,
%{
"CreationDate" => "D:20250718212328Z",
"Creator" => "Stirling-PDF v0.44.2",
"ModDate" => "D:20250718212328Z",
"Producer" => "Stirling-PDF v0.44.2"
}}
Extracts metadata from PDF binary data. Similar to extract_metadata/1
but works with PDF data in memory instead of
files.
Examples
iex> content = File.read!("priv/fixtures/book.pdf")
...> PdfExtractor.extract_metadata_from_binary(content)
{:ok,
%{
"CreationDate" => "D:20250718212328Z",
"Creator" => "Stirling-PDF v0.44.2",
"ModDate" => "D:20250718212328Z",
"Producer" => "Stirling-PDF v0.44.2"
}}
Extracts text from PDF pages.
It supports extracting from single pages, multiple pages, and specific areas within pages.
Page Numbers
- Integer: Extract from single page (e.g.,
0
for first page) - List: Extract from multiple pages (e.g.,
[0, 1, 2]
) - Empty list
[]
: Extract from all pages (default)
Areas Format
Areas are specified as a map where keys are page numbers and values are bounding boxes:
- Single area:
%{0 => {x0, y0, x1, y1}}
- Multiple areas:
%{0 => [{x0, y0, x1, y1}, {x2, y2, x3, y3}]}
- Mixed:
%{0 => {x0, y0, x1, y1}, 1 => [{x2, y2, x3, y3}, {x4, y4, x5, y5}]}
Examples
Extract text from all pages.
iex> PdfExtractor.extract_text("priv/fixtures/fatura.pdf")
{:ok,
%{
0 =>
"Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €",
1 =>
"✂\nReceipt Payment part Account / Payable to\nCH4431999123000889012\n✂\nMax Muster & Söhne\nAccount / Payable to\nCH4431999123000889012 Musterstrasse 123\nMax Muster & Söhne 8000 Seldwyla\nMusterstrasse 123\n8000 Seldwyla\nReference\n210000000003139471430009017\nReference\n210000000003139471430009017\nAdditional information\nBestellung vom 15.10.2020\nPayable by (name/address)\nSimon Muster\nPayable by (name/address)\nMusterstrasse 1\nCurrency Amount\nSimon Muster\n8000 Seldwyla\nCHF 1 949.75 Musterstrasse 1\n8000 Seldwyla\nCurrency Amount\nCHF 1 949.75\nAcceptance point"
}}
Extract text from only some pages.
iex> PdfExtractor.extract_text("priv/fixtures/fatura.pdf", [0])
{:ok,
%{
0 =>
"Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €"
}}
Extract only the titles in the book chapters.
iex> PdfExtractor.extract_text("priv/fixtures/book.pdf", %{
...> 2 => {0, 0, 612, 190},
...> 8 => {0, 0, 612, 190},
...> 10 => {0, 0, 612, 190}
...> })
{:ok,
%{
2 => "Introdução – Nota do tradutor",
8 => "I. Sobre aproveitar o tempo",
10 => "II. Sobre a falta de foco na Leitura"
}}
Extract multiple areas from a single page.
iex> PdfExtractor.extract_text("priv/fixtures/book.pdf", %{
...> 1 => [{0, 100, 612, 140}, {0, 400, 612, 440}]
...> })
{:ok,
%{
1 => [
"CARTAS DE UM ESTOICO, Volume I",
"Montecristo Editora Ltda.\ne-mail: editora@montecristoeditora.com.br"
]
}}
Extracts text from PDF binary data. See extract_text/3
for details on how to specify pages and areas.
This function allows you to extract text from PDF data that's already in memory, such as data downloaded from a URL or received via an API. This avoids the need to write the PDF to the filesystem.
Examples
Extract text from all pages.
iex> content = File.read!("priv/fixtures/fatura.pdf")
...> PdfExtractor.extract_text_from_binary(content)
{:ok,
%{
0 =>
"Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €",
1 =>
"✂\nReceipt Payment part Account / Payable to\nCH4431999123000889012\n✂\nMax Muster & Söhne\nAccount / Payable to\nCH4431999123000889012 Musterstrasse 123\nMax Muster & Söhne 8000 Seldwyla\nMusterstrasse 123\n8000 Seldwyla\nReference\n210000000003139471430009017\nReference\n210000000003139471430009017\nAdditional information\nBestellung vom 15.10.2020\nPayable by (name/address)\nSimon Muster\nPayable by (name/address)\nMusterstrasse 1\nCurrency Amount\nSimon Muster\n8000 Seldwyla\nCHF 1 949.75 Musterstrasse 1\n8000 Seldwyla\nCurrency Amount\nCHF 1 949.75\nAcceptance point"
}}
Extract text from only some pages.
iex> content = File.read!("priv/fixtures/fatura.pdf")
...> PdfExtractor.extract_text_from_binary(content, [0])
{:ok,
%{
0 =>
"Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €"
}}
Extract only the titles in the book chapters.
iex> content = File.read!("priv/fixtures/book.pdf")
...>
...> PdfExtractor.extract_text_from_binary(content, %{
...> 2 => {0, 0, 612, 190},
...> 8 => {0, 0, 612, 190},
...> 10 => {0, 0, 612, 190}
...> })
{:ok,
%{
2 => "Introdução – Nota do tradutor",
8 => "I. Sobre aproveitar o tempo",
10 => "II. Sobre a falta de foco na Leitura"
}}
Extract multiple areas from a single page.
iex> content = File.read!("priv/fixtures/book.pdf")
...>
...> PdfExtractor.extract_text_from_binary(content, %{
...> 1 => [{0, 100, 612, 140}, {0, 400, 612, 440}]
...> })
{:ok,
%{
1 => [
"CARTAS DE UM ESTOICO, Volume I",
"Montecristo Editora Ltda.\ne-mail: editora@montecristoeditora.com.br"
]
}}