PdfExtractor (PdfExtractor v0.5.0)

View Source

A powerful and easy-to-use Elixir library for extracting text and metadata from PDF files.

PdfExtractor leverages Python's pdfplumber library through seamless integration to provide robust PDF text extraction capabilities. It supports both file-based and binary-based operations, making it suitable for various use cases from local file processing to web-based PDF handling.

Features

  • 🔍 Extract text from single or multiple PDF pages
  • 📍 Area-based extraction using bounding boxes
  • 🌐 Work with PDF data directly from memory (e.g., HTTP downloads)
  • 📊 Get PDF metadata like title, author, creation date
  • 🐍 Leverages Python's powerful pdfplumber library
  • 🚀 Simple and intuitive API
  • ✅ Comprehensive test coverage
  • 📚 Full documentation

Installation

Add pdf_extractor to your list of dependencies in mix.exs:

def deps do
  [
    {:pdf_extractor, "~> 0.5.0"}
  ]
end

Then start it in your application start function:

defmodule MyApp.Application do
  use Application

  def start(_type, _args) do
    children = [
        PdfExtractor,
        ...
    ]

    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Usage

Extract text from specific regions using bounding boxes {x0, y0, x1, y1}:

areas = %{
  0 => {0, 0, 300, 200},    # Top-left area of page 0
  1 => [
        {200, 300, 600, 500}, # Bottom-right area of page 1
        {0, 0, 200, 250}, # Top-left area of page 1
       ]
}
PdfExtractor.extract_text("path/to/document.pdf", areas)

Return Format

The function returns a map where keys are page numbers and values are the extracted text:

%{
  0 => "Text from page 0...",
  1 => ["Text from page 1 (first area)...", "Text from page 1 (second area)..."],
  2 => "Text from page 2..."
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built on top of the excellent pdfplumber Python library
  • Uses pythonx for seamless Python integration

Summary

Functions

Returns a specification to start this module under a supervisor.

Extracts metadata from a PDF file info trailers. Typically includes "CreationDate", "ModDate", "Producer", et cetera.

Extracts metadata from PDF binary data. Similar to extract_metadata/1 but works with PDF data in memory instead of files.

Extracts text from PDF pages.

Extracts text from PDF binary data. See extract_text/3 for details on how to specify pages and areas.

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

extract_metadata(file_path)

Extracts metadata from a PDF file info trailers. Typically includes "CreationDate", "ModDate", "Producer", et cetera.

Examples

iex> PdfExtractor.extract_metadata("priv/fixtures/book.pdf")
{:ok,
 %{
   "CreationDate" => "D:20250718212328Z",
   "Creator" => "Stirling-PDF v0.44.2",
   "ModDate" => "D:20250718212328Z",
   "Producer" => "Stirling-PDF v0.44.2"
 }}

extract_metadata_from_binary(binary)

Extracts metadata from PDF binary data. Similar to extract_metadata/1 but works with PDF data in memory instead of files.

Examples

iex> content = File.read!("priv/fixtures/book.pdf")
...> PdfExtractor.extract_metadata_from_binary(content)
{:ok,
 %{
   "CreationDate" => "D:20250718212328Z",
   "Creator" => "Stirling-PDF v0.44.2",
   "ModDate" => "D:20250718212328Z",
   "Producer" => "Stirling-PDF v0.44.2"
 }}

extract_text(file_path, pages \\ [])

Extracts text from PDF pages.

It supports extracting from single pages, multiple pages, and specific areas within pages.

Page Numbers

  • Integer: Extract from single page (e.g., 0 for first page)
  • List: Extract from multiple pages (e.g., [0, 1, 2])
  • Empty list []: Extract from all pages (default)

Areas Format

Areas are specified as a map where keys are page numbers and values are bounding boxes:

  • Single area: %{0 => {x0, y0, x1, y1}}
  • Multiple areas: %{0 => [{x0, y0, x1, y1}, {x2, y2, x3, y3}]}
  • Mixed: %{0 => {x0, y0, x1, y1}, 1 => [{x2, y2, x3, y3}, {x4, y4, x5, y5}]}

Examples

Extract text from all pages.

iex> PdfExtractor.extract_text("priv/fixtures/fatura.pdf")
{:ok,
 %{
   0 =>
     "Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €",
   1 =>
     "✂\nReceipt Payment part Account / Payable to\nCH4431999123000889012\n\nMax Muster & Söhne\nAccount / Payable to\nCH4431999123000889012 Musterstrasse 123\nMax Muster & Söhne 8000 Seldwyla\nMusterstrasse 123\n8000 Seldwyla\nReference\n210000000003139471430009017\nReference\n210000000003139471430009017\nAdditional information\nBestellung vom 15.10.2020\nPayable by (name/address)\nSimon Muster\nPayable by (name/address)\nMusterstrasse 1\nCurrency Amount\nSimon Muster\n8000 Seldwyla\nCHF 1 949.75 Musterstrasse 1\n8000 Seldwyla\nCurrency Amount\nCHF 1 949.75\nAcceptance point"
 }}

Extract text from only some pages.

iex> PdfExtractor.extract_text("priv/fixtures/fatura.pdf", [0])
{:ok,
 %{
   0 =>
     "Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €"
 }}

Extract only the titles in the book chapters.

iex> PdfExtractor.extract_text("priv/fixtures/book.pdf", %{
...>   2 => {0, 0, 612, 190},
...>   8 => {0, 0, 612, 190},
...>   10 => {0, 0, 612, 190}
...> })
{:ok,
 %{
   2 => "Introdução – Nota do tradutor",
   8 => "I. Sobre aproveitar o tempo",
   10 => "II. Sobre a falta de foco na Leitura"
 }}

Extract multiple areas from a single page.

iex> PdfExtractor.extract_text("priv/fixtures/book.pdf", %{
...>   1 => [{0, 100, 612, 140}, {0, 400, 612, 440}]
...> })
{:ok,
 %{
   1 => [
     "CARTAS DE UM ESTOICO, Volume I",
     "Montecristo Editora Ltda.\ne-mail: editora@montecristoeditora.com.br"
   ]
 }}

extract_text_from_binary(binary, pages \\ [])

Extracts text from PDF binary data. See extract_text/3 for details on how to specify pages and areas.

This function allows you to extract text from PDF data that's already in memory, such as data downloaded from a URL or received via an API. This avoids the need to write the PDF to the filesystem.

Examples

Extract text from all pages.

iex> content = File.read!("priv/fixtures/fatura.pdf")
...> PdfExtractor.extract_text_from_binary(content)
{:ok,
 %{
   0 =>
     "Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €",
   1 =>
     "✂\nReceipt Payment part Account / Payable to\nCH4431999123000889012\n\nMax Muster & Söhne\nAccount / Payable to\nCH4431999123000889012 Musterstrasse 123\nMax Muster & Söhne 8000 Seldwyla\nMusterstrasse 123\n8000 Seldwyla\nReference\n210000000003139471430009017\nReference\n210000000003139471430009017\nAdditional information\nBestellung vom 15.10.2020\nPayable by (name/address)\nSimon Muster\nPayable by (name/address)\nMusterstrasse 1\nCurrency Amount\nSimon Muster\n8000 Seldwyla\nCHF 1 949.75 Musterstrasse 1\n8000 Seldwyla\nCurrency Amount\nCHF 1 949.75\nAcceptance point"
 }}

Extract text from only some pages.

iex> content = File.read!("priv/fixtures/fatura.pdf")
...> PdfExtractor.extract_text_from_binary(content, [0])
{:ok,
 %{
   0 =>
     "Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €"
 }}

Extract only the titles in the book chapters.

iex> content = File.read!("priv/fixtures/book.pdf")
...>
...> PdfExtractor.extract_text_from_binary(content, %{
...>   2 => {0, 0, 612, 190},
...>   8 => {0, 0, 612, 190},
...>   10 => {0, 0, 612, 190}
...> })
{:ok,
 %{
   2 => "Introdução – Nota do tradutor",
   8 => "I. Sobre aproveitar o tempo",
   10 => "II. Sobre a falta de foco na Leitura"
 }}

Extract multiple areas from a single page.

iex> content = File.read!("priv/fixtures/book.pdf")
...>
...> PdfExtractor.extract_text_from_binary(content, %{
...>   1 => [{0, 100, 612, 140}, {0, 400, 612, 440}]
...> })
{:ok,
 %{
   1 => [
     "CARTAS DE UM ESTOICO, Volume I",
     "Montecristo Editora Ltda.\ne-mail: editora@montecristoeditora.com.br"
   ]
 }}

start_link(opts \\ [])