View Source Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

1.0.1 — 2026-05-07 (fork: ExPDF)

First release of ex_pdf on Hex. Fork of andrewtimberlake/elixir-pdf v0.7.2 (Hex package :pdf, ©Andrew Timberlake, MIT). The writer API is preserved unchanged; this release adds a native PDF reader, error recovery, AcroForm/outlines/annotations extraction, encryption support, and many quality-of-life improvements.

Renamed

OTP app and Hex package: :pdf → :ex_pdf. Module names are unchanged (Pdf, Pdf.Reader, etc. — the namespace Pdf.* was kept to avoid breaking writer callers).
Mixfile module: Pdf.Mixfile → ExPdf.Mixfile.
Repo: andrewtimberlake/elixir-pdf → MisaelMa/ExPDF.
All internal :code.priv_dir(:pdf) and Application.compile_env(:pdf, …) references updated to :ex_pdf (transparent to library users).

Added — Native PDF reader

The reader is implemented in pure Elixir + Erlang/OTP stdlib (:zlib, :crypto, :binary, :unicode, :xmerl). No new Hex runtime deps, no system tool deps.

Unified entry point

Pdf.Reader.read/2 — one call returns %Pdf.Reader.Result{meta, pages} carrying document-level metadata + per-page lines with :kind-tagged tokens (:text | :link | :email | :image). See Pdf.Reader.Result and Pdf.Reader.Line for the full shape.
Convenience shapes: read(doc, shape: :text) returns [String.t()], read(doc, shape: :shapes) returns [%Pdf.Reader.Shape{}].
Image opt: image_bytes: true includes raw decoded :bytes in shape meta (off by default — :data_uri is always present).

Core extraction primitives

Pdf.Reader.open/2 (with optional password: and recover: opts)
Pdf.Reader.read_text/1 — plain text per page
Pdf.Reader.read_text_with_positions/1 — text runs with absolute X/Y
Pdf.Reader.read_lines/2 — logical lines with token tokenisation
Pdf.Reader.read_metadata/1 — Info dict + XMP (PDF 1.7 § 14.3)
Pdf.Reader.read_images/1 — embedded raster images with positions
Pdf.Reader.read_outlines/1 — bookmark tree
Pdf.Reader.read_annotations/1 — per-page annotations
Pdf.Reader.read_acroform/1 — interactive form fields
Pdf.Reader.read_shapes/1 — link-like elements (annotations + inferred)
Pdf.Reader.recovery_log/1 — recovery event accessor
Pdf.Reader.page_count/1
Bang variants for every read function

Encryption (PDF 1.7 § 7.6, PDF 2.0 § 7.6)

Standard Security Handler V1/V2 (RC4-40, RC4-128)
V4 with /AESV2 (AES-128) — crypto:crypto_init_dyn/4
V5/R6 with /AESV3 (AES-256) — full Algorithm 2.A round trip
Empty-password auto-try, file-key derivation per Algorithms 2/3/5/6/8

CID fonts (PDF 1.7 § 9.7)

Identity-H/V composite fonts via 2-byte CID tokenisation
40 predefined CMaps from adobe-type-tools/cmap-resources bundled in priv/cmap/ (UniJIS, GBK, KSC, ETen, Identity, …)
Adobe Japan1/CNS1/Korea1/GB1 collections bundled in priv/
PostScript-subset CMap parser with codespace-aware variable-length tokenizer
ToUnicode CMap fallback for glyphs outside predefined ranges

Per-glyph widths (PDF 1.7 § 9.4.4, § 9.6.2.1, § 9.7.4.3)

Full advance formula tx = ((w/1000 - Tj_kern) × Tfs + Tc + Tw_space) × Th applied per glyph
Heterogeneous CIDFont /W parsing (Form A + Form B + interleaved)
Standard 14 fallback to 500-unit average glyph width when /Widths is absent — restores correct positional advance for Helvetica / Times-Roman / Courier PDFs

Form XObject recursion (PDF 1.7 § 8.10)

Do operator transparent recursion into /Subtype /Form XObjects
CTM × Form /Matrix multiplication, resource merging, cycle detection, depth cap (8)

AcroForm field extraction (PDF 1.7 § 12.7)

Pdf.Reader.FormField struct with /FT inheritance walk
Text, button, choice, signature field types

Outlines and annotations (PDF 1.7 § 12.3, § 12.5)

Pdf.Reader.Outline tree with destinations resolved
Pdf.Reader.Annotation struct with subtype detection (link, text, highlight, file attachment, …)
Annotation-source links automatically merge into the unified Pdf.Reader.Shape API

Image extraction (PDF 1.7 § 8.9)

/Subtype /Image XObjects with absolute positions and CTM-derived rendered dimensions
:jpeg (DCTDecode passthrough) and :png_like (FlateDecode + Predictor) classification
shape.meta.data_uri — RFC 2397 data: URI ready for HTML embedding. JPEG is passthrough; png_like is re-encoded into a real PNG (PNG 1.2 § 5: signature + IHDR + IDAT + IEND with filter byte 0 + zlib) so the URI is browser-loadable.

Shape inference

Pdf.Reader.Shape struct unifies link annotations and pattern- inferred URIs/emails/images.
URL regex per RFC 3986 § 3, email regex per RFC 5321 § 4.1.2.
Trailing punctuation (. , ; : ) ]) stripped from inferred URIs.

Error recovery

Opt-in recover: true flag with four orthogonal phases:
- R-1 Per-page isolation — one bad page does not kill the doc
- R-2 Font lenience — bad font refs fall back to U+FFFD per byte
- R-3 XRef linear scan — :binary.matches/2 recovers from corrupted xref tables; trailer synthesis from last trailer<<…>> block or /Type /Catalog scan; multi-gen dedup (highest gen wins)
- R-4 Catalog/Pages tree fallback — /Type /Page xref scan when /Root or /Pages doesn't resolve
Closed set of recovery event tuples observable via recovery_log/1: :eof_marker_missing, :xref_recovered, :page_tree_recovered, :page_failed, :font_skipped.
Fatal errors (:not_a_pdf, :encrypted_password_required, :encrypted_wrong_password, :encrypted_unsupported_handler) remain hard errors even under recover: true.

Added — Tooling

releaser ~> 0.0.7 dev dependency for monorepo-aware version bumping, changelog generation, and Hex publishing.
Hex package metadata: maintainers, contributors (Andrew Timberlake + Misael Sánchez), licenses, links (GitHub + upstream + changelog).
ExDoc groups_for_modules separating Reader and Writer namespaces.
Comprehensive README documenting every reader feature with spec citations.

Test suite

1180 tests, 0 failures, 30 excluded as of this release.

Unreleased — pdf-reader-error-recovery

Added

Pdf.Reader.open/2 recover: true option — opts in to error-recovery mode. When recover: false (the default, unchanged), all existing strict behavior is preserved. When recover: true, the reader activates four orthogonal recovery phases (R-1..R-4) and logs each recovery action instead of returning {:error, _}.
Pdf.Reader.recovery_log/1 — public accessor returning the recovery event log in chronological (oldest-first) order. An empty list after open/2 guarantees that no recovery action occurred. Direct access to doc.recovery_log is discouraged in application code.
Pdf.Reader.Document struct extension — two new fields with defaults that are invisible to code that does not reference them:
- recover_mode :: boolean(), default false
- recovery_log :: [recovery_event()], default []
PUBLIC API CHANGE — read_text/1 and read_images/1 return shape. Both functions now return {:ok, list, doc} 3-tuples (doc is the updated document carrying the recovery log). The bang variants (read_text!/2, read_images!/1) are unchanged.
R-1 — Per-page isolation: when recover: true, a failed page is logged as {:page_failed, page_n, reason} and skipped; remaining pages continue. Spec: PDF 1.7 § 7.7.3, § 7.8.
R-2 — Font decoder lenience: when recover: true and a font dict fails to resolve, the decoder for that font is replaced with a per-byte U+FFFD identity decoder (<<0xFFFD::utf8>> per byte). The event {:font_skipped, page_n, font_name, reason} is logged. String.valid?/1 is guaranteed true on recovery output. Spec: PDF 1.7 § 9.6, § 9.10.
R-3 — XRef linear scan: when recover: true and normal xref loading fails (corrupt startxref offset, absent %%EOF), XRef.recover/1 performs a :binary.matches/2 scan to reconstruct the cross-reference table. A {:xref_recovered, n_objects} event is logged. When %%EOF is absent, an additional {:eof_marker_missing, :linear_scan_used} event is prepended. Spec: PDF 1.7 § 7.5.4, § 7.5.5, § 7.5.8.
R-4 — Catalog / Pages tree fallback: when recover: true and /Root or /Pages cannot be resolved, the reader scans the recovered xref entries for objects with /Type /Page and (/Contents OR /Parent). A {:page_tree_recovered, n_pages} event is logged. Form XObjects (which also carry /Type /Page sometimes) are correctly excluded by the filter. Spec: PDF 1.7 § 7.7.2, § 7.7.3.

Closed set of recovery event tuples

Tuple	Meaning
`{:xref_recovered, n}`	Linear scan recovered `n` object entries
`{:eof_marker_missing, :linear_scan_used}`	`%%EOF` absent; linear scan was invoked
`{:page_failed, page_n, reason}`	Page `page_n` skipped; `reason` is an atom or term
`{:font_skipped, page_n, font_name, reason}`	Font replaced with U+FFFD fallback
`{:page_tree_recovered, n_pages}`	Catalog/Pages fallback found `n_pages` page objects

No other tuple shapes are appended.

Known gaps (recovery)

Encrypted AND corrupted PDFs — when a PDF is both encrypted and has a corrupt xref/catalog, the synthetic trailer built by the linear scan does not include /Encrypt. Decryption cannot proceed; these PDFs are non-decryptable even with recover: true.
Catalog-fallback page order — when R-4 triggers, the page list is in xref-insertion order, NOT document order. The {:page_tree_recovered, n} event signals this known limitation to callers.
R-4 probe cost — with recover: true, do_open/2 runs a full page-tree walk immediately after xref load (to surface {:page_tree_recovered, n} on the doc returned from open/2). This is O(pages) and measurable on large documents. It is opt-in by design.

Internal

Test suite: 1128 tests, 0 failures, 29 excluded (was 1125 before this change).
New test file: test/pdf/reader/recovery_test.exs (65 tests: 16 RED, 11 GREEN, integration, smoke, and stress).
Strict TDD throughout (red → green → refactor per task).
Spec-driven via SDD (sdd/pdf-reader-error-recovery/* artifacts in engram).

Unreleased — pdf-reader-per-glyph-widths

Added

Per-glyph width support (Pdf.Reader.Font.Widths): text-matrix advance now uses the full PDF 1.7 § 9.4.4 formula — tx = ((w/1000) * Tfs + Tc + Tw_if_space) * Th — rather than a uniform 1-em approximation. Glyph widths are loaded from:
- /Widths, /FirstChar, /LastChar for simple fonts (Type1, TrueType) — § 9.6.2.1
- /W Form A/B arrays and /DW fallback for CIDFonts (Type0) — § 9.7.4.3
Pdf.Reader.Font.Widths — new module with closures of type (binary() -> [non_neg_integer()]), one per font, built alongside the existing decoder map in extract_page_runs/3 and threaded through Form XObject recursion.
GraphicsState.widths_fn — new field (default nil) storing the active font's width closure. Set by the Tf operator alongside decoder. (§ 9.4.4)
Tc, Tw, Tz, TL text-state operators now correctly update GraphicsState. Previously their operands were silently dropped.
TJ kerning shift now applies horizontal scaling (Th): shift = -(n/1000) * Tfs * Th (previously Th was omitted).

Changed

GraphicsState.horizontal_scaling default changed from 1.0 to 100.0 (the PDF spec unit is a percentage; Th = horizontal_scaling / 100). Existing code that reads this field directly and expects the percentage form is unaffected; callers that divided by 100 already will need to adjust.

Documented gaps (not in scope)

Vertical writing widths (/W2, /DW2) — § 9.7.4.4
Standard-14 hardcoded AFM metrics — § 9.6.2.2 (fonts without embedded /Widths currently produce zero-width advance; Tc/Tw still apply correctly)
Non-default /FontMatrix scaling on CIDType2 fonts — § 9.7.4.3

Internal

Test suite: 1107 tests, 0 failures, 27 excluded (was 1095 before this change).
New file: lib/pdf/reader/font/widths.ex
New test file: test/pdf/reader/font/widths_test.exs (25 tests)

Unreleased — housekeeping-dialyzer-warnings

Internal

Removed 9 dead-code clauses flagged by Dialyzer "pattern can never match the type". All defensive {:error, _} arms in bang-wrappers and downstream pattern dispatches where the upstream successtyping was {:ok, ...}-only. No behavior change. Specifically: read_metadata!/1 error branch; extract_doc_id/1 `{:hex_string, }andpatterns;resolvepage_resources/4dead{n,g}andnilkey branches plus unreachable{:error, }cache arm;doresolve_page_resources/4dead{n,g}andnilparent_key branches;font.ex{:error, }arm forCID.Decoder.build/2;decoder.exparseregistry(nil)clause. Defensive_error/fallbacks inoutlines.exandannotations.exthat guard against future widening ofDestination.resolve/3` return type were intentionally kept and annotated with comments.

Unreleased — pdf-reader-cid-fonts-tier3

Added

10 Tier 3 predefined CMaps bundled in priv/cmap/:
- Adobe-Japan1: EUC-H, EUC-V
- Adobe-CNS1: B5-H, B5-V, ETenms-B5-H, ETenms-B5-V
- Adobe-GB1: GB-H, GB-V
- Adobe-Korea1: KSCms-UHC-HW-H, KSCms-UHC-HW-V
- Source: adobe-type-tools/cmap-resources (Apache-2.0)
Pdf.Reader.CID.PredefinedCMap.@bundled set extended from 30 to 40 names.

Internal

Test suite: 1063 tests, 3 pre-existing failures (encryption), 0 new failures.
Bundle size: +51.9 KB additional priv/cmap/ data.

Unreleased — housekeeping-mix-format

Internal

Auto-formatted 12 pre-existing files via mix format to satisfy --check-formatted. Affected files: lib/pdf.ex, lib/pdf/builder.ex, lib/pdf/fonts.ex, lib/pdf/images/png.ex, lib/pdf/layout.ex, lib/pdf/page.ex, lib/pdf/styled_table.ex, test/pdf/builder_test.exs, test/pdf/fonts_test.exs, test/pdf/layout_test.exs, test/pdf/page_templates_test.exs, test/pdf/styled_table_test.exs. No behavior change.

Unreleased — pdf-reader-annotations-outlines

Added

Document outlines (bookmarks) — Pdf.Reader.read_outlines/1 returns [%Pdf.Reader.Outline{title, level, dest_page, children}] walking catalog /Outlines linked list with cycle detection (visited MapSet) and depth cap 32.
Annotations — Pdf.Reader.read_annotations/1 returns [%Pdf.Reader.Annotation{type, page, rect, contents, ...}] for the 10 in-scope subtypes: Link, Text, Highlight, Underline, StrikeOut, Squiggly, Square, Circle, FreeText, FileAttachment. Other subtypes surface as :type :unknown with raw fields preserved in :kind_specific.
Pdf.Reader.Destination — resolves all 4 /Dest variants (direct array, named string, /A /S /GoTo /D <array>, /A /S /GoTo /D <name>). Name-tree walker handles depth-20 + cycle detection.
Pdf.Reader.Utils — extracted shared decode_pdf_string/1 (UTF-16BE BOM
- hex string aware) and parse_rect/1. Pdf.Reader and Pdf.Reader.AcroForm migrated to use Utils; private duplicates removed.
Page index cache — :page_ref_index cached once per read_* call via Destination.ensure_page_index/1. Avoids O(n) page-ref lookups per annotation/outline.

Out of scope

Annotation appearance streams.
Markup popup hierarchies.
Sound/movie/screen/redact/3D annotations.
AcroForm widget annotations (covered by separate pdf-reader-acroform-extraction).

Internal

1053+ tests, 0 failures.
Strict TDD throughout.
Spec-driven via SDD (sdd/pdf-reader-annotations-outlines/* artifacts in engram).

Unreleased — pdf-reader-cid-fonts-cmap-resources

Added

30 Adobe predefined CMaps bundled in priv/cmap/ (Tier 1 + Tier 2):
- Tier 1 (16 files): UniJIS-UTF16-H/V, UniJIS-UCS2-H/V, UniCNS-UTF16-H/V, UniCNS-UCS2-H/V, UniGB-UTF16-H/V, UniGB-UCS2-H/V, UniKS-UTF16-H/V, UniKS-UCS2-H/V
- Tier 2 (14 files): GBK-EUC-H/V, GBKp-EUC-H/V, GBK2K-H/V, ETen-B5-H/V, KSCms-UHC-H/V, 90ms-RKSJ-H/V, 90msp-RKSJ-H/V
- Source: adobe-type-tools/cmap-resources (Apache-2.0)
Pdf.Reader.CID.CMapParser — minimal PostScript subset parser (codespacerange, cidchar, cidrange, notdefchar, notdefrange, usecmap). Silently skips all other PS content. Returns {:ok, cmap_fields} | {:error, reason}. Never raises on malformed input.
Pdf.Reader.CID.Codespace.tokenize/2 — variable-length 1–4 byte codespace-aware tokenizer per PDF 1.7 § 9.7.6 shortest-match rule. Bytes outside all codespace ranges are silently dropped one-at-a-time.
Pdf.Reader.CID.PredefinedCMap — lazy loader with Document.cache keyed {:predefined_cmap, name} and usecmap chain support (cycle detection via visited MapSet; missing/non-bundled parents fall back to empty CMap per discovery #182).
Pdf.Reader.CID.Decoder.build_predefined/2 — new dispatch branch resolves bytes → CID via codespace + CMap → Unicode via existing Adobe collection table. Resolution cascade: ToUnicode CMap → predefined CMap → Adobe registry → U+FFFD with sentinel.
Pdf.Reader.Font.cid_font_type/1 — extends the former cid_font?/1 predicate to recognise bundled predefined CMap names; dispatch returns :identity | {:predefined, name} | :not_cid.

Known Limitations

Tier 3 CMaps not bundled — EUC, B5, GB, ETenms-B5, KSCms-UHC-HW and similar encodings were deferred. Shipped in pdf-reader-cid-fonts-tier3.
Adobe-{Japan1,CNS1,Korea1,GB1}-UCS2 abstract parent files do not exist in adobe-type-tools/cmap-resources. The usecmap operator falls back to empty parent CMap if the named parent is not bundled — child's mappings still work standalone. Real-world usecmap chains are exercised via -V → -H pairs (e.g. UniJIS-UTF16-V usecmap UniJIS-UTF16-H).

Internal

Test suite: 909 tests, 0 failures (was 890 before this change's tests).
Strict TDD throughout (red → green → refactor per task pair).
Spec-driven via SDD (sdd/pdf-reader-cid-fonts-cmap-resources/*).

Unreleased — pdf-reader-cid-fonts

Added

CID composite font support — Type0 fonts with /Encoding /Identity-H or /Identity-V are now dispatched to a new CID decoder path in Pdf.Reader.Font.build_decoder_internal/2. Text extraction from standard CJK PDFs (Japanese, Chinese Traditional/Simplified, Korean) now returns correct Unicode instead of U+FFFD.
Four Adobe collection modules — compile-time CID → Unicode tables bundled as @external_resource pattern-match clauses (O(1) BEAM dispatch):
- Pdf.Reader.CID.AdobeJapan1 — ~9 600 entries (UniJIS-UCS2 column)
- Pdf.Reader.CID.AdobeCNS1 — ~18 300 entries (UniCNS-UCS2 column)
- Pdf.Reader.CID.AdobeKorea1 — ~17 100 entries (UniKS-UCS2 column)
- Pdf.Reader.CID.AdobeGB1 — ~28 700 entries (UniGB-UCS2 column)
Source data: adobe-type-tools/cmap-resources repository. Blob SHAs committed:
- Adobe-Japan1-7/cid2code.txt → 4aead36837da
- Adobe-CNS1-7/cid2code.txt → 13ebdcb98e07
- Adobe-Korea1-2/cid2code.txt → 0b5db6b5f5c3
- Adobe-GB1-6/cid2code.txt → c94c7bf8c943
- Repository HEAD at time of normalization: f5cf3bca7fdf
Pdf.Reader.CID.CIDToGIDMap — parses /CIDToGIDMap entries (/Identity, FlateDecode-decoded binary stream, or indirect ref). Stored for future glyph-rendering work; not used in the Unicode cascade.
Pdf.Reader.CID.Decoder — resolves per-CID Unicode via cascade: ToUnicode CMap → Adobe registry table → U+FFFD with sentinel {idx, "cid:0xHHHH"}.
mix.exs package.files — "priv" added so that @external_resource paths in the Adobe collection modules resolve correctly at Hex compile time.

Known limitations

Non-Identity predefined CMaps not decoded — fonts with /Encoding /UniJIS-UTF16-H, /GBK-EUC-H, etc. fall through to the simple-font path and emit U+FFFD with sentinels. Full support planned for future change pdf-reader-cid-fonts-cmap-resources.
Vertical writing mode — Identity-V is dispatched to the same decoder as Identity-H. No positional adjustments for vertical layout.

Unreleased — pdf-reader-acroform-extraction

Added

Pdf.Reader.read_acroform/1 and read_acroform!/1 — extract interactive AcroForm form fields from a PDF document. Returns a flat list of %Pdf.Reader.FormField{} structs with decoded names, types, values, flags, and rectangles. Absent /AcroForm returns {:ok, [], doc} — never an error.
Pdf.Reader.FormField struct — carries :name (fully-qualified dot-path), :partial_name, :type (:text | :button | :choice | :signature | :unknown), :value (type-specific decoded value), :default, :tooltip, :flags (%{atom => boolean} decoded from /Ff bitmask), :rect.
Pdf.Reader.AcroForm walker module — depth-first leaf-only walker with cycle detection (MapSet of {n, g} xref keys), depth cap (@max_field_depth 8), /FT inheritance, hierarchical naming, and widget-only annotation filtering.

Unreleased — pdf-reader-resource-inheritance-multilevel

Fixed

Cyclic /Parent infinite loop — resolve_page_resources/4 now carries a visited MapSet of {obj_num, gen_num} xref refs during each /Parent-chain walk. If a ref is encountered a second time (direct self-ref or transitive cycle), the walk is silently terminated and %{} is returned. Prevents corrupt PDFs from hanging the reader indefinitely.

Added

Per-leaf-page resource cache — resolved /Resources maps are now stored in doc.cache under {:page_resources, {n, g}} keyed by the leaf page's xref ref. Subsequent calls for the same page (e.g. a second read_text/1 call on an open doc) skip the /Parent-chain walk entirely and return the cached value.

Note

Moduledoc clarified — removed the stale "Known limitations" entry that stated resource inheritance was limited to one level of parent-chain walk. The full recursive walk has been in place since Phase 1.1; only the documentation was wrong. Added PDF 1.7 § 7.7.3 and § 7.7.3.4 spec references.

Unreleased — fix-writer-set-info-state

Fixed

Info dict lost after page mutations — Pdf.set_info/2 (and its single-key variants set_title/2, set_author/2, etc.) stores metadata by updating document.objects. However, Page carries its own copy of objects that was snapshotted at page-creation time. Any subsequent page mutation (set_font, text_at, …) calls sync_page/2, which replaces document.objects with page.objects — silently discarding the info update. The fix propagates the info-dict change into document.current.objects inside put_info/2, so both copies stay in sync and sync_page/2 no longer clobbers metadata.

Unreleased — pdf-reader-form-xobject-recursion (Phase 3)

Added — Phase 3 (pdf-reader-form-xobject-recursion)

Form XObject recursion — Do operators referencing /Type /XObject /Subtype /Form are now recursed into transparently. Text and images inside Forms (headers, footers, repeated logos, templated form fields) appear in Pdf.Reader.read_text/2, read_text_with_positions/1, and read_images/1 output. Previously these objects were emitted as {:deferred, :form_xobject, name} events and silently dropped — that behavior is REPLACED.
CTM × /Matrix inheritance — child Form's CTM is Form.Matrix × parent CTM at time of Do. Graphics state is saved on entry and restored on exit (effectively q ... Q around the form).
Resource merging — Form's /Resources is shallow-merged with the page's resources (Form wins on key collision). Per-Form font decoders are built via Pdf.Reader.Font.build_decoders_for_resources/2 and benefit from the existing Document.cache {:font_decoder, font_ref} cache.
Cycle detection — interpreter state carries a :visited MapSet of {obj_num, gen_num} xref keys, threaded forward into child states. When a Form references an already-visited Form (directly or transitively), an internal {:cycle_detected, ref} event is emitted and recursion is skipped.
Depth cap — recursion is capped at @max_form_depth 8. Beyond that, an internal {:max_depth_exceeded, ref} event is emitted and the Form is skipped.
Image bubble-up — images embedded inside Form XObjects bubble up to the parent's event stream and appear in read_images/1 output, with CTM reflecting the full transform (Form.Matrix × parent CTM × image local CTM).
Internal/cycle/depth events dropped from text output — the new {:cycle_detected, _} and {:max_depth_exceeded, _} event types flow through Pdf.Reader.ContentStream.interpret/3's output but are silently dropped by events_to_text_runs/2. Public read_text* API surface unchanged.

Modified — Phase 3

Pdf.Reader.ContentStream.interpret/3 — public arity and return shape unchanged (backward-compat). New private do_interpret_with_doc/5 for the recursive path; extract_page_runs/3 and extract_page_images/3 now use it.
Pdf.Reader.Image and Pdf.Reader.TextRun events from inside Forms are appended to the parent page's event list.
build_xobjects_map/1 simplified — passes raw {:ref, n, g} refs from resources["XObject"] instead of pre-classifying as :form. ContentStream classifies on demand inside Do.

Out of scope (Phase 3)

BBox clipping of Form contents — text outside a Form's /BBox is still extracted (presentational concern, not data extraction).
Pattern XObject recursion — /Type /Pattern objects referenced via Do are skipped.
Multi-level page-tree resource inheritance (still one-level walk only).
AcroForm interactive field extraction.

Internal — Phase 3

Test suite: 756 tests, 0 failures (738 default + 18 @tag :fixtures).
Strict TDD applied throughout (red → green → refactor per task).
Spec-driven via SDD (sdd/pdf-reader-form-xobject-recursion/* artifacts in engram).
Pdf.Reader.ContentStream @moduledoc cites PDF 1.7 § 8.10 (Form XObjects), § 8.10.2 (Form Dictionaries), § 8.4 (Coordinate Systems), § 8.8 (External Objects / Do operator) plus pdf.js + pdfminer-six reference impls.

Unreleased — pdf-reader-encryption (Phase 2)

Added — Phase 2 (pdf-reader-encryption)

Standard Security Handler support — encrypted PDFs are now READABLE via Pdf.Reader.open/2 when the correct password is provided (or empty for metadata-protection cases). Implements all four spec versions:
- V1 / R=2 — RC4 40-bit (legacy)
- V2 / R=3 — RC4 up to 128-bit (most common pre-2008)
- V4 / R=4 — Crypt Filters + AES-128 (PDF 1.6+)
- V5 / R=6 — AES-256 + SHA-256/384/512 mixing (PDF 2.0 / Acrobat X+)
Pdf.Reader.open/2 with password: String.t() opt (default "").
- Always tries empty password first (metadata-protection auto-unlock).
- If non-empty password supplied, tries as user → owner password.
- Pdf.Reader.open/1 retained — delegates to open/2 with empty opts.
New error atoms in Pdf.Reader.reason/0:
- :encrypted_password_required — no password supplied, empty failed.
- :encrypted_wrong_password — supplied password rejected as user AND owner.
- :encrypted_unsupported_handler — /Filter != /Standard, V5/R5 (deprecated), or RC4 unavailable on the runtime.
- The legacy :encrypted atom is REMOVED (existing test updated to assert the new atom).
Pdf.Reader.Document struct gained :encryption field (%StandardHandler{} when encrypted, nil otherwise).
Decryption hook integrated transparently in Pdf.Reader.ObjectResolver.resolve_in_use/3 only — resolve_compressed/3 is left untouched (object-stream contents are decrypted ONCE at the containing-stream level; double-decryption would corrupt them).
Per-object encryption key derivation per PDF 1.7 § 7.6.2 for V1/V2/V4 (file key + obj_num + gen_num + optional sAlT literal → MD5 → truncate). V5 uses the file encryption key directly.
Crypt Filter /Identity honored — V4 streams marked /Identity are passed through plaintext (common XMP metadata pattern).
/EncryptMetadata false honored — when set in the Encrypt dict, the catalog's /Metadata stream is read as plaintext regardless of the default Stream Filter.
mix.exs — :crypto added to extra_applications (required at release time; the OTP :crypto app is stdlib, not a Hex dep).
New modules: Pdf.Reader.Encryption (facade), Pdf.Reader.Encryption.{PasswordPad, ObjectKey, StandardHandler, V1V2, V4, V5}.

Known Limitations (Phase 2, carried forward)

End-to-end V4/V5 round-trip integration tests deferred — algorithm-level unit tests (73 total across V1V2/V4/V5) verify each cipher against published vectors from Mozilla pdf.js crypto_spec.js, cross-checked with Node.js. V2/R3 is fully covered end-to-end via craft_rc4_v2_pdf/1 (round-trip from hand-crafted PDF through open/2 → read_text/1). V4/V5 dispatch through the resolver hook is unit-validated but lacks a full hand-crafted PDF round-trip fixture. Planned as pdf-reader-encryption-fixtures-handcraft.
Real-world fixture PDFs not committed — would require qpdf as a build/test dependency, which contradicts the project's "native only, zero external dependencies" principle. Planned as a separate optional change if/when the constraint is relaxed.
R5 (deprecated V5 variant) — unsupported by design. PDFs with V=5 R=5 return {:error, :encrypted_unsupported_handler}.
Public-Key Security Handler (X.509 cert-based, /Filter /Adobe.PubSec or similar) — not supported. Returns :encrypted_unsupported_handler.
Permission flag enforcement — flags are read but NOT enforced. We are a reader; downstream tools may choose to honor /P bits.
RC4 availability — runtime dependent on OpenSSL configuration. On systems where RC4 is disabled (some OpenSSL 3 builds), V1/V2 PDFs return :encrypted_unsupported_handler. AES paths (V4/V5) work everywhere.

Internal — Phase 2

Test suite: 726 tests, 0 failures (708 default + 18 @tag :fixtures).
73 unit tests across V1V2/V4/V5 verify algorithms 2, 4, 5, 6, 7, 8, 9, 10 against vectors sourced from Mozilla pdf.js test/unit/crypto_spec.js (Apache-2.0). Each vector independently re-computed with Node.js crypto and :crypto Erlang to confirm parity.
Strict TDD applied throughout (red → green → refactor per task).
Spec-driven via SDD (sdd/pdf-reader-encryption/* artifacts in engram).
All algorithm modules cite canonical spec URLs (PDF 1.7/2.0, NIST FIPS 197, NIST SP 800-38A, RFC 1321) in @moduledoc.

Unreleased — pdf-reader-cascade-wire (Phase 1.1)

Added — Phase 1.1 (pdf-reader-cascade-wire)

Encoding cascade wired through read_text/2 and read_text_with_positions/1 — text is now decoded to Unicode (was raw bytes in Phase 1). Per-font cascade order: ToUnicode CMap → /Differences + AGL → base encoding (WinAnsi/MacRoman/Standard) → U+FFFD.
Per-font decoder construction with cache — Pdf.Reader.Font.build_decoder/2 builds closures per font dict; decoders are cached in Document.cache keyed by {:font_decoder, font_ref} (indirect-ref fonts only; inline font dicts are not cached).
Tf operator switches active decoder mid-content-stream — font changes in the stream are respected; each text operation uses the decoder for the currently active font.
XMP metadata parsing via :xmerl (OTP stdlib) — read_metadata/1 merges XMP with /Info; XMP wins on conflict (PDF 1.7 § 14.3.2). Recognized namespaces: dc:, xmp:, pdf:. Malformed XMP falls back to /Info-only silently.
Pdf.Reader.Image struct gained :ctm, :render_width, :render_height, :rotation_radians fields. CTM decomposition follows PDF 1.7 § 8.3.3 and § 8.9.5.
Resource inheritance — one-level parent-chain walk added to resolve_page_resources/2 so writer-built PDFs (which store resources on the Pages parent node, not the leaf page) extract text and images correctly.
New modules: Pdf.Reader.Font, Pdf.Reader.XMP
New fixture: test/fixtures/images/tiny.jpg (32×32 px, ~900 B, public-domain JPEG from picsum.photos — used by image CTM integration tests)

Known Limitations (Phase 1.1, carried forward)

Resource inheritance — only one level of parent-chain walk is implemented. PDFs with deeply nested page trees that store resources two or more levels above the leaf page may produce empty text. Planned as pdf-reader-resource-inheritance change.
Per-glyph advance via /Widths — glyph advance is approximated as uniform (char_count × font_size). Per-glyph widths are a separate change.
Form XObject Do recursion — content inside form XObjects (/Type /Form) is not extracted. Planned for Phase 3.
CID fonts beyond ToUnicode — CID-keyed fonts without a /ToUnicode CMap produce U+FFFD substitutions. Planned for Phase 3.
CCITTFaxDecode, JBIG2Decode, JPXDecode — not supported; these require third-party C libraries and are outside scope.

Internal — Phase 1.1

Test suite: 616 tests, 0 failures (598 default + 18 @tag :fixtures)
Strict TDD applied throughout (red → green → refactor per task)
Spec-driven via SDD (sdd/pdf-reader-cascade-wire/* artifacts in engram)

Unreleased — pdf-reader-core (Phase 1)

Added

Pdf.Reader.open/1, read_text/2, read_text_with_positions/1, read_images/1, read_metadata/1, page_count/1, close/1 — and bang variants (open!/1, etc.)
Stream filter pipeline:
- FlateDecode with PNG predictors 1–4 and 10–14 and TIFF Predictor 2 (horizontal differencing)
- ASCII85Decode with z shortcut and ~> EOD marker
- ASCIIHexDecode with whitespace tolerance and > EOD
- RunLengthDecode (128 = EOD, 0–127 = literal, 129–255 = repeat)
- LZWDecode with variable-width codes (9–12 bit), EarlyChange 0 and 1
Cross-reference table support: classic xref (PDF 1.0–1.4) AND xref streams (PDF 1.5+) with /Prev chain merging and hybrid chains (mixed classic + stream)
Object stream (/Type /ObjStm) decoding via Pdf.Reader.ObjectStream
Encoding cascade (per-glyph): ToUnicode CMap → /Differences + Adobe Glyph List → base encoding (WinAnsi / MacRoman / StandardEncoding) → U+FFFD with diagnostic sentinel
Bundled Adobe Glyph List 2.0 as a compile-time module (~4 500 entries, BSD-licensed)
Public-domain encoding tables:
- Apple ROMAN.TXT (canonical Mac Roman mapping)
- PDF 1.7 Annex D.2 StandardEncoding (cross-checked against Mozilla pdf.js)
Lazy indirect-object resolver with pure Map cache — no GenServer, no Agent
Pure tagged-tuple internal value model: {:ref, n, g}, {:name, _}, {:string, _}, {:hex_string, _}, {:stream, dict, body}, plain %{} for dicts, plain lists for arrays

Known Limitations (Phase 1)

No encryption support — encrypted PDFs return {:error, :encrypted} (deferred to Phase 2)
No CID fonts beyond ToUnicode-mapped glyphs (deferred to Phase 3)
No CCITTFaxDecode, JBIG2Decode, or JPEG 2000 image filters — these require third-party C libraries and are outside scope
No AcroForm or XFA form field extraction
No OCR or scanned-PDF text extraction — impossible without third parties
Form XObject (Do operator) is recognised but not recursed; content is not extracted
Glyph advance approximation: uses char_count × font_size instead of per-glyph /Widths; start-of-run position is exact, inter-run drift is possible for proportional fonts
CMap multi-codepoint mappings (ligatures): only the first codepoint is used
Malformed PDFs return strict {:error, :malformed} — no partial-recovery mode
XMP metadata streams are not parsed; read_metadata/1 reads only the /Info dictionary

Internal

Test suite: 550 tests, 0 failures (541 default + 9 @tag :fixtures)
Strict TDD applied throughout (red → green → refactor per task)
Spec-driven via SDD (sdd/pdf-reader-core/* artifacts in engram)

0.7.1 (2024-07-23)

Fix memory leak when cleaning up a PDF process

0.7.0 (2024-07-12)

Add autoprint/1 to automatically open the print dialog in a browser

0.6.1 (2023-01-19)

Fix bug with zero width strings and empty rows (also fixes [#24])
Fix issue with nil cap height [#35]
Raise RuntimeError when attempting to add text without a font [#36]
Fix typespec for text_wrap/5 [#37]

0.6.0 (2021-12-07)

Add :odd and :even to :row_style on table with a lower precedence than indexed styles
Fix bug where only the first non-WinAnsi character was replaced [#32]

0.5.0 (2020-12-02)

Catch errors raised within the GenServer and re-raise them in the calling process

0.4.0 (2020-08-12)

Add :encoding_replacement_character option to supply a replacement character when encoding fails
Add :allow_row_overflow option to Pdf.table/4 to allow row contents to be split across pages

0.3.7 (2020-04-29)

Bug fix: Fix memory leak by stopping internal processes

0.3.6 (2020-04-22)

Bug fix: Correctly handle encoded text as binary, not UTF-8 encoded string
Bug fix: External fonts now work like built-in fonts #17
Bug fix: Reset colours changed by attributed text
Bug fix: Fix global options for text_at/4 when using a string #11

0.3.5 (2020-04-14)

Deprecate: Pdf.delete/1 in favour of Pdf.cleanup/1
Deprecate: Pdf.open/2 in favour of Pdf.build/2

← Previous Page API Reference

Next Page → License