SignCore.PDF.Reader (sign_core v0.1.0)

Copy Markdown View Source

Minimal PDF trailer / xref scanner for the PAdES adapter.

Scope is the file-level structure only — the four primitives the Phase 4 plan calls out:

  1. Locate startxref and the most-recent xref offset.
  2. Parse the text-format xref subsections at that offset.
  3. Extract /Size, /Root, /Prev from the trailer dict.
  4. Walk the /Prev chain across revisions.

Out of scope (deliberately): content streams, encoded streams, page resources, font dictionaries, and any indirect-object body. None of those are required for incremental signature emission or for recomputing the byte-range covered by a /Sig.

Cross-reference streams (PDF 1.5+, /Type /XRef) are not handled in v1 and surface as {:error, {:malformed_pdf, :xref_stream_unsupported}}. Per the Phase 4 plan, the writer always emits the legacy text-format xref (still legal in PDF 1.7+); the reader is the side that needs to tolerate vendor variation, and we accept the limitation until a real corpus argues for lopdf on the verify path.

Summary

Types

Reader error. Always carries :malformed_pdf as the class atom.

Functions

Returns the merged xref offsets across every revision in the PDF — newest entry per object number wins (incremental updates override).

Returns the next free indirect-object number, derived from the most-recent revision's /Size. PAdES incremental updates allocate fresh object numbers starting here.

Convenience: locate the most-recent xref and read it.

Returns the catalog dict body (the bytes between << and >> of the object pointed at by /Root). The catalog is what an incremental update must re-emit when adding a /Sig field — its /AcroForm and /Pages entries need to be preserved.

Returns the dict body (the bytes between the object's outer << and matching >>) for the object at offset. :not_a_dict for objects that don't begin with a dict (streams, primitives).

Reads the textual body of the indirect object at the given offset. Returns the bytes between obj and endobj, trimmed.

Reads the xref table + trailer at the given offset and returns a Revision describing this PDF revision.

Walks the /Prev chain newest-first. The first element is the most recent revision (the one startxref points at); the last is the original.

Returns the list of {object_number, dict_body} pairs for every indirect object whose body is a dictionary containing /Type /Sig.

Returns the byte offset stored in the file's terminating startxref marker. Searches the last 8192 bytes — PDF 1.7 §7.5.5 requires it within the last 1 KiB but real-world authoring tools emit trailing whitespace that pushes the marker further back.

Types

error()

@type error() :: {:malformed_pdf, atom()}

Reader error. Always carries :malformed_pdf as the class atom.

Functions

merged_xref_offsets(pdf)

@spec merged_xref_offsets(binary()) ::
  {:ok, %{required(non_neg_integer()) => non_neg_integer()}} | {:error, error()}

Returns the merged xref offsets across every revision in the PDF — newest entry per object number wins (incremental updates override).

Used by the verify path to enumerate indirect objects by number without picking older revisions of objects that were superseded.

next_object_number(pdf)

@spec next_object_number(binary()) :: {:ok, non_neg_integer()} | {:error, error()}

Returns the next free indirect-object number, derived from the most-recent revision's /Size. PAdES incremental updates allocate fresh object numbers starting here.

parse(pdf)

@spec parse(binary()) :: {:ok, SignCore.PDF.Reader.Revision.t()} | {:error, error()}

Convenience: locate the most-recent xref and read it.

read_catalog_body(pdf)

@spec read_catalog_body(binary()) :: {:ok, binary()} | {:error, error()}

Returns the catalog dict body (the bytes between << and >> of the object pointed at by /Root). The catalog is what an incremental update must re-emit when adding a /Sig field — its /AcroForm and /Pages entries need to be preserved.

Returns {:error, {:malformed_pdf, :catalog_not_indirect}} if the catalog body isn't a plain dict (rare; would only happen if /Root pointed at an object stream).

read_dict_at(pdf, offset)

@spec read_dict_at(binary(), non_neg_integer()) ::
  {:ok, binary()} | {:error, error() | :not_a_dict}

Returns the dict body (the bytes between the object's outer << and matching >>) for the object at offset. :not_a_dict for objects that don't begin with a dict (streams, primitives).

read_object_body(pdf, offset)

@spec read_object_body(binary(), non_neg_integer()) ::
  {:ok, binary()} | {:error, error()}

Reads the textual body of the indirect object at the given offset. Returns the bytes between obj and endobj, trimmed.

Used by the Writer to extract the catalog dict so an incremental update can re-emit it with a merged /AcroForm entry. Does not parse stream contents; the bytes are returned verbatim.

read_revision(pdf, offset)

@spec read_revision(binary(), non_neg_integer()) ::
  {:ok, SignCore.PDF.Reader.Revision.t()} | {:error, error()}

Reads the xref table + trailer at the given offset and returns a Revision describing this PDF revision.

revisions(pdf)

@spec revisions(binary()) ::
  {:ok, [SignCore.PDF.Reader.Revision.t()]} | {:error, error()}

Walks the /Prev chain newest-first. The first element is the most recent revision (the one startxref points at); the last is the original.

signature_dicts(pdf)

@spec signature_dicts(binary()) ::
  {:ok, [{non_neg_integer(), binary()}]} | {:error, error()}

Returns the list of {object_number, dict_body} pairs for every indirect object whose body is a dictionary containing /Type /Sig.

This is the canonical way to locate signature dicts: it ignores comments, content-stream text that happens to mention /Type /Sig, and superseded older revisions of the same object number. Each returned dict body is bounded — only the dict content between its outer << and matching >>, suitable for whitespace-tolerant regex extraction of /ByteRange and /Contents.

startxref(pdf)

@spec startxref(binary()) :: {:ok, non_neg_integer()} | {:error, error()}

Returns the byte offset stored in the file's terminating startxref marker. Searches the last 8192 bytes — PDF 1.7 §7.5.5 requires it within the last 1 KiB but real-world authoring tools emit trailing whitespace that pushes the marker further back.