FormatParser.Document (format_parser v2.14.0)

Copy Markdown

A Document struct and functions.

The Document struct contains the fields format, nature, and intrinsics.

Supported Formats

FormatExtensionDescription
:rtf.rtfRich Text Format
:pdf.pdfPortable Document Format
:docx.docxMicrosoft Word (Open XML)
:doc.docMicrosoft Word (Legacy)
:xlsx.xlsxMicrosoft Excel (Open XML)
:pptx.pptxMicrosoft PowerPoint (Open XML)
:odt.odtOpenDocument Text
:ods.odsOpenDocument Spreadsheet
:odp.odpOpenDocument Presentation
:epub.epubElectronic Publication

Intrinsics

Some formats provide additional metadata in the intrinsics field:

  • PDF: %{page_count: integer} - Number of pages (from linearized PDFs)

Summary

Types

t()

A struct representing a parsed document file.

Functions

Parses a document from the given input.

Types

t()

@type t() :: %FormatParser.Document{
  format: atom() | nil,
  intrinsics: map(),
  nature: :document
}

A struct representing a parsed document file.

Fields

  • :format - The document format as an atom (e.g., :pdf, :docx, :odt)
  • :nature - Always :document for document files
  • :intrinsics - A map containing format-specific metadata

Functions

parse(file)

@spec parse({:error, binary()} | binary() | any()) :: any()

Parses a document from the given input.

This function attempts to identify document formats by examining magic bytes and internal file structure. ZIP-based formats (DOCX, XLSX, PPTX, ODT, ODS, ODP, EPUB) are detected by examining their internal file entries.

Arguments

  • input - Can be one of:
    • {:error, binary} - A tuple containing binary file content (used in parser chain)
    • binary - Raw binary file content
    • any - Any other value is returned as-is (pass-through for parser chain)

Returns

  • %FormatParser.Document{} - When a supported document format is detected
  • {:error, binary} - When the format is not recognized (for parser chain)
  • The input unchanged - When input is neither a binary nor an error tuple

Examples

iex> {:ok, file} = File.read("priv/test.pdf")
iex> result = FormatParser.Document.parse(file)
iex> result.format
:pdf

iex> FormatParser.Document.parse(%FormatParser.Image{})
%FormatParser.Image{}