FormatParser.Document
(format_parser v2.14.0)
Copy Markdown
A Document struct and functions.
The Document struct contains the fields format, nature, and intrinsics.
Supported Formats
| Format | Extension | Description |
|---|---|---|
:rtf | .rtf | Rich Text Format |
:pdf | Portable Document Format | |
:docx | .docx | Microsoft Word (Open XML) |
:doc | .doc | Microsoft Word (Legacy) |
:xlsx | .xlsx | Microsoft Excel (Open XML) |
:pptx | .pptx | Microsoft PowerPoint (Open XML) |
:odt | .odt | OpenDocument Text |
:ods | .ods | OpenDocument Spreadsheet |
:odp | .odp | OpenDocument Presentation |
:epub | .epub | Electronic Publication |
Intrinsics
Some formats provide additional metadata in the intrinsics field:
- PDF:
%{page_count: integer}- Number of pages (from linearized PDFs)
Summary
Functions
Parses a document from the given input.
Types
Functions
Parses a document from the given input.
This function attempts to identify document formats by examining magic bytes and internal file structure. ZIP-based formats (DOCX, XLSX, PPTX, ODT, ODS, ODP, EPUB) are detected by examining their internal file entries.
Arguments
input- Can be one of:{:error, binary}- A tuple containing binary file content (used in parser chain)binary- Raw binary file contentany- Any other value is returned as-is (pass-through for parser chain)
Returns
%FormatParser.Document{}- When a supported document format is detected{:error, binary}- When the format is not recognized (for parser chain)- The input unchanged - When input is neither a binary nor an error tuple
Examples
iex> {:ok, file} = File.read("priv/test.pdf")
iex> result = FormatParser.Document.parse(file)
iex> result.format
:pdf
iex> FormatParser.Document.parse(%FormatParser.Image{})
%FormatParser.Image{}