ChromicPDF (ChromicPDF v1.1.0) View Source

ChromicPDF is a fast HTML-to-PDF/A renderer based on Chrome & Ghostscript.

Usage

Start

Start ChromicPDF as part of your supervision tree:

def MyApp.Application do
  def start(_type, _args) do
    children = [
      # other apps...
      {ChromicPDF, chromic_pdf_opts()}
    ]

    Supervisor.start_link(children, strategy: :one_for_one, name: MyApp.Supervisor)
  end

  defp chromic_pdf_opts do
    []
  end
end

Print a PDF or PDF/A

ChromicPDF.print_to_pdf({:url, "file:///example.html"}, output: "output.pdf")

PDF printing comes with a ton of options. Please see ChromicPDF.print_to_pdf/2 and ChromicPDF.convert_to_pdfa/2 for details.

Security Considerations

Before adding a browser to your application's (perhaps already long) list of dependencies, you may want consider the security hints below.

Escape user-supplied data

Make sure to escape any user-provided data with something like Phoenix.HTML.html_escape. Chrome is designed to make displaying HTML pages relatively safe, in terms of preventing undesired access of a page to the host operating system. However, the attack surface of your application is still increased. Running this in a containerized application with a small RPC interface creates an additional barrier (and has other benefits).

Running in offline mode

For some apparent security bonus, browser targets can be spawned in "offline mode" (using the DevTools command Network.emulateNetworkConditions. Chrome targets with network conditions set to offline can't resolve any external URLs (e.g. https://), neither entered as navigation URL nor contained within the HTML body.

def chromic_pdf_opts do
  [offline: true]
end

Chrome Sandbox in Docker containers

By default, ChromicPDF will allow Chrome to make use of its own "sandbox" process jail. The sandbox tries to limit system resource access of the renderer processes to the minimum resources they require to perform their task.

However, in Docker containers running Linux images (e.g. images based on Alpine), and which are configured to run their main job as a non-root user, this causes Chrome to crash on startup as it requires root privileges to enter the sandbox.

The error output (discard_stderr: false option) looks as follows:

Failed to move to new namespace: PID namespaces supported, Network namespace supported,
but failed: errno = Operation not permitted

The best way to resolve this issue is to configure your Docker container to use seccomp rules that grant Chrome access to the relevant system calls. See the excellent Zenika/alpine-chrome repository for details on how to make this work.

Alternatively, you may choose to disable Chrome's sandbox with the no_sandbox option.

defp chromic_pdf_opts do
  [no_sandbox: true]
end

SSL connections

In you are fetching your print source from a https:// URL, as usual Chrome verifies the remote host's SSL certificate when establishing the secure connection, and errors out of navigation if the certificate has expired or is not signed by a known certificate authority (i.e. no self-signed certificates).

For production systems, this security check is essential and should not be circumvented. However, if for some reason you need to bypass certificate verification in development or test, you can do this with the :ignore_certificate_errors option.

defp chromic_pdf_opts do
  [ignore_certificate_errors: true]
end

Worker pools

ChromicPDF spawns two worker pools, the session pool and the ghostscript pool. By default, it will create as many sessions (browser tabs) as schedulers are online, and allow the same number of concurrent Ghostscript processes to run.

Concurrency

To increase or limit the number of concurrent workers, you can pass pool configuration to the supervisor. Please note that these are non-queueing worker pools. If you intend to max them out, you will need a job queue as well.

defp chromic_pdf_opts do
  [
    session_pool: [size: 3]
    ghostscript_pool: [size: 10]
  ]
end

Operation timeouts

By default, ChromicPDF allows the print process to take 5 seconds to finish. In case you are printing large PDFs and run into timeouts, these can be configured configured by passing the timeout option to the session pool.

defp chromic_pdf_opts do
  [
    session_pool: [timeout: 10_000]   # in milliseconds
  ]
end

Automatic session restarts to avoid memory drain

By default, ChromicPDF will restart sessions within the Chrome process after 1000 operations. This helps to prevent infinite growth in Chrome's memory consumption. The "max age" of a session can be configured with the :max_session_uses option.

defp chromic_pdf_opts do
  [max_session_uses: 1000]
end

Chrome zombies

Help, a Chrome army tries to take over my memory!

ChromicPDF tries its best to gracefully close the external Chrome process when its supervisor is terminated. Unfortunately, when the BEAM is not shutdown gracefully, Chrome processes will keep running. While in a containerized production environment this is unlikely to be of concern, in development it can lead to unpleasant performance degradation of your operation system.

In particular, the BEAM is not shutdown properly…

  • when you exit your application or iex console with the Ctrl+C abort mechanism (see issue #56),
  • and when you run your tests. No, after an ExUnit run your application's supervisor is not terminated cleanly.

There are a few ways to mitigate this issue.

"On Demand" mode

In case you habitually end your development server with Ctrl+C, you should consider enabling "On Demand" mode which disables the session pool, and instead starts and stops Chrome instances as needed. If multiple PDF operations are requested simultaneously, multiple Chrome processes will be launched (each with a pool size of 1, disregarding the pool configuration).

defp chromic_pdf_opts do
  [on_demand: true]
end

To enable it only for development, you can load the option from the application environment.

# config/config.exs
config :my_app, ChromicPDF, on_demand: false

# config/dev.exs
config :my_app, ChromicPDF, on_demand: true

# application.ex
@chromic_pdf_opts Application.compile_env!(:my_app, ChromicPDF)
defp chromic_pdf_opts do
  @chromic_pdf_opts ++ [... other opts ...]
end

Terminating your supervisor after your test suite

You can enable "On Demand" mode for your tests, as well. However, please be aware that each test that prints a PDF will have an increased runtime (plus about 0.5s) due to the added Chrome boot time cost. Luckily, ExUnit provides a method to run code at the end of your test suite.

# test/test_helper.exs
ExUnit.after_suite(fn _ -> Supervisor.stop(MyApp.Supervisor) end)
ExUnit.start()

Only start ChromicPDF in production

The easiest way to prevent Chrome from spawning in development is to only run ChromicPDF in the prod environment. However, obviously you won't be able to print PDFs in development or test then.

Chrome Options

Custom command line switches

The :chrome_args option allows to pass arbitrary options to the Chrome/Chromium executable.

defp chromic_pdf_opts do
  [chrome_args: "--font-render-hinting=none"]
end

The :chrome_executable option allows to specify a custom Chrome/Chromium executable.

defp chromic_pdf_opts do
  [chrome_executable: "/usr/bin/google-chrome-beta"]
end

Debugging Chrome errors

Chrome's stderr logging is silently discarded to not obscure your logfiles. In case you would like to take a peek, add the discard_stderr: false option.

defp chromic_pdf_opts do
  [discard_stderr: false]
end

Telemetry support

To provide insights into PDF and PDF/A generation performance, ChromicPDF executes the following telemetry events:

  • [:chromic_pdf, :print_to_pdf, :start | :stop | exception]
  • [:chromic_pdf, :capture_screenshot, :start | :stop | :exception]
  • [:chromic_pdf, :convert_to_pdfa, :start | :stop | exception]

Please see :telemetry.span/3 for details on their payloads, and :telemetry.attach/4 for how to attach to them.

Each of the corresponding functions accepts a telemetry_metadata option which is passed to the attached event handler. This can, for instance, be used to mark events with custom tags such as the type of the print document.

ChromicPDF.print_to_pdf(..., telemetry_metadata: %{template: "invoice"})

The print_to_pdfa function emits both the print_to_pdf and convert_to_pdfa event series, in that order.

How it works

PDF Printing

  • ChromicPDF spawns an instance of Chromium/Chrome (an OS process) and connects to its "DevTools" channel via file descriptors.
  • The Chrome process is supervised and the connected processes will automatically recover if it crashes.
  • A number of "targets" in Chrome are spawned, 1 per worker process in the SessionPool. By default, ChromicPDF will spawn each session in a new browser context (i.e., a profile).
  • When a PDF print is requested, a session will instruct its assigned "target" to navigate to the given URL, then wait until it receives a "frameStoppedLoading" event, and proceed to call the printToPDF function.
  • The printed PDF will be sent to the session as Base64 encoded chunks.

PDF/A Conversion

  • To convert a PDF to a PDF/A-3, ChromicPDF uses the ghostscript utility.
  • Since it is required to embed a color scheme into PDF/A files, ChromicPDF ships with a copy of the royalty-free eciRGB_V2 scheme by the European Color Initiative. If you need to be able to use a different color scheme, please open an issue.

Link to this section Summary

Functions

Captures a screenshot.

Returns a specification to start this module as part of a supervision tree.

Converts a PDF to PDF/A (either PDF/A-2b or PDF/A-3b).

Prints a PDF and converts it to PDF/A in a single call.

Starts ChromicPDF.

Link to this section Types

Specs

blob() :: iodata()
Link to this type

capture_screenshot_option()

View Source

Specs

capture_screenshot_option() ::
  {:capture_screenshot, map()}
  | navigate_option()
  | output_option()
  | telemetry_metadata_option()

Specs

evaluate_option() :: {:evaluate, %{expression: binary()}}
Link to this type

ghostscript_pool_option()

View Source

Specs

ghostscript_pool_option() :: {:size, non_neg_integer()}

Specs

global_option() ::
  {:offline, boolean()}
  | {:max_session_uses, non_neg_integer()}
  | {:session_pool, [session_pool_option()]}
  | {:no_sandbox, boolean()}
  | {:discard_stderr, boolean()}
  | {:chrome_args, binary()}
  | {:chrome_executable, binary()}
  | {:ignore_certificate_errors, boolean()}
  | {:ghostscript_pool, [ghostscript_pool_option()]}
  | {:on_demand, boolean()}

Specs

info_option() ::
  {:info,
   %{
     optional(:title) => binary(),
     optional(:author) => binary(),
     optional(:subject) => binary(),
     optional(:keywords) => binary(),
     optional(:creator) => binary(),
     optional(:creation_date) => binary(),
     optional(:mod_date) => binary()
   }}

Specs

navigate_option() ::
  {:set_cookie, map()} | evaluate_option() | wait_for_option()

Specs

output_function() :: (blob() -> output_function_result())
Link to this type

output_function_result()

View Source

Specs

output_function_result() :: any()

Specs

output_option() :: {:output, binary()} | {:output, output_function()}

Specs

path() :: binary()

Specs

pdf_option() ::
  {:print_to_pdf, map()}
  | navigate_option()
  | output_option()
  | telemetry_metadata_option()

Specs

pdfa_option() ::
  {:pdfa_version, binary()}
  | {:pdfa_def_ext, binary()}
  | info_option()
  | output_option()
  | telemetry_metadata_option()

Specs

return() :: :ok | {:ok, binary()} | {:ok, output_function_result()}

Specs

session_pool_option() :: {:size, non_neg_integer()} | {:timeout, timeout()}

Specs

source() :: {:url, url()} | {:html, blob()}

Specs

source_and_options() :: %{source: source(), opts: [pdf_option()]}
Link to this type

telemetry_metadata_option()

View Source

Specs

telemetry_metadata_option() :: {:telemetry_metadata, map()}

Specs

url() :: binary()

Specs

wait_for_option() :: {:wait_for, %{selector: binary(), attribute: binary()}}

Link to this section Functions

Link to this function

capture_screenshot(input, opts \\ [])

View Source

Specs

capture_screenshot(url :: source(), opts :: [capture_screenshot_option()]) ::
  return()

Captures a screenshot.

This call blocks until the screenshot has been created.

Print and return Base64-encoded PNG

{:ok, blob} = ChromicPDF.capture_screenshot({:url, "file:///example.html"})

Options

Options to the Page.captureScrenshot call can be passed by passing a map to the :capture_screenshot option.

ChromicPDF.capture_screenshot(
  {:url, "file:///example.html"},
  capture_screenshot: %{
    format: "jpeg"
  }
)

For navigational options (source, cookies, evaluating scripts) see print_to_pdf/2.

Specs

child_spec([global_option()]) :: Supervisor.child_spec()

Returns a specification to start this module as part of a supervision tree.

Link to this function

convert_to_pdfa(pdf_path, opts \\ [])

View Source

Specs

convert_to_pdfa(pdf_path :: path(), opts :: [pdfa_option()]) :: return()

Converts a PDF to PDF/A (either PDF/A-2b or PDF/A-3b).

Convert an input PDF and return a Base64-encoded blob

{:ok, blob} = ChromicPDF.convert_to_pdfa("some_pdf_file.pdf")

Convert and write to file

ChromicPDF.convert_to_pdfa("some_pdf_file.pdf", output: "output.pdf")

PDF/A versions & levels

Ghostscript supports both PDF/A-2 and PDF/A-3 versions, both in their b (basic) level. By default, ChromicPDF generates version PDF/A-3b files. Set the pdfa_version option for version 2.

ChromicPDF.convert_to_pdfa("some_pdf_file.pdf", pdfa_version: "2")

Specifying PDF metadata

The converter is able to transfer PDF metadata (the Info dictionary) from the original PDF file to the output file. However, files printed by Chrome do not contain any metadata information (except "Creator" being "Chrome").

The :info option of the PDF/A converter allows to specify metatadata for the output file directly.

ChromicPDF.convert_to_pdfa("some_pdf_file.pdf", info: %{creator: "ChromicPDF"})

The converter understands the following keys, all of which accept only String values:

  • :title
  • :author
  • :subject
  • :keywords
  • :creator
  • :creation_date
  • :mod_date

By specification, date values in :creation_date and :mod_date do not need to follow a specific syntax. However, Ghostscript inserts date strings like "D:20200208153049+00'00'" and Info extractor tools might rely on this or another specific format. The converter will automatically format given DateTime values like this.

Both :creation_date and :mod_date are filled with the current date automatically (by Ghostscript), if the original file did not contain any.

Adding more PostScript to the conversion

The pdfa_def_ext option can be used to feed more PostScript code into the final conversion step.

ChromicPDF.convert_to_pdfa(
  "some_pdf_file.pdf",
  pdfa_def_ext: "[/Title (OverriddenTitle) /DOCINFO pdfmark",
)
Link to this function

start_link(config \\ [])

View Source

Specs

Starts ChromicPDF.

If the given config includes the on_demand: true flag, this will instead spawn an Agent process that holds this configuration until a PDF operation is triggered which will then launch a supervisor temporarily, process the operation, and proceed to perform a graceful shutdown.