SafeNIF (SafeNIF v0.4.0)

Copy Markdown View Source

Wrap your untrusted NIFs so that they can never crash your node.

Motivation

NIFs are great - sometimes... when they're written in a safe way, have been in use for a very long time, and are trusted by the community, then they have likely been through the process of finding most bugs that are in their underlying source. However, sometimes new libraries come out, and have not been as battle tested as you'd like. Some may have bugs, and when a NIF has a bug, it can crash your entire BEAM node! Code running inside of a NIF does not provide the same safety guarantees that the BEAM gives.

But... what if it could?

I recently ran into this issue, using a library based on a NIF, and the NIF's underlying source was having sporadic crashes. I don't own the library, nor do I own the underlying C source, so while I can submit PRs to them to get it fixed, I still need some way to guarantee safety in the meantime. And thus, SafeNIF was born!

SafeNIF allows you to wrap your NIFs to run on an isolated peer node raised on the same machine. If the NIF crashes, only this peer node dies. The guarantees of the BEAM continue, and you get fault tolerance and crash isolation, even for NIFs, all in native Elixir (with a touch of Erlang's standard library).

Benchmarks

Benchmarks can be found in the bench directory.

As of v0.2.0, SafeNIF has implemented a lazy pool of reusable nodes which scale down when idle. On cold starts, a startup cost is incurred to initialize the peer node, which can take anywhere from 100ms to over a second, depending on how much code needs to be loaded onto the peer node. It should be noted that pooling also incurs costs around memory and CPU since it spins up a node on the same machine.

The benchmarks show that a CLI based Port is slower than SafeNIF. However, different types of workloads and Ports may yield different results. For example, Ports that communicate over :stdio and use a protocol so they are constantly alive and responding may perform better than how a CLI based port may perform.

Ports have both upsides and downsides just like NIFs, so your mileage may vary as you work with them. SafeNIF's main concern is allowing any consumers to simply wrap any NIF by calling SafeNIF.wrap/1 and immediately having the safety and isolation that the BEAM natively provides.

The following information was generated by Claude and Reviewed by @probably-not. If issues in this README are found, feel free to open up a PR to fix them!

Usage

Basic Usage

SafeNIF provides a single function: SafeNIF.wrap/2. Pass it an MFA (module, function, arguments) tuple and it runs on an isolated peer node:

# Successful execution returns {:ok, result}
{:ok, 6} = SafeNIF.wrap({Kernel, :+, [2, 4]})

# Complex return values work fine
{:ok, %{name: "test"}} = SafeNIF.wrap({Map, :put, [%{}, :name, "test"]})

Wrapping Potentially Dangerous NIFs

The primary use case is wrapping NIFs that might crash:

defmodule MyApp.ImageProcessor do
  def safe_process(image_binary) do
    # UntrustedNIF.process/1 might crash the BEAM
    case SafeNIF.wrap({UntrustedNIF, :process, [image_binary]}) do
      {:ok, processed} -> 
        {:ok, processed}
      {:error, :noconnection} -> 
        # The NIF crashed the peer node
        {:error, :nif_crashed}
      {:error, :timeout} -> 
        {:error, :processing_timeout}
      {:error, reason} -> 
        {:error, reason}
    end
  end
end

Timeouts

The default timeout is 5 seconds. Specify a custom timeout as the second argument using to_timeout/1:

# 30 second timeout for long-running operations
SafeNIF.wrap({HeavyComputation, :run, [data]}, to_timeout(second: 30))

# 2 minute timeout for very long operations
SafeNIF.wrap({BatchJob, :process, [items]}, to_timeout(minute: 2))

# 500ms timeout for quick operations
SafeNIF.wrap({QuickCheck, :validate, [input]}, to_timeout(millisecond: 500))

When a timeout occurs, the peer node is killed and {:error, :timeout} is returned.

Anonymous Functions

Anonymous functions are supported but with an important caveat: the module that defines the function must be loadable on the peer node.

# Works
SafeNIF.wrap(fn -> 1 + 1 end)

# Works (application modules are loaded on the peer)
SafeNIF.wrap(fn -> MyApp.Worker.do_work() end)

# May fail if defined inside a code path that is not part of the application.
defmodule MyTest do
  def run_test do
    SafeNIF.wrap(fn -> :test_result end)
  end
end

For maximum reliability, prefer MFA tuples over anonymous functions.

Error Handling

SafeNIF returns tagged tuples to distinguish between successful results and failures:

case SafeNIF.wrap({SomeModule, :some_function, [arg]}) do
  {:ok, result} ->
    # Function executed successfully, result is the return value
    handle_success(result)
    
  {:error, :timeout} ->
    # Function exceeded the timeout
    handle_timeout()
    
  {:error, :noconnection} ->
    # Peer node crashed (NIF crash, :erlang.halt, etc.)
    handle_crash()
    
  {:error, :not_alive} ->
    # Current node isn't running in distributed mode
    handle_not_distributed()
    
  {:error, reason} ->
    # Function raised/exited with reason
    handle_error(reason)
end

Note that if your wrapped function returns an error tuple, it's wrapped in {:ok, ...}:

# Function returns {:error, :not_found}
{:ok, {:error, :not_found}} = SafeNIF.wrap({MyModule, :find, [123]})

This follows the same convention as Task.async_stream/5.

Requirements

Distributed Mode

SafeNIF requires your node to be running in distributed mode. If you call SafeNIF.wrap/2 on a non-distributed node, you'll get {:error, :not_alive}.

For development, start IEx with a node name:

iex --sname myapp -S mix

For production releases, ensure your node is started with distribution enabled.

Running Tests

Tests require distribution. Add this to your test/test_helper.exs:

{:ok, _} = Node.start(:"test@127.0.0.1", :shortnames)
ExUnit.start()

Or run tests with:

mix test --sname test

How It Works

When you call SafeNIF.wrap/2:

  1. A new BEAM node is started as a hidden peer using OTP's :peer module
  2. All code paths and application configuration are copied to the peer
  3. Applications are started on the peer
  4. Your function executes on the peer node
  5. The result is sent back via Erlang distribution
  6. The peer node shuts down

Hidden Nodes

Peer nodes are started with the -hidden flag. This means they:

This prevents SafeNIF's ephemeral peers from interfering with your cluster topology.

Performance Considerations

Since v0.2.0, SafeNIF now creates a lazy pool of ready peer nodes for use.

This does not mean, however, that SafeNIF is without overhead. There is still overhead in sending messages between the nodes, and wrapping the function in a way that can communicate with the caller.

SafeNIF is designed for "performant-enough" isolation, ensuring that functions, specifically NIFs which are untrusted, can run without affecting the current node, and not high performance. Use it for:

  • Untrusted or potentially crashy NIFs
  • Operations where safety trumps speed

Don't use it for:

  • Trusted code that won't crash the node

Installation

SafeNIF is available on Hex.

To install, add it to you dependencies in your project's mix.exs.

def deps do
  [
    {:safe_nif, ">= 0.0.1"}
  ]
end

Documentation can be generated with ExDoc and published on HexDocs. Once published, the docs can be found at https://hexdocs.pm/safe_nif.

Summary

Types

Options to pass into a new pool when adding it to the supervision tree.

Anything that is runnable. This may be a function, or an MFA tuple.

Options to pass into wrap/1 or wrap/4.

Functions

Wrap a call in a way that will ensure that it cannot affect the current BEAM node.

Wrap a call in a way that will ensure that it cannot affect the current BEAM node.

Types

pool_start_opt()

@type pool_start_opt() ::
  {:name, atom()}
  | {:size, pos_integer()}
  | {:idle_timeout, timeout()}
  | {:peer_applications, [atom()]}

Options to pass into a new pool when adding it to the supervision tree.

Options

  • :name (required) - An atom/0 name to give to the pool.
  • :size - How many workers to put in the pool. Defaults to System.schedulers_online/0.
  • :idle_timeout - How long to allow the pool and nodes to be idle until we start scaling them down.
  • :peer_applications - A list of applications to start on the peer node, defaulting to just :safe_nif. If you need a custom list of applications that must start on the peer node, make sure to pass their names into the pool.

runnable()

@type runnable() :: (-> term()) | {module(), atom(), list()}

Anything that is runnable. This may be a function, or an MFA tuple.

wrap_opt()

@type wrap_opt() ::
  {:timeout, timeout()} | {:pool_timeout, timeout()} | {:pool, atom()}

Options to pass into wrap/1 or wrap/4.

Options

  • :timeout - A timeout to be passed into the function, defaulting to 5 seconds. Should the function take longer than the given timeout the underlying process will be force killed and {:error, :timeout} will be returned.
  • :pool - A pool configured with custom values for size and idle node time. Defaults to the default pool started up with the application.
  • :pool_timeout - A timeout for how long it should take to checkout a node from the pool, defaulting to 5 seconds. Should the pool checkout take longer than the given timeout {:error, :timeout} will be returned.

Functions

wrap(runnable, opts \\ [])

@spec wrap(runnable(), [wrap_opt()]) :: {:ok, term()} | {:error, term()}

Wrap a call in a way that will ensure that it cannot affect the current BEAM node.

This will raise a separate BEAM node via the Erlang :peer module, and run the runnable on that node. The current node remains isolated, and results are communicated between the two via Erlang Distribution.

Since this uses Erlang Distribution under the hood, it requires that the current node be alive. If the current node is not alive, an error of {:error, :not_alive} will be returned.

The result of the function is emitted wrapped in an :ok tuple. This mirrors Task.async_stream/5, which always emits an :ok tuple wrapping the result of running the function value regardless of if the return value is an error.

Should the function cause a crash, the reason will be wrapped in an error tuple and returned as {:error, reason}.

Options

  • :timeout - A timeout to be passed into the function, defaulting to 5 seconds. Should the function take longer than the given timeout the underlying process will be force killed and {:error, :timeout} will be returned.
  • :pool - A pool configured with custom values for size and idle node time. Defaults to the default pool started up with the application.
  • :pool_timeout - A timeout for how long it should take to checkout a node from the pool, defaulting to 5 seconds. Should the pool checkout take longer than the given timeout {:error, :timeout} will be returned.

Node Pools

In order to avoid startup costs of initializing a full BEAM node and loading all of the necessary code onto it on each call, SafeNIF implements a NimblePool based resource pool for :peer nodes, allowing nodes to be reused across calls. By default, a pool is created with default values for idle time (5 minutes) and sizing (based on System.schedulers_online/0). However, based on your demand, you may need to tune these values. All benchmarks in SafeNIF were conducted with these defaults, but you can create your own pools and use them by passing in the :pool option to wrap/1 and wrap/4.

Creating a pool is as easy as adding {SafeNIF, opts} (where opts is a list of pool_start_opt/0) into your supervision tree, and customizing these options for your use case.

Node Pool Options

  • :name (required) - An atom/0 name to give to the pool.
  • :size - How many workers to put in the pool. Defaults to System.schedulers_online/0.
  • :idle_timeout - How long to allow the pool and nodes to be idle until we start scaling them down.
  • :peer_applications - A list of applications to start on the peer node, defaulting to just :safe_nif. If you need a custom list of applications that must start on the peer node, make sure to pass their names into the pool.

wrap(mod, fun, args, opts \\ [])

@spec wrap(module(), atom(), list(), [wrap_opt()]) :: {:ok, term()} | {:error, term()}

Wrap a call in a way that will ensure that it cannot affect the current BEAM node.

Like wrap/1 but accepts an MFA that will be used with apply/3.

See wrap/1 for more details.