Wrap your untrusted NIFs so that they can never crash your node.
Motivation
NIFs are great - sometimes... when they're written in a safe way, have been in use for a very long time, and are trusted by the community, then they have likely been through the process of finding most bugs that are in their underlying source. However, sometimes new libraries come out, and have not been as battle tested as you'd like. Some may have bugs, and when a NIF has a bug, it can crash your entire BEAM node! Code running inside of a NIF does not provide the same safety guarantees that the BEAM gives.
But... what if it could?
I recently ran into this issue, using a library based on a NIF, and the NIF's underlying source was having sporadic crashes. I don't own the library, nor do I own the underlying C source, so while I can submit PRs to them to get it fixed, I still need some way to guarantee safety in the meantime. And thus, SafeNIF was born!
SafeNIF allows you to wrap your NIFs to run on an isolated peer node raised on the same machine. If the NIF crashes, only this peer node dies. The guarantees of the BEAM continue, and you get fault tolerance and crash isolation, even for NIFs, all in native Elixir (with a touch of Erlang's standard library).
Benchmarks
Benchmarks can be found in the bench directory.
As of v0.2.0, SafeNIF has implemented a lazy pool of reusable nodes which scale down when idle. On cold starts, a startup cost is incurred to initialize the peer node, which can take anywhere from 100ms to over a second, depending on how much code needs to be loaded onto the peer node. It should be noted that pooling also incurs costs around memory and CPU since it spins up a node on the same machine.
The benchmarks show that a CLI based Port is slower than SafeNIF. However, different types of workloads and Ports may yield different results.
For example, Ports that communicate over :stdio and use a protocol so they are constantly alive and responding may perform better than how a CLI based port may perform.
Ports have both upsides and downsides just like NIFs, so your mileage may vary as you work with them.
SafeNIF's main concern is allowing any consumers to simply wrap any NIF by calling SafeNIF.wrap/1 and immediately having the safety and isolation that the BEAM natively provides.
The following information was generated by Claude and Reviewed by @probably-not. If issues in this README are found, feel free to open up a PR to fix them!
Usage
Basic Usage
SafeNIF provides a single function: SafeNIF.wrap/2. Pass it an MFA (module, function, arguments) tuple and it runs on an isolated peer node:
# Successful execution returns {:ok, result}
{:ok, 6} = SafeNIF.wrap({Kernel, :+, [2, 4]})
# Complex return values work fine
{:ok, %{name: "test"}} = SafeNIF.wrap({Map, :put, [%{}, :name, "test"]})Wrapping Potentially Dangerous NIFs
The primary use case is wrapping NIFs that might crash:
defmodule MyApp.ImageProcessor do
def safe_process(image_binary) do
# UntrustedNIF.process/1 might crash the BEAM
case SafeNIF.wrap({UntrustedNIF, :process, [image_binary]}) do
{:ok, processed} ->
{:ok, processed}
{:error, :noconnection} ->
# The NIF crashed the peer node
{:error, :nif_crashed}
{:error, :timeout} ->
{:error, :processing_timeout}
{:error, reason} ->
{:error, reason}
end
end
endTimeouts
The default timeout is 5 seconds. Specify a custom timeout as the second argument using to_timeout/1:
# 30 second timeout for long-running operations
SafeNIF.wrap({HeavyComputation, :run, [data]}, to_timeout(second: 30))
# 2 minute timeout for very long operations
SafeNIF.wrap({BatchJob, :process, [items]}, to_timeout(minute: 2))
# 500ms timeout for quick operations
SafeNIF.wrap({QuickCheck, :validate, [input]}, to_timeout(millisecond: 500))When a timeout occurs, the peer node is killed and {:error, :timeout} is returned.
Anonymous Functions
Anonymous functions are supported but with an important caveat: the module that defines the function must be loadable on the peer node.
# Works
SafeNIF.wrap(fn -> 1 + 1 end)
# Works (application modules are loaded on the peer)
SafeNIF.wrap(fn -> MyApp.Worker.do_work() end)
# May fail if defined inside a code path that is not part of the application.
defmodule MyTest do
def run_test do
SafeNIF.wrap(fn -> :test_result end)
end
endFor maximum reliability, prefer MFA tuples over anonymous functions.
Error Handling
SafeNIF returns tagged tuples to distinguish between successful results and failures:
case SafeNIF.wrap({SomeModule, :some_function, [arg]}) do
{:ok, result} ->
# Function executed successfully, result is the return value
handle_success(result)
{:error, :timeout} ->
# Function exceeded the timeout
handle_timeout()
{:error, :noconnection} ->
# Peer node crashed (NIF crash, :erlang.halt, etc.)
handle_crash()
{:error, :not_alive} ->
# Current node isn't running in distributed mode
handle_not_distributed()
{:error, reason} ->
# Function raised/exited with reason
handle_error(reason)
endNote that if your wrapped function returns an error tuple, it's wrapped in {:ok, ...}:
# Function returns {:error, :not_found}
{:ok, {:error, :not_found}} = SafeNIF.wrap({MyModule, :find, [123]})This follows the same convention as Task.async_stream/5.
Requirements
Distributed Mode
SafeNIF requires your node to be running in distributed mode. If you call SafeNIF.wrap/2 on a non-distributed node, you'll get {:error, :not_alive}.
For development, start IEx with a node name:
iex --sname myapp -S mix
For production releases, ensure your node is started with distribution enabled.
Running Tests
Tests require distribution. Add this to your test/test_helper.exs:
{:ok, _} = Node.start(:"test@127.0.0.1", :shortnames)
ExUnit.start()Or run tests with:
mix test --sname test
How It Works
When you call SafeNIF.wrap/2:
- A new BEAM node is started as a hidden peer using OTP's
:peermodule - All code paths and application configuration are copied to the peer
- Applications are started on the peer
- Your function executes on the peer node
- The result is sent back via Erlang distribution
- The peer node shuts down
Hidden Nodes
Peer nodes are started with the -hidden flag. This means they:
- Don't appear in
Node.list/0 - Don't trigger
:net_kernel.monitor_nodes/1callbacks - Won't be discovered by clustering libraries (libcluster, Horde, etc.)
This prevents SafeNIF's ephemeral peers from interfering with your cluster topology.
Performance Considerations
Since v0.2.0, SafeNIF now creates a lazy pool of ready peer nodes for use.
This does not mean, however, that SafeNIF is without overhead. There is still overhead in sending messages between the nodes, and wrapping the function in a way that can communicate with the caller.
SafeNIF is designed for "performant-enough" isolation, ensuring that functions, specifically NIFs which are untrusted, can run without affecting the current node, and not high performance. Use it for:
- Untrusted or potentially crashy NIFs
- Operations where safety trumps speed
Don't use it for:
- Trusted code that won't crash the node
Installation
To install, add it to you dependencies in your project's mix.exs.
def deps do
[
{:safe_nif, ">= 0.0.1"}
]
endDocumentation can be generated with ExDoc and published on HexDocs. Once published, the docs can be found at https://hexdocs.pm/safe_nif.
Summary
Types
Options to pass into a new pool when adding it to the supervision tree.
Anything that is runnable. This may be a function, or an MFA tuple.
Functions
Wrap a call in a way that will ensure that it cannot affect the current BEAM node.
Wrap a call in a way that will ensure that it cannot affect the current BEAM node.
Types
@type pool_start_opt() :: {:name, atom()} | {:size, pos_integer()} | {:idle_timeout, timeout()} | {:peer_applications, [atom()]}
Options to pass into a new pool when adding it to the supervision tree.
Options
:name(required) - Anatom/0name to give to the pool.:size- How many workers to put in the pool. Defaults toSystem.schedulers_online/0.:idle_timeout- How long to allow the pool and nodes to be idle until we start scaling them down.:peer_applications- A list of applications to start on the peer node, defaulting to just:safe_nif. If you need a custom list of applications that must start on the peer node, make sure to pass their names into the pool.
Anything that is runnable. This may be a function, or an MFA tuple.
Options to pass into wrap/1 or wrap/4.
Options
:timeout- A timeout to be passed into the function, defaulting to 5 seconds. Should the function take longer than the given timeout the underlying process will be force killed and{:error, :timeout}will be returned.:pool- A pool configured with custom values for size and idle node time. Defaults to the default pool started up with the application.:pool_timeout- A timeout for how long it should take to checkout a node from the pool, defaulting to 5 seconds. Should the pool checkout take longer than the given timeout{:error, :timeout}will be returned.
Functions
Wrap a call in a way that will ensure that it cannot affect the current BEAM node.
This will raise a separate BEAM node via the Erlang :peer module, and run the runnable on that node.
The current node remains isolated, and results are communicated between the two via Erlang Distribution.
Since this uses Erlang Distribution under the hood, it requires that the current node be alive. If the current
node is not alive, an error of {:error, :not_alive} will be returned.
The result of the function is emitted wrapped in an :ok tuple. This mirrors Task.async_stream/5, which always emits
an :ok tuple wrapping the result of running the function value regardless of if the return value is an error.
Should the function cause a crash, the reason will be wrapped in an error tuple and returned as {:error, reason}.
Options
:timeout- A timeout to be passed into the function, defaulting to 5 seconds. Should the function take longer than the given timeout the underlying process will be force killed and{:error, :timeout}will be returned.:pool- A pool configured with custom values for size and idle node time. Defaults to the default pool started up with the application.:pool_timeout- A timeout for how long it should take to checkout a node from the pool, defaulting to 5 seconds. Should the pool checkout take longer than the given timeout{:error, :timeout}will be returned.
Node Pools
In order to avoid startup costs of initializing a full BEAM node and loading all of the necessary code onto it on each call,
SafeNIF implements a NimblePool based resource pool for :peer nodes,
allowing nodes to be reused across calls. By default, a pool is created with default values for idle time (5 minutes) and sizing
(based on System.schedulers_online/0). However, based on your demand, you may need to tune these values.
All benchmarks in SafeNIF were conducted with these defaults, but you can create your own pools and use them by passing in the :pool
option to wrap/1 and wrap/4.
Creating a pool is as easy as adding {SafeNIF, opts} (where opts is a list of pool_start_opt/0) into your supervision tree, and customizing these options for your use case.
Node Pool Options
:name(required) - Anatom/0name to give to the pool.:size- How many workers to put in the pool. Defaults toSystem.schedulers_online/0.:idle_timeout- How long to allow the pool and nodes to be idle until we start scaling them down.:peer_applications- A list of applications to start on the peer node, defaulting to just:safe_nif. If you need a custom list of applications that must start on the peer node, make sure to pass their names into the pool.
Wrap a call in a way that will ensure that it cannot affect the current BEAM node.
Like wrap/1 but accepts an MFA that will be used with apply/3.
See wrap/1 for more details.