Resiliency.Bulkhead (Resiliency v0.6.0)

Copy Markdown View Source

A bulkhead for isolating workloads with per-partition concurrency limits.

A bulkhead wraps calls to a downstream service and limits how many can execute concurrently. When the limit is reached, callers are either rejected immediately or queued for a configurable wait time. This prevents one slow or overloaded service from consuming all available resources and cascading into other parts of the system.

Inspired by Resilience4j's SemaphoreBulkhead, with an Elixir-idiomatic API that follows the conventions of this library.

When to use

  • You need to isolate different workloads so that a spike in one does not starve others — e.g., separate bulkheads for search, payments, and notifications.
  • You want to cap the number of concurrent calls to a downstream service, with clear rejection semantics when the limit is reached.
  • You need server-managed wait queues with configurable timeouts and FIFO fairness, rather than caller-side timeouts.

Quick start

# 1. Add to your supervision tree
children = [
  {Resiliency.Bulkhead, name: MyApp.ApiBulkhead, max_concurrent: 10}
]
Supervisor.start_link(children, strategy: :one_for_one)

# 2. Use it
case Resiliency.Bulkhead.call(MyApp.ApiBulkhead, fn -> HttpClient.get(url) end) do
  {:ok, response} -> handle_response(response)
  {:error, :bulkhead_full} -> {:error, :overloaded}
  {:error, reason} -> {:error, reason}
end

How it works

The bulkhead runs as a GenServer that tracks active permits and a waiter queue. The protected function runs in the caller's process, not inside the GenServer. This means:

  • The GenServer is never blocked by slow downstream calls.
  • A crash in the protected function does not crash the GenServer.
  • Permit acquisition is synchronous (GenServer.call), and permit release is asynchronous (GenServer.cast) for minimal overhead.

Return values

FunctionSuccessBulkhead fullfn crashes
call/2,3{:ok, result}{:error, :bulkhead_full}{:error, reason}

Error handling

If the function raises, exits, or throws, the error is caught, the permit is released, and the error is returned to the caller:

{:error, {%RuntimeError{message: "boom"}, _stacktrace}} =
  Resiliency.Bulkhead.call(MyApp.ApiBulkhead, fn -> raise "boom" end)

Algorithm Complexity

FunctionTimeSpace
call/2,3O(1) GenServer call + O(f) functionO(q) — one entry per queued waiter
get_stats/1O(1)O(1)
reset/1O(q) — rejects all waitersO(1)

Telemetry

All events are emitted in the caller's process. See Resiliency.Telemetry for the complete event catalogue.

[:resiliency, :bulkhead, :call, :start]

Emitted at the beginning of every call/2,3 invocation, before the permit request.

Measurements

KeyTypeDescription
system_timeintegerSystem.system_time() at emission time

Metadata

KeyTypeDescription
nametermThe bulkhead name passed to call/2,3

[:resiliency, :bulkhead, :call, :rejected]

Emitted when the bulkhead queue is full and the call is rejected without executing the function. Always followed immediately by a :stop event.

Measurements

KeyTypeDescription
(none)

Metadata

KeyTypeDescription
nametermThe bulkhead name

[:resiliency, :bulkhead, :call, :permitted]

Emitted when the bulkhead grants a permit and the function begins execution.

Measurements

KeyTypeDescription
(none)

Metadata

KeyTypeDescription
nametermThe bulkhead name

[:resiliency, :bulkhead, :call, :stop]

Emitted after every call/2,3 completes — whether rejected, successful, or failed.

Measurements

KeyTypeDescription
durationintegerElapsed native time units (System.monotonic_time/0 delta)

Metadata

KeyTypeDescription
nametermThe bulkhead name
result`:ok:error`:ok on success, :error on failure or rejection
error`termnil`The error reason, :bulkhead_full if rejected, nil on success

Summary

Types

A bulkhead reference — a registered name, PID, or {:via, ...} tuple.

Functions

Executes fun through the bulkhead.

Returns a child specification for starting under a supervisor.

Returns statistics about the bulkhead.

Resets the bulkhead to its initial state.

Starts a bulkhead linked to the current process.

Types

name()

@type name() :: GenServer.server()

A bulkhead reference — a registered name, PID, or {:via, ...} tuple.

Functions

call(name, fun, opts \\ [])

@spec call(name(), (-> result), keyword()) :: {:ok, result} | {:error, term()}
when result: term()

Executes fun through the bulkhead.

If a permit is available (or becomes available within max_wait), the function runs in the caller's process. The permit is automatically released when the function returns, raises, exits, or throws.

If no permit is available and the wait time is exhausted, returns {:error, :bulkhead_full} without executing fun.

Parameters

  • name -- the name or PID of a running bulkhead.
  • fun -- a zero-arity function to execute.
  • opts -- optional keyword list.
    • :max_wait -- override the server's default max_wait for this call.

Returns

{:ok, result} on success, {:error, :bulkhead_full} when the bulkhead is full and the wait time is exhausted, or {:error, reason} if the function raises, exits, or throws.

Examples

iex> {:ok, _pid} = Resiliency.Bulkhead.start_link(name: :call_bh, max_concurrent: 2)
iex> Resiliency.Bulkhead.call(:call_bh, fn -> 1 + 1 end)
{:ok, 2}

child_spec(opts)

Returns a child specification for starting under a supervisor.

Options

  • :name -- (required) the name to register the bulkhead under.
  • :max_concurrent -- (required) the maximum number of concurrent calls. Must be a non-negative integer. 0 rejects all calls (useful as a kill-switch).
  • :max_wait -- max time in ms a caller will wait for a permit. 0 means reject immediately when full. :infinity means wait forever. Default 0.
  • :on_call_permitted -- fn name -> any callback fired when a call is permitted.
  • :on_call_rejected -- fn name -> any callback fired when a call is rejected.
  • :on_call_finished -- fn name -> any callback fired when a call finishes.

Examples

children = [
  {Resiliency.Bulkhead, name: MyApp.ApiBulkhead, max_concurrent: 10}
]
Supervisor.start_link(children, strategy: :one_for_one)

iex> spec = Resiliency.Bulkhead.child_spec(name: :my_bh, max_concurrent: 5)
iex> spec.id
{Resiliency.Bulkhead, :my_bh}

get_stats(name)

@spec get_stats(name()) :: map()

Returns statistics about the bulkhead.

Returns

A map containing:

  • :max_concurrent -- the configured maximum concurrent calls
  • :current -- the number of currently active calls
  • :available -- the number of available permits
  • :waiting -- the number of callers waiting in the queue

Examples

iex> {:ok, _pid} = Resiliency.Bulkhead.start_link(name: :stats_bh, max_concurrent: 5)
iex> stats = Resiliency.Bulkhead.get_stats(:stats_bh)
iex> stats.max_concurrent
5
iex> stats.available
5

reset(name)

@spec reset(name()) :: :ok

Resets the bulkhead to its initial state.

Rejects all waiting callers with {:error, :bulkhead_full}, demonitors all active permit holders, and sets the current count to 0.

Examples

iex> {:ok, _pid} = Resiliency.Bulkhead.start_link(name: :reset_bh, max_concurrent: 5)
iex> Resiliency.Bulkhead.reset(:reset_bh)
:ok

start_link(opts)

@spec start_link(keyword()) :: GenServer.on_start()

Starts a bulkhead linked to the current process.

Typically you'd use child_spec/1 instead to start under a supervisor. See child_spec/1 for options.

Examples

{:ok, pid} = Resiliency.Bulkhead.start_link(name: MyApp.ApiBulkhead, max_concurrent: 10)