Handling Expected Failures

Reporting job errors by sending notifications to an external service is essential to maintaining application health. While reporting is essential, noisy reports for flaky jobs can become a distraction that gets ignored. Sometimes we expect that a job will error a few times. That could be because the job relies on an external service that is flaky, because it is prone to race conditions, or because the world is a crazy place. Regardless of why a job fails, reporting every failure may be undesirable.

Use Case: Silencing Initial Notifications for Flaky Services

One solution for reducing noisy error notifications is to start reporting only after a job has failed several times. Oban uses Telemetry to make reporting errors and exceptions a simple matter of attaching a handler function. In this example we will extend Honeybadger reporting from the Oban.Telemetry documentation, but account for the number of processing attempts.

To start, we'll define a Reportable protocol with a single reportable?/2 function:

defprotocol MyApp.Reportable do
  @fallback_to_any true
  def reportable?(worker, attempt)
end

defimpl MyApp.Reportable, for: Any do
  def reportable?(_worker, _attempt), do: true
end

The Reportable protocol has a default implementation which always returns true, meaning it reports all errors. Our application has a FlakyWorker that's known to fail a few times before succeeding. We don't want to see a report until after a job has failed three times, so we'll add an implementation of Reportable within the worker module:

defmodule MyApp.Workers.FlakyWorker do
  use Oban.Worker

  defstruct []

  defimpl MyApp.Reportable do
    @threshold 3

    def reportable?(_worker, attempt), do: attempt > @threshold
  end

  @impl true
  def perform(%{args: %{"email" => email}}) do
    MyApp.ExternalService.deliver(email)
  end
end

Note that we've also used defstruct [] to make our worker a viable struct. This is necessary for our protocol to dispatch correctly, as protocols consider all modules to be a plain atom.

The final step is to call reportable?/2 from our application's error reporter, passing in the worker module and the attempt number:

defmodule MyApp.ErrorReporter do
  alias MyApp.Reportable

  def handle_event(_, _, meta, _) do
    worker_struct = maybe_get_worker_struct(meta.job.worker)

    if Reportable.reportable?(worker_struct, meta.job.attempt) do
      context = Map.take(meta.job, [:id, :args, :queue, :worker])

      Honeybadger.notify(meta.reason, context, meta.stacktrace)
    end
  end

  def maybe_get_worker_struct(worker) do
    try do
      {:ok, module} = Oban.Worker.from_string(worker)

      struct(module)
    rescue
      UndefinedFunctionError -> worker
    end
  end
end

Attach the failure handler somewhere in your application.ex module:

:telemetry.attach("oban-errors", [:oban, :job, :exception], &ErrorReporter.handle_event/4, nil)

With the failure handler attached you will start getting error reports only after the third error.

Giving Time to Recover

If a service is especially flaky you may find that Oban's default backoff strategy is too fast. By defining a custom backoff function on the FlakyWorker we can set a linear delay before retries:

# inside of MyApp.Workers.FlakyWorker

@impl true
def backoff(%Job{attempt: attempt}) do
  attempt * 60
end

Now the first retry is scheduled 60s later, the second 120s later, and so on.

Building Blocks

Elixir's powerful primitives of behaviours, protocols and event handling make flexible error reporting seamless and extensible. While our Reportable protocol only considered the number of attempts, this same mechanism is suitable for filtering by any other meta value.

← Previous Page Reporting Job Progress

Next Page → Splitting Queues Between Nodes