Handling Expected Failures
Reporting job errors by sending notifications to an external service is essential to maintaining application health. While reporting is essential, noisy reports for flaky jobs can become a distraction that gets ignored. Sometimes we expect that a job will error a few times. That could be because the job relies on an external service that is flaky, because it is prone to race conditions, or because the world is a crazy place. Regardless of why a job fails, reporting every failure may be undesirable.
Use Case: Silencing Initial Notifications for Flaky Services
One solution for reducing noisy error notifications is to start reporting only
after a job has failed several times. Oban uses Telemetry to make
reporting errors and exceptions a simple matter of attaching a handler function.
In this example we will extend Honeybadger reporting from the
Oban.Telemetry documentation, but account for the number of processing attempts.
To start, we'll define a Reportable protocol with a single
reportable?/2 function:
defprotocol MyApp.Reportable do
@fallback_to_any true
def reportable?(worker, attempt)
end
defimpl MyApp.Reportable, for: Any do
def reportable?(_worker, _attempt), do: true
endThe Reportable protocol has a default implementation which always returns
true, meaning it reports all errors. Our application has a FlakyWorker
that's known to fail a few times before succeeding. We don't want to see a
report until after a job has failed three times, so we'll add an implementation
of Reportable within the worker module:
defmodule MyApp.Workers.FlakyWorker do
use Oban.Worker
defimpl MyApp.Reportable do
@threshold 3
def reportable?(_worker, attempt), do: attempt > @threshold
end
@impl true
def perform(%{args: %{"email" => email}}) do
MyApp.ExternalService.deliver(email)
end
endThe final step is to call reportable?/2 from our application's error reporter,
passing in the worker module and the attempt number:
defmodule MyApp.ErrorReporter do
alias MyApp.Reportable
def handle_event(_, _, %{attempt: attempt, worker: worker} = meta, _) do
if Reportable.reportable?(worker, attempt) do
context = Map.take(meta, [:id, :args, :queue, :worker])
Honeybadger.notify(meta.error, context, meta.stacktrace)
end
end
endAttach the failure handler somewhere in your application.ex module:
:telemetry.attach("oban-errors", [:oban, :job, :exception], &ErrorReporter.handle_event/4, nil)With the failure handler attached you will start getting error reports only after the third error.
Giving Time to Recover
If a service is especially flaky you may find that Oban's default backoff
strategy is too fast. By defining a custom backoff function on the
FlakyWorker we can set a linear delay before retries:
# inside of MyApp.Workers.FlakyWorker
@impl true
def backoff(attempt, base_amount \\ 60) do
attempt * base_amount
endNow the first retry is scheduled 60s later, the second 120s later, and so on.
Building Blocks
Elixir's powerful primitives of behaviours, protocols and event handling make
flexible error reporting seamless and extensible. While our Reportable
protocol only considered the number of attempts, this same mechanism is suitable
for filtering by any other meta value.
Explore the [event metadata][meta] that Oban provides for job failures to see how you can configure reporting by by worker, queue, or even specific arguments.