Shoehorn
View SourceShoehorn helps you handle OTP application failures
Motivation
By default, the Erlang VM exits when OTP applications unexpectedly stop. This
can happen if an application's Application.start/2 callback crashes or if a
GenServer crashes repeatedly and takes down the application's supervision
tree. Either way, recovery needs to happen outside of the Erlang VM.
Shoehorn provides a way of handling this inside the Erlang VM to allow you to
debug, restart an application, switch to a recovery mode, or something else of
your choosing. It does this by creating a custom release start script
(shoehorn.boot) and exposing the Shoehorn.Handler behaviour for your code to
decide what to do. The custom release start script turns off the default OTP
application mode that exits the VM on unexpected errors and orders application
starts to make sure that the handler is available.
Shoehorn has another benefit of letting you influence the OTP application start
order. Dependencies still determine the overall ordering, but it's possible to
sort applications earlier via Shoehorn's :init option. This can let you
improve the apparent release startup time on slow platforms.
Usage
Run mix release.init on your project and then add shoehorn to your mix releases configuration in the mix.exs (replace :simple_app):
def project do
[
...
releases: releases()
]
end
def releases do
[
simple_app: [
steps: [&Shoehorn.Release.init/1, :assemble]
]
]
end
defp deps do
[
{:shoehorn, "~> 0.9.2"}
]
end
endCreate a release:
mix release
Next, take a look at the start script so that you can see how your application
will now be started and how it compares to the default startup.script. Open
_build/dev/rel/simple_app/releases/0.1.0/shoehorn.script and go to the end.
You should see something like the following:
{progress,applications_loaded},
{apply,{application,start_boot,[kernel,permanent]}},
{apply,{application,start_boot,[stdlib,permanent]}},
{apply,{application,start_boot,[compiler,permanent]}},
{apply,{application,start_boot,[elixir,permanent]}},
{apply,{application,start_boot,[logger,permanent]}},
{apply,{application,start_boot,[crypto,permanent]}},
{apply,{application,start_boot,[shoehorn,permanent]}},
{apply,{application,start_boot,[sasl,permanent]}},
{apply,{application,start_boot,[simple_app,temporary]}},
{progress,started}This shows the order that applications will be started and their mode.
Applications marked permanent will exit the VM if they stop expectantly.
Shoehorn will change as much as it can to temporary so that it (and by
extension, you) can control what happens.
To start your release using the shoehorn boot script, run:
RELEASE_BOOT_SCRIPT=shoehorn _build/dev/rel/simple_app/bin/simple_app start_iex
It should work as expected with the possible exception that the Erlang VM won't
exit for any of the OTP applications marked temporary.
Now let's configure shoehorn to do something more interesting by adding some
minimal configuration. This is hypothetical unless you're using Nerves:
# config/config.exs
config :shoehorn,
init: [:nerves_runtime, :nerves_pack]Shoehorn will generate a release script that starts :nerves_runtime and its
dependencies as soon as it can. Then it will start :nerves_pack and its
dependencies. Then it will start the remainder of the applications in the
project. Inspect the shoehorn.script file in the release directory to verify
this.
Use the init application list to prioritize OTP applications that are needed
for early on or for error recovery. In the example above, we initialize the
runtime, bring up the network (in :nerves_pack), and ensure that we can
receive new firmware updates. Now, if simple_app fails to start, the device
would still be in a state where it can receive new firmware over the network.
Handling application failures
The Erlang VM will respond to application failures differently, depending on the mode specified when the application started. The modes are:
:permanent- if the application terminates, all other applications and the entire node are also terminated.:transient- if the application terminates with:normal reason, it is reported but no other applications are terminated. However, if the application terminates abnormally, all other applications and the entire node are also terminated.:temporary- if the application terminates, it is reported but no other applications are terminated (the default behaviour).
Unless overridden in the Mix release using the :applications
option, Shoehorn
most applications as :temporary and monitors application events by registering
with the Erlang error_logger.
Application start and exit events will attempt to execute a callback to the
configured Shoehorn.Handler module. By default, the module
Shoehorn.DefaultHandler will be called. This module is configured to continue
the Erlang VM if any OTP application were to exit, for any reason. In
production, you may want to customize the action on failure so you can gather
forensics or perform updates to the node. You can do this by overriding the
handler in the prod env of your application config.
# config/prod.exs
config :shoehorn,
handler: SimpleApp.ShoehornHandlerMore advanced failure cases can be handled by providing your own module that
implements the Shoehorn.Handler behaviour. For example, the Erlang :ssh
application used to exit when subjected to a brute force attack (this seems like
it has been fixed). Instead of the default production behaviour of forcing the
node to restart, we can restart the application.
defmodule Example.RestartHandler do
@behavior Shoehorn.Handler
def init(_opts) do
{:ok, :no_state}
end
def application_started(_app, state) do
{:continue, state}
end
def application_exited(:ssh, _reason, state) do
Logger.error("Stop bothering ssh!")
Process.sleep(1000)
Application.ensure_all_started(:ssh)
{:continue, state}
end
def application_exited(app, _reason, state) do
Logger.error("Application stopped! #{inspect(app)} #{inspect(state)}")
{:halt, state}
end
endThe application_exited/3 callback is limited in the amount of time is has to
execute by setting a shutdown timer. If the callback does not return within the
defined shutdown time, the node is instructed to halt. The default shutdown time
is 30 seconds but this value can be changed in the application config:
# config/config.exs
config :shoehorn,
shutdown_timer: 50_000 # 50 SecondsHave a look at the example application for more info on implementing custom strategies.