Nerves.Runtime.StartupGuard (nerves_runtime v0.13.12)

Copy Markdown View Source

Monitor system startup and validate firmware

This module provides a default for preventing devices that have failed to complete initialization from either reverting to an earlier firmware or rebooting to try again. Enough time is given so that a device doesn't get into an undebuggable boot loop, but also doesn't wait forever in a state that may also be impossible to debug.

This is a generic default that is intended to be suitable for all use cases. However, you will eventually find that you can do better, and you are encouraged to replace it when ready. For example, you may want to confirm connectivity to a firmware update server before validating a new image just in case a change broke networking. Please investigate using alarms (via :alarm_handler or alarmist) for aggregating these checks.

If your Nerves system requires that new firmware images are validated, you will need this. In other words, if you have to run Nerves.Runtime.validate_firmware/0 every time you upload new firmware, then your Nerves system requires validation.

Setup

Add the following to your project's target.exs or config.exs:

config :nerves_runtime, startup_guard_enabled: true

To handle a case where Erlang starts fine, but somehow hangs before StartupGuard can register itself with Erlang's heart feature, there's a handshake that needs to occur. The handshake needs to be enabled in Nerves Heart (which integrates with Erlang heart), though. To do this, add the following to your project's rel/vm.args.eex:

## Require an initialization handshake within 10 minutes
-env HEART_INIT_TIMEOUT 600

Further discussion

Here's the high level summary of how this works:

  1. On init, OTP starts up all applications. When it starts up :nerves_runtime, StartupGuard gets run.
  2. StartupGuard registers a :heart callback. The callback is a time bomb that starts failing after 15 minutes.
  3. StartupGuard gets the list of OTP applications that should be started. Applications marked in the Mix release to only :load aren't counted.
  4. StartupGuard waits for all expected applications to start
  5. Once everything starts, StartupGuard validates the firmware and removes the :heart callback.
  6. If anything went wrong, log the errors. Since the :heart callback is still registered, the system will be available for debugging, but it will eventually reboot.

One nice alteration to this is to leave the :heart callback in place, but have it check some kind of "system ok" flag. If you do this, keep in mind that the callback is totally unforgiving to errors and function calls taking too long. Making it too complicated can backfire and cause inadvertent reboots. Rebooting too quickly on errors can impact your ability to debug partial failures. If using this code as a template, try to keep your code in Task or change this to a GenServer or anything else that can be supervised. Decoupling the checks into alarms is another nice pattern.

Troubleshooting

  1. If getting the log message about exceeding the number of retries for getting firmware validation status, then Nerves.Runtime.firmware_validation_status/0 is returning :unknown. This is probably due to the Nerves system's fwup.conf not initializing <slot>.nerves_fw_validated to 0 (or 1 if always valid).
  2. If falling back without logs, try installing ramoops_logger to capture log messages that don't make it to disk.

Summary

Functions

Returns a specification to start this module under a supervisor.

Functions

child_spec(arg)

Returns a specification to start this module under a supervisor.

arg is passed as the argument to Task.start_link/1 in the :start field of the spec.

For more information, see the Supervisor module, the Supervisor.child_spec/2 function and the Supervisor.child_spec/0 type.