Alarmist.Ops (alarmist v0.4.0)

Alarm operations for use with alarm_if

Summary

Functions

copy(engine, list)

Replicate an alarm status

debounce(engine, list)

Set an alarm when the input has been set for a specified duration

hold(engine, list)

Keep an alarm set for a guaranteed amount of time

intensity(engine, list)

Sets an alarm when the input alarm has been set and cleared too many times

logical_and(engine, list)

Set an alarm when all of the input alarms are set

logical_not(engine, list)

Set an alarm when the input alarm is cleared

logical_or(engine, list)

Set an alarm when one or more input alarms get set

on_time(engine, list)

Sets an alarm when the input has been set for too long in a given period

sustain_window(engine, list)

Sets an alarm when the input has been set for a minimum duration in a window

unknown_as_set(engine, list)

Return an alarm as set if it's unknown

Types

engine()

@opaque engine()

Functions

copy(engine, list)

@spec copy(engine(), [Alarmist.alarm_id()]) :: engine()

Replicate an alarm status

This is useful for aliasing alarm names. For example, if one library sets and clears an alarm ID that's in its namespace, but another library wants to listen on changes to an alarm ID in its namespace, a copy rule can glue them together.

Example:

defmodule NewAlarm do
  use Alarmist.Alarm

  alarm_if do
    OriginalAlarm
  end
end

debounce(engine, list)

@spec debounce(engine(), [Alarmist.alarm_id(), ...]) :: engine()

Set an alarm when the input has been set for a specified duration

This rule removes transient alarms from triggering remediation unnecessarily. This is useful when remediation is expensive or service impacting and the input alarm is somewhat glitchy.

Alarmist already provides some debouncing since alarms that get set and cleared in one alarm processing pass are ignored already. This is unreliable, though, and a debounce rule establishes a duration.

An example of when debouncing is useful is to delay remediation of higher level alarms like being disconnected from a backend server. There are many reasons that a TCP connection could be interrupted and client code probably has some retry logic in it already to reestablish the connection. In this case, it might be good to delay switching to an offline mode for a little bit in the hopes that the problem will naturally go away.

Example:

defmodule NewAlarm do
  use Alarmist.Alarm

  alarm_if do
    debounce(Alarm1, 1_000)
  end
end

hold(engine, list)

@spec hold(engine(), [Alarmist.alarm_id(), ...]) :: engine()

Keep an alarm set for a guaranteed amount of time

This sets an alarm for at least timeout milliseconds after it is set. Each time the alarm is set, the timer is restarted.

Hold is useful for types of remediation that are time based. I.e., handling an alarm means turning something off for a while since turning that feature back on when the alarm gets cleared would likely just result in the alarm being set again. Managing the timeout period via alarms rather than programmatically lets you manually clear the alarm if you'd like that feature enabled again immediately like if you're debugging.

Example:

defmodule NewAlarm do
  use Alarmist.Alarm

  alarm_if do
    hold(Alarm1, 1_000)
  end
end

intensity(engine, list)

@spec intensity(engine(), [Alarmist.alarm_id(), ...]) :: engine()

Sets an alarm when the input alarm has been set and cleared too many times

This type of rule catches flapping alarms where it's desirable to take some kind of remediation when they trigger too many times in a row. Intensity is measured as count set/clears in period milliseconds. This is the same as supervision restart intensity thresholds.

An example of an intensity-based alarm is to handle the case when multiple network connections are available, but one that should be good is flakey. This happens if a device has both a cellular and a WiFi connection. Normally the WiFi connection is preferred, but if it keeps going up and down, it may be desirable to raise an alarm. That alarm could disable WiFi for a while. Combine this with hold/2 to manage the duration that WiFi is off.

Example:

defmodule NewAlarm do
  use Alarmist.Alarm

  alarm_if do
    intensity(Alarm1, 3, 60_000)
  end
end

logical_and(engine, list)

@spec logical_and(engine(), [Alarmist.alarm_id()]) :: engine()

Set an alarm when all of the input alarms are set

This is useful when remediation is only useful when a lot of things go wrong. For example, if a device has more than one way of accomplishing a task, there could be a specific remediation when one way stops working. However, if every way is broken, the device could trigger a more significant remediation.

Example:

defmodule NewAlarm do
  use Alarmist.Alarm

  alarm_if do
    Alarm1 and Alarm2
  end
end

logical_not(engine, list)

@spec logical_not(engine(), [Alarmist.alarm_id()]) :: engine()

Set an alarm when the input alarm is cleared

This is useful for "proof-of-life" alarms where the presence of an alarm is a good thing.

Example:

defmodule NewAlarm do
  use Alarmist.Alarm

  alarm_if do
    not OriginalAlarm
  end
end

logical_or(engine, list)

@spec logical_or(engine(), [Alarmist.alarm_id()]) :: engine()

Set an alarm when one or more input alarms get set

This is useful for triggering a generic remediation. An example of this for setting an alarm that indicates that the device is "unhealthy" and needs to reboot. There are usually many disastrous alarms that when raised really have no great remediation other than reboot. This allows a handler to register for a combined alarm so that it's decoupled from the alarms that trigger it.

Example:

defmodule NewAlarm do
  use Alarmist.Alarm

  alarm_if do
    Alarm1 or Alarm2
  end
end

on_time(engine, list)

@spec on_time(engine(), [Alarmist.alarm_id(), ...]) :: engine()

Sets an alarm when the input has been set for too long in a given period

This records an alarms status over a period of time and accumulates the total duration that the alarm has been set. If that duration exceeds on_time, then the output alarm is set.

This is useful in situations where you may want to use debounce/2, but where the input is flaky enough that it could bounce around and not trigger the alarm. Using intensity/3 might help in this situation, but coming up with a total time for on_time/3 is more intuitive.

Example:

defmodule NewAlarm do
  use Alarmist.Alarm

  alarm_if do
    on_time(Alarm1, 30_000, 60_000)
  end
end

sustain_window(engine, list)

@spec sustain_window(engine(), [Alarmist.alarm_id(), ...]) :: engine()

Sets an alarm when the input has been set for a minimum duration in a window

This only looks for one occurrence of the alarm being set for on_time duration in a time period. If that exists, then the output is set.

This is useful for "good" alarms where being set is the desired state. The alarm may later be inverted to become a more typical alarm. For this case, the system is viewed as functioning good enough if the input alarm is on for a long enough period of time. For example, this could be a connection to a control server where being connected long enough in a time period is good enough for remotely fixing the device.

Compare this with debounce/2 followed by hold/2 which can implement similar behavior with appropriate parameters. sustain_window/3 conveys intent better and perhaps is easier to understand.

Example:

defmodule RemotelyFixableAlarm do
  use Alarmist.Alarm

  alarm_if do
    sustain_window(ConnectedToServer, 30_000, 60_000)
  end
end

unknown_as_set(engine, list)

@spec unknown_as_set(engine(), [Alarmist.alarm_id()]) :: engine()

Return an alarm as set if it's unknown

All Alarmist operations except this one treat unknown alarms as cleared. Use this to treat unknown alarms as set. This is useful for detecting initialization failures where the code that should be setting or clearing the alarm doesn't run.

Example:

defmodule NewAlarm do
  use Alarmist.Alarm

  alarm_if do
    unknown_as_set(OriginalAlarm)
  end
end