Alarmist.Ops (alarmist v0.2.2)

View Source

Derivative alarm generation operations

Summary

Functions

Replicate an alarm status

Set an alarm when the input has been set for a specified duration

Keep an alarm set for a guaranteed amount of time

Sets an alarm when the input alarm has been set and cleared too many times

Set an alarm when all of the input alarms are set

Set an alarm when the input alarm is cleared

Set an alarm when one or more input alarms get set

Functions

copy(engine, list)

Replicate an alarm status

This is useful for aliasing alarm names. For example, if one library sets and clears an alarm ID that's in its namespace, but another library wants to listen on changes to an alarm ID in its namespace, a copy rule can glue them together.

Example:

defmodule NewAlarm do
  use Alarmist.Definition

  defalarm do
    OriginalAlarm
  end
end

debounce(engine, list)

@spec debounce(Alarmist.Engine.t(), [Alarmist.alarm_id(), ...]) :: Alarmist.Engine.t()

Set an alarm when the input has been set for a specified duration

This rule removes transient alarms from triggering remediation unnecessarily. This is useful when remediation is expensive or service impacting and the input alarm is somewhat glitchy.

Alarmist already provides some debouncing since alarms that get set and cleared in one alarm processing pass are ignored already. This is unreliable, though, and a debounce rule establishes a duration.

An example of when debouncing is useful is to delay remediation of higher level alarms like being disconnected from a backend server. There are many reasons that a TCP connection could be interrupted and client code probably has some retry logic in it already to reestablish the connection. In this case, it might be good to delay switching to an offline mode for a little bit in the hopes that the problem will naturally go away.

Example:

defmodule NewAlarm do
  use Alarmist.Definition

  defalarm do
    debounce(Alarm1, 1_000)
  end
end

hold(engine, list)

Keep an alarm set for a guaranteed amount of time

This sets an alarm for at least timeout milliseconds after it is set. Each time the alarm is set, the timer is restarted.

Hold is useful for types of remediation that are time based. I.e., handling an alarm means turning something off for a while since turning that feature back on when the alarm gets cleared would likely just result in the alarm being set again. Managing the timeout period via alarms rather than programmatically lets you manually clear the alarm if you'd like that feature enabled again immediately like if you're debugging.

Example:

defmodule NewAlarm do
  use Alarmist.Definition

  defalarm do
    hold(Alarm1, 1_000)
  end
end

intensity(engine, list)

@spec intensity(Alarmist.Engine.t(), [Alarmist.alarm_id(), ...]) ::
  Alarmist.Engine.t()

Sets an alarm when the input alarm has been set and cleared too many times

This type of rule catches flapping alarms where it's desirable to take some kind of remediation when they trigger too many times in a row. Intensity is measured as count set/clears in duration milliseconds. This is the same as supervision restart intensity thresholds.

An example of an intensity-based alarm is to handle the case when multiple network connections are available, but one that should be good is flakey. This happens if a device has both a cellular and a WiFi connection. Normally the WiFi connection is preferred, but if it keeps going up and down, it may be desirable to raise an alarm. That alarm could disable WiFi for a while. Combine this with hold/2 to manage the duration that WiFi is off.

Example:

defmodule NewAlarm do
  use Alarmist.Definition

  defalarm do
    intensity(Alarm1, 3, 60_000)
  end
end

logical_and(engine, list)

@spec logical_and(Alarmist.Engine.t(), [Alarmist.alarm_id()]) :: Alarmist.Engine.t()

Set an alarm when all of the input alarms are set

This is useful when remediation is only useful when a lot of things go wrong. For example, if a device has more than one way of accomplishing a task, there could be a specific remediation when one way stops working. However, if every way is broken, the device could trigger a more significant remediation.

Example:

defmodule NewAlarm do
  use Alarmist.Definition

  defalarm do
    Alarm1 and Alarm2
  end
end

logical_not(engine, list)

@spec logical_not(Alarmist.Engine.t(), [Alarmist.alarm_id()]) :: Alarmist.Engine.t()

Set an alarm when the input alarm is cleared

This is useful for "proof-of-life" alarms where the presence of an alarm is a good thing.

Example:

defmodule NewAlarm do
  use Alarmist.Definition

  defalarm do
    not OriginalAlarm
  end
end

logical_or(engine, list)

@spec logical_or(Alarmist.Engine.t(), [Alarmist.alarm_id()]) :: Alarmist.Engine.t()

Set an alarm when one or more input alarms get set

This is useful for triggering a generic remediation. An example of this for setting an alarm that indicates that the device is "unhealthy" and needs to reboot. There are usually many disastrous alarms that when raised really have no great remediation other than reboot. This allows a handler to register for a combined alarm so that it's decoupled from the alarms that trigger it.

Example:

defmodule NewAlarm do
  use Alarmist.Definition

  defalarm do
    Alarm1 or Alarm2
  end
end