Writing Reporters
Reporters are a crucial part of Telemetry.Metrics ecosystem. Without them, metric definitions are merely... definitions. This guide aims to help in writing reporters in a proper way.
Before writing the reporter for your favourite monitoring system, make sure that one isn't already available on Hex.pm - it might make sense to contribute and improve the existing solution than starting from scratch.
Let's get started!
Specification
- Reporters MUST accept a list of metric definitions as input when being started
- Reporters MUST attach handlers to events contained in these definitions
- Reporters MUST extract the measurement and selected tags specified by the metric definitions
- Reporters SHOULD handle events in a way that makes sense for whatever it is publishing to
- Reporters MUST clean up on exit by detaching all event handlers they have created
- Reporters MUST honor
keep
recording rule functions - Reporters MUST skip events with missing or invalid measurements or tags
Accepting Metric Definitions as Input
This one is quite easy - you need to give your users a way to actually tell you what metrics
they want to track. It's essential to give users an option to provide metric definitions
at runtime (e.g. when their application starts). For example, let's say you're building a
PigeonReporter
.
If the reporter was process-based, you could provide a start_link/1
function that accepts
a list of metric definitions:
metrics = [
counter("..."),
last_value("..."),
summary("...")
]
PigeonReporter.start_link(metrics: metrics)
If the reporter doesn't support metrics of particular type, it may either:
- Log a warning and discard the metric
- Log a warning and convert the metric to an equivalent type. For example, a reporter may convert an histogram into a summary or simpler metric in case it is not supported
We recommend all reporters to include a summary table of which metrics are supported and their equivalents on the adapter terminology.
Reporter-specific options for individual metrics may be passed on the :reporter_options
key of the metric definitions. These options can be used to define options such as sample
rates, percentiles, rates, etc. Reporters should validate any options they accept and
provide useful exception messages.
Attaching event handlers
Event handlers are attached using :telemetry.attach/4
function. To reduce overhead of
installing many event handlers, you can install a single handler for multiple metrics
based on the same event but note that any exception will cause all metrics on under that
handler. You can achieve this by grouping the metrics by event name:
Enum.group_by(metrics, & &1.event_name)
Note that handler IDs need to be unique - you can generate completely random blobs of data, or use something that you know needs to be unique anyway, e.g. some combination of reporter name, event name, and something which is different for multiple instances of the same reporter (PID is a good choice as most reporters should be backed by a process):
id = {PigeonReporter, metric.event_name, self()}
Putting it all together:
for {event, metrics} <- Enum.group_by(metrics, & &1.event_name) do
id = {__MODULE__, event, self()}
:telemetry.attach(id, event, &handle_event/4, metrics)
end
Reacting to events
When consuming events, there are five steps to take into account:
If a
keep
recording rule function has been provided, the reporter MUST record the metric only if the function returnstrue
.Extract event measurements from the event. Measurements are optional, so we must skip reporting that particular measurement if it is not available;
Extract all the relevant tags from the event metadata (if they are supported by the reporter);
Implement the logic specific to the reporter;
How to react to errors. One option is to let the
handle_event/4
callback fail, but that means we will no longer listen to any future event. Another option is to rescue any error and log them. That's the approach we will take in this example. However, be careful! If an event always contains bad data, then we will log an error every time it is emitted;
Let's see a simple handler implementation that takes all of those four items into account:
def handle_event(_event_name, measurements, metadata, metrics) do
for metric <- metrics do
try do
if measurement = keep?(metric, metadata) && extract_measurement(metric, measurements) do
tags = extract_tags(metric, metadata)
# record and send
end
rescue
e ->
Logger.error("Could not format metric #{inspect metric}")
Logger.error(Exception.format(:error, e, __STACKTRACE__))
end
end
end
The implementation of keep?/2
might look like:
defp keep?(%{keep: keep}, metadata) when keep != nil, do: keep.(metadata)
defp keep?(_metric, _metadata), do: true
The implementation of extract_measurement/2
might look as follows:
def extract_measurement(metric, measurements) do
case metric.measurement do
fun when is_function(fun, 1) -> fun.(measurements)
key -> measurements[key]
end
end
Since :measurement
in the metric definition can be both an arbitrary term (to be used
as key to fetch the measurement) or a function, we need to handle both cases.
Note: Telemetry.Metrics can't guarantee that the extracted measurement's value is a number. Each reporter can handle this scenario properly, either by logging a warning, detaching the handler etc.
We also need to implement the extract_tags/2
function:
def extract_tags(metric, metadata) do
tag_values = metric.tag_values.(metadata)
Map.take(tag_values, metric.tags)
end
First we need to apply last-minute transformation to the metadata using the :tag_values
function, then we fetch all transformed metadata, ignoring any tag that may not be available.
Detaching Handlers on Termination
To leave the system in a clean state, the reporter must detach the event handlers it installed
when it's being stopped or terminated unexpectedely. This can be done by trapping exits in the
init
function and implementing the terminate callback, or having a dedicated process
responsible only for the cleanup (e.g. by using monitors).
Documentation
It's extremely important that reporters document how Telemetry.Metrics
metric types, names,
and tags are translated to metric types and identifiers in the system they publish metrics to
(this is particularly important for a summary metric which is broadly defined). They should also
document if some metric types are not supported at all.
Examples
This repository ships with a Telemetry.Metrics.ConsoleReporter
that prints data to the
terminal as an example. Official reporters can be found in the BEAM Telemetry Github Organization. You may search for other reporters on hex.pm.