TelemetryMetricsStatsd (telemetry_metrics_statsd v0.7.0) View Source

Telemetry.Metrics reporter for StatsD-compatible metric servers.

To use it, start the reporter with the start_link/1 function, providing it a list of Telemetry.Metrics metric definitions:

import Telemetry.Metrics

TelemetryMetricsStatsd.start_link(
  metrics: [
    counter("http.request.count"),
    sum("http.request.payload_size"),
    last_value("vm.memory.total")
  ]
)

Note that in the real project the reporter should be started under a supervisor, e.g. the main supervisor of your application.

By default the reporter sends metrics to 127.0.0.1:8125 - both hostname and port number can be configured using the :host and :port options.

TelemetryMetricsStatsd.start_link(
  metrics: metrics,
  host: "statsd",
  port: 1234
)

Alternatively, a Unix domain socket path can be provided using the :socket_path option.

TelemetryMetricsStatsd.start_link(
  metrics: metrics,
  socket_path: "/var/run/statsd.sock"
)

If the :socket_path option is provided, :host and :port parameters are ignored and the connection is established exclusively via Unix domain socket.

Note that the reporter doesn't aggregate metrics in-process - it sends metric updates to StatsD whenever a relevant Telemetry event is emitted.

By default, the reporter sends metrics through a single socket. To reduce contention when there are many metrics to be sent, more sockets can be configured to be opened through the pool_size option.

TelemetryMetricsStatsd.start_link(
  metrics: metrics,
  pool_size: 10
)

When the pool_size is bigger than 1, the sockets are randomly selected out of the pool each time they need to be used

Translation between Telemetry.Metrics and StatsD

In this section we walk through how the Telemetry.Metrics metric definitions are mapped to StatsD metrics and their types at runtime.

Telemetry.Metrics metric names are translated as follows:

  • if the metric name was provided as a string, e.g. "http.request.count", it is sent to StatsD server as-is
  • if the metric name was provided as a list of atoms, e.g. [:http, :request, :count], it is first converted to a string by joining the segments with dots. In this example, the StatsD metric name would be "http.request.count" as well

Since there are multiple implementations of StatsD and each of them provides slightly different set of features, other aspects of metric translation are controlled by the formatters. The formatter can be selected using the :formatter option. Currently only two formats are supported - :standard and :datadog.

The following table shows how Telemetry.Metrics metrics map to standard StatsD metrics:

Telemetry.MetricsStatsD
last_valuegauge
countercounter
sumgauge or counter
summarytimer
distributiontimer

DataDog provides a richer set of metric types:

Telemetry.MetricsDogStatsD
last_valuegauge
countercounter
sumgauge or counter
summaryhistogram
distributiondistribution

The standard StatsD formatter

The :standard formatter is compatible with the Etsy implementation of StatsD. Since this particular implementation doesn't support explicit tags, tag values are appended as consecutive segments of the metric name. For example, given the definition

counter("db.query.count", tags: [:table, :operation])

and the event

:telemetry.execute([:db, :query], %{}, %{table: "users", operation: "select"})

the StatsD metric name would be "db.query.count.users.select". Note that the tag values are appended to the base metric name in the order they were declared in the metric definition.

Another important aspect of the standard formatter is that all measurements are converted to integers, i.e. no floats are ever sent to the StatsD daemon.

Now to the metric types!

Counter

Telemetry.Metrics counter is simply represented as a StatsD counter. Each event the metric is based on increments the counter by 1. To be more concrete, given the metric definition

counter("http.request.count")

and the event

:telemetry.execute([:http, :request], %{duration: 120})

the following line would be send to StatsD

"http.request.count:1|c"

Note that the counter was bumped by 1, regardless of the measurements included in the event (careful reader will notice that the :count measurement we chose for the metric wasn't present in the map of measurements at all!). Such behaviour conforms to the specification of counter as defined by Telemetry.Metrics package - a counter should be incremented by 1 every time a given event is dispatched.

Last value

Last value metric is represented as a StatsD gauge, whose values are always set to the value of the measurement from the most recent event. With the following metric definition

last_value("vm.memory.total")

and the event

:telemetry.execute([:vm, :memory], %{total: 1024})

the following metric update would be send to StatsD

"vm.memory.total:1024|g"

Sum

Sum metric is also represented as a gauge - the difference is that it always changes relatively and is never set to an absolute value. Given metric definition below

sum("http.request.payload_size")

and the event

:telemetry.execute([:http, :request], %{payload_size: 1076})

the following line would be send to StatsD

"http.request.count:+1076|g"

When the measurement is negative, the StatsD gauge is decreased accordingly.

When the report_as: :counter reporter option is passed, the sum metric is reported as a counter and increased with the value provided. Only positive values are allowed, negative measurements are discarded and logged.

Given the metric definition

sum("kafka.consume.batch_size", reporter_options: [report_as: :counter])

and the event

:telemetry.execute([:kafka, :consume], %{batch_size: 200})

the following would be sent to StatsD

"kafka.consume.batch_size:200|c"

Summary

The summary is simply represented as a StatsD timer, since it should generate statistics about gathered measurements. Given the metric definition below

summary("http.request.duration")

and the event

:telemetry.execute([:http, :request], %{duration: 120})

the following line would be send to StatsD

"http.request.duration:120|ms"

Distribution

There is no metric in original StatsD implementation equivalent to Telemetry.Metrics distribution. However, histograms can be enabled for selected timer metrics in the StatsD daemon configuration. Because of that, the distribution is also reported as a timer. For example, given the following metric definition

distribution("http.request.duration")

and the event

:telemetry.execute([:http, :request], %{duration: 120})

the following line would be send to StatsD

"http.request.duration:120|ms"

The DataDog formatter

The DataDog formatter is compatible with DogStatsD, the DataDog StatsD service bundled with its agent.

Tags

The main difference from the standard formatter is that DataDog supports explicit tagging in its protocol. Using the same example as with the standard formatter, given the following definition

counter("db.query.count", tags: [:table, :operation])

and the event

:telemetry.execute([:db, :query], %{}, %{table: "users", operation: "select"})

the metric update packet sent to StatsD would be db.query.count:1|c|#table:users,operation:select.

Metric types

There is no difference in how the counter and last value metrics are handled between the standard and DataDog formatters.

The sum metric is reporter as DataDog counter, which is being transformed into a rate metric in DataDog: https://docs.datadoghq.com/developers/metrics/dogstatsd_metrics_submission/#count. To be able to observe the actual sum of measurements make sure to use the as_count() modifier in your DataDog dashboard. The report_as: :count option does not have any effect with the DataDog formatter.

The summary metric is reported as DataDog histogram, as that is the metric that provides a set of statistics about gathered measurements on the DataDog side.

The distribution is flushed as DataDog distribution metric, which provides statistically correct aggregations of data gathered from multiple services or DogStatsD agents.

Also note that DataDog allows measurements to be floats, that's why no rounding is performed when formatting the metric.

Global tags

The library provides an option to specify a set of global tag values, which are available to all metrics running under the reporter.

For example, if you're running your application in multiple deployment environment (staging, production, etc.), you might set the environment as a global tag:

TelemetryMetricsStatsd.start_link(
  metrics: [
    counter("http.request.count", tags: [:env])
    ],
    global_tags: [env: "prod"]
)

Note that if the global tag is to be sent with the metric, the metric needs to have it listed under the :tags option, just like any other tag.

Also, if the same key is configured as a global tag and emitted as a part of event metadata or returned by the :tag_values function, the metadata/:tag_values take precedence and override the global tag value.

Prefixing metric names

Sometimes it's convenient to prefix all metric names with particular value, to group them by the name of the service, the host, or something else. You can use :prefix option to provide a prefix which will be prepended to all metrics published by the reporter (regardless of the formatter used).

Maximum datagram size

Metrics are sent to StatsD over UDP, so it's important that the size of the datagram does not exceed the Maximum Transmission Unit, or MTU, of the link, so that no data is lost on the way. By default the reporter will break up the datagrams at 512 bytes, but this is configurable via the :mtu option.

Sampling data

It's not always convenient to capture every piece of data, such as in the case of high-traffic applications. In those cases, you may want to capture a "sample" of the data. You can do this by passing [sampling_rate: <rate>] as an option to :reporter_options, where rate is a value between 0.0 and 1.0. The default :sampling_rate is 1.0, which means that all the measurements are being captured.

Example

TelemetryMetricsStatsd.start_link(
  metrics: [
    counter("http.request.count"),
    summary("http.request.duration", reporter_options: [sampling_rate: 0.1]),
    distribution("http.request.duration", reporter_options: [sampling_rate: 0.1])
  ]
)

In this example, we are capturing 100% of the measurements for the counter, but only 10% for both summary and distribution.

Link to this section Summary

Functions

Reporter's child spec.

Starts a reporter and links it to the calling process.

Link to this section Types

Specs

host() :: String.t() | :inet.ip_address()

Specs

option() ::
  {:port, :inet.port_number()}
  | {:host, host()}
  | {:socket_path, Path.t()}
  | {:metrics, [Telemetry.Metrics.t()]}
  | {:mtu, non_neg_integer()}
  | {:prefix, prefix()}
  | {:formatter, :standard | :datadog}
  | {:global_tags, Keyword.t()}
  | {:host_resolution_interval, non_neg_integer()}

Specs

options() :: [option()]

Specs

prefix() :: String.t() | atom() | nil

Link to this section Functions

Specs

child_spec(options()) :: Supervisor.child_spec()

Reporter's child spec.

This function allows you to start the reporter under a supervisor like this:

children = [
  {TelemetryMetricsStatsd, options}
]

See start_link/1 for a list of available options.

Specs

start_link(options()) :: GenServer.on_start()

Starts a reporter and links it to the calling process.

The available options are:

  • :metrics (list of term/0) - Required. A list of Telemetry.Metrics metric definitions that will be published by the reporter.

  • :host - Hostname or IP address of the StatsD server. If it's a hostname, the reporter will resolve it on start and send metrics to the resolved IP address. See :host_resolution_interval option to enable periodic hostname lookup. The default value is {127, 0, 0, 1}.

  • :port (non_neg_integer/0) - Port of the StatsD server. The default value is 8125.

  • :inet_address_family - The inet address family, as specified by the Erlang :inet.address_family type(). The default value is :inet.

  • :socket_path - Path to the Unix Domain Socket used for publishing instead of the hostname and port.

  • :formatter - Determines the format of the published metrics. Can be either :standard or :datadog. The default value is :standard.

  • :global_tags (keyword/0) - Additional tags published with every metric. Global tags are overridden by the tags specified in the metric definition. The default value is [].

  • :prefix - A prefix added to the name of each metric published by the reporter.

  • :pool_size (non_neg_integer/0) - A number of UDP sockets used for publishing metrics. The default value is 10.

  • :host_resolution_interval (non_neg_integer/0) - When set, the reporter resolves the configured hostname on the specified interval (in milliseconds) instead of looking up the name once on start. If the provided hostname resolves to multiple IP addresses, the first one one the list is used

  • :mtu (non_neg_integer/0) - Maximum Transmission Unit of the link between your application and the StastD server in bytes. If this value is greater than the actual MTU of the link, UDP packets with published metrics will be dropped. The default value is 512.

You can read more about all the options in the TelemetryMetricsStatsd module documentation.

Example

import Telemetry.Metrics

TelemetryMetricsStatsd.start_link(
  metrics: [
    counter("http.request.count"),
    sum("http.request.payload_size"),
    last_value("vm.memory.total")
  ],
  prefix: "my-service"
)