Parapet SLO Authoring Guide

Parapet is built around a simple conviction: an SLO should track whether users can do the things they came to your app to do, not whether the servers are breathing. A CPU gauge that stays under 80% tells you nothing about whether login is working. A journey SLO that burns at 2% tells you exactly what is wrong and who is affected.

This guide walks through how to decide what deserves a slice, how to use the built-in Parapet.SLO.StarterPack.WebSaaS slices as anchors for your own decisions, and how to handle the situations where low traffic or low volume makes naive alerting unreliable.

For the full provider and slice catalog - including built-in provider modules for Mailglass, Chimeway, Rindle, and the WebSaaS pack - see Parapet SLO Reference.

How to decide what to slice

The decision is not about what you can measure. It is about what failing would cost a user.

Use this tree to decide whether a potential signal warrants its own journey SLO:

Does this failure directly prevent a user task?
- Yes -> this is a candidate for a journey SLO. Continue down.
  - Is the failure observable through a metric Parapet already emits (or that your integration emits)?
    - Yes -> define a slice against that metric.
    - No -> wire the metric first (or use a synthetic probe - see the low-traffic section below), then define the slice.
  - Is the failure synchronous (request-time) or async (job, callback, provider-mediated)?
    - Synchronous -> use an HTTP availability or login-journey style ratio slice.
    - Async -> use a job-success or delivery-confirmation style slice.
- No (infrastructure-only signal, does not directly prevent a user task) -> this is not a journey SLO. Consider a system-health dashboard instead.

Litmus: "Does this failure directly prevent a user task?" is the one question you should always answer first.

Good examples from Parapet.SLO.StarterPack.WebSaaS:

web_saas_login_journey - a failed login directly blocks the user from entering your app. Auth failures are low-volume, high-impact, and exactly what a journey SLO is for.
web_saas_http_availability - request-level availability is the baseline user expectation. A user who cannot load a page is directly blocked.
web_saas_oban_job_success - Oban job failures directly affect users when the job gates a user-visible outcome (order confirmation, email delivery, image processing, billing). Wire a job-success slice for each critical async path.

Bad example:

A CPU utilization or memory gauge SLO. CPU at 95% does not directly prevent a user task. You might be processing batch work, running GC, or handling a spike with headroom to spare. Alerting on raw infrastructure metrics produces noise without actionable user-impact framing.

Real anchor: The three web_saas_* slice names in Parapet.SLO.StarterPack.WebSaaS are the reference implementation. Each is pinned to a real Prometheus series, has a documented default objective in human terms, and is overridable. Read the source or Parapet SLO Reference to understand the defaults before changing them.

Writing a custom slice

When the built-in packs do not cover your journey, you define a custom provider module that returns Parapet.SLO.SliceSpec structs. The SliceSpec struct drives all generator output - you never write raw PromQL.

The minimum fields are name, integration, kind, alert_class, runbook, a good metric + matchers, and a total metric + matchers. Set objective as a percentage (e.g., 99.5) and the Generator derives the error-rate threshold for you.

config :parapet,
  providers: [
    Parapet.SLO.StarterPack.WebSaaS,
    MyApp.SLO.CheckoutJourney
  ]

Then run mix parapet.gen.prometheus to write the recording rules and alert expressions. You never hand-write PromQL.

Provider-as-bundle pattern

A Parapet.SLO.Provider that returns slices from multiple sub-providers is the bundle abstraction. No separate macro or base module is required — the slos/0 callback returns a flat list, and list concatenation (++) is the composition primitive.

The canonical example is Parapet.SLO.StarterPack.DeliverySaaS, which composes three providers into one registration: the three WebSaaS slices plus conditionally-guarded Mailglass and Chimeway delivery slices. Its slos/0 calls WebSaaS.slos() ++ delivery_slices(Mailglass, Chimeway), where each delivery slice set is included only when the corresponding host library is loaded.

defmodule MyApp.SLO.FullStack do
  @behaviour Parapet.SLO.Provider

  @impl true
  def slos do
    Parapet.SLO.StarterPack.WebSaaS.slos() ++
      (if Code.ensure_loaded?(Mailglass), do: Parapet.SLO.MailglassDelivery.slos(), else: []) ++
      my_custom_slices()
  end

  defp my_custom_slices, do: [...]
end

config :parapet, providers: [MyApp.SLO.FullStack]

Conditional registration: Use Code.ensure_loaded?/1 to guard slices for optional host libraries. The bundle module itself is always loadable (passes mix verify.public_api) regardless of whether the guarded library is present. This is the pattern used by Parapet.SLO.StarterPack.DeliverySaaS — see its moduledoc for the reference implementation.

For the full built-in provider catalog and starter packs, see Parapet SLO Reference.

Low-traffic and low-volume services

Low-traffic services introduce a specific failure mode: the SLO burns when there is not enough data to know. A single failed login attempt out of five total produces a 20% error rate - which would fire a page alert - even though five requests is not a meaningful signal. The naive solution is to lower the objective to stop the noise. That is the wrong move.

The denominator guard the generator renders

Every alert expression the Generator produces includes a denominator guard. For a slice named web_saas_login_journey with a :page alert class (14.4x multiplier, 5m window) and a 99.9% objective:

parapet:web_saas_login_journey:error_ratio:5m > 0.0144 and parapet:web_saas_login_journey:total_rate:5m > 0.01

The guard shape is:

parapet:<slice_name>:error_ratio:<window> > <threshold> and parapet:<slice_name>:total_rate:<window> > <min_total_rate>

The second condition - total_rate > min_total_rate - is the denominator guard. The alert fires only when there is enough traffic to make the error ratio meaningful. Without that guard, a single failure in a quiet window would trigger a page.

The 0.0144 threshold comes from the objective: 99.9% -> 0.001 error budget x 14.4 multiplier = 0.0144.

The min_total_rate default and the six windows

The default min_total_rate is 0.01 - defined in Parapet.SLO.SliceSpec as the struct default and applied to every slice unless you override it. You can override it per-slice by passing min_total_rate: <value> when constructing a SliceSpec.

The Generator emits alert expressions for one window per alert class. The full set of recording rule windows is ["5m", "30m", "1h", "2h", "6h", "3d"]. The alert window and multiplier by class are:

:page - 5m window, 14.4x multiplier
:ticket - 30m window, 6.0x multiplier
:warning - 6h window, 1.0x multiplier

Recording rules are generated for all six windows ("5m", "30m", "1h", "2h", "6h", "3d"), so you have history for retrospectives and trend analysis at every granularity.

The extended-window approach

The 6h and 3d windows the Generator already emits are naturally more tolerant of low-traffic variance - a service that handles 10 requests per day accumulates enough denominator data over six hours to produce a reliable ratio. If you are seeing false-positive :warning alerts on a low-volume slice, the first question is not "should I lower the objective?" It is "is the denominator guard firing correctly, and am I looking at the right window?"

Synthetic probes

When traffic is genuinely too low to produce a reliable signal even at the 6h window - for example, an internal-only workflow that runs once a week - the right tool is a synthetic probe.

Parapet.Metrics.Probe is a real, implemented fallback. It emits two metrics:

parapet.probe.run.total - a counter tagged with probe and status
parapet.probe.run.duration.ms - a distribution for latency tracking

A synthetic probe continuously exercises the journey at a known rate, giving the SLO a stable denominator even on services with negligible organic traffic. The probe outcome then feeds into a slice the same way real traffic does - you define the slice against parapet.probe.run.total and the denominator guard works as intended.

What not to do

These are the failure modes that produce noise instead of signal.

Lower the objective to silence noise. This is the wrong move. Dropping a login-journey SLO from 99.9% to 90% because it was firing on low traffic means you will not page when 10% of your users cannot log in. The denominator guard, extended windows, and synthetic probes exist precisely so you do not have to choose between accuracy and quiet alerts.
Alert on infrastructure metrics as if they were journey SLOs. CPU, memory, and disk are system-health signals. They are useful for capacity planning. They are not journey SLOs, and wiring them as SLOs produces alerts that are both noisy and unactionable.
Emit a new journey SLO without wiring a denominator guard. The Generator handles this for you via the min_total_rate field on SliceSpec - but if you bypass the Generator and write raw PromQL, you need to add the guard yourself.
Assume "no data" means "green." If a slice has no traffic - for example, the web_saas_login_journey slice before you wire the Sigra integration or another login-count emitter - the denominator guard prevents the alert from firing. That is correct behavior. But silence is not a health signal. Use mix parapet.doctor and check that the expected metrics are present before treating a quiet slice as a passing one.

← Previous Page Parapet Troubleshooting

Next Page → Parapet + Sigra