Hephaestus emits :telemetry events at every significant point in workflow and step execution. These events let you observe, measure, and react to workflow behavior without modifying your business logic.

Events are always emitted (the overhead is ~1-2 microseconds with no handler attached). Handlers are opt-in --- you attach only what you need.

Quick Start

1. Upgrade your dependency

Hephaestus includes :telemetry automatically. No extra dependencies are needed.

def deps do
  [
    {:hephaestus, "~> 0.2.0"}
  ]
end

2. Attach the built-in log handler

In your application.ex start callback, add:

def start(_type, _args) do
  Hephaestus.Telemetry.LogHandler.attach()

  children = [
    MyApp.Hephaestus
  ]

  Supervisor.start_link(children, strategy: :one_for_one)
end

3. See structured log output

Run a workflow and you will see output like:

[info] [abc-123] Workflow started
[info] [abc-123] Step MyApp.Steps.Validate started
[info] [abc-123] Step MyApp.Steps.Validate completed in 2ms -> :ok
[debug] [abc-123] Transition: MyApp.Steps.Validate -> MyApp.Steps.ProcessPayment
[info] [abc-123] Workflow completed in 15ms

Every log line includes the instance_id as a prefix and structured metadata fields (workflow, step, duration_ms) for machine parsing.

Events Reference

Hephaestus emits 11 events organized into two spans and five standalone events. All instance-scoped events also include workflow_version in their metadata, plus any caller-supplied telemetry_metadata fields passed to start_instance/3.

Workflow Span

Event NameMeasurementsMetadata
[:hephaestus, :workflow, :start]system_timeinstance_id, workflow, workflow_version, initial_step, context_keys, runner
[:hephaestus, :workflow, :stop]duration, step_count, advance_countinstance_id, workflow, workflow_version, status (:completed), completed_steps, runner
[:hephaestus, :workflow, :exception]durationinstance_id, workflow, workflow_version, status (:failed), failed_step, reason, kind, stacktrace, runner

Step Span

Event NameMeasurementsMetadata
[:hephaestus, :step, :start]system_timeinstance_id, workflow, workflow_version, step, step_key, concurrent, active_steps_count
[:hephaestus, :step, :stop]durationinstance_id, workflow, workflow_version, step, step_key, event, has_context_updates, has_metadata_updates, transitions_to
[:hephaestus, :step, :exception]durationinstance_id, workflow, workflow_version, step, step_key, kind, reason, stacktrace

Standalone Events

Event NameMeasurementsMetadata
[:hephaestus, :step, :async]durationinstance_id, workflow, workflow_version, step, step_key, instance_status (:waiting)
[:hephaestus, :step, :resume]system_time, wait_durationinstance_id, workflow, workflow_version, step, step_key, resume_event, source
[:hephaestus, :workflow, :transition]system_time, targets_countinstance_id, workflow, workflow_version, from_step, event, targets, fan_out
[:hephaestus, :engine, :advance]duration, active_steps_count, completed_in_advanceinstance_id, workflow, workflow_version, status_before, status_after, iteration
[:hephaestus, :runner, :init]system_timename, runner, storage, pid

All duration values are in Erlang native time units. Convert with System.convert_time_unit(duration, :native, :millisecond).

By Use Case

I want to track failures

Attach a handler to [:hephaestus, :workflow, :exception] and [:hephaestus, :step, :exception]. The metadata includes kind, reason, and stacktrace for crash diagnostics, plus failed_step on the workflow event to identify which step caused the failure.

:telemetry.attach(
  "my-failure-handler",
  [:hephaestus, :workflow, :exception],
  fn _event, _measurements, metadata, _config ->
    Logger.error("Workflow #{inspect(metadata.workflow)} failed at #{inspect(metadata.failed_step)}: #{inspect(metadata.reason)}")
  end,
  nil
)

I want to measure latency

Use [:hephaestus, :workflow, :stop] for end-to-end workflow duration and [:hephaestus, :step, :stop] for individual step duration. Both include a duration measurement in native time units.

:telemetry.attach(
  "my-latency-handler",
  [:hephaestus, :step, :stop],
  fn _event, measurements, metadata, _config ->
    ms = System.convert_time_unit(measurements.duration, :native, :millisecond)
    Logger.info("Step #{inspect(metadata.step)} took #{ms}ms")
  end,
  nil
)

I want to monitor fan-out

The [:hephaestus, :workflow, :transition] event includes fan_out: true when a step triggers multiple concurrent targets. The targets_count measurement tells you the parallelism degree, and targets in metadata lists the activated step modules.

I want to debug async steps

Two events cover the async lifecycle:

  • [:hephaestus, :step, :async] --- emitted when a step returns {:async}, with duration measuring how long the step ran before going async.
  • [:hephaestus, :step, :resume] --- emitted when the step receives an external event or timeout, with wait_duration measuring how long the workflow spent waiting and source indicating whether the resume was :external or :timeout.

Log Handler

Hephaestus.Telemetry.LogHandler is a built-in handler that produces structured Logger output for all Hephaestus events.

Attaching

# Attach with defaults (all events, standard log levels)
Hephaestus.Telemetry.LogHandler.attach()

# Attach with custom options
Hephaestus.Telemetry.LogHandler.attach(
  level: %{[:hephaestus, :step, :stop] => :debug},
  events: [
    [:hephaestus, :workflow, :start],
    [:hephaestus, :workflow, :stop],
    [:hephaestus, :workflow, :exception]
  ]
)

Detaching

Hephaestus.Telemetry.LogHandler.detach()

Default Log Levels

EventLevel
workflow:start:info
workflow:stop:info
workflow:exception:error
workflow:transition:debug
step:start:info
step:stop:info
step:exception:error
step:async:warning
step:resume:info
engine:advance:debug
runner:init:info

Override any level by passing a :level map keyed by event name.

Options

  • :level --- a map of event name (list of atoms) to log level atom. Overrides the defaults above.
  • :events --- a list of event names to subscribe to. Defaults to all events from Hephaestus.Telemetry.events/0.

Metrics

Hephaestus.Telemetry.Metrics returns pre-built Telemetry.Metrics structs that plug directly into any reporter (Prometheus, StatsD, LiveDashboard).

# In your application supervisor
metrics = Hephaestus.Telemetry.Metrics.metrics()

children = [
  {TelemetryMetricsPrometheus, metrics: metrics}
]

Scope Filtering

Filter metrics by scope to include only what you need:

Hephaestus.Telemetry.Metrics.metrics(scope: :workflow)  # workflow metrics only
Hephaestus.Telemetry.Metrics.metrics(scope: :step)      # step metrics only
Hephaestus.Telemetry.Metrics.metrics()                   # all metrics

Metric Definitions

MetricTypeTags
hephaestus.workflow.start.countcounterworkflow
hephaestus.workflow.stop.countcounterworkflow
hephaestus.workflow.exception.countcounterworkflow, failed_step
hephaestus.workflow.stop.durationdistributionworkflow
hephaestus.step.stop.durationdistributionworkflow, step
hephaestus.step.exception.countcounterworkflow, step, kind
hephaestus.step.async.countcounterworkflow, step
hephaestus.step.resume.wait_durationdistributionworkflow, step, source
hephaestus.engine.advance.active_steps_countlast_valueworkflow

Custom Handlers

Use standard :telemetry.attach/4 or :telemetry.attach_many/4 to build your own handlers. Here is an example that sends Slack notifications on workflow failures:

defmodule MyApp.Telemetry.SlackNotifier do
  def attach do
    :telemetry.attach(
      "slack-workflow-failure",
      [:hephaestus, :workflow, :exception],
      &handle_event/4,
      %{webhook_url: System.fetch_env!("SLACK_WEBHOOK_URL")}
    )
  end

  def handle_event(_event, measurements, metadata, config) do
    duration_ms =
      case measurements[:duration] do
        nil -> "unknown"
        d -> "#{System.convert_time_unit(d, :native, :millisecond)}ms"
      end

    message = """
    Workflow failed: #{inspect(metadata.workflow)}
    Step: #{inspect(metadata.failed_step)}
    Reason: #{inspect(metadata.reason)}
    Duration: #{duration_ms}
    Instance: #{metadata.instance_id}
    """

    HTTPClient.post(config.webhook_url, %{text: message})
  end
end

And an example for Sentry error reporting:

:telemetry.attach(
  "sentry-step-exception",
  [:hephaestus, :step, :exception],
  fn _event, _measurements, metadata, _config ->
    if metadata[:stacktrace] do
      Sentry.capture_exception(metadata.reason,
        stacktrace: metadata.stacktrace,
        extra: %{
          workflow: inspect(metadata.workflow),
          step: inspect(metadata.step),
          instance_id: metadata.instance_id
        }
      )
    end
  end,
  nil
)

Caller Metadata

When starting a workflow, you can pass a telemetry_metadata map that gets merged into every telemetry event from that instance. This is useful for correlating workflow events with application-level context such as request IDs, user IDs, or trace context.

MyApp.Hephaestus.start_instance(
  MyApp.Workflows.OrderFlow,
  %{items: ["widget"]},
  telemetry_metadata: %{request_id: "req-abc-123", user_id: 42}
)

All 11 telemetry events from this instance will include request_id: "req-abc-123" and user_id: 42 in their metadata. This works across process boundaries because the data lives on the Instance struct, not in the process dictionary.

Constraints

  • Serializable values only: No PIDs, references, or functions. Values must survive JSON serialization for future Oban runner support.
  • Lower precedence: Caller-supplied fields are merged with lower precedence. You cannot override Hephaestus-defined fields like instance_id, workflow, or step.
  • No key validation: Avoid using keys that conflict with Hephaestus metadata field names.

API Stability

Event names and metadata structure are part of the public API. Changes follow semantic versioning:

  • Patch versions (0.1.x): Only add new measurement or metadata fields. Never remove or rename existing fields.
  • Minor versions (0.x.0): May deprecate fields with a warning for one minor version before removal.
  • Event names: Event names like [:hephaestus, :workflow, :start] are never changed. New events may be added.

Known Limitations

Duration is invalid after VM restart

duration measurements use System.monotonic_time/0, which is relative to the current VM instance. If a workflow spans a VM restart (deploy, crash recovery), the stored start time becomes meaningless. In this case, duration is emitted as nil.

Handlers and metric reporters should handle nil durations gracefully. Long-running workflows that span deploys are better served by the Oban runner, which measures per-job durations independently.

Step events run in the task process

Step telemetry events (step:start, step:stop, step:exception, step:async) are emitted inside the Task.Supervisor task process, not the Runner GenServer. Handlers that inspect self() or rely on the process dictionary will see the task context, not the GenServer context.

This is safe because the :telemetry library wraps handlers in try/catch --- a handler crash never propagates to the step execution. Standard handlers (Logger, Prometheus, StatsD) do not depend on self(). If your handler needs GenServer context, use instance_id from metadata for correlation.

No Engine-level telemetry

The Engine is purely functional and emits no telemetry. Events are emitted from the Runtime layer (runners). Direct Engine usage in IEx or tests produces no telemetry output. This is intentional --- telemetry is a runtime concern.