Hephaestus emits :telemetry events at every significant point in workflow and step execution. These events let you observe, measure, and react to workflow behavior without modifying your business logic.
Events are always emitted (the overhead is ~1-2 microseconds with no handler attached). Handlers are opt-in --- you attach only what you need.
Quick Start
1. Upgrade your dependency
Hephaestus includes :telemetry automatically. No extra dependencies are needed.
def deps do
[
{:hephaestus, "~> 0.2.0"}
]
end2. Attach the built-in log handler
In your application.ex start callback, add:
def start(_type, _args) do
Hephaestus.Telemetry.LogHandler.attach()
children = [
MyApp.Hephaestus
]
Supervisor.start_link(children, strategy: :one_for_one)
end3. See structured log output
Run a workflow and you will see output like:
[info] [abc-123] Workflow started
[info] [abc-123] Step MyApp.Steps.Validate started
[info] [abc-123] Step MyApp.Steps.Validate completed in 2ms -> :ok
[debug] [abc-123] Transition: MyApp.Steps.Validate -> MyApp.Steps.ProcessPayment
[info] [abc-123] Workflow completed in 15msEvery log line includes the instance_id as a prefix and structured metadata fields (workflow, step, duration_ms) for machine parsing.
Events Reference
Hephaestus emits 11 events organized into two spans and five standalone events.
All instance-scoped events also include workflow_version in their metadata, plus
any caller-supplied telemetry_metadata fields passed to start_instance/3.
Workflow Span
| Event Name | Measurements | Metadata |
|---|---|---|
[:hephaestus, :workflow, :start] | system_time | instance_id, workflow, workflow_version, initial_step, context_keys, runner |
[:hephaestus, :workflow, :stop] | duration, step_count, advance_count | instance_id, workflow, workflow_version, status (:completed), completed_steps, runner |
[:hephaestus, :workflow, :exception] | duration | instance_id, workflow, workflow_version, status (:failed), failed_step, reason, kind, stacktrace, runner |
Step Span
| Event Name | Measurements | Metadata |
|---|---|---|
[:hephaestus, :step, :start] | system_time | instance_id, workflow, workflow_version, step, step_key, concurrent, active_steps_count |
[:hephaestus, :step, :stop] | duration | instance_id, workflow, workflow_version, step, step_key, event, has_context_updates, has_metadata_updates, transitions_to |
[:hephaestus, :step, :exception] | duration | instance_id, workflow, workflow_version, step, step_key, kind, reason, stacktrace |
Standalone Events
| Event Name | Measurements | Metadata |
|---|---|---|
[:hephaestus, :step, :async] | duration | instance_id, workflow, workflow_version, step, step_key, instance_status (:waiting) |
[:hephaestus, :step, :resume] | system_time, wait_duration | instance_id, workflow, workflow_version, step, step_key, resume_event, source |
[:hephaestus, :workflow, :transition] | system_time, targets_count | instance_id, workflow, workflow_version, from_step, event, targets, fan_out |
[:hephaestus, :engine, :advance] | duration, active_steps_count, completed_in_advance | instance_id, workflow, workflow_version, status_before, status_after, iteration |
[:hephaestus, :runner, :init] | system_time | name, runner, storage, pid |
All duration values are in Erlang native time units. Convert with System.convert_time_unit(duration, :native, :millisecond).
By Use Case
I want to track failures
Attach a handler to [:hephaestus, :workflow, :exception] and [:hephaestus, :step, :exception]. The metadata includes kind, reason, and stacktrace for crash diagnostics, plus failed_step on the workflow event to identify which step caused the failure.
:telemetry.attach(
"my-failure-handler",
[:hephaestus, :workflow, :exception],
fn _event, _measurements, metadata, _config ->
Logger.error("Workflow #{inspect(metadata.workflow)} failed at #{inspect(metadata.failed_step)}: #{inspect(metadata.reason)}")
end,
nil
)I want to measure latency
Use [:hephaestus, :workflow, :stop] for end-to-end workflow duration and [:hephaestus, :step, :stop] for individual step duration. Both include a duration measurement in native time units.
:telemetry.attach(
"my-latency-handler",
[:hephaestus, :step, :stop],
fn _event, measurements, metadata, _config ->
ms = System.convert_time_unit(measurements.duration, :native, :millisecond)
Logger.info("Step #{inspect(metadata.step)} took #{ms}ms")
end,
nil
)I want to monitor fan-out
The [:hephaestus, :workflow, :transition] event includes fan_out: true when a step triggers multiple concurrent targets. The targets_count measurement tells you the parallelism degree, and targets in metadata lists the activated step modules.
I want to debug async steps
Two events cover the async lifecycle:
[:hephaestus, :step, :async]--- emitted when a step returns{:async}, withdurationmeasuring how long the step ran before going async.[:hephaestus, :step, :resume]--- emitted when the step receives an external event or timeout, withwait_durationmeasuring how long the workflow spent waiting andsourceindicating whether the resume was:externalor:timeout.
Log Handler
Hephaestus.Telemetry.LogHandler is a built-in handler that produces structured Logger output for all Hephaestus events.
Attaching
# Attach with defaults (all events, standard log levels)
Hephaestus.Telemetry.LogHandler.attach()
# Attach with custom options
Hephaestus.Telemetry.LogHandler.attach(
level: %{[:hephaestus, :step, :stop] => :debug},
events: [
[:hephaestus, :workflow, :start],
[:hephaestus, :workflow, :stop],
[:hephaestus, :workflow, :exception]
]
)Detaching
Hephaestus.Telemetry.LogHandler.detach()Default Log Levels
| Event | Level |
|---|---|
workflow:start | :info |
workflow:stop | :info |
workflow:exception | :error |
workflow:transition | :debug |
step:start | :info |
step:stop | :info |
step:exception | :error |
step:async | :warning |
step:resume | :info |
engine:advance | :debug |
runner:init | :info |
Override any level by passing a :level map keyed by event name.
Options
:level--- a map of event name (list of atoms) to log level atom. Overrides the defaults above.:events--- a list of event names to subscribe to. Defaults to all events fromHephaestus.Telemetry.events/0.
Metrics
Hephaestus.Telemetry.Metrics returns pre-built Telemetry.Metrics structs that plug directly into any reporter (Prometheus, StatsD, LiveDashboard).
# In your application supervisor
metrics = Hephaestus.Telemetry.Metrics.metrics()
children = [
{TelemetryMetricsPrometheus, metrics: metrics}
]Scope Filtering
Filter metrics by scope to include only what you need:
Hephaestus.Telemetry.Metrics.metrics(scope: :workflow) # workflow metrics only
Hephaestus.Telemetry.Metrics.metrics(scope: :step) # step metrics only
Hephaestus.Telemetry.Metrics.metrics() # all metricsMetric Definitions
| Metric | Type | Tags |
|---|---|---|
hephaestus.workflow.start.count | counter | workflow |
hephaestus.workflow.stop.count | counter | workflow |
hephaestus.workflow.exception.count | counter | workflow, failed_step |
hephaestus.workflow.stop.duration | distribution | workflow |
hephaestus.step.stop.duration | distribution | workflow, step |
hephaestus.step.exception.count | counter | workflow, step, kind |
hephaestus.step.async.count | counter | workflow, step |
hephaestus.step.resume.wait_duration | distribution | workflow, step, source |
hephaestus.engine.advance.active_steps_count | last_value | workflow |
Custom Handlers
Use standard :telemetry.attach/4 or :telemetry.attach_many/4 to build your own handlers. Here is an example that sends Slack notifications on workflow failures:
defmodule MyApp.Telemetry.SlackNotifier do
def attach do
:telemetry.attach(
"slack-workflow-failure",
[:hephaestus, :workflow, :exception],
&handle_event/4,
%{webhook_url: System.fetch_env!("SLACK_WEBHOOK_URL")}
)
end
def handle_event(_event, measurements, metadata, config) do
duration_ms =
case measurements[:duration] do
nil -> "unknown"
d -> "#{System.convert_time_unit(d, :native, :millisecond)}ms"
end
message = """
Workflow failed: #{inspect(metadata.workflow)}
Step: #{inspect(metadata.failed_step)}
Reason: #{inspect(metadata.reason)}
Duration: #{duration_ms}
Instance: #{metadata.instance_id}
"""
HTTPClient.post(config.webhook_url, %{text: message})
end
endAnd an example for Sentry error reporting:
:telemetry.attach(
"sentry-step-exception",
[:hephaestus, :step, :exception],
fn _event, _measurements, metadata, _config ->
if metadata[:stacktrace] do
Sentry.capture_exception(metadata.reason,
stacktrace: metadata.stacktrace,
extra: %{
workflow: inspect(metadata.workflow),
step: inspect(metadata.step),
instance_id: metadata.instance_id
}
)
end
end,
nil
)Caller Metadata
When starting a workflow, you can pass a telemetry_metadata map that gets merged into every telemetry event from that instance. This is useful for correlating workflow events with application-level context such as request IDs, user IDs, or trace context.
MyApp.Hephaestus.start_instance(
MyApp.Workflows.OrderFlow,
%{items: ["widget"]},
telemetry_metadata: %{request_id: "req-abc-123", user_id: 42}
)All 11 telemetry events from this instance will include request_id: "req-abc-123" and user_id: 42 in their metadata. This works across process boundaries because the data lives on the Instance struct, not in the process dictionary.
Constraints
- Serializable values only: No PIDs, references, or functions. Values must survive JSON serialization for future Oban runner support.
- Lower precedence: Caller-supplied fields are merged with lower precedence. You cannot override Hephaestus-defined fields like
instance_id,workflow, orstep. - No key validation: Avoid using keys that conflict with Hephaestus metadata field names.
API Stability
Event names and metadata structure are part of the public API. Changes follow semantic versioning:
- Patch versions (0.1.x): Only add new measurement or metadata fields. Never remove or rename existing fields.
- Minor versions (0.x.0): May deprecate fields with a warning for one minor version before removal.
- Event names: Event names like
[:hephaestus, :workflow, :start]are never changed. New events may be added.
Known Limitations
Duration is invalid after VM restart
duration measurements use System.monotonic_time/0, which is relative to the current VM instance. If a workflow spans a VM restart (deploy, crash recovery), the stored start time becomes meaningless. In this case, duration is emitted as nil.
Handlers and metric reporters should handle nil durations gracefully. Long-running workflows that span deploys are better served by the Oban runner, which measures per-job durations independently.
Step events run in the task process
Step telemetry events (step:start, step:stop, step:exception, step:async) are emitted inside the Task.Supervisor task process, not the Runner GenServer. Handlers that inspect self() or rely on the process dictionary will see the task context, not the GenServer context.
This is safe because the :telemetry library wraps handlers in try/catch --- a handler crash never propagates to the step execution. Standard handlers (Logger, Prometheus, StatsD) do not depend on self(). If your handler needs GenServer context, use instance_id from metadata for correlation.
No Engine-level telemetry
The Engine is purely functional and emits no telemetry. Events are emitted from the Runtime layer (runners). Direct Engine usage in IEx or tests produces no telemetry output. This is intentional --- telemetry is a runtime concern.