Fly.io with Tigris Storage

Build an incident response system on Fly.io where Claude generates structured runbooks and a Native skill safely executes them.

Time: 35 minutes

Prerequisites:

Complete Native Elixir Skills tutorial
Fly.io CLI installed
Fly.io account (free tier works)

What You'll Build

An incident response service demonstrating a two-phase skill pipeline:

Claude (Anthropic) Skill: Analyzes incident descriptions and generates structured runbook artifacts
Native Skill: Validates runbooks against a schema, enforces action allow-lists, and executes safe dry-runs

┌─────────────────────────────────────────────────────────────────────────┐
│                            Fly.io Machine                               │
│                                                                         │
│  ┌────────────┐    ┌──────────────────┐    ┌─────────────────────────┐  │
│  │   User     │───▶│  Claude Skill    │───▶│  Runbook Artifact       │  │
│  │  Request   │    │  (Anthropic API) │    │  (JSON in Tigris)       │  │
│  └────────────┘    └──────────────────┘    └───────────┬─────────────┘  │
│                                                        │                │
│                                                        ▼                │
│                                            ┌─────────────────────────┐  │
│                                            │  Native Executor Skill  │  │
│                                            │  - Schema validation    │  │
│                                            │  - Action allow-list    │  │
│                                            │  - Safe dry-run         │  │
│                                            └───────────┬─────────────┘  │
│                                                        │                │
│                                                        ▼                │
│                                            ┌─────────────────────────┐  │
│                                            │  Execution Results      │  │
│                                            │  (auditable output)     │  │
│                                            └─────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘

Why this pattern?

Responsibility	Claude Skill	Native Skill
Reasoning	✓ Analyzes problems
Planning	✓ Generates runbooks
Validation		✓ Schema enforcement
Safety		✓ Action allow-listing
Execution		✓ Deterministic dry-run

Claude reasons and plans. Native code executes safely. The runbook artifact is the contract between them.

Step 1: Create the Fly.io Application

Generate a new Phoenix application:

mix phx.new incident_response --database postgres
cd incident_response

Add Conjure and dependencies to mix.exs:

defp deps do
  [
    # ... existing deps ...
    {:conjure, "~> 0.1.0"},
    {:req, "~> 0.5"}
  ]
end

Get dependencies:

mix deps.get

Step 2: Set Up Fly.io Infrastructure

Launch the Fly app with Postgres:

fly launch --name my-incident-response

Create a Tigris storage bucket for runbook artifacts:

fly storage create incident-runbooks

This automatically configures environment variables:

BUCKET_NAME - Your Tigris bucket name
AWS_ACCESS_KEY_ID - Tigris access key
AWS_SECRET_ACCESS_KEY - Tigris secret key
AWS_ENDPOINT_URL_S3 - Tigris endpoint

Set your Anthropic API key:

fly secrets set ANTHROPIC_API_KEY=sk-ant-...

Step 3: Define the Runbook Schema

Create lib/incident_response/runbooks/schema.ex:

defmodule IncidentResponse.Runbooks.Schema do
  @moduledoc """
  Defines the runbook artifact contract between Claude and the executor.

  This schema is the boundary between LLM reasoning and safe execution.
  """

  @type check :: %{
    id: String.t(),
    action: String.t(),
    params: map()
  }

  @type recommended_action :: %{
    id: String.t(),
    action: String.t(),
    safe: boolean(),
    params: map()
  }

  @type runbook :: %{
    incident_type: String.t(),
    affected_system: String.t(),
    confidence: float(),
    hypotheses: [String.t()],
    checks: [check()],
    recommended_actions: [recommended_action()],
    rollback_plan: String.t()
  }

  @required_keys ~w(incident_type affected_system confidence checks recommended_actions)a

  @doc """
  Validates a runbook artifact against the schema.

  Returns `{:ok, runbook}` if valid, `{:error, reasons}` otherwise.
  """
  def validate(runbook) when is_map(runbook) do
    with :ok <- validate_required_keys(runbook),
         :ok <- validate_confidence(runbook),
         :ok <- validate_checks(runbook),
         :ok <- validate_actions(runbook) do
      {:ok, normalize(runbook)}
    end
  end

  def validate(_), do: {:error, ["runbook must be a map"]}

  defp validate_required_keys(runbook) do
    missing = @required_keys -- Map.keys(runbook)

    if Enum.empty?(missing) do
      :ok
    else
      {:error, ["missing required keys: #{inspect(missing)}"]}
    end
  end

  defp validate_confidence(%{confidence: c}) when is_number(c) and c >= 0 and c <= 1, do: :ok
  defp validate_confidence(%{confidence: c}), do: {:error, ["confidence must be 0.0-1.0, got: #{inspect(c)}"]}

  defp validate_checks(%{checks: checks}) when is_list(checks) do
    errors =
      checks
      |> Enum.with_index()
      |> Enum.flat_map(fn {check, i} ->
        validate_check(check, i)
      end)

    if Enum.empty?(errors), do: :ok, else: {:error, errors}
  end

  defp validate_checks(_), do: {:error, ["checks must be a list"]}

  defp validate_check(check, index) do
    required = ~w(id action)

    missing =
      required
      |> Enum.reject(&Map.has_key?(check, &1))

    if Enum.empty?(missing) do
      []
    else
      ["check[#{index}] missing: #{Enum.join(missing, ", ")}"]
    end
  end

  defp validate_actions(%{recommended_actions: actions}) when is_list(actions) do
    errors =
      actions
      |> Enum.with_index()
      |> Enum.flat_map(fn {action, i} ->
        validate_action(action, i)
      end)

    if Enum.empty?(errors), do: :ok, else: {:error, errors}
  end

  defp validate_actions(_), do: {:error, ["recommended_actions must be a list"]}

  defp validate_action(action, index) do
    cond do
      not Map.has_key?(action, "id") and not Map.has_key?(action, :id) ->
        ["action[#{index}] missing: id"]

      not Map.has_key?(action, "action") and not Map.has_key?(action, :action) ->
        ["action[#{index}] missing: action"]

      true ->
        []
    end
  end

  defp normalize(runbook) do
    runbook
    |> Map.put_new(:hypotheses, [])
    |> Map.put_new(:rollback_plan, "No rollback plan specified")
    |> Map.update(:checks, [], &normalize_checks/1)
    |> Map.update(:recommended_actions, [], &normalize_actions/1)
  end

  defp normalize_checks(checks) do
    Enum.map(checks, fn check ->
      check
      |> Map.put_new(:params, %{})
      |> atomize_keys()
    end)
  end

  defp normalize_actions(actions) do
    Enum.map(actions, fn action ->
      action
      |> Map.put_new(:params, %{})
      |> Map.put_new(:safe, false)
      |> atomize_keys()
    end)
  end

  defp atomize_keys(map) when is_map(map) do
    Map.new(map, fn
      {k, v} when is_binary(k) -> {String.to_existing_atom(k), v}
      {k, v} when is_atom(k) -> {k, v}
    end)
  rescue
    ArgumentError -> map
  end
end

Step 4: Create the Runbook Generator (Claude Skill)

Create lib/incident_response/skills/runbook_generator.ex:

defmodule IncidentResponse.Skills.RunbookGenerator do
  @moduledoc """
  Claude skill that analyzes incidents and generates structured runbooks.

  This skill does NO execution - it only reasons about the problem
  and produces a machine-readable artifact for the executor skill.
  """

  alias IncidentResponse.Runbooks.Schema
  alias Conjure.Session

  require Logger

  @system_prompt """
  You are an incident response analyst. When given an incident description,
  analyze the problem and generate a structured runbook for investigation.

  IMPORTANT: You must output a valid JSON runbook with this exact structure:

  {
    "incident_type": "performance_degradation|outage|data_issue|security",
    "affected_system": "string - the primary system affected",
    "confidence": 0.0-1.0,
    "hypotheses": ["list of possible root causes"],
    "checks": [
      {
        "id": "unique_check_id",
        "action": "system.operation_name",
        "params": { "key": "value" }
      }
    ],
    "recommended_actions": [
      {
        "id": "unique_action_id",
        "action": "system.operation_name",
        "safe": true/false,
        "params": { "key": "value" }
      }
    ],
    "rollback_plan": "description of how to revert if needed"
  }

  Available check actions:
  - database.slow_queries - params: {lookback_minutes}
  - database.table_stats - params: {table}
  - database.connection_count - params: {}
  - metrics.query - params: {query, range_minutes}
  - logs.search - params: {pattern, service, lookback_minutes}
  - http.health_check - params: {url}

  Available recommended actions:
  - database.analyze_table - params: {table}, safe: true
  - database.kill_query - params: {query_id}, safe: false
  - cache.invalidate - params: {key_pattern}, safe: true
  - service.restart - params: {service}, safe: false
  - alert.notify - params: {channel, message}, safe: true

  Only mark actions as safe:true if they are read-only or easily reversible.
  Output ONLY the JSON, no markdown fences or explanation.
  """

  @doc """
  Generate a runbook from an incident description.

  Returns `{:ok, runbook}` where runbook is a validated map,
  or `{:error, reason}` if generation or validation fails.
  """
  def generate(incident_description, opts \\ []) do
    with {:ok, response} <- call_claude(incident_description, opts),
         {:ok, json} <- extract_json(response),
         {:ok, runbook} <- Schema.validate(json) do
      {:ok, runbook}
    end
  end

  defp call_claude(incident_description, opts) do
    messages = [
      %{"role" => "user", "content" => incident_description}
    ]

    api_callback = Keyword.get(opts, :api_callback, &default_api_callback/1)
    api_callback.(messages)
  end

  defp extract_json(response) do
    text =
      response["content"]
      |> Enum.filter(&(&1["type"] == "text"))
      |> Enum.map(&(&1["text"]))
      |> Enum.join()
      |> String.trim()

    # Strip markdown code fences if present
    text =
      text
      |> String.replace(~r/^```json\s*/, "")
      |> String.replace(~r/\s*```$/, "")

    case Jason.decode(text) do
      {:ok, json} -> {:ok, json}
      {:error, _} -> {:error, "Failed to parse runbook JSON: #{String.slice(text, 0, 200)}..."}
    end
  end

  defp default_api_callback(messages) do
    url = "https://api.anthropic.com/v1/messages"
    api_key = System.get_env("ANTHROPIC_API_KEY")

    body = %{
      "model" => "claude-sonnet-4-5-20250929",
      "max_tokens" => 4096,
      "system" => @system_prompt,
      "messages" => messages
    }

    headers = [
      {"x-api-key", api_key},
      {"content-type", "application/json"},
      {"anthropic-version", "2023-06-01"}
    ]

    case Req.post(url, json: body, headers: headers, receive_timeout: 60_000) do
      {:ok, %{status: 200, body: response}} -> {:ok, response}
      {:ok, %{status: status, body: body}} -> {:error, "API error #{status}: #{inspect(body)}"}
      {:error, reason} -> {:error, "Request failed: #{inspect(reason)}"}
    end
  end
end

Step 5: Create the Runbook Executor (Native Skill)

Create lib/incident_response/skills/runbook_executor.ex:

defmodule IncidentResponse.Skills.RunbookExecutor do
  @moduledoc """
  Native skill that validates and executes runbook checks safely.

  This skill is the trust boundary - it enforces:
  - Schema validation (fail fast if malformed)
  - Action allow-list (only whitelisted operations)
  - Dry-run execution (no destructive actions without approval)
  """

  @behaviour Conjure.NativeSkill

  alias IncidentResponse.Runbooks.Schema

  require Logger

  # Allow-list of executable actions
  @allowed_checks ~w(
    database.slow_queries
    database.table_stats
    database.connection_count
    metrics.query
    logs.search
    http.health_check
  )

  @allowed_actions ~w(
    database.analyze_table
    cache.invalidate
    alert.notify
  )

  @impl true
  def __skill_info__ do
    %{
      name: "runbook-executor",
      description: """
      Safely executes incident runbooks generated by Claude.
      Validates schema, enforces allow-lists, and performs dry-run checks.
      """,
      allowed_tools: [:execute]
    }
  end

  @impl true
  def execute(command, context) do
    case parse_command(command) do
      {:run_checks, runbook_json} ->
        run_checks(runbook_json, context)

      {:run_action, runbook_json, action_id} ->
        run_action(runbook_json, action_id, context)

      {:validate, runbook_json} ->
        validate_only(runbook_json)

      {:error, reason} ->
        {:error, reason}
    end
  end

  # Command parsing

  defp parse_command(command) do
    command = String.trim(command)

    cond do
      String.starts_with?(command, "validate ") ->
        {:validate, String.replace_prefix(command, "validate ", "")}

      String.starts_with?(command, "run_checks ") ->
        {:run_checks, String.replace_prefix(command, "run_checks ", "")}

      String.starts_with?(command, "run_action ") ->
        parse_run_action(command)

      true ->
        {:error, "Unknown command. Available: validate <json>, run_checks <json>, run_action <json> <action_id>"}
    end
  end

  defp parse_run_action(command) do
    # Format: run_action <json> <action_id>
    # JSON ends at the last }, action_id follows
    case Regex.run(~r/run_action (.+})\s+(\S+)$/, command) do
      [_, json, action_id] -> {:run_action, json, action_id}
      _ -> {:error, "Invalid run_action format. Use: run_action <json> <action_id>"}
    end
  end

  # Execution

  defp validate_only(runbook_json) do
    with {:ok, json} <- Jason.decode(runbook_json),
         {:ok, runbook} <- Schema.validate(json) do
      {:ok, "Runbook valid. #{length(runbook.checks)} checks, #{length(runbook.recommended_actions)} actions."}
    else
      {:error, reasons} when is_list(reasons) ->
        {:error, "Validation failed:\n" <> Enum.join(reasons, "\n")}

      {:error, reason} ->
        {:error, "Validation failed: #{inspect(reason)}"}
    end
  end

  defp run_checks(runbook_json, context) do
    with {:ok, json} <- Jason.decode(runbook_json),
         {:ok, runbook} <- Schema.validate(json),
         {:ok, results} <- execute_checks(runbook.checks, context) do
      output = format_check_results(runbook, results)
      {:ok, output}
    else
      {:error, reasons} when is_list(reasons) ->
        {:error, "Validation failed:\n" <> Enum.join(reasons, "\n")}

      {:error, reason} ->
        {:error, "Failed: #{inspect(reason)}"}
    end
  end

  defp run_action(runbook_json, action_id, context) do
    with {:ok, json} <- Jason.decode(runbook_json),
         {:ok, runbook} <- Schema.validate(json),
         {:ok, action} <- find_action(runbook, action_id),
         :ok <- verify_action_allowed(action),
         :ok <- verify_action_safe(action),
         {:ok, result} <- execute_action(action, context) do
      {:ok, "Action #{action_id} completed: #{result}"}
    else
      {:error, reason} -> {:error, reason}
    end
  end

  defp execute_checks(checks, context) do
    results =
      Enum.map(checks, fn check ->
        if check.action in @allowed_checks do
          result = execute_check(check, context)
          {check.id, result}
        else
          {check.id, {:blocked, "Action not in allow-list: #{check.action}"}}
        end
      end)

    {:ok, results}
  end

  defp execute_check(%{action: action, params: params} = check, _context) do
    Logger.info("Executing check: #{check.id} (#{action})")

    # Simulate check execution - in production, these would call real services
    case action do
      "database.slow_queries" ->
        lookback = Map.get(params, "lookback_minutes", 60)
        {:ok, "Found 3 slow queries in last #{lookback} minutes (avg 4.2s)"}

      "database.table_stats" ->
        table = Map.get(params, "table", "unknown")
        {:ok, "Table #{table}: 18,432 partitions, last analyzed 3 days ago"}

      "database.connection_count" ->
        {:ok, "Active connections: 47/100 (47% utilized)"}

      "metrics.query" ->
        {:ok, "Query latency p99: 2.3s (up from 0.8s baseline)"}

      "logs.search" ->
        pattern = Map.get(params, "pattern", "*")
        {:ok, "Found 142 matches for '#{pattern}' in last hour"}

      "http.health_check" ->
        url = Map.get(params, "url", "/health")
        {:ok, "#{url} returned 200 OK (latency: 45ms)"}

      _ ->
        {:warning, "Unknown check action: #{action}"}
    end
  end

  defp find_action(runbook, action_id) do
    case Enum.find(runbook.recommended_actions, &(&1.id == action_id)) do
      nil -> {:error, "Action not found: #{action_id}"}
      action -> {:ok, action}
    end
  end

  defp verify_action_allowed(%{action: action}) do
    if action in @allowed_actions do
      :ok
    else
      {:error, "Action not in allow-list: #{action}. Allowed: #{Enum.join(@allowed_actions, ", ")}"}
    end
  end

  defp verify_action_safe(%{safe: true}), do: :ok
  defp verify_action_safe(%{safe: false, id: id}) do
    {:error, "Action #{id} is marked unsafe. Requires manual approval."}
  end
  defp verify_action_safe(_), do: :ok

  defp execute_action(%{action: action, params: params} = act, _context) do
    Logger.info("Executing action: #{act.id} (#{action})")

    case action do
      "database.analyze_table" ->
        table = Map.get(params, "table", "unknown")
        {:ok, "ANALYZE TABLE #{table} completed (updated statistics)"}

      "cache.invalidate" ->
        pattern = Map.get(params, "key_pattern", "*")
        {:ok, "Invalidated cache keys matching '#{pattern}' (23 keys removed)"}

      "alert.notify" ->
        channel = Map.get(params, "channel", "#incidents")
        {:ok, "Notification sent to #{channel}"}

      _ ->
        {:ok, "Action #{action} simulated (dry-run)"}
    end
  end

  # Output formatting

  defp format_check_results(runbook, results) do
    checks_output =
      results
      |> Enum.map(fn {id, result} ->
        status = case result do
          {:ok, _} -> "OK"
          {:warning, _} -> "WARN"
          {:blocked, _} -> "BLOCKED"
          _ -> "ERROR"
        end

        detail = case result do
          {_, msg} -> msg
          msg -> inspect(msg)
        end

        "  [#{status}] #{id}: #{detail}"
      end)
      |> Enum.join("\n")

    recommendations =
      runbook.recommended_actions
      |> Enum.map(fn action ->
        safety = if action.safe, do: "[SAFE]", else: "[UNSAFE - requires approval]"
        "  - #{action.id}: #{action.action} #{safety}"
      end)
      |> Enum.join("\n")

    """
    === Runbook Execution Results ===

    Incident Type: #{runbook.incident_type}
    Affected System: #{runbook.affected_system}
    Confidence: #{Float.round(runbook.confidence * 100, 1)}%

    Hypotheses:
    #{Enum.map_join(runbook.hypotheses, "\n", &"  - #{&1}")}

    Check Results:
    #{checks_output}

    Recommended Actions:
    #{recommendations}

    Rollback Plan: #{runbook.rollback_plan}
    """
  end
end

Step 6: Create the Orchestration Layer

Create lib/incident_response/agent.ex:

defmodule IncidentResponse.Agent do
  @moduledoc """
  Orchestrates the incident response pipeline:
  1. Claude generates runbook artifact
  2. Artifact stored in Tigris
  3. Native skill executes checks safely
  """

  alias IncidentResponse.Skills.{RunbookGenerator, RunbookExecutor}
  alias Conjure.Session

  require Logger

  @doc """
  Analyze an incident and run diagnostic checks.

  Returns a structured result with the runbook and execution results.
  """
  def analyze(incident_description, opts \\ []) do
    with {:ok, runbook} <- generate_runbook(incident_description, opts),
         {:ok, artifact_ref} <- store_artifact(runbook, opts),
         {:ok, results} <- execute_runbook(runbook, opts) do
      {:ok,
       %{
         runbook: runbook,
         artifact_ref: artifact_ref,
         execution_results: results
       }}
    end
  end

  @doc """
  Generate a runbook without executing it.
  """
  def generate_runbook(incident_description, opts \\ []) do
    Logger.info("Generating runbook for incident...")
    RunbookGenerator.generate(incident_description, opts)
  end

  @doc """
  Execute checks from a previously generated runbook.
  """
  def execute_runbook(runbook, opts \\ []) do
    Logger.info("Executing runbook checks...")

    runbook_json = Jason.encode!(runbook)
    context = Keyword.get(opts, :context, %{})

    RunbookExecutor.execute("run_checks #{runbook_json}", context)
  end

  @doc """
  Execute a specific action from a runbook (safe actions only).
  """
  def execute_action(runbook, action_id, opts \\ []) do
    Logger.info("Executing action: #{action_id}")

    runbook_json = Jason.encode!(runbook)
    context = Keyword.get(opts, :context, %{})

    RunbookExecutor.execute("run_action #{runbook_json} #{action_id}", context)
  end

  # Store artifact in Tigris for audit trail

  defp store_artifact(runbook, opts) do
    if storage = Keyword.get(opts, :storage) do
      artifact_path = "runbooks/#{DateTime.utc_now() |> DateTime.to_iso8601()}.json"
      content = Jason.encode!(runbook, pretty: true)

      case storage.write(storage, artifact_path, content) do
        {:ok, ref} ->
          Logger.info("Runbook stored: #{ref.path}")
          {:ok, ref}

        error ->
          Logger.warning("Failed to store runbook: #{inspect(error)}")
          {:ok, nil}
      end
    else
      {:ok, nil}
    end
  end
end

Step 7: Create a LiveView Interface

Create lib/incident_response_web/live/incident_live.ex:

defmodule IncidentResponseWeb.IncidentLive do
  use IncidentResponseWeb, :live_view

  alias IncidentResponse.Agent

  def mount(_params, _session, socket) do
    {:ok,
     socket
     |> assign(:incident, "")
     |> assign(:runbook, nil)
     |> assign(:results, nil)
     |> assign(:loading, false)
     |> assign(:error, nil)}
  end

  def handle_event("analyze", %{"incident" => incident}, socket) do
    socket = assign(socket, loading: true, error: nil)

    # Run analysis in background
    pid = self()

    Task.start(fn ->
      result = Agent.analyze(incident)
      send(pid, {:analysis_complete, result})
    end)

    {:noreply, assign(socket, :incident, incident)}
  end

  def handle_event("execute_action", %{"action_id" => action_id}, socket) do
    runbook = socket.assigns.runbook

    case Agent.execute_action(runbook, action_id) do
      {:ok, result} ->
        {:noreply,
         socket
         |> put_flash(:info, result)}

      {:error, reason} ->
        {:noreply,
         socket
         |> put_flash(:error, reason)}
    end
  end

  def handle_info({:analysis_complete, {:ok, result}}, socket) do
    {:noreply,
     socket
     |> assign(:loading, false)
     |> assign(:runbook, result.runbook)
     |> assign(:results, result.execution_results)}
  end

  def handle_info({:analysis_complete, {:error, reason}}, socket) do
    {:noreply,
     socket
     |> assign(:loading, false)
     |> assign(:error, inspect(reason))}
  end

  def render(assigns) do
    ~H"""
    <div class="max-w-4xl mx-auto p-6">
      <h1 class="text-3xl font-bold mb-6">Incident Response</h1>

      <!-- Input Form -->
      <form phx-submit="analyze" class="mb-8">
        <label class="block text-sm font-medium mb-2">
          Describe the incident
        </label>
        <textarea
          name="incident"
          rows="4"
          class="w-full border rounded-lg p-3 font-mono text-sm"
          placeholder="Our Athena queries for attendance analytics are timing out since this morning..."
          disabled={@loading}
        ><%= @incident %></textarea>
        <button
          type="submit"
          class="mt-3 bg-blue-600 text-white px-6 py-2 rounded-lg disabled:opacity-50"
          disabled={@loading}
        >
          <%= if @loading, do: "Analyzing...", else: "Analyze Incident" %>
        </button>
      </form>

      <!-- Error Display -->
      <%= if @error do %>
        <div class="bg-red-50 border border-red-200 rounded-lg p-4 mb-6">
          <p class="text-red-800"><%= @error %></p>
        </div>
      <% end %>

      <!-- Runbook Display -->
      <%= if @runbook do %>
        <div class="border rounded-lg p-6 mb-6">
          <h2 class="text-xl font-semibold mb-4">Generated Runbook</h2>

          <div class="grid grid-cols-2 gap-4 mb-4">
            <div>
              <span class="text-gray-500">Type:</span>
              <span class="font-medium ml-2"><%= @runbook.incident_type %></span>
            </div>
            <div>
              <span class="text-gray-500">System:</span>
              <span class="font-medium ml-2"><%= @runbook.affected_system %></span>
            </div>
            <div>
              <span class="text-gray-500">Confidence:</span>
              <span class="font-medium ml-2"><%= Float.round(@runbook.confidence * 100, 1) %>%</span>
            </div>
          </div>

          <div class="mb-4">
            <h3 class="font-medium mb-2">Hypotheses</h3>
            <ul class="list-disc list-inside text-gray-700">
              <%= for h <- @runbook.hypotheses do %>
                <li><%= h %></li>
              <% end %>
            </ul>
          </div>

          <div class="mb-4">
            <h3 class="font-medium mb-2">Recommended Actions</h3>
            <div class="space-y-2">
              <%= for action <- @runbook.recommended_actions do %>
                <div class="flex items-center justify-between bg-gray-50 p-3 rounded">
                  <div>
                    <span class="font-mono text-sm"><%= action.action %></span>
                    <%= if action.safe do %>
                      <span class="ml-2 text-xs bg-green-100 text-green-800 px-2 py-1 rounded">SAFE</span>
                    <% else %>
                      <span class="ml-2 text-xs bg-red-100 text-red-800 px-2 py-1 rounded">UNSAFE</span>
                    <% end %>
                  </div>
                  <%= if action.safe do %>
                    <button
                      phx-click="execute_action"
                      phx-value-action_id={action.id}
                      class="text-sm bg-blue-500 text-white px-3 py-1 rounded"
                    >
                      Execute
                    </button>
                  <% end %>
                </div>
              <% end %>
            </div>
          </div>
        </div>
      <% end %>

      <!-- Execution Results -->
      <%= if @results do %>
        <div class="border rounded-lg p-6 bg-gray-900 text-gray-100">
          <h2 class="text-xl font-semibold mb-4">Execution Results</h2>
          <pre class="font-mono text-sm whitespace-pre-wrap"><%= @results %></pre>
        </div>
      <% end %>
    </div>
    """
  end
end

Add the route in lib/incident_response_web/router.ex:

scope "/", IncidentResponseWeb do
  pipe_through :browser

  live "/", IncidentLive
end

Step 8: Deploy to Fly.io

Deploy the application:

fly deploy

Run the database migration:

fly ssh console -C "/app/bin/migrate"

Open the application:

fly open

Usage Example

Input an incident description:

Our Athena queries for attendance analytics are timing out since this morning.
Users report 30+ second waits. No recent deployments.

Claude generates a structured runbook:

{
  "incident_type": "performance_degradation",
  "affected_system": "athena",
  "confidence": 0.82,
  "hypotheses": [
    "partition explosion in attendance_facts table",
    "stale table statistics causing poor query plans",
    "unexpected data volume increase"
  ],
  "checks": [
    {
      "id": "check_partitions",
      "action": "database.table_stats",
      "params": {"table": "attendance_facts"}
    },
    {
      "id": "check_slow_queries",
      "action": "database.slow_queries",
      "params": {"lookback_minutes": 180}
    }
  ],
  "recommended_actions": [
    {
      "id": "recompute_stats",
      "action": "database.analyze_table",
      "params": {"table": "attendance_facts"},
      "safe": true
    },
    {
      "id": "notify_team",
      "action": "alert.notify",
      "params": {"channel": "#data-platform", "message": "Athena perf issue under investigation"},
      "safe": true
    }
  ],
  "rollback_plan": "No destructive actions proposed"
}

The Native executor validates and runs checks:

=== Runbook Execution Results ===

Incident Type: performance_degradation
Affected System: athena
Confidence: 82.0%

Hypotheses:
  - partition explosion in attendance_facts table
  - stale table statistics causing poor query plans
  - unexpected data volume increase

Check Results:
  [OK] check_partitions: Table attendance_facts: 18,432 partitions, last analyzed 3 days ago
  [OK] check_slow_queries: Found 3 slow queries in last 180 minutes (avg 4.2s)

Recommended Actions:
  - recompute_stats: database.analyze_table [SAFE]
  - notify_team: alert.notify [SAFE]

Rollback Plan: No destructive actions proposed

Why This Pattern Works

Clear Separation of Concerns

Phase	Owner	Responsibility
Reasoning	Claude	Analyze problem, generate hypotheses
Planning	Claude	Structure checks and actions
Validation	Native	Schema enforcement, fail fast
Safety	Native	Action allow-listing
Execution	Native	Deterministic, auditable

Artifact-Driven

The JSON runbook is:

Inspectable: Review before execution
Testable: Unit test the schema
Replayable: Re-run checks without Claude
Auditable: Store in Tigris for compliance

Production-Ready

No LLM executing infrastructure changes
Native code is the trust boundary
Safe actions clearly marked
Unsafe actions require explicit approval

Testing Locally

For local development without Fly.io:

# Set API key
export ANTHROPIC_API_KEY=sk-ant-...

# Run the app
mix phx.server

Test the agent directly in IEx:

iex> IncidentResponse.Agent.analyze("Database queries are slow")
{:ok, %{runbook: %{...}, execution_results: "..."}}

Next Steps

Add real service integrations (replace simulated checks)
Implement approval workflow for unsafe actions
Add runbook versioning and history
Create Slack/PagerDuty integration for alert.notify
Add telemetry dashboards with Fly.io Metrics