Fly.io with Tigris Storage

View Source

Build an incident response system on Fly.io where Claude generates structured runbooks and a Native skill safely executes them.

Time: 35 minutes

Prerequisites:

What You'll Build

An incident response service demonstrating a two-phase skill pipeline:

  1. Claude (Anthropic) Skill: Analyzes incident descriptions and generates structured runbook artifacts
  2. Native Skill: Validates runbooks against a schema, enforces action allow-lists, and executes safe dry-runs

                            Fly.io Machine                               
                                                                         
            
     User       Claude Skill      Runbook Artifact         
    Request         (Anthropic API)       (JSON in Tigris)         
            
                                                                        
                                                                        
                                              
                                              Native Executor Skill    
                                              - Schema validation      
                                              - Action allow-list      
                                              - Safe dry-run           
                                              
                                                                        
                                                                        
                                              
                                              Execution Results        
                                              (auditable output)       
                                              

Why this pattern?

ResponsibilityClaude SkillNative Skill
Reasoning✓ Analyzes problems
Planning✓ Generates runbooks
Validation✓ Schema enforcement
Safety✓ Action allow-listing
Execution✓ Deterministic dry-run

Claude reasons and plans. Native code executes safely. The runbook artifact is the contract between them.

Step 1: Create the Fly.io Application

Generate a new Phoenix application:

mix phx.new incident_response --database postgres
cd incident_response

Add Conjure and dependencies to mix.exs:

defp deps do
  [
    # ... existing deps ...
    {:conjure, "~> 0.1.0"},
    {:req, "~> 0.5"}
  ]
end

Get dependencies:

mix deps.get

Step 2: Set Up Fly.io Infrastructure

Launch the Fly app with Postgres:

fly launch --name my-incident-response

Create a Tigris storage bucket for runbook artifacts:

fly storage create incident-runbooks

This automatically configures environment variables:

  • BUCKET_NAME - Your Tigris bucket name
  • AWS_ACCESS_KEY_ID - Tigris access key
  • AWS_SECRET_ACCESS_KEY - Tigris secret key
  • AWS_ENDPOINT_URL_S3 - Tigris endpoint

Set your Anthropic API key:

fly secrets set ANTHROPIC_API_KEY=sk-ant-...

Step 3: Define the Runbook Schema

Create lib/incident_response/runbooks/schema.ex:

defmodule IncidentResponse.Runbooks.Schema do
  @moduledoc """
  Defines the runbook artifact contract between Claude and the executor.

  This schema is the boundary between LLM reasoning and safe execution.
  """

  @type check :: %{
    id: String.t(),
    action: String.t(),
    params: map()
  }

  @type recommended_action :: %{
    id: String.t(),
    action: String.t(),
    safe: boolean(),
    params: map()
  }

  @type runbook :: %{
    incident_type: String.t(),
    affected_system: String.t(),
    confidence: float(),
    hypotheses: [String.t()],
    checks: [check()],
    recommended_actions: [recommended_action()],
    rollback_plan: String.t()
  }

  @required_keys ~w(incident_type affected_system confidence checks recommended_actions)a

  @doc """
  Validates a runbook artifact against the schema.

  Returns `{:ok, runbook}` if valid, `{:error, reasons}` otherwise.
  """
  def validate(runbook) when is_map(runbook) do
    with :ok <- validate_required_keys(runbook),
         :ok <- validate_confidence(runbook),
         :ok <- validate_checks(runbook),
         :ok <- validate_actions(runbook) do
      {:ok, normalize(runbook)}
    end
  end

  def validate(_), do: {:error, ["runbook must be a map"]}

  defp validate_required_keys(runbook) do
    missing = @required_keys -- Map.keys(runbook)

    if Enum.empty?(missing) do
      :ok
    else
      {:error, ["missing required keys: #{inspect(missing)}"]}
    end
  end

  defp validate_confidence(%{confidence: c}) when is_number(c) and c >= 0 and c <= 1, do: :ok
  defp validate_confidence(%{confidence: c}), do: {:error, ["confidence must be 0.0-1.0, got: #{inspect(c)}"]}

  defp validate_checks(%{checks: checks}) when is_list(checks) do
    errors =
      checks
      |> Enum.with_index()
      |> Enum.flat_map(fn {check, i} ->
        validate_check(check, i)
      end)

    if Enum.empty?(errors), do: :ok, else: {:error, errors}
  end

  defp validate_checks(_), do: {:error, ["checks must be a list"]}

  defp validate_check(check, index) do
    required = ~w(id action)

    missing =
      required
      |> Enum.reject(&Map.has_key?(check, &1))

    if Enum.empty?(missing) do
      []
    else
      ["check[#{index}] missing: #{Enum.join(missing, ", ")}"]
    end
  end

  defp validate_actions(%{recommended_actions: actions}) when is_list(actions) do
    errors =
      actions
      |> Enum.with_index()
      |> Enum.flat_map(fn {action, i} ->
        validate_action(action, i)
      end)

    if Enum.empty?(errors), do: :ok, else: {:error, errors}
  end

  defp validate_actions(_), do: {:error, ["recommended_actions must be a list"]}

  defp validate_action(action, index) do
    cond do
      not Map.has_key?(action, "id") and not Map.has_key?(action, :id) ->
        ["action[#{index}] missing: id"]

      not Map.has_key?(action, "action") and not Map.has_key?(action, :action) ->
        ["action[#{index}] missing: action"]

      true ->
        []
    end
  end

  defp normalize(runbook) do
    runbook
    |> Map.put_new(:hypotheses, [])
    |> Map.put_new(:rollback_plan, "No rollback plan specified")
    |> Map.update(:checks, [], &normalize_checks/1)
    |> Map.update(:recommended_actions, [], &normalize_actions/1)
  end

  defp normalize_checks(checks) do
    Enum.map(checks, fn check ->
      check
      |> Map.put_new(:params, %{})
      |> atomize_keys()
    end)
  end

  defp normalize_actions(actions) do
    Enum.map(actions, fn action ->
      action
      |> Map.put_new(:params, %{})
      |> Map.put_new(:safe, false)
      |> atomize_keys()
    end)
  end

  defp atomize_keys(map) when is_map(map) do
    Map.new(map, fn
      {k, v} when is_binary(k) -> {String.to_existing_atom(k), v}
      {k, v} when is_atom(k) -> {k, v}
    end)
  rescue
    ArgumentError -> map
  end
end

Step 4: Create the Runbook Generator (Claude Skill)

Create lib/incident_response/skills/runbook_generator.ex:

defmodule IncidentResponse.Skills.RunbookGenerator do
  @moduledoc """
  Claude skill that analyzes incidents and generates structured runbooks.

  This skill does NO execution - it only reasons about the problem
  and produces a machine-readable artifact for the executor skill.
  """

  alias IncidentResponse.Runbooks.Schema
  alias Conjure.Session

  require Logger

  @system_prompt """
  You are an incident response analyst. When given an incident description,
  analyze the problem and generate a structured runbook for investigation.

  IMPORTANT: You must output a valid JSON runbook with this exact structure:

  {
    "incident_type": "performance_degradation|outage|data_issue|security",
    "affected_system": "string - the primary system affected",
    "confidence": 0.0-1.0,
    "hypotheses": ["list of possible root causes"],
    "checks": [
      {
        "id": "unique_check_id",
        "action": "system.operation_name",
        "params": { "key": "value" }
      }
    ],
    "recommended_actions": [
      {
        "id": "unique_action_id",
        "action": "system.operation_name",
        "safe": true/false,
        "params": { "key": "value" }
      }
    ],
    "rollback_plan": "description of how to revert if needed"
  }

  Available check actions:
  - database.slow_queries - params: {lookback_minutes}
  - database.table_stats - params: {table}
  - database.connection_count - params: {}
  - metrics.query - params: {query, range_minutes}
  - logs.search - params: {pattern, service, lookback_minutes}
  - http.health_check - params: {url}

  Available recommended actions:
  - database.analyze_table - params: {table}, safe: true
  - database.kill_query - params: {query_id}, safe: false
  - cache.invalidate - params: {key_pattern}, safe: true
  - service.restart - params: {service}, safe: false
  - alert.notify - params: {channel, message}, safe: true

  Only mark actions as safe:true if they are read-only or easily reversible.
  Output ONLY the JSON, no markdown fences or explanation.
  """

  @doc """
  Generate a runbook from an incident description.

  Returns `{:ok, runbook}` where runbook is a validated map,
  or `{:error, reason}` if generation or validation fails.
  """
  def generate(incident_description, opts \\ []) do
    with {:ok, response} <- call_claude(incident_description, opts),
         {:ok, json} <- extract_json(response),
         {:ok, runbook} <- Schema.validate(json) do
      {:ok, runbook}
    end
  end

  defp call_claude(incident_description, opts) do
    messages = [
      %{"role" => "user", "content" => incident_description}
    ]

    api_callback = Keyword.get(opts, :api_callback, &default_api_callback/1)
    api_callback.(messages)
  end

  defp extract_json(response) do
    text =
      response["content"]
      |> Enum.filter(&(&1["type"] == "text"))
      |> Enum.map(&(&1["text"]))
      |> Enum.join()
      |> String.trim()

    # Strip markdown code fences if present
    text =
      text
      |> String.replace(~r/^```json\s*/, "")
      |> String.replace(~r/\s*```$/, "")

    case Jason.decode(text) do
      {:ok, json} -> {:ok, json}
      {:error, _} -> {:error, "Failed to parse runbook JSON: #{String.slice(text, 0, 200)}..."}
    end
  end

  defp default_api_callback(messages) do
    url = "https://api.anthropic.com/v1/messages"
    api_key = System.get_env("ANTHROPIC_API_KEY")

    body = %{
      "model" => "claude-sonnet-4-5-20250929",
      "max_tokens" => 4096,
      "system" => @system_prompt,
      "messages" => messages
    }

    headers = [
      {"x-api-key", api_key},
      {"content-type", "application/json"},
      {"anthropic-version", "2023-06-01"}
    ]

    case Req.post(url, json: body, headers: headers, receive_timeout: 60_000) do
      {:ok, %{status: 200, body: response}} -> {:ok, response}
      {:ok, %{status: status, body: body}} -> {:error, "API error #{status}: #{inspect(body)}"}
      {:error, reason} -> {:error, "Request failed: #{inspect(reason)}"}
    end
  end
end

Step 5: Create the Runbook Executor (Native Skill)

Create lib/incident_response/skills/runbook_executor.ex:

defmodule IncidentResponse.Skills.RunbookExecutor do
  @moduledoc """
  Native skill that validates and executes runbook checks safely.

  This skill is the trust boundary - it enforces:
  - Schema validation (fail fast if malformed)
  - Action allow-list (only whitelisted operations)
  - Dry-run execution (no destructive actions without approval)
  """

  @behaviour Conjure.NativeSkill

  alias IncidentResponse.Runbooks.Schema

  require Logger

  # Allow-list of executable actions
  @allowed_checks ~w(
    database.slow_queries
    database.table_stats
    database.connection_count
    metrics.query
    logs.search
    http.health_check
  )

  @allowed_actions ~w(
    database.analyze_table
    cache.invalidate
    alert.notify
  )

  @impl true
  def __skill_info__ do
    %{
      name: "runbook-executor",
      description: """
      Safely executes incident runbooks generated by Claude.
      Validates schema, enforces allow-lists, and performs dry-run checks.
      """,
      allowed_tools: [:execute]
    }
  end

  @impl true
  def execute(command, context) do
    case parse_command(command) do
      {:run_checks, runbook_json} ->
        run_checks(runbook_json, context)

      {:run_action, runbook_json, action_id} ->
        run_action(runbook_json, action_id, context)

      {:validate, runbook_json} ->
        validate_only(runbook_json)

      {:error, reason} ->
        {:error, reason}
    end
  end

  # Command parsing

  defp parse_command(command) do
    command = String.trim(command)

    cond do
      String.starts_with?(command, "validate ") ->
        {:validate, String.replace_prefix(command, "validate ", "")}

      String.starts_with?(command, "run_checks ") ->
        {:run_checks, String.replace_prefix(command, "run_checks ", "")}

      String.starts_with?(command, "run_action ") ->
        parse_run_action(command)

      true ->
        {:error, "Unknown command. Available: validate <json>, run_checks <json>, run_action <json> <action_id>"}
    end
  end

  defp parse_run_action(command) do
    # Format: run_action <json> <action_id>
    # JSON ends at the last }, action_id follows
    case Regex.run(~r/run_action (.+})\s+(\S+)$/, command) do
      [_, json, action_id] -> {:run_action, json, action_id}
      _ -> {:error, "Invalid run_action format. Use: run_action <json> <action_id>"}
    end
  end

  # Execution

  defp validate_only(runbook_json) do
    with {:ok, json} <- Jason.decode(runbook_json),
         {:ok, runbook} <- Schema.validate(json) do
      {:ok, "Runbook valid. #{length(runbook.checks)} checks, #{length(runbook.recommended_actions)} actions."}
    else
      {:error, reasons} when is_list(reasons) ->
        {:error, "Validation failed:\n" <> Enum.join(reasons, "\n")}

      {:error, reason} ->
        {:error, "Validation failed: #{inspect(reason)}"}
    end
  end

  defp run_checks(runbook_json, context) do
    with {:ok, json} <- Jason.decode(runbook_json),
         {:ok, runbook} <- Schema.validate(json),
         {:ok, results} <- execute_checks(runbook.checks, context) do
      output = format_check_results(runbook, results)
      {:ok, output}
    else
      {:error, reasons} when is_list(reasons) ->
        {:error, "Validation failed:\n" <> Enum.join(reasons, "\n")}

      {:error, reason} ->
        {:error, "Failed: #{inspect(reason)}"}
    end
  end

  defp run_action(runbook_json, action_id, context) do
    with {:ok, json} <- Jason.decode(runbook_json),
         {:ok, runbook} <- Schema.validate(json),
         {:ok, action} <- find_action(runbook, action_id),
         :ok <- verify_action_allowed(action),
         :ok <- verify_action_safe(action),
         {:ok, result} <- execute_action(action, context) do
      {:ok, "Action #{action_id} completed: #{result}"}
    else
      {:error, reason} -> {:error, reason}
    end
  end

  defp execute_checks(checks, context) do
    results =
      Enum.map(checks, fn check ->
        if check.action in @allowed_checks do
          result = execute_check(check, context)
          {check.id, result}
        else
          {check.id, {:blocked, "Action not in allow-list: #{check.action}"}}
        end
      end)

    {:ok, results}
  end

  defp execute_check(%{action: action, params: params} = check, _context) do
    Logger.info("Executing check: #{check.id} (#{action})")

    # Simulate check execution - in production, these would call real services
    case action do
      "database.slow_queries" ->
        lookback = Map.get(params, "lookback_minutes", 60)
        {:ok, "Found 3 slow queries in last #{lookback} minutes (avg 4.2s)"}

      "database.table_stats" ->
        table = Map.get(params, "table", "unknown")
        {:ok, "Table #{table}: 18,432 partitions, last analyzed 3 days ago"}

      "database.connection_count" ->
        {:ok, "Active connections: 47/100 (47% utilized)"}

      "metrics.query" ->
        {:ok, "Query latency p99: 2.3s (up from 0.8s baseline)"}

      "logs.search" ->
        pattern = Map.get(params, "pattern", "*")
        {:ok, "Found 142 matches for '#{pattern}' in last hour"}

      "http.health_check" ->
        url = Map.get(params, "url", "/health")
        {:ok, "#{url} returned 200 OK (latency: 45ms)"}

      _ ->
        {:warning, "Unknown check action: #{action}"}
    end
  end

  defp find_action(runbook, action_id) do
    case Enum.find(runbook.recommended_actions, &(&1.id == action_id)) do
      nil -> {:error, "Action not found: #{action_id}"}
      action -> {:ok, action}
    end
  end

  defp verify_action_allowed(%{action: action}) do
    if action in @allowed_actions do
      :ok
    else
      {:error, "Action not in allow-list: #{action}. Allowed: #{Enum.join(@allowed_actions, ", ")}"}
    end
  end

  defp verify_action_safe(%{safe: true}), do: :ok
  defp verify_action_safe(%{safe: false, id: id}) do
    {:error, "Action #{id} is marked unsafe. Requires manual approval."}
  end
  defp verify_action_safe(_), do: :ok

  defp execute_action(%{action: action, params: params} = act, _context) do
    Logger.info("Executing action: #{act.id} (#{action})")

    case action do
      "database.analyze_table" ->
        table = Map.get(params, "table", "unknown")
        {:ok, "ANALYZE TABLE #{table} completed (updated statistics)"}

      "cache.invalidate" ->
        pattern = Map.get(params, "key_pattern", "*")
        {:ok, "Invalidated cache keys matching '#{pattern}' (23 keys removed)"}

      "alert.notify" ->
        channel = Map.get(params, "channel", "#incidents")
        {:ok, "Notification sent to #{channel}"}

      _ ->
        {:ok, "Action #{action} simulated (dry-run)"}
    end
  end

  # Output formatting

  defp format_check_results(runbook, results) do
    checks_output =
      results
      |> Enum.map(fn {id, result} ->
        status = case result do
          {:ok, _} -> "OK"
          {:warning, _} -> "WARN"
          {:blocked, _} -> "BLOCKED"
          _ -> "ERROR"
        end

        detail = case result do
          {_, msg} -> msg
          msg -> inspect(msg)
        end

        "  [#{status}] #{id}: #{detail}"
      end)
      |> Enum.join("\n")

    recommendations =
      runbook.recommended_actions
      |> Enum.map(fn action ->
        safety = if action.safe, do: "[SAFE]", else: "[UNSAFE - requires approval]"
        "  - #{action.id}: #{action.action} #{safety}"
      end)
      |> Enum.join("\n")

    """
    === Runbook Execution Results ===

    Incident Type: #{runbook.incident_type}
    Affected System: #{runbook.affected_system}
    Confidence: #{Float.round(runbook.confidence * 100, 1)}%

    Hypotheses:
    #{Enum.map_join(runbook.hypotheses, "\n", &"  - #{&1}")}

    Check Results:
    #{checks_output}

    Recommended Actions:
    #{recommendations}

    Rollback Plan: #{runbook.rollback_plan}
    """
  end
end

Step 6: Create the Orchestration Layer

Create lib/incident_response/agent.ex:

defmodule IncidentResponse.Agent do
  @moduledoc """
  Orchestrates the incident response pipeline:
  1. Claude generates runbook artifact
  2. Artifact stored in Tigris
  3. Native skill executes checks safely
  """

  alias IncidentResponse.Skills.{RunbookGenerator, RunbookExecutor}
  alias Conjure.Session

  require Logger

  @doc """
  Analyze an incident and run diagnostic checks.

  Returns a structured result with the runbook and execution results.
  """
  def analyze(incident_description, opts \\ []) do
    with {:ok, runbook} <- generate_runbook(incident_description, opts),
         {:ok, artifact_ref} <- store_artifact(runbook, opts),
         {:ok, results} <- execute_runbook(runbook, opts) do
      {:ok,
       %{
         runbook: runbook,
         artifact_ref: artifact_ref,
         execution_results: results
       }}
    end
  end

  @doc """
  Generate a runbook without executing it.
  """
  def generate_runbook(incident_description, opts \\ []) do
    Logger.info("Generating runbook for incident...")
    RunbookGenerator.generate(incident_description, opts)
  end

  @doc """
  Execute checks from a previously generated runbook.
  """
  def execute_runbook(runbook, opts \\ []) do
    Logger.info("Executing runbook checks...")

    runbook_json = Jason.encode!(runbook)
    context = Keyword.get(opts, :context, %{})

    RunbookExecutor.execute("run_checks #{runbook_json}", context)
  end

  @doc """
  Execute a specific action from a runbook (safe actions only).
  """
  def execute_action(runbook, action_id, opts \\ []) do
    Logger.info("Executing action: #{action_id}")

    runbook_json = Jason.encode!(runbook)
    context = Keyword.get(opts, :context, %{})

    RunbookExecutor.execute("run_action #{runbook_json} #{action_id}", context)
  end

  # Store artifact in Tigris for audit trail

  defp store_artifact(runbook, opts) do
    if storage = Keyword.get(opts, :storage) do
      artifact_path = "runbooks/#{DateTime.utc_now() |> DateTime.to_iso8601()}.json"
      content = Jason.encode!(runbook, pretty: true)

      case storage.write(storage, artifact_path, content) do
        {:ok, ref} ->
          Logger.info("Runbook stored: #{ref.path}")
          {:ok, ref}

        error ->
          Logger.warning("Failed to store runbook: #{inspect(error)}")
          {:ok, nil}
      end
    else
      {:ok, nil}
    end
  end
end

Step 7: Create a LiveView Interface

Create lib/incident_response_web/live/incident_live.ex:

defmodule IncidentResponseWeb.IncidentLive do
  use IncidentResponseWeb, :live_view

  alias IncidentResponse.Agent

  def mount(_params, _session, socket) do
    {:ok,
     socket
     |> assign(:incident, "")
     |> assign(:runbook, nil)
     |> assign(:results, nil)
     |> assign(:loading, false)
     |> assign(:error, nil)}
  end

  def handle_event("analyze", %{"incident" => incident}, socket) do
    socket = assign(socket, loading: true, error: nil)

    # Run analysis in background
    pid = self()

    Task.start(fn ->
      result = Agent.analyze(incident)
      send(pid, {:analysis_complete, result})
    end)

    {:noreply, assign(socket, :incident, incident)}
  end

  def handle_event("execute_action", %{"action_id" => action_id}, socket) do
    runbook = socket.assigns.runbook

    case Agent.execute_action(runbook, action_id) do
      {:ok, result} ->
        {:noreply,
         socket
         |> put_flash(:info, result)}

      {:error, reason} ->
        {:noreply,
         socket
         |> put_flash(:error, reason)}
    end
  end

  def handle_info({:analysis_complete, {:ok, result}}, socket) do
    {:noreply,
     socket
     |> assign(:loading, false)
     |> assign(:runbook, result.runbook)
     |> assign(:results, result.execution_results)}
  end

  def handle_info({:analysis_complete, {:error, reason}}, socket) do
    {:noreply,
     socket
     |> assign(:loading, false)
     |> assign(:error, inspect(reason))}
  end

  def render(assigns) do
    ~H"""
    <div class="max-w-4xl mx-auto p-6">
      <h1 class="text-3xl font-bold mb-6">Incident Response</h1>

      <!-- Input Form -->
      <form phx-submit="analyze" class="mb-8">
        <label class="block text-sm font-medium mb-2">
          Describe the incident
        </label>
        <textarea
          name="incident"
          rows="4"
          class="w-full border rounded-lg p-3 font-mono text-sm"
          placeholder="Our Athena queries for attendance analytics are timing out since this morning..."
          disabled={@loading}
        ><%= @incident %></textarea>
        <button
          type="submit"
          class="mt-3 bg-blue-600 text-white px-6 py-2 rounded-lg disabled:opacity-50"
          disabled={@loading}
        >
          <%= if @loading, do: "Analyzing...", else: "Analyze Incident" %>
        </button>
      </form>

      <!-- Error Display -->
      <%= if @error do %>
        <div class="bg-red-50 border border-red-200 rounded-lg p-4 mb-6">
          <p class="text-red-800"><%= @error %></p>
        </div>
      <% end %>

      <!-- Runbook Display -->
      <%= if @runbook do %>
        <div class="border rounded-lg p-6 mb-6">
          <h2 class="text-xl font-semibold mb-4">Generated Runbook</h2>

          <div class="grid grid-cols-2 gap-4 mb-4">
            <div>
              <span class="text-gray-500">Type:</span>
              <span class="font-medium ml-2"><%= @runbook.incident_type %></span>
            </div>
            <div>
              <span class="text-gray-500">System:</span>
              <span class="font-medium ml-2"><%= @runbook.affected_system %></span>
            </div>
            <div>
              <span class="text-gray-500">Confidence:</span>
              <span class="font-medium ml-2"><%= Float.round(@runbook.confidence * 100, 1) %>%</span>
            </div>
          </div>

          <div class="mb-4">
            <h3 class="font-medium mb-2">Hypotheses</h3>
            <ul class="list-disc list-inside text-gray-700">
              <%= for h <- @runbook.hypotheses do %>
                <li><%= h %></li>
              <% end %>
            </ul>
          </div>

          <div class="mb-4">
            <h3 class="font-medium mb-2">Recommended Actions</h3>
            <div class="space-y-2">
              <%= for action <- @runbook.recommended_actions do %>
                <div class="flex items-center justify-between bg-gray-50 p-3 rounded">
                  <div>
                    <span class="font-mono text-sm"><%= action.action %></span>
                    <%= if action.safe do %>
                      <span class="ml-2 text-xs bg-green-100 text-green-800 px-2 py-1 rounded">SAFE</span>
                    <% else %>
                      <span class="ml-2 text-xs bg-red-100 text-red-800 px-2 py-1 rounded">UNSAFE</span>
                    <% end %>
                  </div>
                  <%= if action.safe do %>
                    <button
                      phx-click="execute_action"
                      phx-value-action_id={action.id}
                      class="text-sm bg-blue-500 text-white px-3 py-1 rounded"
                    >
                      Execute
                    </button>
                  <% end %>
                </div>
              <% end %>
            </div>
          </div>
        </div>
      <% end %>

      <!-- Execution Results -->
      <%= if @results do %>
        <div class="border rounded-lg p-6 bg-gray-900 text-gray-100">
          <h2 class="text-xl font-semibold mb-4">Execution Results</h2>
          <pre class="font-mono text-sm whitespace-pre-wrap"><%= @results %></pre>
        </div>
      <% end %>
    </div>
    """
  end
end

Add the route in lib/incident_response_web/router.ex:

scope "/", IncidentResponseWeb do
  pipe_through :browser

  live "/", IncidentLive
end

Step 8: Deploy to Fly.io

Deploy the application:

fly deploy

Run the database migration:

fly ssh console -C "/app/bin/migrate"

Open the application:

fly open

Usage Example

Input an incident description:

Our Athena queries for attendance analytics are timing out since this morning.
Users report 30+ second waits. No recent deployments.

Claude generates a structured runbook:

{
  "incident_type": "performance_degradation",
  "affected_system": "athena",
  "confidence": 0.82,
  "hypotheses": [
    "partition explosion in attendance_facts table",
    "stale table statistics causing poor query plans",
    "unexpected data volume increase"
  ],
  "checks": [
    {
      "id": "check_partitions",
      "action": "database.table_stats",
      "params": {"table": "attendance_facts"}
    },
    {
      "id": "check_slow_queries",
      "action": "database.slow_queries",
      "params": {"lookback_minutes": 180}
    }
  ],
  "recommended_actions": [
    {
      "id": "recompute_stats",
      "action": "database.analyze_table",
      "params": {"table": "attendance_facts"},
      "safe": true
    },
    {
      "id": "notify_team",
      "action": "alert.notify",
      "params": {"channel": "#data-platform", "message": "Athena perf issue under investigation"},
      "safe": true
    }
  ],
  "rollback_plan": "No destructive actions proposed"
}

The Native executor validates and runs checks:

=== Runbook Execution Results ===

Incident Type: performance_degradation
Affected System: athena
Confidence: 82.0%

Hypotheses:
  - partition explosion in attendance_facts table
  - stale table statistics causing poor query plans
  - unexpected data volume increase

Check Results:
  [OK] check_partitions: Table attendance_facts: 18,432 partitions, last analyzed 3 days ago
  [OK] check_slow_queries: Found 3 slow queries in last 180 minutes (avg 4.2s)

Recommended Actions:
  - recompute_stats: database.analyze_table [SAFE]
  - notify_team: alert.notify [SAFE]

Rollback Plan: No destructive actions proposed

Why This Pattern Works

Clear Separation of Concerns

PhaseOwnerResponsibility
ReasoningClaudeAnalyze problem, generate hypotheses
PlanningClaudeStructure checks and actions
ValidationNativeSchema enforcement, fail fast
SafetyNativeAction allow-listing
ExecutionNativeDeterministic, auditable

Artifact-Driven

The JSON runbook is:

  • Inspectable: Review before execution
  • Testable: Unit test the schema
  • Replayable: Re-run checks without Claude
  • Auditable: Store in Tigris for compliance

Production-Ready

  • No LLM executing infrastructure changes
  • Native code is the trust boundary
  • Safe actions clearly marked
  • Unsafe actions require explicit approval

Testing Locally

For local development without Fly.io:

# Set API key
export ANTHROPIC_API_KEY=sk-ant-...

# Run the app
mix phx.server

Test the agent directly in IEx:

iex> IncidentResponse.Agent.analyze("Database queries are slow")
{:ok, %{runbook: %{...}, execution_results: "..."}}

Next Steps

  • Add real service integrations (replace simulated checks)
  • Implement approval workflow for unsafe actions
  • Add runbook versioning and history
  • Create Slack/PagerDuty integration for alert.notify
  • Add telemetry dashboards with Fly.io Metrics

See Also