FlyDeploy.BlueGreen.PeerManager (FlyDeploy v0.4.2)

Manages the lifecycle of peer BEAM nodes for blue-green deploys.

Architecture Overview

Blue-green mode runs two BEAM layers on a single Fly machine: a parent node that never serves traffic, and a peer node (a child BEAM process started via OTP's :peer module) that runs the user's full application and binds the HTTP port. On upgrade, a new peer boots with new code (its Endpoint binds via SO_REUSEPORT alongside the old), the old peer's Endpoint is stopped, and the old peer is terminated.

┌─ Fly Machine (single VM instance) ──────────────────────────────────────┐
│                                                                         │
│  Parent BEAM (long-lived, never serves traffic)                         │
│  ├─ BlueGreen.Supervisor                                                │
│  │   ├─ PeerManager          ← this module                              │
│  │   │   • starts/stops peer BEAM processes via :peer                   │
│  │   │   • handles cutover (stop old Endpoint)                          │
│  │   │   • on startup, checks S3 for pending blue-green reapply         │
│  │   │                                                                  │
│  │   └─ Poller (mode: :blue_green)                                      │
│  │       • polls S3 "blue_green_upgrade" field                          │
│  │       • on change → calls PeerManager.upgrade(tarball_url)           │
│  │                                                                      │
│  └─ (no Endpoint, no Repo, no app processes)                            │
│                                                                         │
│  Peer BEAM (child process, serves all traffic)                          │
│  ├─ User's full supervision tree                                        │
│  │   ├─ FlyDeploy Poller (mode: :hot)    ← polls "hot_upgrade" field    │
│  │   │   • applies hot code upgrades in-place inside the peer           │
│  │   │   • on startup, checks S3 for pending hot upgrade reapply        │
│  │   ├─ Repo, PubSub, Counter, ...                                      │
│  │   └─ Endpoint                         ← binds port via reuseport     │
│  │                                                                      │
│  └─ Code loaded from /tmp/fly_deploy_bg_<ts>/ (not /app/)               │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

How Peers Are Started

Each peer is a separate OS process started via :peer.start/1. The parent:

Finds the bundled erl binary from the ERTS directory
Builds exec args: -boot start_clean (bypasses the release boot script), -config (sys.config from the release or extracted tarball), -args_file (vm.args)
Passes -pa flags for all code paths (ebin directories)
Calls :peer.start(%{name: ..., exec: {erl, args}, ...})
Boots Elixir + Logger via :erpc.call
Marks the peer with Application.put_env(:fly_deploy, :__role__, :peer)
Injects SO_REUSEPORT config so the Endpoint can bind alongside an existing peer
Calls ensure_all_started(otp_app) (blocking — returns when fully started)

Why -boot start_clean? Without it, :peer inherits the parent's release boot script which auto-starts all apps before we can mark __role__: :peer, and computes node names from FLY_IMAGE_REF causing invalid names.

Blue-Green Upgrade Flow

When the parent's Poller detects a new "blue_green_upgrade" in S3:

Poller ──→ PeerManager.upgrade(tarball_url)
             │
             ├─ 1. Download tarball from S3
             ├─ 2. Extract to /tmp/fly_deploy_bg_<ts>/
             ├─ 3. Build code paths from extracted ebin dirs
             ├─ 4. Start new peer with new code paths
             │      └─ Peer fully boots (Endpoint binds via reuseport)
             ├─ 5. Stop old peer's Endpoint
             └─ 6. Stop old peer entirely

Key properties:

Zero downtime: Both old and new Endpoints serve simultaneously via SO_REUSEPORT during the brief overlap, then old Endpoint stops.
Clean state: The new peer starts fresh — no code_change/3, no state migration. This is the key difference from hot upgrades.
New PID: Every process gets a new PID (new BEAM process).

Hot Upgrades Inside Peers

The peer runs its own FlyDeploy.Poller with mode: :hot (started as {FlyDeploy, otp_app: :my_app} in the user's supervision tree). This Poller polls the "hot_upgrade" field in S3, completely independent of the parent's Poller which watches "blue_green_upgrade".

When a hot upgrade is detected inside the peer:

Peer's Poller ──→ FlyDeploy.hot_upgrade(tarball_url, app)
                    │
                    ├─ Download tarball from S3
                    ├─ Copy .beam files to where :code.which() says
                    │   they're loaded (/tmp/fly_deploy_bg_<ts>/lib/...)
                    ├─ Detect changed modules via :code.modified_modules()
                    ├─ Phase 1: Suspend ALL processes using changed modules
                    ├─ Phase 2: Purge + load ALL new code
                    ├─ Phase 3: :sys.change_code on ALL processes
                    └─ Phase 4: Resume ALL processes

This works because the Upgrader uses :code.which(module) to find where each module is currently loaded from, then copies new beams to that same path. Whether the peer loaded code from /app/lib/ or /tmp/fly_deploy_bg_<ts>/lib/, the hot upgrade lands in the right place.

S3 State: Separate Fields

The deployment metadata in S3 (releases/<app>-current.json) has two independent fields so blue-green and hot upgrades coexist:

{
  "image_ref": "registry.fly.io/app:deployment-ABC",
  "blue_green_upgrade": {
    "tarball_url": "https://s3/.../app-0.2.0.tar.gz",
    "source_image_ref": "registry.fly.io/app:deployment-DEF",
    ...
  },
  "hot_upgrade": {
    "tarball_url": "https://s3/.../app-0.2.1.tar.gz",
    "source_image_ref": "registry.fly.io/app:deployment-GHI",
    ...
  }
}

Rules:

mix fly_deploy.hot (default mode) → writes "hot_upgrade", preserves "blue_green_upgrade"
mix fly_deploy.hot --mode blue_green → writes "blue_green_upgrade", clears "hot_upgrade" (new peer = fresh start, old hot patches subsumed)
fly deploy (cold deploy) → machines detect image_ref mismatch and reset both fields to nil

Restart Reapply Flow

When a Fly machine restarts (crash, scaling, fly machine restart), both layers are reapplied from S3:

Machine restarts
  │
  ├─ Parent boots
  │   └─ BlueGreen.Supervisor starts
  │       ├─ PeerManager.init
  │       │   ├─ resolve_startup_code(otp_app)
  │       │   │   └─ Reads S3 "blue_green_upgrade" field
  │       │   │       → Downloads tarball → extracts to /tmp/bg_<ts>/
  │       │   ├─ start_peer(otp_app, new_code_paths)
  │       │   │   └─ Peer boots with /tmp/bg_<ts>/ code (v2)
  │       │   │       ├─ {FlyDeploy, otp_app: :app} starts Poller (mode: :hot)
  │       │   │       │   └─ startup_apply_current reads S3 "hot_upgrade"
  │       │   │       │       → Downloads v2-hot tarball
  │       │   │       │       → Copies beams to /tmp/bg_<ts>/ paths
  │       │   │       │       → Loads via :c.lm() (no suspend at startup)
  │       │   │       │       → Peer now running v2-hot code
  │       │   │       ├─ Counter, Repo, PubSub, ...
  │       │   │       └─ Endpoint (binds port via reuseport)
  │       │
  │       └─ Poller (mode: :blue_green)
  │           └─ Polls for future blue-green upgrades
  │
  └─ Result: machine serves v2-hot traffic (blue-green base + hot overlay)

Cutover Details

With SO_REUSEPORT, both old and new Endpoints bind the same port simultaneously. The new peer's Endpoint starts during start_peer (blocking erpc.call). Once it's up, we just stop the old Endpoint. There is zero gap — both peers serve traffic during the overlap.

Why the Parent Never Serves Traffic

The parent node's only job is process management:

Start/stop peer BEAM processes
Poll S3 for blue-green upgrades
Coordinate cutover

It has no Repo, no Endpoint, no business logic processes. This means:

Parent crashes don't affect traffic (peer keeps running independently)
Parent restarts cleanly without port conflicts
Upgrade logic is isolated from application logic

Tarball Types

PeerManager handles two tarball formats:

Full release (blue-green mode): Contains lib/ + releases/ (sys.config, vm.args, boot files, consolidated protocols). The peer uses 100% new code paths — no mixing with the parent's code.
Beam-only (hot mode, fallback): Contains just .beam files and consolidated protocols. Merged with the parent's existing code paths (new ebin dirs replace matching app dirs).

Full release tarballs are detected by the presence of a releases/ directory with a sys.config file.

Summary

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

get_info()

Returns a status map for this machine's blue-green state.

peer_node()

Returns the active peer's node name.

start_link(opts)

upgrade(tarball_url)

Triggers a blue-green upgrade with new code paths.

upgrading?()

Returns true if a blue-green upgrade is currently in progress.