FlyDeploy.BlueGreen.PeerManager
(FlyDeploy v0.4.1)
Copy Markdown
Manages the lifecycle of peer BEAM nodes for blue-green deploys.
Architecture Overview
Blue-green mode runs two BEAM layers on a single Fly machine: a parent
node that never serves traffic, and a peer node (a child BEAM process
started via OTP's :peer module) that runs the user's full application and
binds the HTTP port. On upgrade, a new peer boots with new code (its
Endpoint binds via SO_REUSEPORT alongside the old), the old peer's Endpoint
is stopped, and the old peer is terminated.
┌─ Fly Machine (single VM instance) ──────────────────────────────────────┐
│ │
│ Parent BEAM (long-lived, never serves traffic) │
│ ├─ BlueGreen.Supervisor │
│ │ ├─ PeerManager ← this module │
│ │ │ • starts/stops peer BEAM processes via :peer │
│ │ │ • handles cutover (stop old Endpoint) │
│ │ │ • on startup, checks S3 for pending blue-green reapply │
│ │ │ │
│ │ └─ Poller (mode: :blue_green) │
│ │ • polls S3 "blue_green_upgrade" field │
│ │ • on change → calls PeerManager.upgrade(tarball_url) │
│ │ │
│ └─ (no Endpoint, no Repo, no app processes) │
│ │
│ Peer BEAM (child process, serves all traffic) │
│ ├─ User's full supervision tree │
│ │ ├─ FlyDeploy Poller (mode: :hot) ← polls "hot_upgrade" field │
│ │ │ • applies hot code upgrades in-place inside the peer │
│ │ │ • on startup, checks S3 for pending hot upgrade reapply │
│ │ ├─ Repo, PubSub, Counter, ... │
│ │ └─ Endpoint ← binds port via reuseport │
│ │ │
│ └─ Code loaded from /tmp/fly_deploy_bg_<ts>/ (not /app/) │
│ │
└─────────────────────────────────────────────────────────────────────────┘How Peers Are Started
Each peer is a separate OS process started via :peer.start/1. The parent:
- Finds the bundled
erlbinary from the ERTS directory - Builds exec args:
-boot start_clean(bypasses the release boot script),-config(sys.config from the release or extracted tarball),-args_file(vm.args) - Passes
-paflags for all code paths (ebin directories) - Calls
:peer.start(%{name: ..., exec: {erl, args}, ...}) - Boots Elixir + Logger via
:erpc.call - Marks the peer with
Application.put_env(:fly_deploy, :__role__, :peer) - Injects SO_REUSEPORT config so the Endpoint can bind alongside an existing peer
- Calls
ensure_all_started(otp_app)(blocking — returns when fully started)
Why -boot start_clean? Without it, :peer inherits the parent's release
boot script which auto-starts all apps before we can mark __role__: :peer,
and computes node names from FLY_IMAGE_REF causing invalid names.
Blue-Green Upgrade Flow
When the parent's Poller detects a new "blue_green_upgrade" in S3:
Poller ──→ PeerManager.upgrade(tarball_url)
│
├─ 1. Download tarball from S3
├─ 2. Extract to /tmp/fly_deploy_bg_<ts>/
├─ 3. Build code paths from extracted ebin dirs
├─ 4. Start new peer with new code paths
│ └─ Peer fully boots (Endpoint binds via reuseport)
├─ 5. Stop old peer's Endpoint
└─ 6. Stop old peer entirelyKey properties:
- Zero downtime: Both old and new Endpoints serve simultaneously via SO_REUSEPORT during the brief overlap, then old Endpoint stops.
- Clean state: The new peer starts fresh — no
code_change/3, no state migration. This is the key difference from hot upgrades. - New PID: Every process gets a new PID (new BEAM process).
Hot Upgrades Inside Peers
The peer runs its own FlyDeploy.Poller with mode: :hot (started as
{FlyDeploy, otp_app: :my_app} in the user's supervision tree). This
Poller polls the "hot_upgrade" field in S3, completely independent of
the parent's Poller which watches "blue_green_upgrade".
When a hot upgrade is detected inside the peer:
Peer's Poller ──→ FlyDeploy.hot_upgrade(tarball_url, app)
│
├─ Download tarball from S3
├─ Copy .beam files to where :code.which() says
│ they're loaded (/tmp/fly_deploy_bg_<ts>/lib/...)
├─ Detect changed modules via :code.modified_modules()
├─ Phase 1: Suspend ALL processes using changed modules
├─ Phase 2: Purge + load ALL new code
├─ Phase 3: :sys.change_code on ALL processes
└─ Phase 4: Resume ALL processesThis works because the Upgrader uses :code.which(module) to find where
each module is currently loaded from, then copies new beams to that same
path. Whether the peer loaded code from /app/lib/ or
/tmp/fly_deploy_bg_<ts>/lib/, the hot upgrade lands in the right place.
S3 State: Separate Fields
The deployment metadata in S3 (releases/<app>-current.json) has two
independent fields so blue-green and hot upgrades coexist:
{
"image_ref": "registry.fly.io/app:deployment-ABC",
"blue_green_upgrade": {
"tarball_url": "https://s3/.../app-0.2.0.tar.gz",
"source_image_ref": "registry.fly.io/app:deployment-DEF",
...
},
"hot_upgrade": {
"tarball_url": "https://s3/.../app-0.2.1.tar.gz",
"source_image_ref": "registry.fly.io/app:deployment-GHI",
...
}
}Rules:
mix fly_deploy.hot(default mode) → writes"hot_upgrade", preserves"blue_green_upgrade"mix fly_deploy.hot --mode blue_green→ writes"blue_green_upgrade", clears"hot_upgrade"(new peer = fresh start, old hot patches subsumed)fly deploy(cold deploy) → machines detect image_ref mismatch and reset both fields to nil
Restart Reapply Flow
When a Fly machine restarts (crash, scaling, fly machine restart), both
layers are reapplied from S3:
Machine restarts
│
├─ Parent boots
│ └─ BlueGreen.Supervisor starts
│ ├─ PeerManager.init
│ │ ├─ resolve_startup_code(otp_app)
│ │ │ └─ Reads S3 "blue_green_upgrade" field
│ │ │ → Downloads tarball → extracts to /tmp/bg_<ts>/
│ │ ├─ start_peer(otp_app, new_code_paths)
│ │ │ └─ Peer boots with /tmp/bg_<ts>/ code (v2)
│ │ │ ├─ {FlyDeploy, otp_app: :app} starts Poller (mode: :hot)
│ │ │ │ └─ startup_apply_current reads S3 "hot_upgrade"
│ │ │ │ → Downloads v2-hot tarball
│ │ │ │ → Copies beams to /tmp/bg_<ts>/ paths
│ │ │ │ → Loads via :c.lm() (no suspend at startup)
│ │ │ │ → Peer now running v2-hot code
│ │ │ ├─ Counter, Repo, PubSub, ...
│ │ │ └─ Endpoint (binds port via reuseport)
│ │
│ └─ Poller (mode: :blue_green)
│ └─ Polls for future blue-green upgrades
│
└─ Result: machine serves v2-hot traffic (blue-green base + hot overlay)Cutover Details
With SO_REUSEPORT, both old and new Endpoints bind the same port
simultaneously. The new peer's Endpoint starts during start_peer
(blocking erpc.call). Once it's up, we just stop the old Endpoint.
There is zero gap — both peers serve traffic during the overlap.
Why the Parent Never Serves Traffic
The parent node's only job is process management:
- Start/stop peer BEAM processes
- Poll S3 for blue-green upgrades
- Coordinate cutover
It has no Repo, no Endpoint, no business logic processes. This means:
- Parent crashes don't affect traffic (peer keeps running independently)
- Parent restarts cleanly without port conflicts
- Upgrade logic is isolated from application logic
Tarball Types
PeerManager handles two tarball formats:
Full release (blue-green mode): Contains
lib/+releases/(sys.config, vm.args, boot files, consolidated protocols). The peer uses 100% new code paths — no mixing with the parent's code.Beam-only (hot mode, fallback): Contains just
.beamfiles and consolidated protocols. Merged with the parent's existing code paths (new ebin dirs replace matching app dirs).
Full release tarballs are detected by the presence of a releases/
directory with a sys.config file.
Summary
Functions
Returns a specification to start this module under a supervisor.
Returns a status map for this machine's blue-green state.
Returns the active peer's node name.
Triggers a blue-green upgrade with new code paths.
Returns true if a blue-green upgrade is currently in progress.
Functions
Returns a specification to start this module under a supervisor.
See Supervisor.
Returns a status map for this machine's blue-green state.
Returns the active peer's node name.
Useful for remsh-ing into the peer from the parent:
/app/bin/myapp rpc 'IO.puts(FlyDeploy.BlueGreen.PeerManager.peer_node())'
RELEASE_NODE=<output> /app/bin/myapp remote
Triggers a blue-green upgrade with new code paths.
Called by the Poller when it detects a new release in S3. Downloads the tarball, extracts it, and starts a new peer with the new code.
Returns {:error, :upgrade_in_progress} if an upgrade is already running.
Returns true if a blue-green upgrade is currently in progress.
Uses :persistent_term so it's readable from any process without
blocking on the PeerManager GenServer (which is busy doing the upgrade).