FlyDeploy.BlueGreen.PeerManager (FlyDeploy v0.4.1)

Copy Markdown

Manages the lifecycle of peer BEAM nodes for blue-green deploys.

Architecture Overview

Blue-green mode runs two BEAM layers on a single Fly machine: a parent node that never serves traffic, and a peer node (a child BEAM process started via OTP's :peer module) that runs the user's full application and binds the HTTP port. On upgrade, a new peer boots with new code (its Endpoint binds via SO_REUSEPORT alongside the old), the old peer's Endpoint is stopped, and the old peer is terminated.

 Fly Machine (single VM instance) 
                                                                         
  Parent BEAM (long-lived, never serves traffic)                         
   BlueGreen.Supervisor                                                
      PeerManager           this module                              
         starts/stops peer BEAM processes via :peer                   
         handles cutover (stop old Endpoint)                          
         on startup, checks S3 for pending blue-green reapply         
                                                                       
      Poller (mode: :blue_green)                                      
          polls S3 "blue_green_upgrade" field                          
          on change  calls PeerManager.upgrade(tarball_url)           
                                                                        
   (no Endpoint, no Repo, no app processes)                            
                                                                         
  Peer BEAM (child process, serves all traffic)                          
   User's full supervision tree                                        
      FlyDeploy Poller (mode: :hot)     polls "hot_upgrade" field    
         applies hot code upgrades in-place inside the peer           
         on startup, checks S3 for pending hot upgrade reapply        
      Repo, PubSub, Counter, ...                                      
      Endpoint                          binds port via reuseport     
                                                                        
   Code loaded from /tmp/fly_deploy_bg_<ts>/ (not /app/)               
                                                                         

How Peers Are Started

Each peer is a separate OS process started via :peer.start/1. The parent:

  1. Finds the bundled erl binary from the ERTS directory
  2. Builds exec args: -boot start_clean (bypasses the release boot script), -config (sys.config from the release or extracted tarball), -args_file (vm.args)
  3. Passes -pa flags for all code paths (ebin directories)
  4. Calls :peer.start(%{name: ..., exec: {erl, args}, ...})
  5. Boots Elixir + Logger via :erpc.call
  6. Marks the peer with Application.put_env(:fly_deploy, :__role__, :peer)
  7. Injects SO_REUSEPORT config so the Endpoint can bind alongside an existing peer
  8. Calls ensure_all_started(otp_app) (blocking — returns when fully started)

Why -boot start_clean? Without it, :peer inherits the parent's release boot script which auto-starts all apps before we can mark __role__: :peer, and computes node names from FLY_IMAGE_REF causing invalid names.

Blue-Green Upgrade Flow

When the parent's Poller detects a new "blue_green_upgrade" in S3:

Poller  PeerManager.upgrade(tarball_url)
             
              1. Download tarball from S3
              2. Extract to /tmp/fly_deploy_bg_<ts>/
              3. Build code paths from extracted ebin dirs
              4. Start new peer with new code paths
                    Peer fully boots (Endpoint binds via reuseport)
              5. Stop old peer's Endpoint
              6. Stop old peer entirely

Key properties:

  • Zero downtime: Both old and new Endpoints serve simultaneously via SO_REUSEPORT during the brief overlap, then old Endpoint stops.
  • Clean state: The new peer starts fresh — no code_change/3, no state migration. This is the key difference from hot upgrades.
  • New PID: Every process gets a new PID (new BEAM process).

Hot Upgrades Inside Peers

The peer runs its own FlyDeploy.Poller with mode: :hot (started as {FlyDeploy, otp_app: :my_app} in the user's supervision tree). This Poller polls the "hot_upgrade" field in S3, completely independent of the parent's Poller which watches "blue_green_upgrade".

When a hot upgrade is detected inside the peer:

Peer's Poller ──→ FlyDeploy.hot_upgrade(tarball_url, app)
                    │
                    ├─ Download tarball from S3
                    ├─ Copy .beam files to where :code.which() says
                    │   they're loaded (/tmp/fly_deploy_bg_<ts>/lib/...)
                     Detect changed modules via :code.modified_modules()
                     Phase 1: Suspend ALL processes using changed modules
                     Phase 2: Purge + load ALL new code
                     Phase 3: :sys.change_code on ALL processes
                     Phase 4: Resume ALL processes

This works because the Upgrader uses :code.which(module) to find where each module is currently loaded from, then copies new beams to that same path. Whether the peer loaded code from /app/lib/ or /tmp/fly_deploy_bg_<ts>/lib/, the hot upgrade lands in the right place.

S3 State: Separate Fields

The deployment metadata in S3 (releases/<app>-current.json) has two independent fields so blue-green and hot upgrades coexist:

{
  "image_ref": "registry.fly.io/app:deployment-ABC",
  "blue_green_upgrade": {
    "tarball_url": "https://s3/.../app-0.2.0.tar.gz",
    "source_image_ref": "registry.fly.io/app:deployment-DEF",
    ...
  },
  "hot_upgrade": {
    "tarball_url": "https://s3/.../app-0.2.1.tar.gz",
    "source_image_ref": "registry.fly.io/app:deployment-GHI",
    ...
  }
}

Rules:

  • mix fly_deploy.hot (default mode) → writes "hot_upgrade", preserves "blue_green_upgrade"
  • mix fly_deploy.hot --mode blue_green → writes "blue_green_upgrade", clears "hot_upgrade" (new peer = fresh start, old hot patches subsumed)
  • fly deploy (cold deploy) → machines detect image_ref mismatch and reset both fields to nil

Restart Reapply Flow

When a Fly machine restarts (crash, scaling, fly machine restart), both layers are reapplied from S3:

Machine restarts
  
   Parent boots
      BlueGreen.Supervisor starts
          PeerManager.init
             resolve_startup_code(otp_app)
                Reads S3 "blue_green_upgrade" field
                    Downloads tarball  extracts to /tmp/bg_<ts>/
             start_peer(otp_app, new_code_paths)
                Peer boots with /tmp/bg_<ts>/ code (v2)
                    {FlyDeploy, otp_app: :app} starts Poller (mode: :hot)
                       startup_apply_current reads S3 "hot_upgrade"
                           Downloads v2-hot tarball
                           Copies beams to /tmp/bg_<ts>/ paths
                           Loads via :c.lm() (no suspend at startup)
                           Peer now running v2-hot code
                    Counter, Repo, PubSub, ...
                    Endpoint (binds port via reuseport)
         
          Poller (mode: :blue_green)
              Polls for future blue-green upgrades
  
   Result: machine serves v2-hot traffic (blue-green base + hot overlay)

Cutover Details

With SO_REUSEPORT, both old and new Endpoints bind the same port simultaneously. The new peer's Endpoint starts during start_peer (blocking erpc.call). Once it's up, we just stop the old Endpoint. There is zero gap — both peers serve traffic during the overlap.

Why the Parent Never Serves Traffic

The parent node's only job is process management:

  • Start/stop peer BEAM processes
  • Poll S3 for blue-green upgrades
  • Coordinate cutover

It has no Repo, no Endpoint, no business logic processes. This means:

  • Parent crashes don't affect traffic (peer keeps running independently)
  • Parent restarts cleanly without port conflicts
  • Upgrade logic is isolated from application logic

Tarball Types

PeerManager handles two tarball formats:

  • Full release (blue-green mode): Contains lib/ + releases/ (sys.config, vm.args, boot files, consolidated protocols). The peer uses 100% new code paths — no mixing with the parent's code.

  • Beam-only (hot mode, fallback): Contains just .beam files and consolidated protocols. Merged with the parent's existing code paths (new ebin dirs replace matching app dirs).

Full release tarballs are detected by the presence of a releases/ directory with a sys.config file.

Summary

Functions

Returns a specification to start this module under a supervisor.

Returns a status map for this machine's blue-green state.

Returns the active peer's node name.

Triggers a blue-green upgrade with new code paths.

Returns true if a blue-green upgrade is currently in progress.

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

get_info()

Returns a status map for this machine's blue-green state.

peer_node()

Returns the active peer's node name.

Useful for remsh-ing into the peer from the parent:

/app/bin/myapp rpc 'IO.puts(FlyDeploy.BlueGreen.PeerManager.peer_node())'
RELEASE_NODE=<output> /app/bin/myapp remote

start_link(opts)

upgrade(tarball_url)

Triggers a blue-green upgrade with new code paths.

Called by the Poller when it detects a new release in S3. Downloads the tarball, extracts it, and starts a new peer with the new code.

Returns {:error, :upgrade_in_progress} if an upgrade is already running.

upgrading?()

Returns true if a blue-green upgrade is currently in progress.

Uses :persistent_term so it's readable from any process without blocking on the PeerManager GenServer (which is busy doing the upgrade).