Object.DistributedTraining (object v0.1.2)
Distributed Low-Communication (DiLoCo) Training implementation for AAOS.
Implements the DiLoCo algorithm from "DiLoCo: Distributed Low-Communication Training of Language Models" (Douillard et al., 2023) as a specialized Object subtype within the AAOS framework.
This module enables training of large language models and other neural networks across islands of devices that are poorly connected, requiring minimal communication while maintaining performance comparable to fully synchronous training.
Key Features
- Low Communication: Communicates only every H steps (hundreds/thousands)
- Federated Architecture: Each worker operates on its own data island
- Fault Tolerance: Byzantine resistance and graceful degradation
- Heterogeneous Hardware: Different islands can use different device types
- Performance: 500x less communication than synchronous training
- Integration: Full AAOS object lifecycle and coordination
DiLoCo Algorithm
The algorithm consists of two optimization loops:
- Inner Optimization: Local AdamW updates for H steps
- Outer Optimization: Global parameter averaging with Nesterov momentum
Mathematical Foundation
For T outer steps and H inner steps per worker:
- Total training steps: N = T × H
- Communication frequency: Every H steps
- Workers: k islands of devices
- Outer gradient: Δ^(t) = (1/k) ∑ᵢ (θ^(t-1) - θᵢ^(t))
Summary
Functions
Returns a specification to start this module under a supervisor.
Initializes a training coalition of distributed workers.
Gets current training metrics.
Performs H inner training steps on local data.
Loads a training checkpoint.
Creates a new DiLoCo distributed training object.
Performs a single outer training step (T outer iterations).
Saves a training checkpoint.
Starts the distributed training object as a GenServer.
Synchronizes with other workers in the coalition.
Executes the DiLoCo training algorithm.
Types
@type communication_state() :: %{ last_sync: DateTime.t(), pending_gradients: map(), sync_barrier_count: integer(), communication_overhead: float(), bandwidth_usage: float() }
@type fault_tolerance_state() :: %{ failed_workers: [worker_id()], backup_checkpoints: %{required(String.t()) => model_state()}, consensus_state: consensus_state(), health_status: :healthy | :degraded | :critical }
@type synchronization_barrier() :: %{ barrier_id: String.t(), expected_workers: [worker_id()], arrived_workers: [worker_id()], timeout: integer(), start_time: DateTime.t() }
@type t() :: %Object.DistributedTraining{ communication_state: communication_state(), coordination_service: pid(), data_shard: data_shard(), fault_tolerance_state: fault_tolerance_state(), global_model_state: model_state(), inner_optimizer_state: optimizer_state(), local_model_state: model_state(), object_id: String.t(), optimizer_config: optimizer_config(), outer_optimizer_state: optimizer_state(), performance_metrics: performance_metrics(), step_counters: step_counters(), synchronization_barrier: synchronization_barrier(), training_config: training_config(), worker_id: worker_id() }
@type vote() :: %{ step: integer(), model_hash: binary(), timestamp: DateTime.t(), signature: binary() }
@type worker_id() :: String.t()
Functions
Returns a specification to start this module under a supervisor.
See Supervisor
.
Initializes a training coalition of distributed workers.
Gets current training metrics.
Performs H inner training steps on local data.
Loads a training checkpoint.
Creates a new DiLoCo distributed training object.
Performs a single outer training step (T outer iterations).
Saves a training checkpoint.
Starts the distributed training object as a GenServer.
Synchronizes with other workers in the coalition.
Executes the DiLoCo training algorithm.