Resource Silo Guide

View Source

What is the Resource Silo?

The Resource Silo is the system stability controller in the Liquid Conglomerate architecture. It monitors computational resources (memory, CPU, processes) and dynamically adjusts the evolution engine's concurrency to prevent system crashes while maximizing throughput.

Think of the Resource Silo as a traffic controller for a busy highway. When traffic is light (low resource pressure), it allows full speed (high concurrency). As congestion builds, it slows traffic down (throttles). In emergencies, it can stop traffic entirely (pause) until conditions improve.

The Resource Silo solves a fundamental problem in neuroevolution: resource exhaustion. Evaluating thousands of neural networks concurrently can quickly exhaust system memory or saturate CPU. Without active management, the system crashes. The Resource Silo prevents this by continuously monitoring system state and adapting evaluation intensity.

Architecture Overview

Resource Silo Architecture

The Resource Silo operates as a three-level hierarchical controller:

LevelNameRoleTime Constant
L0EmergencyHard limits, GC triggers, pause if criticalAlways active
L1ReactiveAdjust concurrency based on current pressure5 updates
L2PredictiveLearn resource patterns, anticipate needsFuture

Key Principle: Safety First

The levels are designed for fail-safe operation:

  • L0 is always active - it cannot be disabled and enforces absolute limits
  • If L2 produces bad predictions → L1 ignores them and uses defaults
  • If L1 computes extreme values → L0 clamps to safe bounds
  • Even if all higher layers fail, L0 will trigger GC and pause to protect the system

How It Works

Sensors (Inputs)

The Resource Silo observes 15 sensors that describe current system state:

SensorRangeDescription
memory_pressure[0, 1]Used memory / total memory
memory_velocity[-1, 1]Rate of memory change
cpu_pressure[0, 1]CPU utilization (scheduler-based)
cpu_velocity[-1, 1]Rate of CPU change
run_queue_pressure[0, 1]Run queue / (schedulers * 2)
process_pressure[0, 1]Process count / process limit
message_queue_pressure[0, 1]Aggregate message queue depth
binary_memory_ratio[0, 1]Binary heap / total memory
gc_frequency[0, 1]GC rate (normalized)
current_concurrency_ratio[0, 1]Current / max concurrency
task_silo_exploration[0, 1]Signal from Task Silo
evaluation_throughput[0, 1]Evaluations per second
time_since_last_gc[0, 1]Time since last forced GC
archive_memory_ratio[0, 1]Self-play archive memory usage
crdt_state_size_ratio[0, 1]CRDT sync overhead

Actuators (Outputs)

The Resource Silo controls 10 parameters that govern system behavior:

ActuatorRangeDefaultDescription
max_concurrent_evaluations[1, base*2]base_concurrencyMax parallel evals
evaluation_batch_size[1, 50]10Evals per batch
gc_trigger_threshold[0.5, 0.95]0.85Memory % to trigger GC
pause_threshold[0.7, 0.99]0.95Memory % to pause
throttle_intensity[0, 1]0.0How aggressively to throttle
max_evals_per_individual[1, 20]5Evaluation cap per agent
task_silo_pressure_signal[0, 1]0.0Cross-silo signal to Task
gc_aggressiveness[0, 1]0.5GC intensity
archive_gc_pressure[0, 1]0.0Force archive cleanup
evaluation_timeout[1000, 10000]3000Worker timeout (ms)

Adaptive Throttling

Adaptive Throttling

The Resource Silo uses a three-state machine to manage system load:

State: CONTINUE

  • Memory pressure below memory_high_threshold (default: 0.7)
  • Full concurrency allowed
  • Normal operation

State: THROTTLE

  • Memory pressure between memory_high_threshold and memory_critical_threshold
  • Concurrency reduced proportionally
  • May trigger GC if pressure is rising

State: PAUSE

  • Memory pressure at or above memory_critical_threshold (default: 0.9)
  • Minimum concurrency (or full stop)
  • Force GC on all processes
  • Wait for pressure to decrease

The L1 Concurrency Formula

%% Compute pressure factor (0.0 to 1.0)
PressureFactor = max(MemFactor, CpuFactor),

%% Scale concurrency using L2-controlled factors
ScaleFactor = max(MinScale, 1.0 - (PressureFactor * PressureScale)),
NewConcurrency = max(1, round(Base * ScaleFactor))

Where:

  • PressureScale (L2-controlled): How aggressively to reduce (default: 0.9)
  • MinScale (L2-controlled): Minimum concurrency fraction (default: 0.1)

The Control Loop

  1. Sample Metrics: Every 1 second, collect VM metrics via resource_monitor
  2. Update History: Add pressure to rolling window for velocity calculation
  3. Query L2: If enabled, get strategic guidance from meta_controller
  4. Compute L1 Concurrency: Calculate adaptive concurrency based on pressure
  5. Check L0 Emergency: If critical, trigger GC and increment pause count
  6. Emit Events: On state change, publish resource_alert event
  7. Return Recommendations: Engine queries for max_concurrent and action
%% Typical usage in evolution loop
run_evaluations(Population, State) ->
    %% Get recommendations from Resource Silo
    #{action := Action, max_concurrent := MaxConc} = resource_silo:get_recommendations(),

    case Action of
        pause ->
            %% Wait for resources to recover
            timer:sleep(1000),
            run_evaluations(Population, State);
        throttle ->
            %% Proceed with reduced concurrency
            evaluate_batch(Population, MaxConc, State);
        continue ->
            %% Full speed ahead
            evaluate_batch(Population, MaxConc, State)
    end.

Integration with the Neuroevolution Engine

Resource Silo Dataflow

Wiring Diagram

The Resource Silo integrates with the system at the lowest level:

Data Sources:

  • resource_monitor - Raw BEAM VM metrics (memory, CPU, processes)
  • neuroevolution_server - Evaluation throughput, current concurrency
  • opponent_archive - Self-play archive memory statistics (optional)

Data Consumers:

  • neuroevolution_server - Applies concurrency limits, timeout settings
  • task_silo - Receives pressure signals for hyperparameter adaptation
  • neuroevolution_events - Event bus for monitoring

Cross-Silo Interactions

The Resource Silo exchanges signals with other silos:

Signals Sent: | Signal | To | Description | |--------|-----|-------------| | pressure_signal | Task | Current resource pressure (0-1) | | max_evals_per_individual | Task | Caps evaluation budget | | should_simplify | Task | Suggests simpler networks when resources stressed | | local_capacity | Distribution | Available local compute capacity |

Signals Received: | Signal | From | Effect | |--------|------|--------| | exploration_boost | Task | High exploration = expect more evaluations | | desired_evals_per_individual | Task | Requested evaluation budget | | expected_complexity_growth | Task | Anticipated memory needs |

Engine Integration Points

%% Start Resource Silo with configuration
{ok, _} = resource_silo:start_link(#{
    enabled_levels => [l0, l1],           % L2 not yet implemented
    base_concurrency => 100000,            % Erlang can handle it
    memory_high_threshold => 0.7,          % Start throttling at 70%
    memory_critical_threshold => 0.9,      % Pause at 90%
    cpu_high_threshold => 0.9,             % Throttle at 90% CPU
    sample_interval => 1000                % Sample every 1 second
}),

%% In evaluation loop
before_evaluation_batch() ->
    case resource_silo:should_pause() of
        true ->
            %% Wait for resources
            timer:sleep(500),
            before_evaluation_batch();
        false ->
            #{max_concurrent := N} = resource_silo:get_recommendations(),
            {ok, N}
    end.

Training Velocity Impact

MetricWithout Resource SiloWith Resource Silo
System stabilityCrashes under loadGraceful degradation
Concurrency managementFixed (guess)Adaptive to pressure
Memory safetyNone (OOM kills)Proactive GC, pause
Training velocity0.0x (crashes)1.0x (baseline)

The Resource Silo establishes the baseline training velocity by keeping the system alive. Without it, aggressive neuroevolution experiments would crash, yielding 0x velocity.

Practical Examples

Example 1: Handling Memory Pressure

%% Scenario: Memory pressure climbing during evolution
%% resource_monitor reports memory_pressure = 0.75

%% Resource Silo detects THROTTLE condition
Recommendations = resource_silo:get_recommendations(),

%% With pressure at 0.75 (above 0.7 threshold):
%% - action => throttle
%% - max_concurrent => 65000 (reduced from 100000)
%% - reason => <<"High resource pressure - reducing concurrency">>

%% Event emitted:
%% {resource_alert, #{
%%     previous_action => continue,
%%     action => throttle,
%%     memory_pressure => 0.75,
%%     cpu_pressure => 0.45,
%%     message => <<"Resource pressure detected, reducing concurrency">>
%% }}

Example 2: Critical Pressure Response

%% Scenario: Memory pressure hits critical level
%% resource_monitor reports memory_pressure = 0.92

%% Resource Silo L0 triggers emergency response
%% 1. Forces garbage collection on all processes
%% 2. Increments pause_count
%% 3. Returns pause action

Recommendations = resource_silo:get_recommendations(),
%% #{
%%     action => pause,
%%     max_concurrent => 1,
%%     memory_pressure => 0.92,
%%     reason => <<"Memory critical - evolution should pause">>,
%%     gc_triggered_count => 5,
%%     pause_count => 3
%% }

Example 3: Pressure Rising Detection

%% Resource Silo tracks pressure history for velocity calculation
%% If pressure is rising rapidly, it preemptively triggers GC

%% Pressure history: [0.68, 0.65, 0.60, 0.55, 0.50]
%% Current pressure: 0.72
%% Velocity: (0.72 - 0.68) = 0.04 per sample

%% Even though 0.72 < 0.9 (critical),
%% pressure is rising above pressure_change_threshold (0.05)
%% L0 triggers preemptive GC to prevent reaching critical

Tuning Guide

Key Parameters

ParameterWhen to IncreaseWhen to Decrease
base_concurrencyUnderutilizing resourcesOOM errors
memory_high_thresholdThrottling too earlySystem sluggish
memory_critical_thresholdPausing unnecessarilyCrashes before pause
sample_intervalToo much monitoring overheadSlow response to spikes
L2 pressure_scale_factorThrottling too weakOscillating concurrency
L2 min_scale_factorConcurrency drops too lowNot enough throttling

Common Pitfalls

  1. Base concurrency too low: Underutilizing Erlang's massive concurrency

    • Symptom: Low CPU utilization, slow training
    • Fix: Increase base_concurrency (Erlang handles 100K+ processes easily)
  2. Thresholds too aggressive: Constant throttling/pausing

    • Symptom: pause_count climbing rapidly, low throughput
    • Fix: Raise memory_high_threshold to 0.8
  3. Thresholds too permissive: System crashes before intervention

    • Symptom: OOM kills, VM crashes
    • Fix: Lower memory_critical_threshold to 0.85
  4. Sample interval too long: Slow reaction to pressure spikes

    • Symptom: Memory spikes cause crashes before silo reacts
    • Fix: Reduce sample_interval to 500ms

Debugging Tips

%% Get full Resource Silo state
State = resource_silo:get_state(),
io:format("Memory pressure: ~.1f%~n", [maps:get(memory_pressure, maps:get(current_metrics, State)) * 100]),
io:format("Current concurrency: ~p~n", [maps:get(current_concurrency, State)]),
io:format("GC triggered count: ~p~n", [maps:get(gc_triggered_count, State)]),
io:format("Pause count: ~p~n", [maps:get(pause_count, State)]),

%% Get sensor details
Sensors = maps:get(sensors, State),
io:format("Run queue pressure: ~.3f~n", [maps:get(run_queue_pressure, Sensors, 0.0)]),
io:format("Process pressure: ~.3f~n", [maps:get(process_pressure, Sensors, 0.0)]),

%% Check thresholds
Thresholds = maps:get(thresholds, State),
io:format("Memory high threshold: ~.2f~n", [maps:get(memory_high, Thresholds)]).

Events Reference

The Resource Silo emits events on state transitions:

EventTriggerKey Payload
resource_alertAction state changesprevious_action, action, memory_pressure, cpu_pressure
resource_sensors_updatedSensors change significantlysensors (full map)
gc_triggeredL0 forces garbage collectionmemory_pressure, reason

Example Event Payload:

{resource_alert, #{
    realm => <<"default">>,
    source => resource_silo,
    previous_action => continue,
    action => throttle,
    memory_pressure => 0.75,
    cpu_pressure => 0.45,
    message => <<"Resource pressure detected, reducing concurrency">>,
    timestamp => 1703318400000
}}

L2 Guidance Parameters

When L2 is enabled (future), the meta_controller provides strategic guidance:

ParameterRangeDefaultEffect
memory_high_threshold[0.5, 0.9]0.7When to start throttling
memory_critical_threshold[0.7, 0.99]0.9When to pause
cpu_high_threshold[0.7, 0.99]0.9CPU throttle threshold
pressure_scale_factor[0.5, 0.99]0.9Throttling intensity
min_scale_factor[0.1, 0.5]0.1Minimum concurrency
pressure_change_threshold[0.01, 0.1]0.05Sensitivity to pressure changes

Source Code Reference

ModulePurposeLocation
resource_silo.erlMain gen_serversrc/silos/resource_silo/
resource_monitor.erlBEAM VM metrics collectionSame
resource_l0_sensors.erlSensor normalizationSame
resource_l0_actuators.erlActuator applicationSame
resource_l0_morphology.erlTWEANN topology definitionSame
lc_cross_silo.erlCross-silo signalssrc/silos/

Further Reading

References

Resource Management in BEAM

  • Armstrong, J. (2007). "Programming Erlang: Software for a Concurrent World." Pragmatic Bookshelf.
  • Cesarini, F., Thompson, S. (2009). "Erlang Programming." O'Reilly Media.

Adaptive Systems

  • Hellerstein, J.L., et al. (2004). "Feedback Control of Computing Systems." Wiley.
  • Diao, Y., et al. (2005). "Control of Computing Systems." IEEE Computer.

Garbage Collection

  • Jones, R., Hosking, A., Moss, E. (2011). "The Garbage Collection Handbook." CRC Press.