compression

INT8 Quantization and Memory Hierarchy System

Reference: Jacob et al. (2017) - “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference” https://arxiv.org/abs/1712.05877

— Compression Math — INT8: 32-bit / 8-bit = 4x compression

24GB VRAM -> 96GB effective parameter storage
RTX 4090 INT8 Tensor Cores: 2x throughput vs FP16 (660 vs 330 TOPS)

Why symmetric quantization? Because asymmetric zero-points are a cache-miss nightmare. The extra memory access for zero-point lookup kills throughput. Per-tensor symmetric is fast. Per-channel symmetric is accurate. Pick one.

absmax quantization: simple but loses dynamic range at the tails. For weights: per-channel absmax is worth the overhead. For activations: per-tensor is fine (they’re more uniform).

FP16 was a mistake for storage. It’s 2x larger than INT8 with minimal accuracy benefit for inference. Train in FP16/BF16, deploy in INT8.

Inspired by: ggml, llama.cpp, Candle, bitsandbytes

Types

AccessRecord

</>

Access tracking for smart offloading

pub type AccessRecord {
  AccessRecord(
    tensor_id: Int,
    timestamp_ms: Int,
    access_count: Int,
  )
}

Constructors

AccessRecord(
  tensor_id: Int,
  timestamp_ms: Int,
  access_count: Int,
)

Checkpoint

</>

Checkpoint for trading compute for memory Key insight: Recomputing forward pass is cheaper than storing all activations

pub type Checkpoint {
  Checkpoint(
    input: tensor.Tensor,
    forward_fn_id: Int,
    memory_saved_gb: Float,
  )
}

Constructors

```
Checkpoint(
  input: tensor.Tensor,
  forward_fn_id: Int,
  memory_saved_gb: Float,
)
```
Arguments

input

Input saved for recomputation

forward_fn_id

Function ID for forward pass replay

memory_saved_gb

Memory saved in GB (activations not stored)

CheckpointStrategy

</>

Checkpointing strategy

pub type CheckpointStrategy {
  NoCheckpoint
  EveryN(n: Int)
  LargeLayersOnly(threshold_mb: Float)
  Adaptive(memory_pressure: Float)
}

Constructors

```
NoCheckpoint
```
No checkpointing - fastest but uses most memory
```
EveryN(n: Int)
```
Checkpoint every N layers (sqrt(N) is optimal for memory/compute)
```
LargeLayersOnly(threshold_mb: Float)
```
Only checkpoint layers above threshold (target the big ones)
```
Adaptive(memory_pressure: Float)
```
Adaptive based on memory pressure (for dynamic batch sizing)

CompressedTensor

</>

Tensor comprimido

pub type CompressedTensor {
  CompressedTensor(
    data: List(Int),
    shape: List(Int),
    format: QuantFormat,
    memory_bytes: Int,
  )
}

Constructors

```
CompressedTensor(
  data: List(Int),
  shape: List(Int),
  format: QuantFormat,
  memory_bytes: Int,
)
```
Arguments

data

Quantized data (bytes simulated as ints for BEAM compatibility)

shape

Original shape preserved for dequantization

format

Quantization format with metadata

memory_bytes

Memory footprint in bytes (useful for memory budgeting)

MemoryHierarchy

</>

Hierarchical memory system for offloading Pattern: Hot tensors on GPU, warm on RAM, cold on disk

pub type MemoryHierarchy {
  MemoryHierarchy(
    gpu: MemoryTier,
    ram: MemoryTier,
    disk: option.Option(MemoryTier),
    total_effective_gb: Float,
  )
}

Constructors

MemoryHierarchy(
  gpu: MemoryTier,
  ram: MemoryTier,
  disk: option.Option(MemoryTier),
  total_effective_gb: Float,
)

Arguments

total_effective_gb: Effective capacity after quantization multiplier

MemoryPool

</>

Memory pool for buffer reuse Pattern: Allocate once, reuse many times (avoids malloc overhead)

pub type MemoryPool {
  MemoryPool(
    free_buffers: List(#(Int, Int)),
    used_buffers: Int,
    total_allocated: Int,
  )
}

Constructors

```
MemoryPool(
  free_buffers: List(#(Int, Int)),
  used_buffers: Int,
  total_allocated: Int,
)
```
Arguments

free_buffers

Available buffers by size: (size_bytes, available_count)

used_buffers

Currently allocated count

total_allocated

Total bytes allocated (for memory tracking)

MemoryTier

</>

Memory tier with capacity and bandwidth tracking

pub type MemoryTier {
  MemoryTier(
    location: TensorLocation,
    capacity_gb: Float,
    used_gb: Float,
    bandwidth_gbps: Float,
  )
}

Constructors

MemoryTier(
  location: TensorLocation,
  capacity_gb: Float,
  used_gb: Float,
  bandwidth_gbps: Float,
)

Arguments

bandwidth_gbps: Bandwidth matters more than latency for large transfers

OffloadPolicy

</>

Offload policy determines when to move tensors between tiers

pub type OffloadPolicy {
  KeepOnGpu
  OffloadToRam(threshold_pct: Float)
  OffloadToDisk(ram_threshold: Float, disk_path: String)
  SmartOffload(access_history: List(AccessRecord))
}

Constructors

```
KeepOnGpu
```
Keep everything on GPU (default, fastest)
```
OffloadToRam(threshold_pct: Float)
```
Move to RAM when GPU usage exceeds threshold

OffloadToDisk(ram_threshold: Float, disk_path: String)

Tiered: GPU -> RAM -> Disk

```
SmartOffload(access_history: List(AccessRecord))
```
LRU-based: evict least recently accessed tensors first

QuantFormat

</>

Quantization format

pub type QuantFormat {
  Fp32
  Fp16
  Int8(scale: Float)
  Quant4(block_size: Int, scales: List(Float))
  Quant4Min(
    block_size: Int,
    scales: List(Float),
    mins: List(Float),
  )
}

Constructors

```
Fp32
```
Full precision (32 bits, 4 bytes per value) Memory: 4 bytes/param | Compression: 1x
```
Fp16
```
Half precision (16 bits, 2 bytes per value) Memory: 2 bytes/param | Compression: 2x Opinion: Only use for training. Deploy with INT8.
```
Int8(scale: Float)
```
Integer 8-bit with scale (1 byte + 1 float per block) Memory: ~1 byte/param | Compression: 4x Sweet spot: Best accuracy/compression tradeoff for inference
```
Quant4(block_size: Int, scales: List(Float))
```
4-bit quantized (0.5 bytes per value) - GGML style Memory: 0.5 bytes + scale overhead | Compression: ~7x Use NF4 instead for normal distributions (see nf4.gleam)
```
Quant4Min(
  block_size: Int,
  scales: List(Float),
  mins: List(Float),
)
```
4-bit with min/max (asymmetric, more accurate for ReLU activations) Memory: 0.5 bytes + 2 scales per block | Compression: ~6.5x

StreamedTensor

</>

Streaming tensor - loads chunks on demand Use case: Models too large for any single memory tier

pub type StreamedTensor {
  StreamedTensor(
    id: Int,
    shape: List(Int),
    chunk_shape: List(Int),
    loaded_chunks: List(Int),
    total_chunks: Int,
    format: QuantFormat,
  )
}

Constructors

StreamedTensor(
  id: Int,
  shape: List(Int),
  chunk_shape: List(Int),
  loaded_chunks: List(Int),
  total_chunks: Int,
  format: QuantFormat,
)

TensorLocation

</>

Tensor location in the memory hierarchy Performance note: GPU->RAM transfer on PCIe 4.0 x16: ~25 GB/s That’s 40ms for 1GB. Plan your offloading accordingly.

pub type TensorLocation {
  OnGpu(device_id: Int)
  OnRam
  OnDisk(path: String)
  Hybrid(gpu_pct: Float)
}

Constructors

```
OnGpu(device_id: Int)
```
VRAM - fastest (RTX 4090: 1008 GB/s bandwidth)
```
OnRam
```
System RAM - medium (DDR5-6400: ~100 GB/s dual channel)
```
OnDisk(path: String)
```
NVMe SSD - slow but huge (7 GB/s sequential)
```
Hybrid(gpu_pct: Float)
```
Hybrid: split across GPU and RAM (gradient offloading pattern)

Values

allocate_tensor

</>

pub fn allocate_tensor(
  hierarchy: MemoryHierarchy,
  tensor_size_gb: Float,
  policy: OffloadPolicy,
) -> #(TensorLocation, MemoryHierarchy)

Allocate tensor in memory hierarchy based on policy Returns new location and updated hierarchy state

checkpoint_savings

</>

pub fn checkpoint_savings(
  num_layers: Int,
  layer_size_mb: Float,
  strategy: CheckpointStrategy,
) -> Float

Calculate memory savings from checkpointing Note: This trades ~33% compute overhead for 50-75% memory savings

create_memory_hierarchy

</>

pub fn create_memory_hierarchy(
  vram_gb: Float,
  ram_gb: Float,
  disk_path: option.Option(String),
) -> MemoryHierarchy

Create memory hierarchy for typical workstation setup Example: RTX 4090 (24GB) + DDR5 RAM (32GB) = 56GB physical With INT8: 56GB * 4 = 224GB effective parameter storage

create_pool

</>

pub fn create_pool() -> MemoryPool

Create empty memory pool

create_streamed

</>

pub fn create_streamed(
  shape: List(Int),
  chunk_dim: Int,
) -> StreamedTensor

Create streaming tensor with specified chunk size

demonstrate_compression

</>

pub fn demonstrate_compression() -> Nil

dequantize

</>

pub fn dequantize(ct: CompressedTensor) -> tensor.Tensor

Dequantize compressed tensor back to FP32 Note: This is NOT lossless. Quantization error is permanent.

load_chunk

</>

pub fn load_chunk(
  st: StreamedTensor,
  chunk_idx: Int,
) -> StreamedTensor

Load specific chunk into memory

main

</>

pub fn main() -> Nil

pool_alloc

</>

pub fn pool_alloc(
  pool: MemoryPool,
  size: Int,
) -> #(MemoryPool, Bool)

Allocate from pool (reuses existing buffer if available) Returns: (updated_pool, was_reused)

pool_free

</>

pub fn pool_free(pool: MemoryPool, size: Int) -> MemoryPool

Return buffer to pool for reuse

quantize_int8

</>

pub fn quantize_int8(t: tensor.Tensor) -> CompressedTensor

Quantize tensor to INT8 using absmax symmetric quantization

Compression: 32/8 = 4x Error: Typically <0.5% for well-distributed weights

Implementation: absmax per-tensor (fast but less accurate than per-channel) For production, consider per-channel for weights, per-tensor for activations.

quantize_q4

</>

pub fn quantize_q4(
  t: tensor.Tensor,
  block_size: Int,
) -> CompressedTensor

Quantize to Q4 (4-bit) using block-wise absmax - GGML style

Compression: 32/4 = 8x theoretical, ~7x effective with scale overhead Block size tradeoff:

Smaller blocks (32): More scales, more accurate, less compression
Larger blocks (128): Fewer scales, less accurate, more compression
Sweet spot: 64 (empirically validated in GGML/QLoRA)

unload_chunk

</>

pub fn unload_chunk(
  st: StreamedTensor,
  chunk_idx: Int,
) -> StreamedTensor

Unload chunk to free memory