viva_tensor/quant/compression

INT8 Quantization and Memory Hierarchy System

Reference: Jacob et al. (2017) - “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference” https://arxiv.org/abs/1712.05877

— Compression Math — INT8: 32-bit / 8-bit = 4x compression

Why symmetric quantization? Because asymmetric zero-points are a cache-miss nightmare. The extra memory access for zero-point lookup kills throughput. Per-tensor symmetric is fast. Per-channel symmetric is accurate. Pick one.

absmax quantization: simple but loses dynamic range at the tails. For weights: per-channel absmax is worth the overhead. For activations: per-tensor is fine (they’re more uniform).

FP16 was a mistake for storage. It’s 2x larger than INT8 with minimal accuracy benefit for inference. Train in FP16/BF16, deploy in INT8.

Inspired by: ggml, llama.cpp, Candle, bitsandbytes

Types

Access tracking for smart offloading

pub type AccessRecord {
  AccessRecord(
    tensor_id: Int,
    timestamp_ms: Int,
    access_count: Int,
  )
}

Constructors

  • AccessRecord(
      tensor_id: Int,
      timestamp_ms: Int,
      access_count: Int,
    )

Checkpoint for trading compute for memory Key insight: Recomputing forward pass is cheaper than storing all activations

pub type Checkpoint {
  Checkpoint(
    input: tensor.Tensor,
    forward_fn_id: Int,
    memory_saved_gb: Float,
  )
}

Constructors

  • Checkpoint(
      input: tensor.Tensor,
      forward_fn_id: Int,
      memory_saved_gb: Float,
    )

    Arguments

    input

    Input saved for recomputation

    forward_fn_id

    Function ID for forward pass replay

    memory_saved_gb

    Memory saved in GB (activations not stored)

Checkpointing strategy

pub type CheckpointStrategy {
  NoCheckpoint
  EveryN(n: Int)
  LargeLayersOnly(threshold_mb: Float)
  Adaptive(memory_pressure: Float)
}

Constructors

  • NoCheckpoint

    No checkpointing - fastest but uses most memory

  • EveryN(n: Int)

    Checkpoint every N layers (sqrt(N) is optimal for memory/compute)

  • LargeLayersOnly(threshold_mb: Float)

    Only checkpoint layers above threshold (target the big ones)

  • Adaptive(memory_pressure: Float)

    Adaptive based on memory pressure (for dynamic batch sizing)

Tensor comprimido

pub type CompressedTensor {
  CompressedTensor(
    data: List(Int),
    shape: List(Int),
    format: QuantFormat,
    memory_bytes: Int,
  )
}

Constructors

  • CompressedTensor(
      data: List(Int),
      shape: List(Int),
      format: QuantFormat,
      memory_bytes: Int,
    )

    Arguments

    data

    Quantized data (bytes simulated as ints for BEAM compatibility)

    shape

    Original shape preserved for dequantization

    format

    Quantization format with metadata

    memory_bytes

    Memory footprint in bytes (useful for memory budgeting)

Hierarchical memory system for offloading Pattern: Hot tensors on GPU, warm on RAM, cold on disk

pub type MemoryHierarchy {
  MemoryHierarchy(
    gpu: MemoryTier,
    ram: MemoryTier,
    disk: option.Option(MemoryTier),
    total_effective_gb: Float,
  )
}

Constructors

Memory pool for buffer reuse Pattern: Allocate once, reuse many times (avoids malloc overhead)

pub type MemoryPool {
  MemoryPool(
    free_buffers: List(#(Int, Int)),
    used_buffers: Int,
    total_allocated: Int,
  )
}

Constructors

  • MemoryPool(
      free_buffers: List(#(Int, Int)),
      used_buffers: Int,
      total_allocated: Int,
    )

    Arguments

    free_buffers

    Available buffers by size: (size_bytes, available_count)

    used_buffers

    Currently allocated count

    total_allocated

    Total bytes allocated (for memory tracking)

Memory tier with capacity and bandwidth tracking

pub type MemoryTier {
  MemoryTier(
    location: TensorLocation,
    capacity_gb: Float,
    used_gb: Float,
    bandwidth_gbps: Float,
  )
}

Constructors

  • MemoryTier(
      location: TensorLocation,
      capacity_gb: Float,
      used_gb: Float,
      bandwidth_gbps: Float,
    )

    Arguments

    bandwidth_gbps

    Bandwidth matters more than latency for large transfers

Offload policy determines when to move tensors between tiers

pub type OffloadPolicy {
  KeepOnGpu
  OffloadToRam(threshold_pct: Float)
  OffloadToDisk(ram_threshold: Float, disk_path: String)
  SmartOffload(access_history: List(AccessRecord))
}

Constructors

  • KeepOnGpu

    Keep everything on GPU (default, fastest)

  • OffloadToRam(threshold_pct: Float)

    Move to RAM when GPU usage exceeds threshold

  • OffloadToDisk(ram_threshold: Float, disk_path: String)

    Tiered: GPU -> RAM -> Disk

  • SmartOffload(access_history: List(AccessRecord))

    LRU-based: evict least recently accessed tensors first

Quantization format

pub type QuantFormat {
  Fp32
  Fp16
  Int8(scale: Float)
  Quant4(block_size: Int, scales: List(Float))
  Quant4Min(
    block_size: Int,
    scales: List(Float),
    mins: List(Float),
  )
}

Constructors

  • Fp32

    Full precision (32 bits, 4 bytes per value) Memory: 4 bytes/param | Compression: 1x

  • Fp16

    Half precision (16 bits, 2 bytes per value) Memory: 2 bytes/param | Compression: 2x Opinion: Only use for training. Deploy with INT8.

  • Int8(scale: Float)

    Integer 8-bit with scale (1 byte + 1 float per block) Memory: ~1 byte/param | Compression: 4x Sweet spot: Best accuracy/compression tradeoff for inference

  • Quant4(block_size: Int, scales: List(Float))

    4-bit quantized (0.5 bytes per value) - GGML style Memory: 0.5 bytes + scale overhead | Compression: ~7x Use NF4 instead for normal distributions (see nf4.gleam)

  • Quant4Min(
      block_size: Int,
      scales: List(Float),
      mins: List(Float),
    )

    4-bit with min/max (asymmetric, more accurate for ReLU activations) Memory: 0.5 bytes + 2 scales per block | Compression: ~6.5x

Streaming tensor - loads chunks on demand Use case: Models too large for any single memory tier

pub type StreamedTensor {
  StreamedTensor(
    id: Int,
    shape: List(Int),
    chunk_shape: List(Int),
    loaded_chunks: List(Int),
    total_chunks: Int,
    format: QuantFormat,
  )
}

Constructors

  • StreamedTensor(
      id: Int,
      shape: List(Int),
      chunk_shape: List(Int),
      loaded_chunks: List(Int),
      total_chunks: Int,
      format: QuantFormat,
    )

Tensor location in the memory hierarchy Performance note: GPU->RAM transfer on PCIe 4.0 x16: ~25 GB/s That’s 40ms for 1GB. Plan your offloading accordingly.

pub type TensorLocation {
  OnGpu(device_id: Int)
  OnRam
  OnDisk(path: String)
  Hybrid(gpu_pct: Float)
}

Constructors

  • OnGpu(device_id: Int)

    VRAM - fastest (RTX 4090: 1008 GB/s bandwidth)

  • OnRam

    System RAM - medium (DDR5-6400: ~100 GB/s dual channel)

  • OnDisk(path: String)

    NVMe SSD - slow but huge (7 GB/s sequential)

  • Hybrid(gpu_pct: Float)

    Hybrid: split across GPU and RAM (gradient offloading pattern)

Values

pub fn allocate_tensor(
  hierarchy: MemoryHierarchy,
  tensor_size_gb: Float,
  policy: OffloadPolicy,
) -> #(TensorLocation, MemoryHierarchy)

Allocate tensor in memory hierarchy based on policy Returns new location and updated hierarchy state

pub fn checkpoint_savings(
  num_layers: Int,
  layer_size_mb: Float,
  strategy: CheckpointStrategy,
) -> Float

Calculate memory savings from checkpointing Note: This trades ~33% compute overhead for 50-75% memory savings

pub fn create_memory_hierarchy(
  vram_gb: Float,
  ram_gb: Float,
  disk_path: option.Option(String),
) -> MemoryHierarchy

Create memory hierarchy for typical workstation setup Example: RTX 4090 (24GB) + DDR5 RAM (32GB) = 56GB physical With INT8: 56GB * 4 = 224GB effective parameter storage

pub fn create_pool() -> MemoryPool

Create empty memory pool

pub fn create_streamed(
  shape: List(Int),
  chunk_dim: Int,
) -> StreamedTensor

Create streaming tensor with specified chunk size

pub fn demonstrate_compression() -> Nil
pub fn dequantize(ct: CompressedTensor) -> tensor.Tensor

Dequantize compressed tensor back to FP32 Note: This is NOT lossless. Quantization error is permanent.

pub fn load_chunk(
  st: StreamedTensor,
  chunk_idx: Int,
) -> StreamedTensor

Load specific chunk into memory

pub fn main() -> Nil
pub fn pool_alloc(
  pool: MemoryPool,
  size: Int,
) -> #(MemoryPool, Bool)

Allocate from pool (reuses existing buffer if available) Returns: (updated_pool, was_reused)

pub fn pool_free(pool: MemoryPool, size: Int) -> MemoryPool

Return buffer to pool for reuse

pub fn quantize_int8(t: tensor.Tensor) -> CompressedTensor

Quantize tensor to INT8 using absmax symmetric quantization

Compression: 32/8 = 4x Error: Typically <0.5% for well-distributed weights

Implementation: absmax per-tensor (fast but less accurate than per-channel) For production, consider per-channel for weights, per-tensor for activations.

pub fn quantize_q4(
  t: tensor.Tensor,
  block_size: Int,
) -> CompressedTensor

Quantize to Q4 (4-bit) using block-wise absmax - GGML style

Compression: 32/4 = 8x theoretical, ~7x effective with scale overhead Block size tradeoff:

  • Smaller blocks (32): More scales, more accurate, less compression
  • Larger blocks (128): Fewer scales, less accurate, more compression
  • Sweet spot: 64 (empirically validated in GGML/QLoRA)
pub fn unload_chunk(
  st: StreamedTensor,
  chunk_idx: Int,
) -> StreamedTensor

Unload chunk to free memory

Search Document