viva_tensor/quant/compression
INT8 Quantization and Memory Hierarchy System
Reference: Jacob et al. (2017) - “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference” https://arxiv.org/abs/1712.05877
— Compression Math — INT8: 32-bit / 8-bit = 4x compression
- 24GB VRAM -> 96GB effective parameter storage
- RTX 4090 INT8 Tensor Cores: 2x throughput vs FP16 (660 vs 330 TOPS)
Why symmetric quantization? Because asymmetric zero-points are a cache-miss nightmare. The extra memory access for zero-point lookup kills throughput. Per-tensor symmetric is fast. Per-channel symmetric is accurate. Pick one.
absmax quantization: simple but loses dynamic range at the tails. For weights: per-channel absmax is worth the overhead. For activations: per-tensor is fine (they’re more uniform).
FP16 was a mistake for storage. It’s 2x larger than INT8 with minimal accuracy benefit for inference. Train in FP16/BF16, deploy in INT8.
Inspired by: ggml, llama.cpp, Candle, bitsandbytes
Types
Access tracking for smart offloading
pub type AccessRecord {
AccessRecord(
tensor_id: Int,
timestamp_ms: Int,
access_count: Int,
)
}
Constructors
-
AccessRecord( tensor_id: Int, timestamp_ms: Int, access_count: Int, )
Checkpoint for trading compute for memory Key insight: Recomputing forward pass is cheaper than storing all activations
pub type Checkpoint {
Checkpoint(
input: tensor.Tensor,
forward_fn_id: Int,
memory_saved_gb: Float,
)
}
Constructors
-
Checkpoint( input: tensor.Tensor, forward_fn_id: Int, memory_saved_gb: Float, )Arguments
- input
-
Input saved for recomputation
- forward_fn_id
-
Function ID for forward pass replay
- memory_saved_gb
-
Memory saved in GB (activations not stored)
Checkpointing strategy
pub type CheckpointStrategy {
NoCheckpoint
EveryN(n: Int)
LargeLayersOnly(threshold_mb: Float)
Adaptive(memory_pressure: Float)
}
Constructors
-
NoCheckpointNo checkpointing - fastest but uses most memory
-
EveryN(n: Int)Checkpoint every N layers (sqrt(N) is optimal for memory/compute)
-
LargeLayersOnly(threshold_mb: Float)Only checkpoint layers above threshold (target the big ones)
-
Adaptive(memory_pressure: Float)Adaptive based on memory pressure (for dynamic batch sizing)
Tensor comprimido
pub type CompressedTensor {
CompressedTensor(
data: List(Int),
shape: List(Int),
format: QuantFormat,
memory_bytes: Int,
)
}
Constructors
-
CompressedTensor( data: List(Int), shape: List(Int), format: QuantFormat, memory_bytes: Int, )Arguments
- data
-
Quantized data (bytes simulated as ints for BEAM compatibility)
- shape
-
Original shape preserved for dequantization
- format
-
Quantization format with metadata
- memory_bytes
-
Memory footprint in bytes (useful for memory budgeting)
Hierarchical memory system for offloading Pattern: Hot tensors on GPU, warm on RAM, cold on disk
pub type MemoryHierarchy {
MemoryHierarchy(
gpu: MemoryTier,
ram: MemoryTier,
disk: option.Option(MemoryTier),
total_effective_gb: Float,
)
}
Constructors
-
MemoryHierarchy( gpu: MemoryTier, ram: MemoryTier, disk: option.Option(MemoryTier), total_effective_gb: Float, )Arguments
- total_effective_gb
-
Effective capacity after quantization multiplier
Memory pool for buffer reuse Pattern: Allocate once, reuse many times (avoids malloc overhead)
pub type MemoryPool {
MemoryPool(
free_buffers: List(#(Int, Int)),
used_buffers: Int,
total_allocated: Int,
)
}
Constructors
-
MemoryPool( free_buffers: List(#(Int, Int)), used_buffers: Int, total_allocated: Int, )Arguments
- free_buffers
-
Available buffers by size: (size_bytes, available_count)
- used_buffers
-
Currently allocated count
- total_allocated
-
Total bytes allocated (for memory tracking)
Memory tier with capacity and bandwidth tracking
pub type MemoryTier {
MemoryTier(
location: TensorLocation,
capacity_gb: Float,
used_gb: Float,
bandwidth_gbps: Float,
)
}
Constructors
-
MemoryTier( location: TensorLocation, capacity_gb: Float, used_gb: Float, bandwidth_gbps: Float, )Arguments
- bandwidth_gbps
-
Bandwidth matters more than latency for large transfers
Offload policy determines when to move tensors between tiers
pub type OffloadPolicy {
KeepOnGpu
OffloadToRam(threshold_pct: Float)
OffloadToDisk(ram_threshold: Float, disk_path: String)
SmartOffload(access_history: List(AccessRecord))
}
Constructors
-
KeepOnGpuKeep everything on GPU (default, fastest)
-
OffloadToRam(threshold_pct: Float)Move to RAM when GPU usage exceeds threshold
-
OffloadToDisk(ram_threshold: Float, disk_path: String)Tiered: GPU -> RAM -> Disk
-
SmartOffload(access_history: List(AccessRecord))LRU-based: evict least recently accessed tensors first
Quantization format
pub type QuantFormat {
Fp32
Fp16
Int8(scale: Float)
Quant4(block_size: Int, scales: List(Float))
Quant4Min(
block_size: Int,
scales: List(Float),
mins: List(Float),
)
}
Constructors
-
Fp32Full precision (32 bits, 4 bytes per value) Memory: 4 bytes/param | Compression: 1x
-
Fp16Half precision (16 bits, 2 bytes per value) Memory: 2 bytes/param | Compression: 2x Opinion: Only use for training. Deploy with INT8.
-
Int8(scale: Float)Integer 8-bit with scale (1 byte + 1 float per block) Memory: ~1 byte/param | Compression: 4x Sweet spot: Best accuracy/compression tradeoff for inference
-
Quant4(block_size: Int, scales: List(Float))4-bit quantized (0.5 bytes per value) - GGML style Memory: 0.5 bytes + scale overhead | Compression: ~7x Use NF4 instead for normal distributions (see nf4.gleam)
-
Quant4Min( block_size: Int, scales: List(Float), mins: List(Float), )4-bit with min/max (asymmetric, more accurate for ReLU activations) Memory: 0.5 bytes + 2 scales per block | Compression: ~6.5x
Streaming tensor - loads chunks on demand Use case: Models too large for any single memory tier
pub type StreamedTensor {
StreamedTensor(
id: Int,
shape: List(Int),
chunk_shape: List(Int),
loaded_chunks: List(Int),
total_chunks: Int,
format: QuantFormat,
)
}
Constructors
-
StreamedTensor( id: Int, shape: List(Int), chunk_shape: List(Int), loaded_chunks: List(Int), total_chunks: Int, format: QuantFormat, )
Tensor location in the memory hierarchy Performance note: GPU->RAM transfer on PCIe 4.0 x16: ~25 GB/s That’s 40ms for 1GB. Plan your offloading accordingly.
pub type TensorLocation {
OnGpu(device_id: Int)
OnRam
OnDisk(path: String)
Hybrid(gpu_pct: Float)
}
Constructors
-
OnGpu(device_id: Int)VRAM - fastest (RTX 4090: 1008 GB/s bandwidth)
-
OnRamSystem RAM - medium (DDR5-6400: ~100 GB/s dual channel)
-
OnDisk(path: String)NVMe SSD - slow but huge (7 GB/s sequential)
-
Hybrid(gpu_pct: Float)Hybrid: split across GPU and RAM (gradient offloading pattern)
Values
pub fn allocate_tensor(
hierarchy: MemoryHierarchy,
tensor_size_gb: Float,
policy: OffloadPolicy,
) -> #(TensorLocation, MemoryHierarchy)
Allocate tensor in memory hierarchy based on policy Returns new location and updated hierarchy state
pub fn checkpoint_savings(
num_layers: Int,
layer_size_mb: Float,
strategy: CheckpointStrategy,
) -> Float
Calculate memory savings from checkpointing Note: This trades ~33% compute overhead for 50-75% memory savings
pub fn create_memory_hierarchy(
vram_gb: Float,
ram_gb: Float,
disk_path: option.Option(String),
) -> MemoryHierarchy
Create memory hierarchy for typical workstation setup Example: RTX 4090 (24GB) + DDR5 RAM (32GB) = 56GB physical With INT8: 56GB * 4 = 224GB effective parameter storage
pub fn create_streamed(
shape: List(Int),
chunk_dim: Int,
) -> StreamedTensor
Create streaming tensor with specified chunk size
pub fn demonstrate_compression() -> Nil
pub fn dequantize(ct: CompressedTensor) -> tensor.Tensor
Dequantize compressed tensor back to FP32 Note: This is NOT lossless. Quantization error is permanent.
pub fn load_chunk(
st: StreamedTensor,
chunk_idx: Int,
) -> StreamedTensor
Load specific chunk into memory
pub fn pool_alloc(
pool: MemoryPool,
size: Int,
) -> #(MemoryPool, Bool)
Allocate from pool (reuses existing buffer if available) Returns: (updated_pool, was_reused)
pub fn pool_free(pool: MemoryPool, size: Int) -> MemoryPool
Return buffer to pool for reuse
pub fn quantize_int8(t: tensor.Tensor) -> CompressedTensor
Quantize tensor to INT8 using absmax symmetric quantization
Compression: 32/8 = 4x Error: Typically <0.5% for well-distributed weights
Implementation: absmax per-tensor (fast but less accurate than per-channel) For production, consider per-channel for weights, per-tensor for activations.
pub fn quantize_q4(
t: tensor.Tensor,
block_size: Int,
) -> CompressedTensor
Quantize to Q4 (4-bit) using block-wise absmax - GGML style
Compression: 32/4 = 8x theoretical, ~7x effective with scale overhead Block size tradeoff:
- Smaller blocks (32): More scales, more accurate, less compression
- Larger blocks (128): Fewer scales, less accurate, more compression
- Sweet spot: 64 (empirically validated in GGML/QLoRA)
pub fn unload_chunk(
st: StreamedTensor,
chunk_idx: Int,
) -> StreamedTensor
Unload chunk to free memory