viva_tensor/compression

Compression System - Faz 24GB VRAM virar 48GB+!

TÉCNICAS COMBINADAS:

INT8 Quantização → 4x menos memória (24GB → 96GB efetivo)
GPU/CPU Offloading → +32GB RAM como extensão
Gradient Checkpointing → recalcula ao invés de armazenar
Tensor Streaming → carrega sob demanda
Memory Pooling → reutiliza buffers

RESULTADO: 24GB VRAM + 32GB RAM = ~80GB efetivo!

Inspirado em: ggml, llama.cpp, Candle, bitsandbytes

Types

AccessRecord

</>

Registro de acesso (para smart offload)

pub type AccessRecord {
  AccessRecord(
    tensor_id: Int,
    timestamp_ms: Int,
    access_count: Int,
  )
}

Constructors

AccessRecord(
  tensor_id: Int,
  timestamp_ms: Int,
  access_count: Int,
)

Checkpoint

</>

Checkpoint de gradiente

pub type Checkpoint {
  Checkpoint(
    input: tensor.Tensor,
    forward_fn_id: Int,
    memory_saved_gb: Float,
  )
}

Constructors

```
Checkpoint(
  input: tensor.Tensor,
  forward_fn_id: Int,
  memory_saved_gb: Float,
)
```
Arguments

input

Input salvo para recálculo

forward_fn_id

Função forward para recálculo

memory_saved_gb

Economia de memória em GB

CheckpointStrategy

</>

Estratégia de checkpointing

pub type CheckpointStrategy {
  NoCheckpoint
  EveryN(n: Int)
  LargeLayersOnly(threshold_mb: Float)
  Adaptive(memory_pressure: Float)
}

Constructors

```
NoCheckpoint
```
Sem checkpointing (usa mais memória)
```
EveryN(n: Int)
```
Checkpoint a cada N camadas
```
LargeLayersOnly(threshold_mb: Float)
```
Checkpoint apenas camadas grandes
```
Adaptive(memory_pressure: Float)
```
Checkpoint adaptativo baseado em pressão de memória

CompressedTensor

</>

Tensor comprimido

pub type CompressedTensor {
  CompressedTensor(
    data: List(Int),
    shape: List(Int),
    format: QuantFormat,
    memory_bytes: Int,
  )
}

Constructors

```
CompressedTensor(
  data: List(Int),
  shape: List(Int),
  format: QuantFormat,
  memory_bytes: Int,
)
```
Arguments

data

Dados quantizados (bytes simulados como ints)

shape

Shape original

format

Formato de quantização

memory_bytes

Memória usada em bytes

MemoryHierarchy

</>

Sistema de memória hierárquica

pub type MemoryHierarchy {
  MemoryHierarchy(
    gpu: MemoryTier,
    ram: MemoryTier,
    disk: option.Option(MemoryTier),
    total_effective_gb: Float,
  )
}

Constructors

MemoryHierarchy(
  gpu: MemoryTier,
  ram: MemoryTier,
  disk: option.Option(MemoryTier),
  total_effective_gb: Float,
)

MemoryPool

</>

Pool de memória

pub type MemoryPool {
  MemoryPool(
    free_buffers: List(#(Int, Int)),
    used_buffers: Int,
    total_allocated: Int,
  )
}

Constructors

```
MemoryPool(
  free_buffers: List(#(Int, Int)),
  used_buffers: Int,
  total_allocated: Int,
)
```
Arguments

free_buffers

Buffers disponíveis por tamanho

used_buffers

Buffers em uso

total_allocated

Total alocado em bytes

MemoryTier

</>

Tier de memória para offloading

pub type MemoryTier {
  MemoryTier(
    location: TensorLocation,
    capacity_gb: Float,
    used_gb: Float,
    bandwidth_gbps: Float,
  )
}

Constructors

MemoryTier(
  location: TensorLocation,
  capacity_gb: Float,
  used_gb: Float,
  bandwidth_gbps: Float,
)

OffloadPolicy

</>

Política de offload

pub type OffloadPolicy {
  KeepOnGpu
  OffloadToRam(threshold_pct: Float)
  OffloadToDisk(ram_threshold: Float, disk_path: String)
  SmartOffload(access_history: List(AccessRecord))
}

Constructors

```
KeepOnGpu
```
Mantém tudo na GPU (default)
```
OffloadToRam(threshold_pct: Float)
```
Move para RAM quando GPU > threshold

OffloadToDisk(ram_threshold: Float, disk_path: String)

Move para disco quando RAM > threshold

SmartOffload(access_history: List(AccessRecord))

Inteligente: prioriza por frequência de acesso

QuantFormat

</>

Formato de quantização

pub type QuantFormat {
  Fp32
  Fp16
  Int8(scale: Float)
  Quant4(block_size: Int, scales: List(Float))
  Quant4Min(
    block_size: Int,
    scales: List(Float),
    mins: List(Float),
  )
}

Constructors

```
Fp32
```
Full precision (32 bits, 4 bytes per value)
```
Fp16
```
Half precision (16 bits, 2 bytes per value)
```
Int8(scale: Float)
```
Integer 8-bit com escala (1 byte + 1 float per block)
```
Quant4(block_size: Int, scales: List(Float))
```
4-bit quantizado (0.5 bytes per value) - GGML style

Quant4Min(
  block_size: Int,
  scales: List(Float),
  mins: List(Float),
)

4-bit com min/max (mais preciso)

StreamedTensor

</>

Tensor em streaming (não carrega tudo de uma vez)

pub type StreamedTensor {
  StreamedTensor(
    id: Int,
    shape: List(Int),
    chunk_shape: List(Int),
    loaded_chunks: List(Int),
    total_chunks: Int,
    format: QuantFormat,
  )
}

Constructors

StreamedTensor(
  id: Int,
  shape: List(Int),
  chunk_shape: List(Int),
  loaded_chunks: List(Int),
  total_chunks: Int,
  format: QuantFormat,
)

Arguments

id: ID para referência
shape: Shape total
chunk_shape: Tamanho de cada chunk
loaded_chunks: Chunks carregados
total_chunks: Total de chunks
format: Formato de compressão

TensorLocation

</>

Localização do tensor

pub type TensorLocation {
  OnGpu(device_id: Int)
  OnRam
  OnDisk(path: String)
  Hybrid(gpu_pct: Float)
}

Constructors

```
OnGpu(device_id: Int)
```
Na VRAM da GPU (rápido)
```
OnRam
```
Na RAM do sistema (médio)
```
OnDisk(path: String)
```
No disco (lento, mas ilimitado)
```
Hybrid(gpu_pct: Float)
```
Híbrido: parte GPU, parte RAM

Values

allocate_tensor

</>

pub fn allocate_tensor(
  hierarchy: MemoryHierarchy,
  tensor_size_gb: Float,
  policy: OffloadPolicy,
) -> #(TensorLocation, MemoryHierarchy)

Decide onde colocar um tensor

checkpoint_savings

</>

pub fn checkpoint_savings(
  num_layers: Int,
  layer_size_mb: Float,
  strategy: CheckpointStrategy,
) -> Float

Calcula economia de memória com checkpointing

create_memory_hierarchy

</>

pub fn create_memory_hierarchy(
  vram_gb: Float,
  ram_gb: Float,
  disk_path: option.Option(String),
) -> MemoryHierarchy

Cria hierarquia de memória para RTX 4090 + 32GB RAM

create_pool

</>

pub fn create_pool() -> MemoryPool

Cria pool de memória

create_streamed

</>

pub fn create_streamed(
  shape: List(Int),
  chunk_dim: Int,
) -> StreamedTensor

Cria tensor para streaming

demonstrate_compression

</>

pub fn demonstrate_compression() -> Nil

dequantize

</>

pub fn dequantize(ct: CompressedTensor) -> tensor.Tensor

Dequantiza de volta para FP32

load_chunk

</>

pub fn load_chunk(
  st: StreamedTensor,
  chunk_idx: Int,
) -> StreamedTensor

Carrega um chunk específico

main

</>

pub fn main() -> Nil

pool_alloc

</>

pub fn pool_alloc(
  pool: MemoryPool,
  size: Int,
) -> #(MemoryPool, Bool)

Aloca do pool (reutiliza se possível)

pool_free

</>

pub fn pool_free(pool: MemoryPool, size: Int) -> MemoryPool

Devolve buffer ao pool

quantize_int8

</>

pub fn quantize_int8(t: tensor.Tensor) -> CompressedTensor

Quantiza tensor para INT8 (4x compressão)

quantize_q4

</>

pub fn quantize_q4(
  t: tensor.Tensor,
  block_size: Int,
) -> CompressedTensor

Quantiza para Q4 (8x compressão!) - GGML style

unload_chunk

</>

pub fn unload_chunk(
  st: StreamedTensor,
  chunk_idx: Int,
) -> StreamedTensor

Descarrega chunk (libera memória)