viva_tensor: Memory Multiplication
Gabriel Maia · VIVA Research · 2026
Abstract
Pure Gleam tensor library achieving 8x memory multiplication through mathematical compression.
graph LR
A["24 GB Physical"] -->|"NF4"| B["192 GB Effective"]
Problem
LLMs are memory-bound, not compute-bound.
| Model | FP32 | NF4 |
|---|---|---|
| LLaMA-7B | 28 GB | 3.7 GB |
| LLaMA-70B | 280 GB | 37 GB |
Solution
flowchart TB
subgraph Quantization
INT8["INT8: 4x"]
NF4["NF4: 7.5x"]
AWQ["AWQ: 7.7x"]
end
subgraph Result
M["Memory × 8"]
end
Quantization --> Result
Algorithms
INT8
Linear quantization. Fast, simple.
scale = 127 / max|x|
q = round(x × scale)
NF4 (QLoRA)
16 levels from normal distribution quantiles. Optimal for Gaussian weights.
AWQ (MLSys 2024 Best Paper)
Key insight: 1% of weights are salient — identified by activation magnitude.
flowchart LR
A[Activations] --> S[Stats]
S --> T["Top 1% Salient"]
T --> U["Scale UP"]
U --> Q[Quantize]
Results
| Method | Compression | Efficiency |
|---|---|---|
| INT8 | 4x | 40% |
| NF4 | 7.5x | 77% |
| AWQ | 7.7x | 53% |
Why Gleam?
graph TB
subgraph BEAM
P1[Process 1]
P2[Process 2]
P3[Process N]
end
subgraph Properties
I[Immutable]
F[Fault-tolerant]
C[Concurrent]
end
BEAM --> Properties
| Property | Threads | BEAM |
|---|---|---|
| Overhead | 1 MB | 2 KB |
| Max concurrent | 1K | 1M |
| Fault isolation | Shared | Isolated |
References
- Lin et al. “AWQ” MLSys 2024 Best Paper
- Dettmers et al. “QLoRA” NeurIPS 2023
- NVIDIA Blackwell Architecture 2024