viva_tensor: Memory Multiplication

Gabriel Maia · VIVA Research · 2026


Abstract

Pure Gleam tensor library achieving 8x memory multiplication through mathematical compression.

graph LR
    A["24 GB Physical"] -->|"NF4"| B["192 GB Effective"]

Problem

LLMs are memory-bound, not compute-bound.

ModelFP32NF4
LLaMA-7B28 GB3.7 GB
LLaMA-70B280 GB37 GB

Solution

flowchart TB
    subgraph Quantization
        INT8["INT8: 4x"]
        NF4["NF4: 7.5x"]
        AWQ["AWQ: 7.7x"]
    end

    subgraph Result
        M["Memory × 8"]
    end

    Quantization --> Result

Algorithms

INT8

Linear quantization. Fast, simple.

scale = 127 / max|x|
q = round(x × scale)

NF4 (QLoRA)

16 levels from normal distribution quantiles. Optimal for Gaussian weights.

AWQ (MLSys 2024 Best Paper)

Key insight: 1% of weights are salient — identified by activation magnitude.

flowchart LR
    A[Activations] --> S[Stats]
    S --> T["Top 1% Salient"]
    T --> U["Scale UP"]
    U --> Q[Quantize]

Results

MethodCompressionEfficiency
INT84x40%
NF47.5x77%
AWQ7.7x53%

Why Gleam?

graph TB
    subgraph BEAM
        P1[Process 1]
        P2[Process 2]
        P3[Process N]
    end

    subgraph Properties
        I[Immutable]
        F[Fault-tolerant]
        C[Concurrent]
    end

    BEAM --> Properties
PropertyThreadsBEAM
Overhead1 MB2 KB
Max concurrent1K1M
Fault isolationSharedIsolated

References

  1. Lin et al. “AWQ” MLSys 2024 Best Paper
  2. Dettmers et al. “QLoRA” NeurIPS 2023
  3. NVIDIA Blackwell Architecture 2024
Search Document