viva_tensor/rtx4090

RTX 4090 Optimized Engine

ESPECIFICAÇÕES RTX 4090 ASUS ROG STRIX:

GPU: AD102 (16384 CUDA Cores)
Tensor Cores: 512 (4th Gen)
VRAM: 24GB GDDR6X
Bandwidth: 1008 GB/s
TDP: 450W (boost até 600W)
FP32: 82.6 TFLOPS
FP16 Tensor: 330 TFLOPS
INT8 Tensor: 661 TOPS

OTIMIZAÇÕES ESPECÍFICAS:

VRAM-aware batch sizing (24GB - 2GB sistema = 22GB útil)
Tensor Core utilization (alinhamento 8x8 ou 16x16)
GDDR6X burst patterns (256-bit bus, aligned access)
CUDA Warp-aware parallelism (32 threads)

Pure Gleam + BEAM concurrency para máxima utilização!

Types

BatchResult

</>

Resultado de processamento em batch

pub type BatchResult {
  BatchResult(
    tensors: List(blackwell.BlackwellTensor),
    total_time_ms: Int,
    throughput_tps: Float,
    compression_ratio: Float,
    memory_saved_mb: Float,
  )
}

Constructors

BatchResult(
  tensors: List(blackwell.BlackwellTensor),
  total_time_ms: Int,
  throughput_tps: Float,
  compression_ratio: Float,
  memory_saved_mb: Float,
)

Bottleneck

</>

Tipo de gargalo

pub type Bottleneck {
  ComputeBound
  MemoryBound
  LatencyBound
}

Constructors

```
ComputeBound
```
```
MemoryBound
```
```
LatencyBound
```

GpuMemoryState

</>

Estado de memória da GPU

pub type GpuMemoryState {
  GpuMemoryState(
    total_bytes: Int,
    used_bytes: Int,
    free_bytes: Int,
    allocated_tensors: Int,
    cached_bytes: Int,
  )
}

Constructors

```
GpuMemoryState(
  total_bytes: Int,
  used_bytes: Int,
  free_bytes: Int,
  allocated_tensors: Int,
  cached_bytes: Int,
)
```
Arguments

total_bytes

VRAM total em bytes

used_bytes

VRAM usada em bytes

free_bytes

VRAM livre em bytes

allocated_tensors

Tensores alocados

cached_bytes

Bytes em cache

PerformanceEstimate

</>

Estimativa de performance

pub type PerformanceEstimate {
  PerformanceEstimate(
    theoretical_flops: Float,
    achievable_flops: Float,
    estimated_time_ms: Float,
    bottleneck: Bottleneck,
    efficiency_pct: Float,
  )
}

Constructors

```
PerformanceEstimate(
  theoretical_flops: Float,
  achievable_flops: Float,
  estimated_time_ms: Float,
  bottleneck: Bottleneck,
  efficiency_pct: Float,
)
```
Arguments

theoretical_flops

FLOPS teóricos

achievable_flops

FLOPS alcançáveis (com overhead)

estimated_time_ms

Tempo estimado em ms

bottleneck

Gargalo (compute ou memory)

efficiency_pct

Eficiência estimada

QuantMode4090

</>

Modos de quantização para RTX 4090

pub type QuantMode4090 {
  Fp32Mode
  Fp16TensorMode
  Int8TensorMode
  MixedPrecisionMode
}

Constructors

```
Fp32Mode
```
FP32 puro (82.6 TFLOPS)
```
Fp16TensorMode
```
FP16 com Tensor Cores (330 TFLOPS, 4x FP32!)
```
Int8TensorMode
```
INT8 com Tensor Cores (661 TOPS, 8x FP32!)
```
MixedPrecisionMode
```
Mixed precision (FP16 compute, FP32 accumulate)

Rtx4090Config

</>

Configuração otimizada para RTX 4090

pub type Rtx4090Config {
  Rtx4090Config(
    optimal_batch_size: Int,
    tensor_core_tile: Int,
    memory_alignment: Int,
    threads_per_block: Int,
    use_tensor_cores: Bool,
    quant_mode: QuantMode4090,
  )
}

Constructors

```
Rtx4090Config(
  optimal_batch_size: Int,
  tensor_core_tile: Int,
  memory_alignment: Int,
  threads_per_block: Int,
  use_tensor_cores: Bool,
  quant_mode: QuantMode4090,
)
```
Arguments

optimal_batch_size

Batch size ótimo para 24GB VRAM

tensor_core_tile

Tamanho de tile para Tensor Cores (8 ou 16)

memory_alignment

Alinhamento de memória (256 bits = 32 bytes)

threads_per_block

Threads por bloco CUDA

use_tensor_cores

Usar Tensor Cores (FP16/INT8)

quant_mode

Modo de quantização

Rtx4090Specs

</>

Especificações RTX 4090

pub type Rtx4090Specs {
  Rtx4090Specs(
    cuda_cores: Int,
    tensor_cores: Int,
    vram_gb: Float,
    vram_available_gb: Float,
    bandwidth_gbps: Float,
    tdp_watts: Int,
    tflops_fp32: Float,
    tflops_fp16: Float,
    tops_int8: Float,
    warp_size: Int,
    sm_count: Int,
    l2_cache_mb: Int,
  )
}

Constructors

```
Rtx4090Specs(
  cuda_cores: Int,
  tensor_cores: Int,
  vram_gb: Float,
  vram_available_gb: Float,
  bandwidth_gbps: Float,
  tdp_watts: Int,
  tflops_fp32: Float,
  tflops_fp16: Float,
  tops_int8: Float,
  warp_size: Int,
  sm_count: Int,
  l2_cache_mb: Int,
)
```
Arguments

cuda_cores

CUDA Cores

tensor_cores

Tensor Cores (4th Gen)

vram_gb

VRAM em GB

vram_available_gb

VRAM disponível (após sistema)

bandwidth_gbps

Bandwidth em GB/s

tdp_watts

TDP em Watts

tflops_fp32

TFLOPS FP32

tflops_fp16

TFLOPS FP16 (Tensor)

tops_int8

TOPS INT8 (Tensor)

warp_size

Warp size (threads por warp)

sm_count

SM count

l2_cache_mb

L2 Cache em MB

Values

allocate

</>

pub fn allocate(
  state: GpuMemoryState,
  bytes: Int,
) -> Result(GpuMemoryState, String)

Aloca memória para tensor

benchmark_rtx4090

</>

pub fn benchmark_rtx4090() -> Nil

can_allocate

</>

pub fn can_allocate(state: GpuMemoryState, bytes: Int) -> Bool

Verifica se tensor cabe na VRAM

default_config

</>

pub fn default_config() -> Rtx4090Config

Configuração padrão otimizada

estimate_performance

</>

pub fn estimate_performance(
  flops_needed: Float,
  bytes_to_transfer: Float,
  config: Rtx4090Config,
) -> PerformanceEstimate

Estima performance para operação de tensor

free

</>

pub fn free(state: GpuMemoryState, bytes: Int) -> GpuMemoryState

Libera memória

get_specs

</>

pub fn get_specs() -> Rtx4090Specs

Retorna specs da RTX 4090

init_memory

</>

pub fn init_memory() -> GpuMemoryState

Cria estado inicial de memória para RTX 4090

main

</>

pub fn main() -> Nil

precision_config

</>

pub fn precision_config() -> Rtx4090Config

Configuração para máxima precisão

process_batch

</>

pub fn process_batch(
  tensors: List(tensor.Tensor),
  config: Rtx4090Config,
) -> BatchResult

Processa batch de tensores com compressão

speed_config

</>

pub fn speed_config() -> Rtx4090Config

Configuração para máxima velocidade

tensor_memory_bytes

</>

pub fn tensor_memory_bytes(
  shape: List(Int),
  mode: QuantMode4090,
) -> Int

Calcula memória necessária para tensor

Constructors

Constructors

Constructors

Arguments

Constructors

Arguments

Constructors

Constructors

Arguments

Constructors

Arguments