viva_tensor/rtx4090

RTX 4090 Optimized Engine

ESPECIFICAÇÕES RTX 4090 ASUS ROG STRIX:

OTIMIZAÇÕES ESPECÍFICAS:

  1. VRAM-aware batch sizing (24GB - 2GB sistema = 22GB útil)
  2. Tensor Core utilization (alinhamento 8x8 ou 16x16)
  3. GDDR6X burst patterns (256-bit bus, aligned access)
  4. CUDA Warp-aware parallelism (32 threads)

Pure Gleam + BEAM concurrency para máxima utilização!

Types

Resultado de processamento em batch

pub type BatchResult {
  BatchResult(
    tensors: List(blackwell.BlackwellTensor),
    total_time_ms: Int,
    throughput_tps: Float,
    compression_ratio: Float,
    memory_saved_mb: Float,
  )
}

Constructors

  • BatchResult(
      tensors: List(blackwell.BlackwellTensor),
      total_time_ms: Int,
      throughput_tps: Float,
      compression_ratio: Float,
      memory_saved_mb: Float,
    )

Tipo de gargalo

pub type Bottleneck {
  ComputeBound
  MemoryBound
  LatencyBound
}

Constructors

  • ComputeBound
  • MemoryBound
  • LatencyBound

Estado de memória da GPU

pub type GpuMemoryState {
  GpuMemoryState(
    total_bytes: Int,
    used_bytes: Int,
    free_bytes: Int,
    allocated_tensors: Int,
    cached_bytes: Int,
  )
}

Constructors

  • GpuMemoryState(
      total_bytes: Int,
      used_bytes: Int,
      free_bytes: Int,
      allocated_tensors: Int,
      cached_bytes: Int,
    )

    Arguments

    total_bytes

    VRAM total em bytes

    used_bytes

    VRAM usada em bytes

    free_bytes

    VRAM livre em bytes

    allocated_tensors

    Tensores alocados

    cached_bytes

    Bytes em cache

Estimativa de performance

pub type PerformanceEstimate {
  PerformanceEstimate(
    theoretical_flops: Float,
    achievable_flops: Float,
    estimated_time_ms: Float,
    bottleneck: Bottleneck,
    efficiency_pct: Float,
  )
}

Constructors

  • PerformanceEstimate(
      theoretical_flops: Float,
      achievable_flops: Float,
      estimated_time_ms: Float,
      bottleneck: Bottleneck,
      efficiency_pct: Float,
    )

    Arguments

    theoretical_flops

    FLOPS teóricos

    achievable_flops

    FLOPS alcançáveis (com overhead)

    estimated_time_ms

    Tempo estimado em ms

    bottleneck

    Gargalo (compute ou memory)

    efficiency_pct

    Eficiência estimada

Modos de quantização para RTX 4090

pub type QuantMode4090 {
  Fp32Mode
  Fp16TensorMode
  Int8TensorMode
  MixedPrecisionMode
}

Constructors

  • Fp32Mode

    FP32 puro (82.6 TFLOPS)

  • Fp16TensorMode

    FP16 com Tensor Cores (330 TFLOPS, 4x FP32!)

  • Int8TensorMode

    INT8 com Tensor Cores (661 TOPS, 8x FP32!)

  • MixedPrecisionMode

    Mixed precision (FP16 compute, FP32 accumulate)

Configuração otimizada para RTX 4090

pub type Rtx4090Config {
  Rtx4090Config(
    optimal_batch_size: Int,
    tensor_core_tile: Int,
    memory_alignment: Int,
    threads_per_block: Int,
    use_tensor_cores: Bool,
    quant_mode: QuantMode4090,
  )
}

Constructors

  • Rtx4090Config(
      optimal_batch_size: Int,
      tensor_core_tile: Int,
      memory_alignment: Int,
      threads_per_block: Int,
      use_tensor_cores: Bool,
      quant_mode: QuantMode4090,
    )

    Arguments

    optimal_batch_size

    Batch size ótimo para 24GB VRAM

    tensor_core_tile

    Tamanho de tile para Tensor Cores (8 ou 16)

    memory_alignment

    Alinhamento de memória (256 bits = 32 bytes)

    threads_per_block

    Threads por bloco CUDA

    use_tensor_cores

    Usar Tensor Cores (FP16/INT8)

    quant_mode

    Modo de quantização

Especificações RTX 4090

pub type Rtx4090Specs {
  Rtx4090Specs(
    cuda_cores: Int,
    tensor_cores: Int,
    vram_gb: Float,
    vram_available_gb: Float,
    bandwidth_gbps: Float,
    tdp_watts: Int,
    tflops_fp32: Float,
    tflops_fp16: Float,
    tops_int8: Float,
    warp_size: Int,
    sm_count: Int,
    l2_cache_mb: Int,
  )
}

Constructors

  • Rtx4090Specs(
      cuda_cores: Int,
      tensor_cores: Int,
      vram_gb: Float,
      vram_available_gb: Float,
      bandwidth_gbps: Float,
      tdp_watts: Int,
      tflops_fp32: Float,
      tflops_fp16: Float,
      tops_int8: Float,
      warp_size: Int,
      sm_count: Int,
      l2_cache_mb: Int,
    )

    Arguments

    cuda_cores

    CUDA Cores

    tensor_cores

    Tensor Cores (4th Gen)

    vram_gb

    VRAM em GB

    vram_available_gb

    VRAM disponível (após sistema)

    bandwidth_gbps

    Bandwidth em GB/s

    tdp_watts

    TDP em Watts

    tflops_fp32

    TFLOPS FP32

    tflops_fp16

    TFLOPS FP16 (Tensor)

    tops_int8

    TOPS INT8 (Tensor)

    warp_size

    Warp size (threads por warp)

    sm_count

    SM count

    l2_cache_mb

    L2 Cache em MB

Values

pub fn allocate(
  state: GpuMemoryState,
  bytes: Int,
) -> Result(GpuMemoryState, String)

Aloca memória para tensor

pub fn benchmark_rtx4090() -> Nil
pub fn can_allocate(state: GpuMemoryState, bytes: Int) -> Bool

Verifica se tensor cabe na VRAM

pub fn default_config() -> Rtx4090Config

Configuração padrão otimizada

pub fn estimate_performance(
  flops_needed: Float,
  bytes_to_transfer: Float,
  config: Rtx4090Config,
) -> PerformanceEstimate

Estima performance para operação de tensor

pub fn free(state: GpuMemoryState, bytes: Int) -> GpuMemoryState

Libera memória

pub fn get_specs() -> Rtx4090Specs

Retorna specs da RTX 4090

pub fn init_memory() -> GpuMemoryState

Cria estado inicial de memória para RTX 4090

pub fn main() -> Nil
pub fn precision_config() -> Rtx4090Config

Configuração para máxima precisão

pub fn process_batch(
  tensors: List(tensor.Tensor),
  config: Rtx4090Config,
) -> BatchResult

Processa batch de tensores com compressão

pub fn speed_config() -> Rtx4090Config

Configuração para máxima velocidade

pub fn tensor_memory_bytes(
  shape: List(Int),
  mode: QuantMode4090,
) -> Int

Calcula memória necessária para tensor

Search Document