viva_tensor/optim/rtx4090
RTX 4090 Optimized Engine
RTX 4090 ASUS ROG STRIX SPECIFICATIONS:
- GPU: AD102 (16384 CUDA Cores)
- Tensor Cores: 512 (4th Gen)
- VRAM: 24GB GDDR6X
- Bandwidth: 1008 GB/s
- TDP: 450W (boost up to 600W)
- FP32: 82.6 TFLOPS
- FP16 Tensor: 330 TFLOPS
- INT8 Tensor: 661 TOPS
SPECIFIC OPTIMIZATIONS:
- VRAM-aware batch sizing (24GB - 2GB system = 22GB usable)
- Tensor Core utilization (8x8 or 16x16 alignment)
- GDDR6X burst patterns (256-bit bus, aligned access)
- CUDA Warp-aware parallelism (32 threads)
Pure Gleam + BEAM concurrency for maximum utilization!
Types
Batch processing result
pub type BatchResult {
BatchResult(
tensors: List(blackwell.BlackwellTensor),
total_time_ms: Int,
throughput_tps: Float,
compression_ratio: Float,
memory_saved_mb: Float,
)
}
Constructors
-
BatchResult( tensors: List(blackwell.BlackwellTensor), total_time_ms: Int, throughput_tps: Float, compression_ratio: Float, memory_saved_mb: Float, )
Bottleneck type
pub type Bottleneck {
ComputeBound
MemoryBound
LatencyBound
}
Constructors
-
ComputeBound -
MemoryBound -
LatencyBound
GPU memory state
pub type GpuMemoryState {
GpuMemoryState(
total_bytes: Int,
used_bytes: Int,
free_bytes: Int,
allocated_tensors: Int,
cached_bytes: Int,
)
}
Constructors
-
GpuMemoryState( total_bytes: Int, used_bytes: Int, free_bytes: Int, allocated_tensors: Int, cached_bytes: Int, )Arguments
- total_bytes
-
Total VRAM in bytes
- used_bytes
-
Used VRAM in bytes
- free_bytes
-
Free VRAM in bytes
- allocated_tensors
-
Allocated tensors
- cached_bytes
-
Cached bytes
Performance estimate
pub type PerformanceEstimate {
PerformanceEstimate(
theoretical_flops: Float,
achievable_flops: Float,
estimated_time_ms: Float,
bottleneck: Bottleneck,
efficiency_pct: Float,
)
}
Constructors
-
PerformanceEstimate( theoretical_flops: Float, achievable_flops: Float, estimated_time_ms: Float, bottleneck: Bottleneck, efficiency_pct: Float, )Arguments
- theoretical_flops
-
Theoretical FLOPS
- achievable_flops
-
Achievable FLOPS (with overhead)
- estimated_time_ms
-
Estimated time in ms
- bottleneck
-
Bottleneck (compute or memory)
- efficiency_pct
-
Estimated efficiency
Quantization modes for RTX 4090
pub type QuantMode4090 {
Fp32Mode
Fp16TensorMode
Int8TensorMode
MixedPrecisionMode
}
Constructors
-
Fp32ModePure FP32 (82.6 TFLOPS)
-
Fp16TensorModeFP16 with Tensor Cores (330 TFLOPS, 4x FP32!)
-
Int8TensorModeINT8 with Tensor Cores (661 TOPS, 8x FP32!)
-
MixedPrecisionModeMixed precision (FP16 compute, FP32 accumulate)
Optimized configuration for RTX 4090
pub type Rtx4090Config {
Rtx4090Config(
optimal_batch_size: Int,
tensor_core_tile: Int,
memory_alignment: Int,
threads_per_block: Int,
use_tensor_cores: Bool,
quant_mode: QuantMode4090,
)
}
Constructors
-
Rtx4090Config( optimal_batch_size: Int, tensor_core_tile: Int, memory_alignment: Int, threads_per_block: Int, use_tensor_cores: Bool, quant_mode: QuantMode4090, )Arguments
- optimal_batch_size
-
Optimal batch size for 24GB VRAM
- tensor_core_tile
-
Tile size for Tensor Cores (8 or 16)
- memory_alignment
-
Memory alignment (256 bits = 32 bytes)
- threads_per_block
-
Threads per CUDA block
- use_tensor_cores
-
Use Tensor Cores (FP16/INT8)
- quant_mode
-
Quantization mode
RTX 4090 specifications
pub type Rtx4090Specs {
Rtx4090Specs(
cuda_cores: Int,
tensor_cores: Int,
vram_gb: Float,
vram_available_gb: Float,
bandwidth_gbps: Float,
tdp_watts: Int,
tflops_fp32: Float,
tflops_fp16: Float,
tops_int8: Float,
warp_size: Int,
sm_count: Int,
l2_cache_mb: Int,
)
}
Constructors
-
Rtx4090Specs( cuda_cores: Int, tensor_cores: Int, vram_gb: Float, vram_available_gb: Float, bandwidth_gbps: Float, tdp_watts: Int, tflops_fp32: Float, tflops_fp16: Float, tops_int8: Float, warp_size: Int, sm_count: Int, l2_cache_mb: Int, )Arguments
- cuda_cores
-
CUDA Cores
- tensor_cores
-
Tensor Cores (4th Gen)
- vram_gb
-
VRAM in GB
- vram_available_gb
-
Available VRAM (after system)
- bandwidth_gbps
-
Bandwidth in GB/s
- tdp_watts
-
TDP in Watts
- tflops_fp32
-
TFLOPS FP32
- tflops_fp16
-
TFLOPS FP16 (Tensor)
- tops_int8
-
TOPS INT8 (Tensor)
- warp_size
-
Warp size (threads per warp)
- sm_count
-
SM count
- l2_cache_mb
-
L2 Cache in MB
Values
pub fn allocate(
state: GpuMemoryState,
bytes: Int,
) -> Result(GpuMemoryState, String)
Allocates memory for tensor
pub fn benchmark_rtx4090() -> Nil
pub fn can_allocate(state: GpuMemoryState, bytes: Int) -> Bool
Checks if tensor fits in VRAM
pub fn estimate_performance(
flops_needed: Float,
bytes_to_transfer: Float,
config: Rtx4090Config,
) -> PerformanceEstimate
Estimates performance for tensor operation
pub fn process_batch(
tensors: List(tensor.Tensor),
config: Rtx4090Config,
) -> BatchResult
Processes batch of tensors with compression
pub fn tensor_memory_bytes(
shape: List(Int),
mode: QuantMode4090,
) -> Int
Computes memory required for tensor