rtx4090

RTX 4090 Optimized Engine

RTX 4090 ASUS ROG STRIX SPECIFICATIONS:

GPU: AD102 (16384 CUDA Cores)
Tensor Cores: 512 (4th Gen)
VRAM: 24GB GDDR6X
Bandwidth: 1008 GB/s
TDP: 450W (boost up to 600W)
FP32: 82.6 TFLOPS
FP16 Tensor: 330 TFLOPS
INT8 Tensor: 661 TOPS

SPECIFIC OPTIMIZATIONS:

VRAM-aware batch sizing (24GB - 2GB system = 22GB usable)
Tensor Core utilization (8x8 or 16x16 alignment)
GDDR6X burst patterns (256-bit bus, aligned access)
CUDA Warp-aware parallelism (32 threads)

Pure Gleam + BEAM concurrency for maximum utilization!

Types

BatchResult

</>

Batch processing result

pub type BatchResult {
  BatchResult(
    tensors: List(blackwell.BlackwellTensor),
    total_time_ms: Int,
    throughput_tps: Float,
    compression_ratio: Float,
    memory_saved_mb: Float,
  )
}

Constructors

BatchResult(
  tensors: List(blackwell.BlackwellTensor),
  total_time_ms: Int,
  throughput_tps: Float,
  compression_ratio: Float,
  memory_saved_mb: Float,
)

Bottleneck

</>

Bottleneck type

pub type Bottleneck {
  ComputeBound
  MemoryBound
  LatencyBound
}

Constructors

```
ComputeBound
```
```
MemoryBound
```
```
LatencyBound
```

GpuMemoryState

</>

GPU memory state

pub type GpuMemoryState {
  GpuMemoryState(
    total_bytes: Int,
    used_bytes: Int,
    free_bytes: Int,
    allocated_tensors: Int,
    cached_bytes: Int,
  )
}

Constructors

```
GpuMemoryState(
  total_bytes: Int,
  used_bytes: Int,
  free_bytes: Int,
  allocated_tensors: Int,
  cached_bytes: Int,
)
```
Arguments

total_bytes

Total VRAM in bytes

used_bytes

Used VRAM in bytes

free_bytes

Free VRAM in bytes

allocated_tensors

Allocated tensors

cached_bytes

Cached bytes

PerformanceEstimate

</>

Performance estimate

pub type PerformanceEstimate {
  PerformanceEstimate(
    theoretical_flops: Float,
    achievable_flops: Float,
    estimated_time_ms: Float,
    bottleneck: Bottleneck,
    efficiency_pct: Float,
  )
}

Constructors

```
PerformanceEstimate(
  theoretical_flops: Float,
  achievable_flops: Float,
  estimated_time_ms: Float,
  bottleneck: Bottleneck,
  efficiency_pct: Float,
)
```
Arguments

theoretical_flops

Theoretical FLOPS

achievable_flops

Achievable FLOPS (with overhead)

estimated_time_ms

Estimated time in ms

bottleneck

Bottleneck (compute or memory)

efficiency_pct

Estimated efficiency

QuantMode4090

</>

Quantization modes for RTX 4090

pub type QuantMode4090 {
  Fp32Mode
  Fp16TensorMode
  Int8TensorMode
  MixedPrecisionMode
}

Constructors

```
Fp32Mode
```
Pure FP32 (82.6 TFLOPS)
```
Fp16TensorMode
```
FP16 with Tensor Cores (330 TFLOPS, 4x FP32!)
```
Int8TensorMode
```
INT8 with Tensor Cores (661 TOPS, 8x FP32!)
```
MixedPrecisionMode
```
Mixed precision (FP16 compute, FP32 accumulate)

Rtx4090Config

</>

Optimized configuration for RTX 4090

pub type Rtx4090Config {
  Rtx4090Config(
    optimal_batch_size: Int,
    tensor_core_tile: Int,
    memory_alignment: Int,
    threads_per_block: Int,
    use_tensor_cores: Bool,
    quant_mode: QuantMode4090,
  )
}

Constructors

```
Rtx4090Config(
  optimal_batch_size: Int,
  tensor_core_tile: Int,
  memory_alignment: Int,
  threads_per_block: Int,
  use_tensor_cores: Bool,
  quant_mode: QuantMode4090,
)
```
Arguments

optimal_batch_size

Optimal batch size for 24GB VRAM

tensor_core_tile

Tile size for Tensor Cores (8 or 16)

memory_alignment

Memory alignment (256 bits = 32 bytes)

threads_per_block

Threads per CUDA block

use_tensor_cores

Use Tensor Cores (FP16/INT8)

quant_mode

Quantization mode

Rtx4090Specs

</>

RTX 4090 specifications

pub type Rtx4090Specs {
  Rtx4090Specs(
    cuda_cores: Int,
    tensor_cores: Int,
    vram_gb: Float,
    vram_available_gb: Float,
    bandwidth_gbps: Float,
    tdp_watts: Int,
    tflops_fp32: Float,
    tflops_fp16: Float,
    tops_int8: Float,
    warp_size: Int,
    sm_count: Int,
    l2_cache_mb: Int,
  )
}

Constructors

```
Rtx4090Specs(
  cuda_cores: Int,
  tensor_cores: Int,
  vram_gb: Float,
  vram_available_gb: Float,
  bandwidth_gbps: Float,
  tdp_watts: Int,
  tflops_fp32: Float,
  tflops_fp16: Float,
  tops_int8: Float,
  warp_size: Int,
  sm_count: Int,
  l2_cache_mb: Int,
)
```
Arguments

cuda_cores

CUDA Cores

tensor_cores

Tensor Cores (4th Gen)

vram_gb

VRAM in GB

vram_available_gb

Available VRAM (after system)

bandwidth_gbps

Bandwidth in GB/s

tdp_watts

TDP in Watts

tflops_fp32

TFLOPS FP32

tflops_fp16

TFLOPS FP16 (Tensor)

tops_int8

TOPS INT8 (Tensor)

warp_size

Warp size (threads per warp)

sm_count

SM count

l2_cache_mb

L2 Cache in MB

Values

allocate

</>

pub fn allocate(
  state: GpuMemoryState,
  bytes: Int,
) -> Result(GpuMemoryState, String)

Allocates memory for tensor

benchmark_rtx4090

</>

pub fn benchmark_rtx4090() -> Nil

can_allocate

</>

pub fn can_allocate(state: GpuMemoryState, bytes: Int) -> Bool

Checks if tensor fits in VRAM

default_config

</>

pub fn default_config() -> Rtx4090Config

Default optimized configuration

estimate_performance

</>

pub fn estimate_performance(
  flops_needed: Float,
  bytes_to_transfer: Float,
  config: Rtx4090Config,
) -> PerformanceEstimate

Estimates performance for tensor operation

free

</>

pub fn free(state: GpuMemoryState, bytes: Int) -> GpuMemoryState

Frees memory

get_specs

</>

pub fn get_specs() -> Rtx4090Specs

Returns RTX 4090 specs

init_memory

</>

pub fn init_memory() -> GpuMemoryState

Creates initial memory state for RTX 4090

main

</>

pub fn main() -> Nil

precision_config

</>

pub fn precision_config() -> Rtx4090Config

Configuration for maximum precision

process_batch

</>

pub fn process_batch(
  tensors: List(tensor.Tensor),
  config: Rtx4090Config,
) -> BatchResult

Processes batch of tensors with compression

speed_config

</>

pub fn speed_config() -> Rtx4090Config

Configuration for maximum speed

tensor_memory_bytes

</>

pub fn tensor_memory_bytes(
  shape: List(Int),
  mode: QuantMode4090,
) -> Int

Computes memory required for tensor

Constructors

Constructors

Constructors

Arguments

Constructors

Arguments

Constructors

Constructors

Arguments

Constructors

Arguments