viva_tensor/cuda
CudaTensor - Persistent GPU Memory
Tensors that live on the GPU. Ideal for weights and heavy compute.
- FP32 (CudaTensor): Standard precision. 40+ TFLOPS on RTX 4090.
- FP16 (CudaTensor16): Low precision, high throughput using Tensor Cores. 330+ TFLOPS!
Data is uploaded once and stays on device. Operations are launched asynchronously (mostly).
Types
Reference to a tensor stored in GPU memory (FP32)
pub type CudaTensor =
ffi.CudaTensorRef
Reference to a tensor stored in GPU memory (FP16)
pub type CudaTensor16 =
ffi.CudaTensor16Ref
Values
pub fn matmul(
a: ffi.CudaTensorRef,
b: ffi.CudaTensorRef,
m: Int,
n: Int,
k: Int,
) -> Result(ffi.CudaTensorRef, String)
Matrix Multiplication (FP32) C = A @ B
pub fn matmul16(
a: ffi.CudaTensor16Ref,
b: ffi.CudaTensor16Ref,
m: Int,
n: Int,
k: Int,
) -> Result(ffi.CudaTensor16Ref, String)
Matrix Multiplication (FP16 Tensor Cores) C = A @ B
Uses HMMA (Half-precision Matrix Multiply Accumulate) instructions. Expect massive speedups (up to 330 TFLOPS) if dimensions align with 16x16.
pub fn new(
data: List(Float),
shape: List(Int),
) -> Result(ffi.CudaTensorRef, String)
Upload data to GPU (FP32)
pub fn new16(
data: List(Float),
shape: List(Int),
) -> Result(ffi.CudaTensor16Ref, String)
Upload data to GPU (converts f64 -> f16)
pub fn shape(
tensor: ffi.CudaTensorRef,
) -> Result(List(Int), String)
Get shape of tensor
pub fn shape16(
tensor: ffi.CudaTensor16Ref,
) -> Result(List(Int), String)
Get shape of FP16 tensor
pub fn to_list(
tensor: ffi.CudaTensorRef,
) -> Result(List(Float), String)
Download data from GPU (FP32)
pub fn to_list16(
tensor: ffi.CudaTensor16Ref,
) -> Result(List(Float), String)
Download data from GPU (converts f16 -> f64)