viva_tensor/cuda

CudaTensor - Persistent GPU Memory

Tensors that live on the GPU. Ideal for weights and heavy compute.

FP32 (CudaTensor): Standard precision. 40+ TFLOPS on RTX 4090.
FP16 (CudaTensor16): Low precision, high throughput using Tensor Cores. 330+ TFLOPS!

Data is uploaded once and stays on device. Operations are launched asynchronously (mostly).

Types

Reference to a tensor stored in GPU memory (FP32)

pub type CudaTensor =
  ffi.CudaTensorRef

</>

Reference to a tensor stored in GPU memory (FP16)

pub type CudaTensor16 =
  ffi.CudaTensor16Ref

</>

pub fn fp16_available() -> Bool

Check if FP16 Tensor Cores are available

</>

pub fn matmul(
  a: ffi.CudaTensorRef,
  b: ffi.CudaTensorRef,
  m: Int,
  n: Int,
  k: Int,
) -> Result(ffi.CudaTensorRef, String)

Matrix Multiplication (FP32) C = A @ B

</>

pub fn matmul16(
  a: ffi.CudaTensor16Ref,
  b: ffi.CudaTensor16Ref,
  m: Int,
  n: Int,
  k: Int,
) -> Result(ffi.CudaTensor16Ref, String)

Matrix Multiplication (FP16 Tensor Cores) C = A @ B

Uses HMMA (Half-precision Matrix Multiply Accumulate) instructions. Expect massive speedups (up to 330 TFLOPS) if dimensions align with 16x16.

</>

pub fn new(
  data: List(Float),
  shape: List(Int),
) -> Result(ffi.CudaTensorRef, String)

Upload data to GPU (FP32)

</>

pub fn new16(
  data: List(Float),
  shape: List(Int),
) -> Result(ffi.CudaTensor16Ref, String)

Upload data to GPU (converts f64 -> f16)

</>

pub fn shape(
  tensor: ffi.CudaTensorRef,
) -> Result(List(Int), String)

Get shape of tensor

</>

pub fn shape16(
  tensor: ffi.CudaTensor16Ref,
) -> Result(List(Int), String)

Get shape of FP16 tensor

</>

pub fn to_list(
  tensor: ffi.CudaTensorRef,
) -> Result(List(Float), String)

Download data from GPU (FP32)

</>

pub fn to_list16(
  tensor: ffi.CudaTensor16Ref,
) -> Result(List(Float), String)

Download data from GPU (converts f16 -> f64)