viva_tensor/sparse

SparseTensor - 2:4 Sparsity

Use cuSPARSELt to prune and compress weight matrices.

2:4 Structure: For every block of 4 elements, 2 must be zero.
Compression: Reduces memory usage by ~50% (1.78x practical).
Speedup: Up to 2x theoretical (660 TFLOPS), 61% measured speedup vs dense.

Ideal for Large Language Model (LLM) weights.

Types

SparseTensor

Reference to a 2:4 structured sparse tensor (GPU)

pub type SparseTensor =
  ffi.SparseTensorRef

Values

available

</>

pub fn available() -> Bool

Check if cuSPARSELt is available

compression_ratio

</>

pub fn compression_ratio(
  tensor: ffi.SparseTensorRef,
) -> Result(Float, String)

Get actual compression ratio (DenseBytes / SparseBytes)

from_cuda16

</>

pub fn from_cuda16(
  tensor: ffi.CudaTensor16Ref,
) -> Result(ffi.SparseTensorRef, String)

Create SparseTensor from CudaTensor16 (Prune + Compress)

This operation is destructive: it prunes the smallest 2 values in every 4-element block. The resulting sparse tensor is stored in a compressed format on the GPU.

matmul

</>

pub fn matmul(
  a_sparse: ffi.SparseTensorRef,
  b_dense: ffi.CudaTensor16Ref,
  m: Int,
  n: Int,
  k: Int,
) -> Result(ffi.CudaTensor16Ref, String)

Sparse Matrix Multiplication (SpMM)

C = Sparse(A) @ Dense(B)

a_sparse: Compressed weight matrix (2:4 sparse)
b_dense: Dense activation matrix (FP16)

Returns dense FP16 result.

shape

</>

pub fn shape(
  tensor: ffi.SparseTensorRef,
) -> Result(List(Int), String)

Get shape of the original dense tensor [Rows, Cols]