viva_tensor/sparse
SparseTensor - 2:4 Sparsity
Use cuSPARSELt to prune and compress weight matrices.
- 2:4 Structure: For every block of 4 elements, 2 must be zero.
- Compression: Reduces memory usage by ~50% (1.78x practical).
- Speedup: Up to 2x theoretical (660 TFLOPS), 61% measured speedup vs dense.
Ideal for Large Language Model (LLM) weights.
Types
Reference to a 2:4 structured sparse tensor (GPU)
pub type SparseTensor =
ffi.SparseTensorRef
Values
pub fn compression_ratio(
tensor: ffi.SparseTensorRef,
) -> Result(Float, String)
Get actual compression ratio (DenseBytes / SparseBytes)
pub fn from_cuda16(
tensor: ffi.CudaTensor16Ref,
) -> Result(ffi.SparseTensorRef, String)
Create SparseTensor from CudaTensor16 (Prune + Compress)
This operation is destructive: it prunes the smallest 2 values in every 4-element block. The resulting sparse tensor is stored in a compressed format on the GPU.
pub fn matmul(
a_sparse: ffi.SparseTensorRef,
b_dense: ffi.CudaTensor16Ref,
m: Int,
n: Int,
k: Int,
) -> Result(ffi.CudaTensor16Ref, String)
Sparse Matrix Multiplication (SpMM)
C = Sparse(A) @ Dense(B)
a_sparse: Compressed weight matrix (2:4 sparse)b_dense: Dense activation matrix (FP16)
Returns dense FP16 result.
pub fn shape(
tensor: ffi.SparseTensorRef,
) -> Result(List(Int), String)
Get shape of the original dense tensor [Rows, Cols]