Beaver.MLIR.Dialect.GPU (beaver v0.4.7)
Summary
Functions
gpu.all_reduce - Reduce values among workgroup.
gpu.alloc - GPU memory allocation operation.
gpu.barrier
gpu.binary
gpu.block_dim
gpu.block_id
gpu.cluster_block_id
gpu.cluster_dim
gpu.cluster_dim_blocks
gpu.cluster_id
gpu.create_2to4_spmat - Create sparse matrix with 2:4 sparsity operation
gpu.create_bsr - Create sparse matrix in BSR format operation
gpu.create_coo - Create sparse matrix in COO format operation
gpu.create_coo_aos - Create sparse matrix in COO format operation (AoS)
gpu.create_csc - Create sparse matrix in CSC format operation
gpu.create_csr - Create sparse matrix in CSR format operation
gpu.create_dn_tensor - Create dense tensor operation
gpu.dealloc - GPU memory deallocation operation
gpu.destroy_dn_tensor - Destroy dense tensor operation
gpu.destroy_sp_mat - Destroy sparse matrix operation
gpu.dynamic_shared_memory
gpu.func
gpu.global_id
gpu.grid_dim
gpu.host_register - Registers a memref for access from device.
gpu.host_unregister - Unregisters a memref for access from device.
gpu.lane_id
gpu.launch - GPU kernel launch operation
gpu.launch_func - Launches a function as a GPU kernel
gpu.memcpy - GPU memcpy operation
gpu.memset - GPU memset operation
gpu.module
gpu.num_subgroups
gpu.printf - Device-side printf, as in CUDA or OpenCL, for debugging
gpu.return - Terminator for GPU functions.
gpu.rotate - Rotate values within a subgroup.
gpu.sddmm - SDDMM operation
gpu.sddmm_buffer_size - Precompute buffersize for SDDMM operation
gpu.set_csr_pointers - SpGEMM get size operation
gpu.set_default_device - Set default GPU for operations after this by index
gpu.shuffle - Shuffles values within a subgroup.
gpu.spgemm_copy - SpGEMM copy operation
gpu.spgemm_create_descr - SpGEMM Create Descr operation
gpu.spgemm_destroy_descr - SpGEMM Destroy Descr operation
gpu.spgemm_work_estimation_or_compute - SpGEMM work estimation operation
gpu.spmat_get_size - SpMat get size operation
gpu.spmm - SpMM operation
gpu.spmm_buffer_size - Precompute buffersize for SpMM operation
gpu.spmv - SpMV operation
gpu.spmv_buffer_size - Precompute buffersize for SpMV operation
gpu.subgroup_broadcast - Broadcasts a value from the specific lane across subgroup
gpu.subgroup_id
gpu.subgroup_mma_compute - GPU warp synchronous matrix multiply accumulate
gpu.subgroup_mma_constant_matrix - GPU warp synchronous constant matrix
gpu.subgroup_mma_elementwise - GPU warp elementwise operation on a matrix
gpu.subgroup_mma_extract_thread_local - Extract a value from GPU warp by invocation and indices
gpu.subgroup_mma_insert_thread_local - Insert a value into GPU warp by invocation and indices
gpu.subgroup_mma_load_matrix - GPU warp synchronous matrix load
gpu.subgroup_mma_store_matrix - GPU warp synchronous matrix store
gpu.subgroup_reduce - Reduce values among subgroup.
gpu.subgroup_size
gpu.terminator
gpu.thread_id
gpu.wait - Wait for async gpu ops to complete.
gpu.warp_execute_on_lane_0 - Executes operations in the associated region on thread #0 of aSPMD program
gpu.yield - GPU yield operation
Functions
gpu.all_reduce - Reduce values among workgroup.
This op has support for result type inference.
Attributes
op- Optional,GPU_AllReduceOperationAttr, built-in reduction operations supported by gpu.allreduce.uniform- Optional,UnitAttr, unit attribute
Operands
value- Single,AnyIntegerOrFloat, Integer or Float
Results
result- Single,AnyIntegerOrFloat, Integer or Float
Description
The all_reduce op reduces the value of every work item across a local
workgroup. The result is equal for all work items of a workgroup.
For example, both
%1 = gpu.all_reduce add %0 {} : (f32) -> (f32)
%2 = gpu.all_reduce %0 {
^bb(%lhs : f32, %rhs : f32):
%sum = arith.addf %lhs, %rhs : f32
"gpu.yield"(%sum) : (f32) -> ()
} : (f32) -> (f32)compute the sum of each work item's %0 value. The first version specifies the accumulation as operation, whereas the second version specifies the accumulation as code region. The reduction operation must be one of:
- Integer types:
add,mul,minui,minsi,maxui,maxsi,and,or,xor - Floating point types:
add,mul,minnumf,maxnumf,minimumf,maximumf
If uniform flag is set either none or all work items of a workgroup
need to execute this op in convergence.
gpu.alloc - GPU memory allocation operation.
Attributes
hostShared- Optional,UnitAttr, unit attribute
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typedynamicSizes- Variadic,Index, variadic of indexsymbolOperands- Variadic,Index, variadic of index
Results
memref- Single,AnyMemRef, memref of any type valuesasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.alloc operation allocates a region of memory on the GPU. It is
similar to the memref.alloc op, but supports asynchronous GPU execution.
The op does not execute before all async dependencies have finished executing.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it also returns a !gpu.async.token.
If the host_shared keyword is present, the memory will be allocated in a
memory accessible both on host and on device.
Example:
%memref, %token = gpu.alloc async [%dep] host_shared (%width) : memref<64x?xf32, 1>
gpu.barrier
gpu.binary
gpu.block_dim
gpu.block_id
gpu.cluster_block_id
gpu.cluster_dim
gpu.cluster_dim_blocks
gpu.cluster_id
gpu.create_2to4_spmat - Create sparse matrix with 2:4 sparsity operation
Attributes
pruneFlag- Single,GPU_Prune2To4SpMatFlagAttr, pruning strategy for 2:4 sparse matrix
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typerows- Single,Index, indexcols- Single,Index, indexmemref- Single,AnyMemRef, memref of any type values
Results
spMat- Single,GPU_SparseSpMatHandle, sparse matrix handle typeasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.create_2to4_spmat operation initializes a sparse matrix in dense
format with 2:4 sparsity.
The buffers must already be copied from the host to the device prior to
using this operation. The operation returns a handle to the sparse
matrix descriptor.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%spmat, %token = gpu.create_2to4_spmat async [%dep] {PRUNE_AND_CHECK} %rows, %cols, %mem: memref<?xf64>
gpu.create_bsr - Create sparse matrix in BSR format operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typebrows- Single,Index, indexbcols- Single,Index, indexbnnz- Single,Index, indexrBlockSize- Single,Index, indexcBlockSize- Single,Index, indexbRowPos- Single,AnyMemRef, memref of any type valuesbColIdxs- Single,AnyMemRef, memref of any type valuesvalues- Single,AnyMemRef, memref of any type values
Results
spmat- Single,GPU_SparseSpMatHandle, sparse matrix handle typeasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.create_bsr operation initializes a sparse matrix in BSR format
with the given sizes for the matrix and blocks from the given position,
index, and values buffers. The buffers must already be copied from the
host to the device prior to using this operation. The operation returns
a handle to the sparse matrix descriptor.
The BSR format is similar to CSR, where the column indices represent
two-dimensional blocks instead of a single matrix entry. Note that this
operation (currently) only supports storage with square blocks,
i.e., rBlockSize == cBlockSize.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%spmat, %token = gpu.create_bsr async [%dep]
%brows, %bcols, %bnnz, %rBlockSize, %cBlockSize,
%bRowPos, %bColIdxs, %values : memref<?xindex>, memref<?xindex>, memref<?xf64>
gpu.create_coo - Create sparse matrix in COO format operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typerows- Single,Index, indexcols- Single,Index, indexnnz- Single,Index, indexrowIdxs- Single,AnyMemRef, memref of any type valuescolIdxs- Single,AnyMemRef, memref of any type valuesvalues- Single,AnyMemRef, memref of any type values
Results
spmat- Single,GPU_SparseSpMatHandle, sparse matrix handle typeasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.create_coo operation initializes a sparse matrix in COO format
with the given sizes from the given index and values buffers. The buffers
must already be copied from the host to the device prior to using this
operation. The operation returns a handle to the sparse matrix descriptor.
Note that this operation builds the COO in SoA format.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%spmat, %token = gpu.create_coo async [%dep] %rows, %cols, %nnz, %rowIdx,
%colIdx, %values : memref<?xindex>, memref<?xindex>, memref<?xf64>
gpu.create_coo_aos - Create sparse matrix in COO format operation (AoS)
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typerows- Single,Index, indexcols- Single,Index, indexnnz- Single,Index, indexidxs- Single,AnyMemRef, memref of any type valuesvalues- Single,AnyMemRef, memref of any type values
Results
spmat- Single,GPU_SparseSpMatHandle, sparse matrix handle typeasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.create_coo_aos operation initializes a sparse matrix in COO format
with the given sizes from the given index and values buffers. The buffers
must already be copied from the host to the device prior to using this
operation. The operation returns a handle to the sparse matrix descriptor.
Unlike the default gpu.create_coo operation, this operation builds the
COO format from a single index buffer in AoS format (note that this
feature has been deprecated in cuSparse 11.2).
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%spmat, %token = gpu.create_coo_aos async [%dep] %rows, %cols, %nnz, %idxs,
%values : memref<?xindex>, memref<?xf64>
gpu.create_csc - Create sparse matrix in CSC format operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typerows- Single,Index, indexcols- Single,Index, indexnnz- Single,Index, indexcolPos- Single,AnyMemRef, memref of any type valuesrowIdxs- Single,AnyMemRef, memref of any type valuesvalues- Single,AnyMemRef, memref of any type values
Results
spmat- Single,GPU_SparseSpMatHandle, sparse matrix handle typeasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.create_csc operation initializes a sparse matrix in CSC format
with the given sizes from the given position, index, and values buffers.
The buffers must already be copied from the host to the device prior to
using this operation. The operation returns a handle to the sparse
matrix descriptor.
The CSC format has exactly the same memory layout as its transpose in CSR format (and vice versa).
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%spmat, %token = gpu.create_csc async [%dep] %rows, %cols, %nnz, %colPos,
%rowIdx, %values : memref<?xindex>, memref<?xindex>, memref<?xf64>
gpu.create_csr - Create sparse matrix in CSR format operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typerows- Single,Index, indexcols- Single,Index, indexnnz- Single,Index, indexrowPos- Single,AnyMemRef, memref of any type valuescolIdxs- Single,AnyMemRef, memref of any type valuesvalues- Single,AnyMemRef, memref of any type values
Results
spmat- Single,GPU_SparseSpMatHandle, sparse matrix handle typeasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.create_csr operation initializes a sparse matrix in CSR format
with the given sizes from the given position, index, and values buffers.
The buffers must already be copied from the host to the device prior to
using this operation. The operation returns a handle to the sparse
matrix descriptor.
The CSR format has exactly the same memory layout as its transpose in CSC format (and vice versa).
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%spmat, %token = gpu.create_csr async [%dep] %rows, %cols, %nnz, %rowPos,
%colIdx, %values : memref<?xindex>, memref<?xindex>, memref<?xf64>
gpu.create_dn_tensor - Create dense tensor operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typememref- Single,AnyMemRef, memref of any type valuesdims- Variadic,Index, variadic of index
Results
dnTensor- Single,GPU_SparseDnTensorHandle, dense tensor handle typeasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.create_dn_tensor operation initializes a dense tensor from
the given values buffer and sizes. The buffer must already be copied
from the host to the device prior to using this operation. The
operation returns a handle to the dense tensor descriptor.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%dmat, %token = gpu.create_dn_tensor async [%dep] %mem, %dims : index, index into memref<?xf64>
gpu.dealloc - GPU memory deallocation operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typememref- Single,AnyMemRef, memref of any type values
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.dealloc operation frees the region of memory referenced by a
memref which was originally created by the gpu.alloc operation. It is
similar to the memref.dealloc op, but supports asynchronous GPU execution.
The op does not execute before all async dependencies have finished executing.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token.
Example:
%token = gpu.dealloc async [%dep] %memref : memref<8x64xf32, 1>
gpu.destroy_dn_tensor - Destroy dense tensor operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typednTensor- Single,GPU_SparseDnTensorHandle, dense tensor handle type
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.destroy_dn_tensor operation releases all resources of a dense
tensor represented by a handle that was previously created by a
gpu.create_dn_tensor operation.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%token = gpu.destroy_dn_tensor async [%dep] %dnTensor
gpu.destroy_sp_mat - Destroy sparse matrix operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typespmat- Single,GPU_SparseSpMatHandle, sparse matrix handle type
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.destroy_sp_mat operation releases all resources of a sparse
matrix represented by a handle that was previously created by a
one of the sparse matrix creation operations.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%token = gpu.destroy_sp_mat async [%dep] %spmat
gpu.func
gpu.global_id
gpu.grid_dim
gpu.host_register - Registers a memref for access from device.
Operands
value- Single,AnyUnrankedMemRef, unranked.memref of any type values
Description
This op maps the provided host buffer into the device address space.
This operation may not be supported in every environment, there is not yet a way to check at runtime whether this feature is supported.
Writes from the host are guaranteed to be visible to device kernels that are launched afterwards. Writes from the device are guaranteed to be visible on the host after synchronizing with the device kernel completion.
gpu.host_unregister - Unregisters a memref for access from device.
Operands
value- Single,AnyUnrankedMemRef, unranked.memref of any type values
Description
This op unmaps the provided host buffer from the device address space.
This operation may not be supported in every environment, there is not yet a
way to check at runtime whether this feature is supported.
gpu.lane_id
gpu.launch - GPU kernel launch operation
Attributes
module- Optional,FlatSymbolRefAttr, flat symbol reference attributefunction- Optional,FlatSymbolRefAttr, flat symbol reference attribute
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typegridSizeX- Single,Index, indexgridSizeY- Single,Index, indexgridSizeZ- Single,Index, indexblockSizeX- Single,Index, indexblockSizeY- Single,Index, indexblockSizeZ- Single,Index, indexclusterSizeX- Optional,Index, indexclusterSizeY- Optional,Index, indexclusterSizeZ- Optional,Index, indexdynamicSharedMemorySize- Optional,I32, 32-bit signless integer
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
Launch a kernel on the specified grid of thread blocks. The body of the kernel is defined by the single region that this operation contains. The operation takes an optional list of async dependencies followed by six operands and an optional operand.
The async keyword indicates the kernel should be launched asynchronously;
the operation returns a new !gpu.async.token when the keyword is specified.
The kernel launched does not start executing until the ops producing its
async dependencies (optional operands) have completed.
The first three operands (following any async dependencies) are grid sizes
along the x,y,z dimensions and the following three are block sizes along the
x,y,z dimensions. When a lower-dimensional kernel is required, unused sizes
must be explicitly set to 1. The last operand is optional and corresponds
to the amount of dynamic shared memory a kernel's workgroup should be
allocated; when this operand is not present, a zero size is assumed.
The body region has at least twelve arguments, or eighteen if cluster dimensions are present, grouped as follows:
- three optional arguments that contain cluster identifiers along x,y,z dimensions;
- three arguments that contain block identifiers along x,y,z dimensions;
- three arguments that contain thread identifiers along x,y,z dimensions;
- operands of the
gpu.launchoperation as is (i.e. the operands for grid and block sizes). - a variadic number of Workgroup memory attributions.
- a variadic number of Private memory attributions.
The function and module attributes are optional and specifies
the kernel name and a module in which the kernel should be outlined.
Syntax:
operation ::= `gpu.launch` (`async` (`[` ssa-id-list `]`)? )?
( `clusters` `(` ssa-id-list `)` `in` ssa-reassignment )?
`blocks` `(` ssa-id-list `)` `in` ssa-reassignment
`threads` `(` ssa-id-list `)` `in` ssa-reassignment
(dynamic_shared_memory_size ssa-use)?
(`module(` symbol-ref-id `)`)?
(`function(` symbol-ref-id `)`)?
memory-attribution
region attr-dict?
ssa-reassignment ::= `(` ssa-id `=` ssa-use (`,` ssa-id `=` ssa-use)* `)`
memory-attribution ::= (`workgroup` `(` ssa-id-and-type-list `)`)?
(`private` `(` ssa-id-and-type-list `)`)?Example:
gpu.launch blocks(%bx, %by, %bz) in (%sz_bx = %0, %sz_by = %1, %sz_bz = %2)
threads(%tx, %ty, %tz) in (%sz_tx = %3, %sz_ty = %4, %sz_tz = %5) {
// Block and thread identifiers, as well as block/grid sizes are
// immediately usable inside body region.
"some_op"(%bx, %tx) : (index, index) -> ()
// Assuming %val1 is defined outside the gpu.launch region.
%42 = load %val1[%bx] : memref<?xf32, 1>
}
// Generic syntax explains how the pretty syntax maps to the IR structure.
"gpu.launch"(%cst, %cst, %c1, // Grid sizes.
%cst, %c1, %c1) // Block sizes.
{/*attributes*/}
// All sizes and identifiers have "index" size.
: (index, index, index, index, index, index) -> () {
// The operation passes block and thread identifiers, followed by grid and
// block sizes.
^bb0(%bx : index, %by : index, %bz : index,
%tx : index, %ty : index, %tz : index,
%num_bx : index, %num_by : index, %num_bz : index,
%num_tx : index, %num_ty : index, %num_tz : index)
"some_op"(%bx, %tx) : (index, index) -> ()
%3 = "memref.load"(%val1, %bx) : (memref<?xf32, 1>, index) -> f32
}
// Launch with memory attributions.
gpu.launch blocks(%bx, %by, %bz) in (%sz_bx = %0, %sz_by = %1, %sz_bz = %2)
threads(%tx, %ty, %tz) in (%sz_tx = %3, %sz_ty = %4, %sz_tz = %5)
workgroup(%workgroup: memref<32xf32, 3>)
private(%private: memref<1xf32, 5>) {
// Block and thread identifiers, as well as block/grid sizes are
// immediately usable inside body region.
"some_op"(%bx, %tx) : (index, index) -> ()
// Assuming %val1 is defined outside the gpu.launch region.
%42 = load %workgroup[%bx] : memref<32xf32, 3>
}
// Launch with clusters.
gpu.launch clusters(%cx, %cy, %cz) in (%sz_cx = %0, %sz_cy = %1, %sz_cz = %2)
blocks(%bx, %by, %bz) in (%sz_bx = %3, %sz_by = %4, %sz_bz = %5)
threads(%tx, %ty, %tz) in (%sz_tx = %6, %sz_ty = %7, %sz_tz = %8)
{
// Cluster, block and thread identifiers, as well as cluster/block/grid
// sizes are immediately usable inside body region.
"some_op"(%cx, %bx, %tx) : (index, index, index) -> ()
}
// Launch with module and function attributes.
gpu.launch blocks(%bx, %by, %bz) in (%sz_bx = %0, %sz_by = %1, %sz_bz = %2)
threads(%tx, %ty, %tz) in (%sz_tx = %3, %sz_ty = %4, %sz_tz = %5)
module(@kernel_module) function(@kernel_func) {
"some_op"(%bx, %tx) : (index, index) -> ()
%42 = load %val1[%bx] : memref<?xf32, 1>
}Rationale: using operation/block arguments gives analyses a clear way of understanding that a value has additional semantics (e.g., we will need to know what value corresponds to threadIdx.x for coalescing). We can recover these properties by analyzing the operations producing values, but it is easier just to have that information by construction.
gpu.launch_func - Launches a function as a GPU kernel
Attributes
kernel- Single,SymbolRefAttr, symbol reference attribute
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typegridSizeX- Single,LaunchIndx, index or 32-bit signless integer or 64-bit signless integergridSizeY- Single,LaunchIndx, index or 32-bit signless integer or 64-bit signless integergridSizeZ- Single,LaunchIndx, index or 32-bit signless integer or 64-bit signless integerblockSizeX- Single,LaunchIndx, index or 32-bit signless integer or 64-bit signless integerblockSizeY- Single,LaunchIndx, index or 32-bit signless integer or 64-bit signless integerblockSizeZ- Single,LaunchIndx, index or 32-bit signless integer or 64-bit signless integerclusterSizeX- Optional,LaunchIndx, index or 32-bit signless integer or 64-bit signless integerclusterSizeY- Optional,LaunchIndx, index or 32-bit signless integer or 64-bit signless integerclusterSizeZ- Optional,LaunchIndx, index or 32-bit signless integer or 64-bit signless integerdynamicSharedMemorySize- Optional,I32, 32-bit signless integerkernelOperands- Variadic,AnyType, variadic of any typeasyncObject- Optional,AnyType, any type
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
Launch a kernel function on the specified grid of thread blocks.
gpu.launch operations are lowered to gpu.launch_func operations by
outlining the kernel body into a function in a dedicated module, which
reflects the separate compilation process. The kernel function is required
to have the gpu.kernel attribute. The module containing the kernel
function is required to be a gpu.module. And finally, the module containing
the kernel module (which thus cannot be the top-level module) is required
to have the gpu.container_module attribute. The gpu.launch_func
operation has a symbol attribute named kernel to identify the fully
specified kernel function to launch (both the gpu.module and func).
The gpu.launch_func supports async dependencies: the kernel does not start
executing until the ops producing those async dependencies have completed.
By the default, the host implicitly blocks until kernel execution has
completed. If the async keyword is present, the host does not block but
instead a !gpu.async.token is returned. Other async GPU ops can take this
token as dependency.
The operation requires at least the grid and block sizes along the x,y,z
dimensions as arguments. When a lower-dimensional kernel is required,
unused sizes must be explicitly set to 1.
The remaining operands are optional. The first optional operand corresponds to the amount of dynamic shared memory a kernel's workgroup should be allocated; when this operand is not present, a zero size is assumed.
The remaining operands if present are passed as arguments to the kernel function.
The gpu.launch_func also supports kernel launching with clusters if
supported by the target architecture. The cluster size can be set by
clusterSizeX, clusterSizeY, and clusterSizeZ arguments. When these
arguments are present, the Op launches a kernel that clusters the given
thread blocks. This feature is exclusive to certain architectures.
Example:
module attributes {gpu.container_module} {
// This module creates a separate compilation unit for the GPU compiler.
gpu.module @kernels {
func.func @kernel_1(%arg0 : f32, %arg1 : memref<?xf32, 1>)
attributes { nvvm.kernel = true } {
// Operations that produce block/thread IDs and dimensions are
// injected when outlining the `gpu.launch` body to a function called
// by `gpu.launch_func`.
%tIdX = gpu.thread_id x
%tIdY = gpu.thread_id y
%tIdZ = gpu.thread_id z
%bDimX = gpu.block_dim x
%bDimY = gpu.block_dim y
%bDimZ = gpu.block_dim z
%bIdX = gpu.block_id x
%bIdY = gpu.block_id y
%bIdZ = gpu.block_id z
%gDimX = gpu.grid_dim x
%gDimY = gpu.grid_dim y
%gDimZ = gpu.grid_dim z
// (Optional) Cluster size only for support architectures
%cIdX = gpu.cluster_id x
%cIdY = gpu.cluster_id y
%cIdZ = gpu.cluster_id z
%cDimX = gpu.cluster_dim x
%cDimY = gpu.cluster_dim y
%cDimZ = gpu.cluster_dim z
"some_op"(%bx, %tx) : (index, index) -> ()
%42 = load %arg1[%bx] : memref<?xf32, 1>
}
}
%t0 = gpu.wait async
gpu.launch_func
async // (Optional) Don't block host, return token.
[%t0] // (Optional) Execute only after %t0 has completed.
@kernels::@kernel_1 // Kernel function.
clusters in (%cst, %cst, %cst) // (Optional) Cluster size only for support architectures.
blocks in (%cst, %cst, %cst) // Grid size.
threads in (%cst, %cst, %cst) // Block size.
dynamic_shared_memory_size %s // (Optional) Amount of dynamic shared
// memory to allocate for a workgroup.
args(%arg0 : f32, // (Optional) Kernel arguments.
%arg1 : memref<?xf32, 1>)
}
gpu.memcpy - GPU memcpy operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typedst- Single,AnyMemRef, memref of any type valuessrc- Single,AnyMemRef, memref of any type values
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.memcpy operation copies the content of one memref to another.
The op does not execute before all async dependencies have finished executing.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token.
Example:
%token = gpu.memcpy async [%dep] %dst, %src : memref<?xf32, 1>, memref<?xf32>
gpu.memset - GPU memset operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typedst- Single,AnyMemRef, memref of any type valuesvalue- Single,AnyType, any type
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.memset operation sets the content of memref to a scalar value.
The op does not execute before all async dependencies have finished executing.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token.
Example:
%token = gpu.memset async [%dep] %dst, %value : memref<?xf32, 1>, f32
gpu.module
gpu.num_subgroups
gpu.printf - Device-side printf, as in CUDA or OpenCL, for debugging
Attributes
format- Single,StrAttr, string attribute
Operands
args- Variadic, anonymous/composite constraint, variadic of integer or index or floating-point
Description
gpu.printf takes a literal format string format and an arbitrary number of
scalar arguments that should be printed.
The format string is a C-style printf string, subject to any restrictions imposed by one's target platform.
gpu.return - Terminator for GPU functions.
Operands
operands- Variadic,AnyType, variadic of any type
Description
A terminator operation for regions that appear in the body of gpu.func
functions. The operands to the gpu.return are the result values returned
by an invocation of the gpu.func.
gpu.rotate - Rotate values within a subgroup.
This op has support for result type inference.
Attributes
offset- Single,I32Attr, 32-bit signless integer attribute whose minimum value is 0width- Single,I32Attr, 32-bit signless integer attribute whose value is a power of two > 0
Operands
value- Single,AnyIntegerOrFloatOr1DVector, Integer or Float or fixed-length vector of Integer or Float values of ranks 1
Results
rotateResult- Single,AnyIntegerOrFloatOr1DVector, Integer or Float or fixed-length vector of Integer or Float values of ranks 1valid- Single,I1, 1-bit signless integer
Description
The "rotate" op moves values across lanes in a subgroup (a.k.a., local
invocations) within the same subgroup. The width attribute specifies the
number of lanes that participate in the rotation, and must be uniform across
all participating lanes. Further, the first width lanes of the subgroup
must be active.
width must be a power of two, and offset must be in the range
[0, width).
Return the rotateResult of the invocation whose id within the group is
calculated as follows:
Invocation ID = ((LaneId + offset) & (width - 1)) + (LaneId & ~(width - 1))Returns the rotateResult and true if the current lane id is smaller than
width, and poison value and false otherwise.
example:
%1, %2 = gpu.rotate %0, 1, 16 : f32For lane k, returns the value from lane (k + cst1) % width.
gpu.sddmm - SDDMM operation
Attributes
modeA- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opsmodeB- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opscomputeType- Single,TypeAttr, any type attribute
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typednmatA- Single,GPU_SparseDnTensorHandle, dense tensor handle typednmatB- Single,GPU_SparseDnTensorHandle, dense tensor handle typespmatC- Single,GPU_SparseSpMatHandle, sparse matrix handle typebuffer- Single,AnyMemRef, memref of any type values
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.sddmm operation performs the SDDMM operation on the given sparse and
dense matrices, and buffer. The operation expects handles returned by previous
sparse operations to construct an environment and the operands for SDDMM. The
buffer must have been allocated on the device.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%token = gpu.sddmm async [%dep] %dnmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %spmatC, %buffer into f32The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
gpu.sddmm_buffer_size - Precompute buffersize for SDDMM operation
Attributes
modeA- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opsmodeB- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opscomputeType- Single,TypeAttr, any type attribute
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typednmatA- Single,GPU_SparseDnTensorHandle, dense tensor handle typednmatB- Single,GPU_SparseDnTensorHandle, dense tensor handle typespmatC- Single,GPU_SparseSpMatHandle, sparse matrix handle type
Results
bufferSz- Single,Index, indexasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.sddmm_buffer_size operation returns the buffer size required
to perform the SDDMM operation on the given sparse and dense matrices.
The operation expects handles returned by previous sparse operations
to construct an environment and the operands for SDDMM.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%buffersz, %token = gpu.sddmm_buffer_size async [%dep] %dnmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %spmatC into f32The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
gpu.set_csr_pointers - SpGEMM get size operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typespmat- Single,GPU_SparseSpMatHandle, sparse matrix handle typepositions- Single,AnyMemRef, memref of any type valuescoordinates- Single,AnyMemRef, memref of any type valuesvalues- Single,AnyMemRef, memref of any type values
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.set_csr_pointers assigns the given positions, coordinates,
and values buffer that reside on the device directly to the given sparse
matrix descriptor in csr format.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%token = gpu.set_csr_pointers async [%dep] %positions, %coordinates, %values
: memref<?xf32>, memref<?xindex>, memref<?xindex>
gpu.set_default_device - Set default GPU for operations after this by index
Operands
devIndex- Single,I32, 32-bit signless integer
Description
Operation that sets the current default GPU, using a zero-based index into the set of GPUs on the system. The default GPU setting may be thread-local.
gpu.shuffle - Shuffles values within a subgroup.
This op has support for result type inference.
Attributes
mode- Single,GPU_ShuffleModeAttr, Indexing modes supported by gpu.shuffle.
Operands
value- Single,AnyIntegerOrFloatOr1DVector, Integer or Float or fixed-length vector of Integer or Float values of ranks 1offset- Single,I32, 32-bit signless integerwidth- Single,I32, 32-bit signless integer
Results
shuffleResult- Single,AnyIntegerOrFloatOr1DVector, Integer or Float or fixed-length vector of Integer or Float values of ranks 1valid- Single,I1, 1-bit signless integer
Description
The "shuffle" op moves values across lanes in a subgroup (a.k.a., local
invocation) within the same subgroup. The width argument specifies the
number of lanes that participate in the shuffle, and must be uniform
across all lanes. Further, the first width lanes of the subgroup must
be active.
The intepretation of the offset arguments depends on the selected
mode.
Returns the shuffleResult and true if the current lane id is smaller
than width, and an unspecified value and false otherwise.
xor example:
%1, %2 = gpu.shuffle xor %0, %offset, %width : f32For lane k, returns the value %0 from lane k ^ offset. Every lane
trades value with exactly one other lane.
down example:
%cst1 = arith.constant 1 : i32
%3, %4 = gpu.shuffle down %0, %cst1, %width : f32For lane k, returns the value from lane (k + cst1). If (k + cst1) is
bigger than or equal to width, the value is poison and valid is false.
up example:
%cst1 = arith.constant 1 : i32
%5, %6 = gpu.shuffle up %0, %cst1, %width : f32For lane k, returns the value from lane (k - cst1). If (k - cst1) is
smaller than 0, the value is poison and valid is false.
idx example:
%cst0 = arith.constant 0 : i32
%7, %8 = gpu.shuffle idx %0, %cst0, %width : f32Broadcasts the value from lane 0 to all lanes.
gpu.spgemm_copy - SpGEMM copy operation
Attributes
modeA- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opsmodeB- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opscomputeType- Single,TypeAttr, any type attribute
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typedesc- Single,GPU_SparseSpGEMMOpHandle, SpGEMM operation handle typespmatA- Single,GPU_SparseSpMatHandle, sparse matrix handle typespmatB- Single,GPU_SparseSpMatHandle, sparse matrix handle typespmatC- Single,GPU_SparseSpMatHandle, sparse matrix handle type
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.spgemm_copy operation copies the sparse matrix result of
a SpGEMM computation.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
gpu.spgemm_copy %spmatA, %spmatB, %spmatC, %spgemmDesc: f32The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
gpu.spgemm_create_descr - SpGEMM Create Descr operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token type
Results
desc- Single,GPU_SparseSpGEMMOpHandle, SpGEMM operation handle typeasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.spgemm_create_descr creates a descriptor for the SpGEMM operation.
The descriptor describes the SpGEMM operation and stores the internal data
throughout the computation. It needs to be passed as an argument to
spgemm_* operations.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%desc, %token = gpu.spgemm_create_descr async [%dep]
gpu.spgemm_destroy_descr - SpGEMM Destroy Descr operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typedesc- Single,GPU_SparseSpGEMMOpHandle, SpGEMM operation handle type
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.spgemm_destroy_descr destroys the SpGEMM operation descriptor.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%token = gpu.spgemm_destroy_descr async [%dep] %desc
gpu.spgemm_work_estimation_or_compute - SpGEMM work estimation operation
Attributes
modeA- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opsmodeB- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opscomputeType- Single,TypeAttr, any type attributekind- Single,GPU_SpGEMMWorkEstimationOrComputeKindAttr, choose whether spgemm_work_estimation_or_compute does work estimation or compute
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typedesc- Single,GPU_SparseSpGEMMOpHandle, SpGEMM operation handle typespmatA- Single,GPU_SparseSpMatHandle, sparse matrix handle typespmatB- Single,GPU_SparseSpMatHandle, sparse matrix handle typespmatC- Single,GPU_SparseSpMatHandle, sparse matrix handle typebufferSz- Single,Index, indexbuffer- Single,AnyMemRef, memref of any type values
Results
bufferSzNew- Single,Index, indexasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.spgemm_work_estimation_or_compute is used to call
cusparseSpGEMM_workEstimation or cusparseSpGEMM_compute. Both of them are
for both determining the buffer size and performing the actual computation.
The operation expects handles returned by previous sparse operations to
construct an environment and the operands for SpGEMM.
The buffer must have been allocated on the device.
C' = alpha op(A) op(B) + beta * C
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%bufferSz, %token = gpu.spgemm_work_estimation_or_compute async [%dep] {COMPUTE}
%desc, %spmatA{NON_TRANSPOSE}, %spmatB{NON_TRANSPOSE},
%spmatC, %spgemmDesc, %c0, %alloc: f32 into
memref<0xi8>The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
gpu.spmat_get_size - SpMat get size operation
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typespmat- Single,GPU_SparseSpMatHandle, sparse matrix handle type
Results
rows- Single,Index, indexcols- Single,Index, indexnnz- Single,Index, indexasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.spmat_get_size operation retrieves the number of rows, number of
columns, and number of non-zero elements of a sparse matrix.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
Example:
%rows, %cols, %nnz, %token = gpu.spmat_get_size async [%dep] %spmatC
gpu.spmm - SpMM operation
Attributes
modeA- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opsmodeB- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opscomputeType- Single,TypeAttr, any type attribute
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typespmatA- Single,GPU_SparseSpMatHandle, sparse matrix handle typednmatB- Single,GPU_SparseDnTensorHandle, dense tensor handle typednmatC- Single,GPU_SparseDnTensorHandle, dense tensor handle typebuffers- Variadic,AnyMemRef, variadic of memref of any type values
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.spmm operation performs the SpMM operation on the given sparse and
dense matrix, and buffer. The operation expects handles returned by previous
sparse operations to construct an environment and the operands for SpMM. The
buffer must have been allocated on the device.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
Example:
%token = gpu.spmm async [%dep] %spmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %dnmatC, %buffers : type($buffers) into f32
gpu.spmm_buffer_size - Precompute buffersize for SpMM operation
Attributes
modeA- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opsmodeB- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opscomputeType- Single,TypeAttr, any type attribute
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typespmatA- Single,GPU_SparseSpMatHandle, sparse matrix handle typednmatB- Single,GPU_SparseDnTensorHandle, dense tensor handle typednmatC- Single,GPU_SparseDnTensorHandle, dense tensor handle type
Results
bufferSzs- Variadic,Index, variadic of indexasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.spmm_buffer_size operation returns the buffer size required
to perform the SpMM operation on the given sparse and dense matrix.
The operation expects handles returned by previous sparse operations
to construct an environment and the operands for SpMM.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
Example:
%bufferszs, %token = gpu.spmm_buffer_size async [%dep] %spmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %dnmatC : i64 into f32
gpu.spmv - SpMV operation
Attributes
modeA- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opscomputeType- Single,TypeAttr, any type attribute
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typespmatA- Single,GPU_SparseSpMatHandle, sparse matrix handle typednX- Single,GPU_SparseDnTensorHandle, dense tensor handle typednY- Single,GPU_SparseDnTensorHandle, dense tensor handle typebuffer- Single,AnyMemRef, memref of any type values
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.spmv operation performs the SpMV operation on the given sparse matrix,
dense vectors, and buffer. The operation expects handles returned by previous
sparse operations to construct an environment and the operands for SpMV. The
buffer must have been allocated on the device.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
Example:
%token = gpu.spmv async [%dep] %spmatA{TRANSPOSE}, %dnX, %dnY : memref<?xf64> into bf16
gpu.spmv_buffer_size - Precompute buffersize for SpMV operation
Attributes
modeA- Single,GPU_TransposeModeAttr, transpose mode of sparse matrix supported by sparse tensor opscomputeType- Single,TypeAttr, any type attribute
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token typespmatA- Single,GPU_SparseSpMatHandle, sparse matrix handle typednX- Single,GPU_SparseDnTensorHandle, dense tensor handle typednY- Single,GPU_SparseDnTensorHandle, dense tensor handle type
Results
bufferSz- Single,Index, indexasyncToken- Optional,GPU_AsyncToken, async token type
Description
The gpu.spmv_buffer_size operation returns the buffer size required
to perform the SpMV operation on the given sparse matrix and dense vectors.
The operation expects handles returned by previous sparse operations
to construct an environment and the operands for SpMV.
If the async keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.
The matrix arguments can also be associated with one of the following operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value is NON_TRANSPOSE.
Example:
%buffersz, %token = gpu.spmv_buffer_size async [%dep] %spmatA{TRANSPOSE}, %dnX, %dnY into f32
gpu.subgroup_broadcast - Broadcasts a value from the specific lane across subgroup
This op has support for result type inference.
Attributes
broadcast_type- Single,GPU_BroadcastTypeAttr, a lane to broadcast from
Operands
src- Single,AnyType, any typelane- Optional,I32, 32-bit signless integer
Results
result- Single,AnyType, any type
Description
Broadcasts a value from one lane to all active lanes in a subgroup. The result is guaranteed to be uniform across the active lanes in subgroup.
The possible broadcast types are:
first_active_lane- broadcasts the value from the first active lane in the subgroup.specific_lane- broadcasts from the specified lane. The lane index must be uniform and within the subgroup size. The result is poison if the lane index is invalid, non subgroup-uniform, or if the source lane is not active.
gpu.subgroup_id
gpu.subgroup_mma_compute - GPU warp synchronous matrix multiply accumulate
This op has support for result type inference.
Attributes
a_transpose- Optional,UnitAttr, unit attributeb_transpose- Optional,UnitAttr, unit attribute
Operands
opA- Single, anonymous/composite constraint, gpu.mma_matrix of 8-bit signed integer or 8-bit unsigned integer or 16-bit float or 32-bit float valuesopB- Single, anonymous/composite constraint, gpu.mma_matrix of 8-bit signed integer or 8-bit unsigned integer or 16-bit float or 32-bit float valuesopC- Single, anonymous/composite constraint, gpu.mma_matrix of 32-bit signless integer or 16-bit float or 32-bit float values
Results
res- Single,GPU_MMAMatrix, MMAMatrix type
Description
The gpu.subgroup_mma_compute operation performs a matrix-multiply accumulate (mma)
operation using all the threads in a subgroup.
This operation takes three !gpu.mma_matrixs as arguments: these hold A,
B and Coperands for the mma operation. The operation performed is represented
as C += A * B. The op returns a !gpu.mma_matrix which contains the result of
the operation held by all threads in a subgroup. a_transpose or
b_transpose if present, signify that the respective operand was loaded in a
transposed manner. The transpose operands are required to map to correct
underlying intrisics but they currently do not seem to affect correctness
even if they are absent given that the operands were loaded correctly using
the transpose attribute in gpu.subgroup_mma_load_matrix op.
For integer types, the A and B matrices carry their signedness with their
types. The accumulator type is expected to be signless and imply a signed integer
with a greater width than the other two operands.
This op is meant to be used along with gpu.subgroup_mma_store_matrix and
gpu.subgroup_mma_load_matrix ops.
Example:
%D = gpu.subgroup_mma_compute_matrix %A, %B, %C :
!gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">>
-> !gpu.mma_matrix<16x16xf16, "COp">
gpu.subgroup_mma_constant_matrix - GPU warp synchronous constant matrix
Operands
value- Single, anonymous/composite constraint, 8-bit signed integer or 8-bit unsigned integer or 32-bit signless integer or 16-bit float or 32-bit float
Results
res- Single,GPU_MMAMatrix, MMAMatrix type
Description
The gpu.subgroup_mma_constant_matrix creates a !gpu.mma_matrix with
constant elements.
The operation takes a scalar input and return a !gpu.mma_matrix where
each element of is equal to the operand constant. The destination
mma_matrix type must have elememt type equal to the constant type. Since
the layout of !gpu.mma_matrix is opaque this only support setting all the
elements to the same value.
This op is meant to be used along with gpu.subgroup_mma_compute.
Example:
%0 = gpu.subgroup_mma_constant_matrix %a :
!gpu.mma_matrix<16x16xf16, "AOp">
%1 = gpu.subgroup_mma_constant_matrix %b :
!gpu.mma_matrix<16x16xf32, "COp">
gpu.subgroup_mma_elementwise - GPU warp elementwise operation on a matrix
Attributes
opType- Single,MMAElementWiseAttr, elementwise operation to apply to mma matrix
Operands
args- Variadic,GPU_MMAMatrix, variadic of MMAMatrix type
Results
res- Single,GPU_MMAMatrix, MMAMatrix type
Description
The gpu.subgroup_mma_elementwise takes !gpu.mma_matrix inputs and
compute a new !gpu.mma_matrix by applying an elementwise operation to each
element.
Since the operation is elementwise and the matrix type must match, the matrix elements are processed independently of the matrix layout.
This op is meant to be used along with gpu.subgroup_mma_compute.
Example:
%0 = %A, %B { opType = "ADD" } :
(!gpu.mma_matrix<16x16xf16, "COp">, !gpu.mma_matrix<16x16xf16, "COp">)
-> !gpu.mma_matrix<16x16xf16, "COp">
gpu.subgroup_mma_extract_thread_local - Extract a value from GPU warp by invocation and indices
This op has support for result type inference.
Operands
matrix- Single,GPU_MMAMatrix, MMAMatrix typeindices- Variadic,Index, variadic of index
Results
res- Single,AnyIntegerOrFloat, Integer or Float
Description
The gpu.subgroup_mma_extract_thread_local operation extracts a value from !gpu.mma_matrix
that is stored at subgroup level.
This operation takes !gpu.mma_matrix as its first operand. It is the source
matrix across a subgroup. The op returns a scalar value stored in the invocation
in the subgroup.
Since matrix is packed into the the threads within a subgroup, indices are
the indices into the values stored by each thread. That is, an index of 0 (or [0, 0])
does not necessarily refer to the first element of the matrix, but the first element
that a particular thread holds.
The mapping of matrix elements to threads is not defined by this operation and may
not be defined by some lowerings (such as the lowering to SPIR-V). However, if the
size of the subgroup is S, then subgroup_mma_extract_thread_local at each index in
[0, (M * N) / S) will have the entire matrix extracted across the subgroup.
Example:
%c0 = arith.constant 0 : index
%val = gpu.subgroup_mma_extract_thread_local %m[%c0] : !gpu.mma_matrix<16x16xf32, "AOp"> -> f32
gpu.subgroup_mma_insert_thread_local - Insert a value into GPU warp by invocation and indices
Operands
value- Single,AnyIntegerOrFloat, Integer or Floatmatrix- Single,GPU_MMAMatrix, MMAMatrix typeindices- Variadic,Index, variadic of index
Results
res- Single,GPU_MMAMatrix, MMAMatrix type
Description
The gpu.subgroup_mma_insert_thread_local operation inserts a value to !gpu.mma_matrix
that is stored at subgroup level.
This operation takes scalar value as its first operand and !gpu.mma_matrix
as its second operand. The op inserts the scalar value to the matrix.
Since matrix is packed into the the threads within a subgroup, indices are
the indices into the values stored by each thread. That is, an index of 0 (or [0, 0])
does not necessarily refer to the first element of the matrix, but the first element
that a particular thread holds.
The mapping of matrix elements to threads is not defined by this operation and may
not be defined by some lowerings (such as the lowering to SPIR-V). However, if the
size of the subgroup is S, then subgroup_mma_insert_thread_local at each index in
[0, (M * N) / S) will have the entire matrix inserted across the subgroup.
The op returns !gpu.mma_matrix with the updated value.
Example:
%c0 = arith.constant 0 : index
%s0 = gpu.subgroup_mma_insert_thread_local %val, %m[%c0] : f16, !gpu.mma_matrix<16x16xf16, "COp">
-> !gpu.mma_matrix<16x16xf16, "COp">
gpu.subgroup_mma_load_matrix - GPU warp synchronous matrix load
Attributes
leadDimension- Single,IndexAttr, index attributetranspose- Optional,UnitAttr, unit attribute
Operands
srcMemref- Single,GPU_MMAMemRef, memref of 8-bit signless integer or 32-bit signless integer or 16-bit float or 32-bit float or vector of 8-bit signless integer or 32-bit signless integer or 16-bit float or 32-bit float values of ranks 1 valuesindices- Variadic,Index, variadic of index
Results
res- Single,GPU_MMAMatrix, MMAMatrix type
Description
The gpu.subgroup_mma_load_matrix operation loads a matrix collectively
using all the threads in a subgroup.
This operation takes a memref as its first operand: it is the source matrix
from which data is to be loaded. The op returns a !gpu.mma_matrix. The
source memref can be in global memory or shared memory. The load address is
determined using indices. The matrix being loaded into is the result. The
leadDimension attribute specifies the leading dimension size of the source
matrix which eventually allows the lowering to determine the size of each
row. If the transpose attribute is present then the op does a transposed load.
For integer types, the resulting !gpu.mma_matrix type needs to specify the
signedness of the data if the matrix type is an A or B operand for
gpu.subgroup_mma_compute.
This op is often meant to be used along with gpu.subgroup_mma_store_matrix and
gpu.subgroup_mma_compute.
Example:
%0 = gpu.subgroup_mma_load_matrix src[%i,%j] : {leadDimension = 32 : i32}
: memref<32x32xf16, 3>, !gpu.mma_matrix<16x16xf16, "AOp">
gpu.subgroup_mma_store_matrix - GPU warp synchronous matrix store
Attributes
leadDimension- Single,IndexAttr, index attributetranspose- Optional,UnitAttr, unit attribute
Operands
src- Single, anonymous/composite constraint, scalable vector of 32-bit signless integer values of length 4dstMemref- Single,GPU_MMAMemRef, memref of 8-bit signless integer or 32-bit signless integer or 16-bit float or 32-bit float or vector of 8-bit signless integer or 32-bit signless integer or 16-bit float or 32-bit float values of ranks 1 valuesindices- Variadic,Index, variadic of index
Description
The gpu.subgroup_mma_store_matrix operation stores a matrix collectively
using all the threads in a subgroup.
This operation takes a !gpu.mma_matrix and a memref as operands.
!gpu.mma_matrix is the source value containing the data to be stored into the
destination memref which can be in global or shared memory. The store address
is determined using the indices provided. The leadDimension attribute
specifies the leading dimension of the destination matrix. If the
transpose attribute is present then the op does a transposed store.
This op is often meant to be used along with gpu.subgroup_mma_load_matrix and
gpu.subgroup_mma_compute.
Example:
gpu.subgroup_mma_store_matrix %D, %sg[%i,%j] : { leadDimension = 32 : i32}
: !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>
gpu.subgroup_reduce - Reduce values among subgroup.
This op has support for result type inference.
Attributes
op- Single,GPU_AllReduceOperationAttr, built-in reduction operations supported by gpu.allreduce.uniform- Optional,UnitAttr, unit attributecluster_size- Optional,I32Attr, 32-bit signless integer attributecluster_stride- Single,I32Attr, 32-bit signless integer attribute
Operands
value- Single,AnyIntegerOrFloatOr1DVector, Integer or Float or fixed-length vector of Integer or Float values of ranks 1
Results
result- Single,AnyIntegerOrFloatOr1DVector, Integer or Float or fixed-length vector of Integer or Float values of ranks 1
Description
The subgroup_reduce op reduces the values of lanes (work items) across a
subgroup.
The subgroup is divided into clusters starting at lane index 0. Within each
cluster, there are size lanes, and the lane index advances by stride.
A reduction is done for each cluster in parallel: every lane in the cluster
is reduced, and the result is equal for all lanes in the cluster. If size
is omitted, there is a single cluster covering the entire subgroup. If
stride is omitted, the stride is 1 (the cluster's lanes are contiguous).
When the reduced value is of a vector type, each vector element is reduced independently. Only 1-d vector types are allowed.
Example:
%1 = gpu.subgroup_reduce add %a : (f32) -> f32
%2 = gpu.subgroup_reduce add %b : (vector<4xf16>) -> vector<4xf16>
%3 = gpu.subgroup_reduce add %c cluster(size = 4) : (f32) -> f32
%3 = gpu.subgroup_reduce add %c cluster(size = 4, stride = 2) : (f32) -> f32If uniform flag is set either none or all lanes of a subgroup need to execute
this op in convergence.
The reduction operation must be one of:
- Integer types:
add,mul,minui,minsi,maxui,maxsi,and,or,xor - Floating point types:
add,mul,minnumf,maxnumf,minimumf,maximumf
gpu.subgroup_size
gpu.terminator
gpu.thread_id
gpu.wait - Wait for async gpu ops to complete.
Operands
asyncDependencies- Variadic,GPU_AsyncToken, variadic of async token type
Results
asyncToken- Optional,GPU_AsyncToken, async token type
Description
This op synchronizes the host or the device with a list of dependent ops.
If the op contains the async keyword, it returns a new async token which
is synchronized with the op arguments. This new token is merely a shortcut
to the argument list, and one could replace the uses of the result with the
arguments for the same effect. The async version of this op is primarily
used to make each async token have a single use during lowering and
thereby make forks in async execution explicit. Example usage:
%t0 = gpu.foo async : !gpu.async.token
%t1 = gpu.bar async : !gpu.async.token
%t2 = gpu.wait async [%t0, %t1]
// gpu.baz doesn't run until gpu.foo and gpu.bar have both completed, just
// as if the async dependencies were [%t0, %t1].
%t3 = gpu.baz async [%t2]If the op does not contain the async keyword, it does not return a new
async token but blocks until all ops producing the async dependency tokens
finished execution. All dependent memory operations are visible to the host
once this op completes. Example usage:
%t0 = gpu.foo async : !gpu.async.token
%t1 = gpu.bar async : !gpu.async.token
// The gpu.wait op blocks until gpu.foo and gpu.bar have completed.
gpu.wait [%t0, %t1]
gpu.warp_execute_on_lane_0 - Executes operations in the associated region on thread #0 of aSPMD program
Attributes
warp_size- Single,I64Attr, 64-bit signless integer attribute
Operands
laneid- Single,Index, indexargs- Variadic,AnyType, variadic of any type
Results
results- Variadic,AnyType, variadic of any type
Description
warp_execute_on_lane_0 is an operation used to bridge the gap between
vector programming and SPMD programming model like GPU SIMT. It allows to
trivially convert a region of vector code meant to run on a multiple threads
into a valid SPMD region and then allows incremental transformation to
distribute vector operations on the threads.
Any code present in the region would only be executed on first thread/lane
based on the laneid operand. The laneid operand is an integer ID between
[0, warp_size). The warp_size attribute indicates the number of lanes in
a warp.
Operands are vector values distributed on all lanes that may be used by the single lane execution. The matching region argument is a vector of all the values of those lanes available to the single active lane. The distributed dimension is implicit based on the shape of the operand and argument. the properties of the distribution may be described by extra attributes (e.g. affine map).
Return values are distributed on all lanes using laneId as index. The vector is distributed based on the shape ratio between the vector type of the yield and the result type. If the shapes are the same this means the value is broadcasted to all lanes. In the future the distribution can be made more explicit using affine_maps and will support having multiple Ids.
Therefore the warp_execute_on_lane_0 operations allow to implicitly copy
between lane0 and the lanes of the warp. When distributing a vector
from lane0 to all the lanes, the data are distributed in a block cyclic way.
For example vector<64xf32> gets distributed on 32 threads and map to
vector<2xf32> where thread 0 contains vector[0] and vector[1].
During lowering values passed as operands and return value need to be visible to different lanes within the warp. This would usually be done by going through memory.
The region is not isolated from above. For values coming from the parent region not going through operands only the lane 0 value will be accesible so it generally only make sense for uniform values.
Example:
// Execute in parallel on all threads/lanes.
gpu.warp_execute_on_lane_0 (%laneid)[32] {
// Serial code running only on thread/lane 0.
...
}
// Execute in parallel on all threads/lanes.This may be lowered to an scf.if region as below:
// Execute in parallel on all threads/lanes.
%cnd = arith.cmpi eq, %laneid, %c0 : index
scf.if %cnd {
// Serial code running only on thread/lane 0.
...
}
// Execute in parallel on all threads/lanes.When the region has operands and/or return values:
// Execute in parallel on all threads/lanes.
%0 = gpu.warp_execute_on_lane_0(%laneid)[32]
args(%v0 : vector<4xi32>) -> (vector<1xf32>) {
^bb0(%arg0 : vector<128xi32>) :
// Serial code running only on thread/lane 0.
...
gpu.yield %1 : vector<32xf32>
}
// Execute in parallel on all threads/lanes.values at the region boundary would go through memory:
// Execute in parallel on all threads/lanes.
...
// Store the data from each thread into memory and Synchronization.
%tmp0 = memreg.alloc() : memref<128xf32>
%tmp1 = memreg.alloc() : memref<32xf32>
%cnd = arith.cmpi eq, %laneid, %c0 : index
vector.store %v0, %tmp0[%laneid] : memref<128xf32>, vector<4xf32>
some_synchronization_primitive
scf.if %cnd {
// Serialized code running only on thread 0.
// Load the data from all the threads into a register from thread 0. This
// allow threads 0 to access data from all the threads.
%arg0 = vector.load %tmp0[%c0] : memref<128xf32>, vector<128xf32>
...
// Store the data from thread 0 into memory.
vector.store %1, %tmp1[%c0] : memref<32xf32>, vector<32xf32>
}
// Synchronization and load the data in a block cyclic way so that the
// vector is distributed on all threads.
some_synchronization_primitive
%0 = vector.load %tmp1[%laneid] : memref<32xf32>, vector<32xf32>
// Execute in parallel on all threads/lanes.
gpu.yield - GPU yield operation
Operands
values- Variadic,AnyType, variadic of any type
Description
gpu.yield is a special terminator operation for blocks inside regions
in gpu ops. It returns values to the immediately enclosing gpu op.
Example:
gpu.yield %f0, %f1 : f32, f32