Beaver.MLIR.Dialect.NVGPU (beaver v0.4.7)
Summary
Functions
nvgpu.device_async_copy - device-side asynchronous copy
nvgpu.device_async_create_group - device side asynchronous create group operation
nvgpu.device_async_wait - Wait for async gpu ops to complete.
nvgpu.ldmatrix
nvgpu.mbarrier.arrive - Performs arrive operation on the nvgpu.mbarrier.arrive.
nvgpu.mbarrier.arrive.expect_tx - Performs expect_tx operation on the nvgpu.mbarrier.arrive
nvgpu.mbarrier.arrive.nocomplete - Performs arrive operation on the nvgpu.mbarrier.arrive.nocomplete as non-blocking.
nvgpu.mbarrier.create
nvgpu.mbarrier.get - Return a pointer to an nvgpu.mbarrier.
nvgpu.mbarrier.init - Initialize the nvgpu.mbarrier.
nvgpu.mbarrier.test.wait - Checks if the nvgpu.mbarrier has completed its current phase.
nvgpu.mbarrier.try_wait.parity - Waits for the nvgpu.mbarrier to complete its current phase.
nvgpu.mma.sp.sync
nvgpu.mma.sync
nvgpu.rcp - The reciprocal calculation for vector types
nvgpu.tma.async.load - TMA asynchronous load
nvgpu.tma.async.store - TMA asynchronous store
nvgpu.tma.create.descriptor - TMA create descriptor
nvgpu.tma.fence.descriptor - Insert fence given nvgpu.tensormap.descriptor
nvgpu.tma.prefetch.descriptor - Prefetch given nvgpu.tensormap.descriptor
nvgpu.warpgroup.generate.descriptor - Generate a warpgroup matrix descriptor
nvgpu.warpgroup.mma
nvgpu.warpgroup.mma.init.accumulator
nvgpu.warpgroup.mma.store
Functions
nvgpu.device_async_copy - device-side asynchronous copy
This op has support for result type inference.
Attributes
dstElements- Single,IndexAttr, index attributebypassL1- Optional,UnitAttr, unit attribute
Operands
dst- Single,AnyMemRef, memref of any type valuesdstIndices- Variadic,Index, variadic of indexsrc- Single,AnyMemRef, memref of any type valuessrcIndices- Variadic,Index, variadic of indexsrcElements- Optional,Index, index
Results
asyncToken- Single,NVGPU_DeviceAsyncToken, device async token type
Description
The nvgpu.device_async_copy op initiates an asynchronous copy operation of
elements from source (global memory) to the destination (shared memory)
without blocking the thread. The async copy is added to a group.
This op is meant to be used with nvgpu.device_async_create_group and
nvgpu.device_async_wait to synchronize copies as explained in those ops
descriptions.
bypassL1 attribute is hint to the hardware to bypass the L1 cache during
async copy, this hint may be ignored by the hardware.
dstElements attribute is the total number of elements written to
destination (shared memory).
srcElements argument is the total number of elements read from
source (global memory).
srcElements is an optional argument and when present the op only reads
srcElements number of elements from the source (global memory) and zero fills
the rest of the elements in the destination (shared memory).
In order to do a copy and wait for the result we need the following combination:
// copy 1.
%cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
// copy 2.
%cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 1 contains copy 1 and copy 2.
%token1 = nvgpu.device_async_create_group %cp1, %cp2
// copy 3.
%cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 2 contains copy 3.
%token2 = nvgpu.device_async_create_group %cp3
// after the wait copy 1 and copy 2 are complete.
nvgpu.device_async_wait %token1
// after the wait copy 3 is complete.
nvgpu.device_async_wait %token2Example:
%0 = nvgpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 :
memref<4x5xf32> to memref<2x7x5xf32, 3>
nvgpu.device_async_create_group - device side asynchronous create group operation
This op has support for result type inference.
Operands
inputTokens- Variadic,NVGPU_DeviceAsyncToken, variadic of device async token type
Results
asyncToken- Single,NVGPU_DeviceAsyncToken, device async token type
Description
The nvgpu.device_async_create_group op creates a group of memory accesses
containing all the pending device_async_copy operations associated with
argument tokens. Each token can only be part of one group.
It returns a token that can be use to wait until the group fully completes.
This is meant to be used with nvgpu.device_async_wait to synchronize copies
as explained in those ops descriptions.
Groups are executed in the order they are created.
Example:
%0 = nvgpu.device_async_create_group
nvgpu.device_async_wait - Wait for async gpu ops to complete.
Attributes
numGroups- Optional,I32Attr, 32-bit signless integer attribute
Operands
asyncDependencies- Single,NVGPU_DeviceAsyncToken, device async token type
Description
The nvgpu.device_async_wait op will block the execution thread until the group
associated with the source token is fully completed.
The optional $numGroups attribute gives an upper bound of the number of
groups uncompleted when the wait can unblock the thread. For example, if
16 async groups are pushe and $numGroups is set to 12, then the thread
will unblock when 12 groups or fewer are in flight (4 groups have
completed).
Example:
nvgpu.device_async_wait %0
nvgpu.ldmatrix
Attributes
transpose- Single,BoolAttr, bool attributenumTiles- Single,I32Attr, 32-bit signless integer attribute
Operands
srcMemref- Single,AnyMemRef, memref of any type valuesindices- Variadic,Index, variadic of index
Results
res- Single,AnyVectorOfNonZeroRank, vector of any type values
Description
The nvgpu.ldmatrix op represents loading a matrix fragment from
memory to registers. The source and result type must be compatible
with lowering to the nvvm.ldmatrix instruction. This op represents
the distributed version of a vector.transfer_read as an intermediate
step between lowering from vector.transfer_read to nvvm.ldmatrix.
This operation is meant to follow the semantic of described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-ldmatrix
Example:
%0 = nvgpu.ldmatrix %sm[%c0, %c0] {numTiles = 4 : i32, transpose = false} :
memref<?x?xf16, 3> -> vector<4x2xf16>
nvgpu.mbarrier.arrive - Performs arrive operation on the nvgpu.mbarrier.arrive.
This op has support for result type inference.
Operands
barriers- Single,NVGPU_MBarrierGroup, mbarrier barrier typembarId- Single,Index, index
Results
token- Single,NVGPU_MBarrierToken,
Description
The Op performs arrive-on operation on the mbarrier object and returns a
nvgpu.mbarrier.token.
For more information, see https://docs.nvidia.com/cuda/parallel-thread-execution/#arrive-on-operation-on-mbarrier-object
Example:
%token = nvgpu.mbarrier.arrive %barrier : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>> -> !nvgpu.mbarrier.token
nvgpu.mbarrier.arrive.expect_tx - Performs expect_tx operation on the nvgpu.mbarrier.arrive
Operands
barriers- Single,NVGPU_MBarrierGroup, mbarrier barrier typetxcount- Single,Index, indexmbarId- Single,Index, indexpredicate- Optional,I1, 1-bit signless integer
Description
A thread executing the Op performs an expect-tx operation on the mbarrier object at the location specified by the address operand $barrier. The expect-tx operation, with an $txcount argument, increases the tx-count of an mbarrier object by the value specified by $txcount. This makes the current phase of the mbarrier object to expect and track the completion of additional asynchronous transactions.
The $txCount specifies the number of element to the expect-tx operation.
Example:
nvgpu.mbarrier.arrive.expect_tx %barrier, %ic0 : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
nvgpu.mbarrier.arrive.nocomplete - Performs arrive operation on the nvgpu.mbarrier.arrive.nocomplete as non-blocking.
This op has support for result type inference.
Operands
barriers- Single,NVGPU_MBarrierGroup, mbarrier barrier typembarId- Single,Index, indexcount- Single,Index, index
Results
token- Single,NVGPU_MBarrierToken,
Description
The Op performs arrive-on operation on the mbarrier object and returns a
nvgpu.mbarrier.token.
The Op does not cause the nvgpu.mbarrier to complete its current phase.
Example:
%token = nvgpu.mbarrier.arrive.noComplete %barrier, %count : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>> -> !nvgpu.mbarrier.token
nvgpu.mbarrier.create
nvgpu.mbarrier.get - Return a pointer to an nvgpu.mbarrier.
Operands
barriers- Single,NVGPU_MBarrierGroup, mbarrier barrier typembarId- Single,Index, index
Results
mbarrierPointer- Single, anonymous/composite constraint, 32-bit signless integer or 64-bit signless integer
Description
The nvgpu.mbarrier.get operation retrieves a pointer to a specific
mbarrier object from a group of barriers created by the nvgpu.mbarrier.create operation.
Example:
%mbars = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 10>
%mbar_pointer = nvgpu.mbarrier.get %mbars[%c2] : !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>>
nvgpu.mbarrier.init - Initialize the nvgpu.mbarrier.
Operands
barriers- Single,NVGPU_MBarrierGroup, mbarrier barrier typecount- Single,Index, indexmbarId- Single,Index, indexpredicate- Optional,I1, 1-bit signless integer
Description
The Op initializes the mbarrier object with the given number of threads.
Example:
%num_threads = gpu.block_dim x
%barrier = nvgpu.mbarrier.create -> !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
nvgpu.mbarrier.init %barrier, %num_threads : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
nvgpu.mbarrier.test.wait - Checks if the nvgpu.mbarrier has completed its current phase.
This op has support for result type inference.
Operands
barriers- Single,NVGPU_MBarrierGroup, mbarrier barrier typetoken- Single,NVGPU_MBarrierToken,mbarId- Single,Index, index
Results
waitComplete- Single,I1, 1-bit signless integer
Description
Checks whether the mbarrier object has completed the phase. It is is a non-blocking instruction which tests for the completion of the phase.
Example:
%isComplete = nvgpu.mbarrier.test.wait %barrier, %token : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>, !nvgpu.mbarrier.token
nvgpu.mbarrier.try_wait.parity - Waits for the nvgpu.mbarrier to complete its current phase.
Operands
barriers- Single,NVGPU_MBarrierGroup, mbarrier barrier typephaseParity- Single,I1, 1-bit signless integerticks- Single,Index, indexmbarId- Single,Index, index
Description
Checks whether the mbarrier object has completed the phase. It is is a potentially blocking instruction which tests for the completion of the phase. Suspended thread resumes execution when the specified phase completes OR before the phase completes following a system-dependent time limit.
The $phaseParity specifies either even phase (0) or odd phase (1) to
wait.
Example:
nvgpu.mbarrier.try_wait.parity %barrier, %phaseParity, %ticks : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
nvgpu.mma.sp.sync
Attributes
mmaShape- Single,I64ArrayAttr, 64-bit integer array attributesparsitySelector- Single,I32Attr, 32-bit signless integer attributetf32Enabled- Optional,UnitAttr, unit attribute
Operands
matrixA- Single,AnyVectorOfNonZeroRank, vector of any type valuesmatrixB- Single,AnyVectorOfNonZeroRank, vector of any type valuesmatrixC- Single,AnyVectorOfNonZeroRank, vector of any type valuessparseMetadata- Single,NVGPU_MmaSparseSyncMetadataType, fixed-length vector of 16-bit signless integer values of length 2
Results
res- Single,AnyVectorOfNonZeroRank, vector of any type values
Description
The nvgu.mma.sp.sync operation performs a warp-distributed MMA operation
where operand A is "structured sparse". In this case, the matrixA operand
represents the (warp-distributed) non-zero values of operand A, and the
sparse_metadata operand provides the indices.
The full description of the sparsity storage format and distribution scheme is described in the PTX docs. This operation is meant to follow the semantic described in the PTX documentation here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-sparse-mma
The way the indices are distributed among the threads in a warp is controlled
by the optional sparsity_selector operand, which is 0 by default. For
more information, please consult the PTX documentation linked above.
Example (targetingthe f16 16x8x32 mma.sp PTX instruction):
nvgpu.mma.sp.sync (%a, %b, %c) metadata (%meta) {mmaShape = [16, 8, 32]} :
(vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
nvgpu.mma.sync
Attributes
mmaShape- Single,I64ArrayAttr, 64-bit integer array attributetf32Enabled- Optional,UnitAttr, unit attribute
Operands
matrixA- Single,AnyVectorOfNonZeroRank, vector of any type valuesmatrixB- Single,AnyVectorOfNonZeroRank, vector of any type valuesmatrixC- Single,AnyVectorOfNonZeroRank, vector of any type values
Results
res- Single,AnyVectorOfNonZeroRank, vector of any type values
Description
The nvgpu.mma.sync op represents the warp-level matrix-multiply-and-
accumulate (mma) operation that is compatible with nvvm.mma.sync.
The operands and results vector sizes are thread-level onwership to
the warp-level mma operation shape. mmaShape attribute holds the
warp-level matrix-multiply shape.
The nvgpu.mma.sync op serves as an intermediate point between lowering from
vector.contract to nvvm.mma.sync.
This operation is meant to follow the semantic of described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma
Example:
%res = nvgpu.mma.sync (%matrixA, %matrixB, %matrixC) {mmaShape = [16, 8, 16]} :
(vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf32>) -> vector<2x2xf32>
nvgpu.rcp - The reciprocal calculation for vector types
This op has support for result type inference.
Attributes
rounding- Single,RcpRoundingModeAttr, Rounding mode of rcpftz- Optional,UnitAttr, unit attribute
Operands
in- Single, anonymous/composite constraint, vector of 32-bit float values
Results
out- Single, anonymous/composite constraint, vector of 32-bit float values
Description
Reciprocal calculation for vector types using nvvm.rcp OPs.
Currently, only the approx rounding mode and ftz are supported, and only for the f32 type.
The input and output must be of the same vector type and shape.
nvgpu.tma.async.load - TMA asynchronous load
Operands
dst- Single,AnyMemRef, memref of any type valuesbarriers- Single,NVGPU_MBarrierGroup, mbarrier barrier typetensorMapDescriptor- Single,NVGPU_TensorMapDescriptor, TensorMap descriptorcoordinates- Variadic,Index, variadic of indexmbarId- Single,Index, indexmulticastMask- Optional,I16, 16-bit signless integerpredicate- Optional,I1, 1-bit signless integer
Description
The Op loads a tile memory region from global memory to shared memory by Tensor Memory Access (TMA).
$tensorMapDescriptor is tensor map descriptor which has information about
tile shape. The descriptor is created by nvgpu.tma.create.descriptor
The Op uses $barrier mbarrier based completion mechanism.
nvgpu.tma.async.store - TMA asynchronous store
Operands
src- Single,AnyMemRef, memref of any type valuestensorMapDescriptor- Single,NVGPU_TensorMapDescriptor, TensorMap descriptorcoordinates- Variadic,Index, variadic of indexpredicate- Optional,I1, 1-bit signless integer
Description
The Op store a tile memory region from global memory to shared memory by Tensor Memory Access (TMA).
$tensorMapDescriptor is tensor map descriptor which has information about
tile shape. The descriptor is created by nvgpu.tma.create.descriptor
nvgpu.tma.create.descriptor - TMA create descriptor
Operands
tensor- Single,AnyUnrankedMemRef, unranked.memref of any type valuesboxDimensions- Variadic,Index, variadic of index
Results
tensorMap- Single,NVGPU_TensorMapDescriptor, TensorMap descriptor
Description
The Op creates a tensor map descriptor object representing tiled memory
region. To do that it calls CUDA Driver's cuTensorMapEncodeTiled. The
descriptor is used by Tensor Memory Access (TMA).
The tensor is the source tensor to be tiled.
The boxDimensions is the size of the tiled memory region in each dimension.
For more information see below: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html
nvgpu.tma.fence.descriptor - Insert fence given nvgpu.tensormap.descriptor
Operands
tensorMapDescriptor- Single,NVGPU_TensorMapDescriptor, TensorMap descriptor
Description
The Op fences the given $tmaDescriptor. This is necessary if the tensor map
descriptor was modified from the host using cudaMemcpy. In this case, the
kernel needs a fence after which it is safe to use tensor.map.
nvgpu.tma.prefetch.descriptor - Prefetch given nvgpu.tensormap.descriptor
Operands
tensorMapDescriptor- Single,NVGPU_TensorMapDescriptor, TensorMap descriptorpredicate- Optional,I1, 1-bit signless integer
Description
The Op brings the cache line containing the given $tmaDescriptor for
subsequent use by the tma.async.load instruction.
nvgpu.warpgroup.generate.descriptor - Generate a warpgroup matrix descriptor
Operands
tensor- Single,AnyMemRef, memref of any type valuestensorMap- Single,NVGPU_TensorMapDescriptor, TensorMap descriptor
Results
descriptor- Single,NVGPU_WarpgroupMatrixDescriptor, Warpgroup matrix descriptor type
Description
This Op builds a nvgpu.warpgroup.descriptor that is used by
nvgpu.warpgroup.mma to perform warpgroup-level matrix multiply and
accumulate.
The descriptor specifies the properties of the matrix in shared memory that is a multiplicand in the matrix multiply and accumulate operation.
nvgpu.warpgroup.mma
Attributes
waitGroup- Optional,I64Attr, 64-bit signless integer attributetransposeA- Optional,UnitAttr, unit attributetransposeB- Optional,UnitAttr, unit attribute
Operands
descriptorA- Single,NVGPU_WarpgroupMatrixDescriptor, Warpgroup matrix descriptor typedescriptorB- Single,NVGPU_WarpgroupMatrixDescriptor, Warpgroup matrix descriptor typematrixC- Single,NVGPU_WarpgroupAccumulator,
Results
matrixD- Single,NVGPU_WarpgroupAccumulator,
Description
The nvgpu.warpgroup.mma op performs the warpgroup-level (4 warps)
matrix-multiply-and-accumulate (mma) operation that results in
nvvm.wgmma.mma_async.
The operands are descriptorA and descriptorB that are wgmma matrix
descriptors that shows the properties of the matrix in shared memory. The
results are thread-level ownership to the warpgroup-level mma operation
shape. The shape is deduced from the descriptor types and output vector.
The Op encapsulates multiple nvvm.wgmma.mma_async operations to complete
the given shape. As nvvm.wgmma.async Op, or its corresponding PTX
instruction, is asynchronous, this Op groups the nvvm.wgmma.async and
surrounds them between wgmma.fence.aligned and
wgmma.commit.group.sync.aligned, wgmma.wait.group.sync.aligned Ops.
Example:
%r1,%r2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2:
!nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>,
!nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
->
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
nvgpu.warpgroup.mma.init.accumulator
nvgpu.warpgroup.mma.store
Operands
matrixD- Single,NVGPU_WarpgroupAccumulator,dstMemref- Single,AnyMemRef, memref of any type values
Description
The nvgpu.warpgroup.mma.store op performs the store of fragmented result
in $matrixD to given memref.
[See the details of register fragment layout for accumulator matrix D] (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d)
Note that, the op must be run with warp group.