Beaver.MLIR.Dialect.NVVM (beaver v0.4.7)
Summary
Functions
nvvm.bar.warp.sync - Warp Barrier Synchronization Op
nvvm.barrier0
nvvm.barrier - CTA Barrier Synchronization Op
nvvm.barrier.arrive
nvvm.breakpoint
nvvm.cluster.arrive
nvvm.cluster.arrive.relaxed
nvvm.cluster.wait
nvvm.clusterlaunchcontrol.query.cancel - Query the response of a clusterlaunchcontrol.try.cancel operation
nvvm.clusterlaunchcontrol.try.cancel - Request atomically canceling the launch of a cluster that has not started running yet
nvvm.convert.bf16x2.to.f8x2 - Convert a pair of bf16 inputs to f8x2
nvvm.convert.f16x2.to.f8x2 - Convert an f16x2 input to f8x2
nvvm.convert.f32x2.to.f6x2 - Convert a pair of float inputs to f6x2
nvvm.convert.f32x2.to.f8x2 - Convert a pair of float inputs to f8x2
nvvm.convert.float.to.tf32 - Convert the given float input to TF32
nvvm.cp.async.bulk.commit.group
nvvm.cp.async.bulk.global.shared.cta - Async bulk copy from Shared CTA memory to Global memory
nvvm.cp.async.bulk.prefetch - Async bulk prefetch from global memory to L2 cache
nvvm.cp.async.bulk.shared.cluster.global - Async bulk copy from global memory to Shared cluster memory
nvvm.cp.async.bulk.shared.cluster.shared.cta - Async bulk copy from Shared CTA memory to Shared cluster memory
nvvm.cp.async.bulk.tensor.global.shared.cta
nvvm.cp.async.bulk.tensor.prefetch
nvvm.cp.async.bulk.tensor.reduce
nvvm.cp.async.bulk.tensor.shared.cluster.global
nvvm.cp.async.bulk.wait_group
nvvm.cp.async.commit.group
nvvm.cp.async.mbarrier.arrive - NVVM Dialect Op for cp.async.mbarrier.arrive
nvvm.cp.async.mbarrier.arrive.shared - NVVM Dialect Op for cp.async.mbarrier.arrive.shared
nvvm.cp.async.shared.global
nvvm.cp.async.wait.group
nvvm.dot.accumulate.2way - Two-way 16-bit to 8-bit dot product-accumulate instruction
nvvm.dot.accumulate.4way - Four-way byte dot product-accumulate instruction
nvvm.elect.sync - Elect one leader thread
nvvm.exit
nvvm.fence.mbarrier.init
nvvm.fence.proxy
nvvm.fence.proxy.acquire - Uni-directional proxy fence operation with acquire semantics
nvvm.fence.proxy.release
nvvm.fence.sc.cluster
nvvm.griddepcontrol
nvvm.inline_ptx - Inline PTX Op
nvvm.ldmatrix - cooperative matrix load
nvvm.mapa
nvvm.match.sync - Broadcast and compare a value across threads in warp
nvvm.mbarrier.arrive - MBarrier Arrive Operation
nvvm.mbarrier.arrive.expect_tx - MBarrier Arrive with Expected Transaction Count
nvvm.mbarrier.arrive.expect_tx.shared - Shared MBarrier Arrive with Expected Transaction Count
nvvm.mbarrier.arrive.nocomplete - MBarrier Arrive No-Complete Operation
nvvm.mbarrier.arrive.nocomplete.shared - Shared MBarrier Arrive No-Complete Operation
nvvm.mbarrier.arrive.shared - Shared MBarrier Arrive Operation
nvvm.mbarrier.init - MBarrier Initialization Op
nvvm.mbarrier.init.shared - Shared MBarrier Initialization Op
nvvm.mbarrier.inval - MBarrier Invalidation Operation
nvvm.mbarrier.inval.shared - Shared MBarrier Invalidation Operation
nvvm.mbarrier.test.wait - MBarrier Non-Blocking Test Wait Operation
nvvm.mbarrier.test.wait.shared - Shared MBarrier Non-Blocking Test Wait Operation
nvvm.mbarrier.try_wait.parity - MBarrier Potentially-Blocking Try Wait with Phase Parity
nvvm.mbarrier.try_wait.parity.shared - Shared MBarrier Potentially-Blocking Try Wait with Phase Parity
nvvm.mma.sync - cooperative matrix-multiply and accumulate
nvvm.nanosleep
nvvm.pmevent
nvvm.prefetch - Brings the cache line containing an address into the specified cache level
nvvm.rcp.approx.ftz.f
nvvm.read.ptx.sreg.clock64
nvvm.read.ptx.sreg.clock
nvvm.read.ptx.sreg.cluster.ctaid.x
nvvm.read.ptx.sreg.cluster.ctaid.y
nvvm.read.ptx.sreg.cluster.ctaid.z
nvvm.read.ptx.sreg.cluster.ctarank
nvvm.read.ptx.sreg.cluster.nctaid.x
nvvm.read.ptx.sreg.cluster.nctaid.y
nvvm.read.ptx.sreg.cluster.nctaid.z
nvvm.read.ptx.sreg.cluster.nctarank
nvvm.read.ptx.sreg.clusterid.x
nvvm.read.ptx.sreg.clusterid.y
nvvm.read.ptx.sreg.clusterid.z
nvvm.read.ptx.sreg.ctaid.x
nvvm.read.ptx.sreg.ctaid.y
nvvm.read.ptx.sreg.ctaid.z
nvvm.read.ptx.sreg.envreg0
nvvm.read.ptx.sreg.envreg1
nvvm.read.ptx.sreg.envreg2
nvvm.read.ptx.sreg.envreg3
nvvm.read.ptx.sreg.envreg4
nvvm.read.ptx.sreg.envreg5
nvvm.read.ptx.sreg.envreg6
nvvm.read.ptx.sreg.envreg7
nvvm.read.ptx.sreg.envreg8
nvvm.read.ptx.sreg.envreg9
nvvm.read.ptx.sreg.envreg10
nvvm.read.ptx.sreg.envreg11
nvvm.read.ptx.sreg.envreg12
nvvm.read.ptx.sreg.envreg13
nvvm.read.ptx.sreg.envreg14
nvvm.read.ptx.sreg.envreg15
nvvm.read.ptx.sreg.envreg16
nvvm.read.ptx.sreg.envreg17
nvvm.read.ptx.sreg.envreg18
nvvm.read.ptx.sreg.envreg19
nvvm.read.ptx.sreg.envreg20
nvvm.read.ptx.sreg.envreg21
nvvm.read.ptx.sreg.envreg22
nvvm.read.ptx.sreg.envreg23
nvvm.read.ptx.sreg.envreg24
nvvm.read.ptx.sreg.envreg25
nvvm.read.ptx.sreg.envreg26
nvvm.read.ptx.sreg.envreg27
nvvm.read.ptx.sreg.envreg28
nvvm.read.ptx.sreg.envreg29
nvvm.read.ptx.sreg.envreg30
nvvm.read.ptx.sreg.envreg31
nvvm.read.ptx.sreg.globaltimer
nvvm.read.ptx.sreg.globaltimer.lo
nvvm.read.ptx.sreg.gridid
nvvm.read.ptx.sreg.laneid
nvvm.read.ptx.sreg.lanemask.eq
nvvm.read.ptx.sreg.lanemask.ge
nvvm.read.ptx.sreg.lanemask.gt
nvvm.read.ptx.sreg.lanemask.le
nvvm.read.ptx.sreg.lanemask.lt
nvvm.read.ptx.sreg.nclusterid.x
nvvm.read.ptx.sreg.nclusterid.y
nvvm.read.ptx.sreg.nclusterid.z
nvvm.read.ptx.sreg.nctaid.x
nvvm.read.ptx.sreg.nctaid.y
nvvm.read.ptx.sreg.nctaid.z
nvvm.read.ptx.sreg.nsmid
nvvm.read.ptx.sreg.ntid.x
nvvm.read.ptx.sreg.ntid.y
nvvm.read.ptx.sreg.ntid.z
nvvm.read.ptx.sreg.nwarpid
nvvm.read.ptx.sreg.smid
nvvm.read.ptx.sreg.tid.x
nvvm.read.ptx.sreg.tid.y
nvvm.read.ptx.sreg.tid.z
nvvm.read.ptx.sreg.warpid
nvvm.read.ptx.sreg.warpsize
nvvm.redux.sync - Redux Sync Op
nvvm.setmaxregister
nvvm.shfl.sync - NVVM Dialect Op for shfl.sync
nvvm.st.bulk - Bulk Store Op
nvvm.stmatrix - cooperative matrix store
nvvm.tcgen05.alloc - Tcgen05 alloc operation
nvvm.tcgen05.commit - Tcgen05 commit operations
nvvm.tcgen05.cp - Tcgen05 copy operation
nvvm.tcgen05.dealloc - Tcgen05 dealloc operation
nvvm.tcgen05.fence
nvvm.tcgen05.ld - tensor memory load instructions
nvvm.tcgen05.mma_smem_desc - Constructs a Shared Memory descriptor for MMA Operands A or B
nvvm.tcgen05.relinquish_alloc_permit
nvvm.tcgen05.shift - Tcgen05 shift operation
nvvm.tcgen05.st - tensor memory store instructions
nvvm.tcgen05.wait
nvvm.vote.sync - Vote across thread group
nvvm.wgmma.commit.group.sync.aligned
nvvm.wgmma.fence.aligned
nvvm.wgmma.mma_async
nvvm.wgmma.wait.group.sync.aligned
nvvm.wmma.load - Warp synchronous matrix load
nvvm.wmma.mma - Warp synchronous matrix-multiply accumulate using tensor cores.
nvvm.wmma.store - Warp synchronous matrix store
Functions
nvvm.bar.warp.sync - Warp Barrier Synchronization Op
Operands
mask- Single,LLVM_Type, LLVM dialect-compatible type
Description
The nvvm.bar.warp.sync operation performs barrier synchronization for threads
within a warp.
This operation causes the executing thread to wait until all threads corresponding
to the mask operand have executed a bar.warp.sync with the same mask value
before resuming execution.
The mask operand specifies the threads participating in the barrier, where each
bit position corresponds to the thread's lane ID within the warp. Only threads with
their corresponding bit set in the mask participate in the barrier synchronization.
Important constraints:
- The behavior is undefined if the executing thread is not included in the mask (i.e., the bit corresponding to the thread's lane ID is not set)
- For compute capability sm_6x or below, all threads in the mask must execute
the same
bar.warp.syncinstruction in convergence
This operation also guarantees memory ordering among participating threads.
Threads within the warp that wish to communicate via memory can store to memory,
execute bar.warp.sync, and then safely read values stored by other threads
in the warp.
nvvm.barrier0
nvvm.barrier - CTA Barrier Synchronization Op
Operands
barrierId- Optional,I32, 32-bit signless integernumberOfThreads- Optional,I32, 32-bit signless integer
Description
The nvvm.barrier operation performs barrier synchronization and communication
within a CTA (Cooperative Thread Array). It causes executing threads to wait for
all non-exited threads participating in the barrier to arrive.
The operation takes two optional operands:
barrierId: Specifies a logical barrier resource with value 0 through 15. Each CTA instance has sixteen barriers numbered 0..15. Defaults to 0 if not specified.numberOfThreads: Specifies the number of threads participating in the barrier. When specified, the value must be a multiple of the warp size. If not specified, all threads in the CTA participate in the barrier.
The barrier operation guarantees that when the barrier completes, prior memory accesses requested by participating threads are performed relative to all threads participating in the barrier. It also ensures that no new memory access is requested by participating threads before the barrier completes.
When a barrier completes, the waiting threads are restarted without delay, and the barrier is reinitialized so that it can be immediately reused.
This operation generates an aligned barrier, indicating that all threads in the CTA will execute the same barrier instruction. Behavior is undefined if all threads in the CTA do not reach this instruction.
nvvm.barrier.arrive
Operands
barrierId- Optional,I32, 32-bit signless integernumberOfThreads- Single,I32, 32-bit signless integer
Description
Thread that executes this op announces their arrival at the barrier with given id and continue their execution.
The default barrier id is 0 that is similar to nvvm.barrier Op. When
barrierId is not present, the default barrier id is used.
nvvm.breakpoint
nvvm.cluster.arrive
nvvm.cluster.arrive.relaxed
nvvm.cluster.wait
nvvm.clusterlaunchcontrol.query.cancel - Query the response of a clusterlaunchcontrol.try.cancel operation
Attributes
query_type- Single,ClusterLaunchControlQueryTypeAttr, NVVM ClusterLaunchControlQueryType
Operands
try_cancel_response- Single,I128, 128-bit signless integer
Results
res- Single, anonymous/composite constraint, 1-bit signless integer or 32-bit signless integer
Description
clusterlaunchcontrol.query.cancel queries the response of a
clusterlaunchcontrol.try.cancel operation specified by operand
try_cancel_response.
Operand query_type specifies the type of query to perform and can be one
of the following:
is_canceled: Returns true if the try cancel request succeeded, and false otherwise.get_first_cta_id_{x/y/z}: Returns the x, y, or z coordinate of the first CTA in the canceled cluster. Behaviour is defined only if the try cancel request succeeded.
nvvm.clusterlaunchcontrol.try.cancel - Request atomically canceling the launch of a cluster that has not started running yet
Attributes
multicast- Optional,UnitAttr, unit attribute
Operands
smemAddress- Single,LLVM_PointerShared, LLVM pointer in address space 3mbarrier- Single,LLVM_PointerShared, LLVM pointer in address space 3
Description
clusterlaunchcontrol.try.cancel requests atomically canceling the launch
of a cluster that has not started running yet. It asynchronously writes an
opaque response to shared memory indicating whether the operation succeeded
or failed.
Operand smemAddress specifies the naturally aligned address of the
16-byte wide shared memory location where the request's response is written.
Operand mbarrier specifies the mbarrier object used to track the
completion of the asynchronous operation.
If multicast is specified, the response is asynchronously written to the
corresponding local shared memory location (specifed by addr) of each CTA
in the requesting cluster.
nvvm.convert.bf16x2.to.f8x2 - Convert a pair of bf16 inputs to f8x2
Attributes
type- Single,ConvertFP8TypeAttr, NVVM ConvertFP8Type kindrnd- Single,FPRoundingModeAttr, NVVM FPRoundingMode kindsat- Single,SaturationModeAttr, NVVM SaturationMode kind
Operands
a- Single, anonymous/composite constraint, vector of bfloat16 type values of length 2
Results
dst- Single, anonymous/composite constraint, of ranks 1scalable vector of 16-bit float or bfloat16 type values of length 8
Description
This Op converts the given bf16 inputs in a bf16x2 vector to the specified
f8 type.
The result dst is represented as an i16 type or as a vector
of two i8 types.
If dst is returned as an i16 type, the converted values from a
are packed such that the value converted from the first element of a
is stored in the upper 8 bits of dst and the value converted from the
second element of a is stored in the lower 8 bits of dst.
If dst is returned as a vector type, each converted value is stored as an
i8 element in the vector.
The rnd and sat attributes specify the rounding and saturation modes
respectively.
nvvm.convert.f16x2.to.f8x2 - Convert an f16x2 input to f8x2
Attributes
type- Single,ConvertFP8TypeAttr, NVVM ConvertFP8Type kindrelu- Single,BoolAttr, bool attribute
Operands
a- Single, anonymous/composite constraint, of ranks 1scalable vector of 16-bit signless integer values of length 8
Results
dst- Single, anonymous/composite constraint, of ranks 1scalable vector of 16-bit float or bfloat16 type values of length 8
Description
This Op converts the given f16 inputs in an f16x2 vector to the specified
f8 type.
The result dst is represented as an i16 type or as a vector
of two i8 types.
If dst is returned as an i16 type, the converted values from a
are packed such that the value converted from the first element of a
is stored in the upper 8 bits of dst and the value converted from the
second element of a is stored in the lower 8 bits of dst.
If dst is returned as a vector type, each converted value is stored as an
i8 element in the vector.
The relu attribute, when set, lowers to the '.relu' variant of
the cvt instruction.
nvvm.convert.f32x2.to.f6x2 - Convert a pair of float inputs to f6x2
Attributes
type- Single,ConvertFP6TypeAttr, NVVM ConvertFP6Type kindrelu- Single,BoolAttr, bool attribute
Operands
a- Single,F32, 32-bit floatb- Single,F32, 32-bit float
Results
dst- Single, anonymous/composite constraint, of ranks 1scalable vector of 16-bit float or bfloat16 type values of length 8
Description
This Op converts each of the given float inputs to the specified fp6 type.
The result dst is represented either as an i16 type or as a vector
of two i8 types.
If dst is returned as an i16 type, the converted values are packed such
that the value converted from a is stored in the upper 8 bits of dst
with 2 MSB bits padded with zeros and the value converted from b is
stored in the lower 8 bits of dst with 2 MSB bits padded with zeros.
If dst is returned as a vector type, each converted value is stored as an
i8 element in the vector.
The relu attribute, when set, lowers to the '.relu' variant of
the cvt instruction.
nvvm.convert.f32x2.to.f8x2 - Convert a pair of float inputs to f8x2
Attributes
type- Single,ConvertFP8TypeAttr, NVVM ConvertFP8Type kindrnd- Single,FPRoundingModeAttr, NVVM FPRoundingMode kindsat- Single,SaturationModeAttr, NVVM SaturationMode kindrelu- Single,BoolAttr, bool attribute
Operands
a- Single,F32, 32-bit floatb- Single,F32, 32-bit float
Results
dst- Single, anonymous/composite constraint, of ranks 1scalable vector of 16-bit float or bfloat16 type values of length 8
Description
This Op converts each of the given float inputs to the specified fp8 type.
The result dst is represented as an i16 type or as a vector
of two i8 types.
If dst is returned as an i16 type, the converted values are packed such
that the value converted from a is stored in the upper 8 bits of dst
and the value converted from b is stored in the lower 8 bits of dst.
If dst is returned as a vector type, each converted value is stored as an
i8 element in the vector.
The rnd and sat attributes specify the rounding and saturation modes respectively.
The relu attribute, when set, lowers to the '.relu' variant of
the cvt instruction.
nvvm.convert.float.to.tf32 - Convert the given float input to TF32
Attributes
rnd- Single,FPRoundingModeAttr, NVVM FPRoundingMode kindsat- Single,SaturationModeAttr, NVVM SaturationMode kindrelu- Single,BoolAttr, bool attribute
Operands
src- Single,F32, 32-bit float
Results
res- Single,I32, 32-bit signless integer
Description
This Op converts the given f32 input to tf32.
The result res is represented as an i32 type.
The relu attribute, when set, lowers to the '.relu' variant of
the cvt instruction. The rnd and sat attributes specify the
the rounding and saturation modes respectively.
nvvm.cp.async.bulk.commit.group
nvvm.cp.async.bulk.prefetch - Async bulk prefetch from global memory to L2 cache
Operands
srcMem- Single,LLVM_PointerGlobal, LLVM pointer in address space 1size- Single,I32, 32-bit signless integerl2CacheHint- Optional,I64, 64-bit signless integer
Description
Initiates an asynchronous prefetch of data from the location
specified by srcMem to the L2 cache.
The l2CacheHint operand is optional, and it is used to specify cache
eviction policy that may be used during the memory access.
Example:
nvvm.cp.async.bulk.prefetch %src, %size : !llvm.ptr<1>
// with l2_cache_hint
nvvm.cp.async.bulk.prefetch %src, %size l2_cache_hint = %ch : !llvm.ptr<1>
nvvm.cp.async.bulk.tensor.prefetch
Attributes
mode- Single,TMALoadModeAttr, List of Load-Modes supported for TMA Tensor Ops
Operands
tmaDescriptor- Single,LLVM_PointerGeneric, LLVM pointer in address space 0coordinates- Variadic,I32, variadic of 32-bit signless integerim2colOffsets- Variadic,I16, variadic of 16-bit signless integerl2CacheHint- Optional,I64, 64-bit signless integer
Description
Initiates an asynchronous prefetch operation on the tensor data from global
memory to L2 cache. This Op supports all the load modes specified in
TMALoadMode.
The l2CacheHint operand is optional, and it is used to specify cache
eviction policy that may be used during the memory access.
nvvm.cp.async.bulk.tensor.reduce
Attributes
redKind- Single,TMAReduxKindAttr, NVVM TMA redux kindmode- Single,TMAStoreModeAttr, NVVM TMA Store Mode
Operands
tmaDescriptor- Single,LLVM_AnyPointer, LLVM pointer typesrcMem- Single,LLVM_PointerShared, LLVM pointer in address space 3coordinates- Variadic,I32, variadic of 32-bit signless integerl2CacheHint- Optional,I64, 64-bit signless integer
Description
Initiates an asynchronous reduction operation of tensor data in global memory with tensor data in shared memory.
The mode attribute indicates whether the copy mode is tile or im2col.
The redOp attribute specifies the reduction operations applied.
The supported reduction operations are:
{add, min, max, inc, dec, and, or, xor}
The l2CacheHint operand is optional, and it is used to specify cache
eviction policy that may be used during the memory access.
nvvm.cp.async.bulk.wait_group
nvvm.cp.async.commit.group
nvvm.cp.async.mbarrier.arrive - NVVM Dialect Op for cp.async.mbarrier.arrive
Attributes
noinc- Single,I1Attr, 1-bit signless integer attribute
Operands
addr- Single,LLVM_AnyPointer, LLVM pointer type
Description
The cp.async.mbarrier.arrive Op makes the mbarrier object track
all prior cp.async operations initiated by the executing thread.
The addr operand specifies the address of the mbarrier object
in generic address space. The noinc attr impacts how the
mbarrier's state is updated.
nvvm.cp.async.wait.group
nvvm.dot.accumulate.2way - Two-way 16-bit to 8-bit dot product-accumulate instruction
Attributes
a_type- Single,DotAccumulateTypeAttr, NVVM DotAccumulateTypeb_type- Single,DotAccumulateTypeAttr, NVVM DotAccumulateTypeb_hi- Single,BoolAttr, bool attribute
Operands
a- Single, anonymous/composite constraint, vector of 16-bit signless integer values of length 2b- Single, anonymous/composite constraint, vector of 8-bit signless integer values of length 4c- Single,I32, 32-bit signless integer
Results
res- Single,I32, 32-bit signless integer
Description
Performs a two-way 16-bit to 8-bit dot-product which is accumulated in a
32-bit result.
Operand a is a vector of two 16-bit elements and operand b a vector
of four 8-bit elements between which the dot product is computed.
The a_type and b_type attributes specify the type of the elements in a
and b respectively.
If a_type or b_type is s, then the elements in the corresponding
vector are sign-extended to 32-bit before the dot product is computed.
If a_type or b_type is u, then the elements in the corresponding
vector are zero-extended to 32-bit instead.
The b_hi boolean attribute specifies which two bytes of b are used for
the dot product. If b_hi is true, then the dot product is computed
between a and elements at indices 2 and 3 of b. If b_hi is false,
then the dot product is computed between a and elements at indices 0 and
1 of b.
Operand c is a 32-bit integer to which the result is accumulated. It is
treated as holding a signed integer if any of a_type or b_type is
signed.
nvvm.dot.accumulate.4way - Four-way byte dot product-accumulate instruction
Attributes
a_type- Single,DotAccumulateTypeAttr, NVVM DotAccumulateTypeb_type- Single,DotAccumulateTypeAttr, NVVM DotAccumulateType
Operands
a- Single, anonymous/composite constraint, vector of 8-bit signless integer values of length 4b- Single, anonymous/composite constraint, vector of 8-bit signless integer values of length 4c- Single,I32, 32-bit signless integer
Results
res- Single,I32, 32-bit signless integer
Description
Performs a four-way byte dot-product which is accumulated in a 32-bit
result.
Operand a and b are vectors of 4 bytes between which the dot product is
computed.
The a_type and b_type attributes specify the type of the elements in a
and b respectively.
If a_type or b_type is signed, then the elements in the corresponding
vector are sign-extended to 32-bit before the dot product is computed.
If a_type or b_type is unsigned, then the elements in the
corresponding vector are zero-extended to 32-bit instead.
Operand c is a 32-bit integer to which the result is accumulated. It is
treated as holding a signed integer if any of a_type or b_type is s8.
nvvm.elect.sync - Elect one leader thread
Operands
membermask- Optional,I32, 32-bit signless integer
Results
pred- Single,I1, 1-bit signless integer
Description
The elect.sync instruction elects one predicated active leader
thread from among a set of threads specified in the membermask.
When the membermask is not provided explicitly, a default value
of 0xFFFFFFFF is used. The predicate result is set to True for
the leader thread, and False for all other threads.
nvvm.exit
nvvm.fence.mbarrier.init
nvvm.fence.proxy
nvvm.fence.proxy.acquire - Uni-directional proxy fence operation with acquire semantics
Attributes
scope- Single,MemScopeKindAttr, NVVM Memory Scope kindfromProxy- Single,ProxyKindAttr, Proxy kindtoProxy- Single,ProxyKindAttr, Proxy kind
Operands
addr- Single,LLVM_PointerGeneric, LLVM pointer in address space 0size- Single,I32, 32-bit signless integer
Description
fence.proxy.acquire is a uni-directional fence used to establish ordering
between a prior memory access performed via the generic proxy and a
subsequent memory access performed via the tensormap proxy
The address operand addr and the operand size together specify the
memory range [addr, addr+size) on which the ordering guarantees on the
memory accesses across the proxies is to be provided. The only supported
value for the size operand is 128 and must be an immediate. Generic Addressing
is used unconditionally, and the address specified by the operand addr must
fall within the .global state space. Otherwise, the behavior is undefined
nvvm.fence.proxy.release
nvvm.fence.sc.cluster
nvvm.griddepcontrol
nvvm.inline_ptx - Inline PTX Op
Attributes
ptxCode- Single,StrAttr, string attribute
Operands
readOnlyArgs- Variadic,AnyType, variadic of any typereadWriteArgs- Variadic,AnyType, variadic of any typepredicate- Optional,PtxPredicate, 1-bit signless integer
Results
writeOnlyArgs- Variadic,AnyType, variadic of any type
Description
This op allows using PTX directly within the NVVM
dialect, while greatly simplifying llvm.inline_asm generation. It
automatically handles register size selection and sets the correct
read/write access for each operand. The operation leverages the
`BasicPtxBuilderInterface` to abstract away low-level details of
PTX assembly formatting.
The `predicate` attribute is used to specify a predicate for the
PTX instruction.
Example 1: Read-only Parameters
```mlir
nvvm.inline_ptx "mbarrier.init.b64 [$0], $1;" (%barrier_gen, %count) : !llvm.ptr, i32
// Lowers to:
llvm.inline_asm has_side_effects asm_dialect = att
"mbarrier.init.b64 [$0], $1;", "l,r" %arg0, %arg2 : (!llvm.ptr, i32) -> ()
```
Example 2: Read-only and Write-only Parameters
```mlir
%0 = nvvm.inline_ptx "ex2.approx.ftz.f32 $0, $1;" (%input) : f32 -> f32
// Lowers to:
%0 = llvm.inline_asm has_side_effects asm_dialect = att
"ex2.approx.ftz.f32 $0, $1;", "=f,f" %arg0 : (f32) -> f32
```
Example 3: Predicate Usage
```mlir
nvvm.inline_ptx "mbarrier.init.b64 [$0], $1;" (%barrier_gen, %count),
predicate = %pred : !llvm.ptr, i32, i1
// Lowers to:
llvm.inline_asm has_side_effects asm_dialect = att
"@$2 mbarrier.init.b64 [$0], $1;", "l,r,b" %arg0, %arg2, %arg3
: (!llvm.ptr, i32, i1) -> ()
```
nvvm.ldmatrix - cooperative matrix load
Attributes
num- Single,I32Attr, 32-bit signless integer attributelayout- Single,MMALayoutAttr, NVVM MMA layoutshape- Single,LdStMatrixShapeAttr, Matrix shape for ldmatrix and stmatrixeltType- Single,LdStMatrixEltTypeAttr, Element type for ldmatrix and stmatrix
Operands
ptr- Single,LLVM_PointerShared, LLVM pointer in address space 3
Results
res- Single,AnyType, any type
nvvm.mapa
Operands
a- Single, anonymous/composite constraint, LLVM pointer in address space 0 or LLVM pointer in address space 3b- Single,I32, 32-bit signless integer
Results
res- Single, anonymous/composite constraint, LLVM pointer in address space 0 or LLVM pointer in address space 7
nvvm.match.sync - Broadcast and compare a value across threads in warp
Attributes
kind- Single,MatchSyncKindAttr, NVVM match sync kind
Operands
thread_mask- Single,I32, 32-bit signless integerval- Single, anonymous/composite constraint, 32-bit signless integer or 64-bit signless integer
Results
res- Single, anonymous/composite constraint, 32-bit signless integer or LLVM struct type
Description
The match.sync op performs broadcast and compare of operand val across
all non-exited threads in thread_mask and returns a mask depending on the
kind and an optional predicate.
The matching operation kinds are:
any: Returns a mask corresponding to the non-exited threads in thethread_maskthat have the same value of operandval.all: Returns a mask and a predicate. If all non-exited threads in thethread_maskhave the same value of operandval, the predicate is set to true and the mask corresponds to the non-exited threads in thethread_mask. Otherwise, the predicate is set to false and the mask is 0.
nvvm.mbarrier.arrive - MBarrier Arrive Operation
Operands
addr- Single,LLVM_AnyPointer, LLVM pointer type
Results
res- Single,LLVM_Type, LLVM dialect-compatible type
Description
The nvvm.mbarrier.arrive operation performs an arrive-on operation on the
mbarrier object at the specified address. Uses the default .release.cta semantics.
This release pattern establishes memory ordering for operations occurring in program
order before this arrive instruction by making operations from the current thread
visible to subsequent operations in other threads within the CTA. When other threads
perform corresponding acquire operations (like 'mbarrier.test.wait'), they synchronize
with this release pattern.
This operation causes the executing thread to signal its arrival at the barrier. The operation returns an opaque value that captures the phase of the mbarrier object prior to the arrive-on operation. The contents of this state value are implementation-specific.
The operation takes the following operand:
addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.
nvvm.mbarrier.arrive.expect_tx - MBarrier Arrive with Expected Transaction Count
Operands
addr- Single,LLVM_AnyPointer, LLVM pointer typetxcount- Single,I32, 32-bit signless integerpredicate- Optional,PtxPredicate, 1-bit signless integer
Description
The nvvm.mbarrier.arrive.expect_tx operation performs an expect-tx operation
followed by an arrive-on operation on the mbarrier object. Uses the default
.release.cta semantics. This release pattern establishes memory ordering for
operations occurring in program order before this arrive instruction by making
operations from the current thread visible to subsequent operations in other
threads within the CTA. When other threads perform corresponding acquire operations
(like 'mbarrier.test.wait'), they synchronize with this release pattern.
This operation first performs an expect-tx operation with the specified transaction count, then performs an arrive-on operation with an implicit count of 1. The expect-tx operation increases the tx-count of the mbarrier object by the specified expectCount value, setting the current phase to expect and tracks the completion of additional asynchronous transactions.
The operation takes the following operands:
addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.txcount: An unsigned integer specifying the expected transaction count for the expect-tx operation. This represents the number of asynchronous transactions expected to complete before the barrier phase completes.predicate: Optional predicate for conditional execution.
nvvm.mbarrier.arrive.nocomplete - MBarrier Arrive No-Complete Operation
Operands
addr- Single,LLVM_AnyPointer, LLVM pointer typecount- Single,I32, 32-bit signless integer
Results
res- Single,LLVM_Type, LLVM dialect-compatible type
Description
The nvvm.mbarrier.arrive.nocomplete operation performs an arrive-on operation
on the mbarrier object with the guarantee that it will not cause the barrier to
complete its current phase. Uses the default .release.cta semantics. This release
pattern establishes memory ordering for operations occurring in program order before
this arrive instruction by making operations from the current thread visible to
subsequent operations in other threads within the CTA. When other threads perform
corresponding acquire operations (like 'mbarrier.test.wait'), they synchronize with
this release pattern.
This operation causes the executing thread to signal its arrival at the barrier with a specified count, but ensures that the barrier phase will not complete as a result of this operation. The operation returns an opaque value that captures the phase of the mbarrier object prior to the arrive-on operation.
The operation takes the following operands:
addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.count: Integer specifying the count argument to the arrive-on operation. Must be in the valid range as specified in the mbarrier object contents.
nvvm.mbarrier.init - MBarrier Initialization Op
Operands
addr- Single,LLVM_AnyPointer, LLVM pointer typecount- Single,I32, 32-bit signless integerpredicate- Optional,PtxPredicate, 1-bit signless integer
Description
The nvvm.mbarrier.init operation initializes an mbarrier object at the specified
memory location.
This operation initializes the mbarrier object with the following state:
- Current phase: 0
- Expected arrival count:
count - Pending arrival count:
count - Transaction count (tx-count): 0
The operation takes the following operands:
addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.count: Integer specifying the number of threads that will participate in barrier synchronization. Must be in the range [1, 2²⁰ - 1].predicate: Optional predicate for conditional execution.
nvvm.mbarrier.inval - MBarrier Invalidation Operation
Operands
addr- Single,LLVM_AnyPointer, LLVM pointer type
Description
The nvvm.mbarrier.inval operation invalidates an mbarrier object at the
specified memory location.
This operation marks the mbarrier object as invalid, making it safe to repurpose the memory location for other uses or to reinitialize it as a new mbarrier object. It is undefined behavior if the mbarrier object is already invalid.
The operation takes the following operand:
addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.
nvvm.mbarrier.test.wait - MBarrier Non-Blocking Test Wait Operation
Operands
addr- Single,LLVM_AnyPointer, LLVM pointer typestate- Single,LLVM_Type, LLVM dialect-compatible type
Results
res- Single,LLVM_Type, LLVM dialect-compatible type
Description
The nvvm.mbarrier.test.wait operation performs a non-blocking test for the
completion of a specific phase of an mbarrier object. It uses the default
.acquire.cta semantics. This acquire pattern establishes memory ordering for
operations occurring in program order after this wait instruction by making
operations from other threads in the CTA visible to subsequent operations in the current
thread. When this wait completes, it synchronizes with the corresponding release
pattern from the mbarrier.arrive operation, establishing memory ordering within
the CTA.
This operation tests whether the mbarrier phase specified by the state operand has completed. It is a non-blocking instruction that immediately returns the completion status without suspending the executing thread.
The operation takes the following operands:
addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.state: An opaque value returned by a previousmbarrier.arriveoperation on the same mbarrier object during the current or immediately preceding phase.
The operation returns a boolean value indicating whether the specified phase has completed:
true: The immediately preceding phase has completedfalse: The phase is still incomplete (current phase)
Memory ordering guarantees: When this wait returns true, the following ordering guarantees hold:
- All memory accesses (except async operations) requested prior to
mbarrier.arrivehaving release semantics by participating CTA threads are visible to the executing thread. - All
cp.asyncoperations requested prior tocp.async.mbarrier.arriveby participating CTA threads are visible to the executing thread. - All
cp.async.bulkoperations using the same mbarrier object requested prior tombarrier.arrivehaving release semantics by participating CTA threads are visible to the executing thread. - Memory accesses requested after this wait are not visible to memory
accesses performed prior to
mbarrier.arriveby other participating threads. - No ordering guarantee exists for memory accesses by the same thread
between
mbarrier.arriveand this wait.
nvvm.mbarrier.try_wait.parity - MBarrier Potentially-Blocking Try Wait with Phase Parity
Operands
addr- Single,LLVM_AnyPointer, LLVM pointer typephase- Single,I32, 32-bit signless integerticks- Single,I32, 32-bit signless integer
Description
The nvvm.mbarrier.try_wait.parity operation performs a potentially-blocking
test for the completion of a specific phase of an mbarrier object using phase
parity. It uses the default .acquire.cta semantics. This acquire pattern
establishes memory ordering for operations occurring in program order after this
wait instruction by making operations from other threads in the CTA visible to subsequent
operations in the current thread. When this wait completes, it synchronizes with
the corresponding release pattern from the mbarrier.arrive operation, establishing
memory ordering within the CTA.
This operation waits for the completion of the mbarrier phase indicated by the
phase parity. While it uses the underlying PTX mbarrier.try_wait.parity
instruction, this MLIR operation generates a loop that enforces the test to
complete before continuing execution, ensuring the barrier phase is actually
completed rather than potentially timing out.
The operation takes the following operands:
addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.phase: An integer specifying the phase parity (0 or 1). Even phases have parity 0, odd phases have parity 1.ticks: An unsigned integer specifying the suspend time hint in nanoseconds. This may be used instead of the system-dependent time limit.
Memory ordering guarantees: When this wait returns true, the following ordering guarantees hold:
- All memory accesses (except async operations) requested prior to
mbarrier.arrivehaving release semantics by participating CTA threads are visible to the executing thread. - All
cp.asyncoperations requested prior tocp.async.mbarrier.arriveby participating CTA threads are visible to the executing thread. - All
cp.async.bulkoperations using the same mbarrier object requested prior tombarrier.arrivehaving release semantics by participating CTA threads are visible to the executing thread. - Memory accesses requested after this wait are not visible to memory
accesses performed prior to
mbarrier.arriveby other participating threads. - No ordering guarantee exists for memory accesses by the same thread
between
mbarrier.arriveand this wait.
Implementation behavior:
This operation generates a PTX loop that repeatedly calls the underlying
mbarrier.try_wait.parity instruction until the barrier phase completes.
Unlike the raw PTX instruction which may return without completion after a
timeout, this MLIR operation guarantees completion by continuing to loop until
the specified phase is reached.
nvvm.mma.sync - cooperative matrix-multiply and accumulate
Attributes
shape- Single,NVVM_MMAShapeAttr, Attribute for MMA operation shape.b1Op- Optional,MMAB1OpAttr, MMA binary operationsintOverflowBehavior- Optional,MMAIntOverflowAttr, MMA overflow optionslayoutA- Single,MMALayoutAttr, NVVM MMA layoutlayoutB- Single,MMALayoutAttr, NVVM MMA layoutmultiplicandAPtxType- Optional,MMATypesAttr, NVVM MMA typesmultiplicandBPtxType- Optional,MMATypesAttr, NVVM MMA types
Operands
operandA- Variadic,LLVM_Type, variadic of LLVM dialect-compatible typeoperandB- Variadic,LLVM_Type, variadic of LLVM dialect-compatible typeoperandC- Variadic,LLVM_Type, variadic of LLVM dialect-compatible type
Results
res- Single,LLVM_AnyStruct, LLVM structure type
Description
The nvvm.mma.sync operation collectively performs the operation
D = matmul(A, B) + C using all threads in a warp.
All the threads in the warp must execute the same mma.sync operation.
For each possible multiplicand PTX data type, there are one or more possible instruction shapes given as "mMnNkK". The below table describes the posssibilities as well as the types required for the operands. Note that the data type for C (the accumulator) and D (the result) can vary independently when there are multiple possibilities in the "C/D Type" column.
When an optional attribute cannot be immediately inferred from the types of the operands and the result during parsing or validation, an error will be raised.
b1Op is only relevant when the binary (b1) type is given to
multiplicandDataType. It specifies how the multiply-and-acumulate is
performed and is either xor_popc or and_poc. The default is xor_popc.
intOverflowBehavior is only relevant when the multiplicandType attribute
is one of u8, s8, u4, s4, this attribute describes how overflow is handled
in the accumulator. When the attribute is satfinite, the accumulator values
are clamped in the int32 range on overflow. This is the default behavior.
Alternatively, accumulator behavior wrapped can also be specified, in
which case overflow wraps from one end of the range to the other.
layoutA and layoutB are required and should generally be set to
#nvvm.mma_layout<row> and #nvvm.mma_layout<col> respectively, but other
combinations are possible for certain layouts according to the table below.
| A/B Type | Shape | ALayout | BLayout | A Type | B Type | C/D Type |
|----------|-----------|---------|---------|----------|----------|-------------------|
| f64 | .m8n8k4 | row | col | 1x f64 | 1x f64 | 2x f64 |
| f16 | .m8n8k4 | row/col | row/col | 2x f16x2 | 2x f16x2 | 4x f16x2 or 8xf32 |
| | .m16n8k8 | row | col | 2x f16x2 | 1x f16x2 | 2x f16x2 or 4 f32 |
| | .m16n8k16 | row | col | 4x f16x2 | 2x f16x2 | 2x f16x2 or 4 f32 |
| bf16 | .m16n8k8 | row | col | 2x i32 | 1x i32 | 4x f32 |
| | .m16n8k16 | row | col | 4x i32 | 2x i32 | 4x f32 |
| tf32 | .m16n8k4 | row | col | 2x i32 | 1x i32 | 4x f32 |
| | .m16n8k8 | row | col | 4x i32 | 2x i32 | 2x f16x2 or 4 f32 |
| u8/s8 | .m8n8k16 | row | col | 1x i32 | 1x i32 | 2x i32 |
| | .m16n8k16 | row | col | 2x i32 | 1x i32 | 4x i32 |
| | .m16n8k32 | row | col | 4x i32 | 2x i32 | 4x i32 |
| u4/s4 | .m8n8k32 | row | col | 1x i32 | 1x i32 | 2x i32 |
| | m16n8k32 | row | col | 2x i32 | 1x i32 | 4x i32 |
| | m16n8k64 | row | col | 4x i32 | 2x i32 | 4x i32 |
| b1 | m8n8k128 | row | col | 1x i32 | 1x i32 | 2x i32 |
| | m16n8k128 | row | col | 2x i32 | 1x i32 | 4x i32 |Example:
%128 = nvvm.mma.sync A[%120, %121, %122, %123]
B[%124, %125]
C[%126, %127]
{layoutA = #nvvm.mma_layout<row>,
layoutB = #nvvm.mma_layout<col>,
shape = {k = 16 : i32, m = 16 : i32, n = 8 : i32}}
: (vector<2xf16>, vector<2xf16>, vector<2xf16>)
-> !llvm.struct<(vector<2xf16>, vector<2xf16>)>
nvvm.nanosleep
nvvm.pmevent
nvvm.prefetch - Brings the cache line containing an address into the specified cache level
Attributes
cacheLevel- Optional,PrefetchCacheLevelAttr, NVVM Prefetch Cache LevelevictPriority- Optional,CacheEvictionPriorityAttr, NVVM Cache Eviction Prioritytensormap- Optional,UnitAttr, unit attributeuniform- Optional,UnitAttr, unit attributein_param_space- Optional,UnitAttr, unit attribute
Operands
addr- Single, anonymous/composite constraint, LLVM pointer in address space 1 or LLVM pointer in address space 5 or LLVM pointer in address space 0 or LLVM pointer in address space 4predicate- Optional,PtxPredicate, 1-bit signless integer
Description
Prefetches the cache line containing the address given by addr. The
operand may be a global, local, or generic pointer. When tensormap is
specified, the operand may instead be a constant or generic pointer. If the
address maps to shared memory, the operation has no effect.
At most one of cacheLevel or tensormap may be present. The cacheLevel
attribute selects the target cache level. When combined with uniform, the
prefetch is performed to the uniform cache, in which case addr must be a
generic pointer.
When tensormap is used, the line containing addr is brought from the
constant or parameter state space for later use by cp.async.bulk.tensor.
If in_param_space is specified, the generic pointer is interpreted as
referring to the parameter state space.
uniform can be specified after the cacheLevel to indicate that the
prefetch is performed to the specified uniform cache level. If uniform is
specified, addr must be a generic address pointer and no operation is
performed if addr maps to a const, local, or shared memory location.
The evictPriority attribute is optional and specifies the cache eviction
priority when cacheLevel is L2.
nvvm.rcp.approx.ftz.f
Operands
arg- Single,F32, 32-bit float
Results
res- Single,F32, 32-bit float
nvvm.read.ptx.sreg.clock64
nvvm.read.ptx.sreg.clock
nvvm.read.ptx.sreg.cluster.ctaid.x
nvvm.read.ptx.sreg.cluster.ctaid.y
nvvm.read.ptx.sreg.cluster.ctaid.z
nvvm.read.ptx.sreg.cluster.ctarank
nvvm.read.ptx.sreg.cluster.nctaid.x
nvvm.read.ptx.sreg.cluster.nctaid.y
nvvm.read.ptx.sreg.cluster.nctaid.z
nvvm.read.ptx.sreg.cluster.nctarank
nvvm.read.ptx.sreg.clusterid.x
nvvm.read.ptx.sreg.clusterid.y
nvvm.read.ptx.sreg.clusterid.z
nvvm.read.ptx.sreg.ctaid.x
nvvm.read.ptx.sreg.ctaid.y
nvvm.read.ptx.sreg.ctaid.z
nvvm.read.ptx.sreg.envreg0
nvvm.read.ptx.sreg.envreg1
nvvm.read.ptx.sreg.envreg2
nvvm.read.ptx.sreg.envreg3
nvvm.read.ptx.sreg.envreg4
nvvm.read.ptx.sreg.envreg5
nvvm.read.ptx.sreg.envreg6
nvvm.read.ptx.sreg.envreg7
nvvm.read.ptx.sreg.envreg8
nvvm.read.ptx.sreg.envreg9
nvvm.read.ptx.sreg.envreg10
nvvm.read.ptx.sreg.envreg11
nvvm.read.ptx.sreg.envreg12
nvvm.read.ptx.sreg.envreg13
nvvm.read.ptx.sreg.envreg14
nvvm.read.ptx.sreg.envreg15
nvvm.read.ptx.sreg.envreg16
nvvm.read.ptx.sreg.envreg17
nvvm.read.ptx.sreg.envreg18
nvvm.read.ptx.sreg.envreg19
nvvm.read.ptx.sreg.envreg20
nvvm.read.ptx.sreg.envreg21
nvvm.read.ptx.sreg.envreg22
nvvm.read.ptx.sreg.envreg23
nvvm.read.ptx.sreg.envreg24
nvvm.read.ptx.sreg.envreg25
nvvm.read.ptx.sreg.envreg26
nvvm.read.ptx.sreg.envreg27
nvvm.read.ptx.sreg.envreg28
nvvm.read.ptx.sreg.envreg29
nvvm.read.ptx.sreg.envreg30
nvvm.read.ptx.sreg.envreg31
nvvm.read.ptx.sreg.globaltimer
nvvm.read.ptx.sreg.globaltimer.lo
nvvm.read.ptx.sreg.gridid
nvvm.read.ptx.sreg.laneid
nvvm.read.ptx.sreg.lanemask.eq
nvvm.read.ptx.sreg.lanemask.ge
nvvm.read.ptx.sreg.lanemask.gt
nvvm.read.ptx.sreg.lanemask.le
nvvm.read.ptx.sreg.lanemask.lt
nvvm.read.ptx.sreg.nclusterid.x
nvvm.read.ptx.sreg.nclusterid.y
nvvm.read.ptx.sreg.nclusterid.z
nvvm.read.ptx.sreg.nctaid.x
nvvm.read.ptx.sreg.nctaid.y
nvvm.read.ptx.sreg.nctaid.z
nvvm.read.ptx.sreg.nsmid
nvvm.read.ptx.sreg.ntid.x
nvvm.read.ptx.sreg.ntid.y
nvvm.read.ptx.sreg.ntid.z
nvvm.read.ptx.sreg.nwarpid
nvvm.read.ptx.sreg.smid
nvvm.read.ptx.sreg.tid.x
nvvm.read.ptx.sreg.tid.y
nvvm.read.ptx.sreg.tid.z
nvvm.read.ptx.sreg.warpid
nvvm.read.ptx.sreg.warpsize
nvvm.redux.sync - Redux Sync Op
Attributes
kind- Single,ReduxKindAttr, NVVM redux kindabs- Single,BoolAttr, bool attributenan- Single,BoolAttr, bool attribute
Operands
val- Single,LLVM_Type, LLVM dialect-compatible typemask_and_clamp- Single,I32, 32-bit signless integer
Results
res- Single,LLVM_Type, LLVM dialect-compatible type
Description
redux.sync performs a reduction operation kind of the 32 bit source
register across all non-exited threads in the membermask.
The abs and nan attributes can be used in the case of f32 input type,
where the abs attribute causes the absolute value of the input to be used
in the reduction operation, and the nan attribute causes the reduction
operation to return NaN if any of the inputs to participating threads are
NaN.
nvvm.setmaxregister
nvvm.shfl.sync - NVVM Dialect Op for shfl.sync
Attributes
kind- Single,ShflKindAttr, NVVM shuffle kindreturn_value_and_is_valid- Optional,UnitAttr, unit attribute
Operands
thread_mask- Single,I32, 32-bit signless integerval- Single,LLVM_Type, LLVM dialect-compatible typeoffset- Single,I32, 32-bit signless integermask_and_clamp- Single,I32, 32-bit signless integer
Results
res- Single,LLVM_Type, LLVM dialect-compatible type
Description
The shfl.sync Op implements data shuffle within threads of a warp.
The thread_mask denotes the threads participating in the Op where
the bit position corresponds to a particular thread's laneid.
The offset specifies a source lane or source lane offset
(depending on kind). The val is the input value to be copied from
the source. The mask_and_clamp contains two packed values specifying
a mask for logically splitting warps into sub-segments and an upper bound
for clamping the source lane index.
nvvm.st.bulk - Bulk Store Op
Attributes
initVal- Single,I64Attr, 64-bit signless integer attribute
Operands
addr- Single, anonymous/composite constraint, LLVM pointer in address space 0 or LLVM pointer in address space 3size- Single,I64, 64-bit signless integer
Description
Initializes a region of shared memory at the address given by addr.
The size operand specifies the number of bytes to initialize and must be
a multiple of 8.
The initVal operand specifies the value to initialize the memory to. The
only supported value is 0.
nvvm.stmatrix - cooperative matrix store
Attributes
layout- Single,MMALayoutAttr, NVVM MMA layoutshape- Single,LdStMatrixShapeAttr, Matrix shape for ldmatrix and stmatrixeltType- Single,LdStMatrixEltTypeAttr, Element type for ldmatrix and stmatrix
Operands
ptr- Single,LLVM_PointerShared, LLVM pointer in address space 3sources- Variadic,I32, variadic of 32-bit signless integer
Description
Collectively store one or more matrices across all threads in a warp to the location indicated by the address operand $ptr in shared memory.
nvvm.tcgen05.alloc - Tcgen05 alloc operation
Attributes
group- Single,CTAGroupKindAttr, NVVM CTA group kind
Operands
addr- Single, anonymous/composite constraint, LLVM pointer type or LLVM pointer in address space 3nCols- Single,I32, 32-bit signless integer
Description
The tcgen05.alloc Op allocates tensor core memory for
the amount specified by nCols and writes the destination
address to the addr argument. The nCols operand specifies the
number of columns to be allocated and it must be a power-of-two.
For more information, see PTX ISA
nvvm.tcgen05.commit - Tcgen05 commit operations
Attributes
group- Single,CTAGroupKindAttr, NVVM CTA group kind
Operands
addr- Single, anonymous/composite constraint, LLVM pointer type or LLVM pointer in address space 3multicastMask- Optional,I16, 16-bit signless integer
Description
The tcgen05.commit makes the mbarrier object, specified by
the operand addr, track the completion of all the prior
async-tcgen05 operations initiated by the executing thread.
The multicast variants allow signaling on the mbarrier objects
of multiple CTAs within the cluster. Operand multicastMask,
when present, specifies the destination CTAs in the cluster such
that each bit position in the 16-bit multicastMask operand
corresponds to the nvvm.read.ptx.sreg.ctaid of the destination CTA.
For more information, see PTX ISA
nvvm.tcgen05.cp - Tcgen05 copy operation
Attributes
shape- Single,Tcgen05CpShapeAttr, tcgen05 cp shapesgroup- Single,CTAGroupKindAttr, NVVM CTA group kindmulticast- Single,Tcgen05CpMulticastAttr, tcgen05 cp multicastsrcFormat- Optional,Tcgen05CpSrcFormatAttr, tcgen05 cp source format
Operands
taddr- Single,LLVM_PointerTensor, LLVM pointer in address space 6smem_desc- Single,I64, 64-bit signless integer
Description
Instruction tcgen05.cp initiates an asynchronous copy operation from
shared memory to the location specified by the address operand taddr
in the Tensor Memory. The 64-bit register operand smem_desc specifies
the matrix descriptor representing the source matrix in the shared memory
that needs to be copied.
Example:
nvvm.tcgen05.cp %taddr, %smem_desc {
group = #nvvm.tcgen05_group<cta_2>,
shape = #nvvm.tcgen05_cp_shape<shape_64x128b>,
multicast = #nvvm.tcgen05_cp_multicast<warpx2_01_23>,
srcFormat = #nvvm.tcgen05_cp_src_fmt<b6x16_p32>
}
nvvm.tcgen05.dealloc - Tcgen05 dealloc operation
Attributes
group- Single,CTAGroupKindAttr, NVVM CTA group kind
Operands
taddr- Single,LLVM_PointerTensor, LLVM pointer in address space 6nCols- Single,I32, 32-bit signless integer
Description
The tcgen05.dealloc Op de-allocates the tensor core memory
specified by tmemAddr, which must be from a previous tensor
memory allocation. The nCols operand specifies the number
of columns to be de-allocated, and it must be a power-of-two.
For more information, see PTX ISA
nvvm.tcgen05.fence
nvvm.tcgen05.ld - tensor memory load instructions
Attributes
pack- Optional,UnitAttr, unit attributeshape- Single,Tcgen05LdStShapeAttr, allowed 32-bit signless integer cases: 0, 1, 2, 3, 4
Operands
tmemAddr- Single,LLVM_PointerTensor, LLVM pointer in address space 6offset- Optional,I64, 64-bit signless integer
Results
res- Single, anonymous/composite constraint, 32-bit signless integer or vector of 32-bit signless integer values of length 2/4/8/16/32/64/128
Description
Instruction tcgen05.ld asynchronously loads data from the Tensor Memory at
the location specified by the 32-bit address operand tmemAddr into the
destination register res, collectively across all threads of the warps.
The shape and the num attribute together determines the total
dimension of the data which is loaded from the Tensor Memory. The shape
attribute indicates the base dimension of data to be accessed as described
in the Data Movement Shape. The num attribute indicates the repeat
factor on the base dimension resulting in the total dimension of the data
that is accessed.
The shape 16x32bx2 performs two accesses into Tensor Memory of the shape
16x32b. The base address of the first access is specified by tmemAddr
and the base address of the second access is specified by
tmemAddr + offset, where offset is an immediate argument.
The unit attribute pack can be used to pack two 16-bit
elements from adjacent columns into a single 32-bit element during the load.
The following table describes the size of the vector for various combinations
of num and shape attributes:
|=====================================================================|
| num/shape | 16x32bx2/16x64b/32x32b | 16x128b | 16x256b |
|=====================================================================|
| x1 | 1 | 2 | 4 |
| x2 | 2 | 4 | 8 |
| x4 | 4 | 8 | 16 |
| x8 | 8 | 16 | 32 |
| x16 | 16 | 32 | 64 |
| x32 | 32 | 64 | 128 |
| x64 | 64 | 128 | NA |
| x128 | 128 | NA | NA |
|=====================================================================|Example:
nvvm.tcgen05.ld %tmemAddr, %offset pack {
shape = #nvvm.tcgen05_ldst_shape<shape_16x32bx2>,
} : <2xi32>
nvvm.tcgen05.mma_smem_desc - Constructs a Shared Memory descriptor for MMA Operands A or B
Operands
startAddr- Single,I32, 32-bit signless integerleadingDimOffset- Single,I32, 32-bit signless integerstrideDimOffset- Single,I32, 32-bit signless integerbaseOffset- Single,I8, 8-bit signless integerleadingDimMode- Single,I1, 1-bit signless integerswizzleMode- Single,I8, 8-bit signless integer
Results
res- Single,I64, 64-bit signless integer
Description
The nvvm.tcgen05_mma_smem_desc constructs a Shared Memory descriptor
for tcgen05.mma. This descriptor is a 64-bit value which describes the
properties of multiplicand matrix in shared memory including its location
in the shared memory of the current CTA.
+-----------+------+------------------------------------------------------+
| Bit-field | Size | Description |
+-----------+------+------------------------------------------------------+
| 0-13 | 14 | Matrix start address |
| 14-15 | 2 | Reserved |
| 16-29 | 14 | Leading dim relative-offset (or) absolute-address |
| 30-31 | 2 | Reserved |
| 32-45 | 14 | Stride dimension byte offset |
| 46-48 | 3 | Fixed constant value of 0b001 |
| 49-51 | 3 | Matrix base offset |
| 52 | 1 | Leading dimension stride mode: |
| | | 0: byte offset relative |
| | | 1: byte address absolute |
| 53-60 | 8 | Fixed constant value of 0xb00000000 |
| 61-63 | 3 | Swizzling mode: |
| | | 0: No swizzling |
| | | 1: 128-Byte with 32B atomic swizzling |
| | | 2: 128-Byte swizzling |
| | | 4: 64-Byte swizzling |
| | | 6: 32-Byte swizzling |
| | | (Values 3, 5 and 7 are invalid) |
+-----------+------+------------------------------------------------------+ Example:
%desc = nvvm.tcgen05.mma_smem_desc (%startAddr, %leadingDimOffset, %strideDimOffset,
%baseOffset, %leadingDimMode, %swizzleMode) : (i32, i32, i32, i8, i1, i8) -> i64
nvvm.tcgen05.relinquish_alloc_permit
nvvm.tcgen05.shift - Tcgen05 shift operation
Attributes
group- Single,CTAGroupKindAttr, NVVM CTA group kind
Operands
taddr- Single,LLVM_PointerTensor, LLVM pointer in address space 6
Description
The tcgen05.shift is an asynchronous instruction which initiates
the shifting of 32-byte elements downwards across all the rows,
except the last, by one row. The operand taddr specifies the base
address of the matrix in Tensor Memory whose rows must be down shifted.
nvvm.tcgen05.st - tensor memory store instructions
Attributes
unpack- Optional,UnitAttr, unit attributeshape- Single,Tcgen05LdStShapeAttr, allowed 32-bit signless integer cases: 0, 1, 2, 3, 4
Operands
tmemAddr- Single,LLVM_PointerTensor, LLVM pointer in address space 6val- Single, anonymous/composite constraint, 32-bit signless integer or vector of 32-bit signless integer values of length 2/4/8/16/32/64/128offset- Optional,I64, 64-bit signless integer
Description
Instruction tcgen05.st asynchronously stores data from the source register r
into the Tensor Memory at the location specified by the 32-bit address operand
tmemAddr, collectively across all threads of the warps.
The shape and the num attribute together determines the total dimension of
the data which is stored to the Tensor Memory. The shape indicates the base
dimension of data to be accessed. The num attribute indicates the repeat
factor on the base dimension resulting in the total dimension of the data that
is accessed.
The shape 16x32bx2 performs two accesses into Tensor Memory of the shape
16x32b. The base address of the first access is specified by tmemAddr
and the base address of the second access is specified by
tmemAddr + offset, where offset is an immediate argument.
The unit attribute unpack can be used to unpack a 32-bit element
in the register into two 16-bit elements and store them in adjacent columns.
The following table describes the size of the vector for various combinations
of num and shape attributes:
|=====================================================================|
| num/shape | 16x32bx2/16x64b/32x32b | 16x128b | 16x256b |
|=====================================================================|
| x1 | 1 | 2 | 4 |
| x2 | 2 | 4 | 8 |
| x4 | 4 | 8 | 16 |
| x8 | 8 | 16 | 32 |
| x16 | 16 | 32 | 64 |
| x32 | 32 | 64 | 128 |
| x64 | 64 | 128 | NA |
| x128 | 128 | NA | NA |
|=====================================================================|Example:
nvvm.tcgen05.st %tmemAddr, %val, %offset unpack {
shape = #nvvm.tcgen05_ldst_shape<shape_16x32bx2>,
} : <2xi32>
nvvm.tcgen05.wait
nvvm.vote.sync - Vote across thread group
Attributes
kind- Single,VoteSyncKindAttr, NVVM vote sync kind
Operands
mask- Single,I32, 32-bit signless integerpred- Single,I1, 1-bit signless integer
Results
res- Single, anonymous/composite constraint, 32-bit signless integer or 1-bit signless integer
Description
The vote.sync op will cause executing thread to wait until all non-exited
threads corresponding to membermask have executed vote.sync with the same
qualifiers and same membermask value before resuming execution.
The vote operation kinds are:
any: True if source predicate is True for some thread in membermask.all: True if source predicate is True for all non-exited threads in membermask.uni: True if source predicate has the same value in all non-exited threads in membermask.ballot: In the ballot form, the destination result is a 32 bit integer. In this form, the predicate from each thread in membermask are copied into the corresponding bit position of the result, where the bit position corresponds to the thread's lane id.
nvvm.wgmma.commit.group.sync.aligned
nvvm.wgmma.fence.aligned
nvvm.wgmma.mma_async
Attributes
shape- Single,NVVM_MMAShapeAttr, Attribute for MMA operation shape.typeA- Single,WGMMATypesAttr, NVVM WGMMA typestypeB- Single,WGMMATypesAttr, NVVM WGMMA typestypeD- Single,WGMMATypesAttr, NVVM WGMMA typesscaleD- Single,WGMMAScaleOutAttr, WGMMA input predicatescaleA- Single,WGMMAScaleInAttr, WGMMA overflow optionsscaleB- Single,WGMMAScaleInAttr, WGMMA overflow optionslayoutA- Single,MMALayoutAttr, NVVM MMA layoutlayoutB- Single,MMALayoutAttr, NVVM MMA layoutsatfinite- Optional,MMAIntOverflowAttr, MMA overflow options
Operands
inouts- Single,LLVM_AnyStruct, LLVM structure typedescriptorA- Single,I64, 64-bit signless integerdescriptorB- Single,I64, 64-bit signless integer
Results
results- Single,LLVM_AnyStruct, LLVM structure type
Description
The warpgroup (128 threads) level matrix multiply and accumulate operation has either of the following forms, where matrix D is called accumulator: D = A B + D D = A B, where the input from accumulator D is disabled.
Supported shapes:
|--------------|--------------|------------|--------------|---------------|
| | | | |f16+=e4m3*e4m3 |
| | | | |f16+=e5m2*e5m2 |
|f32+=tf32*tf32|f16+=f16 *f16 | s32+=s8*s8 |s32 += b1 * b1|f16+=e5m2*e4m3 |
| |f32+=f16 *f16 | s32+=u8*u8 | |f16+=e4m3*e5m2 |
| |f32+=bf16*bf16| s32+=u8*u8 | |f16+=e4m3*e5m2 |
| |f32+=bf16*bf16| s32+=s8*u8 | |f32+=e4m3*e4m3 |
| | | s32+=u8*s8 | |f32+=e5m2*e5m2 |
| | | | |f32+=e4m3*e5m2 |
| | | | |f32+=e4m3*e5m2 |
|--------------|--------------|------------|--------------|---------------|
| .m64n8k8 | .m64n8k16 | .m64n8k32 | .m64n8k256 | .m64n8k32 |
| .m64n16k8 | .m64n16k16 | .m64n16k32 | .m64n16k256 | .m64n16k32 |
| .m64n24k8 | .m64n24k16 | .m64n24k32 | .m64n24k256 | .m64n24k32 |
| .m64n32k8 | .m64n32k16 | .m64n32k32 | .m64n32k256 | .m64n32k32 |
| .m64n40k8 | .m64n40k16 | .m64n48k32 | .m64n48k256 | .m64n40k32 |
| .m64n48k8 | .m64n48k16 | .m64n64k32 | .m64n64k256 | .m64n48k32 |
| .m64n56k8 | .m64n56k16 | .m64n80k32 | .m64n80k256 | .m64n56k32 |
| .m64n64k8 | .m64n64k16 | .m64n96k32 | .m64n96k256 | .m64n64k32 |
| .m64n72k8 | .m64n72k16 | .m64n112k32| .m64n112k256 | .m64n72k32 |
| .m64n80k8 | .m64n80k16 | .m64n128k32| .m64n128k256 | .m64n80k32 |
| .m64n88k8 | .m64n88k16 | .m64n144k32| .m64n144k256 | .m64n88k32 |
| .m64n96k8 | .m64n96k16 | .m64n160k32| .m64n160k256 | .m64n96k32 |
| .m64n104k8 | .m64n104k16 | .m64n176k32| .m64n176k256 | .m64n104k32 |
| .m64n112k8 | .m64n112k16 | .m64n192k32| .m64n192k256 | .m64n112k32 |
| .m64n120k8 | .m64n120k16 | .m64n208k32| .m64n208k256 | .m64n120k32 |
| .m64n128k8 | .m64n128k16 | .m64n224k32| .m64n224k256 | .m64n128k32 |
| .m64n136k8 | .m64n136k16 | .m64n240k32| .m64n240k256 | .m64n136k32 |
| .m64n144k8 | .m64n144k16 | .m64n256k32| .m64n256k256 | .m64n144k32 |
| .m64n152k8 | .m64n152k16 | | | .m64n152k32 |
| .m64n160k8 | .m64n160k16 | | | .m64n160k32 |
| .m64n168k8 | .m64n168k16 | | | .m64n168k32 |
| .m64n176k8 | .m64n176k16 | | | .m64n176k32 |
| .m64n184k8 | .m64n184k16 | | | .m64n184k32 |
| .m64n192k8 | .m64n192k16 | | | .m64n192k32 |
| .m64n200k8 | .m64n200k16 | | | .m64n200k32 |
| .m64n208k8 | .m64n208k16 | | | .m64n208k32 |
| .m64n216k8 | .m64n216k16 | | | .m64n216k32 |
| .m64n224k8 | .m64n224k16 | | | .m64n224k32 |
| .m64n232k8 | .m64n232k16 | | | .m64n232k32 |
| .m64n240k8 | .m64n240k16 | | | .m64n240k32 |
| .m64n248k8 | .m64n248k16 | | | .m64n248k32 |
| .m64n256k8 | .m64n256k16 | | | .m64n256k32 |
|--------------|--------------|------------|--------------|---------------|
nvvm.wgmma.wait.group.sync.aligned
nvvm.wmma.load - Warp synchronous matrix load
Attributes
m- Single,I32Attr, 32-bit signless integer attributen- Single,I32Attr, 32-bit signless integer attributek- Single,I32Attr, 32-bit signless integer attributelayout- Single,MMALayoutAttr, NVVM MMA layouteltype- Single,MMATypesAttr, NVVM MMA typesfrag- Single,MMAFragAttr, NVVM MMA frag type
Operands
ptr- Single,LLVM_AnyPointer, LLVM pointer typestride- Single,I32, 32-bit signless integer
Results
res- Single,LLVM_AnyStruct, LLVM structure type
nvvm.wmma.mma - Warp synchronous matrix-multiply accumulate using tensor cores.
Attributes
m- Single,I32Attr, 32-bit signless integer attributen- Single,I32Attr, 32-bit signless integer attributek- Single,I32Attr, 32-bit signless integer attributelayoutA- Single,MMALayoutAttr, NVVM MMA layoutlayoutB- Single,MMALayoutAttr, NVVM MMA layouteltypeA- Single,MMATypesAttr, NVVM MMA typeseltypeB- Single,MMATypesAttr, NVVM MMA types
Operands
args- Variadic,LLVM_Type, variadic of LLVM dialect-compatible type
Results
res- Single,LLVM_AnyStruct, LLVM structure type
nvvm.wmma.store - Warp synchronous matrix store
Attributes
m- Single,I32Attr, 32-bit signless integer attributen- Single,I32Attr, 32-bit signless integer attributek- Single,I32Attr, 32-bit signless integer attributelayout- Single,MMALayoutAttr, NVVM MMA layouteltype- Single,MMATypesAttr, NVVM MMA types
Operands
ptr- Single,LLVM_AnyPointer, LLVM pointer typeargs- Variadic,LLVM_Type, variadic of LLVM dialect-compatible typestride- Single,I32, 32-bit signless integer