Beaver.MLIR.Dialect.NVVM (beaver v0.4.7)

Summary

Functions

nvvm.bar.warp.sync - Warp Barrier Synchronization Op

nvvm.barrier0

nvvm.barrier - CTA Barrier Synchronization Op

nvvm.barrier.arrive

nvvm.breakpoint

nvvm.cluster.arrive

nvvm.cluster.arrive.relaxed

nvvm.cluster.wait

nvvm.clusterlaunchcontrol.query.cancel - Query the response of a clusterlaunchcontrol.try.cancel operation

nvvm.clusterlaunchcontrol.try.cancel - Request atomically canceling the launch of a cluster that has not started running yet

nvvm.convert.bf16x2.to.f8x2 - Convert a pair of bf16 inputs to f8x2

nvvm.convert.f16x2.to.f8x2 - Convert an f16x2 input to f8x2

nvvm.convert.f32x2.to.f6x2 - Convert a pair of float inputs to f6x2

nvvm.convert.f32x2.to.f8x2 - Convert a pair of float inputs to f8x2

nvvm.convert.float.to.tf32 - Convert the given float input to TF32

nvvm.cp.async.bulk.commit.group

nvvm.cp.async.bulk.global.shared.cta - Async bulk copy from Shared CTA memory to Global memory

nvvm.cp.async.bulk.prefetch - Async bulk prefetch from global memory to L2 cache

nvvm.cp.async.bulk.shared.cluster.global - Async bulk copy from global memory to Shared cluster memory

nvvm.cp.async.bulk.shared.cluster.shared.cta - Async bulk copy from Shared CTA memory to Shared cluster memory

nvvm.cp.async.bulk.tensor.global.shared.cta

nvvm.cp.async.bulk.tensor.prefetch

nvvm.cp.async.bulk.tensor.reduce

nvvm.cp.async.bulk.tensor.shared.cluster.global

nvvm.cp.async.bulk.wait_group

nvvm.cp.async.commit.group

nvvm.cp.async.mbarrier.arrive - NVVM Dialect Op for cp.async.mbarrier.arrive

nvvm.cp.async.mbarrier.arrive.shared - NVVM Dialect Op for cp.async.mbarrier.arrive.shared

nvvm.cp.async.shared.global

nvvm.cp.async.wait.group

nvvm.dot.accumulate.2way - Two-way 16-bit to 8-bit dot product-accumulate instruction

nvvm.dot.accumulate.4way - Four-way byte dot product-accumulate instruction

nvvm.elect.sync - Elect one leader thread

nvvm.exit

nvvm.fence.mbarrier.init

nvvm.fence.proxy

nvvm.fence.proxy.acquire - Uni-directional proxy fence operation with acquire semantics

nvvm.fence.proxy.release

nvvm.fence.sc.cluster

nvvm.griddepcontrol

nvvm.inline_ptx - Inline PTX Op

nvvm.ldmatrix - cooperative matrix load

nvvm.mapa

nvvm.match.sync - Broadcast and compare a value across threads in warp

nvvm.mbarrier.arrive - MBarrier Arrive Operation

nvvm.mbarrier.arrive.expect_tx - MBarrier Arrive with Expected Transaction Count

nvvm.mbarrier.arrive.expect_tx.shared - Shared MBarrier Arrive with Expected Transaction Count

nvvm.mbarrier.arrive.nocomplete - MBarrier Arrive No-Complete Operation

nvvm.mbarrier.arrive.nocomplete.shared - Shared MBarrier Arrive No-Complete Operation

nvvm.mbarrier.arrive.shared - Shared MBarrier Arrive Operation

nvvm.mbarrier.init - MBarrier Initialization Op

nvvm.mbarrier.init.shared - Shared MBarrier Initialization Op

nvvm.mbarrier.inval - MBarrier Invalidation Operation

nvvm.mbarrier.inval.shared - Shared MBarrier Invalidation Operation

nvvm.mbarrier.test.wait - MBarrier Non-Blocking Test Wait Operation

nvvm.mbarrier.test.wait.shared - Shared MBarrier Non-Blocking Test Wait Operation

nvvm.mbarrier.try_wait.parity - MBarrier Potentially-Blocking Try Wait with Phase Parity

nvvm.mbarrier.try_wait.parity.shared - Shared MBarrier Potentially-Blocking Try Wait with Phase Parity

nvvm.mma.sync - cooperative matrix-multiply and accumulate

nvvm.nanosleep

nvvm.pmevent

nvvm.prefetch - Brings the cache line containing an address into the specified cache level

nvvm.rcp.approx.ftz.f

nvvm.read.ptx.sreg.clock64

nvvm.read.ptx.sreg.clock

nvvm.read.ptx.sreg.cluster.ctaid.x

nvvm.read.ptx.sreg.cluster.ctaid.y

nvvm.read.ptx.sreg.cluster.ctaid.z

nvvm.read.ptx.sreg.cluster.ctarank

nvvm.read.ptx.sreg.cluster.nctaid.x

nvvm.read.ptx.sreg.cluster.nctaid.y

nvvm.read.ptx.sreg.cluster.nctaid.z

nvvm.read.ptx.sreg.cluster.nctarank

nvvm.read.ptx.sreg.clusterid.x

nvvm.read.ptx.sreg.clusterid.y

nvvm.read.ptx.sreg.clusterid.z

nvvm.read.ptx.sreg.ctaid.x

nvvm.read.ptx.sreg.ctaid.y

nvvm.read.ptx.sreg.ctaid.z

nvvm.read.ptx.sreg.envreg0

nvvm.read.ptx.sreg.envreg1

nvvm.read.ptx.sreg.envreg2

nvvm.read.ptx.sreg.envreg3

nvvm.read.ptx.sreg.envreg4

nvvm.read.ptx.sreg.envreg5

nvvm.read.ptx.sreg.envreg6

nvvm.read.ptx.sreg.envreg7

nvvm.read.ptx.sreg.envreg8

nvvm.read.ptx.sreg.envreg9

nvvm.read.ptx.sreg.envreg10

nvvm.read.ptx.sreg.envreg11

nvvm.read.ptx.sreg.envreg12

nvvm.read.ptx.sreg.envreg13

nvvm.read.ptx.sreg.envreg14

nvvm.read.ptx.sreg.envreg15

nvvm.read.ptx.sreg.envreg16

nvvm.read.ptx.sreg.envreg17

nvvm.read.ptx.sreg.envreg18

nvvm.read.ptx.sreg.envreg19

nvvm.read.ptx.sreg.envreg20

nvvm.read.ptx.sreg.envreg21

nvvm.read.ptx.sreg.envreg22

nvvm.read.ptx.sreg.envreg23

nvvm.read.ptx.sreg.envreg24

nvvm.read.ptx.sreg.envreg25

nvvm.read.ptx.sreg.envreg26

nvvm.read.ptx.sreg.envreg27

nvvm.read.ptx.sreg.envreg28

nvvm.read.ptx.sreg.envreg29

nvvm.read.ptx.sreg.envreg30

nvvm.read.ptx.sreg.envreg31

nvvm.read.ptx.sreg.globaltimer

nvvm.read.ptx.sreg.globaltimer.lo

nvvm.read.ptx.sreg.gridid

nvvm.read.ptx.sreg.laneid

nvvm.read.ptx.sreg.lanemask.eq

nvvm.read.ptx.sreg.lanemask.ge

nvvm.read.ptx.sreg.lanemask.gt

nvvm.read.ptx.sreg.lanemask.le

nvvm.read.ptx.sreg.lanemask.lt

nvvm.read.ptx.sreg.nclusterid.x

nvvm.read.ptx.sreg.nclusterid.y

nvvm.read.ptx.sreg.nclusterid.z

nvvm.read.ptx.sreg.nctaid.x

nvvm.read.ptx.sreg.nctaid.y

nvvm.read.ptx.sreg.nctaid.z

nvvm.read.ptx.sreg.nsmid

nvvm.read.ptx.sreg.ntid.x

nvvm.read.ptx.sreg.ntid.y

nvvm.read.ptx.sreg.ntid.z

nvvm.read.ptx.sreg.nwarpid

nvvm.read.ptx.sreg.smid

nvvm.read.ptx.sreg.tid.x

nvvm.read.ptx.sreg.tid.y

nvvm.read.ptx.sreg.tid.z

nvvm.read.ptx.sreg.warpid

nvvm.read.ptx.sreg.warpsize

nvvm.redux.sync - Redux Sync Op

nvvm.setmaxregister

nvvm.shfl.sync - NVVM Dialect Op for shfl.sync

nvvm.st.bulk - Bulk Store Op

nvvm.stmatrix - cooperative matrix store

nvvm.tcgen05.alloc - Tcgen05 alloc operation

nvvm.tcgen05.commit - Tcgen05 commit operations

nvvm.tcgen05.cp - Tcgen05 copy operation

nvvm.tcgen05.dealloc - Tcgen05 dealloc operation

nvvm.tcgen05.fence

nvvm.tcgen05.ld - tensor memory load instructions

nvvm.tcgen05.mma_smem_desc - Constructs a Shared Memory descriptor for MMA Operands A or B

nvvm.tcgen05.relinquish_alloc_permit

nvvm.tcgen05.shift - Tcgen05 shift operation

nvvm.tcgen05.st - tensor memory store instructions

nvvm.tcgen05.wait

nvvm.vote.sync - Vote across thread group

nvvm.wgmma.commit.group.sync.aligned

nvvm.wgmma.fence.aligned

nvvm.wgmma.mma_async

nvvm.wgmma.wait.group.sync.aligned

nvvm.wmma.load - Warp synchronous matrix load

nvvm.wmma.mma - Warp synchronous matrix-multiply accumulate using tensor cores.

nvvm.wmma.store - Warp synchronous matrix store

Functions

bar_warp_sync(ssa)

nvvm.bar.warp.sync - Warp Barrier Synchronization Op

Operands

  • mask - Single, LLVM_Type, LLVM dialect-compatible type

Description

The nvvm.bar.warp.sync operation performs barrier synchronization for threads within a warp.

This operation causes the executing thread to wait until all threads corresponding to the mask operand have executed a bar.warp.sync with the same mask value before resuming execution.

The mask operand specifies the threads participating in the barrier, where each bit position corresponds to the thread's lane ID within the warp. Only threads with their corresponding bit set in the mask participate in the barrier synchronization.

Important constraints:

  • The behavior is undefined if the executing thread is not included in the mask (i.e., the bit corresponding to the thread's lane ID is not set)
  • For compute capability sm_6x or below, all threads in the mask must execute the same bar.warp.sync instruction in convergence

This operation also guarantees memory ordering among participating threads. Threads within the warp that wish to communicate via memory can store to memory, execute bar.warp.sync, and then safely read values stored by other threads in the warp.

For more information, see PTX ISA

barrier0(ssa)

nvvm.barrier0

barrier(ssa)

nvvm.barrier - CTA Barrier Synchronization Op

Operands

  • barrierId - Optional, I32, 32-bit signless integer
  • numberOfThreads - Optional, I32, 32-bit signless integer

Description

The nvvm.barrier operation performs barrier synchronization and communication within a CTA (Cooperative Thread Array). It causes executing threads to wait for all non-exited threads participating in the barrier to arrive.

The operation takes two optional operands:

  • barrierId: Specifies a logical barrier resource with value 0 through 15. Each CTA instance has sixteen barriers numbered 0..15. Defaults to 0 if not specified.
  • numberOfThreads: Specifies the number of threads participating in the barrier. When specified, the value must be a multiple of the warp size. If not specified, all threads in the CTA participate in the barrier.

The barrier operation guarantees that when the barrier completes, prior memory accesses requested by participating threads are performed relative to all threads participating in the barrier. It also ensures that no new memory access is requested by participating threads before the barrier completes.

When a barrier completes, the waiting threads are restarted without delay, and the barrier is reinitialized so that it can be immediately reused.

This operation generates an aligned barrier, indicating that all threads in the CTA will execute the same barrier instruction. Behavior is undefined if all threads in the CTA do not reach this instruction.

For more information, see PTX ISA

barrier_arrive(ssa)

nvvm.barrier.arrive

Operands

  • barrierId - Optional, I32, 32-bit signless integer
  • numberOfThreads - Single, I32, 32-bit signless integer

Description

Thread that executes this op announces their arrival at the barrier with given id and continue their execution.

The default barrier id is 0 that is similar to nvvm.barrier Op. When barrierId is not present, the default barrier id is used.

For more information, see PTX ISA

breakpoint(ssa)

nvvm.breakpoint

cluster_arrive(ssa)

nvvm.cluster.arrive

cluster_arrive_relaxed(ssa)

nvvm.cluster.arrive.relaxed

cluster_wait(ssa)

nvvm.cluster.wait

clusterlaunchcontrol_query_cancel(ssa)

nvvm.clusterlaunchcontrol.query.cancel - Query the response of a clusterlaunchcontrol.try.cancel operation

Attributes

  • query_type - Single, ClusterLaunchControlQueryTypeAttr, NVVM ClusterLaunchControlQueryType

Operands

  • try_cancel_response - Single, I128, 128-bit signless integer

Results

  • res - Single, anonymous/composite constraint, 1-bit signless integer or 32-bit signless integer

Description

clusterlaunchcontrol.query.cancel queries the response of a clusterlaunchcontrol.try.cancel operation specified by operand try_cancel_response.

Operand query_type specifies the type of query to perform and can be one of the following:

  • is_canceled : Returns true if the try cancel request succeeded, and false otherwise.
  • get_first_cta_id_{x/y/z} : Returns the x, y, or z coordinate of the first CTA in the canceled cluster. Behaviour is defined only if the try cancel request succeeded.

For more information, see PTX ISA

clusterlaunchcontrol_try_cancel(ssa)

nvvm.clusterlaunchcontrol.try.cancel - Request atomically canceling the launch of a cluster that has not started running yet

Attributes

  • multicast - Optional, UnitAttr, unit attribute

Operands

  • smemAddress - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • mbarrier - Single, LLVM_PointerShared, LLVM pointer in address space 3

Description

clusterlaunchcontrol.try.cancel requests atomically canceling the launch of a cluster that has not started running yet. It asynchronously writes an opaque response to shared memory indicating whether the operation succeeded or failed.

Operand smemAddress specifies the naturally aligned address of the 16-byte wide shared memory location where the request's response is written.

Operand mbarrier specifies the mbarrier object used to track the completion of the asynchronous operation.

If multicast is specified, the response is asynchronously written to the corresponding local shared memory location (specifed by addr) of each CTA in the requesting cluster.

For more information, see PTX ISA

convert_bf16x2_to_f8x2(ssa)

nvvm.convert.bf16x2.to.f8x2 - Convert a pair of bf16 inputs to f8x2

Attributes

  • type - Single, ConvertFP8TypeAttr, NVVM ConvertFP8Type kind
  • rnd - Single, FPRoundingModeAttr, NVVM FPRoundingMode kind
  • sat - Single, SaturationModeAttr, NVVM SaturationMode kind

Operands

  • a - Single, anonymous/composite constraint, vector of bfloat16 type values of length 2

Results

  • dst - Single, anonymous/composite constraint, of ranks 1scalable vector of 16-bit float or bfloat16 type values of length 8

Description

This Op converts the given bf16 inputs in a bf16x2 vector to the specified f8 type. The result dst is represented as an i16 type or as a vector of two i8 types. If dst is returned as an i16 type, the converted values from a are packed such that the value converted from the first element of a is stored in the upper 8 bits of dst and the value converted from the second element of a is stored in the lower 8 bits of dst. If dst is returned as a vector type, each converted value is stored as an i8 element in the vector. The rnd and sat attributes specify the rounding and saturation modes respectively.

For more information, see PTX ISA

convert_f16x2_to_f8x2(ssa)

nvvm.convert.f16x2.to.f8x2 - Convert an f16x2 input to f8x2

Attributes

  • type - Single, ConvertFP8TypeAttr, NVVM ConvertFP8Type kind
  • relu - Single, BoolAttr, bool attribute

Operands

  • a - Single, anonymous/composite constraint, of ranks 1scalable vector of 16-bit signless integer values of length 8

Results

  • dst - Single, anonymous/composite constraint, of ranks 1scalable vector of 16-bit float or bfloat16 type values of length 8

Description

This Op converts the given f16 inputs in an f16x2 vector to the specified f8 type. The result dst is represented as an i16 type or as a vector of two i8 types. If dst is returned as an i16 type, the converted values from a are packed such that the value converted from the first element of a is stored in the upper 8 bits of dst and the value converted from the second element of a is stored in the lower 8 bits of dst. If dst is returned as a vector type, each converted value is stored as an i8 element in the vector. The relu attribute, when set, lowers to the '.relu' variant of the cvt instruction.

For more information, see PTX ISA

convert_f32x2_to_f6x2(ssa)

nvvm.convert.f32x2.to.f6x2 - Convert a pair of float inputs to f6x2

Attributes

  • type - Single, ConvertFP6TypeAttr, NVVM ConvertFP6Type kind
  • relu - Single, BoolAttr, bool attribute

Operands

  • a - Single, F32, 32-bit float
  • b - Single, F32, 32-bit float

Results

  • dst - Single, anonymous/composite constraint, of ranks 1scalable vector of 16-bit float or bfloat16 type values of length 8

Description

This Op converts each of the given float inputs to the specified fp6 type. The result dst is represented either as an i16 type or as a vector of two i8 types. If dst is returned as an i16 type, the converted values are packed such that the value converted from a is stored in the upper 8 bits of dst with 2 MSB bits padded with zeros and the value converted from b is stored in the lower 8 bits of dst with 2 MSB bits padded with zeros. If dst is returned as a vector type, each converted value is stored as an i8 element in the vector. The relu attribute, when set, lowers to the '.relu' variant of the cvt instruction.

For more information, see PTX ISA

convert_f32x2_to_f8x2(ssa)

nvvm.convert.f32x2.to.f8x2 - Convert a pair of float inputs to f8x2

Attributes

  • type - Single, ConvertFP8TypeAttr, NVVM ConvertFP8Type kind
  • rnd - Single, FPRoundingModeAttr, NVVM FPRoundingMode kind
  • sat - Single, SaturationModeAttr, NVVM SaturationMode kind
  • relu - Single, BoolAttr, bool attribute

Operands

  • a - Single, F32, 32-bit float
  • b - Single, F32, 32-bit float

Results

  • dst - Single, anonymous/composite constraint, of ranks 1scalable vector of 16-bit float or bfloat16 type values of length 8

Description

This Op converts each of the given float inputs to the specified fp8 type. The result dst is represented as an i16 type or as a vector of two i8 types. If dst is returned as an i16 type, the converted values are packed such that the value converted from a is stored in the upper 8 bits of dst and the value converted from b is stored in the lower 8 bits of dst. If dst is returned as a vector type, each converted value is stored as an i8 element in the vector. The rnd and sat attributes specify the rounding and saturation modes respectively. The relu attribute, when set, lowers to the '.relu' variant of the cvt instruction.

For more information, see PTX ISA

convert_float_to_tf32(ssa)

nvvm.convert.float.to.tf32 - Convert the given float input to TF32

Attributes

  • rnd - Single, FPRoundingModeAttr, NVVM FPRoundingMode kind
  • sat - Single, SaturationModeAttr, NVVM SaturationMode kind
  • relu - Single, BoolAttr, bool attribute

Operands

  • src - Single, F32, 32-bit float

Results

  • res - Single, I32, 32-bit signless integer

Description

This Op converts the given f32 input to tf32. The result res is represented as an i32 type. The relu attribute, when set, lowers to the '.relu' variant of the cvt instruction. The rnd and sat attributes specify the the rounding and saturation modes respectively.

For more information, see PTX ISA

cp_async_bulk_commit_group(ssa)

nvvm.cp.async.bulk.commit.group

cp_async_bulk_global_shared_cta(ssa)

nvvm.cp.async.bulk.global.shared.cta - Async bulk copy from Shared CTA memory to Global memory

Operands

  • dstMem - Single, LLVM_PointerGlobal, LLVM pointer in address space 1
  • srcMem - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • size - Single, I32, 32-bit signless integer
  • l2CacheHint - Optional, I64, 64-bit signless integer
  • byteMask - Optional, I16, 16-bit signless integer

Description

Initiates an asynchronous copy operation from Shared CTA memory to global memory. The 32-bit operand size specifies the amount of memory to be copied, in terms of number of bytes. size must be a multiple of 16. The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access. The byteMask operand is optional. The i-th bit in the 16-bit wide byteMask specifies whether the i-th byte of each 16-byte wide chunk of source data is copied to the destination. If the bit is set, the byte is copied.

Example:

  nvvm.cp.async.bulk.global.shared.cta %dst, %src, %size
      : !llvm.ptr<1>, !llvm.ptr<3>

  // with l2_cache_hint
  nvvm.cp.async.bulk.global.shared.cta %dst, %src, %size l2_cache_hint = %ch
      : !llvm.ptr<1>, !llvm.ptr<3>

  // with byte_mask
  nvvm.cp.async.bulk.global.shared.cta %dst, %src, %size byte_mask = %mask
      : !llvm.ptr<1>, !llvm.ptr<3>

  // with both l2_cache_hint and byte_mask
  nvvm.cp.async.bulk.global.shared.cta %dst, %src, %size l2_cache_hint = %ch byte_mask = %mask
      : !llvm.ptr<1>, !llvm.ptr<3>

For more information, see PTX ISA

cp_async_bulk_prefetch(ssa)

nvvm.cp.async.bulk.prefetch - Async bulk prefetch from global memory to L2 cache

Operands

  • srcMem - Single, LLVM_PointerGlobal, LLVM pointer in address space 1
  • size - Single, I32, 32-bit signless integer
  • l2CacheHint - Optional, I64, 64-bit signless integer

Description

Initiates an asynchronous prefetch of data from the location specified by srcMem to the L2 cache.

The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.

Example:

  nvvm.cp.async.bulk.prefetch %src, %size : !llvm.ptr<1>

  // with l2_cache_hint
  nvvm.cp.async.bulk.prefetch %src, %size l2_cache_hint = %ch : !llvm.ptr<1>

For more information, see PTX ISA

cp_async_bulk_shared_cluster_global(ssa)

nvvm.cp.async.bulk.shared.cluster.global - Async bulk copy from global memory to Shared cluster memory

Operands

  • dstMem - Single, LLVM_PointerSharedCluster, LLVM pointer in address space 7
  • srcMem - Single, LLVM_PointerGlobal, LLVM pointer in address space 1
  • mbar - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • size - Single, I32, 32-bit signless integer
  • multicastMask - Optional, I16, 16-bit signless integer
  • l2CacheHint - Optional, I64, 64-bit signless integer

Description

Initiates an asynchronous copy operation from global memory to cluster's shared memory.

The multicastMask operand is optional. When it is present, the Op copies data from global memory to shared memory of multiple CTAs in the cluster. Operand multicastMask specifies the destination CTAs in the cluster such that each bit position in the 16-bit multicastMask operand corresponds to the nvvm.read.ptx.sreg.ctaid of the destination CTA.

The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.

For more information, see PTX ISA

cp_async_bulk_shared_cluster_shared_cta(ssa)

nvvm.cp.async.bulk.shared.cluster.shared.cta - Async bulk copy from Shared CTA memory to Shared cluster memory

Operands

  • dstMem - Single, LLVM_PointerSharedCluster, LLVM pointer in address space 7
  • srcMem - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • mbar - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • size - Single, I32, 32-bit signless integer

Description

Initiates an asynchronous copy operation from Shared CTA memory to Shared cluster memory.

For more information, see PTX ISA

cp_async_bulk_tensor_global_shared_cta(ssa)

nvvm.cp.async.bulk.tensor.global.shared.cta

Attributes

  • mode - Single, TMAStoreModeAttr, NVVM TMA Store Mode

Operands

  • tmaDescriptor - Single, LLVM_PointerGeneric, LLVM pointer in address space 0
  • srcMem - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • coordinates - Variadic, I32, variadic of 32-bit signless integer
  • l2CacheHint - Optional, I64, 64-bit signless integer
  • predicate - Optional, PtxPredicate, 1-bit signless integer

Description

Initiates an asynchronous copy of the tensor data from shared::cta memory to global memory. This Op supports all the store modes specified in TMAStoreMode.

The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.

For more information, see PTX ISA

cp_async_bulk_tensor_prefetch(ssa)

nvvm.cp.async.bulk.tensor.prefetch

Attributes

  • mode - Single, TMALoadModeAttr, List of Load-Modes supported for TMA Tensor Ops

Operands

  • tmaDescriptor - Single, LLVM_PointerGeneric, LLVM pointer in address space 0
  • coordinates - Variadic, I32, variadic of 32-bit signless integer
  • im2colOffsets - Variadic, I16, variadic of 16-bit signless integer
  • l2CacheHint - Optional, I64, 64-bit signless integer

Description

Initiates an asynchronous prefetch operation on the tensor data from global memory to L2 cache. This Op supports all the load modes specified in TMALoadMode.

The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.

For more information, see PTX ISA

cp_async_bulk_tensor_reduce(ssa)

nvvm.cp.async.bulk.tensor.reduce

Attributes

  • redKind - Single, TMAReduxKindAttr, NVVM TMA redux kind
  • mode - Single, TMAStoreModeAttr, NVVM TMA Store Mode

Operands

  • tmaDescriptor - Single, LLVM_AnyPointer, LLVM pointer type
  • srcMem - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • coordinates - Variadic, I32, variadic of 32-bit signless integer
  • l2CacheHint - Optional, I64, 64-bit signless integer

Description

Initiates an asynchronous reduction operation of tensor data in global memory with tensor data in shared memory.

The mode attribute indicates whether the copy mode is tile or im2col. The redOp attribute specifies the reduction operations applied. The supported reduction operations are: {add, min, max, inc, dec, and, or, xor}

The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.

For more information, see PTX ISA

cp_async_bulk_tensor_shared_cluster_global(ssa)

nvvm.cp.async.bulk.tensor.shared.cluster.global

Attributes

  • mode - Single, TMALoadModeAttr, List of Load-Modes supported for TMA Tensor Ops
  • isCTAOnly - Single, BoolAttr, bool attribute
  • group - Optional, CTAGroupKindAttr, NVVM CTA group kind

Operands

  • dstMem - Single, anonymous/composite constraint, LLVM pointer in address space 3 or LLVM pointer in address space 7
  • tmaDescriptor - Single, LLVM_PointerGeneric, LLVM pointer in address space 0
  • coordinates - Variadic, I32, variadic of 32-bit signless integer
  • mbar - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • im2colOffsets - Variadic, I16, variadic of 16-bit signless integer
  • multicastMask - Optional, I16, 16-bit signless integer
  • l2CacheHint - Optional, I64, 64-bit signless integer
  • predicate - Optional, PtxPredicate, 1-bit signless integer

Description

Initiates an asynchronous copy operation on the tensor data from global memory to shared::cluster (or) shared::cta memory. This Op supports all the load modes specified in TMALoadMode.

The multicastMask operand is optional. When it is present, the Op copies data from global memory to shared memory of multiple CTAs in the cluster. Operand multicastMask specifies the destination CTAs in the cluster such that each bit position in the 16-bit multicastMask operand corresponds to the nvvm.read.ptx.sreg.ctaid of the destination CTA.

The l2CacheHint operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.

When the isCTAOnly attribute is set to true, the destination is shared::cta only. Hence, multicastMask and CTAGroup are not applicable when isCTAOnly is true.

For more information, see PTX ISA

cp_async_bulk_wait_group(ssa)

nvvm.cp.async.bulk.wait_group

cp_async_commit_group(ssa)

nvvm.cp.async.commit.group

cp_async_mbarrier_arrive(ssa)

nvvm.cp.async.mbarrier.arrive - NVVM Dialect Op for cp.async.mbarrier.arrive

Attributes

  • noinc - Single, I1Attr, 1-bit signless integer attribute

Operands

  • addr - Single, LLVM_AnyPointer, LLVM pointer type

Description

The cp.async.mbarrier.arrive Op makes the mbarrier object track all prior cp.async operations initiated by the executing thread. The addr operand specifies the address of the mbarrier object in generic address space. The noinc attr impacts how the mbarrier's state is updated.

For more information, see PTX ISA

cp_async_mbarrier_arrive_shared(ssa)

nvvm.cp.async.mbarrier.arrive.shared - NVVM Dialect Op for cp.async.mbarrier.arrive.shared

Attributes

  • noinc - Single, I1Attr, 1-bit signless integer attribute

Operands

  • addr - Single, LLVM_PointerShared, LLVM pointer in address space 3

Description

The cp.async.mbarrier.arrive.shared Op makes the mbarrier object track all prior cp.async operations initiated by the executing thread. The addr operand specifies the address of the mbarrier object in shared memory. The noinc attr impacts how the mbarrier's state is updated.

For more information, see PTX ISA

cp_async_shared_global(ssa)

nvvm.cp.async.shared.global

Attributes

  • size - Single, I32Attr, 32-bit signless integer attribute
  • modifier - Single, LoadCacheModifierAttr, NVVM load cache modifier kind

Operands

  • dst - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • src - Single, LLVM_PointerGlobal, LLVM pointer in address space 1
  • cpSize - Optional, LLVM_Type, LLVM dialect-compatible type

cp_async_wait_group(ssa)

nvvm.cp.async.wait.group

dot_accumulate_2way(ssa)

nvvm.dot.accumulate.2way - Two-way 16-bit to 8-bit dot product-accumulate instruction

Attributes

  • a_type - Single, DotAccumulateTypeAttr, NVVM DotAccumulateType
  • b_type - Single, DotAccumulateTypeAttr, NVVM DotAccumulateType
  • b_hi - Single, BoolAttr, bool attribute

Operands

  • a - Single, anonymous/composite constraint, vector of 16-bit signless integer values of length 2
  • b - Single, anonymous/composite constraint, vector of 8-bit signless integer values of length 4
  • c - Single, I32, 32-bit signless integer

Results

  • res - Single, I32, 32-bit signless integer

Description

Performs a two-way 16-bit to 8-bit dot-product which is accumulated in a 32-bit result. Operand a is a vector of two 16-bit elements and operand b a vector of four 8-bit elements between which the dot product is computed.

The a_type and b_type attributes specify the type of the elements in a and b respectively. If a_type or b_type is s, then the elements in the corresponding vector are sign-extended to 32-bit before the dot product is computed. If a_type or b_type is u, then the elements in the corresponding vector are zero-extended to 32-bit instead.

The b_hi boolean attribute specifies which two bytes of b are used for the dot product. If b_hi is true, then the dot product is computed between a and elements at indices 2 and 3 of b. If b_hi is false, then the dot product is computed between a and elements at indices 0 and 1 of b.

Operand c is a 32-bit integer to which the result is accumulated. It is treated as holding a signed integer if any of a_type or b_type is signed.

For more information, see PTX ISA

dot_accumulate_4way(ssa)

nvvm.dot.accumulate.4way - Four-way byte dot product-accumulate instruction

Attributes

  • a_type - Single, DotAccumulateTypeAttr, NVVM DotAccumulateType
  • b_type - Single, DotAccumulateTypeAttr, NVVM DotAccumulateType

Operands

  • a - Single, anonymous/composite constraint, vector of 8-bit signless integer values of length 4
  • b - Single, anonymous/composite constraint, vector of 8-bit signless integer values of length 4
  • c - Single, I32, 32-bit signless integer

Results

  • res - Single, I32, 32-bit signless integer

Description

Performs a four-way byte dot-product which is accumulated in a 32-bit result. Operand a and b are vectors of 4 bytes between which the dot product is computed.

The a_type and b_type attributes specify the type of the elements in a and b respectively. If a_type or b_type is signed, then the elements in the corresponding vector are sign-extended to 32-bit before the dot product is computed. If a_type or b_type is unsigned, then the elements in the corresponding vector are zero-extended to 32-bit instead.

Operand c is a 32-bit integer to which the result is accumulated. It is treated as holding a signed integer if any of a_type or b_type is s8.

For more information, see PTX ISA

elect_sync(ssa)

nvvm.elect.sync - Elect one leader thread

Operands

  • membermask - Optional, I32, 32-bit signless integer

Results

  • pred - Single, I1, 1-bit signless integer

Description

The elect.sync instruction elects one predicated active leader thread from among a set of threads specified in the membermask. When the membermask is not provided explicitly, a default value of 0xFFFFFFFF is used. The predicate result is set to True for the leader thread, and False for all other threads.

For more information, see PTX ISA

exit(ssa)

nvvm.exit

fence_mbarrier_init(ssa)

nvvm.fence.mbarrier.init

fence_proxy(ssa)

nvvm.fence.proxy

fence_proxy_acquire(ssa)

nvvm.fence.proxy.acquire - Uni-directional proxy fence operation with acquire semantics

Attributes

  • scope - Single, MemScopeKindAttr, NVVM Memory Scope kind
  • fromProxy - Single, ProxyKindAttr, Proxy kind
  • toProxy - Single, ProxyKindAttr, Proxy kind

Operands

  • addr - Single, LLVM_PointerGeneric, LLVM pointer in address space 0
  • size - Single, I32, 32-bit signless integer

Description

fence.proxy.acquire is a uni-directional fence used to establish ordering between a prior memory access performed via the generic proxy and a subsequent memory access performed via the tensormap proxy

The address operand addr and the operand size together specify the memory range [addr, addr+size) on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the size operand is 128 and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the .global state space. Otherwise, the behavior is undefined

For more information, see PTX ISA

fence_proxy_release(ssa)

nvvm.fence.proxy.release

fence_sc_cluster(ssa)

nvvm.fence.sc.cluster

griddepcontrol(ssa)

nvvm.griddepcontrol

inline_ptx(ssa)

nvvm.inline_ptx - Inline PTX Op

Attributes

  • ptxCode - Single, StrAttr, string attribute

Operands

  • readOnlyArgs - Variadic, AnyType, variadic of any type
  • readWriteArgs - Variadic, AnyType, variadic of any type
  • predicate - Optional, PtxPredicate, 1-bit signless integer

Results

  • writeOnlyArgs - Variadic, AnyType, variadic of any type

Description

This op allows using PTX directly within the NVVM

dialect, while greatly simplifying llvm.inline_asm generation. It 
automatically handles register size selection and sets the correct 
read/write access for each operand. The operation leverages the 
`BasicPtxBuilderInterface` to abstract away low-level details of 
PTX assembly formatting.

The `predicate` attribute is used to specify a predicate for the 
PTX instruction.

Example 1: Read-only Parameters
```mlir
nvvm.inline_ptx "mbarrier.init.b64 [$0], $1;" (%barrier_gen, %count) : !llvm.ptr, i32

// Lowers to:
llvm.inline_asm has_side_effects asm_dialect = att 
  "mbarrier.init.b64 [$0], $1;", "l,r" %arg0, %arg2 : (!llvm.ptr, i32) -> ()
```

Example 2: Read-only and Write-only Parameters
```mlir
%0 = nvvm.inline_ptx "ex2.approx.ftz.f32 $0, $1;" (%input) : f32 -> f32

// Lowers to:
%0 = llvm.inline_asm has_side_effects asm_dialect = att 
  "ex2.approx.ftz.f32 $0, $1;", "=f,f" %arg0 : (f32) -> f32
```

Example 3: Predicate Usage
```mlir
nvvm.inline_ptx "mbarrier.init.b64 [$0], $1;" (%barrier_gen, %count), 
  predicate = %pred : !llvm.ptr, i32, i1

// Lowers to:
llvm.inline_asm has_side_effects asm_dialect = att 
  "@$2 mbarrier.init.b64 [$0], $1;", "l,r,b" %arg0, %arg2, %arg3 
  : (!llvm.ptr, i32, i1) -> ()
```

ldmatrix(ssa)

nvvm.ldmatrix - cooperative matrix load

Attributes

  • num - Single, I32Attr, 32-bit signless integer attribute
  • layout - Single, MMALayoutAttr, NVVM MMA layout
  • shape - Single, LdStMatrixShapeAttr, Matrix shape for ldmatrix and stmatrix
  • eltType - Single, LdStMatrixEltTypeAttr, Element type for ldmatrix and stmatrix

Operands

  • ptr - Single, LLVM_PointerShared, LLVM pointer in address space 3

Results

  • res - Single, AnyType, any type

mapa(ssa)

nvvm.mapa

Operands

  • a - Single, anonymous/composite constraint, LLVM pointer in address space 0 or LLVM pointer in address space 3
  • b - Single, I32, 32-bit signless integer

Results

  • res - Single, anonymous/composite constraint, LLVM pointer in address space 0 or LLVM pointer in address space 7

match_sync(ssa)

nvvm.match.sync - Broadcast and compare a value across threads in warp

Attributes

  • kind - Single, MatchSyncKindAttr, NVVM match sync kind

Operands

  • thread_mask - Single, I32, 32-bit signless integer
  • val - Single, anonymous/composite constraint, 32-bit signless integer or 64-bit signless integer

Results

  • res - Single, anonymous/composite constraint, 32-bit signless integer or LLVM struct type

Description

The match.sync op performs broadcast and compare of operand val across all non-exited threads in thread_mask and returns a mask depending on the kind and an optional predicate.

The matching operation kinds are:

  • any: Returns a mask corresponding to the non-exited threads in the thread_mask that have the same value of operand val.
  • all: Returns a mask and a predicate. If all non-exited threads in the thread_mask have the same value of operand val, the predicate is set to true and the mask corresponds to the non-exited threads in the thread_mask. Otherwise, the predicate is set to false and the mask is 0.

For more information, see PTX ISA

mbarrier_arrive(ssa)

nvvm.mbarrier.arrive - MBarrier Arrive Operation

Operands

  • addr - Single, LLVM_AnyPointer, LLVM pointer type

Results

  • res - Single, LLVM_Type, LLVM dialect-compatible type

Description

The nvvm.mbarrier.arrive operation performs an arrive-on operation on the mbarrier object at the specified address. Uses the default .release.cta semantics. This release pattern establishes memory ordering for operations occurring in program order before this arrive instruction by making operations from the current thread visible to subsequent operations in other threads within the CTA. When other threads perform corresponding acquire operations (like 'mbarrier.test.wait'), they synchronize with this release pattern.

This operation causes the executing thread to signal its arrival at the barrier. The operation returns an opaque value that captures the phase of the mbarrier object prior to the arrive-on operation. The contents of this state value are implementation-specific.

The operation takes the following operand:

  • addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.

For more information, see PTX ISA

mbarrier_arrive_expect_tx(ssa)

nvvm.mbarrier.arrive.expect_tx - MBarrier Arrive with Expected Transaction Count

Operands

  • addr - Single, LLVM_AnyPointer, LLVM pointer type
  • txcount - Single, I32, 32-bit signless integer
  • predicate - Optional, PtxPredicate, 1-bit signless integer

Description

The nvvm.mbarrier.arrive.expect_tx operation performs an expect-tx operation followed by an arrive-on operation on the mbarrier object. Uses the default .release.cta semantics. This release pattern establishes memory ordering for operations occurring in program order before this arrive instruction by making operations from the current thread visible to subsequent operations in other threads within the CTA. When other threads perform corresponding acquire operations (like 'mbarrier.test.wait'), they synchronize with this release pattern.

This operation first performs an expect-tx operation with the specified transaction count, then performs an arrive-on operation with an implicit count of 1. The expect-tx operation increases the tx-count of the mbarrier object by the specified expectCount value, setting the current phase to expect and tracks the completion of additional asynchronous transactions.

The operation takes the following operands:

  • addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.
  • txcount: An unsigned integer specifying the expected transaction count for the expect-tx operation. This represents the number of asynchronous transactions expected to complete before the barrier phase completes.
  • predicate: Optional predicate for conditional execution.

For more information, see PTX ISA

mbarrier_arrive_expect_tx_shared(ssa)

nvvm.mbarrier.arrive.expect_tx.shared - Shared MBarrier Arrive with Expected Transaction Count

Operands

  • addr - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • txcount - Single, I32, 32-bit signless integer
  • predicate - Optional, PtxPredicate, 1-bit signless integer

Description

This Op is the same as nvvm.mbarrier.arrive.expect_tx except that the mbarrier object should be accessed using a shared-memory pointer instead of a generic-memory pointer.

For more information, see PTX ISA

mbarrier_arrive_nocomplete(ssa)

nvvm.mbarrier.arrive.nocomplete - MBarrier Arrive No-Complete Operation

Operands

  • addr - Single, LLVM_AnyPointer, LLVM pointer type
  • count - Single, I32, 32-bit signless integer

Results

  • res - Single, LLVM_Type, LLVM dialect-compatible type

Description

The nvvm.mbarrier.arrive.nocomplete operation performs an arrive-on operation on the mbarrier object with the guarantee that it will not cause the barrier to complete its current phase. Uses the default .release.cta semantics. This release pattern establishes memory ordering for operations occurring in program order before this arrive instruction by making operations from the current thread visible to subsequent operations in other threads within the CTA. When other threads perform corresponding acquire operations (like 'mbarrier.test.wait'), they synchronize with this release pattern.

This operation causes the executing thread to signal its arrival at the barrier with a specified count, but ensures that the barrier phase will not complete as a result of this operation. The operation returns an opaque value that captures the phase of the mbarrier object prior to the arrive-on operation.

The operation takes the following operands:

  • addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.
  • count: Integer specifying the count argument to the arrive-on operation. Must be in the valid range as specified in the mbarrier object contents.

For more information, see PTX ISA

mbarrier_arrive_nocomplete_shared(ssa)

nvvm.mbarrier.arrive.nocomplete.shared - Shared MBarrier Arrive No-Complete Operation

Operands

  • addr - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • count - Single, I32, 32-bit signless integer

Results

  • res - Single, LLVM_Type, LLVM dialect-compatible type

Description

This Op is the same as nvvm.mbarrier.arrive.nocomplete except that the mbarrier object should be accessed using a shared-memory pointer instead of a generic-memory pointer.

For more information, see PTX ISA

mbarrier_arrive_shared(ssa)

nvvm.mbarrier.arrive.shared - Shared MBarrier Arrive Operation

Operands

  • addr - Single, LLVM_PointerShared, LLVM pointer in address space 3

Results

  • res - Single, LLVM_Type, LLVM dialect-compatible type

Description

This Op is the same as nvvm.mbarrier.arrive except that the mbarrier object should be accessed using a shared-memory pointer instead of a generic-memory pointer.

For more information, see PTX ISA

mbarrier_init(ssa)

nvvm.mbarrier.init - MBarrier Initialization Op

Operands

  • addr - Single, LLVM_AnyPointer, LLVM pointer type
  • count - Single, I32, 32-bit signless integer
  • predicate - Optional, PtxPredicate, 1-bit signless integer

Description

The nvvm.mbarrier.init operation initializes an mbarrier object at the specified memory location.

This operation initializes the mbarrier object with the following state:

  • Current phase: 0
  • Expected arrival count: count
  • Pending arrival count: count
  • Transaction count (tx-count): 0

The operation takes the following operands:

  • addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.
  • count: Integer specifying the number of threads that will participate in barrier synchronization. Must be in the range [1, 2²⁰ - 1].
  • predicate: Optional predicate for conditional execution.

For more information, see PTX ISA

mbarrier_init_shared(ssa)

nvvm.mbarrier.init.shared - Shared MBarrier Initialization Op

Operands

  • addr - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • count - Single, I32, 32-bit signless integer
  • predicate - Optional, PtxPredicate, 1-bit signless integer

Description

This Op is the same as nvvm.mbarrier.init except that the mbarrier object should be accessed using a shared-memory pointer instead of a generic-memory pointer.

For more information, see PTX ISA

mbarrier_inval(ssa)

nvvm.mbarrier.inval - MBarrier Invalidation Operation

Operands

  • addr - Single, LLVM_AnyPointer, LLVM pointer type

Description

The nvvm.mbarrier.inval operation invalidates an mbarrier object at the specified memory location.

This operation marks the mbarrier object as invalid, making it safe to repurpose the memory location for other uses or to reinitialize it as a new mbarrier object. It is undefined behavior if the mbarrier object is already invalid.

The operation takes the following operand:

  • addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.

For more information, see PTX ISA

mbarrier_inval_shared(ssa)

nvvm.mbarrier.inval.shared - Shared MBarrier Invalidation Operation

Operands

  • addr - Single, LLVM_PointerShared, LLVM pointer in address space 3

Description

This Op is the same as nvvm.mbarrier.inval except that the mbarrier object should be accessed using a shared-memory pointer instead of a generic-memory pointer.

For more information, see PTX ISA

mbarrier_test_wait(ssa)

nvvm.mbarrier.test.wait - MBarrier Non-Blocking Test Wait Operation

Operands

  • addr - Single, LLVM_AnyPointer, LLVM pointer type
  • state - Single, LLVM_Type, LLVM dialect-compatible type

Results

  • res - Single, LLVM_Type, LLVM dialect-compatible type

Description

The nvvm.mbarrier.test.wait operation performs a non-blocking test for the completion of a specific phase of an mbarrier object. It uses the default .acquire.cta semantics. This acquire pattern establishes memory ordering for operations occurring in program order after this wait instruction by making operations from other threads in the CTA visible to subsequent operations in the current thread. When this wait completes, it synchronizes with the corresponding release pattern from the mbarrier.arrive operation, establishing memory ordering within the CTA.

This operation tests whether the mbarrier phase specified by the state operand has completed. It is a non-blocking instruction that immediately returns the completion status without suspending the executing thread.

The operation takes the following operands:

  • addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.
  • state: An opaque value returned by a previous mbarrier.arrive operation on the same mbarrier object during the current or immediately preceding phase.

The operation returns a boolean value indicating whether the specified phase has completed:

  • true: The immediately preceding phase has completed
  • false: The phase is still incomplete (current phase)

Memory ordering guarantees: When this wait returns true, the following ordering guarantees hold:

  1. All memory accesses (except async operations) requested prior to mbarrier.arrive having release semantics by participating CTA threads are visible to the executing thread.
  2. All cp.async operations requested prior to cp.async.mbarrier.arrive by participating CTA threads are visible to the executing thread.
  3. All cp.async.bulk operations using the same mbarrier object requested prior to mbarrier.arrive having release semantics by participating CTA threads are visible to the executing thread.
  4. Memory accesses requested after this wait are not visible to memory accesses performed prior to mbarrier.arrive by other participating threads.
  5. No ordering guarantee exists for memory accesses by the same thread between mbarrier.arrive and this wait.

For more information, see PTX ISA

mbarrier_test_wait_shared(ssa)

nvvm.mbarrier.test.wait.shared - Shared MBarrier Non-Blocking Test Wait Operation

Operands

  • addr - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • state - Single, LLVM_Type, LLVM dialect-compatible type

Results

  • res - Single, LLVM_Type, LLVM dialect-compatible type

Description

This Op is the same as nvvm.mbarrier.test.wait except that the mbarrier object should be accessed using a shared-memory pointer instead of a generic-memory pointer.

For more information, see PTX ISA

mbarrier_try_wait_parity(ssa)

nvvm.mbarrier.try_wait.parity - MBarrier Potentially-Blocking Try Wait with Phase Parity

Operands

  • addr - Single, LLVM_AnyPointer, LLVM pointer type
  • phase - Single, I32, 32-bit signless integer
  • ticks - Single, I32, 32-bit signless integer

Description

The nvvm.mbarrier.try_wait.parity operation performs a potentially-blocking test for the completion of a specific phase of an mbarrier object using phase parity. It uses the default .acquire.cta semantics. This acquire pattern establishes memory ordering for operations occurring in program order after this wait instruction by making operations from other threads in the CTA visible to subsequent operations in the current thread. When this wait completes, it synchronizes with the corresponding release pattern from the mbarrier.arrive operation, establishing memory ordering within the CTA.

This operation waits for the completion of the mbarrier phase indicated by the phase parity. While it uses the underlying PTX mbarrier.try_wait.parity instruction, this MLIR operation generates a loop that enforces the test to complete before continuing execution, ensuring the barrier phase is actually completed rather than potentially timing out.

The operation takes the following operands:

  • addr: A pointer to the memory location of the mbarrier object. Uses generic addressing, but the address must still be in the shared memory space.
  • phase: An integer specifying the phase parity (0 or 1). Even phases have parity 0, odd phases have parity 1.
  • ticks: An unsigned integer specifying the suspend time hint in nanoseconds. This may be used instead of the system-dependent time limit.

Memory ordering guarantees: When this wait returns true, the following ordering guarantees hold:

  1. All memory accesses (except async operations) requested prior to mbarrier.arrive having release semantics by participating CTA threads are visible to the executing thread.
  2. All cp.async operations requested prior to cp.async.mbarrier.arrive by participating CTA threads are visible to the executing thread.
  3. All cp.async.bulk operations using the same mbarrier object requested prior to mbarrier.arrive having release semantics by participating CTA threads are visible to the executing thread.
  4. Memory accesses requested after this wait are not visible to memory accesses performed prior to mbarrier.arrive by other participating threads.
  5. No ordering guarantee exists for memory accesses by the same thread between mbarrier.arrive and this wait.

Implementation behavior: This operation generates a PTX loop that repeatedly calls the underlying mbarrier.try_wait.parity instruction until the barrier phase completes. Unlike the raw PTX instruction which may return without completion after a timeout, this MLIR operation guarantees completion by continuing to loop until the specified phase is reached.

For more information, see PTX ISA

mbarrier_try_wait_parity_shared(ssa)

nvvm.mbarrier.try_wait.parity.shared - Shared MBarrier Potentially-Blocking Try Wait with Phase Parity

Operands

  • addr - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • phase - Single, I32, 32-bit signless integer
  • ticks - Single, I32, 32-bit signless integer

Description

This Op is the same as nvvm.mbarrier.try_wait.parity except that the mbarrier object should be accessed using a shared-memory pointer instead of a generic-memory pointer.

For more information, see PTX ISA

mma_sync(ssa)

nvvm.mma.sync - cooperative matrix-multiply and accumulate

Attributes

  • shape - Single, NVVM_MMAShapeAttr, Attribute for MMA operation shape.
  • b1Op - Optional, MMAB1OpAttr, MMA binary operations
  • intOverflowBehavior - Optional, MMAIntOverflowAttr, MMA overflow options
  • layoutA - Single, MMALayoutAttr, NVVM MMA layout
  • layoutB - Single, MMALayoutAttr, NVVM MMA layout
  • multiplicandAPtxType - Optional, MMATypesAttr, NVVM MMA types
  • multiplicandBPtxType - Optional, MMATypesAttr, NVVM MMA types

Operands

  • operandA - Variadic, LLVM_Type, variadic of LLVM dialect-compatible type
  • operandB - Variadic, LLVM_Type, variadic of LLVM dialect-compatible type
  • operandC - Variadic, LLVM_Type, variadic of LLVM dialect-compatible type

Results

  • res - Single, LLVM_AnyStruct, LLVM structure type

Description

The nvvm.mma.sync operation collectively performs the operation D = matmul(A, B) + C using all threads in a warp.

All the threads in the warp must execute the same mma.sync operation.

For each possible multiplicand PTX data type, there are one or more possible instruction shapes given as "mMnNkK". The below table describes the posssibilities as well as the types required for the operands. Note that the data type for C (the accumulator) and D (the result) can vary independently when there are multiple possibilities in the "C/D Type" column.

When an optional attribute cannot be immediately inferred from the types of the operands and the result during parsing or validation, an error will be raised.

b1Op is only relevant when the binary (b1) type is given to multiplicandDataType. It specifies how the multiply-and-acumulate is performed and is either xor_popc or and_poc. The default is xor_popc.

intOverflowBehavior is only relevant when the multiplicandType attribute is one of u8, s8, u4, s4, this attribute describes how overflow is handled in the accumulator. When the attribute is satfinite, the accumulator values are clamped in the int32 range on overflow. This is the default behavior. Alternatively, accumulator behavior wrapped can also be specified, in which case overflow wraps from one end of the range to the other.

layoutA and layoutB are required and should generally be set to #nvvm.mma_layout<row> and #nvvm.mma_layout<col> respectively, but other combinations are possible for certain layouts according to the table below.

| A/B Type | Shape     | ALayout | BLayout | A Type   | B Type   | C/D Type          |
|----------|-----------|---------|---------|----------|----------|-------------------|
| f64      | .m8n8k4   | row     | col     | 1x f64   | 1x f64   | 2x f64            |
| f16      | .m8n8k4   | row/col | row/col | 2x f16x2 | 2x f16x2 | 4x f16x2 or 8xf32 |
|          | .m16n8k8  | row     | col     | 2x f16x2 | 1x f16x2 | 2x f16x2 or 4 f32 |
|          | .m16n8k16 | row     | col     | 4x f16x2 | 2x f16x2 | 2x f16x2 or 4 f32 |
| bf16     | .m16n8k8  | row     | col     | 2x i32   | 1x i32   | 4x f32            |
|          | .m16n8k16 | row     | col     | 4x i32   | 2x i32   | 4x f32            |
| tf32     | .m16n8k4  | row     | col     | 2x i32   | 1x i32   | 4x f32            |
|          | .m16n8k8  | row     | col     | 4x i32   | 2x i32   | 2x f16x2 or 4 f32 |
| u8/s8    | .m8n8k16  | row     | col     | 1x i32   | 1x i32   | 2x i32            |
|          | .m16n8k16 | row     | col     | 2x i32   | 1x i32   | 4x i32            |
|          | .m16n8k32 | row     | col     | 4x i32   | 2x i32   | 4x i32            |
| u4/s4    | .m8n8k32  | row     | col     | 1x i32   | 1x i32   | 2x i32            |
|          | m16n8k32  | row     | col     | 2x i32   | 1x i32   | 4x i32            |
|          | m16n8k64  | row     | col     | 4x i32   | 2x i32   | 4x i32            |
| b1       | m8n8k128  | row     | col     | 1x i32   | 1x i32   | 2x i32            |
|          | m16n8k128 | row     | col     | 2x i32   | 1x i32   | 4x i32            |

Example:


%128 = nvvm.mma.sync A[%120, %121, %122, %123]
                     B[%124, %125]
                     C[%126, %127]
                     {layoutA = #nvvm.mma_layout<row>,
                      layoutB = #nvvm.mma_layout<col>,
                      shape = {k = 16 : i32, m = 16 : i32, n = 8 : i32}}
    : (vector<2xf16>, vector<2xf16>, vector<2xf16>)
       -> !llvm.struct<(vector<2xf16>, vector<2xf16>)>

nanosleep(ssa)

nvvm.nanosleep

pmevent(ssa)

nvvm.pmevent

prefetch(ssa)

nvvm.prefetch - Brings the cache line containing an address into the specified cache level

Attributes

  • cacheLevel - Optional, PrefetchCacheLevelAttr, NVVM Prefetch Cache Level
  • evictPriority - Optional, CacheEvictionPriorityAttr, NVVM Cache Eviction Priority
  • tensormap - Optional, UnitAttr, unit attribute
  • uniform - Optional, UnitAttr, unit attribute
  • in_param_space - Optional, UnitAttr, unit attribute

Operands

  • addr - Single, anonymous/composite constraint, LLVM pointer in address space 1 or LLVM pointer in address space 5 or LLVM pointer in address space 0 or LLVM pointer in address space 4
  • predicate - Optional, PtxPredicate, 1-bit signless integer

Description

Prefetches the cache line containing the address given by addr. The operand may be a global, local, or generic pointer. When tensormap is specified, the operand may instead be a constant or generic pointer. If the address maps to shared memory, the operation has no effect.

At most one of cacheLevel or tensormap may be present. The cacheLevel attribute selects the target cache level. When combined with uniform, the prefetch is performed to the uniform cache, in which case addr must be a generic pointer.

When tensormap is used, the line containing addr is brought from the constant or parameter state space for later use by cp.async.bulk.tensor. If in_param_space is specified, the generic pointer is interpreted as referring to the parameter state space.

uniform can be specified after the cacheLevel to indicate that the prefetch is performed to the specified uniform cache level. If uniform is specified, addr must be a generic address pointer and no operation is performed if addr maps to a const, local, or shared memory location.

The evictPriority attribute is optional and specifies the cache eviction priority when cacheLevel is L2.

For more information, see PTX ISA

rcp_approx_ftz_f(ssa)

nvvm.rcp.approx.ftz.f

Operands

  • arg - Single, F32, 32-bit float

Results

  • res - Single, F32, 32-bit float

read_ptx_sreg_clock64(ssa)

nvvm.read.ptx.sreg.clock64

read_ptx_sreg_clock(ssa)

nvvm.read.ptx.sreg.clock

read_ptx_sreg_cluster_ctaid_x(ssa)

nvvm.read.ptx.sreg.cluster.ctaid.x

read_ptx_sreg_cluster_ctaid_y(ssa)

nvvm.read.ptx.sreg.cluster.ctaid.y

read_ptx_sreg_cluster_ctaid_z(ssa)

nvvm.read.ptx.sreg.cluster.ctaid.z

read_ptx_sreg_cluster_ctarank(ssa)

nvvm.read.ptx.sreg.cluster.ctarank

read_ptx_sreg_cluster_nctaid_x(ssa)

nvvm.read.ptx.sreg.cluster.nctaid.x

read_ptx_sreg_cluster_nctaid_y(ssa)

nvvm.read.ptx.sreg.cluster.nctaid.y

read_ptx_sreg_cluster_nctaid_z(ssa)

nvvm.read.ptx.sreg.cluster.nctaid.z

read_ptx_sreg_cluster_nctarank(ssa)

nvvm.read.ptx.sreg.cluster.nctarank

read_ptx_sreg_clusterid_x(ssa)

nvvm.read.ptx.sreg.clusterid.x

read_ptx_sreg_clusterid_y(ssa)

nvvm.read.ptx.sreg.clusterid.y

read_ptx_sreg_clusterid_z(ssa)

nvvm.read.ptx.sreg.clusterid.z

read_ptx_sreg_ctaid_x(ssa)

nvvm.read.ptx.sreg.ctaid.x

read_ptx_sreg_ctaid_y(ssa)

nvvm.read.ptx.sreg.ctaid.y

read_ptx_sreg_ctaid_z(ssa)

nvvm.read.ptx.sreg.ctaid.z

read_ptx_sreg_envreg0(ssa)

nvvm.read.ptx.sreg.envreg0

read_ptx_sreg_envreg1(ssa)

nvvm.read.ptx.sreg.envreg1

read_ptx_sreg_envreg2(ssa)

nvvm.read.ptx.sreg.envreg2

read_ptx_sreg_envreg3(ssa)

nvvm.read.ptx.sreg.envreg3

read_ptx_sreg_envreg4(ssa)

nvvm.read.ptx.sreg.envreg4

read_ptx_sreg_envreg5(ssa)

nvvm.read.ptx.sreg.envreg5

read_ptx_sreg_envreg6(ssa)

nvvm.read.ptx.sreg.envreg6

read_ptx_sreg_envreg7(ssa)

nvvm.read.ptx.sreg.envreg7

read_ptx_sreg_envreg8(ssa)

nvvm.read.ptx.sreg.envreg8

read_ptx_sreg_envreg9(ssa)

nvvm.read.ptx.sreg.envreg9

read_ptx_sreg_envreg10(ssa)

nvvm.read.ptx.sreg.envreg10

read_ptx_sreg_envreg11(ssa)

nvvm.read.ptx.sreg.envreg11

read_ptx_sreg_envreg12(ssa)

nvvm.read.ptx.sreg.envreg12

read_ptx_sreg_envreg13(ssa)

nvvm.read.ptx.sreg.envreg13

read_ptx_sreg_envreg14(ssa)

nvvm.read.ptx.sreg.envreg14

read_ptx_sreg_envreg15(ssa)

nvvm.read.ptx.sreg.envreg15

read_ptx_sreg_envreg16(ssa)

nvvm.read.ptx.sreg.envreg16

read_ptx_sreg_envreg17(ssa)

nvvm.read.ptx.sreg.envreg17

read_ptx_sreg_envreg18(ssa)

nvvm.read.ptx.sreg.envreg18

read_ptx_sreg_envreg19(ssa)

nvvm.read.ptx.sreg.envreg19

read_ptx_sreg_envreg20(ssa)

nvvm.read.ptx.sreg.envreg20

read_ptx_sreg_envreg21(ssa)

nvvm.read.ptx.sreg.envreg21

read_ptx_sreg_envreg22(ssa)

nvvm.read.ptx.sreg.envreg22

read_ptx_sreg_envreg23(ssa)

nvvm.read.ptx.sreg.envreg23

read_ptx_sreg_envreg24(ssa)

nvvm.read.ptx.sreg.envreg24

read_ptx_sreg_envreg25(ssa)

nvvm.read.ptx.sreg.envreg25

read_ptx_sreg_envreg26(ssa)

nvvm.read.ptx.sreg.envreg26

read_ptx_sreg_envreg27(ssa)

nvvm.read.ptx.sreg.envreg27

read_ptx_sreg_envreg28(ssa)

nvvm.read.ptx.sreg.envreg28

read_ptx_sreg_envreg29(ssa)

nvvm.read.ptx.sreg.envreg29

read_ptx_sreg_envreg30(ssa)

nvvm.read.ptx.sreg.envreg30

read_ptx_sreg_envreg31(ssa)

nvvm.read.ptx.sreg.envreg31

read_ptx_sreg_globaltimer(ssa)

nvvm.read.ptx.sreg.globaltimer

read_ptx_sreg_globaltimer_lo(ssa)

nvvm.read.ptx.sreg.globaltimer.lo

read_ptx_sreg_gridid(ssa)

nvvm.read.ptx.sreg.gridid

read_ptx_sreg_laneid(ssa)

nvvm.read.ptx.sreg.laneid

read_ptx_sreg_lanemask_eq(ssa)

nvvm.read.ptx.sreg.lanemask.eq

read_ptx_sreg_lanemask_ge(ssa)

nvvm.read.ptx.sreg.lanemask.ge

read_ptx_sreg_lanemask_gt(ssa)

nvvm.read.ptx.sreg.lanemask.gt

read_ptx_sreg_lanemask_le(ssa)

nvvm.read.ptx.sreg.lanemask.le

read_ptx_sreg_lanemask_lt(ssa)

nvvm.read.ptx.sreg.lanemask.lt

read_ptx_sreg_nclusterid_x(ssa)

nvvm.read.ptx.sreg.nclusterid.x

read_ptx_sreg_nclusterid_y(ssa)

nvvm.read.ptx.sreg.nclusterid.y

read_ptx_sreg_nclusterid_z(ssa)

nvvm.read.ptx.sreg.nclusterid.z

read_ptx_sreg_nctaid_x(ssa)

nvvm.read.ptx.sreg.nctaid.x

read_ptx_sreg_nctaid_y(ssa)

nvvm.read.ptx.sreg.nctaid.y

read_ptx_sreg_nctaid_z(ssa)

nvvm.read.ptx.sreg.nctaid.z

read_ptx_sreg_nsmid(ssa)

nvvm.read.ptx.sreg.nsmid

read_ptx_sreg_ntid_x(ssa)

nvvm.read.ptx.sreg.ntid.x

read_ptx_sreg_ntid_y(ssa)

nvvm.read.ptx.sreg.ntid.y

read_ptx_sreg_ntid_z(ssa)

nvvm.read.ptx.sreg.ntid.z

read_ptx_sreg_nwarpid(ssa)

nvvm.read.ptx.sreg.nwarpid

read_ptx_sreg_smid(ssa)

nvvm.read.ptx.sreg.smid

read_ptx_sreg_tid_x(ssa)

nvvm.read.ptx.sreg.tid.x

read_ptx_sreg_tid_y(ssa)

nvvm.read.ptx.sreg.tid.y

read_ptx_sreg_tid_z(ssa)

nvvm.read.ptx.sreg.tid.z

read_ptx_sreg_warpid(ssa)

nvvm.read.ptx.sreg.warpid

read_ptx_sreg_warpsize(ssa)

nvvm.read.ptx.sreg.warpsize

redux_sync(ssa)

nvvm.redux.sync - Redux Sync Op

Attributes

  • kind - Single, ReduxKindAttr, NVVM redux kind
  • abs - Single, BoolAttr, bool attribute
  • nan - Single, BoolAttr, bool attribute

Operands

  • val - Single, LLVM_Type, LLVM dialect-compatible type
  • mask_and_clamp - Single, I32, 32-bit signless integer

Results

  • res - Single, LLVM_Type, LLVM dialect-compatible type

Description

redux.sync performs a reduction operation kind of the 32 bit source register across all non-exited threads in the membermask.

The abs and nan attributes can be used in the case of f32 input type, where the abs attribute causes the absolute value of the input to be used in the reduction operation, and the nan attribute causes the reduction operation to return NaN if any of the inputs to participating threads are NaN.

For more information, see PTX ISA

setmaxregister(ssa)

nvvm.setmaxregister

shfl_sync(ssa)

nvvm.shfl.sync - NVVM Dialect Op for shfl.sync

Attributes

  • kind - Single, ShflKindAttr, NVVM shuffle kind
  • return_value_and_is_valid - Optional, UnitAttr, unit attribute

Operands

  • thread_mask - Single, I32, 32-bit signless integer
  • val - Single, LLVM_Type, LLVM dialect-compatible type
  • offset - Single, I32, 32-bit signless integer
  • mask_and_clamp - Single, I32, 32-bit signless integer

Results

  • res - Single, LLVM_Type, LLVM dialect-compatible type

Description

The shfl.sync Op implements data shuffle within threads of a warp. The thread_mask denotes the threads participating in the Op where the bit position corresponds to a particular thread's laneid. The offset specifies a source lane or source lane offset (depending on kind). The val is the input value to be copied from the source. The mask_and_clamp contains two packed values specifying a mask for logically splitting warps into sub-segments and an upper bound for clamping the source lane index.

For more information, see PTX ISA

st_bulk(ssa)

nvvm.st.bulk - Bulk Store Op

Attributes

  • initVal - Single, I64Attr, 64-bit signless integer attribute

Operands

  • addr - Single, anonymous/composite constraint, LLVM pointer in address space 0 or LLVM pointer in address space 3
  • size - Single, I64, 64-bit signless integer

Description

Initializes a region of shared memory at the address given by addr. The size operand specifies the number of bytes to initialize and must be a multiple of 8. The initVal operand specifies the value to initialize the memory to. The only supported value is 0.

For more information, see PTX ISA

stmatrix(ssa)

nvvm.stmatrix - cooperative matrix store

Attributes

  • layout - Single, MMALayoutAttr, NVVM MMA layout
  • shape - Single, LdStMatrixShapeAttr, Matrix shape for ldmatrix and stmatrix
  • eltType - Single, LdStMatrixEltTypeAttr, Element type for ldmatrix and stmatrix

Operands

  • ptr - Single, LLVM_PointerShared, LLVM pointer in address space 3
  • sources - Variadic, I32, variadic of 32-bit signless integer

Description

Collectively store one or more matrices across all threads in a warp to the location indicated by the address operand $ptr in shared memory.

For more information, see PTX ISA

tcgen05_alloc(ssa)

nvvm.tcgen05.alloc - Tcgen05 alloc operation

Attributes

  • group - Single, CTAGroupKindAttr, NVVM CTA group kind

Operands

  • addr - Single, anonymous/composite constraint, LLVM pointer type or LLVM pointer in address space 3
  • nCols - Single, I32, 32-bit signless integer

Description

The tcgen05.alloc Op allocates tensor core memory for the amount specified by nCols and writes the destination address to the addr argument. The nCols operand specifies the number of columns to be allocated and it must be a power-of-two. For more information, see PTX ISA

tcgen05_commit(ssa)

nvvm.tcgen05.commit - Tcgen05 commit operations

Attributes

  • group - Single, CTAGroupKindAttr, NVVM CTA group kind

Operands

  • addr - Single, anonymous/composite constraint, LLVM pointer type or LLVM pointer in address space 3
  • multicastMask - Optional, I16, 16-bit signless integer

Description

The tcgen05.commit makes the mbarrier object, specified by the operand addr, track the completion of all the prior async-tcgen05 operations initiated by the executing thread. The multicast variants allow signaling on the mbarrier objects of multiple CTAs within the cluster. Operand multicastMask, when present, specifies the destination CTAs in the cluster such that each bit position in the 16-bit multicastMask operand corresponds to the nvvm.read.ptx.sreg.ctaid of the destination CTA. For more information, see PTX ISA

tcgen05_cp(ssa)

nvvm.tcgen05.cp - Tcgen05 copy operation

Attributes

  • shape - Single, Tcgen05CpShapeAttr, tcgen05 cp shapes
  • group - Single, CTAGroupKindAttr, NVVM CTA group kind
  • multicast - Single, Tcgen05CpMulticastAttr, tcgen05 cp multicast
  • srcFormat - Optional, Tcgen05CpSrcFormatAttr, tcgen05 cp source format

Operands

  • taddr - Single, LLVM_PointerTensor, LLVM pointer in address space 6
  • smem_desc - Single, I64, 64-bit signless integer

Description

Instruction tcgen05.cp initiates an asynchronous copy operation from shared memory to the location specified by the address operand taddr in the Tensor Memory. The 64-bit register operand smem_desc specifies the matrix descriptor representing the source matrix in the shared memory that needs to be copied.

Example:

  nvvm.tcgen05.cp %taddr, %smem_desc {
    group = #nvvm.tcgen05_group<cta_2>,
    shape = #nvvm.tcgen05_cp_shape<shape_64x128b>,
    multicast = #nvvm.tcgen05_cp_multicast<warpx2_01_23>,
    srcFormat = #nvvm.tcgen05_cp_src_fmt<b6x16_p32>
  }

For more information, see PTX ISA

tcgen05_dealloc(ssa)

nvvm.tcgen05.dealloc - Tcgen05 dealloc operation

Attributes

  • group - Single, CTAGroupKindAttr, NVVM CTA group kind

Operands

  • taddr - Single, LLVM_PointerTensor, LLVM pointer in address space 6
  • nCols - Single, I32, 32-bit signless integer

Description

The tcgen05.dealloc Op de-allocates the tensor core memory specified by tmemAddr, which must be from a previous tensor memory allocation. The nCols operand specifies the number of columns to be de-allocated, and it must be a power-of-two. For more information, see PTX ISA

tcgen05_fence(ssa)

nvvm.tcgen05.fence

tcgen05_ld(ssa)

nvvm.tcgen05.ld - tensor memory load instructions

Attributes

  • pack - Optional, UnitAttr, unit attribute
  • shape - Single, Tcgen05LdStShapeAttr, allowed 32-bit signless integer cases: 0, 1, 2, 3, 4

Operands

  • tmemAddr - Single, LLVM_PointerTensor, LLVM pointer in address space 6
  • offset - Optional, I64, 64-bit signless integer

Results

  • res - Single, anonymous/composite constraint, 32-bit signless integer or vector of 32-bit signless integer values of length 2/4/8/16/32/64/128

Description

Instruction tcgen05.ld asynchronously loads data from the Tensor Memory at the location specified by the 32-bit address operand tmemAddr into the destination register res, collectively across all threads of the warps.

The shape and the num attribute together determines the total dimension of the data which is loaded from the Tensor Memory. The shape attribute indicates the base dimension of data to be accessed as described in the Data Movement Shape. The num attribute indicates the repeat factor on the base dimension resulting in the total dimension of the data that is accessed.

The shape 16x32bx2 performs two accesses into Tensor Memory of the shape 16x32b. The base address of the first access is specified by tmemAddr and the base address of the second access is specified by tmemAddr + offset, where offset is an immediate argument.

The unit attribute pack can be used to pack two 16-bit elements from adjacent columns into a single 32-bit element during the load.

The following table describes the size of the vector for various combinations of num and shape attributes:

|=====================================================================|
| num/shape      |     16x32bx2/16x64b/32x32b |  16x128b   | 16x256b  |
|=====================================================================|
| x1             |          1                 |    2       |    4     |
| x2             |          2                 |    4       |    8     |
| x4             |          4                 |    8       |    16    |
| x8             |          8                 |    16      |    32    |
| x16            |          16                |    32      |    64    |
| x32            |          32                |    64      |    128   |
| x64            |          64                |    128     |    NA    |
| x128           |          128               |    NA      |    NA    |
|=====================================================================|

Example:

  nvvm.tcgen05.ld %tmemAddr, %offset pack {
    shape = #nvvm.tcgen05_ldst_shape<shape_16x32bx2>,
  } : <2xi32>

For more information, see PTX ISA

tcgen05_mma_smem_desc(ssa)

nvvm.tcgen05.mma_smem_desc - Constructs a Shared Memory descriptor for MMA Operands A or B

Operands

  • startAddr - Single, I32, 32-bit signless integer
  • leadingDimOffset - Single, I32, 32-bit signless integer
  • strideDimOffset - Single, I32, 32-bit signless integer
  • baseOffset - Single, I8, 8-bit signless integer
  • leadingDimMode - Single, I1, 1-bit signless integer
  • swizzleMode - Single, I8, 8-bit signless integer

Results

  • res - Single, I64, 64-bit signless integer

Description

The nvvm.tcgen05_mma_smem_desc constructs a Shared Memory descriptor for tcgen05.mma. This descriptor is a 64-bit value which describes the properties of multiplicand matrix in shared memory including its location in the shared memory of the current CTA.

+-----------+------+------------------------------------------------------+
| Bit-field | Size | Description                                          |
+-----------+------+------------------------------------------------------+
| 0-13      | 14   | Matrix start address                                 |
| 14-15     | 2    | Reserved                                             |
| 16-29     | 14   | Leading dim relative-offset (or) absolute-address    |
| 30-31     | 2    | Reserved                                             |
| 32-45     | 14   | Stride dimension byte offset                         |
| 46-48     | 3    | Fixed constant value of 0b001                        |
| 49-51     | 3    | Matrix base offset                                   |
| 52        | 1    | Leading dimension stride mode:                       |
|           |      |   0: byte offset relative                            |
|           |      |   1: byte address absolute                           |
| 53-60     | 8    | Fixed constant value of 0xb00000000                  |
| 61-63     | 3    | Swizzling mode:                                      |
|           |      |   0: No swizzling                                    |
|           |      |   1: 128-Byte with 32B atomic swizzling              |
|           |      |   2: 128-Byte swizzling                              |
|           |      |   4: 64-Byte swizzling                               |
|           |      |   6: 32-Byte swizzling                               |
|           |      |   (Values 3, 5 and 7 are invalid)                    |
+-----------+------+------------------------------------------------------+    

Example:

  %desc = nvvm.tcgen05.mma_smem_desc (%startAddr, %leadingDimOffset, %strideDimOffset,
                                      %baseOffset, %leadingDimMode, %swizzleMode) : (i32, i32, i32, i8, i1, i8) -> i64

For more information, see PTX ISA

tcgen05_relinquish_alloc_permit(ssa)

nvvm.tcgen05.relinquish_alloc_permit

tcgen05_shift(ssa)

nvvm.tcgen05.shift - Tcgen05 shift operation

Attributes

  • group - Single, CTAGroupKindAttr, NVVM CTA group kind

Operands

  • taddr - Single, LLVM_PointerTensor, LLVM pointer in address space 6

Description

The tcgen05.shift is an asynchronous instruction which initiates the shifting of 32-byte elements downwards across all the rows, except the last, by one row. The operand taddr specifies the base address of the matrix in Tensor Memory whose rows must be down shifted.

For more information, see PTX ISA

tcgen05_st(ssa)

nvvm.tcgen05.st - tensor memory store instructions

Attributes

  • unpack - Optional, UnitAttr, unit attribute
  • shape - Single, Tcgen05LdStShapeAttr, allowed 32-bit signless integer cases: 0, 1, 2, 3, 4

Operands

  • tmemAddr - Single, LLVM_PointerTensor, LLVM pointer in address space 6
  • val - Single, anonymous/composite constraint, 32-bit signless integer or vector of 32-bit signless integer values of length 2/4/8/16/32/64/128
  • offset - Optional, I64, 64-bit signless integer

Description

Instruction tcgen05.st asynchronously stores data from the source register r into the Tensor Memory at the location specified by the 32-bit address operand tmemAddr, collectively across all threads of the warps.

The shape and the num attribute together determines the total dimension of the data which is stored to the Tensor Memory. The shape indicates the base dimension of data to be accessed. The num attribute indicates the repeat factor on the base dimension resulting in the total dimension of the data that is accessed.

The shape 16x32bx2 performs two accesses into Tensor Memory of the shape 16x32b. The base address of the first access is specified by tmemAddr and the base address of the second access is specified by tmemAddr + offset, where offset is an immediate argument.

The unit attribute unpack can be used to unpack a 32-bit element in the register into two 16-bit elements and store them in adjacent columns.

The following table describes the size of the vector for various combinations of num and shape attributes:

|=====================================================================|
| num/shape      |     16x32bx2/16x64b/32x32b |  16x128b   | 16x256b  |
|=====================================================================|
| x1             |          1                 |    2       |    4     |
| x2             |          2                 |    4       |    8     |
| x4             |          4                 |    8       |    16    |
| x8             |          8                 |    16      |    32    |
| x16            |          16                |    32      |    64    |
| x32            |          32                |    64      |    128   |
| x64            |          64                |    128     |    NA    |
| x128           |          128               |    NA      |    NA    |
|=====================================================================|

Example:

  nvvm.tcgen05.st %tmemAddr, %val, %offset unpack {
    shape = #nvvm.tcgen05_ldst_shape<shape_16x32bx2>,
  } : <2xi32>

For more information, see PTX ISA

tcgen05_wait(ssa)

nvvm.tcgen05.wait

vote_sync(ssa)

nvvm.vote.sync - Vote across thread group

Attributes

  • kind - Single, VoteSyncKindAttr, NVVM vote sync kind

Operands

  • mask - Single, I32, 32-bit signless integer
  • pred - Single, I1, 1-bit signless integer

Results

  • res - Single, anonymous/composite constraint, 32-bit signless integer or 1-bit signless integer

Description

The vote.sync op will cause executing thread to wait until all non-exited threads corresponding to membermask have executed vote.sync with the same qualifiers and same membermask value before resuming execution.

The vote operation kinds are:

  • any: True if source predicate is True for some thread in membermask.
  • all: True if source predicate is True for all non-exited threads in membermask.
  • uni: True if source predicate has the same value in all non-exited threads in membermask.
  • ballot: In the ballot form, the destination result is a 32 bit integer. In this form, the predicate from each thread in membermask are copied into the corresponding bit position of the result, where the bit position corresponds to the thread's lane id.

For more information, see PTX ISA

wgmma_commit_group_sync_aligned(ssa)

nvvm.wgmma.commit.group.sync.aligned

wgmma_fence_aligned(ssa)

nvvm.wgmma.fence.aligned

wgmma_mma_async(ssa)

nvvm.wgmma.mma_async

Attributes

  • shape - Single, NVVM_MMAShapeAttr, Attribute for MMA operation shape.
  • typeA - Single, WGMMATypesAttr, NVVM WGMMA types
  • typeB - Single, WGMMATypesAttr, NVVM WGMMA types
  • typeD - Single, WGMMATypesAttr, NVVM WGMMA types
  • scaleD - Single, WGMMAScaleOutAttr, WGMMA input predicate
  • scaleA - Single, WGMMAScaleInAttr, WGMMA overflow options
  • scaleB - Single, WGMMAScaleInAttr, WGMMA overflow options
  • layoutA - Single, MMALayoutAttr, NVVM MMA layout
  • layoutB - Single, MMALayoutAttr, NVVM MMA layout
  • satfinite - Optional, MMAIntOverflowAttr, MMA overflow options

Operands

  • inouts - Single, LLVM_AnyStruct, LLVM structure type
  • descriptorA - Single, I64, 64-bit signless integer
  • descriptorB - Single, I64, 64-bit signless integer

Results

  • results - Single, LLVM_AnyStruct, LLVM structure type

Description

The warpgroup (128 threads) level matrix multiply and accumulate operation has either of the following forms, where matrix D is called accumulator: D = A B + D D = A B, where the input from accumulator D is disabled.

Supported shapes:

|--------------|--------------|------------|--------------|---------------|
|              |              |            |              |f16+=e4m3*e4m3 |
|              |              |            |              |f16+=e5m2*e5m2 |
|f32+=tf32*tf32|f16+=f16 *f16 | s32+=s8*s8 |s32 += b1 * b1|f16+=e5m2*e4m3 |
|              |f32+=f16 *f16 | s32+=u8*u8 |              |f16+=e4m3*e5m2 |
|              |f32+=bf16*bf16| s32+=u8*u8 |              |f16+=e4m3*e5m2 |
|              |f32+=bf16*bf16| s32+=s8*u8 |              |f32+=e4m3*e4m3 |
|              |              | s32+=u8*s8 |              |f32+=e5m2*e5m2 |
|              |              |            |              |f32+=e4m3*e5m2 |
|              |              |            |              |f32+=e4m3*e5m2 |
|--------------|--------------|------------|--------------|---------------|
|   .m64n8k8   |  .m64n8k16   | .m64n8k32  | .m64n8k256   | .m64n8k32     |
|   .m64n16k8  |  .m64n16k16  | .m64n16k32 | .m64n16k256  | .m64n16k32    |
|   .m64n24k8  |  .m64n24k16  | .m64n24k32 | .m64n24k256  | .m64n24k32    |
|   .m64n32k8  |  .m64n32k16  | .m64n32k32 | .m64n32k256  | .m64n32k32    |
|   .m64n40k8  |  .m64n40k16  | .m64n48k32 | .m64n48k256  | .m64n40k32    |
|   .m64n48k8  |  .m64n48k16  | .m64n64k32 | .m64n64k256  | .m64n48k32    |
|   .m64n56k8  |  .m64n56k16  | .m64n80k32 | .m64n80k256  | .m64n56k32    |
|   .m64n64k8  |  .m64n64k16  | .m64n96k32 | .m64n96k256  | .m64n64k32    |
|   .m64n72k8  |  .m64n72k16  | .m64n112k32| .m64n112k256 | .m64n72k32    |
|   .m64n80k8  |  .m64n80k16  | .m64n128k32| .m64n128k256 | .m64n80k32    |
|   .m64n88k8  |  .m64n88k16  | .m64n144k32| .m64n144k256 | .m64n88k32    |
|   .m64n96k8  |  .m64n96k16  | .m64n160k32| .m64n160k256 | .m64n96k32    |
|   .m64n104k8 |  .m64n104k16 | .m64n176k32| .m64n176k256 | .m64n104k32   |
|   .m64n112k8 |  .m64n112k16 | .m64n192k32| .m64n192k256 | .m64n112k32   |
|   .m64n120k8 |  .m64n120k16 | .m64n208k32| .m64n208k256 | .m64n120k32   |
|   .m64n128k8 |  .m64n128k16 | .m64n224k32| .m64n224k256 | .m64n128k32   |
|   .m64n136k8 |  .m64n136k16 | .m64n240k32| .m64n240k256 | .m64n136k32   |
|   .m64n144k8 |  .m64n144k16 | .m64n256k32| .m64n256k256 | .m64n144k32   |
|   .m64n152k8 |  .m64n152k16 |            |              | .m64n152k32   |
|   .m64n160k8 |  .m64n160k16 |            |              | .m64n160k32   |
|   .m64n168k8 |  .m64n168k16 |            |              | .m64n168k32   |
|   .m64n176k8 |  .m64n176k16 |            |              | .m64n176k32   |
|   .m64n184k8 |  .m64n184k16 |            |              | .m64n184k32   |
|   .m64n192k8 |  .m64n192k16 |            |              | .m64n192k32   |
|   .m64n200k8 |  .m64n200k16 |            |              | .m64n200k32   |
|   .m64n208k8 |  .m64n208k16 |            |              | .m64n208k32   |
|   .m64n216k8 |  .m64n216k16 |            |              | .m64n216k32   |
|   .m64n224k8 |  .m64n224k16 |            |              | .m64n224k32   |
|   .m64n232k8 |  .m64n232k16 |            |              | .m64n232k32   |
|   .m64n240k8 |  .m64n240k16 |            |              | .m64n240k32   |
|   .m64n248k8 |  .m64n248k16 |            |              | .m64n248k32   |
|   .m64n256k8 |  .m64n256k16 |            |              | .m64n256k32   |
|--------------|--------------|------------|--------------|---------------|

For more information, see PTX ISA

wgmma_wait_group_sync_aligned(ssa)

nvvm.wgmma.wait.group.sync.aligned

wmma_load(ssa)

nvvm.wmma.load - Warp synchronous matrix load

Attributes

  • m - Single, I32Attr, 32-bit signless integer attribute
  • n - Single, I32Attr, 32-bit signless integer attribute
  • k - Single, I32Attr, 32-bit signless integer attribute
  • layout - Single, MMALayoutAttr, NVVM MMA layout
  • eltype - Single, MMATypesAttr, NVVM MMA types
  • frag - Single, MMAFragAttr, NVVM MMA frag type

Operands

  • ptr - Single, LLVM_AnyPointer, LLVM pointer type
  • stride - Single, I32, 32-bit signless integer

Results

  • res - Single, LLVM_AnyStruct, LLVM structure type

wmma_mma(ssa)

nvvm.wmma.mma - Warp synchronous matrix-multiply accumulate using tensor cores.

Attributes

  • m - Single, I32Attr, 32-bit signless integer attribute
  • n - Single, I32Attr, 32-bit signless integer attribute
  • k - Single, I32Attr, 32-bit signless integer attribute
  • layoutA - Single, MMALayoutAttr, NVVM MMA layout
  • layoutB - Single, MMALayoutAttr, NVVM MMA layout
  • eltypeA - Single, MMATypesAttr, NVVM MMA types
  • eltypeB - Single, MMATypesAttr, NVVM MMA types

Operands

  • args - Variadic, LLVM_Type, variadic of LLVM dialect-compatible type

Results

  • res - Single, LLVM_AnyStruct, LLVM structure type

wmma_store(ssa)

nvvm.wmma.store - Warp synchronous matrix store

Attributes

  • m - Single, I32Attr, 32-bit signless integer attribute
  • n - Single, I32Attr, 32-bit signless integer attribute
  • k - Single, I32Attr, 32-bit signless integer attribute
  • layout - Single, MMALayoutAttr, NVVM MMA layout
  • eltype - Single, MMATypesAttr, NVVM MMA types

Operands

  • ptr - Single, LLVM_AnyPointer, LLVM pointer type
  • args - Variadic, LLVM_Type, variadic of LLVM dialect-compatible type
  • stride - Single, I32, 32-bit signless integer