Matrix Core Programming on AMD GPUs

TL;DR In this blog post, we walk through how to use Matrix Cores in HIP kernels, with a focus on low-precision data types such as FP16, FP8, and FP4, as well as the new family of Matrix Core instructions with exponent block scaling introduced in the AMD CDNA™4 architecture. Through code examples and illustrations, we provide the necessary knowledge to start programming Matrix Cores, covering modern low-precision floating-point types, the Matrix Core compiler intrinsics, and the data layouts required by the Matrix Core instructions. The blog post is also available on ROCm Blogs.

1. Matrix Cores

- Advertisement -

Matrix multiplication is an essential part of AI and HPC workloads. The AMD CDNA™ architecture features special-purpose hardware, the Matrix Cores, to accelerate matrix fused-multiply-add (MFMA) operations defined as D:=A*B+C. Please note that MFMA instructions are often used to update a matrix in-place (=accumulation) so that D=C and C:=A*B+C. The matrices A and B are called input matrices, while the matrix D is referred to as the output matrix or accumulator.

The performance gains from using Matrix Cores are especially significant in mixed-precision mode, where the input matrices use lower-precision data types instead of FP32. The output matrix, however, is stored in FP32 to minimize accuracy loss during accumulation. The tables below show the theoretical peak performance of Matrix Cores with different input data types on both AMD CDNA™3 and AMD CDNA™4 architectures. On the AMD Instinct™ MI325X, using FP16 input matrices delivers nearly an 8x performance increase compared to single-precision, with only minimal accuracy loss. Switching to FP8 further doubles the performance providing a 16x increase when compared to FP32. The AMD CDNA™4 architecture further improves Matrix Core performance, delivering up to 2x higher throughput for FP16 and FP8 compared to the AMD CDNA™3 architecture. In addition, AMD CDNA™4 introduces new low-precision data types such as FP6 and FP4, enabling up to 64x performance gain relative to FP32. Please refer to the AMD CDNA™3 and AMD CDNA™4 white papers for detailed architecture specifications.

- Advertisement -

Type
AMD Instinct™ MI325X (CDNA™3)
Speedup vs. FP32

Matrix FP64
163.4 TF
1x

- Advertisement -

Matrix FP32
163.4 TF
1x

Matrix FP16
1307.4 TF
~8x

Matrix FP8
2614.9 TF
~16x

- Advertisement -

Type
AMD Instinct™ MI355X (CDNA™4)
Speedup vs. FP32

Matrix FP64
78.6 TF
~0.5x

Matrix FP32
157.3 TF
1x

- Advertisement -

Matrix FP16
2.5 PF
~16x

Matrix FP8
5 PF
~32x

Matrix FP6
10 PF
~64x

Matrix FP4
10 PF
~64x

2. Low-Precision Floating-Point Types

A binary representation of a floating-point number consists of n bits, where m of n bits represent the mantissa, 1 bit determines the sign and n-m-1 bits represent the exponent. The following image illustrates the binary format of a floating-point number and how the exponent and mantissa are calculated based on its binary representation.

Figure 1: Binary representation of a floating-point number.

Floating-point types are characterized by the number of bits used for the exponent and for the mantissa. Increasing the exponent width extends the range of representable values, while increasing the mantissa width improves precision. Since all floating-point types include the sign bit, a shorthand notation typically specifies only the exponent and mantissa widths. For example, the E4M3 type is an 8-bit floating-point type with 4-bit exponent and 3-bit mantissa. Additionally, a floating-point type is specified by exponent bias – a number that is subtracted from the exponent during conversion from binary format to real value. Given the exponent width, mantissa width, and exponent bias, one can convert the binary representation of a floating-point type (except E8M0) into its real value using the following equation:

Figure 2: Conversion to real value from binary representation for floating-point numbers.

Please note that the equation takes different forms depending on whether the exponent is zero or not. Often, certain exponent and mantissa values are reserved for special values (e.g. NaN, Infinity), which limits the range of representable real numbers. For example, the FP16 type has 5-bit exponent with a nominal range of [0, 1, … 2^5-1] = [0, 1, … 31]. However, the exponent value E = 31 is reserved for NaN (if the mantissa M != 0) and infinity (if the mantissa M = 0). Therefore, the largest exponent value that can represent a real number is E = 30.

The following table summarizes low-precision types commonly used in modern AI/ML workloads:

Width
Shorthand
Exp. bias
Range
Zero
NaN
Infinity

16-Bit

E5M10 (FP16)
15
±65504
S 00000 0000000000
S 11111 xxxxxxxxxx
S 11111 0000000000

E8M7 (BF16)
127
±3.3895 * 10^38
S 00000000 0000000
S 11111111 xxxxxxx
S 11111111 0000000

8-Bit

E4M3FN (FP8, OCP)
7
±448
S 0000 000
S 1111 111
n/a

E4M3FNUZ (FP8)
8
±240
0 0000 000
1 0000 000
n/a

E5M2 (BF8, OCP)
15
±57344
S 00000 00
S 11111 {01, 10 11}
S 11111 00

E5M2FNUZ (BF8)
16
±57344
0 00000 00
S 00000 00
n/a

E8M0
127
2^(±127)
n/a
11111111
n/a

6-Bit

E2M3
1
±7.5
S 00 000
n/a
n/a

E3M2 (BF6)
3
±28
S 000 00
n/a
n/a

4-Bit

E2M1 (FP4)
1
±6
S 00 0
n/a
n/a

Please note that the E4M3 type has two variants: E4M3FN and E4M3FNUZ. Both E4M3FN and E4M3FNUZ use 4 bits for the exponent and 3 bits for the mantissa. They use different exponent biases and differ in the special values they can represent. Neither variant supports infinities, which is why their notations include FN (FiNite). However, E4M3FN supports +0, -0, +NaN and -Nan, while E4M3FNUZ supports only +0 and NaN, hence UZ (Unsigned Zero). The image below demonstrates how to convert a binary sequence into a real value, using E4M3FNUZ type as an example:

Figure 3: E4M3FNUZ encoding details.

FP8 types are divided into E4M3 and E5M2 formats. The E5M2 format is sometimes referred to as BF8, similar to BF16, where exponent width is larger compared to FP16. Similar to E4M3, E5M2 is further subdivided into two variants: E5M2 (OCP) and E5M2FNUZ. The AMD CDNA™3 architecture uses FNUZ variants for both E4M3 and E5M2, whereas the CDNA™4 architecture uses E4M3FN and E5M2 (OCP) variants. E4M3FN and E5M2 are standardized formats defined by the Open Compute Project (OCP). For detailed specifications, see the OCP Microscaling Formats (MX) Specification and the ONNX documentation. For visualization of FP8 values and their binary representations please refer to the FP8 Data table. Additionally, see the chapter “Low-precision floating-point types” in the AMD ROCm™ documentation for details on using low-precision types in HIP.

There is a special 8-bit format, E8M0, which is not used as a standard element data type but instead serves as a scale factor for microscaling types and block-scaled MFMA operations (discussed later in this article). Its value is calculated according to the equation below:

Figure 4: E8M0 encoding details.

The exponent value E = 255 is reserved for NaN values, limiting the range of representable real numbers to [2^-127 … 2^127].

Similar to FP8, FP6 has two formats: E2M3 and E3M2. The latter, E3M2, is often referred to as BF6 due to its larger exponent width compared to E2M3.

3. Matrix fused-multiply-add (MFMA) Instructions

The AMD CDNA™3 and CDNA™4 architectures support a variety of MFMA operations, which are characterized by the matrix dimensions M, N, K and the data type of input/output matrices. The following table lists all available floating-point MFMA instructions for the AMD CDNA™3 and CDNA™4 architectures. As can be seen from the table, the AMD CDNA™4 architecture extends the set of available MFMA instructions by adding new FP16/BF16 instructions with larger matrix dimensions. Furthermore, it introduces FP6/FP4 data types and provides a completely new set of FP8/FP6/FP4 instructions where the types can be independently used for the matrices A and B. Finally, the AMD CDNA™4 architecture enables MFMA with block exponent scaling.

Type (C,D) ← (A,B)
MxNxK (CDNA™3)
MxNxK (CDNA™4)
Cycles
Note

FP64 ← FP64
16x16x4
16x16x4
64

FP32 ← FP32
32x32x2
32x32x2
64

16x16x4
16x16x4
32

FP32 ← FP16 (BF16)
32x32x8
32x32x8
32
Both A and B are either FP16 or BF16

16x16x16
16x16x16
16

–
16x16x32
16

–
32x32x16
32

FP32 ← FP8
16x16x32
16x16x32
16
FP8 (E4M3) or BF8 (E5M2) can be used independently for A and B

32x32x16
32x32x16
32

FP32 ← FP8/FP6/FP4
–
16x16x128
16 or 32
FP4, FP6 or FP8 can be used independently for A and B. Larger cycle count if either matrix A or B is FP8

–
32x32x64
32 or 64

FP32 ← MXFP8/MXFP6/MXFP4
–
16x16x128
16 or 32
FP4, FP6 or FP8 can be used independently for A and B. Larger cycle count if either matrix A or B is FP8

–
32x32x64
32 or 64

Please note that the table lists only floating-point type MFMA instructions with batch size = 1. In addition to them, the AMD CDNA™3 and CDNA™4 architectures support batched MFMA operations, where multiple output matrices are computed in parallel. These instructions are not covered in this article. See the Chapter 7 “Matrix Arithmetic Instructions” in the AMD CDNA™3 and AMD CDNA™4 ISA reference guides for the full list of available MFMA instructions.

The table above specifies cycle count for each MFMA operation. Given a known cycle count, one can estimate theoretical peak performance in TFLOP/s of corresponding MFMA operation using the formula below:

2*M*N*K * num_matrix_cores * (max_engine_clock / cycle_count) / 10^6,

where

num_matrix_cores is total number of matrix cores in a GPU (specified in white paper)
max_engine_clock is max engine clock (peak) in MHz (specified in white paper)
cycle_count is cycle count of corresponding MFMA instruction
M, N, K are matrix dimensions

Using this formula and the MFMA instruction 32x32x8 FP16 as an example, we can estimate theoretical peak FP16 Matrix Core performance on the AMD Instinct™ MI325X:

2*32*32*8 * 1216 * (2100 / 32) / 10^6 = 1307.4 TFLOP/s.

4. Compiler Intrinsics

To use Matrix Core instructions in HIP kernels, LLVM provides built-in compiler intrinsic functions. The list of all available compiler intrinsics can be found in the LLVM Github repository. The syntax of the MFMA intrinsics has the following format:

d_reg = __builtin_amdgcn_mfma_ODType_MxNxKInDType(a_reg, b_reg, c_reg, cbsz, abid, blgp),

where

MxNxK specifies the shapes of the matrices A, B, C, D,
ODType is data type of the matrices C and D,
InDType is data type of the input matrices A and B,
a_reg is a scalar/vector containing a portion of the matrix A,
b_reg is a scalar/vector containing a portion of the matrix B,
c_reg is a vector containing a portion of the matrix C,
d_reg is a vector containing a portion of the matrix D,
cbsz, abid, blgp are broadcast flags. For the following discussion, these flags are irrelevant and are, therefore, set to 0 by default, unless specified otherwise. Please refer to the ISA reference guide for detailed information on the broadcast flags.

For example,

__builtin_amdgcn_mfma_f32_16x16x16f16 performs 16x16x16 MFMA, where both input matrices A and B have type FP16 and the output matrix has type FP32
__builtin_amdgcn_mfma_f32_32x32x16_fp8_fp8 performs 32x32x16 MFMA, where both input matrices A and B have type FP8(E4M3) and the output matrix is stored in FP32
__builtin_amdgcn_mfma_f32_32x32x16_fp8_bf8 performs 32x32x16 MFMA, where the matrix A has type FP8(E4M3) and the matrix B has type BF8(E5M2).

The MFMA instructions are wavefront-level (warp-level) instructions, where all work-items (threads) within a wavefront collectively perform a single MFMA operation and the operands A, B, C, D are distributed across work-items so that each work-item in the wavefront holds a portion of the operands. In order to use the MFMA instructions, it’s required to understand how the operands are distributed across threads within a wavefront. The ISA reference guide fully specifies the data layout for all available MFMA instructions. For illustrative purposes, the next chapter explains a subset of the MFMA instructions and the corresponding data layouts.

5. Examples

Important note: In the following discussion we assume the matrices are stored in row-major order. The wavefront size on the AMD CDNA™ architecture is 64. The shapes of the matrices A, B, C, D are MxK, KxN, MxN, and MxN, respectively. The first dimension denotes the number of rows and the second dimension the number of columns in a matrix. For example, the matrix A has M rows and K columns.

5.1. __builtin_amdgcn_mfma_f32_32x32x2f32

In this example we will multiply matrix A of size 32×2 with matrix B of size 2×32 using single wavefront (64 threads) and single MFMA instruction. The output matrix C has shape 32×32. The input and output matrices are FP32. Since threads within a wavefront collectively perform single MFMA instruction, the operands are distributed across the threads. Each thread stores

M * K / wavefront_size = 32 * 2 / 64 = 1 entries of the matrix A
K * N / wavefront_size = 2 * 32 / 64 = 1 entries of the matrix B
M * N / wavefront_size = 32 * 32 / 64 = 16 entries of the matrix C

The operands are distributed according to the scheme below. The matrix elements highlighted in light red are those stored by the thread with index 0 within the wavefront.

Figure 5: Data layout for `__builtin_amdgcn_mfma_f32_32x32x2f32`. The operands are stored in row-major order.

The code example below demonstrates how this operation can be implemented as a HIP kernel:

#include

using fp32x16_t = __attribute__((vector_size(16 * sizeof(float)))) float;

__global__ void
mfma_fp32_32x32x2_fp32(const float* A, const float* B, float* C) {
float a_reg;
float b_reg;
fp32x16_t c_reg {};

const float* ldg_a_ptr = A + threadIdx.x / 32 + 2 * (threadIdx.x % 32);
const float* ldg_b_ptr = B + threadIdx.x % 32 + (threadIdx.x / 32) * 32;

a_reg = *ldg_a_ptr;
b_reg = *ldg_b_ptr;

c_reg = __builtin_amdgcn_mfma_f32_32x32x2f32(a_reg, b_reg, c_reg, 0, 0, 0);

for (int i = 0; i

Top Stories

Kevin O’Connell Outsmarts Browns on Winning Drive

Harry Kane: England World Cup hero favourite to win Golden Boot after Panama hat-trick

F365’s pretty poor team of the season…

Stay Connected

Matrix Core Programming on AMD GPUs

Leave a Reply Cancel reply

Related Stories

Apple @ Work: Live Translation with AirPods could reshape global business communication

Digiday Media Agency Report 2025: How clients are experimenting and spending — from AI to tariffs

How agencies can ace their next RFP

OpenAI’s DevDay 2025 preview: Will Sam Altman launch the ChatGPT browser?

New AI training method creates powerful software agents with just 78 examples

Databricks set to accelerate agentic AI by up to 100x with ‘Mooncake’ technology — no ETL pipelines for analytics and AI

How UK immigration must evolve to attract the best global talent

$912 energy independence without red tape

Top Stories

Stay Connected

Leave a Reply Cancel reply

Related Stories

Ads Blocker Detected & This Is Prohibited!!!