# **GPU-FPX** and **FTTN**

### Xinyi Li\*, Ganesh Gopalakrishnan\*\*

\* Postdoctoral Fellow, Pacific Northwest National Laboratory, Richland, WA



xinyinicole.com \*\* Professor, Kahlert School of Computing, University of Utah, Salt Lake City, UT https://www.cs.utah.edu/~ganesh









# Presenters



Prof. Ganesh Gopalakrishnan

Professor, Kahlert School of Computing, University of Utah, Salt Lake City, UT

https://www.cs.utah.edu/~ganesh



Dr. Xinyi Li

Postdoctoral Associate, Pacific Northwest National Laboratory, Richland, WA

https://xinyinicole.com

#### **TOOLS PRESENTED :**

**GPU-FPX:** For detecting Floating-Point Exceptions

FTTN: To discover the numerical behavior of Tensor Cores

#### • FP Exceptions

- Cause non-determinism even serious accidents
- May be masked, resulting in users not seeing the effects
- Numerical solutions with NaNs are useless
- Exceptions can change with platforms
- Exception-checking support is limited in Heterogeneous Hardware
- Closed-source libraries make the problem worse
- We provide the first usable binary-instrumentation framework for NVIDIA GPUs
- Long term: Manufacturer Help is essential

#### • FTTN

- Important to know the "HW substrate" which increasingly uses Tensor-Cores
- These are poorly documented
- In the short-term, we must experimentally discover what the behaviors are
- We provide a set of tests to reveal many of the important Tensor-Core behaviors
- Long term: Manufacturer Help is Essential

### NaN ... how perhaps born, how it flows, how likely killed





No need to put NAN-Checker here !! No need to put NAN-Checker here !! Must put NAN-Checker here !!

#### GPU-FPX Tool Efficiently Detects Exceptions in NVIDIA GPU Binaries

(C ? X. Li, I. Laguna, B. Fang, K. Swirydowicz, A. Li and G. Gopalakrishnan, **"Design and Evaluation of GPU-FPX: A Low-Overhead tool for Floating-Point Exception Detection in NVIDIA GPUs,"** HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, August 2023, Pages 59–71, https://doi.org/10.1145/3588195.3592991

- HPDC 2023 paper published on new tool GPU-FPX released at <a href="https://github.com/LLNL/GPU-FPX">https://github.com/LLNL/GPU-FPX</a>
- Found 27 previously unknown exceptions detected across 151 programs on their own data sets
  - Some repairs also identified based on tool feedback

| 0(0) 0 |           | Program         | Source available? | Diagnose? | <b>Exceptions Matter?</b> | Fixed?  | How Fixed?                   |
|--------|-----------|-----------------|-------------------|-----------|---------------------------|---------|------------------------------|
| 0/0)   | == 0      | GRAMSCHM        | yes               | yes       | yes                       | yes     | Remove 0 from input          |
| >      |           | LU              | yes               | yes       | yes                       | yes     | Remove 0 from input          |
|        | predicate | myocyte         | yes               | no        | N.A.                      | N.A.    | N.A.                         |
|        | predicate | S3D             | yes               | yes       | no                        | N.A.    | N.A.                         |
| /      | NaN       | Interval        | yes               | yes       | no                        | N.A.    | N.A.                         |
|        |           | Laghos          | yes               | no        | N.A.                      | N.A.    | N.A.                         |
|        | Т         | Sw4lite         | yes               | no        | N.A.                      | N.A.    | N.A.                         |
|        |           | HPCG            | no                | no        | N.A.                      | N.A.    | N.A.                         |
|        |           | CuMF-Movielens  | yes               | yes       | yes                       | yes     | Enforce variable consistency |
|        | ↓ ↓       | cuML-HousePrice | partial           | yes       | yes                       | partial | N.A.                         |
| 42     |           | CUDA GMRES      | partial           | yes       | yes                       | partial | Diagonal boosting            |
| 2      | †∠        | SRU-Example     | yes               | yes       | yes                       | yes     | Change input generator       |
|        |           |                 |                   |           |                           |         |                              |

Table 7. Overview of Exception Diagnoses and Repairs using Analyzer for Programs with Severe Exceptions

### Non-Deterministic Behavior: This is incorrect in FP

# #define $MAX(x, y) ((x) \ge (y)?(x) : (y))$

# **GPU-FPX**

# A Low-Overhead tool for Floating-Point Exception Detection in NVIDIA GPUs



Programs may not run correctly



All these may stem from a simple floating-point exception





### Floating point exceptions

| Invalid<br>Operations | Division<br>by Zero      | Overflow         | Underflow                 |
|-----------------------|--------------------------|------------------|---------------------------|
| Resulting in NaN      | Resulting in NaN,<br>INF | Resulting in INF | Resulting in<br>Subnormal |
| sqrt(-1)              | 0/0, 3/0                 | power(2,2000)    | 1.0e-308 / 1.0e308        |

# Hardware exception traps can be enabled on CPUs



| Invalid<br>Operations | Division<br>by Zero      | Overflow         | Underflow                 |  |  |
|-----------------------|--------------------------|------------------|---------------------------|--|--|
| Resulting in NaN      | Resulting in NaN,<br>INF | Resulting in INF | Resulting in<br>Subnormal |  |  |
| sqrt(-1)              | 0/0, 3/0                 | power(2,2000)    | 1.0e-308 / 1.0e308        |  |  |

### ... but not on NVIDIA GPUs!!





| Invalid<br>Operations | Division<br>by Zero      | Overflow         | Underflow                 |
|-----------------------|--------------------------|------------------|---------------------------|
| Resulting in NaN      | Resulting in NaN,<br>INF | Resulting in INF | Resulting in<br>Subnormal |
| sqrt(-1)              | 0/0, 3/0                 | power(2,2000)    | 1.0e-308 / 1.0e308        |

### ... but not on NVIDIA GPUs!!





requiring exceptions to be detected by examining the results in **Software** 

| Invalid<br>Operations | Division<br>by Zero      | Overflow         | Underflow                 |
|-----------------------|--------------------------|------------------|---------------------------|
| Resulting in NaN      | Resulting in NaN,<br>INF | Resulting in INF | Resulting in<br>Subnormal |
| sqrt(-1)              | 0/0, 3/0                 | power(2,2000)    | 1.0e-308 / 1.0e308        |

#### Programs may not run correctly



# All these may stem from a simple floating-point exception



### Show-Stopper Floating-Point Exceptions on GPUs



Ax = b

A is a near singular matrix

**Run on GPU** 

No warning raised

Loss became NaN !!

If A < B then P else Q

If Either A or B is NaN

then Q is executed P is ignored ... P may contain a NaN too User has no tools to root-cause and fix!





Need tool to detect and analyze Floating-point exceptions



Need tool to detect and analyze Floating-point exceptions



# **GPU-FPX: 3 features**



# **GPU-FPX: 2 components**



### Detector

Pinpoints exception-generating locations across all kernels







### Analyzer

Reports how exceptions flow within one instruction





**GPU-FPX** 

```
global void dot prod(float *x, float *y, int size)
float d;
for (int i=0; i < size; ++i)</pre>
{
float tmp;
// division by zero, produce NaN
tmp = x[i] * y[i] / 0
d += tmp; // d=NaN
}
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid == 0) {
printf("dot: %f\n", d);
}
}
```



### LD PRELOAD=detector.so ./dot-prod

```
// division by zero, produce NaN
tmp = x[i]*y[i] / 0
d += tmp; // d=NaN
}
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid == 0) {
printf("dot: %f\n", d);
}
```



### LD PRELOAD=detector.so ./dot-prod

// division by zero, produce NaN

$$tmp = x[i] * y[i] / 0$$

#GPU-FPX LOC-EXCEP INFO: in kernel [dot\_prod], DIV0 found @ dot-prod.cu:13 [FP32] #GPU-FPX LOC-EXCEP INFO: in kernel [dot\_prod], NaN found @ dot-prod.cu:13 [FP32] dot: nan #GPU-FPX LOC-EXCEP INFO: in kernel [dot\_prod], NaN found @ dot-prod.cu:21 [FP32] #GPU-FPX LOC-EXCEP INFO: in kernel [dot\_prod], NaN found @ dot-prod.cu:14 [FP32]



#### LD PRELOAD=detector.so ./dot-prod

// division by zero, produce NaN

$$tmp = x[i] * y[i] / 0$$

#GPU-FPX LOC-EXCEP INFO: in kernel [dot\_prod], DIVO found @

#### dot-prod.cu:13 [FP32]

#GPU-FPX LOC-EXCEP INFO: in kernel [dot\_prod], NaN found @ dot-prod.cu:13 [FP32]
dot: nan
#GPU-FPX LOC-EXCEP INFO: in kernel [dot\_prod], NaN found @ dot-prod.cu:21 [FP32]
#GPU-FPX LOC-EXCEP INFO: in kernel [dot prod], NaN found @ dot-prod.cu:14 [FP32]



# **Using the Analyzer**

#### LD PRELOAD=analyzer.so ./dot-prod

```
for (int i=0; i < size; ++i)
{
float tmp;
// division by zero, produce NaN
tmp = x[i]*y[i] / 0</pre>
```

GPU-FPX-ANA APPEAR : INF appear at the destination @ dot-prod.cu:13 Instruction: MUFU.RCP R0, R10 ; We have 2 registers in total. Register 0 is INF. Register 1 is VAL. #GPU-FPX-ANA APPEAR : NaN appear at the destination @ dot-prod.cu:13 Instruction: FFMA R9, -R10, R0, 1 ; We have 3 registers in total. Register 0 is NaN. Register 1 is VAL. Register 2 is INF. #GPU-FPX-ANA PROPAGATION: ...



# Using the Analyzer

#### LD PRELOAD=analyzer.so ./dot-prod

```
for (int i=0; i < size; ++i)
{
float tmp;
// division by zero, produce NaN
tmp = x[i]*y[i] / 0</pre>
```

**#GPU-FPX-ANA APPEAR** : INF appear at the destination @ dot-prod.cu:13 Instruction: **MUFU.RCP R0**, **R10** ; We have 2 registers in total. Register 0 is INF. Register 1 is VAL.

#GPU-FPX-ANA APPEAR : NaN appear at the destination @ dot-prod.cu:13 Instruction: FFMA R9, -R10, R0, 1 ; We have 3 registers in total. Register 0 is NaN. Register 1 is VAL. Register 2 is INF.

#### **#GPU-FPX-ANA PROPAGATION:** ...

# Next step

Low or mixed-precision exceptions detect and analysis

- AI/ML workload are using lower precision
- Libraries are developed to use mixed precision
  - torch.autocast
- Current hardware vendor are developed low-precision unit
  - NVIDIA tensor cores, AMD matrix cores
- More exceptional issues about using mixed precision

# Next step

Low or mixed-precision exceptions detect and analysis

- AI/ML workload are using lower precision
- Libraries are developed to use mixed precision
  - torch.autocast
- Current hardware vendor are developed low-precision unit
  - NVIDIA tensor cores, AMD matrix cores
- We studied their numerical behaviors
- More exceptional issues about using mixed precision

### Will cover in the next half

Nowadays GPUs are always equipped with special hardware for **mixed-precision matrix multiplication** D = A\*B+C



| Inputs       | GPU    | Subnormal<br>inputs<br>handled? | Subnormal<br>outputs<br>handled? | Extra bit<br>present?<br>How many? | Rounding<br>mode<br>exhibited | FMA<br>unit<br>width | Order<br>within one<br>FMA unit<br>is controllable? | Rounding mode for:<br>1. outputting FP16/BF16<br>(only for FP16/BF16<br>inputs)<br>2. product (only for<br>FP32/FP64 inputs) |
|--------------|--------|---------------------------------|----------------------------------|------------------------------------|-------------------------------|----------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
|              | V100   | 1                               | 1                                | 0                                  | truncate                      | N.A.                 | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | ×                                | 1                                  | truncate                      | 8                    | ×                                                   | RTN-TE                                                                                                                       |
| FP16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
|              | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 4                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | N.A.**                                                                                                                       |
| BF16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
| DI 10        | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 2                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| TF32(NVIDIA) | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | 4                    | X                                                   | N.A.                                                                                                                         |
| FP32(AMD)    | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | ×                                                   | RTN-TE                                                                                                                       |
| FP64         | H100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | X                                                   | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |

\* Since the V100 does not preserve extra bits, its FMA functionality cannot be evaluated. \*\* A100 doesn't support BF16 output.

The results can also be applied to GPUs with the same architecture (e.g. A100 to RTX 30 series), as they use the same generation of tensor cores.

# Part 2: GPU-FPX in practice Debugging a SRU open issue

# **Problem description**

- Link: https://github.com/asappresearch/sru/issues/193



(..)

hoagy-davis-digges commented on Nov 2, 2021 • edited -

I have run the example code in the readme on both 2.6.0 and 3.0.0-dev and both have nan values in both the output and state objects using pytorch 1.9, I've tried this on my computer using a Titan X and also with a fresh install on a cloud T4, this doesn't seem to relate to the other nan issue raised here <a href="https://github.com/asappresearch/sru/issues/185">https://github.com/asappresearch/sru/issues/185</a> because this problem appears immediately.

...

### **Problem description**

- Link: https://github.com/asappresearch/sru/issues/193

```
taolei87 commented on Nov 5, 2021
                                                                             Contributor
                                                                                        ...
hi @hoagy-davis-digges, did you mean you tried the following example and got NaN?
                                                                                       Q
  import torch
  from sru import SRU, SRUCell
  # input has length 20, batch size 32 and dimension 128
  x = torch.FloatTensor(20, 32, 128).cuda()
  input_size, hidden_size = 128, 128
  rnn = SRU(input_size, hidden_size,
     num_layers = 2,  # number of stacking RNN layers
     dropout = 0.0,  # dropout applied between RNN layers
     bidirectional = False, # bidirectional RNN
      layer_norm = False, # apply layer normalization on the output of each layer
     highway_bias = -2,  # initial bias of highway gate (<= 0)
  )
  rnn.cuda()
  output states, c states = rnn(x)
                                      # forward pass
```



### **Use Detector**

Running #GPU-FPX: kernel [ampere\_saemm\_32x128\_nn] ... #GPU-FPX LOC-EXCEP INFO: in kernel [ampere\_sgemm\_32x128\_nn], NaN found @ /unknown\_path in [ampere\_sge mm 32x128 nn1:0 [FP32] #GPU-FPX LOC-EXCEP INFO: Warning: in kernel [ampere\_sgemm\_32x128\_nn], very small guantity (SUB) found @ /unknown\_path in [ampere\_sgemm\_32x128\_nn]:0 [FP32] Running #GPU-FPX: kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple] ... #GPU-FPX LOC-EXCEP INFO: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple], NaN found @ /unknown\_path in [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple], INF found @ /unknown\_path in [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple], DIV0 found @ /unknown\_path in [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: Warning: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simp le], very small quantity (SUB) found @ /unknown\_path in [void (anonymous namespace)::sru\_cuda\_forward \_kernel\_simple]:0 [FP32] Running #GPU-FPX: kernel [void at::native::(anonymous namespace)::CatArrayBatchedCopy] ... #GPU-FPX LOC-EXCEP INFO: in kernel [void at::native::vectorized\_elementwise\_kernel], NaN found @ /unk nown\_path in [void at::native::vectorized\_elementwise\_kernel]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: Warning: in kernel [void at::native::vectorized\_elementwise\_kernel], very sm all quantity (SUB) found @ /unknown\_path in [void at::native::vectorized\_elementwise\_kernel]:0 [FP32] Running #GPU-FPX: kernel [void at\_cuda\_detail::cub::DeviceReduceSingleTileKernel] ... Running #GPU-FPX: kernel [void at\_cuda\_detail::cub::DeviceCompactInitKernel] ... Running #GPU-FPX: kernel [void at\_cuda\_detail::cub::DeviceSelectSweepKernel] ... Running #GPU-FPX: kernel [void at::native::index\_elementwise\_kernel] ... Running #GPU-FPX: kernel [void at::native::unrolled\_elementwise\_kernel] ... Running #GPU-FPX: kernel [void at::native::reduce\_kernel] ...

#### **Closed-source library**

### **Use Detector**

Running #GPU-FPX: kernel [ampere\_sgemm\_32x128\_nn] ... #GPU-FPX LOC-EXCEP INFO: in kernel [ampere\_sgemm\_32x128\_nn], NaN found @ /unknown\_path in [ampere\_sge mm 32x128 nn1:0 [FP32] #GPU-FPX LOC-EXCEP INFO: Warning: in kernel [ampere\_sgemm\_32x128\_nn], very small quantity (SUB) found @ /unknown\_path in [ampere\_sgemm\_32x128\_nn]:0 [FP32] Running #GPU-FPX: kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple] ... #GPU-FPX LOC-EXCEP INFO: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple], NaN found @ /unknown\_path in [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple], INF found @ /unknown\_path in [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple], DIV0 found @ /unknown\_path in [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: Warning: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simp le], very small quantity (SUB) found @ /unknown\_path in [void (anonymous namespace)::sru\_cuda\_forward \_kernel\_simple]:0 [FP32] Running #GPU-FPX: kernel [void at::native::(anonymous namespace)::CatArrayBatchedCopy] ... #GPU-FPX LOC-EXCEP INFO: in kernel [void at::native::vectorized\_elementwise\_kernel], NaN found @ /unk nown\_path in [void at::native::vectorized\_elementwise\_kernel]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: Warning: in kernel [void at::native::vectorized\_elementwise\_kernel], very sm all quantity (SUB) found @ /unknown\_path in [void at::native::vectorized\_elementwise\_kernel]:0 [FP32] Running #GPU-FPX: kernel [void at\_cuda\_detail::cub::DeviceReduceSingleTileKernel] ... Running #GPU-FPX: kernel [void at\_cuda\_detail::cub::DeviceCompactInitKernel] ... Running #GPU-FPX: kernel [void at\_cuda\_detail::cub::DeviceSelectSweepKernel] ... Running #GPU-FPX: kernel [void at::native::index\_elementwise\_kernel] ... Running #GPU-FPX: kernel [void at::native::unrolled\_elementwise\_kernel] ... Running #GPU-FPX: kernel [void at::native::reduce\_kernel] ...

#### **Closed-source library**

### **Use Detector**

Running #GPU-FPX: kernel [ampere\_sgemm\_32x128\_nn] ... #GPU-FPX LOC-EXCEP INFO: in kernel [ampere\_sgemm\_32x128\_nn], NaN found @ /unknown\_path in [ampere\_sge mm 32x128 nn1:0 [FP32] #GPU-FPX LOC-EXCEP INFO: Warning: in kernel [ampere\_sgemm\_32x128\_nn], very small guantity (SUB) found @ /unknown\_path in [ampere\_sgemm\_32x128\_nn]:0 [FP32] Running #GPU-FPX: kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple] ... #GPU-FPX LOC-EXCEP INFO: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple], NaN found @ /unknown\_path in [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple], INF found @ /unknown\_path in [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple], DIV0 n in [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple]:0 [FP32] 0: Warning: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simp Not clear how to fix it ty (SUB) found @ /unknown\_path in [void (anonymous namespace)::sru\_cuda\_forward \_kernel\_simple]:0 [FP32] Running #GPU-FPX: kernel [void at::native::(anonymous namespace)::CatArrayBatchedCopy] ... #GPU-FPX LOC-EXCEP INFO: in kernel [void at::native::vectorized\_elementwise\_kernel], NaN found @ /unk nown\_path in [void at::native::vectorized\_elementwise\_kernel]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: Warning: in kernel [void at::native::vectorized\_elementwise\_kernel], very sm all quantity (SUB) found @ /unknown\_path in [void at::native::vectorized\_elementwise\_kernel]:0 [FP32] Running #GPU-FPX: kernel [void at\_cuda\_detail::cub::DeviceReduceSingleTileKernel] ... Running #GPU-FPX: kernel [void at\_cuda\_detail::cub::DeviceCompactInitKernel] ... Running #GPU-FPX: kernel [void at\_cuda\_detail::cub::DeviceSelectSweepKernel] ... Running #GPU-FPX: kernel [void at::native::index\_elementwise\_kernel] ... Running #GPU-FPX: kernel [void at::native::unrolled\_elementwise\_kernel] ... Running #GPU-FPX: kernel [void at::native::reduce\_kernel] ...

### Use Analyzer

### Speedup by enabling necessary kernels

Running #GPU-FPX: kernel [ampere\_sgemm\_32x128\_nn].

#GPU-FPX LOC-EXCEP INF0: in kernel [ampere\_sgemm\_32x128\_nn], NaN found @ /unknown\_path in [ampere\_sge
mm\_32x128\_nn]:0 [FP32]
#GPU-FPX LOC-EXCEP INF0: Warning: in kernel [ampere\_sgemm\_32x128\_nn], very small quantity (SUB) found
@ /unknown\_path in [ampere\_sgemm\_32x128\_nn]:0 [EP32]

### First and most exceptions happened in this kernel, so we can limit our instrumentation in this kernel

simple] ...
'd\_kernel\_simple], NaN
simple]:0 [FP32]
'd\_kernel\_simple], INF

rouna 🛯 /unknown\_path in Lvoia (anonymous namespace)::sru\_cuaa\_torwara\_kernei\_simple]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple], DIV0 found @ /unknown\_path in [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simple]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: Warning: in kernel [void (anonymous namespace)::sru\_cuda\_forward\_kernel\_simp le], very small quantity (SUB) found @ /unknown\_path in [void (anonymous namespace)::sru\_cuda\_forward \_kernel\_simple]:0 [FP32] Running #GPU-FPX: kernel [void at::native::(anonymous namespace)::CatArrayBatchedCopy] ... #GPU-FPX LOC-EXCEP INFO: in kernel [void at::native::vectorized\_elementwise\_kernel], NaN found @ /unk nown\_path in [void at::native::vectorized\_elementwise\_kernel]:0 [FP32] #GPU-FPX LOC-EXCEP INFO: Warning: in kernel [void at::native::vectorized\_elementwise\_kernel], very sm all quantity (SUB) found @ /unknown\_path in [void at::native::vectorized\_elementwise\_kernel]:0 [FP32] Running #GPU-FPX: kernel [void at\_cuda\_detail::cub::DeviceReduceSingleTileKernel] ... Running #GPU-FPX: kernel [void at\_cuda\_detail::cub::DeviceCompactInitKernel] ... Running #GPU-FPX: kernel [void at\_cuda\_detail::cub::DeviceSelectSweepKernel] ... Running #GPU-FPX: kernel [void at::native::index\_elementwise\_kernel] ... Running #GPU-FPX: kernel [void at::native::unrolled\_elementwise\_kernel] ... Running #GPU-FPX: kernel [void at::native::reduce\_kernel] ...

### Analysis

#### Kuninting #GFO-TEX. Kernet Lumpere\_Sgemm\_JZXIZO\_IIIJ ...

#GPU-FPX-ANA SHARED REGISTER Before executing the instruction @ /unknown\_path in [ampere\_sgemm\_32x1 28\_nn]:0 Instruction: FFMA R1, R32.reuse, R40.reuse, R1 ; We have 4 registers in total. Register 0 is VAL. Register 1 is VAL. Register 2 is NaN. Register 3 is VAL.

#GPU-FPX-ANA SHARED REGISTER Before executing the instruction @ /unknown\_path in [ampere\_sgemm\_32x1 28\_nn]:0 Instruction: FFMA R0, R32\_ R41.reuse, R0 ; We have 4 registers in total. Register 0 is VAL. Register 1 is VAL. Register 2 is NaN. Register 3 is VAL.

#GPU-FPX-ANA SHARED REGISTER: After executing the instruction @ /unknown\_path in [ampere\_sgemm\_32x12 8\_nn]:0 Instruction: FFMA R1, R32.reuse, R40.reuse, R1 ; We have 4 registers in total. Register 0 is NaN. Register 1 is VAL. Register 2 is NaN. Register 3 is NaN.

#GPU-FPX-ANA SHARED REGISTER: After executing the instruction @ /unknown\_path in [ampere\_sgemm\_32x12 8\_nn]:0 Instruction: FFMA R0, R32, R41.reuse, R0 ; We have 4 registers in total. Register 0 is NaN. R egister 1 is VAL. Register 2 is NaN. Register 3 is NaN.

# Analysis

#GPU-FPX-ANA SHARED REGISTER Before executing the instruction @ (unknown nath in Compere sgemm\_32x1 28\_nn]:0 Instruction: FFMA R1, R32.reuse, R40.reuse, R1 VAL. Register 1 is VAL. Register 2 is NaN. Register 3 i within the initial data!

#GPU-FPX-ANA SHARED REGISTER Before executing the instruction @ /unknown\_path in [ampere\_sgemm\_32x1 28\_nn]:0 Instruction: FFMA R0, R32\_R41.reuse, R0 ; We have 4 registers in total. Register 0 is VAL. Register 1 is VAL. Register 2 is NaN. Register 3 is VAL.

#GPU-FPX-ANA SHARED REGISTER: After executing the instruction @/unknown\_path in [ampere\_sgemm\_32x12 8 nn]:0 Instruction: FFMA R1, K32.reuse, R40.reuse, R1 ; We have 4 registers in total. Register 0 is NaN. Register 1 is VAL. Register 2 is NaN. Register 3 is NaN.

#GPU-FPX-ANA SHARED REGISTER: After executing the instruction @/unknown\_path in [ampere\_sgemm 32x12 8\_nn]:0 Instruction: FFMA R0, R32, R41.reuse, R0 ; We have 4 registers in total. Register 0 is NaN. R egister 1 is VAL. Register 2 is NaN. Register 3 is NaN.

# Analysis

| This creates a<br>memory. | tensor with unir                                                | nitialized data on GPU                                                                                                                | butor ···· |
|---------------------------|-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|------------|
|                           | dropout = 0.0,<br>bidirectional = False,<br>layer_norm = False, | <pre>size 32 and dimension 128 , 128).cuda() 8, 128 _size, # number of stacking RNN layers # dropout applied between RNN layers</pre> | r          |
|                           | <pre>output_states, c_states = rn</pre>                         | n(x) # forward pass                                                                                                                   |            |

 $\odot$ 

### Fix



### Matrix Accelerators

Nowadays GPUs are always equipped with special hardware for **mixed-precision matrix multiplication** D = A\*B+C



### Matrix Accelerators Issues

Nowadays GPUs are always equipped with special hardware for (mixed-precision) matrix multiplication D = A\*B+C



Matrix Accelerators

A specific hardware to speed up matrix multiplication D=A\*B+C

Mixed-precision computation

**Block-wise computation** 

Lack of numerical standardization!

Numerical inconsistency

A numerical inconsistent example caused by matrix accelerators

Two 2<sup>13</sup> X 2<sup>13</sup> matrix doing the matrix multiplication

 $D = -1 \times A \times B + 1 \times C$ 



 $D_{ij} = -(2^{10} * 2^{10} - \Sigma 2^{-2} * 2^{-3} - \Sigma 2^{-3} * 2^{-3}) + 2^{20} = 2^7 + 2^6 - 2^{-6} \approx 191.99218$ 

Run on different hardware (with matrix accelerators

| NVIDIA | NVIDIA | NVIDIA  | AMD     | AMD   | CPU |
|--------|--------|---------|---------|-------|-----|
| V100   | A100   | H100    | MI100   | MI250 |     |
| ο      | 0      | 191.875 | 255.875 | ο     | 0   |

Numerical behaviors we want to test



| Inputs       | GPU    | Subnormal<br>inputs<br>handled? | Subnormal<br>outputs<br>handled? | Extra bit<br>present?<br>How many? | Rounding<br>mode<br>exhibited | FMA<br>unit<br>width | Order<br>within one<br>FMA unit<br>is controllable? | Rounding mode for:<br>1. outputting FP16/BF16<br>(only for FP16/BF16<br>inputs)<br>2. product (only for<br>FP32/FP64 inputs) |
|--------------|--------|---------------------------------|----------------------------------|------------------------------------|-------------------------------|----------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
|              | V100   | 1                               | 1                                | 0                                  | truncate                      | N.A.                 | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | ×                                | 1                                  | truncate                      | 8                    | ×                                                   | RTN-TE                                                                                                                       |
| FP16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
|              | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 4                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | N.A.**                                                                                                                       |
| BF16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
| DI 10        | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 2                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| TF32(NVIDIA) | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | 4                    | X                                                   | N.A.                                                                                                                         |
| FP32(AMD)    | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | ×                                                   | RTN-TE                                                                                                                       |
| FP64         | H100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | X                                                   | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |

\* Since the V100 does not preserve extra bits, its FMA functionality cannot be evaluated. \*\* A100 doesn't support BF16 output.

#### Subnormal supported

| Inputs       | GPU    | Subnormal<br>inputs<br>handled? | Subnormal<br>outputs<br>handled? | Extra bit<br>present?<br>How many? | Rounding<br>mode<br>exhibited | FMA<br>unit<br>width | Order<br>within one<br>FMA unit<br>is controllable? | Rounding mode for:<br>1. outputting FP16/BF16<br>(only for FP16/BF16<br>inputs)<br>2. product (only for<br>FP32/FP64 inputs) |
|--------------|--------|---------------------------------|----------------------------------|------------------------------------|-------------------------------|----------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
|              | V100   | 1                               | 1                                | 0                                  | truncate                      | N.A.                 | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | RTN-TE                                                                                                                       |
| FP16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
| _            | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 4                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | N.A.**                                                                                                                       |
| BF16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
| DITO         | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 2                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | V                               | 1                                | 1                                  | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| TF32(NVIDIA) | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| FP32(AMD)    | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | ×                                                   | RTN-TE                                                                                                                       |
| FP64         | H100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |

\* Since the V100 does not preserve extra bits, its FMA functionality cannot be evaluated. \*\* A100 doesn't support BF16 output.

#### Extra bits and rounding mode

| Inputs       | GPU    | Subnormal<br>inputs<br>handled? | Subnormal<br>outputs<br>handled? | Extra bit<br>present?<br>How many? | Rounding<br>mode<br>exhibited | FMA<br>unit<br>width | Order<br>within one<br>FMA unit<br>is controllable? | Rounding mode for:<br>1. outputting FP16/BF16<br>(only for FP16/BF16<br>inputs)<br>2. product (only for<br>FP32/FP64 inputs) |
|--------------|--------|---------------------------------|----------------------------------|------------------------------------|-------------------------------|----------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
|              | V100   | 1                               | 1                                | 0                                  | truncate                      | N.A.                 | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | RTN-TE                                                                                                                       |
| FP16         | H100   | 1                               | 1                                | > 2                                | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
|              | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 4                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | N.A.**                                                                                                                       |
| BF16         | H100   | 1                               | 1                                | > 2                                | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
| DITO         | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 2                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| TF32(NVIDIA) | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| FP32(AMD)    | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | ×                                                   | RTN-TE                                                                                                                       |
| FP64         | H100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |

\* Since the V100 does not preserve extra bits, its FMA functionality cannot be evaluated. \*\* A100 doesn't support BF16 output.

#### Extra bits and rounding mode

| Inputs       | GPU    | Subnormal<br>inputs<br>handled? | Subnormal<br>outputs<br>handled? | Extra bit<br>present?<br>How many? | Rounding<br>mode<br>exhibited | FMA<br>unit<br>width | Order<br>within one<br>FMA unit<br>is controllable? | Rounding mode for:<br>1. outputting FP16/BF16<br>(only for FP16/BF16<br>inputs)<br>2. product (only for<br>FP32/FP64 inputs) |
|--------------|--------|---------------------------------|----------------------------------|------------------------------------|-------------------------------|----------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
|              | V100   | 1                               | 1                                | 0                                  | truncate                      | N.A.                 | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | RTN-TE                                                                                                                       |
| FP16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
|              | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 4                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | N.A.**                                                                                                                       |
| BF16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
| DI 10        | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 2                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| TF32(NVIDIA) | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| FP32(AMD)    | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | ×                                                   | RTN-TE                                                                                                                       |
| FP64         | H100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |

\* Since the V100 does not preserve extra bits, its FMA functionality cannot be evaluated. \*\* A100 doesn't support BF16 output.

#### Extra bits and rounding mode

| Inputs       | GPU    | Subnormal<br>inputs<br>handled? | Subnormal<br>outputs<br>handled? | Extra bit<br>present?<br>How many? | Rounding<br>mode<br>exhibited | FMA<br>unit<br>width | Order<br>within one<br>FMA unit<br>is controllable? | Rounding mode for:<br>1. outputting FP16/BF16<br>(only for FP16/BF16<br>inputs)<br>2. product (only for<br>FP32/FP64 inputs) |
|--------------|--------|---------------------------------|----------------------------------|------------------------------------|-------------------------------|----------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
|              | V100   | 1                               | 1                                | 0                                  | truncate                      | N.A.                 | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | RTN-TE                                                                                                                       |
| FP16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
|              | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 4                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | N.A.**                                                                                                                       |
| BF16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
| DITO         | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 2                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| TF32(NVIDIA) | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| FP32(AMD)    | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | ×                                                   | RTN-TE                                                                                                                       |
| FP64         | H100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |

\* Since the V100 does not preserve extra bits, its FMA functionality cannot be evaluated. \*\* A100 doesn't support BF16 output.

| Inputs       | GPU    | Subnormal<br>inputs<br>handled? | Subnormal<br>outputs<br>handled? | Extra bit<br>present?<br>How many? | Rounding<br>mode<br>exhibited | FMA<br>unit<br>width | Order<br>within one<br>FMA unit<br>is controllable? | Rounding mode for:<br>1. outputting FP16/BF16<br>(only for FP16/BF16<br>inputs)<br>2. product (only for<br>FP32/FP64 inputs) |
|--------------|--------|---------------------------------|----------------------------------|------------------------------------|-------------------------------|----------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
|              | V100   | 1                               | 1                                | 0                                  | truncate                      | N.A.                 | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | RTN-TE                                                                                                                       |
| FP16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
|              | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 4                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | N.A.**                                                                                                                       |
| BF16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
| DITO         | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 2                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| TF32(NVIDIA) | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| FP32(AMD)    | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | ×                                                   | RTN-TE                                                                                                                       |
| FP64         | H100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |

\* Since the V100 does not preserve extra bits, its FMA functionality cannot be evaluated. \*\* A100 doesn't support BF16 output.

| Inputs       | GPU    | Subnormal<br>inputs<br>handled? | Subnormal<br>outputs<br>handled? | Extra bit<br>present?<br>How many? | Rounding<br>mode<br>exhibited | FMA<br>unit<br>width | Order<br>within one<br>FMA unit<br>is controllable? | Rounding mode for:<br>1. outputting FP16/BF16<br>(only for FP16/BF16<br>inputs)<br>2. product (only for<br>FP32/FP64 inputs) |
|--------------|--------|---------------------------------|----------------------------------|------------------------------------|-------------------------------|----------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
|              | V100   | 1                               | 1                                | 0                                  | truncate                      | N.A.                 | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | RTN-TE                                                                                                                       |
| FP16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
|              | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 4                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | N.A.**                                                                                                                       |
| BF16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
| DI IO        | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 2                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| TF32(NVIDIA) | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| FP32(AMD)    | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | ×                                                   | RTN-TE                                                                                                                       |
| FP64         | H100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | X                                                   | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |

\* Since the V100 does not preserve extra bits, its FMA functionality cannot be evaluated. \*\* A100 doesn't support BF16 output.

| Inputs       | GPU    | Subnormal<br>inputs<br>handled? | Subnormal<br>outputs<br>handled? | Extra bit<br>present?<br>How many? | Rounding<br>mode<br>exhibited | FMA<br>unit<br>width | Order<br>within one<br>FMA unit<br>is controllable? | Rounding mode for:<br>1. outputting FP16/BF16<br>(only for FP16/BF16<br>inputs)<br>2. product (only for<br>FP32/FP64 inputs) |
|--------------|--------|---------------------------------|----------------------------------|------------------------------------|-------------------------------|----------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
|              | V100   | 1                               | 1                                | 0                                  | truncate                      | N.A.                 | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | X                                                   | RTN-TE                                                                                                                       |
| FP16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
|              | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 4                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    | ×                                                   | N.A.**                                                                                                                       |
| BF16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            | ×                                                   | RTN-TE                                                                                                                       |
| DI IO        | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 2                    | ×                                                   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 4                    | ×                                                   | N.A.                                                                                                                         |
| TF32(NVIDIA) | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | 4                    | X                                                   | N.A.                                                                                                                         |
| FP32(AMD)    | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | X                                                   | RTN-TE                                                                                                                       |
| FP64         | H100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | X                                                   | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    | N.A.                                                | RTN-TE                                                                                                                       |

\* Since the V100 does not preserve extra bits, its FMA functionality cannot be evaluated. \*\* A100 doesn't support BF16 output.

| Inputs       | GPU    | Subnormal<br>inputs<br>handled? | Subnormal<br>outputs<br>handled? | Extra bit<br>present?<br>How many? | Rounding<br>mode<br>exhibited | FMA<br>unit<br>width | I | Order<br>within one<br>FMA unit<br>is controllable? |   | Rounding mode for:<br>1. outputting FP16/BF16<br>(only for FP16/BF16<br>inputs)<br>2. product (only for<br>FP32/FP64 inputs) |
|--------------|--------|---------------------------------|----------------------------------|------------------------------------|-------------------------------|----------------------|---|-----------------------------------------------------|---|------------------------------------------------------------------------------------------------------------------------------|
|              | V100   | 1                               | 1                                | 0                                  | truncate                      | N.A.                 |   | N.A.                                                |   | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    |   | X                                                   |   | RTN-TE                                                                                                                       |
| FP16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            |   | ×                                                   |   | RTN-TE                                                                                                                       |
|              | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 4                    |   | X                                                   |   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    |   | N.A.                                                |   | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 8                    |   | X                                                   |   | N.A.**                                                                                                                       |
| BF16         | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | $\geq 16$            |   | ×                                                   |   | RTN-TE                                                                                                                       |
| DI 10        | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 2                    |   | ×                                                   |   | RTN-TE                                                                                                                       |
|              | MI250X | ×                               | ×                                | 3                                  | RTN-TE                        | 1                    |   | N.A.                                                | Î | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 1                                  | truncate                      | 4                    |   | ×                                                   |   | N.A.                                                                                                                         |
| TF32(NVIDIA) | H100   | 1                               | 1                                | $\geq 2$                           | truncate                      | 4                    |   | ×                                                   |   | N.A.                                                                                                                         |
| FP32(AMD)    | MI100  | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    |   | N.A.                                                |   | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    |   | N.A.                                                |   | RTN-TE                                                                                                                       |
|              | A100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    |   | X                                                   |   | RTN-TE                                                                                                                       |
| FP64         | H100   | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    |   | ×                                                   |   | RTN-TE                                                                                                                       |
|              | MI250X | 1                               | 1                                | 3                                  | RTN-TE                        | 1                    |   | N.A.                                                |   | RTN-TE                                                                                                                       |

\* Since the V100 does not preserve extra bits, its FMA functionality cannot be evaluated. \*\* A100 doesn't support BF16 output.

### Thanks!

- **1**. Questions?
- 2. Any applications you want us to help?
- 3. What features you want to add?

Try GPU-FPX!

Try FTTN!





# Thanks!