Tutorial on Floating-Point Analysis Tools

Tools To Detect and Diagnose Floating-Point Errors in Heterogeneous Computing Hardware and Software

SC25, St. Louis, MO

Nov 16, 2025

Time: 8:30pm - 12pm CST (half-day tutorial)
Location: America’s Center Convention Complex, Room 121

Click Here for Tutorial Slides (PDF)

Description

High-performance computing and machine learning applications increasingly rely on mixed-precision arithmetic on CPUs and GPUs for superior performance. However, this shift introduces several challenging numerical issues such as increased round-off errors, and INF and NaN exceptions that can render the computed solutions useless.

At present, this places a heavy burden on developers, interrupting their work while they diagnose these problems manually. This tutorial presents three tools that target specific issues leading to floating-point bugs.

First, we present FPChecker, which not only detects and reports INF/NaN exceptions in parallel and distributed CPU codes, but also tells programmers about the exponent value ranges for avoiding exceptions while also minimizing rounding errors.

Second, we present GPU-FPX, which detects floating-point exceptions generated by NVIDIA GPUs, including their Tensor Cores via a “nixnan” extension to GPU-FPX.

Third, we present FloatGuard, a unique tool that detects exceptions in AMD GPUs. The tutorial is aimed at helping programmers avoid exception bugs; for this, we will demonstrate our tools on simple examples with seeded bugs. Attendees may optionally install and run our tools.

The tutorial also allocates question/answer time to address real situations faced by the attendees.

Note for Attendees

An overview video of the tutorial is at https://youtu.be/1Ka8g_06Nxg?si=EpYCeuADEVk2qT4u.

Presenters

Presentation Slides

Repositories: