DGEMM using Tensor Cores, and Its Accurate and Reproducible Versions
Math Library Design
TimeWednesday, June 24th11:00am - 11:30am
LocationAnalog 1, 2
DescriptionThis paper proposes a method to implement dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA GPUs. Tensor Cores are a special processing unit that enables 4 by 4 matrix multiplication operations on FP16 inputs with FP32 precision and return the result on FP32. The proposed method is based on the Ozaki scheme, which is an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. One of the most significant advantages of the proposed method is that it can be built upon the cublasGemmEx routine, which utilizes the Tensor Core operation. Moreover, this method supports tunable-accuracy, e.g., it is possible to achieve higher accuracy than standard DGEMM and SGEMM, including correct-rounding, and reproducibility even when using a different number of cores or threads. While the performance of our method depends on the range of the absolute values on each element of the input matrices, on Tesla V100 with 113 TFlops on Tensor Cores, our DGEMM equivalent implementation achieved up to approximately 3.5 TFlops on FP64 operation for matrices initialized with random numbers having the dynamic range of 1E+9. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units, e.g., AI-oriented processors, for general purpose workloads. Furthermore, we offer a new perspective into hardware design for AI and traditional HPC workloads in terms of the precision of hardware-supported floating-point operations.