SHARP Streaming-Aggregation Hardware Design and Evaluation
TimeTuesday, June 23rd8:30pm - 8:55pm
DescriptionThis paper introduces a new hardware-based Mellanox Scalable Hierarchical Aggregation and Reduction Protocol SHARP reduction capability added to Mellanox’s HDR InﬁniBand switches called SHARP Streaming-Aggregation. This capability is designed to achieve reduction bandwidths similar to those of point-to-point messages of the same size, and complements the latency-optimized Low-Latency-Aggregation reduction capabilities, aimed at small data reductions. MPI Allreduce() bandwidth measured on an HDR InﬁniBand based system achieves about 96% of point-to-point bandwidth. For medium and large data reduction this also improves the reduction bandwidth by a factor of two to about ﬁve relative to host-based (e.g., software-based) reduction algorithms. Using this capability also increased DL-Poly and PyTorch application performance by as much as 7% and 10%, respectively. This paper describes SHARP Streaming-Aggregation hardware architecture and a set of synthetic and application benchmarks used to study this new reduction capability, and the range of data sizes for which Streaming-Aggregation performs better than the Low-Latency-Aggregation algorithm.