Presentation
SHARP Streaming-Aggregation Hardware Design and Evaluation
SessionResearch Paper Session
Event Type
Research Paper
Pre-Recorded
TimeTuesday, June 23rd8:30pm - 8:55pm
LocationDigital
DescriptionThis paper introduces a new hardware-based Mellanox Scalable Hierarchical Aggregation and Reduction Protocol SHARP reduction capability added to Mellanox’s HDR InfiniBand switches called SHARP Streaming-Aggregation. This capability is designed to achieve reduction bandwidths similar to those of point-to-point messages of the same size, and complements the latency-optimized Low-Latency-Aggregation reduction capabilities, aimed at small data reductions. MPI Allreduce() bandwidth measured on an HDR InfiniBand based system achieves about 96% of point-to-point bandwidth. For medium and large data reduction this also improves the reduction bandwidth by a factor of two to about five relative to host-based (e.g., software-based) reduction algorithms. Using this capability also increased DL-Poly and PyTorch application performance by as much as 7% and 10%, respectively. This paper describes SHARP Streaming-Aggregation hardware architecture and a set of synthetic and application benchmarks used to study this new reduction capability, and the range of data sizes for which Streaming-Aggregation performs better than the Low-Latency-Aggregation algorithm.