Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance
Event Type
Research Paper
TimeTuesday, June 23rd6:25pm - 6:50pm
DescriptionScaling supercomputers comes with an increase in failure rates due to
the increasing number of hardware components. In standard practice,
applications are made resilient through checkpointing data and
restarting execution after a failure occurs to resume from the latest
checkpoint. However, re-deploying an application incurs overhead by
tearing down and re-instating execution, and possibly limiting
checkpointing retrieval from slow permanent storage.

In this paper we present , a new design and implementation of
the Reinit approach for global-restart recovery, which avoids
application re-deployment. We extensively evaluate
contrasted with the leading MPI fault-tolerance approach of
ULFM, implementing global-restart recovery, and the typical practice of
restarting an application to derive new insight on performance.
Experimentation with three different HPC proxy applications made
resilient to withstand process and node failures shows that
recovers much faster than restarting, up to 6 times, or ULFM, up to
3 times, and that it scales excellently as the number of MPI processes