Research Paper
Evaluating the Performance of Global-Restart Recovery For MPI Fault Tolerance
Event Type
Research Paper
System Software & Runtime Systems
TimeWednesday, June 24th2:15pm - 2:45pm
LocationAnalog 1, 2
DescriptionScaling supercomputers comes with an increase in failure rates due to
the increasing number of hardware components. In standard practice,
applications are made resilient through checkpointing data and
restarting the execution to resume from the latest checkpoint when a
failure occurs. Notably, re-deploying an application incurs overhead by
tearing down and re-instating execution, and possibly emitting checkpoints
to slow permanent storage.

Various techniques have been proposed for MPI fault tolerance to
support resilient application execution without the need for
re-deployment. However, their efficacy and performance have not been
rigorously evaluated. We present an extensive evaluation of two leading
MPI fault-tolerance approaches, Reinit and ULFM, contrasted with the
typical practice of restarting an application to implement
global-restart recovery. The evaluation uses HPC proxy applications made
resilient to withstand process and node failures. Experimentation shows
that Reinit recovers much faster than restarting or ULFM and that it
scales excellently as the number of MPI processes grows.