Bitwise Reproducible Execution of Unstructured Mesh Applications
TimeMonday, June 22nd7:37pm - 8:14pm
DescriptionIn the field of high performance computing, there is a need for tools or methods to achieve reproducibility: when we have an error during a program execution, we might want to reproduce the problem exactly. We might have some contractual obligations, for example in applications from human sciences or financial fields. Comparing the output of a trustworthy sequential run to the output of a parallel run is very common method to determine the correctness of the parallel method. Without reproducibility we might have trouble to determine whether we have some error from the representation or we have a bug in the parallel code.
Engineering applications use floating point arithmetic which are not associative according to the IEEE specifications. In a parallel environment, this usually means the application becomes unreproducible due to the non-deterministic ordering of operations.
It is straightforward that reproducibility comes with a price. This price depends on the application area of which is addressed. The ReproBLAS project uses a different representation, they introduce a binned method. With their method a 5n to 9n floating point operations overhead is produced when summing n floating point numbers. And also to use this, we would need to rewrite our algorithm, and we would also introduce a dependency on a new library. Lulesh present some work on achieving reproducibility, but only between runs of the same number of MPI processes.
In this paper we present work on generating a method in the fied of unstructured mesh computations, commonly used for the discretized solution of partial differential equations. We provide bitwise reproducibility between separate runs, even if they are started with different number of MPI processes. We implement our work in the OP2 domain-specific library, which provides an API that abstracts the solution of unstructured mesh computations, and demonstrate how the whole process can be automated without intervention from the user. We carry out the performance analysis of our method applied to two applications: a simple finite volume application, and a more complex finite element code that uses a conjugate-gradient solver. We show a 2.37x to 1.49x slowdown on these applications as a price for full bitwise reproducibility.
We also introduce some new problems, and a simple model to support reproducibility on a broader set of applications.