# Presentation

Distributed Memory Task-Based Block Low Rank Direct Solver

SessionResearch Poster Session

Event Type

Research Poster

Pre-Recorded

TimeTuesday, June 23rd3:20pm - 3:25pm

LocationAnalog 1

DescriptionLU factorization of a dense matrix takes $O(N^3)$ time, which can lead to very long run-times if the matrix size gets too large. Given a matrix with a certain structure, where off-diagonal blocks of the matrix have rapidly decaying singular values, we can represent the dense matrix as a Block Low Rank (BLR) matrix and perform the LU factorization in $O(N^2)$ time. When distributing the matrix on multiple nodes, load imbalance between nodes can lead to improper node utilization. In this research, we explore a way of mitigating the load imbalance by using a distributed run-time system called starPU for resolving dependencies and asynchronous execution.

We begin by demonstrating scaling experiments between distributed dense LU factorization from ScaLAPACK and Elemental and comparing them with our task-based distributed dense factorization for a matrix of size 32768 for upto 16 nodes. We then proceed to compare these dense cases with our task-based BLR solver, which shows improved processor efficiency due to better overlap between communication and computation between the nodes.

We also try strong scaling experiments for a single node with multiple threads and report increasing scaling upto 25 threads for BLR matrices, after which we see decreasing scaling. Upon further investigation, we find that the reason for decreasing scaling after 25 threads is contention for bandwidth,

since too many threads try to access data from main memory for a BLR factorization, where the amount of computation is lower than that in a dense matrix.

We begin by demonstrating scaling experiments between distributed dense LU factorization from ScaLAPACK and Elemental and comparing them with our task-based distributed dense factorization for a matrix of size 32768 for upto 16 nodes. We then proceed to compare these dense cases with our task-based BLR solver, which shows improved processor efficiency due to better overlap between communication and computation between the nodes.

We also try strong scaling experiments for a single node with multiple threads and report increasing scaling upto 25 threads for BLR matrices, after which we see decreasing scaling. Upon further investigation, we find that the reason for decreasing scaling after 25 threads is contention for bandwidth,

since too many threads try to access data from main memory for a BLR factorization, where the amount of computation is lower than that in a dense matrix.

Poster PDF