Towards a Heterogeneous data-centric framework for efficient Linear algebra
TimeMonday, June 22nd11:21pm - 11:59pm
DescriptionBLAS(Basic Linear Algebra Subprograms) are extensively used in scientific code for graph processing, simulation, systems, machine learning and more. Their optimization is one of the most explored areas of computational science and still a very active field. Much effort has been put into efficiently implementing BLAS on different architectures and paradigms. However, full hardware utilization, data movement optimization and enough parallelism are required to harness the performance of the new heterogeneous systems.
‘Qilin’  is the first hybrid execution approach, splitting the problem in equal chunks and using optimized vendor libraries for computation based on a simple empirical model. Stanimire  proposes LAPACK with heterogeneous BLAS-level parallelism. Werkhoven  proposes a model for predicting optimal communication/computation overlap for GPUs. cuBLAS-XT is Nvidia’s Multi-GPU library for communication overlap-inclusive BLAS 3 routines with the option of hybrid host execution for GEMM only, but tuning the chunk size and CPU ratio is left to the user. ‘BLASXT’  is the state of the art hybrid CPU/multi-GPU (Nvidia) library for BLAS 3 which uses a cache-representation of the multi-GPU memory problem to reduce pcie communication and load balance work between GPUs.
The previous approaches either focus on some branch of the problem (e.g. BLAS 3) and provide a full solution[4,5], or model the problem with assumptions regarding data location, availability and layout in order to simplify it and also lack a whole implementation [1,2,3]. Our objective is creating a complete framework aiming to provide optimal heterogeneous BLAS performance. Our framework will provide a BLAS-style high level interface with no extra input parameters and will be responsible for automatically and optimally distributing work/data and scheduling execution (utilize existing libraries for computation) simultaneously on the available heterogeneous components.
Since most linear algebra operations(BLAS, SparseBLAS, LAPACK) work with vectors and arrays, data movement and access is the main issue in heterogeneous clusters. To account for this we follow a data-centric approach, and isolate three main challenges. The first is to create a lightweight mechanism for predicting the optimal data distribution. The second is to engineer the underlying data tiling/transfer methods and integrate existing BLAS libraries, to provide a complete framework. The third challenge is the modularity, extensibility and adaptability of our framework.
Our early evaluation for this work is the following: We have implemented simple regression models in Python for dot, gemv, spmv and gemm for hybrid CPU/GPU execution. Our model takes into account available system resources, the algorithm and its input/output (type,size and layout) and is able to predict the ideal CPU/GPU ratio and chunk number and size for maximum overlap, also considering non-square and varying size tiling blocks (all previous works consider equal square blocks). We have also implemented the aforementioned algorithms in C using Intel MKL + CUBLAS, to test our model accuracy. Our early results show that our model provides good insight on the optimal data split, and that the accuracy of the prediction is dependent on all the model input we are using.