Presentation
Optimizing Deep Learning Recommender Systems On CPU Architectures
SessionResearch Poster Session
Event Type
Research Poster
Pre-Recorded
TimeTuesday, June 23rd4:30pm - 4:35pm
LocationAnalog 1
DescriptionIn the last two years, the compute demand of AI applications, especially deep learning training,
has exploded. Since then, the goal of many researches has been to squeeze the last bit of
performance out of HPC system for AI tasks. Thus, the community started to talk about
the convergence of AI and HPC. Often this discussion is held in the context of how fast
ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in
2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in
cloud computing centers. More specifically, we focus on Facebook's DLRM benchmark. By enabling it
to run on latest CPU hardware and software tailored for HPC, we are able to achieve nearly
two-orders of magnitude improvement in performance (91x) on a single socket, and high scaling efficiency up to 64 sockets, while fitting
ultra-large datasets. Due to these large datasets we cannot compare with GPUs as our GPU
cluster is simply not able hold the models in the device memory. This poster discusses the optimization techniques for the various operators in DLRM and which component of the systems are stressed by these different operators. Therefore, the poster demonstrates that current CPUs and GPUs are within 2-3x of performance and within 2x for performance/watt for these novel workloads on small datasets. For large datasets, CPUs have a scaling advantage as they are not limited by on-board memory as each socket easily houses 100+ GBs.
has exploded. Since then, the goal of many researches has been to squeeze the last bit of
performance out of HPC system for AI tasks. Thus, the community started to talk about
the convergence of AI and HPC. Often this discussion is held in the context of how fast
ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in
2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in
cloud computing centers. More specifically, we focus on Facebook's DLRM benchmark. By enabling it
to run on latest CPU hardware and software tailored for HPC, we are able to achieve nearly
two-orders of magnitude improvement in performance (91x) on a single socket, and high scaling efficiency up to 64 sockets, while fitting
ultra-large datasets. Due to these large datasets we cannot compare with GPUs as our GPU
cluster is simply not able hold the models in the device memory. This poster discusses the optimization techniques for the various operators in DLRM and which component of the systems are stressed by these different operators. Therefore, the poster demonstrates that current CPUs and GPUs are within 2-3x of performance and within 2x for performance/watt for these novel workloads on small datasets. For large datasets, CPUs have a scaling advantage as they are not limited by on-board memory as each socket easily houses 100+ GBs.
Poster PDF