Global Experiences with HPC Operational Data Measurement, Collection and Analysis
TimeTuesday, June 23rd4:06pm - 4:08pm
DescriptionEach high-performance computing (HPC) data center often has unique operational requirements related to the size and scale of the machines they support. Highly efficient operation of these data centers typically requires continuously gathering, monitoring, and analyzing information in near-realtime from the facility (e.g., power, cooling), HPC machine-level (e.g., job scheduling, application efficiency), and support infrastructure (e.g., network, filesystems, web services). At the extreme scales of next-generation systems, collecting this high-frequency, high-volume operational data from heterogeneous sensors, machines, and resources distributed throughout the HPC data center without impacting underlying data center operations and HPC resource utilization presents many challenges. To mitigate these challenges, many HPC data centers have implemented customized data collection and analysis systems capable of meeting their respective data center operational needs. In an effort to share and analyze these solutions with the wider HPC data center community, the Energy Efficient HPC Working Group (EEHPCWG) conducted a global survey of multiple HPC data centers. This poster presents a summary of the methodology, analysis, and findings of the survey; specifically, the technical details of the in-house data collection solutions each site has developed, what data is being collected, how the data is being used, and what lessons were learned during the design and deployment of their Operational Data Analytics (ODA) systems. Real-world use cases from 8 of the sites illustrate the value of integrated facilities and systems data collection in HPC data centers toward improving overall operational efficiency.