Ronald Brightwell, YongHong Yan, Maya Gokhale, Ivy B. Peng
Follow the Data: Memory-Centric Designs for Modern Datacenters
Brief Bio: Dr. Daniel Ernst is currently a Principal Architect in Microsoft's Azure Hardware Architecture team, which is responsible for long-range technology pathfinding for future Azure Cloud systems. Within AHA, Dan leads the team responsible for future memory systems. This team investigates future architecture directions for Azure and serves as the primary architecture contact point in technical relationships with compute, memory, and device partners, as well as the primary driver of Microsoft’s memory standards activity.
Prior to joining Microsoft, Dan spent 10 years at Cray/HPE, most recently as a Distinguished Technologist in the HPC Advanced Technology team. While at Cray, Dan led multiple customer-visible collaborative pathfinding investigations into future HPC architectures and was part of the team that architected the Department of Energy’s Frontier and El Capitan Exascale systems.
Dan has served as part of multiple industry standards bodies throughout his career, including JEDEC, the CXL and CCIX consortia, and as a founding Board of Directors member of the Gen-Z Consortium.
Dan received his Ph.D. in Computer Science and Engineering from the University of Michigan, where he studied high-performance, low-power, and fault-tolerant microarchitectures. He also holds an MSE from Michigan and a BS in Computer Engineering from Iowa State University.
Author/Presenters: Benjamin Sepanski, Tuowen Zhao, Hans Johansen, Samuel Williams
Abstract: Computations on structured grids using standard multidimensional array layouts can incur substantial data movement costs through the memory hierarchy. This presentation explores the benefits of using a framework (Bricks) to separate the complexity of data layout and optimized communication from the functional representation. To that end, we provide three novel contributions and evaluate them on several kernels taken from GENE, a phase-space fusion tokamak simulation code. We extend Bricks to support 6-dimensional arrays and kernels that operate on complex data types, and integrate Bricks with cuFFT. We demonstrate how to optimize Bricks for data reuse, spatial locality, and GPU hardware utilization achieving up to a 2.67× speedup on a single A100 GPU. We conclude with insights on how to rearchitect memory subsystems.
Author/Presenters: Jacob Wahlgren, Maya Gokhale, Ivy Peng
Abstract: Current HPC systems provide memory resources tightly coupled with compute nodes. But HPC applications are evolving – diverse workloads demand different memory resources to achieve both high performance and utilization. In this study, we evaluate a memory subsystem leveraging CXL-enabled memory to provide configurable capacity and bandwidth. We propose an emulator to explore the performance impact of various memory configurations, and a profiler to identify optimization opportunities. We evaluate the performance of seven HPC workloads and six graph workloads on the emulated system. Our results show that three and two HPC workloads have less than 10% and 18% performance impact on 75% pooled memory. Also, a dynamically configured high-bandwidth system could effectively support bandwidth bottle-necked workloads like grid-based solvers. Finally, we identify interference through shared memory pools as a practical challenge for HPC systems to adopt CXL-enabled memory.
Author/Presenters: Alex Fallin, Martin Burtscher
Abstract: Energy consumption is a major concern in high-performance computing. One important contributing factor is the number of times the wires are charged and discharged, i.e., how often they switch from '0' to '1' and vice versa. We describe a software technique to minimize this switching activity in GPUs, thereby lowering the energy usage. Our technique targets the memory bus, which comprises many high-capacitance wires that are frequently used. Our approach is to strategically change data values in the source code such that loading and storing them yields fewer bit flips. The new values are guaranteed to produce the same control flow and program output. Measurements on GPUs from two generations show that our technique allows programmers to save up to 9.3% of the whole-GPU energy consumption and 1.2% on average across eight graph-analytics CUDA codes without impacting performance.
Author/Presenters: Galen Shipman, Jered Dominguez-Trujillio, Kevin Sheridan, Sriram Swaminarayan
Abstract: Many of Los Alamos National Laboratory's HPC codes are memory bandwidth bound. These codes exhibit high levels of sparse memory access which differ significantly from standard benchmarks. In this paper we present an analysis of the memory access of some of our most important code-bases. We then generate micro-benchmarks that preserve the memory access characteristics of our codes using two approaches, one based on statistical sampling of relative memory offsets in a sliding time window at the function level and another at the loop level. The function level approach is used to assess the impact of advanced memory technologies such as LPDDR5 and HBM3 using the gem5 simulator. Our simulation results show significant improvements for sparse memory access workloads using HBM3 relative to LPDDR5 and better scaling on a per core basis. Assessment of two different architectures show that higher peak memory bandwidth results in high bandwidth on sparse workloads.
The growing disparity between CPU speed and memory speed, known as the memory wall problem, has been one of the most critical and long-standing challenges in the computing industry. The situation is further complicated by the recent expansion of the memory hierarchy and the blurred boundary between memory and storage. The memory hierarchy is becoming deeper and more diversified with the adoption of new memory technologies and architectures, including 3D-stacked memory, non-volatile random-access memory (NVRAM), memristor, hybrid software and hardware caches, etc. Computer architecture and hardware systems, operating systems, storage and file systems, programming stack, performance models and tools are being enhanced, augmented, or even redesigned to address the performance, programmability, and energy efficiency challenges of the increasingly complex and heterogeneous memory systems for HPC and data-intensive applications. The MCHPC workshop aims to bring together computer and computational science researchers, from industry, government labs and academia, concerned with the challenges of efficiently using existing and emerging memory systems. The term performance for memory systems is general, which includes latency, bandwidth, power consumption and reliability from the aspect of hardware memory technologies to what it is manifested in the application performance. The topics of interest for the MCHPC workshop include, but are not limited to:
November 13 2022 - Workshop
Authors are invited to submit manuscripts in English structured as technical papers up to 8 pages or as short papers up to 5 pages, both of letter size (8.5in x 11in) and including figures, tables, and references. Submissions not conforming to these guidelines may be returned without review. Your paper should be formatted using IEEE conference format which can be found from https://www.ieee.org/conferences/publishing/templates.html. The workshop encourage submitters to include reproducibility information, using Reproducibility Initiative for SC'22 Technical Papers as guideline.
All manuscripts will be peer-reviewed and judged on correctness, originality, technical strength, and significance, quality of presentation, and interest and relevance to the workshop attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. At least one author of an accepted paper must register for and attend the workshop. Authors may contact the workshop organizers for more information.
Papers should be submitted electronically at: https://submissions.supercomputing.org/, choose "SC22 Workshop: MCHPC'22: Workshop on Memory Centric High Performance Computing".
The final papers are planned to be published through IEEE Computer Society. Published proceedings will be included in the IEEE Xplore digital library.