The growing disparity between CPU speed and memory speed, known as the memory wall problem, has been one of the most critical and long-standing challenges in the computing industry. To address the memory wall problem, memory systems have been becoming increasingly complex in recent years. New memory technologies and architectures are introduced into conventional memory hierarchies. This newly added memory complexity plus the existing programming complexity and architecture heterogeneity make utilizing high performance computer systems extremely challenging. Performance optimization has thus shifted from computing to data access, especially for data-intensive applications. Significant amount of efforts of a user is often spent on optimizing local and shared data access regarding the memory hierarchy rather than for decomposing and mapping task parallelism onto hardware. This increase of memory optimization complexity also demands significant system support, from tools to compiler technologies, and from modeling to new programming paradigms. Explicitly or implicitly, to address the memory wall performance bottleneck, the development of programming interfaces, compiler tool chains, and applications are becoming memory oriented or memory centric.
The organization of this workshop believe it is important to elevate the notion of memory-centric programming to utilize the unprecedented and ever-elevating modern memory systems. Memory-centric programming refers to the notion and techniques of exposing the hardware memory system and its hierarchy, which include NUMA regions, shared and private caches, scratch pad, 3-D stack memory, and non-volatile memory, to the programmer for extreme performance programming via portable abstraction and APIs for explicit memory allocation, data movement and consistency enforcement between memories. The concept has been gradually adopted in main stream programming interfaces, for example and to name a few, the use of place
in OpenMP and X10 and locale
in Chapel to represent memory regions in a system, the use of shared
modifier in CUDA or cache
modifier in OpenACC for representing scratch-pad SRAM for GPUs, the memkind
library and the recent effort for OpenMP memory management for supporting 3-D stack memory (HBM or HMC), and PMEM library for persistent memory programming. The MCHPC workshop aims to bring together computer and computational science researchers, from industry and academia, concerned with the programmability and performance of the existing and emerging memory systems. The term performance for memory system is general, which include latency, bandwidth, power consumption and reliability from the aspect of hardware memory technologies to what it is manifested in the application performance.
The topics of interest for the workshop include, but are not limited to:
November 12th 2017 - MCHPC2017 Workshop
Authors are invited to submit manuscripts in English structured as technical papers up to 8 pages or as short papers up to 5 pages, both of letter size (8.5in x 11in) and including figures, tables, and references using the IEEE format for conference proceedings. Submissions not conforming to these guidelines may be returned without review. Reference style files are available at http://www.ieee.org/conferences_events/conferences/publishing/templates.html.
All manuscripts will be reviewed and judged on correctness, originality, technical strength, and significance, quality of presentation, and interest and relevance to the workshop attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. At least one author of an accepted paper must register for and attend the workshop. Authors may contact the workshop organizers for more information.
Papers should be submitted electronically at: https://easychair.org/conferences/?conf=mchpc2017.
It is widely recognized that a major disruption is under way in computer hardware as processors strive to extend, and go beyond, the end-game of Moore's Law. This disruption will include new forms of processor and memory hierarchies, including near-memory computation structures. In this talk, we summarize compiler and runtime challenges for memory centric programming, based on past experiences with the X10 project at IBM and the Habanero project at Rice University and Georgia Tech. A key insight in addressing compiler challenges is to expand the state-of-the-art in analyzing and transforming explicitly-parallel programs, so as to encourage programmers to write forward-scalable layout-independent code rather than hardwiring their programs to specific hardware platforms and specific data layouts. A key insight in addressing runtime challenges is to focus on asynchrony in both computation and data movement, while supporting both in a unified and integrated manner. A cross-cutting opportunity across compilers and runtimes is to broaden the class of computation and data mappings that can be considered for future systems. Based on these and other insights, we will discuss recent trends in compilers and runtime systems that point the way towards possible directions for addressing the challenges of memory centric programming.
Vivek Sarkar is a Professor in the School of Computer Science, and the Stephen Fleming Chair for Telecommunications in the College of Computing at at Georgia Institute of Technology, since August 2017. Prior to joining Georgia Tech, Sarkar was a Professor of Computer Science at Rice University, and the E.D. Butcher Chair in Engineering. During 2007 - 2017, Sarkar built Rice's Habanero Extreme Scale Software Research Group with the goal of unifying parallelism and concurrency elements of high-end computing, multicore, and embedded software stacks (http://habanero.rice.edu). He also served as Chair of the Department of Computer Science at Rice during 2013 - 2016.
Prior to joining Rice in 2007, Sarkar was Senior Manager of Programming Technologies at IBM Research. His research projects at IBM included the X10 programming language, the Jikes Research Virtual Machine for the Java language, the ASTI optimizer used in IBM’s XL Fortran product compilers, and the PTRAN automatic parallelization system. Sarkar became a member of the IBM Academy of Technology in 1995, and was inducted as an ACM Fellow in 2008. He has been serving as a member of the US Department of Energy’s Advanced Scientific Computing Advisory Committee (ASCAC) since 2009, and on CRA’s Board of Directors since 2015.
In this talk, Andy will describe the emerging Persistent Memory technology and how it can be applied to HPC-related use cases. Andy will also discuss some of the challenges using Persistent Memory, and the ongoing work the ecosystem is doing to mitigate those challenges.
Andy Rudoff is a Senior Principal Engineer at Intel Corporation, focusing on Non-Volatile Memory programming. He is a contributor to the SNIA NVM Programming Technical Work Group. His more than 30 years industry experience includes design and development work in operating systems, file systems, networking, and fault management at companies large and small, including Sun Microsystems and VMware. Andy has taught various Operating Systems classes over the years and is a co-author of the popular UNIX Network Programming text book.
Given the recent resurgence of research into processing in or near memory systems, we find an ever increasing need to augment traditional system software tools in order to make efficient use of the PIM hardware abstractions. One such architecture, the Micron In-Memory Intelligence (IMI) DRAM, provides a unique processing capability within the sense amp stride of a traditional 2D DRAM architecture. This accumulator processing circuit has the ability to compute both horizontally and vertically on pitch within the array. This unique processing capability requires a memory allocator that provides physical bit locality in order to ensure numerical consistency.
In this work we introduce a new memory allocation methodology that provides bit contiguous allocation mechanisms for horizontal and vertical memory allocations for the Micron IMI DRAM device architecture. Our methodology drastically reduces the complexity by which to find new, unallocated memory blocks by combining a sparse matrix representation of the array with dense continuity vectors that represent the relative probability of finding candidate free blocks. We demonstrate our methodology using a set of pathological and standard benchmark applications in both horizontal and vertical memory modes.
Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D, CloverLeaf 3D, and OpenSBLI. We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. Evaluating our work on Intel's Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15% loss in efficiency.
Experience with Intel Xeon Phi suggests that NUMA alone is inadequate for assignment of pages to devices in heterogeneous memory systems. We argue that this is because NUMA is based on a single distance metric between all domains (i.e., number of devices “in between” the domains), while relationships between heterogeneous domains can and should be characterized by multiple metrics (e.g., latency, bandwidth, capacity). We therefore propose elaborating the concept of NUMA distance to give better and more intuitive control of placement of pages, while retaining most of the simplicity of the NUMA abstraction. This can be based on minor modification of the Linux kernel, with the possibility for further development by hardware vendors.
General Purpose Graphics Processing Units (GPGPU) have become a popular platform to accelerate computing. However, while they provide additional computing powers, GPGPU have put even more pressure on the already behindhand memory systems. Memory performance is an identified performance killer of GPGPU. Evaluating, understanding, and improving GPGPU data access delay is an imperative research issue of high-performance computing. In this study, we utilize the newly proposed C-AMAT (Concurrent Average Memory Access Time) model to measure the memory performance of GPGPU. We first introduce a GPGPU-specialized measurement design of C-AMAT. Then the modern GPGPU simulator, GPGPU-Sim, is used to carry the performance study. Finally, the performance results are analyzed.