Node-level parallel programming are becoming more challenging as we enter the era of heterogenous and manycore computing architectures. The need for a single intra-node programming model that meet diverse requirements of emerging computer architectures and increasing domain application requirements has become much more challenging than before.
The approach of compiler directives, i.e. pragma, offers an elegant solution to programming challenges of portability and productivity, as shown in the industry standard such as OpenMP. Most of my research topics and activities on improving productivity of parallel programming use, extend and implement OpenMP directives. Specific topics include the implementation of OpenMP 4.x features (accelerator support, dependent tasks, etc), extensions to support data-driven execution model, extensions to support power and energy modeling and tuning, extensions to support advanced data distributions, and extensions to support large-scale data processing on distributed and cloud systems, etc.
Conventional compiler optimizations have been focusing on the techniques of exploiting instruction level parallelism (ILP) and the vertical direction of memory hierarchy (private caches) of microarchitectures; Multicores, or SMP systems, which often implies NUCA or NUMA memory hierarchy, introduce the global(macro) optimization challenges for compilers; the challenges to coordinate the sharing and contention of system resources among multiple cores. We refer to this high-level compilation techniques as machine-aware compilation or macro optimization, opposing to the (micro)architecture-level compilations. Current topics in this area of research includes abstract machine model, machine aware compilation, application parallel structural representation, compile-time modeling for performance and energy efficiency, etc.
The emerging and future HPC computing systems will exhibit an unprecedented level of heterogeneous complexity within compute nodes, and have dramatic increase of core and node counts within and cross nodes. An efficient runtime system is essential to take advantage of the full hardware capabilities. Runtimes implement execution models supported by the hardware and should provide efficient mapping of execution patterns conveyed by the programming models to the execution models. My previous work in Habanero-C runtime, OpenMP runtime, and synchronizations on Cyclops64 manycores established a solid foundation for moving forward toward a unified runtime system for heterogeneous and manycore computing nodes. Current topics include the creation of an offloading runtime, data-driven execution model, efficient memory management techniques for threaded runtime, interoperability of threaded runtime systems, runtime adaptivity and profiling, etc.
Performance tools, such as TAU, HPCToolkit, Score-P, etc, are excellent utilities to aid HPC programmers to diagnose performance bottleneck and load balance related issues of parallel applications. They treat the applications equally. The idea of skeleton-guided performance profiling and monitoring use compile-time modeling and algorithm skeleton performance will dramatically improve the tool intelligence. Topics includes algorithm skeleton performance data collections, compile-time modeling, online and offline-coordinated runtime and tool adaptation, runtime monitoring and alarm for performance hits, etc