Research Background

High performance computing-enabled simulation has been widely considered a third pillar of science along with theory and experimentation, and is a strategic tool in many aspects of scientific discovery and innovation. The recent goal to scale emerging supercomputers for exaflop (1018 operations per second) poses unprecedented challenges for performance, energy efficency and programmability. The following links provides some background information of those challenges and other information that are relevant to this area of work.

  1. Reports from DoE Office of Science Advanced Scientific Computing Research (ASCR), e.g. Top Ten Exascale Research Challenges by ASCAC Subcommittee, February 10 2014, Report for Software Productivity for Extreme Scale Science, 2011, 2014.
  2. White House Executive Order – Creating a National Strategic Computing Initiative
  3. TOP 500 supercomputers, HPCWire Magazine

Current research and development at PASSLab create software and hardware solutions to improve the performance, productivity and energy efficiency of existing and emerging parallel and HPC computing systems.


Research Area and Topics

Currently, there are four major area of ongoing research in our group.

Parallel programming models and compiler/runtime implementation for HPC

The research in this area improves the programmability and performance of existing parallel programming models based on OpenMP directives. Specific topics include the implementation of OpenMP 4.x features (accelerator support, dependent tasks, etc), extensions to support data-driven execution model, extensions to support power and energy modeling and tuning, extensions to support advanced data distributions, and extensions to support large-scale data processing on distributed and cloud systems, etc.

More recently, memory systems seem to become another wave of increase of complexity in computer industry. New memory technologies and architectures are now introduced in the conventional memory hierarchy, e.g. 3D-stacked memory, NVRAM, and hybrid software/hardware cache architectures. It is often that users spend more efforts to optimize local and shared data access with regard to the memory hierarchy than for decomposing and mapping parallelism onto hardware.

In this direction of research, we extend OpenMP with notions of explicit memory mapping and data distribution, explicit data and computation binding and coherence enforcement, more relaxed memory consistency model, and an execution model of asynchronous data movement and data-driven computation for applying agreesive latency techiniques. Our goal is to further address the memory wall challenge (both latency and bandwidth) and minimize its impact to performance and power consumption in today’s deeper memory hierarchy without comprising programmability.

Our development leverage OpenMP parallel programming model, LLVM/Clang compiler and OpenMP runtime systems.

Hardware support for parallel programming

The approach of implementing a parallel runtime systems in hardware and FPGA may greatly improve the scalability and performance of parallel applications, e.g. hardware-supported barrier operations, hardware queue implementation for task scheduling, hardware message passing to reduce the use of lock, transactional memory for parallel data structures, intelligent NIC and NIC-shared memory to implement shared memory between computing nodes, etc.

In this area of research, we leverage RISC-V ecosystems (ISA, UCB Rocket and BOOM core and chips) to create new hardware logics for supporting parallel programming beyound research. We also explore the notion of hybrid dataflow architecture from both historical research (WaveScalar and TRIPS) and recent development (neuromophic, Mexeler dataflow in FPGA, and aNN chips) for parallel computing.

We leverage RISC-V ISA/Rocket and Boom chip and design for computer architecture research, we also use Xilinx Zed board for RISC-V development, Intel/Altera DE5-Net to experiment OpenCL support in FPGA and Maxeler Galava PCI-e DataFlow Engine FPGA Card for dataflow model.

Binary based performance visualization, parallelization and instrumentations and tools support

The approach of directly analayzing and modifying binary code has the obvious benefits of similicity and security. We use binary analysis and instrumentation for diagonise memory access and cache behavior for improve software prefetching without compiler and the source code, visualize data layout and access patterns in NUMA architecture for pinpointing performance bottleneck, and machine-learning based analysis of memory traces for detecting malicious activities. We also develop neural network-based approach for performance and power model and predictions based on hardware counter data.

PAPI/HPCToolkit/Pin/Dyninst tools are used in this area of development.

Innovative algorithms, hardware and software support for medical image processing

We explore the use of NVIDIA GPUs and Intel/Xilinx/Maxeler FPGA solutions for large scale image analysis and deep learning for medical applications, including deformable image registrations and radiation dose calculation for cancer diagnosis and treatment.

Autonomous and intelligent vehicle/drone using deep learning and image analysis

This is a new area that we explore the use of NVIDIA GPU, 3-D depth camera, imaging processing and deep neural network for robotics and embedded applications. We use NVIDIA Jetson TX1 and ZED camera from STEROLabs


Benchmark and Applications

We use benchmarks such as Rodinia, NPB, SPEC CPU, SPEC OMP, apps from DoE Co-Design and Miniapps such as Mantevo


Our research are supported through the College of Engineering of Computing at the University of South Carolina, National Science Foundation, Xilinx, Intel/Altera, Maxeler Technologies, and Beaumont Cancer Institute of Beaumont Health System.