Orchestrating Non-Blocking Asynchronous Framework for HPC Systems and Applications

Author/Creator

Author/Creator ORCID

Date

2021-01-01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.

Subjects

Abstract

In modern supercomputing applications, communication dominates computation. This is seen by the peak performance on the largest supercomputer applications being roughly around 9%. In addition, large-scale graph data analytics pose challenges to systems with a traditional memory hierarchy due to their unstructured data sources and irregular memory access patterns. Standard benchmarks like LINPACK which focuses on Floating Point Operations Per Second [FLOPS], do not give any importance to communications, as of today's many modern scientific applications need. In the analytics world, when the data size of an application becomes sufficiently larger than the DRAM memory, there is a problem keeping processors busy, which in turn leads to the need for faster memory and bandwidth. The rate of improvement in microprocessor speed greatly exceeds the rate of improvement in DRAM memory speed. In order to overcome this limitation of bandwidth speed, inconsistent with processor speed, asynchronous programming provides a way to deal with blocking waits and executes events independent of the main program flow. The performance of many HPC applications critically depends on how well the applications can hide the long latency of data movement by overlapping communications with ongoing computations, thereby minimizing wait time and data transfers. In this thesis, we designed and developed a multi-step Non-Blocking Asynchronous Framework (N-BAF) to enable a user to efficiently increase application performance on high-performance computing systems. The first step of N-BAF addresses the data movement and memory bandwidth problems through an analytical performance model by automatically extracting an execution flow graph to identify application communication hotspots. The next step uses the flow graph to optimize out-of-core I/O requests, by the use of prefetching, operation reordering, in-memory shuffling, and mailbox abstraction. We provide tools to disassemble blocking communications to non-blocking operations and alleviate the long latency of irregular data movement intensive applications. We also evaluated and addressed the data movement problems with N-BAF methodologies in novel Parallel Migratory Thread Architecture, Coherent Accelerator Processor Interface, and Persistent Memory Allocators. To illustrate the performance improvement gained from this framework, we implemented three applications showing irregular behavior. 1. N-Body simulation with Barnes-Hut algorithm from the molecular dynamics domain. 2. Navier stokes equation from the computational fluid dynamics domain and 3. Breadth-First Search algorithm and obtained a 20-45% improvement compared to the base case.?