Porting and Tuning Numerical Kernels in Real-World Applications to Many-Core Intel Xeon Phi Accelerators

Author/Creator

Author/Creator ORCID

Date

2016-01-01

Department

Mathematics and Statistics

Program

Mathematics, Applied

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Abstract

Modern architectures with multiple memory hierarchies in multi-core CPUs and coprocessors such as the massively parallel GPGPU and many-core Intel Xeon Phi offer opportunities to drastically speed up numerical kernels. Coprocessors, which supplement the work of the CPUs, generally have significantly more cores and threads than a multi-core CPU and use power more efficiently. The Intel Xeon Phi is a newer hardware released to the public only in 2013. Each Intel Phi has between 57 and 61 cores, each capable of up to four threads. Each core is x86 compatible and is capable of running its own instruction stream, allowing programmers to use familiar CPU frameworks such as MPI and OpenMP. Three modes of execution are available on the Intel Phi: (i) offloading, where the program is run on the CPU and segments of the code are moved to the Intel Phi, similar to GPGPU programming, (ii) native, where the program is run directly on the Intel Phi, and (iii) symmetric, where the program is run on the CPU and Phi jointly. We report the performance of three test problems whose structure is representative of kernels of real-world application codes. The first problem is the classical elliptic test problem of a Poisson equation with homogeneous Dirichlet boundary conditions in two and three dimensions. The second problem is a model of calcium induced calcium release in a heart cell. In this model, calcium activates calcium release from the sarcoplasmic reticulum in the cytosol, an essential part of the excitation-contraction coupling in the cardiac muscle. This process is modeled by a system of coupled, non- linear, time-dependent advection-diffusion-reaction equations solved by a method of lines approach. The third problem is a model of pancreatic beta cells in a computational islet. Results are presented for a model without coupling between cells and a model with electrical coupling between cells. Code can easily be ported to the Intel Phi with the use of a compiler flag, however real-world applications may require significant modifications to existing CPU code in order to perform well on the Intel Phi. Code with a high degree of parallelism is required to take advantage of the many cores of the Phi. Offload mode performs poorly for real-world problems due to the cost of communication between the CPU and Phi and the restriction of only using OpenMP on the Phi. For good performance, a combination of MPI and OpenMP is required to take advantage of the complex memory hierarchy of the Intel Phi in native mode. The use of manual loop unrolling to fully utilize the vector registers may significantly improve performance on the Intel Phi. Symmetric mode requires MPI for communication between Phis and multi-core CPUs as well as OpenMP for good performance on the Phis. Using all available resources of the hybrid node in symmetric mode results in the best performance. Studies on multiple hybrid nodes connected by a high-performance interconnect exhibit excellent strong and weak scalability.