Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya

Author/Creator ORCID

Date

2014

Department

Program

Citation of Original Publication

Rights

This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.

Abstract

The UMBC High Performance Computing Facility (HPCF) is the community-based, interdisciplinary core facility for scientific computing and research on parallel algorithms at UMBC. Released in Summer 2014, the current machine in HPCF is the 240-node distributed-memory cluster maya. The cluster is comprised of three uniform portions, one consisting of 72 nodes based on 2.6 GHz Intel E5-2650v2 Ivy Bridge CPUs from 2013, another consisting of 84 nodes based on 2.8 GHz Intel Nehalem X5560 CPUs from 2010, and another consisting of 84 nodes based on 2.6 GHz Intel Nehalem X5550 CPUs from 2009. All nodes are connected via In niBand to a central storage of more than 750 TB. The performance of parallel computer code depends on an intricate interplay of the processors, the architecture of the compute nodes, their interconnect network, the numerical algorithm, and its implementation. The solution of large, sparse, highly structured systems of linear equations by an iterative linear solver that requires communication between the parallel processes at every iteration is an instructive and classical test case of this interplay. This note considers the classical elliptic test problem of a Poisson equation with homogeneous Dirichlet boundary conditions in two spatial dimensions, whose approximation by the finite difference method results in a linear system of this type. Our existing implementation of the conjugate gradient method for the iterative solution of this system is known to have the potential to perform well up to many parallel processes, provided the interconnect network has low latency. Since the algorithm is known to be memory-bound, it is also vital for good performance that the architecture of the nodes does not create a bottleneck. We report parallel performance studies on each of the three uniform portions of the cluster maya. The results show very good performance up to 64 compute nodes on all portions and support several key conclusions: (i) The newer nodes are faster per core as well as per node, however, for most serial production code using one of the 2010 nodes with 2.8 GHz is a good default. (ii) The high-performance interconnect supports parallel scalability on at least 64 nodes optimally. (iii) It is often faster to use all cores on modern multi-core nodes but it is useful to track memory to determine if this is the case for memory-bound code. (iv) There is no disadvantage to several jobs sharing a node, which justi es the default scheduling setup