Browsing by Subject "Parallel Performance"
Now showing 1 - 9 of 9
Results Per Page
Sort Options
Item A Comparative Study of the Parallel Performance of the Blocking and Non-Blocking MPI Communication Commands on an Elliptic Test Problem on the Cluster tara(2016) Tari, Hafez; Gobbert, Matthias K.In this report we study the parallel solution of the elliptic test problem of a Poisson equation with homogenous Dirichlet boundary conditions in a two dimensional domain. We use the finite difference method to approximate the governing equations with a system of N2 linear equations, with N the number of interior grid points in either spatial direction. To parallelize the computation, we distribute blocks of the rows of the interior mesh point values among the parallel processes. We then use the iterative conjugate gradient method featured with a so-called matrix-free implementation to solve the system of linear equations local to any of the processes. The conjugate gradient method initiates with local vectors of zero elements, as the start solution, and updates the successive solutions until the Euclidean norm of the global residual of the local iterative solutions relative to that of the global residual of the local start solutions vanishes based on a predefined tolerance. To achieve this and considering the fact that the conjugate gradient method forces some communication between the neighboring processes, i.e. the processes possessing data of the grid interfaces, two modes of MPI communications, namely blocking and non-blocking send and receive, are employed for the data exchange between the processes. The obtained results given accordingly show excellent performance on the cluster tara with up to 512 parallel processes when using 64 compute nodes, especially once non-blocking MPI commands are used. The cluster tara is an IBM Server x iDataPlex purchased in 2009 by the UMBC High Performance Computing Facility (www.umbc.edu/hpcf). It is an 86-node distributed-memory cluster comprised of 82 compute, 2 develop, 1 user, and 1 management nodes. Each node features two quad-core Intel Nehalem X5550 processors (2.66 GHz, 8 MB cache), 24 GB memory, and a 120 GB local hard drive. All nodes and the 160 TB central storage are connected by an InfiniBand (QDR) interconnect network.Item Comparison of Parallel Performance between MVAPICH2 and OpenMPI Applied to a Hyperbolic Test ProblemReid, Michael J.During the manufacture of integrated circuits, the process of atomic layer deposition (ALD) is used to deposit a uniform seed layer of solid material atop the surface of a silicon wafer. The process can be modeled on the molecular level by a system of transient, linear integro-partial di erential Boltzmann equations, coupled with a non-linear surface reaction model, together called the kinetic transport and reaction model (KTRM). Each Boltzmann equation can be approximated by discretizing the velocity space, which yields a system of transient hyperbolic conservation laws that only involve the position vector and time as independent variables. The system can then be solved with DG, a computer implementation of the discontinuous Galerkin method. Due to the large size of the systems being solved and large number of time steps required, it is necessary to use parallel computing to obtain a solution in a reasonable amount of time. We analyze the performance of the DG code on multiple mesh resolutions by measuring its speedup and e ciency on UMBC's new distributed-memory cluster, hpc (www.umbc.edu/hpcf). We also compare the performance of DG when it is compiled using the MVAPICH2 and OpenMPI implementations of MPI, the most prevalent parallel communication library today. Testing on a variety of mesh sizes shows that the MVAPICH2 implementation runs as fast or faster than OpenMPI in all cases. This senior thesis is part of undergraduate research conducted under the direction of Dr. Matthias K. Gobbert.Item Parallel Performance Studies for a Clustering Algorithm(2008) Blasberg, Robin V.; Gobbert, Matthias K.Affinity propagation is a clustering algorithm that functions by identifying similar datapoints in an iterative process. Its structure allows for taking full advantage of parallel computing by enabling the solution of larger problems and by solving them faster than possible in serial. We show that our memory-optimal implementation with minimal number of communication commands per iteration performs excellently on the distributed-memory cluster hpc and that it is efficient to use all 128 processor cores currently available.Item Parallel Performance Studies for a Hyperbolic Test Problem(2008) Reid, Michael J.; Gobbert, Matthias K.The performance of parallel computer code depends on an intricate interplay of the processors, the architecture of the compute nodes, their interconnect network, the numerical algorithm, and the scheduling policy used. This note considers a case study of a solver of a system of transient hyperbolic conservation laws which utilizes both point-to-point and collective communications between parallel processes at each time step. The solver is already known to scale well to many parallel processes on distributed-memory clusters with a high performance interconnect network. The results presented here show excellent overall performance of the new cluster hpc with InfiniBand interconnect and confirm that is beneficial to use the maximum number of cores possible on every node, allowing a total of 128 parallel processes on the 32 compute nodes.Item Parallel Performance Studies for a Maximum Likelihood Estimation Problem Using TAO(2009) Raim, Andrew M.; Gobbert, Matthias K.In this report, we present an application of parallel computing to an estimation procedure in statistics. The method of maximum likelihood estimation (MLE) is based on the ability to perform maximizations of probability functions. In practice, this work is often performed by computer with numerical methods, and may be time consuming for some likelihood functions. We consider one such likelihood function based on the Finite Mixture Multinomial distribution. We perform estimation for this problem in parallel using the Toolkit for Advanced Optimization (TAO) software library. The computations are performed on a distributed-memory cluster with InfiniBand interconnect in the High Performance Computing Facility at University of Maryland, Baltimore County (UMBC). We study how the resource requirements change as problem sizes vary, and demonstrate that scaling the number of processes for larger problems decreases wall clock time significantly.Item Parallel Performance Studies for a Parabolic Test Problem(2008) Muscedere, Michael; Gobbert, Matthias K.The performance of a parallel computer depends on an intricate interplay of the processors, the architecture of the compute nodes, their interconnect network, the numerical algorithm, and the scheduling policy used. This note considers a parabolic test problem given by a time-dependent linear reaction-diffusion equation in three space dimensions, whose spatial discretization results in a large system of ordinary differential equations. These are integrated in time by the family of numerical differentiation formulas, which requires the solution of a system of linear equations at every time step. The results presented here show excellent performance on the cluster hpc in the UMBC High Performance Computing Facility and confirm that it is beneficial to use all four cores of the two dual-core processors on each node simultaneously, giving us in effect a computer that can run jobs efficiently with up to 128 parallel processes.Item Parallel Performance Studies for a Parabolic Test Problem on the Cluster tara(2010) Muscedere, Michael; Raim, Andrew M.; Gobbert, Matthias K.The performance of parallel computer code depends on the intricate interplay of processors, the architecture of the computer nodes, their interconnect network, the numerical algorithm, and its implementation. The solution of large, sparse, highly structured of equations of linear equations by an iterative linear solver that requires communication between the parallel processes at every iteration is an instructive test of this interplay. This note considers a parabolic test problem given by a time-dependent, scalar, linear reaction-diffusion equation in three dimensions, whose time-stepping requires the solution of such a system of linear equations at every timestep. The results presented here show excellent performance on the cluster tara with up to 512 parallel processes when using 64 compute nodes. The results support the scheduling policy implemented, since they confirm that it is beneficial to use all eight cores of the two quad-core processors on each node simultaneously, giving us in-effect a computer that can run jobs efficiently with up to 656 parallel processes when using all 82 compute nodes. The cluster tara is an IBM server x iDataPlex purchased in 2009 by the UMBC High Performance Computing Facility (www.umbc.edu/hpcf). It is an 86-node distributed-memory cluster comprised of 82 compute, 2 develop, 1 user and 1 management nodes. Each node features two quad-core Intel Nehalem X5550 processors (2.66 GHz, 8 MB cache), 24 GB memory, and a 120 GB local hard drive. All nodes and the 160 TB central storage are connected by an InfiniBand (QDR) interconnect network.Item Parallel Performance Studies for an Elliptic Test Problem(2008) Gobbert, Matthias K.The performance of parallel computer code depends on an intricate interplay of the processors, the architecture of the compute nodes, their interconnect network, the numerical algorithm, and the scheduling policy used. The solution of large, sparse, highly structured systems of linear equations by an iterative linear solver that requires communication between the parallel processes at every iteration is an instructive test of this interplay. This note considers the classical elliptic test problem of a Poisson equation with Dirichlet boundary conditions, whose approximation by the finite difference method results in a linear system of this type. Our existing implementation of the conjugate gradient method for the iterative solution of this system is known to have the potential to perform well up to many parallel processes, provided the interconnect network has low latency. Since the algorithm is known to be memory bound, it is also vital for good performance that the architecture of the nodes in conjunction with the scheduling policy does not create a bottleneck. The results presented here show excellent performance the cluster hpc in the UMBC High Performance Computing Facility and give guidance on the scheduling policy to be implemented. Specifically, they confirm that it is beneficial to use all four cores of the two dual-core processors on each node simultaneously, giving us in effect a computer that can run jobs efficiently with up to 128 parallel processes.Item Parallel Performance Studies for an Elliptic Test Problem on the Stampede2 Cluster and Comparison of Networks(2018) Arora, Kritesh; Barajas, Carlos; Gobbert, Matthias K.We study the parallel performance of dual-socket compute nodes with Intel Xeon Platinum 8160 Skylake CPUs with 24 cores and 192 GB of memory, connected by a 100 Gbps Intel Omni-Path (OPA) interconnect. The experimenets use the classical test problem of a Poisson equation in two spatial dimensions, discretized by the finite difference method to give a very large and sparse system of linear equations that is solved by the conjugate gradient method. The tests are performanced on the Skylake nodes of Stampede2 in the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. This national supercomputer is funded by National Science Foundation (NSF) and can be accessed through the XSEDE program. We also compare the performance of the test code using different inter-node networks, Omni-Path (OPA), InfiniBand (IB), and Ethernet, on test clusters graciously provided to us by Dell. The results demonstrate excellent scalability when using more nodes due to the low latency of the high-performance interconnect and good speedup when using all cores of the multi-core CPUs. Comparison to past results brings out that core per core performance improvements have stalled, but that node per node performance continues to improve due to the larger number of cores available on a node.