An Empirical Evaluation of Allgatherv on Multi-GPU Systems
Links to Files
Author/Creator
Author/Creator ORCID
Date
Type of Work
Department
Program
Citation of Original Publication
T. B. Rolinger, T. A. Simon and C. D. Krieger, "An Empirical Evaluation of Allgatherv on Multi-GPU Systems," 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Washington, DC, USA, 2018, pp. 123-132, doi: 10.1109/CCGRID.2018.00027.
Rights
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Subjects
Abstract
Applications for deep learning and big data analytics have compute and memory requirements that exceed the limits of a single GPU. However, effectively scaling out an application to multiple GPUs is challenging due to the complexities of communication between the GPUs, particularly for collective communication with irregular message sizes. In this work, we provide a performance evaluation of the Allgatherv routine on multi-GPU systems, focusing on GPU network topology and the communication library used. We present results from the OSU-micro benchmark as well as conduct a case study for sparse tensor factorization, one application that uses Allgatherv with highly irregular message sizes. We extend our existing tensor factorization tool to run on systems with different node counts and varying number of GPUs per node. We then evaluate the communication performance of our tool when using traditional MPI, CUDA-aware MVAPICH and NCCL across a suite of real-world data sets on three different systems: a 16-node cluster with one GPU per node, NVIDIA's DGX-1 with 8 GPUs and Cray's CS-Storm with 16 GPUs. Our results show that irregularity in the tensor data sets produce trends that contradict those in the OSU micro-benchmark, as well as trends that are absent from the benchmark.
