Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis

Yu, Fuxun; Xu, Zirui; Shangguan, Longfei; Wang, Di; Stamoulis, Dimitrios; Madhok, Rishi; Karianakis, Nikolaos; Li, Ang; Liu, ChenChen; Chen, Yiran; Chen, Xiang

Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis

dc.contributor.author	Yu, Fuxun
dc.contributor.author	Xu, Zirui
dc.contributor.author	Shangguan, Longfei
dc.contributor.author	Wang, Di
dc.contributor.author	Stamoulis, Dimitrios
dc.contributor.author	Madhok, Rishi
dc.contributor.author	Karianakis, Nikolaos
dc.contributor.author	Li, Ang
dc.contributor.author	Liu, ChenChen
dc.contributor.author	Chen, Yiran
dc.contributor.author	Chen, Xiang
dc.date.accessioned	2024-07-12T14:57:26Z
dc.date.available	2024-07-12T14:57:26Z
dc.date.issued	2024-05-22
dc.description.abstract	As the size of Deep Neural Networks (DNNs) continues to grow, their runtime latency also scales. While model pruning and Neural Architecture Search (NAS) can effectively reduce the computation workload, their effectiveness fails to consistently translate into runtime latency reduction. In this paper, we identify the root cause behind the mismatch between workload reduction and latency reduction is GPU tail effect – a classic system issue caused by resource under-utilization in the last processing wave of the GPU. We conduct detailed DNN workload characterization and demonstrate the prevalence of GPU tail effect across different DNN architectures, and meanwhile reveal that the unique deep structure and the light-weight layer workload of DNNs exacerbate the tail effect for DNN inference. We then propose a tail-awareness design space enhancement and DNN optimization algorithm to optimize existing NAS and pruning designs and achieve better runtime latency and model accuracy performance. Extensive experiments show 11%-27% latency reduction over SOTA DNN pruning and NAS methods.
dc.description.uri	https://ieeexplore.ieee.org/abstract/document/10537049
dc.format.extent	14 pages
dc.genre	journal articles
dc.genre	postprints
dc.identifier	doi:10.13016/m20hnm-cbw2
dc.identifier.citation	Yu, Fuxun, Zirui Xu, Longfei Shangguan, Di Wang, Dimitrios Stamoulis, Rishi Madhok, Nikolaos Karianakis, et al. “Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024. https://doi.org/10.1109/TCAD.2024.3404413.
dc.identifier.uri	https://doi.org/10.1109/TCAD.2024.3404413
dc.identifier.uri	http://hdl.handle.net/11603/34887
dc.language.iso	en
dc.publisher	IEEE
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Faculty Collection
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department
dc.rights	© 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
dc.subject	Artificial neural networks
dc.subject	Computational modeling
dc.subject	Graphics processing units
dc.subject	Hardware
dc.subject	Optimization
dc.subject	Runtime
dc.subject	Tail
dc.title	Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis
dc.type	Text
dcterms.creator	https://orcid.org/0000-0001-7749-0640

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Rethinking_LatencyAware_DNN_Design_With_GPU_Tail_Effect_Analysis.pdf
Size:: 12.89 MB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Faculty Collection
UMBC Computer Science and Electrical Engineering Department