Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis

dc.contributor.authorYu, Fuxun
dc.contributor.authorXu, Zirui
dc.contributor.authorShangguan, Longfei
dc.contributor.authorWang, Di
dc.contributor.authorStamoulis, Dimitrios
dc.contributor.authorMadhok, Rishi
dc.contributor.authorKarianakis, Nikolaos
dc.contributor.authorLi, Ang
dc.contributor.authorLiu, ChenChen
dc.contributor.authorChen, Yiran
dc.contributor.authorChen, Xiang
dc.date.accessioned2024-07-12T14:57:26Z
dc.date.available2024-07-12T14:57:26Z
dc.date.issued2024-05-22
dc.description.abstractAs the size of Deep Neural Networks (DNNs) continues to grow, their runtime latency also scales. While model pruning and Neural Architecture Search (NAS) can effectively reduce the computation workload, their effectiveness fails to consistently translate into runtime latency reduction. In this paper, we identify the root cause behind the mismatch between workload reduction and latency reduction is GPU tail effect – a classic system issue caused by resource under-utilization in the last processing wave of the GPU. We conduct detailed DNN workload characterization and demonstrate the prevalence of GPU tail effect across different DNN architectures, and meanwhile reveal that the unique deep structure and the light-weight layer workload of DNNs exacerbate the tail effect for DNN inference. We then propose a tail-awareness design space enhancement and DNN optimization algorithm to optimize existing NAS and pruning designs and achieve better runtime latency and model accuracy performance. Extensive experiments show 11%-27% latency reduction over SOTA DNN pruning and NAS methods.
dc.description.urihttps://ieeexplore.ieee.org/abstract/document/10537049
dc.format.extent14 pages
dc.genrejournal articles
dc.genrepostprints
dc.identifierdoi:10.13016/m20hnm-cbw2
dc.identifier.citationYu, Fuxun, Zirui Xu, Longfei Shangguan, Di Wang, Dimitrios Stamoulis, Rishi Madhok, Nikolaos Karianakis, et al. “Rethinking Latency-Aware DNN Design With GPU Tail Effect Analysis.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024. https://doi.org/10.1109/TCAD.2024.3404413.
dc.identifier.urihttps://doi.org/10.1109/TCAD.2024.3404413
dc.identifier.urihttp://hdl.handle.net/11603/34887
dc.language.isoen_US
dc.publisherIEEE
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department
dc.rights© 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
dc.subjectArtificial neural networks
dc.subjectComputational modeling
dc.subjectGraphics processing units
dc.subjectHardware
dc.subjectOptimization
dc.subjectRuntime
dc.subjectTail
dc.titleRethinking Latency-Aware DNN Design With GPU Tail Effect Analysis
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0001-7749-0640

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Rethinking_LatencyAware_DNN_Design_With_GPU_Tail_Effect_Analysis.pdf
Size:
12.89 MB
Format:
Adobe Portable Document Format