Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

dc.contributor.authorYu, Fuxun
dc.contributor.authorWang, Di
dc.contributor.authorShangguan, Longfei
dc.contributor.authorTang, Xulong
dc.contributor.authorLiu, Chenchen
dc.contributor.authorChen, Xiang
dc.date.accessioned2022-01-11T15:40:10Z
dc.date.available2022-01-11T15:40:10Z
dc.date.issued2021-11-28
dc.descriptionAccepted in the 40th IEEE International Conference on Computer-Aided Design (ICCAD'21)en_US
dc.description.abstractWith the fast development of deep neural networks (DNNs), many real-world applications are adopting multiple models to conduct compound tasks, such as co-running classification, detection, and segmentation models on autonomous vehicles. Such multi-tenant DNN inference cases greatly exacerbate the computational complexity and call for comprehensive collaboration for graph-level operator scheduling, runtime-level resource awareness, as well as hardware scheduler support. However, the current scheduling support for such multi-tenant inference is still relatively backward. In this work, we propose a resource-aware scheduling framework for efficient multi-tenant DNN inference on GPU, which automatically coordinates DNN computing in different execution levels. Leveraging the unified scheduling intermediate representation and the automated MLbased searching algorithm, optimal schedules could be generated to wisely adjust model concurrency and interleave DNN model operators, maintaining a continuously balanced resource utilization across the entire inference process, and eventually improving the runtime efficiency. Experiments show that we could consistently achieve 1.3×1.7× speed-up, comparing to regular DNN runtime libraries (e.g., CuDNN, TVM) and particular concurrent scheduling methods (e.g., NVIDIA Multi-Stream).en_US
dc.description.urihttps://arxiv.org/abs/2111.14255en_US
dc.format.extent9 pagesen_US
dc.genreconference papers and proceedingsen_US
dc.genrepreprintsen_US
dc.identifierdoi:10.13016/m2c1cv-iids
dc.identifier.urihttp://hdl.handle.net/11603/23954
dc.language.isoen_USen_US
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartofUMBC Student Collection
dc.relation.ispartofUMBC Faculty Collection
dc.rightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.en_US
dc.rightsAttribution 4.0 International (CC BY 4.0)*
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/*
dc.titleAutomated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPUen_US
dc.typeTexten_US
dcterms.creatorhttps://orcid.org/0000-0001-7749-0640en_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2111.14255.pdf
Size:
1.27 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.56 KB
Format:
Item-specific license agreed upon to submission
Description: