Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

Yu, Fuxun; Wang, Di; Shangguan, Longfei; Tang, Xulong; Liu, Chenchen; Chen, Xiang

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

dc.contributor.author	Yu, Fuxun
dc.contributor.author	Wang, Di
dc.contributor.author	Shangguan, Longfei
dc.contributor.author	Tang, Xulong
dc.contributor.author	Liu, Chenchen
dc.contributor.author	Chen, Xiang
dc.date.accessioned	2022-01-11T15:40:10Z
dc.date.available	2022-01-11T15:40:10Z
dc.date.issued	2021-11-28
dc.description	Accepted in the 40th IEEE International Conference on Computer-Aided Design (ICCAD'21)	en_US
dc.description.abstract	With the fast development of deep neural networks (DNNs), many real-world applications are adopting multiple models to conduct compound tasks, such as co-running classification, detection, and segmentation models on autonomous vehicles. Such multi-tenant DNN inference cases greatly exacerbate the computational complexity and call for comprehensive collaboration for graph-level operator scheduling, runtime-level resource awareness, as well as hardware scheduler support. However, the current scheduling support for such multi-tenant inference is still relatively backward. In this work, we propose a resource-aware scheduling framework for efficient multi-tenant DNN inference on GPU, which automatically coordinates DNN computing in different execution levels. Leveraging the unified scheduling intermediate representation and the automated MLbased searching algorithm, optimal schedules could be generated to wisely adjust model concurrency and interleave DNN model operators, maintaining a continuously balanced resource utilization across the entire inference process, and eventually improving the runtime efficiency. Experiments show that we could consistently achieve 1.3×1.7× speed-up, comparing to regular DNN runtime libraries (e.g., CuDNN, TVM) and particular concurrent scheduling methods (e.g., NVIDIA Multi-Stream).	en_US
dc.description.uri	https://arxiv.org/abs/2111.14255	en_US
dc.format.extent	9 pages	en_US
dc.genre	conference papers and proceedings	en_US
dc.genre	preprints	en_US
dc.identifier	doi:10.13016/m2c1cv-iids
dc.identifier.uri	http://hdl.handle.net/11603/23954
dc.language.iso	en_US	en_US
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department Collection
dc.relation.ispartof	UMBC Student Collection
dc.relation.ispartof	UMBC Faculty Collection
dc.rights	This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.	en_US
dc.rights	Attribution 4.0 International (CC BY 4.0)	*
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	*
dc.title	Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU	en_US
dc.type	Text	en_US
dcterms.creator	https://orcid.org/0000-0001-7749-0640	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2111.14255.pdf
Size:: 1.27 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.56 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

UMBC Computer Science and Electrical Engineering Department
UMBC Faculty Collection
UMBC Student Collection