Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU
Loading...
Links to Files
Permanent Link
Author/Creator
Author/Creator ORCID
Date
2021-11-28
Type of Work
Department
Program
Citation of Original Publication
Rights
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
Attribution 4.0 International (CC BY 4.0)
Attribution 4.0 International (CC BY 4.0)
Subjects
Abstract
With the fast development of deep neural networks
(DNNs), many real-world applications are adopting multiple
models to conduct compound tasks, such as co-running classification, detection, and segmentation models on autonomous
vehicles. Such multi-tenant DNN inference cases greatly exacerbate the computational complexity and call for comprehensive
collaboration for graph-level operator scheduling, runtime-level
resource awareness, as well as hardware scheduler support.
However, the current scheduling support for such multi-tenant
inference is still relatively backward. In this work, we propose a
resource-aware scheduling framework for efficient multi-tenant
DNN inference on GPU, which automatically coordinates DNN
computing in different execution levels. Leveraging the unified
scheduling intermediate representation and the automated MLbased searching algorithm, optimal schedules could be generated
to wisely adjust model concurrency and interleave DNN model
operators, maintaining a continuously balanced resource utilization across the entire inference process, and eventually improving
the runtime efficiency. Experiments show that we could consistently achieve 1.3×1.7× speed-up, comparing to regular DNN
runtime libraries (e.g., CuDNN, TVM) and particular concurrent
scheduling methods (e.g., NVIDIA Multi-Stream).