Deploying Deep Neural Networks in Embedded Real-Time Systems

Author/Creator

Author/Creator ORCID

Date

2016-01-01

Department

Computer Science and Electrical Engineering

Program

Engineering, Computer

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Abstract

Deep neural networks have been shown to outperform prior state-of-the-art solutions that rely heavily on hand-engineered features coupled with simple classification techniques. In addition to achieving several orders of magnitude improvement, they offer a number of additional benefits such as the ability to perform end-to-end learning by performing both hierarchical feature abstraction and inference. Furthermore, their success continues to be demonstrated in a growing number of fields for a wide-range of applications, including computer vision, speech recognition, and model forecasting. As this area of machine learning matures, a major challenge that remains is the ability to efficiently deploy such deep networks in embedded, resource-bound settings that have strict power and area budgets. While GPUs have been shown to improve throughput and energy efficiency over traditional computing paradigms, they still impose significant power burden for such low-power embedded settings. In order to further reduce power while still achieving desired throughput and accuracy, classification-efficient networks are required in addition to optimal deployment onto embedded hardware. In this work, we target both of these enterprises. For the first objective, we analyze simple, biologically-inspired reduction strategies that are applied both before and after training. The central theme of the techniques is the introduction of sparsification to help dissolve away the dense connectivity that is often found at different levels in neural networks. The sparsification techniques include feature compression partition, structured filter pruning and dynamic feature pruning. Additionally, we explore filter factorization and filter quantization approximation techniques to further reduce the complexity of convolutional layers. In the second contribution, we propose utilizing scalable, FPGA-based accelerators that enable deploying networks in such resource-bound settings by both exploiting efficient forms of parallelism inherent in convolutional layers and by exploiting the sparsification and approximation techniques proposed. In partcular, we developed SPARCNet: a hardware accelerator for efficient deployment of SPARse Convolutional NETworks. Utilizing the reduction techniques, we demonstrate the ability to reduce computation and memory by up to 60% and 93% with less than 1% impact on accuracy when evaluated on several public datasets including 1000-class ImageNet dataset. The SPARCNet accelerator has been evaluated in real-time on a number of popular networks including VGGNet, AlexNet, and SqueezeNet when trained on CIFAR-10 and ImageNet datasets. When deployed on a Zynq-based FPGA platform, the reduction techniques enabled up to 6x improvement in energy efficiency relative to the baseline network. Relative to its integrated dual-core ARM A9 CPU counterpart, the SPARCNet accelerator improved throughput by up to 22x while decreasing energy consumption by 13x. The SPARCNet accelerator was further evaluated against a number of other platforms including NVIDIA Jetson TK1 containing an embedded K1 GPU. When evaluated on AlexNet, the SPARCNet accelerator running on Zedboard platform with Zynq-7000 FPGA is able to achieve an efficiency of 8.07 GOP/J while under 3 Watts versus Jetson TK1 that obtained an efficiency of 4.58 GOP/J with total system power of 12 Watts.