CSCMAC - Cyclic Sparsely Connected Neural Manycore Accelerator

Author/Creator ORCID

Department

Computer Science and Electrical Engineering

Program

Engineering, Computer

Citation of Original Publication

Rights

Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu

Abstract

In deep neural networks (DNNs), model size and computation complexity are two important factors that impact memory footprint and performance respectively, both of which can be minimized by compressing DNN with methods such as pruning as well as structurally compressing the model. Recent works on DNN weight pruning have shown a significant reduction in model size but at the expense of irregularity in the DNN architecture, which necessitates additional indexing memory to address non-zero weights. Structurally compressing DNNs, on the other hand, require minimal or no indexing, and are on par with pruning methods in terms of the DNN accuracy, and can be used as an overlay for traditional DNN layers. The recent Cyclic sparsely connected (CSC) layers structurally compress and sparsify DNNs which can reduce the memory footprint of dense layers from O(N�) to O(N log N). In this theses, we propose an energy-efficient, domain-specific manycore accelerator named CSCMAC - Cyclic Sparsely Connected Neural Network Manycore Accelerator, which effectively maps and executes DNNs compressed with CSC architectures. We implement a kernel specific instruction for CSC layers for inference on a manycore platform, take advantage of their cyclic architecture, and show that their implementation in software even for a parallel-computing processor is affable. To further take advantage of their implementation simplicity, we propose customized instructions for the manycore that fuse frequently used sequences of machine codes and, by means of Amdahl's law, evaluate the optimization gained by the customization. Our experimental results using a LeNet300100 on MNIST (as an image classification application) and a Multi-Layer Perceptron (MLP) on Physical Activity Monitoring (as a physical activity monitoring data processing application) indicate that by replacing Fully-Connected (FC) layers with CSC layers, we can achieve 46x and 6x compression respectively within a margin of 2% accuracy loss. With only 2 mW power overhead, novel instruction CSC is added to the ISA of the CSCMAC which replaces frequently used functions that would have taken 11 clock cycles with 1 clock cycles. A 64-cluster architecture of the CSCMAC is fully placed and routed using 65nm, TSMC CMOS technology. The layout of each cluster occupies an area of 0.73 mm� and consumes 230.2 mW power at 980 MHz clock frequency. Our proposed CSCMAC achieves 57% higher throughput and 56% lower energy compared to its equivalent predecessor manycore (PENC). Also, the CSCMAC achieves 90x higher throughput and consumes 69x lower energy compared to CPU implementation of the NVIDIA Jetson TX2 platform.