Accelerating Convolutional Neural Network with FFT on Embedded Hardware


Author/Creator ORCID



Type of Work


Computer Science and Electrical Engineering


Engineering, Computer

Citation of Original Publication


This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see or contact Special Collections at speccoll(at)
Distribution Rights granted to UMBC by the author.


Fueled by ILSVRC and COCO competitions, Convolutional Neural Network (CNN) has become important in computer vision, and natural language processing. However, state-of-the-art CNNs are computationally and memory intensive, thus energy efficient implementation on embedded platform is challenging. Recently VGGNet and ResNet showed that deep neural networks with more convolution layers and few fully connected layer can achieve lower error rates, thus reducing the complexity of convolution layers is of utmost importance. Recent trends in neural network architectures show that, fully connected layers are reduced and more emphasis is given to convolution layers. In this paper we evaluate three variations of convolutions including direct convolution (Direct-Conv), Fast Fourier Transform (FFT) based convolution (FFT-Conv), and FFT Overlap and Add convolution (FFT-OVA-Conv) in terms of computation complexity and memory storage requirements for popular CNN networks in embedded hardware for two case studies of object detection and atmospheric big data compression. For the case study of object detection, we implemented these three techniques for ResNet-20 with the CIFAR-10 dataset on a low power domain specific many-core called Power Efficient Nano Clusters (PENC), NVIDIA Jetson TX1 GPU, and ARM Cortex A53 CPU to explore the trade off between software and hardware implementation, domain specific logic, and instructions, as well as various parallelism across different architectures. Results are evaluated and compared with respect to throughput per layer, energy consumption, and execution time for the three methods. Using built-in FFT instruction in PENC, the FFT-OVA-Conv performs 2.9x and 1.65x faster and achieves 6.7x and 2.3x higher throughput per watt than Direct-Conv and FFT-Conv. In ARM A53 CPU, FFT-OVA-Conv achieves 3.36x and 1.38x improvement in execution time and 2.72x and 1.32x higher throughput than Direct-Conv and FFT-Conv. In TX1 GPU FFT based convolution is 1.9x faster, 2.2x more energy efficient and achieves 5.6x higher throughput per layer than Direct-Conv. PENC is 10,916x and 1.8x faster and 9,200x and 7.9x more energy efficient and achieves 7.5x and 1.2x higher throughput per layer than ARM A53 CPU and TX1 GPU, respectively. For the case study of atmospheric big data compression, we apply the proposed FFT based convolution techniques for the compression and decompression (CoDec) of LIDAR Backscattering profile from Vaisala CL31 ceilometer. We evaluate Dirct-Conv, FFT-Conv, and FFT-OVA-Conv based discrete wavelet transform for compression and decompression in ARM Cortex A53 CPU. With 1 level of compression on 24hr of data at sensor and decompressing at base, we achieved run time of 32.13s with throughput of 138.6 Ksample/s with 75% reduction in transmission and storage consumption. FFT-OVA-Conv based CoDec achieved 10.6x and 4x faster execution time than Direct-Conv and FFT-Conv based CoDec method in ARM A53 CPU.