Enabling Edge-Optimized AI Acceleration through Energy-Recycling Clocks and Compute-in-Memory Architectures

Author/Creator ORCID

Department

Computer Science and Electrical Engineering

Program

Engineering, Computer

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Abstract

Energy-efficient high-performance computing has been a pivotal factor in driving the microprocessor industry. This rising demand necessitates addressing the immense computational requirements of growing AI advancements while maintaining low energy consumption. This work addresses the significant dynamic power consumption in two critical microprocessor systems: clock architecture and cache architecture. First, we propose a low-power, wideband energy-recycling clock architecture utilizing resonant flip-flops (FFs) with series LC resonance and an inductor tuning technique. This inductor tuning technique reduces clock skew and increases the robustness of the clock networks. Our design saves over 43% power and reduces skew by 90% in clock tree networks, and 44% power with 90% skew reduction in mesh networks, across a 1–5 GHz range, compared to industry-standard primarysecondary FF-based networks. To enhance edge artificial intelligence (AI) computational efficiency, we introduce two Compute-in-Memory (CiM) architectures that minimize costly data transfers between memory and CPU. The first architecture, an energy-recycling resonant 10T Compute-in-Memory SRAM (rCiM) macro, integrates Boolean logic computations within the memory, reducing core-cache data movement. Additionally, this work proposes an automation tool that generates energy and latency-optimized rCiM implementations for given logic circuits and memory constraints. When provided with a combinational circuit, the tool aims to generate an energy-efficient implementation strategy tailored to the specified input memory and latency constraints. An 8KB rCiM evaluated on the EPFL combinational benchmark suite showcased 55.42% average lower energy consumption than standard Von-Neuamnn architectures, achieving 88.2-106.6 GOPS throughput and 8.64-10.45 TOPS/W energy efficiency. The proposed combinational logic operation mapping methodology demonstrates that a three-topology macro strategy further cuts energy by 40.52% compared to single-macro designs. The second architecture is a resonant time-domain CiM (rTD-CiM) for Convolutional Neural Networks (CNNs) that avoids Analog-to-Digital converters (ADCs) by using a low-overhead time-to-digital converter (TDC) to digitize Multiply-Accumulate (MAC) operations, mitigating area, power, and non-linearity issues of traditional ADCs. In addition, a weight stationary data mapping strategy combined with an automated SRAM macro selection algorithm optimizes memory usage for quantized CNNs. Demonstrated across six CNNs and nine SRAM configurations, our algorithm achieves an 87.5% reduction in latency for ResNet-18 when mapped to a 256 KB SRAM macro and improves energy efficiency by 8× over a 32 KB SRAM. The rTD-CiM achieves 320 GOPS throughput and 38.46 TOPS/W on an 8 KB macro.