© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Access to this work was provided by the University of Maryland, Baltimore County (UMBC) ScholarWorks@UMBC digital repository on the Maryland Shared Open Access (MD-SOAR) platform.

Please provide feedback Please support the ScholarWorks@UMBC repository by emailing <u>scholarworks-group@umbc.edu</u> and telling us what having access to this work means to you and why it's important to you. Thank you.

# Negative Capacitance Clock Distribution

Riadul Islam, Member, IEEE

**Abstract**—In this paper, we investigate the ever-increasing power issue that is poised to jeopardize the performance and robustness of future low-power microprocessor design. We have observed a tremendous amount of research in low-power clock network design to bolster energy-efficient computing, without, however, any substantial improvement in overall microprocessor clock power and performance. In this work, we used the emerging ferroelectric negative capacitance field-effect transistor (NCFET) to reduce clock network effective capacitance and active elements, which enables low-power clocking. According to accurate HSPICE simulation, the proposed NCFET-based clocking can save up to 70% and 73% average power compared to the industry standard clocking schemes on industrial ISPD 2009 and ISPD 2010 benchmarks, respectively. In addition, the proposed methodology uses up to 20% fewer clock buffers compared to the existing synthesized clocking scheme and exhibits 49% lower crosstalk-induced delay variation compared to the traditional CMOS-based design.

Index Terms—Ferroelectric capacitance, NCFET, clock network, power, skew.

## **1** INTRODUCTION

As more and more functionality and devices are introduced into microprocessors, a tremendous signal traffic demand in next generation microprocessors or system-on-chips (SOCs) is surely foreseen. How to carry such a huge signal-traffic demand with a bounded time limit and power budget becomes a very thorny problem that needs to be dealt with immediately and carefully. Among all the data and communication signals, the notion of clock and clocking is essential for the concept of synchronous design of a digital system. The clock provides the reference signal necessary for a synchronous circuit that has simultaneous switching activity. However, in a microprocessor, the clock has the fastest switching activity or frequency. As a result, a clock distribution network (CDN) consumes a significant amount of power in a microprocessor power-arc, which is considered as 30%-40% of total dynamic power [1]. The primary reason is that dynamic power  $(P_{dynamic})$  has a linear relationship to the frequency (f) and can be expressed as

$$P_{dynamic} = \alpha C_{clock} V_{DD}^2 f, \tag{1}$$

where f is clock frequency,  $\alpha$  is activity factor,  $V_{DD}$  is the supply voltage, and  $C_{clock}$  is the total CDN capacitance. It is apparent from Equation 1 that power is proportional to  $C_{clock}$  along with the f. In this work, we introduce negative capacitance in the CDN to effectively reduce the total CDN capacitance in order to enable ultra low-power clocking.

#### 1.1 Prior Work and Motivations

The reliability of a clock depends on the clock skew; jitter; resource uses; power; latency; slew rate; robustness to noise; and process, supply voltage, and chip temperature (PVT) variation. As a result of these diverse metrics, there has been

Manuscript received April 19, 2018; revised August 26, 2018.

a tremendous amount work on clock design to improve one or several metrics at a time. However, the primary goal of this work is to reduce the CDN power without affecting the skew and slew rate of the clock. The most widespread low-power clocking techniques use dynamic voltage and frequency scaling (DVFS) [2], [3]. According to Equation 1, dynamic power is proportional to  $V_{DD}f$ , resulting in a cubic power reduction. In addition, a decrease in supply voltage means slower circuits; accordingly, researchers reduce both  $V_{DD}$  and f simultaneously to make the design compatible with dynamic workload requirements [2], [3], [4]. However, most of the low-swing clocking requires dualsupply voltage routing and is highly susceptible to crosstalk noise. Other researchers proposed resonant energy recovery clocking to reduce CDN power. Resonant clocking uses the clock capacitance and an on-chip inductor to resonate at a central frequency. The basic idea of resonant clocking is that it saves power by storing and recycling energy into the magnetic field of an inductor and the electric field of a capacitor [4], [5]. However, resonant clocking suffers from a large slew rate due to the sinusoidal nature of the produced signal and consumes high short-circuit power. In addition, it requires a high quality factor for the on-chip inductor, which is hard to maintain over a large frequency band. Recently, intermittent-resonant clocking solved both of the issues by using extra clock driver circuitry [5]. However, all of these resonant clocking schemes require passive devices and additional bias circuits. Another interesting idea reduces resonant clock passive device resources, perhaps very relevant to our work. However, it suffers from frequency scaling issues [6]. In recent years, researchers have paid a lot of interest to current-mode (CM) clocking [7], [8], [9]. The primary reason is that it eliminates the requirement of active elements (i.e., buffers) from the clock tree and consumes extremely low power due to its very low-swing operation. However, CM clocking requires a current transmitter (voltage to current converter) and receiver (currentto-voltage converter) or CM flip-flops, which require extra attention to tackle analog issues. In addition, similar to resonant clocking, the traditional clock synthesis tools do

Riadul Islam is with the Department of Electrical and Computer Engineering, University of Michigan-Dearborn, Dearborn, MI 48128.
 E-mail: riaduli@umich.edu

<sup>•</sup> This work was supported by UM-Dearborn CECS Start-Up U056930.



Fig. 1: (a) The RC ladder equivalent buffered circuit of a balanced clock tree, (b) NCFET-based equivalent buffered circuit.

not support CM clocking. In addition to clock power, in multi-gigahertz microprocessor design, it is very challenging to meet the stringent timing budget, primarily due to the variation in interconnect, clock loading, temperature, and supply voltage. As a result, it requires complex clock tree synthesis (CTS), which is also a part of digital design flow. There has been much research to implement a robust CDN with minimal clock resources (power, wirelength) and timing uncertainties. In order to achieve high performance, however, most of the researchers focus on improving clock skew [10], [11], [12]. H-tree routing is considered to be the most widely used technique in the microprocessor and field-programmable gate array (FPGA) industry. However, in high-performance design it is impossible to use H-tree routing due to chip-blockages and the physical asymmetry of the flip-flop locations. Similar to H-tree routing, method of mean and median (MMM) is a top-down approach. The MMM routing recursively partitions the clock region into alternating between the X and Y directions and merges the center of mass to their parent region to build a CDN with optimal skew [12]. Unlike MMM routing, the geometric matching algorithm (GMA) utilizes the recursive bottom-up approach to build the CDN [11]. The one common issue that hinders the quality of results in all of these algorithms is that they do not consider clock load while balancing the wire length to minimize skew. The deferred-merge embedding (DME) based zero skew routing algorithm (ZSA) tackles the loading issue by applying an Elmore delay model to achieve theoretical zero skew in CDN [13], [14], [15]. Although the ZSA algorithm uses optimal wirelength, it struggles to meet an optimal power budget at high-frequency operation. The primary reason is the large CDN capacitance.

In order to enable low-power high-performance microprocessor operation, the negative capacitance field effect transistor (NCFET) has recently attracted great attention and has been studied in recent works [16], [17]. Using NCFET, the researchers' primary goal is to break the Boltzman switching barrier for conventional CMOS transistors that defines the subthreshold swing at 60mV/decade [16]. To reduce clock capacitance, we introduce NCFET buffers in the CDN. We will discuss more details about the NCFET characteristics in Section 2. Figure 1(a) shows the *RC* ladder equivalent buffered circuit of a balanced clock tree. The CMOS-based buffers are automatically sized using our inhouse automated tool to meet the slew requirement, which is 10% of the clock period. Figure 1(b) shows the same



Fig. 2: (a) The bulk-MOSFET, (b) MOSFET schematic diagram, and (c) equivalent capacitance model of an MOSFET.

RC ladder circuit but uses NCFET buffers to drive the RC network and maintain the slew budget. According to our analysis, the proposed NCFET-based network consumes  $0.18\times$  power and has  $0.78\times$  latency compared to the traditional CMOS design.

#### 1.2 Main Contributions

In this work, we reduced the CDN power by lowering effective clock capacitance using NCFET devices. As a result, we increased the capacitive load of the CDN buffer to reduce the overall buffer/active area requirement. In particular, the major contributions of this work are:

- The first clock synthesis tool to incorporate NCFET devices and apply on industrial testbenches
- The first attempt to reduce clock capacitance using NCFET device ferroelectric capacitance
- The significant reduction of the total buffer requirement of the CDN by increasing the capacitive load of a buffer
- The power-performance analysis of traditional CMOS-buffered and NCFET-based buffered CDN

### 1.3 Paper Organization

The rest of this paper is organized as follows. In Section 2, we first introduce the NCFET device characteristics. Section 3 presents the proposed NCFET-based clocking scheme. In Section 4, the power efficiency of the proposed design with existing industry standard schemes is investigated. Finally, Section 5 concludes our main findings and observations in this paper.

# 2 NCFET CHARACTERISTICS AND MODELING

The recent development of the ferroelectric NCFET foretells a new era of computation [16], [18]. While the conventional MOSFET requires a 60mV change in the channel potential to derive a change in the current by  $10\times$ , the NCFET promises to lower the 60mV/decade subthreshold swing (SS) to allow a lower supply voltage that enables low-power operation. The ferroelectric (FE) materials or negative capacitance dielectric materials improve SS through amplification of gate bias in an NCFET. Figure 2(a) and Figure 2(b)show the traditional MOSFET and its schematic representation. Figure 2(c) shows the symbolic capacitance diagram, which represents a transistor as a two-series capacitance, an oxide capacitance ( $C_{ox}$ ), and a semiconductor capacitance ( $C_S$ ).



Fig. 3: (a) The bulk-NCFET, (b) NCFET schematic diagram, and (c) equivalent capacitance model of a NCFET.

The SS swing of a conventional MOSFET can be represented as

$$SS = \frac{dV_{GS}}{dlog_{10}(I_D)} = \frac{dV_{GS}}{d\Psi_S} \frac{d\Psi_S}{dlog_{10}(I_D)} = m \times p \quad (2)$$

where  $V_{GS}$  is the applied gate-to-source voltage,  $I_D$  is the drain current, and  $\Psi$  is the surface potential. The body factor term m represented by the first term  $\frac{dV_{GS}}{d\Psi_S}$  of Equation 2 can be easily derived from capacitive voltage-division and ignoring the gate-source and gate-drain overlap capacitance  $(C_{ov})$  as [16]:

$$m_{cmos} = \frac{dV_{GS}}{d\Psi_S} = 1 + \frac{C_S}{C_{ox}} \tag{3}$$

Since both the capacitances  $C_S$  and  $C_{ox}$  are positive, the body factor  $m_{cmos} > 1$ . However, the NCFET devices can produce m < 1. Figure 3(a) and Figure 3(b) show the twodimensional NCFET and schematic diagram, respectively. Figure 3(c) is the symbolic capacitance diagram, which represents an NCFET as a series of FE capacitances ( $C_{FE}$ ) and regular transistor capacitances. Now, we can rewrite the SS swing of an NCFET as

$$m_{NC} = \frac{dV_{GS}^{NC}}{d\Psi_S} = 1 + \frac{C_S C_{OX}}{C_{FE}(C_S + C_{OX})}$$
(4)

From Equation 4, we can clearly see that it is possible to achieve lower than one body factor using NCFET devices. However, in this work we utilized an NCFET device to reduce CDN capacitance without scaling the supply voltage to enable low-power computation. In particular, we used an existing NCFET model [19] that utilized a traditional MOSFET model and Landau Khalatnikov (LK) theory to provide characteristics of an NCFET. According to the LK model, the dynamic FE capacitor can be explained using the following expression:

$$\rho_{FE}\frac{dQ_G}{dt} + \nabla_{Q_G}U = 0 \tag{5}$$

where  $\rho_{FE}$  is the heat dissipation within the FE,  $Q_G$  is the gate charge, and U is the free energy of the FE material, which can be expressed as Gibbs equation as

$$U = \alpha_{FE}Q_G^2 + \beta_{FE}Q_G^4 + \gamma_{FE}Q_G^6 - \frac{V_{FE}}{t_{FE}}Q_G \qquad (6)$$

where  $\alpha_{FE}$ ,  $\beta_{FE}$ , and  $\gamma_{FE}$  are coefficients of FE. The  $t_{FE}$  is the thickness of the FE dielectric. The voltage  $V_{FE}$  and charge  $Q_G$  can be expressed as





Fig. 4: The proposed clocking scheme uses an existing greedy algorithm to build an analytical zero-skew tree and reduces the overall buffer requirement by increasing the maximum capacitance limit while buffering the tree using NCFET buffers.

Equation 7 is modeled as a dependent voltage source in the SPICE model [19]. Using Equation 7, the steady-state negative capacitance can be expressed as

$$C_{FE} = \frac{1}{2\alpha_{FE}t_{FE} + 12\beta_{FE}t_{FE}Q_{G}^{2} + 30\gamma_{FE}t_{FE}Q_{G}^{4}} \quad (8)$$

In order to understand the driving capability of the NCFET buffer cell, we performed HSPICE simulation on a standard buffer (with PMOS and NMOS sizing ratio of 1.5) driving a capacitive load ( $C_L$ ). We sized our buffers considering a 1GHz clock frequency that can drive the maximum  $C_L$  to achieve a slew rate of 5%–10% of clock period. According to our analysis, the maximum  $C_L$  for an NCFET-based buffer is  $1.1 \times$  compared to the CMOS-based buffer with similar slew-rate and will be used as the maximum capacitance limit ( $C_{max}$ ) in Section 3. However, NCFET-based clocking can increase the  $C_{max}$  due to the very small switching-power compared to the CMOS-based design.

# **3 PROPOSED CLOCKING SCHEME**

In a physical design flow, after place and route, the clock tree synthesis (CTS) is the most critical step to improve routing congestion and correct functionality of a design. The primary goal of CTS is to create the routed buffered tree such that skew and power is minimized for any given network. Figure 4 shows the proposed CTS methodology. The first phase of the CTS is routing and topology generation to achieve minimal skew or zero skew. After that, we need to insert buffers to drive the clock signal and distribute it to each sink. In this work, we utilized traditional deferred merge embedding (DME) to meet the skew constraint [13], [14], [15]. In addition, we optimized the overall wire length using a bottom-up merging of the sinks to identify the potential zero-skew merging sectors, while the final merging location is selected depending on the parent location. In a nutshell, the overall DME approach confirms analytical zero skew using the Elmore delay model. According to the Elmore model, the time delay  $(d_{tij})$  from node *i* to node *j* along the signal path  $(P_{ij})$  is computed as:

$$d_{tij} = \sum_{n=i}^{J} R_n C_n \tag{9}$$



Fig. 5: (a) Equivalent RC network of an NCFET-CDN, and (b) we modeled the NCFET buffers as a negative FE capacitance and repeating clock signal to reduce overall CDN capacitance.

where  $R_n$  and  $C_n$  are the wire resistance from node *i* to node *j* and the downstream capacitance observed from node *j*. Using Equation 9 we can define the skew of a clock network as the maximum delay or arrival time difference between any logically connected pair of sinks as:

$$Skew = \max_{i,j\in sinks} (d_{ti} - d_{tj}) = \max_{i\in sinks} (d_{ti}) - \min_{j\in sinks} (d_{tj})$$
(10)

In order to efficiently drive the clock tree, we used an existing greedy algorithm that provides an optimal number of buffers with zero buffer skew trees [20]. The algorithm controls the slew-rate of each node by considering a maximum capacitance load ( $C_{max}$ ), which represents the clock wire and buffer gate capacitances. The primary reason to adapt the bottom-up algorithm is that it uses edge and node rules to buffer the clock tree. According to the first rule if the wire length or capacitance at the source of the edge is greater than the  $C_{max}$ , split the edge and insert a new node and buffer at the point on the edge nearest the root in such way that does not violate the downstream  $C_{max}$ while the latter rule equalizes the number of buffer levels in each subtree. However, after these buffer insertions, if the node capacitance exceeds the  $C_{max}$ , add an extra buffer to drive each subtree. In each case, the proposed methodology reduces the wire effective capacitance by inserting NCFET buffers. As a result, we increase the capacitance limit to reduce the overall number of buffers in the clock network. Figure 5(a) shows the buffered RC clock network with two subtrees. The clock network is driven by NCFET buffers. Figure 5(b) shows the equivalent RC model of the buffered network. Using Equation 9 we can write the delay equations for subtree 1 as

$$d_{t12} = R_1 \left(\frac{C_1}{2} + C_{L1} - C_{FE}\right) + t_1 \tag{11}$$



Fig. 6: The resulting ISPD 2009 s2r1 CDN, (a) CMOS-based CDN and (b) the proposed NCFET-based CDN.

and for the subtree 2 as

$$d_{t13} = R_2(\frac{C_2}{2} + C_{L2} - C_{FE}) + t_2$$
(12)

where  $R_1$ ,  $C_1$ ,  $C_{L1}$ , and  $t_1$  are the resistance, capacitance, load, and downstream network delay of subtree 1, while,  $R_2$ ,  $C_2$ ,  $C_{L2}$ , and  $t_2$  are the resistance, capacitance, load, and downstream network delay of subtree 2. Clearly, from Equation 11 and Equation 12, the effective capacitance of the network is reduced by the introduction of NCFET buffers. The proposed methodology increases the capacitance limit to meet the skew budget of the network. Figure 6(a) and Figure 6(b) show the resulting CMOS-based and NCFETbased CDNs.

## 4 EXPERIMENTAL RESULTS

### 4.1 Experimental Setup

We implemented the proposed methodology in C++ and ran the simulations in a Quad-core Intel Xeon 2.6 GHz processor. We validated our proposed technique using 45nm technology applied on ISPD 2009 and ISPD 2010 industrial benchmarks. IBM provides the ISPD 2009 benchmarks from their application-specific IC designs. These benchmark circuits are distributed at a  $50.4-133.3mm^2$  chip area and consist of 81-623 distributed flip-flops with the capacitive load [21]. ISPD 2010 benchmarks are derived from IBM and Intel's original microprocessor designs [22]. These benchmark circuits are distributed in  $1.4-91.0mm^2$  area and consist of 981-2249 nonuniformly distributed flip-flops with unbalanced loading. The 45nm technology parameters and NCFET models were used to design the buffers and for HSPICE simulations [19]. In addition, we considered 1GHz clock frequency to optimize our design.

We compared the proposed methodology with a state-ofthe-art traditional buffered algorithm. This algorithm uses common industry methods with minimum wire length using DME [13], [15] and CMOS buffers [23] inserted to meet the clock skew and slew rate [20].

# 4.2 ISPD Benchmarks Result

The proposed NCFET-based methodology consumes a tremendously lower amount of power than the CMOS-based existing clocking scheme on all of the ISPD 2009 and 2010 benchmarks. Table 1 shows results obtained from running the proposed methodology on ISPD 2009 test benches.

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, VOL. 14, NO. 8, AUGUST 2018

TABLE 1: The proposed NCFET-based clocking scheme enables 69.6% average power and 20% average buffer saving when compared to traditional CMOS buffered-based system, with 11.3ps higher clock skew using 2009 ISPD benchmarks.

| Benchmark |      |           | Traditional CMOS |         |      | NCFET-based   |         |      | NCFET compared to CMOS |         |               |
|-----------|------|-----------|------------------|---------|------|---------------|---------|------|------------------------|---------|---------------|
| ISPD 2009 | Sink | Chip area | Power            | Buffers | Skew | Power         | Buffers | Skew | Power                  | Buffers | $\Delta$ Skew |
|           | (#)  | $(mm^2)$  | ( <i>mW</i> )    | (#)     | (ps) | ( <i>mW</i> ) | (#)     | (ps) | (%)                    | (%)     | (ps)          |
| s1r1      | 81   | 69.4      | 8.0              | 174     | 24.0 | 2.5           | 127     | 27.0 | 68.9                   | 27.0    | -3.0          |
| s2r1      | 88   | 54.6      | 7.9              | 172     | 20.0 | 2.2           | 146     | 23.0 | 72.1                   | 15.1    | -3.0          |
| s4r3      | 623  | 120.7     | 20.6             | 591     | 21.0 | 6.2           | 464     | 63.0 | 69.9                   | 21.5    | -21.0         |
| f11       | 121  | 109.2     | 12.2             | 288     | 24.0 | 3.7           | 222     | 42.0 | 69.6                   | 22.9    | -18.0         |
| f21       | 117  | 133.3     | 12.6             | 290     | 23.0 | 4.0           | 242     | 43.0 | 68.2                   | 16.6    | -20.0         |
| f22       | 91   | 50.4      | 7.1              | 155     | 16.0 | 2.2           | 142     | 19.0 | 69.0                   | 8.4     | -3.0          |
| Average   | 187  | 89.6      | 11.4             | 279     | 24.8 | 3.5           | 224     | 36.2 | 69.6                   | 19.6    | -11.3         |

We considered the standard 10% of clock period as the skew and slew constraint. The quality of the proposed technique is evidenced by the power consumption and the number of buffer requirements. The proposed NCFET-based clocking can save more than 69% of the average power compared to the existing methodology. In addition, the proposed algorithm requires 20% fewer buffers compared to the traditional technique and 11.3*ps* more average skew. However, these are skew-constrained comparisons, and the existing algorithms are unable to utilize it for further power savings. Figure 7(a) shows the required buffer area for the traditional CMOS technique and the NCFET-based design. Using ISPD 2009 benchmark circuits, the proposed NCFET-CDN consumes 7.5% lower buffer area on average compared to the traditional CMOS-based design.

Table 2 shows results obtained from running the proposed methodology on ISPD 2010 test benches with a similar slew rate and skew constraint. The proposed NCFETbased clocking can save 73.4% average power compared to the existing methodology. Similar to the ISPD 2009 benchmarks results, the proposed algorithm requires 20.1% fewer buffers compared to the traditional technique and exhibits 21.4*ps* more average skew. Figure 7(b) shows the required buffer area for the traditional CMOS technique and the NCFET-based design. Using ISPD 2010 benchmark circuits, the proposed NCFET-CDN consumes 6.9% lower buffer area on average compared to the CMOS-based design.

#### 4.3 Crosstalk Analysis

Crosstalk is considered to be a major threat in highly compact digital design. We observe crosstalk when two neighboring signals switch at the same time. However, any victim line signal is most susceptible when the aggressor's signal is switching in the opposite direction or when the victim and aggressor signals are 180° out of phase, as shown in Figure 8(a). Empirically, with the reduction of wire-segment capacitance within buffers, the crosstalk susceptibility increases. As a result, it is critical to perform crosstalk analysis, when we are considering an NCFET buffer-based interconnect. A representative NCFET-based interconnect is shown in Figure 8(b), where we apply a traditional buffer insertion method to meet slew and skew constraints. However, using this technique, the delay variation of the victim line due to crosstalk is 36% more than that of the conventional CMOSbased design. In the proposed buffer insertion method with



Fig. 7: The proposed NCFET-based design consumes lower active area compared to the CMOS design: (a) using ISPD 2009 benchmarks, the NCFET-based design requires 7.5% lower average buffer area; and (b) using ISPD 2010, the NCFET-based design requires 6.9% lower average buffer area compared to traditional CMOS-based design.

increased  $C_{max}$ , the delay variation is 21% and 49% lower than the conventional CMOS- and NCFE-based designs. The proposed scheme is more robust to crosstalk noise because the victim line has a larger capacitance with a lesser number of buffers, as shown in Figure 8(c).

## 4.4 Effect of Process Variation

Due to process variation, the transistor threshold voltage  $(V_{th})$  can vary and affect the CDN performance. In order to quantify the clock delay variation due to  $V_{th}$  variation, we used a 2-level H-tree network distributed in a  $0.96mm \times 0.96mm$  area. In this analysis, we consider  $\pm 10\%$ 

#### IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, VOL. 14, NO. 8, AUGUST 2018

TABLE 2: Using 2010 ISPD benchmarks, the proposed NCFET-based clocking scheme enables 73% average power saving and 20% average buffer saving when compared to traditional CMOS buffered-based system, with 21.4ps higher clock skew.

| Benchmark |      |           | Traditional CMOS |         |      | NCFET-based |         |      | NCFET compared to CMOS |         |               |
|-----------|------|-----------|------------------|---------|------|-------------|---------|------|------------------------|---------|---------------|
| ISPD 2010 | Sink | Chip area | Power            | Buffers | Skew | Power       | Buffers | Skew | Power                  | Buffers | $\Delta$ Skew |
|           | (#)  | $(mm^2)$  | (mW)             | (#)     | (ps) | (mW)        | (#)     | (ps) | (%)                    | (%)     | (ps)          |
| cns01     | 1107 | 64.0      | 16.6             | 596     | 31.0 | 4.8         | 515     | 60.0 | 71.2                   | 13.6    | -29.0         |
| cns02     | 2249 | 91.0      | 31.9             | 1072    | 34.0 | 8.6         | 907     | 50.0 | 73.0                   | 15.4    | -16.0         |
| cns03     | 1200 | 1.4       | 2.0              | 251     | 14.0 | 0.4         | 230     | 65.0 | 79.7                   | 8.4     | -51.0         |
| cns04     | 1845 | 5.7       | 4.1              | 247     | 21.0 | 1.0         | 179     | 37.0 | 75.8                   | 27.5    | -16.0         |
| cns05     | 1016 | 5.8       | 2.4              | 146     | 26.0 | 0.5         | 93      | 33.0 | 77.1                   | 36.3    | -19.0         |
| cns07     | 1915 | 3.5       | 3.3              | 374     | 24.0 | 0.8         | 257     | 34.0 | 74.8                   | 31.3    | -10.0         |
| cns08     | 1134 | 2.6       | 2.6              | 208     | 28.0 | 0.6         | 130     | 37.0 | 77.9                   | 37.5    | -9.0          |
| Average   | 1496 | 24.9      | 9.0              | 414     | 23.7 | 2.4         | 331     | 45.1 | 73.4                   | 20.1    | -21.4         |



Fig. 8: Any signal is most vulnerable due to the crosstalk noise when the victim line signal switches in the opposite direction to the aggressors: (a) testbench to measure effect of crosstalk in traditional CMOS interconnect; (b) when the NCFET buffers replace the CMOS buffers, the crosstalkinduced delay increases up to 36% compared to the traditional design due to the effective capacitance reduction; (c) the proposed technique reduces the number of buffers and increases robustness by reducing delay variation up to 49%.

 $V_{th}$  variation from the nominal value. The CMOS-based design exhibits -9.2% to 1% delay variation. The proposed NCFET-based CDN exhibits -9.5% to 2.6% delay variation due to  $\pm 10\% V_{th}$ -variation.

#### 4.5 Effect of Supply Voltage Variation

 $V_{DD}$  variation is considered to be one of the major sources of variation in a microprocessor's performance. In order to quantify CDN performance variation due to the  $V_{DD}$ variation, we used the same 2-level H-tree network. In addition, we considered typical  $\pm 10\% V_{DD}$ -variation from the nominal value. According to our analysis, the CMOS system has -6.7% to 6.8% clock delay variation due to  $\pm 10\% V_{DD}$ - variation. The proposed NCFET-based system has slightly lower -5.6% to 7.0% delay variation compared to the CMOSbased system due to the smaller number of buffers in the network.

#### 4.6 Effect of Temperature Variation

For temperature variation analysis, we used the same H-tree network with 16 sinks. We observed almost identical clock skew due to temperature variation (ranging from  $-25^{\circ}C$  to  $125^{\circ}C$ ) for the CMOS-based and the proposed NCFET-based CDN schemes. We found that the NCFET-based scheme has only 0.5% more clock skew compared to the CMOS system. However, at room temperature  $(25^{\circ}C)$ , the proposed system has 23.1% lower clock latency compared to the existing CMOS system. The primary reason is the lesser number of buffers in the NCFET-CDN compared to the CMOS-based design.

## 5 CONCLUSION

In this paper, we presented the first NCFET-based clock synthesis scheme to reduce CDN power. The proposed clocking scheme consumes 69.6% and 73.4% lower average power compared to the conventional CMOS-based design using ISPD 2009 and ISPD 2010 benchmarks, respectively at 1GHz. The proposed technique inserts 20% and 18% fewer buffers compared to the standard technique using ISPD 2009 and ISPD 2010 benchmarks, respectively. The proposed system has 23.1% lower clock latency compared to the existing CMOS system. Better yet, the proposed clocking is more robust to crosstalk noise and exhibits up to 49% lower delay variation compared to the traditional CMOS-based design.

# REFERENCES

- J Rabaey, "Low Power Design Essentials. Second Edition," Springer Science and Business Media, January 2009.
- [2] T. D. Burd and R. W. Brodersen, "On-chip jitter and oscilloscope circuits using an asynchronous sample clock," *ISLPED*, July 2000, pp. 9–14.
- [3] K. J. Nowka and G. D. Carpenter and E. W. MacDonald and H. C. Ngo and B. C. Brock and K. I. Ishii and T. Y. Nguyen and J. L. Burns, "A 32-bit PowerPC system-on-a-chip with support for dynamic voltage scaling and dynamic frequency scaling," *JSSC*, November 2002, pp. 1441–1447.

- [4] S. E. Esmaeili and A. J. Al-Kahlili and G. E. R. Cowan, "Low-Swing Differential Conditional Capturing Flip-Flop for LC Resonant Clock Distribution Networks," *TVLSI*, vol. 20, no. 8, pp. 1547–155, August 2012.
- [5] F. U. Rahman and V. Sathe, "Quasi-Resonant Clocking: Continuous Voltage-Frequency Scalable Resonant Clocking System for Dynamic Voltage-Frequency Scaling Systems," *JSSC*, vol. 53, no. 3, pp. 924–935, March 2018.
- [6] P. Y. Lin and H. A. Fahmy and R. Islam and M. R. Guthaus, "LC resonant clock resource minimization using compensation capacitance," *ISCAS*, pp. 1406–1409, May 2015.
- [7] M. Dave and M. Jain and S. Baghini and D. Sharma, "A Variation Tolerant Current-Mode Signaling Scheme for On-Chip Interconnects," *TVLSI*, vol. 21, no. 2, pp. 342–353, February 2013.
  [8] R. Islam and M.R. Guthaus, "Low-Power Clock Distribution Using
- [8] R. Islam and M.R. Guthaus, "Low-Power Clock Distribution Using a Current-Pulsed Clocked Flip-Flop," TCAS, vol. 62, no. 4, pp. 1156–1164, March 2015.
- [9] R. Islam and M. R. Guthaus, "CMCS: Current-Mode Clock Synthesis," TVLSI, vol. 25, no. 3, Mar 2017, pp. 1054–1062.
- [10] Guthaus, Matthew R. and Wilke, Gustavo and Reis, Ricardo, "Revisiting Automated Physical Synthesis of High-performance Clock Networks," *TODAES*, vol. 18, no. 2, pp. 31:1–31:27, April 2013.
- [11] A. Kahng and J. Cong and G. Robins, "High-performance clock routing based on recursive geometric matching," DAC, pp. 322– 327, June 1991.
- [12] M. A. B. Jackson and A. Srinivasan and E. S. Kuh, "Clock routing for high-performance ICs," DAC, pp. 573–579, June 1990.
- [13] Tsay, R.-S., "Exact zero skew," ICCAD, pp. 336–339, November 1991.
- [14] T. H. Chao and Y. C. Hsu and J. M. Ho, "Zero skew clock net routing," DAC, pp. 518–523, June 1992.
- [15] K. D. Boese and A. B. Kahng, "Zero-skew clock routing trees with minimum wirelength," *ASIC*, pp. 17–21, September 1992.
- [16] Salahuddin, Sayeef and Datta, Supriyo, "Use of Negative Capacitance to Provide Voltage Amplification for Low Power Nanoscale Devices," *Nano Letters*, vol. 8, no. 2, pp. 405–410, 2008.
- [17] X. Li and S. George and K. Ma and W. Y. Tsai and A. Aziz and J. Sampson and S. K. Gupta and M. F. Chang and Y. Liu and S. Datta and V. Narayanan, "Advancing Nonvolatile Computing With Nonvolatile NCFET Latches and Flip-Flops," *TCAS*, vol. 64, no. 11, pp. 2907–2919, November 2017.
- [18] A. Sharma and K. Roy, "Design Space Exploration of Hysteresis-Free HfZrOx-Based Negative Capacitance FETs," EDL, vol. 38, no. 8, pp. 1165–1167, August 2017.
- [19] Muhammad Abdul Wahab , and Muhammad A. Alam, "A Verilog-A Compact Model for Negative Capacitance FET," https: //nanohub.org/publications/95/5, November 2017.
- [20] Tellez, G.E. and Sarrafzadeh, M., "Minimal buffer insertion in clock trees with skew and slew rate constraints," *TCAD*, vol. 16, no. 4, pp. 333–342, April 1997.
- [21] C. N. Sze, P. Restle, G. J. Nam, and C. J. Alpert, "Clocking and the ISPD'09 clock synthesis contest," *ISPD*, March 2009, pp. 149–150.
- [22] C. N. Sze, "ISPD 2010 High Performance Clock Network Synthesis Contest," ISPD, March 2010.
- [23] Shaloo Rakheja , and Dimitri Antoniadis, "MVS Nanotransistor Model (Silicon)," https://nanohub.org/publications/15/4, December 2015.



**Riadul Islam** is currently an assistant professor in the Department of Electrical and Computer Engineering at the University of Michigan-Dearborn, MI. In his Ph.D. dissertation work at UCSC, Riadul designed the first currentpulsed flip-flop/register that resulted in the firstever one-to-many current-mode clock distribution networks for high-performance microprocessors. From 2007 to 2009, he worked as a fulltime faculty member in the Department of Electrical and Electronic Engineering of the Univer-

sity of Asia Pacific, Dhaka, Bangladesh. He is a member of the IEEE, IEEE Circuits and Systems (CAS) society. He holds one US patent and several IEEE/ACM/Springer Nature journal and conference publications in TVLSI, TCAS, JETTA, ISCAS, MWSCAS, ISQED, and ASICON. His current research interests include digital, analog, and mixed-signal CMOS ICs/SOCs for a variety of applications; verification and testing techniques for analog, digital and mixed-signal ICs; hardware security; CAD tools for design and analysis of microprocessors and FPGAs; automobile electronics; and biochips.