# Clockless Spintronic Logic: A Robust and Ultra-Low Power Computing Paradigm

## Yu Bai, Ronald F. DeMara, Jia Di, and Mingjie Lin

Department of Electrical and Computer Engineering, University of Central Florida, Orlando, FL 32816-2362 November 1, 2017

Abstract—Asynchronous logic offers the advantages of no clock tree, robust circuit operation, avoidance of worst-case timing margins, and a reduced emission spectrum. Thus, computational paradigms are sought to attain advantages of clockless logic by leveraging the complementary characteristics of emerging devices and CMOS transistors within novel circuit designs. This paper introduces Spin Torque Enabled NULL Convention Logic (STENCL), which exploits the physical characteristics of non-volatile Domain-Wall (DW) and memristive devices to realize the Quasi-Delay-Insensitive (QDI) NULL Convention Logic (NCL) asynchronous design methodology. First, a formal algorithm is developed to transform NCL-based threshold m-of-n gate realizations to STENCL, in order to generate the corresponding input memristance and NULL module memristance required for nominal currents achieving DW device biasing. Second, hysteresis and set/reset conditions are realized by determining the corresponding current fluctuations required to move the DW within each threshold logic gate to realize all 27 foundational NCL gate structures, which are then simulated to assess energy and delay metrics. Third, a case study of a four-stage pipelined 32-bit IEEE single-precision floating point co-processor implemented as a dual-rail STENCL architecture is compared to a conventional CMOS-based NCL design implemented by an IBM SOI1250 45nm CMOS process. Fourth, a sensitivity analysis is performed to assess the impact of write accuracy and drift on memristor and DW device operation. Results indicate that STENCL-based designs achieve between 2-fold to 20-fold reduction in energy consumption with up to 8-fold reduction in area, over an equivalent CMOS-based NCL design for 32-bit full adders. Comparisons for various four-stage pipelined 32-bit IEEE single-precision floating-point co-processors and ISCAS benchmarks further substantiate those benefits for operation within acceptable tolerances at identical process technology nodes.

Index Terms—Asynchronous logic; NULL Convention Logic (NCL); Domain Wall (DW); Memristor; Quasi-Delay Insensitive (QDI) Logic

## **1** INTRODUCTION

UASI-Delay Insensitive (QDI) circuits offer several potential benefits, such as low Process Voltage Temperature (PVT) susceptibility, module reusability due to clockless operation, and a much-coveted correct-by-construction property, i.e., timing analysis is not vital to ensure correct operation [1]. Among many architectural variations of asynchronous circuits, NULL Convention Logic (NCL) remains as a popular and proven candidate [2]. Nonetheless, NCL circuits have some notable shortcomings despite their significant advantages. First, the correct operation of an NCL circuit depends on some active or semi-static realization of hysteresis, which implies the presence of an appropriate state-holding mechanism [3]. Second, its use of dual-rail logic signaling based on 1-hot delay-insensitive encoding requires two wires per bit. Finally, NCL circuit design requires utilization of adapted EDA tools capable of leveraging this signaling style, as compared to synchronous design [4]. We address each of those issues herein.

To access the benefits of NCL while minimizing costs, innovations in circuit design are sought. Fortunately, emerging spintronic device technology offers two valuable attributes towards such goals:

- intrinsic hysteresis property: one essential requirement for NCL's correct operation is to maintain QDI behavior via hysteresis, which is expensive in terms of area and energy to implement using MOS transistors. Interestingly, some emerging spintronic-based devices naturally exhibit physical properties similar to hysteresis [5]. Therefore, it becomes plausible to devise innovative circuit designs to natively exploit such behaviors without complicated control mechanisms.
- near-zero leakage current: spintronic devices, such as Magnetic Tunnel Junctions (MTJs) [6], [7], [8], spinvalves, and Domain Wall (DW) devices offer nearzero leakage power operation while supporting fast switching speed and facilitating area-sparing vertical integration.

In this paper, we propose a QDI computational paradigm based on magnetic DW logic called Spin Torque Enabled NULL Convention Logic (STENCL). Major contributions include:

- Costly control modules supporting hysteresis of NCL state-holding behavior, which have high transistor counts, are avoided by utilizing the inherent switching properties of DW devices.
- 2) Leveraging emerging device technology for improved energy and area performance is attracting

<sup>•</sup> Y. Bai is with the Department of Engineering and Computer Science, California State University, Fullerton CA, 92831. E-mail: ybai@fullerton.edu

<sup>•</sup> J. Di is Professor and 21st Century Research Leadership Chair of Department of Computer Science and Computer Engineering, University of Arkansas AR 72701.

<sup>•</sup> Ronald F. DeMara and M. Lin are with the Department of Electrical and Computer Engineering, University of Central Florida FL, 32816.

increasing research interest. However, most existing studies focus on using spintronic devices as high-performance switching devices, therefore follow almost identical circuit design methodologies as CMOS [9]. We instead pioneer an emerging device approach. As a result, the correct operation of our STENCL approach only requires the device parameters to be in a predetermined range, thus increasing tolerance to multiple sources of parametric variation.

The remainder of this paper is organized as follows. Section 2 introduces the fundamental concept and logic realization techniques underlying the NCL paradigm. Section 3 briefly describes the operational physics of a DW device. In particular, we identify the relevant relationships between the physics of spintronic devices and NCL techniques. Section 4 and 5 describe why spin-torque-driven NCL design is significant and provides the implementation details of representative threshold gates using spin-based devices. In Section 6, a transformation algorithm is developed to convert Boolean NCL circuits to their corresponding STENCL realization. Based on NCL gate design methodology, the dual-rail STENCL architecture is proposed in Section 7. Section 8 presents a performance comparison using various case studies for the baseline STENCL architecture. In Section 9, the architecture is extended using an improved handshaking scheme. Subsequently, Section 10 presents area and delay results for numerous benchmarks. Finally, we conduct an error analysis in Section 11 and conclude the manuscript in Section 12.

#### **2 NCL CONCEPT AND CIRCUIT IMPLEMENTATION**

NCL circuits typically consist of cascaded logic and registration stages, which can be finely pipelined by inserting additional registers. As shown in Figure 1 (a), two adjacent register stages interact through their request and acknowledge signals,  $K_i$  and  $K_o$ , respectively [10]. To prevent the present Boolean {TRUE, FALSE} signals from overwriting the previous DATA wavefront, these two DATA wavefronts are separated by a NULL wavefront, which represents a spacer signal condition to clear Boolean states prior to the subsequent DATA wavefront. The acknowledge signals are combined in the completion detection circuitry to produce the request signal to the previous register stage, utilizing either a full-word or bit-wise completion strategy. Specifically, the NCL circuit methodology leverages two core ideas, dual-rail signaling and NULL signal propagation, in order to achieve QDI operation. In NCL, each dual-rail signal, D, transported by two wires,  $(D^0, D^1)$ , can assume one of three possible values being logic 0, logic 1, and a NULL state which are encoded as (1,0), (0,1), and (0,0), respectively. The NULL state denotes that the value of *D* is not yet available. Note that the assertion of  $D^0$  and  $D^1$  are mutually exclusive, such that both rails can never be asserted simultaneously, therefore (1,1) is defined as an illegal state.

NCL utilizes threshold gates exhibiting hysteresis behavior for its fundamental circuit elements. Its generic threshold logic primitive is the  $\text{TH}_{m,n}$  gate with n inputs  $(1 \le m \le n)$ , where at least m of n inputs must be asserted before the output will become asserted. The typical gate symbol



Figure 1: NCL logic signaling scheme: input wavefronts are controlled by local handshaking and completion detection signals. (a) Conventional NCL pipeline, (b) Symbol and structure of TH23 threshold gate, (c) Implementation of logic function  $Z = X \oplus Y$ , and (d) Two-bit register and completion detector.

denoting a TH23 gate is shown in Figure 1 (b). Threshold gates can be cascaded to construct NCL combinational logic blocks, NCL registers, and completion detectors. Figure 1 (c) illustrates the implementation of an NCL combinational logic block  $Z = X \oplus Y$  using threshold gates. Figure 1 (d) depicts the implementation of a 2-bit NCL register and a 2-bit completion detector using threshold gates. Generally, the implementation of an *n*-bit NCL register requires 2n TH22 gates, and the implementation of an *n*-bit completion detector requires *n* 2-input OR, i.e. TH12 gates and an *n*-input C-element, i.e. THnn gate. One important property for the design of NCL circuits is that a set of only 27 fundamental NCL gates can implement any logic function with four or fewer variables, i.e. is logically-complete [11].

## **3** DOMAIN WALL (DW) DEVICE AND MEMRSITOR PHYSICS

A DW device employs linear displacement of magnetic domains to encode information. In Figure 2 (a), the conceptual view of a DW device is depicted. When current flows through the nano-strip, an effect on the magnetic moment is exerted, which is observed as DW motion. By controlling the position of the DW, a DW device with current-induced in a three-terminal structure can be used to implement non-volatile memory and logic operations.

In Figure 2 (b), the two terminals T1 and T2 of the magnetic nano-strip have fixed magnetization in anti-parallel



Figure 2: Schematic illustration of DW device depicting up-spin and down-spin domains. (a) Simplified conceptual view, (b) Realistic three-terminal DW structure, and (c) Equivalent circuit view.

directions. The bidirectional currents are injected into the wire laterally along through terminals T1 and T2, and drag the DW back and forth, whose position encodes the stored data bit. An MTJ device is placed on the top of the nanostrip, thus providing a fixed-polarity magnetic head for reading the resistance of the three-terminal DW structure. Usually, T1 receives the input current and T2 is tied to ground. Once the current is applied, the spin polarity of domain D1 is written parallel to T1. Therefore, the DW can move through the nano-strip by current injection, leading to the switching of the spin-polarity at specific location [12], [13], [14], [8]. The areas occupied by D1 and D2 are sensed using the MTJ to determine parallel and anti-parallel regions of spin. The ratio between these two regions exhibits a difference in resistance, which is dictated by the DW position. The equivalent circuit is shown in Figure 2 (c). Two variable resistors are used to represent resistances of antiparallel and parallel regions, respectively. The fixed resistor in Figure 2 (c) is used to represent the DW region depicted in Figure 2 (b).



Figure 3: (a) Simulation of DW motion by current injection in terminal T1, the DW moves towards the left by the spin-polarized electrons and (b) Compact model indicates agreement with micromagnetic simulation for DW motion velocity,  $V_{DW}$ , as a function of current density, *J* [15].

To illustrate and validate such behavior, we have conducted a DW motion simulation with the mumax<sup>3</sup> simulator. Results presented in Figure 3 (a) show that DW motion having a fixed velocity can be obtained by injecting a magnitude of current  $(1.5 \times 10^{13} \text{ A/m}^2)$  into terminal T1 of Figure 2 (b). This simulation utilized the device parameters: damping coefficient  $\alpha = 0.02$ , uniaxial anisotropy constant  $Ku = 5.9 \times 10^5 \text{ J/m}^3$ , saturation magnetization Ms =  $6 \times 10^5 \text{ A/m}$ , exchange stiffness  $A_{\text{ex}} = 1 \times 10^{11}$ , and polarization P = 1 [13]. Terminal T3 is used to read the position of the DW according to the MTJ resistance. The resistance model of the MTJ is based on the supplied voltage, tunneling oxide thickness ( $t_{\text{ox}}$ ), and angle

of magnetization between the free layer and the pinned layer. The resistance model of the device is divided into three regions: the parallel region, the anti-parallel region, and the DW region. Let x denote the DW position at its middle point, let L denote the length of free layer (100nm), and let W denote the width of the free layer. As depicted in Figure 2 (b),  $RA_P$ ,  $RA_{AP}$ , and  $RA_{DW}$  are the MTJ resistance-area product for the parallel region, anti-parallel region, and DW region, respectively. Those values were selected as follows:  $RA_P = 2\Omega \cdot \mu m^2$ ,  $RA_{AP} = 5\Omega \cdot \mu m^2$ , and  $RA_{DW} = \sim 3.5\Omega \cdot \mu m^2$  [12], [16]. Within Figure 2 (b), the resistance of the leftmost region is calculated as  $RA_P(x) =$  $\frac{RA_P}{(W(L-x+L_{DW}))}$  and the rightmost region is calculated as  $RA_{AP}(x) = \frac{RA_{AP}}{(W(x+L_{DW}))}$ . The resistance of the DW region is calculated as  $RA_{DW} = \frac{RA_{DW}}{(WL_{DW})}$ . Therefore, the vertical resistance is given by:  $RA_P / / RA_{AP} / / RA_{DW}$ . Meanwhile, the resistance model for the DW is given by:  $\frac{A}{B \cdot x + C}$ , where  $A = RA_{AP} \cdot RA_{P} \cdot RA_{DW}, B = (RA_{AP} - RA_{P})\tilde{R}\tilde{A}_{DW} \cdot W,$ and  $C = RA_P \cdot RA_{DW} \cdot W \cdot L + (RA_{AP} \cdot RA_P - 0.5RA_P \cdot M_P)$  $RA_{DW} - 0.5RA_{AP} \cdot RA_{DW})W \cdot L_{DW}$ . Therefore, the output voltage can be computed as a function of DW positions (0 < x < 100 nm). Finally, Figure 3 (b) exhibits a hysteresis phenomenon found in the DW switching characteristics. Figure 3 (b) shows the critical current simulation for DW motion having velocity  $V_{DW}$ , as a function of current density, J. The DW velocity is equal to zero whenever the input current is less than the critical current. This critical current can be adjusted either by choosing memristances or by choosing the device width. In this paper, we employ specific combinations of the input and NULL module memristances to achieve the hysteresis property required to realize an NCL threshold logic gate.



Figure 4: (a) Memristor structure with switch transistor and (b) Memristor resistance with different writing currents.

In the proposed architecture, a STENCL gate employs memristors whose conductance can be precisely modulated by the charge or flux through them. A weighted current flow can be generated through different programmable memristor configurations by constant  $V_{dd}$ . Figure 4 shows the architecture of the memristor memory write and read scheme. The control signal which is generated by the Read/Write (R/W) module is switching the memristor connection when supplied by a constant  $V_{dd}$  or a write voltage. For write operations, a write voltage pulse is applied. When programming them, a vulnerability arises that voltage applied across two cross-connected memristors can induce sneak current paths through other devices, which can disturb the state of unselected memristors. To overcome this issue, Manem [17] proposed a solution of using an access transistor and diodes to facilitate selected free-disturb write operation, while Jung [18] proposed a reduced area design without using transistors and diodes. However, additional delay will be incurred in such designs as it is possible to write only one memristor at a time. The simulation of the proposed architecture under various write currents is shown in Figure 4 (b). Thus, distinct memristances are generated based on the model in [19] using various write currents. From Figure 4 (b), it is observed that for each larger write current applied then a more rapidly steepening resistance curve becomes generated. On the contrary, a smaller write current induces a shallower resistance curve.

## 4 MOTIVATION FOR SPIN TORQUE NULL CONVEN-TION LOGIC

Among the CMOS-based circuit realizations of NCL, a static NCL gate implementation can attain a suitable design offering faster and more reliable operation than other CMOS-based alternatives. However, its area utilization and energy consumption are notably high, as compared to a corresponding clocked CMOS logic implementation. The conventional static NCL gate is shown in Figure 5 (b), which is comprised of four transistor networks: SET, RESET, HOLD0, and HOLD1, each of which uses CMOS transistors. According to TH gate functionality, the SET and HOLD1 function of an NCL static gate with n inputs can be expressed as:

$$HOLD1 = I_1 + I_2 + \dots + I_n$$
  
$$Z = SET + (Z^- \times HOLD1)$$
(1)

where the  $Z^-$  is the previous output value of static NCL gate and Z is current output value. The RESET function of NCL static gate with n inputs can be expressed as:

$$Z' = RESET + (Z^{-'} \times HOLD0)$$
(2)

where the Z' is the complement of Z, and  $Z^{-'}$  is the complement of the previous output value of static NCL gate. In Figure 1 (b), the TH23 static NCL gate is given. The function of the four CMOS networks are shown by:

$$SET = AB$$

$$HOLD1 = A + B$$

$$RESET = A'B'$$

$$HOLD0 = A' + B'$$
(3)

Thus, this QDI-capable NCL gate requires additional transistors to realize circuitry for HOLD0 and HOLD1 that detract significantly from the overall circuit area efficiency. The area cost of CMOS-based NCL is usually approximately 1.5-fold to 2-fold greater than the corresponding conventional synchronous CMOS-based circuit. For example, as identified in [20], a four-stage pipelined 32-bit IEEE singleprecision floating-point co-processor was implemented as both a synchronous CMOS-based circuit and CMOS-based asynchronous NCL circuit. The given design utilized an IBM SOI1250 45nm process and was evaluated for performing addition, subtraction, and multiplication. The synchronous CMOS circuit utilized 104,571 transistors, which is around 1.5 times less than the asynchronous NCL circuit consumption which required 158,059 transistors.

To help address the area challenges facing NCL, it is feasible for DW devices to offer a partial replacement for numerous CMOS transistors within the NCL hold subcircuits. DW devices exhibit fast switching times, however, their device physics dictates utilization of spin torque mechanisms, such as hysteresis switching behavior. The hysteresis switching behavior follows DW device transfer characteristics, as shown in Fig 2 (c). The DW moves if the magnitude of the combined input currents is larger than the positive critical current  $I_c$  or the magnitude of the negative critical current  $-I_c$ . According to device physics, a typical DW with  $3 \times 20 \times 100 nm^3$  dimensions has critical current density  $J_{c,i}^1 = 5.2 \times 10^{12} A/m^2$  and  $J_{c,i}^1 = -5.2 \times 10^{12} A/m^2$ . Therefore, the hysteresis dictated by DW device physics can be leveraged to avoid the use of HOLD transistors within CMOS-based NCL realizations. Furthermore, the vertical integration of a DW device and memristor can dramatically reduce hardware area requirements. For instance, the layout of DW devices and their associated control transistor are shown in Figure 6. It depicts a two-bit DW device and access transistor which achieves twice the area density compared to a single DW device.



Figure 5: (a) A non-zero current is injected to achieve DW motion with a hysteresis characteristic [21], [22], (b) CMOS implementation of a NCL TH23 gate, and (c) STENCL TH23 gate.

#### 5 PROPOSED STENCL REALIZATION

To realize the needed hysteresis behavior, an early attempt towards building a C-element realizing an THnn NCL threshold gate using CMOS and MTJ devices was introduced in [32]. However, this approach involves complex CMOS control logic and multiple MTJs, resulting in high write power consumption. Herein, a more efficient NCL gate structure is proposed leveraging physical characteristics of spintronic devices. The hysteresis phenomenon of DW devices illustrated in Figure 5 (a). Hysteresis is exhibited by the applied current and the resulting polarization. The scaling of such hysteresis behavior can be adjusted by either the resistance of the memristor or the DW device width. In the proposed STENCL approach, the hysteresis blocks with CMOS devices shown in Figure 5 (b), i.e., HOLD0 and HOLD1, are replaced by a DW device. In Figure 5 (c), the DW device utilizes write current along the lateral path  $d_1$  to  $d_3$ , while read current flows along the vertical path. Meanwhile,  $d_1$  and  $d_3$  construct an input

current path to perform logical set and reset. The DW  $d_2$  moves through the free layer with varying velocity which depends on the magnitude, direction, and duration of the given lateral current. If the set operation is active (the number of active inputs is more than the threshold), then the proposed architecture generates a current that is larger than the critical current to move the DW from  $d_1$  to  $d_3$ . On the other hand, when the reset operation is active such that all inputs are zero, then the opposite current generated at terminal  $d_3$  pushes the DW to move from  $d_3$  to  $d_1$ . The hold operations employ the hysteresis characteristics of the DW device with generating lateral current smaller than the critical current. During a read operation,  $V_{pa}$  and  $V_{pb}$  provide a constant sensing current to read the DW position.



Figure 6: (a) Layout of single DW with two access transistors and (b) Layout of two bit DW with three access transistors.

To sense DW position, separate read and write paths help to reduce stress on the oxide layer of the MTJs. The supplied voltages,  $V_{pa}$  and  $V_{pb}$ , are applied on two terminals to sense the DW position using the access transistor. Distinct clock signals are also needed to control the sensing of NCL gates. They can realize the required delay element for various NCL layer gates. A similar scheme within C-element asynchronous circuit was proposed by Zianbetov [23]. According to the DW position, the reference is in the  $2.5 \mathrm{K}\Omega$ range and achieves the largest sensing margin between  $V_{pa}$ and  $V_{pb}$  of approximately 350mV. Therefore, we set  $V_{pa}$  and  $V_{pb}$  as 50mV and -50mV in order to maintain an adequate sensing margin. Within the NULL domain, the inputs are all 0. Therefore, the difference of the sum of write current and NULL current exceeds the resetting critical current. Thus, the DW moves back to its initial position and is ready to perform the next calculation.

## 6 TRANSFORMATION FROM BOOLEAN NCL TO STENCL

To utilize DW devices within QDI designs at the architecture-level, a concise approach to align the various design tool libraries is developed. To-date, several NCL design automation flows have been proposed to address two main issues: 1) synthesizing a synchronous register-transfer level (RTL) design into an NCL netlist while enforcing input completeness and observability; 2) optimizing the circuit under a predefined cost function [24], [25], [26], [27]. In this paper, we employ elements from those existing NCL design into an input-complete and observable NCL netlist. However, our method differs in the technology-mapping approach used and the levels at which optimization occurs. In Figure 7, the proposed STENCL design flow is shown.

The input to the STENCL design flow is an RTL design, which is subsequently partitioned into smaller modules. A constraint is enforced to synthesize combinational blocks via two-input Boolean gates. Specifically, this procedure is imposed by the contents of the STENCL gate library. While the decomposition of a logic expression containing several input terms is a needed synthesis procedure, however partitioning of DW-based gates into smaller expressions is non-trivial and may also introduce gate orphans. Thus, the STENCL flow replaces individual Boolean signals with their dual-rail counterparts first, and then subsequently remaps the Boolean realization to utilize an equivalent network of STENCL threshold gates. The proposed STENCL mapping techniques are based on the UNCLE design compiler [24], which conducts a constraint-based minimization of CMOSbased NCL. In addition to Boolean systematic and general cell merging / mapping techniques, we also utilize the physical constraint information received from design mapping, such the fan-in constraint of memristor devices. Finally, the transformation algorithm converts the optimized NCL gates to corresponding STENCL gates via library-level substitution.

In the previous Section, a memristor associated with the DW device is used to generate the requred weights and thresholds of NCL realizations. Therefore, the transformation algorithm of NCL Boolean logic to STENCL form is identified. The procedure of generating distinct input memristance and NULL module memristance is delineated in Algorithm 1. In agreement with Algorithm 1, weights and thresholds of NCL Boolean logic are converted to DW device physics equivalents and corresponding memrisitance conditions. Therefore, before introducing the transformation procedure, some default definitions and values have to be declared, which are listed in steps 1-8 within Algorithm 1. The given Boolean NCL-based netlist G is the input to the algorithm. The indices i, jindicate various NCL gates having distinct inputs to an individual NCL gate. The given  $V_{dd}$  is used to generate different weighted currents. Meanwhile,  $T_i$  and  $w_{i,i}$  are written according to functions of Thres(G) and Weigh(G), which are used to read the logic thresholds and weights of each NCL gate from the given Boolean NCL netlist. Algorithm 1 outputs calculated memristance of inputs  $m_{i,j}$ and NULL  $M_i$  with constraint from  $m_{min}$  to  $m_{max}$ . The values of  $m_{min}$  and  $m_{max}$  are obtained from memristor device physics. In the present case, the range is given from  $100\Omega$  to  $38000\Omega$ . The two critical current densities of DW device  $(J_{c,i}^1 \text{ and } J_{c,i}^2)$  are used to achieve the hysteresis behavior of NCL [13]. Therefore, the given value of critical current density of DW device  $J^2_{c,i} = 6.2 \times 10^{12} \mathrm{A}/\mathrm{m}^2$  will cause domain wall motion with 20m/s velocity. On the other side, current density  $J^1_{c,i} = 5.2 \times 10^{12} A/m^2$  will cause the DM to remain stationary. The critical currents  $I_{c,i}^1$  and  $I_{c,i}^2$  are calculated according to the injection area and critical current density. To explain the algorithm, we consider the two Boolean NCL gates TH23W2 and TH44 with logic functions f = A + BC, f = ABCD, as example. For the Boolean NCL function f = A + BC, the three input weights are (2,1,1) and threshold is 2. Since the weights of each input is different from each other. Therefore, the algorithm



Figure 7: STENCL design flow.

steps 19 - 27 are invoked. When given those conditions, the three input and NULL module memristance values are calculated for function f = A + BC as follows, whereby the memristance of A, B, C is defined as  $m_{1,1}, m_{1,2}, m_{1,3}$ , respectfully:

#### Case 1: HOLD 1:

The sum of input currents is smaller than the threshold and thus does not incur DW motion. Namely,  $\frac{V_{dd}}{m_{1,2}} - \frac{V_{dd}}{M_1} < I_{c,1}^1$  and  $\frac{V_{dd}}{m_{1,3}} - \frac{V_{dd}}{M_1} < I_{c,1}^1$  are both true. **Case 2: SET**:

The sum of input currents is larger than the threshold value required to incur DW motion. Namely,  $2 \cdot \frac{V_{dd}}{m_{1,2}} - \frac{V_{dd}}{M_1} > I_{c,1}^2$  and  $\frac{V_{dd}}{m_{1,1}} - \frac{V_{dd}}{M_1} > I_{c,1}^2$  are both true. **Case 3: HOLD 0**:

The sum of input currents relative to the negative threshold does not cause the DW to return back to its original location. Namely,  $\frac{V_{dd}}{m_{1,2}} - \frac{V_{dd}}{M_1} > -I_{c,1}^1$ , and  $\frac{V_{dd}}{m_{1,1}} - \frac{V_{dd}}{M_1} > -I_{c,1}^1$  are both true.

#### Case 4: RESET:

The sum of input current is zero causing the DW to return to initial position. Hence, the condition  $-\frac{V_{dd}}{M_1} < -I_{c,1}^2$  is true.

The possible memristances of three different inputs and the NULL module are given by the set of cases listed above. The memristance of input A is  $m_{i,A} = 608\Omega$ , the memristance of input B is  $m_{i,B} = 1209\Omega$ , the memristance of input C is  $m_{1,C} = 1209\Omega$ , and the memristance of the NULL module is  $M_i = 1209\Omega$ , receptively. For the TH44 gate realizing f = ABCD, the method is similar to the above. The memristance of input A is  $m_{i,A} = 2418\Omega$ , the memristance of input B is  $m_{i,B} = 2418\Omega$ , the memristance of input C is  $m_{i,C} = 2418\Omega$ , the memristance of input D is  $m_{i,D} = 2418\Omega$ , and the memristance of the NULL module is  $M_i = 1209\Omega$ .

Algorithm 1 was applied to the 27 standard TH gate truth tables to verify results. It shows that various TH gates can be classified into five different categories according to their threshold. In particular, the corresponding parameters for the DW devices were determined by the methods substantiated by Fukami [13]. According to the configuration of this DW device, current density  $6.2 \times 10^{12} \text{A/m}^2$  will cause DW motion with 20m/s velocity. On the contrary, current density  $5.2 \times 10^{12} \text{A/m}^2$  will cause the DM to remain stationary. The mapping outcomes of Algorithm 1 are listed in Table 1.

According to the results from Table 1, we consider the NCL TH44 gate as an example. The DW simulation results



Figure 8: Simulation of proposed TH44 gate through DW device.

were obtained from the mumax<sup>3</sup> simulator using the parameters listed in Table 2. Whenever the sum of input current has less or equal magnitude than the critical current, then the proposed design will not lead to any DW movement. When the magnitude of summed current is larger than the critical current, then the DW moves towards terminal T2. Therefore, the various combinations of inputs create specific DW motion. The simulation of the TH44 gate is shown in Figure 8. The number of inputs is increased sequentially to test hysteresis. Before the four inputs are all ones, the different combinations of input are shown in Figure 8 as follows: (A = 0, B = 0, C = 0, D = 0); (A = 0, B = 0)0, C = 0, D = 1; (A = 0, B = 0, C = 1, D = 1); and (A = 0, B = 1, C = 1, D = 1). During those cases, the DW does not move since the sum of input currents and the NULL module current is not larger than the critical current. While all four inputs are active, the sum of the input current and NULL module current is larger than the critical current and induces the movement of the DW. After the DW moves to a specific position, the number of inputs asserted may change. However, the current does not induce the movement of the DW back to the initial position due to insufficient inverse current. Thus, the DW remains at its

**Algorithm 1:** Algorithm for calculating STENCL weight and threshold conditions.

Input : G-Boolean NCL netlist Output: N-STENCL netlist 1  $V_{dd} \leftarrow 0.3V$  $\mathbf{2} \ S \leftarrow 40 nm^2$ // injection area of DW  $3 T_i \leftarrow \text{Thres}(G)$ // read threshold of each node 4  $w_{i,j} \leftarrow \text{Weigh}(G)$ // read weight of each node 5  $m_{min} \leftarrow 100\Omega$ // set minimal memristance 6  $m_{max} \leftarrow 38000\Omega$ // set minimal memristance 7  $Ic_i^1 \leftarrow S \cdot 5.2 \times 10^{12} A/m^2$ // set critical current density for DW velocity=0 s  $Ic_i^{2} \leftarrow S \cdot 6.2 \times 10^{12} A/m^2$ // set critical current density for DW velocity=20m/s 9 for i = 1 : N do if  $w_{i,j} = w_{i,1}, \cdots, = w_{i,n_i}$  then 10 // find the minimal memristance of input j=1:n 11  $minimize(m_{i,j=1:n})$ subject to : 12  $T_i \cdot \frac{V_{dd}}{m_{i,j}} - \frac{V_{dd}}{M_i} > Ic_i^2$ 13 // set 1  $(T_i - w_{i,j}) \cdot \frac{V_{dd}}{m_{i,j}} - \frac{V_{dd}}{M_i} < Ic_i^{\dagger}$ // hysteresis 14  $-\frac{V_{dd}}{M_i} < -Ic_i^2$ 15 // NULL  $\frac{V_{dd}}{m_{i,i}} - \frac{V_{dd}}{M_i} > -Ic_i^1$ 16 // hysteresis  $\frac{\mathbf{u}}{m_{i,j}}$  $m_{min}^{i,j} < m_{i,j}, M_i < m_{max}$ 17 // device constraint 18 else 19  $w_{min} \leftarrow findmin(w_{i,j})$  // find the minimal Boolean weight of input 20 21 22 23 // set 1 24 // hysteresis  $-\frac{V_{dd}}{M_{\cdot}} < -Ic_i^2$ // NULL 25  $\frac{\frac{M_i}{V_{dd}}}{\frac{1}{w_{min}}} - \frac{V_{dd}}{M_i} > -Ic_i^1$ 26  $\frac{1}{m_{w_{min}}} - \frac{1}{M_i}$   $m_{min} < m_{i,j}, M_i < m_{max}$ // hysteresis 27 // device constraint

current position. When all inputs are zeros, the sum of input currents and NULL module current is larger than resetting critical current, thus pushing the DW back to its original position. The corresponding simulation results are shown in Figure 8. In the simulation, the hysteresis of the NCL gate is realized through DW movement by utilizing the appropriate memristances as determined by Algorithm 1.

Among the 27 standard NCL gates of 4-input variables, there are three special NCL functions, e.g. TH24comp, THand0, and THxor0, which are not formally considered to be threshold gates [28]. To implement these NCL gates, we decompose them into constituent gates. Figure 9 (a), (b), and (c) show the realization of spin-torque-transfer DW devicebased TH24comp, THand0, and THxor0 gates, respectively. The proposed realization of these gates employs a decomposition of the NCL function set. For example, the NCL gate THxor0 can be decomposed into two levels consisting of two TH22 gates and one TH21 gate, as shown in Figure 9 (d). The THand0 gate can be decomposed into two stages that consist of three TH22 gates and one TH21 gate, as shown in Figure 9 (e). The NCL gate TH24comp can be decomposed into a dual-layer design that consists of two TH21 gates and one TH22 gate, as shown in Figure 9 (f). The results for various proposed NCL gates are shown in Figure 9 (g), (h), and (i), respectively. Each depicts an increasing number of inputs. For the THxor0 gate, when inputs of C and D are active then the DW moves because the sum of the input currents exceeds the critical current of the DW device. The sum of currents injected from AB and CD into the DW device at the second layer is larger than the second DW critical current. Therefore, the second layer DW moves and hence results in

the desired high voltage output.

## 7 DUAL-RAIL STENCL CIRCUIT DESIGN

A CMOS-based NCL circuit was implemented using a QDI pipeline with a dual-rail four-phase handshaking protocol. In Figure 11 (e), dual-rail signal D is encoded as two wires,  $D^0$  and  $D^1$ . Any values from the dual-rail set  $\{DATA^0, DATA^1, NULL\}$  can be represented through different combinations of  $D^0$  and  $D^1$ . The  $DATA^0$  and  $DATA^1$  conditions are encoded as ( $D^0 = 1$ ,  $D^1 = 0$ ) and  $(D^0 = 0, D^1 = 1)$ , respectively. The NULL signal  $(D^0 = 0, D^1 = 1)$  $D^1 = 0$  corresponds to an empty set of input data values. Nonetheless, the extra-logic cost is a significant drawback of a CMOS-based dual-rail NCL system. A dual-rail NCL system contains at least two asynchronous registers, one each the input and output side, as shown in Figure 11 (f). The multi-pipelined NCL system can be achieved by inserting additional asynchronous registers. Two adjacent registers are connected through their request and acknowledge signals, labeled  $K_i$  and  $K_o$ , respectively. The purpose of using handshake lines is to preserve DATA signals from becoming overwritten. Thus, they are separated by a NULL signal wavefront. Acknowledge signals are used in the completion detection module to generate request signals to the previous stage. Since the dual-rail set operation is implemented using distinct logic, the duplicated outlay of hardware incurs area inefficiencies. Figure 11 (a) shows the proposed architecture of dual-rail STENCL realization. The two adjacent DW devices having comparable resistance are connected to the shared terminal, which is injected with NULL module current. Normally, one side is injected with the sum of currents designated as  $D^0$ . Meanwhile, the other is injected with the sum of currents designated as  $D^1$ . The resistance of the vertical write current path for the left side  $R_l$  and the right side  $R_r$  of the DW are identical, as identified by the governing relationships defined in [29]. In order to illustrate the operation of the proposed dual-rail STENCL architecture, the equivalent circuits are shown in Figure 11 (b), (c), and (d). In Figure 11 (b), the NULL case is realized upon two input combinations. When the inputs are all zero,  $V_{sum}^0 = 0$  and  $V_{sum}^1 = 0$  while  $V_{null}$  is larger. Thus, two currents with opposite direction are created. If we designate the direction from NULL to input terminal as the positive direction, then the combination of the sum of the input currents and the NULL module current is less than the negative critical current. Therefore, the DW moves back to its original position. In Figure 11 (c), the input voltage  $V_0^0 \cdots V_n^0$  is less than the NULL module supplied voltage  $V_{null}$ . Therefore, the input  $V_{sum}^0$  does not induce DW motion. The input voltage  $V_0^1 \cdots V_n^1$  is higher than the NULL voltage  $V_{null}$ . Therefore, the input voltage  $V_{sum}^1$  leads to DW motion. In Figure 11 (d), the input voltage  $V_0^0 \cdots V_n^0$ has exceeds  $V_{null}$ , therefore, the DW device for input  $V_{sum}^{0}$  is moving. On the other terminal, the input vector  $V_0^1 \cdots V_n^1$ has lower voltage than the NULL module supplied voltage  $V_{null}$ , therefore, the DW for input  $V_{sum}^1$  is not moving. For reading the DW position of the proposed design, the reading margin needs to be considered since two DW devices share the same terminal. Therefore, the proposed architecture has four different reading levels. Figure 10 (a) shows dual-rail

| NCL gate | Boolean function  | Weight:Threshold | Memristance Range ( $\Omega$ )                                                                                  |
|----------|-------------------|------------------|-----------------------------------------------------------------------------------------------------------------|
| TH12     | A+B               | (1,1:1)          | $m_{i,A}, m_{i,B} \in [100, M_i/2]; M_i \in [100, 1209]$                                                        |
| TH13     | A+B+C             | (1,1,1:1)        | $m_{i,A}, m_{i,B}, m_{i,C} \in [100, M_i/2]; M_i \in [100, 1209]$                                               |
| TH14     | A+B+C+D           | (1,1,1:1)        | $m_{i,A}, m_{i,B}, m_{i,C}, m_{i,D} \in [100, M_i/2]; M_i \in [100, 1209]$                                      |
| TH22     | AB                | (1,1:2)          | $m_{i,A}, m_{i,B}, M_i \in [100, 1209]$                                                                         |
| TH23     | AB+AC+BC          | (1,1,1:2)        | $m_{i,A}, m_{i,B}, m_{i,C}, M_i \in [100, 1209]$                                                                |
| TH23W2   | A+BC              | (2,1,1:2)        | $m_{i,A} \in [100, M_i/2]; m_{i,B}, m_{i,C}, M_i \in [100, 1209]$                                               |
| TH24     | AB+AC+AD+BC+BD+CD | (1,1,1,1:2)      | $m_{i,A}, m_{i,B}, m_{i,C}, m_{i,D}, M_i \in [100, 1209]$                                                       |
| TH24W2   | A+BC+BD+CD        | (2,1,1,1:2)      | $m_{i,A} \in [100, M_i/2]; m_{i,B}, m_{i,C}, m_{i,D}, M_i \in [100, 1209]$                                      |
| TH24W22  | A+B+CD            | (2,2,1,1:2)      | $m_{i,A}, m_{i,B} \in [100, M_i/2]; m_{i,C}, m_{i,D}, M_i \in [100, 1209]$                                      |
| TH33     | ABC               | (1,1,1:3)        | $m_{i,A}, m_{i,B}, m_{i,C} \in [100, (2/3) \cdot M_i]; M_i \in [100, 1209]$                                     |
| TH33W2   | AB+AC             | (2,1,1:3)        | $m_{i,A} \in [100, (3/4) \cdot M_i]; m_{i,B}, m_{i,C} \in [100, (3/2) \cdot M_i]; M_i \in [100, 1209]$          |
| TH34     | ABC+ABD+ACD+BCD   | (1,1,1,1:3)      | $m_{i,A}, m_{i,B}, m_{i,C}, m_{i,D} \in [100, (3/2) \cdot M_i]; M_i \in [100, 1209]$                            |
| TH34W2   | AB+AC+AD+BCD      | (2,1,1,1:3)      | $m_{i,A} \in [100, (3/4) \cdot M_i]; m_{i,B}, m_{i,C}, m_{i,D} \in [100, (3/2) \cdot M_i]; M_i \in [100, 1209]$ |
| TH34W3   | A+BCD             | (3,1,1,1:3)      | $m_{i,A} \in [100, M_i/2]; m_{i,B}, m_{i,C}, m_{i,D} \in [100, (3/2) \cdot M_i]; M_i \in [100, 1209]$           |
| TH34W22  | AB+AC+AD+BC+BD    | (2,2,1,1:3)      | $m_{i,A}, m_{i,B} \in [100, (2/3) \cdot M_i]; m_{i,C}, m_{i,D} \in [100, (3/2) \cdot M_i]; M_i \in [100, 1209]$ |
| TH34W32  | A+BC+BD           | (3,2,1,1:3)      | $m_{i,A} \in [100, M_i/2]; m_{i,B} \in [100, (2/3) \cdot M_i]$                                                  |
|          |                   |                  | $m_{i,C}, m_{i,D} \in [100, (3/2) \cdot M_i]; M_i \in [100, 1209]$                                              |
| TH44     | ABCD              | (1,1,1,1:4)      | $m_{i,A}, m_{i,B}, m_{i,C}, m_{i,D} = \in [100, 2 \cdot M_i]; M_i \in [100, 1209]$                              |
| TH44W2   | ABC+ABD+ACD       | (2,1,1,1:4)      | $m_{i,A}, M_i \in [100, 1209]; m_{i,B}, m_{i,C}, m_{i,D} \in [100, 2 \cdot M_i]$                                |
| TH44W3   | AB+AC+AD          | (3,1,1,1,4)      | $m_{i,A} \in [100, (2/3) \cdot M_i]; m_{i,B}, m_{i,C}, m_{i,D} \in [100, 2 \cdot M_i]; M_i \in [100, 1209]$     |
| TH44W22  | AB+ACD+BCD        | (2,2,1,1:4)      | $m_{i,A}, m_{i,B}, M_i \in [100, 1209]; m_{i,C}, m_{i,D} \in [100, 2 \cdot M_i]$                                |
| TH44W322 | AB+AC+AD+BC       | (3,2,2,1:4)      | $m_{i,A} \in [100, (2/3) \cdot M_i]; m_{i,B}, m_{i,C}, M_i \in [100, 1209]; m_{i,D} \in [100, 2 \cdot M_i]$     |
| TH54W22  | ABC+ABD           | (2,2,1,1:5)      | $m_{i,A}, m_{i,B} \in [100, (5/4) \cdot M_i]; m_{i,C}, m_{i,D} \in [100, (5/2) \cdot M_i]; M_i \in [100, 1209]$ |
| TH54W32  | AB+ACD            | (3,2,1,1:5)      | $m_{i,A} \in [100, (5/4) \cdot M_i], m_{i,B} \in [100, (5/4) \cdot M_i]$                                        |
|          |                   |                  | $m_{i,C}, m_{i,D} \in [100, (5/2) \cdot M_i]; M_i \in [100, 1209]$                                              |
| TH54W322 | AB+AC+BCD         | (3,2,2,1:5)      | $m_{i,A} \in [100, (5/6) \cdot M_i]; m_{i,B}, m_{i,C} \in [100, (5/4) \cdot M_i]$                               |
|          |                   |                  | $m_{i,D} \in [100, (5/2) \cdot M_i]; M_i \in [100, 1209]$                                                       |

Table 2: Device simulation parameters utilized to simulate the STENCL TH44 gate.

| Symbol   | Description                       | Value                             |
|----------|-----------------------------------|-----------------------------------|
| α        | damping coefficient               | 0.02                              |
| Ku       | uniaxial anisotropy constant      | $0.59 \times 10^6 \mathrm{J/m^3}$ |
| Xi       | Non-adiabaticity of spin-transfer | 0.2                               |
|          | -torque anisotropy constant       |                                   |
| Ms       | saturation magnetization          | $6 \times 10^5 \mathrm{A/m}$      |
| Р        | polarization                      | 0.6                               |
| $A_{ex}$ | exchange stiffness                | $1.1 \times 10^{11} \text{J/m}$   |

STENCL architecture with its sensing scheme. In Figure 10 (a), a series of domains d1 to d5 are associated with two fixed MTJs located above the nano-strip. As mentioned earlier, the two free domains d2 and d4 can be programmed to be in either parallel or anti-parallel configurations with respect to the MTJ magnetization in order to store a '0' or a '1'. The separated read-write paths can facilitate the use of increased oxide thickness, which beneficially raises the Tunneling Magneto Resistance (TMR) ratio leading to larger read margins [30]. Although Sharad [30] proposed a new device structure for similar multi-DW architectures to distinguish the two resistance states of two DW devices, our method utilizes a vertical read path and PMOS transistor to realize an accurate read operation without consuming excessive area.

In particular, the read disturb margin is defined as the difference of read currents (I<sub>MTJ1</sub> and I<sub>MTJ2</sub>) passing through the DW regions (d2 and d4) during read operations. Table 3 lists read current values for four states of proposed dual-rail STENCL architectures. The peak values of transient read current are provided for a thickness of the free layer  $t_{ox}$  =1.6nm and a pulse duration of 0.5ns, where  $I_{m1}$ and  $I_{m2}$  denote the currents passing through two MTJs. A parallel state within the MTJ passes higher current than the anti-parallel state within the MTJ. The quantities  $I_{d2}$  and  $I_{d4}$ represent current passing through two free domains d2 and

Table 3: Read current values for four states of proposed dual-rail STENCL architecture.

| DW<br>State<br>Current | d2:P<br>d4:AP | d2 : AP<br>d4 : P | d2 : AP<br>d4 : AP | d2 : P<br>d4 : P |
|------------------------|---------------|-------------------|--------------------|------------------|
| $I_{m1}(\mu A)$        | 3.2           | 15.7              | 16.7               | invalid          |
| $I_{m2}(\mu A)$        | 15.8          | 4.1               | 14.4               | invalid          |
| $I_{d2}(\mu A)$        | 16.3          | 4.6               | 1.4                | invalid          |
| $I_{d4}(\mu A)$        | 3.9           | 14.9              | 0.9                | invalid          |

d4, respectively. In Table 3, the corresponding levels of read current are generated during the read operation.

The PMOS transistor transmits these currents to the next stage. There are two purposes for the PMOS transistor. The first role is amplification, since the DW read current is less than  $30\mu$ A to maintain an accurate read margin. The second role is thresholding whereas the read current for the antiparallel state will not produce current to the next stage due to the PMOS transistor's threshold, as shown in Figure 10. If all DW states are anti-parallel, then a zero magnitude current for  $I_{d2}$  and  $I_{d4}$  are generated, which represents the NULL state in NCL. On the other hand, two parallel DW states are invalid according to the NCL dual-rail encoding.

To examine the cascadability and delay performance of dual-rail STENCL, a one-bit NCL full adder is developed. In Figure 12 (b), the one-bit full adder employs two TH23 and TH34W2 gates to implement DATA0 and DATA1 conditions. The schematic of the one-bit full adder is shown in Figure 12 (b), where X and Y are input addends and C is the carry input. The optimized circuit is obtained through the Threshold Combinational Reduction (TCR) method #2 [11], and the carry out is given by  $C_o^0 = X^0 Y^0 + C^0 X^0 + C^0 Y^0, C_o^1 = X^1 Y^1 + C^1 X^1 + C^1 Y^1, S^0 = X^0 C_0^0 + C_o^0 Y^0 + C_o^0 C^0 + X^0 Y^0 C^0$ , and  $S^1 = X^1 C_0^0 + C_o^0 Y^1 + C_o^0 C^1 + X^1 Y^1 C^1$ . Therefore, the one-bit full adder can be implemented through four TH gates, TH34W2 and

9



Figure 9: (a) CMOS NCL THXOR gate, (b) CMOS NCL THand0 gate, (c) CMOS NCL TH24comp gate, (d) STENCL THXOR gate architecture, (e) STENCL THand0 gate architecture, (f) STENCL TH24comp gate architecture, (g) Simulation of STENCL THXOR gate architecture, (h) Simulation of STENCL THand0 gate architecture, and (i) Simulation of STENCL TH24comp gate architecture.

TH23 gates. Although TCR reduces transistor counts, the area and power consumption are still drawbacks of widelyused asynchronous circuit approaches. Figure 12 (a) shows the STENCL implementation of a one-bit full adder. The two TH23 NCL gates are implemented by two DW devices connected with a shared terminal, and similarly regarding the TH34W2 gate. The operation of the spintronic dual-rail design is similar to that of the previously-described static NCL gate. However, the mapping algorithm is modified, since the NULL module current should be determined through various input combinations. The simulation of the one-bit full adder is implemented with the same parameters used for the previous TH44 gate shown in Figure 13. The testbench employs several different input combinations, as described in the subsequent Section.

#### 8 PERFORMANCE ANALYSIS

In this Section, we compare the performance of the proposed design at the gate-level and the system level. At the gate-level, the proposed STENCL logic realization is simulated for DATA and NULL input wavefronts. The MTJ resistance is calculated by length of the free layer (100nm), width of the free layer W, (DW position) x (middle point),  $RA_{AP}$ ,  $RA_{DW}$ , and  $RA_P$  are MTJ resistance area products for the anti-parallel, DW, and parallel configurations, respectively. During phase one and phase three, there is no DW motion



Figure 10: (a) STENCL dual-rail architecture's read scheme and (b) Simulation of PMOS transistor. The read current is generated from the DW sensing current and then amplified through the PMOS transistor.

through the device. However, during phase two and signaling SET and RESET processing, the DW moves forward and backward, respectively. The dynamic power dissipation can be obtained via integration over the time interval of DW transit. The simulation results are shown in Figure 15. Figure 15 (c) depicts the delay measurement through two different implementations, whereby one is the proposed STENCL design and the other is CMOS-based NCL [31].

The delay of the STENCL design exceeds that of CMOSbased NCL, whereas the result reading time spans the interval until the DW motion terminates. Figure 15 (b) presents an energy comparison of both designs. Compared to CMOS-based NCL, the proposed STENCL design exhibits a one-third reduction in energy consumption compared to the CMOS-based design. The proposed STENCL design significantly reduces power dissipation through its nearzero leakage current. Moreover, the use a non-volatile of DW device to implement hysteresis function avoids the extra logic cost of the state-holding functionality, including both HOLD0 and HOLD1 logic totaling eight transistors. The area comparison of the two implementations is shown in Figure 15 (a). The elimination of HOLD logic transistors along with vertical integration of MTJs and the DW device, lead to a ten-fold area savings. Figure 6 (a) presents the layout of the DW device and associated CMOS transistors.

At the system-level, we compare conventional CMOSbased NCL and the proposed STENCL circuit using 1bit, 4-bit, 8-bit, 16-bit, and 32-bit adders. The conventional NCL full adder follows the architecture shown in Figure 12 (b). The circuit is implemented and simulated using an IBM SOI1250 45nm CMOS process standard cell library. The simulation utilized a nominal power supply voltage of 0.92V, temperature equal to 27C, and capacitive load of 10fF. The proposed STENCL parameters are listed in Table 2. Figure 16 (a) shows the delay of each design. The STENCL realization exhibits increased delay than CMOSbased adder when the velocity of DW is approximately 20m/s. A tuning procedure may improve the delay performance by adjusting the device threshold to create larger write current, which is leading to increased DW velocity, to realize the desired energy vs. delay tradeoff. In our case, we use  $Jc_i^2 = 6.2 \times 10^{12} A/m^2$  which induces DW moving at 20m/s. Since the full adder is pipelined, the steady-state throughput of the full adder is not varying with its word width. In Figure 16 (b), we compare energy consumption of two different implementations. The proposed circuit is operating under very low current levels, thus dissipates only a few  $\mu W$  for memristors,  $0.15\mu W$  for the sensing unit and few  $\mu W$  by the DW device. Figure 16 (b) shows power savings using a log scale. The STENCL adder achieves 20x times energy saving for a 32-bit word width. In Figure 16 (c), the area comparison between CMOS NCL and STENCL adder is shown. By using a 3D structure, the area of the STENCL adder is significantly decreased. In terms of a direct comparison, the STENCL full adder achieves an 8X area savings relative to a CMOS-based NCL full adder.

## 9 STENCL ERROR RESILIENCE HANDSHAKING

Herein, the interweaving of error resilience within the handshaking protocol is developed as a throughput-sustaining method. After the memristor has been programmed, ions drift through the electric field across the device, which can induce resistance changes over time. Beyond existing approaches which focus on increasing device dimensions or exotic material properties to minimize such variations, a broadly-applicable architectural-level approach is developed to handle such variations inside the NCL pipeline. A novel refresh mechanism is added to NCL handshaking protocols and activated at specific times determined by device parameters, type of logic function, and usage scenario. Since refresh may cause degradation of throughput, the refresh control mechanism has been co-designed with NCL handshaking while combating memristor variation. The refresh schematic and control signal waveforms are depicted in Figure 17. In Figure 17 (a), the new handshaking mechanism is illustrated. In addition to the conventional completion detection circuit, the memristor refresh is designed to activate in NULL states only, as shown in Figure 17 (c). The refresh control module receives the current state as either DATA or NULL. The refresh mechanism reuses access transistors without large overhead, as shown in Figure 17 (b). Here, the spintronic NCL circuit shares one refresh control module with different parameters. If the resistance of the memristor is defined as  $R_{initial}$ , at beginning of its operation and as  $R_t$ at time *t*, the difference of the two resistances is bounded as  $\frac{R_{initial} - R_t}{2} \le \varepsilon.$ 

 $\begin{array}{l} \hline R_{initial} & \geq \varepsilon. \\ & \text{Since the resistance of a memristor is controlled by flux} \\ & \varphi \ [32], \text{ the resistance read over time can be modeled by the flux across the memristor. The flux is given by an analytical form: <math>\varphi_{in}(t) = \Phi[(1 - \frac{w_0}{D}) - (1 - \frac{w(t)}{D}) + (1 - \frac{w_0}{D})^2 - (1 - \frac{w(t)}{D})^2], \text{ where } \varphi_{in}(t) \text{ is input flux to transition the memristor from } w_0 \text{ to } w(t). \text{ The internal variable } w \text{ is used to represented width of doped region, } D \text{ is the full length of memristor, } \Phi_D \text{ is a parameter defined as } \frac{\beta D^2}{2\mu_v}, \text{ where } \beta \text{ is the OFF/ON ratio, and } \mu_v \text{ is the average ion mobility. According to the error boundary, the variation in flux is given by <math>|\Delta \varphi| \leq \varepsilon |V_{apply}T|$  where  $V_{apply}$  is the read voltage, and T is the read pulse duration. In the worst case, N cycles of read current will create  $\varphi_{in} = N\Delta\varphi$ . Thus, substituting the given equations, N can be expressed as  $N = |\frac{\phi_D}{V_{apply}T_\varepsilon}[(1 - \frac{w_0}{D}) - (1 - \frac{w(t)}{D}) + (1 - \frac{w_0}{D})^2 - (1 - \frac{w(t)}{D})^2]|, which is the count in Figure 17 (c). By using this analytical substitution in figure 17 (c).$ 



Figure 11: (a) Dual-rail STENCL implementation, whereby two dual-rail signals are realized using two DW devices separated by shared terminals, (b) The equivalent analog circuit of dual-rail STENCL architecture for NULL case, (c) The equivalent analog circuit of dual-rail STENCL architecture for DATA0 case, (e) The spintronic dual-rail 4-phase communication protocol, and (f) Dual-rail STENCL pipeline architecture whereby the the input is controlled by local handshaking via a completion detection signal (ACK).



Figure 12: (a) CMOS-based dual-rail NCL architecture of one-bit full adder and (b) Dual-rail STENCL architecture of one-bit full adder.

model, the gates will automatically receive their appropriate refresh schedule. For example, the minimum and maximum resistance values for each input of a TH33w2 gate are calculated separately. Since each input employs a distinct resistance, the worst case interval between refresh cycles can be calculated based on supply voltage and read current duration. In this task, the refresh cycle N becomes a library parameter for all 27 NCL gates and the corresponding refresh mechanisms are realized.

## 10 LARGE-SCALE APPLICATION OF STENCL-BASED ARCHITECTURE

To compare STENCL with alternatives, various four-stage pipelined 32-bit IEEE single-precision floating-point coprocessors were simulated. The co-processor consists of several functional blocks to perform addition, subtraction, and multiplication [20]. The conventional CMOS-based NCL design was realized using an IBM SOI1250 45nm CMOS process, which was simulated at the transistor-level using the Cadence UltraSim simulator. The Verilog-A library was created to contain 25 sets of randomly-selected floatingpoint numbers for each add/subtract and multiplication operation. To validate the STENCL architecture and circuit design method, we have implemented the same case study using STENCL. Besides verifying its computational accuracy, we also quantitatively measured its performance metrics, such as energy consumption, area, and delay. For the STENCL realization, we take advantage of both the logic synthesis tool and the technology mapping capability of the Cadence toolchain. Specifically, we started by building a cell library of 27 different NCL gates. Subsequently, these designs were read by the Cadence Spectre tool, which creates a SPICE circuit library, as depicted in the toolchain flow of Figure 14.

Table 4 lists performance results for various implementations of the 32-bit IEEE single-precision floating-point co-processor. The delay of asynchronous designs is mea-

| Design Type          | # Transistors | Delay (ns) |        | Operation Energy (pJ) |        | Idle Power (nW) |        | Power-Delay-Product (J*s) |          |
|----------------------|---------------|------------|--------|-----------------------|--------|-----------------|--------|---------------------------|----------|
| Design type          |               | Add/Sub.   | Multi. | Add/Sub.              | Multi. | Add/Sub.        | Multi. | Add/Sub.                  | Multi.   |
| NCL Low- $V_t$ [20]  | 158059        | 14.1       | 14.4   | 27.4                  | 23.7   | 12300           | 12300  | 2.03e-19                  | 3.41e-19 |
| NCL High- $V_t$ [20] | 158059        | 32.7       | 33.4   | 28.5                  | 25.1   | 208             | 208    | 9.03e-19                  | 8.38e-19 |
| MTCMOS               | 104571        | 10         | 13.0   | 124.3                 | 124.7  | 156000          | 132000 | 12 /0-19                  | 17 30-10 |
| Synchronous [20]     | 104571        | 10         | 15.7   | 124.5                 | 124.7  | 150000          | 132000 | 12.40-17                  | 17.50-17 |
| SMTNCL1              | 119244        | 10.7       | 15.4   | 14.6                  | 26     | 121.1           | 121.1  | 1 560-19                  | 40-19    |
| SECII [20]           | 11/244        | 10.7       | 10.4   | 14.0                  | 20     | 121.1           | 121.1  | 1.500-17                  | 40-17    |
| STENCL               | 18801         | 34.77      | 35.14  | 0.876                 | 1.03   | 11.254          | 12.22  | 0.3e-19                   | 0.36e-19 |



Figure 13: Simulation of proposed STENCL full adder.



Figure 14: CAD flow of STENCL simulation framework.

sured as the average (DATA+NULL) processing time. For synchronous designs, we measure delay according to the maximum frequency operating clock. In Table 4, the results of conventional NCL implementation using Low- $V_t$  and High- $V_t$  transistors are presented. The High- $V_t$  transistor design incurs the largest delay among all of designs being compared. Its operations' energy consumption and idle power are largest among asynchronous designs, however, they can be less than the MTCMOS synchronous design. Synchronous designs typically incur overhead of sleep transistors and switching topology for power-gating. For the MTCMOS synchronous design, although it requires fewer transistors than asynchronous designs, its operational energy consumption and idle power are the highest among Table 4, whereas the MTCMOS design only sleeps after a preset number of inputs [33], [34]. Among asynchronous designs, the SMTNCL1 SECII design exhibits reduced transistor counts, delay, and power consumption. The proposed logic implementation with the acknowledgment completion logic can reduce area, energy, and leakage power. Compared to other asynchronous and synchronous designs, the proposed STENCL design requires one to two orders fewer transistors, and consumes as little as one tenth the energy under operational and idle conditions, at a tradeoff of up to 3X delay. Table 4 also lists the energy delay product (EDP) of the six different designs. It is seen that the proposed STENCL paradigm achieves EDP within 10% of the most favorable EDP. Moreover, its memristor and DW-based design requires fewer devices, which can help reduce the area requirement and interconnect burden. The DATA phase contributes two components towards energy consumption, which are programming operations and sensing operations.

On average, approximately  $40\mu$ A flows through the memristor. Therefore, programming energy is measured as ~ 0.5fJ for a 1ns write interval. The sensing energy is measured as ~ 2.5fJ for a 1ns read current duration. For NULL wavefront processing, the resetting energy which occurs is also accounted for. In the proposed STENCL architecture, approximately  $50\mu$ A current is used to shift the DW in 1ns,



Figure 15: (a) Area measurement of different TH gate, (b) Energy measurement of different selected TH gate, and (c) Delay measurement of different selected TH gate.



Figure 16: (a) Delay measurement of NCL adder with increasing bit-width, (b) Energy measurement in log scale of NCL adder with increasing bit-width, and (c) Area measurement in log scale of NCL adder with increasing bit-width.



Figure 17: (a) The proposed Error Resilience Hand Shaking architecture. The refresh signal is controlled by inputs of DW reset signal and acknowledgment from the next stage, (b) Memristor circuit, abd (c) Waveform of control signal in R/W control module.

which results in approximately 0.75 fJ resetting energy.

A quantitative comparison of a synchronous Combinational Gate (CG) design, a CMOS-based NCL design, and a STENCL design was conducted. The study evaluated the transistor count, energy consumption, and propagation delay of each design. A selected set of the most complex six of the 228 ISCAS benchmarks is listed in Table 5. Compared with CG, the CMOS-Based NCL circuit has more transistors because an NCL gate utilizes more transistors than CG to maintain hysteresis while avoiding a clock. Therefore, among the benchmarks listed in Table 5, the CMOS-Based NCL design exhibits a larger delay and transistor count compared to CG. Thus, in order to utilize CMOSbased NCL efficiently, a large scale application should be selected, such as was seen with the 32-bit IEEE singleprecision floating-point co-processor. In this design, CMOS-Based NCL may achieve a reduction in energy consumption because a clocked co-processor requires complex timing support or else provision of appropriate sleep islands. On the contrary, the STENCL design requires fewer transistors, but exhibits larger delay than Synchronous CG. Although NCL has been considered at times to constitute a lowenergy logic paradigm, still within relatively small benchmark circuits NCL can be seen to exhibit a range of energy performance. Therein, the absence of a sophisticated clock tree and reduced benefits of sparse on-demand datadriven computation can result in larger energy consumption than a corresponding synchronous Boolean logic realization. Meanwhile, STENCL exploits the physical characteristics of spintronic devices to reduce the number of transistors and energy consumption due to state-holding hysteresis and its proliferation of leakage current. Therefore, STENCL is seen to offer a useful broadening of beneficial features beyond previous asynchronous realizations.

#### 11 ERROR SENSITIVITY ANALYSIS

#### 11.1 Memristor Error Analysis

Memristor write accuracy is influenced by the accuracy of analog components, such as the random offset comparator, digital-to-analog converter, and the current source. Fan [12] proposed an analysis methodology for memristor write accuracy that entails higher design complexity for these blocks and lower write speed. Meanwhile, the read accuracy

Table 5: Comparison of various design implementations for ISCAS benchmark circuits relative to previous approaches presented in [20], [26], [35].

| Circuit | Synchronous CG |       |             |               | NCL   |             | STENCL        |       |             |
|---------|----------------|-------|-------------|---------------|-------|-------------|---------------|-------|-------------|
|         | # Transistors  | Delay | Energy (pJ) | # Transistors | Delay | Energy (pJ) | # Transistors | Delay | Energy (pJ) |
| alu4    | 38436          | 110.7 | 8.52        | 43434         | 113.8 | 9.11        | 10858         | 133.6 | 3.77        |
| apex4   | 43044          | 86.1  | 9.54        | 36771         | 104.8 | 8.21        | 7865          | 131.7 | 3.26        |
| c6288   | 22334          | 360.8 | 4.95        | 32496         | 581.8 | 6.21        | 5944          | 638.2 | 2.43        |
| des     | 52872          | 106.6 | 11.72       | 58206         | 115.8 | 12.12       | 10588         | 129.6 | 3.54        |
| misex3  | 51518          | 98.4  | 11.42       | 55134         | 104.8 | 10.23       | 9189          | 127.8 | 3.33        |
| seq     | 33274          | 98.4  | 7.38        | 35970         | 104.8 | 7.18        | 8540          | 125.3 | 2.89        |



Figure 18: (a) Memristor drift simulation of different input current as time increases and (b) Memristor drift simulation of various pulse duration currents.

has remained as an important aspect to be considered. When a memristor is programmed through write current, the ions drift due to electric field across the device and thus the memristance may change over time. The analytical model of memristor drift has been validated by linear and non-linear drift velocity models and experiments using fabricated memristors [36]. The changes due to this drift may influence either from  $R_{on}$  to  $R_{off}$  or  $R_{off}$  to  $R_{on}$ , depending on the polarity of applied voltage. There are two parameters impacting memristance changes due to drift, which are the applied voltage and delay. In this paper, we use a memristor drift model from Kvatinsky [37] and simulate the impact on memristance variation due to changes in supply voltage over time, based upon device measurements obtained via previous work. Device parameters that were

measured previously are utilized herein [38]. As shown in Figure 18 (a), as time elapses, the resistance of the memristor varies according to the drift model and an increased supplied voltage. Figure 18 (b) shows the resistance of the memristor changing according to this drift model along with the memristance refresh approach described in Section 9.

Various device-level approaches for addressing memristor drift have included focusing on device fabrication with 36nm thickness for the titanium dioxide layer between a 9nm titanium electrode and 12nm titanium electrode [39]. However, this is difficult to fabricate. Therefore, it is not a candidate considered herein. Fortunately, a clockless design inherently exhibits substantial robustness to changes in memristance. The ideal refresh cycles are set to the maximum duration interval of reliable operation. However, every NCL logic combination has distinct drift tolerance. Therefore, only an application-based refresh cycle is required to maintain the desired throughput, as innovated in Section 9.

#### 11.2 DW Error Analysis

The reliability of DW devices benefit from a relative lack of sensitivity with respect to domain velocity, critical current, and temperature [40]. There is also a report of write endurance for Co/Ni nano-strips to exhibit ten-year retention times at  $150^{\circ}C$  and  $1 \times 10^{14}$  write occurrences. To analyze the heating effect on magnetic-metallic DW devices, the effect of Joule heating simulations are being considered [12]. The conclusion is that the thin and short central free domain is the most critical portion to induce current drive heating. To reduce the Joule heating effect, a larger contact area of two fixed domains and a shorter free domain can be utilized.

### **12** CONCLUSION

When realized using CMOS device technology alone, many innovative computational methodologies such as NCL, may incur significant area and energy costs. Fortunately, emerging spintronic devices offer additional opportunities to innovate computational circuits by leveraging non-volatility to realize the needed hysteresis behavior. One valuable perspective is to utilize emerging devices by exploiting their inherent switching behaviors, instead of merely treating them as some "super switches" to directly replace CMOS transistors. By leveraging inherent physical behaviors, the proposed STENCL paradigm offers improvements in energy efficiency and area as compared to CMOS-based NCL for several representative benchmarks of sufficient size and complexity. Moreover, the general approach of realizing hysteresis with state-holding sub-circuits could also be extended to other non-volatile devices having favorable energy and switching profiles as they emerge and mature.

#### REFERENCES

- P. A. Beerel, R. O. Ozdag, and M. Ferretti, A Designer's Guide to Asynchronous VLSI. New York, NY, USA: Cambridge University Press, 1st ed., 2010.
- [2] K. Fant, Logically Determined Design: Clockless System Design with NULL Convention Logic. Wiley, 2005.
- [3] M. C. Chang, P. H. Yang, and Z. G. Pan, "Register-Less NULL Convention Logic," *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 64, pp. 314–318, March 2017.
- [4] L. D. Tran, G. I. Matthews, P. Beckett, and A. Stojcevski, "Null convention logic (NCL) based asynchronous design fundamentals and recent advances," in 2017 International Conference on Recent Advances in Signal Processing, Telecommunications Computing (SigTel-Com), pp. 158–163, Jan 2017.
- [5] K. Ali, F. Li, S. Y. H. Lua, and C. H. Heng, "Compact spin transfer torque non-volatile flip-flop design for power-gating architecture," in 2016 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), pp. 119–122, Oct 2016.
- [6] X. Chen, N. Khoshavi, R. F. DeMara, J. Wang, D. Huang, W. Wen, and Y. Chen, "Energy-Aware Adaptive Restore Schemes for MLC STT-RAM Cache," *IEEE Transactions on Computers*, vol. 66, pp. 786– 798, May 2017.
- [7] Q. Li, Y. He, J. Li, L. Shi, Y. Chen, and C. J. Xue, "Compiler-Assisted Refresh Minimization for Volatile STT-RAM Cache," *IEEE Transactions on Computers*, vol. 64, pp. 2169–2181, Aug 2015.
- [8] Z. Sun, X. Bi, W. Wu, S. Yoo, and H. . Li, "Array organization and data management exploration in racetrack memory," *IEEE Transactions on Computers*, vol. 65, pp. 1041–1054, April 2016.
- [9] Y. Bai, B. Hu, W. Kuang, and M. Lin, "Ultra-robust null convention logic circuit with emerging domain wall devices," in *Proceedings of* the 26th edition on Great Lakes Symposium on VLSI, pp. 251–256, ACM, 2016.
- [10] D. Sokolov, J. Murphy, A. Bystrov, and A. Yakovlev, "Design and analysis of dual-rail circuits for security applications," *IEEE Transactions on Computers*, vol. 54, pp. 449–460, April 2005.
- [11] S. Smith, R. DeMara, J. Yuan, D. Ferguson, and D. Lamb, "Optimization of NULL convention self-timed circuits," *Integration, the VLSI Journal*, vol. 37, no. 3, pp. 135 – 165, 2004.
- [12] D. Fan, Y. Shim, A. Raghunathan, and K. Roy, "STT-SNN: A Spin-Transfer-Torque Based Soft-Limiting Non-Linear Neuron for Low-Power Artificial Neural Networks," *IEEE Transactions on Nanotechnology*, vol. 14, pp. 1013–1023, Nov 2015.
- [13] S. Fukami, M. Yamanouchi, K.-J. Kim, T. Suzuki, N. Sakimura, D. Chiba, S. Ikeda, T. Sugibayashi, N. Kasai, T. Ono, and H. Ohno, "20-nm magnetic domain wall motion memory with ultralowpower operation," in *Electron Devices Meeting (IEDM)*, 2013 IEEE International, pp. 3.5.1–3.5.4, Dec 2013.
- [14] M. Feigenson, J. W. Reiner, and L. Klein, "Efficient Current-Induced Domain-Wall Displacement in SrRuO<sub>3</sub>," *Phys. Rev. Lett.*, vol. 98, p. 247204, Jun 2007.
- [15] Y. Zhang, W. S. Zhao, D. Ravelosona, J.-O. Klein, J. V. Kim, and C. Chappert, "Perpendicular-magnetic-anisotropy CoFeB racetrack memory," *Journal of Applied Physics*, vol. 111, no. 9, pp. 1–6, 2012.
- [16] X. Fong, S. Gupta, N. Mojumder, S. Choday, C. Augustine, and K. Roy, "KNACK: A hybrid spin-charge mixed-mode simulator for evaluating different genres of spin-transfer torque MRAM bitcells," in Simulation of Semiconductor Processes and Devices (SIS-PAD), 2011 International Conference on, pp. 51–54, Sept 2011.
- [17] H. Manem and G. Rose, "A read-monitored write circuit for 1T1M multi-level memristor memories," in *Circuits and Systems (ISCAS)*, 2011 IEEE International Symposium on, pp. 2938–2941, May 2011.
- [18] C.-M. Jung, J.-M. Choi, and K.-S. Min, "Two-step write scheme for reducing sneak-path leakage in complementary memristor array," *Nanotechnology, IEEE Transactions on*, vol. 11, pp. 611–618, May 2012.
- [19] A. Radwan, M. Zidan, and K. Salama, "On the mathematical modeling of memristors," in *Microelectronics (ICM)*, 2010 International Conference on, pp. 284–287, Dec 2010.

- [20] L. Zhou, R. Parameswaran, F. A. Parsan, S. C. Smith, and J. Di, "Multi-Threshold NULL Convention Logic (MTNCL): An ultralow power asynchronous circuit design methodology," *Journal of Low Power Electronics and Applications*, vol. 5, no. 2, pp. 81–100, 2015.
- [21] D. Fan, M. Sharad, A. Sengupta, and K. Roy, "Hierarchical Temporal Memory Based on Spin-Neurons and Resistive Memory for Energy-Efficient Brain-Inspired Computing," arXiv preprint arXiv:1402.2902, 2014.
- [22] T. Koyama, K. Ueda, K.-J. Kim, Y. Yoshimura, D. Chiba, K. Yamada, J.-P. Jamet, A. Mougin, A. Thiaville, S. Mizukami, et al., "Current-induced magnetic domain wall motion below intrinsic threshold triggered by Walker breakdown," *Nature nanotechnology*, vol. 7, no. 10, pp. 635–639, 2012.
- [23] E. Zianbetov, E. Beigne, and G. Di Pendina, "Non-volatility for Ultra-Low Power Asynchronous Circuits in Hybrid CMOS/Magnetic Technology," in Asynchronous Circuits and Systems (ASYNC), 2015 21st IEEE International Symposium on, pp. 139– 146, May 2015.
- [24] R. B. Reese, S. C. Smith, and M. A. Thornton, "Uncle An RTL Approach to Asynchronous Design," in *IEEE International Symposium on Asynchronous Circuits and Systems*, pp. 65–72, 2012.
- [25] F. A. Parsan, W. K. Al-Assadi, and S. C. Smith, "Gate mapping automation for asynchronous null convention logic circuits," *IEEE Transactions on Very Large Scale Integration Systems*, vol. 22, no. 1, pp. 99–112, 2013.
- [26] I. Lemberski and P. Fiser, "Area and speed oriented implementations of asynchronous logic operating under strong constraints," in 2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools, pp. 155–162, Sept 2010.
- [27] I. Lemberski, P. Fier, and R. Suleimanov, "Asynchronous sum-ofproducts logic minimization and orthogonalization," *International Journal of Circuit Theory and Applications*, vol. 42, no. 6, pp. 562–571, 2014.
- [28] F. A. Parsan and S. C. Smith, "CMOS implementation of static threshold gates with hysteresis: A new approach," in VLSI and System-on-Chip, 2012 (VLSI-SoC), IEEE/IFIP 20th International Conference on, pp. 41–45, Oct 2012.
- [29] S. Fukami, Y. Nakatani, T. Suzuki, K. Nagahara, N. Ohshima, and N. Ishiwata, "Relation between critical current of domain wall motion and wire dimension in perpendicularly magnetized Co/Ni nanowires," *Applied Physics Letters*, vol. 95, no. 23, pp. 1–4, 2009.
  [30] M. Sharad, R. Venkatesan, A. Raghunathan, and K. Roy, "Domain-
- [30] M. Sharad, R. Venkatesan, A. Raghunathan, and K. Roy, "Domainwall shift based multi-level MRAM for high-speed, high-density and energy-efficient caches," in *Device Research Conference (DRC)*, 2013 71st Annual, pp. 99–100, June 2013.
- [31] A. Burg, A. Coskun, M. Guthaus, S. Katkoori, and R. Reis, VLSI-SoC: From Algorithms to Circuits and System-on-Chip Design 20th IFIP WG 10.5/IEEE International Conference on Very Large Scale Integration, 2012. Springer Publishing Company, Incorporated, 2013.
- [32] Y. Ho, G. M. Huang, and P. Li, "Nonvolatile memristor memory: Device characteristics and design implications," in *IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers*, pp. 485–490, 2009.
- [33] R. Thian, Multi-Threshold CMOS Circuit Design Methodology from 2D To 3D. Masters Thesis, University of Arkansas, Fayetteville, December 2010.
- [34] S. C. Smith and J. Di, "Designing asynchronous circuits using NULL convention logic (NCL)," Synthesis Lectures on Digital Circuits and Systems, vol. 4, no. 1, pp. 1–96, 2009.
- [35] Q. Xie, X. Lin, Y. Wang, S. Chen, M. J. Dousti, and M. Pedram, "Performance Comparisons between 7nm FinFET and Conventional Bulk CMOS Standard Cell Libraries," vol. 62, pp. 1–1, 08 2015.
- [36] R. Williams, "How we found the missing memristor," Spectrum, IEEE, vol. 45, pp. 28–35, Dec 2008.
- [37] S. Kvatinsky, K. Talisveyberg, D. Fliter, A. Kolodny, U. Weiser, and E. Friedman, "Models of memristors for SPICE simulations," in *Electrical Electronics Engineers in Israel (IEEEI)*, 2012 IEEE 27th Convention of, pp. 1–5, Nov 2012.
- [38] K.-H. Kim, S. Gaba, D. Wheeler, J. M. Cruz-Albrecht, T. Hussain, N. Srinivasa, and W. Lu, "A functional hybrid memristor crossbararray/CMOS system for data storage and neuromorphic applications," *Nano Letters*, vol. 12, no. 1, pp. 389–395, 2011.
- [39] Q. Xia, W. Robinett, M. W. Cumbie, N. Banerjee, T. J. Cardinali, J. J. Yang, W. Wu, X. Li, W. M. Tong, D. B. Strukov, G. S. Snider,

G. Medeiros-Ribeiro, and R. S. Williams, "MemristorCMOS Hybrid Integrated Circuits for Reconfigurable Logic," *Nano Letters*, vol. 9, no. 10, pp. 3640–3645, 2009. PMID: 19722537.

[40] S. Fukami, M. Yamanouchi, T. Koyama, K. Ueda, Y. Yoshimura, K.-J. Kim, D. Chiba, H. Honjo, N. Sakimura, R. Nebashi, Y. Kato, Y. Tsuji, A. Morioka, K. Kinoshita, S. Miura, T. Suzuki, H. Tanigawa, S. Ikeda, T. Sugibayashi, N. Kasai, T. Ono, and H. Ohno, "High-speed and reliable domain wall motion device: Material design for embedded memory and logic application," in VLSI Technology (VLSIT), 2012 Symposium on, pp. 61–62, June 2012.



Jia Di received B.S. and M.S. degrees from Tsinghua University, China, in 1997 and 2000, respectively. He completed his Ph.D. in Electrical Engineering at the University of Central Florida in 2004. He joined the Computer Science and Computer Engineering Department of the University of Arkansas as an Assistant Professor in Fall 2004, where he is now a Professor and 21st Century Research Leadership Chair. His research area is asynchronous integrated circuit design and hardware security. Dr. Di has

published one book and over 100 papers on technical journals and conferences. He also has 5 U.S. patents. Dr. Di is a senior member of IEEE and an elected member of the National Academy of Inventors.



Yu Bai is an Assistant Professor in the Computer Engineering Program in the College of Engineering and Computer Science at the California State University, Fullerton. He earned his Ph.D. degree in Electrical Engineering from the University of Central Florida in 2016. Yu recieved his BS degree in 2008 in Electrical Engineering from the Ukraine National Aviation University and his M.S. degree in 2011 in Electrical and Computer Engineering from the University of Texas Pan American. Prior to his academic career, he had

been at Siemens Energy Inc. His research interests include stochastic computing, neuromorphic computing, FPGA design, nano-scale computing system with novel silicon and post-silicon devices, and low power digital and mixed-signal CMOS circuit design.



**Ronald F. DeMara** (S87-M93-SM05) received the Ph.D. degree in Computer Engineering from the University of Southern California in 1992. Since 1993, he has been a full-time faculty member at the University of Central Florida where he is a Professor of Electrical and Computer Engineering, and joint faculty of Computer Science, and has served as Associate Chair, ECE Graduate Coordinator, and Computer Engineering Program Coordinator. His research interests are in computer architecture with emphasis on

reconfigurable logic devices, evolvable hardware, and emerging devices, on which he has published over 200 articles and holds one patent. He is a Senior Member of IEEE and has served on the Editorial Boards of IEEE Transactions on VLSI Systems, Journal of Circuits, Systems, and Computers, the journal Microprocessors and Microsystems, and as Associate Guest Editor of ACM Transactions on Embedded Computing Systems. He has been the Keynote Speaker of IEEE-sponsored conferences including the International Conference on Reconfigurable Computing and FPGAs (ReConFig) and the Reconfigurable Architectures Workshop (RAW), and a panel organizer and invited panelist at several conferences. He is the lead Guest Editor of IEEE Transactions on Computers joint with IEEE Transactions on Emerging Topics in Computing 2017 Special Section on Innovation in Reconfigurable Computing Fabrics: from Devices to Architectures. He is currently an Associate Editor of IEEE Transactions on Computers, and serves on various IEEE conference program committees, including ISVLSI and SSCI. He received IEEEs Joseph M. Bidenbach Outstanding Engineering Educator Award in 2008.