**3. Logic-in-memory with RRAM devices**

In recent years, several IMC hardware accelerators based on resistive memory technologies have been proposed in the literature as a solution for improving computation efficiency, especially for data intensive tasks. Among these accelerators, LIM circuits enable the computation of logic operations on arrays of RRAM devices. Two different approaches can be commonly distinguished, i.e., stateful and non-stateful computing paradigms. Stateful LIM frameworks include the material implication [4, 24] and the memristor-aided logics (MAGIC) [2]. In these approaches, RRAM devices are treated at the same time as computing and storing elements. When performing an operation, inputs are stored as the resistive state of RRAM devices, and by using specific voltage pulses, the resistive state of an output device is either changed or preserved depending on the input combination. Non-stateful LIM approaches such as the scouting logic [25] and the smart material implication logic (SIMPLY) [26], instead, exploit the peripheral circuit to perform part of the computations. Inputs are encoded in the nonvolatile resistance of RRAM devices which are read in parallel with small voltages (i.e., ≈100 mV). The voltage at a specific circuit node changes depending on the input combination, and by sensing this voltage it is possible to implement different logic operations.

In the following subsections, the compact model is used to study the performance and reliability of a stateful and a non-stateful LIM frameworks based on the material

implication logic. In addition, in Section 3.3 the performance and characteristics of different LIM frameworks are compared considering the computation of a 1-bit full addition operation as a benchmark.

### **3.1 RRAM-based material implication logic**

The material implication logic is type of stateful logic that is functionally complete [27] and is based on two core operations, the IMPLY and the FALSE. These two operations were demonstrated in [4] to be easily implemented with RRAM devices and a circuit such as the one reported in **Figure 4a**, which consists of RRAM devices, that store the inputs and outputs of IMPLY and FALSE operations and act as computing elements, a control logic and analog tri-state drivers that deliver appropriate voltages to RRAM devices, and a resistor RG. The IMPLY is a two inputs one output operation and its truth table is reported in **Figure 4b**. To execute an IMPLY operation, the inputs are encoded in the resistive state of devices P and Q. Then, the control logic delivers two voltage pulses with amplitudes VCOND and VSET, on P and Q, respectively. The voltage pulse amplitudes are dimensioned so that the state of device P never changes, while the state of device Q changes according to the truth table in **Figure 4b**. The FALSE operation is a 1 input 1 output logic operation which always results in logic 0 output and is implemented by delivering a negative voltage VFALSE to an RRAM device.

However, the use of high voltage pulses to conditionally program the output device in the IMPLY operation, leads to intrinsic reliability challenges that need to be studied with accurate compact models, including the effects of temperature and variability. Using the UNIMORE RRAM physics-based compact model, simulations of the IMPLY operation performed in [5] for different pairs of VSET and VCOND voltages show that only a very narrow design space exists. As shown in **Figure 4c**, voltage variation of tens of mV, easily introduced by voltage drivers and line parasitic resistances, can lead to a malfunction of the circuit [5]. Also, the use of high voltage pulses induces the problem of logic state degradation [3, 5, 26] on devices that should preserve a HRS after the operation execution. These are devices P and Q in the first and third cases of the truth table, respectively. In these cases, although the voltage drop across these RRAM devices cannot fully switch a device, it can induce a drop of their resistance. Repeatedly executing IMPLY operations on these devices eventually leads to a bit

#### **Figure 4.**

*(a) RRAM-based material implication logic core gate. (b) IMPLY operation truth table. Q' is the state of Q after the operation execution. (c) Map of the VSET and VCOND pairs leading to a correct IMPLY gate operation. The effect of C2C variations is considered. Data from [5]. d), and e) effect of the resistive state degradation on device P and Q, due to the repeated execution of IMPLY operations on the same input device when both inputs are zero and when P = 1 and Q = 0, respectively.*

*Study of RRAM-Based Binarized Neural Networks Inference Accelerators Using an RRAM… DOI: http://dx.doi.org/10.5772/intechopen.110340*

corruption. This effect cannot be prevented completely, due to existence of opposite requirements on VCOND for the two cases of the truth table. The number of cycles before a corruption depends on the VSET and VCOND pair, but also on the initial device resistance, as shown in **Figure 4d** and **e**.

Thus, although promising, the analysis performed with the compact model highlighted that this stateful LIM framework is intrinsically unreliable and other approaches should be preferred.

#### **3.2 The smart material implication logic**

The SIMPLY LIM framework was proposed in [26] as a more reliable and efficient solution to implement the material implication logic on RRAM devices. The circuit used to implement an IMPLY operation is similar to the one used in the stateful approach with the addition of a comparator which is feedback to the control logic, see **Figure 5a**. Since, during an IMPLY operation, the state of the output device Q changes only when both P and Q are in HRS, in the SIMPLY framework the operation is split into a read step, which detects this input configuration, followed by a conditional programming step, as sketched in **Figure 5b**. During a read step small (i.e., ≈ 100 mV) voltage pulses are delivered to the input RRAM devices. As a result, the voltage across RG (i.e., *VN*) when both inputs are in HRS is lower than in all the other cases. Thus, a sufficient read margin exists to discriminate the first case of the truth table from all the others by using a comparator with an appropriate threshold *VTH*. In the conditional programming step, the control logic delivers VSET to Q only when the comparator detects the first case of the truth table, while it keeps the analog drivers in high impedance otherwise. Since only a sufficient read margin needs to be ensured, this approach is intrinsically more reliable than the stateful one. In fact, circuit simulations performed in [26, 28, 29] with the compact model, including the effect of variability and RTN showed that a sufficiently large read margin is easily obtained by tuning the read voltage and VFALSE (i.e., higher HRS leads to larger memory windows). Also, simulations confirmed that by delivering only small read voltages to the RRAM devices, the effect of variability is virtually solved. In addition, VCOND is no more

#### **Figure 5.**

*a) SIMPLY core logic gate. b) Signals used to perform an IMPLY operation in the SIMPLY framework. When the input configuration* P=Q=0 *is detected,* VSET *is delivered to* Q*. In all the other cases the voltage drivers are kept in high impedance (hi-Z). c) Sense amplifier circuit from [26] used as comparator. d) Control (i.e.,* CTRLIN*,* CTRLOUT*,* RD*,* TAIL*, and* RST*) and output signals (i.e.,* VOUT*) of the sense amplifier circuit in c). A complete comparison is completed in four steps: 1) the internal nodes of the VSA are pre-charged with the input and threshold voltage (i.e.,* VTH*); 2) the comparison is performed by turning on the* TAIL *transistor; 3) the result of the comparison is passed to the output node; 4) the control logic delivers the conditional* VSET *pulse and resets the internal nodes of the VSA.*

required, and since VSET is applied only when P and Q are zero, in three out of four cases of the truth table the energy consumption is greatly reduced with respect to the stateful implementation, as discussed in [28]. Additional energy can be saved using the same approach on the FALSE operation, thus splitting the operation into a read step followed by a conditional VFALSE pulse. When the device is already in the HRS, delivering VFALSE to the device would waste additional energy.

Also, the multi-input IMPLY operation was proposed in [30] to speed up computations, and its feasibility up to four inputs operations was demonstrated by means of circuit simulations on SIMPLY-based architectures in [31]. In the SIMPLY framework the multi-input IMPLY operation (hereafter called n-IMPLY, where n indicates the number of inputs) is executed in the same way as the two inputs IMPLY operation, with the only difference that all the input devices are read in parallel. VSET is then delivered to the output device only when all the inputs are in HRS. The logic operation is equivalent to *Q*<sup>0</sup> ¼ *Q* þ *P* þ *S* þ … , where the output is written on Q while P and S are other input devices. Although the read margin at the input of the comparator decreases for increasing number of inputs, the same voltage threshold *VTH* can be used. The SIMPLY architecture enabling the execution of n-IMPLY operations is referred to as n-SIMPLY in this chapter.

To correctly evaluate the performance of SIMPLY-based architectures it is important to also evaluate the performance of the comparator. The latter was implemented as a voltage sense amplifier (VSA) designed in [26] with a 45 nm technology from [32]. The circuit of the VSA and the timing of the control signals are reported in **Figure 5c** and **d**. The proposed design consumes less than 8 fJ per comparison when VDD is 2 V and the temperature is in the range 0 to 85°C.

#### *3.2.1 Full adder implementation on a n-SIMPLY-based architecture*

To implement more complex logic functions, sequences of IMPLY and FALSE operations and a sufficient number of devices are required. To highlight the remarkable energy efficiency of the n-SIMPLY architecture here we consider the implementation of a full adder (FA). As shown in **Figure 6a**, to implement a 1-bit FA an array with at least 8 RRAM devices is needed. These devices are used to store the inputs (i.e., IN1, IN2, and CIN), the outputs (i.e., S, COUT), and partial results (i.e., M1, M2, and M3) of the operation. To compute the result of the 1-bit addition, the 15 computing steps of n-IMPLY and FALSE operations reported in **Figure 6b** need to be executed sequentially [31]. The circuit in **Figure 6a** was simulated in [31] using the compact model and considering a clock frequency of 500 MHz, to estimate its performance. As reported in [31] the computation of a 1-bit FA consumes a worst-case

#### **Figure 6.**

*a) Implementation of a 1-bit FA on the n-SIMPLY architecture. b) Sequence of computing steps from [31] required to compute the output for a 1-bit FA when exploiting the n-SIMPLY framework.*

*Study of RRAM-Based Binarized Neural Networks Inference Accelerators Using an RRAM… DOI: http://dx.doi.org/10.5772/intechopen.110340*

energy of 4.2 pJ and it is computed in 60 ns. Although these numbers are orders of magnitude higher than those achievable with full CMOS implementations, n-SIMPLY-, and LIM-based architectures in general, provide considerable advantages when data intensive tasks are considered. Thus, in [31] the parallel execution on conventional CMOS circuits and on the n-SIMPLY architecture of 512 32-bit FA operations was compared. When including the VNB overhead, the n-SIMPLY architecture was demonstrated to achieve a > 106 energy delay product (EDP) improvement with respect to the CMOS implementation.

#### **3.3 Comparison with other RRAM-based logic circuits**

Although in this chapter material implication-based LIM frameworks are analyzed as a use case example, other approaches have been studied in the literature. Here follows a description and an analysis of the main characteristics and performance of the most common LIM frameworks in which RRAM devices are used both as computing and as memory elements, and hence can enable the implementation of non-von Neumann computing architectures.

#### *3.3.1 Memristor ratioed logic*

Although the Memristor Ratioed Logic (MRL) [33, 34] enables to fabricate logic gates with smaller footprints with respect to conventional CMOS logic circuits, it is not included in the analysis because in this framework RRAM devices are used only as computing elements, while inputs are encoded as voltages. Ali et al. demonstrated in [34] a possible realization on a crossbar array of a MRL-based FA design, however in this approach the computation parallelism is limited, and retrieving the input data would still incur in the VNB overhead.

#### *3.3.2 Memristor-aided logic*

The MAGIC LIM framework is similar to the stateful RRAM-based material implication logic one described in Section 3.1. The core logic operations in the MAGIC framework are the NOR and the NOT operations that can be implemented with RRAM devices using the circuit shown in **Figure 7a**. Differently from the stateful material implication logic gate, the resistor RG is replaced with an RRAM device (i.e., *O* in **Figure 7a**) which stores the result at the end of an operation execution. Thus, before each operation *O* is set into LRS. The inputs are stored on the devices P and Q (see **Figure 7a**), and by delivering a negative voltage pulse with amplitude *V0* to their TE, the resistive state of the device *O* changes depending on the resistive state of the inputs. As in the stateful material implication logic implementation, the conditional programming of an RRAM device reduces the circuit reliability, increasing the BER. As discussed in [35, 36], RRAM-based MAGIC implementations are affected by the small design space, and by logic state degradation, the latter affecting both the input and output devices. Also, as discussed in [35], the circuit reliability is strongly influenced by the RRAM technology characteristics. Indeed, larger |VSET/VRESET|and ROFF/RON ratios can potentially improve the circuit reliability, provided that the effect of C2C variability is considered in the design phase. This further underlines the importance of accurate compact models for the implementation of device-circuit cooptimization strategies.

#### **Figure 7.**

*a) NOR operation as implemented in the MAGIC LIM framework on a linear RRAM array. The resistive state of the output device* O *changes depending on the input configuration. b) Sketch of the scouting logic LIM framework and associated reliability issues. Similarly to SIMPLY, two devices are read in parallel and the voltage (VN) at the input of a sensing circuit changes depending on the input configuration. The sensing circuit tries to detect how many devices are in LRS, but as shown in the sketched distribution of* VN *for different input combinations, while a sufficient read margin exists to detect the 00 combination, the 01 and the 11* VN *distributions are partially overlapped, thus leading to high BER.*

#### *3.3.3 Scouting logic*

The approach used in scouting logic is similar to the one employed in the SIMPLY framework. As shown in **Figure 7b**, also in this case RRAM devices encode the input bits of logic operations, and their state is read in parallel using current or voltage sense amplifiers. The equivalent parallel resistance of the input RRAM devices changes depending on the number of inputs in LRS, leading to different distributions at the input of the sense amplifier [25]. By using different thresholds, the sense amplifier can implement different logic operations. Specifically, by discriminating the 11 input combination from the others using the sense amplifier and *VTH2* (see **Figure 7b**), the control logic in the array periphery can compute the AND and NAND logic operations. Conversely, the OR and the NOR operations can be implemented by discriminating the 00 input combination from the others using the *VTH1* threshold (see **Figure 7b**). Also, in this framework the XOR and XNOR operations could be potentially implemented in only two read steps which discriminate the 01 input combination from the others. However, although as discussed in Section 3.2 for the SIMPLY framework, the 00 input combination can be reliably distinguished from the others, the distributions at the input of the sense amplifier for the 01 and 11 input combinations overlap due to the effects of variability and RTN, leading to very high bit error rates and a low circuit reliability when computing AND, NAND, XOR, or XNOR operations. Thus, in this framework only OR and NOR operations can be reliably computed [37]. Also, differently from SIMPLY, in which the output device is also an input of IMPLY operations, in scouting logic if more complex operations are implemented, partial results need to be stored on an additional device. Using the same device to accumulate multiple partial results (i.e., to OR together the partial results) may cause a set voltage pulse to be applied to a device already in LRS, thus causing high energy consumption.

#### *3.3.4 Enhanced scouting logic*

To improve the reliability of the scouting logic, Yu et al. introduced in [37] an enhanced version which is based on 2T1R memory arrays. As sketched in **Figure 8a** and **b**, the transistors in series with the RRAM devices can be used to either read the *Study of RRAM-Based Binarized Neural Networks Inference Accelerators Using an RRAM… DOI: http://dx.doi.org/10.5772/intechopen.110340*

#### **Figure 8.**

*Operation of the enhanced scouting logic implementation on 2T1R arrays from [37]. a) Execution of an OR/NOR operation. The WL and BL are controlled so that the parallel resistance between the input devices is connected to the sense amplifier. b) Execution of an AND/NAND operation. The WL and BL are controlled so that the series resistance between the input devices is connected to the sense amplifier.*

equivalent parallel or series resistance between the two RRAM devices storing the input bits. As in the scouting logic, when reading the resistive state of the two devices in parallel the 00 input combination can be easily distinguished from the others to compute the OR or NOR operations. Instead, by reading the equivalent series resistance, the 11 input combination can be distinguished from the others thanks to a sufficiently large read margin, which enables the correct implementation of AND and NAND operations. Thus, by trading an increased chip area for reliability, in this framework both OR and AND operations can be reliably computed. Compared to the SIMPLY framework this approach can potentially reduce the number of computing steps required to compute more complex operations, however at the cost of a larger chip area that is required to accommodate the additional bit line and selector transistor, and a more complex sense amplifier circuit.

#### *3.3.5 Performance comparison on a 1-bit FA implementation*

The performance of LIM frameworks can be compared considering different metrics. As shown in **Table 1**, these include both metrics that can be directly associated with specific LIM frameworks (i.e., number of devices and of computing steps used to implement a specific logic function, and the feasibility in crossbar arrays), and other metrics that also depend on design choices and the specific technology employed to fabricate or to simulate the circuits, making the comparison between different LIM solutions non-trivial. To identify key differences between different LIM approaches the execution of a 1-bit FA operation is considered, since it represents a common simple benchmark for LIM frameworks.

Considering the stateful material implication-based implementations (i.e., "Stateful Mat. Imp." in **Table 1**) different approaches can be followed to optimize the number of required devices or the number of computing steps. As reported in the


*Neuromorphic Computing*


*Study of RRAM-Based Binarized Neural Networks Inference Accelerators Using an RRAM… DOI: http://dx.doi.org/10.5772/intechopen.110340*


 *table, reported explicitly reported paper. corresponding explicitly reported.*

#### **Table 1.**

*Comparison of 1-bit FA implementations based on different LIM frameworks.*

### *Study of RRAM-Based Binarized Neural Networks Inference Accelerators Using an RRAM… DOI: http://dx.doi.org/10.5772/intechopen.110340*

literature [3, 26, 27, 40], typically a 1-bit FA operation can be implemented using from 6 up to 9 RRAM devices, and sequences of IMPLY and FALSE operations which length vary from as low as 23 steps, when the inputs are overwritten, up to 136 steps when the sequence is not optimized. To further reduce the number of computing steps, the multi-input stateful material implication framework was proposed by Siemon et al. in [39], showing that by using three-inputs IMPLY operations the number of computing steps can be reduced down to 19 steps when using 8 RRAM devices. The n-SIMPLY framework discussed previously in Section 3.2, enables to reliably compute 4-inputs IMPLY operations, and thus it reduces the number of computing steps down to 15. In terms of number of computing steps, the different set of core operations used in the enhanced scouting logic and the MAGIC LIM frameworks provides additional advantages compared to most of the material implication-based solutions. Although, not explicitly reported in the literature, by combining AND, NAND, NOR, and OR operations in the enhanced scouting logic array implementation (see **Figure 8**), the 1-bit FA operation is executed in 15 steps, which could be further reduced as in the n-SIMPLY approach by increasing the parallelism of the OR/NOR operations (the parallelism of the AND/NAND operations cannot be increased in the architecture reported in [37]). MAGIC-based implementations in [38] can achieve the lowest number of computing steps (i.e., 13 steps) among the different LIM implementations reported in **Table 1**.

Although the number of computing steps is linked to the computing delay (i.e., *delay = #steps tstep*) and can give an idea about the energy efficiency of the specific implementation (i.e., *Etot* <sup>¼</sup> <sup>P</sup>#*steps <sup>i</sup>*¼<sup>1</sup> *Ei*, where *Ei* is the energy dissipated in each step), the achievable time (*tstep*) and energy (*Ei*) per single computing step depend on the characteristics of the RRAM technology that is used on the adopted design choices. Also, the accuracy of the performance and reliability analysis depends on the type of compact model used in the circuit simulations. Indeed, compared to general purpose memristor models, the use of physics-based compact models can provide more accurate results when fast voltage pulses are used and enable to analyze important nonideal effects such as the logic state degradation. Thus, in **Table 1**, along with the energy and computing delay of different implementations from the literature, also the main circuit parameters and an indication regarding the type of compact model used in the simulations are reported. In general, to reduce the energy per computing step, increasing the LRS resistances can be a viable solution. However, larger voltages must be employed to program a device without increasing the programming pulse width, potentially conflicting with the desired circuit design space. In terms of energy efficiency, currently a hybrid FET-RRAM design from [41] shows the best performance. However, this implementation is not feasible in crossbar arrays, limiting the area efficiency, and simulations were performed with an extremely simplified RRAM model, suggesting the need for a more accurate performance analysis. Among the available data regarding crossbar-feasible implementations, MAGIC and n-SIMPLY achieve the best energy and delay performance. However, although MAGIC is slightly more efficient than n-SIMPLY, the 1-bit FA implementation does not retain the input states and is affected by the problem of logic state degradation, which can discourage its adoption in applications such as hardware accelerator for BNNs. In BNN the network parameters, stored in the nonvolatile resistance of RRAM devices, must be preserved through computations. Thus, the effect of logic state degradation introduces the need for frequent memory refresh cycles, causing high inefficiencies. Therefore, in the following sections, the design of a LIM-based accelerator for BNNs is discussed considering an n-SIMPLY implementation.
