*4.2.1 State of the art on RRAM-based analog BNN vector matrix multiplication accelerators*

Thanks to their intrinsic reconfigurability, LIM computing architectures such as SIMPLY, are an effective solution for resource-constrained devices deployed at the edge of the network. However, the high level of reconfigurability comes at the cost of suboptimal performance when specific tasks need to be accelerated. For instance, considering BNNs inference tasks, most of the computations performed are required to compute the result of dot products [45]. Thus, designing optimized accelerators for the VMM operation would provide a considerable improvement of the overall accelerator performance. Thus, several analog/mixed-signal RRAM-based accelerators of the VMM used in BNNs have been proposed in the literature [6, 8, 42, 46, 47]. To encode the 1 and +1 weights of BNNs, a common solution adopted in analog/mixedsignal RRAM-based accelerators is to use a pair of devices (i.e., either 1T1R or 1R devices) programmed in complementary resistive states for each weight of the BNN. Driving these pairs of devices with appropriate voltage schemes enables to compute the result of XNOR operations using only positive RRAM conductances. However, different RRAM implementations vary depending on the array implementation, the sensing scheme (i.e., current sensing or voltage sensing), and peripheral circuitry. The main approaches that can be found in the literature are summarized in **Figure 14**. Specifically, **Figure 14a** shows a solution exploiting 1T1R arrays and a current sensing scheme to compute the bitwise XNOR and accumulation operations. In this approach, the weights of the BNN are encoded in pairs of devices located in adjacent rows of the same column of the array. The select lines of the selector transistors encode the input activations using complementary signals, see **Figure 14a**. In the array periphery, a transimpedance amplifier, implemented with an operational amplifier (OPA), ensures a virtual ground at the end of the column of the array, so that the current flowing in each column (i.e., *Isum* in **Figure 14a**) is linearly proportional to the number of +1 results from the bitwise XNOR operations in that column. Also, the transimpedance amplifier converts the current to a voltage that is then digitized using an analog to digital converter (ADC). However, this solution is inefficient in terms of energy consumption and chip area, meaning that the transimpedance amplifier and the ADC must be shared among adjacent columns by means of analog multiplexers. A similar approach that is more efficient in terms of area occupation and energy efficiency was proposed by Yin et al. in [8], and the core elements of their solution are sketched in **Figure 14b**. The working principle is similar to the previous approach, with the main difference being that no OPA is used in the array periphery, thus saving considerable area and energy. However, without the OPA, no virtual ground is provided at the end

#### **Figure 14.**

*Different implementations of the VMM operation with RRAM arrays. (a) 1T1R implementation in which each weight of the network is encoded as complementary resistive state of RRAM devices located in adjacent rows. Weights associated with the same neuron are stored in the same column of the array. An OPA provides a virtual ground node at the end of each column so that the current flowing in each column is equivalent to the sum of all the positive results of the bitwise XNOR operation. (b) Similar 1T1R implementation from [8], which uses a pulldown FET device and a flash ADC implemented with multiple VSAs instead of introducing a virtual ground node. (c) Sketch of the 2T2R implementation from [46]. Differently from a) and (b), the weights associated with the same neuron are located on the same row. A single vector-vector product is computed at a time by activating the desired word line (WL). Appropriate capacitive sensing circuits convert the results of the bitwise XNOR operations encoded in the select lines (SL) voltages to the output result. (d) 1T1R array implementation from [47], which exploits the select line capacitance to compute the result of the vector-vector multiplication. The SL is kept in hi-Z so that during a read-out step the voltage at the input of the ADC is proportional to the number of positive results of the bitwise XNOR operation.*

#### *Study of RRAM-Based Binarized Neural Networks Inference Accelerators Using an RRAM… DOI: http://dx.doi.org/10.5772/intechopen.110340*

of the columns of the array. The authors instead, introduced a pull-up or pull-down FET device, which creates a voltage divider with the equivalent resistance of the column which depends on the result of the bitwise XNOR operations. The voltage drop on the pull-down (pull-up) device increases (decreases) with the number of +1 from the bitwise XNOR operations in that column. Such voltage can be compared with a threshold using a VSA to compute the output activation. As discussed by Yin et al. in [8], multiple VSA and threshold voltages can be used to implement a flash ADC for improving the accuracy of the computation. The compact area and faster response of the VSAs compared to OPAs, not only reduce the area occupation but also increase the speed of the computation. In fact, the VSAs must be shared between fewer columns of the array compared with OPA-based solutions, thus increasing the throughput of the computation. A drawback of this approach is the nonlinear increment of the voltage at the input of the VSAs for an increasing number of positive results, which may require a fine tuning of the voltage thresholds to retain high accuracy in BNN inference tasks. Alternatively, approaches based on voltage mode sensing schemes have been recently proposed in the literature [46, 47], and shown to achieve high accuracy and energy efficiency. Specifically, Ezzadeen et al. proposed a solution based on 2T2R arrays which is sketched in **Figure 14c**. In this approach, the weights of each neuron are stored on the same row of the array, and the results for each neuron are computed one at a time. To do so, a single WL (see **Figure 14c**) is turned on at a time, and the input activations and their complements are applied to the corresponding pairs of devices which encode the neuron's weights. As a result, the voltage at each SL line (see **Figure 14c**) encodes the result of a single XNOR operation. In their paper, the authors proposed a capacitive sensing scheme to compute the result of the accumulation. Such a scheme requires a smaller chip area compared to other solutions, however at the cost of reduced computing speed since the result for a single neuron is computed at each step. Another voltage mode sensing scheme, sketched in **Figure 14d**, was proposed by Zhao et al. [47]. In this approach, the weights associated with the same neuron are encoded in adjacent RRAM devices programmed in complementary resistive states that are connected to the same SL (see **Figure 14d**). Input activations are encoded as a fixed bias (*Vbias*) term applied to both devices encoding a single neuron's weight, and an additional voltage contribution with amplitude *Vpulse* to the first or second line depending on the polarity of the input activation, as shown in **Figure 14d**. During a VMM, the WLs are activated, and the SLs are kept in Hi-Z. By exploiting the intrinsic capacitance of the array lines, the voltage on the SL lines is linearly proportional to the number of positive bitwise XNOR operations. Such voltage is then digitized by means of an ADC. Compared with the more common current mode sensing scheme, this approach is more energy efficient since smaller currents flow in the circuit.

Overall, each different scheme provides different tradeoffs, in terms of chip area (i.e., cost), computing speed, and energy. Thus, it is up to circuit designers to choose the design that better suits a specific application.

#### *4.2.2 Circuit design tradeoffs: analysis of a case of study*

In this section, we analyze in detail the tradeoffs existing in the BNN analog VMM accelerator shown in **Figure 14b** proposed by Yin et al. [8], since it can provide highthroughput combined with high energy efficiency, and a compact peripheral circuit design. An example of the core elements of a BNN VMM accelerator based on this approach is sketched in **Figure 15a**, where additional FET devices and the required control logic are also represented. As discussed in the previous section, the voltage at

the input of the VSA increases nonlinearly with an increasing number of positive results of bitwise XNOR operations due to the existence of a voltage divider between the pull-down device and the equivalent resistance of the devices connected to the same column of the array. The reliability of this accelerator is linked to the linearity of the transfer characteristic and the read margin available at the input of the VSA. Indeed, a higher read margin over the whole span of possible inputs (i.e., the number of positive bitwise XNOR operations) would reduce the error rate of the VSA. Different design parameters can influence the linearity of the transfer characteristic and the available read margin at the input of the VSA. Specifically, the pull-down resistance (i.e., *RPD*) changes the linearity of the transfer characteristic. As shown in **Figure 15b**, low *RPD* values increase the linearity of the transfer, but too low values reduce the dynamic range, and thus the read margin, at the input of the VSA. Also, very low *RPD* values may require excessively large FET devices which would reduce the area efficiency. Conversely, larger *RPD* values increase the nonlinearity of the transfer characteristic, and too large values can quickly saturate the input of the VSA when just a few positive results are accumulated. Also, C2C variability can further reduce the read margin available at the input of the VSA, introducing possible overlaps between the distributions of the voltage at the input of the VSA for consecutive numbers of accumulated XNOR results, as shown **Figure 15c**. To increase the read margin higher power consumption can be traded by increasing the read voltage, provided that devices in HRS are not corrupted during a read step. All these circuit parameters must be optimized to ensure a reliable circuit operation. Still, a limited number of devices can be reliably read in parallel, introducing the need for the input split method described in [7]. In this approach, the computation of the complete sum of products is split into multiple steps, in which a partial number of the sum of products is computed, and the results of each step are accumulated using digital circuits. As suggested in [8], to improve the accuracy of the accelerator multiple VSA amplifiers and appropriately designed thresholds can be employed for each column, implementing a flash ADC converter. Specifically, in their work [8], the authors achieved high inference accuracy by using a 3-bit flash ADCs implemented with seven VSA, for computing operations with 64 inputs. Even when increasing the number of VSAs in the array

#### **Figure 15.**

*(a) 1T1R memory array and peripheral circuits that can be used to accelerate the binary VMM operation in the analog domain, by storing the weights of the neural network and their complement in columns of the array, and controlling transistors select line depending on the input activations. As shown in (b), the current flowing in each column increases with the number of positive products. The transfer characteristic depends on the value of the pulldown resistance (*RPD*). (b) Effect of RRAM resistive state variability and RTN on the equivalent output voltage, showing possible overlaps between adjacent accumulated positive products.*

periphery, this approach can provide considerable efficiency improvement with respect to LIM-based accelerators since only read operations are performed without the need to use more energy-hungry programming steps.

### **4.3 Reconfigurable hybrid BNN SIMPLY/analog accelerator**

To combine the reconfigurability of the SIMPLY architecture described in Section 4.1 and the efficiency of the BNN analog VMM accelerator discussed in Section 4.2.2, a hybrid architecture merging the two approaches on the same array was proposed in [28]. Here we discuss a variation of the architecture proposed in [28] which is less sensitive to sneak paths and requires fewer VSAs in the array periphery. Such architecture is shown in **Figure 16** and is similar to the SIMPLY-based architecture shown in **Figure 9c**. IMPLY and FALSE operations can be performed in the array as described in Section 4.1. Conversely, the architecture is slightly different to the one implementing the analog BNN VMM accelerator, shown in **Figure 15a**. Still the working principle is the same, with the only differences being that the currents encoding the results of multiplications flow in the rows of the array instead of the columns, that the columns of the array are biased with the read voltage, and that the pair of devices in complementary state encoding a single weight are encoded in adjacent columns instead of rows. To enable both computing paradigms the control logic and peripheral circuits need to be adapted, so that the reference voltage of the VSAs, and the bias voltage of the FET implementing either RG or RPD can be changed when performing computations in the SIMPLY or the analog BNN VMM frameworks, respectively. Also, if multiple VSA amplifiers are used to implement a flash ADC to improve the accuracy of the analog VMM operation, multiple rows of the array may share the same group of VSAs, thus reducing the maximum parallelism achievable, compared to the SIMPLY architecture in **Figure 9c**.

Despite the increased complexity of the peripheral circuits and the possible reduced parallelism, the proposed hybrid accelerator provides a more efficient exploitation of the hardware resources that are often scarce in low-power embedded devices.

#### **Figure 16.**

*Sketch of the hybrid IMC accelerator enabling the coexistence of the SIMPLY and analog BNN VMM frameworks shown in Figure 9c and 15a, respectively. Shaded VSAs highlight the possible use of flash ADCs in the architecture.*
