**4. RRAM-based binarized neural networks accelerators**

In BNNs [9] weights and activations of neural networks are represented with a single bit. Despite the considerable reduction in the required hardware resources, for some classification tasks, BNNs were demonstrated to achieve similar performance with respect to full precision neural network implementations, with only slight accuracy degradation [9]. The use of binary weights and activations, not only leads to a reduction of the size of the memory used to store the network parameters, but also reduces the complexity of the operations performed by each neuron of the network. As shown in **Figure 9a**, in full precision neural networks each neuron computes the dot product between its input activations (i.e., *X*) and the learned weights (i.e., *W*), adds a bias term (i.e., *b*) to the result, and finally applies a nonlinear activation function (i.e., *ReLu* in **Figure 9a**) to compute the output activation. In BNNs, these operations translate to simpler logic operations. Specifically, the dot product becomes a bitwise XNOR followed by an accumulation, or popcounting, of the results, and the activation function becomes the comparison with a learned threshold. Despite using less memory space to store the network parameters, BNNs are still characterized by a large number of parameters. Running inference tasks on conventional hardware would incur in the VNB, therefore degrading the performance. Thus, several IMC hardware accelerators based on RRAM technologies have been proposed in the literature [7, 8, 42], showing promising efficiency improvement without considerable accuracy degradation. In fact, binary storage is currently a more robust approach for state-of-the-arts RRAM technologies with respect to multibit storage on a single which is more affected by device nonidealities.

#### **Figure 9.**

*(a) Sketch of a fully connected neural network, with* k *inputs,* m *neurons in the hidden layer, and* n *neurons in the output layer. The operation performed by each neuron is shown. (b) Implementation of a neuron in BNNs. The vector-vector multiplication becomes the accumulation of the results from bitwise XNOR operations, and the activation function the comparison with a threshold. (c) 2D array of 1T1R devices that can be used to accelerate in hardware the operations required for classification tasks based on BNNs, in the SIMPLY framework. FETs control signals are managed by the control logic.*

*Study of RRAM-Based Binarized Neural Networks Inference Accelerators Using an RRAM… DOI: http://dx.doi.org/10.5772/intechopen.110340*

In the following sections, three different BNN hardware accelerators based on 1T1R memory arrays are discussed and benchmarked. Specifically, Section 4.1 describes SIMPLY-based implementations, Section 4.2 discusses analog accelerators for VMM used in BNNs, and Section 4.3 describes a hybrid accelerator combining the previous two approaches on the same array.

#### **4.1 SIMPLY-based BNN accelerators**

Since the core operations of BNNs are logic operations, SIMPLY-based architectures can be used to realize energy efficient hardware accelerators for inference tasks. Such accelerators can be realized on memory arrays and a peripheral circuit such as the one shown in **Figure 9c**, enabling the implementation of single instruction multiple data architectures, which can exploit the intrinsic parallelism of BNN inference algorithms [28, 31, 43]. In fact, depending on the number of comparators placed in the array periphery, and the design of the drivers and control logic, multiple devices in different rows and in the same column can be read and programmed in parallel. For instance, to perform an IMPLY operation in the SIMPLY framework on multiple rows of the array, the columns of interest are biased with the read voltage and the corresponding select lines are biased to turn on the selector FET. At the same time, FET devices in the array periphery are biased to implement RG and bring the voltage at the row of interest to the input of a VSA. Then the column corresponding to the output device is biased with VSET and FET devices in the array periphery provide a low conductive path for the devices that need to be programmed.

In the rest of this section, the implementation of the different core operations of BNNs on the SIMPLY architecture is discussed.

#### *4.1.1 XNOR operation*

In BNNs, a dot product operation is equivalent to a bitwise XNOR operation between the input activations and the weights of the neuron, followed by an accumulation of positive results. As shown in **Figure 10**, for BNN applications, each XNOR operation can be implemented with 5 RRAM devices and 5 computing steps [31], provided that both the weight (i.e., W) and its complement (i.e., *W*) are stored in the memory array and never overwritten. The other three devices store the input (i.e., *IN*), the output (i.e., *O*), and the partial result (i.e., *N1*) of the XNOR. Since operations

#### **Figure 10.**

*(a) XNOR gate implementation used for BNN inference tasks in the SIMPLY framework. To reduce the number of computing steps, the weight, and its complement (i.e., W, and W, respectively) are stored in the array and preserved through the computations. (b) Sequence of computing steps in the n-SIMPLY framework.*

on the same row of an array are sequentially executed, *N1* can be reused in the subsequent XNOR operation when computing a dot product.

#### *4.1.2 Accumulation operation*

To implement the accumulation operation, different approaches corresponding to different sequences of IMPLY and FALSE operations can be followed. As proposed in [31, 43], the accumulation operation can be implemented with a chain of half adders (HAs), see **Figure 11a** and **b**. Each HA receives in input its output from the previous step and a new input bit when it computes the least significant bit (LSB), or the output carry from the previous HA in the chain otherwise. In the SIMPLY implementation, when a new bit is accumulated, the result of each HA in the chain is computed sequentially, starting from the LSB. However, based on the position of an HA in the chain (*i*) its result is computed only after *2<sup>i</sup>* input bits have been accumulated, since when fewer bits have been accumulated the corresponding HA output would be zero. Although the number of HAs grows as the *log2()* of the number of input bits, the number of computing steps grows exponentially. As shown in **Figure 11e** and **f**, exploiting different maximum parallelism in the framework of n-SIMPLY to optimize the sequence of computing steps results in a > 15% reduction in the number of computing steps with respect to the 2-SIMPLY implementation [31], however, to further improve the efficiency different approaches can be followed.

Here we propose a SIMPLY-based implementation of a binary tree adder architecture as the one shown in **Figure 11d**. In each level of the tree, the results from the previous level are added two by two, until all the input bits are accumulated. Thus, the number of levels of the tree corresponds to the *log2()* of the number of input bits. From one level of the tree to the following, the size of the FA increases by 1 bit. Each adder is implemented as a ripple-carry adder, see **Figure 11c**, with a 1-bit HA requiring 9 computing steps, and a 1-bit FA requiring 15 steps (see **Figure 6b**). This

#### **Figure 11.**

*(a) Implementation of the accumulation operation on* m *input bits with a chain of* n *HAs, as in [31]. (b) Truth table of the 1-bit HA used in the HA chain. (c) n-bit ripple-carry adder implementation used as the core element of the binary tree adder-based accumulator. (d) Example of a binary tree adder used to accumulate 16 input bits. The number of levels in the binary tree grows as the log* <sup>2</sup>ð Þ *m . (e) Comparison of the number of steps required to accumulate (2n -1) bits for different accumulator implementations. (f) Percentage of saved computing steps with respect to the* 2-SIMPLY HA chain *implementation of the different accumulator implementations.*

*Study of RRAM-Based Binarized Neural Networks Inference Accelerators Using an RRAM… DOI: http://dx.doi.org/10.5772/intechopen.110340*

approach provides a considerable performance improvement compared to the HA chain implementation as shown in **Figure 11e** and **f**.

#### *4.1.3 Comparison operation*

In the original BNN implementation form Courbariaux et al. [9], weights and activations are encoded with +1 and 1 values, meaning that the result of the accumulation operation spans from negative to positive values. Thus, the activation operation is commonly implemented using the *sign()* function. Conversely, in the SIMPLY-based implementation, weights and activations are encoded with 0 and 1, and the result of the accumulation is a positive value, ranging from zero to the number of products performed in the neuron. Thus, the activation function in SIMPLY-based implementations corresponds to the comparison with a threshold (i.e., half the number of products performed in the corresponding neuron). The comparator outputs a logic 1 only when the result of the accumulation is above the threshold. The logic function implementing this operation in the sum of products form is reported in **Figure 12a**. As discussed in [31], increasing the parallelism in the read step of n-IMPLY operation, considerably reduces the required number of computing steps required for a comparison. In fact, when using only 2-IMPLY operations the number of computing step grows exponentially with the number of inputs, while using n-IMPLY operations with n > 2 leads to linear trends, as shown in **Figure 12b**. More details regarding the different sequence of computing steps implementing the comparison operation are available in [31].

#### *4.1.4 Hard max operation*

To determine the output class of an inference task the hard max function can be used. This function determines the index of the maximum value among a group of elements. A sketch of the approach used for the SIMPLY-based implementation is shown in **Figure 12c**. Similarly to the binary tree adder, also in this case a binary tree structure is used, and at each level of the tree pairs of elements are compared. To do so, for each pair of elements, the one stored in the upper position of the array is copied in the same row of the other element together with its ID. Then the two elements are

#### **Figure 12.**

*(a) Comparator sketch and corresponding logic function, which outputs a logic 1 when* a > th*, and a logic 0 otherwise. (b) Number of computing step required for different n-SIMPLY-based implementations and increasing number of compared bits (*l*). In the legend,* n *indicates the maximum degree of parallelism used in the parallel read step of IMPLY operations. (c) Sketch of the hard max activation function implemented in the n-SIMPLY framework. Inputs are compared pairwise using the comparator implementation in (a) and (b).*

compared using the same comparator implementation used for the neuron activation function. For each pair, the initial value and ID stored in the row are replaced with the larger element resulting from the comparison. In just *log2(number of elements)* levels of the tree, the result is computed and stored in place of the last element. Additional information regarding the SIMPLY-based implementation is available in [11, 31].

#### *4.1.5 Batch normalization operation/bias term addition*

Batch normalization layers are often used in neural networks to speed up the training of the network parameters and consist in the scaling of the inputs of a layer using learned parameters. In BNNs, instead, batch normalization is introduced between the MAC operation and the *sign()* activation function, as discussed by Courbariaux et al. [9]. Thus, during training, the average (*m*) and the standard deviation (*σ*) of the results of the MAC operations of a layer are computed, while a scaling (*γ*) and a bias terms (*β*Þ are learned and optimized through each iteration. In forward passes of the neural network these parameters are used to regularize the results of the MAC operations (*a*), so that the input to the *sign()* function is *<sup>a</sup>*^ <sup>¼</sup> ð Þ *<sup>a</sup>* � *<sup>m</sup>* <sup>⨀</sup>*<sup>γ</sup>* <sup>⨀</sup> <sup>1</sup> *<sup>σ</sup>* þ *β*.

As discussed in Section 4.1.3, in the SIMPLY-based implementation the activation function is implemented as the comparison with a threshold (*th*), that is equivalent to half the number of inputs of a layer if no bias term is added to the result of MAC operations. Thus, the effect of the batch normalization parameters can be introduced in the SIMPLY-based neural network by adjusting the value that is mapped to the threshold of the activation function for each neuron ( ^*th* <sup>¼</sup> *th*�*<sup>β</sup> <sup>γ</sup> σ* þ *m*).

In the output layer of a neural network, the hard max activation function is used to infer the output of the network. Thus, batch normalization or equivalently the addition of a bias term, needs to be performed also on the SIMPLY-based architectures. These operations consist in the addition of a fixed value to the results of the MAC operation, and can be implemented on the SIMPLY-based architecture by adding the bias term, that is stored on the memory array, using the ripple-carry adder described in Section 3.2.1 and shown in **Figure 11c**.

#### *4.1.6 Array level implementation*

As discussed in the previous sections, the SIMPLY-based implementation of BNN inference accelerators requires mapping to the resistive state of RRAM devices of memory arrays, the inputs, the neurons weights, the thresholds of the activation functions, the IDs of the output layer, and devices for storing partial results of computations. Ideally, infinitely large arrays would simplify this task by storing all the parameters of a single layer of the neural network onto a single array and the parameters of a single neuron all on a single row of that array. However, real memory arrays have a limited size due to nonideal effects, such as line parasitic resistance [44]. Thus, in practical applications, the parameters of a neural network are split onto multiple arrays, and the parameters of a single neuron onto multiple adjacent rows of the same sub-array, as shown in **Figure 13a**. When the parameters related to a single neuron are split onto multiple rows of the array, as in **Figure 13b**, each row of the array must still contain an equal number of devices for storing the inputs, the weights, their complement, and the outputs, for computing the partial results of the accumulation operations. Also, sufficient space on the array must be left to store support devices and other network parameters (e.g., the thresholds of the activation functions). After partial results are

*Study of RRAM-Based Binarized Neural Networks Inference Accelerators Using an RRAM… DOI: http://dx.doi.org/10.5772/intechopen.110340*

#### **Figure 13.**

*(a) Mapping of the neural network parameters on* 256x256 *memory array. The number of arrays (i.e.,* j k*) depends on the number of inputs (*m*) and outputs (*n*) of each layer of the neural network. (b) Description of the mapping strategy used in a single array. Each row of the array stores portions of the input activations (*IN*), the weights, and their complement, for a single neuron. Additional space to store partial results of the computations and additional parameters needs to be left and appropriately sized. The parameters and inputs of a single neuron may be split onto multiple rows of the array based on the number of available columns in the array.*

computed on each row, these must be copied to adjacent rows to compute the final result. Despite the increased complexity in the mapping of the neural network weights and the increased chip area occupied by the peripheral circuitry, the use of multiple arrays enables increasing the computation parallelism, reducing the computation latency. Results relative to different neurons stored on different sub-arrays can be computed in parallel. Also, the intrinsic parallelism of 2D array implementations of the SIMPLY architecture enables the parallel computation of partial results relative to a single neuron when its parameters are split onto multiple rows on the same array.

However, the number of devices on the same column of an array that can be programmed in parallel is constrained by the maximum current that can flow on a single array line, potentially decreasing the maximum parallelism that can be achieved. Indeed, this effect is connected to the current compliance used to program RRAM devices in the array, and the use of RRAM technologies that can reliably store binary values when programmed at low current compliance values would alleviate this constraint and enable the parallel programming of devices located onto multiple rows of the array.

#### **4.2 BNN analog vector matrix multiplication accelerators**
