**4.4 Performance benchmark**

In this section, the performance of the architectures discussed in sections 4.1–4.3, are estimated on a BNN inference task, and benchmarked against an implementation running on conventional embedded system hardware [48]. The performance was estimated on a classification task on the MNIST handwritten digits dataset. The 20x20 pixels images of the dataset were converted to black and white images before training. The adopted neural network consists of a single fully connected hidden layer with 1000 neurons and 10 output neurons and was trained on 9500 sample images using the DoReFa-Net algorithm [49]. Other 2500, and 2000 images were used for validation, and testing, respectively, and the trained neural network achieves an accuracy of 91.4% [29], which is comparable with other results in the literature and the BNN implementation considered as a reference [48]. The performance of each accelerator was estimated considering that the computations were executed on an appropriate number of 256x256 1T1R memory arrays. In the case of the SIMPLY-based implementations, the network parameters were mapped to memory arrays as discussed in Section 4.1.6, while in the hybrid architectures, the network weights and their complements were mapped to memory arrays, filling them completely, and parameters and devices required for computing logic operations in the SIMPLY framework were mapped onto other memory arrays. The two approaches require a similar number of arrays since most of the memory is used to store the neural network parameters. Energy and latency estimates were computed by mapping all the IMPLY, FALSE, and VMM operations required to implement the BNN inference task to the memory arrays. For the SIMPLY accelerator, two different implementations were studied to evaluate the performance improvement provided by the binary tree adder accumulator with respect to the HA chain implementation. All the test set was randomly shuffled and classified by the implemented neural network, preserving the memory states between classification tasks. In the case of SIMPLY-based implementations, the worst-case energy consumption for n-IMPLY and FALSE operations was estimated by considering the data of the maximum energy consumption for each input combination available in **Table 2** that were estimated by means of circuit simulations performed considering a 500 MHz clock frequency, meaning that IMPLY and FALSE operations are computed in 4 ns, and including the comparator overhead. The energy for BNN analog VMM operations was estimated considering RRAM devices approximated as resistors and computing the equivalent resistance for each active row of the array. To estimate the latency, when possible, operations are parallelized among different arrays, while inside each array two different degrees of parallelism were considered. Specifically, in the "serial" (see **Table 3**) case no


#### **Table 2.**

*Worst-case energy estimates from circuits simulations of FALSE and n-IMPLY operations executed on the SIMPLY architecture.*

*Study of RRAM-Based Binarized Neural Networks Inference Accelerators Using an RRAM… DOI: http://dx.doi.org/10.5772/intechopen.110340*


#### **Table 3.**

*Benchmark of different SIMPLY-based BNN accelerators against a state-of-the-art embedded system implementation from the literature.*

additional parallelism was introduced, and operations were all executed sequentially. In the "parallel" (see **Table 3**) case the read step is executed in parallel on the rows of the array but, due to the high Ic of the considered technology (i.e., IC = 100 μA), the conditional programming step is performed sequentially on the rows of the array. For the computation of BNN VMM operation in the hybrid accelerators, the same design choice performed in [8] was considered. Up to 128 lines were activated in parallel, 3 bit flash ADCs were used to digitize the partial results of the products, and each ADC was shared among 8 rows of the array. The digitized partial results were written onto other arrays, grouping on the same rows of the array the partial results related to the same neuron. Logic operations required to accumulate the results and to compute activation functions were executed in the SIMPLY framework.

The results of the simulations are reported in **Table 3** and show that all the proposed accelerators provide a > 7∙10<sup>2</sup> EDP improvement with respect to the reference embedded system implementation [48]. All the accelerators considerably reduce the energy consumption, and except for the "n-SIMPLY Serial HA chain" implementation, they can also considerably reduce the computing time. In the SIMPLY-based accelerators, the adoption of the binary tree adder accumulator reduces considerably the number of computing steps leading to considerable latency and energy savings. As expected, additional improvements can be achieved by performing VMM in the analog domain using the hybrid accelerators, which can achieve an EDP improvement of >10<sup>5</sup> and > 3.5 with respect to the embedded system implementation and the SIMPLY-based implementations, respectively.

The latency of SIMPLY-based implementations could be further improved by adopting RRAM technologies with lower current compliance provided that a sufficient read margin and high retention are preserved. Adopting lower current compliances would potentially increase the maximum number of devices that can be written in parallel. Also, lower current compliance values would reduce the energy consumption in both computing approaches further highlighting the advantages of both approaches.

In addition, to compare the performance of the n-SIMPLY and hybrid BNN accelerators, the performance and characteristics of RRAM-based BNN inference accelerators from the literature [6, 8, 42, 46, 47, 50, 51] are summarized in **Table 4**. In general, these works employ the different schemes discussed in Section 4.2.1 to


**Table 4.** *Performance*

 *comparison*

 *of different BNN inference accelerators*

 *from the literature.*

#### *Neuromorphic Computing*

### *Study of RRAM-Based Binarized Neural Networks Inference Accelerators Using an RRAM… DOI: http://dx.doi.org/10.5772/intechopen.110340*

accelerate in the analog domain the BNN VMM. However, each work uses different arrays sizes, topologies, devices, and technologies, thus complicating the comparison between different accelerators. Ideally, for a direct comparison between different implementations the same task should be executed on all the accelerators, however these data are rarely reported. Some works focus only on demonstrating the feasibility of their proposed implementation of the VMM [46], while other studies simulate a different inference task [6], or report different metrics that cannot be easily used to estimate the performance on a specific inference task. Specifically, in [8, 47] the TOPS/ W metric is reported, however this metric indicates the maximum performance that can be achieved in specific conditions, which are not necessarily the one achieved by the circuit in a generic inference task. Still, some works [42, 50–52] directly report the performance, or sufficient data that can be used to estimate the performance of their accelerator on an MNIST handwritten digits classification task. Among these works, the results reported by Yu et al. in [42, 52] show the worst performance, however their solution was optimized for training directly on chip the entire network parameters, thus introducing additional overheads. Compared with the results reported by Minguet et al. in [50], the n-SIMPLY Binary Tree Adder implementation achieves similar EDP performance while the Hybrid implementation can further reduce the EDP. The lowest energy consumption for an MNIST inference task, was estimated from the work of Lopez et al. [51], where the authors proposed a subthreshold read scheme in which 1S1R devices are read using a read voltage that is lower than the threshold voltage of the selector device, achieving a read energy of 76 fJ/bit.
