**4. ReRAM-based in-memory computing architectures**

In today's DNN-based AI, the concept of neuromorphic computing has been expanded to non-von Neumann architecture computing paradigms that integrate processing and memory on a single chip. The successful hardware implementation of synapses and neurons with ReRAM devices has enabled the prosperous development of ReRAM-based in-memory accelerators for various DNN applications, including convolutional neural networks (CNNs), graph neural networks (GNNs), etc. These accelerators have demonstrated significant computing speedup and energy efficiency in AI applications such as image processing and language processing. In this chapter, we discuss early exploration of ReRAM-based accelerators and computing architectures for various DNNs and applications.

#### **4.1 ReRAM-based in-memory computing accelerators**

With the successful exploration of ReRAM-based neuromorphic hardware, researchers also started to explore architecture-level innovations by developing novel ReRAM-based in-memory processing accelerators.

In 2016, researchers proposed a ReRAM-based neural network accelerator for DNN applications—PRIME [42] with novel in-memory processing architecture. As is shown in **Figure 4**, unlike previous in-memory processing architectures that integrate additional processing units to memory, PRIME has full-function (FF) subarrays that

#### **Figure 4.**

*(a) Traditional shared memory-based processor-coprocessor architecture, (b) processing in-memory approach using 3D integration technologies, (c) PRIME design [42].*

can act as a memory unit for data storage or perform matrix multiplication in DNN computation. The FF subarrays mainly consist of the components: (1) decoders and drivers that provide analog inputs to ReRAM crossbar arrays by controlling voltage through amplifiers, latches, etc.; (2) multiplexers that are used to support subtraction and the sigmoid activation function; (3) sense amplifiers to sense computing results from the ReRAM devices and support ReLU functions with the additional hardware units. PRIME maximum reuses the periphery circuits to enable the switch of ReRAM between storage and computing with reduced design costs. Moreover, PRIME proposed high-precision NN acceleration based on its architecture and provided software/hardware interfaces for developers.

In the same year, researchers at the University of Utah introduced ISAAC, a comprehensive ReRAM-based accelerator, aimed at offering high-speed and energyefficient computations of convolutional neural networks (CNNs) [43]. The ISAAC is a multi-stage architecture, which includes tiles, IMAs, and ReRAM crossbar arrays, as illustrated in **Figure 5**. The fundamental component used for deploying a neural network layer is IMAs, where several IMAs share eDRAM activation components and other peripheral circuits. ISAAC presented an inter-layer pipeline coupled with a buffer optimization strategy. ISAAC processed all layers in a CNN concurrently and redeployed layers to maintain the pipeline's continuity, and hence improved overall throughput and computational efficiency were achieved. The subsequent layer started processing as soon as the output feature maps of the current layer meet the size requirement of the convolution kernels of the next layer instead of waiting for the final results. This strategy significantly reduced the capacity requirement and on-chip space occupation of eDRAM. In addition, ISAAC developed an encoding scheme to reduce the ADC resolution needs to 1-bit, which can significantly reduce hardware costs and improve computing efficiency. The ISAAC achieved significant improvement in computing throughput, energy, and computational density compared to previous designs.

Although previous architectures demonstrated high computing efficiency, limitations still exist. For example, they were designed only for inference and did not

**Figure 5.**

*Architecture and pipeline of ISAAC [43].*

*Enabling Neuromorphic Computing for Artificial Intelligence with Hardware-Software… DOI: http://dx.doi.org/10.5772/intechopen.111963*


#### **Table 3.**

*The comparison of ReRAM-based in-memory computing accelerators.*

support the training process. The computing pipelines were deep and may introduce pipeline bubbles. Therefore, in 2017, a novel ReRAM-based in-memory computing architecture Pipelayer was proposed aiming to solve this challenge [44]. The proposed architecture is based on PRIME and ISAAC and can support the training process. In addition, PipeLayer has optimized inter-layer and intra-layer pipelines to achieve high throughput in both training and inference. The same as PRIME, the ReRAM subarrays were utilized for both storage and computing with the capability of functionality shifting. During the training process, the error is obtained after comparing the labels and loss function in the last forward cycle. In the error backpropagation process, weights and errors, which were also calculated in the ReRAM morphable subarrays and saved in storage subarrays, were updated by the layer errors and partial derivatives and passed to the previous layer. It is noted that the depth of logical cycles was independent of the number of NN layers, preventing potential issues arising from deep pipelines, such as stall, bubble, and latency. In addition, PipeLayer introduced a parameter to repeatedly deploy operators, building a trade-space-for-time intra-layer parallel pipeline. This intra-layer pipeline, together with the inter-layer pipeline, supports the high throughput of PipeLayer. The comparison of these ReRAM-based in-memory computing accelerators is shown in **Table 3**.

### **4.2 ReRAM-based accelerators for various DNNs and applications**

After the great success of the ReRAM-based neuromorphic accelerators, customized ReRAM accelerators tailored for specific types of neural networks and applications are also emerging. For example, generative adversarial networks (GANs) [91] require frequent data transmission between the generative and adversarial networks, which makes them highly demanding in terms of memory, computational resources, and data transfer. ReGAN, a ReRAM-based accelerator for GANS, has been designed to leverage the advantages of processing in-memory and ReRAM crossbar arrays, as well as optimized parallel computation and data dependency, to achieve significantly high performance and energy efficiency [45]. To expedite the execution of recurrent neural networks (RNNs), ReRAM accelerators tailored to various RNN configurations were proposed [46].

In graph processing tasks, high data transfer costs have led researchers to focus on the optimization of memory access [92]. ReRAM architecture's inherent *in situ* computing can minimize these overheads. GraphR was the first to introduce a ReRAMbased architecture for graph computation [51]. The feasibility of ReRAM for graph processing was analyzed, and sparse matrix-vector multiplications were applied to

data blocks of compressed representation. This approach designed ReRAM-based graph engine to accelerate the network with parallel computation, while a drawback of additional useless multiplications with zero due to sparsity still exists. Subsequently, Spara presented a novel vertex mapping strategy to address this challenge [52]. There are also ReRAM-based architectures for graph processing focus on sparsity [53, 54], three-dimensional architecture [55, 56], regularization, redundant computation [57], etc.

Transformers, one of the most advanced models in current natural language processing (NLP), present several challenges to the ReRAM-based neuromorphic computing [93]. For example, one layer's input includes the results of the previous layer due to the self-attention mechanism, generating data dependencies that result in bubbles in the traditional interlayer pipeline. The calculation of scaled dot-product attention significantly differs from traditional multiply-accumulate operations in CNNs. ReTransformer [58] and ReBERT [59] are dedicated to addressing these challenges. ReTransformer mathematically decomposed matrix multiplication into two steps, reducing the pipeline bubbles by performing the first initialization step and the second calculation step sequentially in two separate cycles. ReBERT concentrated on the attention mechanism and proposed a window self-attention mechanism with a corresponding window size search algorithm to support the ReRAM crossbar arrays and achieved significant speedup and energy performance. There are also other ReRAM-based architectures for NLP, including sparsity attention [60], and BERT deployment [61], etc.
