*2.1.1 RRAM Device*

RRAM-based IMC architectures consist of an RRAM memory cell at each crosspoint within the IMC crossbar array. RRAM is a two-terminal device with programmable resistance representing the neural network's weights. It has high integration density, fast read speed, high memory accessing bandwidth, and good compatibility with CMOS fabrication technology. For example, the RRAM device stack can include a TiN bottom electrode, HfO2 mem-resistive switching layer, a PVD Ti oxygen exchange layer (OEL), and 40 nm TiN top electrode [59, 60]. This specific stack is implemented between the M1 and M2 metallization layers using a FEOL-compatible process flow.

#### **Figure 3.**

*Generic block diagram of an IMC architecture for DNN acceleration. It consists of an array of IMC tiles connected by an NoC with each tile consisting of a number of IMC arrays [44, 45].*

Each RRAM cell can be characterized by the number of resistance levels accessed within them. Broadly, RRAM can be classified into single-level cells (SLC) and multilevel cells (MLC). SLC only has two resistance levels, that is, they can store only binary data. On the other hand, MLC cells have multiple resistance levels that represent higher precision data. The number of available resistance levels is governed by the ratio of the off resistance Rð Þ off to the on resistance Rð Þ on ratio [61]. The ratio provides the range of resistances accessible for the given RRAM device. The overall resistance range can be divided into two main states, a low resistance state (LRS) and a high resistance state (HRS). LRS deals with the lower spectrum within the resistance band, while the HRS deals with the upper band of resistance of the RRAM device.

To program the RRAM device, a series of steps need to be followed [62]. First, the RRAM device is formed by applying a large voltage across the two terminals. This process breaks the barrier and allows for electron flow across the terminals. Next, the RRAM is programmed to the required resistance by passing a specific current (compliance current) through the two electrodes. The RRAM can be programmed at different resistances depending on the compliance current. Furthermore, different resistance levels can be achieved depending on the RRAM device (SLC or MLC). Finally, once the RRAM device is programmed, we can perform a read by applying a voltage across the device electrodes. For the RRAM device proposed in [59, 60], a read voltage of up to 0.4 V can be sustained by the RRAM device. Applying a higher voltage damages the device or goes into the write state, changing the programmed resistance level.

#### *2.1.2 IMC Architecture*

Studies involving crossbar architectures have demonstrated that a 100� to 1000� improvement in energy efficiency is achieved as compared to traditional CPU and GPU architectures [44, 45, 47, 49, 50, 52, 63–67]. **Figure 3** shows the block diagram of a IMC architecture with an RRAM/SRAM memory cell. The architecture consists of an array of IMC tiles connected by a network-on-chip (NoC). The architecture also consists of a global pooling unit, nonlinear activation unit, accumulator, and input/ output buffers. A global control logic performs the architecture's overall handling of the blocks.

Each tile consists of an array of processing elements (PEs), where each PE is an IMC crossbar array with either an SRAM or an RRAM cell. Each IMC crossbar array consists of a set of peripheral circuits that enable the MAC computations.

**Figure 4** shows the generic block diagram for a single RRAM-based IMC crossbar array. In the case of RRAM IMC, a transistor connects the gate to the wordline (WL) of the IMC crossbar array [60]. The WL connects to the access transistors for the SRAM-based IMC with a conventional 6 T structure. The IMC crossbar arrays consist of a wordline (WL) decoder, WL driver, a column multiplexer, analog-to-digital converter (ADC) or a sense amplifier, shifter and add circuit, control logic, and input/ output buffers. The WL decoder turns on and off the WL for the IMC crossbar array. Meanwhile, the WL driver and level shifter ensure the driver can turn on the memory cell. Next, for an N � N IMC crossbar array, M columns are shared across the read-out circuit. The read-out circuit consists of the ADC, shift and add circuit, and the precharge circuit for the read operation. To enable the sharing of M columns, a column multiplexer is used. Finally, a custom control logic is utilized to drive the control signals during the operation of the IMC crossbar array. We will now go over the operation for both the SRAM and RRAM-based IMC architectures. First, we will

*End-to-End Benchmarking of Chiplet-Based In-Memory Computing DOI: http://dx.doi.org/10.5772/intechopen.111926*

#### **Figure 4.**

*Block diagram of a RRAM-based IMC crossbar array. An array of RRAM cells form the IMC crossbar array. Peripheral circuits such as bitline (BL)/select line (SL)/column multiplexer (MUX), precharge circuit, wordline (WL) decoder and driver, buffers, level shifters, ADC, and shift and add circuit complete the RRAM-based IMC.*

detail the working of the RRAM-based IMC architecture. **Figure 4** shows the generic block diagram for a single RRAM-based IMC crossbar array. The RRAM devices are programmed by connecting the two terminals to a given voltage. To facilitate this, the terminals are connected to the bitline (BL) and the selectline (SL). By applying a voltage across the BL and the SL, forming, programming, and reading operations are performed cell-by-cell. Each cell is chosen during the write state of the IMC, and then the write is performed. During the compute state, the RRAM undergoes the read operation. Two kinds of read-out are performed, parallel and serial. During the parallel read-out, all/multiple WLs are turned on simultaneously, and the output is accumulated across the BL. Two kinds of input schemes are employed for single and multibit inputs. The first method uses a digital-to-analog converter (DAC) to convert the input vector to an analog voltage and performs the computation in the charge domain [44]. The second method is to perform bit-serial computing, where each bit in the input vector is computed one at a time. Each input vector's bit significance is handled using a shift and add circuit [45, 49, 63].

Depending on the resistance stored in the RRAM, an output current/charge is generated by the product of the voltage and resistance (conductance). This operation is analogous to the multiply with the MAC. This current/charge is then accumulated across all rows for a given column to perform the addition in the MAC. In the case of the serial read-out, row-wise access of the IMC array is performed for MAC computations. Overall, the final MAC output is generated by accumulating across all rows of the IMC crossbar array.

**Figure 5** shows the generic block diagram for a single SRAM-based IMC crossbar array. Next, we will discuss the operation for an SRAM-based IMC architecture [48, 49, 51, 52, 68–70]. Depending on the SRAM bit-cell type and the degree of parallelism, the IMC design can be largely divided into three categories [71]: 6 T bitcell with parallel computing, 6 T bit-cell with local computing, and (6 T + extra-T) bitcell with parallel computing. Initially, SRAM-based IMC architecture employed the

#### **Figure 5.**

*Block diagram of a SRAM-based IMC crossbar array. An array of SRAM cells 6 T or 6 T + additional circuit) form the IMC crossbar array. Peripheral circuits such as bitline (BL) and bitline bar (BLB) precharge and conditioning circuits, row decoder and WL driver, column multiplexers, buffers, write drivers, ADC, and shift and add circuit complete the SRAM-based IMC.*

6 T bit-cell with a parallel computation [72, 73]. The parallel computation was achieved by turning on all the WL together to perform the MAC operations. The WL is driven by the input vector where a one means it turns on that cell, while a zero means the cell is turned off. Next, a 6 T bit-cell with a local compute structure is utilized where a special compute engine is designed to perform the MAC operation [70]. Here, the MAC operation is performed row-by-row, similar to the serial read-out in RRAM-based IMC. Finally, in addition to the 6 T cell, extra transistors are added in each bit-cell to perform parallel compute [51, 68, 69]. In addition to the bit-cell structure, peripheral circuits such as precharge circuits, ADCs, write drivers, column multiplexers, row decoders, and row drivers are used.

### *2.1.3 Challenges with IMC Architectures*

IMC architectures are known for their improved energy efficiency and throughput, but they have some drawbacks. One such drawback is the limited precision of the IMC crossbar array, particularly in the memory cell and ADC, which can affect the accuracy of DNN inference [74, 75]. Additionally, noise within analog computation can also harm DNN inference accuracy. The challenges associated with an RRAMbased IMC architecture are discussed next. RRAM devices suffer from several nonidealities, including limited resistance levels, device-to-device write variations, stuckat-faults, and limited Roff*=*Ron ratio, which make it difficult to design reliable RRAMbased IMC architectures [60, 61, 76–84]. These non-idealities can cause programmed weight values (resistance value) to deviate, significantly reducing post-mapping accuracy for DNNs. Moreover, the limited array size of the IMC crossbar structure necessitates the splitting of large convolution (conv) or FC layers into partial operations, which can introduce additional errors due to the limited precision of the peripheral circuits (ADC and shift and add) of the RRAM-based IMC crossbar.

#### *End-to-End Benchmarking of Chiplet-Based In-Memory Computing DOI: http://dx.doi.org/10.5772/intechopen.111926*

Previous research has proposed several methods to address the post-mapping accuracy loss associated with RRAM-based in-memory acceleration of DNNs. Two methods that have been proposed are Closed-Loop-on-Device (CLD) and Open-Loop-off-Device (OLD), which involve iterative read-verify-write (R-V-W) operations at the RRAM device until the resistance converges to the desired value [85, 86]. Other methods, such as in References [78, 87], utilize variation-aware-training (VAT) based on known device variation (*σ*) characterized from RRAM devices. In [76], VAT is combined with dynamic precision quantization to mitigate the post-mapping accuracy loss. Another approach proposed in [75] involves injecting RRAM macro measurement results that include variability and noise during the DNN training process to improve the DNN accuracy of the RRAM IMC hardware. Mohanty et al. [88] proposes post-mapping training by selecting a random subset of weights and mapping them to an on-chip memory to recover the accuracy. Charan et al. [79] utilizes knowledge distillation and online adaptation for accuracy mitigation, using an SRAM-based IMC as the parallel network, while [88] proposes to use a register file and a randomization circuit. Finally, [77, 80] propose a custom unary mapping scheme by mapping the most significant bit (MSB) and least significant bit (LSB) of the weights to RRAM devices based on individual cell variations and bit significance.

Next, we discuss the challenges associated with SRAM-based IMC architectures. A compromise between parallelism and reliability is employed for best performance. In a conventional 6 T SRAM IMC architecture, parallel computation is achieved by turning on all or multiple rows. The higher parallelism raises the critical issue of read disturbance, resulting in the WL voltage being driven with a lower voltage [72, 73, 89]. To mitigate this, reduced parallelism is employed by exploiting the local compute engine [70]. The reduced parallelism results in reduced throughput for DNN inference. Yin et al. [51, 68, 69] proposes to utilize additional transistors that isolate the bit-cell and employ parallel computation. Such a solution comes at the cost of increased area overhead, thus limiting the density of the SRAM-based IMC architecture. The additional transistor solution is typically implemented using a resistance or a capacitance. The resistive IMC method implements a multi-bit MAC operation by utilizing a resistive pull-up/down using transistors [51, 68, 69]. The pull-up/down characteristics of the transistors exhibit a nonlinear behavior for the read bitline (RBL) transfer curve across different voltage ranges, thus reducing reliability. At the same time, the capacitive SRAM-based IMC utilizes a capacitor per bit-cell and charge sharing and capacitive coupling to perform the MAC operations [52]. The capacitive SRAM IMC exhibits a more linear transfer characteristic on the RBL but at the cost of a capacitor per bit-cell. Finally, the limited precision of the ADC and the noise on the bitline (BL) requires careful algorithm design to achieve the best inference accuracy [89].

### **2.2 2.5D/Chiplet-based AI accelerators**

The area of monolithic hardware accelerators increases with the increasing number of parameters of AI algorithms. The higher silicon area of a single monolithic system reduces the yield, increasing fabrication cost [46]. The chiplet-based system solves the issue of higher fabrication cost by integrating multiple small chips (known as chiplets) on a single die. Since the area of each chiplet in the system is considerably lower than a monolithic chip (for the same AI algorithm), the yield of the chiplet-based system increases, which reduces the fabrication cost. The communication between chiplets is performed through network-on-package (NoP), as shown in **Figure 6**. Several works

#### **Figure 6.**

*Chiplet-based IMC architecture [46, 64] that includes an NoP for on-package communication, NoC for on-chip communication within each chiplet, and a point-to-point network like H-Tree for within tile communication.*

in the literature propose NoP for chiplet-based systems considering different performance objectives (e.g., latency, energy) [42, 46, 90–92].

Kite is a family of NoP proposed in [90], mainly targeted for general-purpose processors. In this work, three topologies are proposed—Kite-Small, Kite-Medium, and Kite-Large. First, an objective function is constructed with a combination of the average delay between the source and destination, diameter, and bisection bandwidth of the NoP. Experimental evaluations on synthetic traffic show that the proposed Kite topologies reduce latency by 7% and improve the peak throughput by 17% with respect to other well-known interconnect topologies. A chiplet-based system with a 96-core processor, INTACT, is proposed in [91]. The chiplets are connected through a generic chiplet-interposer interface (called 3D-plugs in the paper). 3D-plugs consist of micro-bump arrays. However, both Kite and INTACT are not specific to AI workloads.

Shao et al. designed and fabricated a 36-chiplet system called SIMBA for deep learning inference [92]. The chiplets in the system are connected through a mesh NoP. Ground-referenced signaling (GRS) is used for intra-package communication. The NoP follows a hybrid wormhole/cut-through flow control. The NoP bandwidth is 100 GBps/chiplet, and the latency for one hop is 20 ns. Extensive evaluation on the fabricated chip shows up to 16% speed up compared to baseline layer mapping for ResNet-50. A simulator for the chiplet-based systems, SIAM, is proposed in [46], targeting AI workloads. In this simulator, a mesh topology is considered for NoP. It is shown that up to 85% of the total system area is contributed by NoP. In this work, multiple studies were performed by varying NoP parameters. For example, it is shown that increasing the NoP channel width increases the energy-delay product of the NoP for ResNet-110. This phenomenon is demonstrated for systems with 25 and 36 chiplets. However, none of the prior works considered any workload-aware optimization for the NoP. Therefore, there is ample opportunity for future research considering NoP optimization for AI accelerators.
