**3.2 SIAM: Chiplet-based AI benchmarking simulator**

This section introduces SIAM [46], a benchmarking simulator for IMC architectures based on chiplets. SIAM supports both generic or homogeneous and custom chiplet-based IMC architectures. A homogeneous architecture has a fixed number of chiplets that the user determines. On the other hand, a custom architecture comprises a specific number of chiplets required to map the DNN under consideration. In both cases, the chiplet structure contains a fixed number of user-defined IMC crossbar arrays. **Figure 6** illustrates a homogeneous chiplet-based IMC architecture SIAM uses.

**Figure 8** illustrates how the comprehensive framework provided by SIAM can be used to benchmark the performance of chiplet-based in-memory compute (IMC) architectures. SIAM generates a chiplet-based IMC architecture based on user-defined inputs. It assesses the corresponding hardware performance by computing various performance metrics, including area, energy, latency, energy efficiency, power, leakage energy, and IMC utilization. The SIAM framework is developed using Python and C++ programming languages and has a top-level Python wrapper that integrates the different components of the simulator. Additionally, SIAM is designed to interface with widely used deep learning frameworks, such as PyTorch and TensorFlow, and supports a range of network structures in current literature. Therefore, SIAM can be used to explore neural architecture search (NAS) techniques. **Table 1** describes the user inputs and their associated parameters for the SIAM benchmarking tool. SIAM consists of four engines:


The individual engines within SIAM operate independently on various subsets of user inputs and communicate with each other through the top-level Python wrapper. **Figure 8** provides an overview of the simulation flow used by SIAM to provide a better

**Figure 8.**

*Block diagram of the chiplet-based IMC architecture simulator SIAM [46].*


#### **Table 1.**

*Definition of the user inputs to SIAM.*

understanding of the framework. Firstly, the partition and mapping engine is used to perform layer partition and mapping onto the chiplets and IMC crossbars, which generates the IMC architecture structure, the number of required chiplets and IMC tiles per layer, the IMC architecture utilization, the volume of intra-chiplet and inter-chiplet data movement, and the number of global accumulator accesses. Next, the circuit and NoC engine evaluate the intra-chiplet and global circuit performance, respectively, providing hardware performance metrics such as area, energy, and latency. Meanwhile, the NoP engine evaluates the cost of chiplet-to-chiplet data movement. Lastly, the DRAM engine assesses the memory access cost, providing energy and latency performance metrics. Except for the partition and mapping engine, all engines operate concurrently, resulting in shorter simulation times. Additionally, SIAM can be used to benchmark traditional monolithic IMC architectures. In the following sections, we provide detailed information on the four engines that comprise SIAM's core functionality.

### **3.3 Partition and mapping engine**

The operational steps of the partition and mapping engine are explained in Algorithm 1. The engine is responsible for partitioning the DNN layers and mapping them onto the IMC chiplets and crossbar arrays. The partitioning and mapping are carried out layer by layer for the entire DNN. The engine takes various user inputs such as DNN structure, the precision of DNN weights, the mapping scheme for IMC chiplets, the size of IMC chiplets, and the size of IMC crossbars.

To begin with, we will describe the IMC mapping approach that is employed in SIAM. Suppose we have a layer *i* with weight matrix W*<sup>i</sup>* represented by Kx*<sup>i</sup>* � Ky*<sup>i</sup>* � Nif*<sup>i</sup>* � Nof*i*, where Kx and Ky indicate the kernel size, Nif refers to the number of input features, and Nof denotes the number of output features. SIAM uses the same mapping scheme presented in [44, 45]:

$$N\_i^r = \left\lceil \frac{\text{Kx}\_i \times \text{Ky}\_i \times \text{Nif}\_i}{\left(PE\_\text{x}\right)} \right\rceil; N\_i^c = \left\lceil \frac{\text{Nof}\_i \times N\_{bits}}{\left(PE\_\text{y}\right)} \right\rceil \tag{1}$$

The equation given above calculates the required number of rows and columns for the IMC crossbars to map layer *i* of the DNN. N*<sup>r</sup> <sup>i</sup>* and N*<sup>c</sup> <sup>i</sup>* represent the number of rows and columns required, while N*bits*, PE*x*, and PE*<sup>y</sup>* denote the DNN weight precision and the number of rows and columns in the IMC crossbar array. To find the total number of required IMC crossbar arrays to map layer *i* of the DNN, one can multiply N*<sup>r</sup> <sup>i</sup>* and N*c <sup>i</sup>* together, which gives N*Total <sup>i</sup>* (line 7 of Algorithm 1).

SIAM can create homogeneous and custom chiplet-based IMC architectures using two types of chiplet partitions. The partition and mapping engine generates architectures based on the assumption that each DNN layer cannot be divided across multiple chiplets and that each chiplet can support multiple layers for optimal chiplet utilization. To map an entire layer of the DNN, multiple chiplets with IMC crossbar arrays are required due to numerous multi-bit weights in each layer. Dividing a layer across multiple chiplets increases overhead in control logic required for routing inputs, higher chiplet-to-chiplet communication energy and latency, and greater inter-chiplet data communication volume. The engine uniformly divides the layer across multiple chiplets during partitioning to prevent workload imbalance issues. The engine determines the number of chiplets required to map layer *i* of the DNN based on the total number of required IMC crossbar arrays, *NTotal <sup>i</sup>* , using the equation *NChiplet <sup>i</sup>* <sup>¼</sup> *<sup>N</sup>Total i S* l m, where *<sup>S</sup>* represents the total number of IMC crossbar arrays within a chiplet (the chiplet size). The resulting architectures generated by the partition and mapping engine can be seen in **Figure 9**.

After determining the number of required chiplets to map a layer of the DNN, the next step is to determine the total number of chiplets in the architecture. This is done at line 9 of Algorithm 1. In the *homogeneous chiplet partition* scheme, the user inputs a fixed number of chiplets to map the DNN. The engine compares the total number of

#### **Figure 9.**

*The figure illustrates two chiplet-based IMC architectures, namely homogeneous (left) and custom (right), generated by SIAM for the same DNN. The homogeneous architecture is a generic architecture, whereas the custom architecture is tailored for the specific DNN. The NoP router is denoted by R in both architectures.*
