**3.4 Circuit and NoC engine**

After the partitioning and mapping of the DNN, the next step in SIAM is to arrange the inter- and intra-chiplet components in a floorplan and place them. This process leads to the final design of the chiplet-based IMC architecture. Once the architecture is determined, the circuit and NoC engine estimate the performance of the hardware, as illustrated in **Figure 10**. The engine uses a model-based estimator to evaluate the circuit aspect, while it employs a trace-based estimator for the interconnect portion.

#### **Figure 10.**

*Block diagram of the circuit and NoC engine within SIAM. The engine utilizes a separate circuit and NoC simulators that perform the overall hardware performance estimation.*

#### *3.4.1 Circuit estimator*

The circuit estimator evaluates the hardware performance of the chiplets, global accumulator, and global buffer in the entire chiplet-based IMC architecture. To perform this evaluation, the engine considers a range of inputs such as the placement of components within and across chiplets, the number of chiplets and IMC crossbars per layer, the IMC utilization per layer, the technology node, the operating frequency, the type of IMC cell, the number of bits per cell, and the ADC precision. The intra-chiplet circuits include the IMC crossbar array, buffer, accumulator, activation unit, and pooling unit. In contrast, the peripheral circuits consist of the ADC, multiplexer circuit, shift and add circuit, and decoders. The circuit estimator is calibrated using NeuroSim [95].

The circuit estimator evaluates the performance of the entire chiplet-based IMC architecture, where each layer of the DNN is considered separately, and each chiplet is responsible for computations for a subset of layers. The partition and mapping engine provides the chiplet count, IMC crossbar count, and IMC utilization values for each layer. The estimator estimates the area, energy, and latency from the device level to the circuit and architecture levels, using user inputs such as technology node, IMC cell type, IMC crossbar size, ADC precision, and read-out mode. For each IMC crossbar within the chiplet, the cost of a single crossbar and its peripheral circuits are evaluated to obtain the total area, energy, and latency of the IMC chiplet, which includes the buffer cost, shift and adder circuitry, accumulator, pooling, and activation units. The global accumulator and global buffer accumulate the partial sum of a layer across chiplets at the chiplet-level. The estimator utilizes the number of additions performed, data volume from each chiplet, and the accumulator size to determine the global accumulator and buffer's area, energy, and latency. Finally, the estimator repeats the estimation for all chiplets required for a given layer of the DNN to determine the overall hardware performance.

#### *3.4.2 NoC Estimator*

Effective communication is essential for achieving optimal hardware performance in DNN accelerators [63]. In [103], communication-intensive DNN accelerators are extensively discussed. Every layer of a DNN transmits a significant amount of data to other layers. Studies have demonstrated that communication can account for up to 90% of the total inference latency in DNNs [45]. Therefore, it is vital to develop an efficient communication protocol for DNNs. To achieve this goal, we incorporate the cost of communication between multiple layers within a chiplet. As NoC is the standard interconnect fabric used in the SoC-domain [104], we consider using an NoC for intra-chiplet communication. We customize a cycle-accurate NoC simulator, BookSim [98], to evaluate NoC performance. First, we generate a trace file for each chiplet following Algorithm 2. The algorithm considers the number of tiles, input activations, chiplets, layer-to-chiplet mapping, quantization bit-precision, and bus width. We then determine the source and destination tile information for each layer in each chiplet. Next, we calculate the number of packets for each sourcedestination pair and generate a trace file as a tuple consisting of the source tile ID, destination tile ID, and timestamp. Finally, we simulate the trace file using BookSim to obtain the area, energy, and latency for on-chip communication within each chiplet.
