**3.5 NoP engine**

Using specialized signaling techniques and driver circuits, the NoP handles on-chip data movement between different chiplets. This is achieved through a silicon interposer or organic substrate, as shown in [105, 106]. **Figure 11** (left) shows the crosssectional image of a 2.5D integration with chiplets and an interposer. However, modeling the NoP's performance can be challenging due to its complex interconnect structure, specialized driver architectures, and corresponding signaling techniques. To accurately estimate performance, the NoP engine models each component of the NoP. **Figure 11** (right) shows different NoP implementations with the corresponding energy-per-bit Eð Þ *bit* proposed in prior works. The NoP performance evaluation comprises two main components: (1) NoP latency estimation and (2) NoP area and power estimation. To estimate NoP latency, the engine uses a cycle-accurate simulator to evaluate the interconnect. It generates the NoP trace using Algorithm 2, similar to the one used for NoCs, based on the chiplet-to-chiplet data volume generated by the partition and mapping engine. The generated traces are simulated using a cycleaccurate simulator or the NoP estimator to determine the latency of the NoP interconnect. To estimate NoP area and power consumption, the engine obtains interconnect parameters such as wire length, pitch, width, and stack-up. The engine then determines the interconnect capacitance and resistance using the PTM interconnect models [107]. Based on these parameters, it generates timing parameters for the interconnect and compares them to the target bandwidth. If the timing parameters do not meet the bandwidth requirements, the NoP engine chooses the maximum allowable bandwidth.

The NoP engine evaluates the NoP transmitter/receiver (TX/RX) circuits, which includes clocking circuitry, based on the energy per bit (E*bit*), number of TX/RX channels, bandwidth, chiplet-to-chiplet data volume, and operating frequency to estimate the energy and latency cost of the TX/RX circuits. The energy calculation for the NoP driver is provided in Algorithm 3. The NoP engine first calculates the number of bits being transferred between chiplets. It then retrieves the energy per bit Eð Þ *bit* from previous research, which is illustrated in **Figure 11** (right). The total energy for the TX/RX channel is computed by multiplying the number of bits and the energy per bit, as indicated in line 9 of Algorithm 3. Afterward, the NoP engine determines the NoP driver area using the NoP driver area cost from previous implementations (**Figure 11**). Finally, the engine integrates the interconnect and driver performance metrics to determine the overall NoP performance.

#### **Figure 11.**

*(Left) Cross-sectional image of the NoP interconnect. The NoP is routed within the interposer connecting different chiplets across the architecture.* Μ *bumps connect the chiplets to the interposer, (Right) Energy per bit for different NoP driver circuit and signaling techniques proposed in prior works.*

The functional flow of the NoP engine is summarized below:


#### **3.6 DRAM engine**

The architecture of the IMC based on chiplets includes a DRAM chiplet that serves as the external memory for the IMC chiplets. The DRAM engine estimates the external memory access required for this architecture. The assumption is that the DRAM will transfer the entire set of weights to the chiplet only once before performing the inference task. This assumption remains constant across different architectural configurations and inference runs for a specific DNN model. The DRAM engine includes a DRAM request generator, RAMULATOR [108] for estimating the latency for DRAM transactions, and VAMPIRE [109] to estimate the DRAM transaction power. The DRAM choice depends on user input and currently supports DDR3 and DDR4 [110, 111]. For a given DNN model, the DRAM engine generates the required traces and memory requests with timestamps, which include the location within the DRAM memory and the operation. SIAM uses a customized version of the cycle-accurate simulator RAMULATOR and the model-based power analysis tool VAMPIRE. To reduce simulation time, the DRAM engine estimates smaller sets of instructions, which are then multiplied by the total number of sets required to represent all the weights in the DNN. An experiment was performed to calibrate the method (**Figure 12(a)** and **(b)**), which showed that a reduction of 50% of DRAM instructions to the engine results in less than 2% EDP accuracy degradation than that at 100% instructions. The overall EDP for different networks across different datasets for DDR4 shows an exponential increase in EDP with the increase in the model size of the

#### **Figure 12.**

*(a) The accuracy of EDP prediction for different numbers of instructions processed to represent 3000 DRAM instructions. Reduction in the number of instructions to half results in less than 2% EDP accuracy degradation for half the simulation time. (b) EDP of DRAM transactions (DDR4) for different DNNs. There is an exponential increase in DRAM cost with an increase in DNN model size.*

DNN. The DRAM engine provides a fast and accurate estimation of external memory access for the entire range of DNNs through this method. To summarize, the DRAM engine's execution involves the following essential steps:

