**4. SIAM dataflow**

This section outlines the default dataflow of the SIAM chiplet-based IMC architecture. The computation dataflow is illustrated in **Figure 13**. Before starting the inference task, the weights are obtained from the DRAM and allocated to the IMC chiplets according to the partition and mapping engine described in Section III-C. There are two possible scenarios based on this: either no layer is spread across several chiplets, or a layer is distributed across multiple chiplets. Let us consider the case where layer N of the DNN is assigned to the first chiplet, as depicted in **Figure 13(a)**. During computation, the entire layer is processed within one chiplet, generating the computed output activations for layer N. The global accumulator and buffer are not utilized during this process and are turned off. When the computation is done, the output activations are sent to the chiplets that execute layer N + 1. If two chiplets are required to map the weights for layer N + 1, the NoP transfers the output activation from layer

### **Figure 13.**

*Computation dataflow within the chiplet-based IMC architecture in SIAM. Two cases arise: (a) no layer is partitioned across two or more chiplets and (b) a layer is partitioned across two or more chiplets.*

N to both chiplets, as shown in **Figure 13(a)**. **Figure 13(b)** presents the computation flow for layer N + 1, where both chiplets execute computations in parallel. The mapping ensures that an equal number of weights are assigned to each chiplet, avoiding any workload imbalances. After the computation is completed, the generated partial sums are aggregated using the global accumulator and buffer. Then, the accumulated outputs from layer N + 1 are transferred to the chiplets that hold the weights for layer N + 2. This process is repeated until all layers are processed and the final output is generated. The algorithmic implementation of the dataflow used in the SIAM IMC chiplet architecture is explained in Algorithm 4.
