**3.1. Overall architecture**

**Figure 8** is a typical architecture overview of CNN accelerator based on an embedded FPGA platform, and it is a CPU + FPGA architecture. The whole system is consisting of two parts including the Processing System (PS) and the Programmable Logic (PL).

Processing System (PS) mainly consists of CPU and off-chip memory. Due to the enormous amount of input data and weights, it is impossible to store data and weights in on-chip memory. Therefore, usually data and weights are stored in the off-chip memory like DDR3 at the beginning. CPU can configure the control module of accelerator, so that adjust the accelerator to accommodate different scale of CONV layers. In addition, CPU can realize

**Figure 8.** Overview of a typical architecture of CNN accelerator based on an embedded FPGA platform.

some simple operation such as the Softmax function of CNN model. We know that the operation of CONV layers usually constitute more than 90% of the total CNN operations. Accelerating the operations except the CONV layers have little performance improvement. Therefore, we can use CPU to handle the operations except the operation CONV layers such as Softmax function.

Programmable Logic (PL) actually is a FPGA chip and we can program the PL to meet our requirement. PL consists of several parts including processing element (PE) array, control module and on-chip buffer. PE array is consisting of a certain number of PE. PEs are the computation unit for convolution and usually the number of PEs decide the computational performance of CNN accelerator. Data can be interchanged between PEs so that data can be reused without accessing buffer. On-chip buffer is used to cache data and weights for PEs and store the results. Since data and weights of CONV layers are reused repeatedly, buffering and reusing data can reduce the off-chip memory access and it will be introduced in Section 4. Control module receives configuration information from PS, and control the computational process and dataflow of PE array.

The whole working process of CNN accelerator of an implementation can be divided into three steps. First, before system working, all data and weights are stored in the off-chip memory. Next, CPU starts to configure the control module of accelerator and then control module control the on-chip buffer to fetch data from off-chip memory. So that PE array can read data and weights from on-chip buffer and start computation. Finally, PE array finishes all computation and returns the results to off-chip memory so that CPU can read the results.
