**3. CNN accelerator**

In this section, we will introduce the CNN accelerator and the challenges CNN accelerator face. As previous section shown, the operations of convolutional layers constitute a large proportion of whole CNN. Therefore, following in our work, the discussion and optimization of CNN accelerator focus on the CONV layers of CNN.

some simple operation such as the Softmax function of CNN model. We know that the operation of CONV layers usually constitute more than 90% of the total CNN operations. Accelerating the operations except the CONV layers have little performance improvement. Therefore, we can use CPU to handle the operations except the operation CONV layers such

Optimizing of Convolutional Neural Network Accelerator

http://dx.doi.org/10.5772/intechopen.75796

155

Programmable Logic (PL) actually is a FPGA chip and we can program the PL to meet our requirement. PL consists of several parts including processing element (PE) array, control module and on-chip buffer. PE array is consisting of a certain number of PE. PEs are the computation unit for convolution and usually the number of PEs decide the computational performance of CNN accelerator. Data can be interchanged between PEs so that data can be reused without accessing buffer. On-chip buffer is used to cache data and weights for PEs and store the results. Since data and weights of CONV layers are reused repeatedly, buffering and reusing data can reduce the off-chip memory access and it will be introduced in Section 4. Control module receives configuration information from PS, and control the computational

The whole working process of CNN accelerator of an implementation can be divided into three steps. First, before system working, all data and weights are stored in the off-chip memory. Next, CPU starts to configure the control module of accelerator and then control module control the on-chip buffer to fetch data from off-chip memory. So that PE array can read data and weights from on-chip buffer and start computation. Finally, PE array finishes all compu-

Processing element (PE) is the computation unit for CNN accelerator. **Figure 9** shows a typical architecture of PE. It is consisting of multipliers, adder trees, ReLU module, pooling module and registers for Input, Weights and Psum. The multipliers and adder trees are used to complete the convolutional operation. The ReLU module applies a ReLU function to the psum

tation and returns the results to off-chip memory so that CPU can read the results.

as Softmax function.

process and dataflow of PE array.

**Figure 9.** Typical architecture of processing element.

**3.2. PE architecture**

Due to deep convolution neural networks involve an enormous amount of data and operations as shown in Section 2, it is necessary to accelerate the CNN computation by hardware. Actually, many previous works developed the hardware platforms to accelerate the CNN and the implementations obtained good performance and high energy efficiency. Generally CNN can be accelerated by hardware platforms such as FPGA platforms [7, 8], GPU platforms [9], and ASIC platforms [10, 11]. These works have a same feature: they obtained high computational performance for CNN due to the parallelism of the hardware accelerator. In this work, we mainly discuss and analyze the CNN accelerator based on an embedded FPGA platform.
