**3.2. PE architecture**

**3. CNN accelerator**

154 Green Electronics

**3.1. Overall architecture**

CNN accelerator focus on the CONV layers of CNN.

In this section, we will introduce the CNN accelerator and the challenges CNN accelerator face. As previous section shown, the operations of convolutional layers constitute a large proportion of whole CNN. Therefore, following in our work, the discussion and optimization of

Due to deep convolution neural networks involve an enormous amount of data and operations as shown in Section 2, it is necessary to accelerate the CNN computation by hardware. Actually, many previous works developed the hardware platforms to accelerate the CNN and the implementations obtained good performance and high energy efficiency. Generally CNN can be accelerated by hardware platforms such as FPGA platforms [7, 8], GPU platforms [9], and ASIC platforms [10, 11]. These works have a same feature: they obtained high computational performance for CNN due to the parallelism of the hardware accelerator. In this work, we mainly discuss and analyze the CNN accelerator based on an embedded FPGA platform.

**Figure 8** is a typical architecture overview of CNN accelerator based on an embedded FPGA platform, and it is a CPU + FPGA architecture. The whole system is consisting of two parts

Processing System (PS) mainly consists of CPU and off-chip memory. Due to the enormous amount of input data and weights, it is impossible to store data and weights in on-chip memory. Therefore, usually data and weights are stored in the off-chip memory like DDR3 at the beginning. CPU can configure the control module of accelerator, so that adjust the accelerator to accommodate different scale of CONV layers. In addition, CPU can realize

including the Processing System (PS) and the Programmable Logic (PL).

**Figure 8.** Overview of a typical architecture of CNN accelerator based on an embedded FPGA platform.

Processing element (PE) is the computation unit for CNN accelerator. **Figure 9** shows a typical architecture of PE. It is consisting of multipliers, adder trees, ReLU module, pooling module and registers for Input, Weights and Psum. The multipliers and adder trees are used to complete the convolutional operation. The ReLU module applies a ReLU function to the psum

**Figure 9.** Typical architecture of processing element.

from adder trees. Pooling module realizes the max or average pooling operation and generates the result to on-chip buffer. Register is used to store the Input, Weights and Psum so that PE can reuse the data. PE can interchange data with other PEs. It improves the convenience of data accessing for PE, and it reduces the amount of on-chip buffer access.

accuracy. It inspires us that reducing data precision is an efficient and significant method to reduce memory footprint and computational hardware resource. In addition, data-reusing [7, 8, 23] is also an optimization direction for CNN accelerator. Due to most of data in CNN are used repeatedly, data-reusing is also an efficient method to reduce memory access, consequently reducing the memory bandwidth and power consumption. In the next section, we

Optimizing of Convolutional Neural Network Accelerator

http://dx.doi.org/10.5772/intechopen.75796

157

Generally, to guarantee high recognition accuracy, 32-bit floating point data and weights are used to train CNN model. However, such high data precision brings more pressure to hardware because high data precision usually requires more computational resources and larger memory footprint. More and more studies indicated that reducing data precision appropriately almost has no impact on the accuracy of CNN model [12–14]. In Ref. [15], authors used MNIST dataset to test the impact of data precision on accuracy. The test result showed that using 16-bit fixed point data instead of 32-bit floating point in the inference process and using 32-bit floating point in the training process, the accuracy almost had no reduction. If using 16-bit fixed point data in the inference process while using 32-bit fixed point in the training process, the error rate reduced from 99.18 to 99.09%. In the reference [16], authors explored the precision requirements of CNN using AlexNet and VGG models. 8-bit fixed point for the convolution weights are used to test in the process of inference. Compared to full precision

These studies show that in many cases there is not necessary for CNN model to use high data precision such as 32-bit floating point. It inspires us reducing data precision is a feasible method to optimize CNN accelerator. Reducing data precision brings a lot of benefits to CNN accelerator. Firstly, for the reason that computation of floating point requires more computational resource than the computation of fixed point, using fixed point data in CNN accelerator will save a large number of computational resource. In addition, using lower precision data will make CNN accelerator more energy-efficiency. For instance, the power consumption of an 8-bit fixed point adder is 30× less than a 32-bit floating point adder and the power consumption of an 8-bit fixed point multiplier is 18.5× less than a 32-bit floating point multiplier [24]. Finally, reducing data precision of data or weights reduces the memory footprint and the bandwidth requirement directly. For instance, if using 16-bit fixed point data instead of 32-bit fixed point, the memory footprint shrink by half and the bandwidth

Due to these advantages, many works [16, 25, 26] optimize their accelerator by reducing data precision of CNN accelerator under the premise of meeting the requirement of recognition accuracy. Ref. [16] used 8–16 bit fixed point data precision for the both AlexNet and VGG models. The accuracy reduction caused by the fixed point operations in the FPGA implemen-

tation is less than 2% for top-1 accuracy and less than 1% for top-5 accuracy.

will introduce these two optimization methods in detail.

weights, the result showed accuracy reduced less than 1%.

**4. Accelerator optimization**

requirement reduce by half.

**4.1. Data precision**

Before PE working, control module receives the configuration information and configures the PE. After finishing configuration, PE starts to operation. Input and Weights register read and store data. Multipliers and adder trees read data from register or other PEs, and psum are generated and store in psum register. After finishing convolutional operation, psum are sent to ReLU module and Pooling module and Output are generated and sent to on-chip buffer. PE operation is done.
