**4.1. Data precision**

from adder trees. Pooling module realizes the max or average pooling operation and generates the result to on-chip buffer. Register is used to store the Input, Weights and Psum so that PE can reuse the data. PE can interchange data with other PEs. It improves the convenience of

Before PE working, control module receives the configuration information and configures the PE. After finishing configuration, PE starts to operation. Input and Weights register read and store data. Multipliers and adder trees read data from register or other PEs, and psum are generated and store in psum register. After finishing convolutional operation, psum are sent to ReLU module and Pooling module and Output are generated and sent to on-chip buffer.

Although the CONV layer algorithm is simple, due to the enormous amount of data and computation, hardware accelerators face some grave challenges. One of the challenges is the limitation of off-chip memory bandwidth. Generally, CNN accelerator computes with high parallelism by increasing the processing elements (PEs), which can improve the computational performance of accelerator. However, it is accompanied by the pressure of the bandwidth caused by the large amount of data access. Another challenge is that the large amount

**Figure 10** is a diagram from reference [22]. It shows the normalized energy cost of each level memory hierarchy relative to the computation of one multiply-accumulate (MAC) operation and the data are extracted from a commercial 65-nm process. We can see that the energy cost of DRAM access is much higher than the energy cost of on-chip buffer access and MAC operation. Therefore, large amount of DRAM access will cause the high power consumption. Besides, large memory footprint caused by enormous amount of data is also an inevitable

To solve these problems, previous works proposed some optimization methods. One of the optimization methods is reducing data precision. Several researches show that reducing the data precision appropriately of CNN model almost has no impact on image recognition

data accessing for PE, and it reduces the amount of on-chip buffer access.

of off-chip memory access consumes a lot of energy.

**Figure 10.** Normalized energy cost of each level memory hierarchy [22].

PE operation is done.

**3.3. Challenges**

156 Green Electronics

challenge.

Generally, to guarantee high recognition accuracy, 32-bit floating point data and weights are used to train CNN model. However, such high data precision brings more pressure to hardware because high data precision usually requires more computational resources and larger memory footprint. More and more studies indicated that reducing data precision appropriately almost has no impact on the accuracy of CNN model [12–14]. In Ref. [15], authors used MNIST dataset to test the impact of data precision on accuracy. The test result showed that using 16-bit fixed point data instead of 32-bit floating point in the inference process and using 32-bit floating point in the training process, the accuracy almost had no reduction. If using 16-bit fixed point data in the inference process while using 32-bit fixed point in the training process, the error rate reduced from 99.18 to 99.09%. In the reference [16], authors explored the precision requirements of CNN using AlexNet and VGG models. 8-bit fixed point for the convolution weights are used to test in the process of inference. Compared to full precision weights, the result showed accuracy reduced less than 1%.

These studies show that in many cases there is not necessary for CNN model to use high data precision such as 32-bit floating point. It inspires us reducing data precision is a feasible method to optimize CNN accelerator. Reducing data precision brings a lot of benefits to CNN accelerator. Firstly, for the reason that computation of floating point requires more computational resource than the computation of fixed point, using fixed point data in CNN accelerator will save a large number of computational resource. In addition, using lower precision data will make CNN accelerator more energy-efficiency. For instance, the power consumption of an 8-bit fixed point adder is 30× less than a 32-bit floating point adder and the power consumption of an 8-bit fixed point multiplier is 18.5× less than a 32-bit floating point multiplier [24]. Finally, reducing data precision of data or weights reduces the memory footprint and the bandwidth requirement directly. For instance, if using 16-bit fixed point data instead of 32-bit fixed point, the memory footprint shrink by half and the bandwidth requirement reduce by half.

Due to these advantages, many works [16, 25, 26] optimize their accelerator by reducing data precision of CNN accelerator under the premise of meeting the requirement of recognition accuracy. Ref. [16] used 8–16 bit fixed point data precision for the both AlexNet and VGG models. The accuracy reduction caused by the fixed point operations in the FPGA implementation is less than 2% for top-1 accuracy and less than 1% for top-5 accuracy.

#### **4.2. Data-reusing**

Data-reusing is an important optimization method of CNN accelerator to reduce the memory accesses. The main idea of data-reusing is use on-chip buffer to buffer the data which will be used repeatedly. When next time we use these data again, they can be accessed in the on-chip buffer, so that we do not need to access off-chip memory.

**Figure 11** shows the computational process of convolution. In this process, a 3 × 3 weight kernel shift and convolve with 9 pixels of input feature map in sequence. Between two times convolving as the red frame and green frame shown in **Figure 11**, the value of weight kernel does not change, and the red area of input feature map will be used repeatedly. So that the weight kernel and the part of input feature map can be reused.

**Table 1** shows the amount of data, of total data accesses and the average access times per data of Input, Weights and Output for CONV1 of VGG-16 model. The amount of data represents the amount of actual Input, Weights or Output. The amount of total data access represents the amount of data access of Input, Weights or Output. The average amount of using times per data is calculated by the amount of total data access dividing the amount of data and it represents the using times of each Input, Weights or Output. From **Table 1**, we can find that in the computing process of convolutional layers, although a large amount of Input, Weights and Output are involved, actually these data are always used repeatedly. Some data even can


be used over thousands of times. Therefore, data-reusing is an available optimized method to

Optimizing of Convolutional Neural Network Accelerator

http://dx.doi.org/10.5772/intechopen.75796

159

Data-reusing greatly reduce off-chip memory accesses, and the benefits are obvious. First, it can relieve the pressure of the off-chip memory bandwidth. Second, it can shorten the operation time of accelerator since the time of accessing off-chip memory is more longer than accessing on-chip buffer. Finally, as **Figure 10** shown, since the energy cost of off-chip memory access is much higher than the energy cost of on-chip buffer access, data-reusing will

Although data-reusing is an available optimized method to reduce the data accesses of off-chip memory, it needs extra on-chip memory as a price to buffer data. Usually the benefits and the cost brought by data-reusing vary greatly from different data and CONV layers. From **Table 1** we can easily infer that reusing Weights reduce more data accesses of off-chip memory than reusing Input or Output, for the reason that the average amount of using times per Weights is far larger than the Input and Output. In addition, the size of buffer for reusing Weights in CONV1 is larger than CONV13, for the reason that the amount of Weights in CONV1 is far larger than CONV13 as **Figure 7(a)** shown. From the above discussion, although data-reusing is an available method, due to the different benefits and cost caused by reusing different data and reusing in different CONV layers, data-reusing is a complicated optimization method involving many factors. Therefore, it

Actually, some previous works explored the optimization of memory access by reusing data. Work [7] explored the influence of loop tiling and transformation to data-reusing, which has an impact on computing throughput and memory bandwidth. Work [8] explored the influence of multiple design variables to the accelerator design. These works took account of the influence factors to data-reusing in a certain extent, such as the loop tiling and loop interchange; however, they did not take account of the influence of parallel computation to datareusing. Work [27] proposed a methodology on how to determine the parallelism strategy. However, this work did not take account of the influence of loop tiling and loop interchange. Following in this chapter, we will discuss the factors impacting on data-reusing in deep,

The computational procedure of convolutional layer can be expressed by a convolution loop nest as **Figure 3**. The similar iteration loop dimensions can be merged to a new loop dimension, and the loops execution order can be transformed, as **Figure 12(a)** and **(b)** shown. **Figure 12(b)** is transformed by **Figure 12(a)**. The loops execution order of two loop nests is

For the reason that the weights will be accessed repeatedly in the loop *a*, we can buffer the weights which will be accessed repeatedly in loop *a* and reuse them. Assuming reusing weights in loop *a*, we need corresponding on-chip buffer to store the weights. All weight which will be accessed in loop *a* should be buffered. For example, in the loops execution order

including loops execution order, reusing strategy and parallelism strategy.

reduce the data accesses of off-chip memory.

make accelerator more energy-efficiency.

is necessary to explore the strategy of data-reusing.

*4.2.1. Loop execution order*

different but the result is same.


**Figure 11.** Data-reusing in the process of convolution.


**Table 1.** Statistics of CONV1 of VGG-16 model.

be used over thousands of times. Therefore, data-reusing is an available optimized method to reduce the data accesses of off-chip memory.

Data-reusing greatly reduce off-chip memory accesses, and the benefits are obvious. First, it can relieve the pressure of the off-chip memory bandwidth. Second, it can shorten the operation time of accelerator since the time of accessing off-chip memory is more longer than accessing on-chip buffer. Finally, as **Figure 10** shown, since the energy cost of off-chip memory access is much higher than the energy cost of on-chip buffer access, data-reusing will make accelerator more energy-efficiency.

Although data-reusing is an available optimized method to reduce the data accesses of off-chip memory, it needs extra on-chip memory as a price to buffer data. Usually the benefits and the cost brought by data-reusing vary greatly from different data and CONV layers. From **Table 1** we can easily infer that reusing Weights reduce more data accesses of off-chip memory than reusing Input or Output, for the reason that the average amount of using times per Weights is far larger than the Input and Output. In addition, the size of buffer for reusing Weights in CONV1 is larger than CONV13, for the reason that the amount of Weights in CONV1 is far larger than CONV13 as **Figure 7(a)** shown. From the above discussion, although data-reusing is an available method, due to the different benefits and cost caused by reusing different data and reusing in different CONV layers, data-reusing is a complicated optimization method involving many factors. Therefore, it is necessary to explore the strategy of data-reusing.

Actually, some previous works explored the optimization of memory access by reusing data. Work [7] explored the influence of loop tiling and transformation to data-reusing, which has an impact on computing throughput and memory bandwidth. Work [8] explored the influence of multiple design variables to the accelerator design. These works took account of the influence factors to data-reusing in a certain extent, such as the loop tiling and loop interchange; however, they did not take account of the influence of parallel computation to datareusing. Work [27] proposed a methodology on how to determine the parallelism strategy. However, this work did not take account of the influence of loop tiling and loop interchange.

Following in this chapter, we will discuss the factors impacting on data-reusing in deep, including loops execution order, reusing strategy and parallelism strategy.

#### *4.2.1. Loop execution order*

**# of data(K) # of total data access(K) Average # of using times per data**

Data-reusing is an important optimization method of CNN accelerator to reduce the memory accesses. The main idea of data-reusing is use on-chip buffer to buffer the data which will be used repeatedly. When next time we use these data again, they can be accessed in the on-chip

**Figure 11** shows the computational process of convolution. In this process, a 3 × 3 weight kernel shift and convolve with 9 pixels of input feature map in sequence. Between two times convolving as the red frame and green frame shown in **Figure 11**, the value of weight kernel does not change, and the red area of input feature map will be used repeatedly. So that the

**Table 1** shows the amount of data, of total data accesses and the average access times per data of Input, Weights and Output for CONV1 of VGG-16 model. The amount of data represents the amount of actual Input, Weights or Output. The amount of total data access represents the amount of data access of Input, Weights or Output. The average amount of using times per data is calculated by the amount of total data access dividing the amount of data and it represents the using times of each Input, Weights or Output. From **Table 1**, we can find that in the computing process of convolutional layers, although a large amount of Input, Weights and Output are involved, actually these data are always used repeatedly. Some data even can

buffer, so that we do not need to access off-chip memory.

weight kernel and the part of input feature map can be reused.

Input 150 86,704 576 Weights 1.728 86,704 50,176 Output 3211 86,704 27

**Table 1.** Statistics of CONV1 of VGG-16 model.

**Figure 11.** Data-reusing in the process of convolution.

**4.2. Data-reusing**

158 Green Electronics

The computational procedure of convolutional layer can be expressed by a convolution loop nest as **Figure 3**. The similar iteration loop dimensions can be merged to a new loop dimension, and the loops execution order can be transformed, as **Figure 12(a)** and **(b)** shown. **Figure 12(b)** is transformed by **Figure 12(a)**. The loops execution order of two loop nests is different but the result is same.

For the reason that the weights will be accessed repeatedly in the loop *a*, we can buffer the weights which will be accessed repeatedly in loop *a* and reuse them. Assuming reusing weights in loop *a*, we need corresponding on-chip buffer to store the weights. All weight which will be accessed in loop *a* should be buffered. For example, in the loops execution order

*4.2.2. Reusing strategy*

reusing strategy.

In Eq. (2) Px

cannot exceed the loop iterations.

*4.2.3. Parallelism strategy*

**Figure 3** must meet the following constraints:

In the above, we introduce the influence of the loop execution order on the buffer size. Actually reuse strategy is also an important influence factor to the data reuse. Reuse strategy is a concrete scheme indicating which data will be reused in which loop dimension. For instance, in **Figure 12(a)**, we can reuse or not reuse the Weight in loop *a*. Similarly, we can choose in loop *a* or loop *d* that we reuse Input. Thus, there are a lot of reusing strategies combination we can choose. Given a loops execution order, different reuse strategy will make buffer size and memory access different. When we consider the data-reusing, reusing strategy should be taken into account and find out the corresponding optimal

Optimizing of Convolutional Neural Network Accelerator

http://dx.doi.org/10.5772/intechopen.75796

Parallel computing is an effective method to improve computational performance of CNN model. Actually parallelism is also a method reusing data indirectly. In some cases, parallel computing involves the same data, thus when parallel computing we just need to access the data once, compared to serial computing, which we need to access data repeatedly. However, parallelism also brings some problems such as higher hardware overhead, higher bandwidth requirement, extra buffer size, etc. Generally, a legal parallelism strategy of CONV layer for

represent the parallelism degree in loop dimension *x*. Eq. (2) indicates the number

of PEs is the limitation of total parallelism. And the parallelism degree in each loop dimension

Under the constraints as Eq. (2), there are many combinations of the parallelism strategy. For the reason that the computational performance of CNN accelerator is generally proportional to the total parallelism degree, if the total parallelism degree is similar, the computational performances of taking different parallelism strategies are similar. However, the memory access of taking different parallelism strategies differs greatly. Assuming two cases, the first case is that take a parallel computing in the loop dimension *b* and the parallelism degree is Pm. The diagram of the computing process is shown as **Figure 14(a)**. The red frame represents the data involved in a parallel computing. The computing diagram shows that when parallel computing in the loop dimension *b*, there are Pm Input and Pm Weight, totally 2Pm data are involved in parallel computing per time. Another case is that take a parallel computing in the loop dimension *d* and the parallelism degree is also Pm, as **Figure 14(b)**

(2)

161

**Figure 12.** Two different loops execution order: (a) order a; (b) order b.

**Figure 13.** Reusing weight in two situations with different loops execution order: (a) situation a; (b) situation b.

of **Figure 12(a)**, when executing loop *a*, the weights in the inner loops such as loop *b*, *c* and *d* will be accessed. Thus, if we want to reuse the weights in loop *a*, we need to buffer the weights which accessed not only in loop *a*, but also in inner loop *b*, *c* and *d*. Therefore, the on-chip buffer size depends on the amount of the weights which will be accessed when executing loop *a*. And the amount of the weights accessed depends on the relative position of loop *a* in the whole loops nest.

Compare the loop execution order of two cases of **Figure 12(a)** and **(b)**. In **Figure 12(a)**, if we want to reuse the weight which is represented by the red pixel in **Figure 13(a)** in loop *a*, between two times for accessing the red pixel weight, the other weights as the blue frame shown in **Figure 13(a)** will be accessed. Thus, we need to buffer these weights which are shown as the blue frames in **Figure 13(a)** and the amount of these weight is *C* × *R* × *S* × *M*. Assuming the bit width of weights is 2 bytes, the buffer size for reusing weights is 2 × *C* × *R* × *S* × *M* bytes.

In **Figure 12(b)**, due to the loop *a* is the innermost loop of the loop nest, if we want to reuse the weight which is shown as the red pixel in **Figure 13(b)**, between two times for accessing the red pixel weight, there is no other weights need to access. So that we just need to buffer the weight which is shown as the blue frame in **Figure 13(b)** and the amount of weight is only 1. Thus the buffer size for reusing weights is 2 bytes. By comparing these two cases, we can find that the loops execution order has a critical influence on the on-chip buffer size for reusing data.

Similarly, the Input and Output can be reused either and the size of the buffer for reusing Input and Output can be calculated like above. When consider the influence on the buffer size for reusing data, Input, Weight and Output all should be taken into account, and find out the optimal loops execution order.

#### *4.2.2. Reusing strategy*

In the above, we introduce the influence of the loop execution order on the buffer size. Actually reuse strategy is also an important influence factor to the data reuse. Reuse strategy is a concrete scheme indicating which data will be reused in which loop dimension. For instance, in **Figure 12(a)**, we can reuse or not reuse the Weight in loop *a*. Similarly, we can choose in loop *a* or loop *d* that we reuse Input. Thus, there are a lot of reusing strategies combination we can choose. Given a loops execution order, different reuse strategy will make buffer size and memory access different. When we consider the data-reusing, reusing strategy should be taken into account and find out the corresponding optimal reusing strategy.

### *4.2.3. Parallelism strategy*

of **Figure 12(a)**, when executing loop *a*, the weights in the inner loops such as loop *b*, *c* and *d* will be accessed. Thus, if we want to reuse the weights in loop *a*, we need to buffer the weights which accessed not only in loop *a*, but also in inner loop *b*, *c* and *d*. Therefore, the on-chip buffer size depends on the amount of the weights which will be accessed when executing loop *a*. And the amount of the weights accessed depends on the relative position of loop *a* in the

**Figure 13.** Reusing weight in two situations with different loops execution order: (a) situation a; (b) situation b.

**Figure 12.** Two different loops execution order: (a) order a; (b) order b.

Compare the loop execution order of two cases of **Figure 12(a)** and **(b)**. In **Figure 12(a)**, if we want to reuse the weight which is represented by the red pixel in **Figure 13(a)** in loop *a*, between two times for accessing the red pixel weight, the other weights as the blue frame shown in **Figure 13(a)** will be accessed. Thus, we need to buffer these weights which are shown as the blue frames in **Figure 13(a)** and the amount of these weight is *C* × *R* × *S* × *M*. Assuming the bit width of weights is 2 bytes, the buffer size for reusing weights is 2 × *C* × *R* × *S* × *M* bytes.

In **Figure 12(b)**, due to the loop *a* is the innermost loop of the loop nest, if we want to reuse the weight which is shown as the red pixel in **Figure 13(b)**, between two times for accessing the red pixel weight, there is no other weights need to access. So that we just need to buffer the weight which is shown as the blue frame in **Figure 13(b)** and the amount of weight is only 1. Thus the buffer size for reusing weights is 2 bytes. By comparing these two cases, we can find that the loops execution order has a critical influence on the on-chip

Similarly, the Input and Output can be reused either and the size of the buffer for reusing Input and Output can be calculated like above. When consider the influence on the buffer size for reusing data, Input, Weight and Output all should be taken into account, and find out the

whole loops nest.

160 Green Electronics

buffer size for reusing data.

optimal loops execution order.

Parallel computing is an effective method to improve computational performance of CNN model. Actually parallelism is also a method reusing data indirectly. In some cases, parallel computing involves the same data, thus when parallel computing we just need to access the data once, compared to serial computing, which we need to access data repeatedly. However, parallelism also brings some problems such as higher hardware overhead, higher bandwidth requirement, extra buffer size, etc. Generally, a legal parallelism strategy of CONV layer for **Figure 3** must meet the following constraints:

$$\begin{cases} 0 < P\_a \times P\_b \times P\_c \times P\_d \le \left( \# \text{ of PLSs} \right) \\ 0 < P\_a < E \times F \\ 0 < P\_b < C \\ 0 < P\_c < \mathcal{R} \times S \\ 0 < P\_d < \mathcal{M} \end{cases} \tag{2}$$

In Eq. (2) Px represent the parallelism degree in loop dimension *x*. Eq. (2) indicates the number of PEs is the limitation of total parallelism. And the parallelism degree in each loop dimension cannot exceed the loop iterations.

Under the constraints as Eq. (2), there are many combinations of the parallelism strategy. For the reason that the computational performance of CNN accelerator is generally proportional to the total parallelism degree, if the total parallelism degree is similar, the computational performances of taking different parallelism strategies are similar. However, the memory access of taking different parallelism strategies differs greatly. Assuming two cases, the first case is that take a parallel computing in the loop dimension *b* and the parallelism degree is Pm. The diagram of the computing process is shown as **Figure 14(a)**. The red frame represents the data involved in a parallel computing. The computing diagram shows that when parallel computing in the loop dimension *b*, there are Pm Input and Pm Weight, totally 2Pm data are involved in parallel computing per time. Another case is that take a parallel computing in the loop dimension *d* and the parallelism degree is also Pm, as **Figure 14(b)**

space. To simplified analysis, we remove the overlapping and obviously poor design groups and remain a series of the optimal design groups. **Figure 15** depicts these optimal design groups and it shows the relationship between the on-chip buffer size and off-chip memory access of all design groups. The "x" axis denotes the buffer size of CNN accelerator and the "y" axis denotes the off-chip memory access. Each point in the graph represents a kind of design space. The blue curve depicts the distribution trend of the optimal design space. As the blue curve shown, the optimal design space points distribute approximately on a hyperbola. It indicates that when we design a CNN accelerator, the relationship between buffer size of accelerator and off-chip memory access is an inverse relationship as a whole. It means if we want to reduce the off-chip memory access, corresponding on-chip buffer are required for reuse data. Reduce the on-chip buffer size, usually accompanied by the increment of off-chip

Optimizing of Convolutional Neural Network Accelerator

http://dx.doi.org/10.5772/intechopen.75796

163

Based on **Figure 15**, we can choose the optimal design space according to the requirement of hardware. For example, if our FPGA platform only provides 1 KB on-chip buffer, we can choose the design space of point A in **Figure 15** to design our accelerator. If we want to reduce the memory bandwidth and the power consumption of accelerator and we do not concern about the buffer size, we can choose the design space of point C. Moreover, if we want to make a balance between on-chip buffer size and off-chip memory access, we can choose the design space of point B. In a word, we can obtain the appropriate and optimal design space of

In the previous section, we mentioned that the power consumption between different memory accesses differs greatly. According to **Figure 10**, under the commercial 65-nm process, the power consumption of DRAM access is 33 times more than the global buffer access. Therefore,

**Figure 15.** The relationship between the on-chip buffer size and off-chip memory access of all design groups for CONV1

memory access.

of VGG-16.

CNN accelerator referring to **Figure 15**.

**4.4. Power consumption evaluation**

**Figure 14.** Computing process of two parallelism strategies: (a) parallelism in loop dimension *b*; (b) parallelism in loop dimension *d.*

shown. Similarly, we can easily get that there are 1 Input and Pm Weight, totally Pm + 1 data are involved in parallel computing per time. Obviously, the amount of data accessed by the second parallelism strategy is less approximately by half than the first parallel strategy. By comparison between these two cases, we can find that different parallelism strategies have different influence on the amount of data accessed. Therefore, when we design the CNN accelerator and take the parallel computing for the accelerator, a good parallelism strategy can reduce the amount of data accessed, and reduce the amount of memory accesses accordingly.

Usually parallel computing will bring extra on-chip buffer. When calculating the on-chip buffer size, we can regard parallel computing as the combination of a number of (it depends on the parallelism degree) independent computing. In that circumstance, the total buffer size can be calculated by the buffer size of each independent computing. Thus we can calculate the buffer size of each independent computing and get the total buffer size for parallel computing.

#### **4.3. Design space evaluation**

In the above, we introduce the influence of data precision and data-reusing to CNN accelerator. In this section we combine all the design factors of accelerator and make a comprehensive analysis. Here we use the buffer size and off-chip memory access as the evaluation parameters of CNN accelerator. Enumerate all design possibilities which are the combinations of different loops execution order, reusing strategy and parallelism strategy. And calculate the buffer size and the off-chip memory access of all design possibilities with the analysis in the previous section and generate a series of groups consist of design space, buffer size and off-chip memory access, and depict all these groups on a graph.

Based on the CONV1 of the VGG-16 model and using 16-bit fixed point data precision, under the constraint that the number of PEs less than 500 which means the total parallelism degree less than 500, we enumerate all design possibilities and obtain a series groups of all design space. To simplified analysis, we remove the overlapping and obviously poor design groups and remain a series of the optimal design groups. **Figure 15** depicts these optimal design groups and it shows the relationship between the on-chip buffer size and off-chip memory access of all design groups. The "x" axis denotes the buffer size of CNN accelerator and the "y" axis denotes the off-chip memory access. Each point in the graph represents a kind of design space. The blue curve depicts the distribution trend of the optimal design space. As the blue curve shown, the optimal design space points distribute approximately on a hyperbola. It indicates that when we design a CNN accelerator, the relationship between buffer size of accelerator and off-chip memory access is an inverse relationship as a whole. It means if we want to reduce the off-chip memory access, corresponding on-chip buffer are required for reuse data. Reduce the on-chip buffer size, usually accompanied by the increment of off-chip memory access.

Based on **Figure 15**, we can choose the optimal design space according to the requirement of hardware. For example, if our FPGA platform only provides 1 KB on-chip buffer, we can choose the design space of point A in **Figure 15** to design our accelerator. If we want to reduce the memory bandwidth and the power consumption of accelerator and we do not concern about the buffer size, we can choose the design space of point C. Moreover, if we want to make a balance between on-chip buffer size and off-chip memory access, we can choose the design space of point B. In a word, we can obtain the appropriate and optimal design space of CNN accelerator referring to **Figure 15**.

#### **4.4. Power consumption evaluation**

shown. Similarly, we can easily get that there are 1 Input and Pm Weight, totally Pm + 1 data are involved in parallel computing per time. Obviously, the amount of data accessed by the second parallelism strategy is less approximately by half than the first parallel strategy. By comparison between these two cases, we can find that different parallelism strategies have different influence on the amount of data accessed. Therefore, when we design the CNN accelerator and take the parallel computing for the accelerator, a good parallelism strategy can reduce the amount of data accessed, and reduce the amount of memory accesses

**Figure 14.** Computing process of two parallelism strategies: (a) parallelism in loop dimension *b*; (b) parallelism in loop

Usually parallel computing will bring extra on-chip buffer. When calculating the on-chip buffer size, we can regard parallel computing as the combination of a number of (it depends on the parallelism degree) independent computing. In that circumstance, the total buffer size can be calculated by the buffer size of each independent computing. Thus we can calculate the buffer size of each independent computing and get the total buffer size for parallel

In the above, we introduce the influence of data precision and data-reusing to CNN accelerator. In this section we combine all the design factors of accelerator and make a comprehensive analysis. Here we use the buffer size and off-chip memory access as the evaluation parameters of CNN accelerator. Enumerate all design possibilities which are the combinations of different loops execution order, reusing strategy and parallelism strategy. And calculate the buffer size and the off-chip memory access of all design possibilities with the analysis in the previous section and generate a series of groups consist of design space, buffer size and off-chip

Based on the CONV1 of the VGG-16 model and using 16-bit fixed point data precision, under the constraint that the number of PEs less than 500 which means the total parallelism degree less than 500, we enumerate all design possibilities and obtain a series groups of all design

accordingly.

dimension *d.*

162 Green Electronics

computing.

**4.3. Design space evaluation**

memory access, and depict all these groups on a graph.

In the previous section, we mentioned that the power consumption between different memory accesses differs greatly. According to **Figure 10**, under the commercial 65-nm process, the power consumption of DRAM access is 33 times more than the global buffer access. Therefore,

**Figure 15.** The relationship between the on-chip buffer size and off-chip memory access of all design groups for CONV1 of VGG-16.

we can consider that the power consumption of DRAM access is the main part of power consumption of memory access. Storing and reusing data with on-chip buffer can reduce the DRAM access, so that it can reduce the power consumption of the computing process of CNN accelerator.

**Author details**

**References**

Systems; 2012

Signal Processing; 2013

2016;**PP**(99):1-12

Wenquan Du, Zixin Wang and Dihu Chen\*

Sun Yat-Sen University, Guangzhou, China

\*Address all correspondence to: stscdh@mail.sysu.edu.cn

Computer Vision and Pattern Recognition; 2016

IEEE Signal Processing Magazine. 2012;**29**(6):82-97

Symposium on Field-Programmable Gate Arrays; 2015

Symposium on Field-Programmable Gate Arrays; 2017

computer. IEEE Transactions on Computers. 2017;**66**(1):73-88

recognition. Proceedings of the IEEE. 1998;**86**(11):2278-2324

[1] Lécun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document

Optimizing of Convolutional Neural Network Accelerator

http://dx.doi.org/10.5772/intechopen.75796

165

[2] Krizhevsky A, Sutskever I, Hinton GE, Editors. ImageNet classification with deep convolutional neural networks. International Conference on Neural Information Processing

[3] He K, Zhang X, Ren S, Sun J, editors. Deep Residual Learning for Image Recognition.

[4] Sainath TN, Mohamed AR, Kingsbury B, Ramabhadran B, editors. Deep convolutional neural networks for LVCSR. In: IEEE International Conference on Acoustics, Speech and

[5] Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.

[6] Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Driessche GVD, et al. Mastering the game of go with deep neural networks and tree search. Nature. 2016;**529**(7587):484 [7] Zhang C, Li P, Sun G, Guan Y, Xiao B, Cong J, editors. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Acm/Sigda International

[8] Ma Y, Cao Y, Vrudhula S, Seo JS, editors. Optimizing Loop Operation and dataflow in FPGA acceleration of deep convolutional neural networks. In: Acm/Sigda International

[9] Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, et al. cuDNN: Efficient primitives for deep learning. Computer Science. 2014;arXiv:1410.0759

[10] Chen YH, Krishna T, Emer JS, Sze V. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits.

[11] Luo T, Liu S, Li L, Wang Y, Zhang S, Chen T, et al. DaDianNao: A neural network super-

As **Figure 15** shown, the off-chip memory access of point B is 4.5 times more than point A, while the on-chip buffer size of point B is only 2 KB approximately more the point A. It means we can increase a small on-chip buffer (2 KB) to obtain the 4.5 times reduction of the power consumption of memory access of CONV1.

Although it is theoretical estimate of power consumption, it shows that we can make a balance between on-chip memory size and power consumption. Our design space evaluation method can help us to choose the optimal design space of CNN accelerator which can reduce the power consumption by increasing small on-chip buffer size. Facing the tendency that miniaturization and low power consumption of IOT, our evaluation method is an effective design strategy and match the concept of green electronics.
