2.2.2. CNN accelerator layer function definition

Research's Catapult Project garnered quite a bit of attention in the industry when it contended that using FPGAs could be as much as 10 times more power efficient compared to GPUs [5]. Although the performance of single FPGA was much lower than comparable-price GPUs, the fact that power consumption was much lower could have significant implications for many applications where high performance may not be the top priority. Figure 3(a) shows a logical view of FPGAs in cloud-scale application and Figure 3(b) shows how the FPGA-based acceler-

As Figure 3(b) shows, the FPGA-based machine learning accelerator typically involves hardware blocks such as DRAM, CPUs, network interface controller (NIC), and FPGAs. The DRAMs act as a large buffer to store the temporary data while the CPU is in charge of managing the computation, including sending instructions to FPGAs. The FPGA is programmed to fit the ML algorithm. Since the ML algorithm is optimized at a hardware level through FPGA programming, a high data access efficiency is obtained compared with regular GPU computation which does not

Although the FPGA reduces the power consumption in computing through optimizing the ML algorithms on the hardware design, the overall efficiency is still much lower compared with the ASIC for single kind of algorithms. Compared with the ASIC, the programmability introduced by the FPGA also brings complicated logic which increases the hardware design cost. In addition, the speed of the FPGA is usually limited to 300 MHz, which is 4–5 times lower than a typical ASIC [6].

In the HPC or datacenter, hardware accelerator solutions are dominated by GPU and FPGA solution. State-of-the-art machine-learning computation mostly relies on the cloud servers.

have any hardware optimization on the corresponding ML algorithms.

Figure 3. (a) De-couples programmable hardware plane, (b) server plus FPGA schematic.

ator fits into a host server.

4 Machine Learning - Advanced Techniques and Emerging Applications

2.2. ASIC-based CNN accelerator at edge

2.2.1. Introduction

The state-of-art convolutional neural networks commonly include three different computational layers: convolution layer, pooling layer, and fully connected layer. Convolution layer is the most computation intensive part of the neural network, with pooling layer inserted between two convolution layers with the function of reducing intermediate data size and remapping feature maps. Fully connected layer is usually the last layer of the CNN to predict labels of input data, which is memory bandwidth limited, rather than computation resource limited.

The primary role of a convolution layer is to apply convolution function to map the input (previous) layer's images to the next layer. Data from each input layer are composed of multiple channels as a three-dimensional tensor. One set of regional filter windows is defined as one filter or weight. The results run through inner product computation by the filter weight and input data. Output feature is defined by using the filter or weight to scan and accumulate different input channels. After interproduct computation, a separated bias vector (the same dimension as output feature number) will be added in each final result. The analytical representation of convolution layer is shown in Eq. (1) and Figure 4.

$$\mathbf{O}[o][m][\mathbf{x}][y] = \mathbf{B}[o] + \sum\_{k=1}^{M} \sum\_{i=1}^{K} \sum\_{j=1}^{K} \mathbf{I}[o][k][\alpha \mathbf{x} + i][\alpha y + j] \times \mathbf{W}[m][k][i][j]$$

$$1 \le o \le N, 1 \le m \le M, 1 \le \mathbf{x}, y \le \mathbf{S}\_o \tag{1}$$

#### O, B, I, and W are the output features, biases, input features, and filters, respectively.

In addition to the convolution layer, pooling layer is to compress important information through a group of local image pixel data in each input channel. There are two types of pooling operations: max pooling and average pooling. For max pooling operation, the output of pooling layer collects the maximum of pixel data in the local group window, while for average pooling operation, the output of pooling layers calculates the mean of pixel data in the local group window. The representations of these two pooling operations are defined as Eqs. (2) and (3). Figure 5 is an example of the max pooling function.

Figure 4. Concept of computation of CONV layer.

$$O\_{avg}[r][c] = a \text{vg} \begin{bmatrix} I[r][c] & \cdots & I[r][c+K-1] \\ \vdots & \ddots & \vdots \\ I[r+K-1][c] & \cdots & I[r+k-1][c+K-1] \end{bmatrix} \tag{2}$$

$$O\_{\text{max}}[r][c] = \max \begin{bmatrix} I[r][c] & \cdots & I[r][c+K-1] \\ \vdots & \ddots & \vdots \\ I[r+K-1][c] & \cdots & I[r+k-1][c+K-1] \end{bmatrix} \tag{3}$$

computation architecture that reports in 2015 [10]. The central computation architecture has one large PE array. Multiple filters will be sent out into the PE array to enable parallel computation. The output result of each filter will be gathered at the PE array's output to feedback to the memory for next layer computation. This large PE array in the central computation architecture provides a benefit to computing large kernel-sized CNN; however, it needs

Hardware Accelerator Design for Machine Learning http://dx.doi.org/10.5772/intechopen.72845 7

On the other hand, a sparse computation architecture is made of many parallel small convolution units that fit for small-sized kernel [11]. Figure 7 is one of such implementations. The computing unit (CU) Engine Array is made of 16 3 3 kernel-sized convolution units. It provides a benefit to compute small kernel-sized convolution operations and simplify the data flow. However, the computing unit is only supported for 3 3 convolution. So when computing a kernel size that is larger than 3 3, a kernel decomposition technique is proposed in the

The filter's kernel size in a typical CNN network can range from a very small size (1 1) to a very large size (11 11). A hardware engine needs design to support various sized convolutional operation. However, for sparse architecture, the computation units are not separated into many small blocks. Each block consists of a small-sized processing engine array and can only support small-sized convolution, making each block hard to process large convolution. To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel-sized (>3 3) convolution through using only 3 3-sized CU [11]. The algorithm is separated into three steps: (1) It first examines the kernel size of the filter. If the original filter's kernel size is not an exact multiple of three, zero padding weights will be added in the original filter's kernel boundary to extend the original filter's kernel size to be a multiple of three. The added weights are all zero to keep the extended filter convolution result to be same as the original one. (2) The extended filters will be decomposed into several 3 3-sized filters. Each filter will be

to reconstruct the array when computing the small kernel-sized CNN.

Figure 6. Central computation architecture of the CNN accelerator.

following section.

2.2.4. Kernel decomposition technique

Here I[r][c] represents the input channel's data at the position (r,c) and the kernel size of the pooling window is K.

#### 2.2.3. CNN accelerator architecture overview

Today's CNN accelerator architecture can mainly separate into two categories. The central computation architecture and the sparse computation architecture. Figure 6 is a typical central

Figure 5. Example of computation of a max pooling layer.

Figure 6. Central computation architecture of the CNN accelerator.

computation architecture that reports in 2015 [10]. The central computation architecture has one large PE array. Multiple filters will be sent out into the PE array to enable parallel computation. The output result of each filter will be gathered at the PE array's output to feedback to the memory for next layer computation. This large PE array in the central computation architecture provides a benefit to computing large kernel-sized CNN; however, it needs to reconstruct the array when computing the small kernel-sized CNN.

On the other hand, a sparse computation architecture is made of many parallel small convolution units that fit for small-sized kernel [11]. Figure 7 is one of such implementations. The computing unit (CU) Engine Array is made of 16 3 3 kernel-sized convolution units. It provides a benefit to compute small kernel-sized convolution operations and simplify the data flow. However, the computing unit is only supported for 3 3 convolution. So when computing a kernel size that is larger than 3 3, a kernel decomposition technique is proposed in the following section.

#### 2.2.4. Kernel decomposition technique

Oavg½ �r ½�¼ c avg

6 Machine Learning - Advanced Techniques and Emerging Applications

Figure 4. Concept of computation of CONV layer.

Omax½ �r ½�¼ c max

2.2.3. CNN accelerator architecture overview

Figure 5. Example of computation of a max pooling layer.

pooling window is K.

2 6 4

> 2 6 4

I r½ �½ � c ⋯ I r½ �½ � c þ K � 1 ⋮⋱ ⋮ I r½ � þ K � 1 ½ � c ⋯ I r½ � þ k � 1 ½ � c þ K � 1

I r½ �½ � c ⋯ I r½ �½ � c þ K � 1 ⋮⋱ ⋮ I r½ � þ K � 1 ½ � c ⋯ I r½ � þ k � 1 ½ � c þ K � 1

Here I[r][c] represents the input channel's data at the position (r,c) and the kernel size of the

Today's CNN accelerator architecture can mainly separate into two categories. The central computation architecture and the sparse computation architecture. Figure 6 is a typical central

3 7

> 3 7

<sup>5</sup> (2)

<sup>5</sup> (3)

The filter's kernel size in a typical CNN network can range from a very small size (1 1) to a very large size (11 11). A hardware engine needs design to support various sized convolutional operation. However, for sparse architecture, the computation units are not separated into many small blocks. Each block consists of a small-sized processing engine array and can only support small-sized convolution, making each block hard to process large convolution. To minimize the hardware resource usage, a filter decomposition algorithm is proposed to compute any large kernel-sized (>3 3) convolution through using only 3 3-sized CU [11]. The algorithm is separated into three steps: (1) It first examines the kernel size of the filter. If the original filter's kernel size is not an exact multiple of three, zero padding weights will be added in the original filter's kernel boundary to extend the original filter's kernel size to be a multiple of three. The added weights are all zero to keep the extended filter convolution result to be same as the original one. (2) The extended filters will be decomposed into several 3 3-sized filters. Each filter will be

Figure 8. A 5 5 Filter decomposed into four 3 3 sub-filter.

Figure 10. Neural network compression reported in Ref [12].

pixels in the output image.

Figure 9. Filter decomposition technique to compute a 5 5 filter on the 7 7 image. The 5 5 filter is decomposed into four separated 3 3 filters F0, F1, F2, F3, and generating four sub-images. The sub-images are summed together to generate the final output. Same color's pixels in each sub-image will be added together to generate the corresponding

Hardware Accelerator Design for Machine Learning http://dx.doi.org/10.5772/intechopen.72845 9

Figure 7. Sparse computation architecture of the CNN accelerator in [11].

assigned a shift address based on its top left weight's relative position in the original filter and each decomposed filter will be computed individually. (3) The output result of each decomposed filter will be summed together based on its shift address to generate the final output. The mathematical derivation of this decomposition technique is also explained in [11].

Figure 8 is an example of decomposing a 5 5 filter into four 3 3 filters using this technique. One row and column zero padding are added in the original filter. The decomposed filters F0, F1, F2, F3's shift address are (0,0), (0,3), (3,0), (3,3). Figure 9 shows the detailed procedure.

### 2.3. Model compression

In addition to the hardware architecture level development, model compression is also reported as a way to improve the hardware computation efficiency of the machine learning. Ref [12] reported a methodology to prune the neural network and achieve up to 35 to 49 model parameters reduction. The procedure is shown in Figure 10. The original network will be pruned and retrained several times to achieve parameters reduction. After that, quantization is implemented with clustered weights to achieve additional parameter size reduction. Finally, Huffman encoding is added into the final weights to achieve further model size reduction.

#### Hardware Accelerator Design for Machine Learning http://dx.doi.org/10.5772/intechopen.72845 9

Figure 8. A 5 5 Filter decomposed into four 3 3 sub-filter.

Figure 9. Filter decomposition technique to compute a 5 5 filter on the 7 7 image. The 5 5 filter is decomposed into four separated 3 3 filters F0, F1, F2, F3, and generating four sub-images. The sub-images are summed together to generate the final output. Same color's pixels in each sub-image will be added together to generate the corresponding pixels in the output image.

Figure 10. Neural network compression reported in Ref [12].

assigned a shift address based on its top left weight's relative position in the original filter and each decomposed filter will be computed individually. (3) The output result of each decomposed filter will be summed together based on its shift address to generate the final output. The

Figure 8 is an example of decomposing a 5 5 filter into four 3 3 filters using this technique. One row and column zero padding are added in the original filter. The decomposed filters F0, F1, F2, F3's shift address are (0,0), (0,3), (3,0), (3,3). Figure 9 shows the detailed procedure.

In addition to the hardware architecture level development, model compression is also reported as a way to improve the hardware computation efficiency of the machine learning. Ref [12] reported a methodology to prune the neural network and achieve up to 35 to 49 model parameters reduction. The procedure is shown in Figure 10. The original network will be pruned and retrained several times to achieve parameters reduction. After that, quantization is implemented with clustered weights to achieve additional parameter size reduction. Finally, Huffman encoding

mathematical derivation of this decomposition technique is also explained in [11].

Figure 7. Sparse computation architecture of the CNN accelerator in [11].

8 Machine Learning - Advanced Techniques and Emerging Applications

is added into the final weights to achieve further model size reduction.

2.3. Model compression

Due to the rapid increment of the deep learning model size, model compression becomes more and more important for machine-learning hardware acceleration, especially for the edge-side user case. In addition, the fixed-point data format is also used in many deep learning applications to reduce the computation cost [13].

were reported to be used as digital memory devices with reliable trapping and de-trapping behavior. Different from other charge-trapping devices such as floating-gate transistors, transistors with an organic gate dielectric, and carbon nanotube transistors, CTTs are manufacturing ready and fully CMOS compatible in terms of process and operating. IT shows that more than 90% of the trapped charge can be retained after 10 years even when

Hardware Accelerator Design for Machine Learning http://dx.doi.org/10.5772/intechopen.72845 11

A schematic of the basic operation of a CTT device is depicted in Figure 11. The device threshold voltage, VT, is modulated by the charge trapped in the gate dielectric of the transistor. VT increases when positive pulses are applied to the gate to trap electrons in the high-k layer and decreases when negative pulses are applied to the gate to de-trap electrons from the

A memristive computing engine based on the charge-trapping transistor (CTT). The proposed memristive computing engine consists of 784 by 784 CTT analog multipliers and achieves 100 power and area reduction compared to the conventional digital approach. Through implementing a novel sequential analog fabric (SAF), the mixed-signal interfaces are simplified and it only requires an 8-bit analog-to-digital converter (ADC) in the system. The top-level system architecture is shown in Figure 12. A 784 by 784 CTT computing engine is implemented using TSMC 28 nm CMOS technology and occupies 0.68mm2 as shown in Figure 13. It achieves 69.9

Figure 12. Top-level system architecture of the proposed memristive computing engine, including CTT array, mixedsignal interfaces including tunable low-dropout regulator (LDO), analog-to-digital converter (ADC), and novel sequential

high-k layer. CTT devices can be programmed by applying logic-compatible voltages.

TOPS with 500 MHz clock frequency and consumes 14.8 mW.

the device is baked at 85C [15].

analog fabrics (SAF).

#### 2.4. Analog computing

In addition to the traditional digital accelerator design, analog computing is also becoming one of the trends to improve the processor computation ability in solving machine learning problems. Here, we use the charge-trapping transistors (CTTs) technique as an example to introduce analog computing [14]. The complementary metal oxide semiconductor (CMOS)-compatible feature of the CTTs makes them very promising devices to implement large-sized computation using analog methodology.

As the scaling of transistors is reaching its manufacturing limit, the computation throughput using current architectures will also inevitably saturate. Recent research reports the development of analog computing engines. Compared to traditional digital computation, analog computing shows tremendous advantages regarding the power, design cost, and computation speed. Among many analog computing systems, memristor-based ones have been widely reported [14]. Recently, more promising charge-trapping transistors (CTTs)

Figure 11. A schematic showing the basic operation of CTT device (equally applicable to FinFET-based CTTs): (1) charge trapping operation, (2) charge de-trapping operation.

were reported to be used as digital memory devices with reliable trapping and de-trapping behavior. Different from other charge-trapping devices such as floating-gate transistors, transistors with an organic gate dielectric, and carbon nanotube transistors, CTTs are manufacturing ready and fully CMOS compatible in terms of process and operating. IT shows that more than 90% of the trapped charge can be retained after 10 years even when the device is baked at 85C [15].

Due to the rapid increment of the deep learning model size, model compression becomes more and more important for machine-learning hardware acceleration, especially for the edge-side user case. In addition, the fixed-point data format is also used in many deep learning applica-

In addition to the traditional digital accelerator design, analog computing is also becoming one of the trends to improve the processor computation ability in solving machine learning problems. Here, we use the charge-trapping transistors (CTTs) technique as an example to introduce analog computing [14]. The complementary metal oxide semiconductor (CMOS)-compatible feature of the CTTs makes them very promising devices to

As the scaling of transistors is reaching its manufacturing limit, the computation throughput using current architectures will also inevitably saturate. Recent research reports the development of analog computing engines. Compared to traditional digital computation, analog computing shows tremendous advantages regarding the power, design cost, and computation speed. Among many analog computing systems, memristor-based ones have been widely reported [14]. Recently, more promising charge-trapping transistors (CTTs)

Figure 11. A schematic showing the basic operation of CTT device (equally applicable to FinFET-based CTTs): (1) charge

implement large-sized computation using analog methodology.

tions to reduce the computation cost [13].

10 Machine Learning - Advanced Techniques and Emerging Applications

trapping operation, (2) charge de-trapping operation.

2.4. Analog computing

A schematic of the basic operation of a CTT device is depicted in Figure 11. The device threshold voltage, VT, is modulated by the charge trapped in the gate dielectric of the transistor. VT increases when positive pulses are applied to the gate to trap electrons in the high-k layer and decreases when negative pulses are applied to the gate to de-trap electrons from the high-k layer. CTT devices can be programmed by applying logic-compatible voltages.

A memristive computing engine based on the charge-trapping transistor (CTT). The proposed memristive computing engine consists of 784 by 784 CTT analog multipliers and achieves 100 power and area reduction compared to the conventional digital approach. Through implementing a novel sequential analog fabric (SAF), the mixed-signal interfaces are simplified and it only requires an 8-bit analog-to-digital converter (ADC) in the system. The top-level system architecture is shown in Figure 12. A 784 by 784 CTT computing engine is implemented using TSMC 28 nm CMOS technology and occupies 0.68mm2 as shown in Figure 13. It achieves 69.9 TOPS with 500 MHz clock frequency and consumes 14.8 mW.

Figure 12. Top-level system architecture of the proposed memristive computing engine, including CTT array, mixedsignal interfaces including tunable low-dropout regulator (LDO), analog-to-digital converter (ADC), and novel sequential analog fabrics (SAF).

3. Conclusion

Author details

Li Du\* and Yuan Du

Los Angeles, USA

References

2012;25:1097-1105

Jan. 2016;529(7587):484-489

Vista. 2017 pp. 399-408

more. 2017. pp. 1-4

suitable computation hardware platform.

\*Address all correspondence to: dl1989113@ucla.edu

Hardware Architecture Research Engineer, Kneron Inc., Research Scientist, UCLA,

[1] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. May 2015;521:436-444

[2] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolution neural networks. Proceeding of Advances in Neural Information Processing Systems.

[3] Silver D et al. Mastering the game of go with deep neural networks and tree search. Nature.

[4] Gawande NA, Landwehr JB, Daily JA, Tallent NR, Vishnu A, Kerbyson DJ. Scaling deep learning workloads: NVIDIA DGX-1/Pascal and intel knights landing. In: IEEE International Parallel And Distributed Processing Symposium Workshops (IPDPSW); Lake Buena

[5] Putnam A. The configurable cloud – accelerating hyperscale datacenter services with FPGA. In: IEEE 33rd International Conference on Data Engineering (ICDE); San Diego. 2017. p. 1587

[6] Chang AXM, Culurciello E. Hardware accelerators for recurrent neural networks on FPGA. In: IEEE International Symposium on Circuits and Systems (ISCAS); MD, Balti-

In this chapter, various computation hardware platforms for machine learning algorithms are discussed. Among them, GPU is the most widely used one due to its fast computation speed and compatibility with various algorithms. FPGA shows better energy efficiency compared with GPU when computing machine learning algorithm at the cost of low speed. Finally, different ASIC architectures are proposed to support certain kinds of the machine learning algorithms such as a deep convolutional neural network with model compression technique to improve hardware performance. Compared with the GPU and FPGA, ASIC shows the best energy efficiency and computation speed, however, at the cost of reconfigurability to various ML algorithms. Depending on the specific applications, the designers should select the most

Hardware Accelerator Design for Machine Learning http://dx.doi.org/10.5772/intechopen.72845 13

Figure 13. Layout view in TSMC 28 nm CMOS technology.

Compared with the traditional digital processor, analog-based computing processor achieves much less power as well as large area reduction in the design. Table 1 is a comparison of the computation ability between the analog processor and digital processor. As it shows, analog processor achieved more than 100 times computing speed with 1/10 times area consumption compared to digital processor.

Even the analog computing shows advantages in the computation speed and design cost, a low computing resolution limits its application in most ML algorithms. Due to the design challenges of the ADC in the analog processor, the processor can only handle computation resolution that is less or equal to around 10 bits, making it not suitable for most AI applications.


Table 1. Comparison table between analog computing and digital computing in Ref [14].
