Author details

Alfonso Martínez-Cruz<sup>1</sup> \*, Ignacio Algredo-Badillo<sup>1</sup> , Alejandro Medina-Santiago<sup>1</sup> , Kelsey Ramírez-Gutiérrez<sup>1</sup> , Prometeo Cortés-Antonio<sup>2</sup> , Ricardo Barrón-Fernández<sup>3</sup> , René Cumplido-Parra<sup>1</sup> and Kwang-Ting Cheng<sup>4</sup>

\*Address all correspondence to: amartinezc@inaoep.mx


### References


[6] Fine S, Ziv A. Coverage directed test generation for functional verification using bayesian networks. In: Proceedings of 2003 Design Automation Conference (IEEE Cat. No.03CH37451). June 2003. pp. 286-291

individuals, the number of iterations used was increased; thus, more time was used in each iteration. The PSO algorithm obtained higher coverage percentages than GA and pseudorandom generation. A main characteristic is that a fewer number of individuals or particles than GA are required. In the case of the BPSOr algorithm, the number of iterations required was less than PSO and GA in most of experiments; therefore, the verification time was reduced. Consequently, hybrid verification methods can improve the performance during the functional

, Alejandro Medina-Santiago<sup>1</sup>

, Ricardo Barrón-Fernández<sup>3</sup>

,

,

\*, Ignacio Algredo-Badillo<sup>1</sup>

, Prometeo Cortés-Antonio<sup>2</sup>

1 National Institute of Astrophysics, Optics and Electronics, San Andres Cholula, Mexico

[1] Bose M, Shin J, Rudnick EM, Dukes T, Abadir M. A genetic approach to automatic bias generation for biased random instruction generation. In: Proceedings of the 2001 Congress

[2] Samarah A, Habibi A, Tahar S, Kharma N. Automated coverage directed test generation using a cell-based genetic algorithm. In: 2006 IEEE International High Level Design

[3] Shen H, Wei W, Chen Y, Chen B, Guo Q. Coverage directed test generation: Godson

[4] Li M, Hsiao MS. An ant colony optimization technique for abstraction-guided state justi-

[5] Puri P, Hsiao MS. Fast stimuli generation for design validation of rtl circuits using binary particle swarm optimization. In: 2015 IEEE Computer Society Annual Symposium on

experience. In: 2008 17th Asian Test Symposium. Nov 2008. pp. 321-326

fication. In: 2009 International Test Conference. Nov 2009. pp. 1-10

on Evolutionary Computation (IEEE Cat. No.01TH8546). Vol. 1. 2001. pp. 442-448

3 Computing Research Center, National Polytechnic Institute, Mexico City, Mexico 4 College of Engineering, University of California, Santa Barbara, California, USA

verification at block level of digital systems.

René Cumplido-Parra<sup>1</sup> and Kwang-Ting Cheng<sup>4</sup>

2 Tijuana Institute of Technology, Tijuana, Mexico

\*Address all correspondence to: amartinezc@inaoep.mx

Validation and Test Workshop. Nov 2006. pp. 19-26

VLSI. July 2015. pp. 573-578

Author details

90 Digital Systems

References

Alfonso Martínez-Cruz<sup>1</sup>

Kelsey Ramírez-Gutiérrez<sup>1</sup>


**Section 3**

**Machine Learning in Digital Systems**

**Machine Learning in Digital Systems**

**Chapter 6**

**Provisional chapter**

**Efficient Deep Learning in Network Compression and**

**Efficient Deep Learning in Network Compression and** 

While deep learning delivers state-of-the-art accuracy on many artificial intelligence tasks, it comes at the cost of high computational complexity due to large parameters. It is important to design or develop efficient methods to support deep learning toward enabling its scalable deployment, particularly for embedded devices such as mobile, Internet of things (IOT), and drones. In this chapter, I will present a comprehensive survey of several advanced approaches for efficient deep learning in network compression and acceleration. I will describe the central ideas behind each approach and explore the similarities and differences between different methods. Finally, I will present some future

**Keywords:** deep learning, deep neural networks, network compression, network

With the rapid development of modern computing power and large data collection technique, deep neural networks (DNNs) have pushed artificial intelligence limits in a wide range of inference tasks, including but not limited to visual recognition [1], face recognition [2], speech recognition [3], and Go game [4]. For example, visual recognition method proposed in [5] achieves 3.57% top-5 test error on the ImageNet LSVRL-2012 classification dataset, while face recognition system [6] achieves over 99.5% accuracy on the public face benchmark LFW [7], which both have surpassed human-level performance (5.1% on ImageNet [8] and 97.53% on

> © 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

distribution, and reproduction in any medium, provided the original work is properly cited.

DOI: 10.5772/intechopen.79562

**Acceleration**

**Abstract**

**1. Introduction**

LFW [9], respectively).

directions in this field.

acceleration, artificial intelligence

**Acceleration**

Additional information is available at the end of the chapter

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.79562

Shiming Ge

Shiming Ge

#### **Efficient Deep Learning in Network Compression and Acceleration Efficient Deep Learning in Network Compression and Acceleration**

DOI: 10.5772/intechopen.79562

#### Shiming Ge Shiming Ge

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.79562

#### **Abstract**

While deep learning delivers state-of-the-art accuracy on many artificial intelligence tasks, it comes at the cost of high computational complexity due to large parameters. It is important to design or develop efficient methods to support deep learning toward enabling its scalable deployment, particularly for embedded devices such as mobile, Internet of things (IOT), and drones. In this chapter, I will present a comprehensive survey of several advanced approaches for efficient deep learning in network compression and acceleration. I will describe the central ideas behind each approach and explore the similarities and differences between different methods. Finally, I will present some future directions in this field.

**Keywords:** deep learning, deep neural networks, network compression, network acceleration, artificial intelligence

### **1. Introduction**

With the rapid development of modern computing power and large data collection technique, deep neural networks (DNNs) have pushed artificial intelligence limits in a wide range of inference tasks, including but not limited to visual recognition [1], face recognition [2], speech recognition [3], and Go game [4]. For example, visual recognition method proposed in [5] achieves 3.57% top-5 test error on the ImageNet LSVRL-2012 classification dataset, while face recognition system [6] achieves over 99.5% accuracy on the public face benchmark LFW [7], which both have surpassed human-level performance (5.1% on ImageNet [8] and 97.53% on LFW [9], respectively).

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

These powerful methods usually rely on DNNs containing millions or even billions of parameters. For example, the "very deep" VGG-16 [10], which achieves very impressive performance on ImageNet LSVRC 2014, uses a 16-layer deep network containing 138 million parameters and takes more than 500 MB in storing the model. Beyond the remarkable performance, there is increasing concern that the larger number of parameters consumes considerable resources (e.g., storage, memory, and energy), which hinders their practical deployment. First, for a deep neural network (DNN) usage on mobile, the storage bandwidth is very critical both for model size and data computation. For example, the mobile-first companies (such as Facebook and Baidu) are very care about the sizes of the uploaded file, while mobile sensor data companies (such as Google and Microsoft) usually build largely cloud powered systems with limited mobile computation. Second, for a DNN usage in cloud, memory bandwidth demand is very important to save transmission and power. Therefore, smaller models via DNN compression at least mean that they (1) are easier to download from App Store, (2) need less bandwidth to update to an autonomous car, (3) are easier to deploy on embedded hardware with limited memory, (4) need less communication across servers during distributed training, and (5) need less energy cost to perform face recognition.

The objective of efficient methods is to improve the efficiency of deep learning through smaller model size, higher prediction accuracy, faster prediction speed, and lower power consumption. Toward this end, a feasible solution is performing model compression and acceleration to optimized well-trained networks. In this chapter, I will first introduce some background of deep neural networks in Section 2, which provides us the motivation toward efficient algorithms. Then, I will present a comprehensive survey of recent advanced approaches for efficient deep learning in network compression and acceleration, which are mainly grouped into five categories, including network pruning category in Section 3, network quantization category in Section 4, network parameter structuring category in Section 5, network distillation category in Section 6, and compact network design category in Section 7. After that, I will discuss some future directions in this field in Section 8. Finally, Section 9 gives the conclusion.

(MAC) operations to extract local pattern, while they contain less weights due to weight sharing and local connectivity. By contrast, fully connected layers contain most of the weights since dense matrix-vector multiplications are very resource-intense. In addition, an activation layer (such as ReLU) contains a nonlinear function to activate or suppress some neurons. It can make the network more sparse and robust again to over-fitting while reducing the number of connections. A pooling layer is followed by a convolutional layer and aims to merge

**Figure 1.** The structures of two classic deep networks. AlexNet (a) and VGGNet (b) all contain multiple convolutional layers (red), activation layers, and pooling layers (yellow), followed by multiple fully connected layers (green). The input

Efficient Deep Learning in Network Compression and Acceleration

http://dx.doi.org/10.5772/intechopen.79562

97

As shown in **Table 1**, the complexity of CNNs could be spitted into two parts: (1) the computational complexity of a CNN is dominated by the convolutional layers and (2) the number of parameters is mainly related to the fully connected layers. Therefore, most model acceleration approaches focus on decreasing the computational complexity of the convolutional layers, while the model compression approaches mainly try to compress the parameters of the fully

**MACs Conv (%) FC (%) Size Conv (%) FC (%)**

**Network Computational complexity Parameter complexity**

**Table 1.** The computational and parameter complexities and distributions for deep CNNs.

AlexNet 724 M 91.9 8.1 61 M 3.8 96.2 VGG-16 15.5G 99.2 0.8 138 M 10.6 89.4 GoogleNet 1.6G 99.9 0.1 6.9 M 85.1 14.9 ResNet-50 3.9G 100 0 25.5 M 100 0

semantically similar features to reduce the memory.

and loss layer are marked in mazarine and blue, respectively.

connected layers.

### **2. Background**

In this section, a brief introduction and analysis are given with some classic networks as examples on the structure of deep networks, computation and storage complexity, weight distribution, and memory bandwidth. This analysis inspires the behind motivation of model compression and acceleration approaches.

Recently, deep convolutional neural networks (CNNs) have become very popular due to their powerful representational capacity. A deep convolutional neural network (CNN) usually has a hierarchical structure of a number of layers, containing multiple blocks of convolutional layers, activation layers, and pooling layers, followed by multiple fully connected layers. **Figure 1** gives the structures of two classic CNNs, where (a) AlexNet [1] and (c) VGG-16 [10] consist of eight and sixteen layers, respectively. The two networks are larger than 200 MB and 500 MB, which makes them difficult to deploy on mobile devices. The convolutional layers dominate most of the computational complexity since they need a lot of multiplication-and-addition

These powerful methods usually rely on DNNs containing millions or even billions of parameters. For example, the "very deep" VGG-16 [10], which achieves very impressive performance on ImageNet LSVRC 2014, uses a 16-layer deep network containing 138 million parameters and takes more than 500 MB in storing the model. Beyond the remarkable performance, there is increasing concern that the larger number of parameters consumes considerable resources (e.g., storage, memory, and energy), which hinders their practical deployment. First, for a deep neural network (DNN) usage on mobile, the storage bandwidth is very critical both for model size and data computation. For example, the mobile-first companies (such as Facebook and Baidu) are very care about the sizes of the uploaded file, while mobile sensor data companies (such as Google and Microsoft) usually build largely cloud powered systems with limited mobile computation. Second, for a DNN usage in cloud, memory bandwidth demand is very important to save transmission and power. Therefore, smaller models via DNN compression at least mean that they (1) are easier to download from App Store, (2) need less bandwidth to update to an autonomous car, (3) are easier to deploy on embedded hardware with limited memory, (4) need less communication across servers during distributed training, and (5) need

The objective of efficient methods is to improve the efficiency of deep learning through smaller model size, higher prediction accuracy, faster prediction speed, and lower power consumption. Toward this end, a feasible solution is performing model compression and acceleration to optimized well-trained networks. In this chapter, I will first introduce some background of deep neural networks in Section 2, which provides us the motivation toward efficient algorithms. Then, I will present a comprehensive survey of recent advanced approaches for efficient deep learning in network compression and acceleration, which are mainly grouped into five categories, including network pruning category in Section 3, network quantization category in Section 4, network parameter structuring category in Section 5, network distillation category in Section 6, and compact network design category in Section 7. After that, I will discuss some future directions in this field in Section 8. Finally, Section 9 gives the conclusion.

In this section, a brief introduction and analysis are given with some classic networks as examples on the structure of deep networks, computation and storage complexity, weight distribution, and memory bandwidth. This analysis inspires the behind motivation of model

Recently, deep convolutional neural networks (CNNs) have become very popular due to their powerful representational capacity. A deep convolutional neural network (CNN) usually has a hierarchical structure of a number of layers, containing multiple blocks of convolutional layers, activation layers, and pooling layers, followed by multiple fully connected layers. **Figure 1** gives the structures of two classic CNNs, where (a) AlexNet [1] and (c) VGG-16 [10] consist of eight and sixteen layers, respectively. The two networks are larger than 200 MB and 500 MB, which makes them difficult to deploy on mobile devices. The convolutional layers dominate most of the computational complexity since they need a lot of multiplication-and-addition

less energy cost to perform face recognition.

compression and acceleration approaches.

**2. Background**

96 Digital Systems

**Figure 1.** The structures of two classic deep networks. AlexNet (a) and VGGNet (b) all contain multiple convolutional layers (red), activation layers, and pooling layers (yellow), followed by multiple fully connected layers (green). The input and loss layer are marked in mazarine and blue, respectively.

(MAC) operations to extract local pattern, while they contain less weights due to weight sharing and local connectivity. By contrast, fully connected layers contain most of the weights since dense matrix-vector multiplications are very resource-intense. In addition, an activation layer (such as ReLU) contains a nonlinear function to activate or suppress some neurons. It can make the network more sparse and robust again to over-fitting while reducing the number of connections. A pooling layer is followed by a convolutional layer and aims to merge semantically similar features to reduce the memory.

As shown in **Table 1**, the complexity of CNNs could be spitted into two parts: (1) the computational complexity of a CNN is dominated by the convolutional layers and (2) the number of parameters is mainly related to the fully connected layers. Therefore, most model acceleration approaches focus on decreasing the computational complexity of the convolutional layers, while the model compression approaches mainly try to compress the parameters of the fully connected layers.


**Table 1.** The computational and parameter complexities and distributions for deep CNNs.

DNNs are known to be over-parameterized for facilitating convergence to good local minima of the loss function during model training [11]. Therefore, such redundancy can be removed from the trained networks in the test or inference time. Moreover, each layer contains lots of weights near zero value. **Figure 2** shows the probability distribution of the weights in two layers of AlexNet and VGG-16, respectively, where the weights are scaled and quantized into [−1, 1] with 32 levels to convenient visual display. It can be seen that the distribution is biased: most of the (quantized) weights on each layer are distributed around zero-value peak. This observation demonstrates that the weights can be reduced through weight coding, such as Huffman coding.

The memory bandwidth of a CNN model refers to the inference processing and greatly impacts the energy consumption, especially when running on embedded or mobile devices. To analyze the memory bandwidth of a trained CNN model, a simple but effective way is applied here by performing forward testing on multiple images and then analyzing the range of each layer output. The memory of each layer is dependent on bit width of each feature and

the number of output features. 1000 images from ImageNet dataset are randomly selected to perform inference with AlexNet and VGG-16, respectively; the mean range of output features on each layer are shown in **Figure 3**. It shows that the ranges of memory bandwidths in each layer are different and variable. Inspired by that, network compression and acceleration approaches can be designed to dynamically control the memory allocation in network layers by evaluating the ranges of each layer. Following these observations, many efficient methods for network compression and acceleration have been proposed, and several survey papers could be found in [12–14]. As shown in **Figure 4**, these approaches are grouped into five main categories according to their scheme for processing deep networks: pruning, quantization, approximation, distillation, and densification. In the following sections, I will introduce the

Efficient Deep Learning in Network Compression and Acceleration

http://dx.doi.org/10.5772/intechopen.79562

99

**Figure 3.** The memory bit-width range of each layer of AlexNet (left) and VGG-16 (right).

**Figure 4.** The main categories of network compression and acceleration approaches.

DNNs are known to be over-parameterized for facilitating convergence to good local minima of the loss function during network training [11]. Therefore, the optimally trained deep networks usually contain redundancy on parameters. Inspired by that, network pruning category aims to remove such redundancy from the pre-trained networks in the inference time. In this

advanced approaches in these categories.

**3. Network pruning**

**Figure 2.** The probability distribution of weights in two layers of AlexNet and VGG-16. It is shown that the weight distribution is around a zero-value peak.

Efficient Deep Learning in Network Compression and Acceleration http://dx.doi.org/10.5772/intechopen.79562 99

**Figure 3.** The memory bit-width range of each layer of AlexNet (left) and VGG-16 (right).

**Figure 4.** The main categories of network compression and acceleration approaches.

the number of output features. 1000 images from ImageNet dataset are randomly selected to perform inference with AlexNet and VGG-16, respectively; the mean range of output features on each layer are shown in **Figure 3**. It shows that the ranges of memory bandwidths in each layer are different and variable. Inspired by that, network compression and acceleration approaches can be designed to dynamically control the memory allocation in network layers by evaluating the ranges of each layer. Following these observations, many efficient methods for network compression and acceleration have been proposed, and several survey papers could be found in [12–14]. As shown in **Figure 4**, these approaches are grouped into five main categories according to their scheme for processing deep networks: pruning, quantization, approximation, distillation, and densification. In the following sections, I will introduce the advanced approaches in these categories.

#### **3. Network pruning**

DNNs are known to be over-parameterized for facilitating convergence to good local minima of the loss function during model training [11]. Therefore, such redundancy can be removed from the trained networks in the test or inference time. Moreover, each layer contains lots of weights near zero value. **Figure 2** shows the probability distribution of the weights in two layers of AlexNet and VGG-16, respectively, where the weights are scaled and quantized into [−1, 1] with 32 levels to convenient visual display. It can be seen that the distribution is biased: most of the (quantized) weights on each layer are distributed around zero-value peak. This observation demonstrates that the weights can be reduced through weight coding, such as

The memory bandwidth of a CNN model refers to the inference processing and greatly impacts the energy consumption, especially when running on embedded or mobile devices. To analyze the memory bandwidth of a trained CNN model, a simple but effective way is applied here by performing forward testing on multiple images and then analyzing the range of each layer output. The memory of each layer is dependent on bit width of each feature and

**Figure 2.** The probability distribution of weights in two layers of AlexNet and VGG-16. It is shown that the weight

Huffman coding.

98 Digital Systems

distribution is around a zero-value peak.

DNNs are known to be over-parameterized for facilitating convergence to good local minima of the loss function during network training [11]. Therefore, the optimally trained deep networks usually contain redundancy on parameters. Inspired by that, network pruning category aims to remove such redundancy from the pre-trained networks in the inference time. In this way, pruning approaches are applied to prune the unimportant or unnecessary parameters to significantly increase the sparsity of the parameters. Recently, many approaches are proposed, which consist of regular pruning approaches and irregular pruning approaches. As stated in [13], regular pruning refers to fine-gained pruning, while irregular pruning approaches are further categorized into four classes according to the pruning levels: vector level, kernel level, group level, and filter level. **Figure 5** shows different pruning methods. The core of network pruning is measuring the importance of weights or parameters.

Different from fine-grained pruning, vector-level and kernel-level pruning remove vectors in the convolutional kernels and 2D-convolutional kernels in the filters in a regular manner, respectively. Anwar et al. [22] proposed pruning a vector in a fixed stride via intra-kernel strided pruning. Mao et al. [23] explored different granularity levels in pruning. Group-level pruning aims to remove network parameters according to the same sparse pattern on the filters. In this way, convolutional computation can be efficiently implemented with reduced matrix multiplication. Lebedev and Lempitsky [24] revised brain damage-based pruning approach in a group-wise manner. Their approach added group-wise pruning to the training process to speed up the convolutional operations by using group-sparsity regularization. Similarly, Wen et al. [25] pruned groups of parameters by using group Lasso. Filter-level pruning aims to remove the convolutional filters or channels to thin the deep networks. Since the number of input channels is reduced after a filter layer is pruned, such pruning is more efficient for accelerating network inference. Polyak and Wolf proposed two compression strategies [26]: one based on eliminating lowly active channels and the other on coupling pruning with the repeated use of already computed elements. Luo et al. [27] proposed ThiNet to perform filter-level pruning. The pruning is guided by feature map, and the channels are selected by minimizing the construction error between two successive layers. Similarly, He et al. [28] applied an iterative two-step algorithm to prune filters by minimizing the feature maps. Generally speaking, these regular pruning (vector-level, kernel-level, group-level, and

Efficient Deep Learning in Network Compression and Acceleration

http://dx.doi.org/10.5772/intechopen.79562

101

filter-level) approaches are more suitable for hardware implementations.

tion term and combining product quantization of the network parameters.

Typically, DNNs apply floating-point (such as 32-bit) precision for training and inference, which may lead to a large cost in memory, storage, and computation. To save the cost, network quantization category uses reduced precision to approximate network parameters. These approaches consist of scalar or vector quantization and fixed-point quantization (see

Scalar or vector quantization techniques are originally designed for data compression, where a codebook and a set of quantization codes are used to represent the original data. Considering that the size of codebook is much smaller than the original data, the original data could be efficiently compressed via quantization. Inspired by that, scalar or vector quantization approaches are applied to represent the parameters or weights of a deep network for compressing. In [29], Gong et al. applied k-means clustering to the weights or conducting product quantization and achieved a very good balance between model size and recognition accuracy. They achieved 16–24× compression of the network with only 1% loss of accuracy on ImageNet classification task. Wu et al. [30] proposed quantized CNN to simultaneously speedup the computation and reduce the storage and memory overhead of CNN models. This method obtains 4–6× speedup and 15–20× compression with 1% loss of accuracy on ImageNet. With the quantized CNN model, even mobile devices can accurately classify images within 1 second. Soulié et al. [31] proposed compressing deep network during the learning phase by adding an extra regulariza-

**4. Network quantization**

**Figure 4**).

Fine-grained pruning is the most popular approaches used in network pruning. It removes any unimportant parameters in convolutional kernels by using an irregular manner. In early work, LeCun et al. proposed optimal brain damage, a fine-grained pruning technique that estimates the saliency of the parameters by using the approximate second-order derivatives of the loss function w.r.t the parameters and then removes the parameters at a low saliency. This technique shows to work better than the naive approach. Later, Hassibi and Stork [15] came up with optimal brain surgeon, which performed much better than optimal brain damage although costing much more computational consumption. Recently, Chaber and Lawrynczuk [16] applied optimal brain damage for pruning recurrent neural models. Han et al. developed a method to prune unimportant connection and then retrain the weights to reduce storage and computation [17]. Later, they proposed a hybrid method, called Deep Compression [18], to compress deep neural networks with pruning, quantization, and Huffman coding. On the ImageNet dataset, the method reduced the storage required by AlexNet by 35× from 240 MB to 6.9 MB and VGG-16 by 49× from 552 MB to 11.3 MB both without loss of accuracy. Recently, Guo et al. [19] improved Deep Compression via dynamic network surgery which incorporated connection splicing into the whole process to avoid incorrect pruning. For face recognition, Sun et al. [20] proposed to iteratively learn sparse ConvNets. Instead of removing individual weights, Srinivas et al. [21] proposed to remove one neuron at a time. They presented a systematic way to remove the redundancy by wiring similar neurons together. In general, these irregular pruning approaches could achieve efficient compression of model sizes, but the memory footprint still has not been saved.

**Figure 5.** Different pruning methods for a convolutional layer that has three convolutional filters of size 3 × 3 × 3 [13].

Different from fine-grained pruning, vector-level and kernel-level pruning remove vectors in the convolutional kernels and 2D-convolutional kernels in the filters in a regular manner, respectively. Anwar et al. [22] proposed pruning a vector in a fixed stride via intra-kernel strided pruning. Mao et al. [23] explored different granularity levels in pruning. Group-level pruning aims to remove network parameters according to the same sparse pattern on the filters. In this way, convolutional computation can be efficiently implemented with reduced matrix multiplication. Lebedev and Lempitsky [24] revised brain damage-based pruning approach in a group-wise manner. Their approach added group-wise pruning to the training process to speed up the convolutional operations by using group-sparsity regularization. Similarly, Wen et al. [25] pruned groups of parameters by using group Lasso. Filter-level pruning aims to remove the convolutional filters or channels to thin the deep networks. Since the number of input channels is reduced after a filter layer is pruned, such pruning is more efficient for accelerating network inference. Polyak and Wolf proposed two compression strategies [26]: one based on eliminating lowly active channels and the other on coupling pruning with the repeated use of already computed elements. Luo et al. [27] proposed ThiNet to perform filter-level pruning. The pruning is guided by feature map, and the channels are selected by minimizing the construction error between two successive layers. Similarly, He et al. [28] applied an iterative two-step algorithm to prune filters by minimizing the feature maps. Generally speaking, these regular pruning (vector-level, kernel-level, group-level, and filter-level) approaches are more suitable for hardware implementations.

## **4. Network quantization**

way, pruning approaches are applied to prune the unimportant or unnecessary parameters to significantly increase the sparsity of the parameters. Recently, many approaches are proposed, which consist of regular pruning approaches and irregular pruning approaches. As stated in [13], regular pruning refers to fine-gained pruning, while irregular pruning approaches are further categorized into four classes according to the pruning levels: vector level, kernel level, group level, and filter level. **Figure 5** shows different pruning methods. The core of network

Fine-grained pruning is the most popular approaches used in network pruning. It removes any unimportant parameters in convolutional kernels by using an irregular manner. In early work, LeCun et al. proposed optimal brain damage, a fine-grained pruning technique that estimates the saliency of the parameters by using the approximate second-order derivatives of the loss function w.r.t the parameters and then removes the parameters at a low saliency. This technique shows to work better than the naive approach. Later, Hassibi and Stork [15] came up with optimal brain surgeon, which performed much better than optimal brain damage although costing much more computational consumption. Recently, Chaber and Lawrynczuk [16] applied optimal brain damage for pruning recurrent neural models. Han et al. developed a method to prune unimportant connection and then retrain the weights to reduce storage and computation [17]. Later, they proposed a hybrid method, called Deep Compression [18], to compress deep neural networks with pruning, quantization, and Huffman coding. On the ImageNet dataset, the method reduced the storage required by AlexNet by 35× from 240 MB to 6.9 MB and VGG-16 by 49× from 552 MB to 11.3 MB both without loss of accuracy. Recently, Guo et al. [19] improved Deep Compression via dynamic network surgery which incorporated connection splicing into the whole process to avoid incorrect pruning. For face recognition, Sun et al. [20] proposed to iteratively learn sparse ConvNets. Instead of removing individual weights, Srinivas et al. [21] proposed to remove one neuron at a time. They presented a systematic way to remove the redundancy by wiring similar neurons together. In general, these irregular pruning approaches could achieve efficient com-

pruning is measuring the importance of weights or parameters.

100 Digital Systems

pression of model sizes, but the memory footprint still has not been saved.

**Figure 5.** Different pruning methods for a convolutional layer that has three convolutional filters of size 3 × 3 × 3 [13].

Typically, DNNs apply floating-point (such as 32-bit) precision for training and inference, which may lead to a large cost in memory, storage, and computation. To save the cost, network quantization category uses reduced precision to approximate network parameters. These approaches consist of scalar or vector quantization and fixed-point quantization (see **Figure 4**).

Scalar or vector quantization techniques are originally designed for data compression, where a codebook and a set of quantization codes are used to represent the original data. Considering that the size of codebook is much smaller than the original data, the original data could be efficiently compressed via quantization. Inspired by that, scalar or vector quantization approaches are applied to represent the parameters or weights of a deep network for compressing. In [29], Gong et al. applied k-means clustering to the weights or conducting product quantization and achieved a very good balance between model size and recognition accuracy. They achieved 16–24× compression of the network with only 1% loss of accuracy on ImageNet classification task. Wu et al. [30] proposed quantized CNN to simultaneously speedup the computation and reduce the storage and memory overhead of CNN models. This method obtains 4–6× speedup and 15–20× compression with 1% loss of accuracy on ImageNet. With the quantized CNN model, even mobile devices can accurately classify images within 1 second. Soulié et al. [31] proposed compressing deep network during the learning phase by adding an extra regularization term and combining product quantization of the network parameters.

Different from scalar and vector quantization approaches, fixed-point quantization approaches directly reduce the precision of parameters without codebooks. In [32], Dettmers proposed 8-bit approximation algorithms to make better use of the available bandwidth by compressing 32-bit gradients and nonlinear activations to 8-bit approximations, which obtains a speedup of 50×. In [33], Gupta et al. used only 16-bit wide fixed-point number representation when using stochastic rounding and incur little to no degradation in the classification accuracy. Lin et al. [34] proposed a fixed-point quantizer design by formulating an optimization problem to identify optimal fixed-point bit-width allocation across network layers. The approach offered larger than 20% reduction in the model size without performance loss, and the performance continued to improve after fine-tuning. Beyond fixed-point quantization with reduced precision, an alternative is using binary or ternary precision to lower the parameter representation. Soudry et al. [35] proposed Expectation Propagation (EP) algorithm to train multilayer neural networks. The algorithm has the advantages of parameter-free training and discrete weights, which are useful for large-scale parameter tuning and efficient training implementation on precision limited hardware, respectively. Courbariaux et al. [36] introduced BinaryConnect to provide deep neural network learning with binary weights. BinaryConnect acts as regularizer like other dropout schemes. The approach obtained near state-of-the-art results on permutation-invariant MNIST, CIFAR-10, and SVHN. Esser et al. [37] proposed training with standard backpropagation in binary precision by treating spikes and discrete synapses as continuous probabilities. They trained a sparse connected network running on the TrueNorth chip, which achieved a high accuracy of 99.42% on MNIST dataset with ensemble of 64 and 92.7% accuracy with ensemble of 1. Hubara et al. [38] introduced a method to train Binarized Neural Networks (BNNs) at run-time. In trained neural networks, both the weights and activations are binary. BNNs achieved near state-of-the-art results on MNIST, CIFAR-10, and SVHN. Moreover, BNNs achieved competitive results on the challenging ImageNet dataset (36.1%, top 1 using AlexNet) while drastically reducing memory consumption (size and number of accesses) and improving the speed of matrix multiplication at seven times. Later, Rastegari et al. [39] proposed XNOR-Net for ImageNet classification. They proposed two approximations to standard CNNs with binary-weight-networks and XNOR-Networks. The first approximation achieved 32× memory saving by replacing 32-bit floating-point weights with binary values. The second approximation enabled both the filters and the activations being binary. Moreover, it approximated convolutions using primarily binary operations. In this way, it achieved 58× faster convolutional operations and 32× memory savings while a much higher classification accuracy (53.8%, top 1 using AlexNet) than BNNs. Beyond the great reductions on network sizes and convolutional operations, these binarization schemes are based on simple matrix approximations and ignore the effect of binarization on the loss. To address this problem, Hou et al. [40] recently proposed a loss-aware binarization method by directly minimizing the loss w.r.t. the binarized weights with a proximal Newton algorithm with diagonal Hessian approximation. This method achieved good binarization performance and was robust for wide and deep networks. Motivated by local binary patterns (LBP), Xu et al. [41] proposed an efficient alternative to convolutional layers called local binary convolution (LBC) for facilitate binary network training. Compared to a standard convolutional layer, the LBC layer affords significant parameter savings, 9×–169× in the parameter numbers, as well as 9×–169× savings in model size. Moreover, the resulting CNNs with LBC layers

achieved comparable performance on a range of visual classification tasks, such as MNIST, SVHN, CIFAR-10, and ImageNet. Targeting at more faithful inference and better trade-off for practical applications, Guo et al. [42] introduced network sketching for pursuing binaryweight CNNs. They applied a coarse-to-fine model approximation by directly exploiting the binary structure in pre-trained filters and generated binary-weight models via tensor expansion. Moreover, an associative implementation of binary tensor convolutions was proposed to further speedup the generated models. After that, the resulting models outperformed the other binary-weight models on ImageNet large-scale classification task (55.2%, top 1 by using AlexNet). In order to reduce the accuracy loss or even improve accuracy, Zhu et al. [43] proposed Trained Ternary Quantization (TTQ) to reduce the precision of weights in neural networks to ternary values. TTQ trained the models from scratch with both ternary values and ternary assignment, while network inference only needed ternary values (2-bit weights) and scaling factors. The resulting models achieved an improved accuracy of 57.5%, top 1, using AlexNet on ImageNet large-scale classification task against full-precision model (57.2%, top 1, using AlexNet). TTQ was argued to be viewed as sparse binary-weight networks, which can potentially be accelerated with custom circuit. Generally speaking, the binary or ternary quantization approaches can greatly save the costs on model sizes, memory footprint, and computation, which make them friendly for hardware implementations. However, the accu-

racy needs to be improved especially in large-scale classification problems.

As stated in Section 2, the most computational cost of network inference comes from the convolution operators. In general, the convolutional kernel of a convolutional layer is rep-

kernel filter, *c* is the number of input channels, and *s* indicates the target number of feature maps. The convolutional operation is performed by first transforming the kernel into a *t*-D (*t* = 1,2,3,4) tensor and then computed with efficient mathematical algorithm, such as by using Basic Linear Algebra Subprograms (BLAS). Inspired by that, network approximation aims to

Some approaches approximate 2D tensor by using singular value decomposition (SVD). Jaderberg et al. [44] decomposed the spatial dimension *w* × *h* into *w* × 1 and 1 × *h* filters, which achieved a 4.5× speedup for a CNN trained on a text character recognition dataset, with only an accuracy drop of 1%. Observing that the computation is dominated by the convolution operations in the lower layers of the network, Denton et al. [45] exploited the redundancy present within the convolutional filters to derive approximations that significantly reduce the required computation. The approach delivered 2× speedup on both CPU and GPU while keeping the accuracy within 1% of the original network on object recognition tasks. In [46], the authors proposed using a sparse decomposition to reduce the redundancy in model parameters. They obtained maximum sparsity by exploiting both interchannel and intrachannel redundancy and performing fine-tuning to minimize the recognition loss, which zeros out more than 90% of parameters and with a less than 1% loss of accuracy on the ImageNet.

, where *w* and *h* are the width and height of the

Efficient Deep Learning in Network Compression and Acceleration

http://dx.doi.org/10.5772/intechopen.79562

103

**5. Network approximation**

resented with a 4D tensor, such as *K*∈**R***<sup>w</sup>*x*h*x*c*x*<sup>s</sup>*

approximate the operation with low-rank decomposition.

achieved comparable performance on a range of visual classification tasks, such as MNIST, SVHN, CIFAR-10, and ImageNet. Targeting at more faithful inference and better trade-off for practical applications, Guo et al. [42] introduced network sketching for pursuing binaryweight CNNs. They applied a coarse-to-fine model approximation by directly exploiting the binary structure in pre-trained filters and generated binary-weight models via tensor expansion. Moreover, an associative implementation of binary tensor convolutions was proposed to further speedup the generated models. After that, the resulting models outperformed the other binary-weight models on ImageNet large-scale classification task (55.2%, top 1 by using AlexNet). In order to reduce the accuracy loss or even improve accuracy, Zhu et al. [43] proposed Trained Ternary Quantization (TTQ) to reduce the precision of weights in neural networks to ternary values. TTQ trained the models from scratch with both ternary values and ternary assignment, while network inference only needed ternary values (2-bit weights) and scaling factors. The resulting models achieved an improved accuracy of 57.5%, top 1, using AlexNet on ImageNet large-scale classification task against full-precision model (57.2%, top 1, using AlexNet). TTQ was argued to be viewed as sparse binary-weight networks, which can potentially be accelerated with custom circuit. Generally speaking, the binary or ternary quantization approaches can greatly save the costs on model sizes, memory footprint, and computation, which make them friendly for hardware implementations. However, the accuracy needs to be improved especially in large-scale classification problems.

### **5. Network approximation**

Different from scalar and vector quantization approaches, fixed-point quantization approaches directly reduce the precision of parameters without codebooks. In [32], Dettmers proposed 8-bit approximation algorithms to make better use of the available bandwidth by compressing 32-bit gradients and nonlinear activations to 8-bit approximations, which obtains a speedup of 50×. In [33], Gupta et al. used only 16-bit wide fixed-point number representation when using stochastic rounding and incur little to no degradation in the classification accuracy. Lin et al. [34] proposed a fixed-point quantizer design by formulating an optimization problem to identify optimal fixed-point bit-width allocation across network layers. The approach offered larger than 20% reduction in the model size without performance loss, and the performance continued to improve after fine-tuning. Beyond fixed-point quantization with reduced precision, an alternative is using binary or ternary precision to lower the parameter representation. Soudry et al. [35] proposed Expectation Propagation (EP) algorithm to train multilayer neural networks. The algorithm has the advantages of parameter-free training and discrete weights, which are useful for large-scale parameter tuning and efficient training implementation on precision limited hardware, respectively. Courbariaux et al. [36] introduced BinaryConnect to provide deep neural network learning with binary weights. BinaryConnect acts as regularizer like other dropout schemes. The approach obtained near state-of-the-art results on permutation-invariant MNIST, CIFAR-10, and SVHN. Esser et al. [37] proposed training with standard backpropagation in binary precision by treating spikes and discrete synapses as continuous probabilities. They trained a sparse connected network running on the TrueNorth chip, which achieved a high accuracy of 99.42% on MNIST dataset with ensemble of 64 and 92.7% accuracy with ensemble of 1. Hubara et al. [38] introduced a method to train Binarized Neural Networks (BNNs) at run-time. In trained neural networks, both the weights and activations are binary. BNNs achieved near state-of-the-art results on MNIST, CIFAR-10, and SVHN. Moreover, BNNs achieved competitive results on the challenging ImageNet dataset (36.1%, top 1 using AlexNet) while drastically reducing memory consumption (size and number of accesses) and improving the speed of matrix multiplication at seven times. Later, Rastegari et al. [39] proposed XNOR-Net for ImageNet classification. They proposed two approximations to standard CNNs with binary-weight-networks and XNOR-Networks. The first approximation achieved 32× memory saving by replacing 32-bit floating-point weights with binary values. The second approximation enabled both the filters and the activations being binary. Moreover, it approximated convolutions using primarily binary operations. In this way, it achieved 58× faster convolutional operations and 32× memory savings while a much higher classification accuracy (53.8%, top 1 using AlexNet) than BNNs. Beyond the great reductions on network sizes and convolutional operations, these binarization schemes are based on simple matrix approximations and ignore the effect of binarization on the loss. To address this problem, Hou et al. [40] recently proposed a loss-aware binarization method by directly minimizing the loss w.r.t. the binarized weights with a proximal Newton algorithm with diagonal Hessian approximation. This method achieved good binarization performance and was robust for wide and deep networks. Motivated by local binary patterns (LBP), Xu et al. [41] proposed an efficient alternative to convolutional layers called local binary convolution (LBC) for facilitate binary network training. Compared to a standard convolutional layer, the LBC layer affords significant parameter savings, 9×–169× in the parameter numbers, as well as 9×–169× savings in model size. Moreover, the resulting CNNs with LBC layers

102 Digital Systems

As stated in Section 2, the most computational cost of network inference comes from the convolution operators. In general, the convolutional kernel of a convolutional layer is represented with a 4D tensor, such as *K*∈**R***<sup>w</sup>*x*h*x*c*x*<sup>s</sup>* , where *w* and *h* are the width and height of the kernel filter, *c* is the number of input channels, and *s* indicates the target number of feature maps. The convolutional operation is performed by first transforming the kernel into a *t*-D (*t* = 1,2,3,4) tensor and then computed with efficient mathematical algorithm, such as by using Basic Linear Algebra Subprograms (BLAS). Inspired by that, network approximation aims to approximate the operation with low-rank decomposition.

Some approaches approximate 2D tensor by using singular value decomposition (SVD). Jaderberg et al. [44] decomposed the spatial dimension *w* × *h* into *w* × 1 and 1 × *h* filters, which achieved a 4.5× speedup for a CNN trained on a text character recognition dataset, with only an accuracy drop of 1%. Observing that the computation is dominated by the convolution operations in the lower layers of the network, Denton et al. [45] exploited the redundancy present within the convolutional filters to derive approximations that significantly reduce the required computation. The approach delivered 2× speedup on both CPU and GPU while keeping the accuracy within 1% of the original network on object recognition tasks. In [46], the authors proposed using a sparse decomposition to reduce the redundancy in model parameters. They obtained maximum sparsity by exploiting both interchannel and intrachannel redundancy and performing fine-tuning to minimize the recognition loss, which zeros out more than 90% of parameters and with a less than 1% loss of accuracy on the ImageNet. Inspired by that the convolutional layer can be calculated with matrix-matrix multiplication; Figurnov et al. [47] used loop perforation technique to eliminate redundant multiplication, which allows to reduce the inference time by 50%.

pre-trained deep network or the ensemble of multiple pre-trained deep networks. The training applies a teacher-student learning manner. Early work proposed by [56] proposed model compression, where the main idea was to use a fast and compact model to approximate the function learned by a slower, larger but better-performing model. Later, Hinton et al. proposed knowledge distillation [57] that trained a smaller neural network (called student network) by taking the output of a large, capable, but slow pre-trained one (called teacher network). The main strength of this idea comes from using the vast network to take care of the regularization process facilitating subsequent training operations. However, this method requires a large pretrained network to begin with which is not always feasible. In [58], the authors extended this method to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Inspired by these methods, Luo et al. [59] proposed to utilize the learned knowledge of a large teacher network or its ensemble as supervision to train a compact student network. The knowledge is represented by using the neurons at the higher hidden layer, which preserve as much information as the label probabilities but are more compact. When using an ensemble of DeepID2+ as teacher, a mimicked student is able to outperform it and achieves 51.6× compression ratio and 90× speedup in inference, making this model applicable on mobile devices. Lu et al. [60] investigated the teacher-student training for small-footprint acoustic models. Shi et al. [61] proposed a taskspecified knowledge distillation algorithm to derive a simplified model with preset computation cost and minimized accuracy loss, which suits the resource-constraint front-end systems well. The knowledge distillation method relied on transferring the learned discriminative information from a teacher model to a student model. The method first analyzed the redundancy of the neural network related to a priori complexity of the given task and then trains a student model by redefining the loss function from a subset of the relaxed target knowledge according to the task information. Recently, Yom et al. [62] defined the distilled knowledge in terms of flow between layers and computed it with the inner product between features from two layers. The knowledge distillation idea was used to compress networks for object detec-

Efficient Deep Learning in Network Compression and Acceleration

http://dx.doi.org/10.5772/intechopen.79562

105

Generally speaking, the network distillation approaches achieve a very high compression ratio. Another advantage of these approaches is making the resulting deep networks more interpretable. One issue needed to be addressed is reducing the accuracy drop ever improv-

Another direct category for obtaining network compression and acceleration is to design more efficient but low-cost network architecture. I call this category as "network densifying," which aims to design compact deep networks to provide high accurate inference. In recent years, several approaches have been proposed following this line. The general ideas to achieve this goal include the usage of small filter kernels, grouping convolution, and

tion tasks [63–65].

ing the accuracy.

**7. Network densifying**

advanced regularization.

By successive 2D tensor decompositions, 3D tensor decompositions can be obtained directly. Zhang et al. [48] applied the strategy that conducts a 2D decomposition on the first weight tensor after SVD. Their approach had been used to accelerate very deep networks for object classification and detection tasks. Another 3D tensor decomposition, Tucker decomposition [49], was proposed to compress deep CNNs for mobile applications by performing SVD along the input channel dimension for the first tensor after 2D decomposition. To further reduce complexity, a block-term decomposition [50] method based on low-rank and group sparse decomposition was proposed by approximating the original weight tensor by the sum of some smaller subtensors. By rearranging these subtensors, the block-term decomposition can be seen as a Tucker decomposition where the second decomposed tensor is a block diagonal tensor.

4D tensor decomposition can be obtained by exploring the low-rank property along the channel dimension and the spatial dimension. This is used in [51], and the decomposition is CP decomposition. The CP decomposition can achieve a very high speedup, for example, as 4.5× speedup for the second layer of AlexNet at only 1% accuracy drop.

Beyond low-rank tensor decomposition approaches which are performed in original space domain, there are some network approximation approaches by processing parameter approximation in transformation domain. In [52], Wang et al. proposed CNNPack to compress the deep networks in frequency domain. CNNPack treated convolutional filters as images and then decomposed their representations in the frequency domain as common parts shared by other similar filters and their individual private parts (i.e., individual residuals). In this way, a large number of low-energy frequency coefficients in both parts can be discarded to produce high compression without significantly compromising accuracy. Moreover, the computational burden of convolution operations in CNNs was relaxed by linearly combining the convolution responses of discrete cosine transform (DCT) bases. Later, Wang et al. [53] extended frequency domain method to the compression of feature maps. They proposed to extract intrinsic representation of the feature maps and preserve the discriminability of the features. The core is employing circulant matrix to formulate the feature map transformation. In this way, both online memory and processing time were reduced. Another transformation domain scheme is hashing, such as HashedNets [54] and FunHashNN [55].

In general, network approximation category focuses on accelerating network inference and reducing network sizes at a minimal performance drop. However, the memory footprint usually cannot be reduced.

### **6. Network distillation**

Different from the above approaches which compress a pretrained deep network, network distillation category aims to train a smaller network to simulate the behaviors of a more complex pre-trained deep network or the ensemble of multiple pre-trained deep networks. The training applies a teacher-student learning manner. Early work proposed by [56] proposed model compression, where the main idea was to use a fast and compact model to approximate the function learned by a slower, larger but better-performing model. Later, Hinton et al. proposed knowledge distillation [57] that trained a smaller neural network (called student network) by taking the output of a large, capable, but slow pre-trained one (called teacher network). The main strength of this idea comes from using the vast network to take care of the regularization process facilitating subsequent training operations. However, this method requires a large pretrained network to begin with which is not always feasible. In [58], the authors extended this method to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Inspired by these methods, Luo et al. [59] proposed to utilize the learned knowledge of a large teacher network or its ensemble as supervision to train a compact student network. The knowledge is represented by using the neurons at the higher hidden layer, which preserve as much information as the label probabilities but are more compact. When using an ensemble of DeepID2+ as teacher, a mimicked student is able to outperform it and achieves 51.6× compression ratio and 90× speedup in inference, making this model applicable on mobile devices. Lu et al. [60] investigated the teacher-student training for small-footprint acoustic models. Shi et al. [61] proposed a taskspecified knowledge distillation algorithm to derive a simplified model with preset computation cost and minimized accuracy loss, which suits the resource-constraint front-end systems well. The knowledge distillation method relied on transferring the learned discriminative information from a teacher model to a student model. The method first analyzed the redundancy of the neural network related to a priori complexity of the given task and then trains a student model by redefining the loss function from a subset of the relaxed target knowledge according to the task information. Recently, Yom et al. [62] defined the distilled knowledge in terms of flow between layers and computed it with the inner product between features from two layers. The knowledge distillation idea was used to compress networks for object detection tasks [63–65].

Generally speaking, the network distillation approaches achieve a very high compression ratio. Another advantage of these approaches is making the resulting deep networks more interpretable. One issue needed to be addressed is reducing the accuracy drop ever improving the accuracy.

### **7. Network densifying**

Inspired by that the convolutional layer can be calculated with matrix-matrix multiplication; Figurnov et al. [47] used loop perforation technique to eliminate redundant multiplication,

By successive 2D tensor decompositions, 3D tensor decompositions can be obtained directly. Zhang et al. [48] applied the strategy that conducts a 2D decomposition on the first weight tensor after SVD. Their approach had been used to accelerate very deep networks for object classification and detection tasks. Another 3D tensor decomposition, Tucker decomposition [49], was proposed to compress deep CNNs for mobile applications by performing SVD along the input channel dimension for the first tensor after 2D decomposition. To further reduce complexity, a block-term decomposition [50] method based on low-rank and group sparse decomposition was proposed by approximating the original weight tensor by the sum of some smaller subtensors. By rearranging these subtensors, the block-term decomposition can be seen as a Tucker decomposition where the second decomposed tensor is a block

4D tensor decomposition can be obtained by exploring the low-rank property along the channel dimension and the spatial dimension. This is used in [51], and the decomposition is CP decomposition. The CP decomposition can achieve a very high speedup, for example, as 4.5×

Beyond low-rank tensor decomposition approaches which are performed in original space domain, there are some network approximation approaches by processing parameter approximation in transformation domain. In [52], Wang et al. proposed CNNPack to compress the deep networks in frequency domain. CNNPack treated convolutional filters as images and then decomposed their representations in the frequency domain as common parts shared by other similar filters and their individual private parts (i.e., individual residuals). In this way, a large number of low-energy frequency coefficients in both parts can be discarded to produce high compression without significantly compromising accuracy. Moreover, the computational burden of convolution operations in CNNs was relaxed by linearly combining the convolution responses of discrete cosine transform (DCT) bases. Later, Wang et al. [53] extended frequency domain method to the compression of feature maps. They proposed to extract intrinsic representation of the feature maps and preserve the discriminability of the features. The core is employing circulant matrix to formulate the feature map transformation. In this way, both online memory and processing time were reduced. Another transformation

In general, network approximation category focuses on accelerating network inference and reducing network sizes at a minimal performance drop. However, the memory footprint usu-

Different from the above approaches which compress a pretrained deep network, network distillation category aims to train a smaller network to simulate the behaviors of a more complex

speedup for the second layer of AlexNet at only 1% accuracy drop.

domain scheme is hashing, such as HashedNets [54] and FunHashNN [55].

which allows to reduce the inference time by 50%.

diagonal tensor.

104 Digital Systems

ally cannot be reduced.

**6. Network distillation**

Another direct category for obtaining network compression and acceleration is to design more efficient but low-cost network architecture. I call this category as "network densifying," which aims to design compact deep networks to provide high accurate inference. In recent years, several approaches have been proposed following this line. The general ideas to achieve this goal include the usage of small filter kernels, grouping convolution, and advanced regularization.

Lin et al. [66] proposed Network-In-Network (NIN) architecture, where the main idea is using 1 × 1 convolution to increase the network capacity while making the computational complexity small. NIN also removed the fully connected layers instead of a global average pooling to reduce the storage requirement. The idea of 1 × 1 convolution is spread wide used in many advanced networks such as GoogleNet [67], ResNet [68], and DenseNet [69]. In [70], Iandola et al. designed a small DNN architecture termed SqueezeNet that achieves AlexNet-level accuracy on ImageNet but with 50× fewer parameters. In addition, with model compression techniques, SqueezeNet can be compressed to less than 1 MB (461× smaller than AlexNet). By using multiple group convolution, ResNeXt [71] achieved much higher accuracy than ResNet when costing the same computation. MobileNet [72] applied depth-wise convolution to reduce the computation cost, which achieved a 32× smaller model size and a 27× faster speed than VGG-16 model with comparable accuracy on ImageNet. ShuffleNet [73] introduced the channel shuffle operation to increase the information change within the multiple groups. It achieved about 13× speedup over AlexNet with comparable accuracy. DarkNet [74, 75] was proposed to facilitate object detection tasks, which applied most of the small convolutional kernels.

object classification, MNIST handwriting recognition, CIFAR object recognition, and so on. Very little effort have been attempted for other visual tasks, such as object detection, object tracking, semantic segmentation, and human pose estimation. Generally, direct using network acceleration approaches for image classification in these visual tasks may encounter a sharp drop on performance. The reason may come from that these visual tasks requires more complex feature representation or richer knowledge than image classification. The work has provided an attempt in facial landmark localization [80]. Therefore, this challenging problem on network acceleration for other visual tasks is one

Efficient Deep Learning in Network Compression and Acceleration

http://dx.doi.org/10.5772/intechopen.79562

107

• **Hardware-software codesign for on-device applications.** To realize the practical deployment on resource-limited devices, the network compression and acceleration algorithms should take the hardware design into consideration besides software algorithm modeling. The requirements from recent on-device applications such as autopiloting, video surveillance, and on-device AI enable tht it is highly desirable to design hardware-efficient deep learning algorithm according to the specific hardware platforms. This co-design scheme

• **More effective distillation methods.** Network distillation methods have proven efficient for model compression in widespread fields beyond image classification, for example, machine translation [81]. However, these methods usually suffer from accuracy drop in inference, especially for complex inference tasks. Considering its efficacy, it is necessary to develop more effective distillation methods to extend their applications. Recent works [82–84] have given some attempts. Therefore, developing more effective distillation meth-

This work was partially supported by grants from National Key Research and Development Plan (2016YFC0801005), National Natural Science Foundation of China (61772513), and the International Cooperation Project of Institute of Information Engineering at Chinese Academy of Sciences (Y7Z0511101). Shiming Ge is also supported by Youth Innovation Promotion

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China

of the future directions.

will be one future direction.

**Acknowledgements**

**Author details**

Shiming Ge

ods is one of the future directions in this field.

Association, Chinese Academy of Sciences.

Address all correspondence to: geshiming@iie.ac.cn

Moreover, some advanced regularization techniques are used to enhance the sparsity and robustness of deep networks. Dropout [76] and DropConnect [77] are widely exploited in many networks to increase the sparsity of activations for memory saving and weights for model size reduction, respectively. The activations neurons, including rectified Liner Unit (ReLU) [1], and its extends such as P-ReLU [78] are used to increase the sparsity of activations for memory saving while provide a speedup for model training, therefore they can facilitate the design of more compact networks.

### **8. Conclusions and future directions**

It is necessary to develop efficient methods for deep learning via network compression and acceleration for facilitating the real-world deployment of advanced deep networks. In this chapter, I give a survey of recent network compression and acceleration approaches in five categories. In the following, I further introduce a few directions in the future in this literature including hybrid scheme for network compression, network acceleration for other visual tasks, hardware-software codesign for on-device applications, and more efficient distillation methods.


object classification, MNIST handwriting recognition, CIFAR object recognition, and so on. Very little effort have been attempted for other visual tasks, such as object detection, object tracking, semantic segmentation, and human pose estimation. Generally, direct using network acceleration approaches for image classification in these visual tasks may encounter a sharp drop on performance. The reason may come from that these visual tasks requires more complex feature representation or richer knowledge than image classification. The work has provided an attempt in facial landmark localization [80]. Therefore, this challenging problem on network acceleration for other visual tasks is one of the future directions.


### **Acknowledgements**

Lin et al. [66] proposed Network-In-Network (NIN) architecture, where the main idea is using 1 × 1 convolution to increase the network capacity while making the computational complexity small. NIN also removed the fully connected layers instead of a global average pooling to reduce the storage requirement. The idea of 1 × 1 convolution is spread wide used in many advanced networks such as GoogleNet [67], ResNet [68], and DenseNet [69]. In [70], Iandola et al. designed a small DNN architecture termed SqueezeNet that achieves AlexNet-level accuracy on ImageNet but with 50× fewer parameters. In addition, with model compression techniques, SqueezeNet can be compressed to less than 1 MB (461× smaller than AlexNet). By using multiple group convolution, ResNeXt [71] achieved much higher accuracy than ResNet when costing the same computation. MobileNet [72] applied depth-wise convolution to reduce the computation cost, which achieved a 32× smaller model size and a 27× faster speed than VGG-16 model with comparable accuracy on ImageNet. ShuffleNet [73] introduced the channel shuffle operation to increase the information change within the multiple groups. It achieved about 13× speedup over AlexNet with comparable accuracy. DarkNet [74, 75] was proposed to facilitate object detection tasks, which applied most of the

Moreover, some advanced regularization techniques are used to enhance the sparsity and robustness of deep networks. Dropout [76] and DropConnect [77] are widely exploited in many networks to increase the sparsity of activations for memory saving and weights for model size reduction, respectively. The activations neurons, including rectified Liner Unit (ReLU) [1], and its extends such as P-ReLU [78] are used to increase the sparsity of activations for memory saving while provide a speedup for model training, therefore they can facilitate

It is necessary to develop efficient methods for deep learning via network compression and acceleration for facilitating the real-world deployment of advanced deep networks. In this chapter, I give a survey of recent network compression and acceleration approaches in five categories. In the following, I further introduce a few directions in the future in this literature including hybrid scheme for network compression, network acceleration for other visual tasks, hardware-software codesign for on-device applications, and more efficient distillation methods. • **Hybrid scheme for network compression.** Current network compression approaches mainly focus on one single scheme, such as by using network quantization and network approximation. This processing leads to insufficient compression or large accuracy loss. It is necessary to exploit a hybrid scheme to combine the advantages from each network compression category. Some attempts can be found in [18, 79], which have demonstrated good

• **Network acceleration for other visual tasks.** Most current approaches aim to compress and accelerate deep networks for image classification tasks, such as ImageNet large-scale

small convolutional kernels.

106 Digital Systems

performance.

the design of more compact networks.

**8. Conclusions and future directions**

This work was partially supported by grants from National Key Research and Development Plan (2016YFC0801005), National Natural Science Foundation of China (61772513), and the International Cooperation Project of Institute of Information Engineering at Chinese Academy of Sciences (Y7Z0511101). Shiming Ge is also supported by Youth Innovation Promotion Association, Chinese Academy of Sciences.

### **Author details**

Shiming Ge

Address all correspondence to: geshiming@iie.ac.cn

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China

## **References**

[1] Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems (NIPS '12); 3-6 December 2012; Lake Tahoe; 2012. pp. 1097-1105

[14] Cheng Y, Wang D, Zhou P, et al. Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine.

Efficient Deep Learning in Network Compression and Acceleration

http://dx.doi.org/10.5772/intechopen.79562

109

[15] Babak Hassibi, David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In: Neural Information Processing Systems (NIPS '93); 1993; Denvor. San

[16] Chaber P, Lawrynczuk M. Pruning of recurrent neural models: An optimal brain damage approach. Nonlinear Dynamics. 2018;**92**(2):763-780. DOI: 10.1007/s11071-018-4089-1

[17] Han S, Pool J, Tran J, Dally W. Learning both weights and connections for efficient neural networks. In: Neural Information Processing Systems (NIPS '14); 11-12 December

[18] Han S, Mao H, Dally W. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In: International Conference on

[19] Guo Y, Yao A, Chen Y. Dynamic network surgery for efficient DNNs. In: Neural Information Processing Systems (NIPS '16); 5-10 December 2016; Barcelona; 2016.

[20] Sun Y, Wang X, Tang X. Sparsifying neural network connections for face recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16); 27-30 June

[21] Srinivas S, Babu RV. Data-free parameter pruning for deep neural networks. In Proceedings of the British Machine Vision Conference (BMVC '15); 7-10 September 2015;

[22] Anwar S, Hwang K, Sung W. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC). 2017;**13**(3):32.

[23] Mao H, Han S, Pool J, et al. Exploring the Regularity of Sparse Structure in Convolutional Neural Networks [Internet]. Available from: https://arxiv.org/pdf/1707.06342 [Accessed:

[24] Lebedev V, Lempitsky V. Fast convnets using group-wise brain damage. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16); 27-30 June 2016;

[25] Wen W, Wu C, Wang Y, et al. Learning structured sparsity in deep neural networks. In: Neural Information Processing Systems (NIPS '16); 5-10 December 2016; Barcelona;

[26] Polyak A, Wolf L. Channel-level acceleration of deep face representations. IEEE Access.

[27] Luo JH, Wu J, Lin W. ThiNet: A filter level pruning method for deep neural network compression. In: IEEE International Conference on Computer Vision (ICCV '17); 22-29

Learning Representations (ICLR '16); 2-4 May 2016; San Juan; 2016

2016; Las Vegas. Nevada State: IEEE; 2016. pp. 4856-4864

Swansea: BMVA Press; 2015. pp. 31.1-31.12

Las Vegas. Nevada State: IEEE; 2016. pp. 2554-2564

2015;**3**:2163-2175. DOI: 10.1109/access.2015.2494536

October 2017; Venice Italy; 2017. pp. 5058-5066

2018;**35**(1):126-136. DOI: 10.1109/MSP.2017.2765695

Francisco: Morgan Kaufmann; 1994. pp. 164-171

2014; Montreal; 2014. pp. 1135-1143

pp. 1379-1387

DOI: 10.1145/3005348

May 12, 2018]

2016. pp. 2074-2082


[14] Cheng Y, Wang D, Zhou P, et al. Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine. 2018;**35**(1):126-136. DOI: 10.1109/MSP.2017.2765695

**References**

108 Digital Systems

2012; Lake Tahoe; 2012. pp. 1097-1105

[1] Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems (NIPS '12); 3-6 December

[2] Taigman Y, Yang M, Ranzato M, et al. DeepFace: Closing the gap to human-level performance in face verification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '14); 24-27 June 2014; Columbus. New York: IEEE; 2014. pp. 1701-1708 [3] Amodei D, Ananthanarayanan S, Anubhai R, et al. Deep speech 2: End-to-end speech recognition in English and Mandarin. In: International Conference on Machine Learning

[4] Silver D, Huang A, Maddison CJ, et al. Mastering the game of Go with deep neural networks and tree search. Nature. 2016;**529**:484-489. DOI: 10.1038/nature16961

[5] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16); 27-30 June 2016;

[6] Schroff F, Kalenichenko D, Philbin J. Facenet: A unified embedding for face recognition and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition

[7] Learned-Miller E, Huang GB, RoyChowdhury A, et al. Labeled faces in the wild: A survey. In: Kawulok M, Celebi M, Smolka B, editors. Advances in Face Detection and Facial Image Analysis. Cham: Springer; 2016. pp. 189-248. DOI: 10.1007/978-3-319-25958-1\_8 [8] Russakovsky O, Deng J, Su H, et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV). 2015;**115**(3):211-252. DOI: 10.1007/

[9] Kumar N, Berg AC, Belhumeur PN, et al. Attribute and simile classifiers for face verification. In: IEEE International Conference on Computer Vision (ICCV '09); 29 Sepetember–2

[10] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (ICLR '15); 7-9 May,

[11] Denil M, Shakibi B, Dinh L, et al. Predicting parameters in deep learning. In: Neural Information Processing Systems (NIPS '13); 5-10 December 2013; Lake Tahoe; 2013.

[12] Sze V, Chen YH, Yang TJ, et al. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE. 2017;**105**(12):2295-2329. DOI: 10.1109/JPROC.

[13] Cheng J, Wang P, Li G, et al. Recent advances in efficient computation of deep convolutional neural networks. Frontiers of Information Technology & Electronic Engineering.

(CVPR '15); 8-10 June 2015; Boston. New York: IEEE; 2015. pp. 815-823

(ICML '16); 19-24 June 2016; New York: IEEE; 2016. pp. 173-182

Las Vegas. Nevada State: IEEE; 2016. pp. 770-778

October 2009; Kyoto. New York: IEEE; 2009. pp. 365-372

2018;**19**(1):64-77. DOI: 10.1631/FITEE.1700789

s11263-015-0816-y

2015; San Diego; 2015

pp. 2148-2156

2017.2761740


[28] He Y, Zhang X, Sun J. Channel pruning for accelerating very deep neural networks. In: IEEE International Conference on Computer Vision (ICCV '17); 22-29 October 2017; Venice. New York: IEEE; 2017. pp. 1389-1397

[41] Juefei-Xu F, Boddeti VN, Savvides M. Local binary convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17); 21-26 July

Efficient Deep Learning in Network Compression and Acceleration

http://dx.doi.org/10.5772/intechopen.79562

111

[42] Guo Y, Yao A, Zhao H, Chen Y. Network sketching: Exploiting binary structure in deep CNNs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17);

[43] Zhu C, Han S, Mao H, Dally WJ. Trained ternary quantization. In: International Conference on Learning Representations (ICLR '17); 24-26 April 2017; Palais des Congrès

[44] Jaderberg M, Vedaldi A, Zisserman A. Speeding up convolutional neural networks with low rank expansions. In: Proceedings of the British Machine Vision Conference (BMVC

[45] Denton EL, Zaremba W, Bruna J, et al. Exploiting linear structure within convolutional networks for efficient evaluation. In: Neural Information Processing Systems (NIPS '14);

[46] Liu B, Wang M, Foroosh H, et al. Sparse convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '15); 8-10 June 2015;

[47] Figurnov M, Ibraimova A, Vetrov D P, et al. Perforated CNNs: Acceleration through elimination of redundant convolutions. In: Neural Information Processing Systems

[48] Zhang X, Zou J, He K, et al. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49] Kim YD, Park E, Yoo S, et al. Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications [Internet]. Available from: https://arxiv.org/

[50] Wang P, Cheng J.Accelerating convolutional neural networks for mobile applications. In: Proceedings of the 2016 ACM on Multimedia Conference; 15-19 Oct. 2016; Amsterdam,

[51] Lebedev V, Ganin Y, Rakhuba M, et al. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. In: International Conference on Learning Representations

[52] Wang Y, Xu C, You S, et al. Cnnpack: Packing convolutional neural networks in the frequency domain. In: Neural Information Processing Systems (NIPS '16); 5-10 December

[53] Wang Y, Xu C, Xu C, Tao D. Beyond filters: Compact feature map for portable deep model. In: Proceedings of The International Conference on Machine Learning (ICML '17);

21-26 July 2017; Honolulu, Hawaii. New York: IEEE; 2017. pp. 5955-5963

2017; Honolulu, Hawaii. New York: IEEE; 2017. pp. 19-28

'15); 7-10 September 2015; Swansea: BMVA Press; 2015

11-12 December 2014; Montreal; 2014. pp. 1269-1277

(NIPS '16); 5-10 December 2016; Barcelona; 2016. pp. 947-955

2016;**38**(10):1943-1955. DOI: 10.1109/tpami.2015.2502579

Boston. New York: IEEE; 2015. pp. 806-814

pdf/1511.06530 [Accessed: May 12, 2018]

(ICLR '15); 7-10 May 2015; San Diego; 2015

8-11 August 2017; Sydney. JMLR; 2017. pp. 3703-3711

The Netherlands; 2016. p. 541-545

2016; Barcelona; 2016. pp. 253-261

Neptune, Toulon, France; 2017


[41] Juefei-Xu F, Boddeti VN, Savvides M. Local binary convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17); 21-26 July 2017; Honolulu, Hawaii. New York: IEEE; 2017. pp. 19-28

[28] He Y, Zhang X, Sun J. Channel pruning for accelerating very deep neural networks. In: IEEE International Conference on Computer Vision (ICCV '17); 22-29 October 2017;

[29] Gong Y, Liu L, Yang M, et al. Compressing deep convolutional networks using vector quantization. In: International Conference on Learning Representations (ICLR '15); 7-9

[30] Wu J, Leng C, Wang Y, et al. Quantized convolutional neural networks for mobile devices. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16);

[31] Soulié G, Gripon V, Robert M. Compression of deep neural networks on the fly. In: International Conference on Artificial Neural Networks; 6-9 September 2016; Barcelona.

[32] Dettmers T. 8-bit approximations for parallelism in deep learning. In: International Conference on Learning Representations (ICLR '16); 2-4 May 2016; Caribe Hilton; 2016

[33] Gupta S, Agrawal A, Gopalakrishnan K, et al. Deep learning with limited numerical precision. In: International Conference on Machine Learning (ICML '15); 6-11 July 2015;

[34] Lin DD, Talathi SS, Annapureddy VS. Fixed point quantization of deep convolutional networks. In: Proceedings of The International Conference on Machine Learning (ICML '16);

[35] Soudry D, Hubara I, Meir R. Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights. In: Neural Information

[36] Courbariaux M, Bengio Y, David JP. Binaryconnect: Training deep neural networks with binary weights during propagations. In: Neural Information Processing Systems (NIPS

[37] Esser SK, Appuswamy R, Merolla P, et al. Backpropagation for energy-efficient neuromorphic computing. In: Neural Information Processing Systems (NIPS '14); 11-12

[38] Hubara I, Courbariaux M, Soudry D, et al. Binarized neural networks. In: Neural Information Processing Systems (NIPS '16); 5-10 December 2016; Barcelona; 2016.

[39] Rastegari M, Ordonez V, Redmon J, et al. XNOR-Net: Imagenet classification using binary convolutional neural networks. In: European Conference on Computer Vision

[40] Hou L, Yao Q, Kwok JT. Loss-aware binarization of deep networks. In: International Conference on Learning Representations (ICLR '17); 24-26 April 2017; Palais des Congrès

(ECCV '16); 8-16 October 2016; Amsterdam. Cham: Springer; 2016. pp. 525-542

Processing Systems (NIPS'14); 8-13 December 2014; Montreal; 2014. pp. 963-971

27-30 June 2016; Las Vegas. Nevada State: IEEE; 2016. pp. 4820-4828

Venice. New York: IEEE; 2017. pp. 1389-1397

May 2015; San Diego; 2015

110 Digital Systems

Cham: Springer; 2016. pp. 153-160

Lille. JMLR; 2015. pp. 1737-1746

19-24 June 2016; New York. JMLR; 2016. pp. 2849-2858

'14); 11-12 December 2014; Montreal; 2014. pp. 3123-3131

December 2014; Montreal; 2014. pp. 1117-1125

pp. 4107-4115

Neptune, Toulon, France; 2017


[54] Chen W, Wilson J, Tyree S, et al. Compressing neural networks with the hashing trick. In: Proceedings of the International Conference on Machine Learning (ICML '15); 6-11 July 2015; Lille. JMLR; 2015. pp. 2285-2294

[67] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17); 21-26 July 2017; Honolulu,

Efficient Deep Learning in Network Compression and Acceleration

http://dx.doi.org/10.5772/intechopen.79562

113

[68] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17); 21-26 July 2017;

[69] Huang G, Liu Z, Weinberger K Q, et al. Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17); 21-26 July

[70] Iandola FN, Han S, Moskewicz MW, et al. SqueezeNet: AlexNet-Level Accuracy with 50× Fewer Parameters and < 0.5 MB Model Size [Internet]. Available from: https://arxiv.

[71] Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17);

[72] Howard AG, Zhu M, Chen B, et al. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications [Internet]. Available from: https://arxiv.org/

[73] Zhang X, Zhou X, Lin M, et al. Shufflenet: An Extremely Efficient Convolutional Neural Network for Mobile Devices [Internet]. Available from: https://arxiv.org/pdf/1707.01083

[74] Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16);

[75] Redmon J, Farhadi A. YOLO9000: Better, faster, stronger. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17); 21-26 July 2017; Honolulu,

[76] Srivastava N, Hinton GE, Krizhevsky A, et al. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. 2014;**15**(1):1929-1958

[77] Wan L, Zeiler MD, Zhang S, et al. Regularization of neural networks using DropConnect. In: Proceedings of the 30th International Conference on Machine Learning (ICML '13);

[78] He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: IEEE International Conference on Computer

[79] Ge S, Luo Z, Zhao S, et al. Compressing deep neural networks for efficient visual inference. In: IEEE International Conference on Multimedia and Expo (ICME '17). 10-14 July

Vision (ICCV '15). 7-13 December 2015; Santiago, Chile. IEEE; 2015. pp. 1-9

21-26 July 2017; Honolulu, Hawaii. New York: IEEE; 2017. pp. 5987-5995

27-30 June 2016; Las Vegas. Nevada State: IEEE; 2016. pp. 779-788

Hawaii. New York: IEEE; 2017. pp. 1-9

Honolulu, Hawaii. New York: IEEE; 2017. pp. 770-778

org/pdf/1602.07360 [Accessed: May 12, 2018]

pdf/1704.04861 [Accessed: May 12, 2018]

Hawaii. New York: IEEE; 2017. pp. 7263-7271

16-21 June 2013; Atlanta, GA; 2013. pp. 1058-1066

2017; Hong Kong. IEEE: 2017. pp. 667-672

[Accessed: May 12, 2018]

2017; Honolulu, Hawaii. New York: IEEE; 2017. pp. 4700-4708


[67] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17); 21-26 July 2017; Honolulu, Hawaii. New York: IEEE; 2017. pp. 1-9

[54] Chen W, Wilson J, Tyree S, et al. Compressing neural networks with the hashing trick. In: Proceedings of the International Conference on Machine Learning (ICML '15); 6-11

[55] Shi L, Feng S. Functional Hashing for Compressing Neural Networks [Internet].

[56] Bucilu C, Caruana R, Niculescu-Mizil A. Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining;

[57] Hinton G, Vinyals O, Dean J. Distilling the Knowledge in a Neural Network [Internet].

[58] Romero A, Ballas N, Kahou SE, et al. FitNets: Hints for thin deep nets. In: International Conference on Learning Representations (ICLR '15); 7-10 May 2015; San Diego; 2015

[59] Luo P, Zhu Z, Liu Z, et al. Face model compression by distilling knowledge from neurons. In: Thirtieth AAAI Conference on Artificial Intelligence (AAAI '16); 12-17 February.

[60] Lu L, Guo M, Renals S. Knowledge distillation for small-footprint highway networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '17);

[61] Shi M, Qin F, Ye Q, et al. A scalable convolutional neural network for task-specified scenarios via knowledge distillation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 5-9 March 2017; New Orleans. New York: IEEE;

[62] Yim J, Joo D, Bae J, et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17); 21-26 July 2017; Honolulu, Hawaii. New York: IEEE;

[63] Shen J, Vesdapunt N, Boddeti VN, et al. In Teacher We Trust: Learning Compressed Models for Pedestrian Detection [Internet]. Available from: https://arxiv.org/pdf/1602.00478

[64] Li Q, Jin S, Yan J. Mimicking very efficient network for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17); 21-26 July 2017;

[65] Chen G, Choi W, Yu X, et al. Learning efficient object detection models with knowledge distillation. In: Neural Information Processing Systems (NIPS '17); 4-9 December 2017;

[66] Lin M, Chen Q, Yan S. Network in Network [Internet]. Available from: https://arxiv.org/

Honolulu, Hawaii. New York: IEEE; 2017. pp. 7341-7349

5-9 March 2017; New Orleans. New York: IEEE; 2017. pp. 4820-4824

Available from: https://arxiv.org/pdf/1605.06560 [Accessed: May 12, 2018]

Available from: https://arxiv.org/pdf/1503.02531 [Accessed: May 12, 2018]

20-23 August 2006; Philadelphia. ACM; 2006. pp. 535-541

2016; Phoenix, Arizona, USA; 2016. pp. 3560-3566

2017. pp. 2467-2471

2017. pp. 7130-7138

[Accessed: May 12, 2018]

Long Beach, CA; 2017. pp. 742-751

pdf/1312.4400 [Accessed: May 12, 2018]

July 2015; Lille. JMLR; 2015. pp. 2285-2294

112 Digital Systems


[80] Zeng D, Zhao F, Shen W, Ge S. Compressing and accelerating neural network for facial point localization. Cognitive Computation. 2018;**10**(2):359-367. DOI: 10.1007/s12559-017- 9506-0

**Chapter 7**

Provisional chapter

**Neural Network Principles and Applications**

Due to the recent trend of intelligent systems and their ability to adapt with varying conditions, deep learning becomes very attractive for many researchers. In general, neural network is used to implement different stages of processing systems based on learning algorithms by controlling their weights and biases. This chapter introduces the neural network concepts, with a description of major elements consisting of the network. It also describes different types of learning algorithms and activation functions with the examples. These concepts are detailed in standard applications. The chapter will be useful for undergraduate students and even for postgraduate students who have simple back-

DOI: 10.5772/intechopen.80416

Keywords: neural network, neuron, digital signal processing, training, supervised

The artificial neural network is a computing technique designed to simulate the human brain's method in problem-solving. In 1943, McCulloch, a neurobiologist, and Pitts, a statistician, published a seminal paper titled "A logical calculus of ideas immanent in nervous activity" in Bulletin of Mathematical Biophysics [1], where they explained the way how brain works and how simple processing units—neurons—work together in parallel to make a decision based on

The similarity between artificial neural networks and the human brain is that both acquire the

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

distribution, and reproduction in any medium, provided the original work is properly cited.

learning, unsupervised learning, classification, time series

skills in processing data and finding solutions through training [1].

Neural Network Principles and Applications

Amer Zayegh and Nizar Al Bassam

Amer Zayegh and Nizar Al Bassam

http://dx.doi.org/10.5772/intechopen.80416

ground on neural networks.

Abstract

1. Introduction

the input signals.

Additional information is available at the end of the chapter

Additional information is available at the end of the chapter


#### **Neural Network Principles and Applications** Neural Network Principles and Applications

DOI: 10.5772/intechopen.80416

Amer Zayegh and Nizar Al Bassam Amer Zayegh and Nizar Al Bassam

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.80416

#### Abstract

[80] Zeng D, Zhao F, Shen W, Ge S. Compressing and accelerating neural network for facial point localization. Cognitive Computation. 2018;**10**(2):359-367. DOI: 10.1007/s12559-017-

[81] Kim Y, Rush AM. Sequence-level knowledge distillation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '16). 1-4

[82] Lopez-Paz D, Bottou L, Schölkopf B, Vapnik V. Unifying distillation and privileged information. In: International Conference on Learning Representations (ICLR '16). 2-4

[83] Hu Z, Ma X, Liu Z, et al. Harnessing deep neural networks with logic rules. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL '16).

[84] Luo Z, Jiang L, Hsieh JT, et al. Graph Distillation for Action Detection with Privileged Information [Internet]. Available from: https://arxiv.org/abs/1712.00108 [Accessed:

November 2016; Austin, Texas; 2016. pp. 1317-1327

7-12 August 2016; Berlin, Germany; 2016. pp.1-11

May 2016; San Juan; 2016. pp. 1-10

December 30, 2017]

9506-0

114 Digital Systems

Due to the recent trend of intelligent systems and their ability to adapt with varying conditions, deep learning becomes very attractive for many researchers. In general, neural network is used to implement different stages of processing systems based on learning algorithms by controlling their weights and biases. This chapter introduces the neural network concepts, with a description of major elements consisting of the network. It also describes different types of learning algorithms and activation functions with the examples. These concepts are detailed in standard applications. The chapter will be useful for undergraduate students and even for postgraduate students who have simple background on neural networks.

Keywords: neural network, neuron, digital signal processing, training, supervised learning, unsupervised learning, classification, time series

#### 1. Introduction

The artificial neural network is a computing technique designed to simulate the human brain's method in problem-solving. In 1943, McCulloch, a neurobiologist, and Pitts, a statistician, published a seminal paper titled "A logical calculus of ideas immanent in nervous activity" in Bulletin of Mathematical Biophysics [1], where they explained the way how brain works and how simple processing units—neurons—work together in parallel to make a decision based on the input signals.

The similarity between artificial neural networks and the human brain is that both acquire the skills in processing data and finding solutions through training [1].

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited. © 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### 2. Neural network's architecture

To illustrate the structure of the artificial neural network, an anatomical and functional look must be taken on the human brain first.

The human brain consists of about 1011 computing units "neurons" working in parallel and exchanging information through their connectors "synapses"; these neurons sum up all information coming into them, and if the result is higher than the given potential called action potential, they send a pulse via axon to the next stage. Human neuron anatomy is shown in Figure 1 [2].

In the same way, artificial neural network consists of simple computing units "artificial neurons," and each unit is connected to the other units via weight connectors; then, these units calculate the weighted sum of the coming inputs and find out the output using squashing function or activation function. Figure 2 shows the block diagram of artificial neuron.

Based on the block diagram and function of the neural network, three basic elements of neural

1. Synapses, or connecting links, have a weight or strength where the input signal xi

3. An activation function to produce the output of a neuron. It is also referred to as a squashing function, in that it squashes (limits) the amplitude range of the output signal to

The bias bk has the effect of increasing or decreasing the net input of the activation function,

yk <sup>¼</sup> <sup>φ</sup> <sup>X</sup><sup>m</sup>

i¼1

To clarify the effect of the bias on the performance of the neuron, the output given in Eq. (1) is processed in two stages, where the first stage includes the weighted inputs and the sum which

xi:wki þ bk !

(1)

Neural Network Principles and Applications http://dx.doi.org/10.5772/intechopen.80416 117

connected to neuron k is multiplied by synaptic weight wwki.

depending on whether it is positive or negative, respectively.

wk1, wk2, wk3,…, wkm are the respective weights of neuron.

Mathematically, the output on the neuron k can be described as

2. An adder for summing the weighted inputs.

x1, x2, x3, ……, xm are the input's signals.

model can be identified:

Figure 2. Block diagram of artificial neuron.

a finite value.

where

bk is the bias.

is donated as Sk:

φ is the activation function.

Figure 1. Human neuron anatomy.

Figure 2. Block diagram of artificial neuron.

Based on the block diagram and function of the neural network, three basic elements of neural model can be identified:


The bias bk has the effect of increasing or decreasing the net input of the activation function, depending on whether it is positive or negative, respectively.

Mathematically, the output on the neuron k can be described as

$$y\_k = \varphi\left(\sum\_{i=1}^m x\_i.w\_{ki} + b\_k\right) \tag{1}$$

where

2. Neural network's architecture

must be taken on the human brain first.

Figure 1 [2].

116 Digital Systems

Figure 1. Human neuron anatomy.

To illustrate the structure of the artificial neural network, an anatomical and functional look

The human brain consists of about 1011 computing units "neurons" working in parallel and exchanging information through their connectors "synapses"; these neurons sum up all information coming into them, and if the result is higher than the given potential called action potential, they send a pulse via axon to the next stage. Human neuron anatomy is shown in

In the same way, artificial neural network consists of simple computing units "artificial neurons," and each unit is connected to the other units via weight connectors; then, these units calculate the weighted sum of the coming inputs and find out the output using squashing

function or activation function. Figure 2 shows the block diagram of artificial neuron.

x1, x2, x3, ……, xm are the input's signals.

wk1, wk2, wk3,…, wkm are the respective weights of neuron.

bk is the bias.

φ is the activation function.

To clarify the effect of the bias on the performance of the neuron, the output given in Eq. (1) is processed in two stages, where the first stage includes the weighted inputs and the sum which is donated as Sk:

$$S\_k = \sum\_{i=1}^{m} \mathbf{x}\_i.w\_{ki} \tag{2}$$

where the output of neuron will be

bk as shown in Figure 4 [3].

3.1. Linear function

And, it can be described by

3.2. Threshold (step) function

Figure 5. Linear activation function.

3. Types of activation function

output will be modified [3] as shown in Figure 3.

following sections describe the different activation functions:

Where neuron output is proportional to the input as shown in Figure 5.

yk ¼ φð Þ vk (4)

Neural Network Principles and Applications http://dx.doi.org/10.5772/intechopen.80416 119

yk ¼ vk (5)

(6)

Depending on the value of the bias, the relationship between the weighted input and adder

Bias could be considered as an input signal x<sup>0</sup> fixed at +1 with synaptic weight equal to the bias

Activation function defines the output of neuron as the function to the adder's output vk. The

This activation function is described in Figure 6 where the output of neuron is given by

(

yk <sup>¼</sup> <sup>1</sup> if vk <sup>≥</sup> <sup>0</sup> 0 if vk < 0

Then, the output of adder will be given in Eq. (3):

$$
\upsilon\_k = \mathbb{S}\_k + b\_k \tag{3}
$$

Figure 3. Effect of bias.

Figure 4. Neuron structure with considering bias as input [1].

where the output of neuron will be

Sk <sup>¼</sup> <sup>X</sup><sup>m</sup> i¼1

Then, the output of adder will be given in Eq. (3):

Figure 3. Effect of bias.

118 Digital Systems

Figure 4. Neuron structure with considering bias as input [1].

xi:wki (2)

vk ¼ Sk þ bk (3)

$$y\_k = \varphi(\upsilon\_k) \tag{4}$$

Depending on the value of the bias, the relationship between the weighted input and adder output will be modified [3] as shown in Figure 3.

Bias could be considered as an input signal x<sup>0</sup> fixed at +1 with synaptic weight equal to the bias bk as shown in Figure 4 [3].

#### 3. Types of activation function

Activation function defines the output of neuron as the function to the adder's output vk. The following sections describe the different activation functions:

#### 3.1. Linear function

Where neuron output is proportional to the input as shown in Figure 5.

And, it can be described by

$$y\_k = v\_k \tag{5}$$

#### 3.2. Threshold (step) function

This activation function is described in Figure 6 where the output of neuron is given by

$$y\_k = \begin{cases} \quad 1 \text{ if } v\_k \ge 0 \\ 0 \text{ if } v\_k < 0 \end{cases} \tag{6}$$

Figure 5. Linear activation function.

Figure 6. Threshold activation function.

In neural computation, such a neuron is referred to as the McCulloch-Pitts model in recognition of the pioneering work done by McCulloch and Pitts (1943); the output of the neuron takes on the value of 1 if the induced local field of that neuron is nonnegative and 0 otherwise. This statement describes the all-or-none property of the McCulloch-Pitts model [4].

#### 3.3. Sigmoid function

The most common type of activation functions in neural network is described by

$$y\_k = \frac{1}{1 + e^{v\_k}} \tag{7}$$

4. Neural network models

Figure 8. Tanh activation function.

Figure 7. Sigmoid activation function.

4.1. Single-layer feedforward neural network

the neural network.

The manner in which the neurons of a neural network are structured is intimately linked with the learning algorithm used to train the network [1]. Three main models can be identified for

Neural Network Principles and Applications http://dx.doi.org/10.5772/intechopen.80416 121

In a layered neural network, the neurons are organized in the form of layers [1]. The simplest structure is the single-layer feedforward network that consists of input nodes connected

Figure 7 shows the sigmoid activation function, it is clearly observed that this function has nonlinear nature and it can produce analogue output unlike threshold functions which produce output in discrete range [0, 1].

Also, we can note that sigmoid activation function is limited between 0 and 1 and gives an advantage over linear activation function which produces output form �∞ to þ∞ [5].

#### 3.4. Tanh activation function

This activation function has the advantages of sigmoid function, while it is characterized by output range between �1 and 1 as shown in Figure 8.

The output is described by

$$y\_k = \frac{2}{1 + e^{-2v\_k}} - 1\tag{8}$$

Figure 7. Sigmoid activation function.

In neural computation, such a neuron is referred to as the McCulloch-Pitts model in recognition of the pioneering work done by McCulloch and Pitts (1943); the output of the neuron takes on the value of 1 if the induced local field of that neuron is nonnegative and 0 otherwise. This

statement describes the all-or-none property of the McCulloch-Pitts model [4].

The most common type of activation functions in neural network is described by

yk <sup>¼</sup> <sup>1</sup> 1 þ evk

Figure 7 shows the sigmoid activation function, it is clearly observed that this function has nonlinear nature and it can produce analogue output unlike threshold functions which pro-

Also, we can note that sigmoid activation function is limited between 0 and 1 and gives an

This activation function has the advantages of sigmoid function, while it is characterized by

� 1 (8)

advantage over linear activation function which produces output form �∞ to þ∞ [5].

yk <sup>¼</sup> <sup>2</sup> 1 þ e�2vk

3.3. Sigmoid function

Figure 6. Threshold activation function.

120 Digital Systems

duce output in discrete range [0, 1].

3.4. Tanh activation function

The output is described by

output range between �1 and 1 as shown in Figure 8.

Figure 8. Tanh activation function.

(7)

#### 4. Neural network models

The manner in which the neurons of a neural network are structured is intimately linked with the learning algorithm used to train the network [1]. Three main models can be identified for the neural network.

#### 4.1. Single-layer feedforward neural network

In a layered neural network, the neurons are organized in the form of layers [1]. The simplest structure is the single-layer feedforward network that consists of input nodes connected

Figure 9. Single-layer neural network.

directly to the single layer of neurons. The node outputs are based on the activation function as shown if Figure 9.

Mathematically, the inputs will be presented as vectors with dimensions of 1 � i, while the weights will be presented as a matrix with dimensions of i � k, and outputs will be presented as a vector with dimensions of 1 � k as given in Eq. (9):

$$\begin{bmatrix} y\_1, & y\_2, \dots, & y\_k \end{bmatrix} = [\mathbf{x}\_1, \ \mathbf{x}\_2, \dots, \mathbf{x}\_l] \begin{bmatrix} \begin{bmatrix} \begin{array}{cccc} \boldsymbol{w}\_{11} \ \boldsymbol{w}\_{21} \ \end{bmatrix} & \cdots & \boldsymbol{w}\_{k1} \\\\ \begin{array}{cccc} \boldsymbol{w}\_{1k} & \boldsymbol{w}\_{2k} & \cdots & \boldsymbol{w}\_{lk} \end{array} \end{bmatrix} \end{bmatrix} \tag{9}$$

By adding one or more hidden layers, the network is enabled to extract higher-order statistics

Neural Network Principles and Applications http://dx.doi.org/10.5772/intechopen.80416 123

The process of calibrating the values of weights and biases of the network is called training of

In supervised learning, the data will be presented in a form of couples (input, desired output), and then the learning algorithm will adapt the weights and biases depending on the error signal between the real output of network and the desired output as shown in Figure 11.

from its input [1].

5. Neural network training

Figure 10. Multilayer feedforward neural network.

5.1. Supervised learning

neural network to perform the desired function correctly [2].

Learning methods or algorithms can be classified into:

#### 4.2. Multilayer feedforward neural network

The second class of a feedforward neural network distinguishes itself by the presence of one or more hidden layers, whose computation nodes are correspondingly called hidden neurons as shown in Figure 10.

Figure 10. Multilayer feedforward neural network.

By adding one or more hidden layers, the network is enabled to extract higher-order statistics from its input [1].

## 5. Neural network training

The process of calibrating the values of weights and biases of the network is called training of neural network to perform the desired function correctly [2].

Learning methods or algorithms can be classified into:

#### 5.1. Supervised learning

directly to the single layer of neurons. The node outputs are based on the activation function as

Mathematically, the inputs will be presented as vectors with dimensions of 1 � i, while the weights will be presented as a matrix with dimensions of i � k, and outputs will be presented

The second class of a feedforward neural network distinguishes itself by the presence of one or more hidden layers, whose computation nodes are correspondingly called hidden neurons as

w<sup>11</sup> w<sup>21</sup> ⋯ wk<sup>1</sup> ⋮ ⋱⋮ w1<sup>k</sup> w2<sup>k</sup> ⋯ wik

(9)

shown if Figure 9.

122 Digital Systems

Figure 9. Single-layer neural network.

shown in Figure 10.

as a vector with dimensions of 1 � k as given in Eq. (9):

� � <sup>¼</sup> <sup>x</sup>1; <sup>x</sup>2;…; xi ½ �

y1; y2;…; yk

4.2. Multilayer feedforward neural network

In supervised learning, the data will be presented in a form of couples (input, desired output), and then the learning algorithm will adapt the weights and biases depending on the error signal between the real output of network and the desired output as shown in Figure 11.

computational, heuristic, and/or linguistic representations, formalisms, modelling techniques and algorithms for generating, transforming, transmitting, and learning from signals. [6].

Based on this definition, many neural network structures could be developed to achieve the

One of the most important applications of an artificial neural network is classification, which can be used in different digital signal processing applications such as speech recognition,

The objects of interest can be classified according to their features, and classification process could be considered as probability process, since the classification of any object under a given class depends on the likelihood that the object belongs to the class more than the probability of

Assume that X is the vector of features for the objects of interest which could be classified into classes c∈ψ where ψ is the pool of classes. Then, classification will be applied as follows:

To decrease the difficulty of solving probability equations in Eq. (10), discriminant function is

One of the examples of classification is QPSK modulator output detection, where detection is

The output of QPSK modulator is shown in Figure 13, where the samples are arranged in four

The neural network shown in Figure 15 is used to detect and demodulate the received signal, where the network consists of one hidden layer with five neurons and an output layer with

where n is normally the distributed noise signal and s is the transmitted signal.

By adding white Gaussian noise, the received signal will be as shown in Figure 14.

<sup>X</sup> belongs to the class ci if P ci ð Þ <sup>j</sup><sup>X</sup> <sup>&</sup>gt; P Cjj<sup>X</sup> when <sup>i</sup> 6¼ <sup>j</sup> (10)

Qið Þ <sup>X</sup> <sup>&</sup>gt; Qjð Þ <sup>X</sup> if ci ð Þ <sup>j</sup><sup>X</sup> <sup>&</sup>gt; P Cjj<sup>X</sup> when <sup>i</sup> 6¼ <sup>j</sup> (11)

X belongs to the class ci if Qið Þ X > Qjð Þ X (12)

X ¼ s þ n (13)

Neural Network Principles and Applications http://dx.doi.org/10.5772/intechopen.80416 125

different processes mentioned in the definition.

signal separation, and handwriting recognition and detection [7].

6.1. Classification

belonging to the other classes [8].

used, and then Eq. (10) will be.

Classification process will be described using Eq. (12)

considered as a special case of classification.

Assume that the received signal is X:

classes.

two neurons.

Figure 11. Supervised learning.

Figure 12. Unsupervised learning.

As a performance measure for the system, we may think in terms of the mean squared error or the sum of squared errors over the training sample defined as a function of the free parameters (i.e., synaptic weights) of the system [1].

#### 5.2. Unsupervised learning

To perform unsupervised learning, a competitive learning rule is used. For example, we may use a neural network that consists of two layers—an input layer and a competitive layer. The input layer receives the available data. The competitive layer consists of neurons that compete with each other (in accordance with a learning rule) for the "opportunity" to respond to features contained in the input data (Figure 12) [1].

### 6. Neural networks' applications in digital signal processing

Digital signal processing could be defined using field of interest statement of the IEEE Signal Processing Society as follows:

Signal processing is the enabling technology for the generation, transformation, extraction, and interpretation of information. It comprises the theory, algorithms with associated architectures and implementations, and applications related to processing information contained in many different formats broadly designated as signals. Signal processing uses mathematical, statistical, computational, heuristic, and/or linguistic representations, formalisms, modelling techniques and algorithms for generating, transforming, transmitting, and learning from signals. [6].

Based on this definition, many neural network structures could be developed to achieve the different processes mentioned in the definition.

### 6.1. Classification

As a performance measure for the system, we may think in terms of the mean squared error or the sum of squared errors over the training sample defined as a function of the free parameters

To perform unsupervised learning, a competitive learning rule is used. For example, we may use a neural network that consists of two layers—an input layer and a competitive layer. The input layer receives the available data. The competitive layer consists of neurons that compete with each other (in accordance with a learning rule) for the "opportunity" to respond to

Digital signal processing could be defined using field of interest statement of the IEEE Signal

Signal processing is the enabling technology for the generation, transformation, extraction, and interpretation of information. It comprises the theory, algorithms with associated architectures and implementations, and applications related to processing information contained in many different formats broadly designated as signals. Signal processing uses mathematical, statistical,

6. Neural networks' applications in digital signal processing

(i.e., synaptic weights) of the system [1].

features contained in the input data (Figure 12) [1].

5.2. Unsupervised learning

Figure 12. Unsupervised learning.

Figure 11. Supervised learning.

124 Digital Systems

Processing Society as follows:

One of the most important applications of an artificial neural network is classification, which can be used in different digital signal processing applications such as speech recognition, signal separation, and handwriting recognition and detection [7].

The objects of interest can be classified according to their features, and classification process could be considered as probability process, since the classification of any object under a given class depends on the likelihood that the object belongs to the class more than the probability of belonging to the other classes [8].

Assume that X is the vector of features for the objects of interest which could be classified into classes c∈ψ where ψ is the pool of classes. Then, classification will be applied as follows:

$$X \text{ belongs to the class } c\_i \text{ if } P(c\_i|X) > P(\mathbb{C}\_i|X) \text{ when } i \neq j \tag{10}$$

To decrease the difficulty of solving probability equations in Eq. (10), discriminant function is used, and then Eq. (10) will be.

$$Q\_i(\mathbf{X}) > Q\_j(\mathbf{X}) \quad \text{if } (c\_i|\mathbf{X}) > P\{\mathbf{C}\_i|\mathbf{X}\} \text{when } i \neq j \tag{11}$$

Classification process will be described using Eq. (12)

$$X \text{ belongs to the class } c\_i \text{ if } \ Q\_i(X) > Q\_j(X) \tag{12}$$

One of the examples of classification is QPSK modulator output detection, where detection is considered as a special case of classification.

Assume that the received signal is X:

$$X = s + n \tag{13}$$

where n is normally the distributed noise signal and s is the transmitted signal.

The output of QPSK modulator is shown in Figure 13, where the samples are arranged in four classes.

By adding white Gaussian noise, the received signal will be as shown in Figure 14.

The neural network shown in Figure 15 is used to detect and demodulate the received signal, where the network consists of one hidden layer with five neurons and an output layer with two neurons.

Figure 13. QPSK modulator output.

Figures 16 and 17 show the performance of neural network evaluated using mean squared error (MSE) criteria.

#### 6.2. Time series prediction

A series is a sequence of values as a function of parameter; in the case of time series, the values will be as a function of the time. So, many applications use time series to express their data, for example, metrology, where the temperature is described as time series [7].

The interesting problem in time series is the future prediction of the series values; neural networks can be used to predict the future results in series in three ways [9]:

• Predict the future values based on the past values of the same series; this way can be described by

$$\hat{y}(t) = E\{y(t)|y(t-1), y(t-2), \dots\} \tag{14}$$

y t ^ðÞ¼ E yt f g ð Þjx tð Þ; x tð Þ � 1 ; x tð Þ � 2 ; … (15)

Neural Network Principles and Applications http://dx.doi.org/10.5772/intechopen.80416 127

y t ^ðÞ¼ E yt f g ð Þjx tð Þ; x tð Þ � 1 ; x Tð Þ � 2 ; …; y tð Þ � 1 ; y tð Þ � 2 ;… (16)

Figure 18 shows the predicted series by neural network based on the first way, where the samples of original series are given of determined period, and then the neural network pre-

• Predict the future values based on both previous cases, where

Figure 15. Structure of neural network.

Figure 14. QPSK output with noise.

dicts the future values of the series based on series behavior.

• Predict the future values based on the values of relevant time series, where

Figure 14. QPSK output with noise.

Figures 16 and 17 show the performance of neural network evaluated using mean squared

A series is a sequence of values as a function of parameter; in the case of time series, the values will be as a function of the time. So, many applications use time series to express their data, for

The interesting problem in time series is the future prediction of the series values; neural

• Predict the future values based on the past values of the same series; this way can be

^ðÞ¼ E yt f g ð Þjy tð Þ � 1 ; y tð Þ � 2 ;… (14)

example, metrology, where the temperature is described as time series [7].

networks can be used to predict the future results in series in three ways [9]:

• Predict the future values based on the values of relevant time series, where

y t

error (MSE) criteria.

126 Digital Systems

described by

6.2. Time series prediction

Figure 13. QPSK modulator output.

Figure 15. Structure of neural network.

$$\hat{y}(t) = E\{y(t)|\mathbf{x}(t), \mathbf{x}(t-1), \mathbf{x}(t-2), \dots\} \tag{15}$$

• Predict the future values based on both previous cases, where

$$\hat{y}(t) = E\{y(t)|\mathbf{x}(t), \mathbf{x}(t-1), \mathbf{x}(T-2), \dots, y(t-1), y(t-2), \dots\} \tag{16}$$

Figure 18 shows the predicted series by neural network based on the first way, where the samples of original series are given of determined period, and then the neural network predicts the future values of the series based on series behavior.

Figure 16. MSE of training, validation, and test vs no. of epochs.

#### 6.3. Independent component analysis

The goal of the independent component analysis (ICA) is to separate the linearly mixed signals. ICA is a type of blind source separation when the separation is performed without the pre-information about the source of signals or the signal-mixing coefficients. Although the problem of separating the blind source, in general, is not specified, the solution of use can be obtained under some assumptions [10].

ICA model assumes that n independent signals sið Þt where i ¼ 1, 2, 3, …:, n are mixed using matrix:

$$A = \begin{bmatrix} a\_{11} & a\_{12} & \dots & a\_{1n} \\ a\_{21} & a\_{22} & \dots & a\_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a\_{n1} & a\_{n2} & \dots & a\_{nn} \end{bmatrix} \tag{17}$$

Then, mixed signal xið Þt could be expressed as

$$\mathbf{x}\_i(t) = \sum\_{i=1}^n \sum\_{j=1}^n a\_{ij} \mathbf{s}\_j(t) \tag{18}$$

The separated signals yi

Figure 17. Training parameters and results.

ð Þt are given as

W ¼

w<sup>11</sup> w<sup>12</sup> w<sup>21</sup> w<sup>22</sup>

⋮ ⋮ wn<sup>1</sup> wn<sup>2</sup>

Different methods could be applied to find W, for example, natural gradient based define W as

⋯ w1<sup>n</sup> ⋯ w2<sup>n</sup>

⋱ ⋮ ⋯ ⋱

y ¼ Wx (19)

Neural Network Principles and Applications http://dx.doi.org/10.5772/intechopen.80416 129

(20)

As the separation process is blind, that is, both aij and sjð Þt are unknown; thus, ICA assumes that the mixed signals are statistically independent and have non-Gaussian distribution [11].

Neural network shown in Figure 19 is used to estimate the unmixing matrix W.


Figure 17. Training parameters and results.

6.3. Independent component analysis

Figure 16. MSE of training, validation, and test vs no. of epochs.

obtained under some assumptions [10].

Then, mixed signal xið Þt could be expressed as

matrix:

128 Digital Systems

The goal of the independent component analysis (ICA) is to separate the linearly mixed signals. ICA is a type of blind source separation when the separation is performed without the pre-information about the source of signals or the signal-mixing coefficients. Although the problem of separating the blind source, in general, is not specified, the solution of use can be

ICA model assumes that n independent signals sið Þt where i ¼ 1, 2, 3, …:, n are mixed using

… a1<sup>n</sup> … a2<sup>n</sup>

aijsjð Þt (18)

(17)

⋱ ⋮ ⋯ ann

a<sup>11</sup> a<sup>12</sup> a<sup>21</sup> a<sup>22</sup>

⋮ ⋮ an<sup>1</sup> an<sup>2</sup>

xiðÞ¼ <sup>t</sup> <sup>X</sup><sup>n</sup>

Neural network shown in Figure 19 is used to estimate the unmixing matrix W.

i¼1

As the separation process is blind, that is, both aij and sjð Þt are unknown; thus, ICA assumes that the mixed signals are statistically independent and have non-Gaussian distribution [11].

Xn j¼1

A ¼

> The separated signals yi ð Þt are given as

$$\mathbf{y} = \mathbf{W}\mathbf{x} \tag{19}$$

$$\mathcal{W} = \begin{bmatrix} w\_{11} & w\_{12} & \cdots & w\_{1n} \\ w\_{21} & w\_{22} & \cdots & w\_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ w\_{n1} & w\_{n2} & \cdots & \ddots \end{bmatrix} \tag{20}$$

Different methods could be applied to find W, for example, natural gradient based define W as

$$\frac{dW}{dt} = \eta(t) \left[ 1 - f(y(t)) \mathbf{g}^T(y(t)) \right] \mathcal{W} \tag{21}$$

Author details

References

2002

2005

Amer Zayegh and Nizar Al Bassam\*

Middle East College, Muscat, Oman

Studies. 2016;5(7):10-14

Software. 2000;15(1):101-124

Speech Mixtures, ICA2006. 2006

\*Address all correspondence to: nizar@mec.edu.om

Pearson Education, Inc.; 2009. 938 p

F28335. Technical Univerzity of Ostrava; 2009

[1] Haykin S. Neural Networks and Learning Machines. 3rd ed. Hamilton, Ontario, Canada:

Neural Network Principles and Applications http://dx.doi.org/10.5772/intechopen.80416 131

[3] Hu Y, Hwang J. Handbook of Neural Network Signal Processing. Boca Raton: CRC Press;

[4] Milad MAMRAN. Neural Network Demodulator For. International Journal of Advanced

[5] Jan Michalík. Applied Neural Networks for Digital Signal Processingwith DSC TMS320

[6] Constitution. IEEE Signal Processing Society [Online]. 2018. Available from: https://signal-

[7] Kriesel D. A Brief Introduction to Neural Network. Bonn: University of Bonn in Germany;

[9] Maier H, Dandy G. Neural networks for the prediction and forecasting of water resources variables: A review of modelling issues and applications. Environmental Modelling &

[10] Hansen LK, Larsen J, Kolenda T. On Independent Component Analysis for Multimedia Signals, Multimedia Image and Video Processing. CRC Press; 2000. pp. 175-199

[11] Mørup M, Schmidt MN. Transformation invariant sparse coding, Machine Learning for Signal Processing, IEEE International Workshop on (MLSP), Informatics and Mathemati-

[12] Pedersen MS, Wang D, Larsen J, Kjems U. Separating Underdetermined Convolutive

[2] Smith S. The Scientist and Engineer's Guide to Digital Signal Processing. 2011

processingsociety.org/volunteers/constitution [Accessed: March 03, 2018]

[8] Gurney K. An Introduction to Neural Networks. London: CRC Press; 1997

cal Modelling. Technical University of Denmark, DTU; 2011

where ηð Þt is the training factor and both f and g are the odd functions.

Figure 18. Time series prediction.

Figure 19. ICA neural network [12].

### Author details

dW

Figure 18. Time series prediction.

130 Digital Systems

Figure 19. ICA neural network [12].

where ηð Þt is the training factor and both f and g are the odd functions.

dt <sup>¼</sup> <sup>η</sup>ð Þ<sup>t</sup> <sup>1</sup> � f yt ð Þ ð Þ gTð Þ y tð Þ <sup>W</sup> (21)

Amer Zayegh and Nizar Al Bassam\*

\*Address all correspondence to: nizar@mec.edu.om

Middle East College, Muscat, Oman

### References


**Chapter 8**

Provisional chapter

**Applications of General Regression Neural Networks in**

DOI: 10.5772/intechopen.80258

Nowadays, computational intelligence (CI) receives much attention in academic and industry due to a plethora of possible applications. CI includes fuzzy logic (FL), evolutionary algorithms (EA), expert systems (ES) and artificial neural networks (ANN). Many CI components have applications in modeling and control of dynamic systems. FL mimics the human reasoning by converting linguistic variables into a set of rules. EA are metaheuristic population-based algorithms which use evolutionary operations such as mutation, crossover, and selection to find an optimal solution for a given problem. ES are programmed based on an expert knowledge to make informed decisions in complex tasks. ANN models how the neurons are connected in animal nervous systems. ANN have learning abilities and they are trained using data to make intelligent decisions. Since ANN have universal approximation abilities, they can be used to solve regression, classification, and forecasting problems. ANNs are made of interconnected layers where every layer is made of neurons and these neurons have connections with other neurons. These layers consist of an input layer,

Keywords: applications, general regression, neural networks, dynamic systems

Nowadays, computational intelligence (CI) receives much attention in academic and industry due to a plethora of possible applications. CI includes fuzzy logic (FL), evolutionary algorithms (EA), expert systems (ES), and artificial neural networks (ANN). Many CI components have applications in modeling and control of dynamic systems. FL mimics the human reasoning by converting linguistic variables into a set of rules. EA are metaheuristic population-based

> © 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

distribution, and reproduction in any medium, provided the original work is properly cited.

Applications of General Regression Neural Networks in

**Dynamic Systems**

Dynamic Systems

Abstract

1. Introduction

Ahmad Jobran Al-Mahasneh, Sreenatha Anavatti,

Ahmad Jobran Al-Mahasneh, Sreenatha Anavatti,

Matthew Garratt and Mahardhika Pratama

Matthew Garratt and Mahardhika Pratama

Additional information is available at the end of the chapter

Additional information is available at the end of the chapter

hidden layer/layers, and an output layer.

http://dx.doi.org/10.5772/intechopen.80258

#### **Applications of General Regression Neural Networks in Dynamic Systems** Applications of General Regression Neural Networks in Dynamic Systems

DOI: 10.5772/intechopen.80258

Ahmad Jobran Al-Mahasneh, Sreenatha Anavatti, Matthew Garratt and Mahardhika Pratama Ahmad Jobran Al-Mahasneh, Sreenatha Anavatti, Matthew Garratt and Mahardhika Pratama

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.80258

#### Abstract

Nowadays, computational intelligence (CI) receives much attention in academic and industry due to a plethora of possible applications. CI includes fuzzy logic (FL), evolutionary algorithms (EA), expert systems (ES) and artificial neural networks (ANN). Many CI components have applications in modeling and control of dynamic systems. FL mimics the human reasoning by converting linguistic variables into a set of rules. EA are metaheuristic population-based algorithms which use evolutionary operations such as mutation, crossover, and selection to find an optimal solution for a given problem. ES are programmed based on an expert knowledge to make informed decisions in complex tasks. ANN models how the neurons are connected in animal nervous systems. ANN have learning abilities and they are trained using data to make intelligent decisions. Since ANN have universal approximation abilities, they can be used to solve regression, classification, and forecasting problems. ANNs are made of interconnected layers where every layer is made of neurons and these neurons have connections with other neurons. These layers consist of an input layer, hidden layer/layers, and an output layer.

Keywords: applications, general regression, neural networks, dynamic systems

#### 1. Introduction

Nowadays, computational intelligence (CI) receives much attention in academic and industry due to a plethora of possible applications. CI includes fuzzy logic (FL), evolutionary algorithms (EA), expert systems (ES), and artificial neural networks (ANN). Many CI components have applications in modeling and control of dynamic systems. FL mimics the human reasoning by converting linguistic variables into a set of rules. EA are metaheuristic population-based

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited. © 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

algorithms which use evolutionary operations such as mutation, crossover, and selection to find an optimal solution for a given problem. ES are programmed based on an expert knowledge to make informed decisions in complex tasks. ANN model how the neurons are connected in animal nervous systems. ANN have learning abilities and they are trained using data to make intelligent decisions. Since ANN have universal approximation abilities [1], they can be used to solve regression, classification, and forecasting problems. ANNs are made of interconnected layers where every layer is made of neurons, and these neurons have connections with other neurons. These layers consist of an input layer, hidden layer/layers, and an output layer. ANN have two major types as shown in Figure 1: feed-forward neural network (FFNN) and recurrent neural network (RNN). In FFNN, the data can only flow from the input to hidden layer, while in RNN, the data can flow in any direction. The output of a singlehidden-layer FFNN can be written as

$$Y = \left(W\_{HO} \, h(\mathbf{x} \, W\_{IH} + \mathbf{b}\_{\mathbf{I}})\right) + \mathbf{b}\_{\mathbf{O}} \tag{1}$$

where θ are the network parameters, λ is the learning rate, and E is the error function:

where N is the number of samples, y is the network output, and t is the network target.

I¼1

The general regression neural network (GRNN) is a single-pass neural network which uses a Gaussian activation function in the hidden layer [4]. GRNN consists of input, hidden, summa-

The regression of the random variable y on the observed values X of random variable x can be

Ð ∞

Ð ∞

When f Xð Þ ; y is unknown, it should be estimated from a set of observations of x and y. f Xð Þ ; y can be estimated using the nonparametric consistent estimator suggested by Parzen as follows:

> 1 n Xn i¼1 e

where n is the number of observations, p is the dimension of the vector variable, and x and σ

� <sup>X</sup>�X<sup>i</sup> ð Þ<sup>T</sup> <sup>X</sup>�X<sup>i</sup> ð Þ 2σ2

Ð ∞

Ð ∞ �<sup>∞</sup> <sup>e</sup>

� <sup>X</sup>�X<sup>i</sup> ð Þ<sup>T</sup> <sup>X</sup>�X<sup>i</sup> ð Þ 2σ2

� <sup>X</sup>�Xi ð Þ<sup>T</sup> <sup>X</sup>�Xi ð Þ 2σ2

� <sup>X</sup>�X<sup>i</sup> ð Þ<sup>T</sup> <sup>X</sup>�X<sup>i</sup> ð Þ 2σ2

P<sup>n</sup> <sup>i</sup>¼<sup>1</sup> ye

> P<sup>n</sup> <sup>i</sup>¼<sup>1</sup> <sup>e</sup>

GRNN was used in different applications related to modeling, system identification, prediction, and control of dynamic systems including: feedback linearization controller [5], HVAC

�<sup>∞</sup> yf Xð Þ ; <sup>y</sup> dy

� <sup>X</sup>�X<sup>i</sup> ð Þ<sup>T</sup> <sup>X</sup>�X<sup>i</sup> ð Þ <sup>2</sup>σ<sup>2</sup> e

�<sup>∞</sup> ye� <sup>Y</sup>�Y<sup>i</sup> ð Þ<sup>2</sup>

� <sup>Y</sup>�Y<sup>i</sup> ð Þ<sup>2</sup> <sup>2</sup>σ<sup>2</sup> dy

<sup>2</sup>σ<sup>2</sup> dy

E y½ �¼ jX

2πð Þ <sup>p</sup>þ<sup>1</sup> <sup>=</sup><sup>2</sup> σð Þ <sup>p</sup>þ<sup>1</sup>

where f Xð Þ ; y is a known joint continuous probability density function.

^f Xð Þ¼ ;<sup>Y</sup> <sup>1</sup>

Y X ^ ð Þ¼

After solving the integration, the following will result:

P<sup>n</sup> <sup>i</sup>¼<sup>1</sup> <sup>e</sup>

P<sup>n</sup> <sup>i</sup>¼<sup>1</sup> <sup>e</sup>

Y X ^ð Þ¼

ð Þ <sup>y</sup> � <sup>t</sup> <sup>2</sup> (4)

http://dx.doi.org/10.5772/intechopen.80258

135

Applications of General Regression Neural Networks in Dynamic Systems

�<sup>∞</sup> f Xð Þ ; <sup>y</sup> dy (5)

� <sup>Y</sup>�Y<sup>i</sup> ð Þ<sup>2</sup>

<sup>2</sup>σ<sup>2</sup> (6)

(7)

(8)

<sup>E</sup> <sup>¼</sup> <sup>1</sup> N X N

2. General regression neural network (GRNN)

tion, and division layers.

are the smoothing factors.

2.1. Previous studies

Substituting (6) into (5) leads to

found using

where Y is the network output, WHO is the hidden-output layers weights matrix, h is the hidden layer activation function, x is the input vector, WIH is the input-hidden layers weights matrix, bI is the input layer bias vector, and bO is the hidden layer bias vector.

The output of a single-hidden-layer RNN with a recurrent hidden layer can be written as

$$Y = \left(\mathbf{W}\_{HO} \, h(\mathbf{x} \, \mathbf{W}\_{IH} + \mathbf{h}\_{\mathbf{t}-\mathbf{1}} \, \mathbf{W}\_{HH} + \mathbf{b}\_{\mathbf{I}})\right) + \mathbf{b}\_{\mathbf{O}} \tag{2}$$

The training of neural networks involves modifying the neural network parameters to reduce a given error function. Gradient descent (GD) [2, 3] is the most common ANN training method:

$$
\theta\_{new} = \theta\_{old} - \lambda \frac{\partial E}{\partial \theta} \tag{3}
$$

Figure 1. Feed-forward and recurrent networks.

where θ are the network parameters, λ is the learning rate, and E is the error function:

$$E = \frac{1}{N} \sum\_{l=1}^{N} \left( y - t \right)^{2} \tag{4}$$

where N is the number of samples, y is the network output, and t is the network target.

#### 2. General regression neural network (GRNN)

The general regression neural network (GRNN) is a single-pass neural network which uses a Gaussian activation function in the hidden layer [4]. GRNN consists of input, hidden, summation, and division layers.

The regression of the random variable y on the observed values X of random variable x can be found using

$$E[y|X] = \frac{\int\_{-\infty}^{\infty} yf(X, y) dy}{\int\_{-\infty}^{\infty} f(X, y) dy} \tag{5}$$

where f Xð Þ ; y is a known joint continuous probability density function.

When f Xð Þ ; y is unknown, it should be estimated from a set of observations of x and y. f Xð Þ ; y can be estimated using the nonparametric consistent estimator suggested by Parzen as follows:

$$\hat{f}(X,Y) = \frac{1}{2\pi^{(p+1)/2}} \frac{1}{\sigma^{(p+1)}} \frac{1}{n} \sum\_{i=1}^{n} e^{-\frac{\left(X - X^i\right)^T \left(X - X^i\right)}{2\sigma^2}} e^{-\frac{\left(Y - Y^i\right)^2}{2\sigma^2}}\tag{6}$$

where n is the number of observations, p is the dimension of the vector variable, and x and σ are the smoothing factors.

Substituting (6) into (5) leads to

algorithms which use evolutionary operations such as mutation, crossover, and selection to find an optimal solution for a given problem. ES are programmed based on an expert knowledge to make informed decisions in complex tasks. ANN model how the neurons are connected in animal nervous systems. ANN have learning abilities and they are trained using data to make intelligent decisions. Since ANN have universal approximation abilities [1], they can be used to solve regression, classification, and forecasting problems. ANNs are made of interconnected layers where every layer is made of neurons, and these neurons have connections with other neurons. These layers consist of an input layer, hidden layer/layers, and an output layer. ANN have two major types as shown in Figure 1: feed-forward neural network (FFNN) and recurrent neural network (RNN). In FFNN, the data can only flow from the input to hidden layer, while in RNN, the data can flow in any direction. The output of a single-

where Y is the network output, WHO is the hidden-output layers weights matrix, h is the hidden layer activation function, x is the input vector, WIH is the input-hidden layers weights

The training of neural networks involves modifying the neural network parameters to reduce a given error function. Gradient descent (GD) [2, 3] is the most common ANN training

<sup>θ</sup>new <sup>¼</sup> <sup>θ</sup>old � <sup>λ</sup> <sup>∂</sup><sup>E</sup>

The output of a single-hidden-layer RNN with a recurrent hidden layer can be written as

matrix, bI is the input layer bias vector, and bO is the hidden layer bias vector.

Y ¼ ðWHO hð Þ x WIH þ bI Þ þ bO (1)

Y ¼ ðWHO hð Þ x WIH þ ht�<sup>1</sup> WHH þ bI Þ þ bO (2)

<sup>∂</sup><sup>θ</sup> (3)

hidden-layer FFNN can be written as

Figure 1. Feed-forward and recurrent networks.

method:

134 Digital Systems

$$\hat{Y}(\mathbf{X}) = \frac{\sum\_{i=1}^{n} \mathbf{e}^{-\left(\frac{\mathbf{x} - \mathbf{x}^{\circ}}{2\sigma^{2}}\right)^{T} \left(\mathbf{x} - \mathbf{x}^{\circ}\right)}}{\sum\_{i=1}^{n} \mathbf{e}^{-\left(\mathbf{x} - \mathbf{x}^{\circ}\right)^{T} \left(\mathbf{x} - \mathbf{x}^{\circ}\right)}} \frac{\int\_{-\infty}^{\infty} y e^{-\frac{\left(\mathbf{y} - \mathbf{y}^{\circ}\right)^{2}}{2\sigma^{2}}} dy}{\int\_{-\infty}^{\infty} e^{-\frac{\left(\mathbf{y} - \mathbf{y}^{\circ}\right)^{2}}{2\sigma^{2}}} dy} \tag{7}$$

After solving the integration, the following will result:

$$\hat{Y}(\mathbf{X}) = \frac{\sum\_{i=1}^{n} y e^{\frac{-\left(\mathbf{x} - \mathbf{x}^{i}\right)^{T} \left(\mathbf{x} - \mathbf{x}^{i}\right)}{2\sigma^{2}}}}{\sum\_{i=1}^{n} e^{\frac{-\left(\mathbf{x} - \mathbf{x}^{i}\right)^{T} \left(\mathbf{x} - \mathbf{x}^{i}\right)}{2\sigma^{2}}}} \tag{8}$$

#### 2.1. Previous studies

GRNN was used in different applications related to modeling, system identification, prediction, and control of dynamic systems including: feedback linearization controller [5], HVAC process identification and control [6], modeling and monitoring of batch processes [7], cooling load prediction for buildings [8], fault diagnosis of a building's air handling unit [9], intelligent control [10], optimal control for variable-speed wind generation systems [11], annual power load forecasting model [12], vehicle sideslip angle estimation [13], fault diagnosis for methane sensors [14], fault detection of excavator's hydraulic system [15], detection of time-varying inter-turn short circuit in a squirrel cage induction machine [16], system identification of nonlinear rotorcraft heave mode [17], and modeling of traveling wave ultrasonic motors [18].

Some significant modifications of GRNN include using fuzzy c-means clustering to cluster the input data of GRNN [19], modified GRNN which uses different types of Parzen estimators to estimate the density function of the regression [20], density-driven GRNN combining GRNN, density-dependent kernels and regularization for function approximation [21], GRNN to model time-varying systems [22], adapting GRNN for modeling of dynamic plants [23] using different adaptation approaches including modifying the training targets, and adding a new pattern and dynamic initialization of σ.

2.2.1. Reducing data dimensionality using clustering

Dataset Training error after/before k-means

MSE

Table 1. Using GRNN with k-means clustering.

2.2.2. Reducing data dimensionality using PCA

in Algorithm 3.

2.3. GRNN output algorithm

errors will increase, there are large reductions in the network size.

The aim of the algorithm is to minimize the distance objective function:

<sup>J</sup> <sup>¼</sup> <sup>X</sup> N

i¼1

X M

xi � cj � � � �

Testing error after/before k-means

Applications of General Regression Neural Networks in Dynamic Systems

http://dx.doi.org/10.5772/intechopen.80258

<sup>2</sup> (9)

Size reduction %

137

j¼1

PCA can be used to reduce a large dataset into a smaller dataset which still carries most of the important information from the large dataset. In a mathematical sense, PCA converts a number of correlated variables into a number of uncorrelated variables. PCA algorithm is explained

After GRNN is trained, the output of GRNN can be calculated using

Clustering techniques can be used to reduce the data dimensionality before feeding it to the GRNN. k-means clustering is one of the popular clustering techniques. The k-means clustering algorithm is explained in Algorithm 2. Also, results of comparing GRNN performance before and after applying k-means algorithm are shown in Table 1. Although the training and testing

MSE

Abalone 0.0177/0.002 0.0141/0.006 99.76 Building energy 0.047/3.44e-05 0.0165/0.023 99.76 Chemical sensor 0.241/0.016 0.328/0.034 97.99 Cholesterol 0.050/4.605e-05 0.030/0.009 92

#### 2.2. GRNN training algorithm

GRNN training is rather simple. The input weights are the training inputs transposed, and the output weights are the training targets. Since GRNN is an associative memory, after training, the number of the hidden neurons is equal to the number of the training samples. However, this training procedure is not efficient if there are many training samples, so one of the suggested solutions is using a data dimensionality reduction technique such as clustering or principal component analysis (PCA). One of the novel solutions to data dimensionality reduction is using an error-based algorithm to grow GRNN [24] as explained in Algorithm 1. The algorithm will check whether an input is required to be included in the training, based on prediction error before training GRNN with that input. If the prediction error without including that input is more than the certain level, then GRNN should be trained with it.



Table 1. Using GRNN with k-means clustering.

process identification and control [6], modeling and monitoring of batch processes [7], cooling load prediction for buildings [8], fault diagnosis of a building's air handling unit [9], intelligent control [10], optimal control for variable-speed wind generation systems [11], annual power load forecasting model [12], vehicle sideslip angle estimation [13], fault diagnosis for methane sensors [14], fault detection of excavator's hydraulic system [15], detection of time-varying inter-turn short circuit in a squirrel cage induction machine [16], system identification of nonlinear rotorcraft heave mode [17], and modeling of traveling wave ultrasonic motors [18]. Some significant modifications of GRNN include using fuzzy c-means clustering to cluster the input data of GRNN [19], modified GRNN which uses different types of Parzen estimators to estimate the density function of the regression [20], density-driven GRNN combining GRNN, density-dependent kernels and regularization for function approximation [21], GRNN to model time-varying systems [22], adapting GRNN for modeling of dynamic plants [23] using different adaptation approaches including modifying the training targets, and adding a new

GRNN training is rather simple. The input weights are the training inputs transposed, and the output weights are the training targets. Since GRNN is an associative memory, after training, the number of the hidden neurons is equal to the number of the training samples. However, this training procedure is not efficient if there are many training samples, so one of the suggested solutions is using a data dimensionality reduction technique such as clustering or principal component analysis (PCA). One of the novel solutions to data dimensionality reduction is using an error-based algorithm to grow GRNN [24] as explained in Algorithm 1. The algorithm will check whether an input is required to be included in the training, based on prediction error before training GRNN with that input. If the prediction error without includ-

ing that input is more than the certain level, then GRNN should be trained with it.

pattern and dynamic initialization of σ.

2.2. GRNN training algorithm

136 Digital Systems

#### 2.2.1. Reducing data dimensionality using clustering

Clustering techniques can be used to reduce the data dimensionality before feeding it to the GRNN. k-means clustering is one of the popular clustering techniques. The k-means clustering algorithm is explained in Algorithm 2. Also, results of comparing GRNN performance before and after applying k-means algorithm are shown in Table 1. Although the training and testing errors will increase, there are large reductions in the network size.

The aim of the algorithm is to minimize the distance objective function:

$$J = \sum\_{i=1}^{N} \sum\_{j=1}^{M} ||\mathbf{x}\_i - \mathbf{c}\_j||^2 \tag{9}$$


#### 2.2.2. Reducing data dimensionality using PCA

PCA can be used to reduce a large dataset into a smaller dataset which still carries most of the important information from the large dataset. In a mathematical sense, PCA converts a number of correlated variables into a number of uncorrelated variables. PCA algorithm is explained in Algorithm 3.

#### 2.3. GRNN output algorithm

After GRNN is trained, the output of GRNN can be calculated using


Table 2. Using GRNN with PCA.

$$D = \left(\mathbf{X} - \mathbf{W}\_i\right)^T \left(\mathbf{X} - \mathbf{W}\_i\right) \tag{10}$$

3. Estimation of GRNN smoothing parameter (σ)

best σ. The holdout algorithm is explained in Algorithm 5.

Since σ is the only free parameter in GRNN and suitable values of it will improve GRNN accuracy, it should be estimated. Since there is no optimal analytical solution for finding σ, numerical approaches can be used to estimate it. The holdout method is one of the suggested methods. In this method, samples are randomly removed from the training dataset; then using the GRNN with a fixed σ, the output is calculated using the removed samples; then the error is calculated between the network outputs and the sample targets. This procedure is repeated for different σ values. The smoothing parameter (σ) with the lowest sum of errors is selected as the

Applications of General Regression Neural Networks in Dynamic Systems

http://dx.doi.org/10.5772/intechopen.80258

139

Other search and optimization methods might be also used to find σ. For instance, genetic algorithms (GA) and differential evolution (DE) are suitable options. Algorithm 6 explains how to find σ using DE or GA. Also, the results of using DE and GA are depicted in Figure 2.

$$\hat{Y} = \frac{\sum\_{i=1}^{N} \mathcal{W}\_o e^{(D/2\sigma^2)}}{\sum\_{i=1}^{N} e^{(D/2\sigma^2)}}\tag{11}$$

where D is the Euclidean distance between the input X and the input weights Wi, Wo is the output weight, and σ is the smoothing factor of the radial basis function.

GRNN output calculation is explained in Algorithm 4.


Other distance measures can be also used such as Manhattan (city block), so (10) will become

$$D = X - W\_i \tag{12}$$

## 3. Estimation of GRNN smoothing parameter (σ)

D ¼ ð Þ X � Wi

Abalone 0.197/0.002 0.188/0.006 99.8 Building energy 0.061/3.44e-05 0.049/0.023 99.6 Chemical sensor 0.241/0.016 0.328/0.034 98.3 Cholesterol 0.026/4.605e-05 0.028/0.009 92

P<sup>N</sup>

P<sup>N</sup>

where D is the Euclidean distance between the input X and the input weights Wi, Wo is the

Dataset Training error after/before PCA MSE Testing error after/before PCA MSE Size reduction %

Other distance measures can be also used such as Manhattan (city block), so (10) will become

<sup>i</sup>¼<sup>1</sup> WoeðD=2σ2<sup>Þ</sup>

<sup>Y</sup>^ <sup>¼</sup>

output weight, and σ is the smoothing factor of the radial basis function.

GRNN output calculation is explained in Algorithm 4.

Table 2. Using GRNN with PCA.

138 Digital Systems

<sup>T</sup>ð Þ <sup>X</sup> � Wi (10)

<sup>i</sup>¼<sup>1</sup> <sup>e</sup> <sup>D</sup>=2σ<sup>2</sup> ð Þ (11)

D ¼ X � Wi (12)

Since σ is the only free parameter in GRNN and suitable values of it will improve GRNN accuracy, it should be estimated. Since there is no optimal analytical solution for finding σ, numerical approaches can be used to estimate it. The holdout method is one of the suggested methods. In this method, samples are randomly removed from the training dataset; then using the GRNN with a fixed σ, the output is calculated using the removed samples; then the error is calculated between the network outputs and the sample targets. This procedure is repeated for different σ values. The smoothing parameter (σ) with the lowest sum of errors is selected as the best σ. The holdout algorithm is explained in Algorithm 5.


Other search and optimization methods might be also used to find σ. For instance, genetic algorithms (GA) and differential evolution (DE) are suitable options. Algorithm 6 explains how to find σ using DE or GA. Also, the results of using DE and GA are depicted in Figure 2.


issue is resolved by either using clustering or PCA (read Sections 2.21 and 2.2.2). Finally, GRNN is based on the general regression theory, while BPNN is based on gradient-descent

Type Dataset Training time (sec) Training error (MSE) Testing error (MSE)

Applications of General Regression Neural Networks in Dynamic Systems

http://dx.doi.org/10.5772/intechopen.80258

141

GRNN Abalone 0.621 0.342 0.384 BPNN Abalone 1.323 0.436 0.395 GRNN Building energy 0.630 0.0731 0.628 BPNN Building energy 1.880 0.1152 0.631 GRNN Chemical sensor 0.701 0.888 1.316 BPNN Chemical sensor 1.473 0.228 1.584 GRNN Cholesterol 0.801 0.037 0.172 BPNN Cholesterol 2.099 0.061 0.215

To show the advantages of GRNN over BPNN, a comparison is held using standard regression datasets built inside MATLAB software [25]. For all the datasets, they are divided 70% for training and 30% for testing. After training the network with the 70% training data, the output of the neural network is found using the remaining testing data. The most notable advantage of GRNN over BPNN is the shorter training time which confirms its selection for dynamic systems modeling and control. Also, GRNN has less testing error which means it has better generalization abilities than BPNN. The comparison results are summarized in Table 3.

System identification is the process of building a model of unknown/partially known dynamic system based on observed input/output data. Gray-box and black-box identification are two common approaches of system identification. In the gray-box approach, a nominal model of a dynamic system is known, but its exact parameters are unknown, so an identifier is used to find these parameters. In the black-box approach, the identification is based only on the data. Examples of black-box identification include fuzzy logic (FL) and neural networks (NN). GRNN can be used to identify dynamic systems quickly and accurately. There are two methods to use GRNN for system identification: the batch mode (off-line training) and sequential mode (online training). In the batch mode, all the observed data is available before the system identification, so GRNN can be trained with a big chunk of the data, while in the

In the batch mode, the observed data should be divided into training, validation, and testing. GRNN will be fed with all the training data to identify the system. Then in the validation stage, the network should be tested with different data, usually randomly selected, and the error is

iterative optimization method.

Table 3. GRNN vs. BPNN training and testing performance.

5. GRNN in identification of dynamic systems

sequential mode only a few data samples are available for identification.

5.1. GRNN identification in batch training mode

Figure 2. DE and GA used to estimate GRNN σ. (a) Estimation of σ using DE (b) MSE evolution when using DE to estimate s (c) Estimation of σ using GA (d) MSE evolution when using GA to estimate σ.

Both of GA and DE can find a good approximation of σ within 100 iterations only; however, DE converges faster since it is a vectorized algorithm.

#### 4. GRNN vs. back-propagation neural networks (BPNN)

There are many differences between GRNN and BPNN. Firstly, GRNN is single-pass learning algorithm, while BPNN needs two passes: forward and backward pass. This means that GRNN consumes significantly less training time. Secondly, the only free parameter in GRNN is the smoothing parameter σ, while in BPNN more parameters are required such as weights, biases, and learning rates. This also indicates that GRNN quick learning abilities and its suitability for online systems or for system where minimal computations are required. Also, another difference is that since GRNN is an autoassociative memory network, it will store all the distinct input/output samples while BPNN has a limited predefined size. This size growth


Table 3. GRNN vs. BPNN training and testing performance.

issue is resolved by either using clustering or PCA (read Sections 2.21 and 2.2.2). Finally, GRNN is based on the general regression theory, while BPNN is based on gradient-descent iterative optimization method.

To show the advantages of GRNN over BPNN, a comparison is held using standard regression datasets built inside MATLAB software [25]. For all the datasets, they are divided 70% for training and 30% for testing. After training the network with the 70% training data, the output of the neural network is found using the remaining testing data. The most notable advantage of GRNN over BPNN is the shorter training time which confirms its selection for dynamic systems modeling and control. Also, GRNN has less testing error which means it has better generalization abilities than BPNN. The comparison results are summarized in Table 3.

### 5. GRNN in identification of dynamic systems

System identification is the process of building a model of unknown/partially known dynamic system based on observed input/output data. Gray-box and black-box identification are two common approaches of system identification. In the gray-box approach, a nominal model of a dynamic system is known, but its exact parameters are unknown, so an identifier is used to find these parameters. In the black-box approach, the identification is based only on the data. Examples of black-box identification include fuzzy logic (FL) and neural networks (NN). GRNN can be used to identify dynamic systems quickly and accurately. There are two methods to use GRNN for system identification: the batch mode (off-line training) and sequential mode (online training). In the batch mode, all the observed data is available before the system identification, so GRNN can be trained with a big chunk of the data, while in the sequential mode only a few data samples are available for identification.

#### 5.1. GRNN identification in batch training mode

Both of GA and DE can find a good approximation of σ within 100 iterations only; however,

Figure 2. DE and GA used to estimate GRNN σ. (a) Estimation of σ using DE (b) MSE evolution when using DE to

There are many differences between GRNN and BPNN. Firstly, GRNN is single-pass learning algorithm, while BPNN needs two passes: forward and backward pass. This means that GRNN consumes significantly less training time. Secondly, the only free parameter in GRNN is the smoothing parameter σ, while in BPNN more parameters are required such as weights, biases, and learning rates. This also indicates that GRNN quick learning abilities and its suitability for online systems or for system where minimal computations are required. Also, another difference is that since GRNN is an autoassociative memory network, it will store all the distinct input/output samples while BPNN has a limited predefined size. This size growth

DE converges faster since it is a vectorized algorithm.

140 Digital Systems

4. GRNN vs. back-propagation neural networks (BPNN)

estimate s (c) Estimation of σ using GA (d) MSE evolution when using GA to estimate σ.

In the batch mode, the observed data should be divided into training, validation, and testing. GRNN will be fed with all the training data to identify the system. Then in the validation stage, the network should be tested with different data, usually randomly selected, and the error is recorded for every validation test. Then the validation process is repeated several times. Usually 10 times is standard. And then the average validation error is found based on all the validation tests. This validation procedure is called k-fold cross validation a standard technique in machine learning (ML) applications. To test the generalization ability of an identified model, a new dataset is used called testing dataset. Based on the model performance in the testing stage, one can decide whether the model is suitable or not.

#### 5.1.1. Batch training GRNN to identify hexacopter attitude dynamics

In this example, GRNN is used to identify the attitude (pitch/roll/yaw) of a hexacopter drone based on real flight test data in the free flight mode. The data consist of three inputs: rolling, pitching, and yawing control values and three outputs: rolling, pitching, and yawing rates. The dataset contains 6691 data samples with a sample rate of 0.01 seconds. A total of 4683 samples are used to train GRNN in the batch mode, and the remaining data samples (2008) are used for testing. The results of hexacopter attitude identification are shown in Figure 3(a–c). The results are accurate with very low error. MSE in training stage is 0.001139 and 0.00258 in the testing stage. Also, the training time was only 0.720 seconds.

5.2. GRNN identification in sequential training mode

Figure 4. Sequential training GRNN.

used in the online dynamic systems identification.

identification, and (c) yawing rate identification.

5.2.1. Sequential training GRNN to identify hexacopter attitude dynamics

In sequential training, the data flow once at a time which makes using the batch training procedures impossible. So GRNN should be able to find the system model from only the current and past measurements. So it is a prediction problem. Since GRNN converges to a regression surface even with a few data samples and since it is accurate and quick, it can be

Applications of General Regression Neural Networks in Dynamic Systems

http://dx.doi.org/10.5772/intechopen.80258

143

To use GRNN in sequential mode, it is preferred to use the delayed output of the plant as an input in addition to the current input as shown in Figure 4. The same data which was used for batch mode is used in the sequential training. The inputs to GRNN are the control values of

Figure 5. Attitude identification of hexacopter in sequential training: (a) rolling rate identification, (b) pitching rate

Figure 3. Attitude identification of hexacopter in batch training: (a) rolling rate identification, (b) pitching rate identification, and (c) yawing rate identification.

Figure 4. Sequential training GRNN.

recorded for every validation test. Then the validation process is repeated several times. Usually 10 times is standard. And then the average validation error is found based on all the validation tests. This validation procedure is called k-fold cross validation a standard technique in machine learning (ML) applications. To test the generalization ability of an identified model, a new dataset is used called testing dataset. Based on the model performance in the

In this example, GRNN is used to identify the attitude (pitch/roll/yaw) of a hexacopter drone based on real flight test data in the free flight mode. The data consist of three inputs: rolling, pitching, and yawing control values and three outputs: rolling, pitching, and yawing rates. The dataset contains 6691 data samples with a sample rate of 0.01 seconds. A total of 4683 samples are used to train GRNN in the batch mode, and the remaining data samples (2008) are used for testing. The results of hexacopter attitude identification are shown in Figure 3(a–c). The results are accurate with very low error. MSE in training stage is 0.001139 and 0.00258 in the testing

Figure 3. Attitude identification of hexacopter in batch training: (a) rolling rate identification, (b) pitching rate identifica-

testing stage, one can decide whether the model is suitable or not.

142 Digital Systems

5.1.1. Batch training GRNN to identify hexacopter attitude dynamics

stage. Also, the training time was only 0.720 seconds.

tion, and (c) yawing rate identification.

#### 5.2. GRNN identification in sequential training mode

In sequential training, the data flow once at a time which makes using the batch training procedures impossible. So GRNN should be able to find the system model from only the current and past measurements. So it is a prediction problem. Since GRNN converges to a regression surface even with a few data samples and since it is accurate and quick, it can be used in the online dynamic systems identification.

#### 5.2.1. Sequential training GRNN to identify hexacopter attitude dynamics

To use GRNN in sequential mode, it is preferred to use the delayed output of the plant as an input in addition to the current input as shown in Figure 4. The same data which was used for batch mode is used in the sequential training. The inputs to GRNN are the control values of

Figure 5. Attitude identification of hexacopter in sequential training: (a) rolling rate identification, (b) pitching rate identification, and (c) yawing rate identification.

rolling, pitching, and yawing and the delayed rolling, pitching, and yawing rates. The results of using GRNN in the sequential training mode are shown in Figure 5(a–c). The results of sequential training are more accurate than the results in batch training.

y kð Þ¼ þ 1 0:8∗ sin y k ð Þþ ð Þ 15∗u kð Þ (13)

Applications of General Regression Neural Networks in Dynamic Systems

x\_ ¼ f xð Þþ ; t bu þ d (15)

<sup>15</sup> (14)

http://dx.doi.org/10.5772/intechopen.80258

145

<sup>15</sup> � <sup>0</sup>:8<sup>∗</sup> sin y k ð Þ ð Þ

To train GRNN as a predictive controller, the system described in (13) and (14) is simulated for 50 seconds. Then the controller output u and the plant output y were stored. GRNN is trained with the plant output as input and the controller output as output. For any time step the plant output is fed to GRNN, and the controller output u is estimated. The estimated controller output by GRNN and the perfect controller output are almost identical as shown in Figure 8. Also, the tracking performance after using GRNN as a predictive controller is very accurate as shown in Figure 9.

Since GRNN has robust approximation abilities, it can be used to approximate the dynamics of a given system to find the control law especially if the system is partially known or unknown.

where x\_ is the derivative of the states, f xð Þ ; t is a known function of the states, b is the input

The desired reference is ydð Þ¼ <sup>k</sup> <sup>2</sup><sup>∗</sup> sin ð Þ <sup>0</sup>:1π<sup>t</sup> .

6.2. GRNN as an adaptive estimator controller

gain, and d is the external disturbance.

Figure 8. Perfect vs. estimated GRNN controller output.

Assume there is a nonlinear dynamic system written as

u kð Þ¼ ydð Þ <sup>k</sup> <sup>þ</sup> <sup>1</sup>

The perfect control law can be written as

### 6. GRNN in control of dynamic systems

The aim of adding a closed-loop controller to the dynamic systems is either to reach the desired performance or stabilize the unstable system. GRNN can be used in controlling dynamic systems as a predictive or feedback controller. GRNN in control systems can be used as either supervised or unsupervised. When GRNN is trained as a predictive then the controller input and output data are known, so this is a supervised problem. On the other hand, if GRNN is utilized as a feedback controller (see Figure 6) without being pretrained, only the controller input data is known so GRNN have to find the suitable control signal u.

### 6.1. GRNN as predictive controller

To utilize GRNN as a predictive controller, it should be trained with input-output data from another controller. For example, training a GRNN with a proportional integral derivative (PID) controller input/output data as shown in Figure 7. Then the trained GRNN can be used as a controller.

### 6.1.1. Example 1: GRNN as predictive controller

If we have a discrete time system Liu [26] described as

Figure 6. Unsupervised learning problem in control.

Figure 7. Training GRNN as predictive controller.

$$y(k+1) = 0.8 \ast \sin\left(y(k)\right) + 15 \ast u(k)\tag{13}$$

The desired reference is ydð Þ¼ <sup>k</sup> <sup>2</sup><sup>∗</sup> sin ð Þ <sup>0</sup>:1π<sup>t</sup> .

The perfect control law can be written as

rolling, pitching, and yawing and the delayed rolling, pitching, and yawing rates. The results of using GRNN in the sequential training mode are shown in Figure 5(a–c). The results of

The aim of adding a closed-loop controller to the dynamic systems is either to reach the desired performance or stabilize the unstable system. GRNN can be used in controlling dynamic systems as a predictive or feedback controller. GRNN in control systems can be used as either supervised or unsupervised. When GRNN is trained as a predictive then the controller input and output data are known, so this is a supervised problem. On the other hand, if GRNN is utilized as a feedback controller (see Figure 6) without being pretrained, only the controller

To utilize GRNN as a predictive controller, it should be trained with input-output data from another controller. For example, training a GRNN with a proportional integral derivative (PID) controller input/output data as shown in Figure 7. Then the trained GRNN can be used as a

sequential training are more accurate than the results in batch training.

input data is known so GRNN have to find the suitable control signal u.

6. GRNN in control of dynamic systems

6.1. GRNN as predictive controller

6.1.1. Example 1: GRNN as predictive controller

Figure 6. Unsupervised learning problem in control.

Figure 7. Training GRNN as predictive controller.

If we have a discrete time system Liu [26] described as

controller.

144 Digital Systems

$$
\mu(k) = \frac{y\_d(k+1)}{15} - \frac{0.8 \ast \sin\left(y(k)\right)}{15} \tag{14}
$$

To train GRNN as a predictive controller, the system described in (13) and (14) is simulated for 50 seconds. Then the controller output u and the plant output y were stored. GRNN is trained with the plant output as input and the controller output as output. For any time step the plant output is fed to GRNN, and the controller output u is estimated. The estimated controller output by GRNN and the perfect controller output are almost identical as shown in Figure 8. Also, the tracking performance after using GRNN as a predictive controller is very accurate as shown in Figure 9.

#### 6.2. GRNN as an adaptive estimator controller

Since GRNN has robust approximation abilities, it can be used to approximate the dynamics of a given system to find the control law especially if the system is partially known or unknown.

Assume there is a nonlinear dynamic system written as

$$
\dot{\mathbf{x}} = f(\mathbf{x}, t) + bu + d\tag{15}
$$

where x\_ is the derivative of the states, f xð Þ ; t is a known function of the states, b is the input gain, and d is the external disturbance.

Figure 8. Perfect vs. estimated GRNN controller output.

Figure 9. GRNN tracking performance.

The perfect control law can be written as

$$
\mu = \frac{1}{b} (\dot{\mathbf{x}} - f(\mathbf{x}, t) - d) \tag{16}
$$

y kð Þ¼ þ 1 f kð Þþ 15∗u kð Þ (20)

Applications of General Regression Neural Networks in Dynamic Systems

<sup>15</sup> (21)

http://dx.doi.org/10.5772/intechopen.80258

147

The desired reference is ydð Þ¼ <sup>k</sup> <sup>2</sup><sup>∗</sup> sin ð Þ <sup>0</sup>:1π<sup>t</sup>

where f kð Þ is unknown nonlinear function. The perfect control law can be written as

6.4. GRNN as an adaptive optimal controller

6.4.1. Example 3: using GRNN as an adaptive controller Let us consider the same discrete system as in (13):

Figure 10. Using GRNN to estimate the unknown dynamics.

Figure 11.

u kð Þ¼ �f kð Þ

GRNN is used to estimate the unknown function f kð Þ. With applying the update law in (19), f kð Þ is estimated with an acceptable accuracy as shown in Figure 10. MSE between the ideal and the estimated f kð Þ is 0.0033. The accurate controller tracking performance is also shown

GRNN has learning abilities which means it is suitable to be an adaptive intelligent controller. Rather than approximating the unknown function in the control law (16), one can use GRNN to approximate the whole controller output as shown in Figure 12. The same update law as in

y kð Þ¼ þ 1 0:8∗ sin y k ð Þþ ð Þ 15∗u kð Þ

(19) can be used to update GRNN weights to approximate the controller output u.

<sup>15</sup> <sup>þ</sup> ydð Þ<sup>k</sup>

If f xð Þ ; t is unknown, then the control law in (16) cannot be found; hence, the alternative is using GRNN to estimate the unknown function f xð Þ ; t . To derive the update law of GRNN weights, let us define the objective function as MSE error function as follows:

$$E = \frac{1}{2}(\hat{y} - y)^2\tag{17}$$

where y^ is the estimation of GRNN and y is the optimal value of f xð Þ ; t . To derive the update law of the GRNN weights, the error should be minimized with respect to GRNN weights W:

$$\frac{\partial E}{\partial W} = \left(\hat{W}H - y\right) \* H \tag{18}$$

where W^ is the GRNN current hidden-output layers weights and H is the hidden layer output, so the update law of GRNN weights will be

$$W\_{i+1} = W\_i + H \left(\hat{\mathcal{W}}H - y\right) \tag{19}$$

#### 6.3. Example 2: using GRNN to approximate the unknown dynamics

Let us consider the same discrete as in example 1:

$$y(k+1) = f(k) + 15 \* u(k)\tag{20}$$

The desired reference is ydð Þ¼ <sup>k</sup> <sup>2</sup><sup>∗</sup> sin ð Þ <sup>0</sup>:1π<sup>t</sup>

where f kð Þ is unknown nonlinear function.

The perfect control law can be written as

$$
\mu(k) = \frac{-f(k)}{15} + \frac{y\_d(k)}{15} \tag{21}
$$

GRNN is used to estimate the unknown function f kð Þ. With applying the update law in (19), f kð Þ is estimated with an acceptable accuracy as shown in Figure 10. MSE between the ideal and the estimated f kð Þ is 0.0033. The accurate controller tracking performance is also shown Figure 11.

#### 6.4. GRNN as an adaptive optimal controller

The perfect control law can be written as

Figure 9. GRNN tracking performance.

146 Digital Systems

so the update law of GRNN weights will be

Let us consider the same discrete as in example 1:

<sup>u</sup> <sup>¼</sup> <sup>1</sup> b

weights, let us define the objective function as MSE error function as follows:

∂E

6.3. Example 2: using GRNN to approximate the unknown dynamics

<sup>E</sup> <sup>¼</sup> <sup>1</sup> 2 ð Þ y^ � y

If f xð Þ ; t is unknown, then the control law in (16) cannot be found; hence, the alternative is using GRNN to estimate the unknown function f xð Þ ; t . To derive the update law of GRNN

where y^ is the estimation of GRNN and y is the optimal value of f xð Þ ; t . To derive the update law of the GRNN weights, the error should be minimized with respect to GRNN weights W:

where W^ is the GRNN current hidden-output layers weights and H is the hidden layer output,

Wiþ<sup>1</sup> <sup>¼</sup> Wi <sup>þ</sup> <sup>H</sup> W H^ � <sup>y</sup>

<sup>∂</sup><sup>W</sup> <sup>¼</sup> W H^ � <sup>y</sup> 

ð Þ x\_ � f xð Þ� ; t d (16)

<sup>2</sup> (17)

∗ H (18)

(19)

GRNN has learning abilities which means it is suitable to be an adaptive intelligent controller. Rather than approximating the unknown function in the control law (16), one can use GRNN to approximate the whole controller output as shown in Figure 12. The same update law as in (19) can be used to update GRNN weights to approximate the controller output u.

#### 6.4.1. Example 3: using GRNN as an adaptive controller

Let us consider the same discrete system as in (13):

$$y(k+1) = 0.8 \ast \sin\left(y(k)\right) + 15 \ast \mu(k)$$

Figure 10. Using GRNN to estimate the unknown dynamics.

Figure 11. GRNN tracking performance for example 2.

Figure 12. Training GRNN as an adaptive controller.

with the same desired reference ydð Þ¼ k 2∗ sin ð Þ 0:1πt , but in this case GRNN is used to estimate the full controller output u as shown in Figure 14 and the tracking performance is shown in

Applications of General Regression Neural Networks in Dynamic Systems

http://dx.doi.org/10.5772/intechopen.80258

149

y kð Þ¼ þ 1 0:2 cosð0:8ð Þ y kð Þþ y kð Þ � 1 Þ þ 0:4 sin ð Þ 0:8ð Þ y kð Þþ � 1 y kð Þþ 2u kð Þþ u kð Þ � 1

ð1 þ cos y k ð Þ ð Þ

(22)

Let us use GRNN to control a more complex discrete plant [27] described as

<sup>þ</sup> <sup>0</sup>:1 9<sup>ð</sup> <sup>þ</sup> y kð Þþ y kð Þ � <sup>1</sup> Þ þ <sup>2</sup>ð Þ u kð Þþ u kð Þ � <sup>1</sup>

Figure 13.

6.4.2. Example 4: using GRNN as an adaptive controller

Figure 15. GRNN as an adaptive controller in Example 4.

Figure 14. GRNN Estimated control law for example 3.

Figure 13. GRNN tracking performance for example 3.

Figure 14. GRNN Estimated control law for example 3.

Figure 11. GRNN tracking performance for example 2.

148 Digital Systems

Figure 12. Training GRNN as an adaptive controller.

Figure 13. GRNN tracking performance for example 3.

Figure 15. GRNN as an adaptive controller in Example 4.

with the same desired reference ydð Þ¼ k 2∗ sin ð Þ 0:1πt , but in this case GRNN is used to estimate the full controller output u as shown in Figure 14 and the tracking performance is shown in Figure 13.

#### 6.4.2. Example 4: using GRNN as an adaptive controller

Let us use GRNN to control a more complex discrete plant [27] described as

$$\begin{split} y(k+1) &= 0.2\cos\left(0.8(y(k) + y(k-1))\right) + 0.4\sin\left(0.8(y(k-1) + y(k) + 2u(k) + u(k-1))\right) \\ &+ 0.1(9 + y(k) + y(k-1)) + \frac{2(u(k) + u(k-1))}{(1 + \cos(y(k)))} \end{split} \tag{22}$$

The desired reference in this case is

$$y\_d(k) = 0.8 + 0.05(\sin\left(\pi k/50\right) + \sin\left(\pi k/100\right) + \sin\left(\sin\left(\pi k/150\right)\right))$$

7.2. The holdout method to find σ

Applications of General Regression Neural Networks in Dynamic Systems

http://dx.doi.org/10.5772/intechopen.80258

151

Figure 16. View GRNN in MATLAB.

The tracking performance of adaptive GRNN is shown in Figure 15.

### 7. MATLAB examples

In this section, GRNN MATLAB code examples are provided.

#### 7.1. Basic GRNN Commands in MATLAB

In this example, GRNN is trained to find the square of a given number.

To design a GRNN in MATLAB:

Firstly, create the inputs and the targets and specify the spread parameter:

Secondly, create GRNN:

To view GRNN after creating it:

The results are shown in Figure 16.

To find GRNN output based on a given input:

$$\text{y2=net1 (4) :}$$

The result is 17.

Applications of General Regression Neural Networks in Dynamic Systems http://dx.doi.org/10.5772/intechopen.80258 151

Figure 16. View GRNN in MATLAB.

The desired reference in this case is

150 Digital Systems

7. MATLAB examples

7.1. Basic GRNN Commands in MATLAB

To design a GRNN in MATLAB:

Secondly, create GRNN:

To view GRNN after creating it:

The results are shown in Figure 16.

The result is 17.

To find GRNN output based on a given input:

ydð Þ¼ k 0:8 þ 0:05ð Þ sin ð Þþ πk=50 sin ð Þþ πk=100 sin sin ð Þ ð Þ πk=150

The tracking performance of adaptive GRNN is shown in Figure 15.

In this section, GRNN MATLAB code examples are provided.

In this example, GRNN is trained to find the square of a given number.

Firstly, create the inputs and the targets and specify the spread parameter:

#### 7.2. The holdout method to find σ
