5.3. Formalization

As shown in Figure 3, a basic neural network consists of layers of nodes (or neurons) where each node has a connection to all the nodes of the next layer, and takes input from each of the nodes in the previous layer. Each of the connections has a real number weight associated with it. Every neuron does some simple computation (we will restrict ourselves to the sigmoid activation and linear activation) on the sum of its inputs to yield an output value.

The above definition is that of multilayer perceptron (MLP) network which is the most basic form of a feed forward neural network.

Let us assume that x is the input (vector) for this neural network, the dimension of x is 6 � 1. Let the weight of the connections between the input layer and the first hidden layer be represented by the matrix θð Þ<sup>1</sup> , each value in the matrix will be referenced as θð Þ<sup>1</sup> ij . The dimension of <sup>θ</sup>ð Þ<sup>1</sup> is 6 � 4.

$$\sigma(z) = \frac{1}{1 + e^{-z}}\tag{8}$$

vector of the same dimension. Sigmoid function is a continuous differentiable function, which

Eq. (9) shows us the first step in forward propagation. að Þ<sup>1</sup> is known as the activation of the first hidden layer. As a sanity-check we can see the dimension of <sup>a</sup>ð Þ<sup>1</sup> is 4 � 1, which is congruent

Neural networks have a bias unit (not shown in Figure 3), which is a neuron that is always firing and is connected to every node in the next layer but does not take input from the previous layer. Mathematically it can be represented as the additive term bð Þ<sup>1</sup> of dimension

Eqs. (10) and (11) complete the forward propagation. Eq. (11) can include a sigmoid activation too if the desired output is between 0 and 1 like in a classification problem. No activation

The neural network learns by optimizing an objective function (or cost function) such as the

<sup>i</sup>¼<sup>1</sup> <sup>y</sup>predicted

where <sup>y</sup>ð Þ<sup>i</sup> is vector of the target values of output layer. In our example it is 1 � 1 but in general the neural network can have multiple outputs. In an offline setting we have all our data

As seen in Figure 4 the sigmoid function is approximately linear between �2 and 2 and is

2. If the input of a neuron is too extreme the neuron becomes saturated and therefore

3. In the case of saturated nodes, backpropagating, which involves computing the gradient,

Due to the latter two points it is beneficial to ensure that the weights of the network are small. If the weights have large values then the network is sure those connections are very important which makes it hard for it to learn otherwise. Therefore, it makes sense to keep weights small so the network is responsive to new information. To accomplish this we incorporate the

ð Þ<sup>i</sup> � <sup>y</sup>ð Þ<sup>i</sup> � �<sup>2</sup>

<sup>x</sup> <sup>þ</sup> <sup>b</sup>ð Þ<sup>1</sup>

Model Reference Adaptive Control of Quadrotor UAVs: A Neural Network Perspective

<sup>a</sup>ð Þ<sup>1</sup> <sup>þ</sup> <sup>b</sup>ð Þ<sup>2</sup>

� � (9)

http://dx.doi.org/10.5772/intechopen.71487

143

� � (10)

<sup>a</sup>ð Þ<sup>2</sup> <sup>þ</sup> <sup>b</sup>ð Þ<sup>3</sup> (11)

(12)

<sup>a</sup>ð Þ<sup>1</sup> <sup>¼</sup> σ θð Þ<sup>1</sup> � �<sup>T</sup>

<sup>a</sup>ð Þ<sup>2</sup> <sup>¼</sup> σ θð Þ<sup>2</sup> � �<sup>T</sup>

<sup>y</sup>predicted <sup>¼</sup> <sup>a</sup>ð Þ<sup>3</sup> <sup>¼</sup> <sup>z</sup>ð Þ<sup>3</sup> <sup>¼</sup> <sup>θ</sup>ð Þ<sup>3</sup> � �<sup>T</sup>

squared-error cost function for dealing with regression problems as in this text.

X<sup>m</sup>

beforehand, indicated here by m examples and we compute cost iterated over all yð Þ<sup>i</sup> .

<sup>J</sup>ð Þ¼ <sup>θ</sup> <sup>1</sup> 2m

almost a flat line beyond �4 and 4. This has 3 implications:

1. It gives us a convenient differentiable function that is nearly binary

information gets attenuated going through the network if it is too deep.

becomes ineffective, as the gradient at the extremes is nearly zero.

is bounded between 0 and 1.

with what we expect.

4 � 1 shown in Eq. (9).

function is also known as linear activation.

Eq. (8) shows us the sigmoid activation function that will be applied on every node in the hidden layers. If z is a vector then applying the sigmoid activation function will result in a

Figure 3. A depiction of a basic neural network. Image courtesy of Michael A. Nielson. Neural networks and deep learning. Determination Press, 2015.

vector of the same dimension. Sigmoid function is a continuous differentiable function, which is bounded between 0 and 1.

Therefore it was theorized that the neuron, which either fires or does not, can be modeled as a function that has multiple inputs, a single output (which may be connected to several other neurons) and only 'fires' when the sum of inputs exceeds a certain threshold. The connections between neurons have weights, which strengthen or weaken a connection between two neurons.


The above quotation may be familiar. It is based on Donald Hebb's theory to explain neural adaptation and learning and this forms the basis of learning in modern-day artificial neural

As shown in Figure 3, a basic neural network consists of layers of nodes (or neurons) where each node has a connection to all the nodes of the next layer, and takes input from each of the nodes in the previous layer. Each of the connections has a real number weight associated with it. Every neuron does some simple computation (we will restrict ourselves to the sigmoid

The above definition is that of multilayer perceptron (MLP) network which is the most basic

Let us assume that x is the input (vector) for this neural network, the dimension of x is 6 � 1. Let the weight of the connections between the input layer and the first hidden layer be represented by

Eq. (8) shows us the sigmoid activation function that will be applied on every node in the hidden layers. If z is a vector then applying the sigmoid activation function will result in a

Figure 3. A depiction of a basic neural network. Image courtesy of Michael A. Nielson. Neural networks and deep

1

σð Þ¼ z

ij . The dimension of <sup>θ</sup>ð Þ<sup>1</sup> is 6 � 4.

<sup>1</sup> <sup>þ</sup> <sup>e</sup>�<sup>z</sup> (8)

activation and linear activation) on the sum of its inputs to yield an output value.

the matrix θð Þ<sup>1</sup> , each value in the matrix will be referenced as θð Þ<sup>1</sup>

"Neurons that wire together, fire together"

form of a feed forward neural network.

learning. Determination Press, 2015.

networks.

5.3. Formalization

142 Adaptive Robust Control Systems

$$\mathbf{a}^{(1)} = \sigma \left( \left( \boldsymbol{\theta}^{(1)} \right)^{T} \mathbf{x} + \mathbf{b}^{(1)} \right) \tag{9}$$

Eq. (9) shows us the first step in forward propagation. að Þ<sup>1</sup> is known as the activation of the first hidden layer. As a sanity-check we can see the dimension of <sup>a</sup>ð Þ<sup>1</sup> is 4 � 1, which is congruent with what we expect.

Neural networks have a bias unit (not shown in Figure 3), which is a neuron that is always firing and is connected to every node in the next layer but does not take input from the previous layer. Mathematically it can be represented as the additive term bð Þ<sup>1</sup> of dimension 4 � 1 shown in Eq. (9).

$$\mathbf{a}^{(2)} = \sigma \left( \left( \theta^{(2)} \right)^T \mathbf{a}^{(1)} + \mathbf{b}^{(2)} \right) \tag{10}$$

$$\mathbf{y}\_{\text{predicted}} = \mathbf{a}^{(3)} = \mathbf{z}^{(3)} = \left(\boldsymbol{\theta}^{(3)}\right)^{T}\mathbf{a}^{(2)} + \mathbf{b}^{(3)} \tag{11}$$

Eqs. (10) and (11) complete the forward propagation. Eq. (11) can include a sigmoid activation too if the desired output is between 0 and 1 like in a classification problem. No activation function is also known as linear activation.

The neural network learns by optimizing an objective function (or cost function) such as the squared-error cost function for dealing with regression problems as in this text.

$$J(\theta) = \frac{1}{2m} \sum\_{i=1}^{m} \left( y\_{predicted}^{(i)} - y^{(i)} \right)^2 \tag{12}$$

where <sup>y</sup>ð Þ<sup>i</sup> is vector of the target values of output layer. In our example it is 1 � 1 but in general the neural network can have multiple outputs. In an offline setting we have all our data beforehand, indicated here by m examples and we compute cost iterated over all yð Þ<sup>i</sup> .

As seen in Figure 4 the sigmoid function is approximately linear between �2 and 2 and is almost a flat line beyond �4 and 4. This has 3 implications:


Due to the latter two points it is beneficial to ensure that the weights of the network are small. If the weights have large values then the network is sure those connections are very important which makes it hard for it to learn otherwise. Therefore, it makes sense to keep weights small so the network is responsive to new information. To accomplish this we incorporate the

<sup>δ</sup>ð Þ<sup>l</sup> <sup>¼</sup> <sup>θ</sup>ð Þ<sup>l</sup> � �<sup>T</sup>

calculated but error on our inputs does not have any significance.

δJ δθð Þ<sup>l</sup> <sup>¼</sup> <sup>1</sup> m ∙ X<sup>m</sup>

power of the algorithm significantly goes down without them.

term is Eq. (19)

5.4. Limitations

2

3

Determination Press, 2015

http://www.image-net.org/challenges/LSVRC/

where σ'(.) denotes the derivative of the sigmoid function, ⋄ signifies an element-wise multiplication and <sup>L</sup> is the total number of layers in the network, here <sup>L</sup> <sup>¼</sup> 4. Note that <sup>δ</sup>ð Þ<sup>1</sup> can be

Eq. (18) is the derivative of the cost function (without the regularization term) computed using the errors previously found. The mathematical proof<sup>2</sup> of backpropagation is beyond the scope of this chapter. The final gradient averaged over all training examples with the regularization

δθð Þ<sup>l</sup> <sup>¼</sup> <sup>δ</sup>ð Þ <sup>l</sup>þ<sup>1</sup> <sup>a</sup>ð Þ<sup>l</sup> � �<sup>T</sup>

The weights of the network are randomly initialized to small positive or negative real values. If one were to initialize all the weights to the same value (say zero) then the gradient calculated at every node in a layer would be the same, and we'd end up with a neural network with lots of redundancies. Note that if there were no hidden layers this would not be the case but the

Neural networks have large time and space requirements. Assume an n hidden layer fully connected neural network with m neurons in each hidden layer. We have ð Þ� n � 1 m � m parameters just in the hidden layers. This number is for the basic MLP network and more

For example, ILSVRC<sup>3</sup> evaluates algorithms for object detection and image classification at large scale to measure the progress of computer vision for large scale image indexing for retrieval and annotation. Over the years, deep neural networks have been used to solve the problem statement with better accuracy every year. AlexNet [5] had 60 million parameters and took two weeks to train on 2 GPUs in 2012 with 16.4% classification error using convolutional neural networks. GoogLeNet [6] had 4 million parameters achieving classification error of 6.66% in 2014 with the advent of their inception module in the convolutional neural network. So, even as the situation continues to improve, neural networks are still time and memory intensive.

The concise proof can be found in chapter 2 of the book by Michael A. Nielson, "Neural Networks and Deep Learning",

sophisticated implementations (deep learning) will have even more parameters.

δJ

<sup>δ</sup>ð Þ <sup>l</sup>þ<sup>1</sup> <sup>⋄</sup>σ<sup>0</sup> <sup>a</sup>ð Þ<sup>3</sup> � �, l <sup>∈</sup>f g <sup>2</sup>;…; <sup>L</sup> � <sup>1</sup> (16)

http://dx.doi.org/10.5772/intechopen.71487

Model Reference Adaptive Control of Quadrotor UAVs: A Neural Network Perspective

<sup>σ</sup><sup>0</sup> <sup>a</sup>ð Þ<sup>3</sup> � � <sup>¼</sup> <sup>σ</sup> <sup>a</sup>ð Þ<sup>3</sup> � �<sup>⋄</sup> <sup>1</sup> � <sup>σ</sup> <sup>a</sup>ð Þ<sup>3</sup> � � � � (17)

<sup>i</sup>¼<sup>1</sup> <sup>δ</sup>ð Þ <sup>l</sup>þ<sup>1</sup> <sup>a</sup>ð Þ<sup>l</sup> � �<sup>T</sup> � � <sup>þ</sup> <sup>λ</sup> <sup>∙</sup> <sup>θ</sup>ð Þ<sup>l</sup> (19)

(18)

145

Figure 4. Graph of sigmoid function.

weights into the cost function. In L1 regularization we add the modulus of the weights to the cost function in Eq. (12). In L2 regularization we add the squares of the weights to the cost function resulting in Eq. (13).

$$f(\boldsymbol{\theta}) = \frac{1}{2m} \sum\_{i=1}^{m} \left( \mathbf{y}\_{\text{predicted}} (\boldsymbol{i}) - \mathbf{y}^{(i)} \right)^2 + \frac{\lambda}{2} \cdot \sum\_{l=1}^{L-1} \sum\_{i=1}^{q} \sum\_{j=1}^{q\_{l+1}} \left( \boldsymbol{\theta}\_{\vec{\boldsymbol{\eta}}}^{(l)} \right)^2 \tag{13}$$

where L is the total number of layers in the neural network, sl is the number of nodes in the l th layer. Note that regularization is not done on the weights from the bias node. λ is the regularization parameter that helps us control the extent to which we want to penalize the neural network for the size of its weights.

Gradient descent is used to train the neural network, which means running several iterations making small adjustments to the parameters θ in the direction that minimizes the cost function. The weight update equation is shown in Eq. (14).

$$\Theta^{(l)} \coloneqq \Theta^{(l)} - \alpha. \frac{\delta l}{\delta \Theta^{(l)}}, l = 1, \dots, L - 1 \tag{14}$$

where α is the learning rate that controls the size of the adjustment, it must not be too small, else the learning will take very long, and it must not be too large, else the network will not converge. θð Þ<sup>l</sup> is a matrix and therefore the derivative term is also a matrix.

The backpropagation algorithm is used to calculate the derivative term. The intuition is to calculate the error term at every node in the network from the output layer backwards and use this to compute the derivative. We shall denote the error vector as δð Þ<sup>l</sup> where l denotes the layer number. Eq. (15) denotes the error in the output layer of our example network.

$$
\mathfrak{d}^{(4)} = \mathfrak{y} - \mathfrak{a}^{(3)} \tag{15}
$$

Except for the output layer the error term is defined as:

Model Reference Adaptive Control of Quadrotor UAVs: A Neural Network Perspective http://dx.doi.org/10.5772/intechopen.71487 145

$$\boldsymbol{\mathfrak{G}}^{(l)} = \left(\boldsymbol{\theta}^{(l)}\right)^{T} \boldsymbol{\mathfrak{G}}^{(l+1)} \boldsymbol{\circ} \boldsymbol{\sigma}^{\prime} \left(\boldsymbol{\mathfrak{a}}^{(3)}\right), \; l \in \{2, \ldots, L - 1\} \tag{16}$$

where σ'(.) denotes the derivative of the sigmoid function, ⋄ signifies an element-wise multiplication and <sup>L</sup> is the total number of layers in the network, here <sup>L</sup> <sup>¼</sup> 4. Note that <sup>δ</sup>ð Þ<sup>1</sup> can be calculated but error on our inputs does not have any significance.

$$
\sigma'\left(\mathfrak{a}^{(3)}\right) = \sigma\left(\mathfrak{a}^{(3)}\right) \circ \left(\mathbf{1} - \sigma\left(\mathfrak{a}^{(3)}\right)\right) \tag{17}
$$

Eq. (18) is the derivative of the cost function (without the regularization term) computed using the errors previously found. The mathematical proof<sup>2</sup> of backpropagation is beyond the scope of this chapter. The final gradient averaged over all training examples with the regularization term is Eq. (19)

$$\frac{\delta J}{\delta \theta^{(l)}} = \delta^{(l+1)} \left(\mathbf{a}^{(l)}\right)^T \tag{18}$$

$$\frac{\delta f}{\delta \Theta^{(l)}} = \frac{1}{m} \cdot \sum\_{i=1}^{m} \left( \delta^{(l+1)} \left( \mathbf{a}^{(l)} \right)^{T} \right) + \lambda \cdot \Theta^{(l)} \tag{19}$$

The weights of the network are randomly initialized to small positive or negative real values. If one were to initialize all the weights to the same value (say zero) then the gradient calculated at every node in a layer would be the same, and we'd end up with a neural network with lots of redundancies. Note that if there were no hidden layers this would not be the case but the power of the algorithm significantly goes down without them.

#### 5.4. Limitations

weights into the cost function. In L1 regularization we add the modulus of the weights to the cost function in Eq. (12). In L2 regularization we add the squares of the weights to the cost

where L is the total number of layers in the neural network, sl is the number of nodes in the l

layer. Note that regularization is not done on the weights from the bias node. λ is the regularization parameter that helps us control the extent to which we want to penalize the neural

Gradient descent is used to train the neural network, which means running several iterations making small adjustments to the parameters θ in the direction that minimizes the cost func-

where α is the learning rate that controls the size of the adjustment, it must not be too small, else the learning will take very long, and it must not be too large, else the network will not

The backpropagation algorithm is used to calculate the derivative term. The intuition is to calculate the error term at every node in the network from the output layer backwards and use this to compute the derivative. We shall denote the error vector as δð Þ<sup>l</sup> where l denotes the

layer number. Eq. (15) denotes the error in the output layer of our example network.

þ λ 2 :

X<sup>L</sup>�<sup>1</sup> l¼1

Xsl i¼1 Xslþ<sup>1</sup> <sup>j</sup>¼<sup>1</sup> <sup>θ</sup>ð Þ<sup>l</sup> ij � �<sup>2</sup>

δθð Þ<sup>l</sup> , l <sup>¼</sup> <sup>1</sup>, …, L � <sup>1</sup> (14)

<sup>δ</sup>ð Þ<sup>4</sup> <sup>¼</sup> <sup>y</sup> � <sup>a</sup>ð Þ<sup>3</sup> (15)

(13)

th

function resulting in Eq. (13).

Figure 4. Graph of sigmoid function.

144 Adaptive Robust Control Systems

<sup>J</sup>ð Þ¼ <sup>θ</sup> <sup>1</sup> 2m

network for the size of its weights.

X<sup>m</sup>

tion. The weight update equation is shown in Eq. (14).

Except for the output layer the error term is defined as:

<sup>i</sup>¼<sup>1</sup> <sup>y</sup>predicted

ð Þ<sup>i</sup> � <sup>y</sup>ð Þ<sup>i</sup> � �<sup>2</sup>

<sup>θ</sup>ð Þ<sup>l</sup> <sup>≔</sup> <sup>θ</sup>ð Þ<sup>l</sup> � <sup>α</sup>: <sup>δ</sup><sup>J</sup>

converge. θð Þ<sup>l</sup> is a matrix and therefore the derivative term is also a matrix.

Neural networks have large time and space requirements. Assume an n hidden layer fully connected neural network with m neurons in each hidden layer. We have ð Þ� n � 1 m � m parameters just in the hidden layers. This number is for the basic MLP network and more sophisticated implementations (deep learning) will have even more parameters.

For example, ILSVRC<sup>3</sup> evaluates algorithms for object detection and image classification at large scale to measure the progress of computer vision for large scale image indexing for retrieval and annotation. Over the years, deep neural networks have been used to solve the problem statement with better accuracy every year. AlexNet [5] had 60 million parameters and took two weeks to train on 2 GPUs in 2012 with 16.4% classification error using convolutional neural networks. GoogLeNet [6] had 4 million parameters achieving classification error of 6.66% in 2014 with the advent of their inception module in the convolutional neural network. So, even as the situation continues to improve, neural networks are still time and memory intensive.

<sup>2</sup> The concise proof can be found in chapter 2 of the book by Michael A. Nielson, "Neural Networks and Deep Learning", Determination Press, 2015

<sup>3</sup> http://www.image-net.org/challenges/LSVRC/

The second problem is predictability. Just as the human brain is an enigma that man has been trying to understand and find patterns in, the fact remains, humans are still unpredictable to quite an extent. Similarly, as far as machine learning algorithms go, neural networks are a black box and no one fully understands the functions that have been mapped in them once trained. Therefore no one can predict when they might fail and it is not always possible to justify the results produced as opposed to a rule-based classification method such as decision trees. Yet, neural networks have proved to be a great tool and are widely used by organizations today.
