**5. Training algorithms**

One of the most important parts of deep learning is learning algorithms. The deep neural network can be differentiated only through the number of layers; if the number of layers increases, the network becomes deeper and more complex. Each layer has its specific function or can detect or help in the detection of the special feature.

According to the author [31], if the problem is face recognition, the first layer has the responsibility to recognize edges, the second has to detect higher features such as the nose, eye, ears, etc., the next layer can further dig out the features, and so on. Thus, each layer is developed earlier to the development of training algorithm like gradient descent; that's why these kinds of classifiers are not suitable for a dataset with huge volume or variation. This was discussed by Yann et al. [32]; they further concluded that a system with less manual and more automatic design can give better results in pattern recognition.

Backpropagation is the solution; it takes information from the data without going through classifiers and finds the representation needed for recognition. List of few famous training algorithms is listed below.

#### **5.1 Gradient descent**

In statistics, data science, and machine learning, we optimize a lot of stuffs; when we fit a line with linear regression, we optimize the intercept and slope; when we use logistic regression, we optimize a squiggle; when we use t-SNE, we optimize clusters. The gradient descent is used to optimize all these and tons of others as well.

Gradient descent algorithm is similar to Newton's roots finding algorithm of 2D function. The methodology is very simple; just pick a point randomly on a curve and move toward the right or left along x-axis depending on the positive and negative value of the slope of the function at the given point up-till the value of y-axis, that is, function or f(x) becomes zero. There is the same concept behind the gradient descent; we move or traverse along a specific path in many-dimensional space weight when the error rate is reduced to your limits than we stop. It is one of the underlying concepts for most of deep learning and machine learning algorithms.

$$\mathbf{C} = \frac{1}{2} \left( \mathbf{Y} \mathbf{x} \mathbf{y} \mathbf{z} \mathbf{e} \mathbf{z} \mathbf{d} \mathbf{l} - \mathbf{Y} \mathbf{z} \mathbf{d} \mathbf{z} \mathbf{u} \mathbf{d} \mathbf{l} \right)^2 \tag{1}$$

**15**

*Advancements in Deep Learning Theory and Applications: Perspective in 2020 and beyond*

In stochastic gradient descent to update the weight or to calculate step size, a fixed multiplier is used as a learning rate; this can cause the update to overshoot a potentialminima; if the gradient is too steep or delay, the convergence of the gradient is noisy. The concept of momentum used in Physics is velocity exponentially decreasing an average of gradient [33]. This prevents the descent going in the wrong direction.

This type of algorithm is used for curve fitting or non-linear least-squares problems. This algorithm is also called as deep least-square; these kinds of issues arise usually in the least-squares curve fitting. It was first introduced by Kenneth Levenberg in 1944, although it was rediscovered by statistician called Donald

It is one of the famous and standard methods used to train the recurrent neural network. It was developed independently by several researchers. Unlike generalpurpose optimization techniques, it is faster in training RNN. The backpropagation

According to Google trends graph more and more expert and professionals have attracted toward deep learning in last five year; the percentage of professionals increased from 12 to 100% [35, 36]. Deep learning is used everywhere, that is, bio-informatics, computer vision, IoT security, health-care, e-commerce, digital marketing, natural language processing, and many more [37, 38]. Because of the very hot research area, there must have some challenges which are enlisted below.

When dealing with data or making a model, several inputs are not necessary for finding any feature, so it is advised to drop un-necessary attributes. There is also necessary to find one best column and make it separate from the dataset; it can be done using numpy array in Keras; but it is difficult and challenging to find best

The number of hidden layers is directly propositional to computational complexity and deepness of the network. To deal with a large number of layers require a

In model optimizations, gradient descent optimizer helps to make the model cost minimum by adjusting the value; choosing an optimizer is also a challenging task to do, because sometimes it makes your cost of model high rather than decreas-

high computational cost, difficult to manage a large number of neurons.

*DOI: http://dx.doi.org/10.5772/intechopen.92271*

**5.4 Levenberg-Marquardt algorithm**

**5.5 Backpropagation through time**

through time also has issues with local optima [34].

**6. Routine challenges of deep learning**

**6.1 Non-contributing columns or inputs**

**5.3 Momentum**

Marquardt in 1963.

match attribute.

**6.2 Number of hidden layers**

**6.3 Optimization algorithms**

ing the model cost.

#### **5.2 Stochastic gradient descent**

A method used for optimizing an objective function with the iterative method is called stochastic gradient descent. It can also be called gradient descent optimization. Stochastic gradient descent would randomly pick one sample for each step and from that, just use this one sample to calculate the derivatives, thus in super sample example, stochastic gradient descent reduced the number of terms by computed by 3.

If we had one million samples than the stochastic gradient descent would reduce the number of terms by computed by factor of one million. In stochastic gradient descent, when minibatch of the number of samples finished running than updates are applied, in here update of weights is more frequent, so we reach a global minimum in less time (**Figure 7**).

**Figure 7.** *Comparison of GD and SGD.*

*Advancements in Deep Learning Theory and Applications: Perspective in 2020 and beyond DOI: http://dx.doi.org/10.5772/intechopen.92271*
