**5.3 Momentum**

*Advances and Applications in Deep Learning*

*C* = \_1

**5.2 Stochastic gradient descent**

mum in less time (**Figure 7**).

**5.1 Gradient descent**

others as well.

computed by 3.

of few famous training algorithms is listed below.

Backpropagation is the solution; it takes information from the data without going through classifiers and finds the representation needed for recognition. List

In statistics, data science, and machine learning, we optimize a lot of stuffs; when we fit a line with linear regression, we optimize the intercept and slope; when we use logistic regression, we optimize a squiggle; when we use t-SNE, we optimize clusters. The gradient descent is used to optimize all these and tons of

Gradient descent algorithm is similar to Newton's roots finding algorithm of 2D function. The methodology is very simple; just pick a point randomly on a curve and move toward the right or left along x-axis depending on the positive and negative value of the slope of the function at the given point up-till the value of y-axis, that is, function or f(x) becomes zero. There is the same concept behind the gradient descent; we move or traverse along a specific path in many-dimensional space weight when the error rate is reduced to your limits than we stop. It is one of the underlying concepts for most of deep learning and machine learning algorithms.

A method used for optimizing an objective function with the iterative method is called stochastic gradient descent. It can also be called gradient descent optimization. Stochastic gradient descent would randomly pick one sample for each step and from that, just use this one sample to calculate the derivatives, thus in super sample example, stochastic gradient descent reduced the number of terms by

If we had one million samples than the stochastic gradient descent would reduce the number of terms by computed by factor of one million. In stochastic gradient descent, when minibatch of the number of samples finished running than updates are applied, in here update of weights is more frequent, so we reach a global mini-

(*Yexpected* − *Yactual*)<sup>2</sup> (1)

2

**14**

**Figure 7.**

*Comparison of GD and SGD.*

In stochastic gradient descent to update the weight or to calculate step size, a fixed multiplier is used as a learning rate; this can cause the update to overshoot a potentialminima; if the gradient is too steep or delay, the convergence of the gradient is noisy. The concept of momentum used in Physics is velocity exponentially decreasing an average of gradient [33]. This prevents the descent going in the wrong direction.
