Abstract

Although it is a very old theme, unconstrained optimization is an area which is always actual for many scientists. Today, the results of unconstrained optimization are applied in different branches of science, as well as generally in practice. Here, we present the line search techniques. Further, in this chapter we consider some unconstrained optimization methods. We try to present these methods but also to present some contemporary results in this area.

Keywords: unconstrained optimization, line search, steepest descent method, Barzilai-Borwein method, Newton method, modified Newton method, inexact Newton method, quasi-Newton method

### 1. Introduction

Optimization is a very old subject of a great interest; we can search deep into a human history to find important examples of applying optimization in the usual life of a human being, for example, the need of finding the best way to produce food yielded finding the best piece of land for producing, as well as (later on, how the time was going) the best ways of treatment of the chosen land and the chosen seedlings to get the best results.

From the very beginning of manufacturing, the manufacturers were trying to find the ways to get maximum income with minimum expenses.

There are plenty of examples of optimization processes in pharmacology (for determination of the geometry of a molecule), in meteorology, in optimization of a trajectory of a deep-water vehicle, in optimization of power management (optimization of the production of electrical power plants), etc.

Optimization presents an important tool in decision theory and analysis of physical systems.

Optimization theory is a very developed area with its wide application in science, engineering, business management, military, and space technology.

Optimization can be defined as the process of finding the best solution to a problem in a certain sense and under certain conditions.

Along with the passage of time, optimization was evolving. Optimization became an independent area of mathematics in 1940, when Dantzig presented the so-called simplex algorithm for linear programming.

The development of nonlinear programming became great after presentation of conjugate gradient methods and quasi-Newton methods in the 1950s.

#### Applied Mathematics

Today, there exist many modern optimization methods which are made to solve a variety of optimization problems. Now, they present the necessary tool for solving problems in diverse fields.

At the beginning, it is necessary to define an objective function, which, for example, could be a technical expense, profit or purity of materials, time, potential energy, etc.

The object function depends on certain characteristics of the system, which are known as variables. The goal is to find the values of those variables, for which the object function reaches its best value, which we call an extremum or an optimum.

It can happen that those variables are chosen in such a way that they satisfy certain conditions, i.e., restrictions.

The process of identifying the object function, variables, and restrictions for the given problem is called modeling.

The first and the most important step in an optimization process is the construction of the appropriate model, and this step can be the problem by itself. Namely, in the case that the model is too much simplified, it cannot be a faithful reflection of the practical problem. By the other side, if the constructed model is too complicated, then solving the problem is also too complicated.

After the construction of the appropriate model, it is necessary to apply the appropriate algorithm to solve the problem. It is no need to emphasize that there does not exist a universal algorithm for solving the set problem.

Sometimes, in the applications, the set of input parameters is bounded, i.e., the input parameters have values within the allowed space of input parameters Dx; we can write

$$x \in D\_x. \tag{1}$$

Except (1), the next conditions can also be imposed:

$$
\rho\_l(\mathbf{x}\_1, \dots, \mathbf{x}\_n) = \rho\_{\text{ol}}, l = \mathbf{1}, \dots, m\_1 \le n,\tag{2}
$$

$$
\psi\_j(\varkappa\_1, \ldots, \varkappa\_n) \le \psi\_{0j}, j = 1, \ldots, m\_2. \tag{3}
$$

Optimization task is to find the minimum (maximum) of the objective function f xð Þ ¼ f xð <sup>1</sup>; …; xnÞ, under the conditions (1), (2), and (3).

If the object function is linear, and the functions φlðx1; …; xnÞ l ¼ 1, …, m<sup>1</sup> and ψj ðx1; …; xnÞ j ¼ 1, …, m<sup>2</sup> are linear, then it is about the linear programming problem, but if at least one of the mentioned functions is nonlinear, it is about the nonlinear programming problem.

Unconstrained optimization problem can be presented as

$$\min\_{\mathbf{x}\in\mathbb{R}^{n}} f(\mathbf{x}),\tag{4}$$

where f ∈R<sup>n</sup> is a smooth function.

Problem (4) is, in fact, the unconstrained minimization problem. But, it is well known that the unconstrained minimization problem is equivalent to an unconstrained maximization problem, i.e.

$$\min f(\mathbf{x}) = -\max(-f(\mathbf{x})),\tag{5}$$

as well as

$$\max f(\mathbf{x}) = -\min(-f(\mathbf{x})).\tag{6}$$

Some Unconstrained Optimization Methods DOI: http://dx.doi.org/10.5772/intechopen.83679

Definition <sup>∗</sup> 1.1.1 <sup>x</sup><sup>∗</sup> is called <sup>a</sup> global minimizer of <sup>f</sup> if <sup>f</sup>ð Þ ð Þ for all <sup>x</sup>∈Rn <sup>x</sup> <sup>≤</sup> f x .

The ideal situation is finding a global minimizer of f. Because of the fact that our knowledge of the function f is usually only local, the global minimizer can be very difficult to find. We usually do not have the total knowledge about f. In fact, most algorithms are able to find only a local minimizer, i.e., a point that achieves the smallest value of f in its neighborhood.

So, we could be satisfied by finding the local minimizer of the function f. We distinguish weak and strict (or strong) local minimizer.

Formal definitions of local weak and strict minimizer of the function f are the next two definitions, respectively.

Definition 1.1.2 x<sup>∗</sup> is called a weak local minimizer of f if there exists a neighbor- <sup>∗</sup> hood <sup>N</sup> of <sup>x</sup><sup>∗</sup>, such that <sup>f</sup>ð Þ <sup>x</sup> <sup>≤</sup>fð Þ <sup>x</sup> for all <sup>x</sup><sup>∈</sup> N.

Definition 1.1.3 x<sup>∗</sup> is called a strict (strong) local minimizer of f if there exists a <sup>∗</sup> <sup>∗</sup> neighborhood <sup>N</sup> of <sup>x</sup> , such that <sup>f</sup>ð Þ <sup>x</sup> , f xð Þ for all <sup>x</sup><sup>∈</sup> N.

Considering backward definitions 1.1.2 and 1.1.3, the procedure of finding local minimizer (weak or strict) does not seem such easy; it seems that we <sup>∗</sup> should examine all points from the neighborhood of x , and it looks like a very difficult task.

Fortunately, if the object function f satisfies some special conditions, we can solve this task in a much easier way.

For example, we can assume that the object function f is smooth or, furthermore, twice continuously differentiable. Then, we concentrate to the gradient <sup>∇</sup> <sup>∗</sup> f x as well as to the Hessian <sup>∇</sup><sup>2</sup>f x<sup>∗</sup> ð Þ ð Þ.

All algorithms for unconstrained minimization require the user to start from a certain point, so-called the starting point, which we usually denote by x0. It is good to choose x<sup>0</sup> such that it is a reasonable estimation of the solution. But, to find such estimation, a little more knowledge about the considered set of data is needed, and the systematic investigation is needed also. So, it seems much simpler to use one of the algorithms to find x<sup>0</sup> or to take it arbitrarily.

There exist two important classes of iterative methods—line search methods and trust-region methods—made in the aim to solve the unconstrained optimization problem (4).

In this chapter, at first, we discuss different kinds of line search. Then, we consider some line search optimization methods in details, i.e., we study steepest descent method, Barzilai-Borwein gradient method, Newton method, and quasi-Newton method.

Also, we try to give some of the most recent results in these areas.

### 2. Line search

Now, let us consider the problem

$$\min\_{\mathbf{x}\in\mathbb{R}^n} f(\mathbf{x}),\tag{7}$$

where <sup>f</sup> : <sup>R</sup><sup>n</sup> ! <sup>R</sup> is <sup>a</sup> continuously differentiable function, bounded from below. There exists a great number of methods made in the aim to solve the problem (7).

The optimization methods based on line search utilize the next iterative scheme:

$$
\varkappa\_{k+1} = \varkappa\_k + t\_k d\_k. \tag{8}
$$

where xk is the current iterative point, xkþ<sup>1</sup> is the next iterative point, dk is the search direction, and tk is the step size in the direction dk.

At first, we consider the monotone line search.

Now, we give the iterative scheme of this kind of search.

Algorithm 1.2.1. (Monotone line search). Assumptions: ϵ . 0, x0, k ≔ 0. Step 1. If ∥gk∥ ≤ϵ, then STOP. Step 2. Find the descent direction dk. Step 3. Find the step size tk, such that f xð <sup>k</sup> þ tkdkÞ , fð Þ xk . Step 4. Set xkþ<sup>1</sup> ¼ xk þ tkdk. Step 5. Take k ≔ k þ 1 and go to Step 1. Denote

$$
\Phi(t) = f(\mathbf{x}\_k + t d\_k).
$$

Trying to solve the minimization problem, we are going to search for the step size t ¼ tk, in the direction dk, such that the next relation holds:

$$
\Phi(t\_k) < \Phi(\mathbf{0}).
$$

That procedure is called the monotone line search. We can search for the step size tk in such a way that the next relation holds:

$$f(\mathbf{x}\_k + t\_k d\_k) = \min\_{t \ge 0} f(\mathbf{x}\_k + t\_k d\_k), \tag{9}$$

i.e.

$$\Phi(t\_k) = \min\_{t \ge 0} \Phi(t), \tag{10}$$

or we can use the next formula:

$$t\_k = \min \left\{ t \left| \mathbf{g} \left( \mathbf{x}\_k + t d\_k \right)^T d\_k = \mathbf{0}, t \ge \mathbf{0} \right\} \right. \tag{11}$$

In this case we are talking about the exact or the optimal line search, where the parameter tk, which is received as the solution of the one-dimensional problem (10), is the optimal step size.

By the other side, instead of using the relation (9), or the relation (11), we can be satisfied by searching for such tk, which is acceptable if the next relation suits us:

$$f(\mathbf{x}\_k) - f(\mathbf{x}\_k + t\_k d\_k) \ge \delta\_k \ge \mathbf{0}.$$

Then, we are talking about the inexact or the approximate or the acceptable line search, which is very much utilized in the practice.

There are several reasons to use the inexact instead of the exact line search. One of them is that the exact line search is expensive. Further, in the cases when the iteration is far from the solution, the exact line search is not efficient. Next, in the practice, the convergence rate of many optimization methods (such as Newton or quasi-Newton) does not depend on the exact line search.

First, we are going to mention so-called basic and, by the way, very well-known inexact line searches.

Algorithm 1.2.2. (Backtracking).

<sup>1</sup> Assumptions: xk, the descent direction dk, 0 , δ , <sup>2</sup> , η∈ð0; 1Þ. Step 1. t ≔ 1.

Some Unconstrained Optimization Methods DOI: http://dx.doi.org/10.5772/intechopen.83679

Step 2. While <sup>f</sup>ðxk <sup>þ</sup> tdk<sup>Þ</sup> . f x<sup>ð</sup> <sup>k</sup>Þ þ <sup>δ</sup> tg <sup>T</sup> <sup>k</sup> dk, t ≔ t � η. Step 3. Set tk ¼ t. Now, we describe the Armijo rule.

R<sup>n</sup> Theorem 1.2.1. [1] Let f ∈C<sup>1</sup> ð Þ and let dk be the descent direction. Then, there exists the nonnegative number mk, such that

$$f(\mathbf{x}\_k + \eta^{m\_k} d\_k) \le f(\mathbf{x}\_k) + c\_1 \eta^{m\_k} \mathbf{g}\_k^T d\_k,$$

where c<sup>1</sup> ∈ ð0; 1Þ and η∈ð0; 1Þ. Next, we describe the Goldstein rule [2]. The step size tk is chosen in such a way that

$$\begin{aligned} f(\boldsymbol{\kappa}\_k + t\boldsymbol{d}\_k) &\leq f(\boldsymbol{\kappa}\_k) + \delta \operatorname{tg}\_k^T \boldsymbol{d}\_k, \\ f(\boldsymbol{\kappa}\_k + t\boldsymbol{d}\_k) &\geq f(\boldsymbol{\kappa}\_k) + (1 - \delta) \operatorname{tg}\_k^T \boldsymbol{d}\_k, \end{aligned}$$

where 0 , δ , <sup>1</sup> 2 .

Now, Wolfe line search rules follow [3], [4]. Standard Wolfe line search conditions are

$$f(\varkappa\_k + t\_k d\_k) - f(\varkappa\_k) \le \delta t\_k \varrho\_k^T d\_k,\tag{12}$$

$$\mathbf{g}\_{k+1}^T d\_k \ge \sigma \mathbf{g}\_k^T d\_k,\tag{13}$$

where dk is a descent direction and 0 , δ≤σ , 1.

This efficient strategy means that we should accept a positive step length tk, if conditions (12)–(13) are satisfied.

Strong Wolfe line search conditions consist of (12) and the next, stronger version of (13):

$$|\mathbf{g}\_{k+1}^T d\_k| \le -\sigma \mathbf{g}\_k^T d\_k. \tag{14}$$

In the generalized Wolfe line search conditions, the absolute value in (14) is replaced by the inequalities:

$$
\sigma\_1 \mathbf{g}\_k^T d\_k \le \mathbf{g}\_{k+1}^T d\_k \le -\sigma\_2 \mathbf{g}\_k^T d\_k,\\
0 \le \delta \le \sigma\_1 \le 1, \sigma\_2 \ge 0. \tag{15}
$$

By the other side, in the approximate Wolfe line search conditions, the inequalities (15) are changed into the next ones:

$$
\sigma \mathbf{g}\_k^T d\_k \le \mathbf{g}\_{k+1}^T d\_k \le (2\delta - 1)\mathbf{g}\_k^T d\_k,\\
\mathbf{0} \le \delta \le \frac{1}{2}, \delta \le \sigma \le 1. \tag{16}
$$

The next lemma is very important.

Lemma 1.2.1. [5] Let <sup>f</sup> <sup>∈</sup><sup>C</sup> Rn ð Þ. Let dk be <sup>a</sup> descent direction at the point xk, and assume that the function f is bounded from below along the direction fxk þ tdkjt . 0g. Then, if 0 , δ , σ , 1, there exist the intervals inside which the step length satisfies standard Wolfe conditions and strong Wolfe conditions.

By the other side, the introduction of the non-monotone line search is motivated by the existence of the problems where the search direction does not have to be a descent direction. This can happen, for example, in stochastic optimization [6].

Next, some efficient quasi-Newton methods, for example, SR1 update, do not produce the descent direction in every iteration [5].

Further, some efficient methods like spectral are not monotone at all.

Some numerical results given in [7–11] show that non-monotone techniques are better than the monotone ones if the problem is to find the global optimal values of the object function.

Algorithms of the non-monotone line search do not insist on a descent of the object function in every step. But, even these algorithms require the reduction of the object function after a predetermined number of iterations.

The first non-monotone line search technique is presented in [12]. Namely, in [12], the problem is to find the step size which satisfies

$$f(\mathbf{x}\_k + t\_k d\_k) \le \max\_{0 \le j \le m(k)} f\left(\mathbf{x}\_{k-j}\right) + \delta t\_k \mathbf{g}\_k^T d\_k,$$

where mð Þ¼ 0 0, 0≤ m kð Þ≤ minfm kð Þþ - 1 1; Mg, for k≥ 1, δ∈ð0; 1Þ, where M is a nonnegative integer.

This strategy is in fact the generalization of Armijo line search. In the same work, the authors suppose that the search directions satisfy the next conditions for some positive constants b<sup>1</sup> and b2:

$$\begin{aligned} \mathbf{g}\_k^T d\_k \le \mathbf{} \mathbf{g}\_1 \|\mathbf{g}\_k\|^2, \\ \|d\_k\| \le b\_2 \|\mathbf{g}\_k\|. \end{aligned}$$

The next non-monotone line search is described in [11]. Let x<sup>0</sup> be the starting point, and let

$$0 \le \eta\_{\min} \le \eta\_{\max} \le 1,\\
0 \le \delta \le \sigma \le 1 \le \rho, \mu \ge 0.$$

Let C<sup>0</sup> ¼ f xð Þ<sup>0</sup> , Q<sup>0</sup> ¼ 1. The step size has to satisfy the next conditions:

$$f(\mathbf{x}\_k + \mathbf{t}\_k d\_k) \le \mathbf{C}\_k + \delta \mathbf{t}\_k \mathbf{g}\_k^T d\_k,\tag{17}$$

$$\log(\varkappa\_k + t\_k d\_k) \ge \sigma \mathbf{g}\_k^T d\_k. \tag{18}$$

The value η<sup>k</sup> is chosen from the interval ½ηmin; ηmaxl and then

$$Q\_{k+1} = \eta\_k Q\_k + \mathbf{1},\\
\mathbf{C}\_{k+1} = \frac{\eta\_k Q\_k \mathbf{C}\_k + f(\mathbf{x}\_{k+1})}{Q\_{k+1}}.$$

Non-monotone rules which contain the sequence of nonnegative parameters f g ϵ<sup>k</sup> are used firstly in [13], and they are successfully used in many other algorithms, for example, in [14]. The next property of the parameters ϵ<sup>k</sup> is assumed:

$$
\epsilon\_k > 0, \quad \sum\_k \epsilon\_k = \epsilon < \infty,
$$

and the corresponding rule is

$$f(\varkappa\_k + t\_k d\_k) \le f(\varkappa\_k) + c\_1 t\_k \gpharpoonright d\_k + \epsilon\_k.$$

Now, we give the non-monotone line search algorithm, shortly NLSA, presented in [11].

#### Algorithm 1.2.3. (NLSA).

Assumptions: x0, 0≤ηmin ≤ηmax ≤ 1, 0 , δ , σ , 1 , ρ, μ . 0. Set C<sup>0</sup> ¼ f xð Þ<sup>0</sup> , Q<sup>0</sup> ¼ 1, k ¼ 0.

Some Unconstrained Optimization Methods DOI: http://dx.doi.org/10.5772/intechopen.83679

Step 1. If ∥∇f xð Þ<sup>k</sup> ∥ is sufficiently small, then STOP.

Step 2. Set xkþ<sup>1</sup> ¼ xk þ tkdk, where tk satisfies either the (non-monotone) Wolfe conditions (17) and (18) or the (non-monotone) Armijo conditions: tk <sup>¼</sup> tkρhk , where tk . 0 is the trial step and hk is the largest integer such that (17) holds and tk ≤ μ.

Step 3. Choose η<sup>k</sup> ∈ ½ηmin; ηmax�, and set

$$Q\_{k+1} = \eta\_k Q\_k + \mathbf{1},\\
\mathbf{C}\_{k+1} = (\eta\_k Q\_k \mathbf{C}\_k + f(\mathbf{x}\_{k+1})) / Q\_{k+1}.$$

Step 4. Set k ≔ k þ 1 and go to Step 1.

We can notice [11] that Ckþ<sup>1</sup> is a convex combination of f xð Þ<sup>0</sup> , fð Þ x<sup>1</sup> , …, fð Þ xk . The parameter η<sup>k</sup> controls the degree of non-monotonicity.

If η<sup>k</sup> ¼ 0 for all k, then this non-monotone line search becomes monotone Wolfe or Armijo line search.

If η<sup>k</sup> ¼ 1 for all k, then Ck ¼ Ak, where

$$A\_k = \frac{1}{k+1} \sum\_{i=0}^k f(\mathbf{x}\_i).$$

Lemma 1.2.2. [11] If ∇f xð Þ<sup>k</sup> Tdk ≤ 0 for each k, then for the iterates generated by the non-monotone line search algorithm, we have f <sup>k</sup> ≤Ck ≤ Ak for each k. Moreover, if <sup>∇</sup>fð Þ<sup>T</sup> xk dk , 0 and fð Þ x are bounded from below, then there exists tk satisfying either Wolfe or Armijo conditions of the line search update.

This study would be very incomplete unless we mention that there are many modifications of the abovementioned line searches. All these modifications are made to improve the previous results.

For example, in [15], the new inexact line search is described by the next way. ˜ °<sup>1</sup> Let <sup>β</sup> <sup>∈</sup>ð0; <sup>1</sup>Þ, <sup>σ</sup> <sup>∈</sup> <sup>0</sup>; ; let Bk be <sup>a</sup> symmetric positive definite matrix which <sup>2</sup> approximates <sup>∇</sup><sup>2</sup><sup>f</sup> xk and sk ¼ � dT <sup>g</sup>Tdk ð Þ <sup>k</sup> . The step size tk is the largest one in <sup>k</sup> Bkdk ˛ sk; skβ; skβ<sup>2</sup> ; … ˝ such that

$$f(\varkappa\_k + td\_k) - f(\varkappa\_k) \le \sigma t \left[ \mathbf{g}\_k^T d\_k + \frac{1}{2} t d\_k^T B\_k d\_k \right].$$

Further, in [16], a new inexact line search rule is presented. This rule is a modified version of the classical Armijo line search rule. We describe it now.

Let g ¼ ∇f xð Þ be a Lipschitz continuous function and L the Lipschitz constant. Let Lk be an approximation of L. Set

$$\boldsymbol{\beta}\_k = -\frac{\mathbf{g}\_k^T d\_k}{L\_k \|\boldsymbol{d}\_k\|^2}.$$

Find a step size tk as the largest component in the set ˛ βk; βkρ; βkρ<sup>2</sup>… ˝ such that the inequality

$$f(\boldsymbol{\kappa}\_k + t\_k d\_k) \le f(\boldsymbol{\kappa}\_k) + \sigma t\_k \left( \boldsymbol{g}\_k^T d\_k - \frac{1}{2} t\_k \mu L\_k \left\| \boldsymbol{d}\_k \right\|^2 \right)^{\frac{1}{2}}$$

holds, where σ ∈ð0; 1Þ, μ∈ ½0; ∞Þ, and ρ∈ð0; 1Þ are given constants. Next, in [17], a new, modified Wolfe line search is given in the next way. Find tk . 0 such that

$$\begin{aligned} &f(\boldsymbol{\varkappa}\_{k} + \boldsymbol{t}\_{k}d\_{k}) - f(\boldsymbol{\varkappa}\_{k}) \leq \min\{\boldsymbol{\delta}\boldsymbol{t}\_{k}\boldsymbol{\mathcal{g}}\_{k}^{T}d\_{k}, -\boldsymbol{\gamma}t\_{k}^{2}||d\_{k}||^{2}\}, \\ &\boldsymbol{g}(\boldsymbol{\varkappa}\_{k} + \boldsymbol{t}\_{k}d\_{k})^{T}d\_{k} \geq \sigma \boldsymbol{\mathcal{g}}\_{k}^{T}d\_{k}, \end{aligned}$$

where δ∈ð0; 1Þ, σ ∈ðδ; 1Þ, and γ . 0. More recent results on this topic can be found, for example, in [18–23].

#### 2.1 Steepest descent (SD)

The classical steepest descent method which is designed by Cauchy [24] can be considered as one among the most important procedures for minimization of real-valued function defined on Rn.

Steepest descent is one of the simplest minimization methods for unconstrained optimization. Since it uses the negative gradient as its search direction, it is known also as the gradient method.

It has low computational cost and low matrix storage requirement, because it does not need the computations of the second derivatives to be solved to calculate the search direction [25].

Suppose that fð Þ x is continuously differentiable in a certain neighborhood of a point xk and also suppose that gk≜∇f xð Þ<sup>k</sup> 6¼ 0.

Using Taylor expansion of the function f near xk as well as Cauchy-Schwartz inequality, one can easily prove that the greatest fall of f exists if and only if dk ¼ �gk, i.e., �gk is the steepest descent direction.

The iterative scheme of the SD method is

$$
\boldsymbol{\infty}\_{k+1} = \boldsymbol{\mathfrak{x}}\_{k} - \boldsymbol{t}\_{k} \, \mathbf{g}\_{k} \,. \tag{19}
$$

The classical steepest descent method uses the exact line search.

Now, we give the algorithm of the steepest descent method which refers to the exact as well as to the inexact line search.

Algorithm 1.2.4. (Steepest descent method, i.e., SD method).

Assumptions: <sup>0</sup> , <sup>ϵ</sup> <sup>≪</sup> 1, <sup>x</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>. Let <sup>k</sup> <sup>¼</sup> 0.

Step 1. If ∥ gk∥ ≤ε, then STOP, else set dk ¼ �gk.

Step 2. Find the step size tk, which is the solution of the problem

$$\min\_{t \ge 0} f(\mathbf{x}\_k + td\_k),\tag{20}$$

else find the step size tk by any of the inexact line search methods.

Step 3. Set xkþ<sup>1</sup> ¼ xk þ tkdk.

Step 4. Set k ≔ k þ 1 and go to Step 1.

The classical and the oldest steepest descent step size tk, which was designed by Cauchy (in the case of the exact line search), is computed as [26]

$$t\_k = \frac{\mathbf{g}\_k^T \mathbf{g}\_k}{\mathbf{g}\_k^T \mathbf{G} \mathbf{g}\_k},$$

where gk <sup>¼</sup> <sup>∇</sup>f xð Þ<sup>k</sup> and <sup>G</sup> <sup>¼</sup> <sup>∇</sup><sup>2</sup>f xð Þ<sup>k</sup> .

Theorem 1.2.2. [27] (Global convergence theorem of the SD method) Let f ∈C<sup>1</sup> . Then, each accumulation point of the iterative sequence f g xk , generated by Algorithm 1.2.4, is a stationary point.

Remark 1.2.1. The steepest descent method has at least the linear convergence rate. More information about the convergence of the SD method can be found in [5, 27].

Although known as the first unconstrained optimization method, this method is still a theme considered by scientists.

Different modifications of this method are made, for example, see [25, 28–32].

In [28], the authors presented a new search direction from Cauchy's method in the form of two parameters known as Zubai'ah-Mustafa-Rivaie-Ismail method, shortly, ZMRI method:

$$d\_k = -\mathbf{g}\_k - \|\mathbf{g}\_k\| \mathbf{g}\_{k-1}.\tag{21}$$

So, in [28], a new modification of SD method is suggested using a new search direction, dk, given by (21). The numerical results are presented based on the number of iterations and CPU time. It is shown that this new method is efficient when it is compared to the classical SD.

In [25], a new scaled search direction of SD method is presented. The inspiration for this new method is the work of Andrei [33], in which the author presents and analyzes a new scaled conjugate gradient algorithm, based on an interpretation of the secant equation and on the inexact Wolfe line search conditions.

The method proposed in [25] is known as Rashidah-Rivaie-Mamat (RRM) method, and it suggests the direction dk given by the next relation:

$$d\_k = \begin{cases} -\mathcal{g}\_k, \text{if } k = 0, \\ -\theta\_k \mathcal{g}\_k - \|\mathcal{g}\_k\| \mathcal{g}\_{k-1}, \end{cases} \tag{22}$$

<sup>k</sup>�<sup>1</sup> <sup>y</sup> where <sup>k</sup>�<sup>1</sup> <sup>θ</sup><sup>k</sup> is <sup>a</sup> scaling parameter, <sup>θ</sup><sup>k</sup> <sup>¼</sup> dT <sup>∥</sup><sup>2</sup> , yk�<sup>1</sup> <sup>¼</sup> gk � gk�<sup>1</sup>. <sup>∥</sup> gk�<sup>1</sup>

Further, in [25], a comparison among RRM, ZMRI, and SD methods is made; it is shown that RRM method is better than ZMRI and SD methods.

It is interesting that the exact line search is used in [25].

In [34], the properties of steepest descent method from the literature are reviewed together with advantages and disadvantages of each step size procedure.

Namely, the step size procedures, which are compared in this paper, are:

gT 1. tk <sup>¼</sup> <sup>k</sup> gk : Step size method by Cauchy [24], computed by exact line search <sup>g</sup>THk <sup>g</sup> <sup>k</sup> <sup>k</sup> (C step size).

° <sup>s</sup>; <sup>s</sup>β; <sup>s</sup>β<sup>2</sup> ˛ 2. Given <sup>s</sup> . <sup>0</sup>, <sup>β</sup>, <sup>σ</sup> <sup>∈</sup> <sup>ð</sup>0; <sup>1</sup>Þ, tk <sup>¼</sup> max ; … such that

$$\int (\mathbf{x}\_k + \mathbf{t}\_k d\_k) \le \mathbf{f}(\mathbf{x}\_k) + \sigma t\_k \mathbf{g}\_k^T d\_k - \text{Armijo's line search} \left(\mathbf{A} \text{step size}\right).$$

3. Given <sup>β</sup>, <sup>σ</sup> <sup>∈</sup> <sup>ð</sup>0; <sup>1</sup>Þ,~t<sup>0</sup> <sup>¼</sup> 1, and tk <sup>¼</sup> <sup>β</sup>~tk such that

$$f(\mathbf{x}\_k + t\_k d\_k) \le f(\mathbf{x}\_k) + \sigma t\_k \mathbf{g}\_k^T d\_k - \text{Backtracking line search} (\mathbf{B} \text{step size}).$$

s <sup>k</sup>�<sup>1</sup>yk�<sup>1</sup> 4. tk <sup>¼</sup> <sup>∥</sup> T yk�1∥<sup>2</sup> , (BB1), tk <sup>¼</sup> <sup>∥</sup>sk�1∥<sup>2</sup> , (BB2), sk�<sup>1</sup> <sup>¼</sup> xk � xk�<sup>1</sup> yk�<sup>1</sup> <sup>¼</sup> gk � gk�<sup>1</sup>, : <sup>s</sup> T <sup>k</sup>�<sup>1</sup>yk�<sup>1</sup> Barzilai and Borwein's formula. The convergence is R-superlinear.

<sup>T</sup> t <sup>2</sup> <sup>g</sup> <sup>g</sup> <sup>k</sup>�<sup>1</sup> <sup>k</sup> <sup>k</sup> 5. tk <sup>¼</sup> : Elimination line search (EL step size), which esti- <sup>2</sup> f x<sup>ð</sup> <sup>k</sup>þtkdkÞ�f x<sup>ð</sup> <sup>k</sup>Þþtk�<sup>1</sup> <sup>g</sup>Tgk <sup>ð</sup> , <sup>k</sup>

mates the step size without computation of the Hessian.

The comparison is based on time execution, number of total iteration, total percentage of function, gradient and Hessian evaluation, and the most decreased value of objective function obtained.

From the numerical results, the authors conclude that the A method and BB1 method are the best methods among others.

Further, in [34], the general conclusions about the steepest descent method are given:


3. xk approaches the minimizer slowly, in fact in a zigzag way.

In [35], in the aim to achieve fast convergence and the monotone property, a new step size for the steepest descent method is suggested.

In [36], for quadratic positive definite problems, an over-relaxation has been considered. Namely, Raydan and Svaiter [36] proved that the poor behavior of the steepest descent method is due to the optimal Cauchy choice of step size and not to the choice of the search direction. These results are extended in [29] to convex, well-conditioned functions. Further, in [29], it is shown that a simple modification of the step length by means of a random variable uniformly distributed in ð0; 1�, for the strongly convex functions, represents an improvement of the classical gradient descent algorithm. Namely, in this paper, the idea is to modify the gradient descent method by introducing a relaxation of the following form:

$$\mathbf{x}\_{k+1} = \mathbf{x}\_k + \theta\_k \mathbf{t}\_k d\_k,\tag{23}$$

where θ<sup>k</sup> is the relaxation parameter, a random variable uniformly distributed between 0 and 1.

In the recent years, the steepest descent method has been applied in many branches of science; one can be inspired, for example, by [37–43].

#### 2.2 Barzilai and Borwein gradient method

Remind to the fact that SD method performs poorly, converges linearly, and is badly affected by the ill-conditioning.

Also, remind to the fact that this poor behavior of SD method is due to the optimal choice of the step size and not to the choice of the steepest descent direction �gk.

Barzilai and Borwein presented [44] a two-point step size gradient method, which is well known as BB method.

The step size is derived from a two-point approximation to the secant equation. Consider the gradient iteration form:

$$
\mathbf{x}\_{k+1} = \mathbf{x}\_k - t\_k \mathbf{g}\_k.
$$

It can be rewritten as xkþ<sup>1</sup> ¼ xk � Dk gk, where Dk ¼ tkI.

To make the matrix Dk having quasi-Newton property, the step size tk is computed in such a way that we get

$$\min \|s\_{k-1} - D\_k \mathcal{y}\_{k-1}\|.$$

Some Unconstrained Optimization Methods DOI: http://dx.doi.org/10.5772/intechopen.83679

This yields that

$$t\_k^{BB1} = \frac{s\_{k-1}^T y\_{k-1}}{y\_{k-1}^T y\_{k-1}}, \mathbf{s}\_{k-1} = \mathbf{x}\_k - \mathbf{x}\_{k-1}, y\_{k-1} = \mathbf{g}\_k - \mathbf{g}\_{k-1}.\tag{24}$$

But, using symmetry, we may minimize ∥D�<sup>1</sup> sk�<sup>1</sup> � yk�<sup>1</sup> <sup>k</sup> <sup>∥</sup>, with respect to tk, and we get:

$$t\_k^{BB2} = \frac{\left\|\mathbf{s}\_{k-1}\right\|^2}{s\_{k-1}^T y\_{k-1}}, \mathbf{s}\_{k-1} = \mathbf{x}\_k - \mathbf{x}\_{k-1}, y\_{k-1} = \mathbf{g}\_k - \mathbf{g}\_{k-1}.\tag{25}$$

Now, we give the algorithm of BB method.

Algorithm 1.2.5. (Barzilai-Borwein gradient method, i.e., BB method). Assumptions: <sup>0</sup> , <sup>ϵ</sup> <sup>≪</sup> 1, <sup>x</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>. Let <sup>k</sup> <sup>¼</sup> 0.

Step 1. If ∥ gk∥ ≤ϵ, then STOP, else set dk ¼ �gk.

Step 2. If k ¼ 0, then find the step size t<sup>0</sup> by the line search, else compute tk using the formula (24) or (25).

Step 3. Set xkþ<sup>1</sup> ¼ xk þ tkdk.

Step 4. Set k ≔ k þ 1 and go to Step 1.

Considering Algorithm 1.2.5, we can conclude that this method does not require any matrix computation or any line search.

The Barzilai-Borwein method is in fact the gradient method, which requires less computational work than SD method, and it speeds up the convergence of the gradient method. Barzilai and Borwein proved that BB algorithm is R�superlinearly convergent for the quadratic case.

In the general non-quadratic case, a globalization strategy based on nonmonotone line search is applied in this method.

In this general case, tk, computed by (24) or (25), may be unacceptably large or small. That is the reason why we assume that there exist the numbers t <sup>l</sup> and t r , such that

$$0 \le t^l \le t\_k \le t^r,\text{ for all } k.$$

Using the iteration

$$\boldsymbol{\infty}\_{k+1} = \boldsymbol{\mathcal{X}}\_{k} - \frac{1}{t\_{k}} \boldsymbol{\mathcal{g}}\_{k} = \boldsymbol{\mathcal{X}}\_{k} - \boldsymbol{\lambda}\_{k} \boldsymbol{\mathcal{g}}\_{k},\tag{26}$$

with

$$t\_k = \frac{s\_{k-1}^T \mathcal{Y}\_{k-1}}{s\_{k-1}^T s\_{k-1}}, \lambda\_k = \frac{1}{t\_k},$$

$$s\_k = -\frac{1}{t\_k} \mathbf{g}\_k = -\lambda\_k \mathbf{g}\_k,$$

we get

$$t\_{k+1} = \frac{s\_k^T \mathcal{Y}\_k}{s\_k^T s\_k} = \frac{-\lambda\_k \mathcal{g}\_k^T \mathcal{Y}\_k}{\lambda\_k^2 \mathcal{g}\_k^T \mathcal{g}\_k} = -\frac{\mathcal{g}\_k^T \mathcal{Y}\_k}{\lambda\_k \mathcal{g}\_k^T \mathcal{g}\_k}.$$

Now, we give the algorithm of the Barzilai-Borwein method with non-monotone line search.

Algorithm 1.2.6. (BB method with non-monotone line search).

Assumptions: <sup>0</sup> , <sup>ϵ</sup> <sup>≪</sup> 1, <sup>x</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup>n, <sup>M</sup> <sup>≥</sup><sup>0</sup> is an integer, <sup>ρ</sup>∈ð0; <sup>1</sup>Þ, <sup>δ</sup> . 0, 0 , σ<sup>1</sup> , σ<sup>2</sup> , 1, t l , t r . Let k ¼ 0. Step 1. If ∥ gk∥ ≤ϵ, then STOP. Step 2. If tk ≤t l , or tk ≥t r , then set tk ¼ δ. Step 3. Set <sup>λ</sup> <sup>¼</sup> <sup>1</sup> . tk Step 4. (non-monotone line search) If

$$f(\mathbf{x}\_k - \lambda \mathbf{g}\_k) \le \max\_{0 \le j \le \min(k, \mathcal{M})} f(\mathbf{x}\_{k-j}) - \rho \lambda \mathbf{g}\_k^T \mathbf{g}\_k,$$

then set

$$
\lambda\_k = \lambda, \mathfrak{x}\_{k+1} = \mathfrak{x}\_k - \lambda\_k \mathfrak{g}\_{k'},
$$

and go to Step 6.

Step 5. Choose σ ∈½σ1; σ2�, set λ ¼ σλ, and go to Step 4. <sup>g</sup>Tyk Step 6. Set tkþ<sup>1</sup> ¼ � <sup>k</sup> and <sup>k</sup> <sup>≔</sup> <sup>k</sup> <sup>þ</sup> 1, and return to Step 1. <sup>λ</sup><sup>k</sup> <sup>g</sup>Tg <sup>k</sup> <sup>k</sup> Obviously, the above algorithm is globally convergent.

Several authors paid attention to the Barzilai-Borwein method, and they proposed some variants of this method.

In [8], the globally convergent Barzilai-Borwein method is proposed by using non-monotone line search by Grippo et al. [12]. In the same paper, Raydan proves the global convergence of the non-monotone Barzilai-Borwein method.

Further, Grippo and Sciandrone [45] propose another type of the non-monotone Barzilai-Borwein method.

Dai [7] gives the basic analysis of the non-monotone line search strategy. Moreover, in [46] numerical results are presented, using

$$t\_k = \frac{s\_{\nu(k)}^T y\_{\nu(k)}}{s\_{\nu(k)}^T s\_{\nu(k)}}.\tag{27}$$

and

$$\nu(k) = M\_{\mathfrak{c}} \cdot \ll \frac{k-1}{M\_{\mathfrak{c}}} \ll 1$$

where for r∈ R, ⌞r⌟ denotes the largest integer j such that j≤r and Mc is a positive integer. The gradient method with (27) is called the cyclic Barzilai-Borwein method. Numerical results in [46] prove that their method performs better than the Barzilai-Borwein method.

Many researchers study the gradient method for minimizing a strictly convex quadratic function, namely,

$$\min f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T A \mathbf{x} - \mathbf{b}^T \mathbf{x},\tag{28}$$

where A ∈ R<sup>n</sup>�<sup>n</sup> is a symmetric positive definite matrix and b∈ R<sup>n</sup> is a given vector. For an application of the Barzilai-Borwein method to the problem (28), Raydan [47] establishes global convergence, and Dai and Liao [48] prove R-linear rate of convergence. Friedlander, Martinez, Molina, and Raydan [49] propose a new gradient method with retards, in which tk is defined by

Some Unconstrained Optimization Methods DOI: http://dx.doi.org/10.5772/intechopen.83679

$$\mathfrak{a}\_{k} = \frac{\mathfrak{g}\_{\nu(k)}^{T} A^{\rho(k)+1} \mathfrak{g}\_{\nu(k)}}{\mathfrak{g}\_{\nu(k)}^{T} A^{\rho(k)} \mathfrak{g}\_{\nu(k)}}, \nu(k) \in \{k, k-1, \dots, \max\{0, k-m\}\} \tag{29}$$

˜ ° and <sup>ρ</sup>ð Þ<sup>k</sup> <sup>∈</sup> <sup>q</sup>1; …; <sup>q</sup> , where <sup>m</sup> is <sup>a</sup> positive integer and <sup>q</sup>1, …, q <sup>≥</sup> � <sup>2</sup> are <sup>m</sup> <sup>m</sup> integers. In the same paper, they establish its global convergence for problem (28) and prove the Q-superlinear rate of convergence in the special case.

In [50], the authors extend the Barzilai-Borwein method, and they give extended Barzilai-Borwein method, which they denote EBB. They also establish global and Q�superlinear convergence properties of the proposed method for minimizing a strictly convex quadratic function. Furthermore, they discuss an application of their method to general objective functions. In [50], a new step size is proposed by extending (29). Namely, in this paper, following Friedlander et al. [49], a new step size is proposed as follows:

$$\begin{aligned} t\_k &= \sum\_{i=1}^l \phi\_i \frac{\mathbf{g}\_{\nu\_{i(k)}}^T A^{\rho\_i(k)+1} \mathbf{g}\_{\nu\_i(k)}}{\mathbf{g}\_{\nu\_{i(k)}}^T A^{\rho\_i(k)} \mathbf{g}\_{\nu\_i(k)}}, \\ \phi\_i &\ge 0, \sum\_{i=1}^n \phi\_i = \mathbf{1}, \\ \nu\_i(k) &\in \{k, k-1, \dots, \max\{0, k-m\}\} \end{aligned}$$

and

$$\phi\_i(k) \in \{q\_1, \ldots, q\_m\},$$

where l and m are positive integers and q1, …, q are integers. <sup>m</sup>

Also, an application of algorithm EBB to general unconstrained minimization problems (4) is considered.

Following Raydan [8], the authors [50] further combine the non-monotone line search and algorithm EBB to get the algorithm called NEBB. They also prove the global convergence of the algorithm NEBB, under some classical assumptions.

The Barzilai-Borwein method and its related methods are reviewed by Dai and Yuan [51] and Fletcher [52].

In [53], a new concept of the approximate optimal step size for gradient method is introduced and used to interpret the BB method; an efficient gradient method with the approximate optimal step size for unconstrained optimization is presented. The next definition is introduced in [53].

Definition 1.2.1. Let Φð Þt be an approximation model of fðxk � tg <sup>k</sup>Þ. A positive constant t <sup>∗</sup> is called approximate optimal step size associated to <sup>Φ</sup>ð Þ<sup>t</sup> for gradient method, if t <sup>∗</sup> satisfies

$$t^\* = \arg\min\_{t \ge 0} \Phi(t).$$

The approximate optimal step size is different from the steepest descent step size, which will lead to the expensive computational cost. The approximate optimal step size is generally calculated easily, and it can be applied to unconstrained optimization.

Due to the effectiveness of t BB<sup>1</sup> and the fact that t BB<sup>1</sup> <sup>¼</sup> argmin<sup>t</sup> . <sup>0</sup>Φð Þt , we can <sup>k</sup> <sup>k</sup> naturally ask if more suitable approximation models can be constructed to generate more efficient approximate optimal step-sizes.

#### Applied Mathematics

This is the purpose of work [53]. Further, if the objective function f xð Þ is not close to a quadratic function on the line segment between xk�<sup>1</sup> and xk, in this paper a conic model is developed to generate the approximate optimal step size if the conic model is suitable to be used. Otherwise, the authors consider two cases:


In [54], derivative-free iterative scheme that uses the residual vector as search direction for solving large-scale systems of nonlinear monotone equations is presented.

The Barzilai-Borwein method is widely used; some interesting results can be found in [55–57].

#### 2.3 Newton method

The basic idea of Newton method for unconstrained optimization is the iterative usage of the quadratic approximation qð Þ<sup>k</sup> to the objective function f at the current ð Þ<sup>k</sup> iterate xk and then minimization of such approximation q .

Let <sup>f</sup> : <sup>R</sup><sup>n</sup> ! <sup>R</sup> be twice continuously differentiable, xk <sup>∈</sup> <sup>R</sup><sup>n</sup>, and let the Hessian <sup>∇</sup><sup>2</sup>f xð Þ<sup>k</sup> be positive definite.

We model ð Þ<sup>k</sup> f at the current point xk by the quadratic approximation q :

$$f(\boldsymbol{\mathfrak{x}}\_{k} + \boldsymbol{s}) \approx \boldsymbol{q}^{(k)}(\boldsymbol{\mathfrak{s}}) = \boldsymbol{f}(\boldsymbol{\mathfrak{x}}\_{k}) + \nabla \boldsymbol{f}(\boldsymbol{\mathfrak{x}}\_{k})^{T} \boldsymbol{s} + \frac{1}{2} \boldsymbol{s}^{T} \nabla^{2} \boldsymbol{f}(\boldsymbol{\mathfrak{x}}\_{k}) \boldsymbol{s}, \boldsymbol{s} = \boldsymbol{\mathfrak{x}} - \boldsymbol{\mathfrak{x}}\_{k}.$$

Minimization ð Þ<sup>k</sup> of <sup>q</sup> ð Þ<sup>s</sup> gives the next iterative scheme:

$$\boldsymbol{\pi}\_{k+1} = \boldsymbol{\pi}\_k - \left(\nabla^2 f(\boldsymbol{\pi}\_k)\right)^{-1} \nabla f(\boldsymbol{\pi}\_k),$$

which is known as Newton formula. Denote Gk <sup>¼</sup> <sup>∇</sup><sup>2</sup> f xk , g ð Þ <sup>k</sup> ¼ ∇f xð Þ<sup>k</sup> . Then, we have a simpler form:

$$
\varkappa\_{k+1} = \varkappa\_k - G\_k^{-1} \mathbf{g}\_k. \tag{30}
$$

A Newton direction is

$$
\kappa\_k = \varkappa\_{k+1} - \varkappa\_k = -G\_k^{-1} \mathcal{g}\_k. \tag{31}
$$

We have supposed that Gk is positive definite. So, the Newton direction is a descent direction. This we can conclude from

$$\mathbf{g}\_k^T \mathbf{s}\_k = -\mathbf{g}\_k^T \mathbf{G}\_k^{-1} \mathbf{g}\_k \le \mathbf{0}.$$

Some Unconstrained Optimization Methods DOI: http://dx.doi.org/10.5772/intechopen.83679

Now, we give the algorithm of the Newton method.

Algorithm 1.2.7. (Newton method). Assumptions: <sup>ϵ</sup> . 0, <sup>x</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup>n. Let <sup>k</sup> <sup>¼</sup> 0. Step 1. If ∥ gk∥ ≤ϵ, then STOP. Step 2. Solve Gks ¼ �gk for sk. Step 3. Set xkþ<sup>1</sup> ¼ xk þ sk. Step 4. k ≔ k þ 1, return to Step 1.

The next theorem shows the local convergence and the quadratic convergence rate of Newton method.

Theorem 1.2.3. [27] (Convergence theorem of Newton method) Let f ∈C<sup>2</sup> and xk be close enough to the solution <sup>x</sup><sup>∗</sup> of the minimization problem with <sup>g</sup>ðx<sup>∗</sup>Þ ¼ <sup>0</sup>. If the Hessian <sup>G</sup> <sup>∗</sup> ð Þ <sup>x</sup> is positively definite and <sup>G</sup>ð Þ <sup>x</sup> satisfies Lipschitz condition

$$|G\_{i\bar{j}}(\infty) - G\_{i\bar{j}}(\mathcal{Y})| \le \beta \|\infty - \mathcal{Y}\|, for \text{ some } \beta, \text{for all } i, j, \text{ }$$

where Gijð Þ x is the ð Þ i; j element of G xð Þ and then for all k, Newton direction (31) is well-defined; the generated sequence xk f g converges to <sup>x</sup><sup>∗</sup> with <sup>a</sup> quadratic rate.

But, in spite of this quadratic rate, the Newton method is a local method: when the starting point is far away from the solution, there is a possibility that Gk is not positive definite, as well as Newton direction is not a descent direction.

So, to guarantee the global convergence, we can use Newton method with line search. We can remind to the fact that only when the step size sequence f g tk tends to 1, Newton method is convergent with the quadratic rate.

Newton iteration with line search is as follows:

$$d\_k = -G\_k^{-1} \mathbf{g}\_{k^\*} \tag{32}$$

$$
\varkappa\_{k+1} = \varkappa\_k + t\_k d\_k. \tag{33}
$$

Now, we give the algorithm.

Algorithm 1.2.8. (Newton method with line search). Assumptions: <sup>ϵ</sup> . 0, <sup>x</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>. Let <sup>k</sup> <sup>¼</sup> 0. Step 1. If ∥ gk∥ ≤ϵ, then STOP. Step 2. Solve Gkd ¼ �gk for dk. Step 3. Line search step: find tk such that

$$f(\mathbf{x}\_k + t\_k d\_k) = \min\_{t \ge 0} f(\mathbf{x}\_k + t d\_k),$$

or find tk such that (inexact) Wolfe line search rules hold.

Step 4. Set xkþ<sup>1</sup> ¼ xk þ tkdk and k ¼ k þ 1, and go to Step 1.

The next theorems claim that Algorithm 1.2.8 with the exact line search, as well as Algorithm 1.2.8 with the inexact line search, are globally convergent.

Theorem 1.2.4. [27] Let <sup>f</sup> : <sup>R</sup><sup>n</sup> ! <sup>R</sup> be twice continuously differentiable on open convex set D⊂Rn. Assume that for any x<sup>0</sup> ∈ D there exists a constant m . 0, such that f xð Þ satisfies

$$\|u^T \nabla^2 f(\mathbf{x}) u \ge m \|u\|^2, \text{for all } u \in \mathbb{R}^n, \mathbf{x} \in L(\infty\_0),\tag{34}$$

where L xð <sup>0</sup>Þ ¼ fxj fð Þ x ≤fð Þ x<sup>0</sup> g is the corresponding level set. Then, the sequence f g xk , generated by Algorithm 1.2.8, with the exact line search, satisfies:


Note that the next relation holds from the standard Wolfe line search:

$$f(\varkappa\_k) - f(\varkappa\_k + t\_k d\_k) \ge \overline{\eta} \|\lg\_k\|^2 \cos^2 \angle(d\_k, -\mathbb{g}\_k),\tag{35}$$

where the constant η does not depend on k.

Theorem 1.2.5. [27] Let <sup>f</sup> : <sup>R</sup><sup>n</sup> ! <sup>R</sup> be twice continuously differentiable on open convex set D⊂Rn. Assume that for any x<sup>0</sup> ∈ D there exists a constant m . 0, such that f xð Þ satisfies the relation (34) on the level set Lð Þ x<sup>0</sup> . If the line search satisfies the relation (35), then the sequence f g xk , generated by Algorithm 1.2.8, with the inexact Wolfe line search, satisfies

$$\lim\_{k \to \infty} \|\mathbf{g}\_k\| = \mathbf{0}$$

and f g xk converges to the unique minimizer of f xð Þ.

#### 2.4 Modified Newton method

The main problem in Newton method could be the fact that the Hessian Gk may be not positive definite. In that case, we are not sure that the objective function f has its minimizers; furthermore, when Gk is indefinite, the objective function f is unbounded.

So, many modified schemes are made. Now, we describe the next two methods shortly.

In [58], Goldstein and Price use the steepest descent method when Gk is not positive definite. Denoting the angle between dk and �gk by θ, as well as having in view the angle rule, <sup>θ</sup> <sup>≤</sup> <sup>π</sup> � <sup>μ</sup>, where <sup>μ</sup> . 0, they determine the direction dk as <sup>2</sup>

$$d\_k = \begin{cases} -\mathbf{G}\_k^{-1} \mathbf{g}\_k, & \text{if } \cos \theta \ge \eta, \\\ -\mathbf{g}\_k, & \text{otherwise}, \end{cases}$$

where η . 0 is a given constant.

In [59], the authors present another modified Newton method. When Gk is not positive definite, Hessian Gk is changed into Gk þ νkI, where ν<sup>k</sup> . 0 is chosen in such a way that Gk þ νkI is positive definite and well-conditioned. Otherwise, when Gk is positive definite, ν<sup>k</sup> ¼ 0.

To consider the other modified Newton methods, such as finite difference Newton method, negative curvature direction method, Gill-Murray stable Newton method, etc., one can see [27], for example.

#### 2.5 Inexact Newton method

By the other side, because of the high cost of the exact Newton method, especially when the dimension n is large, the inexact Newton method might be a good solution. This type of method means that we only approximately solve the Newton equation.

Consider solving the nonlinear equations:

$$F(\mathbf{x}) = \mathbf{0},\tag{36}$$

where <sup>F</sup> : <sup>R</sup><sup>n</sup> ! <sup>R</sup><sup>n</sup> is assumed to have the next properties: A1 There exists <sup>x</sup><sup>∗</sup> such that F x<sup>∗</sup> ð Þ¼ 0.

A <sup>∗</sup> <sup>2</sup> F is continuously differentiable in the neighborhood of x . A <sup>∗</sup> <sup>3</sup> F<sup>0</sup> ð Þ x is nonsingular.

Remind that the basic Newton step is obtained by solving

$$F'(\mathfrak{x}\_k)\mathfrak{s}\_k = -F(\mathfrak{x}\_k)$$

and setting

xkþ<sup>1</sup> ¼ xk þ sk:

The inexact Newton method means that we solve

$$F'(\mathbf{x}\_k)\mathbf{s}\_k = -F(\mathbf{x}\_k) + r\_k. \tag{37}$$

where

$$\|\boldsymbol{r}\_{k}\| \leq \eta\_{k} \|\boldsymbol{F}(\boldsymbol{x}\_{k})\|.\tag{38}$$

Set

$$
\varkappa\_{k+1} = \varkappa\_k + \varsigma\_k. \tag{39}
$$

Here, rk denotes the residual, and the sequence η<sup>k</sup> f g, where 0 , η<sup>k</sup> , 1, is the sequence which controls the inexactness.

Now, we give two theorems; the first of them claims the linear convergence, and the second claims the superlinear convergence of the inexact Newton method.

Theorem 1.2.6. [27] Let <sup>F</sup> : <sup>R</sup><sup>n</sup> ! <sup>R</sup><sup>n</sup> satisfy the assumptions <sup>A</sup>1–A3. Let the sequence f g η satisfies 0≤η<sup>k</sup> ≤ η , t , 1. Then, for some ϵ . 0, if the starting point is <sup>k</sup> sufficiently near <sup>x</sup>∗, the sequence f g xk generated by inexact Newton'<sup>s</sup> method (37)–(39) converges to x∗, and the convergence rate is linear, i.e.

$$\|\mathfrak{x}\_{k+1} - \mathfrak{x}^\*\|\_\* \le t \|\mathfrak{x}\_k - \mathfrak{x}^\*\|\_{\*\*}$$

where <sup>∗</sup> <sup>∥</sup>y∥∗ <sup>¼</sup> <sup>∥</sup>F<sup>0</sup> ð Þ x y∥.

Theorem 1.2.7. [27] Let all assumptions of Theorem 1.2.6 hold. Assume that the <sup>∗</sup> sequence f g xk , generated by the inexact Newton method, converges to <sup>x</sup> . Then

$$\|r\_k\| = o(\|F(\mathfrak{x}\_k)\|), k \to \infty,$$

if and only if xk converges to <sup>x</sup><sup>∗</sup> f g superlinearly. The relation

$$\boldsymbol{\omega}\_{k+1} = \boldsymbol{\omega}\_k - \frac{\boldsymbol{f}'(\boldsymbol{\omega}\_k)}{\boldsymbol{f}'(\boldsymbol{\omega}\_k) - \boldsymbol{f}'(\boldsymbol{\omega}\_{k-1})} \cdot (\boldsymbol{\omega}\_k - \boldsymbol{\omega}\_{k-1}),\tag{40}$$

presents the secant method.

In [60], a modification of the classical secant method for solving nonlinear, univariate, and unconstrained optimization problems based on the development of the cubic approximation is presented. The iteration formula including an approximation of the third derivative of f xð Þ by using the Taylor series expansion is derived. The basic assumption on the objective function f xð Þ is that <sup>f</sup>ð Þ <sup>x</sup> is <sup>a</sup> real- <sup>∗</sup> valued function of <sup>a</sup> single, real variable <sup>x</sup> and that f xð Þ has <sup>a</sup> minimum at <sup>x</sup> . Furthermore, in this chapter it is noted that the secant method is the simplification of Newton method. But, the order of the secant method is lower than one of the Newton methods; it is Q-superlinearly convergent, and its order is pffiffi

<sup>5</sup>þ<sup>1</sup> <sup>p</sup> <sup>¼</sup> <sup>≈</sup>1; 618. <sup>2</sup>

This modified secant method is constructed in [60], having in view, as it is emphasized, that it is possible to construct a cubic function which agrees with f xð Þ up to the third derivatives. The third derivative of the objective function f is approximated as

$$f^{'}(\mathbf{x}) = \frac{\mathbf{3}\left\{\frac{2\left[f^{'(\mathbf{x}\_{k})} - \frac{f(\mathbf{x}\_{k}) - f(\mathbf{x}\_{k-1})}{\mathbf{x}\_{k} - \mathbf{x}\_{k-1}}\right]}{\mathbf{x}\_{k} - \mathbf{x}\_{k-1}} - f^{'}(\mathbf{x}\_{k})\right\}}{\mathbf{x}\_{k-1} - \mathbf{x}\_{k}}.$$

In [61], the authors propose an inexact Newton-like conditional gradient method for solving constrained systems of nonlinear equations. The local convergence of the new method as well as results on its rate is established by using a general majorant condition.

### 2.6 Quasi-Newton method

Consider the Newton method.

For various practical problems, the computation of Hessian may be very expensive, or difficult, or Hessian can be unavailable analytically. So, the class of so-called quasi-Newton methods is formed, such that it uses only the objective function values and the gradients of the objective function and it is close to Newton method. Quasi-Newton method is such a class of methods which does not compute Hessian, but it generates a sequence of Hessian approximations and maintains a fast rate of convergence.

So, we would like to construct Hessian approximation Bk in quasi-Newton method. Naturally, it is desirable that the sequence f g Bk possesses positive defi- <sup>1</sup> niteness, as well as its direction dk ¼ -B- gk should be <sup>a</sup> descent one. <sup>k</sup>

Now, let <sup>f</sup> : <sup>R</sup><sup>n</sup> ! <sup>R</sup> be twice continuously differentiable function on an open set <sup>D</sup>⊂R<sup>n</sup>. Consider the quadratic approximation of <sup>f</sup> at xkþ1:

$$f(\mathbf{x}) \approx f(\boldsymbol{\omega}\_{k+1}) + \mathbf{g}\_{k+1}^T(\boldsymbol{\omega} - \boldsymbol{\omega}\_{k+1}) + \frac{\mathbf{1}}{2}(\boldsymbol{\omega} - \boldsymbol{\omega}\_{k+1})^T \mathbf{G}\_{k+1}(\boldsymbol{\omega} - \boldsymbol{\omega}\_{k+1}).$$

Finding the derivatives, we get

$$\mathbf{g}(\mathbf{x}) \approx \mathbf{g}\_{k+1} + G\_{k+1}(\mathbf{x} - \mathbf{x}\_{k+1}).$$

Setting <sup>x</sup> <sup>¼</sup> xk and using the standard notation: sk <sup>¼</sup> xkþ<sup>1</sup> - xk, yk <sup>¼</sup> gkþ<sup>1</sup> - gk, from the last relation, we get

$$G\_{k+1}^{-1} \mathcal{Y}\_k \approx \mathfrak{s}\_k. \tag{41}$$

Relation (41) transforms into the next one if f is the quadratic function:

$$G\_{k+1}^{-1} \mathcal{Y}\_k = \mathfrak{s}\_k. \tag{42}$$

Let Hk be the approximation of the inverse of Hessian. Then, we want Hk to satisfy the relation (42). In this way, we come to the quasi-Newton condition or quasi-Newton equation:

$$H\_{k+1} \mathfrak{y}\_k = \mathfrak{s}\_k. \tag{43}$$

Let Bkþ<sup>1</sup> <sup>¼</sup> <sup>H</sup>�<sup>1</sup> <sup>k</sup>þ<sup>1</sup> be the approximation of Hessian Gkþ1. Then

$$B\_{k+1} \mathfrak{s}\_k = \mathfrak{y}\_k \tag{44}$$

is also the quasi-Newton equation. If

$$s\_k^T y\_k \ge \mathbf{0},\tag{45}$$

then the matrix Bkþ<sup>1</sup> is positive definite. The condition (45) is known as the curvature condition.

Algorithm 1.2.9. (A general quasi-Newton method). <sup>n</sup> Assumptions: <sup>0</sup> <sup>≤</sup><sup>ϵ</sup> , 1, <sup>x</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>, <sup>H</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>� . Let <sup>k</sup> <sup>¼</sup> 0. Step 1. If ∥ gk∥ ≤ϵ, then STOP. Step 2. Compute dk ¼ �Hk gk. Step 3. Find tk by line search and set xkþ<sup>1</sup> ¼ xk þ tkdk. Step 4. Update Hk into Hkþ<sup>1</sup> such that quasi-Newton equation (43) holds. Step 5. Set k ¼ k þ 1 and go to Step 1. In Algorithm 1.2.9, usually we take H<sup>0</sup> ¼ I, where I is an identity matrix. Sometimes, instead of Hk, we use Bk in Algorithm 1.2.9. Then, Step 2 becomes Step 2<sup>∗</sup>. Solve

$$B\_k d = -\mathbf{g}\_{k^\*} \text{ for } d\_k \dots$$

By the other side, Step 4 becomes

Step <sup>4</sup><sup>∗</sup>. Update Bk into Bkþ<sup>1</sup> in such <sup>a</sup> way that quasi-Newton equation (44) holds.

#### 2.7 Symmetric rank-one (SR1) update

Let Hk be the inverse Hessian approximation of the kth iteration. We are trying to update Hk into Hkþ1, i.e.

$$H\_{k+1} = H\_k + E\_{k\nu}$$

where Ek is a matrix with a lower rank. If it is about a rank-one update, we get

$$H\_{k+1} = H\_k + \mu v^T,\tag{46}$$

where u, v∈ R<sup>n</sup>. Using quasi-Newton equation (43), we can get

$$H\_{k+1}\mathcal{Y}\_k = \left(H\_k + \mu v^T\right)\mathcal{Y}\_k = s\_{k\nu}$$

wherefrom

$$(v^T \mathcal{Y}\_k)u = \mathfrak{s}\_k - H\_k \mathcal{Y}\_k. \tag{47}$$

Further, from (46) and (47), we have

$$H\_{k+1} = H\_k + \frac{1}{\nu^T \mathcal{Y}\_k} \left(\mathfrak{s}\_k - H\_k \mathfrak{y}\_k\right) \nu^T.$$

Having in view that the inverse Hessian approximation Hk has to be the symmetric one, we use v ¼ sk � Hkyk, so we get the symmetric rank-one update (i.e., SR1 update):

$$H\_{k+1} = H\_k + \frac{\left(\mathbf{s}\_k - H\_k \mathbf{y}\_k\right)\left(\mathbf{s}\_k - H\_k \mathbf{y}\_k\right)^T}{\left(\mathbf{s}\_k - H\_k \mathbf{y}\_k\right)^T \mathbf{y}\_k}. \tag{48}$$

Theorem 1.2.8. [27] (Property theorem of SR1 update) Let s0, s1, and sn�<sup>1</sup> be linearly independent. Then, for quadratic function with a positive definite Hessian, SR1 method terminates at <sup>n</sup> <sup>þ</sup> <sup>1</sup> steps, i.e., Hn <sup>¼</sup> <sup>G</sup>�<sup>1</sup> .

More information about SR1 update can be found.

#### 2.8 Davidon-Fletcher-Powell (DFP) update

There exists another type of update, which is a rank-two update. In fact, we get Hkþ<sup>1</sup> using two symmetric, rank-one matrices:

$$H\_{k+1} = H\_k + auu^T + bvv^T,\tag{49}$$

where u, v∈ R<sup>n</sup> and a, b are scalars which have to be determined. Using quasi-Newton equation (43), we can get

$$+H\_k y\_k + auu^T y\_k + bvv^T y\_k = \mathfrak{s}\_k. \tag{50}$$

The values of u, v are not determined in a unique way, but the good choice is

$$
\mu = s\_k, \upsilon = H\_k \mathcal{y}\_k.
$$

Now, from (50), we get:

$$a = \frac{1}{s\_k^T \mathcal{Y}\_k}, b = -\frac{1}{\mathcal{Y}\_k^T H\_k \mathcal{Y}\_k}.$$

Hence, we get the formula

$$H\_{k+1} = H\_k + \frac{s\_k s\_k^T}{s\_k^T \boldsymbol{\nu}\_k} - \frac{H\_k \boldsymbol{\nu}\_k \boldsymbol{\nu}\_k^T H\_k}{\boldsymbol{\nu}\_k^T H\_k \boldsymbol{\nu}\_k},\tag{51}$$

which is DFP update.

Theorem 1.2.9. [27] (Positive definiteness of DFP update) DFP update (51) retains <sup>T</sup> positive definiteness if and only if sk yk . 0.

Theorem 1.2.10. [27] (Quadratic termination theorem of DFP method) Let fð Þ x be a quadratic function with positive definite Hessian G. Then, if the exact line search is used, the ˛ ˝ sequence sj , generated from DFP method, satisfies, for <sup>i</sup> <sup>¼</sup> <sup>0</sup>, <sup>1</sup>, …, m, where <sup>m</sup> <sup>≤</sup> <sup>n</sup> � <sup>1</sup>:

Some Unconstrained Optimization Methods DOI: http://dx.doi.org/10.5772/intechopen.83679

