**Paradigms & Algorithms**

In Chapter 6, entitled "Facility Layout Problem for Cellular Manufacturing Systems," a heu‐ ristic algorithm has been designed to allocate and displace facilities in the radial direction for the underlying optimization problem. In order to improve the search efficiency of the developed algorithm, the different cell size in initialization has been considered. In this chapter, a real-life optimization problem from the industry has been used. In Chapter 7, en‐ titled "Application of Simulated Annealing and Adaptive Simulated Annealing in Search for Efficient Optimal Solutions of a Groundwater Contamination-Related Problem," the optimi‐ zation problem of source characterization has been considered using a methodology based on ASA. In this study, it has also been shown that ASA provides reliable results for the con‐

I would like to add that it was a great time working on the chapters as an editor. I hope the contents of the book attract the interests from all fields of engineering and scientific sectors, as it has been desired to serve so. I welcome any further questions or comments on the book chapters and their contents. Interested readers may send their comments directly to the edi‐

I would like to thank my family for their support, particularly, Reza for letting me work on the book in times that I should have spent with him. I would like to thank Professor Mu‐ hammad Ali Imran, vice dean of UESTC at the University of Glasgow, for his encourage‐ ment and kind words. I would also like to thank Ms. Ana Pantar, senior commissioning editor, and Ms. Martina Usljebrka, publishing process manager, for their assistance in pre‐

**Dr. Hossein Peyvandi**

United Kingdom

PhD, FHEA, MIEEE, MIET, MSc, BSc University of Surrey, Guildford, Surrey

sidered optimization problem.

VIII Preface

tor using the following e-mail address.

paring the book during the year.

## **Deterministic Annealing: A Variant of Simulated Annealing and its Application to Fuzzy Clustering Deterministic Annealing: A Variant of Simulated Annealing and its Application to Fuzzy Clustering**

## Makoto Yasuda Makoto Yasuda

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66072

#### **Abstract**

Deterministic annealing (DA) is a deterministic variant of simulated annealing. In this chapter, after briefly introducing DA, we explain how DA is combined with the fuzzy *c*-means (FCM) clustering by employing the entropy maximization method, especially the Tsallis entropy maximization. The Tsallis entropy is a *q* parameter extension of the Shannon entropy. Then, we focus on Tsallis-entropy-maximized FCM (Tsallis-DAFCM), and examine effects of cooling functions for DA on accuracy and convergence. A shape of a membership function of Tsallis-DAFCM depends on both a system temperature and *q*. Accordingly, a relationship between the temperature and *q* is quantitatively investigated.

**Keywords:** deterministic annealing, simulated annealing, free energy, fuzzy *c*-means clustering, entropy maximization, Shannon entropy, fuzzy entropy, Tsallis entropy

## **1. Introduction**

Statistical mechanics investigates the macroscopic properties of a physical system consisting of many elements. Recently, great research activities of applying statistical mechanical models or tools to information engineering problems have been seen. The entropy maximization method applied to a fuzzy *c*-means clustering is a good example of such models.

Simulated annealing (SA) [1, 2] is one of the most commonly used optimization techniques and plays an important role in the field of engineering because many of the engineering problems can be formulated as optimization problems. SA is a stochastic relaxation method that treats an objective function as a system energy, and by analogy with the annealing process

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

of solids, searches for its minimum with decreasing the system temperature. SA searches randomly at a high temperature, but more deterministically at a low temperature. As long as a neighborhood can be defined and the temperature is lowered sufficiently slowly, SA is a global optimization technique for solving optimization problems. It requires a very long time to find an optimal global solution because of a stochastic search at each temperature. Thus, SA is an approximation method in practical.

Deterministic annealing (DA) is a deterministic variant of SA, which is first proposed by Rose et al. [3] for a vector quantization algorithm. DA characterizes the minimization problem of the objective function as the minimization of the free energy of the system, which depends on the temperature. DA tracks the minimum of the free energy while decreasing the temperature. Thus, it can deterministically optimize the objective function at each temperature, and is more efficient than SA but does not guarantee the optimal solution. In addition, an effect of a cooling function on the quality of a DA's solution is still unclear.

Membership functions of the fuzzy *c*-means (FCM) clustering [4] maximized or regularized with entropy [5, 6] have similar forms with the distribution functions that appear in statistical mechanics. For example, a membership function obtained by the Shannon entropy maximization has a similar form with the Boltzmann-Gibbs distribution function [3, 5]. Similarly, a membership function obtained by the fuzzy entropy [7] has a similar form with the Fermi-Dirac distribution function [8]. Annealing methods can be applicable to these membership functions because they contain a parameter that can be corresponded to the system temperature. The advantage of applying the entropy maximization methods to FCM is that fuzzy clustering can be analyzed from both information processing and statistical physical points of view.

Tsallis [9], by extending Boltzmann-Gibbs statistics nonextensively with a generalization parameter *q*, postulated a generalized formulation of entropy. The entropy is now well known as The Tsallis entropy, which, in the limit of *q* to 1, approaches the Shannon entropy. The Tsallis entropy is applicable to numerous fields, including physics, bioscience, chemistry, networks, computer science, and so on [10–12]. Menard et al. [13, 14] investigated fuzzy clustering in the framework of nonextensive thermostatistics, and derived the possibilistic membership function by taking the possibilistic constraint into account.

On the other hand, based on the Tsallis entropy, Yasuda [15] defined another form of entropy for FCM, and then derived the membership function by maximizing this entropy within FCM [15]. After that, by combining the membership function with DA, a new fuzzy clustering algorithm (the Tsallis-DAFCM algorithm) has been developed. Tsallis-DAFCM was proved to yield superior results in comparison with the conventional annealing methods.

Similarly to SA, a performance of Tsallis-DAFCM strongly depends on how to decrease the temperature. Among the cooling functions for SA, the very fast annealing (VFA) method decreases the temperature fastest. Thus, VFA is applied to Tsallis-DAFCM to improve its performance, and proved to be effective [16].

In spite of its performance, it remains unknown how appropriate *q* value and initial annealing temperature *T*high for Tsallis-DAFCM should be determined according to the data distribution. One of the important characteristics of the membership function of Tsallis-DAFCM is that centers of clusters are given as a weighted function of the membership function to the power of *q*( ). Furthermore, it changes its shape in a similar way by decreasing the temperature or

by increasing *q*. In order to reveal the relationship between the temperature and *q*, is quantitatively analyzed using the Iris Data Set [17]. The result shows that the temperature and *<sup>q</sup>* affect almost inversely, suggesting that a *q*-incrementation algorithm is possible. This algorithm might be a solution to the initialvalue problem of Tsallis-DAFCM [18, 19].

This chapter is composed of five sections. Section 1 is this introduction. In Section 2, how DA works is explained generously, and its applications are summarized. In Section 3, each of the components of entropy-maximized FCM clustering methods is explained. In Section 4, VFA is used as the cooling function of Tsallis-DAFCM, and its effects are experimentally investigated. Effects of the temperature and *q*-values on the membership function of Tsallis-DAFCM are also examined. Section 5 gives a conclusion of this chapter.

## **2. Deterministic annealing**

of solids, searches for its minimum with decreasing the system temperature. SA searches randomly at a high temperature, but more deterministically at a low temperature. As long as a neighborhood can be defined and the temperature is lowered sufficiently slowly, SA is a global optimization technique for solving optimization problems. It requires a very long time to find an optimal global solution because of a stochastic search at each temperature. Thus, SA

Deterministic annealing (DA) is a deterministic variant of SA, which is first proposed by Rose et al. [3] for a vector quantization algorithm. DA characterizes the minimization problem of the objective function as the minimization of the free energy of the system, which depends on the temperature. DA tracks the minimum of the free energy while decreasing the temperature. Thus, it can deterministically optimize the objective function at each temperature, and is more efficient than SA but does not guarantee the optimal solution. In addition, an effect of a cooling

Membership functions of the fuzzy *c*-means (FCM) clustering [4] maximized or regularized with entropy [5, 6] have similar forms with the distribution functions that appear in statistical mechanics. For example, a membership function obtained by the Shannon entropy maximization has a similar form with the Boltzmann-Gibbs distribution function [3, 5]. Similarly, a membership function obtained by the fuzzy entropy [7] has a similar form with the Fermi-Dirac distribution function [8]. Annealing methods can be applicable to these membership functions because they contain a parameter that can be corresponded to the system temperature. The advantage of applying the entropy maximization methods to FCM is that fuzzy clustering can be analyzed from both information processing and statistical physical points of

Tsallis [9], by extending Boltzmann-Gibbs statistics nonextensively with a generalization parameter *q*, postulated a generalized formulation of entropy. The entropy is now well known as The Tsallis entropy, which, in the limit of *q* to 1, approaches the Shannon entropy. The Tsallis entropy is applicable to numerous fields, including physics, bioscience, chemistry, networks, computer science, and so on [10–12]. Menard et al. [13, 14] investigated fuzzy clustering in the framework of nonextensive thermostatistics, and derived the possibilistic membership

On the other hand, based on the Tsallis entropy, Yasuda [15] defined another form of entropy for FCM, and then derived the membership function by maximizing this entropy within FCM [15]. After that, by combining the membership function with DA, a new fuzzy clustering algorithm (the Tsallis-DAFCM algorithm) has been developed. Tsallis-DAFCM was proved to

Similarly to SA, a performance of Tsallis-DAFCM strongly depends on how to decrease the temperature. Among the cooling functions for SA, the very fast annealing (VFA) method decreases the temperature fastest. Thus, VFA is applied to Tsallis-DAFCM to improve its

In spite of its performance, it remains unknown how appropriate *q* value and initial annealing temperature *T*high for Tsallis-DAFCM should be determined according to the data distribution.

yield superior results in comparison with the conventional annealing methods.

is an approximation method in practical.

view.

function on the quality of a DA's solution is still unclear.

4 Computational Optimization in Engineering - Paradigms and Applications

function by taking the possibilistic constraint into account.

performance, and proved to be effective [16].

Vector quantization is a classification method for a big data, which is widely used in the field of image compression. Linde-Buzo-Gray [20] and self-organizing feature map [21] algorithms, for example, are well-known vector quantization algorithms. However, classification results of these algorithms depend on initial settings of the reference vector, and can easily fall into local minima.

In order to overcome the problem, by analogy with statistical mechanics, a DA method has been proposed. DA does not suffer from the initial vector problem. It usually performs better than other methods. However, it does not theoretically guarantee to find a global optimal solution. Furthermore, its performance depends on a way of decreasing the system temperature.

This section gives a brief introduction of the deterministic annealing method. In Section 2.1, a definition and major characteristics of DA are explained. In Section 2.2, DA is compared with SA. In Section 2.3, applications and modifications of DA are summarized.

## **2.1. Major characteristics of deterministic annealing**

In DA, by analogy with statistical mechanics, the free energy *F* is derived from an objective function *J* of a problem. At a high temperature, *F* represents a global structure of *J*. As the temperature decreases, it gradually reaches *J*.

Based on these characteristics, at the high temperature, DA is able to find a global minimum of *F* by the steepest descent method because it should have multiple local minima. When the temperature is lowered a little, *F* would change its shape only a little. Accordingly, by setting the previous global mimimum as an initial value of the steepest descent mehod, DA searches the next global minimum. This procedure continues until the temperature is lowered sufficiently, and *F* reaches *J*. Consequently, at each temperature, DA searches the local minimum of *J* deterministically.

#### **2.2. Comparison with simulated annealing**

While decreasing the temperature, SA searches the minimum stochastically at each temperature and thus requires a very long time to find an optimal solution. Hence, though theoretically, it is guaranteed to find the optimal solution, SA is practically an approximation method.

On the contrary, DA consumes less computational time because it searches the local minimum deterministically at each temperature. Furthermore, it should be noticed that, in case that multiple local minima exist at some temperature, DA might not be able to find the minimum. For this reason, even theoretically, DA is not guaranteed to find the optimal solution.

Approaches to speed up SA are mainly based on the improvement of a transition method and a cooling function including its parallelization. For example, adaptive SA (ASA) [22], which may belong to the both categories, is an implementation of very fast simulated re-annealing (VFSA) [23]. As compared with the acceleration method called fast annealing (FA) [24] in which the temperature is lowered inversely linear to a number of iterations, ASA is faster than FA. In addition, among many features included in ASA, it can get the benefit of speeding-up by simulated quenching.

In DA, it seems no comprehensive studies on this topic have been conducted.

The summary of comparison is shown in **Table 1**.


**Table 1.** Comparison of SA and DA.

#### **2.3. Applications and modifications of deterministic annealing**

The study on DA first addressed avoidance of the poor local minima of data clustering [25]. Then it was extensively applied to various subjects such as combinational optimization problems [26], vector quantization [27, 28], maximum likelihood estimation [29], classifier design [30], and pairwise data clustering [31].

In order to cluster a large number of data, research activities that attempt to parallelize DA have become popular. Kim et al. [32] discussed the parallelization method of DA using GPU, and applied it to color image segmentation. In order to cluster a large number of bioscientific data, Fox et al. [33, 34] parallelized DA using MPI. Qiu et al. [35] compared the DA's performance using C# messaging runtime library CCR with that using MPI.

## **3. Application of deterministic annealing to fuzzy** *c***-means clustering maximized with entropy**

One of the important applications of DA is fuzzy clustering. In this section, we focus on fuzzy *c*-means (FCM) clustering. By maximizing the objective function of FCM with various entropies, membership functions similar to the statistical mechanical distribution functions are obtained. These membership functions can be easily combined with DA, because they contain a parameter corresponding to a system temperature.

In this section, first we outline the formulation of FCM. Then, we describe how to apply the entropy maximization methods to FCM. In Section 3.1, the classical FCM clustering method is introduced. In Section 3.2, various entropy maximization methods are explained. The free energies for each entropy maximization method are derived in Section 3.3. In Section 3.4, representative annealing functions are presented.

#### **3.1. Fuzzy** *c***-means clustering**

the next global minimum. This procedure continues until the temperature is lowered sufficiently, and *F* reaches *J*. Consequently, at each temperature, DA searches the local minimum

While decreasing the temperature, SA searches the minimum stochastically at each temperature and thus requires a very long time to find an optimal solution. Hence, though theoretically, it is guaranteed to find the optimal solution, SA is practically an approximation method.

On the contrary, DA consumes less computational time because it searches the local minimum deterministically at each temperature. Furthermore, it should be noticed that, in case that multiple local minima exist at some temperature, DA might not be able to find the minimum.

Approaches to speed up SA are mainly based on the improvement of a transition method and a cooling function including its parallelization. For example, adaptive SA (ASA) [22], which may belong to the both categories, is an implementation of very fast simulated re-annealing (VFSA) [23]. As compared with the acceleration method called fast annealing (FA) [24] in which the temperature is lowered inversely linear to a number of iterations, ASA is faster than FA. In addition, among many features included in ASA, it can get the benefit of speeding-up by

For this reason, even theoretically, DA is not guaranteed to find the optimal solution.

In DA, it seems no comprehensive studies on this topic have been conducted.

**SA DA** Search strategy Stochastic search based on the Metropolis algorithm. Deterministic search based on the

The study on DA first addressed avoidance of the poor local minima of data clustering [25]. Then it was extensively applied to various subjects such as combinational optimization problems [26], vector quantization [27, 28], maximum likelihood estimation [29], classifier

steepest descent algorithm.

used empirically.

Not guaranteed.

Cooling functions appear in SA are

The summary of comparison is shown in **Table 1**.

Cooling function Two categories of cooling functions are well-used.

Optimality The global minimum can be achieved if a temperature

**2.3. Applications and modifications of deterministic annealing**

design [30], and pairwise data clustering [31].

1. Functions based on statistical analysis.

is decreased as slow as *T*∝1/log(iterations).

2. Adaptive functions depending on the problems.

of *J* deterministically.

simulated quenching.

**Table 1.** Comparison of SA and DA.

**2.2. Comparison with simulated annealing**

6 Computational Optimization in Engineering - Paradigms and Applications

Let <sup>=</sup> 1, ⋯, <sup>=</sup> 1, ⋯, ∈ be a data set in *p*-dimensional real space, which is to be divided into *c* clusters. Let <sup>=</sup> 1, ⋯, <sup>=</sup> 1, ⋯, ∈ be the centers of the clusters, and let <sup>∈</sup> 0, 1 = 1, ⋯, ; = 1, ⋯, be the membership functions. Then, let

$$J = \sum\_{k=1}^{n} \sum\_{l=1}^{c} u\_{ik} \, ^{m} d\_{ik} \quad \left( d\_{lk} = \left\| \mathbf{x}\_k - \mathbf{v}\_l \right\|^2; \ m \in R; \ 1 < m \right) \tag{1}$$

be the objective function of FCM to be minimized. Under the normalization constraint of

$$\sum\_{l=1}^{c} u\_{lk} = 1 \quad (\forall k), \tag{2}$$

the Lagrange function *L* is given by

#### 8 Computational Optimization in Engineering - Paradigms and Applications

$$L = J - \sum\_{k=1}^{n} \eta\_k \left(\sum\_{\ell=1}^{c} u\_{ik} - 1\right),\tag{3}$$

where η*k* denotes the Lagrange multiplier. ∂/ ∂ = 0 gives the membership function of the form:

$$\mu\_{lk} = \left[ \sum\_{j=1}^{c} \left( \frac{d\_{lk}}{d\_{jk}} \right)^{\frac{1}{m-1}} \right]^{-1} \,. \tag{4}$$

Similarly, can be determined by ∂/ ∂ = 0 as follows:

$$\mathbf{v}\_{l} = \frac{\sum\_{k=1}^{n} u\_{lk} \mathbf{x}\_{k}}{\sum\_{k=1}^{n} u\_{lk}}.\tag{5}$$

Desirable cluster centers can be obtained by calculating Eq. (4) and Eq. (5) repeatedly.

#### **3.2. Entropy maximization method for fuzzy** *c***-means**

#### *3.2.1. Shannon entropy maximization*

Shannon entropy for FCM takes the form:

$$S\_{SE} = -\sum\_{k=1}^{n} \sum\_{l=1}^{c} \mu\_{lk} \log \mu\_{lk} \,. \tag{6}$$

By setting *m* to 1 in Eq. (1), under the normalization constraint of Eq. (2), the Shannon entropy functional is given by

$$\delta S\_{\rm SE} - \sum\_{k=1}^{n} \alpha\_k \delta \left(\sum\_{l=1}^{c} u\_{lk} - 1\right) - \beta \sum\_{k=1}^{n} \sum\_{l=1}^{c} \delta (u\_{lk} d\_{lk}),\tag{7}$$

where *αk* and *β* denote the Lagrange multipliers. The stationary condition for Eq. (7) leads to the following Gaussian membership function:

Deterministic Annealing: A Variant of Simulated Annealing and its Application to Fuzzy Clustering http://dx.doi.org/10.5772/66072 9

$$\mu\_{ik} = \frac{\mathbf{e}^{-\beta d\_{ik}}}{\sum\_{j=1}^{c} \mathbf{e}^{-\beta d\_{jk}}},\tag{8}$$

and the same formula for in Eq. (5).

#### *3.2.2. Fuzzy entropy maximization*

1 1

= - - ç ÷

= =

*k i LJ u* h

1

=

*<sup>d</sup> <sup>u</sup> <sup>d</sup>*

*jk j*

*ik*

8 Computational Optimization in Engineering - Paradigms and Applications

can be determined by ∂/ ∂ = 0 as follows:

**3.2. Entropy maximization method for fuzzy** *c***-means**

*3.2.1. Shannon entropy maximization*

functional is given by

Shannon entropy for FCM takes the form:

d

the following Gaussian membership function:

 ad*S u*

*<sup>c</sup> <sup>m</sup> ik*

1

=

*u*

*n ik k <sup>k</sup>*

*i n*

å

Desirable cluster centers can be obtained by calculating Eq. (4) and Eq. (5) repeatedly.

1 1

1 1 1 1

= = = -

*n c n c SE k ik ik ik k i k i*

æ ö


By setting *m* to 1 in Eq. (1), under the normalization constraint of Eq. (2), the Shannon entropy

b

where *αk* and *β* denote the Lagrange multipliers. The stationary condition for Eq. (7) leads to

*n c SE ik ik k i S uu* = =

= å

1

=

*ik <sup>k</sup>*

*u*

**x**

.

log .

é ù ê ú æ ö <sup>=</sup> ê ú ç ÷ ç ÷ ê ú è ø ê ú ë û

form:

Similarly,

*k ik*

where η*k* denotes the Lagrange multiplier. ∂/ ∂ = 0 gives the membership function of the

æ ö

ç ÷

<sup>1</sup> <sup>1</sup> 1


.

*n c*

1 ,

è ø å å (3)

å (4)

**v** (5)

= -åå (6)

( )

è ø å å åå (7)

1 ,

 d*u d* Fuzzy entropy for FCM is defined as [36]

$$LS\_{FE} = -\sum\_{k=1}^{n} \sum\_{l=1}^{c} \left\{ u\_{lk} \log u\_{lk} + (1 - u\_{lk}) \log(1 - u\_{lk}) \right\}. \tag{9}$$

The fuzzy entropy functional is given by

$$\delta S\_{FE} - \sum\_{k=1}^{n} \alpha\_k \delta \left(\sum\_{l=1}^{c} u\_{lk} - 1\right) - \beta \sum\_{k=1}^{n} \sum\_{l=1}^{c} \delta \left(u\_{lk} d\_{lk}\right) \tag{10}$$

The stationary condition for Eq. (10) leads to the following membership function:

$$
\mu\_{ik} = \frac{1}{e^{\alpha\_k + \beta d\_{ik}} + 1},
\tag{11}
$$

and the same formula for in Eq. (5). Eq. (11) is similar to the Fermi-Dirac distribution function.

#### *3.2.3. Tsallis entropy maximization*

The Tsallis entropy for FCM is defined as [15]

$$S\_{T\text{salllis}} = -\frac{1}{q-1} \left( \sum\_{k=1}^{n} \sum\_{\ell=1}^{c} \mu\_{ik} \,^q - 1 \right),\tag{12}$$

where *q* ∈ *R* is a real number. In case of the Tsallis entropy maximization, the objective function should be rewritten as

$$J\_{T\text{sallis}} = \sum\_{k=1}^{n} \sum\_{l=1}^{c} u\_{lk} {}^{q} d\_{lk} \,. \tag{13}$$

Accordingly, the Tsallis entropy functional is given by

$$\mathcal{S}S\_{\text{Tsallis}} - \sum\_{k=1}^{n} \alpha\_{k} \mathcal{S} \left(\sum\_{l=1}^{c} u\_{lk} - 1\right) - \mathcal{S} \sum\_{k=1}^{n} \sum\_{l=1}^{c} \mathcal{S} \left(u\_{lk}^{\ \ \ \ \ \beta} d\_{lk}\right) \tag{14}$$

The stationary condition for Eq. (14) leads to the membership function of the form:

$$
\mu\_{lk} = \frac{\left\{1 - \beta(1 - q)d\_{lk}\right\} \frac{1}{1 - q}}{Z},
\tag{15}
$$

where

$$Z = \sum\_{j=1}^{c} \left\{ 1 - \beta (1 - q) d\_{jk} \right\}^{\frac{1}{1 - q}}.$$

 is defined as

$$\mathbf{v}\_{l} = \frac{\sum\_{k=1}^{n} u\_{lk}{}^{q} \mathbf{x}\_{k}}{\sum\_{k=1}^{n} u\_{lk}}.\tag{16}$$

#### **3.3. Free energy for entropy maximized fuzzy** *c***-means**

In each entropy, maximization methods introduced in Section 3.2, *β* can be regarded as the inverse of the system temperature *T-1*. This feature makes it possible to apply DA, and Shannonentropy-maximized FCM with DA (Shannon-DAFCM, hereafter), fuzzy-entropy-maximized FCM with DA (fuzzy-DAFCM, hereafter), and Tsallis-entropy-maximized FCM with DA (Tsallis-DAFCM, hereafter) have been developed.

#### *3.3.1. Free energy for Shannon-DAFCM*

In Shannon-DAFCM, by analogy with statistical mechanics [37], the sum of the states (the partition function) for the grand canonical ensemble of FCM can be expressed as

Deterministic Annealing: A Variant of Simulated Annealing and its Application to Fuzzy Clustering http://dx.doi.org/10.5772/66072 11

$$\xi = \prod\_{k=1}^{n} \prod\_{l=1}^{c} \mathbf{e}^{-\beta d\_{lk}} \tag{17}$$

The free energy is derived as

1 1

*n c*

*Tsallis ik ik k i J u d* = =

1 1 1 1

{ }

1 (1 ) <sup>1</sup> , *<sup>q</sup> ik*

{ }

b

1

=

*<sup>n</sup> <sup>q</sup> ik k <sup>k</sup>*

*u*

1

=

*ik <sup>k</sup>*

In each entropy, maximization methods introduced in Section 3.2, *β* can be regarded as the inverse of the system temperature *T-1*. This feature makes it possible to apply DA, and Shannonentropy-maximized FCM with DA (Shannon-DAFCM, hereafter), fuzzy-entropy-maximized FCM with DA (fuzzy-DAFCM, hereafter), and Tsallis-entropy-maximized FCM with DA

In Shannon-DAFCM, by analogy with statistical mechanics [37], the sum of the states (the

partition function) for the grand canonical ensemble of FCM can be expressed as

*u*

1 (1 ) . *c q*

*q d* -

.

**x**

*jk*

= = = -

The stationary condition for Eq. (14) leads to the membership function of the form:

*q d <sup>u</sup> <sup>Z</sup>* - b

1

*i n*

å

= å

= = -- å

*j*

*Z*

**3.3. Free energy for entropy maximized fuzzy** *c***-means**

(Tsallis-DAFCM, hereafter) have been developed.

*3.3.1. Free energy for Shannon-DAFCM*

*Tsallis k ik ik ik k i k i*

*n c n c*

æ ö


Accordingly, the Tsallis entropy functional is given by

10 Computational Optimization in Engineering - Paradigms and Applications

 ad*S u*

*ik*

d

where

is defined as

.

1 .

 d*u d*

1 1

<sup>=</sup>åå (13)

( )


**v** (16)

è ø å å åå (14)

*q*

*q*

b

1

$$F = -\frac{1}{\beta} \log \xi = -\frac{1}{\beta} \sum\_{k=1}^{n} \log \sum\_{l=1}^{c} \mathbf{e}^{-\beta d\_{il}}.\tag{18}$$

Stable thermal equilibrium requires a minimization of the free energy. By formulating deterministic annealing as a minimization of the free energy, ∂/ ∂ = 0 yields the same expression for as that in Eq. (5).

#### *3.3.2. Free energy for fuzzy-DAFCM*

Similarly, in fuzzy-DAFCM, the grand partition function for the grand canonical ensemble for FCM can be written as

$$\Xi = \prod\_{k=1}^{n} \prod\_{l=1}^{c} \left( \mathbf{l} + \mathbf{e}^{-\alpha\_k - \beta d\_{ik}} \right) \tag{19}$$

The free energy is calculated as

$$F = -\frac{1}{\beta} \left( \log \Xi - \alpha\_k \frac{\partial \log \Xi}{\partial \alpha\_k} \right) = -\frac{1}{\beta} \sum\_{k=1}^{n} \left\{ \sum\_{l=1}^{c} \log \left( 1 + \mathbf{e}^{-\alpha\_k - \beta d\_{il}} \right) + \alpha\_k \right\}.\tag{20}$$

∂/ ∂ =0 yields the same expression for as that in Eq. (5).

#### *3.3.3. Free energy for Tsallis-DAFCM*

In Tsallis-DAFCM, the free energy can be derived as

$$F = J\_{\text{Tsallis}} - T\mathbf{S}\_{\text{Tsallis}} = -\frac{1}{\beta} \sum\_{k=1}^{n} \frac{Z^{1-q} - 1}{1 - q},\tag{21}$$

where *T* is a system temperature. ∂/ ∂ =0 yields the same expression for as that in Eq. (16).

#### **3.4. Effect of annealing temperature on clustering**

#### *3.4.1. Dependency of shape of membership function on temperature*

While reducing the system temperature *T*, DA achieves thermal equilibrium at each temperature by minimizing the free energy. Thus, DA searches a cluster distribution that minimizes the free energy at each temperature. When the temperature is high, the membership functions distribute widely. This makes clusters to which a data belong fuzzy. In case of Tsallis-DAFCM, when *q* is nearly equal to 2, the width of the membership function is almost proportional to . On the contrary, at the low temperature, fuzzy clustering approaches hard clustering. The relationship = Tsallis <sup>−</sup> Tsallis in Eq. (21) suggests that, at the higher temperature, the larger entropy state or chaotic state is caused by a widening of the extent of the membership function.

#### *3.4.2. Cooling function*

In SA, the temperature decreases according to a cooling function or an annealing schedule. The representative cooling functions for SA [38] are:

(I) Exponential function

$$T = T\_{h\sharp gh}r^t,\tag{22}$$

where *Thigh* is the highest initial temperature, *r* is a parameter which defines a temperature reduction rate, and *t* is a number of iterations of temperature reduction.

(II) Inversely linear function

$$T = \frac{T\_{high}}{t}.\tag{23}$$

(III) Inversely logarithmic function

$$T = \frac{T\_{h\{gh\}}}{\ln t}.\tag{24}$$

(IV) Inversely exponential function

$$T = \frac{T\_{high}}{e^{rt}}.\tag{25}$$

#### (V) Very fast annealing

**3.4. Effect of annealing temperature on clustering**

12 Computational Optimization in Engineering - Paradigms and Applications

The representative cooling functions for SA [38] are:

clustering. The relationship =

the membership function.

*3.4.2. Cooling function*

(I) Exponential function

(II) Inversely linear function

(III) Inversely logarithmic function

(IV) Inversely exponential function

*3.4.1. Dependency of shape of membership function on temperature*

While reducing the system temperature *T*, DA achieves thermal equilibrium at each temperature by minimizing the free energy. Thus, DA searches a cluster distribution that minimizes the free energy at each temperature. When the temperature is high, the membership functions distribute widely. This makes clusters to which a data belong fuzzy. In case of Tsallis-DAFCM, when *q* is nearly equal to 2, the width of the membership function is almost proportional to . On the contrary, at the low temperature, fuzzy clustering approaches hard

temperature, the larger entropy state or chaotic state is caused by a widening of the extent of

In SA, the temperature decreases according to a cooling function or an annealing schedule.

,

where *Thigh* is the highest initial temperature, *r* is a parameter which defines a temperature

. *Thigh <sup>T</sup>*

. ln *Thigh <sup>T</sup>*

> . *high rt T*

reduction rate, and *t* is a number of iterations of temperature reduction.

Tsallis <sup>−</sup> Tsallis in Eq. (21) suggests that, at the higher

*<sup>t</sup> TT r* <sup>=</sup> *high* (22)

*<sup>t</sup>* <sup>=</sup> (23)

*<sup>t</sup>* <sup>=</sup> (24)

*<sup>T</sup> <sup>e</sup>* <sup>=</sup> (25)

Rosen [39] proposed another inversely exponential function known as VFA. VFA decreases the temperature exponentially in a similar way to ASA:

$$T = \frac{T\_{h\text{lg}h}}{\mathbf{e''}^{\prime \mathbf{r}^{(1/D)}}},\tag{26}$$

where *D* is a dimension of a state space.1 **Figure 1** compares plots of Eq. (25) and Eq. (26).

**Figure 1.** Plots of (a) Eq. (25) and (b) Eq. (26) (*T*high = 1.0 × 105 , *D* = 2).

## **4. Tsallis-entropy-maximized fuzzy** *c***-means clustering with deterministic annealing**

In this section, we focus on Tsallis-DAFCM, and its important experimental results are explained. In Section 4.1, we present the Tsallis-DAFCM clustering algorithm. In Section 4.2, how VFA affects Tsallis-DAFCM is experimentally investigated. In Section 4.3, effects of the temperature and *q*-values on the membership function are examined.

#### **4.1. Tsallis-DAFCM clustering algorithm**

The Tsallis-entropy-maximization method, fuzzy *c*-means clustering, and the deterministic annealing method can be combined as the following Tsallis-DAFCM clustering algorithm [15]:

**1.** Set the number of clusters *c*, the highest temperature *Thigh*, the temperature reduction rate *m*, and the threshold of convergence test *δ*1 and *δ*2;

<sup>1</sup> In clustering, a state space refers to an input space of a data set. For example, if a data set consists of 3 attributes, *D* is set to 3.


#### **4.2. Effect of cooling functions**

In general, the temperature should be reduced gradually in DA. However, this takes a long time to converge. If VFA is applicable to Tsallis-DAFCM, it is of great advantage to this method. Accordingly, VFA is tested as a cooling function of Tsallis-DAFCM.

In this subsection, both Shannon- and Tsallis-DAFCM are examined.

#### *4.2.1. Experiment 1*

In experiment 1, the numerical data composed of five clusters and 2000 data points are used, as shown in **Figure 2**. The parameters are set as follows: *c* = 10, *δ*1 = 50, *δ*2 = 2, *q* = 1.5, *T*high = 1.0×106 or 1.0×105 .

**Figure 2.** Numerical data.

First, the inversely exponential cooling function is applied to Tsallis-DAFCM. The changes of *β* parameterized by *r* are plotted in **Figure 3**. In case of *r* = 1000.0, when *T*high = 1.0×105 , as *T* is lowered from **Figures 3 (A)** to **(D)**, data are clustered gradually and desirably. In case of *r* = 10.0 and *r* = 1.0 (**Figures 3 (E)** and **(F)**, respectively), it is observed that *uik* and converge more rapidly.

**Figure 3.** Increasing of β by inversely exponential cooling function.

**2.** Generate *c* clusters at random positions. Set current temperature *T* to *Thigh*;

**5.** Compare the difference between the current cluster centers

Accordingly, VFA is tested as a cooling function of Tsallis-DAFCM.

In this subsection, both Shannon- and Tsallis-DAFCM are examined.

by Eq. (17);

1 <sup>≤</sup> ≤ − < 1 is satisfied, then go to **6**, otherwise go back to **3**;

**6.** Compare the difference between the current cluster centers and the cluster centers obtained at the previous temperature . If the convergence condition 1 <sup>≤</sup> ≤ − < 2 is satisfied, then stop, otherwise decrease the temperature

In general, the temperature should be reduced gradually in DA. However, this takes a long time to converge. If VFA is applicable to Tsallis-DAFCM, it is of great advantage to this method.

In experiment 1, the numerical data composed of five clusters and 2000 data points are used, as shown in **Figure 2**. The parameters are set as follows: *c* = 10, *δ*1 = 50, *δ*2 = 2, *q* = 1.5, *T*high =

−  and the cluster centers

. If the convergence condition

**3.** Calculate the membership function *uik* by Eq. (15);

14 Computational Optimization in Engineering - Paradigms and Applications

obtained in the previous iteration

using the cooling function, and go back to **3**.

**4.** Calculate the centers of clusters

**4.2. Effect of cooling functions**

*4.2.1. Experiment 1*

or 1.0×105

**Figure 2.** Numerical data.

.

1.0×106

However, when *T*high = 1.0×106 , the algorithm fails to converge with *r* = 100.0 and 10.0 (expressed by "Not converged" in **Figure 3**) because the initial distribution of *uik* is too wide. This result indicates that it is important to set both *Thigh* and *r* properly.

**Figure 4.** Shifts of cluster centers during clustering obtained by (a) Shannon-DAFCM and (b) Tsallis-DAFCM.

In order to clarify the adaptability of VFA as a cooling function of DA, numerical experiments on Shannon- and Tsallis-DAFCM are performed. The shifts of cluster centers with decreasing temperature are illustrated in **Figures 4 (a)** and **(b)**. Initially, clusters are located randomly. Then, at the higher temperature, *β* is comparatively small and clusters move to near the center of gravity of data because the membership functions are extend over the data area and become extremely uniform. As *T* is lowered, contrarily, the membership functions become narrower and the associations of data to the clusters become less fuzzy. In this process, in Shannon-DAFCM, the clusters move to their nearest local data distribution centers. However, in Tsallis-DAFCM, the clusters can move a long distance to optimal positions because the membership functions have gentle base slopes.

**Figures 5 (a)** and **(b)** illustrates the three-dimensional plots of *uik* in the progress of Shannonand Tsallis-DAFCM clustering combined with VFA. When the temperature is as high as 3.7×104 , roughness of *uik* of Tsallis-DAFCM is smaller than that of Shannon-DAFCM. After that, the shapes of both membership functions do not change greatly, because VFA reduces the temperature extremely only at the early annealing stage. When the temperature is lowered to 1.3×104 , both methods cluster data desirably.

**Figure 5.** Initial and final landscapes of *uik* of (a) Shannon-DAFCM and (b) Tsallis-DAFCM.

Consequently, because Tsallis-DAFCM has gentle slope in the region far from the origin, clusters can move long distance to optimal positions stably. This feature makes it possible to reduce the temperature rapidly. Thus, VFA is suitable as a cooling function of Tsallis-DAFCM. On the other hand, final cluster positions obtained by Shannon-DAFCM tend to depend on their initial positions.

## *4.2.2. Experiment 2*

In order to clarify the adaptability of VFA as a cooling function of DA, numerical experiments on Shannon- and Tsallis-DAFCM are performed. The shifts of cluster centers with decreasing temperature are illustrated in **Figures 4 (a)** and **(b)**. Initially, clusters are located randomly. Then, at the higher temperature, *β* is comparatively small and clusters move to near the center of gravity of data because the membership functions are extend over the data area and become extremely uniform. As *T* is lowered, contrarily, the membership functions become narrower and the associations of data to the clusters become less fuzzy. In this process, in Shannon-DAFCM, the clusters move to their nearest local data distribution centers. However, in Tsallis-DAFCM, the clusters can move a long distance to optimal positions because the membership

**Figures 5 (a)** and **(b)** illustrates the three-dimensional plots of *uik* in the progress of Shannonand Tsallis-DAFCM clustering combined with VFA. When the temperature is as high as

, roughness of *uik* of Tsallis-DAFCM is smaller than that of Shannon-DAFCM. After that, the shapes of both membership functions do not change greatly, because VFA reduces the temperature extremely only at the early annealing stage. When the temperature is lowered to

functions have gentle base slopes.

, both methods cluster data desirably.

16 Computational Optimization in Engineering - Paradigms and Applications

**Figure 5.** Initial and final landscapes of *uik* of (a) Shannon-DAFCM and (b) Tsallis-DAFCM.

Consequently, because Tsallis-DAFCM has gentle slope in the region far from the origin, clusters can move long distance to optimal positions stably. This feature makes it possible to reduce the temperature rapidly. Thus, VFA is suitable as a cooling function of Tsallis-DAFCM.

3.7×104

1.3×104

In experiment 2, the Iris Data Set [17] consisting of 150 four-dimensional vectors of iris flowers is used. Three clusters of flowers detected are Versicolor, Virginia, and Setosa. Each cluster consists of 50 vectors. VFA is used as a cooling function of DA. The parameters are set as follows: *c* = 3, *δ*1 = 0.1, *δ*2 = 0.01, *q* = 1.5, *T*high = 2.0.

The minimum, maximum, and average values of misclassified data points of 100 trials are summarized in **Table 2**. It can be seen that Shannon-DAFCM gives slightly better results than Tsallis-DAFCM. However, it is also confirmed that Tsallis-DAFCM gives its best results when the temperature reduction rate *r* is set to 1 or 2, though the best result for Shannon-DAFCM is obtained only when *r* = 2. Furthermore, variances of Tsallis-DAFCM are smaller than those of Shannon-DAFCM. These features indicate that a wide range of *r* values are applicable to Tsallis-DAFCM. On the other hand, with larger *r* values, Shannon-DAFCM becomes unstable.


**Table 2.** Misclassified data points of Iris Data Set (100 trials).

**Figure 6.** Plots of ().

#### **4.3. Dependencies of the membership function on temperature and** *q*

In Eq. (16), it can be seen that plays an important role as a weight value to each , and it determines . For this reason, dependencies of on *T* and *q* are to be investigated.

In this subsection, for simplicity, is set to be 0, because this makes the denominator of Eq. (15) become the sum of the same formulas of its numerator. **Figure 6** illustrates the numerator of (expressed by ) as a function of , parameterized by *T* and *q*, respectively. In order to plot the shape of as a function of the distance between the cluster center and various data points, in this figure, is replaced to a continuous variable *x*. **Figure 6** confirms that the extent of becomes narrower with increasing *q*. On the contrary, as the temperature decreases, the distribution becomes narrower.

#### *4.3.1. Quantitative relationship between temperature and q*

As stated in the previous subsection, *T* and *q* inversely affect the extent of , which changes in a similar way with decreasing *T* or increasing *q*. In order to examine the quantitative relationship between *T* and *q* in more detail, they are changed independently as follows:

First, we define

$$\mu^{q}(\mathbf{x},T,q) = \left\{ 1 - \frac{1-q}{T} \mathbf{x}^{2} \right\}^{\frac{q}{1-q}}.\tag{27}$$

Then, Eq. (27) is calculated by fixing *T* and *q* to some constants *T*0 and *q*0. Next, by decreasing *T*, we search the *q* values that minimize the sum of squares of the residuals of the following two functions:

$$\sum\_{S=0}^{Sm\mathbb{1}\mathbb{1}} \left| \mu^q(\Delta \mathbf{x}S, T\_0, q\_0) - \mu^q(\Delta \mathbf{x}S, T, q) \right|^2. \tag{28}$$

In the following calculations, the parameters are set as follows: *T*high(= T0) = 2.0; the domain of *x* is 0 ≤ x ≤ 100; the number of sampling points of the sum of residuals *S*max = 10,000; ∆x = 0.01.

For *q* (= *q*0) values of 1.01, 2.0, 6.0, 10.0, and for *T* decreasing from *T*high, the *q* values that minimize Eq. (28) (expressed by *q*min) are shown in **Figure 7 (a)**. **Figure 7 (b)**, on the other hand, shows the results of cases in which *q* is set to 2.0 and *T* is lowered from *T*high = 2.0, 20.0, 100.0, 200.0. Approximate curves plotted in **Figure 7** are obtained by fitting the data to the following exponential function:

Deterministic Annealing: A Variant of Simulated Annealing and its Application to Fuzzy Clustering http://dx.doi.org/10.5772/66072 19

**Figure 7.** Plots of as a function of parameterized by (a) and (b) high.

**4.3. Dependencies of the membership function on temperature and** *q*

(15) become the sum of the same formulas of its numerator. **Figure 6** illustrates the numerator

data points, in this figure, is replaced to a continuous variable *x*. **Figure 6** confirms that the

in a similar way with decreasing *T* or increasing *q*. In order to examine the quantitative relationship between *T* and *q* in more detail, they are changed independently as follows:

<sup>1</sup> <sup>2</sup> <sup>1</sup> (, ,) 1 .

Then, Eq. (27) is calculated by fixing *T* and *q* to some constants *T*0 and *q*0. Next, by decreasing *T*, we search the *q* values that minimize the sum of squares of the residuals of the following

( , , ) ( ,,).

In the following calculations, the parameters are set as follows: *T*high(= T0) = 2.0; the domain of *x* is 0 ≤ x ≤ 100; the number of sampling points of the sum of residuals *S*max = 10,000; ∆x = 0.01. For *q* (= *q*0) values of 1.01, 2.0, 6.0, 10.0, and for *T* decreasing from *T*high, the *q* values that minimize Eq. (28) (expressed by *q*min) are shown in **Figure 7 (a)**. **Figure 7 (b)**, on the other hand, shows the results of cases in which *q* is set to 2.0 and *T* is lowered from *T*high = 2.0, 20.0, 100.0, 200.0. Approximate curves plotted in **Figure 7** are obtained by fitting the data to the following

*u xS T q u xS T q*

*<sup>q</sup> <sup>q</sup> <sup>q</sup> u xT q x <sup>T</sup>* ì ü - - = - í ý î þ

0 0

*q q*

å D -D

0

=

max

*S*

*S*

As stated in the previous subsection, *T* and *q* inversely affect the extent of

(expressed by ) as a function of , parameterized by *T* and *q*, respectively. In order

becomes narrower with increasing *q*. On the contrary, as the temperature de-

*q*

2

. For this reason, dependencies of

18 Computational Optimization in Engineering - Paradigms and Applications

plays an important role as a weight value to each , and it

as a function of the distance between the cluster center and various

on *T* and *q* are to be investigated.

, which changes

(27)

(28)

is set to be 0, because this makes the denominator of Eq.

In Eq. (16), it can be seen that

In this subsection, for simplicity,

creases, the distribution becomes narrower.

*4.3.1. Quantitative relationship between temperature and q*

to plot the shape of

determines

extent of

First, we define

two functions:

exponential function:

of

$$q\_{\min} = aT^{-b},\tag{29}$$

where *a* and *b* denote the fitting parameters. Optimal values for these parameters obtained by the least squares method are summarized in **Table 3** and **Table 4**. From these tables, it is concluded that *b* is nearly equal to 1.0 indicating that *q* is inversely proportional to *T*. In addition, it can be seen that, though *b* does not change its value much, *a* increases with increasing *T*.


**Table 3.** Parameters of approximate curves (*T*high = 2.0).


**Table 4.** Parameters of approximate curves (*q* = 2.0).

As a result, by using the approximate relationship of *T* and *q*min, instead of annealing or *T*reduction, *q*-incrementation clustering might be possible [18].

## **5. Conclusion**

In this chapter, we first explained the major characteristics of DA and compared it with SA. DA is a variant of SA and searches a minimum deterministically. Thus, generally it is more efficient than SA. We then explained how DA could be applied to the fuzzy *c*-means clustering by employing the entropy maximization method.

After that, by focusing on Tsallis-entropy-maximized FCM combined with DA (Tsallis-DAFCM), an effect of VFA on DA was examined. VFA reduces the temperature extremely only at the early annealing stage, and the experimental result showed that this feature improved the performance of Tsallis-DAFCM because it has gentle slope in the region far from the origin and clusters can move long distance to optimal positions from the beginning.

The Tsallis entropy is an extension of the Shannon entropy with a generalization parameter *q*. A shape of a membership function of Tsallis-DAFCM strongly depends on both the temperature and *q*. Accordingly, a relationship between the temperature and *q* was quantitatively investigated, and it was experimentally confirmed that they affected the area covered by the membership function almost inversely. Based on the result, a development of a *q*-incrementation algorithm is our future subject.

## **Author details**

Makoto Yasuda

Address all correspondence to: yasuda@gifu-nct.ac.jp

National Institute of Technology, Gifu College, Gifu, Japan

## **References**


[5] R.-P. Li, M. Mukaidono. A maximum entropy approach to fuzzy clustering. In: Proceedings of the 4th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE/ IFES '95); 1995 March 20–24; Yokohama, Japan; 1995. p. 2227–2232.

**5. Conclusion**

by employing the entropy maximization method.

20 Computational Optimization in Engineering - Paradigms and Applications

tion algorithm is our future subject.

Wiley & Sons; 1989.

1983; 220: 671–680.

Prenum Press; 1981.

Address all correspondence to: yasuda@gifu-nct.ac.jp

National Institute of Technology, Gifu College, Gifu, Japan

Pattern Recognition Letters. 1990; 11(9): 589–594.

**Author details**

Makoto Yasuda

**References**

In this chapter, we first explained the major characteristics of DA and compared it with SA. DA is a variant of SA and searches a minimum deterministically. Thus, generally it is more efficient than SA. We then explained how DA could be applied to the fuzzy *c*-means clustering

After that, by focusing on Tsallis-entropy-maximized FCM combined with DA (Tsallis-DAFCM), an effect of VFA on DA was examined. VFA reduces the temperature extremely only at the early annealing stage, and the experimental result showed that this feature improved the performance of Tsallis-DAFCM because it has gentle slope in the region far from the origin

The Tsallis entropy is an extension of the Shannon entropy with a generalization parameter *q*. A shape of a membership function of Tsallis-DAFCM strongly depends on both the temperature and *q*. Accordingly, a relationship between the temperature and *q* was quantitatively investigated, and it was experimentally confirmed that they affected the area covered by the membership function almost inversely. Based on the result, a development of a *q*-incrementa-

[1] E. Aarts, J. Korst. Simulated Annealing and Boltzmann Machines. Chichester: John

[2] S. Kirkpatrick, C. D. Gelatt, M. P. Vecchi. Optimization by simulated annealing. Science.

[3] K. Rose, E. Gurewitz, B. C. Fox. A deterministic annealing approach to clustering.

[4] J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. New York:

and clusters can move long distance to optimal positions from the beginning.


[34] G. C. Fox, D. R. Mani. Parallel deterministic annealing clustering and its application to LC-MS data analysis. In: 2013 IEEE International Conference on Big Data; 6–9 October 2013; Silicon Valley, USA. IEEE; 2013. p. 665–673.

[19] M. Yasuda. Quantitative analyses and development of a q-incrementation algorithm for FCM with Tsallis entropy maximization. Advances in Fuzzy Systems. 2015;

[20] Y. Linde, A. Buzo, R. M. Gray. An algorithm for vector quantizer design. IEEE Trans-

[22] L. Ingber. Adaptive Simulated Annealing. In: H. A. Oliveira, A. Petraglia Jr., L. Ingber, M. A. S. Machado, M. R. Petraglia (Eds.), Stochastic Global Optimization and Its Applications with Fuzzy Adaptive Simulated Annealing. New York:

[23] L. Ingber. Very fast simulated re-annealing. Mathematical Computer Modelling. 1989;

[24] H. Szu, R. Hartley. Fast simulated annealing. Physics Letters A. 1987; 122(3–4): 157–162.

[25] K. Rose. Deterministic annealing for clustering, compression, classification regression, and related optimization problems. Proceedings of the IEEE. 1998; 86(11): 2210–2239.

[26] K. Rose, E. Gurewitz, G. C. Fox. Constrained clustering as an optimization method. IEEE Transaction on Pattern Analysis and Machine Intelligence. 1993; 15(8): 785–794.

[27] J. Buhmann, H. Kuhnel. Vector quantization with complexity costs. IEEE Transaction

[28] K. Rose, E. Gurewitz, G. C. Fox. Vector quantization by deterministic annealing. IEEE

[29] N. Ueda, R. Nakano. Mixture density estimation via EM algorithm with deterministic annealing. In: Proceedings of 1994 IEEE Neural Networks for Signal Processing; 6–8

[30] D. Miller, A. V. Rao, K. Rose, A. Gersho. A global optimization technique for statistical classifier design. IEEE Transaction on Signal Processing. 1996; 44: 3108–3122.

[31] T. Hofmann, J. Buhmann. Pairwise data clustering by deterministic annealing. IEEE

[32] E. Kim, W. Wang, H. Li, X. Huang. A parallel annealing method for automatic color cervigram image segmentation. In: Medical Image Computing and Computer Assisted Intervention, MICCAI-GRID 2009 HPC Workshop; 2009 September 20–24; London, UK;

[33] G. C. Fox, D. R. Mani, S. Pyne. Detailed results of evaluation of DAVS(c) deterministic annealing clustering and its application to LC-MS data analysis. 2013. Available from:

http://grids.ucs.indiana.edu/ptliupages/publications/DAVS2.pdf

Transaction on Pattern Analysis and Machine Intelligence. 1997; 19: 1–14.

action on Communication. 1980; Com-28(1): 84–95.

22 Computational Optimization in Engineering - Paradigms and Applications

on Information Theory. 1993; 39(4): 1133–1143.

Transaction on Information Theory. 2002; 38(4): 1249–1257.

September 1994; Ermioni, Greek. IEEE; 1994. p. 69–77.

[21] T. Kohonen. Self-Organizing Maps. 3rd ed. New York: Springer; 2000.

404510: 7.

Springer; 2012. p. 33–61.

12(8): 967–973.

2009.


#### **Chapter 2** Provisional chapter

#### **Generalized Simulated Annealing** Generalized Simulated Annealing

Yang Xiang, Sylvain Gubian and Florian Martin Yang Xiang, Sylvain Gubian and

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66071

#### Abstract

Florian Martin

Many problems in mathematics, statistics, finance, biology, pharmacology, physics, applied mathematics, economics, and chemistry involve the determination of the global minimum of multidimensional real-valued functions. Simulated annealing methods have been widely used for different global optimization problems. Multiple versions of simulated annealing have been developed, including classical simulated annealing (CSA), fast simulated annealing (FSA), and generalized simulated annealing (GSA). After revisiting the basic idea of GSA using Tsallis statistics, we implemented a modified GSA approach using the R package GenSA. This package was designed to solve complicated nonlinear objective functions with a large number of local minima. In this chapter, we provide a brief introduction to this R package and demonstrate its utility by solving non-convexoptimization problems in different fields: physics, environmental science, and finance. We performed a comprehensive comparison between GenSA and other widely used R packages, including rgenoud and DEoptim. GenSA is useful and can provide a solution that is comparable with or even better than that provided by other widely used R packages for optimization.

Keywords: classical simulated annealing (CSA), fast simulated annealing (FSA), generalized simulated annealing (GSA), GenSA

## 1. Introduction

Determining the global minimum of a multidimensional function is the focus of many problems in statistics, biology, physics, applied mathematics, economics, and chemistry [1–6]. Although there is a wide spectrum of problems, computing the global minimum remains a challenging task, because, for example, modern problem dimensionality is increasing.

The optimization of convex functions is usually reasonably conducted using standard optimization approaches, such as simplex optimization, the steepest descent method, and

Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

the quasi-Newton method. These methods can also provide reasonable results for the study of simple non-convex functions with only a few dimensions and well-separated local minima.

Deterministic methods are usually faster than stochastic methods although they tend to be trapped into a local minimum. To overcome this particular issue, stochastic methods have been widely developed and can determine a good approximation of the global minimum with a modest computational cost. Among stochastic methods, genetic algorithms [7], evolution algorithms [8], simulated annealing (SA) [9], and taboo search [10–12] have been successfully applied.

Among popular approaches, genetic algorithms [7] mimic the process of natural DNA evolution. In this approach, a population of randomly generated solutions is generated. The solutions are encoded as strings and evolve over many iterations toward better solutions. In each generation, the fitness of each individual in the population is evaluated, and in the next generation, strings are generated by crossover, mutation, and selection, based on their fitness. Differential evolution belongs to such genetic algorithms.

Ant colony optimization (ACO) [13] is another set of stochastic optimization methods, which is inspired by ants wandering to find food for the colony. An ant starts wandering randomly while laying down pheromone trails that will influence other ants because they will be attracted (increase in probability) by the trail, and if they eventually locate food, will return and reinforce the trail. To avoid the algorithm converging to local minima, the pheromone trail is set to evaporate proportionally to the time it takes to traverse the trail to decrease its attractiveness. As a consequence, the pheromone density of short paths becomes higher than that of longer paths. The design of ACO perfectly matches graph-based optimization (e.g., traveling salesman problem), but it can be adapted to determine the global minimum of realvalued functions [14] by allowing local random moves in the neighborhood of the current states of the ant.

The SA algorithm was inspired by the annealing process that takes place in metallurgy, whereby annealing a molten metal causes it to achieve its global minimum in terms of thermodynamic energy (crystalline state) [9]. In the SA algorithm, the objective function is treated as the energy function of a molten metal, and one or more artificial temperatures are introduced and gradually cooled, which is analogous to the annealing technique, to attempt to achieve the global minimum. To escape from local minima, this artificial temperature (or set of temperatures) acts as a source of stochasticity. Following the metallurgy analogy, at the end of the process, the system is foreseen to reside inside the attractive basin of the global minimum (or in one of the global minima if more than one global minimum exists). In classical simulated annealing (CSA), the visiting distribution is a Gaussian function (a local search distribution) for each temperature. It has been observed that this distribution is not optimal for moving across the entire search space [5]. Generalized simulated annealing (GSA) was developed to overcome this issue by using a distorted Cauchy-Lorentz distribution.

For a more extensive review of stochastic optimization algorithms, see the review provided by Fouskakis and Draper [15].

The R language and environment for statistical computing will be the language of choice in this chapter because it enables a fast implementation of algorithms, access to a variety of statistical modeling packages, and easy-to-use plotting functionalities. These advantages make the use of R preferable in many situations to other programming languages, such as Java, C++, Fortran, and Pascal [16].

In this chapter, we elaborate further on the background and improvements of GSA and the use of the R package GenSA [17], which is an implementation of a modified GSA. We will also discuss the performance of GenSA and show that it outperforms the genetic algorithm (R package rgenoud) and differential evolution (R package DEoptim) in an extensive testbed comprising 134 testing functions based on the success rate and number of function calls. The core function of GenSA is written in C++ to ensure that the package runs as efficiently as possible. The utility of this R package and its use will be presented by way of several applications, such as the famous Thomson problem in physics, non-convex portfolio optimization in finance, and kinetic modeling of pesticide degradation in environmental science.

## 2. Method

the quasi-Newton method. These methods can also provide reasonable results for the study of simple non-convex functions with only a few dimensions and well-separated local

Deterministic methods are usually faster than stochastic methods although they tend to be trapped into a local minimum. To overcome this particular issue, stochastic methods have been widely developed and can determine a good approximation of the global minimum with a modest computational cost. Among stochastic methods, genetic algorithms [7], evolution algorithms [8], simulated annealing (SA) [9], and taboo search [10–12] have been

Among popular approaches, genetic algorithms [7] mimic the process of natural DNA evolution. In this approach, a population of randomly generated solutions is generated. The solutions are encoded as strings and evolve over many iterations toward better solutions. In each generation, the fitness of each individual in the population is evaluated, and in the next generation, strings are generated by crossover, mutation, and selection, based on their fitness.

Ant colony optimization (ACO) [13] is another set of stochastic optimization methods, which is inspired by ants wandering to find food for the colony. An ant starts wandering randomly while laying down pheromone trails that will influence other ants because they will be attracted (increase in probability) by the trail, and if they eventually locate food, will return and reinforce the trail. To avoid the algorithm converging to local minima, the pheromone trail is set to evaporate proportionally to the time it takes to traverse the trail to decrease its attractiveness. As a consequence, the pheromone density of short paths becomes higher than that of longer paths. The design of ACO perfectly matches graph-based optimization (e.g., traveling salesman problem), but it can be adapted to determine the global minimum of realvalued functions [14] by allowing local random moves in the neighborhood of the current

The SA algorithm was inspired by the annealing process that takes place in metallurgy, whereby annealing a molten metal causes it to achieve its global minimum in terms of thermodynamic energy (crystalline state) [9]. In the SA algorithm, the objective function is treated as the energy function of a molten metal, and one or more artificial temperatures are introduced and gradually cooled, which is analogous to the annealing technique, to attempt to achieve the global minimum. To escape from local minima, this artificial temperature (or set of temperatures) acts as a source of stochasticity. Following the metallurgy analogy, at the end of the process, the system is foreseen to reside inside the attractive basin of the global minimum (or in one of the global minima if more than one global minimum exists). In classical simulated annealing (CSA), the visiting distribution is a Gaussian function (a local search distribution) for each temperature. It has been observed that this distribution is not optimal for moving across the entire search space [5]. Generalized simulated annealing (GSA) was developed to

For a more extensive review of stochastic optimization algorithms, see the review provided by

overcome this issue by using a distorted Cauchy-Lorentz distribution.

Differential evolution belongs to such genetic algorithms.

26 Computational Optimization in Engineering - Paradigms and Applications

minima.

successfully applied.

states of the ant.

Fouskakis and Draper [15].

As mentioned above, SA methods attempt to determine the global minimum of any objective function by simulating the annealing process of a molten metal. Given an objective function fðxÞ with x ¼ ðx1, x2,…, xnÞ <sup>T</sup>, we attempt to determine its global minimum using SA. The general procedure for SA is as follows:

	- The temperature T<sup>i</sup> is decreased according to some cooling function.
	- Generate a new state <sup>x</sup><sup>i</sup> <sup>¼</sup> xi<sup>−</sup><sup>1</sup> <sup>þ</sup> <sup>Δ</sup>x, where <sup>Δ</sup><sup>x</sup> follows a predefined visiting distribution (e.g., Gaussian distribution). <sup>E</sup><sup>i</sup> <sup>¼</sup> <sup>f</sup>ðxi <sup>Þ</sup> and <sup>Δ</sup><sup>E</sup> <sup>¼</sup> <sup>E</sup><sup>i</sup> −Ei<sup>−</sup><sup>1</sup> .
	- Calculate the probability <sup>p</sup> of xi<sup>−</sup><sup>1</sup> ! xi .If <sup>p</sup> <sup>&</sup>lt; randomð0, <sup>1</sup>Þ, xi is set back to its previous state xi<sup>−</sup><sup>1</sup> and E<sup>i</sup> is also set back to E<sup>i</sup>−<sup>1</sup> .

We provide more details of SA methods as follows.

## 2.1. Classical simulated annealing (CSA)

According to the process of cooling and the visiting distribution, SA methods can be classified into several categories, amongwhich CSA [9], fast simulated annealing (FSA) [18], and the GSA [19] are the most common.

In CSA, proposed by Kirkpatrick et al., the visiting distribution is a Gaussian function, which is a local search distribution [5, 19]:

$$g(\Delta \mathbf{x}) \propto \exp\left(-\frac{(\Delta \mathbf{x})^2}{T}\right),\tag{1}$$

where Δx is the trial jump distance of variable x and T is an artificial temperature in the reduced unit. In a local search distribution, for example, a Gaussian distribution Δx is always localized around zero. The jump is accepted if it is downhill of the energy/fitness/objective function. If the jump is uphill, it might be accepted according to an acceptance probability, which is computed using the Metropolis algorithm [20]:

$$p = \min\left[1, \exp\left(-\frac{\Delta E}{T}\right)\right].\tag{2}$$

Geman and Geman [21] showed that for the classical case, a necessary and sufficient condition for having probability 1 of ending at the global minimum is that the temperature decreases logarithmically with the simulation time, which is impossible in practice because this would dramatically increase the computational time.

#### 2.2. Fast simulated annealing (FSA)

In 1987, Szu and Hartley proposed a method called FSA [18], in which the Cauchy-Lorentz visiting distribution, that is, a semi-local search distribution, is introduced:

$$g(\Delta \mathbf{x}) \propto \frac{T}{\left(T^2 + (\Delta \mathbf{x})^2\right)^{\frac{D+1}{2}}},\tag{3}$$

where D is the dimension of the variable space. In a semi-local search distribution, for example, the Cauchy-Lorentz distribution, the jumps Δx are frequently local, but can occasionally be quite long. The temperature T in FSA decreases with the inverse of the simulation time, and the acceptance algorithm is the Metropolis algorithm shown in Eq. (2).

#### 2.3. Generalized simulated annealing (GSA)

#### 2.3.1. Introduction to GSA

A generalization of classical statistical mechanics was proposed by Tsallis and Stariolo [19]. In the Tsallis formalism, a generalized statistic is built from generalized entropy:

$$s\_q = k \frac{1 - \sum p\_i^q}{q - 1},\tag{4}$$

where q is a real number, i is the index of the energy spectrum, and sq tends to information entropy:

$$s = -k\sum p\_i \text{ln}p\_i\tag{5}$$

when q ! 1. Maximizing the Tsallis entropy with the constraints

In CSA, proposed by Kirkpatrick et al., the visiting distribution is a Gaussian function, which

where Δx is the trial jump distance of variable x and T is an artificial temperature in the reduced unit. In a local search distribution, for example, a Gaussian distribution Δx is always localized around zero. The jump is accepted if it is downhill of the energy/fitness/objective function. If the jump is uphill, it might be accepted according to an acceptance probability,

ðΔxÞ 2 T !

> ΔE T

� � � �

, (1)

: (2)

, (3)

<sup>q</sup>−<sup>1</sup> , (4)

gðΔxÞ ∝ exp −

p ¼ min 1, exp −

Geman and Geman [21] showed that for the classical case, a necessary and sufficient condition for having probability 1 of ending at the global minimum is that the temperature decreases logarithmically with the simulation time, which is impossible in practice because this would

In 1987, Szu and Hartley proposed a method called FSA [18], in which the Cauchy-Lorentz

<sup>T</sup><sup>2</sup> þ ðΔx<sup>Þ</sup>

where D is the dimension of the variable space. In a semi-local search distribution, for example, the Cauchy-Lorentz distribution, the jumps Δx are frequently local, but can occasionally be quite long. The temperature T in FSA decreases with the inverse of the simulation time, and

A generalization of classical statistical mechanics was proposed by Tsallis and Stariolo [19]. In

where q is a real number, i is the index of the energy spectrum, and sq tends to information

1−∑ p q i

2 �<sup>D</sup>þ<sup>1</sup> 2

<sup>g</sup>ðΔx<sup>Þ</sup> <sup>∝</sup> <sup>T</sup> �

visiting distribution, that is, a semi-local search distribution, is introduced:

the acceptance algorithm is the Metropolis algorithm shown in Eq. (2).

the Tsallis formalism, a generalized statistic is built from generalized entropy:

sq ¼ k

is a local search distribution [5, 19]:

which is computed using the Metropolis algorithm [20]:

28 Computational Optimization in Engineering - Paradigms and Applications

dramatically increase the computational time.

2.3. Generalized simulated annealing (GSA)

2.3.1. Introduction to GSA

entropy:

2.2. Fast simulated annealing (FSA)

$$\begin{cases} \sum p\_i = 1, \text{ and} \\ \sum p\_i^q \varepsilon\_i = const, \end{cases} \tag{6}$$

where ε<sup>i</sup> is the energy spectrum, and the generalized probability distribution is determined to be

$$p\_i = \frac{[1 - (1 - q)\beta \varepsilon\_i]^{\frac{1}{1 - q}}}{z\_q},\tag{7}$$

where zq is the normalization constant that ensures that pi integrates to 1. This distribution pointwise converges to the Gibbs-Boltzmann distribution, where q tends to 1.

The GSA algorithm [19] refers to the generalization of both CSA and FSA according to Tsallis statistics. It uses the Tsallis-Stariolo form of the Cauchy-Lorentz visiting distribution, whose shape is controlled by the parameter qv:

$$\mathcal{g}\_{q\_v} \left( \Delta \mathbf{x}(t) \right) \propto \frac{[T\_{q\_v}(t)]^{\frac{D}{3 \cdot q\_v}}}{\left[1 + (q\_v - 1) \frac{\left(\Delta \mathbf{x}(t)\right)^2\right]^{\frac{1}{q\_v - 1} + \frac{D - 1}{2}}}} \,\tag{8}$$

Please note that qv also controls the rate of cooling:

$$T\_{q\_v}(t) = \frac{2^{q\_v - 1} - 1}{(1 + t)^{q\_v - 1} - 1} T\_{q\_v}(1),\tag{9}$$

where Tqv is the visiting temperature. In turn, a generalized Metropolis algorithm is used for the acceptance probability:

$$p\_{q\_a} = \min\left\{1, \left[1 - (1 - q\_a)\beta \Delta E\right]^{\frac{1}{1 - q\_a}}\right\},\tag{10}$$

where β ¼ 1=ðKTqa Þ. The acceptance probability is controlled by the artificial temperature, Tqa . When qv ¼ 1 and qa ¼ 1, then GSA is exactly CSA. Another special instance is given by qv ¼ 2 and qa ¼ 1 for which GSA is exactly FSA. Asymptotically, GSA has a similar behavior thanthe stochastic steepest descentas T ! 0. A faster cooling than CSA and FSA is achieved when qv > 2.

It has been shown in a few instances that GSA is superior to FSA and CSA. Xiang et al. established that a pronounced reduction in the fluctuation of energy and a faster convergence to the global minimum are achieved in the optimization of the Thomson problem and Nickel cluster structure [4–6]. Lemes et al. [22] observed a similar trend when optimizing the structure of a silicon cluster.

#### 2.3.2. Improvement of GSA

A simple technique to accelerate convergence is as follows:

$$T\_{q\_a} = \frac{T\_{q\_v}}{t} \tag{11}$$

where Tqa is the acceptance temperature, Tqv is the visiting temperature, and t is the number of time steps. Our testing shows that this simple technique accelerates convergence [6].

The performance of GSA can be further improved by combining GSA with a local search method, large-scale bound-constrained Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [23] for a smooth function, or Nelder-Mead [24] for a non-smooth function. A local search is performed at the end of each Markov chain for GSA.

#### 3. Results

The GenSA R package has been developed and added to the toolkit for solving optimization problems in the Comprehensive R Archive Network (CRAN) R packages repository. The package GenSA has proven to be a useful tool for solving global optimization problems [17].

#### 3.1. Usage

The GenSA R package provides a unique function called GenSA whose arguments were described in the associated help [25]:


control: A list of control parameters, discussed below. The control argument is a list that can be used to control the behavior of the algorithm. Some components of control are the following:


The default values of the control components are set for a complicated optimization problem. For typical optimization problems with medium complexity, GenSA can determine a reasonable solution quickly; thus,using the variable threshold.stop to the expected function value is recommended to make GenSA stop at an earlier stage. A maximum run time can be also set by max.time argument or max.call argument for setting the maximum run time or number of calls, respectively.

#### 3.2. Performance

to the global minimum are achieved in the optimization of the Thomson problem and Nickel cluster structure [4–6]. Lemes et al. [22] observed a similar trend when optimizing the structure

Tqa <sup>¼</sup> Tqv

where Tqa is the acceptance temperature, Tqv is the visiting temperature, and t is the number of

The performance of GSA can be further improved by combining GSA with a local search method, large-scale bound-constrained Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [23] for a smooth function, or Nelder-Mead [24] for a non-smooth function. A local search is

The GenSA R package has been developed and added to the toolkit for solving optimization problems in the Comprehensive R Archive Network (CRAN) R packages repository. The package GenSA has proven to be a useful tool for solving global optimization problems [17].

The GenSA R package provides a unique function called GenSA whose arguments were

par: Numeric vector. Initial values for the parameters to be optimized over. Default value is

lower: Numeric vector with a length of par. Lower bounds for the parameters to be optimized

upper: Numeric vector with a length of par. Upper bounds for the parameters to be optimized

fn: A function to be minimized, where the first argument is the vector of parameters over which minimization is to take place. It should return a scalar result. The function has to always return a numerical value; however, not applicable (NA) and infinity are supported.

NULL, in which case, the initial values will be generated automatically.

...: Allows the user to pass additional arguments into fn.

time steps. Our testing shows that this simple technique accelerates convergence [6].

<sup>t</sup> (11)

of a silicon cluster.

3. Results

3.1. Usage

over.

over.

described in the associated help [25]:

2.3.2. Improvement of GSA

A simple technique to accelerate convergence is as follows:

30 Computational Optimization in Engineering - Paradigms and Applications

performed at the end of each Markov chain for GSA.

An extensive performance comparison of currently available R packages for continuous global optimization problems has been published [26]. In this comparison, 48 benchmark functions were used to compare 18 R packages for continuous global optimization. Performance was measured in terms of the quality of the solutions determined and speed. The author concluded that GenSA and rgenoud are recommended in general for continuous global optimization [26]. Based on this conclusion, we set out to perform a more extensive performance test by including more benchmark functions and the additional algorithm DEoptim. The SciPy Python scientific toolkit provides an extensive set of 196 benchmark functions. Because these 196 benchmark functions are coded in Python, we had to convert the Python code to R code. As a result, a subset containing 134 functions was available for this test. One hundred runs using a random initial starting point were performed for every combination of the 134 benchmark functions and the aforementioned three methods. We used a local search method to further refine the best solution provided by Deoptim, because this technique provides a more accurate final result [17]. The default values of the parameters of every package were used in this comparison. A tolerance of 1e-8 was used to establish whether the algorithm determines the global minimum value.

The reliability of a method can be measured by the success rate%, which is defined as the number of successful runs over 100 runs. For each testing function, the number of function calls required to achieve the global minimum was recorded for every method, and we refer to this as the number of function calls. Please note that rgenoud required a longer time to complete 100 runs compared with GenSA and DEoptim. Table 1 shows the success rate% and the average number of function calls (in parentheses).

The mean of the success rate% over all the benchmark functions is 92, 85, and 86% for GenSA, DEoptim, and rgenoud, respectively. Because the number of function calls changes dramatically, the median rather than the mean of the number of function calls is provided: 244.3 for GenSA, 1625.9 for Deoptim, and 1772.1 for rgenoud.

A heatmap of the success rate% for GenSA, DEoptim, and rgenoud is displayed in Figure 1. The color scaling from red to green represents the success rate% from 0 to 100. Clearly, GenSA has a larger green region (high success rate%) than DEoptim and rgenoud.

Both the success rate% and number of function calls show that GenSA performed better than DEoptim and rgenoud in the testbed composed of 134 benchmark functions.

## 3.3. Applications

#### 3.3.1. The Thomson problem in physics

The physicist J.J. Thomson posed the famous Thomson problem after proposing his plum pudding atomic model, based on his knowledge of the existence of negatively charged electrons within neutrally charged atoms [27]. The objective of the Thomson problem is to determine the minimum electrostatic potential energy configuration of N equal point charges constrained at the surface of a unit sphere that repel each other with a force given by Coulomb's law. The Thomson model has been widely studied in physics [28–31]. In the Thomson model, the energy function is

$$E = \frac{1}{2} \sum\_{j \neq i} \frac{1}{|\overrightarrow{r}\_i - \overrightarrow{r}\_j|}. \tag{12}$$

The number of metastable structures (local minima) of the Thomson problem grows exponentially with N [28]. The region containing the global minimum is often small compared with those of other minima [30]. The Thomson problem has been selected as a benchmark for global


benchmark functions are coded in Python, we had to convert the Python code to R code. As a result, a subset containing 134 functions was available for this test. One hundred runs using a random initial starting point were performed for every combination of the 134 benchmark functions and the aforementioned three methods. We used a local search method to further refine the best solution provided by Deoptim, because this technique provides a more accurate final result [17]. The default values of the parameters of every package were used in this comparison. A tolerance of 1e-8 was used to establish whether the algorithm determines the

The reliability of a method can be measured by the success rate%, which is defined as the number of successful runs over 100 runs. For each testing function, the number of function calls required to achieve the global minimum was recorded for every method, and we refer to this as the number of function calls. Please note that rgenoud required a longer time to complete 100 runs compared with GenSA and DEoptim. Table 1 shows the success rate% and

The mean of the success rate% over all the benchmark functions is 92, 85, and 86% for GenSA, DEoptim, and rgenoud, respectively. Because the number of function calls changes dramatically, the median rather than the mean of the number of function calls is provided: 244.3 for

A heatmap of the success rate% for GenSA, DEoptim, and rgenoud is displayed in Figure 1. The color scaling from red to green represents the success rate% from 0 to 100. Clearly, GenSA

Both the success rate% and number of function calls show that GenSA performed better than

The physicist J.J. Thomson posed the famous Thomson problem after proposing his plum pudding atomic model, based on his knowledge of the existence of negatively charged electrons within neutrally charged atoms [27]. The objective of the Thomson problem is to determine the minimum electrostatic potential energy configuration of N equal point charges constrained at the surface of a unit sphere that repel each other with a force given by Coulomb's law. The Thomson model has been widely studied in physics [28–31]. In the Thomson

> 1 jr ! <sup>i</sup>− r ! jj

The number of metastable structures (local minima) of the Thomson problem grows exponentially with N [28]. The region containing the global minimum is often small compared with those of other minima [30]. The Thomson problem has been selected as a benchmark for global

: (12)

<sup>E</sup> <sup>¼</sup> <sup>1</sup> 2 ∑ j≠i

has a larger green region (high success rate%) than DEoptim and rgenoud.

DEoptim and rgenoud in the testbed composed of 134 benchmark functions.

global minimum value.

3.3. Applications

3.3.1. The Thomson problem in physics

model, the energy function is

the average number of function calls (in parentheses).

32 Computational Optimization in Engineering - Paradigms and Applications

GenSA, 1625.9 for Deoptim, and 1772.1 for rgenoud.

#### 34 Computational Optimization in Engineering - Paradigms and Applications



Function name GenSA DEoptim-LBFGS rgenoud Decanomial 100% (1519.3) 100% (741.4) 100.0% (2050.8) DeckkersAarts 100% (74.6) 100% (1525.0) 100.0% (1988.7) Dolan 100% (2504.7) 1% (10293.0) 82.0% (25067.4) DropWave 100% (3768.7) 85% (4009.8) 83.0% (2973.3) Easom 97% (5077.5) 100% (1343.0) 100.0% (1875.7) EggCrate 100% (122.9) 100% (912.2) 100.0% (1697.2) ElAttarVidyasagarDutta 100% (675.7) 93% (1625.9) 100.0% (2115.7) Exp2 100% (85.3) 100% (781.2) 100.0% (1707.7) Exponential 100% (20.6) 100% (580.3) 100.0% (1682.5) FreudensteinRoth 100% (395.4) 83% (2620.7) 100.0% (1700.9) Gear 100% (2225.8) 100% (1118.0) 93.0% (1609.6) Giunta 100% (39.6) 100% (592.3) 100.0% (1686.6) GoldsteinPrice 100% (158.7) 100% (1023.0) 100.0% (1703.5) Gulf 100% (1739.0) 100% (4076.4) 1.0% (1810.0) Hansen 100% (149.7) 100% (104.0) 100.0% (132.0) HimmelBlau 100% (53.7) 100% (2384.8) 100.0% (1698.7) HolderTable 100% (138.0) 100% (2010.3) 100.0% (1701.7) Hosaki 100% (49.8) 100% (583.2) 100.0% (1699.5) Infinity 100% (225.8) 100% (113.4) 100.0% (492.0) JennrichSampson 0% (NA) 0% (NA) 0.0% (NA) Keane 100% (21.4) 100% (679.1) 100.0% (629.5) Leon 100% (128.0) 100% (1338.2) 35.0% (2156.5) Levy05 100% (152.8) 100% (144.7) 100.0% (224.6) Levy13 100% (867.3) 100% (1138.5) 100.0% (1771.7) Matyas 100% (33.8) 100% (967.7) 100.0% (1702.4) McCormick 100% (42.2) 100% (747.8) 100.0% (1705.4) Michalewicz 100% (97.3) 100% (69.0) 100.0% (157.3) MieleCantrell 100% (2935.1) 100% (1942.7) 100.0% (3438.9) Mishra03 81% (138334.7) 0% (NA) 24.0% (5745.0) Mishra04 0% (NA) 0% (NA) 13.0% (1833.2) Mishra05 100% (1289.1) 77% (1396.9) 100.0% (1680.0) Mishra06 0% (NA) 0% (NA) 0.0% (NA) Mishra07 100% (38.9) 100% (3055.9) 100.0% (1679.7) Mishra08 100% (1519.3) 100% (741.4) 100.0% (1900.8) Mishra09 100% (80.0) 85% (4832.0) 99.0% (12356.0)

34 Computational Optimization in Engineering - Paradigms and Applications


Table 1. The success rate% and average number of function calls (in parentheses) for GenSA, DEoptim, and rgenoud.

optimization algorithms in a number of previous studies [4, 5, 28, 32]. The Thomson problem has been solved by both deterministic methods, including steepest descent [28], and stochastic methods, including (but not limited to) constrained global optimization (CGO) [29], the GSA algorithm [4, 5], genetic algorithms [30], and the Monte Carlo method [31, 33]. Typically, deterministic methods with multiple starts can provide a good solution when there are fewer point charges, whereas stochastic methods have to be used when N is large.

Figure 1. Heat map of the success rate% for GenSA, DEoptim, and rgenoud.

optimization algorithms in a number of previous studies [4, 5, 28, 32]. The Thomson problem has been solved by both deterministic methods, including steepest descent [28], and stochastic methods, including (but not limited to) constrained global optimization (CGO) [29], the GSA algorithm [4, 5], genetic algorithms [30], and the Monte Carlo method [31, 33]. Typically, deterministic methods with multiple starts can provide a good solution when there are fewer

Table 1. The success rate% and average number of function calls (in parentheses) for GenSA, DEoptim, and rgenoud.

point charges, whereas stochastic methods have to be used when N is large.

Function name GenSA DEoptim-LBFGS rgenoud SixHumpCamel 100% (83.3) 100% (909.2) 100.0% (1695.7) Sodp 100% (64.3) 100% (477.7) 100.0% (1697.5) Sphere 100% (17.4) 100% (730.2) 100.0% (1683.2) Step 100% (471.0) 100% (219.3) 100.0% (1131.5) Step2 100% (433.4) 100% (222.0) 100.0% (1177.1) StyblinskiTang 100% (94.7) 100% (861.4) 100.0% (1824.2) TestTubeHolder 100% (839.0) 98% (3957.5) 100.0% (2088.5) ThreeHumpCamel 100% (86.3) 100% (817.5) 100.0% (1699.0) Treccani 100% (49.1) 100% (1015.2) 100.0% (1696.4) Trefethen 64% (20972.8) 0% (NA) 46.0% (29439.5) Trigonometric02 100% (1766.2) 100% (1319.6) 99.0% (4809.6) Ursem01 100% (47.7) 100% (682.1) 100.0% (1753.8) Ursem03 100% (584.0) 100% (1645.7) 100.0% (1777.7) Ursem04 100% (210.2) 100% (1372.3) 100.0% (1847.3) UrsemWaves 100% (2565.0) 26% (4025.8) 50.0% (1674.8) VenterSobiezcczanskiSobieski 100% (698.3) 100% (1748.0) 100.0% (2004.8) Vincent 100% (41.4) 100% (3013.4) 100.0% (1703.5) Wavy 100% (535.1) 100% (3110.8) 100.0% (1745.0) WayburnSeader01 100% (156.1) 96% (2258.3) 93.0% (16665.5) WayburnSeader02 100% (262.8) 100% (1772.3) 100.0% (1826.0) Wolfe 100% (50.3) 100% (3252.9) 100.0% (1720.7) XinSheYang02 100% (1048.0) 87% (2284.5) 100.0% (1768.5) XinSheYang04 100% (457.0) 88% (3329.7) 100.0% (1777.5) Xor 100% (26204.2) 48% (18245.1) 100.0% (1852.9) YaoLiu09 100% (482.3) 98% (3753.6) 100.0% (1824.3) Zacharov 100% (59.7) 100% (944.3) 100.0% (1696.6) Zettl 100% (123.4) 100% (846.8) 100.0% (1712.5) Zirilli 100% (131.0) 99% (702.3) 100.0% (1695.1)

36 Computational Optimization in Engineering - Paradigms and Applications

We can define an R function for the Thomson problem as follows:

```
>Thomson.fn<- function(x) {
fn.call<<- fn.call + 1
x <- matrix(x, ncol = 2)
y <- t(apply(x, 1, function(z) {
c(sin(z [1]) * cos(z [2]),
sin(z [1]) * sin(z [2]), cos(z [1])) }))
n <- nrow(x)
tmp<- matrix(NA, nrow = n, ncol = n)
index<- cbind(as.vector(row(tmp)), as.vector(col(tmp)))
index<- index [index [, 1] < index [, 2],, drop=F]
rdist<- apply(index, 1, function(z) {
tmp<- 1/sqrt(sum((y [z [1],] - y [z [2],])^2))
})
res<- sum(rdist)
return(res)
}
```
In this example, we chose six point charges because our purpose is only to show how to use our package GenSA. The global energy minimum of six equal point charges on a unit sphere is 9.98528137 [28].

We applied GenSA with default settings to the Thomson problem. As the number of point charges is small, GenSA can determine the global minimum easily. We set the maximum number of function calls allowed, max.call, to 600:

```
>n.particles<- 6 # regular octahedron with global minimum
9.98528137
>lower.T<- rep(0, 2 * n.particles)
>upper.T<- c(rep(pi, n.particles), rep(2 * pi, n.particles))
>require(GenSA)
>options(digits = 9)
>set.seed(1234)
>fn.call<<- 0
>out.GenSA<- GenSA(par = NULL, lower = lower.T, upper = upper.T,
fn = Thomson.fn, control = list(max.call=600))
>print(out.GenSA[c("value", "counts")])
$value
[1]9.98528137
$counts
[1]600
>cat("GenSA call function", fn.call, "times.\n")
GenSA call function 600 times.
```
The global minimum 9.98528137 for six point charges is determined within 600 function calls.

## 3.3.2. Kinetic modeling of pesticide degradation

We can define an R function for the Thomson problem as follows:

index<- index [index [, 1] < index [, 2],, drop=F]

index<- cbind(as.vector(row(tmp)), as.vector(col(tmp)))

In this example, we chose six point charges because our purpose is only to show how to use our package GenSA. The global energy minimum of six equal point charges on a unit sphere is

We applied GenSA with default settings to the Thomson problem. As the number of point charges is small, GenSA can determine the global minimum easily. We set the maximum

>n.particles<- 6 # regular octahedron with global minimum

>upper.T<- c(rep(pi, n.particles), rep(2 \* pi, n.particles))

>out.GenSA<- GenSA(par = NULL, lower = lower.T, upper = upper.T,

The global minimum 9.98528137 for six point charges is determined within 600 function calls.

fn = Thomson.fn, control = list(max.call=600)) >print(out.GenSA[c("value", "counts")])

>cat("GenSA call function", fn.call, "times.\n")

>Thomson.fn<- function(x) {

y <- t(apply(x, 1, function(z) {

sin(z [1]) \* sin(z [2]), cos(z [1])) }))

38 Computational Optimization in Engineering - Paradigms and Applications

rdist<- apply(index, 1, function(z) { tmp<- 1/sqrt(sum((y [z [1],] - y [z [2],])^2))

tmp<- matrix(NA, nrow = n, ncol = n)

number of function calls allowed, max.call, to 600:

>lower.T<- rep(0, 2 \* n.particles)

GenSA call function 600 times.

fn.call<<- fn.call + 1 x <- matrix(x, ncol = 2)

c(sin(z [1]) \* cos(z [2]),

n <- nrow(x)

res<- sum(rdist) return(res)

})

}

9.98528137 [28].

9.98528137

\$value

\$counts [1]600

[1]9.98528137

>require(GenSA) >options(digits = 9) >set.seed(1234) >fn.call<<- 0

Various types of pesticides have been widely used in modern agriculture. It is important to calculate the concentration of a pesticide in groundwater and surface water. We will show how GenSA can be used to fit a degradation model for a parent compound with one transformation product. All the data and models are from the R packages mkin [34] and FOCUS (2006) [35].

After loading the library, we obtain the data (FOCUS Example Dataset D). The observed concentrations are in the column named "value" at the time specified in column "time" for the two observed variables named "parent" and "m1."

```
>require(mkin)
>require(GenSA)
>require(deSolve)
>options(digits = 9)
>set.seed(1234)
>str(FOCUS_2006_D)
'data.frame':44 obs. of3 variables:
$ name : Factor w/2 levels "m1","parent": 2 2 2 2 2 2 2 2 2 2 ...
$ time :num0 0 1 1 3 3 7 7 14 14 ...
$ value: num99.5 102 93.5 92.5 63.2 ...
```
The measured concentration of parent and m1 change with time is displayed in the lower panel of Figure 2.

According to the kinetic model displayed in the upper panel of Figure 2, we define the derivative function as follows:

```
>df<- function(t, y, parameters) {
+d_parent<- -parameters ["k_parent_sink"]*y["parent"]-
+parameters ["k_parent_m1"]* y ["parent"]
+d_m1 <- parameters ["k_parent_m1"]* y ["parent"]-
+parameters ["k_m1_sink"]* y ["m1"]
+return(list(c(d_parent, d_m1)))
+ }
```
There is one initial concentration, parent\_0, and three kinetic parameters, k\_parent\_m1, k\_parent\_sink, and k\_m1\_sink, whose values need to be fitted out by fitting the concentration curves. The initial concentration of m1, m1\_0, is always zero. We define the objective function, fn, as the sum of the squares of residuals (deviations predicted from observed values for both the parent and m1):

```
>fn<- function(x,
+names_x = c("parent_0", "k_parent_sink", "k_parent_m1",
"k_m1_sink"),
+names_y = c("parent", "m1"),
+names_parameters = c("k_parent_sink", "k_parent_m1",
"k_m1_sink"),
```

```
+tspan = if (!is.null(dat.fitting))
+sort(unique(dat.fitting [["time"]])) else NULL,
+df, dat.fitting = NULL) {
+m1_0 = 0
+names(x) <- names_x
+y0 <- c(x ["parent_0"], m1_0)
+names(y0) <- names_y
+parameters<- x [c("k_parent_sink", "k_parent_m1", "k_m1_sink")]
+stopifnot(!is.null(tspan))
+out<- ode(y0, tspan, df, parameters)
+if (is.null(dat.fitting)) {
+rss<- out
+ }else {
+rss<- sum(sapply(c("parent", "m1"), function(nm) {
+o.time<- as.character(dat.fitting [dat.fitting$name == nm,
"time"])
+sum((out [, nm ][match(o.time, out [, "time"])] -
+dat.fitting [dat.fitting$name == nm, "value"])^2,
na.rm = TRUE)
+}))
+}
+return(rss)
+ }
```
Then, we use GenSA to estimate the four parameters. As the model is not complicated, we limit the running time of GenSA to 5seconds by setting max.time=5:

```
>res<- GenSA(fn = fn, lower = c(90, rep(0.001, 3)),
+upper = c(110, rep(0.1, 3)),
+control = list(max.time = 5), df = df,
+dat.fitting = FOCUS_2006_D)
>names(res$par) <- c("parent_0", "k_parent_sink", "k_parent_m1",
"k_m1_sink")
>print(round(res$par, digits = 6))
parent_0 k_parent_sinkk_parent_m1k_m1_sink
99.5984910.0479200.0507780.005261
```
GenSA successfully determines the correct value of the initial concentration of the parent and the three kinetic parameters. The fitting curves for the parent and m1 are displayed in the lower panel of Figure 2.

#### 3.3.3. Finance: risk allocation portfolios

Portfolio selection problems were addressed by mean-risk models in the 1950s. The most popular measures of downside risk are the value-at-risk (VaR) and conditional VaR(CVaR). Portfolio weights for which the portfolio has the lowest CVaR and each investment can

+tspan = if (!is.null(dat.fitting))

40 Computational Optimization in Engineering - Paradigms and Applications

+df, dat.fitting = NULL) {

+y0 <- c(x ["parent\_0"], m1\_0)

+stopifnot(!is.null(tspan))

+if (is.null(dat.fitting)) {

+upper = c(110, rep(0.1, 3)),

+dat.fitting = FOCUS\_2006\_D)

+control = list(max.time = 5), df = df,

>print(round(res\$par, digits = 6))

99.5984910.0479200.0507780.005261

parent\_0 k\_parent\_sinkk\_parent\_m1k\_m1\_sink

+out<- ode(y0, tspan, df, parameters)

+names(x) <- names\_x

+names(y0) <- names\_y

+m1\_0 = 0

+rss<- out + }else {

"time"])

+})) +}

+ }

na.rm = TRUE)

+return(rss)

"k\_m1\_sink")

lower panel of Figure 2.

3.3.3. Finance: risk allocation portfolios

+sort(unique(dat.fitting [["time"]])) else NULL,

+rss<- sum(sapply(c("parent", "m1"), function(nm) {

+sum((out [, nm ][match(o.time, out [, "time"])] - +dat.fitting [dat.fitting\$name == nm, "value"])^2,

limit the running time of GenSA to 5seconds by setting max.time=5:

>res<- GenSA(fn = fn, lower = c(90, rep(0.001, 3)),

+o.time<- as.character(dat.fitting [dat.fitting\$name == nm,

Then, we use GenSA to estimate the four parameters. As the model is not complicated, we

>names(res\$par) <- c("parent\_0", "k\_parent\_sink", "k\_parent\_m1",

GenSA successfully determines the correct value of the initial concentration of the parent and the three kinetic parameters. The fitting curves for the parent and m1 are displayed in the

Portfolio selection problems were addressed by mean-risk models in the 1950s. The most popular measures of downside risk are the value-at-risk (VaR) and conditional VaR(CVaR). Portfolio weights for which the portfolio has the lowest CVaR and each investment can

+parameters<- x [c("k\_parent\_sink", "k\_parent\_m1", "k\_m1\_sink")]

Figure 2. Kinetic modeling of pesticide degradation. Upper panel: illustration of pesticide degradation model. Lower panel: experimental concentration data and fitting curves for parent and m1.

contribute at most 22.5% to the total portfolio CVaR risk were estimated using differential evolution algorithms in Mullen et al. [16] and Ardia et al. [36]. The code for the objective function in portfolio optimization is rewritten below from Ardia et al. [36]:

```
>library("quantmod")
> tickers <- c("GE", "IBM", "JPM", "MSFT", "WMT")
>getSymbols(tickers, from = "2000-12-01", to = "2010-12-31")
[1] "GE" "IBM" "JPM" "MSFT" "WMT"
> P <- NULL
>for(ticker in tickers) {
```

```
+tmp<- Cl(to.monthly(eval(parse(text = ticker))))
+P <- cbind(P, tmp)
+ }
>colnames(P) <- tickers
> R <- diff(log(P))
> R <- R [-1,]
> mu <- colMeans(R)
> sigma <- cov(R)
>library("PerformanceAnalytics")
>pContribCVaR<- ES(weights = rep(0.2, 5),
+method = "gaussian", portfolio_method = "component",
+mu = mu, sigma = sigma)$pct_contrib_ES
>obj<- function(w) {
+if (sum(w) == 0) {w <- w + 1e-2 }
+w <- w/sum(w)
+CVaR<- ES(weights = w,
+method = "gaussian", portfolio_method = "component",
+mu = mu, sigma = sigma)
+tmp1 <- CVaR$ES
+tmp2 <- max(CVaR$pct_contrib_ES - 0.225, 0)
+out <- tmp1 + 1e3 * tmp2
+return(out)
+}
```
GenSA can be used to determine the global optimum of this function using a bounded search domain from 0 to 1 values for the five parameters to be optimized:

```
>library(GenSA)
>lb<- rep(0, 5) # lower bounds, minimum values for all 5 parameters
are 0
>ub<- rep(1, 5) # upper bounds, maximum values for all 5 parameters
are 1
> out1.GenSA <- GenSA(fn = obj, lower = lb, upper = ub)
```
For non-differentiable objective functions, the smooth parameter in the control argument can be set to FALSE, which means that the Nelder-Mead method is used in the local search:

```
> out2.GenSA <- GenSA(fn=obj, lower=rep(0, 5), upper=rep(1, 5),
+control=list(smooth=FALSE, max.call=3000))
The max.call parameter is set to 3000 to make the algorithm stop
earlier:
> out2.GenSA$value
[1] 0.1141484884
> out2.GenSA$counts
[1] 3000
>cat("GenSA call functions", fn.call.GenSA, "times.\n")
GenSA call functions 3000 times.
>wstar.GenSA<- out2.GenSA$par
>wstar.GenSA<- wstar.GenSA/sum(wstar.GenSA)
```

```
>rbind(tickers, round(100 * wstar.GenSA, 2))
[,1] [,2] [,3] [,4] [,5]
tickers "GE" "IBM" "JPM" "MSFT" "WMT"
"18.92" "21.23" "8.33" "15.92" "35.6"
>100 * (sum(wstar.GenSA * mu) - mean(mu))
[1] 0.03790568876
```
GenSA determined a minimum of 0.1141484884 within 3000 function calls.

## 4. Discussion and conclusions

+tmp<- Cl(to.monthly(eval(parse(text = ticker))))

+method = "gaussian", portfolio\_method = "component",

+method = "gaussian", portfolio\_method = "component",

GenSA can be used to determine the global optimum of this function using a bounded search

>lb<- rep(0, 5) # lower bounds, minimum values for all 5 parameters

>ub<- rep(1, 5) # upper bounds, maximum values for all 5 parameters

For non-differentiable objective functions, the smooth parameter in the control argument can be set to FALSE, which means that the Nelder-Mead method is used in the local search: > out2.GenSA <- GenSA(fn=obj, lower=rep(0, 5), upper=rep(1, 5),

The max.call parameter is set to 3000 to make the algorithm stop

+P <- cbind(P, tmp)

> mu <- colMeans(R) > sigma <- cov(R)

>obj<- function(w) {

+CVaR<- ES(weights = w,

+mu = mu, sigma = sigma)

+out <- tmp1 + 1e3 \* tmp2

+w <- w/sum(w)

+tmp1 <- CVaR\$ES

>library(GenSA)

+return(out)

+}

are 0

are 1

earlier:

[1] 3000

> out2.GenSA\$value [1] 0.1141484884 > out2.GenSA\$counts

GenSA call functions 3000 times. >wstar.GenSA<- out2.GenSA\$par

> R <- R [-1,]

>colnames(P) <- tickers > R <- diff(log(P))

>library("PerformanceAnalytics")

42 Computational Optimization in Engineering - Paradigms and Applications

+if (sum(w) == 0) {w <- w + 1e-2 }

>pContribCVaR<- ES(weights = rep(0.2, 5),

+tmp2 <- max(CVaR\$pct\_contrib\_ES - 0.225, 0)

domain from 0 to 1 values for the five parameters to be optimized:

+control=list(smooth=FALSE, max.call=3000))

>wstar.GenSA<- wstar.GenSA/sum(wstar.GenSA)

> out1.GenSA <- GenSA(fn = obj, lower = lb, upper = ub)

>cat("GenSA call functions", fn.call.GenSA, "times.\n")

+mu = mu, sigma = sigma)\$pct\_contrib\_ES

+ }

The discrete optimization problem, in particular, the feature selection problem, exists extensively. GSA can also be used for the discrete problem. Please refer to [37] for details.

GSA is a powerful method for the non-convex global optimization problem. We developed an R package GenSA based on Tsallis statistics and GSA. In an extensive performance testbed composed of 134 benchmark functions, GenSA provided a higher average success rate% and a smaller median of the number of function calls compared with two widely recognized R packages: DEoptim and rgenoud. GenSA is useful and can provide a solution that is comparable with or even better than that provided by other widely used R packages for optimization.

R is very good for program prototype. When there is a need for heavy computation, other computational languages, such as C/C++, Fortran, Java, and Python,are recommended. Considering both speed and usability, aPython version of GSA, PyGenSA, is being developed and will be released within the SciPy scientific toolkitat the end of 2016.

## Acknowledgements

This work would not have been possible without the R open-source software project. We would like to thank our colleague Julia Hoeng for inspirational discussions and support. This study was funded by Philip Morris International.

## Author details

Yang Xiang† \*, Sylvain Gubian† and Florian Martin

\*Address all correspondence to: yang.xiang@pmi.com

Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland

† These authors contributed equally to this work

## References


[16] Mullen KM, Ardia D, Gil DL, Windover D, Cline J. DEoptim: An R package for global optimization by differential evolution. Journal of Statistical Software. 2011;40(6):1–26.

References

Aug 1;27(11):1142–55.

44 Computational Optimization in Engineering - Paradigms and Applications

2010 Jul 1;38(13):e140.

2000 Sep 1;62(3):4473.

30;104(12):2746–51.

1983;220(4598):671–80.

Mar 1;41(1):1–28.

Milano, Italy. 1992.

Science. 1995 Feb 3;267(5198):664.

Review. 2002 Dec 1;70(3):315–49.

matics and Computation. 2006 May 1;176(1):308–16.

Street, Ann Arbor, MI 48104-3209; 1975.

[1] Agostini FP, Soares-Pinto DD, Moret MA, Osthoff C, Pascutti PG. Generalized simulated annealing applied to protein folding studies. Journal of Computational Chemistry. 2006

[2] Serra P, Stanton AF, Kais S, Bleil RE. Comparison study of pivot methods for global

[3] De Guire V, Caron M, Scott N, Ménard C, Gaumont-Leclerc MF, Chartrand P, Major F, Ferbeyre G. Designing small multiple-target artificial RNAs. Nucleic Acids Research.

[4] Xiang Y, Sun DY, Fan W, Gong XG. Generalized simulated annealing algorithm and its application to the Thomson model. Physics Letters A. 1997 Aug 25;233(3):216–20.

[5] Xiang Y, Gong XG. Efficiency of generalized simulated annealing. Physical Review E.

[6] Xiang Y, Sun DY, Gong XG. Generalized simulated annealing studies on structures and properties of Ni n (n =2-55) clusters. The Journal of Physical Chemistry A. 2000 Mar

[7] Holland JH. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. U Michigan Press, 839 Greene

[8] Storn R, Price K. Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization. 1997 Dec 1;11(4):341–59. [9] Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. Science.

[10] Cvijovic D, Klinowski J. Taboo search: an approach to the multiple minima problem.

[11] Cvijović D, Klinowski J. Taboo search: An approach to the multiple-minima problem for continuous functions. In Handbook of Global Optimization 2002 (pp. 387-406). Springer

[12] Glover F, Taillard E. A user's guide to tabu search. Annals of Operations Research. 1993

[13] Dorigo M. Optimization, learning and natural algorithms. Ph. D. Thesis, Politecnico di

[14] Toksari MD. Ant colony optimization for finding the global minimum. Applied Mathe-

[15] Fouskakis D, Draper D. Stochastic optimization: a review. International Statistical

Publishing Company, 11 West 42nd Street, 15th Floor, New York, NY 10036, US.

optimization. The Journal of Chemical Physics. 1997 May 1;106(17):7170–7.


## **A Simulated Annealing Based Optimization Algorithm Provisional chapter**

**A Simulated Annealing Based Optimization Algorithm**

Yoel Tenne

Yoel Tenne

[32] Altschuler EL, Pérez–Garrido A. Defect-free global minima in Thomson's problem of

[33] Wales DJ, McKay H, Altschuler EL. Defect motifs for spherical topologies. Physical

[34] Ranke J. mkin: Routines for fitting kinetic models with one or more state variables to chemical degradation data, 2013b. URL http://CRAN.R-project.org/package=mkin. [35] Boesten JJ, Aden K, Beigel CY, Beulke S, Dust M, Dyson JS, Fomsgaard IS, Jones RL, Karlsson S, Van der Linden AM, Richter O. Guidance document on estimating persistence and degradation kinetics from environmental fate studies on pesticides in EU registration. Report of the FOCUS Work Group on Degradation Kinetics, EC Doc. Ref.

[36] Ardia D, Boudt K, Carl P, Mullen KM, Peterson BG. Differential evolution with DEoptim.

[37] Xiang Y, Hoeng J, Martin F, inventors; Philip Morris Products SA, assignee. Systems and methods for generating biomarker signatures with integrated dual ensemble and generalized simulated annealing techniques. United States patent application US 14/

charges on a sphere. Physical Review E. 2006 Mar 6;73(3):036108.

Review B. 2009 Jun 22;79(22):224115.

46 Computational Optimization in Engineering - Paradigms and Applications

Sanco/10058/2005, version. 2005;1.

R Journal. 2011 Jun 1;3(1):27–34.

409,679.2013 Jun 21.

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66455

#### **Abstract**

In modern engineering finding an optimal design is formulated as an optimization problem which involves evaluating a computationally expensive black-box function. To alleviate these difficulties, such problems are often solved by using a metamodel, which approximates the computer simulation and provides predicted values at a much lower computational cost. While metamodels can significantly improve the efficiency of the design process, they also introduce several challenges, such as a high evaluation cost, the need to effectively search the metamodel landscape and to locate good solutions, and the selection of which metamodel is most suitable to the problem being solved. To address these challenges, this chapter proposes an algorithm that uses a hybrid simulated annealing and SQP search to effectively search the metamodel. It also uses ensembles that combine prediction of several metamodels to improve the overall prediction accuracy. To further improve the ensemble accuracy, it adapts the ensemble topology during the search. Finally, to ensure convergence to a valid optimum in the presence of metamodel inaccuracies, the proposed algorithm operates within a trust-region framework. An extensive performance analysis based on both mathematical test functions and an engineering application shows the effectiveness of the proposed algorithm.

**Keywords:** simulated annealing, metamodelling, ensembles, trust-region, expensive optimization problems

## **1. Introduction**

With the advent of high performance computing, intricate computer simulations can now model real-world physics with high accuracy. This, in turn, transformed the engineering design process into a simulation-driven process in which candidate designs are evaluated by

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons

Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

a computer simulation instead of a laboratory experiment. In this set-up, the design process is formulated as an optimization problem with several unique features [1]:


An optimization strategy that has proven effective in such problems is that of *metamodelassisted* optimization, namely, where a metamodel approximates the true expensive function and provides predicted objective values at a lower computational cost [1]. However, the integration of metamodels and ensembles into the optimization search introduces several challenges:


To address this issue, this chapter proposes an optimization algorithm that uses a hybridsimulated annealing (SA) search followed by a local refinement of solutions based on an SQP search. In this manner, this set-up achieves both an effective global and local search, which assists in locating good solutions. To address the issue of inaccurate metamodel predictions, the proposed algorithm operates within a trust region (TR) approach that manages the metamodel and ensures convergence to a valid optimum. Finally, to further improve the prediction accuracy the proposed algorithm uses ensembles and selects the most suitable topology during the search.

Accordingly, this chapter presents several contributions: (a) a hybrid SA-SQP metamodelassisted search, (b) integration within a TR framework, and (c) continuous selection of the ensemble topology during the search. An extensive performance analysis based on both mathematical test functions and an engineering problem of airfoil shape optimization shows the effectiveness of the proposed algorithm.

The remainder of this chapter is organized as follows: Section 2 provides the background information, Section 3 describes the proposed algorithm, then Section 4 provides an extensive performance evaluation and discussion, and finally Section 5 concludes this chapter.

## **2. Background**

a computer simulation instead of a laboratory experiment. In this set-up, the design process is

• The computer simulation acts as the objective function, since it assigns objective values to candidate designs. However, the simulation is often a legacy code or commercial software whose inner workings are inaccessible to the user, and so there is no analytic expression that defines how candidate designs are mapped to objective values. Such a *black-box function* precludes the use of optimization algorithms that require an analytic function. • Each simulation run is *computationally expensive*, that is, it requires a lengthy run time, and

this severely restricts the number of candidate designs that can be evaluated.

• Both the real-world physics being modelled and the numerical simulation process may result in an objective function, which has a complicated landscape, and this further

An optimization strategy that has proven effective in such problems is that of *metamodelassisted* optimization, namely, where a metamodel approximates the true expensive function and provides predicted objective values at a lower computational cost [1]. However, the integration of metamodels and ensembles into the optimization search introduces several

• Locating a good solution requires effectively searching the metamodel, which can have a complicated landscape with multiple local solutions, and hence can be a difficult task. • Since function evaluations are expensive, only a small number of evaluated vectors will be available and hence the metamodel will be inaccurate. In severe cases, the optimization search can converge to a false optimum, namely, which was artificially created by the

• A variety of metamodels have been proposed, for example, artificial neural networks (ANNs), Kriging and radial basis functions (RBFs) [2, 3], but the optimal variant is problem-dependant, and is typically not known *a priori*. In an attempt to address this issue, *ensembles* employ several metamodels concurrently and aggregate their individual predictions into a single one [4, 5]. However, the effectiveness of ensembles depends on their topology, namely, which metamodels they incorporate, but the optimal topology is again typically unknown *a priori*, and may be impractical to identify by numerical exper-

To address this issue, this chapter proposes an optimization algorithm that uses a hybridsimulated annealing (SA) search followed by a local refinement of solutions based on an SQP search. In this manner, this set-up achieves both an effective global and local search, which assists in locating good solutions. To address the issue of inaccurate metamodel predictions, the proposed algorithm operates within a trust region (TR) approach that manages the metamodel and ensures convergence to a valid optimum. Finally, to further improve the prediction accuracy the proposed algorithm uses ensembles and selects the most suitable

metamodel's inaccurate approximation of the true expensive function.

iments due to the high cost of each simulation run.

topology during the search.

formulated as an optimization problem with several unique features [1]:

48 Computational Optimization in Engineering - Paradigms and Applications

complicates the optimization process.

challenges:

## **2.1. Optimization techniques**

As mentioned in Section 1, simulation-driven problems often include a challenging objective function, such as having multiple local optima or lacking an analytic expression. In such settings, classical gradient-based optimization algorithms can perform poorly, and therefore researchers have explored various alternative approaches [1]. One class of such algorithms is the *nature-inspired metaheuristics,* which include evolutionary algorithms (EAs), particle swarm optimization (PSO), and SA. The latter was inspired by the physics of the annealing process in metals: initially a metal has a high temperature and so particles have a high probability of moving to a higher energy state. As the metal cools in the annealing process, particles are more likely to move to a lower energy level than to a higher one. The process is completed once the system has reached the lowest possible energy level, typically its temperature of equilibrium with the environment. In the realm of global optimization, these mechanics have been translated into a heuristic search, which starts with an initial vector, namely, a candidate solution. During the search, the current vector is perturbed so that new vectors in its vicinity are obtained. These vectors are evaluated and replace the original vector if: (a) they are better, or (b) they are worse and with probability *p*, which is analogous to the energy state changes. As *p* decreases, the search is transformed from being explorative to being more localized. Two main parameters of the SA algorithm are the *annealing schedule*, namely, the duration of the search process, which is determined by the manner that the temperature is decreased, and the selection probability function, which defines the dynamic threshold for accepting a worse solution. Algorithm 1 gives a pseudocode of a baseline SA algorithm, while Section 3 gives the specific parameter settings of the SA implementation used in this study.

The underlying mechanism of the SA algorithm was originally proposed by Metropolis et al. [6], while the more common annealing-inspired formulation was later proposed by Černy [7] and Kirkpatrick et al. [8]. Since then the algorithm has been widely used in the literature, and some recent examples include [9] in finance, [10] in machine learning, [11] in chemical engineering and [12] in production line machine scheduling.

### **2.2. Expensive black-box problems**

Computationally expensive optimization problems are common across engineering and science. Typically in such problems, candidate designs are parameterized as vectors of design variables and sent to the simulation for evaluation, as shown in **Figure 1**.

**Figure 1.** The layout of an expensive black-box optimization problem. The optimization algorithm generates candidate solutions, and these are evaluated by the simulation, which acts as a 'black-box' function, to obtain their corresponding objective values.

As mentioned in Section 1, metamodels are often used in such settings to alleviate the high computational cost of the simulation runs [2, 3]. However, the integration of metamodels into the search introduces two main challenges:


The ensemble *topology*, namely, which metamodel variants are incorporated, is typically determined *a priori* and is unchanged during the search. However, the topology directly affects the prediction accuracy, and hence an unsuitable topology can degrade the search effectiveness. As an example, RBFs, RBFN and Kriging metamodels (described in Appendix 1) were used to generate several ensemble topologies. The same testing and training samples were used with all the topologies (sized 30 and 20 vectors, respectively), such that each ensemble was trained with training sample and its prediction accuracy was tested on the testing sample. The prediction accuracy was measured both with the root mean squared error (RMSE),

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} \left( \hat{f}(\mathbf{x}\_i) \overline{-f(\mathbf{x}\_i)} \right)^2} \tag{1}$$

and the *R*<sup>2</sup> indicator,

$$R^2 = \frac{S\_t - S\_r}{S\_t},\tag{2a}$$

where

**2.2. Expensive black-box problems**

50 Computational Optimization in Engineering - Paradigms and Applications

the search introduces two main challenges:

objective values.

approach implemented in this study.

based on the Rosenbrock function.

Computationally expensive optimization problems are common across engineering and science. Typically in such problems, candidate designs are parameterized as vectors of design

As mentioned in Section 1, metamodels are often used in such settings to alleviate the high computational cost of the simulation runs [2, 3]. However, the integration of metamodels into

**Figure 1.** The layout of an expensive black-box optimization problem. The optimization algorithm generates candidate solutions, and these are evaluated by the simulation, which acts as a 'black-box' function, to obtain their corresponding

• Prediction uncertainty: Due to the high cost of the simulation runs only a small number of designs can be evaluated, which in turn degrades the prediction accuracy of the metamodel and leads to optimization with uncertainty regarding the validity of the predicted objective values [13]. To address this, the metamodel needs to be updated during the search to ensure that its accuracy is sufficient to drive the search to a correct final result. To accomplish this, the proposed algorithm is structured based on the TR approach [14]. In this way, the algorithm performs a sequence of *trial steps* that are constrained to the TR, namely, the region where the metamodel is assumed to be accurate. Based on the success of the trial step, namely, if a new optimum has been found, the TR and the set of vectors are updated. Section 3 presents a detailed description of the TR

• Metamodel suitability: Various metamodel variants have been proposed, but the optimal variant is problem dependant and is typically unknown *a priori* [15]. To address this, ensembles employ multiple metamodels concurrently and combine their predictions into a single one to obtain a more accurate prediction [4, 5, 16]. **Figure 2** shows an example

The ensemble *topology*, namely, which metamodel variants are incorporated, is typically determined *a priori* and is unchanged during the search. However, the topology directly affects the prediction accuracy, and hence an unsuitable topology can degrade the search effectiveness. As an example, RBFs, RBFN and Kriging metamodels (described in Appendix 1) were used to generate several ensemble topologies. The same testing and training samples were used with all the topologies (sized 30 and 20 vectors, respectively), such that each ensemble was trained with training sample and its prediction accuracy was tested on

variables and sent to the simulation for evaluation, as shown in **Figure 1**.

$$S\_t = \Sigma(f(\mathbf{x}\_i) \overline{-f(\mathbf{x}\_i)})^2, \quad S\_r = \Sigma(f(\mathbf{x}\_i) \neg \hat{f}(\mathbf{x}\_i))^2,\tag{2b}$$

**Figure 2.** An ensemble topology consisting of RBF, RBFN and Kriging metamodels. The overall prediction is the aggregation of the individual predictions.


Note: R: RBF, RN: RBF network, K: Kriging.

**Table 1.** Statistics for prediction accuracies of different topologies.

where *<sup>f</sup>*ð*x*Þ, and ^*<sup>f</sup>* <sup>ð</sup>*x*<sup>Þ</sup> are the true and the ensemble predicted values, respectively, and *xi*, *i* ¼ 1…*n* are the testing vectors. **Table 1** presents the test results, from which it follows that the prediction accuracy varied with the topology and that no single topology was the overall best.

Addressing this issue, the algorithm proposed in this study selects the most suitable ensemble topology during the search, as explained in the following section.

## **3. Proposed algorithm**

This section describes the algorithm proposed in this study, which is designed to address the issues described in Sections 1 and 2. The proposed algorithm operates in five main steps, as follows:

Step 1. *Initialization*: The algorithm begins by generating an initial sample of vectors based on the optimal Latin hypercube design (OLHD) method [17]. The method ensures that the resultant sample is space-filling, namely, adequately covers the search space, which in turn improves the prediction accuracy of the metamodels. The OLHD method exploits patterns of point locations for optimal Latin hypercube designs based on a variation of the maximum distance criterion to produce near-optimal designs efficiently. After generating the sample, the vectors are evaluated with the true expensive function.

#### Step 2. *Ensemble selection*:

Step 2.1. Following the cross-validation (CV) procedure [18], the vectors that have been evaluated so far are split into a training set and a testing set. In turn, each candidate metamodel variant is trained with the training set and is tested on the testing set. The prediction accuracy is measured with the root mean squared error (RMSE), which is calculated as

$$
\varepsilon\_{\hat{f}} = \sqrt{\frac{1}{l} \sum\_{i=1}^{l} \left( m\_{\hat{f}}(\mathbf{x}\_{i}) \overline{\!\!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/ \!/}}} \tag{3}
$$

where *xi*, *i* ¼ 1…*l*, are the vectors in the testing set, *f*ð*xi*Þ is the exact and already known function value at *xi*, and *mj*ð*x*Þ is the prediction of the *j*th metamodel variant that has been trained with the training set.

Step 2.2. The evaluated vectors are again split into a training and a testing set. For each candidate ensemble topology, the following steps are performed:

• Each metamodel variant that is active in the ensemble is assigned an *ensemble weight*, which is calculated as

$$w\_i = \frac{\varepsilon\_i^{-1}}{\sum\_{j=1}^{n\_m} \varepsilon\_j^{-1}},\tag{4}$$

and the overall ensemble prediction is then given by

$$
\psi(\mathbf{x}) = \sum\_{i=1}^{n\_m} w\_i m\_i(\mathbf{x}),
\tag{5}
$$

where *n*<sup>m</sup> is the number of participating metamodels in the ensemble and *mi*ð*x*Þ is a metamodel that has been trained with the new training sample.

• The prediction accuracy of the ensemble is estimated based on its RMSE on the testing set

$$\varepsilon = \sqrt{\frac{1}{k} \sum\_{i=1}^{k} \left(\psi(\mathbf{x}\_i) \overline{-f(\mathbf{x}\_i)}\right)^2},\tag{6}$$

where *xi*, *i* ¼ 1…*k*, are the vectors in the testing set.

where *<sup>f</sup>*ð*x*Þ, and ^*<sup>f</sup>* <sup>ð</sup>*x*<sup>Þ</sup> are the true and the ensemble predicted values, respectively, and *xi*, *i* ¼ 1…*n* are the testing vectors. **Table 1** presents the test results, from which it follows that the prediction accuracy varied with the topology and that no single topology was the

Addressing this issue, the algorithm proposed in this study selects the most suitable ensemble

This section describes the algorithm proposed in this study, which is designed to address the issues described in Sections 1 and 2. The proposed algorithm operates in five main steps, as

Step 1. *Initialization*: The algorithm begins by generating an initial sample of vectors based on the optimal Latin hypercube design (OLHD) method [17]. The method ensures that the resultant sample is space-filling, namely, adequately covers the search space, which in turn improves the prediction accuracy of the metamodels. The OLHD method exploits patterns of point locations for optimal Latin hypercube designs based on a variation of the maximum distance criterion to produce near-optimal designs efficiently. After generating the sample, the

Step 2.1. Following the cross-validation (CV) procedure [18], the vectors that have been evaluated so far are split into a training set and a testing set. In turn, each candidate metamodel variant is trained with the training set and is tested on the testing set. The prediction accuracy

where *xi*, *i* ¼ 1…*l*, are the vectors in the testing set, *f*ð*xi*Þ is the exact and already known function value at *xi*, and *mj*ð*x*Þ is the prediction of the *j*th metamodel variant that has been

Step 2.2. The evaluated vectors are again split into a training and a testing set. For each

• Each metamodel variant that is active in the ensemble is assigned an *ensemble weight*,

*wi* <sup>¼</sup> *<sup>ε</sup>*<sup>−</sup><sup>1</sup> *i* ∑*<sup>n</sup>*<sup>m</sup> *<sup>j</sup>*¼<sup>1</sup>*ε*<sup>−</sup><sup>1</sup> *j*

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

*mj*ð*xi*Þ−*f*ð*xi*Þ

2

*;* (3)

*;* (4)

is measured with the root mean squared error (RMSE), which is calculated as

s

1 *l* ∑ *l i*¼1 

*ε<sup>j</sup>* ¼

candidate ensemble topology, the following steps are performed:

and the overall ensemble prediction is then given by

topology during the search, as explained in the following section.

52 Computational Optimization in Engineering - Paradigms and Applications

vectors are evaluated with the true expensive function.

overall best.

**3. Proposed algorithm**

Step 2. *Ensemble selection*:

trained with the training set.

which is calculated as

follows:

Step 2.3. After repeating the process with all the candidate topologies, the one having the lowest RMSE, as defined in Eq. (6), is selected for the current optimization iteration. A new ensemble is then trained based on the selected topology and on all the evaluated vectors. This ensemble will be used in the following step.

It is emphasized that during the above process, the true expensive function is not used, and hence the process requires negligible computational resources.

For the specific implementation in this chapter, three well-established metamodels were considered for the ensembles: RBF, RBFN and Kriging that are all described in Appendix 1, while **Table 2** presents the candidate ensemble topologies. Also, the split ratio between the training and testing vectors was calibrated by numerical experiments, as described in Appendix 2.

Step 3. *Optimization trial step*: The proposed algorithm now seeks an optimum based on the ensemble prediction in the bounded TR around the current best solution (*xb*), namely

$$\mathfrak{S} = \langle \mathfrak{x} : \|\mathfrak{x} - \mathfrak{x}\_{\mathbb{B}}\|\_{2} \le \Delta \rangle,\tag{7}$$

where *Δ* is the TR radius. The search is performed by using a hybrid approach [19], which is composed of an initial search performed by a SA algorithm, and followed by a localized refinement of the solution with a sequential quadratic programming (SQP) algorithm. The main settings of the SA algorithm were based on existing studies [20–22], as follows:


$$T(t) = 0.95 \cdot T(t \text{-1}) \tag{8}$$


**Table 2.** Candidate ensemble topologies.

where *t* is the current time-step counter. This way the temperature initially decreases at a slow rate, which assists the exploration search, but the reduction is much faster in later stages when the search is localized.

• Acceptance probability function: A decaying exponent, namely

$$p(T) = \exp\left(-\frac{f(\mathbf{x}\_n) \cdot f(\mathbf{x}\_c)}{T(t)}\right) \tag{9}$$

where *xn* is the new vector being examined and *xc* is the current vector.

During the entire process, objective values are obtained only from the ensemble at a negligible computational cost, and hence the SA and SQP were able to evaluate a large number of candidate vectors.

Step 4. *TR updates*: The best vector found in the previous step (*x*<sup>⋆</sup>) is evaluated with the true expensive function, and the following updates are performed [14]:


The implementation described above differs from the classical TR framework in two main aspects:


To conclude the description, Algorithm 2 gives the pseudocode of the proposed algorithm.

## **4. Performance analysis**

where *t* is the current time-step counter. This way the temperature initially decreases at a slow rate, which assists the exploration search, but the reduction is much faster in later

**Index RBF RBFN Kriging**

3 •

5 • • 6 • • 7 •• •

During the entire process, objective values are obtained only from the ensemble at a negligible computational cost, and hence the SA and SQP were able to evaluate a large

Step 4. *TR updates*: The best vector found in the previous step (*x*<sup>⋆</sup>) is evaluated with the

• If *<sup>f</sup>*ð*x*<sup>⋆</sup><sup>Þ</sup> *<sup>&</sup>lt; <sup>f</sup>*ð*x*bÞ: The trial step was successful since the vector found was indeed better than the current best one. This indicates that the ensemble prediction is accurate, and

• If *<sup>f</sup>*ð*x*<sup>⋆</sup><sup>Þ</sup> <sup>≥</sup> *<sup>f</sup>*ð*x*b<sup>Þ</sup> *and* there are sufficient vectors in the TR: The search was unsuccessful since the solution found is not better than the current best one. However, since there are enough vectors in the TR to train the metamodels the failure is attributed to the TR being too large,

• If *<sup>f</sup>*ð*x*<sup>∗</sup><sup>Þ</sup> <sup>≥</sup> *<sup>f</sup>*ð*x*c<sup>Þ</sup> *and* there are insufficient vectors in the TR: As above but now the failure is attributed to having too few vectors in the TR to train the metamodels with. Therefore, a

The implementation described above differs from the classical TR framework in two main

*f*ð*xn*Þ−*f*ð*xc*Þ *T*ð*t*Þ 

(9)

stages when the search is localized.

**Metamodels participating in the ensemble**

1 •

**Table 2.** Candidate ensemble topologies.

number of candidate vectors.

accordingly the TR radius is doubled.

and accordingly the TR radius is halved.

aspects:

new vector is sampled in the TR, as explained below.

• Acceptance probability function: A decaying exponent, namely

2 •

54 Computational Optimization in Engineering - Paradigms and Applications

4 • •

*p*ð*T*Þ ¼ exp −

where *xn* is the new vector being examined and *xc* is the current vector.

true expensive function, and the following updates are performed [14]:

## **4.1. Benchmark tests based on mathematical test functions**

To evaluate its effectiveness, the proposed algorithm was applied to the established set of mathematical functions [23]: Ackley, Griewank, Rastrigin, Rosenbrock, Schwefel 2.13 and Weierstrass in dimensions 5–40, as listed in **Table 3**. These test functions represent challenging features such as high multimodality, deceptive landscapes and noise, and are therefore adequate for the testing purpose.

For a comprehensive evaluation, the proposed algorithm was benchmarked against the following four reference algorithms:



**Table 3.** Mathematical test functions.


This testing set-up was used for several reasons: (i) any gains brought by using ensembles are highlighted through the comparisons to the V1 and V2 algorithms and (ii) the performance of the proposed algorithm is benchmarked against representative metamodel-assisted algorithms from the literature, namely, the EA-PS and EI-CMA-ES algorithms. Each algorithm-objective function combination was tested over 30 runs to support a valid statistical analysis, and the number of evaluations of the true objective function was limited to 200, to represent the tight optimization budget in expensive problems.

**Tables 4** and **5** give the resultant test statistics of mean, standard deviation (SD), median, minimum (best) and maximum (worst) objective value for each algorithm-objective function combination. It also gives the statistic *α* that indicates the significance level at which the results of the proposed algorithm were better than those of the competing algorithms, measured at either the 0.01 or 0.05 levels, while an empty entry indicates no such statistically significant advantage. The *α* statistic was obtained by using the Mann-Whitney non-parametric test [27].

Test results show that the proposed algorithm performed well, as it obtained the best mean and median statistics in 8 out of 12 cases. Its results also had a statistically significant advantage over the other algorithms in 30 out of 48 cases. The performance advantage of the proposed algorithm was particularly pronounced in the high-dimensional cases, where it obtained the best mean and median in five out of six test functions. In terms of repeatability of performance, its SD was often slightly higher but was competitive with the best SD in each test case.

Overall, the proposed algorithm consistently outperformed the V1 and V2 variants, which shows that selecting the ensemble topology during the search improved the search effectiveness, both when compared to using a fixed metamodel or a fixed ensemble topology. Also, the proposed algorithm consistently outperformed the two reference algorithms from the literature, which shows that it compares well with existing approaches.

The analysis of the experiments also examined how the ensemble topology was updated, to examine if either a single or multiple topologies were predominantly selected. Accordingly, **Figure 3** shows representative plots of the ensemble topologies selected during a run with the Ackley-10D function and another with the Rosenbrock-20D function. It follows that in both


**Table 4.** Test statistics: test functions—low dimension.

• Evolutionary algorithm with periodic sampling (EA-PS): The algorithm leverages on the concepts in references [24, 25]. It uses a Kriging metamodel and a real-coded evolutionary algorithm (EA). The accuracy of the metamodel is maintained by periodically evaluating a small subset of the EA population with the true objective function, and using these

• *Expected improvement with covariance matrix adaptation evolutionary strategy* (CMA-ES) (EI-CMA-ES) [26]: The algorithm combines a covariance matrix adaptation evolutionary strategy (CMA-ES) algorithm with a Kriging metamodel, and uses the expected improvement (EI) framework to select new vectors for evaluation based both on the response and uncertainty in the metamodel prediction. This way the algorithm balances between a local search around the current best solution, and an explorative search in less-visited regions

This testing set-up was used for several reasons: (i) any gains brought by using ensembles are highlighted through the comparisons to the V1 and V2 algorithms and (ii) the performance of the proposed algorithm is benchmarked against representative metamodel-assisted algorithms from the literature, namely, the EA-PS and EI-CMA-ES algorithms. Each algorithm-objective function combination was tested over 30 runs to support a valid statistical analysis, and the number of evaluations of the true objective function was limited to 200, to represent the tight

**Tables 4** and **5** give the resultant test statistics of mean, standard deviation (SD), median, minimum (best) and maximum (worst) objective value for each algorithm-objective function combination. It also gives the statistic *α* that indicates the significance level at which the results of the proposed algorithm were better than those of the competing algorithms, measured at either the 0.01 or 0.05 levels, while an empty entry indicates no such statistically significant advantage. The *α* statistic was obtained by using the Mann-Whitney non-para-

Test results show that the proposed algorithm performed well, as it obtained the best mean and median statistics in 8 out of 12 cases. Its results also had a statistically significant advantage over the other algorithms in 30 out of 48 cases. The performance advantage of the proposed algorithm was particularly pronounced in the high-dimensional cases, where it obtained the best mean and median in five out of six test functions. In terms of repeatability of performance, its SD was often slightly higher but was competitive with the best SD in each

Overall, the proposed algorithm consistently outperformed the V1 and V2 variants, which shows that selecting the ensemble topology during the search improved the search effectiveness, both when compared to using a fixed metamodel or a fixed ensemble topology. Also, the proposed algorithm consistently outperformed the two reference algorithms from the litera-

The analysis of the experiments also examined how the ensemble topology was updated, to examine if either a single or multiple topologies were predominantly selected. Accordingly, **Figure 3** shows representative plots of the ensemble topologies selected during a run with the Ackley-10D function and another with the Rosenbrock-20D function. It follows that in both

ture, which shows that it compares well with existing approaches.

sampled vectors to update the metamodel.

56 Computational Optimization in Engineering - Paradigms and Applications

of the search space.

metric test [27].

test case.

optimization budget in expensive problems.


**Table 5.** Test statistics: test functions—high dimension.

**Figure 3.** Ensemble topologies selected during two test runs. Abbreviations: K: Kriging, R: RBF, RN: RBF network.

cases, different topologies were selected, which indicate that no single topology was the overall optimal, and further justifies the proposed approach.

#### **4.2. Engineering test problem**

**Proposed V1 V2 EA-PS EI-CMA-ES**

SD 5.712e+00 6.302e+00 2.836e-01 2.461e-01 1.921e+00 Median **3.885e+00** 5.524e+00 1.944e+01 6.744e+00 1.934e+01 Min(best) 2.683e+00 3.574e+00 1.909e+01 6.468e+00 1.493e+01 Max(worst) 1.802e+01 1.804e+01 1.998e+01 7.203e+00 2.044e+01 *α* 0.05 0.01 0.05 0.01

SD 2.942e-02 1.185e-01 1.151e+00 6.031e-02 3.032e-02 Median **1.040e+00** 1.246e+00 8.068e+00 1.454e+00 1.096e+00 Min(best) 1.013e+00 1.120e+00 6.567e+00 1.387e+00 1.071e+00 Max(worst) 1.114e+00 1.475e+00 1.035e+01 1.595e+00 1.157e+00 *α* 0.01 0.01 0.01 0.01

SD 3.753e+01 1.703e+01 2.754e+01 1.219e+01 3.914e+01 Median **4.808e+01** 6.683e+01 1.449e+02 1.230e+02 2.296e+02 Min(best) 3.924e+01 4.203e+01 1.144e+02 1.046e+02 1.395e+02 Max(worst) 1.568e+02 8.807e+01 1.983e+02 1.429e+02 2.507e+02 *α* 0.01 0.01 0.01

SD 2.067e+02 5.794e+02 5.142e+03 3.012e+02 9.406e+02 Median **5.807e+02** 8.295e+02 8.112e+03 7.782e+02 3.685e+03 Min(best) 2.042e+02 5.281e+02 4.165e+03 4.676e+02 3.141e+03 Max(worst) 8.756e+02 2.497e+03 2.271e+04 1.439e+03 6.144e+03

SD 2.173e+05 2.506e+05 5.317e+05 2.509e+05 6.520e+05 Median **7.074e+05** 8.420e+05 2.369e+06 1.744e+06 1.528e+06 Min(best) 4.989e+05 5.815e+05 1.666e+06 1.415e+06 8.933e+05 Max(worst) 1.126e+06 1.348e+06 3.186e+06 2.104e+06 2.871e+06 *α* 0.01 0.01 0.01

SD 4.414e+00 4.106e+00 2.138e+00 1.265e+00 1.463e+01 Median 2.445e+01 4.100e+01 5.135e+01 **2.304e+01** 2.597e+01 Min(best) 2.365e+01 3.351e+01 4.817e+01 2.214e+01 2.100e+01 Max(worst) 3.419e+01 4.685e+01 5.333e+01 2.567e+01 5.817e+01

*α* 0.01 0.01 0.05 0.01

Ackley-20 Mean **6.446e+00** 8.795e+00 1.947e+01 6.814e+00 1.863e+01

58 Computational Optimization in Engineering - Paradigms and Applications

Griewank-40 Mean **1.045e+00** 1.270e+00 8.192e+00 1.461e+00 1.102e+00

Rastrigin-20 Mean **6.490e+01** 6.501e+01 1.492e+02 1.223e+02 2.105e+02

Rosenbrock-20 Mean **5.653e+02** 1.005e+03 9.013e+03 8.435e+02 3.967e+03

Schwefel-40 Mean **7.503e+05** 8.786e+05 2.322e+06 1.774e+06 1.667e+06

Weierstrass-40 Mean 2.730e+01 4.048e+01 5.105e+01 **2.343e+01** 3.598e+01

*α* 0.01 0.01

**Table 5.** Test statistics: test functions—high dimension.

The proposed algorithm was also applied to a representative simulation-driven engineering problem, where the goal is to find an airfoil shape, which maximizes the lift *L* and minimizes the drag (aerodynamic friction) *D* at some prescribed flight conditions. In practise, the design requirements for airfoils are specified in terms of the non-dimensional *lift and drag coefficients*, *cl* and *cd*, respectively, defined as

$$\mathbf{c}\_{l} = \frac{L}{\frac{1}{2}\rho V^2 S} \tag{10a}$$

$$\mathcal{L}c = \frac{D}{\frac{1}{2}\rho V^2 S} \tag{10b}$$

where *L* and *D* are the lift and drag forces, respectively, *ρ* is the air density, *V* is the aircraft speed, and *S* is the reference area, such as the wing area. The relevant flight conditions are the aircraft altitude, speed and angle of attack (AOA) that is the angle between the aircraft velocity and the airfoil *chord line*. **Figure 4** gives a schematic layout of the airfoil problem.

Candidate airfoils were represented with the Hicks-Henne method [28], such that an airfoil profile was defined as

$$y = y\_b + \sum\_{i=1}^{h} \alpha\_i b\_i(\mathbf{x})\,,\tag{11}$$

where *yb* is a baseline profile taken here to be the NACA0012 symmetric profile, and *bi* are the shape basis functions [29].

$$\mathcal{b}\_{i(x)} = \left[ \sin \left( \pi \mathfrak{X}^{\frac{\log((0.5))}{\log(i/(h+1))}} \right) \right]^4,\tag{12}$$

where *αi*∈½−0*:*01*;* 0*:*01 are the variables, as shown in **Figure 4**. Ten basis functions were used for the upper and lower airfoil profiles, respectively, which resulted in a total of 20 variables per airfoil. Also, for structural integrity the thickness of an airfoil between 0.2 and 0.8 of its chord line was required to be greater than a critical thickness *t* <sup>⋆</sup> <sup>¼</sup> <sup>0</sup>*:*1. The lift and drag coefficients of candidate airfoils were obtained by using *XFoil*, an aerodynamics simulation code for subsonic isolated airfoils [30]. Each airfoil evaluation required up to 30 s on a desktop computer, so evaluations were not prohibitively expensive and the tests could be completed within a reasonable time.

Based on the above discussion, the objective function used was

$$f = -\frac{c\_l}{c\_d} + p \,, \quad p = \left\{ \begin{array}{c} t^\star \\ t \\ 0 \end{array} \Big| \frac{c\_l}{c\_d} \, | \, \quad \text{if } t < \quad t^\star \\ 0 \qquad \text{otherwise} \end{array} \tag{13}$$

where *p* is the penalty for violation of the thickness constraint. The flight conditions were an altitude of 30,000 ft, a speed of Mach 0.7, namely 70% of the speed of sound, and an AOA of 2°.

Tests were performed along the set-up of Section 4.1, and **Table 6** gives the resultant test statistics. The trends are consistent with those of the test functions, and the proposed algorithm

**Figure 4.** The layout of the airfoil problem: main components (left) and parameterization (right).


**Table 6.** Test statistics for the airfoil problem.

outperformed the other candidate algorithms also here. It obtained the best mean and median statistics, and had a competitively low SD.

## **5. Conclusion**

*bi*ð*x*<sup>Þ</sup> ¼ sin *πx*

chord line was required to be greater than a critical thickness *t*

60 Computational Optimization in Engineering - Paradigms and Applications

Based on the above discussion, the objective function used was

þ *p ; p* ¼

**Figure 4.** The layout of the airfoil problem: main components (left) and parameterization (right).

**Table 6.** Test statistics for the airfoil problem.

*<sup>f</sup>* <sup>¼</sup> <sup>−</sup> *cl cd*

within a reasonable time.

where *αi*∈½−0*:*01*;* 0*:*01 are the variables, as shown in **Figure 4**. Ten basis functions were used for the upper and lower airfoil profiles, respectively, which resulted in a total of 20 variables per airfoil. Also, for structural integrity the thickness of an airfoil between 0.2 and 0.8 of its

coefficients of candidate airfoils were obtained by using *XFoil*, an aerodynamics simulation code for subsonic isolated airfoils [30]. Each airfoil evaluation required up to 30 s on a desktop computer, so evaluations were not prohibitively expensive and the tests could be completed

> *t* ⋆ *<sup>t</sup>* j *cl cd*

8 < :

where *p* is the penalty for violation of the thickness constraint. The flight conditions were an altitude of 30,000 ft, a speed of Mach 0.7, namely 70% of the speed of sound, and an AOA of 2°. Tests were performed along the set-up of Section 4.1, and **Table 6** gives the resultant test statistics. The trends are consistent with those of the test functions, and the proposed algorithm

j if *t < t*

0 otherwise

**Proposed V1 V2 EA-PS EI-CMA-ES**

Mean **-3.376e+00** -3.279e+00 -3.368e+00 -3.231e+00 -3.278e+00 SD 1.242e-01 6.683e-02 1.008e-01 7.164e-02 9.597e-02 Median **-3.355e+00** -3.274e+00 -3.352e+00 -3.227e+00 -3.267e+00 Min(best) -3.624e+00 -3.393e+00 -3.533e+00 -3.335e+00 -3.395e+00 Max(worst) -3.158e+00 -3.161e+00 -3.214e+00 -3.134e+00 -3.098e+00 *α* 0.01 0.01 0.05

⋆

logðð0*:*5Þ logð*i=*ð*h*þ1ÞÞ h i<sup>4</sup>

*;* (12)

<sup>⋆</sup> <sup>¼</sup> <sup>0</sup>*:*1. The lift and drag

(13)

While computer simulations can improve the efficiency of the engineering design process, they also introduce new optimization challenges. Metamodels aim to alleviate the challenge of a high evaluation cost by providing computationally cheaper approximations of the true expensive function.

While metamodels can significantly improve the search effectiveness, they also introduce various challenges, such as identifying an optimal combination of metamodel variants, and effectively searching the metamodel landscape. To address these issues, this chapter has proposed a hybrid algorithm that uses SA to perform a global search, and it then refines the solutions with an SQP local search. To further enhance its effectiveness, the proposed algorithm uses ensembles of metamodels and selects the most suitable ensemble topology during the search. Lastly, to ensure convergence to an optimum of the true expensive function in the light of the inherent metamodel inaccuracies, the proposed algorithm operates within a TR approach such that the optimization is performed through a series of trial steps.

In an extensive performance analysis, the proposed algorithm was benchmarked against two implementations without selection of the ensemble topology, and two reference algorithms from the literature, which also do not use topology adaption. Analysis of the results shows that the proposed algorithm consistently outperformed the other algorithms: it achieved better results in 30 out of 48 cases with mathematical test functions, and also performed well with a simulation-driven problem. Its performance advantage was evident from the superior mean and median statistics that was obtained across the tests and was particularly pronounced in the high-dimensional problems.

The analysis also showed that during the optimization search the optimal topology continuously varied during the search, and that no single topology was the overall optimal. This observation further supports the approach proposed of selecting the ensemble topology during the search.

Overall, the solid performance of the proposed algorithm shows the merit of the hybrid SA +SQP algorithm proposed. It also suggests that the proposed algorithm could be applied to problems from a variety of academic domains, such as scheduling, systems engineering and model calibration, to name a few.

## **Appendix 1: Candidate metamodels**

This appendix describes the metamodels used, as follows:

• Kriging: A statistical metamodel that combines a global 'drift' function and a local adjustment based on the correlation between the sample vectors. The metamodel replicates the

observed responses precisely (Lagrangian interpolation). For a constant drift function, the metamodel becomes

$$
\mu m(\mathfrak{x}) = \beta + \kappa(\mathfrak{x}),
\tag{14}
$$

where *κ*ð*x*Þ is the local correction. The latter is defined by a stationary Gaussian process with mean zero and covariance

$$\text{Cov}[\kappa(\mathfrak{x})\kappa(\mathfrak{y})] = \sigma^2 c(\theta, \mathfrak{x}, \mathfrak{y}), \tag{15}$$

where *c*ð*θ;x;y*Þ is a user-prescribed correlation function. With the Gaussian correlation function [3]

$$\mathcal{L}(\boldsymbol{\theta}, \mathbf{x}, \boldsymbol{y}) = \prod\_{i=1}^{d} \exp\left(-\boldsymbol{\theta} (\mathbf{x}\_i - \boldsymbol{y}\_i)^2\right), \tag{16}$$

the above metamodel becomes

$$m(\mathbf{x}) = \hat{\boldsymbol{\beta}} + \boldsymbol{r}(\mathbf{x})^T \boldsymbol{R}^{-1} (\mathbf{f} - \mathbf{1}\hat{\boldsymbol{\beta}}) \tag{17}$$

where *β*^ is the estimated drift coefficient, *R* is the symmetric matrix of correlations between all interpolation vectors, *f* is the vector of objective values and **1** is a vector with all elements equal to 1. *r<sup>T</sup>* is the correlation vector between a new vector **x** and the sample vectors, namely,

$$\mathbf{r}^T = [c(\theta, \mathbf{x}, \mathbf{x}\_1), \dots, c(\theta, \mathbf{x}, \mathbf{x}\_n)] \tag{18}$$

The estimated drift coefficient *β*^ and variance *σ*^<sup>2</sup> are calculated as

$$
\hat{\boldsymbol{\beta}} = (\mathbf{1}^T \boldsymbol{R}^{-1} \mathbf{1})^{-1} \mathbf{1}^T \boldsymbol{R}^{-1} \boldsymbol{f}, \tag{19a}
$$

$$
\hat{\sigma}^2 = \frac{1}{n} [(\mathbf{f} - \mathbf{1}\hat{\boldsymbol{\beta}})^T \mathbf{R}^{-1} (\mathbf{f} - \mathbf{1}\hat{\boldsymbol{\beta}})] \tag{19b}
$$

For an isotropic (single correlation parameter) Kriging metamodel the optimal value of the parameter is obtained by maximizing the metamodel likelihood [31]

$$\theta^\star : \max \{ |R|^{1/n} \hat{\sigma}^2 \} \tag{20}$$

• *Radial basis functions* (RBF): The metamodel approximates the true objective function with a set of basis functions, namely

A Simulated Annealing Based Optimization Algorithm http://dx.doi.org/10.5772/66455 63

$$m(\mathbf{x}) = \alpha\_i \sum\_{i=1}^{n} \phi\_i(\mathbf{x}) + \mathbf{c} \tag{21}$$

where *φ<sup>i</sup>* ð*x*Þ are basis functions of the form

observed responses precisely (Lagrangian interpolation). For a constant drift function, the

where *κ*ð*x*Þ is the local correction. The latter is defined by a stationary Gaussian process

where *c*ð*θ;x;y*Þ is a user-prescribed correlation function. With the Gaussian correlation

−*θ*ð*xi*−*yi* Þ 2 

*TR*<sup>−</sup><sup>1</sup>

where *β*^ is the estimated drift coefficient, *R* is the symmetric matrix of correlations between all interpolation vectors, *f* is the vector of objective values and **1** is a vector with all elements equal to 1. *r<sup>T</sup>* is the correlation vector between a new vector **x** and the sample

*Cov*½*κ*ð*x*Þ*κ*ð*y*Þ ¼ *<sup>σ</sup>*<sup>2</sup>

*d i*¼1 exp 

*<sup>m</sup>*ð*x*Þ ¼ *<sup>β</sup>*^ <sup>þ</sup> *<sup>r</sup>*ð*x*<sup>Þ</sup>

*c*ð*θ;x;y*Þ ¼ ∏

The estimated drift coefficient *β*^ and variance *σ*^<sup>2</sup> are calculated as

*<sup>σ</sup>*^<sup>2</sup> <sup>¼</sup> <sup>1</sup>

parameter is obtained by maximizing the metamodel likelihood [31]

*<sup>β</sup>*^ ¼ ð**1***TR*<sup>−</sup><sup>1</sup>

*<sup>n</sup>* ½ð*<sup>f</sup>* <sup>−</sup>**1***β*^<sup>Þ</sup>

*<sup>θ</sup>*<sup>⋆</sup> : maxfj*R*<sup>j</sup>

• *Radial basis functions* (RBF): The metamodel approximates the true objective function with

**1**Þ −1 **1***TR*<sup>−</sup><sup>1</sup>

*TR*<sup>−</sup><sup>1</sup>

For an isotropic (single correlation parameter) Kriging metamodel the optimal value of the

1*=n σ*^2

*m*ð*x*Þ ¼ *β* þ *κ*ð*x*Þ, (14)

*c*ð*θ;x;y*Þ, (15)

*;* (16)

<sup>ð</sup>*<sup>f</sup>* <sup>−</sup>1*β*^<sup>Þ</sup> (17)

*f ;* (19a)

<sup>ð</sup>*<sup>f</sup>* <sup>−</sup>**1***β*^Þ (19b)

g (20)

*<sup>r</sup><sup>T</sup>* ¼ ½*c*ð*θ;x;x*1Þ, …*;c*ð*θ;x;xn*Þ (18)

metamodel becomes

function [3]

vectors, namely,

with mean zero and covariance

62 Computational Optimization in Engineering - Paradigms and Applications

the above metamodel becomes

a set of basis functions, namely

$$\phi\_i(\mathbf{x}) = \phi(\|\mathbf{x} - \mathbf{x}\_i\|\_2),\tag{22}$$

where *x<sup>i</sup>* is a sampled vector and *c* is a constant. The coefficients *α<sup>i</sup>* and *c* are determined from the interpolation conditions

$$m(\mathfrak{x}\_i) = f(\mathfrak{x}\_i), i = 1 \dots n,\tag{23a}$$

$$\sum\_{i=1}^{n} \alpha\_i = 0 \; . \tag{23b}$$

In this study, the widely used Gaussian basis function [32] was used:

$$\phi\_i(\mathbf{x}) = \exp\left(-\frac{\mathbf{x} - \mathbf{x}\_i}{\tau}\right),\tag{24}$$

where *τ* controls the width of the Gaussians, and is determined by cross-validation [33, 34].

• *Radial basis function network* (RBFN): A variant of the RBF approach but in which the number of basis functions is smaller than the sample size, which can improve the prediction accuracy in certain scenarios. The metamodel is given by

$$m(\mathbf{x}) = \sum\_{j=1}^{\hat{n}} \alpha\_j \phi\_j(\mathbf{x}),\tag{25}$$

where the coefficients *α<sup>i</sup>* are determined from the least-squares interpolation conditions

$$
\mathbb{Q}^T \mathbb{Q} \alpha = \mathbb{Q}^T f \tag{26}
$$

where

$$\mathcal{O}\_{i,j} = \phi\_j(\mathfrak{x}\_i) \tag{27}$$

and *xi*, *i* ¼ 1…*n*, are the sample vectors, **f** is the vector of corresponding objective function values, and *φ<sup>j</sup>* ð*x*Þ, *j* ¼ 1…*n*^ are the basis functions, which in this study were taken as the Gaussian functions described above. The basic function centres *x<sup>j</sup>* are obtained by clustering the sampled vectors and using the resultant cluster centres.

## **Appendix 2: Parameter sensitivity analysis**

As described in Section 3, the proposed algorithm relies on two main parameters: (i) the minimum number (*n*) of TR vectors needed to allow a TR contraction, and (ii) the split ratio (*s*) between the training and testing subsets, as used in estimating the accuracy of candidate metamodel and ensembles.

To calibrate these parameters, a set of numerical experiments were performed where different parameter settings were used, and for each setting the algorithm was tested with the Rastrigin-10D, Rosenbrock-10D, Rastrigin-20D and Rosenbrock-20D functions. The parameter settings examined were:


**Table 7** gives the resultant test statistics of mean objective value, rank per objective function and the overall rank based on each setting where a lower score is better. From these results it follows:



Note: ratios are in percent.

**Table 7.** Parameter sensitivity analysis results.


The above settings were then used during the numerical experiments described in Section 4.

## **Author details**

Yoel Tenne

**Appendix 2: Parameter sensitivity analysis**

64 Computational Optimization in Engineering - Paradigms and Applications

metamodel and ensembles.

• *s*: 80–20, 60–40, 40–60, in percent.

**(a) Statistics: different TR vectors threshold (***n*)

these results it follows:

**(b) Statistics: different split ratios (***s*)

Note: ratios are in percent.

**Table 7.** Parameter sensitivity analysis results.

examined were:

As described in Section 3, the proposed algorithm relies on two main parameters: (i) the minimum number (*n*) of TR vectors needed to allow a TR contraction, and (ii) the split ratio (*s*) between the training and testing subsets, as used in estimating the accuracy of candidate

To calibrate these parameters, a set of numerical experiments were performed where different parameter settings were used, and for each setting the algorithm was tested with the Rastrigin-10D, Rosenbrock-10D, Rastrigin-20D and Rosenbrock-20D functions. The parameter settings

**Table 7** gives the resultant test statistics of mean objective value, rank per objective function and the overall rank based on each setting where a lower score is better. From

**Function Mean Rank Mean Rank Mean Rank** Ras-10 4.598e+01 02 4.827e+01 03 3.530e+01 01 Ros-10 1.233e+02 02 4.393e+01 01 1.317e+02 03 Ras-20 9.256e+01 03 8.656e+01 02 7.997e+01 01 Ros-20 4.503e+02 01 5.361e+02 02 8.518e+02 03 Overall 08 08 08

**Function Mean Rank Mean Rank Mean Rank** Ras-10 4.598e+01 02 3.981e+01 01 5.414e+01 03 Ros-10 1.233e+02 02 1.266e+02 03 4.929e+01 01 Ras-20 9.256e+01 03 8.899e+01 02 7.970e+01 01 Ros-20 4.503e+02 01 5.578e+02 03 5.318e+02 02 Overall 08 09 **07**

*n* ¼ **0***:***1***d n* ¼ **0***:***5***d n* ¼ *d*

**80–20 60–40 40–60**

• *n*: 0*:*1*d*, 0*:*5*d*, *d*, where *d* is the dimension of the objective function.

Address all correspondence to: y.tenne@ariel.ac.il

Department of Mechanical and Mechatronic Engineering, Ariel University, Ariel, Israel

## **References**


[23] Suganthan PN, Hansen N, Liang JJ, Deb K, Chen YP, Auger A, et al. Problem Definitions and Evaluation Criteria for the CEC 2005 Special Session on Real-Parameter Optimization. Nanyang Technological University, Singapore and Kanpur Genetic Algorithms Laboratory, Indian Institute of Technology Kanpur, India; 2005.KanGAL2005005.

[10] Belloni A, Liang H Tengyuan Narayanan, Rakhlin A. Escaping the local minima via simulated annealing: Optimization of approximately convex functions. The Journal of

[11] Rodríguez DA, Oteiza PP, Brignole NB. Simulated annealing optimization for hydrocarbon pipeline networks. Journal of Industrial Engineering and Chemical Research. 2013;52

[12] Sivasankaran P, Sornakumar T, Panneerselvam R. Design and comparison of simulated annealing algorithm and grasp to minimize makespan in single machine scheduling with unrelated parallel machines. Intelligent Information Management. 2010;2:406–416. [13] Jin Y, Olhofer M, Sendhoff B. A framework for evolutionary optimization with approximate fitness functions. IEEE Transactions on Evolutionary Computation. 2002;6(5):481–494.

[14] Conn AR, Scheinberg K, Toint PL. On the convergence of derivative-free methods for unconstrained optimization. In: Iserles A, Buhmann MD, editors. Approximation Theory and Optimization: Tributes to M.J.D. Powell. Cambridge, New York: Cambridge Univer-

[15] Simpson TW, Poplinski JD, Koch PN, Allen JK. Metamodels for computer-based engineering design: Survey and recommendations. Engineering with Computers. 2001;17:129–150.

[16] Jin Y, Sendhoff B. Reducing fitness evaluations using clustering techniques and neural network ensembles. In: Deb K, et al., editors. Proceedings of the Genetic and Evolutionary Computation Conference–GECCO 2004. Berlin, Heidelberg: Springer-Verlag; 2004.

[17] Viana FAC, Venter G, Balabanov V. An algorithm for fast optimal Latin hypercube design of experiments. International Journal of Numerical Methods in Engineering. 2009;82

[18] Burnham KP, Anderson DR. Model Selection and Inference: A Practical Information-

[19] Mininno E, Neri F. A memetic differential evolution approach in noisy optimization.

[20] Yang B, Guang L, Säntti T, Plosila J. Parameter-optimized simulated annealing for application mapping on networks-on-chip. In: Hamadi Y, Schoenauer M, editors. Learning and Intelligent Optimization: 6th International Conference, LION 6, Paris, France, January 16–20, 2012, Revised Selected Papers. Berlin, Heidelberg: Springer Berlin Heidelberg;

[21] Frausto-Solis J, Alonso-Pecina F. Analytically tuned parameters of simulated annealing for the timetabling problem. WSEAS Transactions on Advances in Engineering Educa-

[22] Park MW, Kim YD. A systematic procedure for setting parameters in simulated annealing

algorithms. Computers and Operations Research. 1998;25(3):207–217.

Machine Learning Research. 2015;40:240–265.

66 Computational Optimization in Engineering - Paradigms and Applications

Theoretic Approach. New York: Springer; 2002.

Journal of Soft Computing. 2010;2:111–135.

(25):8579–8588.

sity Press; 1997. pp. 83–108.

pp. 688–699.

(2):135–156.

2012. pp. 307–322.

tion. 2008;5(5):272–281.


**Provisional chapter**

## **Simulated Annealing of Constrained Statistical Functions Simulated Annealing of Constrained Statistical Functions**

Barry Smith and Augustine Wong Barry Smith and Augustine Wong

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/66069

#### **Abstract**

In 1987, Corana et al. published a simulated annealing (SA) algorithm. Soon thereafter in 1993, Goffe et al. coded the algorithm in FORTRAN and showed that SA could uncover global optima missed by traditional optimization software when applied to statistical modeling and estimation in economics (econometrics). This chapter shows how and why SA can be used successfully to perform likelihood-based statistical inference on models where likelihood is constrained by often very complicated functions defined on a compact parameter space. These constraints arise because likelihood-based inference involves comparing the maxima of constrained versus unconstrained statistical optimization models. The chapter begins with a review of the relevant literature on SA and constrained optimization using penalty techniques. Next, a constrained optimization problem based in maximum likelihood stress-strength modeling is introduced, and its statistical and numerical properties are summarized. SA is then used to solve a sequence of penalty-constrained optimization problems, and the results are used to construct a confidence interval for the parameter of interest in the statistical model. The chapter concludes with a brief summary of the results and some ways we were able to enhance the performance of SA in this setting.

**Keywords:** SA, constrained optimization, penalty, likelihood

## **1. SA and penalty-constrained optimization**

Several chapters in this book consider the foundations and development of the simulated annealing (SA) algorithm. In this chapter, we focus on just one version of the algorithm given in Ref. [1] and one FORTRAN implementation of the algorithm presented in Ref. [2]. The

Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

reasons for this are efficiency and familiarity. The algorithm in Ref. [1] is tight and effective and the FORTRAN implementation in [2] is dependable, easily extensible and sufficiently fast that it can be applied to both complicated small-scale problems and to very large Monte Carlo studies such as the one in Ref. [3].

### **1.1. Background**

The SA algorithm in Ref. [1] was developed as an approach to find the unconstrained global optimum of functions with a potential multiplicity of optima, some of which may lie on the boundaries of the function's domain. The function being optimized need not be differentiable or even continuous, but it must be bounded. As well, the domain of the function must be compact. The search algorithm in Ref. [1] depends upon function evaluations. Derivatives play no role. The reader is referred to the careful statement of the algorithm on pages 266–269 in Ref. [1]. Here we provide an overview of the algorithm and its implementation.

In a multivariate setting, the domain of the function is initially defined as a (perhaps very) large hypercube. Random paths of individual choice variable values are searched at each stage. The number of such searches is controlled by the user. The step sizes that partially define the random paths are part of the algorithm and will in general be different for each choice variable at each stage. The algorithm cycles through and individually changes (increases or decreases) all choice variables. Each time a variable change leads to a "better" value of the function being optimized, this point is accepted. Sometimes "worse" points are temporarily accepted. That is, sometimes, the algorithm deliberately allows a search path at a stage to contain "worse" points. By allowing the paths to meander through better and worse points in the domain, the algorithm allows the domain to be searched for better optima that can only be reached by first passing through the worse regions. In this way, the search process can escape from local optima that are dominated by one or more global optima. This distinguishes SA from traditional hill-climbing algorithms that use the local properties of the function to move always in better directions. Movements in worse directions are governed by a Metropolis decision. In a sense, the Metropolis decision can be thought of as the algorithm giving permission to the search process to move in a worse direction, depending upon the roll of a (weighted) die. As the algorithm progresses through subsequent ("cooling") stages, the chance that a worse point will be permitted/accepted decreases. In addition, as the algorithm progresses, the effective domain of the function (that part of the domain to which the search is effectively restricted) is contracted. In part, the search evolves in a fashion consistent with the overall (global versus local) topography of the surface of the function being optimized. Asymptotically, the algorithm converges to a domain that contains the best optimum encountered and which has a userspecified (small) volume. Convergence criteria and the number and length of searches at each stage are determined by parameters set by the user.

The FORTRAN implementation in Ref. [2] is faithful to the algorithm in Ref. [1], but it does include some features and suggestions that tend to help in deciding whether a global optimum has been reached and, in the initial stages of optimization, the features allow the researcher to search individual subsets of the domain of the function. The researcher can control the number of searches, the initial "temperature" of the model and the rate at which temperature decreases. Uphill and rejected downhill moves are balanced by changes to the upper and lower bounds on parameter changes. In Ref. [2], the suggestion is made that starting the SA algorithm from a variety of randomly selected domain points may provide information about whether a global optimum has been found. This is balanced, to some extent, by the realization that the properties of SA that make it a global optimizer also tend to make it independent of initial values of the choice variables. Any indication of sensitivity to starting values is, therefore, a strong suggestion that the SA user-determined parameters should be changed to provide a more thorough search.

As with any mathematical tool, becoming adept at using SA to solve optimization problems requires practice. FORTRAN code for the SA implementation in Ref. [2] is widely available. This code, written in FORTRAN 77, is carefully documented and contains an example problem that illustrates many of the features of SA. These features, such as convergence criteria, cooling rate, lengths of search paths, and the like can be adjusted to study how the implementation works. It is straightforward to code and solve new problems. The advice in Ref. [2] and in the code is helpful in finding the set of search parameters that solves the problem at hand.

Finally, Goffe et al. [2] address the issue of the speed of SA and recommend the ways that the user can tune the implementation to run more quickly. The implementation was published in 1994, and since then multicore very fast processors with at least 64-bit single precision have become the norm. At the same time, though, there are optimization problems that are now being addressed at the limit of current technology.

## **1.2. Penalty-constrained optimization**

reasons for this are efficiency and familiarity. The algorithm in Ref. [1] is tight and effective and the FORTRAN implementation in [2] is dependable, easily extensible and sufficiently fast that it can be applied to both complicated small-scale problems and to very large Monte Carlo

The SA algorithm in Ref. [1] was developed as an approach to find the unconstrained global optimum of functions with a potential multiplicity of optima, some of which may lie on the boundaries of the function's domain. The function being optimized need not be differentiable or even continuous, but it must be bounded. As well, the domain of the function must be compact. The search algorithm in Ref. [1] depends upon function evaluations. Derivatives play no role. The reader is referred to the careful statement of the algorithm on pages 266–269

In a multivariate setting, the domain of the function is initially defined as a (perhaps very) large hypercube. Random paths of individual choice variable values are searched at each stage. The number of such searches is controlled by the user. The step sizes that partially define the random paths are part of the algorithm and will in general be different for each choice variable at each stage. The algorithm cycles through and individually changes (increases or decreases) all choice variables. Each time a variable change leads to a "better" value of the function being optimized, this point is accepted. Sometimes "worse" points are temporarily accepted. That is, sometimes, the algorithm deliberately allows a search path at a stage to contain "worse" points. By allowing the paths to meander through better and worse points in the domain, the algorithm allows the domain to be searched for better optima that can only be reached by first passing through the worse regions. In this way, the search process can escape from local optima that are dominated by one or more global optima. This distinguishes SA from traditional hill-climbing algorithms that use the local properties of the function to move always in better directions. Movements in worse directions are governed by a Metropolis decision. In a sense, the Metropolis decision can be thought of as the algorithm giving permission to the search process to move in a worse direction, depending upon the roll of a (weighted) die. As the algorithm progresses through subsequent ("cooling") stages, the chance that a worse point will be permitted/accepted decreases. In addition, as the algorithm progresses, the effective domain of the function (that part of the domain to which the search is effectively restricted) is contracted. In part, the search evolves in a fashion consistent with the overall (global versus local) topography of the surface of the function being optimized. Asymptotically, the algorithm converges to a domain that contains the best optimum encountered and which has a userspecified (small) volume. Convergence criteria and the number and length of searches at each

The FORTRAN implementation in Ref. [2] is faithful to the algorithm in Ref. [1], but it does include some features and suggestions that tend to help in deciding whether a global optimum has been reached and, in the initial stages of optimization, the features allow the researcher to search individual subsets of the domain of the function. The researcher can control the number

in Ref. [1]. Here we provide an overview of the algorithm and its implementation.

studies such as the one in Ref. [3].

70 Computational Optimization in Engineering - Paradigms and Applications

stage are determined by parameters set by the user.

**1.1. Background**

The foregoing discussion has provided an overview of SA and how it can be applied to optimize a function. We now turn to a discussion of how SA can be used to find the global optimum of functions when there are constraints on the choice variables. Our principal concern is how to find the optimum of the statistical functions when the choice variables must satisfy an integral equality constraint. Our approach, however, is quite general and it can be adapted to solve optimization problems subject to several inequality as well as equality constraints that may or may not involve integrals. At this point, it is helpful to introduce some notation.

Our choice variables are represented by the vector: *θ*. The objective function to be maximized is given by: *l*(*θ*). As well, there is a constraint given by: *R*(*θ*) = *ψ*0. In the statistical problem considered in the next section, we examine the unconstrained problem.

## *1.2.1. (Unconstrained problem) U: choose θ to maximize l(θ)*

(Unconstrained Problem) *U*: Choose *θ* to maximize *l*(*θ*) is an important part of the analysis. It is almost always the case in statistical settings that the unconstrained problem can be solved in a straightforward manner using SA.

## *1.2.2. (Constrained problem) C: choose θ to maximize l(θ) subject to R(θ) = ψ<sup>0</sup>*

(Constrained Problem) *C*: Choose *θ* to maximize *l*(*θ*) subject to *R*(*θ*) = *ψ*<sup>0</sup> can also be solved using simulated annealing. Indeed, as we will see, SA is a very natural approach to solving problem *C*.

Within a statistical setting, substitution and Lagrange multiplier techniques tend not to work well when attempting to solve *C*. Typically, the objective function, *l*(*θ*), and the constraint, *R* (*θ*) = *ψ*<sup>0</sup> will not have properties that guarantee a straightforward solution to the optimization problem. For example, the constraint can reasonably be expected to be a highly nonlinear function of the choice variables, *θ*. In both the substitution and Lagrange approaches, it is necessary at each iteration to solve the constraint equation to express one choice variable in terms of all of the others. If one or more of the choice variables takes on an extreme value (very large or very small) during the iteration process, then this can lead in turn to an extreme constraint solution and such extreme values tend to perpetuate themselves through subsequent iterations. The traditional absence of some form of textbook concavity or convexity on the objective function and/or the constraint function typically results in a failed optimization attempt. Standard derivative-based optimizers often get lost when derivatives take extreme values or when the Hessian of the function is indefinite. In part, this is due to their strong dependence upon local properties of the function being optimized.

The penalty function approach provides an alternative way of dealing with the constraint, which does not require exact satisfaction of the constraint equation at each "iteration." In fact, the constraint equation only holds asymptotically. To clarify this, we begin by introducing the penalty function *PL*:

$$PL\left(\Theta, \psi\_0\right) = l(\theta) - k\left[R(\theta) - \psi\_0\right]^2, \quad k > 0. \tag{1}$$

This is the penalty function associated with the constrained optimization problem *C* introduced above. In this case, *k* is a positive parameter controlled by the researcher.

In a recent paper, Byrne [4] studied how the penalty functions like *PL* could be used to solve constrained optimization problems such as *C*. The following three conditions are introduced:


Then, based on these conditions, it is proved in Ref. [4] that the sequence *θ <sup>k</sup>* , *<sup>k</sup>* <sup>¼</sup> <sup>1</sup>, <sup>2</sup>, … converges to the *θ*\* that solves problem *C*.

This result is extremely important for applying simulated annealing to solve constrained optimization problems. First, SA chooses candidate optimizers from a compact set. Second, continuity of the penalty function *PL* is more than what is required for SA to reach a global optimum. The complicating point is that the result in Ref. [4] is expressed in terms of the limit of a sequence of SA optimizations. But, in our experience, this is not a complication of considerable practical importance. In the first place, even for large values of *k*, SA is not helped by starting the iterations for the (*k* + 1)st solution at the optimal values from the *k*th solution. In practice, given that SA searches ("cools") sufficiently slowly and follows long enough paths in the domain, it tends to find the global solution of the (*k* + 1) problem regardless of the starting values it is given. Second, there is a practical issue of how much accuracy can be expected. The SA algorithm terminates when successive improvements in the value of the objective function are all less that a user-specified threshold. This means that the contribution of the term *k*[*R* (*θ*) − *ψ*0] <sup>2</sup> must also be small in absolute value. In all of the problems we have considered, setting *k* = 100, 000 is certainly enough to get a high-quality estimate of *θ*\*. That is, it is reasonable to truncate the sequence at this value of *k*.

In concluding this section, we reconsider the question: "Why does a penalty approach work when the Lagrange approach fails?" In our experience, the Lagrange approach, which requires solutions of an equation to a given level of accuracy, is prone to problems of numerical accuracy and their propagation. Alternatively, the penalty approach never requires the constraint to be exactly satisfied. Instead, it increasingly discourages large squared deviations of the constraint function *R*(*θ*) from its constraint value *ψ*<sup>0</sup> as *k* increases. As a final point, when conditions are satisfied for the Lagrange multiplier formally to exist, it can be obtained as the limit of the partial derivative of the penalty function with respect to the parameter *ψ*<sup>0</sup> as *k* increases.

## **2. Modeling reliability using SA penalized likelihood**

#### **2.1. Background on reliability**

*1.2.2. (Constrained problem) C: choose θ to maximize l(θ) subject to R(θ) = ψ<sup>0</sup>*

72 Computational Optimization in Engineering - Paradigms and Applications

dependence upon local properties of the function being optimized.

*PL θ;ψ*<sup>0</sup>

problem *C*.

penalty function *PL*:

**3.** Let *θ*

*θ*

**1.** *θ* is chosen from a compact set.

*<sup>k</sup>* , *<sup>k</sup>* <sup>¼</sup> <sup>1</sup>, <sup>2</sup>, … exists.

**2.** The functions *l*(*θ*) and *R*(*θ*) are continuous.

converges to the *θ*\* that solves problem *C*.

(Constrained Problem) *C*: Choose *θ* to maximize *l*(*θ*) subject to *R*(*θ*) = *ψ*<sup>0</sup> can also be solved using simulated annealing. Indeed, as we will see, SA is a very natural approach to solving

Within a statistical setting, substitution and Lagrange multiplier techniques tend not to work well when attempting to solve *C*. Typically, the objective function, *l*(*θ*), and the constraint, *R* (*θ*) = *ψ*<sup>0</sup> will not have properties that guarantee a straightforward solution to the optimization problem. For example, the constraint can reasonably be expected to be a highly nonlinear function of the choice variables, *θ*. In both the substitution and Lagrange approaches, it is necessary at each iteration to solve the constraint equation to express one choice variable in terms of all of the others. If one or more of the choice variables takes on an extreme value (very large or very small) during the iteration process, then this can lead in turn to an extreme constraint solution and such extreme values tend to perpetuate themselves through subsequent iterations. The traditional absence of some form of textbook concavity or convexity on the objective function and/or the constraint function typically results in a failed optimization attempt. Standard derivative-based optimizers often get lost when derivatives take extreme values or when the Hessian of the function is indefinite. In part, this is due to their strong

The penalty function approach provides an alternative way of dealing with the constraint, which does not require exact satisfaction of the constraint equation at each "iteration." In fact, the constraint equation only holds asymptotically. To clarify this, we begin by introducing the

This is the penalty function associated with the constrained optimization problem *C* intro-

In a recent paper, Byrne [4] studied how the penalty functions like *PL* could be used to solve constrained optimization problems such as *C*. The following three conditions are introduced:

This result is extremely important for applying simulated annealing to solve constrained optimization problems. First, SA chooses candidate optimizers from a compact set. Second, continuity of the penalty function *PL* is more than what is required for SA to reach a global

<sup>2</sup>

*<sup>k</sup>* be the vector that corresponds to the global maximum of *PL* in Eq. (1) when the parameter multiplying the squared term is *k*. We assume that each element in the sequence

, *k >* 0*:* (1)

*<sup>k</sup>* , *<sup>k</sup>* <sup>¼</sup> <sup>1</sup>, <sup>2</sup>, …

<sup>¼</sup> *<sup>l</sup>*ð Þ *<sup>θ</sup>* <sup>−</sup>*k R*ð Þ *<sup>θ</sup>* <sup>−</sup>*ψ*<sup>0</sup>

duced above. In this case, *k* is a positive parameter controlled by the researcher.

Then, based on these conditions, it is proved in Ref. [4] that the sequence *θ*

We consider two variables *X* and *Y*. We refer to them respectively as Strength and Stress. For example, Strength could refer to the "time before failure" of a component such as a digital storage device. Alternatively, Stress might measure the total time that the device is used. From the standpoint of a manufacturer, *X* and *Y* are both random variables with distributions that can, in principle, be estimated from available breakdown and usage data. Reliability, *R*, is defined as the probability that the component will withstand the stress it faces in use. In particular,

$$R = P(Y < X). \tag{2}$$

A variety of distributions have been used for *X* and *Y* in the literature. The actual choice depends upon the process being studied. It is standard and reasonable to suppose that *X* and *Y* are independent.<sup>1</sup>

<sup>1</sup> That is, the probability distribution of *Y* does not depend upon any realized value of *X* and vice versa. If dependence is possible in a given setting, it is easily accommodated. Formally, (Eq. (2)) will continue to hold, but additional parameters, associated with the interdependence, may appear in both the objective function and the constraint. In a formal sense, the penalty approach is unchanged.

In this section, we suppose that both *X* and *Y* are distributed as exponentiated exponential distributions. Exponentiated exponential distributions, *EE*(*α*, *β*), have two parameters: *α >* 0 controls shape and *β >* 0 controls scale. Adopting the notation introduced in Ref. [5], the cumulative distribution function is:

$$F(\mathbf{x}; \alpha, \beta) = \left(1 - e^{-\beta \mathbf{x}}\right)^{a}, \quad a > 0, \beta > 0, \mathbf{x} > 0,\tag{3}$$

As in Ref. [6], we assume that *X* is distributed as *EE*(*α*1, *β*1) and *Y* is distributed as *EE*(*α*2, *β*2). We do not, however, constrain the scale parameters, *β*<sup>1</sup> and *β*2, to be equal. As a result, the expression for reliability is:

$$R = P(Y < X) \quad = \int\_0^\infty \alpha\_1 \beta\_1 \, \left(1 - e^{-\beta\_1 \mathbf{x}}\right)^{\alpha\_1 - 1} e^{-\beta\_1 \mathbf{x}} \, \left(1 - e^{-\beta\_2 \mathbf{x}}\right)^{\alpha\_2} \, d\mathbf{x}.\tag{4}$$

There is no known closed-form solution for this integral. As noted in Ref. [7], introducing the change of variables: *z* = *β*1*x* allows one to see that *R* is homogeneous of degree 0 in (*β*1, *β*2). The contours of *R* are, therefore, all constant along a line in a parameter space defined by *β*<sup>2</sup> = *β*1.

#### **2.2. The unconstrained EE likelihood equation and it properties**

Following Ref. [7], we let *x* = (*x*1, …, *xn*)′ and *y* = (*y*1, …, *ym*)′ denote the realizations of random samples from *EE*(*α*1, *β*1) and *EE*(*α*2, *β*2), respectively. The log-likelihood function of the above model can be written:

$$\begin{aligned} nl(\alpha\_1, \beta\_1, \alpha\_2, \beta\_2; \mathbf{x}, \mathbf{y}) &= n \log \alpha\_1 + n \log \beta\_1 + (\alpha\_1 - 1) \sum\_{i=1}^n \log \left(1 - e^{-\beta\_1 \mathbf{x}\_i} \right) - \beta\_1 \sum\_{i=1}^n \mathbf{x}\_i \\ &+ n \log \alpha\_2 + m \log \beta\_2 + (\alpha\_2 - 1) \sum\_{j=1}^m \log \left(1 - e^{-\beta\_2 \mathbf{y}\_j} \right) - \beta\_2 \sum\_{j=1}^m \mathbf{y}\_j. \end{aligned} \tag{5}$$

We denote the parameter vector as *θ* = (*α*1, *α*2, *β*1, *β*2)′.

The Appendix in Ref. [7] contains a derivation of the properties of *l*(*θ*) = *l*(*α*1, *β*1, *α*2, *β*2; *x*, *y*). In particular, *l*(*θ*) is not a concave function of *θ* nor is it quasi- or pseudo-concave. There is a small region around the point where the gradient of *l*(*θ*) vanishes and in that region, the Hessian matrix is negative definite. Elsewhere in the parameter space, the determinant of the Hessian matrix changes sign frequently. Thus, extreme care must be taken in trying to maximize *l*(*θ*), using a derivative-based algorithm. We found that a variable-metric algorithm would work as long as an approximate Hessian matrix, constrained to be negative definite, is used over a restricted parameter space.

One example, which we consider in greater detail later in the paper, uses the following data with sample sizes of 11 and 9: *x*= (2.1828, 0.5911, 1.0711, 0.9007, 1.7814, 1.3616, 0.8629, 0.2301, 1.5183, 0.8481, 1.0845) and *y*= (0.8874, 1.1482, 0.8227, 0.4086, 0.5596, 1.1978, 1.1324, 0.5625, 1.0679). Our SA program quickly and easily solved the associated unconstrained maximum likelihood optimization problem.

#### **2.3. Constrained likelihood maximization**

In this section, we suppose that both *X* and *Y* are distributed as exponentiated exponential distributions. Exponentiated exponential distributions, *EE*(*α*, *β*), have two parameters: *α >* 0 controls shape and *β >* 0 controls scale. Adopting the notation introduced in Ref. [5], the

As in Ref. [6], we assume that *X* is distributed as *EE*(*α*1, *β*1) and *Y* is distributed as *EE*(*α*2, *β*2). We do not, however, constrain the scale parameters, *β*<sup>1</sup> and *β*2, to be equal. As a result, the

*<sup>α</sup>*1*β*<sup>1</sup> <sup>1</sup>−*e*<sup>−</sup>*β*1*<sup>x</sup>* � *<sup>α</sup>*1−<sup>1</sup>

There is no known closed-form solution for this integral. As noted in Ref. [7], introducing the change of variables: *z* = *β*1*x* allows one to see that *R* is homogeneous of degree 0 in (*β*1, *β*2). The contours of *R* are, therefore, all constant along a line in a parameter space defined by *β*<sup>2</sup> = *β*1.

Following Ref. [7], we let *x* = (*x*1, …, *xn*)′ and *y* = (*y*1, …, *ym*)′ denote the realizations of random samples from *EE*(*α*1, *β*1) and *EE*(*α*2, *β*2), respectively. The log-likelihood function of the above

þ*m*log*α*<sup>2</sup> þ *m*log*β*<sup>2</sup> þ ð Þ *α*2−1 ∑

The Appendix in Ref. [7] contains a derivation of the properties of *l*(*θ*) = *l*(*α*1, *β*1, *α*2, *β*2; *x*, *y*). In particular, *l*(*θ*) is not a concave function of *θ* nor is it quasi- or pseudo-concave. There is a small region around the point where the gradient of *l*(*θ*) vanishes and in that region, the Hessian matrix is negative definite. Elsewhere in the parameter space, the determinant of the Hessian matrix changes sign frequently. Thus, extreme care must be taken in trying to maximize *l*(*θ*), using a derivative-based algorithm. We found that a variable-metric algorithm would work as long as an approximate Hessian matrix, constrained to be negative definite, is used over a

One example, which we consider in greater detail later in the paper, uses the following data with sample sizes of 11 and 9: *x*= (2.1828, 0.5911, 1.0711, 0.9007, 1.7814, 1.3616, 0.8629, 0.2301, 1.5183, 0.8481, 1.0845) and *y*= (0.8874, 1.1482, 0.8227, 0.4086, 0.5596, 1.1978, 1.1324, 0.5625, 1.0679). Our SA program quickly and easily solved the associated unconstrained maximum

*n i*¼1

*m j*¼1

log 1−*e*<sup>−</sup>*β*1*xi*

log 1−*e*

� −*β*<sup>1</sup> <sup>∑</sup>

−*β*2*yj* � −*β*<sup>2</sup> <sup>∑</sup>

*n i*¼1 *xi*

*m j*¼1 *yj* *:* (5)

, *α >* 0, *β >* 0, *x >* 0*;* (3)

*<sup>e</sup>*<sup>−</sup>*β*1*<sup>x</sup>* <sup>1</sup>−*e*<sup>−</sup>*β*2*<sup>x</sup>* � *<sup>α</sup>*<sup>2</sup> *dx:* (4)

cumulative distribution function is:

expression for reliability is:

model can be written:

restricted parameter space.

likelihood optimization problem.

*F x*; *<sup>α</sup>; <sup>β</sup>* � <sup>¼</sup> <sup>1</sup>−*<sup>e</sup>*

**2.2. The unconstrained EE likelihood equation and it properties**

*<sup>l</sup> <sup>α</sup>*1*; <sup>β</sup>*1*; <sup>α</sup>*2*; <sup>β</sup>*2; *<sup>x</sup>; <sup>y</sup>* � <sup>¼</sup> *<sup>n</sup>*log*α*<sup>1</sup> <sup>þ</sup> *<sup>n</sup>*log*β*<sup>1</sup> <sup>þ</sup> ð Þ *<sup>α</sup>*1−<sup>1</sup> <sup>∑</sup>

ð∞ 0

*R* ¼ *P Y*ð Þ¼ *< X*

74 Computational Optimization in Engineering - Paradigms and Applications

We denote the parameter vector as *θ* = (*α*1, *α*2, *β*1, *β*2)′.

<sup>−</sup>*β<sup>x</sup>* � *<sup>α</sup>*

As will be discussed in the next section, inference for the "parameter" *R*(*θ*) in our reliability model requires that the likelihood function *l*(*θ*) be maximized subject to the constraint *R* (*θ*) = *ψ*<sup>0</sup> for a range of values of the constraint parameter *ψ*0. These constrained optimization problems are all solved using the penalty function approach introduced in Section 1.2 and using the penalty function *PL*(*θ*, *ψ*0) given in Eq. (1). The functions *l*(*θ*) and *R*(*θ*) can be thought of now as the unconstrained *EE* likelihood function and the reliability function, respectively.

## **3. Likelihood-based inference and penalty functions**

In Eq. (5), we introduced the statistical log-likelihood function primarily as an example of a function that needs to be maximized (with and without constraint) in a statistical setting. In this section, we provide more details about likelihood functions and inference. Our brief discussion is not intended as a complete explanation of the underlying statistical notions. Rather, it is intended only to motivate some importance of constrained and unconstrained optimization within statistics. Our discussion is rooted in the example of Eq. (5).

#### **3.1. Background on likelihood models in statistics**

Likelihood is akin to probability. The difference in the notions for our purposes is that likelihood is measured in terms of the probability density governing the realizations of a continuous random variable. Technically, the probability of any one outcome, say *x*0, of a continuous random variable is 0. The value of the density, say *h*(*x*), evaluated at *x*<sup>0</sup> and multiplied by *dx*, that is, *h*(*x*0) *dx*, can be thought of as approximately the probability that there will be a realization of the random variable *X* in a very small interval containing *x*0. It is common to have situations where the realized (observed) values of a random variable *X* arise from a process of random sampling where the outcomes are independent of each other yet are governed by identical probability density functions, *h*(*x*). The likelihood of a given sample of realized values is defined as the product of the densities corresponding to each of the outcomes in the sample; so the likelihood of a given sample is, up to a scaling factor, a notion similar to the probability of the sample. The likelihood of a sample of realizations of *X* and *Y* is, given our assumptions, the product of the likelihoods of the *X* and the *Y* samples. For a variety of reasons, it is often easier to work with a positive monotonic transformation of the sample likelihood. In particular, we work with the log-likelihood of the sample. In Eq. (5), we are given the log-likelihood of a sample where the densities come from possibly different exponentiated exponential distributions.

The derivation of the log-likelihood associated with a sample of realizations is just the beginning of the modeling process. Extensions include forecasting the next realization of a random variable or perhaps finding an interval where one can be 95% confident that the next realization of a function of the random variables will fall. For example, we could ask for a 95% confidence interval of the measure of reliability in Eq. (4), incorporating the information in the sample given at the end of Section 2.2. These are questions of statistical inference. We answer these questions by solving the optimization problems.

The likelihood function given in Eq. (5) can be combined with the sample of 11 realizations of *X* and 9 realizations of *Y* given in Section 2.2. We can use the information in the data to estimate the unknown vector of parameters: *θ* = (*α*1, *α*2, *β*1, *β*2). One set of estimates of the parameters of the model is obtained by maximizing the likelihood function with the choice variables being the parameters. These maximum likelihood parameter estimates can be thought of as the parameter values that yield specific density functions that are most likely to have generated the data. Of course, 20 observations are not enough to achieve certainty, so there is a related theory about where the true (population) parameters lie in relation to their estimates. Indeed, there are probability distributions associated with the maximum likelihood parameter estimation process, and the parameter values that maximize the log-likelihood for a given sample of data realizations are themselves just realizations. These probability distributions or their approximations allow us to estimate how close the parameter realizations are to the true parameter values.

There is also a probability distribution for the maximized value of the log-likelihood function. This allows us to ask questions such as the following: do I induce a "significant" change in the maximized likelihood value when I constrain the parameter estimates (choice variables) to satisfy an additional condition or set of conditions. This leads back to the constrained optimization problem *C* in Section 1.2, and the penalty function in Eq. (1).

In the subsection that follows, we present the process of inference for our reliability model. The presentation is more technical.

#### **3.2. Inference in the reliability model**

Given *l*(*θ*) is the log likelihood function, we denote the unconstrained maximum likelihood estimator *θ*b when *l*(*θ*) alone is maximized. As well, we define

$$j\left(\widehat{\boldsymbol{\theta}}\right) = \frac{\partial^2 l(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}'}\Big|\_{\widehat{\boldsymbol{\theta}}} \tag{6}$$

as the observed information matrix evaluated at *θ*b. Finally, we let *θ*b*<sup>ψ</sup>* be the constrained maximum likelihood estimator of *θ* given by maximizing *l*(*θ*) subject to *R*(*θ*) = *ψ*. Formally, *θ*b*<sup>ψ</sup>* can be obtained for any *ψ* in the range of *R*(*θ*) by maximizing *l*(*θ*) subject to the constraint *R* (*θ*) = *ψ* using the penalty function approach within SA.

The aim next is to obtain inference concerning *R* = *R*(*θ*), where *dim*(*R*) = 1. Two widely used likelihood-based methods for obtaining confidence interval for *R* are based on the asymptotic distribution of the maximum likelihood estimator *θ*b and the (log) likelihood ratio statistic.

Taking *θ* as the true population parameter vector, *θ*b−*θ* ′ *var θ*b h i<sup>−</sup><sup>1</sup> *θ*b−*θ* is asymptotically distributed as chi-square with degrees of freedom equal to *dim*(*θ*), and variance-covariance matrix *var* <sup>c</sup> *<sup>θ</sup>*<sup>b</sup> ≈*<sup>j</sup>* <sup>−</sup><sup>1</sup> *<sup>θ</sup>*<sup>b</sup> . Since *<sup>R</sup>* <sup>=</sup> *<sup>R</sup>*(*θ*) depends upon the entire vector of parameters, we approximate its variance by applying the Delta method to *R*b ¼ *R θ*b and obtain:

$$
\widehat{uar}\left(\widehat{\mathbb{R}}\right) \simeq \overset{\circ}{R}\_{\theta}^{'}\left(\widehat{\overline{\theta}}\right) \widehat{uar}\left(\widehat{\overline{\theta}}\right) \mathcal{R}\_{\theta}\left(\widehat{\overline{\theta}}\right) \tag{7}
$$

where

sample given at the end of Section 2.2. These are questions of statistical inference. We answer

The likelihood function given in Eq. (5) can be combined with the sample of 11 realizations of *X* and 9 realizations of *Y* given in Section 2.2. We can use the information in the data to estimate the unknown vector of parameters: *θ* = (*α*1, *α*2, *β*1, *β*2). One set of estimates of the parameters of the model is obtained by maximizing the likelihood function with the choice variables being the parameters. These maximum likelihood parameter estimates can be thought of as the parameter values that yield specific density functions that are most likely to have generated the data. Of course, 20 observations are not enough to achieve certainty, so there is a related theory about where the true (population) parameters lie in relation to their estimates. Indeed, there are probability distributions associated with the maximum likelihood parameter estimation process, and the parameter values that maximize the log-likelihood for a given sample of data realizations are themselves just realizations. These probability distributions or their approximations allow us to estimate how close the parameter realizations are to

There is also a probability distribution for the maximized value of the log-likelihood function. This allows us to ask questions such as the following: do I induce a "significant" change in the maximized likelihood value when I constrain the parameter estimates (choice variables) to satisfy an additional condition or set of conditions. This leads back to the constrained optimi-

In the subsection that follows, we present the process of inference for our reliability model. The

Given *l*(*θ*) is the log likelihood function, we denote the unconstrained maximum likelihood

as the observed information matrix evaluated at *θ*b. Finally, we let *θ*b*<sup>ψ</sup>* be the constrained maximum likelihood estimator of *θ* given by maximizing *l*(*θ*) subject to *R*(*θ*) = *ψ*. Formally, *θ*b*<sup>ψ</sup>* can be obtained for any *ψ* in the range of *R*(*θ*) by maximizing *l*(*θ*) subject to the constraint *R*

The aim next is to obtain inference concerning *R* = *R*(*θ*), where *dim*(*R*) = 1. Two widely used likelihood-based methods for obtaining confidence interval for *R* are based on the asymptotic distribution of the maximum likelihood estimator *θ*b and the (log) likelihood ratio statistic.

distributed as chi-square with degrees of freedom equal to *dim*(*θ*), and variance-covariance

<sup>∂</sup><sup>2</sup>*l*ð Þ *<sup>θ</sup>* ∂*θ*∂*θ*′

 b *θ*

′

*var θ*b h i<sup>−</sup><sup>1</sup>

*θ*b−*θ* 

is asymptotically

(6)

*j θ*b ¼ −

zation problem *C* in Section 1.2, and the penalty function in Eq. (1).

estimator *θ*b when *l*(*θ*) alone is maximized. As well, we define

(*θ*) = *ψ* using the penalty function approach within SA.

Taking *θ* as the true population parameter vector, *θ*b−*θ*

these questions by solving the optimization problems.

76 Computational Optimization in Engineering - Paradigms and Applications

the true parameter values.

presentation is more technical.

**3.2. Inference in the reliability model**

$$R\_{\theta} \left( \widehat{\boldsymbol{\theta}} \right) = \frac{\partial R(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \Big|\_{\widehat{\boldsymbol{\theta}}}.\tag{8}$$

Since *dim R*b <sup>¼</sup> 1, we have

$$\frac{\widehat{R} - R}{\sqrt{\widehat{var}} \left(\widehat{R}\right)}\tag{9}$$

asymptotically distributed as standard normal. An approximate (1 − *α*)100 % confidence interval for *R* based on *θ*b is

$$\left(\widehat{R} - z\_{\alpha/2} \sqrt{\widehat{v}\widehat{a}r}\left(\widehat{R}\right), \widehat{R} + z\_{\alpha/2} \sqrt{\widehat{v}\widehat{a}r}\left(\widehat{R}\right)\right) \tag{10}$$

where *zα*/2 is the (1 − *α*/2)100th percentile of the standard normal distribution. This is our first confidence interval.

Alternatively, with regularity conditions stated in Refs. [8, 9], the log likelihood ratio statistic:

$$\mathcal{W}(\psi) \;=\; \mathcal{Q}\left[\ell\left(\hat{\boldsymbol{\theta}}\right)\neg\ell\left(\hat{\boldsymbol{\theta}}\_{\psi}\right)\right] \tag{11}$$

is asymptotically distributed as chi square with 1 degree of freedom. Therefore, an approximate (1 − *α*)100 % confidence interval of *R* based on the likelihood ratio statistic is:

$$\left\{\psi: W(\psi) \lhd \chi^2\_{1,a}\right\}.\tag{12}$$

The set of all constrained values *ψ* in the domain of *R* that cannot be rejected at the (1 − *α*)100 % as the true value *R*(*θ*) is defined in Eq. (10). These values form our second confidence interval.

It should be noted that both methods have rates of convergence *O*(*n*<sup>−</sup> 1/2). While the MLE-based interval is often preferred because of simplicity in calculation, the log-likelihood ratio method has the advantage that it is invariant to reparameterization and the MLE-based method is not. The results presented in Ref. [10] suggest that, in terms of coverage, the confidence interval based on the log-likelihood ratio statistic should be preferred to the MLE-based interval. In particular, when 95% confidence intervals for both statistics are compared, the interval from the log-likelihood ratio statistic is shorter and therefore more precise.


**Table 1.** Interval estimates of *R*.

The results are summarized in **Table 1**. We note that, consistent with Ref. [10], the *χ*<sup>2</sup> interval is indeed shorter than the MLE interval.

## **4. Conclusion**

This chapter has considered how SA can play an important role as a global optimizer of constrained likelihood-based statistical models. SA is naturally paired with the penalty function approach to constrained optimization. SA and the penalty approach both require compact domains and bounded functions. Penalty functions must be continuous and, within a statistical setting, this almost always holds. SA supplies the global optimization property that guarantees that the penalty function approach converges to the global constrained optimum. Even though our implementation of SA does not make use of derivatives, the converged penalty function will often be differentiable and Lagrange multipliers, gradients, and Hessian matrices can be calculated. The extension of these results to multiple constraints is computationally straightforward.

In this chapter, we have motivated the pairing of SA and penalty functions in a statistical setting. But the approach can be used to solve many other types of numerical constrained optimization problems. Over time, we have accumulated a considerable amount of experience solving constrained problems using SA and the penalty functions. One lesson stands out: SA is a global optimizer and, for the most part, it should be independent of initial conditions such as starting values of parameters (choice variables). If you find that you get a different optimum after changing the starting values, then it is likely that neither solution is the true global optimum. You can usually remedy this by increasing the initial temperature, slowing the rate of cooling, and/or increasing the length and number of search paths.

## **Author details**

Barry Smith<sup>1</sup> and Augustine Wong<sup>2</sup> \*

\*Address all correspondence to: august@yorku.ca


## **References**

The results are summarized in **Table 1**. We note that, consistent with Ref. [10], the *χ*<sup>2</sup> interval is

**95% Confidence interval**

This chapter has considered how SA can play an important role as a global optimizer of constrained likelihood-based statistical models. SA is naturally paired with the penalty function approach to constrained optimization. SA and the penalty approach both require compact domains and bounded functions. Penalty functions must be continuous and, within a statistical setting, this almost always holds. SA supplies the global optimization property that guarantees that the penalty function approach converges to the global constrained optimum. Even though our implementation of SA does not make use of derivatives, the converged penalty function will often be differentiable and Lagrange multipliers, gradients, and Hessian matrices can be calculated. The extension of these results to multiple constraints is computationally

In this chapter, we have motivated the pairing of SA and penalty functions in a statistical setting. But the approach can be used to solve many other types of numerical constrained optimization problems. Over time, we have accumulated a considerable amount of experience solving constrained problems using SA and the penalty functions. One lesson stands out: SA is a global optimizer and, for the most part, it should be independent of initial conditions such as starting values of parameters (choice variables). If you find that you get a different optimum after changing the starting values, then it is likely that neither solution is the true global optimum. You can usually remedy this by increasing the initial temperature, slowing the rate

of cooling, and/or increasing the length and number of search paths.

\*

2 Department of Mathematics and Statistics, York University, Toronto, Canada

1 Department of Economics, York University, Toronto, Canada

indeed shorter than the MLE interval.

**Table 1.** Interval estimates of *R*.

MLE (0.306, 0.934) *χ*<sup>2</sup> (0.378, 0.823)

78 Computational Optimization in Engineering - Paradigms and Applications

**4. Conclusion**

straightforward.

**Author details**

Barry Smith<sup>1</sup> and Augustine Wong<sup>2</sup>

\*Address all correspondence to: august@yorku.ca

