**Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing**

Makoto Yasuda

270 Simulated Annealing – Advances, Applications and Hybridizations

*and Systems*. 1984; 103(1): 198-206.

*and Systems*. 1984; 103(1): 198-206.

1024.

1535.

3(1): 6-13.

[19] X. Xie, X. Guo, Y. Han. Mitigation of multimodal SSR using SEDC in the Shangdu series-compensated power system. *IEEE Trans. Power Systems*, 2011, 26(1): 384-391. [20] Donghui Zhang, Xiaorong Xie, Shiyu Liu, Shuqing Zhang. An Intelligently Optimized SEDC for Multimodal SSR Mitigation. *Electric Power Systems Research*, 2009, vol. 7: 1018-

[21] Hammad AE, El-Sadek M. Application of a thyristor-controlled var compensator for damping subsynchronous oscillations in power systems. *IEEE Trans. Power Apparatus* 

[22] Putman TH, Ramey DG. Theory of the modulated reactance solution for subsynchronous resonance. *IEEE Trans. Power Apparatus and Systems*. 1982; 101(6): 1527-

[23] Hammad AE, El-Sadek M. Application of a thyristor-controlled var compensator for damping subsynchronous oscillations in power systems. *IEEE Trans. Power Apparatus* 

[24] Wang L, Hsu YY. Damping of subsynchronous resonance using excitation controllers and static var compensators: a comparative study. *IEEE Trans. Energy Conversion*. 1983; Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/48659

## **1. Introduction**

Many engineering problems can be formulated as optimization problems, and the deterministic annealing (DA) method [20] is known as an effective optimization method for such problems. DA is a deterministic variant of simulated annealing (SA) [1, 10]. The DA characterizes the minimization problem of cost functions as the minimization of Helmholtz free energy which depends on a (pseudo) temperature, and tracks the minimum of free energy while decreasing temperature and thus it can deterministically optimize the function at a given temperature [20]. Hence, the DA is more efficient than the SA, but does not guarantee a global optimal solution. The study on the DA in [20] addressed avoidance of the poor local minima of cost function of data clustering. Then it was extensively applied to various subjects such as combinational optimization problems [21], vector quantization [4], classifier design [13], pairwise data clustering [9] and so on.

On the other hand, clustering is a method which partitions a given set of data points into subgroups, and is one of major tools for data analysis. It is supposed that, in the real world, cluster boundaries are not so clear that fuzzy clustering is more suitable than crisp clustering. Bezdek[2] proposed the fuzzy c-means (FCM) which is now well known as the standard technique for fuzzy clustering.

Then, after the work of Li et al.[11] which formulated the regularization of the FCM with Shannon entropy, Miyamoto et al.[14] discussed the FCM within the framework of the Shannon entropy based clustering. From the historical point of view, however, it should be noted that Rose et al.[20] first studied the statistical mechanical analogy of the FCM with the maximum entropy method, which was basically probabilistic clustering.

To measure the "indefiniteness" of fuzzy set, DeLuca and Termini [6] defined fuzzy entropy after Shannon. Afterwards, some similar measures from the wider viewpoints of the indefiniteness were proposed [15, 16]. Fuzzy entropy has been used for knowledge retrieval from fuzzy database [3] and image processing [31], and proved to be useful.

Tsallis [24] achieved nonextensive extension of the Boltzmann-Gibbs statistics. Tsallis postulated a generalization form of entropy with a generalization parameter *q*, which, in a

©2012 Yasuda, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

limit of *q* → 1 ,reaches the Shannon entropy. Later on, Menard et.al.[12] derived a membership function by regularizing FCM with the Tsallis entropy.

be the objective function of the FCM where *dik* <sup>=</sup> �**x***<sup>k</sup>* <sup>−</sup> **<sup>v</sup>***i*�2. In the FCM, under the

*uik* = 1 ∀*k*, (2)

*uik* log *uik*. (4)

*δ*(*uikdik*), (5)

. (6)

. (7)

*δ*(*u*ˆ*ikdik*), (9)

{*u*ˆ*ik* log *u*ˆ*ik* + (1 − *u*ˆ*ik*)log(1 −ˆ*uik*)}. (8)

*c* ∑ *i*=1 , (3)

∑ *i*=1

> *n* ∑ *k*=1 *ηk c* ∑ *i*=1

where *η<sup>k</sup>* is the Lagrange multiplier. Bezdek[2] showed that the FCM approaches crisp

First, we introduce the Shannon entropy into the FCM clustering. The Shannon entropy is

*c* ∑ *i*=1

Under the normalization constraint and setting *m* to 1, the fuzzy entropy functional is given

where *α<sup>k</sup>* and *β* are the Lagrange multipliers, and *α<sup>k</sup>* must be determined so as to satisfy Eq.

*uik* <sup>=</sup> <sup>e</sup>−*βdik* ∑*c*

− *β n* ∑ *k*=1

*<sup>j</sup>*=<sup>1</sup> <sup>e</sup>−*βdjk*

*<sup>k</sup>*=<sup>1</sup> *uik*x*<sup>k</sup>* ∑*n <sup>k</sup>*=<sup>1</sup> *uik*

*c* ∑ *i*=1

*n* ∑ *k*=1

*uik* − 1

(2). The stationary condition for Eq. (5) leads to the following membership function

<sup>v</sup>*<sup>i</sup>* <sup>=</sup> <sup>∑</sup>*<sup>n</sup>*

*uik* − 1

Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing 273

*LFCM* = *J* −

*S* = −

 *c* ∑ *i*=1

normalization constraint of *<sup>c</sup>*

the Lagrange function *LFCM* is given by

**3. Entropy maximization of FCM**

**3.1. Shannon entropy maximization**

*δS* −

*n* ∑ *k*=1 *αkδ*

We then introduce the fuzzy entropy into the FCM clustering.

*n* ∑ *k*=1

*n* ∑ *k*=1 *αkδ*

*c* ∑ *i*=1

> *c* ∑ *i*=1

*u*ˆ*ik* − 1

 − *β n* ∑ *k*=1

*<sup>S</sup>*<sup>ˆ</sup> <sup>=</sup> <sup>−</sup>

*<sup>δ</sup>S*<sup>ˆ</sup> <sup>−</sup>

clustering as *m* decreases to +1.

given by

and the cluster centers

**3.2. Fuzzy entropy maximization**

The fuzzy entropy functional is given by

The fuzzy entropy is given by

by

In this chapter, by maximizing the various entropies within the framework of FCM, the membership functions which take the familiar forms of the statitical mechanical distribution functions are derived. The advantage to use the statistical mechanical membership functions is that the fuzzy c-means clustering can be interpreted and analyzed from a statistical mechanical point of view [27, 28]

After that, we focus on the Fermi-Dirac like membership function, because, as compared to the Maxwell-Boltzmann-like membership function, the Fermi-Dirac-like membership function has extra parameters *αk*s (*α<sup>k</sup>* corresponds to a chemical potential in statistical mechanics[19], and *k* denotes a data point), which make it possible to represent various cluster shapes like former clustering methods based on such as the Gaussian mixture[7], and the degree of fuzzy entropy[23]. *αk*s strongly affect clustering results and they must be optimized under a normalization constraint of FCM. On the other hand, the DA method, though it is efficient, does not give appropriate values of *αk*s by itself and the DA clustering sometimes fails if *αk*s are improperly given. Accordingly, we introduce SA to optimize *αk*s because, as pointed above, both of DA and SA contain the parameter corresponding to the system temperature and can be naturally combined as DASA.

Nevertheless, this approach causes a few problems. (1)How to estimate the initial values of *αk*s under the normalization constraint ? (2)How to estimate the initial annealing temperature? (3)SA must optimize a real number *αk*[5, 26]. (4)SA must optimize many *αk*s[22].

Linear approximations of the Fermi-Dirac-like membership function is useful in guessing the initial *αk*s and the initial annealing temperature of DA.

In order to perform SA in a many variables domain, *αk*s to be optimized are selected according to a selection rule. In an early annealing stages, most *αk*s are optimized. In a final annealing stage, however, only *αk*s of data which locate sufficiently away from all cluster centers are optimized because their memberships might be fuzzy. Distances between the data and the cluster centers are measured by using linear approximations of the Fermi-Dirac-like membership function.

However, DASA suffers a few disadvantages. One of them is that it is not necessarily easy to interpolate membership functions obtained by DASA, since their values are quite different each other. The fractal interpolation method [17] is suitable for these rough functions [30].

Numerical experiments show that DASA clusters data which distribute in various shapes more properly and stably than single DA. Also, the effectiveness of the fractal interpolation is examined.

### **2. Fuzzy c-means**

Let *<sup>X</sup>* <sup>=</sup> {**x**1,..., **<sup>x</sup>***n*}(**x***<sup>k</sup>* = (*x*<sup>1</sup> *<sup>k</sup>* ,..., *<sup>x</sup> <sup>p</sup> <sup>k</sup>* ) <sup>∈</sup> *<sup>R</sup>p*) be a data set in a *<sup>p</sup>*-dimensional real space, which should be divided into *c* clusters *C* = {*C*1,..., *Cc*}. Let *V* = {**v**1,..., **v***c*}(**v***<sup>i</sup>* = (*v*<sup>1</sup> *<sup>i</sup>* ,..., *<sup>v</sup> <sup>p</sup> <sup>i</sup>* )) be the centers of clusters and *uik* ∈ [0, 1](*i* = 1, . . . , *c*; *k* = 1, . . . , *n*) be the membership function. Also let

$$J = \sum\_{k=1}^{n} \sum\_{i=1}^{c} \mu\_{ik} (d\_{ik})^m \ (m > 1) \tag{1}$$

be the objective function of the FCM where *dik* <sup>=</sup> �**x***<sup>k</sup>* <sup>−</sup> **<sup>v</sup>***i*�2. In the FCM, under the normalization constraint of *<sup>c</sup>*

$$\sum\_{i=1}^{c} u\_{ik} = 1 \,\,\forall k \,\,\tag{2}$$

the Lagrange function *LFCM* is given by

2 Will-be-set-by-IN-TECH

limit of *q* → 1 ,reaches the Shannon entropy. Later on, Menard et.al.[12] derived a membership

In this chapter, by maximizing the various entropies within the framework of FCM, the membership functions which take the familiar forms of the statitical mechanical distribution functions are derived. The advantage to use the statistical mechanical membership functions is that the fuzzy c-means clustering can be interpreted and analyzed from a statistical

After that, we focus on the Fermi-Dirac like membership function, because, as compared to the Maxwell-Boltzmann-like membership function, the Fermi-Dirac-like membership function has extra parameters *αk*s (*α<sup>k</sup>* corresponds to a chemical potential in statistical mechanics[19], and *k* denotes a data point), which make it possible to represent various cluster shapes like former clustering methods based on such as the Gaussian mixture[7], and the degree of fuzzy entropy[23]. *αk*s strongly affect clustering results and they must be optimized under a normalization constraint of FCM. On the other hand, the DA method, though it is efficient, does not give appropriate values of *αk*s by itself and the DA clustering sometimes fails if *αk*s are improperly given. Accordingly, we introduce SA to optimize *αk*s because, as pointed above, both of DA and SA contain the parameter corresponding to the system temperature

Nevertheless, this approach causes a few problems. (1)How to estimate the initial values of *αk*s under the normalization constraint ? (2)How to estimate the initial annealing temperature?

Linear approximations of the Fermi-Dirac-like membership function is useful in guessing the

In order to perform SA in a many variables domain, *αk*s to be optimized are selected according to a selection rule. In an early annealing stages, most *αk*s are optimized. In a final annealing stage, however, only *αk*s of data which locate sufficiently away from all cluster centers are optimized because their memberships might be fuzzy. Distances between the data and the cluster centers are measured by using linear approximations of the Fermi-Dirac-like

However, DASA suffers a few disadvantages. One of them is that it is not necessarily easy to interpolate membership functions obtained by DASA, since their values are quite different each other. The fractal interpolation method [17] is suitable for these rough functions [30]. Numerical experiments show that DASA clusters data which distribute in various shapes more properly and stably than single DA. Also, the effectiveness of the fractal interpolation is

which should be divided into *c* clusters *C* = {*C*1,..., *Cc*}. Let *V* = {**v**1,..., **v***c*}(**v***<sup>i</sup>* =

*c* ∑ *i*=1

*<sup>i</sup>* )) be the centers of clusters and *uik* ∈ [0, 1](*i* = 1, . . . , *c*; *k* = 1, . . . , *n*) be the

*<sup>k</sup>* ) <sup>∈</sup> *<sup>R</sup>p*) be a data set in a *<sup>p</sup>*-dimensional real space,

*uik*(*dik*)*<sup>m</sup>* (*<sup>m</sup>* > <sup>1</sup>) (1)

(3)SA must optimize a real number *αk*[5, 26]. (4)SA must optimize many *αk*s[22].

*<sup>k</sup>* ,..., *<sup>x</sup> <sup>p</sup>*

*J* = *n* ∑ *k*=1

function by regularizing FCM with the Tsallis entropy.

mechanical point of view [27, 28]

and can be naturally combined as DASA.

membership function.

**2. Fuzzy c-means**

Let *<sup>X</sup>* <sup>=</sup> {**x**1,..., **<sup>x</sup>***n*}(**x***<sup>k</sup>* = (*x*<sup>1</sup>

membership function. Also let

examined.

*<sup>i</sup>* ,..., *<sup>v</sup> <sup>p</sup>*

(*v*<sup>1</sup>

initial *αk*s and the initial annealing temperature of DA.

$$L\_{FCM} = f - \sum\_{k=1}^{n} \eta\_k \left(\sum\_{i=1}^{c} \mu\_{ik} - 1\right),\tag{3}$$

where *η<sup>k</sup>* is the Lagrange multiplier. Bezdek[2] showed that the FCM approaches crisp clustering as *m* decreases to +1.

#### **3. Entropy maximization of FCM**

#### **3.1. Shannon entropy maximization**

First, we introduce the Shannon entropy into the FCM clustering. The Shannon entropy is given by

$$S = -\sum\_{k=1}^{n} \sum\_{i=1}^{c} \mu\_{ik} \log \mu\_{ik}.\tag{4}$$

Under the normalization constraint and setting *m* to 1, the fuzzy entropy functional is given by

$$\delta S - \sum\_{k=1}^{n} \alpha\_k \delta \left(\sum\_{i=1}^{c} u\_{ik} - 1\right) \ -\beta \sum\_{k=1}^{n} \sum\_{i=1}^{c} \delta (u\_{ik} d\_{ik})\_i \tag{5}$$

where *α<sup>k</sup>* and *β* are the Lagrange multipliers, and *α<sup>k</sup>* must be determined so as to satisfy Eq. (2). The stationary condition for Eq. (5) leads to the following membership function

$$
\mu\_{ik} = \frac{\mathbf{e}^{-\beta d\_{ik}}}{\sum\_{j=1}^{c} \mathbf{e}^{-\beta d\_{jk}}}.\tag{6}
$$

and the cluster centers

$$\boldsymbol{\upsilon}\_{i} = \frac{\sum\_{k=1}^{n} \boldsymbol{\mu}\_{ik} \boldsymbol{\omega}\_{k}}{\sum\_{k=1}^{n} \boldsymbol{\mu}\_{ik}}.\tag{7}$$

#### **3.2. Fuzzy entropy maximization**

We then introduce the fuzzy entropy into the FCM clustering.

The fuzzy entropy is given by

$$\hat{S} = -\sum\_{k=1}^{n} \sum\_{i=1}^{c} \{\mathfrak{A}\_{ik} \log \mathfrak{A}\_{ik} + (1 - \mathfrak{A}\_{ik}) \log(1 - \mathfrak{A}\_{ik})\}. \tag{8}$$

The fuzzy entropy functional is given by

$$\delta \boldsymbol{\delta} - \sum\_{k=1}^{n} \alpha\_{k} \delta \left(\sum\_{i=1}^{c} \mathfrak{d}\_{ik} - 1\right) - \beta \sum\_{k=1}^{n} \sum\_{i=1}^{c} \delta (\mathfrak{d}\_{ik} \mathfrak{d}\_{ik})\_{i} \tag{9}$$

where *α<sup>k</sup>* and *β* are the Lagrange multipliers[28]. The stationary condition for Eq. (9) leads to the following membership function

$$\hat{\mu}\_{ik} = \frac{1}{\mathbf{e}^{\mu\_k + \beta d\_{ik}} + 1},\tag{10}$$

**4. Entropy maximization and statistical physics**

canonical ensemble of fuzzy clustering can be written as

In the Shannon entropy based FCM, the sum of the states (the partition function) for the grand

*c* ∑ *i*=1

log *<sup>c</sup>* ∑ *i*=1

Stable thermal equilibrium requires a minimization of the free energy. By formulating

In a group of independent particles, the total energy and the total number of particles are given by *E* = ∑*<sup>l</sup> �lnl* and *N* = ∑*<sup>l</sup> nl*, respectively, where *�<sup>l</sup>* represents the energy level and *nl* represents the number of particles that occupy *�l*. We can write the sum of states, or the

∑*<sup>l</sup> �lnl*=*E*,∑*<sup>l</sup> nl*=*N*

distribution, and the sum of states (the grand partition function) Ξ is given by[8, 19]

(*e*−*α*)*NZN* <sup>=</sup> ∏

�*nl*� <sup>=</sup> <sup>1</sup>

where *α* is defined by the condition that *N* = ∑*l*�*nl*� [19]. Helmholtz free energy *F* is, from the

For particles governed by the Fermi-Dirac distribution, Ξ can be rewritten as

Ξ = ∏ *l*

where *β* is the product of the inverse of temperature *T* and *kB* (Boltzmann constant). However, it is difficult to take the sums in (22) counting up all possible divisions. Accordingly, we make the number of particles *nl* a variable, and adjust the new parameter *α*(chemical potential) so as to make ∑*<sup>l</sup> �lnl* = *E* and ∑*<sup>l</sup> nl* = *N* are satisfied. Hence, this becomes the grand canonical

*l*

*e*

∞ ∑ *nl*=0

*<sup>k</sup>*=<sup>1</sup> *uik*x*<sup>k</sup>* ∑*n <sup>k</sup>*=<sup>1</sup> *uik*

e−*βdik*

Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing 275

e−*βdik* . (19)

. (20)

. (21)

<sup>−</sup>*<sup>β</sup>* <sup>∑</sup>*<sup>l</sup> �lnl* (22)

(*e*−*α*−*β�l*)*nl* . (23)

(1 + *e*−*α*−*β�l*). (24)

*<sup>e</sup>α*+*β�<sup>l</sup>* <sup>+</sup> <sup>1</sup> (25)

*n* ∏ *k*=1

*n* ∑ *k*=1

<sup>v</sup>*<sup>i</sup>* <sup>=</sup> <sup>∑</sup>*<sup>n</sup>*

*Z* =

deterministic annealing as a minimization of the free energy, *∂F*/*∂*v*<sup>i</sup>* = 0 yields

*ZN* = ∑

By substituting Eq. (19) for *F* = −(1/*β*)(log *Z*)[19], the free energy becomes

*<sup>F</sup>* <sup>=</sup> <sup>−</sup> <sup>1</sup> *β*

**4.1. Shannon entropy based FCM statistics**

This cluster center is the same as that in Eq. (7).

**4.2. Fuzzy entropy based FCM statistics**

Ξ =

∞ ∑ *N*=0

partition function, in the form:

Also, *nl* is averaged as

relationship *F* = −*kBT* log *ZN*,

and the cluster centers

$$\boldsymbol{v}\_{i} = \frac{\sum\_{k=1}^{n} \boldsymbol{\hat{u}}\_{ik} \boldsymbol{x}\_{k}}{\sum\_{k=1}^{n} \boldsymbol{\hat{u}}\_{ik}}.\tag{11}$$

In Eq. (10), *β* defines the extent of the distribution [27]. Equation (10) is formally normalized as

$$\mathcal{H}\_{ik} = \frac{1}{\mathbf{e}^{a\_k + \beta d\_k} + 1} / \sum\_{j=1}^{c} \frac{1}{\mathbf{e}^{a\_k + \beta d\_{jk}} + 1}. \tag{12}$$

#### **3.3. Tsallis entropy maximization**

Let ˜v*<sup>i</sup>* and *u*˜*ik* be the centers of clusters and the membership functions, respectively.

The Tsallis entropy is defined as

$$\tilde{S} = -\frac{1}{q-1} \left( \sum\_{k=1}^{n} \sum\_{i=1}^{c} \tilde{u}\_{i\,\,k}^{q} - 1 \right) \,\prime \tag{13}$$

where *q* ∈ R is any real number. The objective function is rewritten as

$$
\tilde{M} = \sum\_{k=1}^{n} \sum\_{i=1}^{c} \tilde{n}\_{ik}^{q} \tilde{d}\_{ik\prime} \tag{14}
$$

where ˜*dik* <sup>=</sup> �x*<sup>k</sup>* <sup>−</sup> <sup>v</sup>˜*i*�2.

Accordingly, the Tsallis entropy functional is given by

$$
\delta \tilde{S} - \sum\_{k=1}^{n} \alpha\_k \delta \left( \sum\_{i=1}^{c} \tilde{\mu}\_{ik} - 1 \right) - \beta \sum\_{k=1}^{n} \sum\_{i=1}^{c} \delta (\tilde{\mu}\_{ik}^{q} \tilde{d}\_{ik}).\tag{15}
$$

$$
\delta \quad \text{or} \quad (17)
\qquad \text{i.i.d.} \quad \text{i.i.d.} \quad \text{or} \quad \text{i.i.d.} \quad \text{or} \quad \text{i.i.d.}
$$

The stationary condition for Eq. (15) yields to the following membership function

$$\mathfrak{A}\_{ik} = \frac{\{1 - \beta(1 - q)\tilde{d}\_{ik}\}^{\frac{1}{1 - q}}}{\tilde{Z}},\tag{16}$$

where

$$\tilde{Z} = \sum\_{j=1}^{c} \{1 - \beta(1 - q)\tilde{d}\_{jk}\}^{\frac{1}{1-q}}.\tag{17}$$

In this case, the cluster centers are given by

$$
\tilde{\boldsymbol{w}}\_{i} = \frac{\sum\_{k=1}^{n} \tilde{\boldsymbol{u}}\_{ik}^{q} \boldsymbol{x}\_{k}}{\sum\_{k=1}^{n} \tilde{\boldsymbol{u}}\_{ik}^{q}}. \tag{18}
$$

In the limit of *q* → 1, the Tsallis entropy recovers the Shannon entropy [24] and *u*˜*ik* approaches *uik* in Eq.(6).

#### **4. Entropy maximization and statistical physics**

#### **4.1. Shannon entropy based FCM statistics**

4 Will-be-set-by-IN-TECH

where *α<sup>k</sup>* and *β* are the Lagrange multipliers[28]. The stationary condition for Eq. (9) leads to

e*αk*+*βdik* + 1

*<sup>k</sup>*=<sup>1</sup> *u*ˆ*ik*x*<sup>k</sup>* ∑*n <sup>k</sup>*=<sup>1</sup> *u*ˆ*ik*

> 1 e*αk*+*βdjk* + 1

*c* ∑ *i*=1 *δ*(*u*˜ *q*

1−*q*

In Eq. (10), *β* defines the extent of the distribution [27]. Equation (10) is formally normalized

/ *c* ∑ *j*=1 , (10)

. (11)

. (12)

, (13)

*ik* ˜*dik*). (15)

*ik* ˜*dik*, (14)

*<sup>Z</sup>*˜ , (16)

<sup>1</sup>−*<sup>q</sup>* . (17)

. (18)

*<sup>u</sup>*ˆ*ik* <sup>=</sup> <sup>1</sup>

<sup>v</sup>*<sup>i</sup>* <sup>=</sup> <sup>∑</sup>*<sup>n</sup>*

e*αk*+*βdik* + 1

Let ˜v*<sup>i</sup>* and *u*˜*ik* be the centers of clusters and the membership functions, respectively.

 *n* ∑ *k*=1

*n* ∑ *k*=1

*u*˜*ik* − 1

*<sup>u</sup>*˜*ik* <sup>=</sup> {<sup>1</sup> <sup>−</sup> *<sup>β</sup>*(<sup>1</sup> <sup>−</sup> *<sup>q</sup>*) ˜*dik*} <sup>1</sup>

{<sup>1</sup> <sup>−</sup> *<sup>β</sup>*(<sup>1</sup> <sup>−</sup> *<sup>q</sup>*) ˜*djk*} <sup>1</sup>

*<sup>k</sup>*=<sup>1</sup> *u*˜ *q ik*x*<sup>k</sup>*

∑*n <sup>k</sup>*=<sup>1</sup> *u*˜ *q ik*

In the limit of *q* → 1, the Tsallis entropy recovers the Shannon entropy [24] and *u*˜*ik* approaches

The stationary condition for Eq. (15) yields to the following membership function

<sup>v</sup>˜*<sup>i</sup>* <sup>=</sup> <sup>∑</sup>*<sup>n</sup>*

*c* ∑ *i*=1 *u*˜ *q*

> − *β n* ∑ *k*=1

*c* ∑ *i*=1 *u*˜ *q i k* − 1

*q* − 1

*U*˜ =

 *c* ∑ *i*=1

*Z*˜ = *c* ∑ *j*=1

*<sup>u</sup>*ˆ*ik* <sup>=</sup> <sup>1</sup>

*<sup>S</sup>*˜ <sup>=</sup> <sup>−</sup> <sup>1</sup>

where *q* ∈ R is any real number. The objective function is rewritten as

Accordingly, the Tsallis entropy functional is given by

*n* ∑ *k*=1 *αkδ*

*<sup>δ</sup>S*˜ <sup>−</sup>

In this case, the cluster centers are given by

the following membership function

**3.3. Tsallis entropy maximization**

The Tsallis entropy is defined as

where ˜*dik* <sup>=</sup> �x*<sup>k</sup>* <sup>−</sup> <sup>v</sup>˜*i*�2.

where

*uik* in Eq.(6).

and the cluster centers

as

In the Shannon entropy based FCM, the sum of the states (the partition function) for the grand canonical ensemble of fuzzy clustering can be written as

$$Z = \prod\_{k=1}^{n} \sum\_{i=1}^{c} \mathbf{e}^{-\beta d\_{ik}}.\tag{19}$$

By substituting Eq. (19) for *F* = −(1/*β*)(log *Z*)[19], the free energy becomes

$$F = -\frac{1}{\beta} \sum\_{k=1}^{n} \log \left\{ \sum\_{i=1}^{c} \mathbf{e}^{-\beta d\_R} \right\}. \tag{20}$$

Stable thermal equilibrium requires a minimization of the free energy. By formulating deterministic annealing as a minimization of the free energy, *∂F*/*∂*v*<sup>i</sup>* = 0 yields

$$\boldsymbol{\sigma}\_{i} = \frac{\sum\_{k=1}^{n} \boldsymbol{\mu}\_{ik} \boldsymbol{\sigma}\_{k}}{\sum\_{k=1}^{n} \boldsymbol{\mu}\_{ik}}.\tag{21}$$

This cluster center is the same as that in Eq. (7).

#### **4.2. Fuzzy entropy based FCM statistics**

In a group of independent particles, the total energy and the total number of particles are given by *E* = ∑*<sup>l</sup> �lnl* and *N* = ∑*<sup>l</sup> nl*, respectively, where *�<sup>l</sup>* represents the energy level and *nl* represents the number of particles that occupy *�l*. We can write the sum of states, or the partition function, in the form:

$$Z\_N = \sum\_{\sum\_l \varepsilon\_l n\_l = \overline{E}, \sum\_l n\_l = N} e^{-\beta \sum\_l \varepsilon\_l n\_l} \tag{22}$$

where *β* is the product of the inverse of temperature *T* and *kB* (Boltzmann constant). However, it is difficult to take the sums in (22) counting up all possible divisions. Accordingly, we make the number of particles *nl* a variable, and adjust the new parameter *α*(chemical potential) so as to make ∑*<sup>l</sup> �lnl* = *E* and ∑*<sup>l</sup> nl* = *N* are satisfied. Hence, this becomes the grand canonical distribution, and the sum of states (the grand partition function) Ξ is given by[8, 19]

$$\Xi = \sum\_{N=0}^{\infty} (e^{-a})^N Z\_N = \prod\_{l} \sum\_{n\_l=0}^{\infty} (e^{-a - \beta \varepsilon\_l})^{n\_l} \,. \tag{23}$$

For particles governed by the Fermi-Dirac distribution, Ξ can be rewritten as

$$\Xi = \prod\_{l} (1 + e^{-\alpha - \beta \varepsilon\_{l}}) . \tag{24}$$

Also, *nl* is averaged as

$$
\langle n\_l \rangle = \frac{1}{e^{\alpha + \beta \epsilon\_l} + 1} \tag{25}
$$

where *α* is defined by the condition that *N* = ∑*l*�*nl*� [19]. Helmholtz free energy *F* is, from the relationship *F* = −*kBT* log *ZN*,

$$F = -k\_B T \left( \log \Xi - a \frac{\partial}{\partial a} \log \Xi \right) = -\frac{1}{\beta} \left\{ \sum\_{l} \log(1 + e^{-a - \beta \varepsilon\_l}) + aN \right\}.\tag{26}$$

Taking that

$$E = \sum\_{l} \frac{\mathfrak{e}\_{l}}{e^{\alpha + \beta \mathfrak{e}\_{l}} + 1} \tag{27}$$

Fermi-Dirac Statistics Fuzzy Clustering

*�lnl* = *E* (a)

Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing 277

*<sup>e</sup>α*+*β�<sup>l</sup>* <sup>+</sup> <sup>1</sup> *uik* <sup>=</sup> <sup>1</sup>

*<sup>n</sup>*

*<sup>i</sup>*=<sup>1</sup> *dikuik* in FC.

 − 1 *β n* ∑ *k*=1

*c* ∑ *i*=1 *uik* = 1

*n* ∑ *k*=1 *c* ∑ *i*=1

*SFE* = −

∏*k*=1 *c* ∏*i*=1 

 *<sup>c</sup>* ∑ *i*=1

*n* ∑ *k*=1 *c* ∑ *i*=1

+(<sup>1</sup> <sup>−</sup> *uik* )log(<sup>1</sup> <sup>−</sup> *uik* )}

*eαk*+*βdik* + 1

<sup>1</sup> <sup>+</sup> *<sup>e</sup>*−*αk*−*βdik*

log(1 + *e*−*αk*−*βdik* ) + *α<sup>k</sup>*

*dik eαk*+*βdik* + 1 {*uik* log *uik*

Constraints (a)∑

Distribution Function �*nl*� <sup>=</sup> <sup>1</sup>

 + <sup>1</sup> <sup>−</sup> �*Nl*� *νl* log

> *β* ∑ *l*

**Table 1.** Correspondence of Fermi-Dirac Statistics and Fuzzy Clustering.

Entropy *<sup>S</sup>* = −*kB* ∑

Partition Function(Ξ) ∏

Energy(*E*) ∑

equals *E* in FD. We have to minimize ∑*<sup>n</sup>*

causes the product over *k* for FC.

complementary event, respectively.

will be discussed in detail in the next subsection. • **Temperature:** Temperature is given in both cases 1.

*S* and *SFE* equal −*∂F*/*∂T* as expected from statistical physics.

<sup>1</sup> In the FCM, however, temperature is determined as a result of clustering.

multiple clusters.

Free Energy(*F*) <sup>−</sup> <sup>1</sup>

*l*

*nl* = *N* (b)∑ *l*

> �*Nl*� *νl*

Temperature(*T*) (given) (given)

1 + *e*−*α*−*β�<sup>l</sup>*

*�l eα*+*β�<sup>l</sup>* + 1

log(1 + *e*−*α*−*β�l*) + *αN*

log �*Nl*� *νl*

<sup>1</sup> <sup>−</sup> �*Nl*� *νl*

*i*. In addition, the fact that data can belong to multiple clusters leads to the summation on *k*. (b) There is no constraint in FC which corresponds to the constraint that the total energy

*<sup>k</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>c</sup>*

• **Distribution Function:** In FD, �*nl*� gives an average particle number which occupies energy level *l*, because particles can not be distinguished from each other. In FC, however, data are distinguishable, and for that reason, *uik* gives a probability of data belonging to

• **Entropy:** �*Nl*� is supposed to correspond to a cluster capacity. The meanings of *S* and *SFE*

• **Partition Function:** The fact that data can belong to multiple clusters simultaneously

• **Free Energy:** Helmholtz free energy *F* is given by −*T*(log Ξ − *αk∂* log Ξ/*∂α<sup>k</sup>* ) in FC. Both

• **Energy:** The relationship *E* = *F* + *TS* or *E* = *F* + *TSFE* holds between *E*, *F*, *T* and *S* or *SFE*.

In the entropy function (28) or (30) for the particle system, we can consider the first term to be the entropy of electrons and the second to be that of holes. In this case, the physical limitation that only one particle can occupy an energy level at a time results in the entropy that formulates the state in which an electron and a hole exist simultaneously and exchanging them makes no difference. Meanwhile, what correspond to electron and hole in fuzzy clustering are the probability of fuzzy event that a data belongs to a cluster and the probability of its

Fig.2 shows a two-dimensional virtual cluster density distribution model. A lattice can have at most one data. Let *Ml* be the total number of lattices and *ml* be the number of lattices which

**4.4. Meanings of Fermi-Dirac distribution function and fuzzy entropy**

*l*

*l* 

*l*

into account, the entropy *S* = (*E* − *F*)/*T* has the form

$$S = -k\_B \sum\_{l} \left\{ \langle n\_l \rangle \log \langle n\_l \rangle + (1 - \langle n\_l \rangle) \log(1 - \langle n\_l \rangle) \right\}.\tag{28}$$

If states are degenerated to the degree of *νl*, the number of particles which occupy *�<sup>l</sup>* is

$$
\langle \mathbf{N}\_{l} \rangle = \nu\_{l} \langle \mathfrak{n}\_{l} \rangle,\tag{29}
$$

and we can rewrite the entropy *S* as

$$S = -k\_B \sum\_{l} \{ \frac{\langle \mathbf{N}\_l \rangle}{\nu\_l} \log \frac{\langle \mathbf{N}\_l \rangle}{\nu\_l} + \left( 1 - \frac{\langle \mathbf{N}\_l \rangle}{\nu\_l} \right) \log \left( 1 - \frac{\langle \mathbf{N}\_l \rangle}{\nu\_l} \right) \},\tag{30}$$

which is similar to fuzzy entropy in (8). As a result, *uik* corresponds to a grain density �*nl*� and the inverse of *β* in (10) represents the system or computational temperature *T*.

In the FCM clustering, note that any data can belong to any cluster, the grand partition function can be written as

$$\Xi = \prod\_{k=1}^{n} \prod\_{i=1}^{c} (1 + e^{-a\_k - \beta d\_{ik}}) \, \tag{31}$$

which, from the relationship *F* = −(1/*β*)(log Ξ − *αk∂* log Ξ/*∂αk*), gives the Helmholtz free energy

$$F = -\frac{1}{\beta} \sum\_{k=1}^{n} \left\{ \sum\_{i=1}^{c} \log(1 + e^{-\alpha\_k - \beta d\_{ik}}) + \alpha\_k \right\}.\tag{32}$$

The inverse of *β* represents the system or computational temperature *T*.

#### **4.3. Correspondence between Fermi-Dirac statistics and fuzzy clustering**

In the previous subsection, we have formulated the fuzzy entropy regularized FCM as the DA clustering and showed that its mechanics was no other than the statistics of a particle system (the Fermi-Dirac statistics). The correspondences between fuzzy clustering (FC) and the Fermi-Dirac statistics (FD) are summarized in TABLE 1. The major difference between fuzzy clustering and statistical mechanics is the fact that data are distinguishable and can belong to multiple clusters, though particles which occupy a same energy state are not distinguishable. This causes a summation or a multiplication not only on *i* but on *k* as well in fuzzy clustering. Thus, fuzzy clustering and statistical mechanics described in this paper are not mathematically equivalent.

• **Constraints:** (a) Constraint that the sum of all particles *N* is fixed in FD is correspondent with the normalization constraint in FC. Energy level *l* is equivalent to the cluster number


**Table 1.** Correspondence of Fermi-Dirac Statistics and Fuzzy Clustering.

6 Will-be-set-by-IN-TECH

 ∑ *l*

*�l*

<sup>1</sup> <sup>−</sup> �*Nl*� *νl*

log(1 + *e*−*αk*<sup>−</sup> *<sup>β</sup>dik* ) + *α<sup>k</sup>*

which is similar to fuzzy entropy in (8). As a result, *uik* corresponds to a grain density �*nl*�

In the FCM clustering, note that any data can belong to any cluster, the grand partition

which, from the relationship *F* = −(1/*β*)(log Ξ − *αk∂* log Ξ/*∂αk*), gives the Helmholtz free

In the previous subsection, we have formulated the fuzzy entropy regularized FCM as the DA clustering and showed that its mechanics was no other than the statistics of a particle system (the Fermi-Dirac statistics). The correspondences between fuzzy clustering (FC) and the Fermi-Dirac statistics (FD) are summarized in TABLE 1. The major difference between fuzzy clustering and statistical mechanics is the fact that data are distinguishable and can belong to multiple clusters, though particles which occupy a same energy state are not distinguishable. This causes a summation or a multiplication not only on *i* but on *k* as well in fuzzy clustering. Thus, fuzzy clustering and statistical mechanics described in this paper are

• **Constraints:** (a) Constraint that the sum of all particles *N* is fixed in FD is correspondent with the normalization constraint in FC. Energy level *l* is equivalent to the cluster number

 log 

log(1 + *e*

{�*nl*�log�*nl*� + (1 − �*nl*�)log(1 − �*nl*�)} . (28)

�*Nl*� = *νl*�*nl*�, (29)

<sup>1</sup> <sup>−</sup> �*Nl*� *νl*

(1 + *e*−*αk*−*βdik* ), (31)

}, (30)

. (32)

−*α*−*β�<sup>l</sup>*

*<sup>e</sup>α*+*β�<sup>l</sup>* <sup>+</sup> <sup>1</sup> (27)

) + *αN*

. (26)

 <sup>=</sup> <sup>−</sup> <sup>1</sup> *β*

*E* = ∑ *l*

If states are degenerated to the degree of *νl*, the number of particles which occupy *�<sup>l</sup>* is

log �*Nl*� *νl* + 

Ξ =

*n* ∑ *k*=1

The inverse of *β* represents the system or computational temperature *T*.

and the inverse of *β* in (10) represents the system or computational temperature *T*.

*n* ∏ *k*=1

 *c* ∑ *i*=1

**4.3. Correspondence between Fermi-Dirac statistics and fuzzy clustering**

*c* ∏ *i*=1

*F* = −*kBT*

and we can rewrite the entropy *S* as

function can be written as

not mathematically equivalent.

energy

*<sup>S</sup>* = −*kB* ∑

*l*

*<sup>F</sup>* <sup>=</sup> <sup>−</sup> <sup>1</sup> *β*

Taking that

log Ξ − *α*

into account, the entropy *S* = (*E* − *F*)/*T* has the form

*<sup>S</sup>* = −*kB* ∑

*l*

{ �*Nl*� *νl*

*∂ ∂α* log <sup>Ξ</sup>

> *i*. In addition, the fact that data can belong to multiple clusters leads to the summation on *k*. (b) There is no constraint in FC which corresponds to the constraint that the total energy equals *E* in FD. We have to minimize ∑*<sup>n</sup> <sup>k</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>c</sup> <sup>i</sup>*=<sup>1</sup> *dikuik* in FC.


#### **4.4. Meanings of Fermi-Dirac distribution function and fuzzy entropy**

In the entropy function (28) or (30) for the particle system, we can consider the first term to be the entropy of electrons and the second to be that of holes. In this case, the physical limitation that only one particle can occupy an energy level at a time results in the entropy that formulates the state in which an electron and a hole exist simultaneously and exchanging them makes no difference. Meanwhile, what correspond to electron and hole in fuzzy clustering are the probability of fuzzy event that a data belongs to a cluster and the probability of its complementary event, respectively.

Fig.2 shows a two-dimensional virtual cluster density distribution model. A lattice can have at most one data. Let *Ml* be the total number of lattices and *ml* be the number of lattices which

<sup>1</sup> In the FCM, however, temperature is determined as a result of clustering.

#### 8 Will-be-set-by-IN-TECH 278 Simulated Annealing – Advances, Applications and Hybridizations Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing <sup>9</sup>

have a data in it (marked by a black box). Then, the number of available divisions of data to lattices is denoted by

$$\mathcal{W} = \prod\_{l} \frac{M\_{l}!}{m\_{l}!(M\_{l} - m\_{l})!} \tag{33}$$

M

1

M

**Figure 1.** Simple lattice model of clusters. *M*1, *M*2, . . . represent clusters. Black and white box represent

Equation (38) makes it possible to regard *β*−<sup>1</sup> as an artificial system temperature *T* [19]. Then,

*β*

*n* ∑ *k*=1

*<sup>k</sup>*=<sup>1</sup> *u*˜ *q ik*x*<sup>k</sup>*

∑*n <sup>k</sup>*=<sup>1</sup> *u*˜ *q ik*

The DA method is a deterministic variant of SA. DA characterizes the minimization problem of the cost function as the minimization of the Helmholtz free energy which depends on the temperature, and tracks its minimum while decreasing the temperature and thus it can

According to the principle of minimal free energy in statistical mechanics, the minimum of the Helmholtz free energy determines the distribution at thermal equilibrium [19]. Thus, formulating the DA clustering as a minimization of (32) leads to *∂F*/*∂***v***<sup>i</sup>* = 0 at the current temperature, and gives (10) and (11) again. Desirable cluster centers are obtained by

In this chapter, we focus on application of DA to the Fermi-Dirac-like distribution function

The Fermi-Dirac distribution function can be approximated by linear functions. That is, as

*n* ∑ *k*=1

*<sup>Z</sup>*˜ <sup>1</sup>−*<sup>q</sup>* <sup>−</sup> <sup>1</sup>

*<sup>Z</sup>*˜ <sup>1</sup>−*<sup>q</sup>* <sup>−</sup> <sup>1</sup>

<sup>1</sup> <sup>−</sup> *<sup>q</sup>* . (39)

<sup>1</sup> <sup>−</sup> *<sup>q</sup>* . (40)

. (41)

*<sup>F</sup>*˜ <sup>=</sup> *<sup>U</sup>*˜ <sup>−</sup> *TS*˜ <sup>=</sup> <sup>−</sup> <sup>1</sup>

*<sup>U</sup>*˜ <sup>=</sup> <sup>−</sup> *<sup>∂</sup> ∂β*

deterministically optimize the cost function at each temperature.

**5.1. Linear approximation of Fermi-Dirac distribution function**

shown in Fig.1, the Fermi-Dirac distribution function of the form:

<sup>v</sup>˜*<sup>i</sup>* <sup>=</sup> <sup>∑</sup>*<sup>n</sup>*

3

Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing 279

M

whether a data exists or not.

*U*˜ can be derived from *F*˜ as

*∂F*˜/*∂*v˜*<sup>i</sup>* = 0 also gives

the free energy can be defined as

**5. Deterministic annealing**

calculating (10) and (11) repeatedly.

described in the Section 4.2.

2

which, from *S* = *kB* log *W* (the Gibbs entropy), gives the form similar to (30)[8]. By extremizing *S*, we have the most probable distribution like (25). In this case, as there is no distinction between data, only the numbers of black and white lattices constitute the entropy. Fuzzy entropy in (8), on the other hand, gives the amount of information of weather a data belongs to a fuzzy set (or cluster) or not, averaged over independent data **x***k*.

Changing a viewpoint, the stationary entropy values for the particle system seems to be a request for giving the stability against the perturbation with collisions between particles. In fuzzy clustering, data reconfiguration between clusters with the move of cluster centers or the change of cluster shapes is correspondent to this stability. Let us represent data density by ��. If data transfer from clusters *Ca* and *Cb* to *Cc* and *Cd* as a magnitude of membership function, the transition probability from {..., *Ca*,..., *Cb*,...} to {..., *Cc*,..., *Cd*,...} will be proportional to �*Ca*��*Cb*�(1 − �*Cc*�)(1 − �*Cd*�) because a data enters a vacant lattice. Similarly, the transition probability from {..., *Cc*,..., *Cd*,...} to {..., *Ca*,..., *Cb*,...} will be proportional to �*Cc*��*Cd*�(1 − �*Ca*�)(1 − �*Cb*�). In the equilibrium state, the transitions exhibit balance (this is known as the principle of detailed balance[19]). This requires

$$\frac{\langle \mathbb{C}\_{a} \rangle \langle \mathbb{C}\_{b} \rangle}{(1 - \langle \mathbb{C}\_{a} \rangle)(1 - \langle \mathbb{C}\_{b} \rangle)} = \frac{\langle \mathbb{C}\_{c} \rangle \langle \mathbb{C}\_{d} \rangle}{(1 - \langle \mathbb{C}\_{c} \rangle)(1 - \langle \mathbb{C}\_{d} \rangle)}. \tag{34}$$

As a result, if energy *di* is conserved before and after the transition, �*Ci*� must have the form

$$\frac{\langle \mathcal{C}\_{i} \rangle}{1 - \langle \mathcal{C}\_{i} \rangle} = e^{-\kappa - \beta d\_{i}} \tag{35}$$

or Fermi-Dirac distribution

$$
\langle \mathcal{C}\_i \rangle = \frac{1}{e^{\alpha + \beta d\_i} + 1},
\tag{36}
$$

where *α* and *β* are constants.

Consequently, the entropy like fuzzy entropy is statistically caused by the system that allows complementary states. Fuzzy clustering handles a data itself, while statistical mechanics handles a large number of particles and examines the change of macroscopic physical quantity. Then it is concluded that fuzzy clustering exists in the extreme of Fermi-Dirac statistics, or the Fermi-Dirac statistics includes fuzzy clustering conceptually.

#### **4.5. Tsallis entropy based FCM statistics**

On the other hand, *U*˜ and *S*˜ satisfy

$$
\tilde{S} - \beta \tilde{U} = \sum\_{k=1}^{n} \frac{\tilde{Z}^{1-q} - 1}{1 - q} \, \tag{37}
$$

which leads to

$$\frac{\partial \tilde{S}}{\partial \tilde{L}} = \beta. \tag{38}$$

**Figure 1.** Simple lattice model of clusters. *M*1, *M*2, . . . represent clusters. Black and white box represent whether a data exists or not.

Equation (38) makes it possible to regard *β*−<sup>1</sup> as an artificial system temperature *T* [19]. Then, the free energy can be defined as

$$\tilde{F} = \tilde{U} - T\tilde{S} = -\frac{1}{\beta} \sum\_{k=1}^{n} \frac{\tilde{Z}^{1-q} - 1}{1 - q}. \tag{39}$$

*U*˜ can be derived from *F*˜ as

$$\tilde{M} = -\frac{\partial}{\partial \beta} \sum\_{k=1}^{n} \frac{\tilde{Z}^{1-q} - 1}{1 - q}. \tag{40}$$

*∂F*˜/*∂*v˜*<sup>i</sup>* = 0 also gives

8 Will-be-set-by-IN-TECH

have a data in it (marked by a black box). Then, the number of available divisions of data to

which, from *S* = *kB* log *W* (the Gibbs entropy), gives the form similar to (30)[8]. By extremizing *S*, we have the most probable distribution like (25). In this case, as there is no distinction between data, only the numbers of black and white lattices constitute the entropy. Fuzzy entropy in (8), on the other hand, gives the amount of information of weather a data

Changing a viewpoint, the stationary entropy values for the particle system seems to be a request for giving the stability against the perturbation with collisions between particles. In fuzzy clustering, data reconfiguration between clusters with the move of cluster centers or the change of cluster shapes is correspondent to this stability. Let us represent data density by ��. If data transfer from clusters *Ca* and *Cb* to *Cc* and *Cd* as a magnitude of membership function, the transition probability from {..., *Ca*,..., *Cb*,...} to {..., *Cc*,..., *Cd*,...} will be proportional to �*Ca*��*Cb*�(1 − �*Cc*�)(1 − �*Cd*�) because a data enters a vacant lattice. Similarly, the transition probability from {..., *Cc*,..., *Cd*,...} to {..., *Ca*,..., *Cb*,...} will be proportional to �*Cc*��*Cd*�(1 − �*Ca*�)(1 − �*Cb*�). In the equilibrium state, the transitions exhibit

*Ml*!

*ml*!(*Ml* <sup>−</sup> *ml*)! (33)

. (34)

*W* = ∏ *l*

belongs to a fuzzy set (or cluster) or not, averaged over independent data **x***k*.

balance (this is known as the principle of detailed balance[19]). This requires

(<sup>1</sup> − �*Ca*�)(<sup>1</sup> − �*Cb*�) <sup>=</sup> �*Cc*��*Cd*�

�*Ci*�

*<sup>S</sup>*˜ <sup>−</sup> *<sup>β</sup>U*˜ <sup>=</sup>

�*Ci*� <sup>=</sup> <sup>1</sup>

Consequently, the entropy like fuzzy entropy is statistically caused by the system that allows complementary states. Fuzzy clustering handles a data itself, while statistical mechanics handles a large number of particles and examines the change of macroscopic physical quantity. Then it is concluded that fuzzy clustering exists in the extreme of Fermi-Dirac statistics, or the

> *n* ∑ *k*=1

*∂S*˜

*<sup>Z</sup>*˜ <sup>1</sup>−*<sup>q</sup>* <sup>−</sup> <sup>1</sup>

*eα*+*βdi* + 1

As a result, if energy *di* is conserved before and after the transition, �*Ci*� must have the form

(1 − �*Cc*�)(1 − �*Cd*�)

<sup>1</sup> − �*Ci*� <sup>=</sup> *<sup>e</sup>*−*α*−*βdi* (35)

, (36)

<sup>1</sup> <sup>−</sup> *<sup>q</sup>* , (37)

*<sup>∂</sup>U*˜ <sup>=</sup> *<sup>β</sup>*. (38)

�*Ca*��*Cb*�

Fermi-Dirac statistics includes fuzzy clustering conceptually.

**4.5. Tsallis entropy based FCM statistics**

On the other hand, *U*˜ and *S*˜ satisfy

which leads to

lattices is denoted by

or Fermi-Dirac distribution

where *α* and *β* are constants.

$$
\tilde{\boldsymbol{w}}\_{i} = \frac{\sum\_{k=1}^{n} \tilde{\boldsymbol{u}}\_{ik}^{q} \boldsymbol{x}\_{k}}{\sum\_{k=1}^{n} \tilde{\boldsymbol{u}}\_{ik}^{q}}. \tag{41}
$$

### **5. Deterministic annealing**

The DA method is a deterministic variant of SA. DA characterizes the minimization problem of the cost function as the minimization of the Helmholtz free energy which depends on the temperature, and tracks its minimum while decreasing the temperature and thus it can deterministically optimize the cost function at each temperature.

According to the principle of minimal free energy in statistical mechanics, the minimum of the Helmholtz free energy determines the distribution at thermal equilibrium [19]. Thus, formulating the DA clustering as a minimization of (32) leads to *∂F*/*∂***v***<sup>i</sup>* = 0 at the current temperature, and gives (10) and (11) again. Desirable cluster centers are obtained by calculating (10) and (11) repeatedly.

In this chapter, we focus on application of DA to the Fermi-Dirac-like distribution function described in the Section 4.2.

#### **5.1. Linear approximation of Fermi-Dirac distribution function**

The Fermi-Dirac distribution function can be approximated by linear functions. That is, as shown in Fig.1, the Fermi-Dirac distribution function of the form:

**Figure 2.** The Fermi-Dirac distribution function *f*(*x*) and its linear approximation functions *g*(*x*).

**Figure 3.** Decreasing of extent of the Fermi-Dirac distribution function from *x* to *xnew* with decreasing the temperature from *T* to *Tnew*.

$$f(\mathbf{x}) = \frac{1}{e^{\mathbf{a} + \beta \mathbf{x}^2} + 1} \tag{42}$$

where *<sup>κ</sup>* <sup>=</sup> <sup>−</sup>*αβ*. *<sup>g</sup>*(*x*) satisfies *<sup>g</sup>*(

annealing temperature in advance.

be adjusted by decreasing its value gradually.

**5.3. Deterministic annealing algorithm**

**2** Calculate *uik* by (12).

The DA algorithm for fuzzy clustering is given as follows:

√−*<sup>α</sup>* <sup>+</sup> <sup>−</sup>*αβnew*

*αnew* = −

and this gives

condition leads to

1)/(

as

<sup>−</sup>*α*/*β*) = 0.5, and requires *<sup>α</sup>* to be negative.

Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing 281

In Fig.2, Δ*x* = *x* − *xnew* denotes a reduction of the extent of distribution with decreasing the temperature from *T* to *Tnew* (*T* > *Tnew*). The extent of distribution also narrows with

*<sup>β</sup>new* <sup>2</sup>

where *β* = 1/*T* and *βnew* = 1/*Tnew*. Thus, taking that *T* to the temperature at which previous DA was executed and *Tnew* to the next temperature, a covariance of *αk*'s distribution is defined

Before executing DA, it is very important to estimate the initial values of *αk*s and the initial

*c* ∑ *i*=1

*α<sup>k</sup>* = −*β*(*Lk*)

With given initial clusters distributing wide enough, (47) overestimates *αk*, so that *α<sup>k</sup>* needs to

Still more, Fig.1 gives the width of the Fermi-Dirac distribution function as wide as 2(−*α* +

As a result, the initial value of *β* or the initial annealing temperature is roughly determined as

**1** *Initialize*: Set a rate at which a temperature is lowered *Trate*, and a threshold of convergence test *δ*0. Calculate an initial temperature *Thigh*(= 1/*βlow*) by (49) and set a current temperature *T* = *Thigh*(*β* = *βlow*). Place *c* clusters randomly and estimate initial

*<sup>T</sup>* � *<sup>R</sup>*<sup>2</sup> 4 

<sup>−</sup>*αβ*), which must be equal to or smaller than that of data distribution (=2R). This

From Fig.1, distances between a data point and cluster centers are averaged as

*Lk* <sup>=</sup> <sup>1</sup> *c*

> 2 −*α* + 1

*<sup>β</sup>* � <sup>4</sup> *R*2

*αk*s by (47) and adjust them to satisfy the normalization constraint (2).

,

Δ*α* = *αnew* − *α*. (45)

�**x***<sup>k</sup>* − **v***i*�, (46)

<sup>−</sup>*αβ* <sup>=</sup> <sup>2</sup>*R*. (48)

2. (47)

. (49)

(44)

increasing *α*. *αnew* (*α* < *αnew*) which satisfies *g*(0.5)*<sup>α</sup>* − *g*(0.5)*αnew* = Δ*x* is obtained as

1 *<sup>β</sup>* <sup>−</sup> <sup>1</sup>

**5.2. Initial estimation of** *α<sup>k</sup>* **and annealing temperature**

is approximated by the linear functions

$$g(\mathbf{x}) = \begin{cases} 1.0 & \left(\mathbf{x} \le \frac{-\mathbf{a} - 1}{\kappa}\right) \\ -\frac{\kappa}{2}\mathbf{x} - \frac{\mathbf{a}}{2} + \frac{1}{2} & \left(\frac{-\mathbf{a} - 1}{\kappa} \le \mathbf{x} \le \frac{-\mathbf{a} + 1}{\kappa}\right), \\ 0.0 & \left(\frac{-\mathbf{a} + 1}{\kappa} \le \mathbf{x}\right) \end{cases} \tag{43}$$

where *<sup>κ</sup>* <sup>=</sup> <sup>−</sup>*αβ*. *<sup>g</sup>*(*x*) satisfies *<sup>g</sup>*( <sup>−</sup>*α*/*β*) = 0.5, and requires *<sup>α</sup>* to be negative.

In Fig.2, Δ*x* = *x* − *xnew* denotes a reduction of the extent of distribution with decreasing the temperature from *T* to *Tnew* (*T* > *Tnew*). The extent of distribution also narrows with increasing *α*. *αnew* (*α* < *αnew*) which satisfies *g*(0.5)*<sup>α</sup>* − *g*(0.5)*αnew* = Δ*x* is obtained as

$$a\_{new} = -\left\{\sqrt{-a} + \sqrt{-a\beta\_{new}} \left(\frac{1}{\sqrt{\beta}} - \frac{1}{\sqrt{\beta\_{new}}}\right)\right\}^2. \tag{44}$$

where *β* = 1/*T* and *βnew* = 1/*Tnew*. Thus, taking that *T* to the temperature at which previous DA was executed and *Tnew* to the next temperature, a covariance of *αk*'s distribution is defined as

$$
\Delta \mathfrak{a} = \mathfrak{a}\_{new} - \mathfrak{a}.\tag{45}
$$

### **5.2. Initial estimation of** *α<sup>k</sup>* **and annealing temperature**

Before executing DA, it is very important to estimate the initial values of *αk*s and the initial annealing temperature in advance.

From Fig.1, distances between a data point and cluster centers are averaged as

$$L\_k = \frac{1}{c} \sum\_{i=1}^{c} ||\mathbf{x}\_k - \mathbf{v}\_i||\_\prime \tag{46}$$

and this gives

10 Will-be-set-by-IN-TECH


**Figure 2.** The Fermi-Dirac distribution function *f*(*x*) and its linear approximation functions *g*(*x*).

*Tnew*

[*x*]

[*x*]

*<sup>e</sup>α*+*βx*<sup>2</sup> <sup>+</sup> <sup>1</sup> (42)

� ,

(43)

**Figure 3.** Decreasing of extent of the Fermi-Dirac distribution function from *x* to *xnew* with decreasing

*<sup>f</sup>*(*x*) = <sup>1</sup>

�

*<sup>x</sup>* <sup>≤</sup> <sup>−</sup>*<sup>α</sup>* <sup>−</sup> <sup>1</sup> *κ*

�−*<sup>α</sup>* <sup>−</sup> <sup>1</sup> *κ*

�−*<sup>α</sup>* <sup>+</sup> <sup>1</sup> *κ*

�

≤ *x* �

<sup>≤</sup> *<sup>x</sup>* <sup>≤</sup> <sup>−</sup>*<sup>α</sup>* <sup>+</sup> <sup>1</sup> *κ*

*xnew x*

Δ *x*


*T*

*f(x) g(x)*

0

0

is approximated by the linear functions

*g*(*x*) =

⎧ ⎪⎪⎪⎪⎪⎪⎨

1.0

−*κ* 2 *<sup>x</sup>* <sup>−</sup> *<sup>α</sup>* 2 + 1 2

0.0

⎪⎪⎪⎪⎪⎪⎩

the temperature from *T* to *Tnew*.

0.5

[*f(x)*]

1

0.5

[*f(x), g(x)*]

1.0

$$
\mathfrak{a}\_k = -\mathfrak{F}(L\_k)^2. \tag{47}
$$

With given initial clusters distributing wide enough, (47) overestimates *αk*, so that *α<sup>k</sup>* needs to be adjusted by decreasing its value gradually.

Still more, Fig.1 gives the width of the Fermi-Dirac distribution function as wide as 2(−*α* + 1)/( <sup>−</sup>*αβ*), which must be equal to or smaller than that of data distribution (=2R). This condition leads to

$$2\frac{-\mathfrak{a}+1}{\sqrt{-\mathfrak{a}\mathfrak{F}}}=2R.\tag{48}$$

As a result, the initial value of *β* or the initial annealing temperature is roughly determined as

$$
\beta \simeq \frac{4}{R^2} \quad \left( T \simeq \frac{R^2}{4} \right). \tag{49}
$$

#### **5.3. Deterministic annealing algorithm**

The DA algorithm for fuzzy clustering is given as follows:


## **6. Combinatorial algorithm of deterministic and simulated annealing**

#### **6.1. Simulated annealing**

The cost function for SA is

$$E(\boldsymbol{\alpha}\_k) = \boldsymbol{J}\_{m=1} + \boldsymbol{S}\_{FE} + \mathbf{K} \sum\_{k=1}^{n} \left( \sum\_{i=1}^{c} \boldsymbol{u}\_{ik} - \mathbf{1} \right)^2 \tag{50}$$

**1** *Initialize*: Set a threshold of convergence test *δ*1, and an iteration count *l* to 1. Set

Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing 283

**4** Compare a difference between the current objective value *e* and that obtained at the previous iteration *e*ˆ. If �*e* − *e*ˆ�/*e* < *δ*<sup>1</sup> or *max*<sup>2</sup> < *l* is satisfied, then go to **5**. Otherwise


[x]

**Figure 4.** Experimental result 1. (Fuzzy clustering result using DASA. Big circles indicate centers of

out. DASA's results were compared with those of DA (single DA).

<sup>2</sup> These parameters have not been optimized particularly for experimental data.

To demonstrate effectiveness of the proposed algorithm, numerical experiments were carried

We set *δ*<sup>0</sup> = 0.5, *δ*<sup>1</sup> = 0.01, *Trate* = 0.8, *max*<sup>0</sup> = 500, *max*<sup>1</sup> = 20000, and *max*<sup>2</sup> = 10. We also set

In experiment 1, 11,479 data points were generated as ten equally sized normal distributions. Fig.4 shows a fuzzy clustering result by DASA. Single DA similarly clusters these data.

In experiment 2-1, three differently sized normal distributions consist of 2,249 data points in Fig.5-1 were used. Fig.5-1(0) shows initial clusters obtained by the initial estimation of *αk*s and the annealing temperature. Fig.5-1(1)∼(6a) shows a fuzzy clustering process of DASA. At the high temperature in Fig.5-1(1), as described in 4.3, the membership functions were widely distributed and clusters to which a data belongs were fuzzy. However, with decreasing of the temperature (from Fig.5-1(2) to Fig.5-1(5)), the distribution became less and less fuzzy. After executing DA and SA alternately, the clusters in Fig.5-1(6a) were obtained. Then, data to be optimized by SA were selected by the criterion stated in the section 4, and SA was executed. The final result of DASA in Fig.5-1(6b) shows that data were desirably clustered. On the contrary, because of randomness of the initial cluster positions and hardness of good estimation of the initial *αk*s, single DA becomes unstable, and sometimes gives satisfactory

*<sup>R</sup>* in (48) to 350.0 for experimental data 1∼3, and 250.0 for experimental data 4 2.

**5** Set *max* = *max*1, and execute the SA algorithm finally, and then stop.

maximum iteration counts *max*0, *max*1, and *max*2.

**3** Set *max* = *max*0, and execute the SA algorithm.

**2** Execute the DA algorithm.

**7. Experiments 1**

clusters.)

increment *l*, and go back to **2**.


0

[y]

5000

where *K* is a constant.

In order to optimize each *α<sup>k</sup>* by SA, its neighbor *αnew <sup>k</sup>* (a displacement from the current *αk*) is generated by assuming a normal distribution with a mean 0 and a covariance Δ*α<sup>k</sup>* defined in (45).

The SA's initial temperature *T*0(= 1/*β*0) is determined so as to make an acceptance probability becomes

$$\begin{aligned} \exp\left[-\beta\_0 \{E(\mathfrak{a}\_k) - E(\mathfrak{a}\_k^{new})\}\right] &= 0.5\\ (E(\mathfrak{a}\_k) - E(\mathfrak{a}\_k^{new}) &\ge 0) \end{aligned} \tag{51}$$

By selecting *αk*s to be optimized from the outside of a transition region in which the membership function changes form 0 to 1, computational time of SA can be shortened. The boundary of the transition region can be easily obtained with the linear approximations of the Fermi-Dirac-like membership function. From Fig.1, data which have distances bigger than <sup>−</sup>*αk*/*<sup>β</sup>* from each cluster centers are selected.

#### **6.2. Simulated annealing algorithm**

The SA algorithm is stated as follows:


#### **6.3. Combinatorial algorithm of deterministic and simulated annealing**

The DA and SA algorithms are combined as follows:


### **7. Experiments 1**

12 Will-be-set-by-IN-TECH

that obtained at the previous iteration <sup>ˆ</sup>*J*. If �*Jm*=<sup>1</sup> <sup>−</sup> <sup>ˆ</sup>*J*�/*Jm*=<sup>1</sup> <sup>&</sup>lt; *<sup>δ</sup>*<sup>0</sup> · *<sup>T</sup>*/*Thigh* is satisfied, then return. Otherwise decrease the temperature as *T* = *T* ∗ *Trate*, and go back to **2**.

> *n* ∑ *k*=1

generated by assuming a normal distribution with a mean 0 and a covariance Δ*α<sup>k</sup>* defined in

The SA's initial temperature *T*0(= 1/*β*0) is determined so as to make an acceptance probability

By selecting *αk*s to be optimized from the outside of a transition region in which the membership function changes form 0 to 1, computational time of SA can be shortened. The boundary of the transition region can be easily obtained with the linear approximations of the Fermi-Dirac-like membership function. From Fig.1, data which have distances bigger than

**1** *Initialize*: Calculate an initial temperature *T*0(= 1/*β*0) from (51). Set a current temperature *T* to *T*0. Set an iteration count *t* to 1. Calculate a covariance Δ*α<sup>k</sup>* for each

**4** Apply the Metropolis algorithm to the selected *αk*s using (50) as the objective function. **5** If *max* < *t* is satisfied, then return. Otherwise decrease the temperature as *T* =

**6.3. Combinatorial algorithm of deterministic and simulated annealing**

exp [−*β*0{*E*(*αk*) <sup>−</sup> *<sup>E</sup>*(*αnew*

(*E*(*αk*) <sup>−</sup> *<sup>E</sup>*(*αnew*

 *c* ∑ *i*=1

*<sup>k</sup>* ) ≥ 0)

*uik* − 1

<sup>2</sup>

*<sup>k</sup>* (a displacement from the current *αk*) is

*<sup>k</sup>* )}] = 0.5 (51)

*<sup>k</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>c</sup>*

, (50)

*<sup>i</sup>*=<sup>1</sup> *dikuik* and

**4** Compare a difference between the current objective value *Jm*=<sup>1</sup> = ∑*<sup>n</sup>*

*E*(*αk*) = *Jm*=<sup>1</sup> + *SFE* + *K*

In order to optimize each *α<sup>k</sup>* by SA, its neighbor *αnew*

<sup>−</sup>*αk*/*<sup>β</sup>* from each cluster centers are selected.

**2** Select data to be optimized, if necessary. **3** Calculate neighbors of current *αk*s.

*T*0/ log(*t* + 1), increment *t*, and go back to **2**.

The DA and SA algorithms are combined as follows:

**6.2. Simulated annealing algorithm** The SA algorithm is stated as follows:

*α<sup>k</sup>* by (45).

**6. Combinatorial algorithm of deterministic and simulated annealing**

**3** Calculate **v***<sup>i</sup>* by (11).

**6.1. Simulated annealing** The cost function for SA is

where *K* is a constant.

(45).

becomes

**Figure 4.** Experimental result 1. (Fuzzy clustering result using DASA. Big circles indicate centers of clusters.)

To demonstrate effectiveness of the proposed algorithm, numerical experiments were carried out. DASA's results were compared with those of DA (single DA).

We set *δ*<sup>0</sup> = 0.5, *δ*<sup>1</sup> = 0.01, *Trate* = 0.8, *max*<sup>0</sup> = 500, *max*<sup>1</sup> = 20000, and *max*<sup>2</sup> = 10. We also set *<sup>R</sup>* in (48) to 350.0 for experimental data 1∼3, and 250.0 for experimental data 4 2.

In experiment 1, 11,479 data points were generated as ten equally sized normal distributions. Fig.4 shows a fuzzy clustering result by DASA. Single DA similarly clusters these data.

In experiment 2-1, three differently sized normal distributions consist of 2,249 data points in Fig.5-1 were used. Fig.5-1(0) shows initial clusters obtained by the initial estimation of *αk*s and the annealing temperature. Fig.5-1(1)∼(6a) shows a fuzzy clustering process of DASA. At the high temperature in Fig.5-1(1), as described in 4.3, the membership functions were widely distributed and clusters to which a data belongs were fuzzy. However, with decreasing of the temperature (from Fig.5-1(2) to Fig.5-1(5)), the distribution became less and less fuzzy. After executing DA and SA alternately, the clusters in Fig.5-1(6a) were obtained. Then, data to be optimized by SA were selected by the criterion stated in the section 4, and SA was executed. The final result of DASA in Fig.5-1(6b) shows that data were desirably clustered. On the contrary, because of randomness of the initial cluster positions and hardness of good estimation of the initial *αk*s, single DA becomes unstable, and sometimes gives satisfactory

<sup>2</sup> These parameters have not been optimized particularly for experimental data.

14 Will-be-set-by-IN-TECH 284 Simulated Annealing – Advances, Applications and Hybridizations Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing <sup>15</sup>

Changes of the costs of DASA (*Jm*=<sup>1</sup> + *SFE* for DA stage and (50) for SA stage (*K* was set to <sup>1</sup> <sup>×</sup> 1015 in (50)), respectively) are plotted as a function of iteration in Fig.5-2, and the both costs decreases with increasing iteration. In this experiment, the total iteration of SA stage was about 12, 500, while that of DA stage was only 7. Accordingly, the amount of simulation

Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing 285

SA DA

10 100 1000 10000 100000

<sup>2</sup> for SA stage, respectively.)

DA

[iteration]

**Figure 5-2.** Experimental result 2-1. (Change of the cost of DASA as a function of iteration. *Jm*<sup>=</sup><sup>1</sup> + *SFE*

In experiment 2-2, in order to examine effectiveness of SA introduced in DASA, experiment 2 was re-conducted ten times as in Table 1, where *ratio* listed in the first row is a ratio of data optimized at SA stage. "UP" means to increase *ratio* as 1.0 − 1.0/*t* where *t* is a number of execution times of SA stage. Also, "DOWN" means to decrease *ratio* as 1.0/*t*. Results are judged "Success" or "Failure" from a human viewpoint 3. From Table 1, it is concluded that DASA always clusters the data properly if *ratio* is large enough (0.6 < *ratio*), whereas, as

DASA

*ratio* 0.3 0.6 1.0 UP DOWN Success 6 9 10 6 7 5 Failure 4 1 0 4 3 5

**Table 2.** Experimental result 2-2. (Comparison of numbers of successes and failures of fuzzy clustering using DASA for *ratio* = 0.3, 0.6, 1.0, 1.0 − 1.0/*t*(UP), 1.0/*t*(DOWN) and single DA. (*t* is a number of

In experiments 3 and 4, two elliptic distributions consist of 2,024 data points, and two horseshoe-shaped distributions consist of 1,380 data points were used, respectively. Fig.5 and 6 show DASA's clustering results. It is found that DASA can cluster these data properly. In experiment 3, a percentage of success of DASA is 90%, though that of single DA is 50%. In experiment 4, a percentage of success of DASA is 80%, though that of single DA is 40%.

*<sup>i</sup>*=<sup>1</sup> *uik* − 1)

*<sup>k</sup>*=<sup>1</sup> (∑*<sup>c</sup>*

time DASA was mostly consumed in SA stage.

1e+12

1e+07

listed in the last column, single DA succeeds by 50%.

for DA stage and *Jm*<sup>=</sup><sup>1</sup> + *SFE* + *K* ∑*<sup>n</sup>*

execution times of SA stage))

<sup>3</sup> No close case was observed in this experiment.

1e+08

1e+09

1e+10

[cost]

1e+11

**Figure 5-1.** Experimental result 2-1. (Fuzzy clustering result by DASA and single DA. "Experimental Data" are given data distributions. "Selected Data" are data selected for final SA by the selection rule. (1)∼(6a) and (6b) are results using DASA. (6c) and (6d) are results using single DA (success and failure, respectively). Data plotted on the xy plane show the cross sections of *uik* at 0.2 and 0.8.)

results as shown in Fig.5-1(6c) and sometimes not as shown in Fig.5-1(6d). By comparing Fig.5-1(6b) to (6c), it is found that, due to the optimization of *αk*s by SA, the resultant cluster shapes of DASA are far less smooth than those of single DA.

Changes of the costs of DASA (*Jm*=<sup>1</sup> + *SFE* for DA stage and (50) for SA stage (*K* was set to <sup>1</sup> <sup>×</sup> 1015 in (50)), respectively) are plotted as a function of iteration in Fig.5-2, and the both costs decreases with increasing iteration. In this experiment, the total iteration of SA stage was about 12, 500, while that of DA stage was only 7. Accordingly, the amount of simulation time DASA was mostly consumed in SA stage.

**Figure 5-2.** Experimental result 2-1. (Change of the cost of DASA as a function of iteration. *Jm*<sup>=</sup><sup>1</sup> + *SFE* for DA stage and *Jm*<sup>=</sup><sup>1</sup> + *SFE* + *K* ∑*<sup>n</sup> <sup>k</sup>*=<sup>1</sup> (∑*<sup>c</sup> <sup>i</sup>*=<sup>1</sup> *uik* − 1) <sup>2</sup> for SA stage, respectively.)

In experiment 2-2, in order to examine effectiveness of SA introduced in DASA, experiment 2 was re-conducted ten times as in Table 1, where *ratio* listed in the first row is a ratio of data optimized at SA stage. "UP" means to increase *ratio* as 1.0 − 1.0/*t* where *t* is a number of execution times of SA stage. Also, "DOWN" means to decrease *ratio* as 1.0/*t*. Results are judged "Success" or "Failure" from a human viewpoint 3. From Table 1, it is concluded that DASA always clusters the data properly if *ratio* is large enough (0.6 < *ratio*), whereas, as listed in the last column, single DA succeeds by 50%.


**Table 2.** Experimental result 2-2. (Comparison of numbers of successes and failures of fuzzy clustering using DASA for *ratio* = 0.3, 0.6, 1.0, 1.0 − 1.0/*t*(UP), 1.0/*t*(DOWN) and single DA. (*t* is a number of execution times of SA stage))

In experiments 3 and 4, two elliptic distributions consist of 2,024 data points, and two horseshoe-shaped distributions consist of 1,380 data points were used, respectively. Fig.5 and 6 show DASA's clustering results. It is found that DASA can cluster these data properly. In experiment 3, a percentage of success of DASA is 90%, though that of single DA is 50%. In experiment 4, a percentage of success of DASA is 80%, though that of single DA is 40%.

14 Will-be-set-by-IN-TECH


Experimental Data (5) *<sup>β</sup>* =8.0 ×<sup>10</sup> -5


0 0.5 1 [u ] ik

0 0.5 1 [u ]ik

> 0 [x] 500-500

0 [x] 500-500

(6a) *<sup>β</sup>* =1.3× <sup>10</sup> -4


[x] Selected Data 0 500 [y]

0 500 [y]

(0)Initial Distribution


[x]

0 500 [y]

0 500 [y]

0 500 [y]

0 500 [y]

0 500 [y]

respectively). Data plotted on the xy plane show the cross sections of *uik* at 0.2 and 0.8.)



**Figure 5-1.** Experimental result 2-1. (Fuzzy clustering result by DASA and single DA. "Experimental Data" are given data distributions. "Selected Data" are data selected for final SA by the selection rule. (1)∼(6a) and (6b) are results using DASA. (6c) and (6d) are results using single DA (success and failure,

results as shown in Fig.5-1(6c) and sometimes not as shown in Fig.5-1(6d). By comparing Fig.5-1(6b) to (6c), it is found that, due to the optimization of *αk*s by SA, the resultant cluster

0 0.5 1 [u ]ik

0 0.5 1 [u ]ik


0 0.5 1 [u ] ik


0

[y]

500

0 [x] 500-500

(6c) *β* =1.3 ×10 -4

(6b) *β* =1.3× 10 -4

0 [x] 500-500

0 [x] 500-500

(6d) *<sup>β</sup>* =1.3 ×<sup>10</sup> -4

0 500 [y]

0 500 [y]

0 500 [y]

0 [x] 500-500





0 0.5 1 [u ]ik

0 0.5 1 [u ] ik


0 0.5 1 [u ]ik

0 0.5 1 [u ]ik

0 0.5 1 [u ]ik


0

[y]

500

(1) *β* =3.3 ×10 -5

0 [x] 500-500

(2) *β* =4.1× 10 -5

0 [x] 500-500

(3) *β* =5.1× 10 -5

0 [x] 500-500

0 [x] 500-500

shapes of DASA are far less smooth than those of single DA.

(4) *<sup>β</sup>* =6.4 × <sup>10</sup> -5

<sup>3</sup> No close case was observed in this experiment.

**Figure 6.** Experimental result 3. (Fuzzy clustering result of elliptic distributions using DASA. Data plotted on the xy plane show the cross sections of *uik* at 0.2 and 0.8.)


[u ]ik


[u ] ik


plane show the cross sections of *uik* at 0.2 and 0.8)

0 0.5 1

0

0.5 1 -500 0 500

Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing 287

[x] (a)Experimental data.

[x] 500-500

[x] 500-500

(b)Final distribution.

**Figure 8.** Experimantal data and membership functions obtained by DASA.(Data plotted on the xy

(a)Initial distribution.

0

0

500

[y]

500

[y]

0

0

0

[y]

500

**Figure 7.** Experimental result 4. (Fuzzy clustering result of horseshoe-shaped distributions using DASA. Data plotted on the xy plane show the cross sections of *uik* at 0.2 and 0.8.)

These experimental results demonstrate the advantage of DASA over single DA. Nevertheless, DASA suffers two disadvantages. First, it takes so long to execute SA repeatedly that, instead of (10), it might be better to use its linear approximation functions as the membership function. Second, since *αk*s differ each other, it is difficult to interpolate them.

## **8. Experiments 2**

#### **8.1. Interpolation of membership function**

DASA suffers a few disadvantages. First, it is not necessarily easy to interpolate *α<sup>k</sup>* or *uik*, since they differ each other. Second, it takes so long to execute SA repeatedly.

A simple solution for the first problem is to interpolate membership functions. Thus, the following step was added to the DASA algorithm.

**6** When a new data is given, some neighboring membership functions are interpolated at its position.

16 Will-be-set-by-IN-TECH

0

0

Data plotted on the xy plane show the cross sections of *uik* at 0.2 and 0.8.)

**Figure 6.** Experimental result 3. (Fuzzy clustering result of elliptic distributions using DASA. Data

[x] 500-500

[x] 500-500

**Figure 7.** Experimental result 4. (Fuzzy clustering result of horseshoe-shaped distributions using DASA.

These experimental results demonstrate the advantage of DASA over single DA. Nevertheless, DASA suffers two disadvantages. First, it takes so long to execute SA repeatedly that, instead of (10), it might be better to use its linear approximation functions as the membership function. Second, since *αk*s differ each other, it is difficult to interpolate them.

DASA suffers a few disadvantages. First, it is not necessarily easy to interpolate *α<sup>k</sup>* or *uik*,

A simple solution for the first problem is to interpolate membership functions. Thus, the

**6** When a new data is given, some neighboring membership functions are interpolated at

since they differ each other. Second, it takes so long to execute SA repeatedly.

0

0

500

[y]

500

[y]


[u ]ik


**8.1. Interpolation of membership function**

following step was added to the DASA algorithm.

**8. Experiments 2**

its position.

0 0.5 1

plotted on the xy plane show the cross sections of *uik* at 0.2 and 0.8.)

0 0.5 1

[u ]ik

**Figure 8.** Experimantal data and membership functions obtained by DASA.(Data plotted on the xy plane show the cross sections of *uik* at 0.2 and 0.8)

#### 18 Will-be-set-by-IN-TECH 288 Simulated Annealing – Advances, Applications and Hybridizations Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing <sup>19</sup>

To examine an effectiveness of interpolation, the proposed algorighm was applied to experimental data shown in Fig.8(a). For simplicity, the data were placed on rectangular grids on the xy plane.

Our future works also include experiments and examinations of the properties of DASA, especially on an adjustment of its parameters, its annealing scheduling problem, and its

Fuzzy c-Means Clustering, Entropy Maximization, and Deterministic and Simulated Annealing 289

However, DASA has problems to be considered. One of them is that it is difficult to interpolate membership functions, since their values are quite different. Accordingly, the fractal interpolation method (InterpolationFM algorithm) is introduced to DASA and examined its

Our future works include experiments and examinations of the properties of DASA, a comparison of results of interpolation methods (linear, bicubic, spline, fractal and so on), an interpolation of higher dimensional data, an adjustment of DASA's parameters, and DASA's

[1] E. Aarts and J. Korst, "Simulated Annealing and Boltzmann Machines", Chichester:

[2] J.C. Bezdek, "Pattern Recognition with Fuzzy Objective Function Algorithms", New

[3] B.P.Buckles and F.E.Petry, "Information-theoretical characterization of fuzzy relational database", IEEE Trans. Systems, Man and Cybernetics, vol.13, no.1, pp.74-77, 1983. [4] J. Buhmann and H. K*u*¨hnel, "Vector quantization with complexity costs", IEEE Trans.

[5] A.Corana, M.Marchesi, C.Martini, and S.Ridella, "Minimizing multimodal functions of continuous variables with the simulated annealing algorithm", ACM Trans. on

[6] A. DeLuca and S. Termini, "A definition of a nonprobabilistic entropy in the setting of

[7] A.P.Dempster, N.M.Laird, and D.B.Rubin, "Maximum likelihood from incomplete data via the EM algorithms", Journal of Royal Stat. Soc., Series B, vol.39, pp.1-38, 1977. [8] W. Greiner, L. Neise, and H. St*o*¨cker, "Thermodynamics and Statistical Mechanics",

[9] T. Hofmann and J. Buhmann, "Pairwise data clustering by deterministic annealing," IEEE Trans. Pattern Analysis and Machine Intelligence, vol.19, pp.1-14, 1997. [10] S. Kirkpatrick, C.D. Gelatt, and M.P. Vecchi, "Optimization by simulated annealing",

[11] R.-P. Li and M. Mukaidono, "A Maximum entropy approach to fuzzy clustering", Proc. of the 4th IEEE Int. Conf. Fuzzy Systems(FUZZ-IEEE/IFES'95), pp.2227-2232, 1995. [12] M. Menard, V. Courboulay, and P. Dardignac, "Possibilistic and probabilistic fuzzy clustering: unification within the framework of the non-extensive thermostatistics",

[13] D. Miller, A.V. Rao, K. Rose, and A. Gersho, "A global optimization technique for statistical classifier design", IEEE Trans. Signal Processing, vol.44, pp.3108-3122, 1996.

applications for fuzzy modeling[29].

annealing scheduling problem.

*Gifu National College of Technology, Japan*

John Wiley & Sons, 1989.

York: Prenum Press, 1981.

New York: Springer-Verlag, 1995.

Science, vol.220, pp.671-680, 1983.

Pattern Recognition, vol.36, pp.1325–1342, 2003

Information Theory, vol.39, no.4, pp.1133-1143, 1993.

Mathematical Software, vol.13, no.3, pp.262-280, 1987.

fuzzy sets theory", *Information and Control*, vol.20, pp.301–312, 1972.

effectiveness.

**Author details** Makoto Yasuda

**10. References**

First, some regional data were randomly selected from the data. Then, Initial and final memberhip functions obtained by DASA are shown in Figs.8(b) and (c) respectively.

After that, remaining data in the region were used as test data, and at each data point, they were interpolated by their four nearest neighboring membership values. Linear, bicubic and fractal interpolation methods were compared.

Prediction error of linear interpolation was 6.8%, and accuracy was not enough. Bicubic interpolation[18] also gave a poor result, because its depends on good estimated gradient values of neighboring points. Accordingly, in this case, fractal interpolation[17] is more suitable than smooth interpolation methods such as bicubic or spline interpolation, because the membership functions in Figs.8(c) are very rough.

The well-known InterpolatedFM (Fractal motion via interpolation and variable scaling) algorithm [17] was used in this experiment. Fractal dimension was estimated by the standard box-counting method [25]. Figs.9(a) and 3(b) represent both the membership functions and their interpolation values. Prediction error (averaged over 10 trials) of fractal interpolation was 2.2%, and a slightly better result was obtained.

**Figure 9.** Plotted lines show the membership functions obtained by DASA . The functions are interpolated by the InterpolatedFM algorithm. Crosses show the interpolated data.

## **9. Conclusion**

In this article, by combining the deterministic and simulated annealing methods, we proposed the new statistical mechanical fuzzy c-means clustering algorithm (DASA). Numerical experiments showed the effectiveness and the stability of DASA.

However, as stated at the end of **Experiments**, DASA has problems to be considered. In addition, a major problem of the fuzzy c-means methodologies is that they do not give a number of clusters by themselves. Thus, a method such as [28] which can determine a number of clusters automatically should be combined with DASA.

Our future works also include experiments and examinations of the properties of DASA, especially on an adjustment of its parameters, its annealing scheduling problem, and its applications for fuzzy modeling[29].

However, DASA has problems to be considered. One of them is that it is difficult to interpolate membership functions, since their values are quite different. Accordingly, the fractal interpolation method (InterpolationFM algorithm) is introduced to DASA and examined its effectiveness.

Our future works include experiments and examinations of the properties of DASA, a comparison of results of interpolation methods (linear, bicubic, spline, fractal and so on), an interpolation of higher dimensional data, an adjustment of DASA's parameters, and DASA's annealing scheduling problem.

## **Author details**

18 Will-be-set-by-IN-TECH

To examine an effectiveness of interpolation, the proposed algorighm was applied to experimental data shown in Fig.8(a). For simplicity, the data were placed on rectangular grids

First, some regional data were randomly selected from the data. Then, Initial and final

After that, remaining data in the region were used as test data, and at each data point, they were interpolated by their four nearest neighboring membership values. Linear, bicubic and

Prediction error of linear interpolation was 6.8%, and accuracy was not enough. Bicubic interpolation[18] also gave a poor result, because its depends on good estimated gradient values of neighboring points. Accordingly, in this case, fractal interpolation[17] is more suitable than smooth interpolation methods such as bicubic or spline interpolation, because

The well-known InterpolatedFM (Fractal motion via interpolation and variable scaling) algorithm [17] was used in this experiment. Fractal dimension was estimated by the standard box-counting method [25]. Figs.9(a) and 3(b) represent both the membership functions and their interpolation values. Prediction error (averaged over 10 trials) of fractal interpolation


interpolated by the InterpolatedFM algorithm. Crosses show the interpolated data.

**Figure 9.** Plotted lines show the membership functions obtained by DASA . The functions are

In this article, by combining the deterministic and simulated annealing methods, we proposed the new statistical mechanical fuzzy c-means clustering algorithm (DASA). Numerical

However, as stated at the end of **Experiments**, DASA has problems to be considered. In addition, a major problem of the fuzzy c-means methodologies is that they do not give a number of clusters by themselves. Thus, a method such as [28] which can determine a number

[*x*]

experiments showed the effectiveness and the stability of DASA.

of clusters automatically should be combined with DASA.

 20 40 60 80 100

[*y*]

memberhip functions obtained by DASA are shown in Figs.8(b) and (c) respectively.

on the xy plane.

fractal interpolation methods were compared.

the membership functions in Figs.8(c) are very rough.

was 2.2%, and a slightly better result was obtained.

0

**9. Conclusion**

0.5

1

[*u* ] *ik*

Makoto Yasuda *Gifu National College of Technology, Japan*

#### **10. References**

	- [14] S. Miyamoto and M. Mukaidono, "Fuzzy c-means as a regularization and maximum entropy approach", Proc. of the 7th Int. Fuzzy Systems Association World Congress, vol.II, pp.86-92, 1997.
	- [15] N.R. Pal and J.C. Bezdek, "Measuring fuzzy uncertainty", IEEE Trans. Fuzzy Systems, vol.2, no.2, pp.107-118, 1994.
	- [16] N.R. Pal, "On quantification of different facets of uncertainty", Fuzzy Sets and Systems, vol.107, pp.81-91, 1999.
	- [17] H.-O. Peitgen, et.al., *The science of fractal images*, Springer-Verlag, 1988
	- [18] W.H. Press, S.A. Teukolsky, W.T. Vetteriling, and B.P. Flannery, *Numerical Recipes in C++*, Cambridge University Press, 2002.
	- [19] L. E. Reichl, *A Modern Course in Statistical Physics*, New York: John Wiley & Sons, 1998.
	- [20] K. Rose, E. Gurewitz, and B.C. Fox, "A deterministic annealing approach to clustering", Pattern Recognition Letters, vol.11, no.9, pp.589-594, 1990.
	- [21] K. Rose, E. Gurewitz, and G.C. Fox, "Constrained clustering as an optimization method", IEEE Trans. Pattern Analysis and Machine Intelligence, vol.15, no.8, pp.785-794, 1993.
	- [22] P.Siarry, "Enhanced simulated annealing for globally minimizing functions of many-continuous variables", ACM Trans. on Mathematical Software, vol.23, no.2, pp.209-228, 1997.
	- [23] D.Tran and M.Wagner, "Fuzzy entropy clustering", Proc. of the 9th IEEE Int. Conf. Fuzzy Systems(FUZZ-IEEE2000), vol.1, pp.152-157, 2000.
	- [24] C. Tsallis, *Possible generalization of Boltzmann-Gibbs statistics*, Journal of Statistical Phys., vol.52, pp.479–487, 1988.
	- [25] R. Voss, *Random fractals: characterization and measurement*, Plenum Press, 1986.
	- [26] P.R.Wang, "Continuous optimization by a variant of simulated annealing", Computational Optimization and Applications, vol.6, pp.59-71, 1996.
	- [27] M. Yasuda, T. Furuhashi, and S. Okuma, *Statistical mechanical analysis of fuzzy clustering based on fuzzy entropy*, IEICE Trans. Information and Systems, Vol.ED90-D, No.6, pp.883-888, 2007.
	- [28] M. Yasuda and T. Furuhashi, *Fuzzy entropy based fuzzy c-means clustering with deterministic and simulated annealing methods*, IEICE Trans. Information ans Systems, Vol.ED92-D, No.6, pp.1232-1239, 2009.
	- [29] M. Yasuda and T. Furuhashi, *Statistical mechanical fuzzy c-means clustering with deterministic and simulated annealing methods*, Proc. of the Joint 3rd Int. Conf. on Soft Computing and Intelligent Systems, in CD-ROM, 2006.
	- [30] M. Yasuda, *Entropy based annealing approach to fuzzy c-means clustering and its interpolation*, Proc. of the 8th Int. Conf. on Fuzzy Sysmtes and Knowledge Discovery, pp.424-428, 2011.
	- [31] S.D. Zenzo and L. Cinque, "Image thresholding using fuzzy entropies", IEEE Trans. Systems, Man and Cybernetics-Part B, vol.28, no.1, pp.15-23, 1998.

20 Will-be-set-by-IN-TECH

[14] S. Miyamoto and M. Mukaidono, "Fuzzy c-means as a regularization and maximum entropy approach", Proc. of the 7th Int. Fuzzy Systems Association World Congress,

[15] N.R. Pal and J.C. Bezdek, "Measuring fuzzy uncertainty", IEEE Trans. Fuzzy Systems,

[16] N.R. Pal, "On quantification of different facets of uncertainty", Fuzzy Sets and Systems,

[18] W.H. Press, S.A. Teukolsky, W.T. Vetteriling, and B.P. Flannery, *Numerical Recipes in C++*,

[19] L. E. Reichl, *A Modern Course in Statistical Physics*, New York: John Wiley & Sons, 1998. [20] K. Rose, E. Gurewitz, and B.C. Fox, "A deterministic annealing approach to clustering",

[21] K. Rose, E. Gurewitz, and G.C. Fox, "Constrained clustering as an optimization method", IEEE Trans. Pattern Analysis and Machine Intelligence, vol.15, no.8,

[22] P.Siarry, "Enhanced simulated annealing for globally minimizing functions of many-continuous variables", ACM Trans. on Mathematical Software, vol.23, no.2,

[23] D.Tran and M.Wagner, "Fuzzy entropy clustering", Proc. of the 9th IEEE Int. Conf.

[24] C. Tsallis, *Possible generalization of Boltzmann-Gibbs statistics*, Journal of Statistical Phys.,

[26] P.R.Wang, "Continuous optimization by a variant of simulated annealing",

[27] M. Yasuda, T. Furuhashi, and S. Okuma, *Statistical mechanical analysis of fuzzy clustering based on fuzzy entropy*, IEICE Trans. Information and Systems, Vol.ED90-D, No.6,

[28] M. Yasuda and T. Furuhashi, *Fuzzy entropy based fuzzy c-means clustering with deterministic and simulated annealing methods*, IEICE Trans. Information ans Systems,

[29] M. Yasuda and T. Furuhashi, *Statistical mechanical fuzzy c-means clustering with deterministic and simulated annealing methods*, Proc. of the Joint 3rd Int. Conf. on Soft

[30] M. Yasuda, *Entropy based annealing approach to fuzzy c-means clustering and its interpolation*, Proc. of the 8th Int. Conf. on Fuzzy Sysmtes and Knowledge Discovery,

[31] S.D. Zenzo and L. Cinque, "Image thresholding using fuzzy entropies", IEEE Trans.

[25] R. Voss, *Random fractals: characterization and measurement*, Plenum Press, 1986.

Computational Optimization and Applications, vol.6, pp.59-71, 1996.

[17] H.-O. Peitgen, et.al., *The science of fractal images*, Springer-Verlag, 1988

Pattern Recognition Letters, vol.11, no.9, pp.589-594, 1990.

Fuzzy Systems(FUZZ-IEEE2000), vol.1, pp.152-157, 2000.

Computing and Intelligent Systems, in CD-ROM, 2006.

Systems, Man and Cybernetics-Part B, vol.28, no.1, pp.15-23, 1998.

vol.II, pp.86-92, 1997.

vol.107, pp.81-91, 1999.

pp.785-794, 1993.

pp.209-228, 1997.

pp.883-888, 2007.

pp.424-428, 2011.

vol.52, pp.479–487, 1988.

Vol.ED92-D, No.6, pp.1232-1239, 2009.

vol.2, no.2, pp.107-118, 1994.

Cambridge University Press, 2002.

290 Simulated Annealing – Advances, Applications and Hybridizations
