Clustering and Machine Learning Methods

#### **Chapter 1**

## A New Functional Clustering Method with Combined Dissimilarity Sources and Graphical Interpretation

*Wenlin Dai, Stavros Athanasiadis and Tomáš Mrkvička*

## **Abstract**

Clustering is an essential task in functional data analysis. In this study, we propose a framework for a clustering procedure based on functional rankings or depth. Our methods naturally combine various types of between-cluster variation equally, which caters to various discriminative sources of functional data; for example, they combine raw data with transformed data or various components of multivariate functional data with their covariance. Our methods also enhance the clustering results with a visualization tool that allows intrinsic graphical interpretation. Finally, our methods are model-free and nonparametric and hence are robust to heavy-tailed distribution or potential outliers. The implementation and performance of the proposed methods are illustrated with a simulation study and applied to three real-world applications.

**Keywords:** depth, insurance, intrinsic graphical interpretation, robustness, statistical rankings

#### **1. Introduction**

Cluster analysis is a critical step in exploratory data analysis intended to identify homogeneous subgroups among observations. Cluster analysis is also widely used for functional data in tasks such as the classification of electrocardiogram curves in the diagnosis of cardiovascular ischemic diseases [1] and the extraction of representative wind behavior [2, 3]. The various functional data clustering methods described in the literature can generally be categorized into two subgroups: distance-based methods and filtering-based methods.

The distance-based methods involve the construction of a distance matrix with a specific metric; the clustering results may be derived with hierarchical or centroidbased clustering tools [3, 4]. The filtering-based methods involve the approximation of the curves with linear combinations of finite basis functions, such as splines and functional principal components, and the cluster analysis is conducted based on the coefficients or scores of finite dimensions [5–7]. The focus of this study is on distance-based methods. In this paper, we propose a new family of clustering algorithms based on the chosen functional ordering. The dissimilarity matrix is constructed via the chosen functional ordering, which is applied to the set of

differences of all pairs of the functional data under investigation. Various functional ordering can be chosen, but we concentrate on orderings with intrinsic graphical interpretation. But any ordering that treats the sources equally can be used, including the modified band depth [8] and the simplicial band depth [9]. The choice of functional ordering with intrinsic graphical interpretation allows us to show the resulting clusters and a central region that attains a natural interpretation. I.e., All functions contained in the central region do not leave the plot of the central region, and all functions not contained in the central region leave the plot of the central region in at least one point. It has to be mentioned that the classical functional orderings mentioned above do not satisfy this natural condition, and therefore we will concentrate on functional orderings defined in [10].

Functional data differ in various ways, such as in magnitude, shape, phase, and dependence structure, and hence they are difficult to analyze when clusters exist from multiple perspectives. The existing methods either focus on a single type of variation or pool the various sources of variation with weightings that rely on a delicate selection procedure. Without balancing, the clustering results could be dominated by the component with the greatest absolute variation. In order to achieve some balancing between the various sources, it is possible to standardize the curves before applying existing methods, such as *k*-means or model-based methods. By "standardization", we mean that the marginal empirical distributions are standardized so that they have zero mean and unit variance. This approach is used in the simulation study in order to compare the performance of existing methods with the proposed methods.

Since the proposed procedure applies functional ordering, such that every part of the function is treated equally, the different sources of variation are combined in an equal manner. For univariate cases, it may combine the raw curves and the derivatives equally to measure the magnitude and shape variation simultaneously. For multivariate cases, it may combine the marginal curves and the covariance functions equally to account for both marginal and joint variation among curves. Furthermore, the proposed method provides a reasonable graphical interpretation of the clustering result. Finally, it inherits the robustness of functional orderings and can stably recover the clusters when abnormal observations contaminate the data.

The remainder of this paper is organized as follows. In Section 2, we define the new proposed procedure with an arbitrary functional ordering. Further, we review several functional orderings already defined in [10] which satisfy the intrinsic graphical interpretation. Finally, we study the metric properties of derived dissimilarity. In Section 3, we describe the simulation studies we conducted to assess the performance of the proposed methods and compare them with some existing competitors in cases where the combination of the various sources is of interest. In Sections 4–6, we demonstrate the effectiveness of our method with three real-world examples. The proposed methods will be available soon in the R package GET.

#### **2. Description of methods**

#### **2.1 Dissimilarity matrix**

Assume that the functions *fi* ð Þ *x* , *i* ¼ 1, … , *s* are observed at a fixed set of points *x*1, … , *xd*, so that the functions can be represented as *d*-dimensional vectors **T***i*, *i* ¼ 1, … , *s*. If the functions of interest are not observed at the same set of points, a nonparametric smoothing method can be applied to address the situation.

To induce dissimilarity measure from functional ordering, we construct the set of functional differences:

*A New Functional Clustering Method with Combined Dissimilarity Sources and Graphical… DOI: http://dx.doi.org/10.5772/intechopen.100124*

$$D\_f = \left\{ df\_{ii'} = f\_i(\varkappa) - f\_i'(\varkappa), \quad i, i' = 1, \ldots, s \right\}.$$

We remark here that *df* � 0 is an element of *Df* . We then apply a functional ordering to *Df* and obtain the induced measure of centrality of *dfii*<sup>0</sup> ¼ *fi* ð Þ� *x f* 0 *i* ð Þ *x* as *Mii*<sup>0</sup> . Finally, the dissimilarity between *fi* ð Þ *x* and *fi* 0ð Þ *x* is defined as *dii*<sup>0</sup> ¼ 1 � *Mii*0, and this forms the dissimilarity matrix of *fi <sup>s</sup> i*¼1 . Such an ordering can take the form of any functional depth notions or rankings in the literature, such as the band depth and modified band depth [8], the simplicial band depth [9], the spatial functional depth [11], or the curve depth [12]. These notions naturally give equal treatment to the variations at each design point, compared with the norm-based methods such as *L*<sup>1</sup> or *L*<sup>2</sup> distances.

After a dissimilarity matrix is established, the partitioning around medoids procedure can be used to determine the given number of clusters. This produces a family of clustering algorithms that depends on the choice of the functional ordering.

In the following, we will discuss the possible choices of functional ordering. First, we assume functional orderings, which take different sources of the data variability equally. We call such ordering combined functional ordering. Such an approach is useful when the investigator wants to join different information about the data and combine them in one universal procedure. Second, we review several functional orderings which satisfy the intrinsic graphical interpretation.

Our proposed procedure then consists of the following steps:


#### **2.2 Combined functional ordering**

We consider now functions *Ti*ð Þ *x* , *i* ¼ 1, … , *s* <sup>0</sup> and specify their combined functional ordering. Various perspectives, such as different magnitudes and different shapes of the functions, can be used to order the functions. Here we provide a general method to combine these different perspectives in an equal manner. As suggested by [13], data transformation is an effective method to convert different types of variation into types that are easy to handle by the functional depth. Hence, various transformations could be applied to the raw functions to obtain the transformed data sets of interest, such as *V*1, … ,*Vk*. These transformations are computed in the same fixed set of points *x*1, … , *xd*; for instance, shifting each curve to zero means eliminates the magnitude variation, normalizing the centered curves by their *L*<sup>2</sup> norms, respectively, to extract pure shape information. In the case of multivariate functional data, each component of the data and their transformation could be treated similarly. Also, the covariance function

between the components can be added to take into account the dependence structure.

We denote with *Vk Tij* � � the resultant curves of *Tij* via the transformation *Vk*, and we can express the long vector as:

$$\mathbf{T}\_i = (V\_1(T\_{i1}), \dots, V\_1(T\_{id}), \dots, V\_k(T\_{i1}), \dots, V\_k(T\_{id})), \quad i = 1, \dots, s' \tag{1}$$

We can then apply to them the corresponding ordering and hence construct the dissimilarity matrix. Note that each of the orderings to be introduced considers each element equally by ranking or scaling, so the desired perspectives of ordering are considered and treated equally in such a combined ordering. To enhance the interpretability of the clustering results, we focus only on the notions that satisfy the intrinsic graphical interpretation.

#### **2.3 Functional ordering with intrinsic graphical interpretation**

The following definition specifies the properties of *the global envelope that has an intrinsic graphical interpretation with respect to an ordering*. This definition was already used in [10] to define global envelope tests and central regions with graphical interpretation.

**Definition 1**: Assume a general ordering ≺ of the vectors **T***i*, *i* ¼ 1, … , *s* 0 , that is induced by a univariate measure *Mi*. That is, *Mi* ≥ *Mj* iff **T***<sup>i</sup>* ≺ **T** *<sup>j</sup>*, which means that **T***<sup>i</sup>* is less extreme or as extreme as **T** *<sup>j</sup>*. (The smaller the measure *Mi*, the more extreme **<sup>T</sup>***i*.) The 100 1ð Þ � *<sup>α</sup>* % global envelope **<sup>T</sup>**ð Þ *<sup>α</sup>* low*j* , **T**ð Þ *<sup>α</sup>* upp *j* h i has *intrinsic graphical interpretation* (IGI) with respect to the ordering ≺ if:

1.*m*ð Þ *<sup>α</sup>* ∈ ℝ is the largest of the *Mi* such that the number of those *i* for which *Mi* < *m*ð Þ *<sup>α</sup>* is less than or equal to *αs* 0 ;

$$\begin{aligned} \text{2.7}\_{\vec{\text{\textquotedblleft}}} &< \mathbf{T}^{(a)}\_{\text{low}j} \text{ or } T\_{\vec{\text{\textquotedblleft}}} > \mathbf{T}^{(a)}\_{\text{upp}j} \text{ for some } j = \mathbf{1}, \dots, d \text{ iff } \mathbf{M}\_i < m\_{(a)} \text{ for every } i = \mathbf{1}, \dots, s'; \\\\ \text{3.7}\_{\text{low}j} &\le T\_{\vec{\text{\textquotedblleft}}} \le T\_{\vec{\text{\textquotedblleft}}}^{(a)} \text{ for all } j = \mathbf{1}, \dots, d \text{ iff } \mathbf{M}\_i \ge m\_{(a)} \text{ for every } i = \mathbf{1}, \dots, s'. \end{aligned}$$

Let us call *the ordering with intrinsic graphical interpretation* such ordering, for which exists a global envelope with IGI with respect to this ordering. Remark here that *m*ð Þ *<sup>α</sup>* is not exactly the *α* quantile of *Mi* and that points 2 and 3 are equivalent. We kept points 2 and 3 to show the interpretability of the IGI. The simple ordering criterion based on *L*<sup>∞</sup> distance, *Mi* ¼ max *<sup>j</sup>*∣*Tij* � *T:<sup>j</sup>*∣, clearly satisfies such a property, but it does not account for the changes in the marginal distribution of *T:<sup>j</sup>* for different values of *j* [14, 15]. To address this problem, Myllymäki et al. [14] proposed studentized and directional quantile scaling of the maximum ordering, which also satisfies IGI. Furthermore, [15, 16] simultaneously defined extreme rank length ordering, which is based on the number of the most extreme pointwise ranks and satisfies IGI. Finally, [10] extended this family with continuous rank ordering, which is based on the continuous extension of pointwise ranking, and area rank ordering, which is based on the area with the most extreme continuous ranks. To the best of our knowledge, no other functional (respective multivariate) orderings satisfy IGI.

The definitions of all previously mentioned orderings are given in [10]. For the sake of completeness, we provide here a short list of these definitions.

*A New Functional Clustering Method with Combined Dissimilarity Sources and Graphical… DOI: http://dx.doi.org/10.5772/intechopen.100124*

#### *2.3.1 Extreme rank length ordering*

Let *r*1*j*,*r*2*j*, … ,*rs*<sup>0</sup> *<sup>j</sup>* be the raw ranks of *T*1*j*, *T*2*j*, … , *Ts*<sup>0</sup> *<sup>j</sup>*, such that the smallest *Tij* has rank 1. In the case of ties, the raw ranks are averaged. The two-sided pointwise ranks are then calculated as *Rij* ¼ min *rij*, *s* <sup>0</sup> <sup>þ</sup> <sup>1</sup> � *rij* � �. Consider now the vectors of pointwise ordered ranks **<sup>R</sup>***<sup>i</sup>* <sup>¼</sup> *Ri*½ � <sup>1</sup> , *Ri*½ � <sup>2</sup> , … , *Ri d*½ � � �, where *Ri*½ � <sup>1</sup> , … , *Ri d*½ � � � <sup>¼</sup> f g *Ri*1, … , *Rid* and *Ri k*½ � <sup>≤</sup>*Ri k*<sup>0</sup> ½ � whenever *<sup>k</sup>*<sup>≤</sup> *<sup>k</sup>*<sup>0</sup> . The extreme rank length measure of the vectors **R***<sup>i</sup>* is equal to:

$$E\_i = \frac{1}{s'} \sum\_{i'=1}^{j} (\mathbf{R}\_{i'} \prec \mathbf{R}\_i) \tag{2}$$

where

$$\mathbf{R}'\_i \prec \mathbf{R}\_i \Leftrightarrow \exists n \leq d \; : \; R\_{i'[k]} = R\_{i[k]} \forall k < n, \quad R\_{i'[n]} < R\_{i[n]} \dots$$

The division by *s* <sup>0</sup> leads to normalized ranks that obtain values between 0 and 1. Consequently, the ERL measure corresponds to the extremal depth as defined in [16].

Let *e<sup>α</sup>* be defined according to point 1 of Definition 2.3, and let *I<sup>α</sup>* ¼ *i*∈1, … , *s* <sup>0</sup> : *Ei* ≥ *e*ð Þ *<sup>α</sup>* � � be the index set of vectors less extreme than or as extreme as *eα*. Then, the 100 1ð Þ � *α* % global extreme rank length envelope (or global extreme rank length central region) induced by *Ei* is:

$$\mathbf{T}\_{\text{low}k}^{(a)} = \min\_{i \in I\_a} T\_{ik} \quad \text{and} \quad \mathbf{T}\_{\text{upp}k}^{(a)} = \max\_{i \in I\_a} T\_{ik} \quad \text{for} \\ k = \mathbf{1}, \ldots, d. \tag{3}$$

#### *2.3.2 Global continuous rank ordering*

The continuous rank measure is:

$$\mathbf{C}\_{i} = \min\_{j=1,\ldots,d} c\_{ij} / \lceil s'/2 \rceil,$$

where *cij* are the pointwise continuous ranks defined as:

*cij* <sup>¼</sup> <sup>X</sup> *i* 0 **1** *Ti* 0 *<sup>j</sup>* <sup>&</sup>gt;*Tij* � � <sup>þ</sup> *T*½ � *<sup>i</sup>*þ<sup>1</sup> *<sup>j</sup>* � *Tij T*½ � *<sup>i</sup>*þ<sup>1</sup> *<sup>j</sup>* � *T*½ � *<sup>i</sup>*�<sup>1</sup> *<sup>j</sup>* for *i* : *Tij* 6¼ max *i* <sup>0</sup> *Ti* 0 *j* and*Tij* > median *Tij* � �, *cij* <sup>¼</sup> exp � *Tij* � *<sup>T</sup>*½ � *<sup>i</sup>*�<sup>1</sup> *<sup>j</sup> T*½ � *<sup>i</sup>*�<sup>1</sup> *<sup>j</sup>* � min *i Tij* 0 @ 1 A for *i* : *Tij* ¼ max *i* <sup>0</sup> *Ti* 0 *j*, *cij* <sup>¼</sup> <sup>X</sup> *i* 0 **1** *Ti* 0 *<sup>j</sup>* <sup>&</sup>lt;*Tij* � � <sup>þ</sup> *Tij* � *T*½ � *<sup>i</sup>*�<sup>1</sup> *<sup>j</sup> T*½ � *<sup>i</sup>*þ<sup>1</sup> *<sup>j</sup>* � *T*½ � *<sup>i</sup>*�<sup>1</sup> *<sup>j</sup>* for *i* : *Tij* 6¼ min *i* <sup>0</sup> *Ti* 0 *j* and*Tij* < median *Tij* � �, *cij* <sup>¼</sup> exp � *<sup>T</sup>*½ � *<sup>i</sup>*þ<sup>1</sup> *<sup>j</sup>* � *Tij* max *i Tij* � *T*½ � *<sup>i</sup>*þ<sup>1</sup> *<sup>j</sup>* 0 @ 1 A for *i* : *Tij* ¼ min *i* <sup>0</sup> *Ti* 0 *j: cij* <sup>¼</sup> *Rij* for *Tij* <sup>¼</sup> median *Tij* � �,

Here, *T*½ � *<sup>i</sup>*�<sup>1</sup> *<sup>j</sup>* and *T*½ � *<sup>i</sup>*þ<sup>1</sup> *<sup>j</sup>* denote the values of the functions, which are in a *j*-th element below and above *Tij*, respectively (i.e., *T*½ � *<sup>i</sup>*�<sup>1</sup> *<sup>j</sup>* ¼ max *<sup>i</sup>* 0 :*Ti*0*<sup>j</sup>* <sup>&</sup>lt;*TijTi* 0 *<sup>j</sup>* and *T*½ � *<sup>i</sup>*þ<sup>1</sup> *<sup>j</sup>* ¼ min *<sup>i</sup>* 0 :*Ti*0*<sup>j</sup>* <sup>&</sup>gt;*TijTi* 0 *j*).

The 100 1ð Þ � *α* % global continuous rank envelope induced by *Ci* is constructed in the same manner as the global extreme rank length envelope.

#### *2.3.3 Global area rank ordering*

The area rank measure:

$$A\_i = \frac{1}{\lceil s'/2 \rceil d} \sum\_{j} \min\left(R\_i, c\_{\vec{\eta}}\right),$$

where.

*Ri* <sup>¼</sup> min *<sup>j</sup> Rij* � � and *Rij* are two-sided pointwise ranks defined above. The 100 1ð Þ � *α* % global area rank envelope induced by *Ai* is constructed in a manner similar to that of the global extreme rank length envelope.

#### *2.3.4 Studentized maximum ordering*

Because we construct a symmetric set of functions to compute the dissimilarity matrix, here we use only the symmetric studentized ordering. The above orderings are based on the whole distributions of *T*�*<sup>j</sup>*, *j* ¼ 1, … , *d*. It is also possible to approximate the distribution from a few sample characteristics. The studentized maximum ordering approximates the distribution of *T*�*<sup>j</sup>*, *j* ¼ 1, … , *d* by the sample mean *T*0*<sup>j</sup>* and sample standard deviation sd *T*�*<sup>j</sup>* � �. The studentized measure is:

$$\mathbf{S}\_{i} = \max\_{j} \left| \frac{T\_{ij} - T\_{0j}}{\mathbf{sd}\left(T\_{j}\right)} \right|. \tag{4}$$

The 100 1ð Þ � *α* % global studentized envelope induced by *Si* is defined by:

$$\mathbf{T}^{(l)}\_{\text{low}j} = T\_{0j} - \mathfrak{s}\_a \text{sd}\left(T\_j\right) \quad \text{and} \quad \mathbf{T}^{(l)}\_{\text{upp}j} = T\_{0j} + \mathfrak{s}\_a \text{sd}\left(T\_j\right) \quad \text{for} \\ j = 1, \ldots, d,\tag{5}$$

where *s<sup>α</sup>* is taken according to point 1 of IGI.

#### **2.4 Dissimilarity matrix based on the combined ordering**

In this section, we validate the dissimilarity matrix construction defined in Section 2.1 for studentized measure by showing that *dii*<sup>0</sup> ¼ *Sii*<sup>0</sup> is a metric and for global area rank measure by showing that *dii*<sup>0</sup> ¼ 1 � *Aii*<sup>0</sup> is a semi-metric. The latter means that the *dii*<sup>0</sup> ¼ 1 � *Aii*<sup>0</sup> satisfies all properties of metric, except for the triangular inequality, which is violated in specific cases. The metric properties are usually required when choosing the distance measure, but it is not necessary for the partitioning around medoids algorithm, which is used to calculate the clusters afterward. Furthermore, our simulation study demonstrates that these specific cases, where the triangular inequality of global area rank measure is not satisfied, are not realized by functions appearing in real data studies. Furthermore, we provide a thorough check of satisfaction of the triangular inequality for global area rank measure in our implementation of the algorithm. Thus in practice, a user can check *A New Functional Clustering Method with Combined Dissimilarity Sources and Graphical… DOI: http://dx.doi.org/10.5772/intechopen.100124*

this feature of the metric for particular data of interest. For any dataset considered by us in simulation and data studies, the triangular inequality was satisfied.

**Theorem 1.1**: Define the distance between **T***<sup>i</sup>* and **T***<sup>i</sup>* <sup>0</sup> as:

$$d\_{\vec{\mu}^{\prime}} = 1 - \mathcal{A}\_{\vec{\mu}^{\prime}},$$

where *Aii*<sup>0</sup> is the global area rank measure of *Ti* � *Ti* <sup>0</sup> on *Df* . Then *dii*<sup>0</sup> satisfies for any *i*, *i* 0 :

1.Non-negativity: *dii*<sup>0</sup> ≥ 0;

2. Identity of indiscernibles: *dii*<sup>0</sup> ¼ 0 iff **T***<sup>i</sup>* ¼ **T***<sup>i</sup>* 0;

3. Symmetry: *dii*<sup>0</sup> ¼ *di* 0 *i*.

#### **Proof:**

*Non-negativity: For the set Df , there are s*<sup>0</sup> *curves. The set Df contains a zero element, which is the deepest point of Df . I.e. 0 is median in every coordinate. For the area ordering of these curves, we have that two-sided pointwise ranks of curve* **T***<sup>i</sup>* � **T***<sup>i</sup>* <sup>0</sup> *is Rii*<sup>0</sup> *<sup>j</sup>* ≤ *s* <sup>0</sup> d e *=*2 *and Rii*<sup>0</sup> ¼ min *<sup>j</sup> Rii*<sup>0</sup> 1, … , *Rii*<sup>0</sup> f g*<sup>d</sup>* ≤ *s* <sup>0</sup> d e *=*2 *. Hence, we have Aii*<sup>0</sup> ≤1*, i.e., dii*<sup>0</sup> ≥0*.*

*Identity of indiscernibles: dii*<sup>0</sup> ¼ 0 ⇔ *Aii*<sup>0</sup> ¼ 1 ⇔ *Rii*<sup>0</sup> *<sup>j</sup>* ¼ *s* <sup>0</sup> d e *=*2 *for every j = 1, … , d* ⇔ **T***<sup>i</sup>* � **T***<sup>i</sup>* <sup>0</sup> *is the deepest curve of Df* ⇔ **T***<sup>i</sup>* ¼ **T***<sup>i</sup>* 0 *.*

*Symmetry: This property holds implicitely due to the symmetry of Df .* The fourth property of the metric, i.e.

4.Triangle inequality: *dii*<sup>0</sup> þ *di* 0 *<sup>k</sup>* ≥*dik*, for any *i*, *i* <sup>0</sup> and *k*,

is not satisfied when **Ti** � *ti* for every *i*. The results of our simulation study suggest that if the system of data provides enough crossings of functions, then the triangle inequality is satisfied.

**Theorem 1.2**: Define the distance between **T***<sup>i</sup>* and **T***<sup>i</sup>* <sup>0</sup> as:

$$d\_{ii'} = \mathcal{S}\_{ii'},$$

where *Sii*<sup>0</sup> is the studentized measure of *Ti* � *Ti* <sup>0</sup> on *Df* . Then *dii*<sup>0</sup> is a valid metric. **Proof:**

*The first three properties obviously hold for the studentized difference distance. We prove the triangle inequality for dii*<sup>0</sup> *. Note that df* � 0 *is an element of Df , and hence the sample mean T*0*<sup>j</sup>* ¼ 0 *for j* ¼ 1, … , *d. Lets denote the sample standard deviation of the j-th element of Df by* sd *D*�*<sup>j</sup>* � �*. Then, we have:*

$$\begin{split} d\_{ik} &= \max\_{j} \left| \frac{T\_{\vec{\imath}\bar{\jmath}} - T\_{k\bar{\jmath}} - \mathbf{0}}{\text{sd}(D\_{j})} \right| \leq \max\_{j} \left\{ \left| \frac{T\_{\vec{\imath}\bar{\jmath}} - T\_{\vec{\imath}\bar{\jmath}}}{\text{sd}(D\_{j})} \right| + \left| \frac{T\_{\vec{\imath}\bar{\jmath}} - T\_{k\bar{\jmath}}}{\text{sd}(D\_{j})} \right| \right\} \\ &\leq \max\_{j} \left| \frac{T\_{\vec{\imath}\bar{\jmath}} - T\_{\vec{\imath}\bar{\jmath}}}{\text{sd}(D\_{j})} \right| + \max\_{j} \left| \frac{T\_{\vec{\imath}\bar{\jmath}} - T\_{k\bar{\jmath}}}{\text{sd}(D\_{j})} \right| \\ &= d\_{\vec{\imath}\bar{\jmath}} + d\_{\vec{\imath}k} .\end{split}$$

*This completes the proof.*

#### **3. Simulation study**

This section describes the intensive simulation studies we conducted to assess the empirical performance of the proposed clustering methods and compares this performance with those of the existing methods when the clusters demonstrate differences from various perspectives. For comparison, we also consider two clustering methods for functional data: the *k*-means methods available in the R package *fda.usc* [17] and the model-based clustering methods proposed by [18], which are available in the R package *fdapace* [19]. For the fairness of comparison, the standardization procedure is applied to normalize the empirical marginal distributions as described in Section 1 so that they can be combined equally.

Specifically, we consider the following five models on *t*∈½ � 0, 1 :


Here, *U* follows a uniform distribution on 0½ � *:*5, 0*:*6 , and *e T*ð Þ is generated from a Gaussian process with zero mean and covariance function *γ*ð Þ¼ *s*, *t <sup>σ</sup>*<sup>2</sup> exp �*ϕ*j j *<sup>t</sup>* � *<sup>s</sup> <sup>ν</sup>* f g, where *<sup>σ</sup>*<sup>2</sup> <sup>¼</sup> <sup>0</sup>*:*2, *<sup>ϕ</sup>* <sup>¼</sup> 2 and *<sup>ν</sup>* <sup>¼</sup> 1.

In addition, to assess the robustness of the proposed methods, we also consider another situation by replacing *e T*ð Þ with a multivariate-*t* distribution with two degrees of freedom, *t*2ð Þ *μ*, Σ , where *μ* ¼ 0, and Σ is generated with *γ*ð Þ *s*, *t* . The heavy

#### **Figure 1.**

*Top panel: Realizations of two settings. Bottom panel: Adjusted Rand index of four clustering methods with the two settings.*

#### *A New Functional Clustering Method with Combined Dissimilarity Sources and Graphical… DOI: http://dx.doi.org/10.5772/intechopen.100124*

tail property of the marginal distribution allows the data to be viewed as contaminated by some outliers, which are commonly encountered in practice. We generate 100 samples for each of the five classes with 20 equally spaced design points; as a result, 500 curves are clustered into five groups. The top panel of **Figure 1** demonstrates one realization of the simulated samples under two settings. To account for both the magnitude and the shape variation among clusters, we make two

#### **Figure 2.**

*Clusters for setting 1 visualized on raw curves (top panel) and normalized curves (bottom panel). In each panel, from top to bottom: Area, studentized, k-means, and model based.*

transformations suggested by [13] to the raw curves, shifting the curves so that each has a zero mean and then normalizing the centered curves by their *L*<sup>2</sup> norm. We then bind the three components together as long vectors for clustering. For each run, we use the true number of clusters for all four methods and calculate the adjusted Rand index [20] to compare their clustering results. We repeat the procedure 100 times, and the results are reported in the bottom panel of **Figure 1**.

**Figure 3.**

*Clusters setting 2 visualized on raw curves (top panel) and normalized curves (bottom panel). In each panel, from top to bottom: Area, studentized, k-means, and model based.*

#### *A New Functional Clustering Method with Combined Dissimilarity Sources and Graphical… DOI: http://dx.doi.org/10.5772/intechopen.100124*

Note that in all cases of the simulation study, the triangular inequality of the area measure was satisfied for all combinations of curves.

Under the first setting, data are generated from a Gaussian process. With regard to the adjusted Rand index, the four methods are quite comparable but the proposed methods are slightly better than the other two. However, our methods recover much better the characteristics of the true clusters; see **Figure 2**, which illustrates one clustering result for each of the four methods with both raw curves and normalized curves. In contrast, the *k*-means method merges classes 2 and 5, and the model-based method merges classes 1 and 3.

As for the second setting, in which the marginal distribution becomes heavytailed, our methods obtain more robust clustering results than the other two methods and reach higher adjusted Rand indexes (**Figure 3**). The model-based method relies heavily on the Gaussian assumption and thus shows less satisfactory behavior. Again, our methods still accurately recover the patterns of each cluster, whereas the other two methods completely fail to reveal reasonable group structures. Specifically, both *k*-means and the model-based methods suggest a cluster with only a few curves, which indicates a clear misinterpretation of the situation.

#### **4. Clustering of insurance penetration**

Insurance consumption indicates the equilibrium of supply and demand of insurance products. For a given insurance market, the collection of total (Life and non-Life) yearly insurance consumption observations helps to explain the variation of insurance market development over time. A common measure of insurance consumption, and hence of insurance development, is insurance penetration (IP), defined as the ratio of insurance premiums on GDP. The pattern of the development variation is evident when one views the IP as a function of time, known as the IP curve.

In their effort to promote the European single insurance market through the integration process, European policymakers put emphasis on homogeneity and convergence aspects of development patterns of European insurance markets. That is equivalent to saying that they are interested in identifying a single group (cluster) of countries whose IP curves exhibit similarity in magnitude and shape. The clustering of European countries in terms of their IP curves provides a method for testing the magnitude and shape similarity of the insurance industry in Europe. In particular, functional clustering methods are appropriate for our data, given the time dependency in the observations.

IP curves (time-series data on IP) originated from the Swiss Re (2016) Database were analyzed by the proposed functional clustering (FC) method based on Area measure. The exploration concentrated on the IP curves of 34 European countries (EU and non-EU members) observed over 13 years between 2004 and 2016, that is, before, during, and post-financial and sovereign debt crises.

The FC method extracts the partitioning information from both the magnitude and the shape of IP curves. While the magnitude is captured in the IP curves, the shape is not straightforward to be detected. To this end, we performed two types of transformations on the raw IP curves to reveal their shape. First, the raw IP curves were centred relative to each country's average IP rate to mitigate the widely different magnitudes in the IP data. After this, the resulting centred IP curves were then normalized with their L2 norms to a unit norm (to have a length of 1). These transformations are proposed to extract shape information by [13] By normalizing the centred IP curves in this manner, we eliminate their amplitude signal, while we are only left with the shape signal of the raw IP curves.

For the FC method to run properly, the most suitable number of clusters must be determined. We chose 6 clusters even if the median value of all methods presented in the NBclust library of the R software is 5. Our choice is justified as it better serves the analysis and the characterization of the produced clusters.

Given the IP curves of each cluster, the FC method also provides a graphical representation, through the central regions, of the deepest central IP curves within each cluster. We are interested in the so-called marginal plot style approach of the clustering solution. This means that the central regions are computed separately for magnitude and shape to better express each cluster component's shape. Remark here that the proposed method also allows showing the central region with respect to the combined ordering with respect to the magnitude and shape together. The appearance of clusters is demonstrated by the deepest IP curve (solid curve) that corresponds to the medoid IP curve and the envelope of 50% central IP curves (gray area) that reflects the band where 50% of the IP curves surrounding the deepest are varied. See **Figures 4** and **5**. Note that the fraction of combinations of countries satisfying the triangular inequality with Area measure was 1 with respect to all combinations. With this visualization, we can describe the clusters that are produced by the FC method as follows:

Cluster 1: Developed insurance markets with middle-to-high IP levels and decreasing IP patterns in the whole period. This cluster includes Belgium, France, Ireland, Austria\*, the UK, Portugal, Switzerland, Malta, Slovakia, and Germany. Cluster 2: Developing insurance markets with low-to-middle IP level and increasing IP pattern until 2010 and varying (decreasing) thereafter. This cluster of countries consists of Cyprus\*, Turkey, Greece, and Luxemburg. Cluster 3: Developed insurance markets with middle-to-high IP levels and increasing IP patterns in the whole period. This cluster unites Finland\*, Italy, Spain, Denmark, and the Netherlands. Cluster 4: Developing insurance markets with low-to-middle IP levels and increasing IP pattern until 2009 and decreasing thereafter. The within-cluster countries are Croatia\*, Slovenia, Iceland, the Czech Republic, Sweden, and Romania. Cluster 5: Developing insurance markets with low-to-middle IP levels and almost quadratic IP

**Figure 4.** *Clustering results of the IP curves: Magnitude plot.*

*A New Functional Clustering Method with Combined Dissimilarity Sources and Graphical… DOI: http://dx.doi.org/10.5772/intechopen.100124*

**Figure 5.** *Clustering results of the IP curves: Shape plot.*

pattern and vertex point in 2008. In this cluster, we see countries such as Russia\*, Ukraine, and Norway. Cluster 6: Least Developed insurance markets with low IP level and increasing IP pattern followed by a decreasing one initiated in 2007, right on the start of the financial crisis. Members of this cluster are countries such as Lithuania\*, Bulgaria, Hungary, Estonia, Serbia, and Poland. The \* symbol denotes the medoid IP curve produced by the clustering for each cluster.

The results bring to surface first the difficulty of the European insurance industry to converge and to exhibit homogeneity among national insurance markets during the whole period. A fact that otherwise could lead to the building of single European insurance industry. Second and final, the differential behavior of European insurance markets under different phases of the macroeconomic environment. For instance, Least Developed non-EU insurance markets faced shrink challenges, especially during and after the financial and sovereign debt crises period. The same challenge with a time lag of approximately two years was obvious for some Developing insurance markets. Russia and Ukraine had their insurance markets running in parallel and separated from the other two Developing insurance markets to follow their own smile-shaped development pattern. A slight improvement in insurance activity was also observed for the remaining Developing insurance markets that lasted almost until the end of the sovereign debt crisis in 2011. However, this improvement was offset by their unstable development pattern thereafter. Over the past years, the overall development of Developed insurance markets has decreased, due to a contraction in life insurance business. However, few of them managed to succeed in an increasing pattern with varying IP rate changes over the years.

#### **5. Clustering of population growth data**

Over the last century, the world has seen rapid population growth. Particularly, the global population more than quadrupled. The magnitude of the population rate of change from one year to another is found by the fold change ratio (FCR). Fold

change is calculated simply as the ratio of the year-end over the year-start population of a certain country. We refer to the evolution of FCR over the course of time as the population growth rate (PGF) curve. In this example, our objective is to find clusters of world countries in which their PGF curves share similar magnitude and shape properties. We use the output of the FC method based on Area measure for clustering world countries. This output will also give a hint towards the distribution of the world population and provide the trends or the dynamics that are defining our world, such that policymakers can set sustainable development goals for our societies.

Thus, we consider the world population data (United Nations 2016), which was analyzed by [21]. This dataset includes estimates of the total population (both sexes) in 233 countries, areas, or regions in July 1950–2015. Motivated by these estimates and the arguments needed for the execution of the FC method, we follow three steps. In the first step, we perform the preprocessing of the dataset by selecting those countries with populations of more than one million in July 1950. In total, 134 countries are included in our analysis. For each of these countries, we collect 65 data points that correspond to the FCR of each year interval and propose connecting them to make the PGF curve. In the second step, we derive the shape information from the L2 normalization of the shifted PGF curves towards their center. This particular step is the one that provides the set of PGF pattern (PGFP) curves. In the last step, we specify the input argument for the number of clusters which is required by FC method to start. The optimal number of clusters was arrived at by calculation of the median value of all methods presented in NBclust library of the R software. Based on the result of this calculation, the chosen number of clusters was three.

**Figure 6** satisfies the marginal plot style approach followed in our case studies by presenting the output of the FC method in a two-panel display. The first panel is dedicated to magnitude clustering (it helps discern broad trends in PGF curves), and the second to the shape clustering (it helps identify patterns of pace for population rate of change). The first plot of each panel is the plot of the median curves of the clusters. Remark that the fraction of combinations of countries satisfying the triangular inequality with Area measure was 1 with respect to all combinations.

Next, we present both the derived clusters and their characterization, which is based on the United Nations (UN) geographical region and classification of economies. For instance, we see that the population growth rates in Cluster 1 appear to follow an increasing trend or at least maintain a certain degree of stability because of a natural increase and migration. Most countries in this cluster have a developing economy and are mainly located in Sub-Saharan Africa. However, three European countries (Ireland, Norway and Spain) with developed economies are also members of this cluster of countries. In contrast, the other two characteristic population growth trends that are present in both Clusters 2 and 3 paint a picture of a stagnating or shrinking population in the future, the only difference being that the population in Cluster 3 has a faster speed of shrinkage than in Cluster 2. The most populated cluster (that is Cluster 2 with 64 curves) is mostly associated with another set of developing economies (such as those of Brazil, China and Singapore) located, this time, in Latin America and the Caribbean along with East Asia and Pacific. Additionally, the only developed economy that appears to reside in this cluster is that of the United States, while few economies in transition that belong to the Commonwealth of Independent States (such as those of Azerbaijan, Kazakhstan) make their presence visible for a first time.

Finally, Cluster 3 has united mostly the developed economies of Europe and East Asia and Pacific along with the economies in transition of South-Eastern Europe (Albania, Serbia and North Macedonia). Moreover, the population of few

*A New Functional Clustering Method with Combined Dissimilarity Sources and Graphical… DOI: http://dx.doi.org/10.5772/intechopen.100124*

**Figure 6.**

*Clustering results for the population curves. Top panel: Magnitude plot; bottom panel: Shape plot.*

developing economies that are located, for example, at Cuba, Jamaica, Puerto Rico, Ghana and Mozambique, have distinguished themselves from the vast majority of developing economies in Cluster 1 or Cluster 2 by following the population behavior of developed economies.

In conclusion, developing economies and economies in transition are split between two clusters, while the majority of the developed economies belong to one cluster. Based on the characterization of these clusters, it is understood that countries with developing economies experience population growth (or at least population stability). However, the more the economy of a country is developed, the more its population growth change decreases. This decrease, in certain cases, might have even a severe negative effect on a country's future projected population size. Whereas, in some other cases, the effect of this decrease is smoother without forcing the population size to reach record lows.

#### **6. Multivariate clustering of insurance penetration with ratio of life and total insurance**

The insurance industry generates a large volume of multivariate functional data from the simultaneously obtained measurements on variables related to life, nonlife, and total insurance activities. In our case of interest, two main country-specific variables that include data on premiums are available. The first is the total IP (TIP) that represents the development of total activities. While the second is the R ratio of life IP to TIP that represents the development of the share of life premiums in total premiums.

Since the insurance industry of a country can be represented by the bivariate variables of TIP and R, it is important to take into account the dependence between them. We compute a variable that describes this dependence through the covariance function:

$$\text{Cov}(t) = \text{sign}((IP(t) - m\_1(t))(R(t) - m\_2(t))) \sqrt{| (IP(t) - m\_1(t))(R(t) - m\_2(t)) |},$$

where *m*1ð Þ*t* is the mean IP over all countries and *m*2ð Þ*t* is the mean R over all countries and represents the development of the link between total and life share dynamics.

There is no doubt that the development of total activities is different from that of life share. Nevertheless, it may be assumed that a common development coordinates these differential developments of different insurance variables. Then, it is of great interest to identify groups of insurance markets with similar joint development patterns. With this consideration in mind, we aim to discover whether the European insurance market is homogeneous when national insurance developments are jointly differential by developing their total activities and their life share.

We obtained again insurance data from Swiss Re (2016) database and for the same European (EU and non-EU) countries as in univariate case. In particular, we employ a dataset of our main variables for 34 European countries sampled at annual frequency between 2004 and 2016. That is to say that the data for each variable can be viewed as curves. Yet, except for the curves related to TIP and R variables, we also included the computed curves for the Cov variable in our dataset and ended up with a set of three-dimensional vectors of curves.

Viewing the curves for each variable as a set of curves, a three-component list of curve sets is constructed to serve as an input for the FC method. This time, the optimal number of clusters is three and consistent with the median value of all methods presented in the NBclust library of the R software. Our proposed method concentrates on visualizing, in the marginal plot style approach, clusters of multivariate insurance functional data with regard to their magnitudes and covariance function (**Figures 7**–**9**).

The clustering results are summarized as follows:

Cluster 1: Countries of high TIP and high R with no correlation whatsoever between the two variables throughout the study period.

Cluster 2: Countries of high TIP and high R with a positive correlation between the two variables throughout the study period.

*A New Functional Clustering Method with Combined Dissimilarity Sources and Graphical… DOI: http://dx.doi.org/10.5772/intechopen.100124*

**Figure 7.** *Clustering results for bivariate curves of TIP and R: TIP plots.*

**Figure 8.** *Clustering results for bivariate curves of TIP and R: R plots.*

Cluster 3: Countries of low TIP and low R with no correlation whatsoever between the two variables throughout the study period.

Additionally, the FC method suggests that the total and life share developments in Cluster 1 and Cluster 3 have independent paths since curves for Cov variable almost coincide with the *x*-axis of **Figure 9**. Simultaneously, it succeeded not to clustered them together due to different magnitude levels. On the contrary, in

**Figure 9.** *Clustering results for bivariate curves of TIP and R: Covariance plots.*

Cluster 2, the curves for Cov variable are positioned above *x*-axis, which means that total and life share are dependent functions (positively correlated) over the years.

The functional cluster analysis revealed some differences in the dynamics of insurance markets in Europe. The clustering results clearly reject the hypothesis on the homogeneity of the European insurance market. Europe continues in a twospeed insurance market, with countries with high development and independent paths of total and life insurance business, and others with low. For both speed markets, detecting an increasing pattern in total insurance business does not guarantee that the life premiums will also follow at the same time the same pattern. Any similarity in their patterns could be explained by socio, economic, demographic, or other factors and not by the total business pattern itself. However, there is another high-speed market where the increase of total insurance business in the economy is an additional factor that accelerates the development of life business share.

#### **7. Conclusions**

In this study, we introduce a new class of functional cluster analysis methods based on functional orderings. We intended to work with methods that allow intrinsic graphical interpretation to obtain a natural interpretation of clusters via their central regions. Therefore, we propose the use of a studentized measure that forms a metric on the set of functional differences. Also, We suggest the use of the area measure, which orders the functions according to the area of the most extreme continuous rank and considers the entire distribution of the functions. This measure does not form a metric on any set of functions, but the simulation study results and the real data study suggest that it is a metric on any real data set of functions. The check for the satisfaction of the triangular inequality is provided for the given set of functions.

This study's primary aim is to introduce methods that combine the various functional information sources equally. It is possible to study clustering while showing equal concern for both magnitude and shape, as shown in the first and *A New Functional Clustering Method with Combined Dissimilarity Sources and Graphical… DOI: http://dx.doi.org/10.5772/intechopen.100124*

second data examples. In other words, it is possible to study the clustering of multivariate functions when the marginal functions are taken equally. It is also possible to add to the study term, which summarizes the covariance between the marginals of the multivariate function, as shown in the third data example.

The simulation study suggests that the proposed method is robust and more powerful than studied alternatives that give equal treatment to various sources. The studied alternatives are the *K*-means method, with pre-standardization of every coordinate by its mean and variance, and the model-based method, which assumes a normal distribution of data and considers marginals means, variances, and the covariance function.

Our proposed methods consider the covariance structure of the functional data via the ordering of the entire functional differences. Our proposed methods are also nonparametric and, as such, have no model requirement. Our simulation study also showed that our proposed methods are quite robust to heavy-tailed functions, which can be considered as a type of functional cluster outlier. The data studies show that our methods can cluster the functions with respect to magnitude and shape and that it provides a sensible graphical interpretation of the resulting clusters. The third example shows that the clusters can be also constructed with respect to the covariance of the marginals in the multivariate function. This study does not examine methods to choose the number of clusters in an optimal manner, and this problem is left to the user's choice or further development.

#### **Acknowledgements**

Wenlin Dai has been financially supported by National Natural Science Foundation of China (Project No. 11901573). T. Mrkvička has been financially supported by the Grant Agency of Czech Republic (Project No. 19-04412S).

#### **Author details**

Wenlin Dai<sup>1</sup> , Stavros Athanasiadis<sup>2</sup> and Tomáš Mrkvička2 \*

1 Center for Applied Statistics, Institute of Statistics and Big Data, Renmin University of China, Beijing, China

2 University of South Bohemia, České Budĕjovice, Czech Republic

\*Address all correspondence to: mrkvicka.toma@gmail.com

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## **References**

[1] Ieva, F., A. M. Paganoni, D. Pigoli, and V. Vitelli (2013). Multivariate functional clustering for the morphological analysis of electrocardiograph curves. Journal of the Royal Statistical Society: Series C 62 (3), 401–418.

[2] Kazor, K. and A. S. Hering (2015). Assessing the performance of modelbased clustering methods in multivariate time series with application to identifying regional wind regimes. Journal of Agricultural, Biological, and Environmental Statistics 20(2), 192–217.

[3] Tupper, L. L., D. S. Matteson, C. L. Anderson, and L. Zephyr (2018). Band depth clustering for nonstationary time series and wind speed behavior. Technometrics 60(2), 245–254.

[4] Ferraty, F. and P. Vieu (2006). Nonparametric Functional Data Analysis: Theory and Practice (1 ed.). Springer Series in Statistics. Springer.

[5] Chiou, J.-M. and P.-L. Li (2007). Functional clustering and identifying substructures of longitudinal data. Journal of the Royal Statistical Society: Series B 69(4), 679–699.

[6] Jacques, J. and C. Preda (2014). Model-based clustering for multivariate functional data. Computational Statistics and Data Analysis 71, 92–106.

[7] Zeng, P., J. Q. Shi, and W.-S. Kim (2019). Simultaneous registration and clustering for multidimensional functional data. Journal of Computational and Graphical Statistics 28(4), 943–953.

[8] López-Pintado, S. and J. Romo (2009). On the concept of depth for functional data. Journal of the American Statistical Association 104(486), 718–734.

[9] López-Pintado, S., Y. Sun, J. K. Lin, and M. G. Genton (2014). Simplicial

band depth for multivariate functional data. Advances in Data Analysis and Classification 8(3), 321–338.

[10] Myllymäki, M. and T. Mrkvička (2020). Get: Global envelopes in r. arXiv preprint arXiv:1911.06583.

[11] Sguera, C., P. Galeano, and R. Lillo (2014). Spatial depth-based classification for functional data. TEST 23(4), 725–750.

[12] de Micheaux, P. L., P. Mozharovskyi, and M. Vimond (2020). Depth for curve data and applications. Journal of the American Statistical Association 0(0), 1–17.

[13] Dai, W., T. Mrkvička, Y. Sun, and M. G. Genton (2020). Functional outlier detection and taxonomy by sequential transformations. Computational Statistics and Data Analysis 149, 106960.

[14] Myllymäki, M., P. Grabarnik, H. Seijo, and D. Stoyan (2015). Deviation test construction and power comparison for marked spatial point patterns. Spatial Statistics 11, 19–34.

[15] Myllymäki, M., T. Mrkvička, P. Grabarnik, H. Seijo, and U. Hahn (2017). Global envelope tests for spatial processes. J. R. Statist. Soc. B 79(2), 381–404.

[16] Narisetty, N. N. and V. N. Nair (2016). Extremal depth for functional data and applications. Journal of American Statistical Association 111 (516), 1705–1714.

[17] Febrero-Bande, M. and M. Oviedo de la Fuente (2012). Statistical computing in functional data analysis: The R package fda.usc. Journal of Statistical Software 51(4), 1–28.

[18] Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant

*A New Functional Clustering Method with Combined Dissimilarity Sources and Graphical… DOI: http://dx.doi.org/10.5772/intechopen.100124*

analysis, and density estimation. Journal of the American Statistical Association 97(458), 611–631.

[19] Carroll, C., A. Gajardo, Y. Chen, X. Dai, J. Fan, P. Z. Hadjipantelis, K. Han, H. Ji, H.-G. Mueller, and J.-L. Wang (2021). fdapace: Functional Data Analysis and Empirical Dynamics. R package version 0.5.6.

[20] Hubert, L. and P. Arabie (1985). Comparing partitions. Journal of classification 2(1), 193–218.

[21] Nagy, S., I. Gijbels, and D. Hlubinka (2017). Depth-based recognition of shape outlying functions. Journal of Computational and Graphical Statistics 26(4), 883–893.

#### **Chapter 2**

## Computational Statistics with Dummy Variables

*Adji Achmad Rinaldo Fernandes, Solimun and Nurjannah*

#### **Abstract**

Cluster analysis is a technique commonly used to group objects and then further analysis is carried out to obtain a model, named cluster integration. This process can be continued with various analyzes, including path analyzes, discriminant analyzes, logistics, etc. In this chapter, the author discusses the reason to use dummy variables in this type of cluster analysis. Dummy variables are the main way that categorical variables are included as predictors in modeling. With statistical models such as linear regression, one of the dummy variables needs to be excluded, otherwise the predictor variables are perfectly correlated. Thus, usually if a categorical variable can take *k* values, we only need *k-1* dummy variables, the *k-th* variable being redundant, it does not bring any new information. When more dummy variables than needed are used this is known as dummy variable trapping. The advantage to use dummy variables is that they are simple to use and the decision making process is easier to manage. The novelty in this chapter is the perspective of the dummy variable technique using cluster analysis in statistical modeling. The data used in this study is an assessment of the provision of credit risk at a bank in Indonesia. All analyzes were carried out using software R.

**Keywords:** dummy, cluster, integrated cluster with logistic regression, integrated cluster with discriminant analysis, integrated cluster with path analysis

#### **1. Introduction**

The application of cluster analysis is commonly used to group objects. Cluster analysis can be used to group objects and then further analysis is carried out to obtain a model, namely cluster integration. Cluster integration can be continued with various analyzes, including path analysis, discriminant analysis, logistics, etc. In cluster integration with path analysis, it aims to group homogeneous objects into one group, the goal is that the resulting residual variance is homogeneous in addition to maximizing the adjusted *R<sup>2</sup>* value. In cluster integration with discriminant analysis, the benefits of cluster analysis generated can maximize the accuracy, sensitivity, and specificity of the model. In this chapter, we will explain the technical perspective of dummy variables using cluster analysis in statistical modeling, such as regression analysis, path analysis, and discriminant analysis.

#### **2. Why use dummy variables**

Dummy variables are numerical variables that represent categorical data, such as gender, race, political affiliation, etc. Technically, the dummy variable is dichotomous, a quantitative variable. Their value range is small; they can only take two quantitative values. As a practical matter, regression results are easiest to interpret when the dummy variable is constrained to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents its absence. Categorical variables have more than two categories, can be represented by a set of dummy variables, with one variable for each category. Numerical variables can also be coded to explore nonlinear effects. Dummy variables are also known as indicator variables, design variables, contrasts, one-hot coding, and binary basis variables [1].

Dummy variables are the main way that categorical variables are included as predictors in modeling. For example, in linear regression analysis, the response variable is profit, and the predictor variable is employee group. With statistical models such as linear regression, one of the dummy variables needs to be excluded (by convention, the former, or the latter), otherwise, the predictor variables are perfectly correlated [2].

When defining dummy variables, a common mistake is to define too many variables. If a categorical variable can take on *k* values, then you tend to define *k* dummy variables. You only need *k-1* dummy variable.

The *k-th* dummy variable is redundant; it does not bring any new information. And that creates a severe multicollinearity problem for analysis. Using *k* dummy variables when only *k-1* dummy variables are needed is known as dummy variable trapping.

Regression analysis treats all independent variables (X) in the analysis as numerical. A numeric variable is an interval or ratio scale variable whose values can be directly compared, e.g. "10 is double 5," or "3 minus 1 equals 2." However, you may want to include a nominal scale attribute or variable such as: "Product Brand" or "Defect Type" in your study. Say you have three types of defects, numbered "1," "2" and "3." In this case, "3 minus 1" means nothing. You cannot subtract handicap 1 from handicap 3. The numbers here are used to indicate or identify the degree of "Type of Disability" and have no intrinsic meaning of their own. A dummy variable is created in this situation to "trick" the regression algorithm into the correct attribution of the analysis variable [3].

The main benefit of dummy variables is that they are simple. Often there are better alternative basis functions, such as orthogonal polynomials, effect coding, and splines. If dummy variables are used in linear regression analysis, then there are several advantages [4], including:


#### **3. Hierarchical cluster**

Cluster analysis (group analysis) is an analytical method that aims to group objects into several groups, objects in groups are homogeneous (same) while other *Computational Statistics with Dummy Variables DOI: http://dx.doi.org/10.5772/intechopen.101460*

group members are heterogeneous (different) [5]. The procedure for group formation in Cluster analysis is divided into two, namely hierarchical and nonhierarchical methods. Grouping with the hierarchical method is used when there is no information about the number of clusters. The main principle of the hierarchical method is to group objects that have something in common with one group. While the non-hierarchical method is used when information about the number of clusters is known or has been determined [6].

This method starts grouping with two or more objects that have the closest object. Then the process is continued by passing to another object that has second proximity. And so on to form a tree in which there is a hierarchy or level from the most similar to the different. The tree formed by this cluster is also called a dendrogram. This tree is useful for providing deeper clarity on the clustering process.

The stages of grouping data using the hierarchical method are [7]:


According to [6] in the method of forming groups in the hierarchical method, there are two approaches, namely agglomerative hierarchical methods (Agglomerative Hierarchical Methods) and divisive hierarchical methods (Device Hierarchical Methods). The agglomerative method starts by assuming that each object is a cluster. Then the two objects that have the closest distance are combined into one cluster. The process continues so that in the end it will form a cluster consisting of all objects.

#### **4. Integrated cluster with logistic regression**

#### **4.1 Integrated cluster equation model with logistic regression analysis**

The model of the integration of cluster analysis with logistic analysis of the dummy variable approach is the same as the general model of multiple linear regression analysis with dummy variables.

The general model of the integrated cluster with logistic analysis can be written in the following Eq. (1).

$$\begin{aligned} \mathbf{x}\_{i} &= \frac{\exp\begin{pmatrix} \beta\_{0} + \beta\_{1}\mathbf{x}\_{1i} + \dots + \beta\_{p}\mathbf{x}\_{pi} + \\ D\_{1}\beta\_{p+1} + D\_{1}\beta\_{p+1}\mathbf{x}\_{1i} + \dots + D\_{1}\beta\_{2p+1}\mathbf{x}\_{pi} + \\ D\_{2}\beta\_{p+2} + D\_{2}\beta\_{p+3}\mathbf{x}\_{1i} + \dots + D\_{2}\beta\_{3p+2}\mathbf{x}\_{pi} + \\ \dots + D\_{q}\beta\_{pq+1} + D\_{q}\beta\_{pq+2}\mathbf{x}\_{1i} + \dots + D\_{q}\beta\_{p(q+1)}\mathbf{x}\_{pi} \\ \end{pmatrix}}{\mathbf{1} + \exp\begin{pmatrix} D\_{0}\beta\_{p+1} + D\_{1}\beta\_{p+1}\mathbf{x}\_{1i} + \dots + D\_{1}\beta\_{2p+1}\mathbf{x}\_{pi} + \\ D\_{2}\beta\_{p+2} + D\_{2}\beta\_{p+3}\mathbf{x}\_{1i} + \dots + D\_{2}\beta\_{3p+2}\mathbf{x}\_{pi} + \\ \dots + D\_{q}\beta\_{pq+1} + D\_{q}\beta\_{pq+2}\mathbf{x}\_{1i} + \dots + D\_{q}\beta\_{p(q+1)}\mathbf{x}\_{pi} \end{pmatrix}} \end{aligned} \tag{1}$$

where,

*yi* : response variable at the *i-th* observation unit *xki*: the *k-th* predictor variable on the *i-th* observation unit *βp*: coefficient of the *p-th* logistic function *Dq*: *q-th* dummy variable *p*: number of predictor variables *q*: the number of clusters formed is reduced by 1 *i*: 1, 2, 3, … , *n*

#### **4.2 Logistics regression analysis assumptions**

Before conducting the analysis, several basic principles or assumptions underlie regression analysis, several assumptions that underlie logistic regression analysis, namely [8].


#### **4.3 Integrated cluster analysis method with logistic regression analysis**

The linkage used in this study is the Average Linkage and the measurement of the distance between clusters using the Euclidean distance. Determination of the number has been determined in advance, namely as many as 2 and 3 groups. The Average Linkage method is based on the average distance. The table of the number of members in each Cluster in the Integrated Cluster Analysis method with regression analysis is presented in **Table 1**.

From **Table 1** it can be seen that there are 71 customers in Cluster 1 with 3 groups, 15 customers in Cluster 2, and 14 customers in Cluster 3. While many members with 2 groups in Cluster 1 as many as 93 customers and in Cluster 2 as many as 7 customers. The selection of the best linkage and model validity is by choosing the model that has the largest total *R*<sup>2</sup> , as shown in the equation, which can be briefly seen in **Table 2** as follows.

Based on **Table 2** the logistic regression analysis model with cluster integration with 3 groups has the greatest total determination value so that logistic regression

*Computational Statistics with Dummy Variables DOI: http://dx.doi.org/10.5772/intechopen.101460*


**Table 1.**

*Number of members of each cluster average linkage method on integrated cluster analysis method with logistics regression analysis.*


**Table 2.**

*Adjusted values R*<sup>2</sup> *for each integrated cluster analysis model with logistics regression analysis.*

analysis with cluster integration with 3 groups is the best model compared to 2 groups. The total determination value of 89.23% is considered very good to describe the model.

Based on **Table 2** the adjusted *R*<sup>2</sup> value of the Cluster integration regression analysis with 3 groups resulted in an adjusted *R*<sup>2</sup> value of 0.4258 meaning that the variables of age, work experience, and loan to value were able to explain the diversity of credit collectibility variables of 42.58%, while 57 The other, 41% is influenced by variables outside the model. The value of *R*<sup>2</sup> adjusted Cluster integration logistic regression analysis with 3 groups resulted in an *R*<sup>2</sup> adjusted value of 0.8492, meaning that the variables of age, work experience, and loan to value were able to explain the diversity of credit collectibility variables of 84.92%, while 13.08. The other percentage is influenced by variables outside the model. The coefficient of total determination of the Cluster integration logistic regression analysis model with 3 groups is 0.8923, so it can be concluded that the diversity of data that can be explained by the model is 89.23% while the remaining 10.17% is explained by variables outside the model.

The results of *R*<sup>2</sup> the adjusted integrated cluster in logistic regression analysis with 3 groups having the highest adjusted *R*<sup>2</sup> value. If the average variables of each Cluster are compared, it is found that most of Cluster 2 has the highest average value compared to other Clusters, so Cluster 2 is high. While Cluster 1 has the lowest average value compared to other Clusters, so Cluster 1 is low. The average value for each cluster is presented in **Table 3**.

Based on **Table 3**, it can be seen that most of the customers are 39 years old in the low cluster, 37 years old in the high cluster, and 38 years old in the medium


**Table 3.**

*Average value and each cluster in integrated cluster analysis model with logistic regression analysis.*

cluster. The work experience of customers in the low cluster is mostly for 40 months, the high cluster is mostly for 194 months, while in the medium cluster mostly for 108 months.

Integrated Cluster Analysis method with Logistic Regression Analysis with 3 groups that separate each data set optimally. Then the model formed is like Eq. (2) as follows.

$$\pi(\mathbf{x}) = \frac{\exp\left(-0.027\mathbf{x}\_1 - 0.044\mathbf{x}\_2 + 0.850\mathbf{y}\_1 - 0.374\mathbf{D}\_1\mathbf{x}\_1 - 0, 0.06\mathbf{D}\_1\mathbf{x}\_2 + 9, 971\mathbf{D}\_2\mathbf{y}\_1 + 0, 0.990\mathbf{D}\_2\mathbf{x}\_1 + 0, 0.26\mathbf{D}\_2\mathbf{x}\_2 - 1, 559\mathbf{D}\_2\mathbf{y}\_1\right)}{1 + \exp\left(-0.027\mathbf{x}\_1 - 0, 0.04\mathbf{x}\_2 + 0, 859\mathbf{y}\_1 - 0, 374\mathbf{D}\_2\mathbf{x}\_1 - 0, 0.06\mathbf{D}\_2\mathbf{x}\_2 + 9, 971\mathbf{D}\_2\mathbf{y}\_1 + 0, 0.26\mathbf{D}\_2\mathbf{y}\_2 - 1, 559\mathbf{D}\_2\mathbf{y}\_2\right)} \tag{2}$$

Low cluster (*D*<sup>1</sup> ¼ 0 and *D*<sup>2</sup> ¼ 0) can be seen in Eq. (3).

$$\pi(\mathbf{x}) = \frac{\exp\left(-0, 0.027\mathbf{x}\_1 - 0, 0.04\mathbf{x}\_2 + 0, 850\mathbf{y}\_1\right)}{1 + \exp\left(-0, 0.027\mathbf{x}\_1 - 0, 0.04\mathbf{x}\_2 + 0, 850\mathbf{y}\_1\right)}\tag{3}$$

High cluster (*D*<sup>1</sup> ¼ 1 and *D*<sup>2</sup> ¼ 0) can be seen in Eq. (4).

$$\pi(\boldsymbol{\pi}) = \frac{\exp\left(-0, 401\mathbb{1}\mathbf{x}\_1 - 0, 047\mathbb{1}\mathbf{x}\_2 + 10, 821\mathbb{1}\mathbf{y}\_1\right)}{1 + \exp\left(-0, 401\mathbb{1}\mathbf{x}\_1 - 0, 047\mathbb{1}\mathbf{x}\_2 + 10, 821\mathbb{1}\mathbf{y}\_1\right)}\tag{4}$$

Medium cluster (*D*<sup>1</sup> ¼ 0 and *D*<sup>2</sup> ¼ 1) can be seen in Eq. (5).

$$\pi(\boldsymbol{x}) = \frac{\exp\left(0, 063\boldsymbol{\chi}\_1 - 0, 015\boldsymbol{\chi}\_2 - 0, 709\boldsymbol{\chi}\_1\right)}{1 + \exp\left(0, 063\boldsymbol{\chi}\_1 - 0, 015\boldsymbol{\chi}\_2 - 0, 709\boldsymbol{\chi}\_1\right)}\tag{5}$$

#### **5. Integrated cluster with discriminant analysis**

#### **5.1 Discriminant analysis**

Discriminant analysis is a multivariate analysis that functions to model the relationship between a categorical response variable and one or more quantitative predictor variables [9]. Discriminant analysis can be used as a grouping method because it produces a function that cancan distinguishes between groups. The function is formed by maximizing the distance between groups. If the response variable or categorical data consists of only two groups, it is called a Two-Group Discriminant Analysis model, whereas if the group consists of more than two categories it is called Multiple Discriminant Analysis. Discriminant analysis has two assumptions that must be met, namely the assumption of multivariate normality, and the assumption of homogeneity of the variance matrix.

According to [6], discriminant analysis is included in the multivariate dependence method. The model can be written as in Eq. (6).

$$y\_i = \beta\_1 X\_{1i} + \beta\_2 X\_{2i} + \dots + \beta\_p X\_{pi} \tag{6}$$

where,

*yi* : the response variable is categorical or nominal data on the *i-th* observation unit

*Xpi*: the p-explanatory variable on the *i-th* observation unit *βp*: the coefficient of the *p-th* discriminant function *i*: 1, 2, 3, … , *n*

#### **5.2 Integration of cluster analysis with discriminant analysis of dummy variable approach**

Integration of Cluster Analysis with Discriminant Analysis The Dummy Variable Approach in this study combines cluster analysis with discriminant analysis. Integrating cluster analysis with discriminant analysis can be done by using dummy variables obtained from cluster results. Many clusters formed are used as categories, then used as dummy variables.

An integrated cluster model with discriminant analysis can be written in Eq. (7).

$$\begin{aligned} \mathbf{y}\_i &= \beta\_1 \mathbf{x}\_{1i} + \beta\_2 \mathbf{x}\_{2i} + \dots + \beta\_p \mathbf{x}\_{pi} \\ &+ D\_1 \beta\_{p+1} \mathbf{x}\_{1i} + D\_1 \beta\_{p+2} \mathbf{x}\_{2i} + \dots + D\_1 \beta\_{p+q} \mathbf{x}\_{pi} \\ &+ D\_2 \beta\_{p+q+1} \mathbf{x}\_{1i} + D\_2 \beta\_{p+q+2} \mathbf{x}\_{2i} + \dots + D\_2 \beta\_{p+2q} \mathbf{x}\_{pi} \\ &+ \dots + D\_q \beta\_{p+qq+1} \mathbf{x}\_{1i} + D\_q \beta\_{p+11+2} \mathbf{x}\_{2i} + \dots + D\_q \beta\_{p+qq} \mathbf{x}\_{pi} \end{aligned} \tag{7}$$

where,

*yi* : response variable at the *i-th* observation unit *xpi*: the p-explanatory variable on the *i-th* observation unit *βp*: the coefficient of the *p-th* discriminant function *Dq*: *q-th* dummy variable *p*: number of explanatory variables *q*: the number of clusters formed is reduced by 1

*i*: 1, 2, 3, … , *n*

If the research variables used are 3 and the number of clusters is 2, then the integrated cluster model with multiple discriminant analysis can be written as in Eq. (8).

Common models:

$$\mathbf{y}\_{i} = \beta\_{1}\mathbf{x}\_{1i} + \beta\_{2}\mathbf{x}\_{2i} + \beta\_{3}\mathbf{x}\_{3i} + D\_{1}\beta\_{4}\mathbf{x}\_{1i} + D\_{1}\beta\_{5}\mathbf{x}\_{2i} + D\_{1}\beta\_{6}\mathbf{x}\_{3i} \tag{8}$$

Cluster 1 (*D*1=0)

$$
\sigma\_i = \beta\_1 \mathbb{x}\_{1i} + \beta\_2 \mathbb{x}\_{2i} + \beta\_3 \mathbb{x}\_{3i} \tag{9}
$$

Cluster 2 (*D*1=1)

$$y\_i = (\beta\_1 + \beta\_4)\mathbf{x}\_{1i} + (\beta\_2 + \beta\_5)\mathbf{x}\_{2i} + (\beta\_3 + \beta\_6)\mathbf{x}\_{3i} \tag{10}$$

#### **5.3 Model efficiency**

Efficiency can be seen based on three criteria, namely model accuracy, sensitivity, and specificity. Accuracy measures how correctly a diagnostic test identifies and excludes a certain condition, in other words, accuracy is used to measure the goodness of the model. In diagnostic tests, the terms sensitivity and specificity are also known. Sensitivity and specificity in diagnostic tests is a measure of the ability to correctly identify objects under reality [10]. The difference is that sensitivity measures the positive group while specificity measures the negative group. To get the value of accuracy, sensitivity, and specificity can use the Confusion Matrix as follows (**Table 4**).

$$\begin{aligned} \text{Accuracy} &= \frac{a+d}{a+b+c+d} \\ \text{Sensitivity} &= \frac{a}{a+c} \\ \text{Specificity} &= \frac{d}{b+d} \end{aligned} \tag{11}$$


**Table 4.** *Confusion matrix.*

#### **5.4 Implementation of integrated cluster with discriminant analysis**

For example, there are secondary data regarding homeownership loans obtained from Bank X in Indonesia, where the variables studied are age, credit period, loan to value, and credit collectibility status. The collectibility status of the credit used consists of two categories, namely the collectibility of current and non-current loans. The variables of age and credit period are in hours, while the loan to value is in proportion units. Therefore, it is necessary to standardize before conducting data analysis.

When using an integrated cluster with discriminant analysis, the first thing we have to do is perform a cluster analysis to get a dummy variable. In cluster analysis, it is not necessary to test assumptions because cluster analysis is included in exploratory analysis. If the results of the cluster analysis are *n* clusters, then the dummy variables formed are *n-1* variables. The analysis used is hierarchical cluster analysis with the average linkage method with Euclidean distance. The determination of the number of clusters is determined based on the Silhouette value. Silhouette values for each of the many clusters can be seen in **Table 5**.

Based on **Table 5**, the largest Silhouette value is in many clusters 2. So that the optimum number of clusters is 2. The results of cluster analysis are obtained in cluster 1 consisting of 71 customers, and cluster 2 consisting of 29 customers. Thus, the dummy variable formed is 1 dummy variable. If the object (customer) is included in cluster 2, we assume that the object is 1 in the dummy variable. Meanwhile, if the object is included in cluster 1, we assume that the object is 0 in the dummy variable.

After obtaining the dummy variable, the next step is to test the assumptions in discriminant analysis. Testing for multivariate normality using the Shapiro-Wilk test on predictor variables, and testing the homogeneity of the covariance matrix using the Box M test. of 0.9917 (> 0.05). So it can be concluded that the data already meet the assumptions of multivariate normality and homogeneity of the variance matrix.

Next is to analyze the data using an integrated cluster with discriminant analysis. Based on the analysis carried out, an integrated cluster model was obtained with the following discriminant analysis:


**Table 5.** *Cluster analysis silhouette results.* *Computational Statistics with Dummy Variables DOI: http://dx.doi.org/10.5772/intechopen.101460*

$$\begin{array}{l} \mathbf{x}\_{i} = \mathbf{0}, \mathbf{0} \mathbf{838x\_{1i}} + \mathbf{0}, \mathbf{0606x\_{12i}} - \mathbf{0}, \mathbf{0241x\_{3i}} + \mathbf{0}, \mathbf{0569D\_{1}x\_{1i}} + \mathbf{0}, \mathbf{0358D\_{1}x\_{2i}} \\ - \mathbf{0}, \mathbf{0752D\_{1}x\_{3i}} \end{array} \tag{12}$$

Cluster 1 (D*<sup>1</sup>* = 0)

$$y\_i = 0,0838x\_{1i} + 0.0606x\_{12i} - 0,0241x\_{3i} \tag{13}$$

Cluster 1 (D*<sup>1</sup>* = 1)

$$y\_i = 0, 1407x\_{1i} + 0.0964x\_{12i} - 0, 0993x\_{3i} \tag{14}$$

Based on the above equations, it can be interpreted that the coefficient of age and credit term is positive, meaning that the higher the age and credit term, the greater the possibility that customers in cluster 1 and cluster 2 have current credit collectability. On the other hand, loan-to-value has a negative coefficient, so if the value increases, it will increase the possibility of customers having non-current credit collectibility. The variable that most influences credit collectibility in cluster 1 and cluster 2 is age which has the largest discriminant coefficient value. The value of classification accuracy, sensitivity, and specificity in the integrated cluster analysis method with discriminant analysis can be seen in **Table 6** below.

Based on **Table 6**, the results of the classification accuracy are 84%, which means that the model correctly classifies as many as 84 customers out of 100 customers. Sensitivity of 84% means that customers belonging to the current category can be classified correctly by the model as many as 60 of 71 customers. The specificity of 16% means that customers belonging to the non-current category can be classified correctly by the model as many as 5 out of 29 customers.

#### **6. Regression analysis with dummy variable**

#### **6.1 Regression analysis**

The method that describes how big the relationship between variables is a regression analysis method. Regression analysis is divided into two, namely simple regression analysis and multiple regression analysis. Simple regression analysis is an analysis involving one predictor variable and one response variable, while multiple regression analysis is a regression analysis involving several predictor variables and one response variable. The regression analysis has several classical assumptions based on Gauss-Markov theory that must be met, namely the relationship between variables is correct, predictor variables are fixed or non-stochastic, homogeneity of variance, non-autocorrelation, error normality, non-multicollinearity [11].

#### **6.2 Regression analysis with dummy variables**

There are many ways to create a regression model with qualitative predictor variables, one of which is to use regression with dummy variables. The dummy variable is a variable used to obtain an estimator in a regression model involving


**Table 6.**

*Value of classification accuracy, sensitivity, and specificity.*

qualitative predictor variables [12]. There is no difference in the assumptions underlying the regression with or without a dummy variable, this is because the addition of a dummy variable will be the same as the addition of a predictor variable in general.

There are several rules for coding dummy variables, for example by using binary code (0, 1). For example, there is a qualitative predictor variable with two categories (category 1 and category 2), then the qualitative variable can be defined in the dummy variable as shown in the following equation:

$$D = \begin{cases} 1, & \text{for category 1} \\ 0, & \text{for other} \end{cases} \tag{15}$$

The regression model with dummy variables can be expressed in the following equation:

$$Y\_i = \beta\_0 + \beta\_1 D + \beta\_2 X\_2 + \beta\_3 X\_3 + \beta\_4 X\_4 + \varepsilon\_i \tag{16}$$

Information:

*Yi*: the value of the response variable at the *i-th* observation.

*Xi*: the value of predictor *i-th* variable.

*β*: regression model parameter.

*εi*: Random error at *i-th* observation.

*i*: index for observation ð Þ *i* ¼ 1, 2, … , *n :*

Dummy variables can be entered into the regression model in three different ways, namely:

1.Dummy variable as intercept component

2.Dummy variable as slope component

3.Dummy variables as components of intercept and slope

#### **6.3 Application of regression analysis with dummy variables**

From the available data, namely Y = willingness to pay, X1 = dummy variable with category 1 being income in one family that is not combined, while category 2 is income in one family combined. X2 is Service Quality, X3 is Environment and X4 is Fairness.

The regression model formed is *Y* ¼ *b0* þ *b1 D* þ *b2X2* þ *b3X***<sup>3</sup>** þ *b4X4* (**Figure 1**)

Based on the regression analysis performed, the regression model with dummy variables is obtained as follows:

$$Y = 0.54088 + 0.08676 \, D + 0.1579X\_2 + 0.4309X\_3 + 0.2545X\_4 \tag{17}$$

In this model, it is possible to know the difference in interest in paying creditors whose income is combined with income that is not combined.

#### **6.4 Assumptions of regression analysis with dummy variable**

#### *6.4.1 Non multicollinearity*

Multicollinearity is a problem in regression which means that the predictor variables correlate. A good regression model is a data that there is no

*Computational Statistics with Dummy Variables DOI: http://dx.doi.org/10.5772/intechopen.101460*

**Figure 1.** *Output R.*

multicollinearity problem. Multicollinearity checks can use the VIF value, where if the VIF value is <10 then there is no multicollinearity problem. In the data used, the VIF value for all variables is less than 10 so that the data is used to fulfill the assumption of non-multicollinearity.

#### *6.4.2 Normality error*

The assumption of normality of error is an assumption that requires that the error must be normally distributed with a mean of 0 and a variance *σ*2. Testing for normality of errors can use the Shapiro Wilk test.

H0: normal distribution error

H1: error is not normally distributed

*α* ¼ 5%

Based on the normality test, a p-value of 0.91 was obtained, which means that the error was normally distributed. So that the assumption of normality error is met.

#### *6.4.3 Non autocorrelation*

The non-autocorrelation assumption test aims to find out whether some observations have correlated errors or not. If there is covariance and the correlation between errors is not equal to zero, then this can be said as a violation of assumptions. The non-autocorrelation assumption test method can be done using the Durbin Watson method. Based on the analysis conducted using the Durbin Watson test, a p-value of 0.6132 was obtained, which means that the data met the nonautocorrelation assumption.

#### *6.4.4 Homoscedasticity*

The assumption of homoscedasticity invariance indicates that as the average increases, the variance should remain constant, but there is a possibility that an increase in the average value causes the variance value to also increase, so it is necessary to test the assumption of homogeneity of variance. Assumption testing is done so that the estimator results obtained are efficient. Testing the assumption of homoscedasticity can use the Brusch Pagan method.

Based on the analysis conducted using the Brusch Pagan test, a p-value of 0.130 (less than 0.05) was obtained, which means that the data met the assumption of homoscedasticity.

#### **6.5 Parameter significance test**

a. Simultaneous test

*H*0: *β*<sup>1</sup> ¼ *β*<sup>2</sup> ¼ *β*<sup>3</sup> ¼ *β*<sup>4</sup> ¼ 0

*H***1**: there is at least one *β<sup>i</sup>* 6¼ **0**

*α* ¼ 5%

Based on the analysis obtained a p-value of 0.000 which means that there is at least one significant regression coefficient.

b. Partial test

$$H\_0 \colon \beta\_i = \mathbf{0}$$

$$H\_1 \colon \beta\_i \neq \mathbf{0}$$

$$a = \mathbf{S} \mathfrak{K}$$

Based on the analysis, it was found that three regression coefficients have a pvalue of less than 0.05. The three regression coefficients are the coefficients of the variables X2 (Quality of Service), X3 (Environment), and X4 (Fairness). This means that Service Quality, Environment, and Fairness have a significant effect on Willingness to Pay.

#### **6.6 Model interpretation**

The model obtained and has fulfilled all the assumptions of regression analysis with dummy variables is as follows:

$$y = 0.54088 + 0.08676 \, D + 0.1579x\_2 + 0.4309x\_3 + 0.2545x\_4 \tag{18}$$

In this model, it is possible to find out the difference in interest in paying creditors whose income is combined with income that is not combined. Based on the model obtained, the coefficient of the dummy variable is 0.08676, which means that when the incomes of creditors in one family are combined, the willingness to pay will be greater than those of creditors whose income is not combined. The estimated regression coefficient for the variable X2 (Quality of Service) is 0.1579, which means that the better the bank's service quality, the willingness to pay for credit also increases. Then for the estimation of the regression coefficient for the X3 (Environmental) variable, an estimate of 0.4309 is obtained, which means that the better the creditor's environmental conditions, the willingness to pay credit will also increase. The same thing also happened to the variable X4 (Fairness) where the estimated regression coefficient was 0.2545, which means that if the bank institution is fairer, creditors will also be more interested in paying.

#### **7. Conclusion**

The use of cluster analysis in statistical modeling will greatly facilitate the capture of the diversity of objects so that objects with the same characteristics can be grouped into the same group. This will be useful in classification methods such as discriminant analysis. Because in one group, objects will be more homogeneous, while between groups has a high diversity. So, the novelty in this chapter is the

*Computational Statistics with Dummy Variables DOI: http://dx.doi.org/10.5772/intechopen.101460*

perspective of the dummy variable technique where the number of categories in the dummy variable is determined by the number of clusters formed from the results of cluster analysis. This will then be continued on statistical modeling which is able to help researchers to divide objects into several groups according to the characteristics of each object by minimizing the diversity within the group.

#### **Conflict of interest**

The authors declare no conflict of interest.

#### **Author details**

Adji Achmad Rinaldo Fernandes\*, Solimun and Nurjannah Departement of Statistics, University of Brawijaya, Malang City, Indonesia

\*Address all correspondence to: fernandes@staff.ub.ac.id

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## **References**

[1] Stattrek.com. Dummy Variables in Regression. 2021. Available from: https://stattrek.com/multipleregression/dummy-variables.aspx

[2] Displayr.com. What are Dummy Variables?. 2019. Available from: https://www.displayr.com/what-aredummy-variables/

[3] Skrivanek S. The Use of Dummy Variables in Regression Analysis. Powell, OH: More Steam, LLC; 2009

[4] Artaya IP. Analisa Regresi Linier Berganda Metode Dummy Banyak Kriteria. 2019. DOI: 10.13140/ RG.2.2.13471.41122

[5] Fernandes AAR. Metode Statistika Multivariat Pemodelan PErsamaan Struktural (SEM) Pendekatan WarPLS. Malang: UB Press; 2017

[6] Johnson RA, Wichern DW. Applied Multivariate. Analysis. Upper Saddle River. NJ: Prentice-Hall; 2007

[7] Gudono. Analisis Data Multivariat Edisi Pertama. Yogyakarta: BPFE; 2011

[8] Tatham RL, Hair JF, Anderson RE, Black WC. Multivariate Data Analysis. New Jersey: Prentice Hall; 1998

[9] Wong HB, Lim GH. Measures of Diagnostic Accuracy: Sensitivity, Specificity, PPV and NPV. Proceedings of Singapore Healthcare. 2011;**20**(4):316-318

[10] Gujarati D. Ekonometri Dasar: Terjemahan Sumarno Zein. Erlangga: Jakarta; 2003

[11] Nawari. Analisis Regresi dengan MS Excel 2007 SPSS 17. Jakarta: PT Elex Media Komputindo; 2010

[12] Le Cessie S, Van Houwelingen JC. Logistic regression for correlated binary data. Journal of the Royal Statistical

Society: Series C (Applied Statistics). 1994;**43**(1):95-108

#### **Chapter 3**

## Sparse Boosting Based Machine Learning Methods for High-Dimensional Data

*Mu Yue*

#### **Abstract**

In high-dimensional data, penalized regression is often used for variable selection and parameter estimation. However, these methods typically require timeconsuming cross-validation methods to select tuning parameters and retain more false positives under high dimensionality. This chapter discusses sparse boosting based machine learning methods in the following high-dimensional problems. First, a sparse boosting method to select important biomarkers is studied for the right censored survival data with high-dimensional biomarkers. Then, a two-step sparse boosting method to carry out the variable selection and the model-based prediction is studied for the high-dimensional longitudinal observations measured repeatedly over time. Finally, a multi-step sparse boosting method to identify patient subgroups that exhibit different treatment effects is studied for the high-dimensional dense longitudinal observations. This chapter intends to solve the problem of how to improve the accuracy and calculation speed of variable selection and parameter estimation in high-dimensional data. It aims to expand the application scope of sparse boosting and develop new methods of high-dimensional survival analysis, longitudinal data analysis, and subgroup analysis, which has great application prospects.

**Keywords:** sparse boosting, high-dimensional data, machine learning, variable selection, data analysis

#### **1. Introduction**

High-dimensional model has become very popular in statistical literature and many new machine learning techniques have been developed to deal with data with very large number of features. In the past decades, researchers have done a great deal of high-dimensional data analysis where the sample size *n* is relatively small but the number of features *p* under consideration is extremely large. It is widely known that including irrelevant predictors in the statistical model may result in unstable estimation and dreadful computing issues. Thus, variable selection is crucial to address the challenges. Among all developments, regularization procedures such as LASSO [1], smoothly clipped absolute deviation (SCAD) [2], MCP [3] and their various extensions [4–6] have been thoroughly studied and widely used to perform variable selection and estimation simultaneously in order to improve the prediction accuracy and interpretability of the statistical model. However, those penalized

estimation approaches all have some tuning parameters required to be selected by computationally expensive methods like cross-validation.

In recent years, machine learning methods such as boosting have become very prominent for high-dimensional data settings since they can improve the selection accuracy substantially and reduce the chance of including irrelevant features. The original boosting algorithms were proposed by Schapire [7] which is an ensemble method that iteratively combines weaker learners to minimize the expected loss. The major difference among different boosting algorithms is the loss function. For example, AdaBoost [8] has the exponential loss function, L2 boosting [9] has the squared error loss function, sparse boosting [10] has the penalized loss function and HingeBoost [11] has the weighted hinge loss function. Recently, more various versions of boosting algorithms have been proposed. See, for example, Bühlmann and Hothorn [12] for the twin boosting; Komori and Eguchi [13] for the pAUCBoost; Wang [14] for the twin HingeBoost; Zhao [15] for the GSBoosting and Yang and Zou [16] for the ER-Boost. Besides these extensions, much effort has been made in understanding the advantages of boosting such as relatively lower over-fitting risk, smaller computational cost, and simpler adjustment to include additional constraints.

In this chapter we review some sparse boosting based methods for the following high-dimensional problems based on three research papers. First, a sparse boosting method to select important biomarkers is studied for the right censored survival data with high-dimensional biomarkers [17]. Then, a two-step sparse boosting to carry out the variable selection and the model-based prediction is studied for the highdimensional longitudinal observations measured repeated over time [18]. Finally, a multi-step sparse boosting method to identify patient subgroups that exhibit different treatment effects is studied for the high-dimensional dense longitudinal observations [19]. This chapter intends to solve the problem of how to improve the accuracy and calculation speed of variable selection and parameter estimation in high-dimensional data. It aims to expand the application scope of sparse boosting and develop new methods of high-dimensional survival analysis, longitudinal data analysis, and subgroup analysis, which has great application prospects.

The rest of the chapter is arranged as follows. In Section 2, a sparse boosting method to fit high-dimensional survival data is studied. In Section 3, a two-step sparse boosting approach to carry out variable selection and model-based prediction by fitting high-dimensional models with longitudinal data is studied. In Section 4, a subgroup identification method incorporating multi-step sparse boosting algorithm for high-dimensional dense longitudinal data is studied. Finally, Section 5 provides concluding remarks.

#### **2. Sparse boosting for survival data**

Survival time data are usually referred to time-to-event data and they are usually censored. Predicting survival time and identifying the risk factors can be very helpful for patient treatment selection, disease prevention strategy or disease management in evidence-based medicine. A well-known model in survival analysis is the Cox proportional hazards (PH) model [20] which assumes multiplicative covariate effects in the hazards function. Another popular model is the accelerated failure time (AFT) model [21] which assumes that the covariate effect is to accelerate or decelerate the life time of a disease. The coefficients in the regression model have the direct interpretation of the covariate effects on the mean survival time. Recently, researchers developed boosting methods to analyze survival data. For

*Sparse Boosting Based Machine Learning Methods for High-Dimensional Data DOI: http://dx.doi.org/10.5772/intechopen.100506*

example, Schmid and Hothorn [22] proposed a flexible boosting method for parametric AFT models, and Wang and Wang [23] proposed Buckley-James boosting for survival data with right censoring and high dimensionality.

In this section, a sparse boosting method to fit high-dimensional varyingcoefficient AFT models is presented. In particular, the sparse boosting techniques for right censored survival data is studied. In Section 2.1, the varying-coefficient AFT model for survival data is formulated and a detailed sparse boosting algorithm to fit the model is proposed. In Section 2.2, the proposed sparse boosting techniques through simulation studies is evaluated. In Section 2.3, the performance of sparse boosting via a lung cancer data example is examined.

#### **2.1 Methodology**

#### *2.1.1 Model and estimation*

Let *Ti* and *Ci* be the logarithm of survival time and censoring time for the *i*th subject in a random sample of size *n* respectively. In reality *Yi* ¼ min *Ti* f g ,*Ci* and the censoring indicator *δ<sup>i</sup>* ¼ *I T*ð Þ *<sup>i</sup>* ≤*Ci* [24] are observed. Denote **X***<sup>i</sup>* ¼ *Xi*,1, ⋯, *Xi*,*p*�<sup>1</sup> � � to be the corresponding (*p*-1)-dimensional predictors such as gene expressions or biomarkers for the *i*th subject and *Ui* to be the univariate index variable. Our observed data set **<sup>X</sup>***i*, *<sup>δ</sup>i*, *Yi* ð Þ , *Ui* : **<sup>X</sup>***<sup>i</sup>* <sup>∈</sup>IR*<sup>p</sup>*�<sup>1</sup> , *<sup>δ</sup><sup>i</sup>* <sup>∈</sup> f g 0, 1 , *Yi* <sup>∈</sup>IR, *Ui* <sup>∈</sup>IR, *<sup>i</sup>* <sup>¼</sup> 1, 2, <sup>⋯</sup>, *<sup>n</sup>* � � is an independently and identically distributed random sample from ð Þ **X**, *δ*, *Y*, *U* . The varying-coefficient AFT model is:

$$T\_i = \beta\_0(U\_i) + \sum\_{j=1}^{p-1} X\_{i,j} \beta\_j(U\_i) + \varepsilon\_i, \quad i = 1, \ldots, n,\tag{1}$$

where *β*0ð Þ*:* , *β*1ð Þ*:* , ⋯, *β<sup>p</sup>*�<sup>1</sup>ð Þ*:* are the unknown varying-coefficient functions of confounder *U* and *ε<sup>i</sup>* is the random error with *E εi*j**X***<sup>i</sup>* ð Þ¼ , *Ui* 0.

A weighted least squares estimation approach is adopted. Let *wi*'s be the Kaplan–Meier weights [25], which are the jumps in the Kaplan–Meier estimator computed as *<sup>w</sup>*<sup>1</sup> <sup>¼</sup> *<sup>δ</sup>*ð Þ<sup>1</sup> *<sup>n</sup>* and *wi* <sup>¼</sup> *<sup>δ</sup>*ð Þ*<sup>i</sup> n*�*i*þ1 Q*<sup>i</sup>*�<sup>1</sup> *j*¼1 *n*�*j n*�*j*þ1 � �*<sup>δ</sup>*ð Þ*<sup>j</sup>* , *i* ¼ 2, … , *n*. Let *Y*ð Þ<sup>1</sup> ≤⋯≤*Y*ð Þ *<sup>n</sup>* be the order statistics of *Yi* <sup>0</sup>*s*, *δ*ð Þ<sup>1</sup> , ⋯, *δ*ð Þ *<sup>n</sup>* be the corresponding censoring indicators of the ordered *Yi* <sup>0</sup>*s*, and *X*ð Þ<sup>1</sup> ,*<sup>j</sup>*, ⋯, *X*ð Þ *<sup>n</sup>* ,*<sup>j</sup>*, *j* ¼ 1, ⋯, *p* � 1 and *U*ð Þ<sup>1</sup> , ⋯, *U*ð Þ *<sup>n</sup>* are defined similarly. Then the weighed least squares loss function is

$$\sum\_{i=1}^{n} w\_i \left( Y\_{(i)} - \beta\_0 \left( U\_{(i)} \right) - \sum\_{j=1}^{p-1} X\_{(i),j} \beta\_j \left( U\_{(i)} \right) \right)^2. \tag{2}$$

Let *<sup>B</sup>*ðÞ¼ *:* ð Þ *<sup>B</sup>*1ð Þ*:* , … , *BL*ð Þ*: <sup>T</sup>* be an equal-spaced B-spline basis, where *<sup>L</sup>* is the dimension of the basis. Under certain smoothness conditions, the Curry-Schonberg theorem [26] implies that for every smooth function *β <sup>j</sup>* ð Þ*:* , it can be approximated by

$$\mathcal{J}\_j(.) \approx \mathcal{B}^T(.)\boldsymbol{\gamma}\_j, \quad j = \mathbf{0}, \cdots, p - \mathbf{1}, \tag{3}$$

where *γ <sup>j</sup>* is a vector of length *L*. Then the weighted least squares loss function Eq. (2) can be approximated by

$$\sum\_{i=1}^{n} w\_i \left( Y\_{(i)} - \mathbf{B}^T (U\_{(i)}) \gamma\_0 - \sum\_{j=1}^{p-1} X\_{(i),j} \mathbf{B}^T (U\_{(i)}) \gamma\_j \right)^2. \tag{4}$$

Denote by *<sup>Y</sup>*<sup>~</sup> <sup>¼</sup> *<sup>Y</sup>*ð Þ<sup>1</sup> , <sup>⋯</sup>, *<sup>Y</sup>*ð Þ *<sup>n</sup>* � �*<sup>T</sup>* , *<sup>X</sup>*ð Þ*<sup>i</sup>* ,0 <sup>¼</sup> 1 for *<sup>i</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>n</sup>*, **<sup>X</sup>**<sup>~</sup> *<sup>j</sup>* <sup>¼</sup> *B U*ð Þ<sup>1</sup> � �*X*ð Þ<sup>1</sup> ,*j*, <sup>⋯</sup>, *B U*ð Þ *<sup>n</sup>* � �*X*ð Þ *<sup>n</sup>* ,*<sup>j</sup>* � �*<sup>T</sup>* , **<sup>X</sup>**<sup>~</sup> <sup>¼</sup> **<sup>X</sup>**<sup>~</sup> 0, <sup>⋯</sup>, **<sup>X</sup>**<sup>~</sup> *<sup>p</sup>*�<sup>1</sup> � �, *<sup>W</sup>* <sup>¼</sup> diagð Þ *<sup>w</sup>*1, <sup>⋯</sup>, *wn* and *<sup>γ</sup>* <sup>¼</sup> *γT* <sup>0</sup> , ⋯, *γ<sup>T</sup> p*�1 � �*<sup>T</sup>* . Then the objective function Eq. (4) may be written in the following matrix form:

$$\left(\tilde{\mathbf{Y}} - \tilde{\mathbf{X}}\boldsymbol{\gamma}\right)^{T} \mathbf{W} \left(\tilde{\mathbf{Y}} - \tilde{\mathbf{X}}\boldsymbol{\gamma}\right). \tag{5}$$

The estimation may yield close-form solution for the coefficients when dimensionality *p* is small or moderate. With high dimensionality the solution cannot be

easily achieved. Let *<sup>γ</sup> <sup>K</sup>*^½ � <sup>¼</sup> *<sup>γ</sup> K*^½ � 0 � �*<sup>T</sup>* , ⋯, *γ K*^½ � *p*�1 � �*<sup>T</sup>* !*<sup>T</sup>* be the estimator of *γ* from

sparse boosting approach with weighted square loss function Eq. (5), and *K*^ is the estimated number of stopping iterations. Then the estimates of coefficient function are given by

$$\hat{\boldsymbol{\beta}}\_{j}(\boldsymbol{u}) = \boldsymbol{B}^{T}(\boldsymbol{u})\boldsymbol{\gamma}\_{j}^{[\boldsymbol{k}]}, \quad j = \mathbf{0}, \cdots, p - \mathbf{1}. \tag{6}$$

Instead of using the regularized estimation approaches, a sparse boosting method to estimate *γ <sup>K</sup>*^½ � is presented in the following subsection.

#### *2.1.2 Sparse boosting techniques*

The key idea of sparse boosting is to replace the empirical risk function in L2 boosting with the penalized empirical risk function which is a combination of squared loss and the trace of boosting operator as a measure of boosting complexity, and then perform gradient descent in a function space iteratively. Thus sparse boosting produces sparser models compared to L2 boosting. The g-prior minimum description length (gMDL) proposed by [27] can be used as the penalized empirical risk function to estimate the update criterion in each iteration and the stopping criterion. The gMDL takes the form:

$$\text{gMDL}(\text{RSS}, \text{trace}(\mathcal{B})) = \log\left(S\right) + \frac{\text{trace}(\mathcal{B})}{n} \log\left(\frac{\bar{Y}^T \tilde{Y} - \text{RSS}}{\text{trace}(\mathcal{B}) \times \mathcal{S}}\right),\tag{7}$$

$$\mathcal{S} = \frac{\text{RSS}}{n - \text{trace}(\mathcal{B})}.\tag{8}$$

Here *RSS* is the residual sum of squares and B is the boosting operator. The model that achieves the shortest description of data will be selected. The advantage is that it has a data-dependent penalty for each dimension since it is explicitly given as a function of data only, thus the selection of the tuning parameter can be avoided.

The sparse boosting procedure is described in details. The initial value of *γ* is set to be a zero vector, i.e. *<sup>γ</sup>*½ � *<sup>k</sup>* <sup>¼</sup> **<sup>0</sup>** for *<sup>k</sup>* <sup>¼</sup> 0, while in each of the *<sup>k</sup>*th iteration (1≤*k*≤*<sup>K</sup>* for *<sup>K</sup>* being the total number of iterations) only the current residual *<sup>R</sup>*½ � *<sup>k</sup>* <sup>¼</sup> *<sup>Y</sup>*<sup>~</sup> � **<sup>X</sup>**<sup>~</sup> *<sup>γ</sup>*½ � *<sup>k</sup>*�<sup>1</sup> is used *Sparse Boosting Based Machine Learning Methods for High-Dimensional Data DOI: http://dx.doi.org/10.5772/intechopen.100506*

to regress every *<sup>j</sup>*th working element **<sup>X</sup>**<sup>~</sup> *<sup>j</sup>*, *<sup>j</sup>* <sup>¼</sup> 0, <sup>⋯</sup>, *<sup>p</sup>* � 1. The fit denoted by ^*<sup>λ</sup>* ½ � *k j* can be obtained by minimizing the weighted squared loss function *<sup>R</sup>*½ � *<sup>k</sup>* � **<sup>X</sup>**<sup>~</sup> *<sup>j</sup><sup>λ</sup>* � �*<sup>T</sup> W R*½ � *<sup>k</sup>* � **<sup>X</sup>**<sup>~</sup> *<sup>j</sup><sup>λ</sup>* � � with respect to *<sup>λ</sup>*. Hence the weighted least squared estimate is ^*λ* ½ � *k <sup>j</sup>* <sup>¼</sup> **<sup>X</sup>**<sup>~</sup> *<sup>j</sup>* � �*<sup>T</sup> W* **X**~ *<sup>j</sup>* � � h i�<sup>1</sup> **X**~ *j* � �*<sup>T</sup> WR*½ � *<sup>k</sup>* , the corresponding hat matrix is <sup>H</sup> *<sup>j</sup>* <sup>¼</sup> **X**~ *j* � � **X**~ *<sup>j</sup>* � �*<sup>T</sup> W* **X**~ *<sup>j</sup>* � � h i�<sup>1</sup> **X**~ *j* � �*<sup>T</sup> W* and the weighted residual sum of squares is *RSS*½ � *<sup>k</sup> j* ¼ *<sup>R</sup>*½ � *<sup>k</sup>* � **<sup>X</sup>**<sup>~</sup> *<sup>j</sup>* ^*λ* ½ � *k j* � �*<sup>T</sup> W R*½ � *<sup>k</sup>* � **<sup>X</sup>**<sup>~</sup> *<sup>j</sup>* ^*λ* ½ � *k j* � �. The selected component^*sk* can be obtained by:

$$\hat{\mathbf{s}}\_{k} = \operatorname{argmin}\_{0 \le j \le p-1} \mathbf{g} \text{MDL} \left( \operatorname{RSS}\_{j}^{[k]}, \operatorname{trace} \left( \mathcal{B}\_{j}^{[k]} \right) \right), \tag{8}$$

where B½ � <sup>1</sup> *<sup>j</sup>* <sup>¼</sup> <sup>H</sup> *<sup>j</sup>* and <sup>B</sup>½ � *<sup>k</sup> <sup>j</sup>* ¼ *I* � *I* � H *<sup>j</sup>* � � *<sup>I</sup>* � *<sup>ν</sup>*H^*sk*�<sup>1</sup> � �*:*⋯*: <sup>I</sup>* � *<sup>ν</sup>*H^*s*<sup>1</sup> ð Þ for *<sup>k</sup>*>1 is the boosting operator for selecting *j*th component in the *k*th iteration. Therefore, at each iteration there is only one working component **X**~^*sk* to be chosen, and only the corresponding coefficient vector *γ* ½ � *k* ^*sk* changes, i.e. *γ* ½ � *k* ^*sk* ¼ *γ* ½ � *k*�1 ^*sk* <sup>þ</sup> *<sup>ν</sup>*^*<sup>λ</sup>* ½ � *k* ^*sk* , where *ν* is the step size, while all the other *γ* ½ � *k <sup>j</sup>* for *j* 6¼ ^*sk* remain the same. This process is repeated for *K* iterations and estimate the stopping iteration *K* by.

$$\hat{K} = \operatorname{argmin}\_{1 \le k \le K} \operatorname{gMDL} \left( \operatorname{RSS}\_{i\_k}^{[k]}, \operatorname{trace} \left( \mathcal{B}^{[k]} \right) \right), \tag{9}$$

where <sup>B</sup>½ � *<sup>k</sup>* <sup>¼</sup> *<sup>I</sup>* � *<sup>I</sup>* � *<sup>ν</sup>*H^*sk* � �*:*⋯*: <sup>I</sup>* � *<sup>ν</sup>*H^*s*<sup>1</sup> ð Þ.

From this sparse boosting procedure, the estimator of *γ* is obtained as *<sup>γ</sup> <sup>K</sup>*^½ � <sup>¼</sup> *<sup>γ</sup> K*^½ � 0 � �*<sup>T</sup>* , ⋯, *γ K*^½ � *p*�1 � �*<sup>T</sup>* !*<sup>T</sup>* . The sparse boosting algorithm for the

varying-coefficient AFT model can be summarized as follows:

Sparse Boosting Algorithm for Varying-Coefficient AFT Model.


the estimate for *γ* and *β*^ *<sup>j</sup>* ð Þ¼ *<sup>u</sup> BT*ð Þ *<sup>u</sup> <sup>γ</sup> K*^½ � *<sup>j</sup>* , *j* ¼ 0, ⋯, *p* � 1 are the estimators for varying coefficients. The final estimator of *<sup>Y</sup>*<sup>~</sup> is *<sup>Y</sup>*<sup>~</sup> *<sup>K</sup>*^½ � <sup>¼</sup> **<sup>X</sup>**<sup>~</sup> *<sup>γ</sup> <sup>K</sup>*^½ �.

According to [10] and references therein, the selection of step size *ν* is of minor importance as long as it is small. A smaller value of *ν* achieves higher prediction

accuracy while requires a larger number of boosting iterations and more computing time. A typical value used in literature is *ν* ¼ 0*:*1.

#### **2.2 Simulation**

The performance of the above sparse boosting algorithm is evaluated by studying their performance on simulated data. L2 boosting and sparse boosting methods are compared in their performance of variable selection and function estimation. Sparse boosting method is what we present in this section while L2 boosting method is a relatively simpler version and may not achieve sparse solution in general.

The simulation results from [17] show that both boosting methods can identify important variables while sparse boosting selects much fewer irrelevant variables than L2 boosting. Although in-sample prediction errors (defined as

P*<sup>n</sup> <sup>i</sup>*¼<sup>1</sup>*δ<sup>i</sup> Yi* � *<sup>Y</sup> <sup>K</sup>*^½ � *i* � �<sup>2</sup> *=* P*<sup>n</sup> <sup>i</sup>*¼<sup>1</sup>*δi*) using L2 boosting is a little bit smaller than using sparse boosting since the former has larger model sizes, the average of root mean integrated squared errors (defined as ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 *n* P<sup>5</sup> *j*¼0 P*<sup>n</sup> <sup>i</sup>*¼<sup>1</sup> *<sup>β</sup> <sup>j</sup>*ð Þ� *ui <sup>β</sup>*^ *<sup>j</sup>*ð Þ *ui* � �<sup>2</sup> <sup>r</sup> ) using sparse boosting is much smaller than that using L2 boosting. Furthermore, when the smoothness assumption in Curry-Schonberg theorem is violated for the coefficient functions, the performance of variable selection remains good. In summary, sparse boosting outperforms L2 boosting in terms of parameter estimation and variable selection.

#### **2.3 Lung cancer data analysis**

Lung cancer is the top cancer killer for people in the U.S. Identifying relevant gene expressions in lung cancer is important for treatment and prevention. Our data is from a large multi-site blinded validation study [28] with 442 lung adenocarcinomas. Age is treated as the potential confounder in this analysis, since it is usually strongly correlated with survival time [29]. After removing missing measurements and predictors in overall survival, a total of 439 patients are left in the analysis. For each patient, 22,283 gene expressions are available. The median follow-up time is 46 months (range: 0.03 to 204 months) with the overall censoring rate 46.47 %. The median age at diagnosis is 65 years (range: 33 to 87 years). After adopting a marginal screening procedure to screen out irrelevant genes, variable selection approaches are used to identify important genes associated with lung cancer. With the aim of comparison, except L2 boosting and the proposed sparse boosting, the following existing variable selection approaches for constant-coefficient AFT models are also considered: Buckley-James boosting with linear least squares [23], Buckley-James twin boosting with linear least squares [23], Buckley-James regression with elastic net penalty [30] and SCAD penalty respectively.

The results from [17] show that L2 boosting and sparse boosting for varyingcoefficient AFT model not only produce relatively sparser model, but also have smaller in-sample and out-of-sample prediction error compared to the four methods for constant-coefficient AFT model. Again, sparse boosting produce even sparser model than L2 boosting. In conclusion, including age in the varying-coefficient AFT model could lead to more accurate estimate than constantcoefficient AFT model and the proposed sparse boosting method for varyingcoefficient AFT model has good performance in terms of estimation, prediction as well as sparsity.

*Sparse Boosting Based Machine Learning Methods for High-Dimensional Data DOI: http://dx.doi.org/10.5772/intechopen.100506*

#### **3. Two-step sparse boosting for longitudinal data**

Longitudinal data contain repeated measurements collected from the same respondents over time. The assumption that all measurements are independent does not hold for such data. One important question in longitudinal analysis is how to make efficient inference by taking into account of the within subjects correlation. This question has been investigated in depth by many researchers [31, 32] for parametric models. Semiparametric and nonparametric models for longitudinal data are also presented in the literature, see [33, 34]. Recently, there are some development on longitudinal data with high-dimensionalilty using varyingcoefficient models [35, 36]. All previous studies adopted the penalty methods.

In this section, a two-step sparse boosting approach is presented to preform the variable selection and the model-based prediction. Specifically, high-dimensional varying-coefficient models with longitudinal data will be considered. In the first step, the sparse boosting approach is utilized to obtain an estimate of the correlation structure. In the second step, the within-subject correlation structure is considered and variable selection and coefficients estimation are achieved by sparse boosting again. The rest of this section is arranged as follows. In Section 3.1, the varyingcoefficient model for longitudinal data is formulated and a two-step sparse boosting algorithm is presented. In Section 3.2, simulation studies are conducted to illustrate the validity of the two-step sparse boosting method. In Section 3.3, the performance of two-stage method is assessed by studying yeast cell cycle gene expression data.

#### **3.1 Methodology**

#### *3.1.1 Model and estimation*

Let *Yij* be the continuous outcome for the *j*th measurement of individual *i* taken at time *tij* ∈*T*, where *T* is the time interval on which the measurements are taken. Denote **X***ij* ¼ *Xij*,1, ⋯,*Xij*,*p*�<sup>1</sup> � � to be the corresponding (*p*-1)-dimensional covariate vector. The varying-coefficient model which can capture the dynamical impacts of the covariates on the response variable is considered:

$$Y\_{i\bar{\jmath}} = \beta\_0(t\_{\bar{\jmath}}) + \sum\_{d=1}^{p-1} X\_{i\bar{\jmath},d} \beta\_d(t\_{\bar{\jmath}}) + \varepsilon\_{\bar{\jmath}}, \quad i = 1, \dots, n, j = 1, \dots, n,\tag{10}$$

where *β*0ð Þ*:* , *β*1ð Þ*:* , ⋯, *β<sup>p</sup>*�<sup>1</sup>ð Þ*:* are the unknown smooth coefficient functions of time and *<sup>ε</sup><sup>i</sup>* <sup>¼</sup> *<sup>ε</sup>i*1, <sup>⋯</sup>, *<sup>ε</sup>ini* ð Þ*<sup>T</sup>*, *<sup>i</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>n</sup>* are multivariate error terms with mean zero. Errors are assumed to be uncorrelated for different *i*, but components of *ε<sup>i</sup>* are correlated with each other. Without loss of generality, the balanced longitudinal study is considered in the following implementation, i.e., *tij* ¼ *tkj*, and *ni* ¼ *m* for all *i*.

The estimation procedure is presented below. In the first step, the within-subject correlation is ignored first and the coefficients are estimated by minimizing the following least squares loss function:

$$\sum\_{i=1}^{n} \sum\_{j=1}^{m} \left( Y\_{ij} - \beta\_0 \left( \mathfrak{t}\_{\vec{\eta}} \right) - \sum\_{d=1}^{p-1} X\_{i\vec{\eta},d} \beta\_d \left( \mathfrak{t}\_{\vec{\eta}} \right) \right)^2. \tag{11}$$

The B-spline basis is used to estimate the coefficient functions *<sup>β</sup>*0ð Þ*:* , *<sup>β</sup>*1ð Þ*:* , <sup>⋯</sup>, *<sup>β</sup><sup>p</sup>*�<sup>1</sup>ð Þ*:* . Denote *<sup>B</sup>*ðÞ¼ *:* ð Þ *<sup>B</sup>*1ð Þ*:* , … , *BL*ð Þ*: <sup>T</sup>* to be an equal-spaced B-spline basis of dimension *L*. Under certain smoothness assumptions, function *βd*ð Þ*:* can be approximated by

$$
\beta\_d(.) \approx \mathbf{B}^T(.)\boldsymbol{\gamma}\_d, \quad d = \mathbf{0}, \dots, p - 1,\tag{12}
$$

where *γ<sup>d</sup>* is a loading vector of length *L*. Then the least squares loss function Eq. (11) is close to

$$\sum\_{i=1}^{n} \sum\_{j=1}^{m} \left( Y\_{ij} - \mathbf{B}^{T} \left( \mathfrak{t}\_{ij} \right) \boldsymbol{\chi}\_{0} - \sum\_{d=1}^{p-1} X\_{ij,d} \mathbf{B}^{T} \left( \mathfrak{t}\_{ij} \right) \boldsymbol{\chi}\_{d} \right)^{2}. \tag{13}$$

Further denote *Yi* ¼ ð Þ *Yi*1, ⋯, *Yim <sup>T</sup>*, *<sup>Y</sup>* <sup>¼</sup> *<sup>Y</sup><sup>T</sup>* <sup>1</sup> , ⋯, *Y<sup>T</sup> n* � �*<sup>T</sup>* , *Xij*,0 <sup>¼</sup> 1, **<sup>X</sup>**<sup>~</sup> *<sup>i</sup>*,*<sup>d</sup>* <sup>¼</sup> ð Þ *B t*ð Þ*<sup>i</sup>*<sup>1</sup> *Xi*1,*d*, ⋯, *B t*ð Þ *im Xim*,*<sup>d</sup> <sup>T</sup>*, **<sup>X</sup>**<sup>~</sup> *<sup>i</sup>* <sup>¼</sup> **<sup>X</sup>**<sup>~</sup> *<sup>i</sup>*,0, <sup>⋯</sup>, **<sup>X</sup>**<sup>~</sup> *<sup>i</sup>*,*p*�<sup>1</sup> � �, **<sup>X</sup>**<sup>~</sup> <sup>¼</sup> **<sup>X</sup>**<sup>~</sup> *<sup>T</sup>* <sup>1</sup> , <sup>⋯</sup>, **<sup>X</sup>**<sup>~</sup> *<sup>T</sup> n* � �*<sup>T</sup>* and *γ* ¼ *γT* <sup>0</sup> , ⋯, *γ<sup>T</sup> p*�1 � �*<sup>T</sup>* . Then the target function Eq. (13) can be expressed in the matrix format:

$$\sum\_{i=1}^{n} \left( Y\_i - \tilde{\mathbf{X}}\_i \boldsymbol{\gamma} \right)^T \left( Y\_i - \tilde{\mathbf{X}}\_i \boldsymbol{\gamma} \right) \equiv \left( Y - \tilde{\mathbf{X}} \boldsymbol{\gamma} \right)^T \left( Y - \tilde{\mathbf{X}} \boldsymbol{\gamma} \right). \tag{14}$$

Denote *γ K*b1 � � to be the estimator of *γ* by sparse boosting with squared loss function Eq. (14) being loss function, where *K* c<sup>1</sup> is the estimated stopping iterations in this step. There is no exact closed form for *γ K*b1 � � since it is derived from an iterative algorithm. However it can be evaluated very fast in a computer implementation. The detailed algorithm will be presented in the next subsection.

The first step coefficient estimates are given by

$$\tilde{\beta}\_d(t) = B^T(t)\chi\_d^{\left[\widehat{K}\_1\right]}, \quad d = 0, \cdots, p-1. \tag{15}$$

Write ^*ε<sup>i</sup>* <sup>¼</sup> *Yi* � **<sup>X</sup>**<sup>~</sup> *<sup>i</sup><sup>γ</sup> K*b1 � � , *i* ¼ 1, ⋯, *n*. The *m* � *m* covariance matrix *Cov Y*ð Þ�*<sup>i</sup>* Σ can be estimated by the following empirical estimator

$$\widehat{\sum} = \frac{1}{n} \sum\_{i=1}^{n} \hat{e}\_i \hat{e}\_i^T. \tag{16}$$

In the second step, the estimated correlation structure within repeated measurements is taken into account to form the weighted least squares loss function as follows:

$$\sum\_{i=1}^{n} \left(\mathbf{Y}\_{i} - \tilde{\mathbf{X}}\_{i}\boldsymbol{\gamma}^{\star}\right)^{T} \hat{\boldsymbol{\Sigma}}^{-1} \left(\mathbf{Y}\_{i} - \tilde{\mathbf{X}}\_{i}\boldsymbol{\gamma}^{\star}\right) \equiv \left(\mathbf{Y} - \tilde{\mathbf{X}}\boldsymbol{\gamma}^{\star}\right)^{T} \boldsymbol{W} \left(\mathbf{Y} - \tilde{\mathbf{X}}\boldsymbol{\gamma}^{\star}\right),\tag{17}$$

where *<sup>W</sup>* <sup>¼</sup> diag <sup>Σ</sup>^�<sup>1</sup> , <sup>⋯</sup>, <sup>Σ</sup>^�<sup>1</sup> � � is the estimated ð Þ� *<sup>n</sup>* � *<sup>m</sup>* ð Þ *<sup>n</sup>* � *<sup>m</sup>* weight matrix. � �

Denote *γ*<sup>⋆</sup> *<sup>K</sup>*b<sup>2</sup> to be the estimator of *γ*<sup>⋆</sup> by sparse boosting with weighted loss function Eq. (17) being the loss function, where *K* c<sup>2</sup> is the estimated stopping iterations in the second step. Then the coefficient estimates from the second step are given by

$$\hat{\beta}\_d(t) = B^T(t) \boldsymbol{\gamma}\_d^{\star \begin{bmatrix} \hat{K}\_2 \end{bmatrix}}, \quad d = \mathbf{0}, \cdots, p - \mathbf{1}. \tag{18}$$

*Sparse Boosting Based Machine Learning Methods for High-Dimensional Data DOI: http://dx.doi.org/10.5772/intechopen.100506*

The reliable estimates for the coefficient functions could then be obtained. More details about how to use sparse boosting to get *γ K*b1 � � and *γ*<sup>⋆</sup> *<sup>K</sup>*b<sup>2</sup> � � are provided in the following subsection.

#### *3.1.2 Two-step sparse boosting techniques*

gMDL can be adopted as the penalized empirical risk function to estimate the update criterion in each iteration and the stopping criterion. gMDL can be expressed in the following form:

$$\text{gMDL}(RSS, \text{trace}(\mathcal{B})) = \log(F) + \frac{\text{trace}(\mathcal{B})}{n \times m} \log\left(\frac{\mathbf{Y}^T \mathbf{Y} - R\mathbf{SS}}{\text{trace}(\mathcal{B}) \times F}\right),\tag{19}$$

$$F = \frac{R\mathbf{SS}}{n \times m - \text{trace}(\mathcal{B})},$$

where B is the boosting operator and *RSS* is the residual sum of squares.

The two-step sparse boosting approach is presented more specifically. In the first step, the start value of *<sup>γ</sup>* is set to zero vector, i.e. *<sup>γ</sup>*½ � <sup>0</sup> <sup>¼</sup> **<sup>0</sup>**, and in each of the *<sup>k</sup>*1th iteration (0 <*k*<sup>1</sup> ≤ *K*1, and *K*<sup>1</sup> is the maximum number of iterations considered in the first step), the residual *<sup>R</sup>*½ � *<sup>k</sup>*<sup>1</sup> <sup>¼</sup> *<sup>Y</sup>* � **<sup>X</sup>**<sup>~</sup> *<sup>γ</sup>*½ � *<sup>k</sup>*1�<sup>1</sup> in present iteration is used to fit each of the *<sup>d</sup>*th component **<sup>X</sup>**<sup>~</sup> ,*<sup>d</sup>* <sup>¼</sup> **<sup>X</sup>**<sup>~</sup> *<sup>T</sup>* 1,*<sup>d</sup>*, <sup>⋯</sup>, **<sup>X</sup>**<sup>~</sup> *<sup>T</sup> n*,*d* � �*<sup>T</sup>* , *d* ¼ 0, ⋯, *p* � 1 by treating all the within-subject observations uncorrelated. Then the fit denoted by ^*λ* ½ � *k*<sup>1</sup> *<sup>d</sup>* can be calculated by minimizing the squared loss function *<sup>R</sup>*½ � *<sup>k</sup>*<sup>1</sup> � **<sup>X</sup>**<sup>~</sup> ,*<sup>d</sup><sup>λ</sup>* � �*<sup>T</sup> <sup>R</sup>*½ � *<sup>k</sup>*<sup>1</sup> � **<sup>X</sup>**<sup>~</sup> ,*<sup>d</sup><sup>λ</sup>* � � with respect to *<sup>λ</sup>*. Therefore, the least squares estimate is ^*λ* ½ � *k*<sup>1</sup> *<sup>d</sup>* <sup>¼</sup> **<sup>X</sup>**<sup>~</sup> ,*<sup>d</sup>* � �*<sup>T</sup>* **<sup>X</sup>**<sup>~</sup> ,*<sup>d</sup>* � � h i�<sup>1</sup> **X**~ ,*<sup>d</sup>* � �*<sup>T</sup> R*½ � *<sup>k</sup>*<sup>1</sup> , the corresponding hat matrix is <sup>H</sup>*<sup>d</sup>* <sup>¼</sup> **<sup>X</sup>**<sup>~</sup> ,*<sup>d</sup>* � � **X**~ ,*<sup>d</sup>* � �*<sup>T</sup>* **<sup>X</sup>**<sup>~</sup> ,*<sup>d</sup>* � � h i�<sup>1</sup> **X**~ ,*<sup>d</sup>* � �*<sup>T</sup>* and the residual sum of squares is *RSS*½ � *<sup>k</sup>*<sup>1</sup> *<sup>d</sup>* <sup>¼</sup> *<sup>R</sup>*½ � *<sup>k</sup>*<sup>1</sup> � **<sup>X</sup>**<sup>~</sup> ,*<sup>d</sup>* ^*λ* ½ � *k*<sup>1</sup> *d* � �*<sup>T</sup> <sup>R</sup>*½ � *<sup>k</sup>*<sup>1</sup> � **<sup>X</sup>**<sup>~</sup> ,*<sup>d</sup>* ^*λ* ½ � *k*<sup>1</sup> *d* � �. The chosen element^*sk*<sup>1</sup> is attained by: ^*sk*<sup>1</sup> <sup>¼</sup> argmin0≤*d*≤*p*�<sup>1</sup>gMDL *RSS*½ � *<sup>k</sup>*<sup>1</sup> *<sup>d</sup>* , trace <sup>B</sup>½ � *<sup>k</sup>*<sup>1</sup> *d* � � � � , (20)

where B½ � <sup>1</sup> *<sup>d</sup>* <sup>¼</sup> <sup>H</sup>*<sup>d</sup>* and <sup>B</sup>½ � *<sup>k</sup>*<sup>1</sup> *<sup>d</sup>* ¼ *I* � ð Þ *I* � H*<sup>d</sup> I* � *ν*H^*sk*1�<sup>1</sup> � �*:*⋯*: <sup>I</sup>* � *<sup>ν</sup>*H^*s*<sup>1</sup> ð Þ for *<sup>k</sup>*<sup>1</sup> <sup>&</sup>gt;1 is the first step boosting operator for choosing *d*th element in the *k*1th iteration. Hence, there is an unique element **<sup>X</sup>**<sup>~</sup> ,^*sk*<sup>1</sup> to be selected at each iteration, and only the corresponding coefficient vector *γ* ½ � *k*<sup>1</sup> ^*sk*1 changes, i.e., *γ* ½ � *k*<sup>1</sup> ^*sk*1 ¼ *γ* ½ � *k*1�1 ^*sk*1 <sup>þ</sup> *<sup>ν</sup>*^*<sup>λ</sup>* ½ � *k*<sup>1</sup> ^*sk*1 , where *ν* is the pre-specified step-size parameter. All the other *γ* ½ � *k*<sup>1</sup> *<sup>d</sup>* for *d* 6¼ ^*sk*<sup>1</sup> keep unchanged. This procedure is repeated for *K*<sup>1</sup> times and the number of iterations *K*<sup>1</sup> can be estimated by

$$\widehat{K\_1} = \operatorname{argmin}\_{1 \le k\_1 \le K\_1} \mathbf{g} \\ \text{MDL} \left( \operatorname{RSS}\_{k\_1}^{[k\_1]}, \operatorname{trace} \left( \mathcal{B}^{[k\_1]} \right) \right), \tag{21}$$

where <sup>B</sup>½ � *<sup>k</sup>*<sup>1</sup> <sup>¼</sup> *<sup>I</sup>* � *<sup>I</sup>* � *<sup>ν</sup>*H^*sk*<sup>1</sup> � �*:*⋯*: <sup>I</sup>* � *<sup>ν</sup>*H^*s*<sup>1</sup> ð Þ.

From the first step of sparse boosting, the estimator of *γ* is obtained by *γ K*b1 � � ¼ *γ K*b1 � � 0 � �*<sup>T</sup>* , ⋯, *γ K*b1 � � *p*�1 ! � �*<sup>T</sup> <sup>T</sup>* . Then the weight matrix *W* can be easily obtained too.

In the second step, sparse boosting is used again by taking into account of the correlation structure estimator for the repeated measurements estimated in the first step. The initial value of *γ*<sup>⋆</sup> is set to be the coefficient estimator from the first step of sparse boosting, i.e. *<sup>γ</sup>*<sup>⋆</sup>½ � <sup>0</sup> <sup>¼</sup> *<sup>γ</sup> K*b1 � � , and in each of the *k*2th iteration (0< *k*<sup>2</sup> ≤ *K*2, and *K*<sup>2</sup> is the maximum number of iterations under consideration in the second step), the residual *<sup>R</sup>*<sup>⋆</sup>½ � *<sup>k</sup>*<sup>2</sup> <sup>¼</sup> *<sup>Y</sup>* � **<sup>X</sup>**<sup>~</sup> *<sup>γ</sup>*<sup>⋆</sup>½ � *<sup>k</sup>*2�<sup>1</sup> in current iteration is used to fit each of the *<sup>d</sup>*th working element **<sup>X</sup>**<sup>~</sup> ,*d*, *<sup>d</sup>* <sup>¼</sup> 0, <sup>⋯</sup>, *<sup>p</sup>* � 1 by incorporating the within-subject correlation estimator from the first step. Then the fit denoted by ^*λ* ⋆½ � *k*<sup>2</sup> *<sup>d</sup>* can be obtained by minimizing the weighted squared loss function *<sup>R</sup>*<sup>⋆</sup>½ � *<sup>k</sup>*<sup>2</sup> � **<sup>X</sup>**<sup>~</sup> ,*d<sup>λ</sup>* � �*<sup>T</sup> W R*<sup>⋆</sup>½ � *<sup>k</sup>*<sup>2</sup> � **<sup>X</sup>**<sup>~</sup> ,*d<sup>λ</sup>* � � with respect to *λ*. Thus, the weighted least squares estimate is ^*λ* ⋆½ � *k*<sup>2</sup> *d* ¼ **X**~ ,*<sup>d</sup>* � �*<sup>T</sup> W* **X**~ ,*<sup>d</sup>* � � h i�<sup>1</sup> **X**~ ,*<sup>d</sup>* � �*<sup>T</sup> WR*<sup>⋆</sup>½ � *<sup>k</sup>*<sup>2</sup> , the corresponding hat matrix is H<sup>⋆</sup> *d* ¼ **X**~ ,*<sup>d</sup>* � � **X**~ ,*<sup>d</sup>* � �*<sup>T</sup> W* **X**~ ,*<sup>d</sup>* � � h i�<sup>1</sup> **X**~ ,*<sup>d</sup>* � �*<sup>T</sup> W* and the weighted residual sum of squares is *RSS*⋆½ � *<sup>k</sup>*<sup>2</sup> *<sup>d</sup>* <sup>¼</sup> *<sup>R</sup>*⋆½ � *<sup>k</sup>*<sup>2</sup> � **<sup>X</sup>**<sup>~</sup> ,*<sup>d</sup>* ^*λ* ⋆½ � *k*<sup>2</sup> *d* � �*<sup>T</sup> W R*⋆½ � *<sup>k</sup>*<sup>2</sup> � **<sup>X</sup>**<sup>~</sup> ,*<sup>d</sup>* ^*λ* ⋆½ � *k*<sup>2</sup> *d* � �. The chosen element ^*sk*<sup>2</sup> can be obtained by:

$$\hat{\mathfrak{s}}\_{k\_2} = \operatorname{argmin}\_{0 \le d \le p-1} \mathbf{g} \text{MDL} \left( \operatorname{RSS}\_d^{\star \left[ k\_2 \right]}, \operatorname{trace} \left( \mathcal{B}\_d^{\star \left[ k\_2 \right]} \right) \right), \tag{22}$$

$$\begin{aligned} \text{where } \mathcal{B}\_d^{\star [1]} &= I - \left( I - \mathcal{B}^{\left[\widehat{K}\_1\right]} \right) \left( I - \mathcal{H}\_d^{\star} \right) \text{ and } \mathcal{B}\_d^{\star [k\_2]} = I - \\ \left( I - \mathcal{B}^{\left[\widehat{K}\_1\right]} \right) \left( I - \mathcal{H}\_d^{\star} \right) \left( I - \nu \mathcal{H}\_{i\_{k\_2 - 1}}^{\star} \right) \cdots \left( I - \nu \mathcal{H}\_{i\_1}^{\star} \right) \text{ for } k\_2 > 1 \text{ is the second step} \\ \text{bosing operator for choosing } d \text{th element in the } k\_2 \text{th iteration. Thus, there is an} \\ \text{i.e. } \mathcal{H}\_{i\_1 + 1} \text{ is a local solution which is called } 1 \text{ and } 1. \end{aligned}$$

unique element **<sup>X</sup>**<sup>~</sup> ,^*sk*<sup>2</sup> to be selected at each time, and only the corresponding coefficient vector *γ* ⋆½ � *k*<sup>2</sup> ^*sk*2 change, i.e., *γ* ⋆½ � *k*<sup>2</sup> ^*sk*2 ¼ *γ* ⋆½ � *k*2�1 ^*sk*2 <sup>þ</sup> *<sup>ν</sup>*^*<sup>λ</sup>* ⋆½ � *k*<sup>2</sup> ^*sk*2 . While all the other *γ* ⋆½ � *k*<sup>2</sup> *d* for *d* 6¼ ^*sk*<sup>2</sup> remain the same. This procedure is repeated for *K*<sup>2</sup> times and the estimated stopping iterations *K* c<sup>2</sup> is

$$\widehat{K\_2} = \operatorname{argmin}\_{1 \le k\_2 \le K\_2} \mathbf{g} \\ \text{MDL} \left( \operatorname{RSS}\_{i\_{k\_2}}^{\star, [k\_2]}, \operatorname{trace} \left( \mathcal{B}^{\star, [k\_2]} \right) \right), \tag{23}$$

$$\text{where } \mathcal{B}^{\star \left[k\_{2}\right]} = I - \left(I - \mathcal{B}^{\left[\widehat{K}\_{1}\right]}\right) \left(I - \nu \mathcal{H}^{\star}\_{\widehat{s}\_{k\_{2}}}\right) \cdots \left(I - \nu \mathcal{H}^{\star}\_{\widehat{s}\_{1}}\right).$$

From the second step of sparse boosting, the estimator of *γ*<sup>⋆</sup> is arrived by

*γ*<sup>⋆</sup> *<sup>K</sup>*b<sup>2</sup> � � ¼ *γ* <sup>⋆</sup> *<sup>K</sup>*b<sup>2</sup> � � 0 � �*<sup>T</sup>* , ⋯, *γ* <sup>⋆</sup> *<sup>K</sup>*b<sup>2</sup> � � *p*�1 ! � �*<sup>T</sup> <sup>T</sup>* . The two-step sparse boosting algorithm for varying-coefficient model with longitudinal data can be summarized in the

following form:

Two-step Sparse Boosting Algorithm with Longitudinal Data.

Step I: Use sparse boosting to estimate covariance matrix.



$$\text{Thus, } \chi^{\begin{bmatrix} \widehat{K}\_{1} \end{bmatrix}} = \left( \left( \chi\_{0}^{\begin{bmatrix} \widehat{K}\_{1} \end{bmatrix}} \right)^{T}, \dots, \left( \chi\_{p-1}^{\begin{bmatrix} \widehat{K}\_{1} \end{bmatrix}} \right)^{T} \right)^{1} \text{ is the first step estimator for } \chi \text{ from } \begin{bmatrix} \widehat{K}\_{1} \end{bmatrix}$$

$$\hat{\Sigma} = \frac{1}{n} \sum\_{i=1}^{n} \left( Y\_i - \tilde{\mathbf{X}}\_i \mathcal{Y}^{\left[\widehat{K}\_1\right]} \right) \left( Y\_i - \tilde{\mathbf{X}}\_i \mathcal{Y}^{\left[\widehat{K}\_1\right]} \right)^T.$$


of variable selection and function estimation performance. M1: two-step L2 boosting (use squared loss for update criterion and gMDL for stopping criterion); M2: two-step sparse boosting; M3: two-step lasso (performs lasso regression in the first step to calculate the estimated within-subject correlation structure using Eq. (14), and use lasso regression in the second step by taking into account of the estimated correlation structure) and M4: two-step elastic net regression (similar as M3 with the elastic net mixing parameter 0.5).

The simulation results from [18] show that all methods are able to identify important variables. However, in terms of sparsity, the two-step sparse boosting method preforms best with smallest number of false positives. Both penalization methods select much more irrelevant variables than boosting methods, with elastic net selects the most. For two-step sparse boosting, results of variable selection are quite stable from step I to step II but for the other approaches, the false positives and thus the sizes of model from step I to step II are expanding. Two-step sparse boosting yields smallest bias for the coefficients estimation among the competing methods. The refined estimates after incorporating the within-subject correlation generally perform better than the initial estimates without taking into account of the within-subject correlation since the two-step methods gain reduction of bias, especially when the within-subject correlation is high. In other words, the reduction of bias from step I to step II are much larger when the within-subject correlation is higher. This is intuitive as in the second step, the within-subject correlation structure estimated from the first step have been taken into account. The similar results obtained for the bias of the estimated covariance matrix. The bias under smaller within-subject correlation is smaller than under larger within-subject correlation. The two-step sparse boosting yields smaller bias of the estimated covariance matrix than other competing methods when the within-subject correlation is high. In summary, the performance of variable selection and functional coefficients estimation for two-step sparse boosting is quite satisfactory.

#### **3.3 Yeast cell cycle gene expression data analysis**

The cell cycle is one of the most important activities in life by which cells grow, replicate their chromosomes, undergo mitosis, and split into daughter cells. Thus, identifying cell cycle-regulated genes becomes very important. Adopting a modelbased approach, Luan and Li [37] identified *n* ¼ 297 cell cycle-regulated genes based on the *α*-factor synchronization experiments. All gene expression levels were measured at *m* ¼ 18 different time points covering two cell-cycle periods. Using the same subset of the original data as in [38], a total *p* ¼ 96 transcriptional factors (TFs) are included as predictors in the downstream analysis. Wei, Huang and Li [39] proved that the effects of the TFs on gene expression levels are timedependent. After the independence screening by *l* 2 -norm [40] to screen out the irrelevant predictors at first step, several methods can be used to identify the key TFs involved in gene regulation. Except two-step L2 boosting and two-step sparse boosting which take into account of the within-subject correlation in the second step, one-step L2 boosting and one-step sparse boosting which ignore the withinsubject correlation are also considered for better comparison. Besides, some twostep penalized approaches are also considered: two-step lasso, two-step adaptive lasso and two-step elastic net (the elastic net mixing parameter 0.5).

The results from [18] show that boosting approaches yield sparser model than the penalized methods. Sparse boosting yields even sparser model and smaller errors in terms of estimation and prediction than L2 boosting. Two-step boosting achieves better performance than one-step boosting with smaller estimation and

*Sparse Boosting Based Machine Learning Methods for High-Dimensional Data DOI: http://dx.doi.org/10.5772/intechopen.100506*

prediction errors. Two-step sparse boosting method yields the most sparse model, with the smallest in-sample and out-of-sample prediction errors compared to other methods. In terms of the selected TFs, there is a significant overlap between twostep sparse boosting and each of the other methods. In conclusion, the two-step sparse boosting approach performs quite well in terms of variable selection, coefficients estimation and prediction and can provide useful information in identifying the important TFs that take part in the network of regulations.

#### **4. Multi-step sparse boosting for subgroup identification**

As personalized medicine is gaining popularity, identification of subgroups of the patients that can gain a higher efficacy from the treatment becomes greatly important. Recently, significant statistical approaches have been proposed to identify subgroups of patients who may be suitable for different treatments. Traditionally, subgroup identification is achieved by parametric partitioning approaches such as Bayesian approaches [41] or classification and regression tree (CART) [42]. Recently, recursive partitioning methods gain popularity since they achieve greater generalizability and efficiency. Such methods include MOB [43], PRIM [44], sequential-BATTing [45] and other non-parametric methods. For a detailed literature review of subgroup identification refer to Lipkovich et al. [46]. In this section, a sparse boosting based subgroup identification method is presented in the context of dense longitudinal data.

In particular, a formal subgroup identification method for high-dimensional dense longitudinal data is presented. It incorporates multi-step sparse boosting into the homogeneous pursuit via change point detection. Firstly, sparse boosting algorithm for individual modeling is first performed to obtain initial estimates. Then, change point detection via binary segmentation is used to identify the subgroup structure of patients. Lastly, the model on each identified subgroups is refitted and again sparse boosting is utilized to remove irrelevant predictors and yield reliable final estimates. The rest of the section is organized as follows. In Section 4.1, the subgroup model is formulated and a detailed method for subgroup identification and estimation is presented. In Section 4.2, the subgroup identification technique is evaluated through simulation studies. In Section 4.3, the feasibility and applicability of the approach is validated by studying a wallaby growth dataset.

#### **4.1 Methodology**

#### *4.1.1 Patients model*

Denote *Yit* be the continuous measurement of the *t*th follow-up for patient *i*, where *i* ¼ 1, ⋯, *n*, *t* ¼ 1, ⋯, *Ti*. Let **X***it* ¼ *Xit*,1, ⋯, *Xit*,*<sup>p</sup>* � � be the corresponding *p*-dimensional predictors. Assume *n* patients are independent. The following longitudinal model for the patients is considered:

$$Y\_{it} = \tilde{\beta}\_{i,0} + \sum\_{j=1}^{p} X\_{itj}\tilde{\beta}\_{i,j} + \varepsilon\_{it}, \ t = 1, \cdots, \ n, \ t = 1, \ \cdots, \ T\_i. \tag{24}$$

where *<sup>ε</sup><sup>i</sup>* <sup>¼</sup> *<sup>ε</sup>i*1, <sup>⋯</sup>, *<sup>ε</sup>iTi* ð Þ*<sup>T</sup>*, *<sup>i</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>n</sup>* are multivariate error terms with mean zero. Errors are assumed to be uncorrelated for different *i*, but components of *ε<sup>i</sup>* are correlated with each other.

Moreover, the model is further assumed to have the following subgroup structure:

$$
\tilde{\beta}\_{i,j} = \begin{cases}
\beta\_{1,j} & \text{when} \quad i \in \Omega\_{1,j} \\
\beta\_{2,j} & \text{when} \quad i \in \Omega\_{2,j} \\
\vdots & \vdots \\
\beta\_{N\_j+1,j} & \text{when} \quad i \in \Omega\_{N\_j+1,j}
\end{cases}
\tag{25}
$$

The partition for regression coefficient *β*~*i*,*<sup>j</sup>* : 1≤*i* ≤*n* n o is <sup>Ω</sup>*k*,*<sup>j</sup>* : <sup>1</sup>≤*k*<sup>≤</sup> <sup>N</sup> *<sup>j</sup>* <sup>þ</sup> <sup>1</sup> � �, which is unknown, and thus there are <sup>N</sup> *<sup>j</sup>* <sup>þ</sup> 1 subgroups for the *<sup>j</sup>*th predictor. All patients are divided into at least max *<sup>j</sup>* <sup>N</sup> *<sup>j</sup>* <sup>þ</sup> <sup>1</sup> � � and at most Q*<sup>p</sup> <sup>j</sup>*¼<sup>0</sup> <sup>N</sup> *<sup>j</sup>* <sup>þ</sup> <sup>1</sup> � � subgroups by the model. The patients in the same subgroup share a similar relationship between the response and the predictors and have the same set of regression coefficients while different subgroups have different overall relationship between response and covariates. The main aim is to investigate the effects of the predictors on the response for different subgroups.

However, if the number of predictors under consideration is much larger than the number of patients and the number of follow-ups, a serious challenge may arise to estimate regression coefficients. Therefore, instead of adopting traditional methods (eg, MLE), sparse boosting method can be used to estimate the regression coefficients. With this, the dimensionality of features can be reduced and the coefficients of parameters can be obtained simultaneously.

#### *4.1.2 Subgroup identification and estimation*

Denote *<sup>β</sup>*~*<sup>i</sup>* <sup>¼</sup> *<sup>β</sup>* ~*i*,0, ⋯, *β* ~*i*,*p* � �*<sup>T</sup>* and *β* <sup>~</sup> <sup>¼</sup> *<sup>β</sup>* ~*T* <sup>1</sup> , ⋯, *β* ~*T n* � �*<sup>T</sup>* . Firstly, an initial estimator for *β*~*<sup>i</sup>* is calculated for each subject *i* through sparse boosting approach using his or her own repeated measurements data; then, homogeneity pursuit via change point detection can be used to identify the change points among *β<sup>k</sup>*,*<sup>j</sup>* s; lastly, the *β*~*i*s can be replaced by the identified subgroup structure, and the final estimator of regression coefficients can be obtained by the sparse boosting algorithm again. The steps for estimating ~*β<sup>i</sup>* is outlined as below.

In the first step, individualized modeling via sparse boosting is performed. For each of the *i*th individual, the initial coefficients *β*~*<sup>i</sup>* can be estimated by minimizing the following least squares loss function:

$$\sum\_{t=1}^{T\_i} \left( Y\_{it} - \tilde{\boldsymbol{\beta}}\_{i,0} - \sum\_{j=1}^p X\_{it,j} \tilde{\boldsymbol{\beta}}\_{i,j} \right)^2. \tag{26}$$

Let *Yi* <sup>¼</sup> *Yi*1, <sup>⋯</sup>, *YiTi* ð Þ*<sup>T</sup>*, *Xit*,0 <sup>¼</sup> 1, *Xi*,*<sup>j</sup>* <sup>¼</sup> *Xi*1,*<sup>j</sup>*, <sup>⋯</sup>, *XiTi*,*<sup>j</sup>* � �*<sup>T</sup>* , **X***<sup>i</sup>* ¼ *Xi*,0, ⋯, *Xi*,*<sup>p</sup>* � �. Then the function Eq. (26) can be written in the matrix form:

$$\left(\mathbf{Y}\_{i} - \mathbf{X}\_{i}\tilde{\boldsymbol{\beta}}\_{i}\right)^{T}\left(\mathbf{Y}\_{i} - \mathbf{X}\_{i}\tilde{\boldsymbol{\beta}}\_{i}\right). \tag{27}$$

Denote *<sup>β</sup>*<sup>~</sup> *<sup>L</sup>*^½ �*<sup>i</sup> <sup>i</sup>* <sup>¼</sup> *<sup>β</sup>*<sup>~</sup> *<sup>L</sup>*^½ �*<sup>i</sup> <sup>i</sup>*,0 <sup>⋯</sup>, *<sup>β</sup>*<sup>~</sup> *<sup>L</sup>*^½ �*<sup>i</sup> i*,*p* � �*<sup>T</sup>* to be the estimator of *β*~*<sup>i</sup>* by sparse boosting with Eq. (27) being loss function, where *L*^*<sup>i</sup>* is the estimated stopping iterations in this step. This is the initial estimator of *β*~*i*. The detailed sparse boosting algorithm will be presented in the next subsection.

In the second step, homogeneity pursuit via change point detection is performed. Binary segmentation algorithm [47] is used to detect the change points among *β*~*i*,*<sup>j</sup>* , *<sup>i</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>n</sup>* and to identify the subgroup structure. Let *<sup>β</sup>*<sup>~</sup> *<sup>L</sup>*^½ �*<sup>i</sup> <sup>i</sup>*,*<sup>j</sup>* be the ð Þ *<sup>j</sup>* <sup>þ</sup> <sup>1</sup> th component of *<sup>β</sup>*<sup>~</sup> *<sup>L</sup>*^½ �*<sup>i</sup> <sup>i</sup>* . For the *<sup>j</sup>*th covariate, *<sup>β</sup>*<sup>~</sup> *<sup>L</sup>*^½ �*<sup>i</sup> <sup>i</sup>*,*<sup>j</sup>* , *i* ¼ 1, ⋯, *n*, are sorted in

ascending order, and denoted by *<sup>b</sup>*ð Þ<sup>1</sup> ≤⋯≤*b*ð Þ *<sup>n</sup>* . Denote *ri*,*<sup>j</sup>* be the rank of *<sup>β</sup>*<sup>~</sup> *<sup>L</sup>*^½ �*<sup>i</sup> <sup>i</sup>*,*<sup>j</sup>* .

For any 1≤ *l*<sup>1</sup> < *l*<sup>2</sup> ≤*n*, denote the scaled difference between the partial means of the first *τ* � *l*<sup>1</sup> þ 1 observations and the last *l*<sup>2</sup> � *τ* observations to be

$$H\_{l\_1 l\_2}(\tau) = \sqrt{\frac{(l\_2 - \tau)(\tau - l\_1 + 1)}{l\_2 - l\_1 + 1}} \left( \frac{\sum\_{i=\tau+1}^{l\_2} b\_{(l)}}{l\_2 - \tau} - \frac{\sum\_{i=l\_1}^{\tau} b\_{(i)}}{\tau - l\_1 + 1} \right). \tag{28}$$

Denote *δ* to be the threshold, which is a tuning parameter and can be selected by AIC or BIC, then the binary segmentation algorithm is as follows:

1.Find ^*t*<sup>1</sup> such that

$$H\_{1,n}(\hat{t}\_1) = \max\_{1 \le \tau < n} H\_{1,n}(\tau). \tag{29}$$

If *<sup>H</sup>*1,*<sup>n</sup>*ð Þ ^*t*<sup>1</sup> <sup>≤</sup> *<sup>δ</sup>*, there is no change points among *<sup>b</sup>*ð Þ*<sup>l</sup>* , *<sup>l</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>n</sup>*, and the change point detection process terminates. Otherwise, ^*t*<sup>1</sup> is added to the set of change points and the region f g *τ* : 1≤*τ* ≤ *n* is divided into two subregions: f g *<sup>τ</sup>* : <sup>1</sup>≤*<sup>τ</sup>* <sup>≤</sup>^*t*<sup>1</sup> and f g *<sup>τ</sup>* : ^*t*<sup>1</sup> <sup>þ</sup> <sup>1</sup>≤*<sup>τ</sup>* <sup>≤</sup>*<sup>n</sup>* .

2.Find the change points in the two subregions derived in part (1), respectively. Consider the region f g *<sup>τ</sup>* : <sup>1</sup><sup>≤</sup> *<sup>τ</sup>* <sup>≤</sup>^*t*<sup>1</sup> first. Find ^*t*<sup>2</sup> such that

$$H\_{1, \hat{t}\_1}(\hat{t}\_2) = \max\_{1 \le \tau < \hat{t}\_1} H\_{1, \hat{t}\_1}(\tau). \tag{30}$$

If *H*1,^*t*<sup>1</sup> ð Þ ^*t*<sup>2</sup> <sup>≤</sup>*δ*, there is no change point in the region f g *<sup>τ</sup>* : <sup>1</sup>≤*<sup>τ</sup>* <sup>≤</sup>^*t*<sup>1</sup> . Otherwise, add ^*t*<sup>2</sup> to the set of change points and divide the region f g *<sup>τ</sup>* : <sup>1</sup>≤*<sup>τ</sup>* <sup>≤</sup>^*t*<sup>1</sup> into two subregions: f g *<sup>τ</sup>* : <sup>1</sup>≤*<sup>τ</sup>* <sup>≤</sup>^*t*<sup>2</sup> and f g *<sup>τ</sup>* : ^*t*<sup>2</sup> <sup>þ</sup> <sup>1</sup><sup>≤</sup> *<sup>τ</sup>* <sup>≤</sup>^*t*<sup>1</sup> . Similarly, for the region f g *<sup>τ</sup>* : ^*t*<sup>1</sup> <sup>þ</sup> <sup>1</sup>≤*<sup>τ</sup>* <sup>≤</sup>*<sup>n</sup>* , ^*t*<sup>3</sup> can be found such that

$$H\_{\hat{t}\_1+1,n}(\hat{t}\_3) = \max\_{\hat{t}\_1+1 \le \tau < n} H\_{\hat{t}\_1+1,n}(\tau). \tag{31}$$

If *<sup>H</sup>*^*t*1þ1,*<sup>n</sup>*ð Þ ^*t*<sup>3</sup> <sup>≤</sup>*δ*, there is no change point in the region f g *<sup>τ</sup>* : ^*t*<sup>1</sup> <sup>þ</sup> <sup>1</sup><sup>≤</sup> *<sup>τ</sup>* <sup>≤</sup>*<sup>n</sup>* . Otherwise, add ^*t*<sup>3</sup> to the set of change points and divide the region f g *<sup>τ</sup>* : ^*t*<sup>1</sup> <sup>þ</sup> <sup>1</sup>≤*<sup>τ</sup>* <sup>≤</sup>*<sup>n</sup>* into two subregions: f g *<sup>τ</sup>* : ^*t*<sup>1</sup> <sup>þ</sup> <sup>1</sup><sup>≤</sup> *<sup>τ</sup>* <sup>≤</sup>^*t*<sup>3</sup> and f g *<sup>τ</sup>* : ^*t*<sup>3</sup> <sup>þ</sup> <sup>1</sup>≤*<sup>τ</sup>* <sup>≤</sup>*<sup>n</sup>* .

3.For each subregion derived in part (2), the above algorithm is repeated for the subregion f g *<sup>τ</sup>* : <sup>1</sup>≤*<sup>τ</sup>* <sup>≤</sup>^*t*<sup>1</sup> or f g *<sup>τ</sup>* : ^*t*<sup>1</sup> <sup>þ</sup> <sup>1</sup><sup>≤</sup> *<sup>τ</sup>* <sup>≤</sup>*<sup>n</sup>* in part (2) until no change point is detected in any subregions.

The estimated locations for change points are sorted in increasing order and denoted by

$$
\hat{t}\_{(1)} < \hat{t}\_{(2)} < \cdots < \hat{t}\_{\left(\hat{N}\_{j}\right)},\tag{32}
$$

where N^ *<sup>j</sup>* is the number of detected change points and could be used to estimate <sup>N</sup> *<sup>j</sup>*. Further denote ^*t*ð Þ <sup>0</sup> <sup>¼</sup> 0, and ^*<sup>t</sup>* <sup>N</sup>^ ð Þ *<sup>j</sup>*þ<sup>1</sup> <sup>¼</sup> *<sup>n</sup>*. Let. *<sup>R</sup>*^*i*,*<sup>j</sup>* <sup>¼</sup> <sup>ℓ</sup> : ^*t*ð Þ <sup>ℓ</sup>�<sup>1</sup> <sup>&</sup>lt;*ri*,*<sup>j</sup>* <sup>≤</sup>^*t*ð Þ <sup>ℓ</sup> � �, <sup>1</sup>≤ℓ<sup>≤</sup> <sup>N</sup>^ *<sup>j</sup>*þ1, where *<sup>R</sup>*^*i*,*<sup>j</sup>* : <sup>1</sup>≤*i*<sup>≤</sup> *<sup>n</sup>* � � can be used to estimate the grouping index *Ri*,*<sup>j</sup>* : 1≤*i* ≤*n* � �. The above algorithm can be used to identify the change points for all *<sup>j</sup>* <sup>¼</sup> 0, <sup>⋯</sup>, *<sup>p</sup>* and correspondingly obtain *<sup>R</sup>*^*i*,*<sup>j</sup>* : <sup>1</sup><sup>≤</sup> *<sup>i</sup>*≤*n*, 0 <sup>≤</sup>*j*≤*<sup>p</sup>* � �. Let *R*^⋆ <sup>ℓ</sup>,*<sup>j</sup>* : <sup>1</sup><sup>≤</sup> <sup>ℓ</sup><sup>≤</sup> <sup>N</sup>^ , 0 <sup>≤</sup>*j*≤*<sup>p</sup>* n o <sup>¼</sup> unique rows of *<sup>R</sup>*^*i*,*<sup>j</sup>* : <sup>1</sup>≤*i*<sup>≤</sup> *<sup>n</sup>*, 0≤*j*<sup>≤</sup> *<sup>p</sup>* � �, then <sup>N</sup>^ is the estimated total number of subgroups for patients and the patients index in group ℓ is.

$$
\hat{\mathfrak{Q}}\_{\ell} = \left\{ i \, : \, \hat{R}\_{i\circ} = \hat{R}\_{\ell \circ j}^{\star} \right\}, \quad \mathbf{1} \le \ell \le \hat{N}. \tag{33}
$$

All the coefficients *β*~*<sup>i</sup>*,*<sup>j</sup>* s in the same estimated subgroup Ω^ <sup>ℓ</sup> are treated to be equal.

In the third step, subgroup modeling is performed by sparse boosting. Incorporating the patients structure identified in step 2, the model is refitted to each of the subgroups via sparse boosting with the following least squares loss function

$$\sum\_{i \in \hat{\Omega}\_{\ell}} \sum\_{t=1}^{T\_i} \left( Y\_{it} - \tilde{\boldsymbol{\beta}}\_{i,0} - \sum\_{j=1}^{p} X\_{it,j} \tilde{\boldsymbol{\beta}}\_{i,j} \right)^2, \quad \mathbf{1} \le \ell \le \hat{\mathcal{N}}. \tag{34}$$

Further denote *Y*<sup>⋆</sup> <sup>ℓ</sup> <sup>¼</sup> *<sup>Y</sup><sup>T</sup>* <sup>Ω</sup>^ <sup>ℓ</sup>½ � <sup>1</sup> , <sup>⋯</sup>, *<sup>Y</sup><sup>T</sup>* <sup>Ω</sup>^ <sup>ℓ</sup> <sup>j</sup>Ω^ ½ � <sup>ℓ</sup><sup>j</sup> � �*<sup>T</sup>* , *X*<sup>⋆</sup> <sup>ℓ</sup>,*<sup>j</sup>* <sup>¼</sup> *<sup>X</sup><sup>T</sup>* <sup>Ω</sup>^ <sup>ℓ</sup>½ � <sup>1</sup> ,*<sup>j</sup>* , ⋯, *X<sup>T</sup>* <sup>Ω</sup>^ <sup>ℓ</sup> <sup>j</sup>Ω^ ½ � <sup>ℓ</sup><sup>j</sup> ,*<sup>j</sup>* � �*<sup>T</sup>* , **X**<sup>⋆</sup> <sup>ℓ</sup> <sup>¼</sup> *<sup>X</sup>*<sup>⋆</sup> <sup>ℓ</sup>,0, ⋯, *X*<sup>⋆</sup> ℓ,*p* � � and *<sup>β</sup>*~<sup>⋆</sup> <sup>ℓ</sup> <sup>¼</sup> *<sup>β</sup>*~*<sup>T</sup>* <sup>Ω</sup>^ <sup>ℓ</sup>½ � <sup>1</sup> , <sup>⋯</sup>, *<sup>β</sup>*~*<sup>T</sup>* <sup>Ω</sup>^ <sup>ℓ</sup> <sup>j</sup>Ω^ ½ � <sup>ℓ</sup><sup>j</sup> � �*<sup>T</sup>* for <sup>ℓ</sup> <sup>¼</sup> 1, <sup>⋯</sup>, <sup>N</sup>^ , where <sup>Ω</sup>^ <sup>ℓ</sup>½ �*<sup>i</sup>* is the *<sup>i</sup>*th element of <sup>Ω</sup>^ <sup>ℓ</sup> and <sup>∣</sup>Ω^ <sup>ℓ</sup><sup>∣</sup> is the number of elements in <sup>Ω</sup>^ <sup>ℓ</sup>. The function Eq. (34) can be written in the matrix form:

$$\left(Y\_{\ell}^{\star} - \mathbf{X}\_{\ell}^{\star}\widetilde{\boldsymbol{\rho}}\_{\ell}^{\star}\right)^{T}\left(Y\_{\ell}^{\star} - \mathbf{X}\_{\ell}^{\star}\widetilde{\boldsymbol{\rho}}\_{\ell}^{\star}\right), \quad \mathbf{1} \le \ell \le \widehat{\mathcal{N}}.\tag{35}$$

Denote *<sup>β</sup>*~<sup>⋆</sup> *<sup>L</sup>*^<sup>⋆</sup> ℓ � � <sup>ℓ</sup> to be the estimate for *<sup>β</sup>*~<sup>⋆</sup> <sup>ℓ</sup> by sparse boosting with Eq. (35) being the loss function, where *<sup>L</sup>*^<sup>⋆</sup> <sup>ℓ</sup> is the estimated number of stopping iterations in this step. The estimator for coefficient *β*~*<sup>i</sup>* is

$$\hat{\boldsymbol{\beta}}\_{i} = \left\{ \tilde{\boldsymbol{\beta}}\_{\ell}^{\star} \Big| \boldsymbol{L}\_{\ell}^{\star} \right\} \text{ for} \boldsymbol{i} \in \hat{\Omega}\_{\ell} \right\}, \quad \mathbf{1} \le i \le n. \tag{36}$$

More details about how to use sparse boosting to obtain <sup>~</sup>*<sup>β</sup> <sup>L</sup>*^½ �*<sup>i</sup> <sup>i</sup>* , 1≤ *i*≤*n* � � and *β* <sup>~</sup><sup>⋆</sup> *<sup>L</sup>*^<sup>⋆</sup> ℓ � � <sup>ℓ</sup> , 1<sup>≤</sup> <sup>ℓ</sup><sup>≤</sup> <sup>N</sup>^ � � are given in the following subsection.

#### *4.1.3 Multi-step sparse boosting techniques*

gMDL can be used as the penalized empirical risk function to estimate the update criterion in each iteration and the stopping criterion to avoid the selection of the tuning parameter. gMDL can be expressed in the following form:

*Sparse Boosting Based Machine Learning Methods for High-Dimensional Data DOI: http://dx.doi.org/10.5772/intechopen.100506*

$$\text{gMDL}(Y,\text{RSS},\text{trace}(\mathcal{B})) = \log(F) + \frac{\text{trace}(\mathcal{B})}{|Y|} \log\left(\frac{Y^T Y - \text{RSS}}{\text{trace}(\mathcal{B}) \times F}\right),\tag{37}$$

$$F = \frac{\text{RSS}}{|Y| - \text{trace}(\mathcal{B})},$$

where *Y* is the vector of response variable, ∣*Y*∣ is the length of *Y*, B is the boosting operator and *RSS* is the residual sum of squares.

The sparse boosting procedure is described in details. The starting value of *β*~*<sup>i</sup>* is set to zero vector, i.e. *β*~½ � <sup>0</sup> *<sup>i</sup>* ¼ 0, and in each of the *li*th iteration (0 <*li* ≤ *Li*, and *Li* is the maximum number of iterations considered in this step), the residual *<sup>R</sup> li* ½ � <sup>¼</sup> *Yi* � **<sup>X</sup>***iβ*<sup>~</sup> *<sup>l</sup>*½ � *<sup>i</sup>*�<sup>1</sup> *<sup>i</sup>* in present iteration is used to fit each of the *j*th element *Xi*,*j*, *j* ¼ 0, ⋯, *p*. The fit denoted by ^*λ li* ½ � *<sup>j</sup>* can be obtained by minimizing the squared loss function *<sup>R</sup> li* ½ � � *Xi*,*<sup>j</sup><sup>λ</sup>* � �*<sup>T</sup> <sup>R</sup> li* ½ � � *Xi*,*<sup>j</sup><sup>λ</sup>* � � with respect to *<sup>λ</sup>*. Thus, the least squares estimate is ^*λ li* ½ � *<sup>j</sup>* ¼ *Xi*,*<sup>j</sup>* � �*<sup>T</sup> Xi*,*<sup>j</sup>* � � h i�<sup>1</sup> *Xi*,*<sup>j</sup>* � �*<sup>T</sup> <sup>R</sup> li* ½ �, the corresponding hat matrix is <sup>H</sup> *<sup>j</sup>* <sup>¼</sup> *Xi*,*<sup>j</sup>* � � *Xi*,*<sup>j</sup>* � �*<sup>T</sup> Xi*,*<sup>j</sup>* � � h i�<sup>1</sup> *Xi*,*<sup>j</sup>* � �*<sup>T</sup>* and the residual sum of squares is *RSS li* ½ � *<sup>j</sup>* ¼ *<sup>R</sup> li* ½ � � *Xi*,*<sup>j</sup>* ^*λ li* ½ � *j* � �*<sup>T</sup> <sup>R</sup> li* ½ � � *Xi*,*<sup>j</sup>* ^*λ li* ½ � *j* � �. The selected entry ^*sli* is obtained by: ^*sli* <sup>¼</sup> argmin0≤*<sup>j</sup>* <sup>≤</sup>*<sup>p</sup>*gMDL *Yi*, *RSS li* ½ � *<sup>j</sup>* , trace <sup>B</sup> *li* ½ � *j* � � � � , (38)

where B½ � <sup>1</sup> *<sup>j</sup>* <sup>¼</sup> <sup>H</sup> *<sup>j</sup>* and <sup>B</sup> *li* ½ � *<sup>j</sup>* ¼ *I* � *I* � H *<sup>j</sup>* � � *<sup>I</sup>* � *<sup>ν</sup>*H^*sl i*�1 � �*:*⋯*: <sup>I</sup>* � *<sup>ν</sup>*H^*s*<sup>1</sup> ð Þ for *li* <sup>&</sup>gt;1 is the boosting operator for choosing *j*th entry in the *li*th iteration in this step. Hence, there is an unique element *Xi*,^*sl <sup>i</sup>* to be selected at each iteration, and only the corresponding coefficient vector *β*~ *li* ½ � *i*,^*sl i* changes, i.e., *β*~ *li* ½ � *i*,^*sl i* <sup>¼</sup> *<sup>β</sup>*<sup>~</sup> *<sup>l</sup>*½ � *<sup>i</sup>*�<sup>1</sup> *i*,^*sl i* <sup>þ</sup> *<sup>ν</sup>*^*<sup>λ</sup> li* ½ � ^*sl i* , where *ν* is the pre-specified step-size parameter. All the other *β*~ *li* ½ � *<sup>i</sup>*,*<sup>j</sup>* for *j* 6¼ ^*sli* keep unchanged. This procedure is repeated for *Li* times and the number of iterations *Li* can be estimated by

$$\hat{L}\_i = \mathbf{argmin}\_{1 \le l\_i \le L\_i} \mathbf{g} \text{MDL}\left(Y\_i, \text{RSS}^{[l\_i]}\_{i\_l}, \text{trace}\left(\mathcal{B}^{[l\_i]}\right)\right), \tag{39}$$

where <sup>B</sup> *li* ½ � <sup>¼</sup> *<sup>I</sup>* � *<sup>I</sup>* � *<sup>ν</sup>*H^*sl i* � �*:*⋯*: <sup>I</sup>* � *<sup>ν</sup>*H^*s*<sup>1</sup> ð Þ.

From the above sparse boosting approach, the estimator of *<sup>β</sup>*~*<sup>i</sup>* is *<sup>β</sup>*<sup>~</sup> *<sup>L</sup>*^½ �*<sup>i</sup> <sup>i</sup>* ¼

*<sup>β</sup>*<sup>~</sup> *<sup>L</sup>*^½ �*<sup>i</sup> <sup>i</sup>*,0 , <sup>⋯</sup>, *<sup>β</sup>*<sup>~</sup> *<sup>L</sup>*^½ �*<sup>i</sup> i*,*p* � �*<sup>T</sup>* , *i* ¼ 1, ⋯, *n*. Then the subgroup structure can be obtained by homogeneity pursuit via change point detection.

Next, sparse boosting is used again for each estimated subgroups. The starting value of *β*~<sup>⋆</sup> <sup>ℓ</sup> is set to zero vector, i.e. *<sup>β</sup>*~<sup>⋆</sup>½ � <sup>0</sup> <sup>ℓ</sup> ¼ 0, and in each of the *l* ⋆ <sup>ℓ</sup> th iteration (0 <*l* ⋆ <sup>ℓ</sup> ≤*L*<sup>⋆</sup> <sup>ℓ</sup> , and *L*<sup>⋆</sup> <sup>ℓ</sup> is the maximum number of iterations considered in this stage), the residual *R*<sup>⋆</sup> *<sup>l</sup>* <sup>⋆</sup> ½ � <sup>ℓ</sup> <sup>¼</sup> *<sup>Y</sup>*<sup>⋆</sup> <sup>ℓ</sup> � **<sup>X</sup>**<sup>⋆</sup> ℓ ~*β* ⋆ *l* <sup>⋆</sup> ½ � <sup>ℓ</sup> �<sup>1</sup> <sup>ℓ</sup> in present iteration is used to fit each of the *j*th component *X*<sup>⋆</sup> ℓ,*j* , *<sup>j</sup>* <sup>¼</sup> 0, <sup>⋯</sup>, *<sup>p</sup>*. Then the fit denoted by ^*<sup>λ</sup>* ⋆ *l* ⋆ ½ � *i <sup>j</sup>* can be calculated by minimizing the squared loss function *R*<sup>⋆</sup> *<sup>l</sup>* <sup>⋆</sup> ½ � <sup>ℓ</sup> � *<sup>X</sup>*<sup>⋆</sup> ℓ,*j λ* � �*<sup>T</sup> R*<sup>⋆</sup> *<sup>l</sup>* <sup>⋆</sup> ½ � <sup>ℓ</sup> � *<sup>X</sup>*<sup>⋆</sup> *i*,*j λ* � � with respect

to *λ*. Therefore, the least squares estimate is ^*λ* ⋆ *l* ⋆ ½ � ℓ *<sup>j</sup>* <sup>¼</sup> *<sup>X</sup>*<sup>⋆</sup> ℓ,*j <sup>T</sup> X*<sup>⋆</sup> ℓ,*j* �<sup>1</sup> *X*<sup>⋆</sup> ℓ,*j <sup>T</sup> R*<sup>⋆</sup> *<sup>l</sup>* <sup>⋆</sup> ½ � <sup>ℓ</sup> , the corresponding hat matrix is H<sup>⋆</sup> *<sup>j</sup>* <sup>¼</sup> *<sup>X</sup>*<sup>⋆</sup> ℓ,*j <sup>X</sup>*<sup>⋆</sup> ℓ,*j <sup>T</sup> X*<sup>⋆</sup> ℓ,*j* �<sup>1</sup> *X*<sup>⋆</sup> ℓ,*j <sup>T</sup>* and the residual sum of squares is *RSS*<sup>⋆</sup> *<sup>l</sup>* ⋆ ½ � ℓ *<sup>j</sup>* <sup>¼</sup> *<sup>R</sup>*<sup>⋆</sup> *<sup>l</sup>* <sup>⋆</sup> ½ � <sup>ℓ</sup> � *<sup>X</sup>*<sup>⋆</sup> ℓ,*j* ^*λ l* ⋆ ½ � ℓ *j <sup>T</sup> R*<sup>⋆</sup> *<sup>l</sup>* <sup>⋆</sup> ½ � <sup>ℓ</sup> � *<sup>X</sup>*<sup>⋆</sup> ℓ,*j* ^*λ l* ⋆ ½ � ℓ *j* . The chosen element ^*s* ⋆ *l* ⋆ ℓ is attained by:

$$\hat{s}\_{l\_{\ell}^{\star}}^{\star} = \operatorname\*{argmin}\_{0 \le j \le p} \operatorname\*{\mathbf{M}} \mathrm{DL} \left( Y\_{l}^{\star}, \operatorname\*{RSS}\_{j}^{\star \left[ l\_{\ell}^{\star} \right]}, \operatorname{trace} \left( \mathcal{B}\_{j}^{\star \left[ l\_{\ell}^{\star} \right]} \right) \right), \tag{40}$$

where B<sup>⋆</sup>½ � <sup>1</sup> *<sup>j</sup>* <sup>¼</sup> <sup>H</sup><sup>⋆</sup> *<sup>j</sup>* and <sup>B</sup><sup>⋆</sup>½ � *<sup>l</sup>*<sup>ℓ</sup> *<sup>j</sup>* <sup>¼</sup> *<sup>I</sup>* � *<sup>I</sup>* � <sup>H</sup><sup>⋆</sup> *j <sup>I</sup>* � *<sup>ν</sup>*H<sup>⋆</sup> ^*s l* ⋆ <sup>ℓ</sup> �<sup>1</sup> *:*⋯*: <sup>I</sup>* � *<sup>ν</sup>*H<sup>⋆</sup> ^*s*1 for

*l* ⋆ <sup>ℓ</sup> >1 is the boosting operator for choosing *j*th element in the *l* ⋆ <sup>ℓ</sup> th iteration in this stage. Hence, there is an unique element *X*<sup>⋆</sup> ℓ,^*s l* ⋆ ℓ to be selected at each iteration, and only the corresponding coefficient vector *<sup>β</sup>*~<sup>⋆</sup> *<sup>l</sup>* ⋆ ½ � ℓ ℓ,^*s l* ⋆ ℓ changes, i.e., *β*~½ � *<sup>l</sup>*<sup>ℓ</sup> ℓ,^*s l* ⋆ ℓ <sup>¼</sup> *<sup>β</sup>*~<sup>⋆</sup> *<sup>l</sup>* <sup>⋆</sup> ½ � <sup>ℓ</sup> �<sup>1</sup> ℓ,^*s l* ⋆ ℓ <sup>þ</sup> *<sup>ν</sup>*^*<sup>λ</sup> <sup>l</sup>* ⋆ ½ � ℓ ^*s l* ⋆ ℓ , where *<sup>ν</sup>* is the pre-specified step-size parameter. All the other *<sup>β</sup>*~<sup>⋆</sup> *<sup>l</sup>* ⋆ ½ � ℓ <sup>ℓ</sup>,*<sup>j</sup>* for *j* ¼6 ^*sl* ⋆ <sup>ℓ</sup> keep unchanged. This procedure is repeated for *L*<sup>⋆</sup> <sup>ℓ</sup> times and the number of iterations *L*<sup>⋆</sup> ℓ can be estimated by

$$\hat{L}\_i^\star = \text{argmin}\_{1 \le l\_\ell \le L\_i^\star} \text{gMDL}\left(Y\_{l'}^\star, \text{RSS}\_{i\_{l\_\ell^\star}}^{\star[l\_\ell^\star]}, \text{trace}\left(\mathcal{B}^{\star[l\_\ell^\star]}\right)\right), \tag{41}$$

$$\text{where } \mathcal{B}^\star[^{I^\star\_{\vec{\ell}}}] = I - \left(I - \nu \mathcal{H}^\star\_{\vec{\ell}\_{\vec{\ell}\_{\vec{\ell}}}}\right) \cdots \left(I - \nu \mathcal{H}^\star\_{\vec{\ell}\_1}\right).$$

From the second step of sparse boosting, the estimator of *<sup>β</sup>*~<sup>ℓ</sup> is *<sup>β</sup>*~<sup>⋆</sup> *<sup>L</sup>*^<sup>⋆</sup> ℓ <sup>ℓ</sup> ¼

$$\left(\tilde{\boldsymbol{\beta}}\_{\ell,0}^{\star,\begin{bmatrix} L\_{\ell}^{\star} \end{bmatrix}}, \ \cdots, \ \tilde{\boldsymbol{\beta}}\_{\ell,p}^{\star,\begin{bmatrix} L\_{\ell}^{\star} \end{bmatrix}}\right)^{T}, \ell = \mathbf{1}, \ \cdots, \ \hat{\mathcal{N}}.$$

#### **4.2 Simulation**

Extensive simulations are conducted to evaluate the performance of the proposed procedure. The accuracy of subgrouping, feature selection, coefficients estimation and prediction are assessed in the setting of different number of patients and repeated measurements. To understand the advantage of the proposed method better, the following four approaches are also considered. M1: the homogeneous model fitting method which treats all patients as one group and use sparse boosting for the single model to estimate *β*~; M2: the heterogeneous model fitting method which uses initial pre-grouping estimate *<sup>β</sup>*<sup>~</sup> *<sup>L</sup>*^½ �*<sup>i</sup> <sup>i</sup>* as the final estimate of *<sup>β</sup>*~*i*; M3: same as the proposed method but in step 2, instead of detecting the change points for coefficients of each covariate *<sup>β</sup>*<sup>~</sup> *<sup>L</sup>*^½ �*<sup>i</sup> <sup>i</sup>*,*<sup>j</sup>* , *i* ¼ 1, ⋯, *n* for *j* ¼ 0, ⋯, *p*, it detects the change points among *<sup>β</sup>*~*<sup>T</sup> <sup>L</sup>*^½ �<sup>1</sup> <sup>1</sup> , <sup>⋯</sup>, *<sup>β</sup>*~*<sup>T</sup> <sup>L</sup>*^½ �*<sup>n</sup> n <sup>T</sup>* similarly to Ke et al. [48]; M4: the proposed method.

#### *Sparse Boosting Based Machine Learning Methods for High-Dimensional Data DOI: http://dx.doi.org/10.5772/intechopen.100506*

The results from [19] show that the naive homogeneous model fitting method M1 can rarely identify the important covariates while the over-parameterized model fitting method M2 and other two methods (M3 & M4) which identify subgroup structures consistently yield true positives equal to the true number of important covariates. Compared these three methods which can identify the important covariates, the proposed method produces smallest false positives. In addition, the number of false positives is decreasing when there is an increase in cluster size. Neither the homogeneous model fitting method nor heterogeneous model fitting method is able to identify the true structure among patients. The method M3 produces much more subgroups than it really has, while the proposed method M4 identified the number of subgroups closest to the actual number of subgroups. Furthermore, the probability of identifying the true subgroups becomes larger when the number of repeated measurements increases. For insample prediction, the over-parameterized model M2 performs the best while the methods M3 & M4 performs very competitively. However, for out-of-sample prediction, method M4 is the best. M1 is inferior to M4, yielding poor results of estimation and prediction. In summary, the proposed method preforms pretty well in terms of subgroup identification, variable selection, estimation as well as prediction.

#### **4.3 Wallaby growth data analysis**

The proposed subgroup identification method is applied to wallaby growth data, which is from the Australian Dataset and Story Library (OzDASL) and can be found at http://www.statsci.org/data/oz/wallaby.html. The data set has 77 Tammar wallabies' growth measurements which were taken longitudinally. The response variable is the weight of wallabies (tenths of a gram). The predictors involve length of head, ear, body, arm, leg, tail, foot and their second order interactions. Therefore, a total of 35 predictors are included in the analysis. After removing the missing data, 43 Tammar wallabies are kept in our dataset. The number of repeated measurements ranges from 9 to 34 (median: 23). To have a better understanding of the wallabies' growth trend, the questions of which parts of body would affect the weight and whether the length of each body parts have the same effects on the weight for all wallabies are investigated, i.e. is there any subgroups among wallabies. Except the above subgroup identification method (SB-CPD1), the other 3 methods studied in simulation are also considered, i.e. homogeneous model fitting method (SB-Homogeneous), heterogeneous model fitting method (SB-Heterogeneous) and the method similar to SB-CPD1 but identifying subgroups via other method in Ke et al. [48] (SB-CPD2). In addition, the following subgroup identification methods incorporating penalized methods are also investigated: similar to our proposed method but instead of using sparse boosting, lasso (Lasso-CPD1), elastic net (ElasticNet-CPD1), SCAD (SCAD-CPD1) or MCP (MCP-CPD1) is used.

The results from [19] show that although Lasso-CPD1 and ElasticNet-CPD1 yield smaller in-sample prediction error by keeping all 35 covariates, they have relatively large out-of-sample prediction errors due to over-fitting problem. The subgroup identification method via sparse boosting keeps smaller number of predictors, achieves sparser model than penalized methods. The proposed method SB-CPD1 identifies smaller number of subgroups and predictors than alternative competing methods while produces smallest out-of-sample prediction errors. In conclusion, the proposed subgroup identification method provides a more precise definition for various subgroups. It may also result in a more accurate medical decision making for these subjects.

## **5. Conclusions**

In this chapter, we discussed various sparse boosting based machine learning methods in the context of high-dimensional data problems. Specifically, we presented the sparse boosting procedure and two-step sparse boosting procedure for nonparametric varying-coefficient models with survival data and repeatedly measured longitudinal data respectively to simultaneously perform variable selection and estimation of functional coefficients. We further presented the multi-step sparse boosting based subgroup identification method with longitudinal patient data to identify subgroups that exhibit different treatment effects. The extensive numerical studies show the validity and effectiveness of our proposed methods and the real data analysis further demonstrate their usefulness and advantages.

### **Author details**

Mu Yue Singapore University of Technology and Design, Singapore

\*Address all correspondence to: yuemu.moon@gmail.com

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Sparse Boosting Based Machine Learning Methods for High-Dimensional Data DOI: http://dx.doi.org/10.5772/intechopen.100506*

### **References**

[1] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996 Jan;58(1):267– 288.

[2] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association. 2001 Dec 1;96(456):1348–1360.

[3] Zhang CH. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics. 2010; 38(2):894–942.

[4] Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006 Dec 1;101(476):1418–1429.

[5] Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology). 2005 Apr 1;67(2):301–320.

[6] Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006 Feb;68(1):49–67.

[7] Schapire RE. The strength of weak learnability. Machine learning. 1990 Jun 1;5(2):197–227.

[8] Freund Y, Schapire RE. A decisiontheoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences. 1997 Aug 1;55(1):119–139.

[9] Bühlmann P, Yu B. Boosting with the L 2 loss: regression and classification. Journal of the American Statistical Association. 2003 Jun 1;98(462):324–339.

[10] Bühlmann P, Yu B, Singer Y, Wasserman L. Sparse Boosting. Journal of Machine Learning Research. 2006 Jun 1;7(6).

[11] Wang Z. HingeBoost: ROC-based boost for classification and variable selection. The International Journal of Biostatistics. 2011 Feb 4;7(1).

[12] Bühlmann P, Hothorn T. Twin boosting: improved feature selection and prediction. Statistics and Computing. 2010 Apr;20(2):119–138.

[13] Komori O, Eguchi S. A boosting method for maximizing the partial area under the ROC curve. BMC bioinformatics. 2010 Dec;11(1):1–7.

[14] Wang Z. Multi-class hingeboost. Methods of information in medicine. 2012;51(02):162–167.

[15] Zhao J. General sparse boosting: improving feature selection of l2 boosting by correlation-based penalty family. Communications in Statistics-Simulation and Computation. 2015 Jul 3; 44(6):1612–1640.

[16] Yang Y, Zou H. Nonparametric multiple expectile regression via ER-Boost. Journal of Statistical Computation and Simulation. 2015 May 3;85(7):1442–1458.

[17] Yue M, Li J, Ma S. Sparse boosting for high-dimensional survival data with varying coefficients. Statistics in medicine. 2018 Feb 28;37(5):789–800.

[18] Yue M, Li J, Cheng MY. Two-step sparse boosting for high-dimensional longitudinal data with varying coefficients. Computational Statistics Data Analysis. 2019 Mar 1;131:222–234.

[19] Yue M, Huang L. A new approach of subgroup identification for highdimensional longitudinal data. Journal of Statistical Computation and

Simulation. 2020 Jul 23;90(11):2098– 2116.

[20] David CR. Regression models and life tables (with discussion). Journal of the Royal Statistical Society. 1972;34(2): 187–220.

[21] Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Statistics in medicine. 1992;11(14–15): 1871–1879.

[22] Schmid M, Hothorn T. Flexible boosting of accelerated failure time models. BMC bioinformatics. 2008 Dec; 9(1):1–3.

[23] Wang Z, Wang CY. Buckley-James boosting for survival analysis with highdimensional biomarker data. Statistical Applications in Genetics and Molecular Biology. 2010 Jun 8;9(1).

[24] Li J, Ma S. Survival analysis in medicine and genetics. CRC Press; 2013 Jun 4.

[25] Stute W. Consistent estimation under random censorship when covariables are present. Journal of Multivariate Analysis. 1993 Apr 1;45(1): 89–103.

[26] Curry HB, Schoenberg IJ. On Pólya frequency functions IV: the fundamental spline functions and their limits. InIJ Schoenberg Selected Papers 1988 (pp. 347–383). Birkhäuser, Boston, MA.

[27] Hansen MH, Yu B. Model selection and the principle of minimum description length. Journal of the American Statistical Association. 2001 Jun 1;96(454):746–774.

[28] Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, Chang AC. Gene expression–based survival

prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nature medicine. 2008 Aug;14(8):822.

[29] Consonni D, Bertazzi PA, Zocchetti C. Why and how to control for age in occupational epidemiology. Occupational and environmental medicine. 1997 Nov 1;54(11):772–776.

[30] Wang S, Nan B, Zhu J, Beer DG. Doubly penalized Buckley–James method for survival data with highdimensional covariates. Biometrics. 2008 Mar;64(1):132–140.

[31] Diggle P, Heagerty P, Liang KY, Zeger S. Analysis of Longitudinal Data: Oxford University Press. 2002.

[32] Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis, vol. 998 John Wiley & Sons. Hoboken NJ. 2012.

[33] Lin X, Carroll RJ. Semiparametric regression for clustered data using generalized estimating equations. Journal of the American Statistical Association. 2001 Sep 1;96(455):1045– 1056.

[34] Fan J, Huang T, Li R. Analysis of longitudinal data with semiparametric estimation of covariance function. Journal of the American Statistical Association. 2007 Jun 1;102(478):632–641.

[35] Cheng MY, Honda T, Li J, Peng H. Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Annals of Statistics. 2014;42(5):1819–1849.

[36] Cheng MY, Honda T, Li J. Efficient estimation in semivarying coefficient models for longitudinal/clustered data. The Annals of Statistics. 2016;44(5): 1988–2017.

[37] Luan Y, Li H. Clustering of timecourse gene expression data using a mixed-effects model with B-splines.

*Sparse Boosting Based Machine Learning Methods for High-Dimensional Data DOI: http://dx.doi.org/10.5772/intechopen.100506*

Bioinformatics. 2003 Mar 1;19(4): 474–482.

[38] Wang L, Chen G, Li H. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics. 2007 Jun 15;23(12): 1486–1494.

[39] Wei F, Huang J, Li H. Variable selection and estimation in highdimensional varying-coefficient models. Statistica Sinica. 2011 Oct 1;21(4):1515.

[40] Yue M, Li J. Improvement screening for ultra-high dimensional data with censored survival outcomes and varying coefficients. The international journal of biostatistics. 2017 May 18;13(1).

[41] Sivaganesan S, Müller P, Huang B. Subgroup finding via Bayesian additive regression trees. Statistics in medicine. 2017 Jul 10;36(15):2391–2403.

[42] Zhang H, Singer BH. Recursive partitioning and applications. Springer Science & Business Media; 2010 Jul 1.

[43] Zeileis A, Hothorn T, Hornik K. Model-based recursive partitioning. Journal of Computational and Graphical Statistics. 2008 Jun 1;17(2):492–514.

[44] Chen G, Zhong H, Belousov A, Devanarayan V. A PRIM approach to predictive-signature development for patient stratification. Statistics in medicine. 2015 Jan 30;34(2):317–342.

[45] Huang X, Sun Y, Trow P, Chatterjee S, Chakravartty A, Tian L, Devanarayan V. Patient subgroup identification for clinical drug development. Statistics in medicine. 2017 Apr 30;36(9):1414–1428.

[46] Lipkovich I, Dmitrienko A, B D'Agostino Sr R. Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials. Statistics in medicine. 2017 Jan 15; 36(1):136–196.

[47] Bai J. Estimating multiple breaks one at a time. Econometric theory. 1997 Jun 1:315–352.

[48] Ke Y, Li J, Zhang W. Structure identification in panel data analysis. Annals of Statistics. 2016;44(3):1193– 1233.

#### **Chapter 4**

## Fast Computation of the EM Algorithm for Mixture Models

*Masahiro Kuroda*

#### **Abstract**

Mixture models become increasingly popular due to their modeling flexibility and are applied to the clustering and classification of heterogeneous data. The EM algorithm is largely used for the maximum likelihood estimation of mixture models because the algorithm is stable in convergence and simple in implementation. Despite such advantages, it is pointed out that the EM algorithm is local and has slow convergence as the main drawback. To avoid the local convergence of the EM algorithm, multiple runs from several different initial values are usually used. Then the algorithm may take a large number of iterations and long computation time to find the maximum likelihood estimates. The speedup of computation of the EM algorithm is available for these problems. We give the algorithms to accelerate the convergence of the EM algorithm and apply them to mixture model estimation. Numerical experiments examine the performance of the acceleration algorithms in terms of the number of iterations and computation time.

**Keywords:** the EM algorithm, normal mixture models, acceleration of convergence, the vector *ε* algorithm, restarting procedure, initial value selection, the emEM algorithm

#### **1. Introduction**

Mixture models become increasingly popular due to their modeling flexibility and are applied to the clustering and classification of heterogeneous data, see [1–3]. The EM algorithm [4] is largely used for the maximum likelihood estimation of mixture models because the algorithm is stable in convergence and simple in implementation. Despite such advantages, it is pointed out that the EM algorithm is local and has slow convergence as the main drawback.

To circumvent the problem of slow convergence of the EM algorithm, various acceleration algorithms incorporating optimization methods are proposed. The optimization methods include the multivariate Aitken method [5], the conjugate gradient method [6], and the quasi-Newton method [7, 8]. However, these methods require matrix computation such as matrix inversion or evaluation of Hessian and Jacobian matrices and a line search for step length optimization. Therefore, their acceleration algorithms tend to lack one or more of the nice properties of the EM algorithm, although they may converge faster than the EM algorithm.

As another approach, the *ε*-accelerated EM algorithm [9] is proposed to accelerate the convergence of the EM algorithm by using the vector *ε* (v*ε*) algorithm [10] that is a vector extrapolation algorithm [11, 12]. The v*ε* algorithm can accelerate the convergence of the sequence of estimates from the EM algorithm, and therefore, the *ε*-accelerated EM algorithm does not require any modification of the E- and M-steps of the EM algorithm. This point is the advantage of the *ε*-accelerated EM algorithm over other acceleration algorithms using the optimization methods. To reduce the number of iterations and computation time of the *ε*-accelerated EM algorithm, the *ε*R-accelerated EM algorithm [13] is developed. The algorithm improves the computation speed of the *ε*-accelerated EM algorithm by embedding a restarting procedure. Then the restarting procedure finds a value for restarting the EM iterations such that a newly generated sequence of EM iterations from the value moves quickly into a neighborhood of a stationary point. We use the *ε*-accelerated EM and *ε*R-accelerated EM algorithms for parameter estimation.

In application of the EM algorithm to mixture models, the algorithm is sensitive to the choice of the initial value and may find estimates at a local maximum of the log-likelihood function. Several strategies are proposed to efficiently initiate the EM algorithm for getting the global maximum of the log-likelihood function, see [14–17]. We use the emEM algorithm [14] for the mixture model estimation and improve its computation speed by the *ε*-accelerated EM and *ε*R-accelerated EM algorithms.

The chapter is organized as follows. Section 2 describes the EM algorithm for normal mixture models. In Section 3, we introduce the *ε*-accelerated EM and *ε*Raccelerated EM algorithms. Section 4 presents numerical experiments to evaluate the performance of these acceleration algorithms. In Section 5, we provide an acceleration algorithm that applies the *ε*-accelerated EM and *ε*R-accelerated EM algorithms to the emEM algorithm. Numerical experiments in Section 6 study the effects of these acceleration algorithms on the emEM algorithm. In Section 7, we present our concluding remarks.

#### **2. The EM algorithm for normal mixture models**

Let **Y**1, … , **Y***<sup>n</sup>* be *p*-dimensional random vectors. Assume that an observed data vector **y***<sup>i</sup>* of **Y***<sup>i</sup>* arises from a mixture distribution of *G* components with density

$$f\left(\mathbf{y}\_i|\boldsymbol{\theta}\right) = \sum\_{k=1}^{G} \lambda\_k \phi\left(\mathbf{y}\_i|\boldsymbol{\mu}\_k, \boldsymbol{\Sigma}\_k\right),\tag{1}$$

where *ϕ* **y***<sup>i</sup>* j*μk*, Σ*<sup>k</sup>* � � is the *k*-th component density of a *p*-variate normal distribution *Np μ<sup>k</sup>* ð Þ , Σ*<sup>k</sup>* with mean vector *μk*, variance–covariance matrix Σ*k*, *λ<sup>k</sup>* is the *k*-th mixing proportion such that 0 <*λ<sup>k</sup>* < 1 and P*<sup>G</sup> <sup>k</sup>*¼<sup>1</sup>*λ<sup>k</sup>* <sup>¼</sup> 1, and *<sup>θ</sup>* <sup>¼</sup> *λ*1, … , *λG*, *μ*<sup>⊤</sup> <sup>1</sup> , … , *μ*<sup>⊤</sup> *G*, vecΣ<sup>⊤</sup> <sup>1</sup> , … , vecΣ<sup>⊤</sup> *G* � �<sup>⊤</sup> . Here vecΣ*<sup>k</sup>* is the vectorization of Σ*k*. The log-likelihood function of *θ* for **y** ¼ **y**1, … , **y***<sup>n</sup>* � � is

$$\mathcal{E}\_o(\boldsymbol{\theta}) = \sum\_{i=1}^n \log f\left(\mathbf{y}\_i|\boldsymbol{\theta}\right) = \sum\_{i=1}^n \log \left\{ \sum\_{k=1}^G \lambda\_k \phi\left(\mathbf{y}\_i|\boldsymbol{\mu}\_k, \boldsymbol{\Sigma}\_k\right) \right\}.\tag{2}$$

Direct maximization of the function (2) is complicated, and then the maximum likelihood estimate (MLE) of *θ* is usually found via the EM algorithm [4].

In the setting of the EM algorithm, we regard **y***<sup>i</sup>* as incomplete data and introduce the component-label vector **Z***<sup>i</sup>* ¼ ½ � *Zi*1, … , *ZiG* <sup>⊤</sup> of zero–one indicator variables such that *Zik* ¼ 1 indicates that **y***<sup>i</sup>* arises from the *k*-th component of the mixture

model and *Zik* ¼ 0 otherwise. Assume that **Z***<sup>i</sup>* has a multinomial distribution *Mu*ð Þ 1, *λ* with parameter *λ* ¼ ½ � *λ*1, … , *λ<sup>G</sup>* <sup>⊤</sup>. In the mixture model, the complete data vector is **<sup>x</sup>***<sup>i</sup>* <sup>¼</sup> **<sup>y</sup>**<sup>⊤</sup> *<sup>i</sup>* , **z**<sup>⊤</sup> *i* � �<sup>⊤</sup> , where **y***<sup>i</sup>* is the observed vector and **z***<sup>i</sup>* is the unobserved vector of **Z***i*. Then **x***<sup>i</sup>* has a mixture distribution with density

$$f(\mathbf{x}\_i|\boldsymbol{\theta}) = \prod\_{k=1}^{G} \left\{ \lambda\_k \phi \left( \mathbf{y}\_i | \boldsymbol{\mu}\_k, \boldsymbol{\Sigma}\_k \right) \right\}^{x\_{ik}}.\tag{3}$$

Given **x** ¼ ½ � **x**1, … , **x***<sup>n</sup>* , the log-likelihood function of *θ* is

$$\mathcal{E}\_{\varepsilon}(\boldsymbol{\theta}) = \sum\_{i=1}^{n} \sum\_{k=1}^{G} \boldsymbol{z}\_{ik} \log \lambda\_{k} \phi \left( \mathbf{y}\_{i} | \boldsymbol{\mu}\_{k}, \boldsymbol{\Sigma}\_{k} \right), \tag{4}$$

and the MLE ^*θ* of the function (4) is obtained from

$$
\hat{\lambda}\_k = \sum\_{i=1}^n z\_{ik}/n,\tag{5}
$$

$$
\hat{\mu}\_k = \sum\_{i=1}^n z\_{ik} \mathbf{x}\_i / \sum\_{i=1}^n z\_{ik},
\tag{6}
$$

$$\hat{\Sigma}\_{k} = \sum\_{i=1}^{n} z\_{ik} (\mathbf{x}\_{i} - \hat{\boldsymbol{\mu}}\_{k}) (\mathbf{x}\_{i} - \hat{\boldsymbol{\mu}}\_{k})^{\mathrm{T}} / \sum\_{i=1}^{n} z\_{ik} \tag{7}$$

for *<sup>k</sup>* <sup>¼</sup> 1, … , *<sup>G</sup>*. The EM algorithm finds ^*<sup>θ</sup>* by iterating the expectation step (E-step) and the maximization step (M-step). Let *θ*ð Þ*<sup>t</sup>* be the *t*-th estimate of *θ* in parameter space Θ. The E-step calculates the *Q* function that is the conditional expectation of <sup>ℓ</sup>*c*ð Þ*<sup>θ</sup>* given **<sup>y</sup>** and *<sup>θ</sup>*ð Þ*<sup>t</sup>* and is written as

$$Q\left(\theta|\theta^{(t)}\right) = E\left[\ell\_c(\theta)|\mathbf{y}, \theta^{(t)}\right].\tag{8}$$

Mixture models treat **z** ¼ ½ � **z**1, … , **z***<sup>n</sup>* as missing data. The E-step calculates the conditional expectation of *Zik* given **y** and *θ*ð Þ*<sup>t</sup>* :

$$\begin{split} \tau\_{ik}^{(t+1)} &= E\left[Z\_{ik}|\mathbf{y}, \boldsymbol{\theta}^{(t)}\right] = \Pr\left(Z\_{ik}|\mathbf{y}, \boldsymbol{\theta}^{(t)}\right) \\ &= \lambda\_k^{(t)} \phi\left(\mathbf{y}\_i|\boldsymbol{\mu}\_k^{(t)}, \boldsymbol{\Sigma}\_k^{(t)}\right) \Big/ \sum\_{k=1}^{G} \lambda\_k^{(t)} \phi\left(\mathbf{y}\_i|\boldsymbol{\mu}\_k^{(t)}, \boldsymbol{\Sigma}\_k^{(t)}\right) . \end{split} \tag{9}$$

The quantity *τ* ð Þ*t ik* is the posterior probability that **y***<sup>i</sup>* belongs to the *k*-th component of the mixture. From Eq. (9), the *Q* function (8) is given by

$$Q\left(\boldsymbol{\theta}|\boldsymbol{\theta}^{(t)}\right) = \sum\_{i=1}^{n} \sum\_{k=1}^{G} \tau\_{ik}^{(t+1)} \log \lambda\_k \boldsymbol{\phi}\left(\mathbf{y}\_i|\boldsymbol{\mu}\_k, \boldsymbol{\Sigma}\_k\right). \tag{10}$$

The M-step finds *θ*ð Þ *<sup>t</sup>*þ<sup>1</sup> maximizing the *Q* function (10) with respect to *θ* over Θ given *θ*ð Þ*<sup>t</sup>* :

$$\boldsymbol{\theta}^{(t+1)} = \underset{\boldsymbol{\theta} \in \Theta}{\text{arg}\, \max} \, Q\left(\boldsymbol{\theta}|\boldsymbol{\theta}^{(t)}\right). \tag{11}$$

When replacing *zik* in Eq. (5) with *τ* ð Þ *t*þ1 *ik* in the E-step, we obtain

$$
\lambda\_k^{(t+1)} = \frac{1}{n} \sum\_{i=1}^n \tau\_{ik}^{(t+1)}.\tag{12}
$$

From Eqs. (6) and (7), we also have

$$\boldsymbol{\mu}\_{k}^{(t+1)} = \sum\_{i=1}^{n} \tau\_{ik}^{(t+1)} \mathbf{x}\_{i} \bigg/ \sum\_{i=1}^{n} \tau\_{ik}^{(t+1)},\tag{13}$$

$$\hat{\boldsymbol{\Sigma}}\_{k}^{(t+1)} = \sum\_{i=1}^{n} \boldsymbol{\pi}\_{ik}^{(t+1)} \left(\mathbf{x}\_{i} - \boldsymbol{\mu}\_{k}^{(t+1)}\right) \left(\mathbf{x}\_{i} - \boldsymbol{\mu}\_{k}^{(t+1)}\right)^{\mathrm{T}} \bigg/ \sum\_{i=1}^{n} \boldsymbol{\pi}\_{ik}^{(t+1)}.\tag{14}$$

We describe the EM algorithm for the normal mixture model in Algorithm 1.

#### **Algorithm 1**: The EM algorithm.

**E-step:** Calculate *τ* ð Þ *t*þ1 *<sup>k</sup>* ¼ *τ* ð Þ *t*þ1 *<sup>i</sup>*<sup>1</sup> , … , *τ* ð Þ *t*þ1 *iG* h i<sup>T</sup> using Eq. (9) and update *<sup>τ</sup>*ð Þ *<sup>t</sup>*þ<sup>1</sup> <sup>¼</sup> *τ*1 ð Þ *<sup>t</sup>*þ<sup>1</sup> , … , *τ* ð Þ *<sup>t</sup>*þ<sup>1</sup> *<sup>n</sup>* h i. **M-step:** Estimate *θ*ð Þ *<sup>t</sup>*þ<sup>1</sup> from Eqs. (12)–(14).

#### **3. Acceleration of the EM algorithm**

In order to accelerate the convergence of the EM algorithm, we can use the *ε*-accelerated EM algorithm [9] and the *ε*R-accelerated EM algorithm [13]. The *ε*-accelerated EM algorithm incorporates the vector *ε* (v*ε*) algorithm [10] in the EM algorithm. The *ε*R-accelerated EM algorithm improves the computation speed of the *ε*-accelerated EM algorithm by adding a restarting procedure.

We briefly introduce the v*<sup>ε</sup>* algorithm. Let *<sup>θ</sup>*ð Þ*<sup>t</sup>* n o *<sup>t</sup>*≥<sup>0</sup> be a linearly convergent vector sequence from an iterative computational procedure and converge to a stationary point ^*<sup>θ</sup>* as *<sup>t</sup>* ! <sup>∞</sup>. Then the v*<sup>ε</sup>* algorithm generates a sequence *<sup>ψ</sup>*ð Þ*<sup>t</sup>* � � *t*≥0 that converges to ^*θ* faster than *θ*ð Þ*<sup>t</sup>* � � *<sup>t</sup>*≥<sup>0</sup> by using

$$\boldsymbol{\Psi}^{(t-1)} = \boldsymbol{\theta}^{(t)} + \left[ \left[ \Delta \boldsymbol{\theta}^{(t)} \right]^{-1} - \left[ \Delta \boldsymbol{\theta}^{(t-1)} \right]^{-1} \right]^{-1},\tag{15}$$

where <sup>Δ</sup>*θ*ð Þ*<sup>t</sup>* <sup>¼</sup> *<sup>θ</sup>*ð Þ *<sup>t</sup>*þ<sup>1</sup> � *<sup>θ</sup>*ð Þ*<sup>t</sup>* and ½ � *<sup>θ</sup>* �<sup>1</sup> <sup>¼</sup> *<sup>θ</sup>=*∥*θ*∥<sup>2</sup> <sup>¼</sup> *<sup>θ</sup>=θ*<sup>⊤</sup>*θ*, see Appendix A for details. The algorithm enables accelerating the convergence of a slowly convergent vector sequence and is very effective for linearly convergent sequences.

We define the EM algorithm as a mapping *θ* ↦ *M*ð Þ*θ* from Θ to Θ such that each iteration *<sup>θ</sup>*ð Þ*<sup>t</sup>* ! *<sup>θ</sup>*ð Þ *<sup>t</sup>*þ<sup>1</sup> is denoted by

$$\theta^{(t+1)} = M\left(\theta^{(t)}\right). \tag{16}$$

**Algorithm 2**: The *ε*-accelerated EM algorithm.

**E-step:** Estimate *θ*ð Þ *<sup>t</sup>*þ<sup>1</sup> from Eq. (16). *<sup>ε</sup>* **acceleration step** Calculate *<sup>ψ</sup>*ð Þ *<sup>t</sup>*�<sup>1</sup> from *<sup>θ</sup>*ð Þ *<sup>t</sup>*þ<sup>1</sup> , *<sup>θ</sup>*ð Þ*<sup>t</sup>* , *<sup>θ</sup>*ð Þ *<sup>t</sup>*�<sup>1</sup> n o using Eq. (15).

The *ε*-accelerated EM algorithm is shown in Algorithm 2. Given a convergence criterion *δ*, the *ε*-accelerated EM algorithm iterates until

$$\|\psi^{(t-1)} - \psi^{(t-2)}\|^2 < \delta. \tag{17}$$

Assume that the sequence *<sup>θ</sup>*ð Þ*<sup>t</sup>* n o *<sup>t</sup>*≥<sup>0</sup> from the EM algorithm converges to a stationary point ^*θ*. The *ε*R-accelerated EM algorithm generates *ψ*ð Þ*<sup>t</sup>* � � *<sup>t</sup>*≥<sup>0</sup> converging to ^*<sup>θ</sup>* faster than *<sup>θ</sup>*ð Þ*<sup>t</sup>* n o *<sup>t</sup>*≥<sup>0</sup> and provides ^*<sup>θ</sup>* from the final value of *<sup>ψ</sup>*ð Þ*<sup>t</sup>* � � *<sup>t</sup>*≥<sup>0</sup> when the algorithm terminates.

The theorems with the convergence and acceleration of the algorithm are given in [18].

As shown in Algorithm 2, the *ε*-accelerated EM algorithm generates two parallel sequences, *ψ*ð Þ*<sup>t</sup>* � � *<sup>t</sup>*≥<sup>0</sup> in the *<sup>ε</sup>* acceleration step and *<sup>θ</sup>*ð Þ*<sup>t</sup>* n o *<sup>t</sup>*≥<sup>0</sup> in the EM step. At the *<sup>ε</sup>* acceleration step, the EM estimate *M ψ*ð Þ *<sup>t</sup>*�<sup>1</sup> � � from *ψ*ð Þ *<sup>t</sup>*�<sup>1</sup> may have a larger loglikelihood function than the current EM estimate *θ*ð Þ *<sup>t</sup>*þ<sup>1</sup> , that is,

$$
\ell\_o \left( M \left( \psi^{(t-1)} \right) \right) > \ell\_o \left( \theta^{(t+1)} \right). \tag{18}
$$

When this occurs, the EM step is restarted with *M ψ*ð Þ *<sup>t</sup>*�<sup>1</sup> � � as the initial value, and the *ε* acceleration step gets *ψ*ð Þ*<sup>t</sup>* from *ψ*ð Þ *<sup>t</sup>*�<sup>1</sup> , *M ψ*ð Þ *<sup>t</sup>*�<sup>1</sup> � �, *M M ψ*ð Þ *<sup>t</sup>*�<sup>1</sup> � � � � � � . Notice that at the restarting point, we still generate the EM sequence using three estimates obtained from the same initial value *ψ*ð Þ *<sup>t</sup>*�<sup>1</sup> . By this manner, we keep to always apply the *ε*-acceleration to a sequence obtained by the EM mapping *M* from the same initial value.

By our experiments, the restarting procedure is performed almost every time when we only use the restarting condition <sup>ℓ</sup>*<sup>o</sup> <sup>M</sup> <sup>ψ</sup>*ð Þ *<sup>t</sup>*�<sup>1</sup> � � � � <sup>&</sup>gt;ℓ*<sup>o</sup> <sup>θ</sup>*ð Þ *<sup>t</sup>*þ<sup>1</sup> � �, and then it inefficiently takes much computation time. As one more condition for restarting the EM step, we give <sup>∥</sup>*ψ*ð Þ *<sup>t</sup>*�<sup>1</sup> � *<sup>ψ</sup>*ð Þ *<sup>t</sup>*�<sup>2</sup> <sup>∥</sup><sup>2</sup> <sup>≤</sup>*δRe*ð Þ <sup>&</sup>gt; *<sup>δ</sup>* and reset *<sup>δ</sup>Re* <sup>¼</sup> *<sup>δ</sup>Re=*10*<sup>k</sup>* at each restarting, where *k* is an integer, such as one. By adding this condition, we can control the restarting frequency. For example, set *<sup>δ</sup>* <sup>¼</sup> <sup>10</sup>�12, and initialize *<sup>δ</sup>Re* <sup>¼</sup> <sup>1</sup> and *k* ¼ 1. Then the restarting procedure is performed at most 12 times.

The restarting conditions are summarized as follows:

$$\begin{aligned} \text{i. } \ell\_o \left( \mathsf{M} \left( \boldsymbol{\mu}^{(t-1)} \right) \right) &> \ell\_o \left( \boldsymbol{\theta}^{(t+1)} \right) \text{, and} \\\\ \text{ii. } \left\| \boldsymbol{\mu}^{(t-1)} - \boldsymbol{\mu}^{(t-2)} \right\|^2 &< \delta\_{\mathsf{Re}}. \end{aligned}$$

Condition (i) means that the log-likelihood function can be increased by restarting. Condition (ii) is used to reduce the frequency of restarting. This is the key idea of the restarting procedure. The *ε*R-accelerated EM algorithm is the *ε*-accelerated EM algorithm with the restarting procedure using conditions (i) and (ii) and is given in Algorithm 3.

**Algorithm 3**: The *ε*R-accelerated EM algorithm.

**EM step:** Estimate *θ*ð Þ *<sup>t</sup>*þ<sup>1</sup> from Eq. (16). *<sup>ε</sup>* **acceleration step:** Calculate *<sup>ψ</sup>*ð Þ *<sup>t</sup>*�<sup>1</sup> from *<sup>θ</sup>*ð Þ *<sup>t</sup>*þ<sup>1</sup> , *<sup>θ</sup>*ð Þ*<sup>t</sup>* , *<sup>θ</sup>*ð Þ *<sup>t</sup>*�<sup>1</sup> n o using Eq. (15). **Restarting step:** If <sup>ℓ</sup>*<sup>o</sup> <sup>M</sup> <sup>ψ</sup>*ð Þ *<sup>t</sup>*�<sup>1</sup> � � � � <sup>&</sup>gt;ℓ*<sup>o</sup> <sup>θ</sup>*ð Þ *<sup>t</sup>*þ<sup>1</sup> � � and <sup>∥</sup>*ψ*ð Þ *<sup>t</sup>*�<sup>1</sup> � *<sup>ψ</sup>*ð Þ *<sup>t</sup>*�<sup>2</sup> <sup>∥</sup><sup>2</sup> <sup>&</sup>lt;*δRe*, then set *<sup>θ</sup>*ð Þ*<sup>t</sup>* <sup>¼</sup> *<sup>ψ</sup>*ð Þ *<sup>t</sup>*�<sup>1</sup> , (19)

update

$$\theta^{(t+1)} = M\left(\boldsymbol{\mu}^{(t-1)}\right),\tag{20}$$

and reset

$$
\delta\_{\text{Re}} = \delta\_{\text{Re}} / \mathbf{10}^k. \tag{21}
$$

The *ε*R-accelerated EM algorithm also gives ^*θ* from the final value of *ψ*ð Þ*<sup>t</sup>* � � *<sup>t</sup>*≥<sup>0</sup>. When the restarting step effectively finds values for restating the EM step, the *ε*R-accelerated EM algorithm greatly reduces the number of iterations and computation time for convergence. The advantage of the *ε*R-accelerated EM algorithm over the *ε*-accelerated EM algorithm is that it restarts the EM step at a better current estimate and also keeps that the log-likelihood function increases in the iterations.

Theoretical results of convergence and speed of convergence of the *ε*R-accelerated EM algorithm are given in [13].

#### **4. Numerical experiments for the acceleration of the EM algorithm**

We investigate how much faster the *ε*-accelerated EM and *ε*R-accelerated EM algorithms converge than the EM algorithm. All computations are performed with the statistical package R [19] executing on Windows, Intel Core i5 3.00 GHz with 8 GB of memory.

The R package MixSim [17, 20] is used to simulate a random data matrix **y** having a *p*-variate normal mixture distribution of *G* components. We generate **y** ¼ **<sup>y</sup>**1, … , **<sup>y</sup>**<sup>1000</sup> � � and find the MLE of *<sup>θ</sup>* using the EM, *<sup>ε</sup>*-accelerated EM, and *<sup>ε</sup>*Raccelerated EM algorithms. The procedure is replicated 100 times. Here, we consider *<sup>p</sup>* <sup>¼</sup> 2, 3, 4, 5, 6 and *<sup>G</sup>* <sup>¼</sup> 4. For all experiments, we set *<sup>δ</sup>* <sup>¼</sup> <sup>10</sup>�<sup>12</sup> for convergence of the algorithms, *δRe* ¼ 1 and *k* ¼ 1 for the restarting condition of the *ε*R-accelerated EM algorithm. Initial values of the algorithms are obtained from the k-means method using the R function kmeans.

**Tables 1** and **2** report the results of the number of iterations and CPU time of these algorithms for each *p*. The CPU times (in seconds) are measured by the R function proc.time that times are typically available to 10 milliseconds. For all computations, the acceleration algorithms found the same MLEs as those from the EM algorithm. We see from the tables that the EM algorithm requires a large number of iterations for convergence, whereas two acceleration algorithms converge a smaller number of iterations than the EM algorithm. Then the *ε*R-accelerated EM algorithm can greatly reduce both the number of iterations and CPU time.

To measure the speed of convergence of the EM and two acceleration algorithms, we calculate iteration and CPU time speedups. The iteration speedup of an acceleration algorithm for the EM algorithm is defined by

> The number of iterations of the EM algorithm The number of iterations of an acceleration algorithm *:*


*Fast Computation of the EM Algorithm for Mixture Models DOI: http://dx.doi.org/10.5772/intechopen.101249*

#### **Table 1.**

*Summary statistics of the number of iterations of the EM, ε-accelerated EM (ε) and εR-accelerated EM (εR) algorithms for 100 simulated random data. Each data is generated from a p-variate normal mixture distribution of four components.*


#### **Table 2.**

*Summary statistics of CPU time of the EM, ε-accelerated EM (ε) and εR-accelerated EM (εR) algorithms for 100 random data. Each data is generated from a p-variate normal mixture distribution of four components.*

The CPU time speedup is also calculated similarly to the iteration speedup. **Tables 3** and **4** show the results of the iteration and CPU time speedups of two acceleration algorithms. We compare the mean values of the iteration and CPU time speedups of the algorithms. The *ε*-accelerated EM algorithm is about 1.5 times and 1.4 times faster than the EM algorithm in the number of iterations and CPU time, respectively. Then the *ε*R-accelerated EM algorithm is more than twice as fast as the EM algorithm in both the number of iterations and CPU time. The boxplots of **Figures 1** and **2** also show that the *ε*R-accelerated EM algorithm is obviously much faster than the *ε*-accelerated EM algorithm. **Table 3** and **Figure 1** indicate that in 75 out of 100 replications, the number of iterations of the *ε*R-accelerated EM algorithm is less than half as small as that of the EM algorithm. For CPU time of **Table 4** and **Figure 2**, the *ε*R-accelerated EM algorithm is more than twice as fast as the EM algorithm in 50 out of 100 replications.

**Figure 3** shows the boxplots of the iteration and CPU time speedups of the *ε*Raccelerated EM algorithm for *p* ¼ 6. Here, "more" ("less") means that the number of iterations of the EM algorithm is more (less) than the median in **Tables 1** and **2**.


#### **Table 3.**

*Summary statistics of the iteration speedup of the ε-accelerated EM (ε) and εR-accelerated EM (εR) algorithms for 100 random data. Each data is generated from a p-variate normal mixture distribution of four components.*


#### **Table 4.**

*Summary statistics of the CPU time speedup of the ε-accelerated EM (ε) and εR-accelerated EM (εR) algorithms for 100 random data. Each data is generated from p-variate normal mixture distributions of four components.*

*Fast Computation of the EM Algorithm for Mixture Models DOI: http://dx.doi.org/10.5772/intechopen.101249*

**Figure 1.**

*Boxplots of the iteration speedup of the ε-accelerated EM (ε) and εR-accelerated EM (εR) algorithms for 100 random data generated from a p-variate normal mixture distribution of four components.*

**Figure 2.**

*Boxplots of the CPU time speedup of the ε-accelerated EM (ε) and εR-accelerated EM (εR) algorithms for 100 random data. Each data is generated from a p-variate normal mixture distribution of four components.*

We can see from the figure that, for the larger number of iterations of the EM algorithm ("more"), the *ε*R-accelerated EM algorithm works well to speed up the convergence of *ψ*ð Þ*<sup>t</sup> <sup>t</sup>*≥<sup>0</sup>. We observed a similar result for other *p*. Therefore, the algorithm is more powerful when the EM algorithm takes a larger number of iterations.

The results from the tables and figures demonstrate that the restarting step in the *ε*R-accelerated EM algorithm enables a significant increase in the computation speed with less computational effort.

**Figure 3.**

*Boxplots of the iteration and CPU time speedups of the εR-accelerated EM algorithms for 100 random data. Each data is generated from a six-variate normal mixture distribution of four components. The label "less" ("more") means that the number of iterations of the EM algorithm is less (more) than the median in Tables 1 and 2.*

#### **5. Initial value selection for normal mixture models**

It is well known that the log-likelihood function (2) may have numerous maximums. The EM algorithm does not guarantee to obtain the global maximum of the log-likelihood function due to its local convergence. Thus, the initial value of *θ* deeply depends on the performance of the EM algorithm. Several methods for selecting the initial value are proposed; for example, see [14–17]. These methods are based on the multiple runs of the EM algorithm using different initial values and find ^*θ* for getting the global maximum of the log-likelihood function.

We apply the emEM algorithm [14] to the mixture model estimation. The algorithm is a popular one and usually provides excellent results when the number of components is not large [21]. The emEM algorithm selects an initial value in the em step that is several short runs of the EM algorithm using different initial values and a lax convergence criterion and obtains ^*θ* from the EM step that runs the EM algorithm starting from the initial value with a strict convergence criterion.

The em step consists of three steps. The first step generates *J* initial values of *θ*. The second step runs the EM algorithm from these initial values with a lax convergence criterion. Hence, we do not wait for convergence of the EM algorithm and stop the iterations. The third step selects the value giving the largest log-likelihood function among *J* trials.

Let *δini* be a convergence criterion and *Tmax* the maximum number of iterations. We present the emEM algorithm in Algorithm 4.


*Fast Computation of the EM Algorithm for Mixture Models DOI: http://dx.doi.org/10.5772/intechopen.101249*

$$\frac{\ell\_o' \left(\boldsymbol{\theta}^{\left(\boldsymbol{t}\_j,\boldsymbol{j}\right)}\right) - \ell\_o' \left(\boldsymbol{\theta}^{\left(\boldsymbol{t}\_j-1,\boldsymbol{j}\right)}\right)}{\ell\_o' \left(\boldsymbol{\theta}^{\left(\boldsymbol{t}\_j,\boldsymbol{j}\right)}\right) - \ell\_o' \left(\boldsymbol{\theta}^{\left(0,\boldsymbol{j}\right)}\right)} < \delta\_{ini}, \quad \text{or} \quad \mathbf{t}\_j > T\_{max}.\tag{22}$$

Obtain *<sup>θ</sup>*ð Þ <sup>∗</sup> ,*<sup>j</sup>* <sup>¼</sup> *<sup>θ</sup> <sup>t</sup> <sup>j</sup>* ð Þ,*<sup>j</sup>* .

**Selection step:** From *<sup>J</sup>* candidate initial values *<sup>θ</sup>*ð Þ <sup>∗</sup> ,*<sup>j</sup>* n o *j*¼1, … ,*J* , find

$$\theta^{(0)} = \underset{\left\{\theta^{(\*,j)}\right\}\_{j=1,\dots,J}}{\text{arg}\,\max} \left\{ \ell\_o^{\ell} \left( \theta^{(\*,j)} \right) \right\}\_{j=1,\dots,J}.\tag{23}$$

**EM step:** Given *θ*ð Þ <sup>0</sup> in the em step, find ^*θ* using the EM algorithm.

The em step performs multiple runs of the EM algorithm, and then its computation may be time-consuming. We replace the EM algorithm with the *ε*-accelerated EM algorithm in the em step and use the *ε*R-accelerated EM algorithm to obtain ^*θ* in the EM step. By applying these acceleration algorithms to the emEM algorithm, it is possible to reduce the number of iterations and CPU time. The acceleration of the emEM algorithm is referred as to the *ε*em-*ε*REM algorithm and is shown in Algorithm 5.

**Algorithm 5**: the *ε*em-*ε*REM algorithm.

*ε***-em step**: Select *θ*ð Þ <sup>0</sup> of the *ε*R-EM step.

**Random initialization step:** Draw *<sup>J</sup>* initial values *<sup>θ</sup>*ð Þ 0,*<sup>j</sup>* n o *j*¼1, … ,*J* . **Short running step:** Repeat the following computation for *j* ¼ 1, … , *J*: Generate *<sup>ψ</sup> <sup>t</sup> <sup>j</sup>* ð Þ,*<sup>j</sup>* n o *<sup>t</sup> <sup>j</sup>* <sup>≥</sup><sup>0</sup> by iterating the *<sup>ε</sup>*-accelerated EM algorithm from *θ*ð Þ 0,*<sup>j</sup>* and stop the iterations at the *tj*-iteration if

$$\frac{\mathcal{E}\_o\left(\boldsymbol{\mu}^{\left(\boldsymbol{t}\_j,\boldsymbol{j}\right)}\right) - \mathcal{E}\_o\left(\boldsymbol{\mu}^{\left(\boldsymbol{t}\_j-1,\boldsymbol{j}\right)}\right)}{\mathcal{E}\_o\left(\boldsymbol{\mu}^{\left(\boldsymbol{t}\_j,\boldsymbol{j}\right)}\right) - \mathcal{E}\_o\left(\boldsymbol{\mu}^{\left(0,\boldsymbol{j}\right)}\right)} < \delta\_{\text{ini}}, \quad \text{or} \quad \boldsymbol{t}\_j > T\_{\text{max}}.\tag{24}$$

Obtain *<sup>θ</sup>*ð Þ <sup>∗</sup> ,*<sup>j</sup>* <sup>¼</sup> *<sup>ψ</sup> <sup>t</sup> <sup>j</sup>* ð Þ,*<sup>j</sup>* .

**Selection step:** From *<sup>J</sup>* candidate initial values *<sup>θ</sup>*ð Þ <sup>∗</sup> ,*<sup>j</sup>* n o *j*¼1, … ,*J* , find

$$\boldsymbol{\theta}^{(0)} = \underset{\left\{\boldsymbol{\theta}^{(\*,j)}\right\}\_{j=1,\dots,J}}{\text{arg}\,\max} \left\{ \boldsymbol{\ell}\_{\boldsymbol{\theta}}^{\boldsymbol{\ell}} \left( \boldsymbol{\theta}^{(\*,j)} \right) \right\}\_{j=1,\dots,J}.\tag{25}$$

*ε***-R-EM step**: Given *θ*ð Þ <sup>0</sup> in the em step, find ^*θ* using the *ε*R-accelerated EM algorithm.

#### **6. Numerical experiments for the initial value selection**

We evaluate the performance of the *ε*-accelerated EM and *ε*R-accelerated EM algorithms in application to the emEM algorithm.


#### **Table 5.**

*The numbers of iterations of the emEM and εem-εREM algorithms. The em and ε-em steps generate 50 random initial values.*


#### **Table 6.**

*CPU times of the emEM and εem-εREM algorithms. The em and ε-em steps generate 50 random initial values.*

By using MixSim, we simulate **<sup>y</sup>** <sup>¼</sup> **<sup>y</sup>**1, … , **<sup>y</sup>**<sup>1000</sup> � � having the *<sup>p</sup>*-variate normal mixture distribution of six components for *p* ¼ 2, 3, 4, 5, 6. The values of *δ*, *δRe*, and *k* are the same as in the experiments of Section 1.4. Assume that the probability of not finding the global maximum of the log-likelihood function in a single run is 0.80 for safety. Then the probability of finding the global maximum at least once is 1 � <sup>0</sup>*:*80<sup>50</sup> <sup>&</sup>gt;0*:*9999. In the em and *<sup>ε</sup>*-em steps, we draw 50 initial values *<sup>θ</sup>*ð Þ 0,*<sup>j</sup>* n o *j*¼1, … ,50 from kmeans and set *δini* ¼ 0*:*001 and *Tmax* ¼ 1000.

**Tables 5** and **6** present the number of iterations and CPU time for each *p*. We see from **Table 5** that the number of iterations of the *ε*-em step is much smaller than that of the em step. The *ε*-accelerated EM algorithm effectively improves the computation speed of the em step. We compare the number of iterations and CPU time of the *ε*em-*ε*REM algorithm with those of the emEM algorithm. Then these values of the *ε*em-*ε*REM algorithm are about less than half of those of the emEM algorithm. The results illustrate that the *ε*-accelerated EM and *ε*R-accelerated EM algorithms can sufficiently accelerate the convergence of the emEM algorithm.

#### **7. Concluding remarks**

In this chapter, we introduced the *ε*-accelerated EM and *ε*R-accelerated EM algorithms. Both algorithms are given by very simple computational procedures and are executed with a little bit of computation for each iteration, while they well accelerate the convergence of the EM algorithm.

When the EM algorithm is applied to normal mixture models, the algorithm may converge slowly and be heavily dependent on the initial value. The first problem is solved by the acceleration of the EM algorithm. The numerical experiments

#### *Fast Computation of the EM Algorithm for Mixture Models DOI: http://dx.doi.org/10.5772/intechopen.101249*

indicated the availability of the *ε*-accelerated EM and *ε*R-accelerated EM algorithms. For the second problem, the initial value selection is useful to initiate the EM algorithm. We applied the emEM algorithm to normal mixture model estimation and developed the *ε*em-*ε*REM algorithm to speed up the computation of the emEM algorithm. Then the *ε*-accelerated EM algorithm is used in the em step, and the *ε*R-accelerated EM algorithm is in the EM step. Numerical experiments showed that the *ε*em-*ε*REM algorithm can converge in a smaller number of iterations and shorter CPU time than the emEM algorithm.

The *ε*-accelerated EM and *ε*R-accelerated EM algorithms accelerate the convergence of the EM algorithm without any modification of the E- and M-steps of the algorithm. This means that these algorithms do not require to derive the acceleration formula for every statistical model. Thus, these algorithms are applied to several mixture models—mixtures of factor analyzers, mixtures of multivariate *t*distributions, mixtures of generalized hyperbolic distributions, and parsimonious Gaussian mixture models. We expect that the convergence of the EM algorithms used in these mixture models tends to be slow. The results from the experiments show that the *ε*R-accelerated EM and *ε*R-accelerated EM algorithms are useful due to their fast speed of convergence and ease of use.

#### **Appendix: the vector** *ε* **algorithm**

Let *<sup>θ</sup>*ð Þ*<sup>t</sup>* denote a *<sup>d</sup>*-dimensional vector that converges to a vector ^*<sup>θ</sup>* as *<sup>t</sup>* ! <sup>∞</sup>. We define ½ � *<sup>θ</sup>* �<sup>1</sup> <sup>¼</sup> *<sup>θ</sup>=*∥*θ*∥<sup>2</sup> <sup>¼</sup> *<sup>θ</sup>=θ*<sup>⊤</sup>*θ*. In general, the v*<sup>ε</sup>* algorithm for a sequence *<sup>θ</sup>*ð Þ*<sup>t</sup>* n o *t*≥0 starts with

$$
\varepsilon^{(t,-1)} = \mathbf{0}, \qquad \varepsilon^{(t,0)} = \theta^{(t)} \tag{26}
$$

and then generates a vector *ε*ð Þ *<sup>t</sup>*,*k*þ<sup>1</sup> by

$$\begin{aligned} \varepsilon^{(t,k+1)} &= \varepsilon^{(t+1,k-1)} + \left[\varepsilon^{(t+1,k)} - \varepsilon^{(t,k)}\right] \\ &= \varepsilon^{(t+1,k-1)} + \left[\Delta \varepsilon^{(t,k)}\right]^{-1}, \qquad k = \mathbf{0}, \mathbf{1}, \mathbf{2}, \dots \end{aligned} \tag{27}$$

For practical implementation, we apply the v*ε* algorithm for *k* ¼ 1 to accelerate the convergence of *<sup>θ</sup>*ð Þ*<sup>t</sup>* n o *t*≥0 . From the above equation, we have

$$
\varepsilon^{(t,2)} = \varepsilon^{(t+1,0)} + \left[\Delta\varepsilon^{(t,1)}\right]^{-1} \quad \text{for} \ k = \mathbf{1},\tag{28}
$$

$$\varepsilon^{(t,1)} = \varepsilon^{(t+1,-1)} + \left[\Delta\varepsilon^{(t,0)}\right]^{-1} = \left[\Delta\varepsilon^{(t,0)}\right]^{-1} \quad \text{for } k = 0. \tag{29}$$

Then the vector *ε*ð Þ *<sup>t</sup>*,2 becomes as follows:

$$\begin{split} \boldsymbol{e}^{(t,2)} &= \boldsymbol{e}^{(t+1,0)} + \left[ \left[ \Delta \boldsymbol{e}^{(t+1,0)} \right]^{-1} - \left[ \Delta \boldsymbol{e}^{(t,0)} \right]^{-1} \right]^{-1} \\ &= \boldsymbol{\theta}^{(t+1)} + \left[ \left[ \Delta \boldsymbol{\theta}^{(t+1)} \right]^{-1} - \left[ \Delta \boldsymbol{\theta}^{(t)} \right]^{-1} \right]^{-1} . \end{split} \tag{30}$$

When setting *<sup>ψ</sup>*ð Þ*<sup>t</sup>* <sup>¼</sup> *<sup>ε</sup>*ð Þ *<sup>t</sup>*,2 , we obtain Eq. (15).

*Computational Statistics and Applications*

### **Author details**

Masahiro Kuroda Okayama University of Science, Okayama City, Japan

\*Address all correspondence to: kuroda@mgt.ous.ac.jp

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Fast Computation of the EM Algorithm for Mixture Models DOI: http://dx.doi.org/10.5772/intechopen.101249*

#### **References**

[1] Bouveyron C, Celeux G, Murphy TB, Raftery AE. Model-Based Clustering and Classification for Data Science with Applications in R. Cambridge: Cambridge University Press; 2019

[2] McLachlan G, Peel D. Finite Mixture Models. New York: Wiley; 2000

[3] McNicholas PD. Mixture Model-Based Classification. Boca Raton Chapman & Hall/CRC Press; 2016

[4] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. With discussion. Journal of the Royal Statistical Society Series B. 1977;**39**:1-38

[5] Louis TA. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society, Series B. 1982;**44**: 226-233

[6] Jamshidian M, Jennrich RI. Conjugate gradient acceleration of the EM algorithm. Journal of the American Statistical Association. 1993; **88**:221-228

[7] Jamshidian M, Jennrich RI. Acceleration of the EM algorithm by using quasi-Newton methods. Journal of the Royal Statistical Society, Series B. 1997;**59**:569-587

[8] Lange K. A quasi Newton acceleration of the EM algorithm. Statistica Sinica. 1995;**5**:1-18

[9] Kuroda M, Sakakihara M. Accelerating the convergence of the EM algorithm using the vector *ε* algorithm. Computational Statistics & Data Analysis. 2006;**51**:1549-1561

[10] Wynn P. Acceleration techniques for iterated vector and matrix problems. Mathematics of Computation. 1962;**16**: 301-322

[11] Brezinski C, Redivo-Zaglia M. Extrapolation Methods: Theory and Practice. Amsterdam: North-Holland; 1991

[12] Smith DA, Ford F, Sidi A. Extrapolation methods for vector sequences. SIAM Review. 1987;**29**: 199-233

[13] Kuroda M, Geng Z, Sakakihara M. Improving the vector *ε* acceleration for the EM algorithm using a re-starting procedure. Computational Statistics. 2015;**30**:1051-1077

[14] Biernacki C, Celeux G, Govaert G. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis. 2003;**41**: 561-575

[15] Kwedlo W. A new random approach for initialization of the multiple restart EM algorithm for Gaussian model-based clustering. Pattern Analysis and Applications. 2015;**18**:757-770

[16] Maitra R. Initializing optimization partitioning algorithms. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2009;**6**: 144-157

[17] Melnykov V, Chen W, Maitra R. MixSim: An R package for simulating data to study performance of clustering algorithms. Journal of Statistical Software. 2012;**51**:1

[18] Wang M, Kuroda M, Sakakihara M, Geng Z. Acceleration of the EM algorithm using the vector epsilon algorithm. Computational Statistics. 2008;**23**:469-486

[19] R Core Team. R. A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2021; Available from: https://www.R-project.org/

[20] Maitra R, Melnykov V. Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics. 2010;**19**: 354-376

[21] Michael S, Melnykov V. An effective strategy for initializing the EM algorithm in finite mixture models. Advances in Data Analysis and Classification. 2016;**10**:563-583
