3. CoClust algorithm

θb ¼ argmax θ

2.1. Copula models

96 Recent Applications in Data Clustering

Gaussian <sup>Φ</sup><sup>G</sup> <sup>Φ</sup>�<sup>1</sup>ð Þ <sup>u</sup><sup>1</sup> ; <sup>Φ</sup>�<sup>1</sup>ð Þ <sup>u</sup><sup>2</sup>

�1 <sup>ν</sup> ð Þ u<sup>1</sup> ; t �1

<sup>1</sup> <sup>þ</sup> <sup>u</sup>�<sup>θ</sup> <sup>2</sup> � <sup>1</sup> � ��<sup>1</sup>

> <sup>θ</sup> ln 1 <sup>þ</sup> <sup>e</sup>�θ<sup>u</sup> ð Þ <sup>1</sup> �<sup>1</sup> <sup>e</sup>�θ<sup>u</sup> ð Þ <sup>2</sup> �<sup>1</sup> eð Þ �θ�1

student-t distribution function. D1ð Þx denotes the "Debye" function 1=x

� 1

Student-t t2, <sup>ν</sup> t

Clayton <sup>u</sup>�<sup>θ</sup>

Frank

Xn i¼1

version. Note that here only single parameter copula models are considered.

The copula model selection task is still an open research field. Although various statistical tests enable evaluating whether a specific model is plausible or not, no tool has thus far been recognized as the best. In the copula-based clustering context, this issue can be overcome,

Copula C uð Þ <sup>1</sup>; u2;θ Parameter range Kendall's τ

Gumbel <sup>e</sup> � �ð Þ log <sup>u</sup>1� log <sup>u</sup><sup>2</sup> <sup>1</sup>=<sup>θ</sup> ½ � <sup>θ</sup>∈½ Þ <sup>1</sup>; <sup>∞</sup> <sup>1</sup> � <sup>1</sup>

t2, <sup>ν</sup>ð Þ �; �; θ denotes the standard bivariate student-t distribution with ν degrees of freedom, and t

Table 1. Some standard single parameter bivariate copulas with the range of the dependence parameter θ and its relation with Kendall'<sup>s</sup> <sup>τ</sup>. uk with <sup>k</sup> <sup>¼</sup> <sup>1</sup>, 2 are uniformly distributed variates so that xk <sup>¼</sup> <sup>F</sup>�<sup>1</sup>ð Þ� uk Fk. <sup>Φ</sup> is the cumulative distribution function (cdf) of the standard normal distribution, ΦGð Þ u1; u<sup>2</sup> is the standard bivariate normal distribution,

� � <sup>θ</sup>∈ð Þ �1; <sup>1</sup> <sup>2</sup>

<sup>ν</sup> ð Þ <sup>u</sup><sup>2</sup> ; <sup>θ</sup> � � <sup>θ</sup>∈½ � �1; <sup>1</sup> , <sup>ν</sup>∈ð Þ <sup>2</sup>; <sup>∞</sup> <sup>2</sup>

� � <sup>θ</sup>∈ð Þ �∞; <sup>∞</sup> <sup>1</sup> � <sup>4</sup>

<sup>θ</sup> θ∈ð Þ 0; ∞ <sup>θ</sup>

Ð x

<sup>0</sup> <sup>t</sup><sup>=</sup> exp<sup>t</sup> � <sup>1</sup> � �dt.

computation of the margins to avoid numerical problems at the boundary of 0½ � ; <sup>1</sup> <sup>K</sup>.

log c Ub <sup>1</sup>i;…; Ub ki;…; Ub Ki; θ h i

Note that the scaling factor n=ð Þ n þ 1 in Eq. (6) is typically introduced in the nonparametric

While many different copula models are available in literature (see [18, 19] for details), the Elliptical and Archimedean families are shown to be the most useful in empirical modeling. The Elliptical family includes the Gaussian copula and the t-copula: both are symmetric, exhibit the strongest dependence in the middle of the distribution, and can take into account both positive and negative dependence, since �1 ≤ θ ≤ 1. The Archimedean family enables describing both left and right asymmetry as well as weak symmetry among the margins using the Clayton, Gumbel, and Frank models, respectively. Clayton's copula has the parameter θ∈ð Þ 0; ∞ and as θ approaches zero, the margins become independent. The dependence parameter θ of a Gumbel model is restricted to the interval 1½ Þ ; þ∞ where the value 1 means independence. Finally, the dependence parameter θ of a Frank copula may assume any real value and as θ approaches zero, the marginal distributions become independent. According to the type of copula model, the value of θ has a specific meaning and the magnitudes of the dependence parameter are not comparable across copulas. It is always true that the greater the value of the dependence parameter, the stronger the association among the margins, but since the relationship between θ and the concordance measures is well known, it is standard to convert θ to these, for example, to the Kendall's τ correlation coefficient. The families of copula models considered here are described in Table 1 and shown in Figure 1 in their bivariate

: (7)

<sup>π</sup> arcsinð Þ θ

<sup>π</sup> arcsinð Þ θ

θ

<sup>ν</sup> the inverse univariate

<sup>θ</sup> ½ � 1 � D1ð Þ θ

θþ2

�1

Di Lascio and Giannerini [4] proposed a clustering algorithm called CoClust that is able to cluster multivariate observations with a complex dependence structure. The basic underlying concept of CoClust is clustering multivariate dependent observations based on the likelihood copula fit estimated on the previously allocated observations. To do so, the CoClust assumes that the data are generated by a multivariate copula function whose arguments represent the clusters, and each cluster is thus generated by a (marginal) univariate density function. The type and strength of multivariate dependence across clusters are modeled through a copula function and its dependence parameter, respectively. Being copula-based, CoClust inherits all the advantages of copula theory, and the multivariate complex dependence structure of the DGP can be taken into account to perform the cluster analysis. However, CoClust in its first version had some significant limitations. For example, it automatically allocated all the observations to the clusters without discarding potentially irrelevant observations, implying a high computational burden. Di Lascio and Giannerini [5] proposed a new version of the CoClust algorithm that satisfactorily overcomes these limitations. This section hereafter describes the latest version of the CoClust algorithm, which is implemented in the R package CoClust.

The starting point of the CoClust algorithm is the standard ð Þ n � p data matrix X:

$$\mathbf{X} = \begin{bmatrix} \mathbf{x}\_{11} & \dots & \mathbf{x}\_{1j} & \dots & \mathbf{x}\_{1p} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \mathbf{x}\_{i1} & \dots & \mathbf{x}\_{ij} & \dots & \mathbf{x}\_{ip} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \mathbf{x}\_{i'1} & \dots & \mathbf{x}\_{ij} & \dots & \mathbf{x}\_{ip} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \mathbf{x}\_{n1} & \dots & \mathbf{x}\_{nj} & \dots & \mathbf{x}\_{np} \end{bmatrix} = \begin{bmatrix} \mathbf{x}\_{1} \\ \vdots \\ \mathbf{x}\_{i} \\ \vdots \\ \mathbf{x}\_{j} \\ \vdots \\ \mathbf{x}\_{n} \end{bmatrix} \tag{8}$$

Before describing the clustering algorithm procedure, two aspects of the CoClust merit a discussion: the construction of the K-plet candidate to the allocation and the selection of number of clusters K. The K-plet of rows candidate to the allocation is constructed based on the following function Hð Þ� , which is a sort of multivariate measure of association based on the

> ψ i∈ Λ<sup>1</sup>

where Λ<sup>1</sup> is the subset of row index vectors already selected to compose a K-plet, Λ<sup>2</sup> is the subset of the remaining candidate row index vectors to complete it, and ψ is an aggregation

As for the selection of K, one of the advantages of CoClust with respect to classic clustering techniques is its ability to automatically choose the number of clusters. Indeed, CoClust explores all the possibilities among those given by the user and selects the K on the basis of the log-likelihood of the copula estimated on the subsets of k-plets allocated up to the user's

The main steps of the CoClust algorithm to cluster the n row data matrix are described in the

1. by varying the number of clusters k in the set of possibilities defined by the user and such

a. select a subset of nk k-plets of rows in the data matrix in Eq. (8) on the basis of the

b. fit the copula model on the nk k-plets of rows through the semiparametric estimation

2. select the subset of nk k-plets of rows, say nK K-plets, that maximizes the log-likelihood of the copula; hence, the number of clusters K, that is, the dimension of the copula, is

3. select a K-plet of rows among those remaining by using the measure in Eq. (9) and estimate K! copulas by using the observations already clustered and a permutation of the

4. allocate the permutation of the selected K-plet to the clustering by assigning each row to the corresponding cluster only if it increases the log-likelihood of the copula fit, otherwise

5. repeat steps 3 and 4 until all the observations are evaluated, that is, either allocated or

At the end of the procedure, we obtain a clustering of K clusters each containing a maximum ð Þ n=K p independent observations such that the multivariate dependence relationship across clusters can be revealed. Hence, attention in recovering the multivariate relationships does not

r xi; x<sup>i</sup> ð Þ ð Þ0 ( )

CoClust: An R Package for Copula-Based Cluster Analysis

http://dx.doi.org/10.5772/intechopen.74865

(9)

99

Hð Þ¼ Λ2jΛ<sup>1</sup> max i <sup>0</sup> ∈ Λ<sup>2</sup>

pairwise Spearman's r correlation coefficient:

function, for instance, the mean, median, or maximum.

predefined step. The technical details are given below.

following:

that 2 ≤ k ≤ n,

measure in Eq. (9);

candidate to the allocation;

drop the entire K-plet of rows;

discarded.

method described in Section 2;

automatically chosen and nK K-plets are already allocated;

in which n p-dimensional objects have to be grouped in K groups. CoClust can be applied either to the row or to the column data matrix according to the purpose of the analysis. In both cases, CoClust works with vectors and treats each row (column) of the data matrix X as a single element to be allocated to a cluster. The values within a row (or column) vector are treated as independent realizations of the same density function, thus observations for each cluster from the same distribution. Here, CoClust is described as applied to the rows of the data matrix.

#### 3.1. The basic procedure and selection of the number of clusters

The basic idea behind the CoClust consists in a forward procedure that allocates a K-plet of the row data matrix at a time, that is, a p-dimensional vector for each cluster at a time, and the decision on the allocation of each K-plet of rows is based on the value of the log-likelihood of the copula fit. This likelihood is computed by using the K-plets already allocated and one allocation candidate, say x<sup>i</sup> <sup>0</sup> ¼ xi 0 <sup>1</sup>;…; xi 0 <sup>j</sup>;…; xi 0 p � �, by varying the permutations of observations in x<sup>i</sup> <sup>0</sup> in order to find, if it exists, the combination that maximizes the copula fit. If the loglikelihood of the copula fitted on the observations already allocated plus the permutation of the selected K-plet increases, then the candidate K-plet is allocated to the clusters; otherwise, it is discarded, since theoretically it could be either independent from the identified DGP or derive from another DGP.

Before describing the clustering algorithm procedure, two aspects of the CoClust merit a discussion: the construction of the K-plet candidate to the allocation and the selection of number of clusters K. The K-plet of rows candidate to the allocation is constructed based on the following function Hð Þ� , which is a sort of multivariate measure of association based on the pairwise Spearman's r correlation coefficient:

that the data are generated by a multivariate copula function whose arguments represent the clusters, and each cluster is thus generated by a (marginal) univariate density function. The type and strength of multivariate dependence across clusters are modeled through a copula function and its dependence parameter, respectively. Being copula-based, CoClust inherits all the advantages of copula theory, and the multivariate complex dependence structure of the DGP can be taken into account to perform the cluster analysis. However, CoClust in its first version had some significant limitations. For example, it automatically allocated all the observations to the clusters without discarding potentially irrelevant observations, implying a high computational burden. Di Lascio and Giannerini [5] proposed a new version of the CoClust algorithm that satisfactorily overcomes these limitations. This section hereafter describes the latest version of the CoClust algorithm, which is implemented in the R package CoClust.

The starting point of the CoClust algorithm is the standard ð Þ n � p data matrix X:

<sup>1</sup> … xi

x<sup>11</sup> … x1<sup>j</sup> … x1<sup>p</sup> ⋮ ⋮⋮ ⋮⋮ xi<sup>1</sup> … xij … xip ⋮ ⋮⋮ ⋮⋮

x1 ⋮ xi ⋮ xi 0 ⋮ xn

, by varying the permutations of observa-

(8)

¼

0 <sup>j</sup> … xi 0 p

in which n p-dimensional objects have to be grouped in K groups. CoClust can be applied either to the row or to the column data matrix according to the purpose of the analysis. In both cases, CoClust works with vectors and treats each row (column) of the data matrix X as a single element to be allocated to a cluster. The values within a row (or column) vector are treated as independent realizations of the same density function, thus observations for each cluster from the same distribution. Here, CoClust is described as applied to the rows of the

The basic idea behind the CoClust consists in a forward procedure that allocates a K-plet of the row data matrix at a time, that is, a p-dimensional vector for each cluster at a time, and the decision on the allocation of each K-plet of rows is based on the value of the log-likelihood of the copula fit. This likelihood is computed by using the K-plets already allocated and one

likelihood of the copula fitted on the observations already allocated plus the permutation of the selected K-plet increases, then the candidate K-plet is allocated to the clusters; otherwise, it is discarded, since theoretically it could be either independent from the identified DGP or

<sup>0</sup> in order to find, if it exists, the combination that maximizes the copula fit. If the log-

� �

⋮ ⋮⋮ ⋮⋮ xn<sup>1</sup> … xnj … xnp

X ¼

data matrix.

tions in x<sup>i</sup>

allocation candidate, say x<sup>i</sup>

98 Recent Applications in Data Clustering

derive from another DGP.

xi 0

3.1. The basic procedure and selection of the number of clusters

<sup>0</sup> ¼ xi 0 <sup>1</sup>;…; xi 0 <sup>j</sup>;…; xi 0 p

$$H(\Lambda\_2|\Lambda\_1) = \max\_{i' \in \Lambda\_2} \left\{ \psi \: \left( \rho(\mathbf{x}\_i, \mathbf{x}\_{i'}) \right) \right\} \tag{9}$$

where Λ<sup>1</sup> is the subset of row index vectors already selected to compose a K-plet, Λ<sup>2</sup> is the subset of the remaining candidate row index vectors to complete it, and ψ is an aggregation function, for instance, the mean, median, or maximum.

As for the selection of K, one of the advantages of CoClust with respect to classic clustering techniques is its ability to automatically choose the number of clusters. Indeed, CoClust explores all the possibilities among those given by the user and selects the K on the basis of the log-likelihood of the copula estimated on the subsets of k-plets allocated up to the user's predefined step. The technical details are given below.

The main steps of the CoClust algorithm to cluster the n row data matrix are described in the following:

	- a. select a subset of nk k-plets of rows in the data matrix in Eq. (8) on the basis of the measure in Eq. (9);
	- b. fit the copula model on the nk k-plets of rows through the semiparametric estimation method described in Section 2;

At the end of the procedure, we obtain a clustering of K clusters each containing a maximum ð Þ n=K p independent observations such that the multivariate dependence relationship across clusters can be revealed. Hence, attention in recovering the multivariate relationships does not rely on the within-cluster relationships, typical of classic clustering methods. A picture of the final CoClust clustering is given in Figure 2. Each cluster is a set of independent and identically distributed realizations from the same marginal distribution while observations across clusters share the same multivariate dependence structure.

BICK,m ¼ �2 log <sup>Π</sup><sup>n</sup>

the Akaike information criterion (henceforth AIC) results in:

AICK,m ¼ �2 log <sup>Π</sup><sup>n</sup>

and can also be used to select the copula model.

distinguishing two different DGPs in the same dataset.

4. The R implementation of the CoClust algorithm

(annual maxima) rainfall measurements.

[26]. It must be installed in the usual way, that is:

3.3. Assessing the CoClust performance

<sup>i</sup>¼<sup>1</sup>cm <sup>b</sup>F1ð Þ <sup>X</sup>1<sup>i</sup> ;…; <sup>b</sup>Fkð Þ Xki ;…; <sup>b</sup>FKð Þ XKi ; <sup>θ</sup><sup>b</sup> n o <sup>þ</sup> <sup>s</sup>log ð Þ ð Þ <sup>n</sup>=<sup>K</sup> <sup>p</sup> (10)

CoClust: An R Package for Copula-Based Cluster Analysis

http://dx.doi.org/10.5772/intechopen.74865

101

<sup>i</sup>¼<sup>1</sup>cm <sup>b</sup>F1ð Þ <sup>X</sup>1<sup>i</sup> ;…; <sup>b</sup>Fkð Þ Xki ;…; <sup>b</sup>FKð Þ XKi ; <sup>θ</sup><sup>b</sup> n o <sup>þ</sup> <sup>2</sup><sup>s</sup> (11)

where θb is as in Eq. (5) or Eq. (7) with the summation over the number of allocated observations, which equals maximum ð Þ n=K p (i.e., n=K p-dimensional vectors) and s is the number of parameters. According to [23], we select the copula model that minimizes the BIC. Similarly,

The goodness of the CoClust algorithm in finding the true multivariate clustering structure underlying the data has been extensively investigated. Specifically, the first version of CoClust [4] was tested on simulated data for different scenarios and compared with model-based clustering [1, 2]. This shows that, both when the DGP is a copula and when it is misspecified, CoClust appears to be able to identify both the true number of clusters and their size in most situations. Moreover, in comparing model-based clustering, CoClust appears better suited to clustering dependent data. In [5], a more sophisticated Monte Carlo study was carried out, investigating the new features of the current version of the CoClust algorithm. Here, the current version of the algorithm clearly outperforms the previous version by [4] and CoClust's ability to find the correct number of clusters and to reconstruct the true k-plets by varying the dimension of the copula, the aggregation function ψ, and the copula model appears to be very satisfactory. Furthermore, [5] also obtained good results in assessing CoClust's ability to drop from the clustering observations that are independent of the true DGP as well as

As for real data applications, the CoClust algorithm has been successfully applied to several datasets. In relation to biomedical applications, [4] apply the CoClust to microarray data to formulate hypotheses on the possible co-regulation and functional relations between genes, [5] use the copula-based clustering method to identify biologically and clinically relevant groups of tumor samples, and [24] attempt to identify organ type from cancer cell lines from tumors. Applications in other fields include [25], where the purpose of the analysis is investigating changes in EU country diets in accordance with common European policies and guidelines on healthy diets, and [24], where CoClust is used to investigate the geographic distribution of

The copula-based clustering algorithm procedure is implemented in the R package CoClust

Note that since in each step of the procedure non-nested models are compared, that is, copula models with a single dependence parameter, the described log-likelihood based criterion is equivalent to the well-known Bayes information criterion and Akaike information criterion. Finally, note that in the current CoClust version, the selection of the number of clusters K is based on a representative subset of nk observations. Hence, the algorithm chooses the number

of clusters K by estimating the PKmax k¼Kmin nk k � � fits required, where ½ � <sup>K</sup>min;Kmax is the range of the number of clusters predefined by the user with Kmin ≥ 2 and nk is chosen by the user with nk ≪ p. This allows keeping the computational complexity under control since it does not depend on sample size.

#### 3.2. Selecting the copula model

The CoClust algorithm has not been implemented to automatically perform the selection copula model task and requires employing an information criterion a posteriori. The Bayesian information criterion (henceforth BIC) is expressed as follows for a K-dimensional copula model m:

Figure 2. The basic concept underlying the CoClust algorithm. Each element in a cluster is a row data matrix of p elements.

$$\text{BIC}\_{\text{K},m} = -2\log \Pi\_{i=1}^{\text{n}} c\_{\text{m}} \left\{ \widehat{F}\_{1}(\text{X}\_{\text{li}}), \dots, \widehat{F}\_{k}(\text{X}\_{\text{li}}), \dots, \widehat{F}\_{K}(\text{X}\_{\text{li}}); \widehat{\Theta} \right\} + s \log \left( (n/\text{K})p \right) \tag{10}$$

where θb is as in Eq. (5) or Eq. (7) with the summation over the number of allocated observations, which equals maximum ð Þ n=K p (i.e., n=K p-dimensional vectors) and s is the number of parameters. According to [23], we select the copula model that minimizes the BIC. Similarly, the Akaike information criterion (henceforth AIC) results in:

$$AIC\_{\mathcal{K},m} = -2\log \Pi\_{i=1}^{\pi} c\_m \left\{ \hat{F}\_1(X\_{1i}), \dots, \hat{F}\_k(X\_{ki}), \dots, \hat{F}\_K(X\_{Ki}); \hat{\Theta} \right\} + 2\mathbf{s} \tag{11}$$

and can also be used to select the copula model.

#### 3.3. Assessing the CoClust performance

rely on the within-cluster relationships, typical of classic clustering methods. A picture of the final CoClust clustering is given in Figure 2. Each cluster is a set of independent and identically distributed realizations from the same marginal distribution while observations across clusters

Note that since in each step of the procedure non-nested models are compared, that is, copula models with a single dependence parameter, the described log-likelihood based criterion is equivalent to the well-known Bayes information criterion and Akaike information criterion. Finally, note that in the current CoClust version, the selection of the number of clusters K is based on a representative subset of nk observations. Hence, the algorithm chooses the number

number of clusters predefined by the user with Kmin ≥ 2 and nk is chosen by the user with nk ≪ p. This allows keeping the computational complexity under control since it does not

The CoClust algorithm has not been implemented to automatically perform the selection copula model task and requires employing an information criterion a posteriori. The Bayesian information criterion (henceforth BIC) is expressed as follows for a K-dimensional copula

Figure 2. The basic concept underlying the CoClust algorithm. Each element in a cluster is a row data matrix of p

fits required, where ½ � Kmin;Kmax is the range of the

share the same multivariate dependence structure.

k¼Kmin

nk k � �

of clusters K by estimating the PKmax

3.2. Selecting the copula model

depend on sample size.

100 Recent Applications in Data Clustering

model m:

elements.

The goodness of the CoClust algorithm in finding the true multivariate clustering structure underlying the data has been extensively investigated. Specifically, the first version of CoClust [4] was tested on simulated data for different scenarios and compared with model-based clustering [1, 2]. This shows that, both when the DGP is a copula and when it is misspecified, CoClust appears to be able to identify both the true number of clusters and their size in most situations. Moreover, in comparing model-based clustering, CoClust appears better suited to clustering dependent data. In [5], a more sophisticated Monte Carlo study was carried out, investigating the new features of the current version of the CoClust algorithm. Here, the current version of the algorithm clearly outperforms the previous version by [4] and CoClust's ability to find the correct number of clusters and to reconstruct the true k-plets by varying the dimension of the copula, the aggregation function ψ, and the copula model appears to be very satisfactory. Furthermore, [5] also obtained good results in assessing CoClust's ability to drop from the clustering observations that are independent of the true DGP as well as distinguishing two different DGPs in the same dataset.

As for real data applications, the CoClust algorithm has been successfully applied to several datasets. In relation to biomedical applications, [4] apply the CoClust to microarray data to formulate hypotheses on the possible co-regulation and functional relations between genes, [5] use the copula-based clustering method to identify biologically and clinically relevant groups of tumor samples, and [24] attempt to identify organ type from cancer cell lines from tumors. Applications in other fields include [25], where the purpose of the analysis is investigating changes in EU country diets in accordance with common European policies and guidelines on healthy diets, and [24], where CoClust is used to investigate the geographic distribution of (annual maxima) rainfall measurements.
