Example 1

In this example, we build a 3-variate joint density function through a 3-dimensional Frank copula with dependence parameter such that the Kendall's τ ¼ 0:7 and three different margins: a Gaussian with parameters μ ¼ 7, σ ¼ 2, a Gamma with shape and rate, respectively, set to 3 and 4, and a Beta with parameters α ¼ 2, β ¼ 1. To do so, we employ the function mvdc of the copula package [27–30]. Next, we generate a data matrix X with 15 rows and 21 columns and build the matrix of the true cluster indexes. Finally, we apply the function CoClust to the rows of X, recover the multivariate dependence structure of the data and compare the obtained clustering with the true one.

Code to generate the example dataset is given in the following. We first define the DGP:

```
R> n.marg <- 3
R> theta <- iTau(frankCopula(), 0.7)
R> copula <- frankCopula(theta, dim = n.marg)
R> mymvdc <- mvdc(copula, c("norm", "gamma", "beta"),list(list(mean=7, sd=2),
+ list(shape=3, rate=4), list(shape1=2, shape2=1)))
```
and then generate the data and save the true clustering in index.true:

```
R> set.seed(11)
R> n.col <- 21
R> n.row <- 15
R> n <- n.row*n.col/n.marg
R> x.samp <- rMvdc(n, mymvdc)
R> X <- matrix(x.samp, nrow = n.row, ncol = n.col, byrow=TRUE)
R> index.true <- matrix(1:n.row, n.row/n.marg, n.marg)
R> colnames(index.true) <- c("Cluster 1", "Cluster 2", "Cluster 3")
R> index.true
```
Note that n is the number of observations for each margin.

We apply the CoClust to the 15 � 21 data matrix X using the maximum likelihood estimation method for the copula, the empirical cumulative distribution function for the three margins, and leaving by default the remaining arguments:

```
R> clust <- CoClust(X, dimset = 2:4, noc=2, copula="frank", method.
ma="empirical",
+ method.c="ml", writeout=1)
```
The output is as follows:

R> clust

```
An object of class "CoClust"
Slot "Number.of.Clusters":
[1] 3
Slot "Index.Matrix":
    Cluster 1 Cluster 2 Cluster 3 LogLik
[1,] 11 1 6 34.15693
[2,] 13 3 8 69.87149
[3,] 12 2 7 103.67653
[4,] 14 4 9 136.31506
[5,] 15 5 10 170.36557
Slot "Data.Clusters":
     Cluster 1 Cluster 2 Cluster 3
  [1,] 0.35776965 4.566417 0.1634203
  [2,] 0.36621352 5.532188 0.1470511
  [3,] 0.99290268 11.191092 1.4006169
  [4,] 0.60411081 6.613533 0.5457595
  [5,] 0.13946354 2.658381 0.3489739
  [6,] 0.80523424 9.526025 0.6908222
  [7,] 0.79477600 8.899494 0.7864765
  [..] ....... ....... .......
[102,] 0.36904520 3.629722 0.2763105
[103,] 0.82647042 10.809628 1.3899675
[104,] 0.48283666 5.185873 0.2763133
[105,] 0.90394435 10.583053 0.9962130
Slot "Dependence":
$Copula
[1] "frank"
$Param
[1] 11.95576
$Std.Err
[1] 0.8261832
$P.value
[1] 0
Slot "LogLik":
[1] 170.3656
```
whereas in the second example, a misspecified DGP is used. In these examples, we focus only on the semi-parametric approach described in Section 2 due to its theoretical and computational advantages with respect to the full parametric approach. Moreover, the latter has only

In this example, we build a 3-variate joint density function through a 3-dimensional Frank copula with dependence parameter such that the Kendall's τ ¼ 0:7 and three different margins: a Gaussian with parameters μ ¼ 7, σ ¼ 2, a Gamma with shape and rate, respectively, set to 3 and 4, and a Beta with parameters α ¼ 2, β ¼ 1. To do so, we employ the function mvdc of the copula package [27–30]. Next, we generate a data matrix X with 15 rows and 21 columns and build the matrix of the true cluster indexes. Finally, we apply the function CoClust to the rows of X, recover the multivariate dependence structure of the data and compare the obtained

Code to generate the example dataset is given in the following. We first define the DGP:

R> mymvdc <- mvdc(copula, c("norm", "gamma", "beta"),list(list(mean=7, sd=2),

+ list(shape=3, rate=4), list(shape1=2, shape2=1)))

R> X <- matrix(x.samp, nrow = n.row, ncol = n.col, byrow=TRUE)

R> colnames(index.true) <- c("Cluster 1", "Cluster 2", "Cluster 3")

We apply the CoClust to the 15 � 21 data matrix X using the maximum likelihood estimation method for the copula, the empirical cumulative distribution function for the three margins,

R> clust <- CoClust(X, dimset = 2:4, noc=2, copula="frank", method.

and then generate the data and save the true clustering in index.true:

R> index.true <- matrix(1:n.row, n.row/n.marg, n.marg)

Note that n is the number of observations for each margin.

and leaving by default the remaining arguments:

+ method.c="ml", writeout=1)

been implemented for Gaussian margins.

R> theta <- iTau(frankCopula(), 0.7)

R> n <- n.row\*n.col/n.marg R> x.samp <- rMvdc(n, mymvdc)

R> copula <- frankCopula(theta, dim = n.marg)

clustering with the true one.

104 Recent Applications in Data Clustering

R> n.marg <- 3

R> set.seed(11) R> n.col <- 21 R> n.row <- 15

R> index.true

ma="empirical",

R> clust

The output is as follows:

Example 1

```
Slot "Est.Method":
[1] "maximum likelihood"
Slot "Opt.Method":
[1] "ml"
Slot "LLC":
         2 34
Slot "Index.dimset":
$`2`
     1 2 LogLik
[1,] 11 1 19.75591
[2,] 8 3 37.33473
$`3`
     1 2 3 LogLik
[1,] 11 1 6 34.15693
[2,] 13 3 8 69.87149
$`4`
     1 2 3 4 LogLik
[1,] 11 1 6 12 3.653809
[2,] 7 3 13 8 22.548352
```
To look at specific objects of the resulting list, it is possible to select, among others, the following information:

[4,] 14 4 9 136.31506 [5,] 15 5 10 170.36557

The obtained clustering is perfect, CoClust is able to recognize the exact structure underlying the data. Note that the label of each cluster, that is, the order of the margins, is not relevant. The only important aspect is the composition of each cluster, that is, the row indexes in each column of index.clust and their order that has to be such that it reconstructs the exact 3-

CoClust: An R Package for Copula-Based Cluster Analysis

http://dx.doi.org/10.5772/intechopen.74865

107

To apply the CoClust to the 15 21 data matrix X previously generated by changing the argument fun in max, or the range of number of clusters to be tried or the copula model, we

R> clust <- CoClust(X, dimset = 2:4, noc=2, copula="frank", fun=max, + method.ma="empirical", method.c="ml", writeout=1)

+ method.ma="empirical", method.c="ml", writeout=1)

+ method.ma="empirical", method.c="ml", writeout=1)

In this example, we use a different DGP from the copula, thus showing the use of CoClust in the misspecification case. Specifically, a 30 21 data matrix is drawn from a three-dimensional skew-normal distribution through the R package sn [31], we then apply CoClust to cluster the

R> omega <- matrix(c(v1, rho\*sqrt(v1\*v2), rho\*sqrt(v1\*v3), rho\*sqrt(v1\*v2), v2, + rho\*sqrt(v3\*v2), rho\*sqrt(v1\*v3), rho\*sqrt(v3\*v2), v3), n.marg)

R> clust <- CoClust(X, dimset = 3:5, noc=2, copula="frank",

R> clust <- CoClust(X, dimset = 2:4, noc=2, copula="clayton",

Cluster 1 Cluster 2 Cluster 3 [1,] 1 6 11 [2,] 2 7 12 [3,] 3 8 13 [4,] 4 9 14 [5,] 5 10 15

> index.true

plets across the columns.

can input respectively:

Example 2

row data matrix: R> library(sn) R> n.marg <- 3 R> rho <- 0.7 R> mu <- c(4,6,7)

R> v1 <- 1 R> v2 <- 1 R> v3 <- 1

R> n.col <- 21

R> alpha <- c(-1,1,1)


To compare the obtained clustering with the true clustering we can input:

R> index.clust <- clust@"Index.Matrix" R> index.clust R> index.true to obtain as follows: > index.clust Cluster 1 Cluster 2 Cluster 3 LogLik [1,] 11 1 6 34.15693 [2,] 13 3 8 69.87149 [3,] 12 2 7 103.67653


The obtained clustering is perfect, CoClust is able to recognize the exact structure underlying the data. Note that the label of each cluster, that is, the order of the margins, is not relevant. The only important aspect is the composition of each cluster, that is, the row indexes in each column of index.clust and their order that has to be such that it reconstructs the exact 3 plets across the columns.

To apply the CoClust to the 15 21 data matrix X previously generated by changing the argument fun in max, or the range of number of clusters to be tried or the copula model, we can input respectively:

```
R> clust <- CoClust(X, dimset = 2:4, noc=2, copula="frank", fun=max,
+ method.ma="empirical", method.c="ml", writeout=1)
R> clust <- CoClust(X, dimset = 3:5, noc=2, copula="frank",
+ method.ma="empirical", method.c="ml", writeout=1)
R> clust <- CoClust(X, dimset = 2:4, noc=2, copula="clayton",
+ method.ma="empirical", method.c="ml", writeout=1)
```
#### Example 2

Slot "Est.Method":

106 Recent Applications in Data Clustering

Slot "Opt.Method":

Slot "Index.dimset":

1 2 LogLik [1,] 11 1 19.75591 [2,] 8 3 37.33473

1 2 3 LogLik [1,] 11 1 6 34.15693 [2,] 13 3 8 69.87149

[1,] 11 1 6 12 3.653809 [2,] 7 3 13 8 22.548352

1 2 3 4 LogLik

To look at specific objects of the resulting list, it is possible to select, among others, the following

R> clust@"Number.of.Clusters" # Selected number of clusters R> clust@"Dependence"\$Param # Estimated copula parameter

To compare the obtained clustering with the true clustering we can input:

R> clust@"Data.Clusters" # Clustered data

Cluster 1 Cluster 2 Cluster 3 LogLik [1,] 11 1 6 34.15693 [2,] 13 3 8 69.87149 [3,] 12 2 7 103.67653

R> index.clust <- clust@"Index.Matrix"

[1] "ml"

\$`2`

\$`3`

\$`4`

information:

R> index.clust R> index.true

to obtain as follows:

> index.clust

Slot "LLC":

[1] "maximum likelihood"

2 34


In this example, we use a different DGP from the copula, thus showing the use of CoClust in the misspecification case. Specifically, a 30 21 data matrix is drawn from a three-dimensional skew-normal distribution through the R package sn [31], we then apply CoClust to cluster the row data matrix:

```
R> library(sn)
R> n.marg <- 3
R> rho <- 0.7
R> mu <- c(4,6,7)
R> v1 <- 1
R> v2 <- 1
R> v3 <- 1
R> omega <- matrix(c(v1, rho*sqrt(v1*v2), rho*sqrt(v1*v3), rho*sqrt(v1*v2), v2,
+ rho*sqrt(v3*v2), rho*sqrt(v1*v3), rho*sqrt(v3*v2), v3), n.marg)
R> alpha <- c(-1,1,1)
R> n.col <- 21
```

```
R> n.row <- 60
R> n.k <- n.row/n.marg
R> n <- n.row*n.col/n.marg
R> set.seed(11)
R> x.samp <- rmsn(n, xi=mu, Omega=omega, alpha=alpha)
R> X <- matrix(x.samp, nrow=n.row, ncol=n.col, byrow=TRUE)
R> clust <- CoClust(X, dimset=2:5, noc=4, copula="clayton",method.
ma="empirical",
+ method.c="ml", penalty = "BICk", writeout=1)
```
On the console, it is possible to monitor the number of observations already allocated (argument writeout). Indeed, while CoClust runs, the following information appears on the console: 5. Application to wine dataset

as follows:

R> n <- 6

R> data(wines)

R> set.seed(11)

ma="empirical",

ma="empirical",

ma="empirical",

R> Type.wine

R> index.clust

R> index.fin

R> X <- wines[ind.sample,-1]

available in the package sn under the name wines.

+ method.c="ml",writeout=1)

+ method.c="ml",writeout=1)

+ method.c = "ml",writeout = 1)

R> Type.wine <- wines[ind.sample,1]

R>K< clustF@"Number.of.Clusters".

R > index.clust < clustF@"Index.Matrix".

+ ncol=(ncol(index.clust)-1))

In this section, an application of the CoClust package to a real dataset is shown. [32] analyze a set of Italian wines by observing the chemical properties of 178 specimens of three types of wines (Barolo, Grignolino, and Barbera) produced in the Piedmont region in Italy. The data are

CoClust: An R Package for Copula-Based Cluster Analysis

http://dx.doi.org/10.5772/intechopen.74865

109

A subset of randomly selected wines has been analyzed through CoClust by varying the number of clusters from 2 to 7 and the copula model among the three models of the Archimedean family. Since Grignolino is a type of wine with characteristics between those of Barolo and Barbera, we work with a sample of only these two last types of wines. Code is

R>ind.sample<-c(sample(1:59,n,replace=FALSE),sample(131:178,n,replace=FALSE))

R> clustF <- CoClust(X, dimset = 2:7, noc=1, copula="frank", method.

R> clustC <- CoClust(X, dimset = 2:7, noc=1, copula="clayton", method.

R> clustG <- CoClust(X, dimset = 2:7, noc=1, copula="gumbel", method.

To evaluate the final clustering obtained with a specific copula model, say the Frank model, and to compare it with the true classification of the 12 selected wines, the code is as follows:

R> index.fin <- matrix(Type.wine[index.clust[,1:K]],nrow=nrow(index.clust),

[,1] [,2] [,3] [,4] [,5] [,6]

```
Number of clusters selected: 3
Allocated observations: 5
Allocated observations: 10
Allocated observations: 15
```
To look at the obtained clustering and its details, one has to input:

```
R> clust
R> index.clust <- clust@"Index.Matrix"
R> index.true <- matrix(1:n.row, n.row/n.marg, n.marg)
R> index.clust; index.true
```
Note that when the number of K-plets to be allocated is not small, the goodness of the obtained clustering is difficult to determine. Hence, for example, two functions can be exploited to assess the quality of the final clustering: pca.coclust, which counts how many K-plets of the true DGP have been correctly allocated in the final clustering, and pcc.coclust, which counts how many K-plets of the obtained clustering have been correctly allocated. In Appendix A., the R code of these two functions is shown. Here, we compute the true clustered index matrix as follows:

```
R> ind.t <- apply(matrix(1:n.row,n.k), FUN=paste, MARGIN=1, collapse="-")
```
and the two functions pca.coclust and pcc.coclust after loading the required package gtools:

```
R> library(gtools)
```

```
R> pca.coclust(clust, ind.t, n.marg)
[1] 65
```

```
R> pcc.coclust(clust, ind.t, n.marg)
[1] 86.66667
```
The obtained values inform us that 65% of 3-plets deriving from the true DGP are correctly allocated and 86:7% of 3-plets in the final clustering are correctly allocated.
