5. Application to wine dataset

R> n.row <- 60

108 Recent Applications in Data Clustering

R> set.seed(11)

ma="empirical",

R> clust

matrix as follows:

R> library(gtools)

gtools:

[1] 65

[1] 86.66667

R> n.k <- n.row/n.marg

R> n <- n.row\*n.col/n.marg

Number of clusters selected: 3 Allocated observations: 5 Allocated observations: 10 Allocated observations: 15

R> index.clust; index.true

R> x.samp <- rmsn(n, xi=mu, Omega=omega, alpha=alpha)

To look at the obtained clustering and its details, one has to input:

R> index.true <- matrix(1:n.row, n.row/n.marg, n.marg)

R> index.clust <- clust@"Index.Matrix"

R> pca.coclust(clust, ind.t, n.marg)

R> pcc.coclust(clust, ind.t, n.marg)

R> X <- matrix(x.samp, nrow=n.row, ncol=n.col, byrow=TRUE) R> clust <- CoClust(X, dimset=2:5, noc=4, copula="clayton",method.

+ method.c="ml", penalty = "BICk", writeout=1)

On the console, it is possible to monitor the number of observations already allocated (argument writeout). Indeed, while CoClust runs, the following information appears on the console:

Note that when the number of K-plets to be allocated is not small, the goodness of the obtained clustering is difficult to determine. Hence, for example, two functions can be exploited to assess the quality of the final clustering: pca.coclust, which counts how many K-plets of the true DGP have been correctly allocated in the final clustering, and pcc.coclust, which counts how many K-plets of the obtained clustering have been correctly allocated. In Appendix A., the R code of these two functions is shown. Here, we compute the true clustered index

R> ind.t <- apply(matrix(1:n.row,n.k), FUN=paste, MARGIN=1, collapse="-") and the two functions pca.coclust and pcc.coclust after loading the required package

The obtained values inform us that 65% of 3-plets deriving from the true DGP are correctly

allocated and 86:7% of 3-plets in the final clustering are correctly allocated.

In this section, an application of the CoClust package to a real dataset is shown. [32] analyze a set of Italian wines by observing the chemical properties of 178 specimens of three types of wines (Barolo, Grignolino, and Barbera) produced in the Piedmont region in Italy. The data are available in the package sn under the name wines.

A subset of randomly selected wines has been analyzed through CoClust by varying the number of clusters from 2 to 7 and the copula model among the three models of the Archimedean family. Since Grignolino is a type of wine with characteristics between those of Barolo and Barbera, we work with a sample of only these two last types of wines. Code is as follows:

```
R> data(wines)
R> n <- 6
R> set.seed(11)
R>ind.sample<-c(sample(1:59,n,replace=FALSE),sample(131:178,n,replace=FALSE))
R> X <- wines[ind.sample,-1]
R> clustF <- CoClust(X, dimset = 2:7, noc=1, copula="frank", method.
ma="empirical",
+ method.c="ml",writeout=1)
R> clustC <- CoClust(X, dimset = 2:7, noc=1, copula="clayton", method.
ma="empirical",
+ method.c="ml",writeout=1)
R> clustG <- CoClust(X, dimset = 2:7, noc=1, copula="gumbel", method.
ma="empirical",
```

```
+ method.c = "ml",writeout = 1)
```
To evaluate the final clustering obtained with a specific copula model, say the Frank model, and to compare it with the true classification of the 12 selected wines, the code is as follows:

```
R> Type.wine <- wines[ind.sample,1]
R> Type.wine
R>K<  clustF@"Number.of.Clusters".
R > index.clust < clustF@"Index.Matrix".
R> index.clust
R> index.fin <- matrix(Type.wine[index.clust[,1:K]],nrow=nrow(index.clust),
+ ncol=(ncol(index.clust)-1))
R> index.fin
   [,1] [,2] [,3] [,4] [,5] [,6]
```

```
[1,] "Barolo" "Barolo" "Barolo" "Barolo" "Barolo" "Barolo"
[2,] "Barbera" "Barbera" "Barbera" "Barbera" "Barbera" "Barbera"
```
CoClust selects 6 clusters and allocates to each cluster the two types of wines. Thus, across clusters, we can perfectly recognize the two types of Italian wines and in each cluster we have different wines with different (e.g., independent) chemical characteristics.

Analysis (AIDA)" and Professor Simone Giannerini, University of Bologna, with whom the

CoClust: An R Package for Copula-Based Cluster Analysis

http://dx.doi.org/10.5772/intechopen.74865

111

The following two functions are useful to evaluate the goodness of the final clustering obtained through the CoClust algorithm when true clustering or benchmark clustering is available. The arguments of these two functions are ccfit, which is the object CoClust as given by the corresponding R function; ind.t, which is the true clustering expressed through the clustered index matrix with clusters by columns and the row index of matrix in Eq. (8) by rows; and nmarg, which is the dimension of the copula model, that is, the selected number of clusters.

For an example of the use of these two functions see Section 4.3, "Example 2".

ind.cc <- ccfit@"Index.Matrix"[,1:n.marg]

ind.ccs <- dum[ind.perm[j,]]

ind.ccs <- paste(ind.ccs, collapse="-") res0[j] <- as.integer(ind.ccs%in%ind. t)

pca.coclust <- function(ccfit, ind.t, nmarg){ n.marg <- ccfit@"Number.of.Clusters" ind.perm <- permutations(n.marg,n.marg)

n.comb <- nrow(ind.perm) if(n.marg==nmarg){

}

return(pca.k=pca.k)

n.comb <- nrow(ind.perm) if(n.marg==nmarg){

}

}

}

n.kp <- nrow(ind.cc) res <- rep(NA,n.kp) for(i in 1:n.kp){

> dum <- ind.cc[i,] res0 <- rep(NA,n.comb) for(j in 1:n.comb){

res[i] <- any(res0)

pcc.coclust <- function(ccfit, ind.t, nmarg){ n.marg <- ccfit@"Number.of.Clusters" ind.perm <- permutations(n.marg,n.marg)

pca.k <- sum(res)/length(ind.t)\*100

ind.cc <- ccfit@"Index.Matrix"[,1:n.marg]

first version of the package was developed.

A. Appendix

R> library("gtools")

Similarly, the other two copula models can be used as in clustC and clustG above. The results appear to not be affected by the type of model used even though, based on the loglikelihood of the copula fitted on the final clustering, the more appropriate model appears to be the Gumbel model with a log-likelihood equal to 527.3022 (compared to 500.8835 for the Frank copula and 429.184 for the Clayton copula).
