4.2. The CoClust function

The main function of the package CoClust is the R function CoClust(), which performs copula-based clustering as described in Section 3. Some options are present, which mainly allow us to:


The argument copula allows specifying a copula model among those described in Section 2.1. As for the selection of the "best" model, CoClust can be run by varying the type of models of interest and selecting the one that fits best a posteriori using one of the criteria introduced in Section 3.2.

The typical use of the function CoClust is as follows:

R> install.packages("CoClust")

4.1. List of functions and subroutines

following auxiliary R functions.

4.2. The CoClust function

allow us to:

noc);

(argument penalty).

R> library("CoClust")

102 Recent Applications in Data Clustering

and then it must be loaded through the usual code:

open source system and the input/output facilities.

The code of the CoClust package is entirely written in R, to enable using an easily accessible

The main R function is CoClust(), which performs the copula-based clustering, while the

fit.margin(), fit.margin2(), fit.margin3(), fcond.mod(), CoClust\_perm(), stima\_cop() are intended for internal use only and are not documented in the package.

The main function of the package CoClust is the R function CoClust(), which performs copula-based clustering as described in Section 3. Some options are present, which mainly

• fit a variety of copula models (by setting the argument copula) with different types of estimation procedures for margins and for copulas (arguments method.ma and method.c, respectively); specifically, all the copula models belonging to the Elliptical and the Archimedean family described in Section 2 can be estimated through the estimation methods implemented in the R package copula [27–30] that are maximum pseudo-likelihood estimators based on two different variance estimators, the inversion of Kendall's τ estimator and the inversion of Spearman's r estimator; as for the margins, two different estimation methods have been implemented, one parametric and one nonparametric: the maximum likelihood method as in Eq. (4) and the empirical cumulative distribution function in Eq. (6);

• set the range or set of dimensions for the copula model, that is, number of clusters, for

• set the dimension of the sample units used for selecting the number of clusters (argument

• select the combination function of the pairwise Spearman's r used to select the k-plets among the mean, the median, or the maximum (argument fun) as defined in Eq. (9); • specifies the likelihood criterion used for selecting the number of clusters among the AIC, the BIC (as defined in Eqs. (10) and (11)), and the log-likelihood without penalty terms

The argument copula allows specifying a copula model among those described in Section 2.1. As for the selection of the "best" model, CoClust can be run by varying the type of models of

which the function tries the clustering (argument dimset);

```
CoClust(m, dimset = 2:5, noc = 4, copula = "frank", fun = median,
        method.ma = c("empirical", "pseudo"), method.c = c("ml", "mpl",
        "irho", "itau"),
        dfree = NULL, writeout = 5, penalty = c("BICk", "AICk", "LL"), …)
```
where m is the entry data matrix and the writeout argument allows monitoring the allocation process, since it informs on each new allocated observation. Further details on the input arguments are given in the package help files.

The main output of the function CoClust is an object of S4 class "CoClust" which is a list with the following elements:

	- a. Model: the copula model used for the clustering;
	- b. Param: the estimated dependence parameter between/among clusters;
	- c. Std.Err: the standard error of Param;
	- d. P.val: the p-value associated to the null hypothesis H<sup>0</sup> : θ ¼ 0;

#### 4.3. Simulated examples

This section shows how to use the CoClust package on data simulated from different DGPs. In the first example, the data are drawn from a joint density function with different margins, whereas in the second example, a misspecified DGP is used. In these examples, we focus only on the semi-parametric approach described in Section 2 due to its theoretical and computational advantages with respect to the full parametric approach. Moreover, the latter has only been implemented for Gaussian margins.

An object of class "CoClust" Slot "Number.of.Clusters":

Cluster 1 Cluster 2 Cluster 3 LogLik [1,] 11 1 6 34.15693 [2,] 13 3 8 69.87149 [3,] 12 2 7 103.67653 [4,] 14 4 9 136.31506 [5,] 15 5 10 170.36557

CoClust: An R Package for Copula-Based Cluster Analysis

http://dx.doi.org/10.5772/intechopen.74865

105

Cluster 1 Cluster 2 Cluster 3 [1,] 0.35776965 4.566417 0.1634203 [2,] 0.36621352 5.532188 0.1470511 [3,] 0.99290268 11.191092 1.4006169 [4,] 0.60411081 6.613533 0.5457595 [5,] 0.13946354 2.658381 0.3489739 [6,] 0.80523424 9.526025 0.6908222 [7,] 0.79477600 8.899494 0.7864765 [..] ....... ....... ....... [102,] 0.36904520 3.629722 0.2763105 [103,] 0.82647042 10.809628 1.3899675 [104,] 0.48283666 5.185873 0.2763133 [105,] 0.90394435 10.583053 0.9962130

Slot "Index.Matrix":

Slot "Data.Clusters":

Slot "Dependence":

\$Copula [1] "frank"

\$Param

\$Std.Err [1] 0.8261832

\$P.value [1] 0

Slot "LogLik": [1] 170.3656

[1] 11.95576

[1] 3
