2. Copula theory

The copula function [17–20] was born in the probabilistic metric space with Sklar's theorem [3] stating that every joint distribution function Fð Þ� can be expressed in terms of K marginal distribution function Fk and the copula distribution function C as follows:

$$F(\mathbf{x}\_1, \dots, \mathbf{x}\_k, \dots, \mathbf{x}\_K) = \mathbb{C}(F\_1(\mathbf{x}\_1), \dots, F\_k(\mathbf{x}\_k), \dots, F\_K(\mathbf{x}\_K)) \tag{1}$$

for all ð Þ <sup>x</sup>1;…; xk;…; xK <sup>∈</sup>R<sup>K</sup> (where R denotes the extended real line). According to this theorem, any joint probability function fð Þ� can be split into the margins f <sup>k</sup>ð Þ� and a copula cð Þ� , so that the latter represents the association among variables, that is, the multivariate dependence structure of a joint density function [17–20]:

CoClust: An R Package for Copula-Based Cluster Analysis http://dx.doi.org/10.5772/intechopen.74865 95

$$f(\mathbf{x}\_1, \dots, \mathbf{x}\_k, \dots, \mathbf{x}\_K) = c(F\_1(\mathbf{x}\_1), \dots, F\_k(\mathbf{x}\_k), \dots, F\_K(\mathbf{x}\_K)) \prod\_{k=1}^K f\_k(\mathbf{x}\_k). \tag{2}$$

Such separation determines the modeling flexibility of copulas, since it enables (i) freely choosing the distribution of the margins and, separately, that of the copula, (ii) decomposing the estimation problem into two steps: in the first step, the margins are estimated, in the second step, the copula model is estimated, and (iii) combining different estimation methods or approaches.

The log-likelihood function of fð Þ� is composed of two positive terms as follows:

model. In this chapter, we consider the model-based clustering methods that assume the data matrix is generated according to a specific data generating process (henceforth DGP). The classic model-based clustering method in [1, 2] is based on a mixture of multivariate probability distributions, such as the multivariate normal. However, this approach only accounts for the linear dependence between objects so that it inherits all the limitations of the linear correlation coefficient. Hence, we here focus on clustering methods that assume the data matrix is generated by a K-dimensional copula [3] such that each of the K clusters is represented by a (continuous) univariate density function and the complex multivariate relationship among clusters is expressed by the copula and its dependence parameter. Specifically, this chapter aims to describe the copula-based clustering algorithm first introduced by [4] and improved by [5], presenting in detail its implementation in the R package called CoClust. The CoClust approach inspired the work of [6], while different copulabased clustering approaches can be found in [7–9], [10–13], [14], and in [15, 16]. To the best of our knowledge, none of these methods have been implemented in software available to

Most clustering algorithms take as input some parameters, such as number of clusters, the distance or density of clusters, or the number of points in a cluster, and a starting classification. Some important benefits of the R function CoClust, which implements the copula-based clustering algorithm in [5], are that (i) the user can simultaneously test a multiple number of clusters in a single function call, (ii) there is no need for a starting classification, and (iii) the algorithm can nonparametrically estimate the density of the clusters, which are distributional free. The package is available from the Comprehensive R Archive Network (CRAN) at https://

The chapter is organized as follows. Section 2 presents the theoretical tools of copula theory essential to understanding the copula-based clustering algorithm introduced and described in Section 3. Section 4 describes in detail the R implementation of the CoClust algorithm and illustrates its use on simulated DGPs. In Section 5, an application to a real dataset is presented,

The copula function [17–20] was born in the probabilistic metric space with Sklar's theorem [3] stating that every joint distribution function Fð Þ� can be expressed in terms of K marginal

for all ð Þ <sup>x</sup>1;…; xk;…; xK <sup>∈</sup>R<sup>K</sup> (where R denotes the extended real line). According to this theorem, any joint probability function fð Þ� can be split into the margins f <sup>k</sup>ð Þ� and a copula cð Þ� , so that the latter represents the association among variables, that is, the multivariate

F x1;…; xk ð ;…; xKÞ ¼ C F1ð Þ x<sup>1</sup> ;…; Fkð Þ xk ð Þ ;…; FKð Þ xK (1)

distribution function Fk and the copula distribution function C as follows:

dependence structure of a joint density function [17–20]:

the scientific community.

94 Recent Applications in Data Clustering

2. Copula theory

cran.r-project.org/web/packages/CoClust/index.html.

while a brief conclusion follows in Section 6.

$$l(\Theta) = \sum\_{i=1}^{n} \log \varepsilon \{ F\_1(X\_{\text{li}}; \boldsymbol{\beta}\_1), \dots, F\_k(X\_{\text{li}}; \boldsymbol{\beta}\_k), \dots, F\_K(X\_{\text{li}}; \boldsymbol{\beta}\_K); \Theta \} + \sum\_{i=1}^{n} \sum\_{k=1}^{K} \log f\_k(X\_{\text{ki}}; \boldsymbol{\beta}\_k) \tag{3}$$

where the first term involves the copula density cð Þ� and its parameter θ, and the second involves marginal densities f <sup>k</sup>ð Þ� and their parameters βk, and the whole set of parameters to be estimated is <sup>Θ</sup> <sup>¼</sup> <sup>β</sup>1; …; <sup>β</sup>k;…; <sup>β</sup>K; <sup>θ</sup> � �. Thus, it is possible to estimate <sup>f</sup>ð Þ� by exploiting the decomposition into two terms of Eq. (2) [20, Chapter 4] using a sequential two-step maximum likelihood method, called inference for margins (henceforth IFM) [21]. This method estimates the marginal parameters in the first step and uses them to estimate the parameter of the copula function in the second step, in either a full or semi-parametric approach.

A full parametric approach for the IFM method is based on the estimation of the marginal parameters β1;…; βk;…; β<sup>K</sup> � � in the first step by the maximum likelihood estimation for each margin:

$$\widehat{\boldsymbol{\beta}}\_{k} = \arg\max\_{\boldsymbol{\beta}\_{k}} \sum\_{i=1}^{n} \log f\_{k}(\mathbf{X}\_{ki}; \boldsymbol{\beta}\_{k}) \tag{4}$$

where each marginal distribution f <sup>k</sup> has its own parameters βk. In the second step, the dependence parameter θ given bβ<sup>k</sup> for k ¼ 1, …, K is estimated by:

$$\widehat{\boldsymbol{\Theta}} = \arg\max\_{\boldsymbol{\Theta}} \sum\_{i=1}^{n} \log \boldsymbol{c} \left[ \boldsymbol{F}\_{1} \left( \boldsymbol{X}\_{1i}; \widehat{\boldsymbol{\beta}}\_{1} \right), \dots, \boldsymbol{F}\_{k} \left( \boldsymbol{X}\_{ki}; \widehat{\boldsymbol{\beta}}\_{k} \right), \dots, \boldsymbol{F}\_{k} \left( \boldsymbol{X}\_{Ki}; \widehat{\boldsymbol{\beta}}\_{K} \right); \boldsymbol{\theta} \right]. \tag{5}$$

using the maximum likelihood estimation method.

The IFM method can also be used in a semi-parametric approach [22] where the margins are modeled without assumptions on their parametric form, that is, through the following empirical cumulative distribution function bFkð Þ Xki :

$$
\widehat{\mathbf{U}}\_{ki} = \frac{n\dot{F}\_k(\mathbf{X}\_{ki})}{(n+1)} \tag{6}
$$

where bFkð Þ Xki is computed from Xk1;…; Xki ð Þ ;…; Xkn with k ¼ 1, 2, …, K and n is the sample size, while the copula parameter θ is estimated by using the following maximum loglikelihood function:

$$\hat{\boldsymbol{\theta}} = \underset{\boldsymbol{\theta}}{\text{arg}\max} \sum\_{i=1}^{n} \log c \left[ \hat{\boldsymbol{\mathcal{U}}}\_{1i}, \dots, \hat{\boldsymbol{\mathcal{U}}}\_{ki}, \dots, \hat{\boldsymbol{\mathcal{U}}}\_{Ki}; \boldsymbol{\theta} \right]. \tag{7}$$

Note that the scaling factor n=ð Þ n þ 1 in Eq. (6) is typically introduced in the nonparametric computation of the margins to avoid numerical problems at the boundary of 0½ � ; <sup>1</sup> <sup>K</sup>.

#### 2.1. Copula models

While many different copula models are available in literature (see [18, 19] for details), the Elliptical and Archimedean families are shown to be the most useful in empirical modeling. The Elliptical family includes the Gaussian copula and the t-copula: both are symmetric, exhibit the strongest dependence in the middle of the distribution, and can take into account both positive and negative dependence, since �1 ≤ θ ≤ 1. The Archimedean family enables describing both left and right asymmetry as well as weak symmetry among the margins using the Clayton, Gumbel, and Frank models, respectively. Clayton's copula has the parameter θ∈ð Þ 0; ∞ and as θ approaches zero, the margins become independent. The dependence parameter θ of a Gumbel model is restricted to the interval 1½ Þ ; þ∞ where the value 1 means independence. Finally, the dependence parameter θ of a Frank copula may assume any real value and as θ approaches zero, the marginal distributions become independent. According to the type of copula model, the value of θ has a specific meaning and the magnitudes of the dependence parameter are not comparable across copulas. It is always true that the greater the value of the dependence parameter, the stronger the association among the margins, but since the relationship between θ and the concordance measures is well known, it is standard to convert θ to these, for example, to the Kendall's τ correlation coefficient. The families of copula models considered here are described in Table 1 and shown in Figure 1 in their bivariate version. Note that here only single parameter copula models are considered.

The copula model selection task is still an open research field. Although various statistical tests enable evaluating whether a specific model is plausible or not, no tool has thus far been recognized as the best. In the copula-based clustering context, this issue can be overcome,


since the choice of the type of model would seem less important in terms of the goodness of the final clustering (see [5], Section 4.3), and a classic information criterion can be used, such as the Bayesian or the Akaike information criterion. This topic is discussed in detail in Section 3.

Figure 1. Contour plots of bivariate copula models with normal standard margins and dependence parameter θ such that the Kendall's correlation coefficient is τ ¼ 0:7; upper panel: Gaussian and t-Student copula models for 2 and 4 degrees of

CoClust: An R Package for Copula-Based Cluster Analysis

http://dx.doi.org/10.5772/intechopen.74865

97

Di Lascio and Giannerini [4] proposed a clustering algorithm called CoClust that is able to cluster multivariate observations with a complex dependence structure. The basic underlying concept of CoClust is clustering multivariate dependent observations based on the likelihood copula fit estimated on the previously allocated observations. To do so, the CoClust assumes

3. CoClust algorithm

freedom; lower panel: Clayton, Gumbel, and Frank copula models.

Table 1. Some standard single parameter bivariate copulas with the range of the dependence parameter θ and its relation with Kendall'<sup>s</sup> <sup>τ</sup>. uk with <sup>k</sup> <sup>¼</sup> <sup>1</sup>, 2 are uniformly distributed variates so that xk <sup>¼</sup> <sup>F</sup>�<sup>1</sup>ð Þ� uk Fk. <sup>Φ</sup> is the cumulative distribution function (cdf) of the standard normal distribution, ΦGð Þ u1; u<sup>2</sup> is the standard bivariate normal distribution, t2, <sup>ν</sup>ð Þ �; �; θ denotes the standard bivariate student-t distribution with ν degrees of freedom, and t �1 <sup>ν</sup> the inverse univariate student-t distribution function. D1ð Þx denotes the "Debye" function 1=x Ð x <sup>0</sup> <sup>t</sup><sup>=</sup> exp<sup>t</sup> � <sup>1</sup> � �dt.

Figure 1. Contour plots of bivariate copula models with normal standard margins and dependence parameter θ such that the Kendall's correlation coefficient is τ ¼ 0:7; upper panel: Gaussian and t-Student copula models for 2 and 4 degrees of freedom; lower panel: Clayton, Gumbel, and Frank copula models.

since the choice of the type of model would seem less important in terms of the goodness of the final clustering (see [5], Section 4.3), and a classic information criterion can be used, such as the Bayesian or the Akaike information criterion. This topic is discussed in detail in Section 3.
