Preface

Clustering is the process of grouping or classifying data points into several different groups or classes where several similar data points are organized into the same cluster or group according to some criterion such as similar features or characteristics. That is, a cluster or group is a collection of several or many data points that have some similar features or characteristics and have different features or characteristics from data points in other clusters or groups. Clustering can be used to obtain the distribution of data, the features of each cluster or group, and analyze some special clusters or groups. Data clustering has many important practical applications in exploratory pattern analysis, grouping, and decision-making, such as classifying sales data to reflect consumer buying behavior or classifying network data to explore communication patterns or grouping student data to reveal their sex characteristics or some field's specialty or special skill or knowledge. The similarity between data points plays an important role in clustering. To this end, many statistics have been developed to measure similarity, for example, Mahalanobis distance, K-means, K-medoids, Wasserstein distance, Kullback-Leibler divergence, and others. In addition, many algorithms have been developed for data clustering, for example, partitional clustering, hierarchical clustering, density-based clustering, and model-based clustering. However, these existing methods cannot directly be applied to more complicated data clustering such as nonlinear separable patterns, heterogeneity data, jump-diffusion models, and Brazilian legal documents for natural language processing. As such, this book introduces some novel approaches to deal with these complicated data or models. Also, to stimulate readers' interest, this book introduces the application of data clustering methods developed recently for insurance, psychology, and pattern recognition, and survey data.

This book includes three sections and seven chapters.

Section I includes one chapter that discusses the development of data clustering, including measures of similarity or dissimilarity for data clustering, data clustering algorithms and assessment of clustering algorithms.

Section II introduces clustering methods and includes three chapters. In Chapter 2, Dr. Lakshmi, and Dr. Veeranjaneyulu scientifically review the widely used clustering algorithm. In Chapter 3, Professor Nascimento and Dr. Souza de Oliveira introduce a data clustering method based on the similarity of Brazilian legal documents using natural language processing approaches. In Chapter 4, Dr. Xia, Dr. Zhu, and Dr. Gou present a Bayesian model-based clustering technique for assessing the heterogeneity of a two-part model and investigate its application to cocaine use data.

Section III includes three chapters that focus on the application of recently developed clustering methods. In Chapter 5, B.Sc. Mushunje, Mrs. Mashiri, Dr. Chandiwana, and Mr. Mashasha discuss the application of jump-diffusion models to insurance claim estimation. In Chapter 6, Dr. Duggirala studies fuzzy perceptron learning for

non-linearly separable patterns. Finally, in Chapter 7, Professor Chadjipadelis and Dr. Panagiotidou present a semantic map including bringing together groups and discourses.

I was invited to edit this book after the publication of *Bayesian Analysis for Hidden Markov Factor Analysis Models*, which I co-wrote with Yemao Xia, Xiaoqian Zeng, and my previously edited book, *Bayesian Inference on Complicated Data*. I am very grateful to Dr. Maja Bozicevic for his kind invitation to edit this book and for providing me the chance to work with my aforementioned coauthors. I would also like to thank all the chapter authors for their contributions. I hope this book will be of great interest to statisticians, engineers, decision-makers, data analysts, biologists, ecologists, and AI and machine learning researchers.

> **Niansheng Tang** Department of Statistics, Yunnan University, Kunming, China

Section 1 Introduction

### **Chapter 1**

## Introductory Chapter: Development of Data Clustering

*Niansheng Tang and Ying Wu*

### **1. Introduction**

Data clustering is a popular method in statistics and machine learning and is widely used to make decisions and predictions in various fields such as life science (e.g., biology, botany, zoology), medical sciences (e.g., psychiatry, pathology), behavioral and social sciences (e.g., psychology, sociology, education), earth sciences (e.g., geology, geography), engineering sciences (e.g, pattern recognition, artificial intelligence, cybernetics, electrical engineering), and information and decision sciences (e.g., information retrieval, political science, economics, marketing research, operational research) [1]. Clustering analysis aims to group individuals into a number of classes or clusters using some measure such that the individuals within classes or clusters are similar in some characteristics, and the individuals in different classes or clusters are quite distinct in some features.

### **2. Measures of similarity or dissimilarity**

There are a lot of measures of similarity or dissimilarity for data clustering. Generally, assessing the similarity of individuals in terms of the number of characteristics, which can be regarded as the points in space (e.g., a plane, the surface of a sphere, three-dimensional space, or higher-dimensional space) directly relates to the concept of distance from a geometrical viewpoint [1]. The widely used measures include Euclidean distance, Manhattan distance (also called city-block distance), and Mahalanobis distance for measuring the similarity of two data points. Euclidean distance depends on the rectangular coordinate system, Manhattan distance depends on the rotation of the coordinate system, but Euclidean and Manhattan distances do not consider the correlation between data variables and data dimensions. Mahalanobis distance can be regarded as a correction of Euclidean distance, the dependence of the data points is described by covariance matrix, which can be used to deal with the problem of non-independent and identically distributed data. In addition, there are other distances such as chebychev distance, power distance, and sup distance.

In many applications, different types of data are related to different distances. For example, the simple matching distance is used to measure the similarity of two categorical data points; a general similarity coefficient is adopted to measure the distance of two mixed-type data points or the means of two clusters; probabilistic model, landmark models, and time series transformation distance are used to measure the similarity of two time-series data points. In particular, Wu et al. [2] considered

spectral clustering for high-dimensional data exploiting sparse representation vectors, Kalra et al. [3] presented online variational learning for medical image data clustering, Prasad et al. [4] discussed leveraging variational autoencoders for image clustering, Soleymani et al. [5] proposed a deep variational clustering framework for self-labeling of large-scale medical image data.

Although the aforementioned similarity and dissimilarity measures can be applied to various types of data, other types of similarity and dissimilarity measures such as cosine similarity measure and a link-based similarity measure have also been developed for specific types of data. Also, one may require computing the distance between an individual and a cluster, and the distance between two clusters based on various central data points or representative data points. In these cases, the widely used distances include the mean-based distance, the nearest neighbor distance, the farthest neighbor distance, the average neighbor distance, which are extensions of data point distances. Particularly, the Lance-Williams formula can be used to compute the distances between the old clusters and a new cluster formed by two clusters. Again, to assess the similarity among probability density distributions of random variables, one can use Kullback-Leibler (K-L) distance (relative entropy) and Wasserstein distance. K-L distance does not satisfy three properties of the distance and is asymmetric, while Wasserstein distance possesses three properties of the distance and is symmetric. More importantly, Wasserstein distance can be used to deal with the mixture of discrete and continuous data. To this end, data clustering based on the Wasserstein distance has received a lot of attention over the past years. For example, see [6–9] for dynamic clustering of interval data, complex data clustering, variational clustering, geometric clustering, respectively.

### **3. Data clustering algorithms**

Many useful data clustering algorithms have been developed to cluster individuals into different clusters over the past years. For example, hierarchical clustering algorithm, partitioning algorithm, fuzzy clustering algorithm [10], center-based clustering algorithm, search-based clustering algorithm, graph-based clustering algorithm, grid-based clustering algorithm, density-based clustering algorithm, model-based clustering algorithm, and subspace clustering [11]. Hierarchical clustering algorithm, which divides individuals into a sequence of nested partitions has two key algorithms: agglomerative algorithm and divisive algorithm, and partitioning algorithm are two important clustering algorithms. Fuzzy clustering algorithm allows an individual to belong to two or more clusters with different probabilities, has three major algorithms: fuzzy k-means, fuzzy k-modes, and c-means. Center-based clustering algorithm is more used to cluster large scales and high-dimensional data sets has two major algorithms: k-means and k-modes in which k-means is the most widely used clustering algorithm, and is a non-hierarchical clustering method. Search-based clustering algorithm is usually used to find the globally optimal clustering for fitting the data set in a solution space, its main algorithms include genetic algorithm, tabu search algorithm, and simulated annealing algorithm. Graph-based clustering algorithm is suitable for clustering graphs or hypergraphs via the dissimilarity matrix of the data set. Grid-based clustering algorithm is sequentially implemented by partitioning the data space into a finite number of cells, estimating the cell density for each cell, sorting the cells with their densities, determining cluster centers, and traversal of neighbor cells, it can largely reduce the computational complexity. Density-based clustering

### *Introductory Chapter: Development of Data Clustering DOI: http://dx.doi.org/10.5772/intechopen.104505*

algorithm is clustered according to dense regions separated by low-density regions, can be used to cluster any shaped clusters but is not suitable for high-dimensional data sets. The commonly used density-based clustering algorithms include DBSCAN (Density-based spatial clustering of application with noise), which cannot deal with clustering for data sets with different densities, OPTICS (Ordering points to identify the clustering structure) which can solve the clustering problem for data sets with different densities and outliers, BRIDGE, DBCLASD, DENCLUE, and CUBN algorithms. Recently Ma et al. [12] developed a density-based radar scanning clustering algorithm that can discover and accurately extract individual clusters by employing the radar scanning strategy.

Model-based clustering algorithm becomes an increasingly popular tool and is conducted by assuming that data sets under consideration come from a finite mixture of probability distributions, and each component of the mixture represents a different cluster, which indicates that it requires knowing the number of components in the mixture including finite mixture model (a parametric method) and infinite mixture model (a nonparametric method), the clustering kernel including multivariate Gaussian mixture models, the hidden Markov mixture models, Dirichlet mixture models, and non-Gaussian distributions-based mixture models. Also, model-based clustering algorithms can be divided into non-Bayesian and Bayesian methods, its implementation is challenging [13]. Recently Goren and Maitra [14] developed a clustering methodology using the marginal density for the observed values assuming a finite mixture model of multivariate t distributions for partially recorded data. For clustering problems with missing data, the most common treatment is deletion or imputation. Deletion methods may lead to poor clustering performance when the missing data mechanism is not missing completely at random. In contrast, imputation method using a predicted value to impute each missing value may lead to a better clustering performance when the missing data mechanism is missing at random. But it is rather difficult to impute a suitable value for each missing value for missing not at random. The defects of deletion and imputation do not consider the missing data structure. Model-based clustering via the finite mixture of the multivariate Gaussian or t distributions has been applied to many fields, for example, see [15, 16].

Subspace clustering is conducted by identifying different clusters embedded in different subspaces of the high-dimensional data, whose clustering has several difficulties: distinguishing similar data points from dissimilar ones due to the same distance between any two data points, and different clusters lying in different subspaces. In this case, dimension reduction techniques such as principal component analysis or feature selection techniques [17, 18]. It is rather difficult to tell readers which algorithm should be used for some settings considered and how to compare novel ideas with the existing results because of its unsupervised learning process. But Gan, Ma and Wu [11] gave a comprehensive review of the applications of the aforementioned clustering algorithms.

### **4. Assessment of clustering algorithm**

Since data clustering is a non-supervised method, the assessment of clustering algorithm is rather important. In the data clustering, there are no pre-specified clusters, it is rather challenging to find an appropriate index for measuring whether the obtained cluster result is acceptable or not. The process of assessing the results of a clustering algorithm is usually referred to as clustering validity evaluation.

Generally, clustering validity assessment includes judging the quality of clustering results, the degree to which the clustering algorithm is suitable for a special data set, and finding the best number of clusters. There are two criteria for clustering validity, for example, compactness that the individual within each cluster should be as close to each other as possible and the common measure of compactness is the variance, and separation that the clusters themselves should be separated and the commonly used methods for measuring the distance between two different clusters are the distance between the closest individual of the clusters, distance between the most distant individuals and distance between the centers of the clusters. There are three indices for assessing the results of the clustering algorithm, for example, internal indices measuring the inter-cluster validity, external indices measuring the intra-cluster validity, and relative indices.

Both internal and external indices are based on statistical methods and involve intensive computation. The comprehensive review can refer to [19]. With the increase in the dimension of data points and variables, the cluster analysis method needs to be combined with the corresponding dimension reduction technology. Extracting features through dimension reduction technology and using features to realize clustering is a method of cluster analysis of high-dimensional data.

### **5. Future interesting topics**

Some interesting research fields in the future include model-based clustering with missing not at random data and skew-normal or skew-t distribution, modelbased tensor clustering, which is a challenging topic due to the correlation structure, ultrahigh-dimension and sparsity of tensor data, and the dimension of each mode of the tensors growing at an exponential rate of the sample size, and high-dimensional and ultrahigh-dimensional data clustering that is also challenging due to sparsity of data. In these cases, data clustering needs to incorporate the dimension reduction technique and imputation technique of missing data. Also, variational and distributed techniques for data clustering may be important and challenging research with the development of computing techniques.

### **Acknowledgements**

We would like to cordially thank Maja Bozicevic for polishing this chapter and fruitful remarks regarding chapter structure. This work was funded by NSFC grant 11731011.

*Introductory Chapter: Development of Data Clustering DOI: http://dx.doi.org/10.5772/intechopen.104505*

### **Author details**

Niansheng Tang\* and Ying Wu Department of Statistics, Yunnan University, Kunming, P.R. China

\*Address all correspondence to: nstang@ynu.edu.cn

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] King RS. Clustering Analysis and Data Mining: An Introduction. Dulles: Mercury Learning and Information; 2015

[2] Wu S, Feng X, Zhou W. Spectral clustering of high-dimensional data exploiting sparse representation vectors. Neurocomputing. 2014;**135**:229-239

[3] Kalra M, Osadebey M, Bouguila N, Pedersen M, Fan W. Online variational learning for medical image data clustering. In: Bouguila N, Fan W, editors. Mixture Models and Applications. Unsupervised and Semi-Supervised Learning. Cham: Springer; 2020

[4] Prasad V, Das D, Bhowmick B. Variational clustering: Leveraging variational autoencoders for image clustering. In: 2020 International Joint Conference on Neural Networks (IJCNN); 19-24 July 2020; Glasgow, UK. Washington, US: IEEE; 2020. pp. 1-10

[5] Soleymain F, Eslami M, Elze T, Bischl B, Rezaei M. Deep variational clustering framework for self-labeling of large-scale medical images. In: Proc. SPIE 12032, Medical Imaging 2022: Image Processing; 4 April 2022; San Diego, California, US. 2022. pp. 68-76. DOI: 10.1117/12.2613331

[6] Irpino A, Verde R. Dynamic clustering of interval data using a Wassersteinbased distance. Pattern Recognition Letters. 2008;**29**:1648-1658

[7] Irpino A. Clustering linear models using Wasserstein distance. In: Palumbo F, Lauro C, Greenacre M, editors. Data Analysis and Classification. Studies in Classification, Data Analysis, and Knowledge Organization. Berlin: Springer; 2009

[8] Mi L, Zhang W, Gu X, Wang Y. Variational Wasserstein clustering. Computer Vis ECCV. 2018;**11219**: 336-352

[9] Mi L, Yu T, Bento J, Zhang W, Li B, Wang Y. Variational Wasserstein Barycenters for geometric clustering. 2020. DOI: 10.48550/arXiv.2002.10543

[10] Abonyi J, Feil B. Cluster Analysis for Data Mining and System Identification. Berlin: Birkhauser Verlag AG; 2007

[11] Gan G, Ma C, Wu J. Data Clustering: Theory, Algorithms, and Applications. Pennsylvania: SIAM; 2007

[12] Ma L, Zhang Y, Leiva V, Liu SZ, Ma TF. A new clustering algorithm based on a radar scanning strategy with applications to machine learning. Expert System with Applications. 2022;**191**:116143

[13] Melnykov V. Challenges in modelbased clustering. WIREs Computational Statistics. 2013;**5**:135-148

[14] Goren EM, Maitra R. Fast modelbased clustering of partial records. Statistics. 2022;**11**:e416

[15] Lin TI. Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition. Computational Statistics and Data Analysis. 2014;**71**:183-195

[16] Wang WL, Lin T. Robust modelbased clustering via mixtures of skew-t distributions with missing information. Advances in Data Analysis and Classification. 2015;**9**:423-445

[17] Yan X, Tang N, Xie J, Ding X, Wang Z. Fused mean-variance filter *Introductory Chapter: Development of Data Clustering DOI: http://dx.doi.org/10.5772/intechopen.104505*

for feature screening. Computational Statistics and Data Analysis. 2018;**122**: 18-32

[18] Xie J, Lin Y, Yan X, Tang N. Categoryadaptive variable screening for ultrahigh dimensional heterogeneous categorical data. Journal of the American Statistical Association. 2020;**115**:747-760

[19] Lazzerini B, Jain LC, Dumitrescu D. Cluster validity. In: Fuzzy Sets & Their Application to Clustering & Training. Boca Raton: CRC Press; 2020. pp. 479-516

Section 2 Clustering Methods

### **Chapter 2**

## Clustering Algorithms: An Exploratory Review

*R.S.M. Lakshmi Patibandla and Veeranjaneyulu N*

### **Abstract**

A process of similar data items into groups is called data clustering. Partitioning a Data Set into some groups based on the resemblance within a group by using various algorithms. Partition Based algorithms key idea is to split the data points into partitions and each one replicates one cluster. The performance of partition depends on certain objective functions. Evolutionary algorithms are used for the evolution of social aspects and to provide optimum solutions for huge optimization problems. In this paper, a survey of various partitioning and evolutionary algorithms can be implemented on a benchmark dataset and proposed to apply some validation criteria methods such as Root-Mean-Square Standard Deviation, R-square and SSD, etc., on some algorithms like Leader, ISODATA, SGO and PSO, and so on.

**Keywords:** partition, evolutionary, algorithms, clustering

### **1. Introduction**

Clustering is unique to the utmost essential methods in data mining. Clustering is one of the major tasks of grouping the objects which have more attributes from different classes and the objects that belong to the same class are similar. Clustering is an eminent research field that has been used in various areas like Big Data Analytics, Statistics, Machine Learning, Artificial Intelligence, Data Mining, Deep Learning, and so on. Diverse algorithms have been anticipated for assorted applications in clustering [1]. The evaluation of these algorithms is most essential in unsupervised learning. There are no predefined classes in clustering thus it is complicated to measure suitable metrics. For this, a variety of validation criteria have been implemented [2, 3]. The major disadvantage of these validation criteria is cannot evaluate the arbitrary shaped clusters. As it normally selects a particular point from every cluster and computes the distance of particular points based on some other parameters. Suppose variance is computed based on these parameters.

Data Clustering is appropriated among the dataset dividing into different bunches with the end goal that the examination in the gathering is better than different groups. The dataset is to be apportioned to some degree if the information is similarly conveyed, attempt to distinguish the information of certain groups will fall flat or will prompt acquainted a few segments that are with being fake. Another issue is that the covering of information gatherings. These gatherings are at times diminishing the

bunching strategies proficiency. This decline the effectiveness is corresponding to the amount of coverage between the groups. Another issue of bunching calculations is their ability to be created in the method of on the web or disconnected. Web-based grouping is a technique for which an input vector is utilized to reconsider the bunch places according to the situation of the vector. Right now, a process where the focuses of groups are to be presented new information every single time. In disconnected mode, the technique is applied on a preparation informational collection, used to locate the focal point of bunches by examining all the information vectors in the preparation set. The bunch communities are found once they are fixed and used to characterize input vectors later. The systems are introduced right now.

Right now, strategies, transformative techniques for bunching, and group approval criteria are presented in Section 2. The complete investigation of the fundamentals much of the time utilized approval techniques in Section 3. The proposed work has been presented in Section 4.

### **2. Related work**

The issue is to recognize the comparative information things and structure as bunches. There are a few calculations and can be delegated Partitioning bunching, Hierarchical Clustering, Density-based Clustering, and Grid-put together Clustering. Here mostly concentrate concerning Partitioning calculations and developmental calculations on seat mark datasets. Dividing calculations legitimately decays an informational index into a lot of disjoint bunches and to decide various parcels have been utilized sure paradigm capacities. Transformative calculations are gotten from the hard bunching calculations for getting the ideal outcomes. The aftereffects of a bunching calculation are not comparable starting with one then onto the next applied with a few information parameters on the same informational index. To assess the groups some approval measures have been proposed. Smallness and Separation approaches are utilized to quantify the separation between groups. Outside criteria, interior criteria, and relative criteria are the three strategies to assess the consequences of grouping. Outer and inside criteria both can have a high computational interest and are dependent on factual methodologies. The significant downside of these two methodologies is the multifaceted nature of calculations. The relative criteria are the assessment of different groups. Many grouping calculations are executed on more occasions on the same informational index with various information parameters. The fundamental goal of the relative criteria is to choose the best grouping calculation from various outcomes based on approval criteria. These distinctive approval criteria have been actualized [4–9].

### **2.1 Partitioning methods**

These strategies are classified into two different ways, the centroid and medoid calculations. The centroid calculations are the calculations to speak to each bunch with the assistance of the greatness of the focus of the cases [10, 11]. The medoid calculations are the calculations that speak to each group of the examples storage room to the size place. K-implies calculation is the generally utilized centroid calculation [12]. The k-implies calculation isolates the informational index into k subsets as each point in a given subset is nearest to a similar focus. Ordinarily, the k-implies have

### *Clustering Algorithms: An Exploratory Review DOI: http://dx.doi.org/10.5772/intechopen.100376*

some helpful properties, for example, handling on enormous informational collections is productive, over and again stops at neighborhood ideal, having circular shape bunches and touchy to clamor. This calculation goes under the bunching technique since it requires the information ahead of time. The fundamental k-implies calculations principle objective is choosing the exact starting centroids. The most as of late utilized calculation for clear-cut traits is k-modes calculation. Both k-means and k-modes calculations permit cases of bunching by utilizing blended characteristics in the k-models calculation. The disentanglement of normal k-implies has been introduced most as of late. This can be utilized on ball and circle formed information groups with no issue and performs definite bunching without pre-deciding the exact group number. Some conventional grouping calculations produce allotments. In a parcel, all examples have a place with just one single bunch. Along these lines, each bunch in a hard grouping is disjoint.

Fluffy-based grouping stretches out the view to relate each example among each bunch through enrollment work. Generally utilized calculation for this is Fuzzy C-implies calculation, which depends on k-implies. Fluffy C-implies calculation is utilized to locate the run-of-the-mill point in each group. It tends to be viewed as the focal point of the bunch and enrollment of each case in the group. Other delicate bunching calculations have been actualized, based on the Expectation– Maximization calculation [13]. This calculation accepts an easygoing probabilistic model with specific parameters that depict the probabilistic cases of that bunch. The arrangement of FM calculation starts with essential speculations for the Mixture Model parameters. These qualities are utilized to assess the probabilities of bunches for each example. This procedure is rehashed to re-gauge the parameters of those probabilities. The drawback of this calculation is computationally progressively costly. Over-fitting is the issue in the previously mentioned strategy. This issue emerges for two reasons. The initial one is a tremendous number of bunches might be exact. The second one is the likelihood dispersions have more parameters. Completely Bayesian methodology is one of the plausible arrangements right now every parameter has a previous likelihood conveyance. ISODATA is one of the generally utilized solos characterization calculations. It is an iterative calculation and like k-implies. ISODATA calculation split and consolidated the bunches for future refinements. The primary contrast between ISODATA and k-implies is ISODATA permits various bunches while the k-implies expect that the groups are known as apriori. Gradual bunching calculation which is utilized on enormous informational indexes is Leader Algorithm. Pioneer is structure-based calculation and structure different bunch relies upon the request for the informational index which is accommodated calculation.

As indicated by Ashish Goel [14], while looking at k-implies, Fuzzy k-means and k-medoids rather than centroid have been utilized in the middle or Partition Around Medoids. In this way, k-implies utilize the centroid for speaking to the bunch not manage the anomalies. That is, an information object with the most noteworthy estimation of information can be conveyed. This technique handles this with the medoids' portrayal of the bunch as an incredible centroid. Rather than centroid, the predominantly set information object of the group on the inside is called a Medoid. Right now, several information objects have favored discretionarily equivalent to medoids for speaking to k number of bunches. And all other leftover information objects are in the group have a medoid which is like that information object. After consummation of all the procedure of information questions, another medoid is

presented in the spot of centroid to speak to bunches in a most ideal manner and once more the entire procedure is persistent. All the information objects have limited to the bunches relies upon the most up-to-date medoids. Medoids correct their position consistently for every cycle. This nonstop procedure is till the remaining medoids sit tight for a move. Inevitably, k groups to speak to a lot of information items can be found. Examination of K-Means, Fuzzy K-Means, and K-Medoids are investigated in the accompanying **Table 1**.

On the other hand, several Evolutionary algorithms have been implemented for optimization. Some of the Evolutionary Algorithms have been explained below.

### **2.2 Evolutionary algorithms**

A Genetic Algorithm is a factual advancement approach. The Genetic Algorithm is a notable calculation that is applied to different ideal plan issues. Also, it decides worldwide ideal arrangements by a consistent variable savvy calculation. Differential Evaluation is additionally like Genetic calculation.

Clonal Selection Algorithm is the developmental calculation for the natural resistant framework. There are two components determination and transformation. These two systems are finished by a record of invulnerable properties. Then again, the blast rate is corresponding to the proclivity, and the transformation rate is conversely relative to liking. The connection among lock and key must fit with one another and afterward, the reaction will work.

Particle Swarm Optimization is a transformative bunching calculation and reenacts the properties of running winged creatures. It follows some situations used to take care of the enhancement issues. Right now, the single arrangement is a winged creature in search, call it a Particle. Each Particle is considered as a point in dimensional space. **Figure 1** shows the process flow of the PSO algorithm.


### **Table 1.**

*K-means, fuzzy K-means, and K-medoids algorithm comparison details.*

*Clustering Algorithms: An Exploratory Review DOI: http://dx.doi.org/10.5772/intechopen.100376*

#### **Figure 1.**

*Flow chart for particle swarm optimization.*

Teaching Learning Based Optimization [10] is one of them as of late actualized advancement calculation. In designing applications, it impacts the impact of an instructor on the yield of students in a class is investigated by scientists for taking care of various streamlining issues.

Suresh Satapathy et al. [8] proposed a novel enhancement calculation named Social Group Optimization that relies upon the conduct of people to learn and take care of complex issues. They executed and examine the exhibition of SGO advancement calculation on a few benchmark capacities. Right now, dissected the different human characteristics of life, for example, resilience, fearlessness, dread, and deceitfulness, etc.

Social Group Optimization calculation can be partitioned into two different ways improving stage and securing stage. Every individual's information level in the gathering has been tried and upgraded by the impact of the best one in the gathering in the improving stage. The best individual in the gathering having the information for taking care of issues. Everybody in the gathering improves information with communications to each other in the gathering and best one in the gathering around then.

As per Wen-Jye Shyr [15], to compute and verify the improvement calculations execution estimated two elements of numerical destinations. The exhibitions of these techniques can be depicted for certain perspectives that are demonstrated as follows. The initial one is the ideal point union, which is the key executive for this calculation. The second one is the ideal incentive for exactness. The third one is the absolute number of target calculations. For the most part, there are a lot of issues where assembly speed is dependent on the absolute number of target calculations. The last one is the time taken for the calculation to locate the ideal worth. Even though this is the simplicity of calculation can be unforeseen. Notwithstanding these, a few parameters are made, tried to ensure that the outcomes are set in **Table 2**.


### **Table 2.**

*Genetic algorithm, clonal selection algorithm, and particle swarm optimization algorithm parameters.*

### **3. Parameters**

The most widely used validity criteria are introduced in the following section.

### **4. Motivations**

### **4.1 Validity criteria**

These validity criteria have been utilized for estimating the bunches. Root-Mean-Square Standard Deviation (RMSSTD), R-square, Sum of Squared Error (SSE), Internal and External legitimacy criteria applied to the previously mentioned calculations to investigate the best calculations. Bunching Algorithms utilize these approval measures to assess the outcomes. The RMSSTD is the technique to assess the change of the bunches and it gauges the group's homogeneity. According to these outcomes, to perceive homogeneous gatherings as the most minimal RMSSTD esteem implies great bunching. To gauge the divergence of bunches R-squared record is utilized. R-square estimates the level of homogeneity between the gatherings. The scope of these qualities is 0 and 1. Here, 0 methods have no distinction between the bunches and 1 method there is a huge contrast between the groups. The Sum of Squared Error is a fundamental calculation for factual methodologies and handles another estimation of information. It recognizes how those qualities are firmly related. Once figure the estimation of SSE for a dataset than just ascertain the estimations of change and standard deviation. Inner Validity is the legitimacy measure for the level of traits of free factor and others. Outer Validity is the legitimacy measure to the degree the aftereffects of a summed-up study [16]. The informational collections have been taken from different assets and the subtleties of informational collections and calculations as demonstrated as follows. Sack of words informational collection have taken from UCI Machine Repository site. This informational collection is content sort, 8lakhs of occurrences, and 1 lakh of information traits. Right now, every assortment of content contains the Number of archives spoke to by D; the Number of words spoken to by W, and the Total number of words spoken to by N in the assortment.

### **5. Proposed work**

The results of the above exploratory survey proposed to pick k-means, Leader, and ISODATA from parceling calculations and actualized on seat mark dataset with the previously mentioned legitimacy criteria for dissecting the presentation. By utilizing some developmental calculations, for example, Genetic Algorithms, Particle Swarm Optimization, and Social Group Optimization to be assessed the presentation with some legitimacy capacities. The accompanying table speaks to the subtleties of grouping strategies. Different clustering methods details with various parameters as shown in **Table 3**.


**Table 3.** *Clustering methods details.*

### **6. Conclusion**

The paper titled " Clustering Algorithms: An Exploratory Review" outlined a few dividing calculations and Evolutionary Algorithms. Apportioning Algorithms, for example, k-implies, k-medoids, Fuzzy k-means, and Expectation Maximization, etc., are considered. According to the correlation of k-implies, Fuzzy k-means, and k-medoids: The primary expert of k-implies is less expense of calculation, albeit con is empathy to Noisy information and Outliers than Fuzzy k-means and k-medoids. In Evolutionary Algorithms: GA, PSO, SGO, CSA, and TLBO are read, and for certain calculations like GA, CSA, and PSO what are the potential parameters utilized for correlations. The legitimacy criteria like RMSSTD, R-square, SSE, interior, and outside criteria have been utilized for the execution of the benchmark informational index. These legitimacy measures have been assessed for different info datasets and look at the effectiveness of the legitimacy measures.

The previously mentioned calculations actualized on seat mark informational collection with legitimacy measures to assess the presentation. In the future, by utilizing this to be evaluated execution present some new developmental calculation which can be utilized for huge and semi-organized information.

*Data Clustering*

### **Author details**

R.S.M. Lakshmi Patibandla\* and Veeranjaneyulu N Department of IT, Vignan's Foundation for Science Technology and Research, Vadlamudi, Guntur, Andhra Pradesh, India

\*Address all correspondence to: patibandla.lakshmi@gmail.com

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Yujie Zheng, "Clustering Methods in Data Mining with its Applications in High Education," International Conference on Education Technology and Computer, 2012.

[2] Prabhdip Kaur, Shruti Aggrwal, "Comparative Study of Clustering Techniques," international journal for advance research in engineering and technology, April 2013.

[3] H. Men'endez and D. Camacho, "A genetic graph-based clustering algorithm," in Intelligent Data Engineering and Automated Learning -IDEAL 2012, ser. Lecture Notes in Computer Science, H. Yin, J. Costa,and G. Barreto, Eds. Springer Berlin / Heidelberg, vol. 7435,pp: 216-225, 2012.

[4] Patibandla, R.S.M.L., Veeranjaneyulu, N. (2018), "Performance Analysis of Partition and Evolutionary Clustering Methods on Various Cluster Validation Criteria", Arab J Sci Eng,Vol.43, pp.4379-4390.

[5] Y. Li, J. Chen, R. Liu, and J. Wu, "A spectral clustering-based adaptive hybrid multi-objective harmony search algorithm for community detection," in Evolutionary Computation (CEC), IEEE Congress on. IEEE2012, pp. 1-8,2012.

[6] H. Men'endez, D. F. Barrero, and D. Camacho, "A multi-objective genetic graph-based clustering algorithm with memory optimization," in 2013 IEEE Conference on Evolutionary Computation, vol. 1, pp: 3174-3181, June2013.

[7] J. Liu, W. Zhong, H. A. Abbass, and D. G. Green, "Separated and overlapping community detection in complex

networks using multiobjective evolutionary algorithms," in Evolutionary Computation (CEC), 2010 IEEE Congress on. IEEE, pp: 1-7, 2010.

[8] Suresh Satapathy and Anima Naik "Social Group Optimization (SGO): a new population evolutionary optimization technique", Journal of complex intelligent systems, Springer, Vol 2, Issue 4, pp: 173-203, 2016.

[9] R S M Lakshmi Patibandla and N. Veeranjaneyulu, (2018), "Explanatory & Complex Analysis of Structured Data to Enrich Data in Analytical Appliance", International Journal for Modern Trends in Science and Technology, Vol. 04, Special Issue 01, pp. 147-151.

[10] Rao RV, "Teaching-learning-based optimization: a novel method for constrained mechanical design optimization problems," Elsevier Comput Aided Des 43, pp: 303-315,2011.

[11] R S M Lakshmi Patibandla, Santhi Sri Kurra, Ande Prasad and N.Veeranjaneyulu, (2015), "Unstructured Data: Qualitative Analysis", J. of Computation In Biosciences And Engineering, Vol. 2,No.3,pp.1-4.

[12] Wen-JyeShyr, "Introduction and Comparison of Three Evolutionary-Based Intelligent Algorithms for Optimal Design," Third International Conference on Convergence and Hybrid Information Technology, 2008.

[13] Patibandla R.S.M.L.,

Veeranjaneyulu N. (2018), "Survey on Clustering Algorithms for Unstructured Data". In: Bhateja V., Coello Coello C., Satapathy S., Pattnaik P. (eds) Intelligent Engineering Informatics. Advances in Intelligent Systems and Computing, vol 695. Springer, Singapore

[14] Ashish Goel, "A Study of Different Partitioning Clustering Technique," IJSRD - International Journal for Scientific Research & Development, Vol. 2, Issue 08, ISSN (online): 2321-0613, 2014.

[15] Wen-Jye Shyr, "Introduction and Comparison of Three Evolutionary-Based Intelligent Algorithms for Optimal Design," Third International Conference on Convergence and Hybrid Information Technology, 2008.

[16] R S M Lakshmi Patibandla, Veeranjaneyulu,N.(2020), "A SimRank based Ensemble Method for Resolving Challenges of Partition Clustering Methods", Journal of Scientific & Industrial Research,Vol. 79, pp. 323-327.

### **Chapter 3**

## Clustering by Similarity of Brazilian Legal Documents Using Natural Language Processing Approaches

*Raphael Souza de Oliveira and Erick Giovani Sperandio Nascimento*

### **Abstract**

The Brazilian legal system postulates the expeditious resolution of judicial proceedings. However, legal courts are working under budgetary constraints and with reduced staff. As a way to face these restrictions, artificial intelligence (AI) has been tackling many complex problems in natural language processing (NLP). This work aims to detect the degree of similarity between judicial documents that can be achieved in the inference group using unsupervised learning, by applying three NLP techniques, namely term frequency-inverse document frequency (TF-IDF), Word2Vec CBoW, and Word2Vec Skip-gram, the last two being specialized with a Brazilian language corpus. We developed a template for grouping lawsuits, which is calculated based on the cosine distance between the elements of the group to its centroid. The Ordinary Appeal was chosen as a reference file since it triggers legal proceedings to follow to the higher court and because of the existence of a relevant contingent of lawsuits awaiting judgment. After the data-processing steps, documents had their content transformed into a vector representation, using the three NLP techniques. We notice that specialized word-embedding models—like Word2Vec present better performance, making it possible to advance in the current state of the art in the area of NLP applied to the legal sector.

**Keywords:** legal, natural language processing, clustering, TF-IDF, Word2Vec

### **1. Introduction**

In recent years, the Brazilian Judiciary has been advancing toward turning all its acts digital. Following this direction, the Brazilian Labour Court implemented in 2012 the Electronic Judicial Process (acronym in Portuguese for "*Processo Judicial Eletrônico*"—PJe), and from this date, all new legal proceedings have already been born electronic. According to the Annual Analytical Report of Justice in Numbers 2020 (base year 2019) [1], produced by the National Council of Justice (acronym in

Portuguese for "*Conselho Nacional de Justiça*"—CNJ), more than 99% of the ongoing cases are already on this platform.

Knowing that human beings cannot promptly analyze a large set of data, especially when such data do not appear to correlate, a way to assist in the patternrecognition process is through statistical, computational, and data analysis methods. From the perspective that an exponential increase in textual data exists, the analysis of patterns in legal documents has become increasingly challenging.

Currently, one of the major challenges in the legal area is to respond quickly to the growing judicial demand. The Brazilian legal system provides for ways to ensure the swift handling of judicial proceedings, such as the principle of the reasonable duration of a case, the principle of speed, the procedural economy, and due process to optimize the procedural progress [2]. Therefore, with the aid of some clustering mechanism, that is, the grouping of processes, with a good rate of similarity between the documents to be analyzed, it was possible to help in the distribution of work among the advisors of the office for which the process was drawn. In addition, it contributed to the search for case law1 for the judgment of the cases in point, to ensure a speedy trial, upholding the principle of legal certainty. According to Gomes Canotilho [3]:

*"The general principle of legal certainty in a broad sense (thus encompassing the idea of trust protection) can be formulated as follows: the individual has the right to be able to rely on the law that his acts or public decisions involved in his rights, positions or legal relations based on existing legal norms and valid for those legal acts left by the authorities on the basis of those rules if the legal effects laid down and prescribed in the planning are connected to the legal effects laid down and prescribed in the legal order" (2003, p. 257).*

Thus, this legal management tool created positive impacts such as the decrease of the operational costs of a legal proceeding, as a result of reducing its duration, meaning lower expenses on the allocation of the necessary resources for its judgment.

Recently, machine learning algorithms have demonstrated through research that they are powerful tools capable of solving high-complexity problems using natural language processing (NLP) [4]. In this sense, it is possible to highlight the works of [5–9], which apply the techniques of word-embedding generation, a form of vector representation of terms, and consequently of documents, taking into account their context. The use of these word embeddings is essential when analyzing a set of unstructured data presented in the form of large-volume documents in court.

Nowadays, a specialist screens the documents and distributes among the team members the legal proceedings to be judged, setting up a deviation from the main activity of this specialist, which is the production of draft decisions. This contributed to an increase in the congestion rate (an indicator that measures the percentage of cases that remain pending solution at the end of the base year) and to the decrease in the meeting of demand index (acronym in Portuguese for "*Índice de Atendimento à Demanda*"—IAD—an indicator that measures the percentage of proceedings in downtime, compared to the number of new cases). It becomes evident in the consolidated data of the Labor Justice contained in **Table 1**, with data extracted from the Annual Analytical Report of Justice in Numbers 2020 (base year 2019) [1] produced by the National Council of Justice (CNJ).

<sup>1</sup> A legal term meaning a set of previous judicial decisions following the same line of understanding.


*Clustering by Similarity of Brazilian Legal Documents Using Natural Language Processing… DOI: http://dx.doi.org/10.5772/intechopen.99875*

### **Table 1.**

*Report of indicators of Brazilian labor justice.*

This work aims, therefore, to present the degree of similarity between the judicial documents that was achieved in the inferred groups through unsupervised learning *via* the application of three techniques of NLP, namely: (i) term frequency-inverse document frequency (TF-IDF); (ii) Word2Vec with CBoW (continuous bag of words) trained for general purposes for the Portuguese language in Brazil (Word2Vec CBoW pt-BR); and (iii) Word2Vec with Skip-gram trained for general purposes for the Portuguese language in Brazil (Word2Vec Skip-gram pt-BR).

This degree of congruence signals the model's performance and is set from the average similarity measure of the grouped files, based on the similarity cosine between the elements of the group to its centroid and, comparatively, by the average cosine similarity among all the documents of the group.

Aiming to delimit the scope of this research, a dataset containing information from documents of the Ordinary Appeal Interposed (acronym in Portuguese for "*Recurso Ordinário Interposto"*—ROI) type was extracted from approximately 210,000 legal proceedings. The Ordinary Appeal Interposed was used as a reference, as this is usually the type of document that induces the legal proceedings for judgment in the higher instance (2nd degree), thus instituting the Ordinary Appeal (acronym in Portuguese for "*Recurso Ordinário*"—RO). That is a free plea, an appropriate appeal against definitive and final judgments proclaimed at first instance, seeking a review of the judicial decision drawn up by a hierarchically superior body [10].

For the present work, a literature review on unsupervised machine learning algorithms applied to the legal area was performed, using NLP, and an overview of recent techniques that use artificial intelligence (AI) algorithms in word-embedding generation. Then, we applied some methods until the results were obtained, comparing and discussing them, and finally, conclusions and future challenges were presented.

### **2. State-of-the-art review**

Machine learning algorithms have in the most recent research demonstrated a great potential to solve high-complexity problems, which follow the categories into (i) supervised machine learning algorithms; (ii) unsupervised; (iii) semi-supervised; and (iv) by reinforcement [11]. In the context of this chapter, the literature review focused on the search for the most recent research on unsupervised machine learning or clustering algorithms applied to the legal area using NLP.

The investigation revealed that there are not many works dealing with the highlighted topic, which proves its complexity. Thus, we sought to expand the research by removing the restriction to the legal area bringing light to other publications. In [12], we discussed the content recommendation system approaches based on grouping for similar articles that used TF-IDF to perform vector transformation of the document contents and, through cosine similarity, applied k-means [13] for clustering them. In [14], the authors automatically summarized texts using TF-IDF and k-means to determine the document's textual groups used to create the abstract. Then, TF-IDF is considered the primary technique for vectorizing textual content and k-means the most used algorithm for unsupervised machine learning.

Therefore, we can assume that choosing the best technique of generating word embeddings requires investigation, experimentation, and comparison of models. Several recent pieces of research have demonstrated the feasibility of using word embeddings to improve the quality of AI algorithm results for pattern detection, classification, among other uses.

In 2013, Mikolov et al. [6] proposed two new architectures to calculate vector representations of words calling them Word2Vec, which was considered, at the time, as a reference in the subject. Subsequently, techniques of word embeddings based on the use of the long short-term memory network (LSTM) [15] became widely used for speech recognition, language modeling, sentiment analysis, and text prediction, and that, unlike the recurrent neural network (RNN) they can forget, remember and update the information thus taking a step forward from the RNNs [16]. Therefore, LSTM-based libraries, such as Embeddings from Language Models (Elmo) [17], Flair [18], and context2vec [19] created a different word embedding for each occurrence of the word, related to the context, that allowed to capture the meaning of the word.

*Clustering by Similarity of Brazilian Legal Documents Using Natural Language Processing… DOI: http://dx.doi.org/10.5772/intechopen.99875*

In more recent years, new techniques of word embeddings have emerged, with emphasis on (i) Bidirectional Encoder Representations from Transformers (BERT) [9], context-sensitive model with architecture based on a transformer model [20]; (ii) Sentence BERT (SBERT) [21], a "Siamese" BERT model that was proposed to improve BERT's performance when seeking to obtain the similarity of sentences; and (iii) Text-to-Text Transfer Transformer (T5) [22], a framework for treating NLP issues as a text-to-text problem, that is, input to the template as text and template output as text.

From this analysis, it was possible to advance in the current state of the art in the area of NLP applied to the legal sector, by conducting a comparative study and application of the techniques TF-IDF, Word2Vec CBoW, and Word2Vec Skip-gram to perform the grouping of labor legal processes in Brazil using the k-means algorithm and the cosine similarity.

### **3. Methodology**

This section presents each step necessary to achieve the results and to make it possible to analyze them comparatively. To perform all the implementations of the routines necessary for this study, the Python programming language (version 3.6.9) was used and, among other libraries, (i) Numpy (version 1.19.2) was used; (ii) Pandas (version 1.1.3); (iii) Sklearn (version 0.21.3); (iv) Spacy (version 2.3.2); and (v) Nltk (version 3.5).

Every processing flow (pipeline) consists of the phases: (i) data extraction; (ii) data cleansing; (iii) generation of word-embedding templates; (iv) calculation of the vector representation of the document; (v) unsupervised learning; and (vi) calculation of the similarity measure, as detailed in the following subsections.

### **3.1 Data extraction**

The dataset used for these studies belongs to the Regional Labour Court of the 5th Region (acronym in Portuguese for "Tribunal Regional do Trabalho da 5ª Região"— TRT5). There are approximately 210 (two hundred and ten) thousand documents of the Ordinary Appeal Interposed type, incorporated into the Electronic Judicial Process (PJe) system, originally added to the PJe in portable document format (PDF) or hypertext markup language (HTML). As the PJe has a tool for extracting and storing the contents of documents, there was no need for further processing in obtaining the text of such files.

In addition to the content of the documents, the following information was extracted: (i) the name of the parts of the proceedings to which such documents belonged; (ii) the list of labor justice issues from the Unified Procedural Table2 (acronym in Portuguese for "*Tabela Processual Unificada*"—TPU) of the Labour Justice branch (made available by the National Council of Justice [CNJ] and consolidated by the Superior Labour Court [acronym in Portuguese for "*Tribunal Superior do Trabalho*"—TST]); and (iii) list of abbreviations (acronyms) with their full translation according to tables made available by the Supreme Court (acronym in Portuguese for "Supremo Tribunal Federal"—STF).3

<sup>2</sup> Labour Justice Unified Procedural Table. Available at: https://www.tst.jus.br/web/corregedoria/ tabelas-processuais

<sup>3</sup> Table of abbreviations (and acronyms) made available by the Supreme Court. Available at: https://www. stf.jus.br/arquivo/cms/publicacaoLegislacaoAnotada/anexo/siglas\_cf.pdf

### **3.2 Data cleaning**

Preprocessing is a fundamental step for the application of artificial intelligence techniques and involves the following: (i) data standardization (when there is a large discrepancy between the values presented to the technique); (ii) the withdrawal of null values; and (iii) the reorganization and adequacy of the structure of the dataset. In this case, it is usually necessary for experts to conduct an exploratory analysis of the data used in advance to determine the direction of preprocessing.

For this phase, this study uses two forms of preprocessing: (i) detection of the subjects of the Unified Procedural Table (contained in the extracted documents) and (ii) cleaning the contents of the documents.

For the detection of the subjects of the TPU present in the extracted documents, regular expression matching was used as the search technique to measure the occurrences of these words in the files marking them with "tags" referring to the subject found.

For cleaning the contents of documents, usually using a regular expression, the steps were as follows:


*Clustering by Similarity of Brazilian Legal Documents Using Natural Language Processing… DOI: http://dx.doi.org/10.5772/intechopen.99875*

	- TF-IDF: removed all stopwords from the Portuguese language, such as "*de*" (from), "*da*" (of), "*a*" (the), "*o*" (the), "*esta*" (this) etc.;
	- Other techniques: removed only the non-adverbs of the Portuguese language, for example, the words "*não*" (no), "*mais*" (more), "*quando*" (when), "*muito*" (very), "*também*" (also), and "*depois*" (after) remain in the document;
	- TF-IDF: removed all the punctuation marks contained in the documents;
	- Other techniques: removed the punctuation marks except dot (.), comma (,), exclamation (!), and interrogation (?);
	- TF-IDF: applied the technique to replace words with its root, for example, words such as "*tenho*" (have), "*tinha*" (had), and "*tem*" (have) had belong of the same root "*ter*" (have);
	- Other techniques: lemmatization has not been applied;

In addition to the preprocessing detailed above, when the technique used was TF-IDF, the tags inserted in the text during this phase were removed.

### **3.3 Generation of word-embedding templates**

An essential technique in solving machine learning problems, involving NLP, is the use of vector representation of words, in which numerical values indicate some correlation of words in the text. This chapter uses word embeddings generated and shared for the Portuguese language, such as Word2Vec CBoW and Word2Vec template with Skip-gram. These templates were created based on more than 1 billion and 300,000 tokens, with results published in the article "Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks" presented at the Symposium in Information and Human Language Technology - STIL 2017 [23].

### **3.4 Calculation of the vector representation of the document**

Different from the TF-IDF technique, which has the vector representation of the document based on the statistical measurement of each term of the document in relation to all known corpus, and whose vector dimension is equal to the size of the vocabulary of the corpus, the other techniques (i) Word2Vec CBoW ptBR and (ii) Word2Vec Skip-gram pt-BR need to go through a change to calculate the vector representation of the document (document embeddings). This happens because for these techniques what you can get is the vector representation of the word (word embeddings).

Thus, to calculate the vector representation for the documents some alternatives are suggested, such as (i) average of the word embeddings of the words of the document; (ii) sum of the word embeddings of the words in the document by pondering them with the TF-IDF and then dividing by the sum of the TF-IDF of the words of the document; and (iii) weighted average with the TF-IDF of the word embeddings of the words of the document, the latter being the technique chosen for presenting the best result.

### **3.5 Unsupervised learning**

The use of unsupervised learning techniques is relevant when the intention is to detect patterns among court documents. The k-means algorithm, whose basic concepts were proposed by MacQueen [13], is the technique adopted in this study. In general, this technique seeks to recognize patterns from the random choice of K initial focal points (centroid), where K is the number of groups that one wishes to obtain and, iteratively, position the elements whose Euclidean distance is the minimum possible concerning the centroid of the group.

Since one does not have an ideal K to offer the algorithm, an approach usually used to support such a decision is to calculate the inertia, based on how well the dataset was grouped through k-means.

The inertia calculation is based on the sum of the square of the Euclidean distance from each point to its centroid and seeks to obtain the lowest K with the lowest inertia. However, the higher the K value reaches, the tendency is that inertia will be lower, and then, the elbow method was used to find the point where the reduction in inertia begins to decrease.

Hence, 31 values for K were used within the range from 30 to 61, considering an interval for each unit, selecting the K that generated the best grouping. In addition, the strategy of creating submodels, limited to two, was used for the documents of the groups whose average similarity rate did not reach a value greater than 0.5.

*Clustering by Similarity of Brazilian Legal Documents Using Natural Language Processing… DOI: http://dx.doi.org/10.5772/intechopen.99875*

### **3.6 Similarity measure calculation**

The similarity measure is an important tool for the measurement of the quality of inferred groups. In this study, the cosine similarity measure is adopted, which is a measure that calculates the cosine of the angle between two vectors projected in the multidimensional plane, the result of which is between 0 and 1, in which 1 represents that the two vectors are totally similar, and 0 represents that they are totally different. Given two vectors, X and Y, the cosine similarity is presented using a scalar product according to Eq. (1).

$$\text{similarity} = \cos(\theta) = \frac{\mathbf{X} \cdot \mathbf{Y}}{|\mathbf{X}| \cdot |\mathbf{Y}|} \tag{1}$$

Consequently, to decide whether, after the clustering of the chief model, it was necessary to generate up to two more submodels, using the average cosine similarity among all elements of the group. Although the computational cost of calculating the similarity between all files in the group is relevant, we sought to reduce the distance between documents that were part of the same group, although they were located near the centroid. To assess the final efficiency of the technique, another form of calculation was adopted, computing for each group the average cosine similarity between the group elements and its centroid. Thus, as a measure of global similarity of each approach, we calculated the average of the average of the groups, so that the one that reached a value closer to 1 (one) was considered the best technique.

### **4. Results and discussions**

This research shows, as per the methodology presented in the previous sections, how machine learning algorithms associated with NLP techniques are important allies in optimizing the operational costs of the judicial process. It is evidenced from the result, for example, of document screenings and procedural distribution, which allows an expert to devote oneself to their chief activity optimizing working time.

While using the k-means unsupervised learning algorithm, it was necessary to choose the best K for each NLP technique studied. In this scenario, the elbow method was applied based on the calculated inertia of each of the 31 K tested, as shown in **Figure 1**, thus achieving a better result for each technique.

From the attainment of the best K, the k-means model was trained and, from the grouping performed by this technique, we could reach the average similarity between the documents of each group. Those groups that did not make the cutting line of at least 0.5 of average had the group files submitted for creating up to two submodels. As expected, only for TF-IDF technique groupings is there a need to generate submodels to improve performance.

**Table 2** shows the average similarity of the groups obtained using the TF-IDF technique, as well as the result of the Word2Vec CBoW pt-BR technique. It achieved a little better measure of similarity than the Word2Vec Skip-gram pt-BR technique; however, the latter achieved its result with a smaller number of groups, which places it, in general, as the best technique.

### **Figure 1.**

*Inertia charts constructed by using the elbow method for determining the best number of clusters for each approach.*


### **Table 2.**

*Mean cosine similarity between all elements of the group. The best results are highlighted in bold.*

After the groups were formed, the statistical data resulting from each approach were calculated, as shown in **Table 3** and in the comparative graph of distributions between the techniques (**Figure 2**). The cosine similarity of the group elements to its centroid was used as a metric, showing the proximity of the results between the techniques with Word2Vec and highlighting the technique Word2Vec Skip-gram ptBR for the smaller amount of generated groups.

When comparing the values presented in **Tables 2** and **3**, it is noteworthy that the results presented in **Table 2** are worse in all cases. It is inferable from this observation *Clustering by Similarity of Brazilian Legal Documents Using Natural Language Processing… DOI: http://dx.doi.org/10.5772/intechopen.99875*


### **Table 3.**

*Statistics of the cosine similarity of the group elements to the centroids. The best results are highlighted in bold.*

### **Figure 2.**

*Boxplots showing the distributions of the clusters calculated by each technique. The more cohesive the boxes and the less number of outliers, the better.*

that the similarity measure calculations shown in **Table 2** can reduce the similarity rates since there may be elements in the group positioned on completely opposite sides. From **Figure 2**, it is also possible to verify that the groupings generated by the Word2Vec technique were more cohesive than those generated by the TF-IDF technique, especially the Word2Vec Skip-gram technique, which created fewer groupings in the range of outliers than Word2Vec CBoW, demonstrating its superiority by allowing fewer groups but maintaining consistent quality and cohesion.

Given the aforesaid, among all the techniques evaluated, the Word2Vec Skip-gram pt-BR technique presented itself as the best option for word embeddings for clustering legal documents of the Ordinary Appeal Interposed type. Although the Word2Vec CBoW pt-BR technique achieves slightly better rates, it stands out from the previous one for reaching a much smaller number of groups.

The result achieved by each approach can be visualized by projecting in two dimensions of the groups formed from the three techniques: (i) TF-IDF; (ii) Word2Vec CBoW pt-BR; and (iii) Word2Vec Skip-gram pt-BR, respectively, presented in **Figures 3**–**5**. It is evident in the figures that the groups formed from Word2Vec are much better defined, especially skip-gram, which confirms the findings previously explained in this work.

### **Figure 3.**

*2D projection of the entire test dataset, showing for each document its corresponding group formed by TF-IDF.*

### **Figure 4.**

*2D projection of the entire test dataset, showing for each document its corresponding group formed by Word2Vec CBoW ptBR.*

*Clustering by Similarity of Brazilian Legal Documents Using Natural Language Processing… DOI: http://dx.doi.org/10.5772/intechopen.99875*

**Figure 5.**

*2D projection of the entire test dataset, showing for each document its corresponding group formed by Word2Vec skip-gram ptBR.*

### **5. Conclusion and future work**

The use of AI as a standard detection tool based on documents from the judiciary has generally proved to be a viable and helpful solution in the scientific, technological, and practice of legal work. In this chapter, it was possible to present the results considered very promising due to the improvement in the average similarity rate. Thus, we demonstrate the possibility of using word-embedding generation techniques applied on clustering of Ordinary Appeal Interposed using AI algorithms.

Of all the techniques evaluated, the Word2Vec Skip-gram pt-BR technique presented itself as the best option for word embeddings for clustering legal documents of the Ordinary Appeal Interposed type.

We believe that specialized word embeddings have great potential in improving the results. Therefore, comes the suggestion for future study of Word2Vec specialized for the judiciary, in addition to evaluating whether the new embeddings generated provide an opportunity to improve the overall performance of clustering. In addition, using transformer-based techniques, such as BERT, can achieve promising results, using both the Portuguese language word-embedding model and training a specialized BERT model for the judiciary.

Moreover, new possibilities arise for using the techniques discussed in this chapter, such as the draft generation of decisions and classification of documents and processes.

### **Acknowledgements**

The authors thank the Regional Labour Court of the 5th Region for making datasets available to the scientific community and contributing to research and technological development. The authors also thank the Artificial Intelligence Reference Centre and the Supercomputing Centre for Industrial Innovation, both from SENAI CIMATEC.

### **Author details**

Raphael Souza de Oliveira1 and Erick Giovani Sperandio Nascimento2 \*

1 TRT5—Regional Labor Court of the 5th Region, Salvador, BA, Brazil

2 SENAI CIMATEC—Manufacturing and Technology Integrated Campus, Salvador, BA, Brazil

\*Address all correspondence to: ericksperandio@gmail.com

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Clustering by Similarity of Brazilian Legal Documents Using Natural Language Processing… DOI: http://dx.doi.org/10.5772/intechopen.99875*

### **References**

[1] CNJ—Conselho Nacional de Justiça. Relatório Analítico Anual da Justiça em Números 2020. 2020. Available from: https://www.cnj.jus.br/pesquisasjudiciarias/justica-em-numeros/ [Accessed: June 07, 2021]

[2] da Costa Salum G. A duração dos processos no judiciário: aplicação dos princípios inerentes e sua eficácia no processo judicial [Internet], Âmbito Jurídico, Rio Grande. Vol. XIX(145). 2016. Avaliable from: https:// ambitojuridico.com.br/cadernos/direitoprocessual-civil/a-duracao-dosprocessos-no-judiciario-aplicacao-dosprincipios-inerentes-e-sua-eficacia-noprocesso-judicial/ [Accessed: September 01, 2021]

[3] Canotilho JJG. Direito constitucional e teoria da constituição. 7th ed. Coimbra: Almedina; 2003

[4] Khan W, Daud A, Nasir J, Amjad T. A survey on machine learning models for Natural Language Processing (NLP). Computer Science and Engineering. 2016;**43**:95-113

[5] Wang Y, Cui L, Zhang Y. Using Dynamic Embeddings to Improve Static Embeddings. In: arXiv Preprint. arXiv:1911.02929v1. 2019

[6] Mikolov, T, Chen, K, Corrado, G, Dean, J. Efficient Estimation of Word Representations in Vector Space. In: ICLR: Proceeding of the International Conference on Learning Representations Workshop Track, Arizona, USA. 2013.

[7] Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing

(EMNLP). Doha, Qatar: Association for Computational Linguistics; 2014. pp. 1532-1543

[8] Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics. 2017;**5**:135-146

[9] Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota: Association for Computational Linguistics. 2019; 1:4171-4186. DOI: 10.18653/v1/N19-1423

[10] Oliveira FJV. Os recursos na Justiça do Trabalho [Internet]. Available from: http://www.conteudojuridico.com.br/ consulta/Artigos/24853/os-recursos-najustica-do-trabalho [Accessed: June 10, 2021]

[11] Sil R, Roy A, Bhushan B, Mazumdar AK. Artificial Intelligence and Machine Learning based Legal Application: The State-of-the-Art and Future Research Trends. In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS); 18-19 October 2019; Greater Noida, India: IEEE; 2019. p. 57-62. DOI: 10.1109/ICCCIS48478.2019.8974479

[12] Renuka S, Raj Kiran GSS, Rohit P. An unsupervised content-based article recommendation system using natural language processing. In: Jeena Jacob I, Kolandapalayam Shanmugam S, Piramuthu S, Falkowski-Gilski P, editors. Data Intelligence and Cognitive

Informatics (Algorithms for Intelligent Systems). Singapore: Springer; 2021. pp. 165-180. DOI: 10.1007/978-981-15- 8530-2\_13

[13] MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Berkeley, CA: University of California Press; Vol. 1. 1967. pp. 281-297.

[14] D'Silva J, Sharma U. Unsupervised automatic text summarization of Konkani texts using K-means with Elbow method. International Journal of Engineering Research and Technology. 2020;**13**:2380. DOI: 10.37624/ IJERT/13.9.2020.2380-2384

[15] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997;**9**:1735-1780. DOI: 10.1162/neco.1997.9.8.1735

[16] Sherstinsky A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena. 2020;**404**:132306

[17] Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Vol. 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics; 2018. pp. 2227-2237. DOI: 10.18653/v1/N18-1202

[18] Akbik A, Blythe D, Vollgraf R. Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for

Computational Linguistics; 2018. pp. 1638-1649

[19] Melamud O, Goldberger J, Dagan I. context2vec: Learning Generic Context Embedding with Bidirectional LSTM. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning; Berlin, Germany: Association for Computational Linguistics; 2016;. p. 51-61. DOI: 10.18653/v1/K16-1006

[20] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. p. 6000-10. (NIPS'17).

[21] Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics; 2019. p. 3982-92. DOI: 10.18653/v1/D19-1410

[22] Roberts A, Raffel C, Lee K, Matena M, Shazeer N, Liu PJ, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. In: arXiv Preprint. arXiv:1910.10683. 2019

[23] Hartmann NS, Fonseca ER, Shulby CD, Treviso MV, Rodrigues JS, Aluísio SM. Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks. In: Proceedings of the 11th Brazilian Symposium on Information and Human Language Technology (STIL). Uberlândia, Minas Gerais, Brazil: Brazilian Computing Society - SBC; 2017. p. 122-31.

### **Chapter 4**

## Assessing Heterogeneity of Two-Part Model via Bayesian Model-Based Clustering with Its Application to Cocaine Use Data

*Ye-Mao Xia, Qi-Hang Zhu and Jian-Wei Gou*

### **Abstract**

The purpose of this chapter is to provide an introduction to the model-based clustering within the Bayesian framework and apply it to asses the heterogeneity of fractional data via finite mixture two-part regression model. The problems related to the number of clusters and the configuration of observations are addressed via Markov Chains Monte Carlo (MCMC) sampling method. Gibbs sampler is implemented to draw observations from the related full conditionals. As a concrete example, the cocaine use data are analyzed to illustrate the merits of the proposed methodology.

**Keywords:** model-based clustering, finite mixture model, two-part model, Markov Chain Monte Carlo sampling, cocaine use data

### **1. Introduction**

A recurring theme in the statistical analysis is to separate the unstructured data into groups to detect the similarity or discrepancy within or between groups. This is especially true in the fields, e.g., discriminant analysis [1–3], pattern recognition [4, 5], gene expression [6–8], machine learning [9], and artificial intelligence [10]. In the literature, the clustering problem is often formulated within the *cluster analysis* framework, which is generally categorized into two classes: the non-probabilistic framework and the probabilistic framework. The non-probabilistic clustering method, including the *K*-means method [9, 11, 12] and the hierarchical/agglomerative clustering algorithms [13–15], is based on the *distance* between any two observations or groups. It clusters data by merging or removing observations according to the "closeness" specified by the distance. This method is more general since it does not impose any distributional assumptions on data, hence having greater flexibility in the real applications. Instead, the non-probabilistic clustering algorithm, also termed the *model-based clustering*, groups data by positing a probability model on data and then clustering data via configuration function related to the model. Compared with the non-probabilistic framework, the model-based methods enable us to assess the statistical properties of the solutions, e.g., how many clusters are there, how well the configuration function works, and how robust the method is against the model

deviation and so on. There is rich literature on this issue. Among them, finite mixture model (FMM, [16–18]) perhaps is the most popular choice and has often been proposed and studied in the context of clustering (see a short review in Fraley and Raftery [2]). FMM assumes that each cluster is identified with a probability distribution indexed by the cluster-specific parameter(s), and each observation is related to clusters via configuration or membership function. The statistical task is the inference about the number of clusters, the estimation of the unknown parameters, and the allocation of observations.

In this chapter, we pursue a Bayesian model-based method to address the heterogeneity of fraction data. Fractional data are very common in the social and economical surveys. A distinguished feature of fractional responses is that its measurements are responded on a scale in the unity interval [0,1] but suffer from excessive zeros and unities on the boundaries. In understanding such type of data, the commonly used method is to separate the whole data into three parts: two corresponding to the zeros and unities respectively, and one corresponding to the continuously positive values. Two separative logistic models are suggested to model two discrete value parts respectively while single normal linear regression model is formulated for the continuous value part. This method, though more appealing, ignores the instinct association across different parts and readily leads to inconsistence of the occurrence probabilities on each part. Instead, we propose a three-category multinomial model for the occurrence variable, in which the usual separated models can be considered as the marginal models of our proposal. Such modeling always ensures the probabilities on each part to be proper, thus avoiding parameter constraints, see for example, [19]. To assess the heterogeneity underlying data, we formulate the problem into a finite mixture analysis of which each component is specified by two-part regression model. In view of the model complexity, we implement Markov Chains Monte Carlo sampling method to implement posterior analysis. Block Gibbs sampler is implemented to draw observations from the target distributions. The posterior inference including parameters estimates, model selection, and the configuration determination of observations are obtained based on the simulated observations.

The chapter is organized as follows. Section 2 introduces a general model-based clustering method to address the heterogeneity of regression model within the Bayesian framework. In Section 3, we apply the proposed method to the fractional data. Section 4 presents a cocaine use study. And Section 5 concludes the chapter.

### **2. Method description**

### **2.1 General framework**

Suppose that for *i* ¼ 1, 2, ⋯, *n*, *yi* is an observed response, each associated with an *m* dimensional fixed covariates **x***<sup>i</sup>* ¼ ð Þ *xi*1, ⋯, *xim* . In the context of regression analysis, the interest mainly focuses on exploring the pattern of the influence of **x***<sup>i</sup>* on *yi* and predicting the mean of a future response *y* in terms of a new **x**. This is usually achieved by formulating **x***i*, *yi* as *yi* j**x***i* <sup>¼</sup> *<sup>m</sup>*ð Þ **<sup>x</sup>***<sup>i</sup>* for some mean function *<sup>m</sup>*ð Þ� . In the parametric fitting framework, the function *m*ð Þ **x** is assumed to be related to **x** via linking function as the form of

$$m(\mathbf{x}) = h\left(\mathbf{x}^T \boldsymbol{\mathfrak{f}}\right) \tag{1}$$

*Assessing Heterogeneity of Two-Part Model via Bayesian Model-Based Clustering… DOI: http://dx.doi.org/10.5772/intechopen.103089*

which induces the so-called generalized linear model [20] for **x***i*, *yi* � �, where *β* is the regression coefficients used to quantify the uncertainty about *m*, and *h*ð Þ� is the known linking function used to link the mean and the predictors.

More often, the single relationship such as Eq. (1) may not be sufficient when the patterns among the subjects take on the heterogeneity such as clustering. The heterogeneous data occur when the observations are generated from the different populations of which the number of populations and the membership of each observation to the population are unknown. The main objective is to separate data into different clusters to detect the possible similarity within clusters or the discrepancy between clusters. This is generally accomplished by defining a cluster's membership/ configuration function K : **x**1, *y*<sup>1</sup> � �, <sup>⋯</sup>, **<sup>x</sup>***n*, *yn* � � � � <sup>↦</sup>f g 1, <sup>⋯</sup>, *<sup>K</sup>* such that *Ki* <sup>¼</sup> K **x***i*, *yi* � � � � <sup>¼</sup> *<sup>k</sup>* if **<sup>x</sup>***i*, *yi* � � belongs to the cluster *k*, where *K* is assumed to be less than *n*. The discrepancy between any two clusters is characterized by the cluster-specific parameters such as intercepters, regression coefficients, and/or disperse parameters.

The model-based clustering assumes that given the clusters membership *Ki*, **x***i*, *yi* � � within the cluster *k* has the following sampling density

$$\left(\boldsymbol{y}\_{i}|\boldsymbol{K}\_{i}=\boldsymbol{k},\mathbf{x}\_{i}\right)\stackrel{ind.}{\sim}f\_{\boldsymbol{k}}\left(\boldsymbol{y}\_{i}|\mathbf{x}\_{i}^{T}\boldsymbol{\mathcal{B}}\_{k},\ \ \boldsymbol{\tau}\_{k}\right)\tag{2}$$

while *Ki* is specified by

$$\mathbb{P}(K\_i = k) = \pi\_k \tag{3}$$

where *f <sup>k</sup>*, maybe independent of *k*, is the probability density function, *β<sup>k</sup>* and *τ<sup>k</sup>* are the cluster-specific regression coefficients and the disperse parameters, respectively, and *π<sup>k</sup>* is the mixing proportion identifying the proportion of the component *k* over the entire population. It is assumed that *π<sup>k</sup>* ≥0 and P*<sup>K</sup> <sup>k</sup>*¼<sup>1</sup>*π<sup>k</sup>* <sup>¼</sup> <sup>1</sup>*:*0.

Two important issues arise when formulating data clustering problem as Eqs. (2) and (3). One is related to the number of clusters, and the other is pertained to the determination of configurations. Within the Bayesian framework, several methods have been proposed for the first issue. One can, for example, follow [21] and treat *K* to be random and assign a prior to it. The reversible jump MCMC method (RJMCMC, [21, 22]) can be implemented to conduct the joint analysis of *K* with other random quantities. Another method is along the lines with the hypothesis test procedure and routinely to estimate *K* via model comparison/selection procedure. This perhaps is the most popular choice in the model-based clustering context, in which various measures such as the Akaike information criterion (AIC) [23], the corrected AIC (AICc) [24, 25], the Bayesian information criterion (BIC) [26], the integrated completed likelihood (ICL) [27], and Bayes factor (BF, [28, 29]) can be adopted to select a suitable model. It is worth pointing out that the deviance information criterion (DIC) [30] may not be appropriate for the mixture model comparison. The well-known software WinBUGS® [31] for Bayesian analysis does not provide DIC results for mixture analysis. In addition, many authors suggested modeling heterogeneous data into the mixture of Dirichlet process (MDP, [32, 33]). However, as discussed in Ishwaran and James [34], DP fitting often overestimates the number of clusters and readily leads to model over fitting.

For the second issue, the complexity of problem depends on the methods adopted in the analysis. In the frequency framework, for example, the configuration of observation *<sup>i</sup>* is often achieved by maximizing *Ki* <sup>¼</sup> *<sup>k</sup>*j**Y**, *<sup>π</sup>*^ , **<sup>Ξ</sup>**^ � � over *<sup>k</sup>* <sup>¼</sup> 1, <sup>⋯</sup>, *<sup>K</sup>*, where *<sup>π</sup>*^ and **<sup>Ξ</sup>**^

are the maximum likelihood estimates (MLE) obtained via, e.g., the expectationmaximization algorithm (EM, [35]). In the next section, we will present a Bayesian procedure for determining K. Compared with the frequency approach, the nice feature of the Bayesian approach is its flexibility to utilize prior information for achieving better results. Also, the sampling-based Bayesian methods depend less on the asymptotic theory and hence have the potential to produce reliable results even with small sample size.

Let **Y** be the set of all observed responses and **X** be the set of fixed covariates; Write **Ξ** as the collection of *β<sup>k</sup>* and *τk*. Integrating over *Ki* produces a *K*-component mixture model for *yi* , which is given by

$$p\left(y\_i|\boldsymbol{\pi}, \boldsymbol{\Xi}, \mathbf{x}\_i\right) = \sum\_{k=1}^{K} \pi\_k f\_k\left(y\_i|\mathbf{x}\_i^T \boldsymbol{\theta}\_k, \mathbf{r}\_k\right). \tag{4}$$

The log-likelihood of the observed data conditional on *K* is given by

$$\mathcal{L}(\boldsymbol{\pi}, \boldsymbol{\Xi} | \mathbf{Y}, \mathbf{X}) = \sum\_{k=1}^{n} \log \left( \sum\_{k=1}^{K} \pi\_{k} f\_{k} (\boldsymbol{\varphi}\_{i} | \mathbf{x}\_{i}^{T} \boldsymbol{\theta}\_{k}, \mathbf{r}\_{k}) \right). \tag{5}$$

As an illustration, **Figure 1** presents a three-component normal linear mixture regression model with one covariate. It can be seen clearly that the density function illustrates strong heterogeneity. The regression line is obviously different from those of components, which indicates that single model is unappreciate in fitting such data. In what follows, we suppress **X** for notational simplicity.

### **2.2 Bayesian model-based clustering via MCMC**

Bayesian analysis for analyzing Eqs. (2) and (3) especially K requires the specification of a prior distribution *p*ð Þ *π*, **Ξ** for the parameters of the mixture model. By model convention, it is naturally to assume that *π* and **Ξ** are independent, and the components among **Ξ** are also independent. In particular,

### **Figure 1.**

*Plot of the three-component normal mixture model* 0*:*3*N*ð Þþ �4 � 2*x*, 1 0*:*5*N*ð0*:*5 þ 0*:*5*x*, 1Þ þ 0*:*2*N*ð Þ 4*:*5 þ 3*x*, 1 *. Left panel: Plot of the density functions of the mixture as well as their three weighted components ; right panel: plots of regression lines. Mixture model: solid line "*�*" component one: dotted lines "*⋯*" component two: dashed lines "*��*" and component three: dotted-dashed lines "*��*"*

*Assessing Heterogeneity of Two-Part Model via Bayesian Model-Based Clustering… DOI: http://dx.doi.org/10.5772/intechopen.103089*

$$\boldsymbol{\mathfrak{g}}\_{k} \stackrel{\text{iid.:}}{\sim} \mathbf{N}\_{m}(\boldsymbol{\mathfrak{g}}\_{0}, \boldsymbol{\Sigma}\_{0}), \quad \boldsymbol{\mathfrak{r}}\_{k}^{-1} \stackrel{\text{iid.:}}{\sim} \mathbf{W}(\boldsymbol{\rho}\_{0}, \mathbf{R}\_{0}) \tag{6}$$

in which *W ρ*0,*R*�<sup>1</sup> 0 is the Wishart distribution with the degrees of freedom *ρ*<sup>0</sup> and the scale matrix *R*0, and reduces to the scaled Chi-square distribution when *τ<sup>k</sup>* is a univariate; *β*0, **Σ**0, *ρ*<sup>0</sup> and *R*<sup>0</sup> are the hyper-parameters, which are treated to fixed and known. In the real applications, if no extra information can be available, the values of these hyper-parameters are often taken to ensure *β<sup>k</sup>* and *τ<sup>k</sup>* to be dispersed enough. For example, one can set **Σ**<sup>0</sup> ¼ *λ*0**I** with large *λ*<sup>0</sup> (Throughout, we use **I** to signify an identify matrix). In this case, the values of *β*<sup>0</sup> are not really important and can be set to any values, e.g., zeros. Note that for the mixture models, Diebolt and Robert [36] (see also, for example, [37]) showed that using fully non-informative prior distributions may lead to improper posterior distributions and hence is strictly prohibitive.

We assign a symmetric Dirichlet distribution to *π* as follows

$$
\mathfrak{a} \vert a \sim D\_K(a, \cdots, a) \tag{7}
$$

in which *α*ð Þ >0 is the hyper-parameter, which is treated to fixed and unknown. In the applications, we can take sensitive analysis by setting smaller and larger values for *α*. See section 4 for more details.

Let **K** ¼ f g *K*1, ⋯,*Kn* be the collection of all configurations. A Bayesian procedure for model-based clustering mainly focuses on exploring the behavior of the posterior of **K** given data, which is given by

$$p(\mathbf{K}|\mathbf{Y}) \propto p(\mathbf{Y}|\mathbf{K})p(\mathbf{K})\tag{8}$$

where *p*ð Þ **Y**j**K** is the marginal distribution of *p*ð Þ **Y**, *π*, **Ξ**j**K** with *π* and **Ξ** being integrated out. Generally, no closed form can be available for this target distribution. Markov Chain Monte Carlo [38, 39] sampling method can be used to conduct posterior analysis. In particular, one can follow the routine in Tanner and Wong [40] and treat the latent quantities f g *π*, **K**, **Ξ** as the missing data and augment them with the observed data. Posterior analysis is carried out based on the joint distribution *p*ð Þ *π*, **K**, **Ξ**j**Y** . In this case, block Gibbs sampler [41, 42] can be implemented to draw observations from such target distribution. The Gibbs sampler is iteratively implemented by drawing: (i) Ξ from *p*ð Þ **Ξ** j*π*, **K**, **Y** ; (ii) *π* from *p*ð Þ *π* j**K**, **Ξ**, **Y** and **K** from *p*ð Þ **K**j *π*, **Ξ**, **Y** till convergence. The convergence can be monitored by the "estimated potential scale reduction" (EPSR) values [43] or by plotting the traces of estimates against iterations under different starting values. Note that except for (i), all full conditionals involved in the Gibbs sampler are standard. However, drawing **Ξ** in (i) depends on the specific form of the density function *f <sup>k</sup>* and sometimes requires implementing Metropolis-Hastings algorithm (MH, [44, 45]) or rejection sampling [46].

### **2.3 Label switching**

Formulating the model-based clustering problem into mixture model Eq. (2) faces the model identification. A statistical model is said to be identified if the observed likelihood is uniquely determined by unknown parameters. A less identified model may be problematic and will distort the estimates of unknown parameters. It is easily showed that the observed likelihood of data is only determined up to the permutation of the component labels. As a matter of fact, suppose that there are the pair *π*ð Þ<sup>1</sup> , **Ξ**ð Þ<sup>1</sup> and *π*ð Þ<sup>2</sup> , **Ξ**ð Þ<sup>2</sup> such that

$$p\left(\boldsymbol{y}|\boldsymbol{\pi}^{(1)},\ \boldsymbol{\Xi}^{(1)}\right) = p\left(\boldsymbol{y}|\boldsymbol{\pi}^{(2)},\ \boldsymbol{\Xi}^{(2)}\right) \tag{9}$$

then there exists a permutation *<sup>ν</sup>* : f g 1, 2, <sup>⋯</sup>, *<sup>K</sup>* <sup>↦</sup>f g 1, 2, <sup>⋯</sup>, *<sup>K</sup>* such that *<sup>π</sup>*ð Þ<sup>1</sup> *<sup>k</sup>* <sup>¼</sup> *<sup>π</sup>*ð Þ<sup>2</sup> *<sup>ν</sup>*ð Þ*<sup>k</sup>* , *β* ð Þ1 *<sup>k</sup>* ¼ *β* ð Þ2 *<sup>ν</sup>*ð Þ*<sup>k</sup>* and *<sup>τ</sup>* ð Þ1 *<sup>k</sup>* ¼ *τ* ð Þ2 *<sup>ν</sup>*ð Þ*<sup>k</sup>* . In this setting, we can not distinguish <sup>K</sup> and *<sup>ν</sup>*∘<sup>K</sup> in terms of data ("∘" denotes the operator of function composition). With this in mind, any exchangeable priors on *π* and **Ξ** like Eqs. (6) and (7) produces symmetric and multimodal posterior distributions with up to *K*! copies of each "genuine" mode, which induces the so-called label switching problem on Bayesian estimate. Traditional approaches to eliminating such exchangeability is to impose identifiability constraints on the parameter space. However, as pointed out by Frühwirth-Schnatter [18], an unappropriate identifiability constraint may not be able to eliminate label switching. Many efforts have been devoted to coping with this issue, see Chapter 11 in Lee [47] for a review. Among them, the relabeling algorithm [48] is more appealing due to its simplicity and flexibility. The relabeling sampling procedure takes a decisiontheoretical approach and requires specifying an appropriate loss function to measure the loss in terms of the classification probability. The model identification problem is addressed via postprocessing the MCMC output to minimize the posterior expected loss. Specifically, let *<sup>θ</sup>* be the collection of **<sup>Ξ</sup>** and *<sup>π</sup>*, and write *<sup>Q</sup>* <sup>¼</sup> *qik*ð Þ *<sup>θ</sup>* � as the matrix of allocation probabilities of order *n* � *K* with *qik*ð Þ¼ *θ* ð Þ *Ki* ¼ *k*j**Y**, *θ* . In the context of clustering, the loss function can be defined on the cluster label K as follows

$$\mathcal{L}\_0(\mathcal{K}; \ \theta) = -\sum\_{i=1}^n \log q\_{i\mathcal{K}\_i}(\theta). \tag{10}$$

Given that *θ*ð Þ<sup>1</sup> , ⋯, *θ*ð Þ *<sup>M</sup>* are the sampled parameters and let *ν*1, ⋯, *ν<sup>M</sup>* be the permutation applied to them. The relabeling algorithm proceeds by selecting initial values for the *νm*s, which are generally taken to be the identity permutations, then iterating the following steps until a fixed point is reached.

\*\*a.\*\* Choose  $\hat{\mathcal{K}}$  to minimize  $\sum\_{m=1}^{M} \mathcal{L}\_{0}\left(\boldsymbol{\chi}, \nu\_{m}\left(\boldsymbol{\theta}^{(m)}\right)\right)$ .

\*\*b.\*\* For  $m = 1, 2, \cdots, M$ , choose  $\nu\_{m}$  to minimize  $\mathcal{L}\_{0}\left(\hat{\mathcal{K}}, \nu\_{m}\left(\boldsymbol{\theta}^{(m)}\right)\right)$ .

### **2.4 Posterior inference**

Once the label switching is taken care of, the MCMC samples can be used to draw posterior inference. For example, the joint Bayesian estimate of *θ* can be obtained easily via the corresponding sample means of the generated observations via ergodic average as follows:

$$\hat{\boldsymbol{\theta}}\_{k} = \boldsymbol{M}^{-1} \sum\_{m=1}^{M} \boldsymbol{\theta}\_{k}^{(m)}, \hat{\boldsymbol{\pi}}\_{k} = \boldsymbol{M}^{-1} \sum\_{m=1}^{M} \quad \boldsymbol{\pi}\_{k}^{(m)}, \text{ and } \hat{\boldsymbol{\pi}}\_{k} = \boldsymbol{M}^{-1} \sum\_{m=1}^{M} \boldsymbol{\pi}\_{k}^{(m)} \tag{11}$$

The consistent estimates of the covariance matrix of estimates can be obtained via sample covariance matrix.

*Assessing Heterogeneity of Two-Part Model via Bayesian Model-Based Clustering… DOI: http://dx.doi.org/10.5772/intechopen.103089*

Given the observations **<sup>K</sup>**ð Þ *<sup>m</sup>* : *<sup>m</sup>* <sup>¼</sup> 1, 2, <sup>⋯</sup>, *<sup>M</sup>* � � drawn from the posterior *<sup>p</sup>*ð Þ **<sup>K</sup>**j**<sup>Y</sup>** via MCMC sampling, serval methods can be available for arriving at a point estimate of the clustering using draws from the posterior clustering distribution. The simplest method, known as the maximum a posteriori (MAP) clustering, is to select the observed clustering that maximizes the density of the posterior clustering distribution, i.e.,

$$
\hat{\mathcal{K}} : \hat{\mathcal{K}}\_i = \mathbf{argmax}\_{k=1,\cdots,K} \mathbb{P}(K\_i = k | \mathbf{Y}) \tag{12}
$$

in which ð Þ *Ki* ¼ *k*j**Y** can be approximated by

$$\mathbb{P}(K\_i = k | \mathbf{Y}) \approx \mathbf{M}^{-1} \sum\_{m=1}^{M} I\left\{ K\_i^{(m)} = k \right\}. \tag{13}$$

A more appreciate alternative to MAP is based on the pairwise probability matrix, an *n* � *n* association matrix *δ*ð Þ K with the ð Þ *i*, *j* th element formed by the indicator of whether the subject *i* is clustered with subject *j*. Element-wise averaging of these association matrices yields the pairwise probability matrix of clustering, denoted *ψ*^. Medvedovic and Sivaganesan [49] and Medvedovic et al. [50] suggested a clustering estimate of K by using the pairwise probability matrix *ψ*^ as a distance matrix in hierarchical/agglomerative clustering. However, as augured by Dahl [51], such routine seems counterintuitive to apply an ad hoc clustering method on top of a model which itself produces clusterings. In the context of Dirichlet process mixture-based clustering, Dahl [51] proposed a least-squares model-based clustering method by using draws from a posterior clustering distribution. Specifically, the least-squares clustering K*LS* is the observed clustering K*LS*, which minimizes the sum of squared deviations of its association matrix ð Þ K from the pairwise probability matrix:

$$\hat{\mathcal{K}}\_{LS} = \operatorname{argmin}\_{\mathbb{K}} \mathbf{m}\_{\mathbb{K}} \in \left\{ \mathbf{K}^{(1)}, \dots, \mathbf{K}^{(n)} \right\} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \left( \delta(i,j)(\mathbb{K}) - \hat{\boldsymbol{\nu}}(i,j) \right)^{2}. \tag{14}$$

Dahl [51] showed that the least-squares clustering has the advantage over those in Medvedovic and Sivaganesan [49] since it utilizes the information from all the clusterings and is intuitively appealing for the "average" clustering instead of forming a clustering via an external, ad hoc clustering algorithm.

### **3. Assessing heterogeneity of two-part model**

In this section, we first proposed a two-part regression model for the fractional data especially for the U shaped fractional data and then extend the method discussed above to the current situation to address the possible heterogeneity of the population underlying data.

### **3.1 Two-part model for U shaped fractional data**

Suppose that for subject/individual *i*ð Þ ¼ 1, ⋯, *n* , *yi* is an univariate fractional response taking values in 0, 1 ½ �; **x***<sup>i</sup>* is an *m* � 1 fixed covariate vector denoting various explanatory factors under consideration. Usually, *yi* suffers from excess zeros and ones on the boundaries, and the whole data set takes on the U shape. In modeling such data, we introduce a three-category indicator variable *di* and a continuous intensity variable *zi* such that

$$d\_i = \begin{cases} 1 & \text{if } \quad y\_i = 0\\ 2 & \text{if } \quad y\_i = 1\\ 3 & \text{if } \quad 0 < y\_i < 1 \end{cases} \quad \text{and} \quad z\_i = \begin{cases} \quad h(y\_i) & \text{if } \quad 0 < y\_i < 1\\ \text{irrelevant} & \text{if } \quad y\_i = 0, 1 \end{cases} \tag{15}$$

where *h*ð Þ� is any monotone increasing function such that *zi* ∈ ð Þ �∞, þ∞ . That is, we break the data set into three parts: two parts corresponding to zeros and ones respectively and one part corresponding to the continuous values between 0 and 1. We formulate a two-part model for *yi* by first specifying a baseline-category logits model [52] for *di* and then a conditional continuous model for *zi*. The baselinecategory logits model is assumed that conditional upon **x***i*, *di*s are independent satisfying the following logits models simultaneously: for *j* ¼ 1, 2,

$$\log \frac{\mathbb{P}(d\_i = j | \mathbf{x}\_i)}{\mathbb{P}(d\_i = \mathfrak{Z} | \mathbf{x}\_i)} = \mathbf{x}\_i^T \mathbf{a}\_j \tag{16}$$

where *α <sup>j</sup>* is an *m* � 1 regression coefficients vector. We use category *di* ¼ 3 as the reference for the ease of parameters interpretation. For example, the magnitude of *α <sup>j</sup>*<sup>ℓ</sup> in *α <sup>j</sup>* indicates that the increase of one unit in *xi*<sup>ℓ</sup> will increase *e<sup>α</sup> <sup>j</sup>*<sup>ℓ</sup> times chance of *di* ¼ *j* over that of *di* ¼ 3.

The conditional continuous model for *zi* is given by

$$p(z\_i|d\_i=\mathbf{3}, \mathbf{x}\_i) = p^x(z\_i|\mathbf{x}\_i^T\mathbf{y}, \mathbf{r})\tag{17}$$

or equivalently

$$p\left(y\_i|0$$

where \_ *h s*ðÞ¼ *dh=ds*, *pz* ð Þ *u*j*a*, *τ* is the normal density with mean *a* and variance *τ* >0, and *γ* like that in Eq. (16), is the regression coefficient vector. Although the identical covariates are taken in Eqs. (16) and (17), this is not necessary in practice. Each equation can own their covariates. This can be achieved by imposing particular structure on the regression coefficients. For example, we can exclude *xi*<sup>1</sup> from Eq. (17) by restricting *γ*<sup>1</sup> in *γ* to be zero.

It follows from Eqs. (16) and (17) that marginal distribution of *yi* is given by

$$p\left(y\_i|\mathbf{x}\_i, \boldsymbol{\theta}, \tau\right) = q\_{i1}\delta\_0 + q\_{i2}\delta\_1 + \left(\mathbf{1} - q\_{i1} - q\_{i2}\right)p\left(y\_i|\mathbf{0} < y\_i < \mathbf{1}, \mathbf{x}\_i, \ \mathbf{y}, \tau\right) \tag{19}$$

where *qij* ¼ *di* ¼ *j*j**x***i*, *α <sup>j</sup>* � �ð Þ *<sup>j</sup>* <sup>¼</sup> 1, 2 is the response probability specified by Eq. (16) and *β* is the regression parameters constituted by *α*1, *α*<sup>2</sup> and *γ*.

### **3.2 Assessing heterogeneity of two-part model**

To detect the possible heterogeneity among *yi* , we extend the model Eq. (18) to the mixture case by assuming that conditional upon *Ki* ¼ *k*, *di* and *zi* satisfy Eqs. (16) and (17) with *α <sup>j</sup>* replaced by *αjk* and ð Þ *γ*, *τ* by *γ<sup>k</sup>* ð Þ , *τ<sup>k</sup>* respectively. This indicates that the mixture component *f <sup>k</sup>* in Eq. (1) in Section 2 is given by Eq. (19) with *β* ¼ *β<sup>k</sup>* and *τ* ¼ *τk*.

*Assessing Heterogeneity of Two-Part Model via Bayesian Model-Based Clustering… DOI: http://dx.doi.org/10.5772/intechopen.103089*

For the Bayesian analysis, the general forms of full conditionals involved in the model-based clustering have been given in Section 2. We here only focus on the technical details of the conditional distribution of **Ξ** in (i) in the Gibbs sampler.

We assume that the prior of *τ<sup>k</sup>* is the same as that in Eq. (6), while the priors of *β<sup>k</sup>* are taken as *p β<sup>k</sup>* ð Þ¼ *p*ð Þ *αk*<sup>1</sup> *p*ð Þ *αk*<sup>2</sup> *p γ<sup>k</sup>* ð Þ, in which

$$p(\mathfrak{a}\_{k\ell}) \stackrel{D}{=} N\_m(\mathfrak{a}\_{\ell 0}, \mathfrak{L}\_{a\ell 0})(\ell = 1, 2), \quad p(\boldsymbol{\eta}\_k) \stackrel{D}{=} N\_m(\boldsymbol{\eta}\_0, \mathfrak{L}\_{\boldsymbol{\tau} 0}). \tag{20}$$

where *α*ℓ0, *γ*0, **Σ***<sup>α</sup>*ℓ<sup>0</sup> and **Σ***<sup>γ</sup>*<sup>0</sup> are the hyper-parameters treated to be known.

Gibbs sampling **Ξ** now becomes drawing *αk*, *γ<sup>k</sup>* and *τ<sup>k</sup>* alternatively from the full conditional distributions *p*ð Þ *αk*j**K**, **Y** , *p γ<sup>k</sup>* ð Þ j*τk*, **K**, **Y** and *p τk*j *γ<sup>k</sup>* ð Þ , **K**, **Y** respectively. By some algebras, it can be shown that

$$\begin{aligned} p(\boldsymbol{a}\_{k}|\mathbf{K},\mathbf{Y}) &\propto p(\boldsymbol{a}\_{k}) \prod\_{K\_{i}=k} p(d\_{i}|\mathbf{x}\_{i}, \ \mathbf{a}\_{k}),\\ p(\boldsymbol{\gamma}\_{k}|\boldsymbol{\tau}\_{k}, \mathbf{K}, \mathbf{Y}) &\propto p(\boldsymbol{\gamma}\_{k}) \prod\_{K\_{i}=k} p(\boldsymbol{z}\_{i}|d\_{i} = \mathbf{3}, \mathbf{x}\_{i}^{T}\boldsymbol{\gamma}\_{k}, \boldsymbol{\tau}\_{k}),\\ p(\boldsymbol{\tau}\_{k}|\boldsymbol{\gamma}\_{k}, \mathbf{K}, \mathbf{Y}) &\propto p(\boldsymbol{\tau}\_{k}) \prod\_{K\_{i}=k} p(\boldsymbol{z}\_{i}|d\_{i} = \mathbf{3}, \mathbf{x}\_{i}^{T}\boldsymbol{\gamma}\_{k}, \boldsymbol{\tau}\_{k})\end{aligned} \tag{21}$$

in which the full conditionals of *γ<sup>k</sup>* and *τ<sup>k</sup>* are easily obtained and given by

$$p(\mathbf{y}\_k|\tau\_k, \mathbf{K}, \mathbf{Y}) \stackrel{D}{=} N(\hat{\mathbf{y}}\_k, \hat{\mathbf{E}}\_{jk}) \tag{22}$$

$$p\left(\tau\_k^{-1}|\mathbf{y}\_k, \mathbf{K}, \mathbf{Y}\right) \stackrel{D}{=}\\Gamma\left(\hat{\alpha}\_k, \hat{\beta}\_k\right) \tag{23}$$

in which

$$\begin{aligned} \hat{\boldsymbol{\Sigma}}\_{\mathcal{I}^k} &= \left( \sum\_{K\_i = k: d\_i = 2} \mathbf{x}\_i \mathbf{x}\_i^T / \boldsymbol{\tau}\_k + \boldsymbol{\Sigma}\_{\mathcal{I}^0}^{-1} \right)^{-1}, \\ \hat{\boldsymbol{\gamma}}\_k &= \hat{\boldsymbol{\Sigma}}\_k \left( \boldsymbol{\Sigma}\_{\mathcal{I}^0}^{-1} \ \boldsymbol{\gamma}\_0 + \sum\_{K\_i = k, d\_i = 3} \mathbf{x}\_i \boldsymbol{\tau}\_i / \boldsymbol{\tau}\_k \right), \\ \hat{\boldsymbol{\alpha}}\_k &= \boldsymbol{\alpha}\_0 + \boldsymbol{n}\_k / 2, \\ \hat{\boldsymbol{\beta}}\_k &= \boldsymbol{\beta}\_0 + \sum\_{K\_i = k, d\_i = 3} \left( \boldsymbol{z}\_i - \mathbf{x}\_i^T \boldsymbol{\gamma}\_k \right)^2 / 2 \end{aligned} \tag{24}$$

and *nk* ¼ #f g *Ki* ¼ *k*, *di* ¼ 3 .

However, drawing *αk*<sup>ℓ</sup> is more tedious since its distribution loses the standard form. We first note that

$$p(\mathbf{a}\_{k\ell} \mid \mathbf{a}\_{k,-\ell}, \mathbf{K}, \mathbf{Y}) \propto p(\mathbf{a}\_{k\ell}) \prod\_{K\_i=k}^{n} \frac{\exp\left(\mathbf{\bar{d}}\_{i\ell} (\mathbf{x}\_i^T \mathbf{a}\_{k\ell} - \mathbf{C}\_{ik\ell})\right)}{1 + \exp\left(\mathbf{x}\_i^T \mathbf{a}\_{k\ell} - \mathbf{C}\_{ik\ell}\right)} \tag{25}$$

where ~ *di*<sup>ℓ</sup> <sup>¼</sup> *I d*f g *<sup>i</sup>* <sup>¼</sup> <sup>ℓ</sup> and *Cik*<sup>ℓ</sup> <sup>¼</sup> log 1*:*<sup>0</sup> <sup>þ</sup> exp **<sup>x</sup>***<sup>T</sup> <sup>i</sup> α<sup>k</sup>*,�<sup>ℓ</sup> � � � � ; *<sup>α</sup><sup>k</sup>*,�<sup>ℓ</sup> denotes the set *α<sup>k</sup>* with *αk*<sup>ℓ</sup> removed. Following the similar routine in Polson, Scott, and Windle [53], we recast the logistic function Eq. (25) as follows

$$\begin{split} \frac{\exp\left(\tilde{d}\_{i\ell}\left(\mathbf{x}\_{i}^{T}\mathbf{a}\_{k\ell}-\mathbf{C}\_{ik\ell}\right)\right)}{\mathbf{1}+\exp\left(\mathbf{x}\_{i}^{T}\mathbf{a}\_{k1}-\mathbf{C}\_{ik\ell}\right)} &= 2^{-1}\exp\left\{\kappa\_{i1}\left(\mathbf{x}\_{i}^{T}\mathbf{a}\_{k\ell}-\mathbf{C}\_{ik\ell}\right)\right\} \\ \times \int\_{0}^{\infty}\exp\left\{-\frac{1}{2}\alpha\_{i\ell}\left(\mathbf{x}\_{i}^{T}\mathbf{a}\_{k\ell}-\mathbf{C}\_{ik\ell}\right)^{2}\right\}p\_{\text{PG}}(\alpha\_{i\ell})\,\mathrm{d}\alpha\_{i\ell} \end{split} \tag{26}$$

in which *<sup>κ</sup>i*<sup>ℓ</sup> <sup>¼</sup> <sup>~</sup> *di*<sup>ℓ</sup> � 1*=*2 and *p*PG is the well-known PG 1, 0 ð Þ density function [53]. If one introduces *n* independent Pólya-Gamma variables *ωi*<sup>ℓ</sup> into the current analysis, then,

$$\operatorname{PG}(o\iota\_{\ell\ell} \mid \mathbf{a}\_{k\ell}, \mathbf{a}\_{k,-\ell}, \mathbf{K}, \mathbf{Y}) \stackrel{D}{=} \operatorname{PG}\left(\mathbf{1}, \left(\mathbf{x}\_{i}^{T}\mathbf{a}\_{k\ell} - \mathbf{C}\_{ik\ell}\right)\right) \tag{27}$$

$$p(\mathbf{a}\_{k\ell} \mid \mathbf{a}\_{k,-\ell}, \ \mathbf{Q}, \mathbf{K}, \mathbf{Y}) \stackrel{D}{=} N(\hat{\mathbf{a}}\_{k\ell}, \hat{\mathbf{z}}\_{ak\ell}) \tag{28}$$

where

$$\hat{\boldsymbol{\Sigma}}\_{ak\ell} = \left(\sum\_{K\_i=k} \mathbf{x}\_i \mathbf{x}\_i^T \boldsymbol{\alpha}\_{i\ell} + \boldsymbol{\Sigma}\_{a\ell 0}^{-1}\right)^{-1}, \quad \hat{\mathbf{a}}\_{k\ell} = \hat{\mathbf{E}}\_{ak\ell} \left(\boldsymbol{\Sigma}\_{a\ell 0}^{-1} \mathbf{a}\_{\ell 0} + \sum\_{K\_i=k} \mathbf{x}\_i \boldsymbol{\eta}\_{ik\ell}\right) \tag{29}$$

with *ηik*<sup>ℓ</sup> ¼ *κi*<sup>ℓ</sup> þ *Cik*ℓ*ωi*<sup>ℓ</sup>. Consequently, drawing *αk*<sup>ℓ</sup> is accomplished by first drawing *ωi*<sup>ℓ</sup> from the Pólya gamma distribution and then drawing *αk*<sup>ℓ</sup> from the normal distribution. The draw of *ωi*<sup>ℓ</sup> is a little intractable since its density function involves the infinite sum. By taking advantage of series sampling method [54], Polson et al. [53] devised a rejection algorithm for generating observations from such type of distribution. Their method can be adapted to draw *ωi*<sup>ℓ</sup>, see also [55].

### **4. A real example**

In this section, a small portion of cocaine use data is analyzed to illustrate the practical value of the proposed methodology. The data are obtained from the 322 cocaine use patients who were admitted in 1988–89 to the West Los Angeles Veterans Affairs Medical Center. The original data set is made up of 68 measurements in which 17 items were assessed at four unequally spanned time points. In this study, we mainly focus on the measurements 1 year after treatment and ignore the initial effects at the baseline. The measurements cover the information on the cocaine use, treatment received, psychological problems, social status, employments, and so on. Among them, the measurement "cocaine use per month" (denoted by CC) plays a critical role since it identifies the severity of cocaine use of patients and therefore is treated as the dependent response. The CC is originally measured by 0–30 points but suffered from small portion of fractions. We identify CC/30 as the fraction response in [0,1]. In view of that the missing data are presented, we delete the subjects with missing values. The total sample size is 228. A primary analysis shows that CC/30 has excessive zeros and ones. **Figure 2** gives the histograms of CC/30 and their fractional values in (0,1) via logistic transformation. It can be seen clearly that there is a large number of zeros and unities accumulated on the boundaries. The proportions of zeros and unities are about 15 and 4%, respectively. Moreover, panel (b) in **Figure 2** indicates that single parametric model may be unappreciate for fitting the continuous valued variable.

*Assessing Heterogeneity of Two-Part Model via Bayesian Model-Based Clustering… DOI: http://dx.doi.org/10.5772/intechopen.103089*

**Figure 2.** *Plots of CC in cocaine use data: (a) Histograms of CC/30; and (b) histograms of CC/30 on logistic transformation conditional on CC/30 in (0,1).*

To explore the effects of exogenous factors on the cocaine use, the following measurements are selected as the explanatory variables: the occupational status of a patient (*x*1). This is a binary indicator: 1 for employment and 0 for non-employment; the level of technical proficiency of patients engaged in work (*x*2): scaled on 0–4 points and the patient's lifestyle (*x*3) with five-point scale. To unify the scales, all covariates are standardized. However, a preliminary analysis shows that there exists strong multiple collinearity among these covariates. The minimum eigenvalue of sample covariance matrix equals to 0.06284, which approaches zero. We remove such collinearity by implementing principle component analysis (PCA) and treat the scores of the first two components (still denoted by *x*<sup>1</sup> and *x*2) as our explanatory variables. These two principle components can be interpreted as the levels related to the patients' occupation and their live life.

To formulate a two-part model for the observed responses, we identity CC*i*/30 with *di* and *zi*, where *di* is the three-category indicator indicating the state of cocaine use after one year treatments: quitting cocaine successfully (state 1), insisting on cocaine use every day in a month (state 2) and taking the cocaine occasionally (state 3); *zi* is the intensity variable representing the numer of days of cocaine use in a month. We assess the effects of exogenous factors *x*<sup>1</sup> and *x*<sup>2</sup> on the cocaine use via Eqs. (16) and (17), respectively.

We proceed data analysis by first fitting data to the *K*-component mixture twopart models with *K* ¼ 1, 2, ⋯, 6. The model fits are assessed via AIC, AICc, and BIC, which are defined as �2 log *<sup>p</sup>* **<sup>Y</sup>**j^*θ<sup>K</sup>* penalized by 2*dK*, 2*n d*ð Þ *<sup>K</sup>* <sup>þ</sup> <sup>1</sup> *<sup>=</sup>*ð Þ *<sup>n</sup>* � *dK* � <sup>2</sup> , and *dK* log *n* respectively, where *θ* ^*<sup>K</sup>* is the MLE of *θ<sup>K</sup>* and *dK* is the dimension of unknown parameters under the model *K*. In view of that the Bayesian estimates and the ML estimates are close to each other, we replace the ML estimates by their Bayesian counterparts in evaluating AIC, AICc, and BIC. For computation, we take *<sup>α</sup>* <sup>¼</sup> *<sup>n</sup>*�<sup>1</sup> , *n*0, *n*1, and *n*<sup>2</sup> in Eq. (7), which represents our knowledge about *π a prior*. Note that for large value of *α*, the Dirichlet distribution places most of the mass on its center and the prior Eq. (7) tends to be informative. However, for small *α*, the Dirichlet distribution concentrates the mass on the boundaries of sampling space and the distribution tends to be degenerated and sparse. As a result, some components in *π* reduces to zeros. When *α* ¼ 1, *DK*ð Þ *α*, ⋯, *α* becomes an uniform distribution on the simplex *K*. For the inputs

of the hyper-parameters involved in the priors Eq. (20), we take *α*0<sup>ℓ</sup> ¼ *γ*<sup>0</sup> ¼ **0**3, **Σ***<sup>α</sup>*ℓ<sup>0</sup> ¼ **Σ***<sup>γ</sup>*<sup>0</sup> ¼ 100**I**3, *αγ*<sup>0</sup> ¼ 2*:*0 and *βγ*<sup>0</sup> ¼ 2*:*0. These values ensure the priors Eq. (20) to be inflated enough and represent the weak information on the parameters.

The relabeling MCMC algorithm described in Section 2 is implemented to draw observations from the posterior. The convergence of algorithm is monitored by plotting the traces of estimates against iterations under three starting values. **Figure 3** presents the values of EPSR of unknown parameters against the number of iterations under three different starting values with *K* ¼ 2. It shows that the convergence of the proposed algorithm is fast and the values of EPSR are less than 1.2 in less than 1000 iterations. Hence, 3000 observations, after removing the initial 2000 iterations, are collected for calculating AIC, AICc, and BIC. The resulting summary is given in **Table 1**.

Examination of **Table 1** shows that all measures favor the model with *K* ¼ 2. This indicates that the proposed model with two groups seems to give a better fit to the data. It also indicates that large *α* favors the model fit. Furthermore, we calculate the posterior predictive density estimate of *zi* under the elected model. Results (not represented here for saving spaces) show that our method can be successful in capturing the skewness and modes of data. We also follow [56] to plot the estimated residuals ^*δ<sup>i</sup>* <sup>¼</sup> *zi* � ^*γ***x***<sup>T</sup> <sup>i</sup>* and find that these plots lie within two parallel horizontal lines that are centered at zero, with nonlinear or quadratic trends detected. This roughly indicates that the proposed linear model Eq. (18) is adequate.

**Table 2** presents the estimates of unknown parameters associated with corresponding standard deviation (SD) estimates under *K* ¼ 2. Based on **Table 2**, we can find the following facts: (i) for Part one, we observe that except for *α*^23, the Bayesian estimates of unknown parameters within two clusters have the same signs but their magnitudes are more different. For example, the estimate of *α*<sup>11</sup> within Cluster one is �1.540 with SD 0.587 while equals to �0.732 with SD 0.481 within Cluster two. This indicates that the baselines of logits Eq. (16) exist obvious

**Figure 3.**

*Plots of values of EPSR of estimates of unknown parameters against the number of iterations under three different staring values in the cocaine use example: K* ¼ 2*.*


*Assessing Heterogeneity of Two-Part Model via Bayesian Model-Based Clustering… DOI: http://dx.doi.org/10.5772/intechopen.103089*

### **Table 1.**

*Summary statistics of AIC, AICcc, and BIC for model selection in cocaine use data analysis.*


### **Table 2.**

*Summary statistics for the Bayesian estimates of unknown parameters in the cocaine use data.*

difference. For *α*23, the estimates between two clusters have the opposite signs. Recall that *α*<sup>23</sup> quantifies the magnitude of effects of live life on the probability ð Þ *di* ¼ 2 over ð Þ *di* ¼ 3 on log scale. This shows that increasing the level of live life will lead to an opposite effect among two clusters; (ii) for Part two, although all the estimates within two clusters have the same signs but the levels of effects among them are obviously different. The estimates of *γ*<sup>1</sup> is �2.779 with SD 0.144 in the cluster one and attains �0.490 associated with SD 0.215 in the Cluster two. This indicates that the baseline of cocaine use in Cluster one is 50 times as much as that in Cluster two; and (iii) investigation of the estimate of *τ* also indicates that there exists the different amount of the fluctuation among two clusters.

### **5. Discussion**

This chapter introduces a general Bayesian model-based clustering procedure for the regression model and proposed a Bayesian method for assessing the heterogeneity of fractional data within the mixture of two-part regression model framework. The heterogeneous fractional data arise mainly from two resources: one is that the excessive zeros and ones are accumulated upon the boundaries, and the other is that the underlying population may consist more than one components. For the first issue, we propose a novel two-part model, in which a three-category multinomial regression is suggested to model the occurrence probabilities of each part, and a conditional normal linear regression is used to fit the continuous positive values on logit scale. Such formulation is more appealing since it can ensure the probabilities on each part to be consistent and and at the same time maintains the coherent association across parts. For the second problem, we resort to the finite mixture model in which the clusterspecific components are specified via two-part model. MCMC sampling method is adopted to carry out the posterior analysis. The number of clusters and the configuration of observations are addressed based on the simulated observations from the posterior. We illustrate the proposed methodology in the analysis of cocaine use data.

When interest is concentrated upon the estimates, model identification is surely an important issue since it involves whether or not the estimates of component-specific quantities are meaningful. For a finite mixture model, model identification mainly stems from the label switching, in which the likelihood and the posterior are invariant under label permutation. Many efforts have devoted to alleviating such indeterminacy. Among them, parameters' constraints may be the most popular choice. However, an unappreciated constraint fails to deal with the label switching. In this case, one can follow the routine in Frühwirth-Schnatter [18] and implement random permutation sampling to find the suitable identifiability constraints. The random permutation sampler is similar to the unconstrained MCMC sampling but only at each sweep, the labels 1, f g ⋯, *K* are randomly permutated. The permutation aims to deliver a sample that explores the whole unconstrained parameter space and jumps between the various labeling subspaces in a balanced fashion. The output of such balanced sample can help us to find a suitable identifiability constraint. A more detailed discussion on model identification in the mixture context can be referred to, for example, [18, 57]. Instead, we resort to the relabeling algorithm for simplicity. Compared with the random permutation sampling, the relabeling method requires implementing MCMC samplng only once, thus saving the computation cost.

The methodology developed in this chapter can be extended to the case where latent factors are included to identify the unobserved heterogeneity due to some fixed convariates absent. Another possible extension is to establish a dynamic LVM, wherein model parameters vary across times. These issues may raise theoretical and computational challenges and therefore require further investigation.

*Assessing Heterogeneity of Two-Part Model via Bayesian Model-Based Clustering… DOI: http://dx.doi.org/10.5772/intechopen.103089*

### **Acknowledgements**

The work presented here was fully supported by grant from the National Natural Science Foundation of China (NNSF 11471161). The authors are thankful to Professor Xin-Yuan Song, the Chinese University of Hong Kong, for using her cocaine use data in the real example.

### **Conflict of interest**

The authors have no conflicts of interest to disclose.

### **Author details**

Ye-Mao Xia<sup>1</sup> , Qi-Hang Zhu2 and Jian-Wei Gou<sup>1</sup> \*

1 Department of Applied Mathematics, Nanjing Forestry University, Nanjing, China

2 College of Economics and Management, Nanjing Forestry University, Nanjing, China

\*Address all correspondence to: gjw1983@139.com

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] McLachlan GJ. Discriminant Analysis and Statistical Pattern Recognition. New York: John Wiley; 1992. DOI: 10.1002/ 0471725293.ch3

[2] Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association. 2002; **97**(458):611-631. DOI: 10.2307/3085676

[3] Andrews JL, McNicholas PD. Modelbased clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions: The tEIGEN family. Statistics and Computing. 2012; **22**(5):1021-1029. DOI: 10.1007/ s11222-011-9272-x

[4] Ripley BD. Pattern Recognition and Neural Networks. Cambridge, UK: Cambridge Univeristy Press; 1996. DOI: 10.1080/00401706.1997.10485099

[5] Paalanen P, Kamarainen JK, Ilonen J, Kälviäinen H. Feature representation and discrimination based on Gaussian mixture model probability densities Practices and algorithms. Pattern Recognition. 2006;**39**(7):1346-1358. DOI: 10.1016/j.patcog.2006.01.005

[6] Qin LX, Self SG. The clustering of regression models method with applications in gene expression data. Biometrics. 2006;**62**:526-533

[7] McNicholas PD, Murphy TB. Modelbased clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics. 2010; **21**:2705-2712. DOI: 10.1093/ bioinformatics/btq498

[8] Yuan M, Kendziorski C. A unified approach for simultaneous gene clustering and differential expression identification. Biometrics. 2006;**62**:1089-1098

[9] Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY. An efficient k-means clustering algorithm: analysis and implementation. IEEE Transactions on Pattern Analysis & Machine Intelligence. 2002;**24**(7): 881-892. DOI: 10.1109/ TPAMI.2002.1017616

[10] Mahmoudi MR, Akbarzadeh H, Parvin H, Nejatian S, Alinejad-Rokny H. Consensus function based on clusterwise two level clustering. Artificial Intelligence Review. 2021;**54**:639-665. DOI: 10.1007/s10462-020-09862-1

[11] MacQueen J. Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J, editors. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics & Probability. Vol. 1. Berkeley, CA: University of California Press; 1967. pp. 281-297

[12] Hartigan JA, Wong MA. Algorithm AS 136: A K-means clustering algorithm. Journal of the Royal Statistical Society, Series C. 1979;**28**(1):100-108. DOI: 10.2307/2346830

[13] Anderberg MR. Cluster Analysis for Applications. New York: Academic Press; 1973

[14] Everitt BS, Landau S, Leese M. Cluster Analysis. 4th ed. London: Hodder Arnold; 2001

[15] Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. 2th ed. New Jersey: Prentice Hall; 1988

[16] Titterington DM, Smith AFM, Makov UE. Statistical Analysis of Finite Mixture Distributions. Chichester: John Wiley and Sons; 1985. DOI: 10.2307/ 2531224

*Assessing Heterogeneity of Two-Part Model via Bayesian Model-Based Clustering… DOI: http://dx.doi.org/10.5772/intechopen.103089*

[17] McLachlan GJ, Peel D. Finite Mixture Models. New York: John Wiley; 2000. DOI: 10.1002/0471721182

[18] Frühwirth-Schnatter S. Markov chain monte carlo estimation of classical and dynamic switching and mixture models. Journal of the American Statistical Association. 2001, 2001; **96**(453):194-209. DOI: 10.1198/ 016214501750333063

[19] Fang KN, Ma SG. Three-part model for fractional response variables with application to Chinese household health insurance coverage. Journal of Applied Statistics. 2013;**40**(5):925-940. DOI: 10.1080/02664763.2012.758246

[20] McCullagh P, Nelder JA. Generalized Linear Models. London: Chapman and Hall; 1989. DOI: 10.1007/978-1- 4899-3242-6

[21] Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;**82**(71):17-32. DOI: 10.1093/biomet/82.4.711

[22] Richardson S, Green PJ. On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society, Series B. 1997;**59**:731C792. DOI: 10.1111/ 1467-9868.00095

[23] Akaike H. Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csáki F, editors. Second International Symposium on Information Theory. Budapest, Hungary: Akad¨¦mia Kiad¨®; 1973. pp. 267-281. DOI: DOI.10.1007/978-1-4612-1694-0-15

[24] Sugiura N. Further analysis of the data by Akaikes information criterion and the finite corrections. Communications in Statistics-Theory and Methods. 1978;**A7**:13-26

[25] Hurvich CM, Tsai C-L. Regression and time series model selection in small samples. Biometrika. 1989;**76**:297-307. DOI: 10.1093/biomet/76.2.297

[26] Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;**6**:461-464. DOI: 10.1214/ aos/1176344136

[27] Biernacki C, Celeux G, Govaert G. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000;**22**(7): 719-725. DOI: 10.1109/34.865189

[28] Berger JO. Statistical Decision Theory and Bayesian Analysis. New York: Springer-Verlag; 1985. DOI: 10.1007/978-1-4757-4286-2

[29] Kass RE, Raftery AE. Bayes factors. Journal of the American Statistical Association. 1995;**90**:773-795. DOI: 10.1080/01621459.1995.10476572

[30] Spiegelhalter DJ, Best N, Carlin B, van der Linde A. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B. 2002; **64**:583-640. DOI: 10.1111/ 1467-9868.00353

[31] Spiegelhalter DJ, Thomas A, Best NG, Lunn D. WinBUGS User Manual. Version 1.4. Cambridge, England: MRC Biostatistics Unit; 2003. DOI: 10.1001/jama.284.24.3187

[32] Ferguson TS. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;**1**(2):209-230. DOI: 10.1214/aos/1176342360

[33] Antoniak CE. Mixtures of Dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics. 1974;**2**:1152-1174. DOI: 10.1214/aos/1176342871

[34] Ishwaran H, James LF. Gibbs sampling methods for stickbreaking priors. Journal of the American Statistical Association. 2001;**96**: 161-173. DOI: 10.1198/ 016214501750332758

[35] Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B. 1977;**39**:1-38

[36] Diebolt J, Robert CP. Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society, Series B. 1994;**56**: 363-375. DOI: 10.1111/j.2517-6161.1994. tb01985.x

[37] Roeder K, Wasserman L. Practical Bayesian density estimation using mixtures of normals. Journal of the American Statistical Association. 1997; **92**:894-902. DOI: 10.1080/ 01621459.1997.10474044

[38] Geman S, Geman D. Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1984;**6**:721-741. DOI: 10.1109/TPAMI.1984.4767596

[39] Geyer CJ. Practical Markov chain Monte Carlo. Statistical Science. 1992;**7**: 473-511. DOI: 10.1214/ss/1177011137

[40] Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation(with discussion). Journal of the American statistical Association. 1987;**82**:528-550. DOI: 10.2307/2289463

[41] Gelfand AE, Smith AFM. Samplingbased approaches to calculating marginal densities. Journal of the American Statistical Association. 1990;**85**:398-409. DOI: 10.1080/01621459.1990.10476213

[42] Ishwaran H, Zarepour M. Markov chain Monte Carlo in approximation Dirichlet and beta-parameter process hierarchical models. Biometrika. 2000; **87**:371-390

[43] Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Science. 1992;**7**: 457-472. DOI: 10.2307/2246093

[44] Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equations of state calculations by fast computing machines. Journal of Chemical Physics. 1953;**21**:1087-1092. DOI: 10.1063/1.1699114

[45] Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;**57**(1): 97-109. DOI: 10.1093/biomet/57.1.97

[46] Gilks WR, Wild P. Adaptive rejection sampling for gibbs sampling. Journal of the Royal Statistical Society. Series C (Applied Statistics). 1992;**41**(2): 337-348. DOI: 10.2307/2347565

[47] Lee SY. Structural Equation Modeling: A Bayesian Approach. New York: John Wiley & Sons; 2007

[48] Stephens M. Dealing with labelswitching in mixture models. Journal of the Royal Statistical Society, Series B. 2000;**62**:795-809. DOI: 10.1111/ 1467-9868.00265

[49] Medvedovic M, Sivaganesan S. Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics. 2002;**18**(9): 1194-1206. DOI: 10.1093/bioinformatics/ 18.9.1194

[50] Medvedovic M, Yeung KY, Bumgarner RE. Bayesian mixture model based clustering of replicated microarray data. Bioinformatics. 2004;**20**(8):

*Assessing Heterogeneity of Two-Part Model via Bayesian Model-Based Clustering… DOI: http://dx.doi.org/10.5772/intechopen.103089*

1222-1232. DOI: 10.1093/bioinformatics/ bth068

[51] Dahl DB. Model-based clustering for expression data via a Dirichlet process mixture model. In: Do KA, Müller P, Vannucci M, editors. Bayesian Inference for Gene Expression and Proteomics. Cambridge University Press; 2006. DOI: 10.1017/CBO9780511584589.011

[52] Agresti A. Categorical Data Analysis. 2nd ed. New York: John Wiley & Sons; 2003

[53] Polson NG, Scott JG, Windle J. Bayesian inference for logistic models using pólya Gamma latent variables. Journal of the American Statistical Association. 2013, 2013;**108**(504): 1339-1349. DOI: 10.1080/ 01621459.2013.829001

[54] Devroye L. The series method in random variate generation and its application to the Kolmogorov-Smirnov distribution. American Journal of Mathematical and Management Sciences. 1981;**1**:359-379. DOI: 10.1080/ 01966324.1981.10737080

[55] Gou JW, Xia YM, Jiang DP. Bayesian analysis of two-part nonlinear latent variable model: Semiparametric method. Statistical Modeling. Published on line. 2021. DOI: 10.1177/1471082X211059233

[56] Xia YM, Tang NS, Gou JW. Generalized linear latent models for multivariate longitudinal measurements mixed with hidden Markov models. Journal of Multivariate Analysis. 2016;**152**: 259-275. DOI: 10.1016/j.jmva.2016.09.001

[57] Jasra A, Holmes CC, Stephens DA. Markov Chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science. 2005;**20**(1):50-67. DOI: 10.1214/ 088342305000000016

Section 3
