**3.1 Data transformation methods**

Traditional clustering algorithms like hierarchical clustering and k-means cannot be directly applied to RNA-seq count data, to apply these methods for cluster analysis of RNA-seq data, that tend to follow an over-dispersed Poisson or negative binomial distribution, we need to transform the data in order to have a distribution closer to the normal distribution. In the following section, we present popular methods for data transformation:

• Logarithmic, widely used method to deal with skewed data in many research domains, often used to reduce the variability of the data and make the data conform more closely to the normal distribution. However, it was

**113**

*Current State-of-the-Art of Clustering Methods for Gene Expression Data with RNA-Seq*

stances, make data more variable and more skewed.

**3.2 Clustering methods based on normal distribution**

demonstrated in [11], that in most circumstances the log transformation does not help make data less variable or more normal and may, in some circum-

• Variance stabilizing transformation: This method was used to transform microarray data to stabilize the asymptotic variance over the full range of the

• Eight data transformations (r, r2, rv, rv2, l, l2, lv, and lv2) for RNA-seq data analysis were proposed in [13], these methods deal with the two common properties when it come to the count matrix generated in the quantification step, Sparsity and Skewness; Sparsity means that many counts in the count matrix are zero. Skewness means that the histogram of all counts in the count

Hierarchical clustering method is the most popular method for gene expression

K-medoids is a partitional clustering algorithm proposed in 1987 by Kaufman and Rousseeuw. It is a variant of the K-means algorithm that is less sensitive to noise and outliers because it uses medoids as cluster centers instead of means that are easily influenced by extreme values. Medoids are the most centrally located objects of the clusters, with a minimum sum of distances to other points. After searching for k representative objects in a data set, the algorithm which is called Partitioning Around Medoids (PAM) assigns each object to the closest medoid in order to create clusters. Like in k-means the number of classes to be generated needs to be specified.

Yaqing Si et al. described a number of Model based clustering methods for RNA-seq data in their paper [16], these methods assume that data are generated by a mixture of probability distributions: Poisson distribution when only technical replicates are used and Negative binomial distribution when working with biological replicates. The first method they proposed is a model-based clustering method with the expectation-maximization algorithm (MB-EM) for clustering RNA-seq gene expression profile. The expectation-maximization algorithm is widely used in many computational biology applications, the authors in [17] explain how this algorithm works and when it is used. The second method is an initialization algorithm for

data analysis. In hierarchical clustering, genes with similar expression patterns are grouped together and are connected by a series of branches (clustering tree or dendrogram). Experiments with similar expression profiles can also be grouped together using the same method. This clustering technique is divided into two types: agglomerative and divisive. In an agglomerative or bottom-up clustering method each observation is assigned to its own cluster. In a comparative study on Cancer data [14], three variants of Hierarchical Clustering Algorithms (HCAs): Single-Linkage (SL), Average-Linkage (AL) and Complete-Linkage (CL) with 12 distance measure have been used to cluster RNA-seq Samples. The same methods will be used in our study along with hierarchical clustering with Poisson distribution [15].

*DOI: http://dx.doi.org/10.5772/intechopen.94069*

matrix is usually skewed.

*3.2.1 Hierarchical methods*

*3.2.2 k-medoids*

**3.3 Model-based clustering**

data [12].

demonstrated in [11], that in most circumstances the log transformation does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed.

