**3.5 Clustering with deep learning**

Deep learning is also a technique that can be used to learn better data representation of high-dimensional data. The two recently published surveys [19, 20] present a taxonomy of existing deep clustering algorithms, by describing the different Neural Network Architecture that exists for feature representation, clustering loss function and Performance Evaluation Metrics for Deep Clustering. In [20], the authors categorize current deep clustering models into following three categories:

**115**

*Current State-of-the-Art of Clustering Methods for Gene Expression Data with RNA-Seq*

• CDNN-Based Deep Clustering (feed-forward networks trained only by specific

These approaches are already used in the analysis of RNA-seq data, for example, an unsupervised deep embedding algorithm that clusters single cell (scRNAseq) data was proposed in [21], another paper use a Lasso model and a multilayer feed-forward artificial neural network to analyze RNA-Seq gene expression data [22]. In [23], the authors used a Deep Neural Network model from the R package h2o for cancer data classification and in [24], ladder networks were used for gene

**4. Clustering algorithms and software packages/tools corresponding** 

**5. Clustering of public RNA-seq data from recount2**

*Clustering algorithms and software packages corresponding to the algorithms.*

Model-based clustering with the expectation-maximization algorithm

expectation-maximization algorithms (Deterministic annealing (DA)

Classification expectation maximization (CEM) algorithm with simulated

distance, hierarchical clustering with a Poisson model and k-medoids.

Recount2 is a multi-experiment resource of analysis-ready RNA-seq gene and exon count datasets. It contains 2041 different studies and over 70,000 human RNA-seq [25]. We selected for our study four different datasets based on the number of samples and the number of classes. We then performed sample-based clustering on each dataset and compared the results to the classes in the phenotype table in recount2 to evaluate the performance of each method. The methods used to classify the data are 3 subtypes of the hierarchical clustering with the Euclidean

Clustering algorithms and software packages corresponding to the algorithms

**Methods Implementation in R** Hierarchical clustering hclust() function in "stats" k-means "cluster", "factoextra" k-medoids "cluster", "factoextra"

"MBCluster.Seq"

SOM "kohonen"

Machine learning algorithms "MLSeq"

*DOI: http://dx.doi.org/10.5772/intechopen.94069*

clustering)

expression classification.

**to the algorithms**

are shown in **Table 1**.

(MB-EM).

algorithm).

**Table 1.**

annealing (SA).

Stochastic version of the

• Auto-Encoders Based Deep Clustering

• Generative Adversarial Network (GAN)

*Current State-of-the-Art of Clustering Methods for Gene Expression Data with RNA-Seq DOI: http://dx.doi.org/10.5772/intechopen.94069*

• Auto-Encoders Based Deep Clustering

*Applications of Pattern Recognition*

**data**

cluster centers, the idea behind this method is to randomly choose one cluster center and then gradually add centers by selecting genes based on the distance between each gene and each of the selected centers. Two other stochastic algorithms have been proposed in this paper, a stochastic version of the expectation-maximization algorithm and a classification expectation maximization algorithm with simulated annealing. The last method in this paper is a model-Based Hybrid-Hierarchical Clustering Algorithm, it does not require to pre-specify the number of clusters to be generated as it is required by the previous methods. The authors propose to use agglomerative clustering starting with k0 clusters to speed up the calculation, then, it repeatedly identifies the two clusters that are closest together and merges the two most similar clusters. This method was called hybrid because it combines two steps: Obtaining the initial K0 clusters using one of the previous described algorithms

**3.4 Classification and clustering algorithms of machine learning for RNA-seq** 

Classification in machine learning is a supervised learning approach in which the algorithm learns from the data given to it and makes new observations, then applies the conclusions to new data. Clustering on the other hand is an unsupervised learning problem for grouping unlabeled features. The learning algorithm that learns the model from the training data and maps the input data to a specific class is called classifier, in the following section, we briefly present three widely

• Random forests (RF): an ensemble method that trains a large number of individual decision trees, each tree gives a class prediction, the category that wins the majority votes is used as the final decision of the random forest model. The algorithm can perform both classification and regression tasks and has better

• Support Vector Machine (SVM): one of the most popular supervised learning models, used for both classification and regression, the data points are separated using an optimal hyperplane or a set of hyperplanes in a multidimensional space with the maximum possible margin between support vectors.

• Poisson linear discriminant analysis: an approach used for the classification and clustering of RNA-seq data using a Poisson log linear model [15].

To test these algorithms, we used MLSeq (Machine learning interface for RNAsequencing data) which is an R package including more than 80 machine learning algorithms and a pipeline to classify RNA-seq data including normalization, filter-

Deep learning is also a technique that can be used to learn better data representation of high-dimensional data. The two recently published surveys [19, 20] present a taxonomy of existing deep clustering algorithms, by describing the different Neural Network Architecture that exists for feature representation, clustering loss function and Performance Evaluation Metrics for Deep Clustering. In [20], the authors categorize current deep clustering models into following three categories:

then agglomerative clustering to build the hierarchical tree.

used classifiers for grouping RNA-seq data.

accuracy among current algorithms.

ing and transformation steps [18].

**3.5 Clustering with deep learning**

**114**


These approaches are already used in the analysis of RNA-seq data, for example, an unsupervised deep embedding algorithm that clusters single cell (scRNAseq) data was proposed in [21], another paper use a Lasso model and a multilayer feed-forward artificial neural network to analyze RNA-Seq gene expression data [22]. In [23], the authors used a Deep Neural Network model from the R package h2o for cancer data classification and in [24], ladder networks were used for gene expression classification.
