instantiate a logistic regression model, and fit with X and y

ln odds ð Þ¼ ln pð Þ *=*ð Þ 1 � p (2)

It is a type of regression where the independent variable power is greater than 1. Example:

$$\mathbf{Y} = \mathbf{a} + \mathbf{b}(\mathbf{X\_2} + \mathbf{X\_3} + ... \mathbf{X\_n}).\tag{3}$$

The plotted graph is usually a curve in nature as shown in **Figure 4**.

If the degree of the equation is 2 then it is called quadratic. If 3 then it is called cubic and if it is 4 it is called quartic. Polynomial regressions are fit with the method of least squares. Since the least squares minimizes the variance of the unbiased estimators of all the coefficients which are done under the conditions of Gauss-Markov theorem. Although we may get tempted to fit a higher degree polynomial so that we could get a low error, it may cause over-fitting [9].

Some guidelines which are to be followed are:

The model is more accurate when it fed with large number of observations. Not a good thing to extrapolate beyond the limits of the observed values.

Values for the predictor shouldn't be large else they will cause overflow with higher degree.

Usage of polynomial regression in python:

```
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
#makes use of a pre-processor called degree for the function
reg = PolynomialFeatures(degree=2)
reg.fit_transform(X)
reg.score(X, y)
```
#### **2.4 Step-wise regression**

This type of regression is used when we have multiple independent variables. To select the variables which are independent an automatic process is used. If used in the right way it puts more power and presents us ton of information. It can be used when the number of variables is too many. However if it is used haphazardly it may affect the models performance.

**Figure 4.** *Plotted graph is looks as curve in nature.*

We make use of the following scores to help us find out the independent variables which contribute to the output variable significantly—R-squared, Adj. Rsquared, F-statistic, Prob (F-statistic), Log-Likelihood, AIC, BIC and many more.

It can be performed by any of the following ways:


The greatest limitation of using step-wise regression is that the each instance or sample must have at least five attributes. Below which it has been observed that the algorithm doesn't perform well [10].

Code to implement Backward Elimination algorithm:

Assume that the dataset consists of 5 columns and 30 rows, which are present in the variable 'X' and let the expected results contain in the variable 'y'. Let 'X\_opt' contain the independent variables which are used to determine the value of 'y'.

We are making use of a package called statsmodels, which is used to estimate the model and to perform statistical tests.

#import stats models package

import statsmodels.formula.api as sm

#since it is a polynomial add a column of 1s to the left

X = np.append (arr = np.ones([30,1]).astype(int), values = X, axis = 1)

#Let X-opt contain the independent variables only and Let y contain the output variable

X\_opt = X[:,[0,1,2,3,4,5]] #assign y to endog and X\_opt to exog regressor\_OLS = sm.OLS(endog = y, exog = X\_opt).fit() regressor\_OLS.summary()

The above code outputs the summary and based on it the variable which should be eliminated should be decided. Once decided remove the variable from 'X-opt'.

It is used to handle high dimensionality of the dataset.

#### **2.5 Ridge regression**

It can be used to analyze the data in detail. It is a technique which is used to get rid of multi collinearly. That is the independent values may be highly correlated. It adds a degree of bias due to which it reduces the standard errors.

The multi collinearity of the data can be inspected by correlation matrix. Higher the values, more the multi collinearity. It can also be used when number of predictor variables in the dataset exceeds the number of instances or observations [11].

The equation for linear regression is

$$\mathbf{Y} = \mathbf{A} + \mathbf{b}\mathbf{X} \tag{4}$$

Error with mean zero and known variance.

from sklearn import linear\_model reg = linear\_model.Ridge (alpha = .5) reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1])

improves the accuracy of linear regression models.

Important points about LASSO regression:

only one of the predictors discards the rest.

from sklearn import linear\_model clf = linear\_model.Lasso(alpha = 0.1)

• It makes use of L1 regularization.

Code to implement in python:

**75**

Usage of ridge regression in python:

*DOI: http://dx.doi.org/10.5772/intechopen.84922*

In (**Figure 5**) how ridge regression looks geometrically.

*Analytical Statistics Techniques of Classification and Regression in Machine Learning*

#to return the co-efficient and intercept

also used to control the variance.

**Figure 5.** *Ridge and OLS.*

> reg.coef\_ reg.intercept\_

**2.6 Lasso regression**

multi collinearity [12].

Ridge regression is known to shrink the size by imposing penalty on the size. It is

Ridge(alpha=0.5, copy\_X=True, fit\_intercept=True, max\_iter=None, normalize=False, random\_state=None, solver='auto', tol=0.001)

Least absolute shrinkage and selection operator is also known as LASSO. Lasso is

It is similar to ridge regression and in addition it can reduce the variability and

a linear regression that makes use of shrinkage. It does so by shrinking the data values toward the mean or a central point. This is used when there are high levels of

It is used for prostate cancer data analysis and other cancer data analysis.

• In the data if the predictors are have high correlation, the algorithm selects

• It helps in feature extraction by shrinking the co-efficient to zero.

This equation also contains error. That is it can be expressed as

Y ¼ A þ bX þ ð Þ error

*Analytical Statistics Techniques of Classification and Regression in Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.84922*

**Figure 5.** *Ridge and OLS.*

We make use of the following scores to help us find out the independent variables which contribute to the output variable significantly—R-squared, Adj. Rsquared, F-statistic, Prob (F-statistic), Log-Likelihood, AIC, BIC and many more.

• Forward selection—where we start by adding the variables to the set and check

• Backward selection—we start by taking all the variables to the set and start eliminating them one by one by looking at the score after each elimination.

• Bidirectional selection—a combination of both the methods mentioned above.

The greatest limitation of using step-wise regression is that the each instance or sample must have at least five attributes. Below which it has been observed that the

Assume that the dataset consists of 5 columns and 30 rows, which are present in the variable 'X' and let the expected results contain in the variable 'y'. Let 'X\_opt' contain the independent variables which are used to determine the value of 'y'.

We are making use of a package called statsmodels, which is used to estimate the

#Let X-opt contain the independent variables only and Let y contain the output

The above code outputs the summary and based on it the variable which should be eliminated should be decided. Once decided remove the variable from 'X-opt'.

It can be used to analyze the data in detail. It is a technique which is used to get rid of multi collinearly. That is the independent values may be highly correlated.

The multi collinearity of the data can be inspected by correlation matrix. Higher the values, more the multi collinearity. It can also be used when number

Y ¼ A þ bX þ ð Þ error

Y ¼ A þ bX (4)

X = np.append (arr = np.ones([30,1]).astype(int), values = X, axis = 1)

It can be performed by any of the following ways:

*Data Mining - Methods, Applications and Systems*

Code to implement Backward Elimination algorithm:

#since it is a polynomial add a column of 1s to the left

regressor\_OLS = sm.OLS(endog = y, exog = X\_opt).fit()

It is used to handle high dimensionality of the dataset.

It adds a degree of bias due to which it reduces the standard errors.

of predictor variables in the dataset exceeds the number of instances or

This equation also contains error. That is it can be expressed as

how affects the scores.

algorithm doesn't perform well [10].

model and to perform statistical tests. #import stats models package

X\_opt = X[:,[0,1,2,3,4,5]]

regressor\_OLS.summary()

**2.5 Ridge regression**

observations [11].

**74**

variable

import statsmodels.formula.api as sm

#assign y to endog and X\_opt to exog

The equation for linear regression is

Error with mean zero and known variance.

Ridge regression is known to shrink the size by imposing penalty on the size. It is also used to control the variance.

In (**Figure 5**) how ridge regression looks geometrically.

Usage of ridge regression in python:

from sklearn import linear\_model reg = linear\_model.Ridge (alpha = .5) reg.fit ([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) Ridge(alpha=0.5, copy\_X=True, fit\_intercept=True, max\_iter=None, normalize=False, random\_state=None, solver='auto', tol=0.001) #to return the co-efficient and intercept reg.coef\_ reg.intercept\_

#### **2.6 Lasso regression**

Least absolute shrinkage and selection operator is also known as LASSO. Lasso is a linear regression that makes use of shrinkage. It does so by shrinking the data values toward the mean or a central point. This is used when there are high levels of multi collinearity [12].

It is similar to ridge regression and in addition it can reduce the variability and improves the accuracy of linear regression models.

It is used for prostate cancer data analysis and other cancer data analysis. Important points about LASSO regression:


Code to implement in python: from sklearn import linear\_model clf = linear\_model.Lasso(alpha = 0.1) clf.fit() Lasso(alpha=0.1, copy\_X=True, fit\_intercept=True, max\_iter=1000, normalize=False, positive=False, precompute=False, random\_state=None, selection='cyclic', tol=0.0001, warm\_start=False) #to return the co-efficent and intercept print(clf.coef\_) print(clf.intercept\_)

3. K-nearest neighbors

*DOI: http://dx.doi.org/10.5772/intechopen.84922*

6. support vector machine

they are still being used today [14].

Which represent 'n' features.

In machine learning, these classifiers belong to "probabilistic classifiers." This algorithm makes use of Bayes' theorem with strong independence assumptions between the features. Although Naive Bayes were introduced in the early 1950s,

*Analytical Statistics Techniques of Classification and Regression in Machine Learning*

X ¼ ð Þ x1*;* x2*;* x3*;* …*;* xn

P Ck ð Þ j x1*;* x2*;* …*;* xn

• All attributes are statistically independent (value of one attribute is

There are three types of naive Bayes algorithms, which can be used:

We can observe that in the above formula that if the number of features is more or if a feature accommodates a large number of values, then it becomes infeasible.

p Cð Þ¼ <sup>k</sup>jx p Cð Þ<sup>k</sup> p xð Þ jCk *=*p xð Þ (5)

Given a problem instance to be classified, represented by a vector

Therefore we rewrite the formula based on Bayes theorem as:

Makes two "naïve" assumptions over attributes:

• All attributes are a priori equally important

• not related to a value of another attribute)

• All attributes are not related to another attribute

This classifier makes two assumptions:

GaussianNB, BernoulliNB, and MultinomialNB.

import matplotlib.pyplot as plt

from sklearn.naive\_bayes import GaussianNB

Usage of naive Bayes in python: import pandas as pd import numpy as np

reg= GaussianNB()

reg.predict(X\_test)

reg.fit(X,y)

reg.score()

**77**

• All attributes are equally important

4.decision trees

5. random forest

**3.1 Naive Bayes**

## **3. Classification**

A classification task is when the output is of the type "category" such as segregating data with respect to some property. In machine learning and statistics, classification consists of categorizing the new data to a particular category where it fits in on the basis of the data which has been used to train the model. Examples of tasks which make use of classification techniques are classifying emails as spam or not, detecting a disease on plants, predicting whether it will rain on some particular day, predicting the house prices based on the area it is located.

In terms of machine learning classification techniques fall under supervised learning [13].

The categories may be either:


The algorithms which make use of this concept in machine learning and classify the new data are called as "Classifiers." Algorithms always return a probability score of belonging to the class of interest. That is considered an example where we are required to classify a gold ornament. Now when we input the image to the machine learning model the algorithms returns the probability value for each category, such as for if it is a ring the probability value may be higher than 0.8 if it not a necklace it may return less than 0.2, etc.

Higher the value more likely it is for it to belong to the particular group. We make use of the following approach to build a machine learning classifier:


Classifiers are of two types: linear and nonlinear classifiers. We now take a look at various classifiers are also statistical techniques:


*Analytical Statistics Techniques of Classification and Regression in Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.84922*


clf.fit()

**3. Classification**

learning [13].

• Real valued

class.

1. Naive Bayes

**76**

may return less than 0.2, etc.

observation to that class.

2. stochastic gradient dissent (SGD)

print(clf.coef\_) print(clf.intercept\_)

*Data Mining - Methods, Applications and Systems*

The categories may be either:

• ordinal (example: high, medium or low)

Lasso(alpha=0.1, copy\_X=True, fit\_intercept=True, max\_iter=1000, normalize=False, positive=False, precompute=False, random\_state=None,

A classification task is when the output is of the type "category" such as segre-

In terms of machine learning classification techniques fall under supervised

The algorithms which make use of this concept in machine learning and classify the new data are called as "Classifiers." Algorithms always return a probability score of belonging to the class of interest. That is considered an example where we are required to classify a gold ornament. Now when we input the image to the machine learning model the algorithms returns the probability value for each category, such as for if it is a ring the probability value may be higher than 0.8 if it not a necklace it

Higher the value more likely it is for it to belong to the particular group. We make use of the following approach to build a machine learning classifier:

2. Estimate that a new observation belongs to a class.

Classifiers are of two types: linear and nonlinear classifiers.

1. Pick a cut off probability above which we consider a record to belong to that

3. If the obtained probability is above the cut off probability, assign the new

We now take a look at various classifiers are also statistical techniques:

gating data with respect to some property. In machine learning and statistics, classification consists of categorizing the new data to a particular category where it fits in on the basis of the data which has been used to train the model. Examples of tasks which make use of classification techniques are classifying emails as spam or not, detecting a disease on plants, predicting whether it will rain on some particular

selection='cyclic', tol=0.0001, warm\_start=False)

#to return the co-efficent and intercept

day, predicting the house prices based on the area it is located.

• categorical (example: blood groups of humans—A, B, O)

• integer valued (example: occurrence of a letter in a sentence)


#### **3.1 Naive Bayes**

In machine learning, these classifiers belong to "probabilistic classifiers." This algorithm makes use of Bayes' theorem with strong independence assumptions between the features. Although Naive Bayes were introduced in the early 1950s, they are still being used today [14].

Given a problem instance to be classified, represented by a vector X ¼ ð Þ x1*;* x2*;* x3*;* …*;* xn

Which represent 'n' features.

P Ck ð Þ j x1*;* x2*;* …*;* xn

We can observe that in the above formula that if the number of features is more or if a feature accommodates a large number of values, then it becomes infeasible. Therefore we rewrite the formula based on Bayes theorem as:

$$\mathbf{p(C\_k|x) = p(C\_k)p(x|C\_k)/p(x)}\tag{5}$$

Makes two "naïve" assumptions over attributes:


This classifier makes two assumptions:


There are three types of naive Bayes algorithms, which can be used: GaussianNB, BernoulliNB, and MultinomialNB.

Usage of naive Bayes in python:

import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.naive\_bayes import GaussianNB reg= GaussianNB() reg.fit(X,y) reg.predict(X\_test) reg.score()

#### **3.2 Stochastic gradient dissent (SGD)**

An example of linear classifier which implements regularized linear model (**Figure 6**) with stochastic gradient dissent. Stochastic gradient descent (often shortened to SGD), also known as incremental gradient descent, is an iterative method to optimize a differentiable objective function, a stochastic approximation of gradient descent optimization [15]. Although SGD has been a part of machine learning since ages it wasn't extensively used until recently.

k-NN algorithm is entirely dependent on the neighbors of the object to be classified. Greater the influence of a neighbor, the object is assigned to it. It is termed as

Let us consider an example where the green circle is the object which is to be classified as shown in **Figure 7**. Let us assume that there are two circles—the solid

As we know that there are two classes class 1 (blue squares) and class 2 (red squares). If we consider only the inner circle that is the solid circle then there are two objects of red circle existing which dominates the number of blue squares due to which the new object is classified to Class 1. But if we consider the dotted circle, the number of blue circle dominates since there are more number of blue squares

The algorithm may suffer from curse of dimensionality since the number of dimensions greatly affects its performance. When the dataset is very large the computation becomes very complex since the algorithm takes time to look out for its neighbors. If there are many dimensions then the samples nearest neighbors can be far away. To avoid curse of dimensionality dimension reduction is usually

Also the algorithm may not perform well with categorical data since it is difficult

from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier (n\_neighbors=5)

Decision trees are considered to be most popular classification algorithms while classifying data. Decision trees are a type of supervised algorithm where the data is split based on certain parameters. The trees consist of decision nodes and

The decision tree consists of a root tree from where the tree generates and this root tree doesn't have any inputs. It is the point from which the tree originates. All the other nodes except the root node have exactly one incoming node. The other

simplest machine learning algorithm among all the algorithms [16].

*Analytical Statistics Techniques of Classification and Regression in Machine Learning*

due to which the object is classified to Class 2 [17]. However, the cost of learning process is zero.

performed before applying k-NN algorithm to the data.

to find the distance between the categorical features.

classifier.fit(X\_train, y\_train)

circle and the dotted circle.

*DOI: http://dx.doi.org/10.5772/intechopen.84922*

Usage in python:

**3.4 Decision trees**

leaves [18].

**Figure 7.** *K-Neighbors.*

**79**

In linear regression algorithm, we make us of least squares to fit the line. To ensure that the error is low we use gradient descent. Although gradient descent does the job it can't handle big tasks hence we use stochastic gradient classifier. SGD calculates the derivative of each training data and also calculates the update within no time.

The advantages of using SGD classifier are that they are efficient and they are easy to implement.

However it is sensitive to feature scaling. Usage of SGD classifier:

```
from sklearn.linear_model import SGDClassifier
X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = SGDClassifier (loss = "hinge", penalty = "l2")
clf.fit(X, y)
#to predict the values
clf.fit(X_test)
```
#### **3.3 K-nearest neighbors**

Also known as k-NN is a method used to classify as well as for regression. The input consists of k number of closest training examples. It is also referred as lazy learning since the training phase doesn't require a lot of effort.

In k-NN an object's classification is solely dependent on the majority vote of the object's neighbors. That is the outcome is based on the presence of the neighbors. The object is assigned to the class most common among its k nearest neighbors. If the value of k is equal to 1 then it's assigned to its nearest neighbor. Simply put, the

**Figure 6.** *Feature scaling classifier.*

#### *Analytical Statistics Techniques of Classification and Regression in Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.84922*

k-NN algorithm is entirely dependent on the neighbors of the object to be classified. Greater the influence of a neighbor, the object is assigned to it. It is termed as simplest machine learning algorithm among all the algorithms [16].

Let us consider an example where the green circle is the object which is to be classified as shown in **Figure 7**. Let us assume that there are two circles—the solid circle and the dotted circle.

As we know that there are two classes class 1 (blue squares) and class 2 (red squares). If we consider only the inner circle that is the solid circle then there are two objects of red circle existing which dominates the number of blue squares due to which the new object is classified to Class 1. But if we consider the dotted circle, the number of blue circle dominates since there are more number of blue squares due to which the object is classified to Class 2 [17].

However, the cost of learning process is zero.

The algorithm may suffer from curse of dimensionality since the number of dimensions greatly affects its performance. When the dataset is very large the computation becomes very complex since the algorithm takes time to look out for its neighbors. If there are many dimensions then the samples nearest neighbors can be far away. To avoid curse of dimensionality dimension reduction is usually performed before applying k-NN algorithm to the data.

Also the algorithm may not perform well with categorical data since it is difficult to find the distance between the categorical features.

Usage in python:

**3.2 Stochastic gradient dissent (SGD)**

*Data Mining - Methods, Applications and Systems*

no time.

**Figure 6.**

**78**

*Feature scaling classifier.*

easy to implement.

Usage of SGD classifier:

y = [0, 1]

clf.fit(X, y)

clf.fit(X\_test)

**3.3 K-nearest neighbors**

X = [[0., 0.], [1., 1.]]

#to predict the values

An example of linear classifier which implements regularized linear model (**Figure 6**) with stochastic gradient dissent. Stochastic gradient descent (often shortened to SGD), also known as incremental gradient descent, is an iterative method to optimize a differentiable objective function, a stochastic approximation of gradient descent optimization [15]. Although SGD has been a part of machine

In linear regression algorithm, we make us of least squares to fit the line. To ensure that the error is low we use gradient descent. Although gradient descent does the job it can't handle big tasks hence we use stochastic gradient classifier. SGD calculates the derivative of each training data and also calculates the update within

The advantages of using SGD classifier are that they are efficient and they are

Also known as k-NN is a method used to classify as well as for regression. The input consists of k number of closest training examples. It is also referred as lazy

In k-NN an object's classification is solely dependent on the majority vote of the object's neighbors. That is the outcome is based on the presence of the neighbors. The object is assigned to the class most common among its k nearest neighbors. If the value of k is equal to 1 then it's assigned to its nearest neighbor. Simply put, the

learning since ages it wasn't extensively used until recently.

from sklearn.linear\_model import SGDClassifier

clf = SGDClassifier (loss = "hinge", penalty = "l2")

learning since the training phase doesn't require a lot of effort.

However it is sensitive to feature scaling.

from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier (n\_neighbors=5) classifier.fit(X\_train, y\_train)

#### **3.4 Decision trees**

Decision trees are considered to be most popular classification algorithms while classifying data. Decision trees are a type of supervised algorithm where the data is split based on certain parameters. The trees consist of decision nodes and leaves [18].

The decision tree consists of a root tree from where the tree generates and this root tree doesn't have any inputs. It is the point from which the tree originates. All the other nodes except the root node have exactly one incoming node. The other

**Figure 7.** *K-Neighbors.* nodes except the root node are called leaves. Below is the example of a decision tree an illustration of how the decision tree looks like as shown in **Figure 8**.

Random forest creates n number of decision trees from a subset of the data. On creating the trees it aggregates the votes from the different trees and then decides the final class of the sample object. Random forest is used in recommendation

*Analytical Statistics Techniques of Classification and Regression in Machine Learning*

2. For every dataset construct a dataset and then predict from every decision tree.

Random forest's default parameters often produce a good result in most of the cases. Additionally, one can make changes to achieve desired results. The parameters in Random Forest which can be used to tune the algorithm which can be used to

1. Increasing the predictive power by increasing "n\_estimators" by which the number of tress which will be built can be altered. "max\_features" parameter can also be adjusted which is the number of features which are used to train the algorithm. Another parameter which can be adjusted is "min\_sample\_leaf" which is the number of leafs that are used to split the internal node.

2. To increase the model's speed, "n\_jobs" parameter can be adjusted which is the number of processors it can use. To use as many as needed "1" can be

Due to large number of decision trees random forest is highly accurate. Since it takes the average of all the predictions which are computed the algorithm doesn't suffer from over fitting. Also it does handle missing values from the dataset. However, the algorithm is takes time to compute since it takes time to build trees and

One of the real time examples where random forest algorithm can be used is predicting a person's systolic blood pressure based on the person's height, age,

Random forests require very little tuning when compared to other algorithms. The main disadvantage of random forest algorithm is that increased number of tress can make the process computationally expensive and lead to inaccurate results.

Support vector machines also known as SVMs or support vector networks fall under supervised learning. They are used for classification as well as regression purposes. Support vectors are the data points which lie close to the hyper plane. When the data is fed to the algorithm the algorithm builds a classifier which can be

from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(max\_depth=2, random\_state=0)

engines, image classification and feature selection [19].

1. It selects random samples from the dataset.

3. For every predicted result perform vote.

specified which signifies that there is no limit.

take the average of the predictions and so on.

weight, gender, etc.

Usage in python:

clf.fit(X, y) clf.predict(X\_test)

**3.6 Support vector machine**

**81**

4.Select the prediction which has the highest number of votes.

The process consists of four steps:

*DOI: http://dx.doi.org/10.5772/intechopen.84922*

give better and efficient results are:

"Is sex male" is the root node from where the tree originates. Depending on the condition the tree further bifurcates into subsequent leaf nodes. Few more conditions like "is Age >9.5?" are applied by which the depth of the node goes on increasing. As the number of leaf nodes increase the depth of the tree goes on increasing. The leaf can also hold a probability vector.

Decision tree algorithms implicitly construct a decision tree for any dataset.

The goal is to construct an optimal decision tree by minimalizing the generalization error. For any tree algorithm, it can be tuned by making changes to parameters such as "Depth of the tree," "Number of nodes," "Max features." However construction of a tree by the algorithm can get complex for large problems since the number of nodes increase as well as the depth of the tree increases.

Advantages of this tree are that they are simple to understand and can be easily interpreted. It also requires little data preparation. The tree can handle both numerical and categorical data unlike many other algorithms. It also easy to validate the decision tree model using statistical testes. However, disadvantages of the trees are that they can be complex in nature for some cases which won't generalize the data well. They are unstable in nature since if there are small variations in data they may change the structure of the tree completely.

Usage in python:

from sklearn.neighbors import tree classifier = tree.DecisionTreeClassifier() classifier.fit(X\_train, y\_train) clf.predict(X\_test)

#### **3.5 Random forest**

These are often referred as ensemble algorithms since these algorithms combine the use of two or more algorithms. They are improved version of bagged decision trees. They are used for classification, regression, etc.

**Figure 8.** *Typical decision tree.*

*Analytical Statistics Techniques of Classification and Regression in Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.84922*

Random forest creates n number of decision trees from a subset of the data. On creating the trees it aggregates the votes from the different trees and then decides the final class of the sample object. Random forest is used in recommendation engines, image classification and feature selection [19].

The process consists of four steps:

nodes except the root node are called leaves. Below is the example of a decision tree

Decision tree algorithms implicitly construct a decision tree for any dataset. The goal is to construct an optimal decision tree by minimalizing the generalization error. For any tree algorithm, it can be tuned by making changes to parameters such as "Depth of the tree," "Number of nodes," "Max features." However construction of a tree by the algorithm can get complex for large problems since the

Advantages of this tree are that they are simple to understand and can be easily

These are often referred as ensemble algorithms since these algorithms combine the use of two or more algorithms. They are improved version of bagged decision

numerical and categorical data unlike many other algorithms. It also easy to validate the decision tree model using statistical testes. However, disadvantages of the trees are that they can be complex in nature for some cases which won't generalize the data well. They are unstable in nature since if there are small variations in data they

"Is sex male" is the root node from where the tree originates. Depending on the condition the tree further bifurcates into subsequent leaf nodes. Few more conditions like "is Age >9.5?" are applied by which the depth of the node goes on increasing. As the number of leaf nodes increase the depth of the tree goes on

an illustration of how the decision tree looks like as shown in **Figure 8**.

number of nodes increase as well as the depth of the tree increases.

interpreted. It also requires little data preparation. The tree can handle both

increasing. The leaf can also hold a probability vector.

*Data Mining - Methods, Applications and Systems*

may change the structure of the tree completely.

classifier.fit(X\_train, y\_train)

clf.predict(X\_test)

from sklearn.neighbors import tree classifier = tree.DecisionTreeClassifier()

trees. They are used for classification, regression, etc.

Usage in python:

**3.5 Random forest**

**Figure 8.**

**80**

*Typical decision tree.*


Random forest's default parameters often produce a good result in most of the cases. Additionally, one can make changes to achieve desired results. The parameters in Random Forest which can be used to tune the algorithm which can be used to give better and efficient results are:


Due to large number of decision trees random forest is highly accurate. Since it takes the average of all the predictions which are computed the algorithm doesn't suffer from over fitting. Also it does handle missing values from the dataset. However, the algorithm is takes time to compute since it takes time to build trees and take the average of the predictions and so on.

One of the real time examples where random forest algorithm can be used is predicting a person's systolic blood pressure based on the person's height, age, weight, gender, etc.

Random forests require very little tuning when compared to other algorithms. The main disadvantage of random forest algorithm is that increased number of tress can make the process computationally expensive and lead to inaccurate results.

Usage in python:

from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(max\_depth=2, random\_state=0) clf.fit(X, y) clf.predict(X\_test)

#### **3.6 Support vector machine**

Support vector machines also known as SVMs or support vector networks fall under supervised learning. They are used for classification as well as regression purposes. Support vectors are the data points which lie close to the hyper plane. When the data is fed to the algorithm the algorithm builds a classifier which can be used to assign new examples to one class or the other [20]. A SVM consists of points in space separated by a gap which is as wide as possible. When a new sample is encountered it maps it to the corresponding category.

It may be difficult for SVM to classify at times due to which the decision boundary is not optimal. For example, when we want to plot the points randomly

*Analytical Statistics Techniques of Classification and Regression in Machine Learning*

It is almost impossible to separate them. So in such cases we transform the dataset by applying 2D or 3D transformations by using a polynomial function or any other appropriate function. By doing so it becomes easier to draw a hyper plane. When the number of features is much greater than number of samples it doesn't

It is evident from the above regression and classification techniques are strongly influenced by statistics. The methods have been derived from statistical methods which existed since a long time. Statistical methods also consist of building models which consists of parameters and then fitting it. However not all the methods which are being used derive their nature from statistics. Not all statistical methods are being used in machine learning. Extensive research in the field of statistical methods may give out new set methods which can be used in machine learning apart from the existing statistical methods which are being used today. It can also be

, Manish Kumar<sup>2</sup> and Subarna Roy<sup>1</sup>

stated that machine learning to some extent is a form of 'Applied Statistics.'

1 Department of Health Research, Biomedical Informatics Centre, ICMR-National

2 Department of Electrical Engineering, College of Engineering, Bharti Vidyapeeth,

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

\*, Sameer Ambekar1†

Institute of Traditional Medicine, Belagavi, Karnataka, India

\*Address all correspondence to: pramodbiotech@gmail.com

† Sameer Ambekar shares first authorship.

provided the original work is properly cited.

distributed on a number line.

Usage of SVM in python:

**4. Conclusion**

**Author details**

Pramod Kumar<sup>1</sup>

**83**

Pune, Maharashtra, India

clf = svm.SVC() clf.fit(X,y)

clf.predict(X\_test)

perform well with the default parameters.

*DOI: http://dx.doi.org/10.5772/intechopen.84922*

from sklearn import svm

Perhaps when the data is unlabeled it becomes difficult for the supervised SVM to perform and this is where unsupervised method of classifying is required.

A SVM constructs a hyper plane which can be used for classification, regression and many other purposes. A good separation can be achieved when the hyper plane has the largest distance to the nearest training point of a class.

In (**Figure 9**) H1 line doesn't separate, while H2 separates but the margin is very small whereas H3 separates such as the distance between the margin and the nearest point is maximum when compared to H1 and H2.

SVMs can be used in a variety of applications such as:

They are used to categorize text, to classify images, handwritten images can be recognized, and they are also used in the field of biology.

SVMs can be used with the following kernels:


**Figure 9.** *Hyper plane construction and H1, H2 and H3 line separation.*

*Analytical Statistics Techniques of Classification and Regression in Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.84922*

It may be difficult for SVM to classify at times due to which the decision boundary is not optimal. For example, when we want to plot the points randomly distributed on a number line.

It is almost impossible to separate them. So in such cases we transform the dataset by applying 2D or 3D transformations by using a polynomial function or any other appropriate function. By doing so it becomes easier to draw a hyper plane.

When the number of features is much greater than number of samples it doesn't perform well with the default parameters.

Usage of SVM in python:

from sklearn import svm clf = svm.SVC() clf.fit(X,y) clf.predict(X\_test)

#### **4. Conclusion**

used to assign new examples to one class or the other [20]. A SVM consists of points in space separated by a gap which is as wide as possible. When a new sample is

Perhaps when the data is unlabeled it becomes difficult for the supervised SVM

A SVM constructs a hyper plane which can be used for classification, regression and many other purposes. A good separation can be achieved when the hyper plane

In (**Figure 9**) H1 line doesn't separate, while H2 separates but the margin is very small whereas H3 separates such as the distance between the margin and the nearest

They are used to categorize text, to classify images, handwritten images can be

to perform and this is where unsupervised method of classifying is required.

encountered it maps it to the corresponding category.

*Data Mining - Methods, Applications and Systems*

point is maximum when compared to H1 and H2.

4.Gaussian radial basis function SVM (RBF)

1. Polynomial kernel SVM

2. Linear kernel SVM

3. Gaussian kernel SVM

2. It is memory efficient

3. It is versatile

**Figure 9.**

**82**

The advantages of SVM are:

1. Effective in high dimensional data

*Hyper plane construction and H1, H2 and H3 line separation.*

has the largest distance to the nearest training point of a class.

SVMs can be used in a variety of applications such as:

recognized, and they are also used in the field of biology. SVMs can be used with the following kernels:

> It is evident from the above regression and classification techniques are strongly influenced by statistics. The methods have been derived from statistical methods which existed since a long time. Statistical methods also consist of building models which consists of parameters and then fitting it. However not all the methods which are being used derive their nature from statistics. Not all statistical methods are being used in machine learning. Extensive research in the field of statistical methods may give out new set methods which can be used in machine learning apart from the existing statistical methods which are being used today. It can also be stated that machine learning to some extent is a form of 'Applied Statistics.'

#### **Author details**

Pramod Kumar<sup>1</sup> \*, Sameer Ambekar1† , Manish Kumar<sup>2</sup> and Subarna Roy<sup>1</sup>

1 Department of Health Research, Biomedical Informatics Centre, ICMR-National Institute of Traditional Medicine, Belagavi, Karnataka, India

2 Department of Electrical Engineering, College of Engineering, Bharti Vidyapeeth, Pune, Maharashtra, India

\*Address all correspondence to: pramodbiotech@gmail.com

† Sameer Ambekar shares first authorship.

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Hawkins DM. On the investigation of alternative regressions by principal component analysis. Journal of the Royal Statistical Society Series. 1973;**22**: 275-286. https://www.jstor.org/stable/ i316057

[2] Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal. 2015;**13**:8-17. DOI: 10.1016/j.csbj.2014.11.005

[3] Machine Learning [Internet]. Available from: https://en.wikipedia. org/wiki/Machine\_learning

[4] Trevor H, Robert T. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2009. pp. 485-586. DOI: 10.1007/978-0-387-84858-7\_14

[5] Aho K, Derryberry DW, Peterson T. Model selection for ecologists: The worldviews of AIC and BIC. Ecology. 2014;**95**(3):631-636. DOI: 10.1890/ 13-1452.1

[6] Freedman DA. Statistical Models: Theory and Practice. USA: Cambridge University Press; 2005. ISBN: 978-0- 521-85483-2

[7] sklearn.linear\_model. LinearRegression—scikit-learn 0.19.2 documentation [Internet]. Available from: http://scikit-learn.org/stable/mod ules/generated/sklearn.linear\_model. LinearRegression.html

[8] Linear Regression—Wikipedia [Internet]. Available from: https://en. wikipedia.org/wiki/Linear\_regression

[9] Shaw P et al. Gergonne's 1815 paper on the design and analysis of polynomial regression experiments. Historia

Mathematica; 2006;**1**(4):431-439. DOI: 10.1016/0315-0860(74)90033-0

[19] sklearn.linear\_model.Lasso—Scikit-Learn 0.19.2 Documentation [Internet]. Available from: http://scikit-learn.org/ stable/modules/generated/sklearn.linear\_

*DOI: http://dx.doi.org/10.5772/intechopen.84922*

*Analytical Statistics Techniques of Classification and Regression in Machine Learning*

[20] Corinna C, Vapnik Vladimir N. Support-vector networks. Machine Learning. 1995;**20**(3):273-297. DOI:

model.Lasso.html

10.1007/BF00994018

**85**

[10] Stepwise Regression—Wikipedia [Internet]. Available from: https://en. wikipedia.org/wiki/Stepwise\_regression

[11] Tikhonov Regularization— Wikipedia [Internet]. Available from: https://en.wikipedia.org/wiki/Tikhonov\_ regularization

[12] sklearn.linear\_model. LogisticRegression—scikit-learn 0.19.2 documentation [Internet]. Available from: http://scikit-learn.org/stable/ modules/generated/sklearn.linear\_ model.LogisticRegression.html

[13] Statistical Classification—Wikipedia [Internet]. Available from: https://en. wikipedia.org/wiki/Statistical\_ classification

[14] Naive Bayes Scikit-Learn 0.19.2 Documentation [Internet]. Available from: http://scikit-learn.org/stable/ modules/naive\_bayes.html

[15] Stochastic Gradient Descent —Scikit-Learn 0.19.2 Documentation [Internet]. Available from: http://scikitlearn.org/stable/modules/sgd.html

[16] k-Nearest Neighbors Algorithm— Wikipedia [Internet]. Available from: https://en.wikipedia.org/wiki/K-nearest\_ neighbors\_algorithm

[17] Kamiński B, Jakubczyk M, Szufel P. A framework for sensitivity analysis of decision trees. Central European Journal of Operations Research. 2017;**26**: 135-159. DOI: 10.1007/s10100-017- 0479-6

[18] Lasso (Statistics)—Wikipedia [Internet]. Available from: https://en. wikipedia.org/wiki/Lasso\_(statistics)

*Analytical Statistics Techniques of Classification and Regression in Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.84922*

[19] sklearn.linear\_model.Lasso—Scikit-Learn 0.19.2 Documentation [Internet]. Available from: http://scikit-learn.org/ stable/modules/generated/sklearn.linear\_ model.Lasso.html

**References**

i316057

13-1452.1

521-85483-2

**84**

[7] sklearn.linear\_model.

LinearRegression.html

[1] Hawkins DM. On the investigation of alternative regressions by principal component analysis. Journal of the Royal Statistical Society Series. 1973;**22**: 275-286. https://www.jstor.org/stable/

*Data Mining - Methods, Applications and Systems*

Mathematica; 2006;**1**(4):431-439. DOI: 10.1016/0315-0860(74)90033-0

[10] Stepwise Regression—Wikipedia [Internet]. Available from: https://en. wikipedia.org/wiki/Stepwise\_regression

LogisticRegression—scikit-learn 0.19.2 documentation [Internet]. Available from: http://scikit-learn.org/stable/ modules/generated/sklearn.linear\_ model.LogisticRegression.html

[13] Statistical Classification—Wikipedia [Internet]. Available from: https://en. wikipedia.org/wiki/Statistical\_

[14] Naive Bayes Scikit-Learn 0.19.2 Documentation [Internet]. Available from: http://scikit-learn.org/stable/

modules/naive\_bayes.html

neighbors\_algorithm

0479-6

[15] Stochastic Gradient Descent —Scikit-Learn 0.19.2 Documentation [Internet]. Available from: http://scikitlearn.org/stable/modules/sgd.html

[16] k-Nearest Neighbors Algorithm— Wikipedia [Internet]. Available from: https://en.wikipedia.org/wiki/K-nearest\_

[17] Kamiński B, Jakubczyk M, Szufel P. A framework for sensitivity analysis of decision trees. Central European Journal

of Operations Research. 2017;**26**: 135-159. DOI: 10.1007/s10100-017-

[18] Lasso (Statistics)—Wikipedia [Internet]. Available from: https://en. wikipedia.org/wiki/Lasso\_(statistics)

[11] Tikhonov Regularization— Wikipedia [Internet]. Available from: https://en.wikipedia.org/wiki/Tikhonov\_

[12] sklearn.linear\_model.

regularization

classification

[2] Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer

prognosis and prediction. Computational and Structural Biotechnology Journal. 2015;**13**:8-17. DOI: 10.1016/j.csbj.2014.11.005

[3] Machine Learning [Internet]. Available from: https://en.wikipedia.

[4] Trevor H, Robert T. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2009. pp. 485-586. DOI: 10.1007/978-0-387-84858-7\_14

[5] Aho K, Derryberry DW, Peterson T. Model selection for ecologists: The worldviews of AIC and BIC. Ecology. 2014;**95**(3):631-636. DOI: 10.1890/

[6] Freedman DA. Statistical Models: Theory and Practice. USA: Cambridge University Press; 2005. ISBN: 978-0-

LinearRegression—scikit-learn 0.19.2 documentation [Internet]. Available from: http://scikit-learn.org/stable/mod ules/generated/sklearn.linear\_model.

[8] Linear Regression—Wikipedia [Internet]. Available from: https://en. wikipedia.org/wiki/Linear\_regression

regression experiments. Historia

[9] Shaw P et al. Gergonne's 1815 paper on the design and analysis of polynomial

org/wiki/Machine\_learning

[20] Corinna C, Vapnik Vladimir N. Support-vector networks. Machine Learning. 1995;**20**(3):273-297. DOI: 10.1007/BF00994018

**Chapter 6**

**Abstract**

**1. Introduction**

mining necessary.

**87**

temporal or time-series data.

*Esma Ergüner Özkoç*

Clustering of Time-Series Data

The process of separating groups according to similarities of data is called "clus-

tering." There are two basic principles: (i) the similarity is the highest within a cluster and (ii) similarity between the clusters is the least. Time-series data are unlabeled data obtained from different periods of a process or from more than one process. These data can be gathered from many different areas that include engineering, science, business, finance, health care, government, and so on. Given the unlabeled time-series data, it usually results in the grouping of the series with similar characteristics. Time-series clustering methods are examined in three main sections: data representation, similarity measure, and clustering algorithm. The scope of this chapter includes the taxonomy of time-series data clustering and the

**Keywords:** time-series data, data mining, data representation, similarity measure,

The rapid development of technology has led to the registration of many processes in an electronic environment, the storage of these records, and the accessibility of these records when requested. With the evolving technology such as cloud computing, big data, the accumulation of a large amount of data stored in databases, and the process of parsing and screening useful information made data

It is possible to examine the data which are kept in databases and reach to huge amounts of size every second, in two parts according to their changes in time: static and temporal. Data is called the static data when its feature values do not change with time, if the feature comprise values change with time then it is called the

Today, with the increase in processor speed and the development of storage technologies, real-world applications can easily record changing data over time. Time-series analysis is a trend study subject because of its prevalence in various fields ranging from science, engineering, bioinformatics, finance, and government to health-care applications [1–3]. Data analysts are looking for the answers of such questions: Why does the data change this way? Are there any patterns? Which series show similar patterns? etc. Subsequence matching, indexing, anomaly detection, motif discovery, and clustering of the data are the answers of some questions [4]. Clustering, which is one of the most important concepts of data mining, defines its structure by separating unlabeled data sets into homogeneous groups. Many general-purpose clustering algorithms are used for the clustering of time-series

clustering of gene expression data as a case study.

clustering algorithms, gene expression data clustering

## **Chapter 6** Clustering of Time-Series Data

*Esma Ergüner Özkoç*

#### **Abstract**

The process of separating groups according to similarities of data is called "clustering." There are two basic principles: (i) the similarity is the highest within a cluster and (ii) similarity between the clusters is the least. Time-series data are unlabeled data obtained from different periods of a process or from more than one process. These data can be gathered from many different areas that include engineering, science, business, finance, health care, government, and so on. Given the unlabeled time-series data, it usually results in the grouping of the series with similar characteristics. Time-series clustering methods are examined in three main sections: data representation, similarity measure, and clustering algorithm. The scope of this chapter includes the taxonomy of time-series data clustering and the clustering of gene expression data as a case study.

**Keywords:** time-series data, data mining, data representation, similarity measure, clustering algorithms, gene expression data clustering

#### **1. Introduction**

The rapid development of technology has led to the registration of many processes in an electronic environment, the storage of these records, and the accessibility of these records when requested. With the evolving technology such as cloud computing, big data, the accumulation of a large amount of data stored in databases, and the process of parsing and screening useful information made data mining necessary.

It is possible to examine the data which are kept in databases and reach to huge amounts of size every second, in two parts according to their changes in time: static and temporal. Data is called the static data when its feature values do not change with time, if the feature comprise values change with time then it is called the temporal or time-series data.

Today, with the increase in processor speed and the development of storage technologies, real-world applications can easily record changing data over time.

Time-series analysis is a trend study subject because of its prevalence in various fields ranging from science, engineering, bioinformatics, finance, and government to health-care applications [1–3]. Data analysts are looking for the answers of such questions: Why does the data change this way? Are there any patterns? Which series show similar patterns? etc. Subsequence matching, indexing, anomaly detection, motif discovery, and clustering of the data are the answers of some questions [4]. Clustering, which is one of the most important concepts of data mining, defines its structure by separating unlabeled data sets into homogeneous groups. Many general-purpose clustering algorithms are used for the clustering of time-series

data, either by directly or by evolving. Algorithm selection depends entirely on the purpose of the application and on the properties of the data such as sales data, exchange rates in finance, gene expression data, image data for face recognition, etc. iii. Acceptable computational cost,

*DOI: http://dx.doi.org/10.5772/intechopen.84490*

*Clustering of Time-Series Data*

ture [7, 10–12] for the data representation.

*Definition:*

**Figure 1.**

**89**

*Time-series clustering.*

v. Insensitivity to noise or implicit noise handling.

iv. Reasonable level of reconstruction from the reduced representation,

Dimension reduction is one of the most frequently used methods in the litera-

The representation of a time series T with length n is a model T̅with reduced dimensions, so that T approximates T [13]. Dimension reduction or feature extraction is a very useful method for reducing the number of variables/attributes or units

In the age of informatics, the analysis of multidimensional data that has emerged as part of the digital transformation in every field has gained considerable importance. These data can be from data received at different times from one or more sensors, stock data, or call records to a call center. This type of data, that is, observing the movement of a variable over time, where the results of the observation are distributed according to time, is called time-series data. Time-series analysis is used for many purposes such as future forecasts, anomaly detection, subsequence matching, clustering, motif discovery, indexing, etc. Within the scope of this study, the methods developed for the time-series data clustering which are important for every field of digital life in three main sections. In the first section, the proposed methods for the preparation of multidimensional data for clustering (dimension reduction) in the literature are categorized. In the second section, the similarity criteria to be used when deciding on the objects to be assigned to the related cluster are classified. In the third section, clustering algorithms of timeseries data are examined under five main headings according to the method used. In the last part of the study, the use of time-series clustering in bioinformatics which is one of the favorite areas is included.

#### **2. Time-series clustering approaches**

There are many different categorizations of time-series clustering approaches. Such as, time-series clustering approaches can be examined in three main sections according to the characteristics of the data used whether they process directly on raw data, indirectly with features extracted from the raw data, or indirectly with models built from the raw data [5]. Another category is according to the clustering method: shape-based, feature-based, and model-based [6]. But whatever the categorization is, for any time-series clustering approach, the main points to be considered are: how to measure the similarity between time series; how to compress the series or reduce dimension and what algorithm to use for cluster. Therefore, this chapter examines time-series clustering approaches according to three main building blocks: data representation methods, distance measurements, and clustering algorithms (**Figure 1**).

#### **2.1 Data representation**

Data representation is one of the main challenging issues for time-series clustering. Because, time-series data are much larger than memory size [7, 8] that increases the need for high processor power and time for the clustering process increases exponentially. In addition, the time-series data are multidimensional, which is a difficulty for many clustering algorithms to handle, and it slows down the calculation of the similarity measurement. Consequently, it is very important for time-series data to represent the data without slowing down the algorithm execution time and without a significant data loss. Therefore, some requirements can be listed for any data representation methods [9]:


iii. Acceptable computational cost,

iv. Reasonable level of reconstruction from the reduced representation,

v. Insensitivity to noise or implicit noise handling.

Dimension reduction is one of the most frequently used methods in the literature [7, 10–12] for the data representation.

### *Definition:*

data, either by directly or by evolving. Algorithm selection depends entirely on the purpose of the application and on the properties of the data such as sales data, exchange rates in finance, gene expression data, image data for face recognition, etc. In the age of informatics, the analysis of multidimensional data that has emerged as part of the digital transformation in every field has gained considerable importance. These data can be from data received at different times from one or more sensors, stock data, or call records to a call center. This type of data, that is, observing the movement of a variable over time, where the results of the observation are distributed according to time, is called time-series data. Time-series analy-

sis is used for many purposes such as future forecasts, anomaly detection,

one of the favorite areas is included.

algorithms (**Figure 1**).

**2.1 Data representation**

**88**

**2. Time-series clustering approaches**

*Data Mining - Methods, Applications and Systems*

listed for any data representation methods [9]:

i. Significantly reduce the data size/dimensionality,

subsequence matching, clustering, motif discovery, indexing, etc. Within the scope of this study, the methods developed for the time-series data clustering which are important for every field of digital life in three main sections. In the first section, the proposed methods for the preparation of multidimensional data for clustering (dimension reduction) in the literature are categorized. In the second section, the similarity criteria to be used when deciding on the objects to be assigned to the related cluster are classified. In the third section, clustering algorithms of timeseries data are examined under five main headings according to the method used. In the last part of the study, the use of time-series clustering in bioinformatics which is

There are many different categorizations of time-series clustering approaches. Such as, time-series clustering approaches can be examined in three main sections according to the characteristics of the data used whether they process directly on raw data, indirectly with features extracted from the raw data, or indirectly with models built from the raw data [5]. Another category is according to the clustering method: shape-based, feature-based, and model-based [6]. But whatever the categorization is, for any time-series clustering approach, the main points to be considered are: how to measure the similarity between time series; how to compress the series or reduce dimension and what algorithm to use for cluster. Therefore, this chapter examines time-series clustering approaches according to three main building blocks: data representation methods, distance measurements, and clustering

Data representation is one of the main challenging issues for time-series cluster-

ing. Because, time-series data are much larger than memory size [7, 8] that increases the need for high processor power and time for the clustering process increases exponentially. In addition, the time-series data are multidimensional, which is a difficulty for many clustering algorithms to handle, and it slows down the calculation of the similarity measurement. Consequently, it is very important for time-series data to represent the data without slowing down the algorithm execution time and without a significant data loss. Therefore, some requirements can be

ii. Maintain the local and global shape characteristics of the time series,

The representation of a time series T with length n is a model T̅with reduced dimensions, so that T approximates T [13]. Dimension reduction or feature extraction is a very useful method for reducing the number of variables/attributes or units

**Figure 1.** *Time-series clustering.*

in multivariate statistical analyzes so that the number of attributes can be reduced to a number that "can handle."

*Definition:*

calculates their distance "d".

*Clustering of Time-Series Data*

*DOI: http://dx.doi.org/10.5772/intechopen.84490*

*2.2.1 Similarity in time*

*2.2.2 Similarity in shape*

*2.2.3 Similarity in change*

**2.3 Clustering algorithms**

[29, 39, 40].

inal, ordinal, etc.)

**91**

required beforehand [34, 36].

time, similarity in shape, and similarity in change.

not care how many times the pattern exists [37, 38].

Similarity between two "n" sized time series T = {t1,t2,….tn} and U = {u1,u2,….un} is the length of the path connecting pair of points [11]. This distance is the measure of similarity. D (T, U) is a function that takes two times series (T, U) as input and

Metrics to be used in clustering must cope with the problems caused by common features of time-series data such as noise, temporal drift, longitudinal scaling, offset translation, linear drift, discontinuities, and amplitude scaling. Various methods have been developed for similarity measure, and the method to choose is problem specific. These methods can be grouped under three main headings: similarity in

The similarity between the series is that they are highly time dependent. Such a measure is costly for the raw time series, so a preprocessing or transformation is

Clustering algorithms that use similarity in shape measure, assigns time series containing similar patterns to the same cluster. Independently of the time, it does

The result of using this metric is time-series clusters that have the similar autocorrelation structure. Besides, it is not a suitable metric for short time series

The process of separating groups according to similarities of data is called "clus-

Han and Kamber [41] classify the general-purpose clustering algorithms which

tering." There are two basic principles: the similarity within the cluster is the highest and the similarity between the clusters is the least. Clustering is done on the basis of the characteristics of the data and using multivariate statistical methods. When dividing data into clusters, the similarities/distances of the data to each other are measured according to the specification of the data (discrete, continuous, nom-

are actually designed for static data in five main sections: partition-based, hierarchical-based, density-based, grid-based, and model-based. Besides these, a wide variety of algorithms has been developed for time-series data. However, some of these algorithms (ignore minor differences) intend to directly use the methods developed for static data without changing the algorithm by transforming it into a static data form from temporal data. Some approaches apply a preprocessing step on the data to be clustered before using the clustering algorithm. This preprocessing step converts the raw-time-series data into feature vectors using dimension reduction techniques, or converts them into parameters of a specified model [42].

Due to the noisy and high-dimensional features of many time-series data, data representations have been studied and generally examined in four main sections: data adaptive, nondata adaptive, model-based, and data dictated [6].


Many representation methods for time-series data are proposed and each of them offering different trade-offs between the aforementioned requirements. The correct selection of the representation method plays a major role in the effectiveness and usability of the application to be performed.

#### **2.2 Similarity/distance measure**

In particular, the similarity measure is the most essential ingredient of timeseries clustering.

The similarity or distance for the time-series clustering is approximately calculated, not based on the exact match as in traditional clustering methods. It requires to use distance function to compare two time series. In other words, the similarity of the time series is not calculated, it is estimated. If the estimated distance is large, the similarity between the time series is less and vice versa.

#### *Definition:*

in multivariate statistical analyzes so that the number of attributes can be reduced

data adaptive, nondata adaptive, model-based, and data dictated [6].

• **Data adaptive methods** that have changing parameters according to

Due to the noisy and high-dimensional features of many time-series data, data representations have been studied and generally examined in four main sections:

processing time-series data. Methods in this category try to minimize global reconstruction error by using unequal length segments. Although it is difficult to compare several time series, this method approximates each series better. Some of the popular data adaptive representation methods are: Symbolic Aggregate Approximation (SAX) [14], Adaptive Piecewise Constant Approximation (APCA) [15], Piecewise Linear Approximation (PLA) [16], Singular Value Decomposition (SVD) [17, 18], and Symbolic Natural Language

• **Non-data adaptive methods** are use fix-size parameters for the representing time-series data. Following methods are shown among non-data adaptive representation methods: Discrete Fourier Transform (DFT) [18], Discrete Wavelet Transform (DWT) [20–22], Discrete Cosine Transformation (DCT)

• **Model-based methods** assume that observed time series was produced by an underlying model. The real issue here is to find the parameters that produce this model. Two time series produced by the same set of parameters using the underlying model are considered similar. Some of the model-based methods can be listed as: Auto-regressive Moving Average (ARMA) [28, 29], Time-

• **Data dictated methods** automatically determine the dimension reduction rate but in the three methods mentioned above, the dimension reduction rates are automatically determined by the user. The most common example of data

Many representation methods for time-series data are proposed and each of them offering different trade-offs between the aforementioned requirements. The correct selection of the representation method plays a major role in the effectiveness

In particular, the similarity measure is the most essential ingredient of time-

The similarity or distance for the time-series clustering is approximately calculated, not based on the exact match as in traditional clustering methods. It requires to use distance function to compare two time series. In other words, the similarity of the time series is not calculated, it is estimated. If the estimated distance is large,

[17], Perceptually Important Point (PIP) [23], Piecewise Aggregate Approximation (PAA) [24], Chebyshev Polynomials (CHEB) [25], Random Mapping [26], and Indexable Piecewise Linear Approximation

Series Bitmaps [30], and Hidden Markov Model (HMM) [31–33].

dictated method is clipped data [34–36].

and usability of the application to be performed.

the similarity between the time series is less and vice versa.

**2.2 Similarity/distance measure**

series clustering.

**90**

to a number that "can handle."

*Data Mining - Methods, Applications and Systems*

(NLG) [19].

(IPLA) [27].

Similarity between two "n" sized time series T = {t1,t2,….tn} and U = {u1,u2,….un} is the length of the path connecting pair of points [11]. This distance is the measure of similarity. D (T, U) is a function that takes two times series (T, U) as input and calculates their distance "d".

Metrics to be used in clustering must cope with the problems caused by common features of time-series data such as noise, temporal drift, longitudinal scaling, offset translation, linear drift, discontinuities, and amplitude scaling. Various methods have been developed for similarity measure, and the method to choose is problem specific. These methods can be grouped under three main headings: similarity in time, similarity in shape, and similarity in change.

#### *2.2.1 Similarity in time*

The similarity between the series is that they are highly time dependent. Such a measure is costly for the raw time series, so a preprocessing or transformation is required beforehand [34, 36].

#### *2.2.2 Similarity in shape*

Clustering algorithms that use similarity in shape measure, assigns time series containing similar patterns to the same cluster. Independently of the time, it does not care how many times the pattern exists [37, 38].

#### *2.2.3 Similarity in change*

The result of using this metric is time-series clusters that have the similar autocorrelation structure. Besides, it is not a suitable metric for short time series [29, 39, 40].

#### **2.3 Clustering algorithms**

The process of separating groups according to similarities of data is called "clustering." There are two basic principles: the similarity within the cluster is the highest and the similarity between the clusters is the least. Clustering is done on the basis of the characteristics of the data and using multivariate statistical methods. When dividing data into clusters, the similarities/distances of the data to each other are measured according to the specification of the data (discrete, continuous, nominal, ordinal, etc.)

Han and Kamber [41] classify the general-purpose clustering algorithms which are actually designed for static data in five main sections: partition-based, hierarchical-based, density-based, grid-based, and model-based. Besides these, a wide variety of algorithms has been developed for time-series data. However, some of these algorithms (ignore minor differences) intend to directly use the methods developed for static data without changing the algorithm by transforming it into a static data form from temporal data. Some approaches apply a preprocessing step on the data to be clustered before using the clustering algorithm. This preprocessing step converts the raw-time-series data into feature vectors using dimension reduction techniques, or converts them into parameters of a specified model [42].

#### *Definition:*

Given a dataset on n time series T = {t1, t2,…., tn}, time-series clustering is the process of partitioning of T into C = {C1,C2,….,Ck} according to certain similarity criterion. Ci is called "cluster" where,

$$T = \bigcup\_{i=1}^{k} \mathbf{C}\_i \text{ and } \mathbf{C}\_i \bigcap \mathbf{C}\_j = \mathcal{Q} \text{ for i } \neq \text{ j} \tag{1}$$

Hierarchical clustering not only forms a group of similar series but also provides a graphical representation of the data. Graphical presentation allows the user to have an overall view of the data and an idea of data distribution. However, a small change in the data set leads to large changes in the hierarchical dendrogram.

The density-based clustering approach is based on the concepts of density and attraction of objects. The idea is to create clusters of dense multi-dimensional areas where objects attract each other. In the core of dense areas, objects are very close together and crowded. The objects in the walls of the clusters were scattered less frequently than the core. In other words, density-based clustering determines dense areas of object space. The clusters are dense areas which are separated by rare dense areas. DBSCAN [51] and OPTICS [52] algorithms are the most known of density-

The density-based approach is robust for noisy environments. The method also deals with outliers when defining embedded clusters. However, density-based clustering techniques cause difficulties due to high computational complexity and input

The model-based approach [53–55] uses a statistical infrastructure to model the cluster structure of the time-series data. It is assumed that the underlying probability distributions of the data come from the final mixture. Model-based algorithms usually try to estimate the likelihood of the model parameters by applying some statistical techniques such as Expectation Maximization (EM). The EM algorithm iterates between an "E-step," which computes a matrix z such that zik is an estimate of the conditional probability that observation i belongs to group k given the current parameter estimates, and an "M-step," which computes maximum likelihood parameter estimates given z. Each data object is assigned to a cluster with the highest probability until the EM algorithm converges, so as to maximize likelihood

The most important advantage of the model-based approach is to estimate the probability that i. observation belongs to k. cluster. In some cases, the time series is likely to belong to more than one cluster. For such time-series data, the probabilitygiving function of the approach is the reason for preference. In this approach, it is assumed that the data set has a certain distribution but this assumption is not always

In this approach, grids made up of square cells are used to examine the data space. It is independent of the number of objects in the database due to the used grid structure. The most typical example is STING [56], which uses various levels of quadrilateral cells at different levels of resolution. It precalculates and records statistical information about the properties of each cell. The query process usually begins with a high-level hierarchical structure. For each cell at the current level, the confidence interval, which reflects the cell's query relation, is computed. Unrelated

cells are exempt from the next steps. The query process continues for the corresponding cells in the lower level until reaching the lowest layer.

parameter dependency when the dimensional index structure is not used.

Another drawback is high computational complexity.

*2.3.3 Density-based clustering*

*Clustering of Time-Series Data*

*DOI: http://dx.doi.org/10.5772/intechopen.84490*

based clustering examples.

*2.3.4 Model-based clustering*

for the entirety of the grant.

*2.3.5 Grid-based clustering*

correct.

**93**

In this section, previously developed clustering algorithms will be categorized. Some of these algorithms work directly with raw time-series data, while others use the data presentation techniques that are previously mentioned.

Clustering algorithms are generally classified as: partitioning, hierarchical, graph-based, model-based, and density-based clustering.

#### *2.3.1 Partitioning clustering*

The K-means [43] algorithm is a typical partition-based clustering algorithm such that the data are divided into a number of predefined sets by optimizing the predefined criteria. The most important advantage is its simplicity and speed. So it can be applied to large data sets. However, the algorithm may not produce the same result in each run and cannot handle the outlier. Self-organizing map [44] is stronger than the noisy data clustering from K-means. The user is prompted to enter the cluster number and grid sets. It is difficult to determine the number of clusters for time-series data. Other examples of partition-based clustering are CLARANS [45] and K-medoids [46]. In addition, the partitioning approach is suitable for lowdimensional, well-separated data. However, time-series data are multidimensional and often contain intersections, embedded clusters.

In essence, these algorithms act as n-dimensional vectors to time-series data and applies distance or correlation functions to determine the amount of similarity between two series. Euclidean distance, Manhattan distance, and Pearson correlation coefficient are the most commonly used functions.

#### *2.3.2 Hierarchical clustering*

Contrary to the partitioning approach, which aims segmenting data that do not intersect, the hierarchical approach produces a hierarchical series of nested clusters that can be represented graphically (dendrogram, tree-like diagram). The branches of the dendrogram show the similarity between the clusters as well as the knowledge of the shaping of the clusters. Determined number of clusters can be obtained by cutting the dendrogram at a certain level.

Hierarchical clustering methods [47–49] are based on the separating clusters into subgroups that are processed step by step as a whole, or the stepwise integration of individual clusters into a cluster [50]. Hierarchical clustering methods are divided into two methods: agglomerative clustering methods and divisive hierarchical clustering methods according to the creation of the dendrogram.

In agglomerative hierarchical clustering methods, each observation is initially treated as an independent cluster, and then repeatedly, until each individual observation obtains a single set of all observations, thereby forming a cluster with the closest observation.

In the divisive hierarchical clustering methods, initially all observations are evaluated as a single cluster and then repeatedly separated in such a way that each observation is separated from the farthest observation to form a new cluster. This process continues until all the observations create a single cluster.

#### *Clustering of Time-Series Data DOI: http://dx.doi.org/10.5772/intechopen.84490*

Hierarchical clustering not only forms a group of similar series but also provides a graphical representation of the data. Graphical presentation allows the user to have an overall view of the data and an idea of data distribution. However, a small change in the data set leads to large changes in the hierarchical dendrogram. Another drawback is high computational complexity.

#### *2.3.3 Density-based clustering*

*Definition:*

criterion. Ci is called "cluster" where,

*Data Mining - Methods, Applications and Systems*

*2.3.1 Partitioning clustering*

*2.3.2 Hierarchical clustering*

closest observation.

**92**

*<sup>T</sup>* <sup>¼</sup> <sup>⋃</sup>*<sup>k</sup>*

graph-based, model-based, and density-based clustering.

and often contain intersections, embedded clusters.

by cutting the dendrogram at a certain level.

correlation coefficient are the most commonly used functions.

the data presentation techniques that are previously mentioned.

Given a dataset on n time series T = {t1, t2,…., tn}, time-series clustering is the process of partitioning of T into C = {C1,C2,….,Ck} according to certain similarity

In this section, previously developed clustering algorithms will be categorized. Some of these algorithms work directly with raw time-series data, while others use

Clustering algorithms are generally classified as: partitioning, hierarchical,

The K-means [43] algorithm is a typical partition-based clustering algorithm such that the data are divided into a number of predefined sets by optimizing the predefined criteria. The most important advantage is its simplicity and speed. So it can be applied to large data sets. However, the algorithm may not produce the same result in each run and cannot handle the outlier. Self-organizing map [44] is stronger than the noisy data clustering from K-means. The user is prompted to enter the cluster number and grid sets. It is difficult to determine the number of clusters for time-series data. Other examples of partition-based clustering are CLARANS [45] and K-medoids [46]. In addition, the partitioning approach is suitable for lowdimensional, well-separated data. However, time-series data are multidimensional

In essence, these algorithms act as n-dimensional vectors to time-series data and

Contrary to the partitioning approach, which aims segmenting data that do not intersect, the hierarchical approach produces a hierarchical series of nested clusters that can be represented graphically (dendrogram, tree-like diagram). The branches of the dendrogram show the similarity between the clusters as well as the knowledge of the shaping of the clusters. Determined number of clusters can be obtained

Hierarchical clustering methods [47–49] are based on the separating clusters into subgroups that are processed step by step as a whole, or the stepwise integration of individual clusters into a cluster [50]. Hierarchical clustering methods are divided into two methods: agglomerative clustering methods and divisive hierar-

In agglomerative hierarchical clustering methods, each observation is initially treated as an independent cluster, and then repeatedly, until each individual observation obtains a single set of all observations, thereby forming a cluster with the

In the divisive hierarchical clustering methods, initially all observations are evaluated as a single cluster and then repeatedly separated in such a way that each observation is separated from the farthest observation to form a new cluster. This

chical clustering methods according to the creation of the dendrogram.

process continues until all the observations create a single cluster.

applies distance or correlation functions to determine the amount of similarity between two series. Euclidean distance, Manhattan distance, and Pearson

*<sup>i</sup>*¼<sup>1</sup>*Ci* and *Ci*⋂*Cj* <sup>¼</sup> <sup>∅</sup> for i 6¼ <sup>j</sup> (1)

The density-based clustering approach is based on the concepts of density and attraction of objects. The idea is to create clusters of dense multi-dimensional areas where objects attract each other. In the core of dense areas, objects are very close together and crowded. The objects in the walls of the clusters were scattered less frequently than the core. In other words, density-based clustering determines dense areas of object space. The clusters are dense areas which are separated by rare dense areas. DBSCAN [51] and OPTICS [52] algorithms are the most known of densitybased clustering examples.

The density-based approach is robust for noisy environments. The method also deals with outliers when defining embedded clusters. However, density-based clustering techniques cause difficulties due to high computational complexity and input parameter dependency when the dimensional index structure is not used.

#### *2.3.4 Model-based clustering*

The model-based approach [53–55] uses a statistical infrastructure to model the cluster structure of the time-series data. It is assumed that the underlying probability distributions of the data come from the final mixture. Model-based algorithms usually try to estimate the likelihood of the model parameters by applying some statistical techniques such as Expectation Maximization (EM). The EM algorithm iterates between an "E-step," which computes a matrix z such that zik is an estimate of the conditional probability that observation i belongs to group k given the current parameter estimates, and an "M-step," which computes maximum likelihood parameter estimates given z. Each data object is assigned to a cluster with the highest probability until the EM algorithm converges, so as to maximize likelihood for the entirety of the grant.

The most important advantage of the model-based approach is to estimate the probability that i. observation belongs to k. cluster. In some cases, the time series is likely to belong to more than one cluster. For such time-series data, the probabilitygiving function of the approach is the reason for preference. In this approach, it is assumed that the data set has a certain distribution but this assumption is not always correct.

#### *2.3.5 Grid-based clustering*

In this approach, grids made up of square cells are used to examine the data space. It is independent of the number of objects in the database due to the used grid structure. The most typical example is STING [56], which uses various levels of quadrilateral cells at different levels of resolution. It precalculates and records statistical information about the properties of each cell. The query process usually begins with a high-level hierarchical structure. For each cell at the current level, the confidence interval, which reflects the cell's query relation, is computed. Unrelated cells are exempt from the next steps. The query process continues for the corresponding cells in the lower level until reaching the lowest layer.

After analyzing the data set and obtaining the clustering solution, there is no guarantee of the significance and reliability of the results. The data will be clustered even if there is no natural grouping. Therefore, whether the clustering solution obtained is different from the random solution should be determined by applying some tests. Some methods developed to test the quality of clustering solutions are classified into two types: external index and internal index.

log-likelihood in order to detect in advance the algorithm converge [57]. In model-based clustering, a model is defined by its number of component/cluster K and its parameterization. In model selection task, several models are reviewed

FunFEM allows to choose between AIC (Akaike Information Criterion) [58], BIC (Bayesian information criteria) [59], and ICL (Integrated Completed Likelihood) [60] when deciding the number of clusters. The penalty terms are:

in the ICL criterion. Here, *M* indicates the number of parameters in the model, n is the number of observations, K is the number of clusters, and *tik* is the probability of

FunFem is implemented in R programming languages and serves as a function [61]. The algorithm is applied on a time series gene expression data in the following section. Input of the algorithm is gene expression data which is given in **Table 1**. The table shows the gene expression values measured as a result of the microarray experiment. The measurement was performed at six different times for each gene. The data were taken from the GEO database (GSE2241) [62].

(bic = �152654.5) for input data. As a result, method assigned each gene to the appropriate cluster which is determined by the algorithm. **Table 2** demonstrates the gene symbol and cluster number. As a result, method assigned each gene to the

**Gene Symbol TP1 TP2 TP3 TP4 TP5 TP6** AADAC 18.4 29.7 30 79.7 86.7 163.2 AAK1 253.2 141.8 49.2 118.7 145.2 126.7 AAMP 490 340.9 109.1 198.4 210.5 212 AANAT 5.6 1.4 3.7 3.1 1.6 4.9 AARS 1770 793.6 226.5 1008.9 713.3 1253.7 AASDHPPT 940.1 570.5 167.2 268.6 683 263.5 AASS 10.9 1.9 1.5 4.1 19.7 25.5 AATF 543.4 520.1 114.5 305.7 354.2 384.9 AATK 124.5 74.5 17 25.6 64.6 13.6 . . .. . . . . . .. . . . ZP2 4.1 1.4 0.8 1.4 1.4 3 ZPBP 23.4 13.7 7 7.8 22.3 26.9 ZW10 517.1 374.5 72.6 240.8 345.7 333.1 ZWINT 1245.4 983.4 495.3 597.4 1074.3 620.7 ZYX 721.6 554.9 135.5 631.5 330.9 706.8 ZZEF1 90.5 49.3 18.6 66.7 10.4 52.2 ZZZ3 457.3 317.1 93 243.2 657.5 443

*<sup>i</sup>*¼<sup>1</sup>∑*<sup>K</sup>*

*<sup>k</sup>*¼<sup>1</sup>*tik* log *<sup>t</sup>*ð Þ *ik*

while selecting the most appropriate model for the considered data.

<sup>2</sup> log ð Þ *<sup>n</sup>* in the BIC criterion, *<sup>γ</sup>*ð Þ *<sup>M</sup>* in the AIC criterion, and <sup>∑</sup>*<sup>n</sup>*

FunFEM method is decided, and the best model is DkBk with *K=4*

appropriate cluster which is determined by the algorithm (**Table 2**).

ith observation belonging to kth cluster.

*DOI: http://dx.doi.org/10.5772/intechopen.84490*

*Clustering of Time-Series Data*

*γ*ð Þ *M*

**Table 1.**

**95**

*Input data of the FunFEM algorithm.*


#### *2.3.6 Clustering algorithm example: FunFEM*

The funFEM algorithm [55, 57] allows to cluster time series or, more generally, functional data. FunFem is based on a discriminative functional mixture model (DFM) which allows the clustering of the curves (data) in a functional subspace. If the observed curves are {*x*1*, x*2…*xn*}, FunFem aims cluster into K homogenous groups. It assumes that there exists an unobserved random variable Z = {z1*,* z2…zn} <sup>∈</sup>f g <sup>0</sup>*;* <sup>1</sup> <sup>k</sup> , if *x* belongs to group *k*, Zk is defined as 1 otherwise 0. The clustering task goal is to predict the value zi = (zi1,… zik) of Z for each observed curve *xi*, for i=1…n. The FunFem algorithm alternates, over the three steps of Fisher EM algorithm [57] ("F-step," "E-Step" and "M-step") to decide group memberships of Z = {z1*,* z2…zn}. In other words, from 12 defined discriminative functional mixture (DFM) models, Fisher-EM decides which data fit the best. The Fisher-EM algorithm alternates between three steps:


Fisher-EM algorithm updates the parameters repeatedly until the Aitken criterion is provided. Aitken criterion estimates the asymptotic maximum of the

#### *Clustering of Time-Series Data DOI: http://dx.doi.org/10.5772/intechopen.84490*

After analyzing the data set and obtaining the clustering solution, there is no guarantee of the significance and reliability of the results. The data will be clustered even if there is no natural grouping. Therefore, whether the clustering solution obtained is different from the random solution should be determined by applying some tests. Some methods developed to test the quality of clustering solutions are

• The external index is the most commonly used clustering evaluation method also known as external validation, external criterion. The ground truth is the goal clusters, usually created by experts. This index measures how well the target clusters and the resulting clusters overlap. Entropy, Adjusted Rand Index (ARI), F-measure, Jaccard Score, Fowlkes and Mallows Index (FM), and Cluster Similarity Measure (CSM) are the most known external indexes.

• The internal indexes evaluate clustering results using the features of data sets and meta-data without any external information. These are often used in cases where the correct solutions are not known. Sum of squared error is one of the

determines the error. So clusters with similar time series are expected to give lower error values. Distance between two clusters (CD) index, root-meansquare standard deviation (RMSSTD), Silhouette index, R-squared index, Hubert-Levin index, semi-partial R-squared (SPR) index, weighted inter-intra index, homogeneity index, and separation index are the common internal

The funFEM algorithm [55, 57] allows to cluster time series or, more generally, functional data. FunFem is based on a discriminative functional mixture model (DFM) which allows the clustering of the curves (data) in a functional subspace. If the observed curves are {*x*1*, x*2…*xn*}, FunFem aims cluster into K homogenous groups. It assumes that there exists an unobserved random variable Z = {z1*,* z2…zn}

, if *x* belongs to group *k*, Zk is defined as 1 otherwise 0. The clustering task goal is to predict the value zi = (zi1,… zik) of Z for each observed curve *xi*, for i=1…n. The FunFem algorithm alternates, over the three steps of Fisher EM algorithm [57] ("F-step," "E-Step" and "M-step") to decide group memberships of Z = {z1*,* z2…zn}. In other words, from 12 defined discriminative functional mixture (DFM) models, Fisher-EM decides which data fit the best. The Fisher-EM algorithm

• an E step in which posterior probabilities that observations belong to the K

• an F step that estimates the orientation matrix U of the discriminative latent

• an M step in which parameters of the mixture model are estimated in the latent

subspace by maximizing the conditional expectation of the complete

Fisher-EM algorithm updates the parameters repeatedly until the Aitken criterion is provided. Aitken criterion estimates the asymptotic maximum of the

space conditionally to the posterior probabilities,

most used internal methods which the distance to the nearest cluster

classified into two types: external index and internal index.

*Data Mining - Methods, Applications and Systems*

indexes.

<sup>∈</sup>f g <sup>0</sup>*;* <sup>1</sup> <sup>k</sup>

*2.3.6 Clustering algorithm example: FunFEM*

alternates between three steps:

groups are computed,

likelihood.

**94**

log-likelihood in order to detect in advance the algorithm converge [57]. In model-based clustering, a model is defined by its number of component/cluster K and its parameterization. In model selection task, several models are reviewed while selecting the most appropriate model for the considered data.

FunFEM allows to choose between AIC (Akaike Information Criterion) [58], BIC (Bayesian information criteria) [59], and ICL (Integrated Completed Likelihood) [60] when deciding the number of clusters. The penalty terms are: *γ*ð Þ *M* <sup>2</sup> log ð Þ *<sup>n</sup>* in the BIC criterion, *<sup>γ</sup>*ð Þ *<sup>M</sup>* in the AIC criterion, and <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*¼<sup>1</sup>∑*<sup>K</sup> <sup>k</sup>*¼<sup>1</sup>*tik* log *<sup>t</sup>*ð Þ *ik* in the ICL criterion. Here, *M* indicates the number of parameters in the model, n is the number of observations, K is the number of clusters, and *tik* is the probability of ith observation belonging to kth cluster.

FunFem is implemented in R programming languages and serves as a function [61]. The algorithm is applied on a time series gene expression data in the following section. Input of the algorithm is gene expression data which is given in **Table 1**. The table shows the gene expression values measured as a result of the microarray experiment. The measurement was performed at six different times for each gene. The data were taken from the GEO database (GSE2241) [62]. FunFEM method is decided, and the best model is DkBk with *K=4* (bic = �152654.5) for input data. As a result, method assigned each gene to the appropriate cluster which is determined by the algorithm. **Table 2** demonstrates the gene symbol and cluster number. As a result, method assigned each gene to the appropriate cluster which is determined by the algorithm (**Table 2**).


#### **Table 1.**

*Input data of the FunFEM algorithm.*


The scaling and shifting factor in the expression level may hide similar expressions and should not be taken into account when measuring the similarity between the two expression profiles. Sampling interval length is informative and cannot be ignored in similarity comparisons. In microarray experiments, the density change characterizes the shape of the expression profile rather than the density of the gene expression. The internal structure can be represented by deterministic function,

There are many popular clustering techniques for gene expression data. The common goal of all is to explain the different functional roles of the genes that play a key biological process. Genes expressed in a similar way may have a similar

In addition to all these approaches, it is possible to examine the cluster of gene expression data in three different classes as gene-based clustering, sample-based clustering, and subspace clustering (**Figure 2**) [66]. In gene-based clustering, genes are treated as objects, instances (time-point/patient-intact) as features. Samplebased clustering is exactly the opposite: samples are treated as objects, genes as features. The distinction between these two clustering approaches is based on the basic characterization of the clustering process used for gene expression data. Some clustering algorithms, such as K-means and hierarchical approach, can be used to cluster both genes and fragments of samples. In the molecular biology, "any function in the cell is carried out with the participation of a small subset of genes, and the cellular function only occurs on a small sample subset." With this idea, genes and samples are handled symmetrically in subspace clustering; gene or sample,

In **gene-based clustering**, the aim is to group the co-expressed genes together. However, due to the complex nature of microarray experiments, gene expression data often contain high amounts of noise, characterizing features such as gene expression data often linked to each other (clusters often have a high intersection ratio), and some problems arising from constraints from the biological domain.

symbols describing the series, or statistical models.

functional role in the process [65].

*Clustering of Time-Series Data*

*DOI: http://dx.doi.org/10.5772/intechopen.84490*

object or features.

**Figure 2.**

**97**

*Gene expression data clustering approaches.*

#### **Table 2.** *Output data of the FunFEM algorithm.*

#### **3. Clustering approaches for gene expression data clustering**

The approach to be taken depends on the application area and the characteristics of the data. For this reason, as a case study, the clustering of gene expression data, which is a special area of clustering of time-series data, will be examined in this section. Microarray is the technology which measures the expression levels of large numbers of genes simultaneously. DNA microarray technology overcomes traditional approaches in the identification of gene copies in a genome, in the identification of nucleotide polymorphisms and mutations, and in the discovery and development of new drugs. It is used as a diagnostic tool for diseases. DNA microarrays are widely used to classify gene expression changes in cancer cells.

The gene expression time series (gene profile) is a set of data generated by measuring expression levels at different cases/times in a single sample. Gene expression time series have two main characteristics, short and unevenly sampled. In The Stanford Microarray database, more than 80% of the time-series experiments contains less than 9 time points [63]. Observations below 50 are considered to be quite short for statistical analysis. Gene expression time-series data are separated from other time-series data by this characteristics (business, finance, etc.). In addition to these characteristics, three basic similarity requirements can be identified for the gene expression time series: scaling and shifting, unevenly distributed sampling points, and shape (internal structure) [64]. Scaling and shifting problems arise due to two reasons: (i) the expression of genes with a common sequence is similar, but in this case, the genes need not have the same level of expression at the same time. (ii) Microarray technology, which is often corrected by normalization.

#### *Clustering of Time-Series Data DOI: http://dx.doi.org/10.5772/intechopen.84490*

The scaling and shifting factor in the expression level may hide similar expressions and should not be taken into account when measuring the similarity between the two expression profiles. Sampling interval length is informative and cannot be ignored in similarity comparisons. In microarray experiments, the density change characterizes the shape of the expression profile rather than the density of the gene expression. The internal structure can be represented by deterministic function, symbols describing the series, or statistical models.

There are many popular clustering techniques for gene expression data. The common goal of all is to explain the different functional roles of the genes that play a key biological process. Genes expressed in a similar way may have a similar functional role in the process [65].

In addition to all these approaches, it is possible to examine the cluster of gene expression data in three different classes as gene-based clustering, sample-based clustering, and subspace clustering (**Figure 2**) [66]. In gene-based clustering, genes are treated as objects, instances (time-point/patient-intact) as features. Samplebased clustering is exactly the opposite: samples are treated as objects, genes as features. The distinction between these two clustering approaches is based on the basic characterization of the clustering process used for gene expression data. Some clustering algorithms, such as K-means and hierarchical approach, can be used to cluster both genes and fragments of samples. In the molecular biology, "any function in the cell is carried out with the participation of a small subset of genes, and the cellular function only occurs on a small sample subset." With this idea, genes and samples are handled symmetrically in subspace clustering; gene or sample, object or features.

In **gene-based clustering**, the aim is to group the co-expressed genes together. However, due to the complex nature of microarray experiments, gene expression data often contain high amounts of noise, characterizing features such as gene expression data often linked to each other (clusters often have a high intersection ratio), and some problems arising from constraints from the biological domain.

**Figure 2.** *Gene expression data clustering approaches.*

**3. Clustering approaches for gene expression data clustering**

**Table 2.**

**96**

*Output data of the FunFEM algorithm.*

tion of nucleotide polymorphisms and mutations, and in the discovery and development of new drugs. It is used as a diagnostic tool for diseases. DNA microarrays are widely used to classify gene expression changes in cancer cells. The gene expression time series (gene profile) is a set of data generated by measuring expression levels at different cases/times in a single sample. Gene expression time series have two main characteristics, short and unevenly sampled. In The Stanford Microarray database, more than 80% of the time-series experiments contains less than 9 time points [63]. Observations below 50 are considered to be quite short for statistical analysis. Gene expression time-series data are separated from other time-series data by this characteristics (business, finance, etc.). In addition to these characteristics, three basic similarity requirements can be identified for the gene expression time series: scaling and shifting, unevenly distributed sampling points, and shape (internal structure) [64]. Scaling and shifting problems arise due to two reasons: (i) the expression of genes with a common sequence is similar, but in this case, the genes need not have the same level of expression at the same time. (ii) Microarray technology, which is often corrected by normalization.

The approach to be taken depends on the application area and the characteristics of the data. For this reason, as a case study, the clustering of gene expression data, which is a special area of clustering of time-series data, will be examined in this section. Microarray is the technology which measures the expression levels of large numbers of genes simultaneously. DNA microarray technology overcomes traditional approaches in the identification of gene copies in a genome, in the identifica-

**Gene symbol Cluster number** AADAC 2 AAK1 3 AAMP 3 AANAT 1 AARS 4 AASDHPPT 3 AASS 1 AATF 3 AATK 2 . . . . ZP2 1 ZPBP 1 ZW10 3 ZWINT 4 ZYX 4 ZZEF1 2 ZZZ3 3

*Data Mining - Methods, Applications and Systems*

Also, among biologists who will use microarray data, the relationship between genes or clusters that are usually related to each other within the cluster, rather than the clusters of genes, is a more favorite subject. That is, it is also important for the algorithm to make graphical presentations not just clusters. K-means, selforganizing maps (SOM), hierarchical clustering, graph-theoretic approach, modelbased clustering, and density-based approach (DHC) are the examples of genebased clustering algorithms.

different clustering algorithms, the data can be clustered in different forms. For this reason, it is necessary to select more suitable algorithm for data distribution.

Clustering for time-series data is used as an effective method for data analysis of many areas from social media usage and financial data to bioinformatics. There are various methods introduced for time-series data. Which approach is chosen is specific to the application. The application is determined by the needs such as time, speed, reliability, storage, and so on. When determining the approach to clustering, three basic issues need to be decided: data representation, similarity measure, and

The data representation involves transforming the multi-dimensional and noisy

It is challenging to measure the similarity of two time series. The chapter has been examined similarity measures in three sections as similarity in shape, similar-

For the time-series clustering algorithms, it is not wrong to say that the evolution of conventional clustering algorithms. Therefore, the classification of traditional clustering algorithms (developed for static data) has been included. It is classified as partitioning, hierarchical, model-based, grid-based, and density-based. Partition algorithms initially require prototypes. The accuracy of the algorithm depends on the defined prototype and updated method. However, they are successful in finding similar series and clustering time series with equal length. The fact that the number of clusters is not given as the initial parameter is a prominent and well-known feature of hierarchical algorithms. At the same time, works on time series that are not of equal length causes it to be one step ahead of other algorithms. However, hierarchical algorithms are not suitable for large data sets due to the complexity of the calculation and the scalability problem. Model-based algorithms suffer from problems such as initialization of parameters based on user predictions and slow processing time for large databases. Density-based algorithms are not generally preferred over time-series data due to their high working complexity. Each approach has pros and cons compared to each other, and the choice of algorithm for time-series clustering varies completely according to the characteristics of the data and the needs of the application. Therefore, in the last chapter, a study on the clustering of gene expression data, which is a specific field of application, has

In time-series data clustering, there is a need for algorithms that execute fast, accurate, and with less memory on large data sets that can meet today's needs.

structure of the time-series data into a less dimensional that best expresses the whole data. The most commonly used method for this purpose is dimension reduc-

**4. Conclusions**

*Clustering of Time-Series Data*

*DOI: http://dx.doi.org/10.5772/intechopen.84490*

clustering algorithm.

been mentioned.

**99**

tion or feature extraction.

ity in time, and similarity in change.

The goal of the **sample-based approach** is to find the phenotype structure or the sub-structure of the sample. The phenotypes of the samples studied [67] can only be distinguished by small gene subsets whose expression levels are highly correlated with cluster discrimination. These genes are called informative genes. Other genes in the expression matrix have no role in the decomposition of the samples and are considered noise in the database. Traditional clustering algorithms, such as Kmeans, SOM, and hierarchical clustering, can be applied directly to clustering samples taking all genes as features. The ratio of the promoter genes to the nonrelated genes (noise ratio) is usually 1:10. This also hinders the reliability of the clustering algorithm. These methods are used to identify the informative genes. Selection of the informative genes is examined in two different categories as supervised and unsupervised. The supervised approach is used in cases where phenotype information such as "patient" and "healthy" is added. In this example, the classifier containing only the informative genes is constructed using this information. The supervised approach is often used by biologists to identify informative genes. In the unsupervised approach, no label specifying the phenotype of the samples is placed. The lack of labeling and therefore the fact that the informative genes do not guide clustering makes the unsupervised approach more complicated. There are two problems that need to be addressed in the unsupervised approach: (i) the high number of genes versus the limited number of samples and (ii) the vast majority of collected genes are irrelevant. Two strategies can be mentioned for these problems in the unsupervised approach: unsupervised gene selection and clustering. In unsupervised gene selection, gene selection and sample clustering are treated as two separate processes. First, the gene size is reduced, and then classical clustering algorithms are applied. Since there is no training set, the choice of gene is based solely on statistical models that analyze the variance of gene expression data. Associated clustering dynamically supports the combination of repetitive clustering and gene selection processes by the use of the relationship between genes and samples. After many repetitions, the sample fragments converge to the real sample structure and the selected genes are likely candidates for the informative gene cluster.

When **subspace clustering** is applied to gene expression vectors, it is treated as a "block" consisting of clusters of genes and subclasses of experimental conditions. The expression pattern of the genes in the same block is consistent under the condition in that block. Different greedy heuristic approaches have been adapted to approximate optimal solution.

Subspace clustering was first described by Agrawal et al. in 1998 on general data mining [68]. In subspace clustering, two subspace sets may share the same objects and properties, while some objects may not belong to any subspace set. Subspace clustering methods usually define a model to determine the target block and then search in the gen-sample space. Some examples of subspatial cluster methods proposed for gene expression are biclustering [69], coupled two way clustering (CTWC) [70], and plaid model [71].

According to different clustering criteria, data can be clustered such as the coexpressing gene groups, the samples belonging to the same phenotype or genes from the same biological process. However, even if the same criteria are used in

different clustering algorithms, the data can be clustered in different forms. For this reason, it is necessary to select more suitable algorithm for data distribution.

#### **4. Conclusions**

Also, among biologists who will use microarray data, the relationship between genes or clusters that are usually related to each other within the cluster, rather than the clusters of genes, is a more favorite subject. That is, it is also important for the

organizing maps (SOM), hierarchical clustering, graph-theoretic approach, modelbased clustering, and density-based approach (DHC) are the examples of gene-

The goal of the **sample-based approach** is to find the phenotype structure or the sub-structure of the sample. The phenotypes of the samples studied [67] can only be distinguished by small gene subsets whose expression levels are highly correlated with cluster discrimination. These genes are called informative genes. Other genes in the expression matrix have no role in the decomposition of the samples and are considered noise in the database. Traditional clustering algorithms, such as Kmeans, SOM, and hierarchical clustering, can be applied directly to clustering samples taking all genes as features. The ratio of the promoter genes to the nonrelated genes (noise ratio) is usually 1:10. This also hinders the reliability of the clustering algorithm. These methods are used to identify the informative genes. Selection of the informative genes is examined in two different categories as supervised and unsupervised. The supervised approach is used in cases where phenotype information such as "patient" and "healthy" is added. In this example, the classifier containing only the informative genes is constructed using this information. The supervised approach is often used by biologists to identify informative genes. In the unsupervised approach, no label specifying the phenotype of the samples is placed. The lack of labeling and therefore the fact that the informative genes do not guide clustering makes the unsupervised approach more complicated. There are two problems that need to be addressed in the unsupervised approach: (i) the high number of genes versus the limited number of samples and (ii) the vast majority of collected genes are irrelevant. Two strategies can be mentioned for these problems in the unsupervised approach: unsupervised gene selection and clustering. In unsupervised gene selection, gene selection and sample clustering are treated as two separate processes. First, the gene size is reduced, and then classical clustering algorithms are applied. Since there is no training set, the choice of gene is based solely on statistical models that analyze the variance of gene expression data. Associated clustering dynamically supports the combination of repetitive clustering and gene selection processes by the use of the relationship between genes and samples. After many repetitions, the sample fragments converge to the real sample structure and the selected genes are likely candidates for the informative gene cluster.

When **subspace clustering** is applied to gene expression vectors, it is treated as a "block" consisting of clusters of genes and subclasses of experimental conditions. The expression pattern of the genes in the same block is consistent under the condition in that block. Different greedy heuristic approaches have been adapted to

Subspace clustering was first described by Agrawal et al. in 1998 on general data mining [68]. In subspace clustering, two subspace sets may share the same objects and properties, while some objects may not belong to any subspace set. Subspace clustering methods usually define a model to determine the target block and then search in the gen-sample space. Some examples of subspatial cluster methods proposed for gene expression are biclustering [69], coupled two way clustering

According to different clustering criteria, data can be clustered such as the coexpressing gene groups, the samples belonging to the same phenotype or genes from the same biological process. However, even if the same criteria are used in

algorithm to make graphical presentations not just clusters. K-means, self-

based clustering algorithms.

*Data Mining - Methods, Applications and Systems*

approximate optimal solution.

(CTWC) [70], and plaid model [71].

**98**

Clustering for time-series data is used as an effective method for data analysis of many areas from social media usage and financial data to bioinformatics. There are various methods introduced for time-series data. Which approach is chosen is specific to the application. The application is determined by the needs such as time, speed, reliability, storage, and so on. When determining the approach to clustering, three basic issues need to be decided: data representation, similarity measure, and clustering algorithm.

The data representation involves transforming the multi-dimensional and noisy structure of the time-series data into a less dimensional that best expresses the whole data. The most commonly used method for this purpose is dimension reduction or feature extraction.

It is challenging to measure the similarity of two time series. The chapter has been examined similarity measures in three sections as similarity in shape, similarity in time, and similarity in change.

For the time-series clustering algorithms, it is not wrong to say that the evolution of conventional clustering algorithms. Therefore, the classification of traditional clustering algorithms (developed for static data) has been included. It is classified as partitioning, hierarchical, model-based, grid-based, and density-based. Partition algorithms initially require prototypes. The accuracy of the algorithm depends on the defined prototype and updated method. However, they are successful in finding similar series and clustering time series with equal length. The fact that the number of clusters is not given as the initial parameter is a prominent and well-known feature of hierarchical algorithms. At the same time, works on time series that are not of equal length causes it to be one step ahead of other algorithms. However, hierarchical algorithms are not suitable for large data sets due to the complexity of the calculation and the scalability problem. Model-based algorithms suffer from problems such as initialization of parameters based on user predictions and slow processing time for large databases. Density-based algorithms are not generally preferred over time-series data due to their high working complexity. Each approach has pros and cons compared to each other, and the choice of algorithm for time-series clustering varies completely according to the characteristics of the data and the needs of the application. Therefore, in the last chapter, a study on the clustering of gene expression data, which is a specific field of application, has been mentioned.

In time-series data clustering, there is a need for algorithms that execute fast, accurate, and with less memory on large data sets that can meet today's needs.

*Data Mining - Methods, Applications and Systems*

**References**

pp. 400-405

2004. p. 460

2018;**18**(2):79-87

2015;**53**:16-38

**101**

[1] Ratanamahatana C. Multimedia

representation and relevance feedback. In: Proceedings of 8th International Conference on Asian Digital Libraries (ICADL2005); 2005.

*DOI: http://dx.doi.org/10.5772/intechopen.84490*

on Knowledge Discovery and Data Mining; 18 April 2000; Springer, Berlin,

[9] Esling P, Agon C. Time-series data mining. ACM Computing Surveys

[10] Keogh E, Lin J, Fu A. Hot sax: Efficiently finding the most unusual time series subsequence. In: Fifth IEEE International Conference on Data Mining (ICDM'05); 27 November 2005;

[11] Ghysels E, Santa-Clara P, Valkanov R. Predicting volatility: Getting the most out of return data sampled at different frequencies. Journal of Econometrics.

[12] Kawagoe GD. Grid Representation of Time Series Data for Similarity Search. In: Data Engineering Workshop;

[13] Agronomischer Zeitreihen CA. Time

Ratanamahatana C. Towards parameterfree data mining. In: Proceedings of Tenth ACM SIGKDD International Conference on Knowledge Discovery Data Mining; 2004, Vol. 22, No. 25.

[15] Keogh E, Chakrabarti K, Pazzani M,

dimensionality reduction for indexing large time series databases. ACM SIGMOD Record. 2001;**27**(2):151-162

[16] Keogh E, Pazzani M. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: Proceedings of the 4th International

Mehrotra S. Locally adaptive

Series Clustering in the Field of Agronomy. Technische Universitat Darmstadt (Master-Thesis); 2013

[14] Keogh E, Lonardi S,

pp. 206-215

Heidelberg. pp. 122-133

(CSUR). 2012;**45**(1):12

IEEE. pp. 226-233

2006;**131**(1-2):59-95

2006

[2] Özkoç EE, Oğul H. Content-based search on time-series microarray databases using clustering-based fingerprints. Current Bioinformatics. 2017;**12**(5):398-405. ISSN: 2212-392X

[3] Lin J, Keogh E, Lonardi S, Lankford J,

Kalashnikov D, Naumann F, Srivastava D. Data change exploration using time series clustering. Datenbank-Spektrum.

[5] Rani S, Sikka G. Recent techniques of clustering of time series data: A survey. International Journal of Computers and

[6] Aghabozorgi S, Shirkhorshidi AS, Wah TY. Time-series clustering–A decade review. Information Systems.

[7] Lin J, Keogh E, Lonardi S, Chiu B. A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery; 13 June 2003; ACM; pp. 2-11

[8] Keogh EJ, Pazzani MJ. A simple dimensionality reduction technique for fast similarity search in large time series databases. In: Pacific-Asia Conference

Nystrom D. Visually mining and monitoring massive time series. In: Proceedings of 2004 ACM SIGKDD International Conference on Knowledge Discovery and data Mining–KDD '04;

[4] Bornemann L, Bleifuß T,

Applications. 2012;**52**(15):1

retrieval using time series

*Clustering of Time-Series Data*

### **Author details**

Esma Ergüner Özkoç Başkent University, Ankara, Turkey

\*Address all correspondence to: eeozkoc@baskent.edu.tr

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Ratanamahatana C. Multimedia retrieval using time series representation and relevance feedback. In: Proceedings of 8th International Conference on Asian Digital Libraries (ICADL2005); 2005. pp. 400-405

[2] Özkoç EE, Oğul H. Content-based search on time-series microarray databases using clustering-based fingerprints. Current Bioinformatics. 2017;**12**(5):398-405. ISSN: 2212-392X

[3] Lin J, Keogh E, Lonardi S, Lankford J, Nystrom D. Visually mining and monitoring massive time series. In: Proceedings of 2004 ACM SIGKDD International Conference on Knowledge Discovery and data Mining–KDD '04; 2004. p. 460

[4] Bornemann L, Bleifuß T, Kalashnikov D, Naumann F, Srivastava D. Data change exploration using time series clustering. Datenbank-Spektrum. 2018;**18**(2):79-87

[5] Rani S, Sikka G. Recent techniques of clustering of time series data: A survey. International Journal of Computers and Applications. 2012;**52**(15):1

[6] Aghabozorgi S, Shirkhorshidi AS, Wah TY. Time-series clustering–A decade review. Information Systems. 2015;**53**:16-38

[7] Lin J, Keogh E, Lonardi S, Chiu B. A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery; 13 June 2003; ACM; pp. 2-11

[8] Keogh EJ, Pazzani MJ. A simple dimensionality reduction technique for fast similarity search in large time series databases. In: Pacific-Asia Conference

on Knowledge Discovery and Data Mining; 18 April 2000; Springer, Berlin, Heidelberg. pp. 122-133

[9] Esling P, Agon C. Time-series data mining. ACM Computing Surveys (CSUR). 2012;**45**(1):12

[10] Keogh E, Lin J, Fu A. Hot sax: Efficiently finding the most unusual time series subsequence. In: Fifth IEEE International Conference on Data Mining (ICDM'05); 27 November 2005; IEEE. pp. 226-233

[11] Ghysels E, Santa-Clara P, Valkanov R. Predicting volatility: Getting the most out of return data sampled at different frequencies. Journal of Econometrics. 2006;**131**(1-2):59-95

[12] Kawagoe GD. Grid Representation of Time Series Data for Similarity Search. In: Data Engineering Workshop; 2006

[13] Agronomischer Zeitreihen CA. Time Series Clustering in the Field of Agronomy. Technische Universitat Darmstadt (Master-Thesis); 2013

[14] Keogh E, Lonardi S, Ratanamahatana C. Towards parameterfree data mining. In: Proceedings of Tenth ACM SIGKDD International Conference on Knowledge Discovery Data Mining; 2004, Vol. 22, No. 25. pp. 206-215

[15] Keogh E, Chakrabarti K, Pazzani M, Mehrotra S. Locally adaptive dimensionality reduction for indexing large time series databases. ACM SIGMOD Record. 2001;**27**(2):151-162

[16] Keogh E, Pazzani M. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: Proceedings of the 4th International

**Author details**

**100**

Esma Ergüner Özkoç

Başkent University, Ankara, Turkey

*Data Mining - Methods, Applications and Systems*

provided the original work is properly cited.

\*Address all correspondence to: eeozkoc@baskent.edu.tr

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

Conference of Knowledge Discovery and Data Mining; 1998. pp. 239-241

[17] Korn F, Jagadish HV, Faloutsos C. Efficientlysupportingadhoc queries in large datasets of time sequences. ACM SIGMOD Record. 1997;**26**: 289-300

[18] Faloutsos C, Ranganathan M, Manolopoulos Y. Fasts ubsequence matching in time-series databases. ACM SIGMOD Record. 1994;**23**(2): 419-429

[19] Portet F, Reiter E, Gatt A, Hunter J, Sripada S, Freer Y, et al. Automatic generation of textual summaries from neonatal intensive care data. Artificial Intelligence. 2009;**173**(7):789-816

[20] Chan K, Fu AW. Efficient time series matching by wavelets. In: Proceedings of 1999 15th International Conference on Data Engineering; 1999, Vol. 15, no. 3. pp. 126-133

[21] Agrawal R, Faloutsos C, Swami A. Efficient similarity search in sequence databases. Foundations of Data Organization and Algorithms. 1993;**46**: 69-84

[22] Kawagoe K, Ueda T. A similarity search method of time series data with combination of Fourier and wavelet transforms. In: Proceedings Ninth International Symposium on Temporal Representation and Reasoning; 2002. pp. 86-92

[23] Chung FL, Fu TC, Luk R. Flexible time series pattern matching based on perceptually important points. In: Jt. Conference on Artificial Intelligence Workshop. 2001. pp. 1-7

[24] Keogh E, Pazzani M, Chakrabarti K, Mehrotra S. A simple dimensionality reduction technique for fast similarity search in large time series databases.

Knowledge and Information Systems. 2000;**1805**(1):122-133

subsequence density estimation and

Proceedings of the National Conference on Artificial Intelligence; 2007, Vol. 22,

*DOI: http://dx.doi.org/10.5772/intechopen.84490*

[40] Xiong Y, Yeung DY. Mixtures of ARMA models for model-based time series clustering. In: Data Mining, 2002.

[41] Han J, Kamber M. Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann; 2001.

[42] Liao TW. Clustering of time series data—a survey. Pattern Recognition.

[43] MacQueen J. Some methods for

[44] Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, et al. Interpreting patterns of gene expression with self-organizing maps:

Proceedings of the National Academy of

classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; 21 June 1967, Vol. 1,

Methods and application to hematopoietic differentiation.

Sciences. 1999;**96**(6):2907-2912

[45] Ng RT, Han J. Efficient and effective clustering methods for spatial data mining. In: Proceedings of the International Conference on Very Large

Data Bases; 1994. pp. 144-144

[46] Kaufman L, Rousseeuw PJ,

Online Library; 1990

1998;**27**(2):73-84

103-114

Corporation E. Finding Groups in Data: An Introduction to Cluster Analysis, Vol. 39. Hoboken, NewJersey: Wiley

[47] Guha S, Rastogi R, Shim K. CURE: An efficient clustering algorithm for large databases. ACM SIGMOD Record.

[48] Zhang T, Ramakrishnan R, Livny M. BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD Record. 1996;**25**(2):

ICDM 2003; 2002. pp. 717-720

pp. 346-389

2005;**38**(11):1857-1874

No. 14. pp. 281-297

[33] Panuccio A, Bicego M, Murino V. A hidden Markov model-based

approach to sequential data clustering. In: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Berlin, Heidelberg: Springer; 2002, pp.

Ratanamahatana C, Keogh E, Lonardi S, Janacek G. A bit level representation for time series data mining with shape based similarity. Data Mining

and Knowledge Discovery. 2006;**13**(1):

[35] Ratanamahatana C, Keogh E, Bagnall AJ, Lonardi S. A novel bit level

time series representation with implications for similarity search and clustering. In: Proceedings of 9th Pacific-Asian International Conference on Knowledge Discovery and Data Mining (PAKDD'05); 2005. pp. 771-777

[36] Bagnall AJ, Janacek G. Clustering time series with clipped data. Machine

Learning. 2005;**58**(2):151-178

[37] Sakoe H, Chiba S. A dynamic programming approach to continuous speech recognition. In: Proceedings of the Seventh International Congress on Acousticsvol; 1971, Vol. 3. pp. 65-69

[38] Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing. 1978;**26**(1):43-49

1997;**9**:648-654

**103**

[39] Smyth P. Clustering sequences with hidden Markov models. Advances in Neural Information Processing Systems.

greedy mixture learning. In:

*Clustering of Time-Series Data*

No. 1. p. 615

734-743

11-40

[34] Bagnall AAJ, "Ann"

[25] Caiand Y, Ng R. Indexing spatiotemporal trajectories with Chebyshev polynomials. In: Procedings of 2004 ACM SIGMOD International; 2004. p. 599

[26] Bingham E. Random projection in dimensionality reduction: Applications to image and text data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2001. pp. 245-250

[27] Chen Q, Chen L, Lian X, Liu Y. Indexable PLA for efficient similarity search. In: Proceedings of the 33rd International Conference on Very large Data Bases; 2007. pp. 435-446

[28] Corduas M, Piccolo D. Timeseries clustering and classification by the autoregressive metric. Computational Statistics & Data Analysis. 2008;**52**(4): 1860-1872

[29] Kalpakis K, Gada D, Puttagunta V. Distance measures for effective clustering of ARIMA time-series. In: Proceedings 2001 IEEE International Conference on Data Mining; 2001. pp. 273-280

[30] Kumar N, Lolla N, Keogh E, Lonardi S. Time-series bitmaps: A practical visualization tool for working with large time series databases. In: Proceedings of the 2005 SIAM International Conference on Data Mining; 2005. pp. 531-535

[31] Minnen D, Starner T, Essa M, Isbell C. Discovering characteristic actions from on body sensor data. In: Proceedings of 10th IEEE International Symposium on Wearable Computers; 2006. pp. 11-18

[32] Minnen D, Isbell CL, Essa I, Starner T. Discovering multivariate motifs using

#### *Clustering of Time-Series Data DOI: http://dx.doi.org/10.5772/intechopen.84490*

subsequence density estimation and greedy mixture learning. In: Proceedings of the National Conference on Artificial Intelligence; 2007, Vol. 22, No. 1. p. 615

Conference of Knowledge

pp. 239-241

289-300

419-429

69-84

pp. 86-92

**102**

Discovery and Data Mining; 1998.

[18] Faloutsos C, Ranganathan M, Manolopoulos Y. Fasts ubsequence matching in time-series databases. ACM SIGMOD Record. 1994;**23**(2):

[19] Portet F, Reiter E, Gatt A, Hunter J, Sripada S, Freer Y, et al. Automatic generation of textual summaries from neonatal intensive care data. Artificial Intelligence. 2009;**173**(7):789-816

[20] Chan K, Fu AW. Efficient time series matching by wavelets. In: Proceedings of 1999 15th International Conference on Data Engineering; 1999,

[21] Agrawal R, Faloutsos C, Swami A. Efficient similarity search in sequence databases. Foundations of Data

Organization and Algorithms. 1993;**46**:

[22] Kawagoe K, Ueda T. A similarity search method of time series data with combination of Fourier and wavelet transforms. In: Proceedings Ninth International Symposium on Temporal Representation and Reasoning; 2002.

[23] Chung FL, Fu TC, Luk R. Flexible time series pattern matching based on perceptually important points. In: Jt. Conference on Artificial Intelligence

[24] Keogh E, Pazzani M, Chakrabarti K, Mehrotra S. A simple dimensionality reduction technique for fast similarity search in large time series databases.

Workshop. 2001. pp. 1-7

Vol. 15, no. 3. pp. 126-133

[17] Korn F, Jagadish HV, Faloutsos C. Efficientlysupportingadhoc queries in large datasets of time sequences. ACM SIGMOD Record. 1997;**26**:

*Data Mining - Methods, Applications and Systems*

Knowledge and Information Systems.

[25] Caiand Y, Ng R. Indexing spatiotemporal trajectories with Chebyshev polynomials. In: Procedings of 2004 ACM SIGMOD International; 2004.

[26] Bingham E. Random projection in dimensionality reduction: Applications to image and text data. In: Proceedings

International Conference on Knowledge Discovery and Data Mining; 2001.

[27] Chen Q, Chen L, Lian X, Liu Y. Indexable PLA for efficient similarity search. In: Proceedings of the 33rd International Conference on Very large

[28] Corduas M, Piccolo D. Timeseries clustering and classification by the autoregressive metric. Computational Statistics & Data Analysis. 2008;**52**(4):

[29] Kalpakis K, Gada D, Puttagunta V. Distance measures for effective clustering of ARIMA time-series. In: Proceedings 2001 IEEE International Conference on Data Mining; 2001.

[30] Kumar N, Lolla N, Keogh E, Lonardi S. Time-series bitmaps: A practical visualization tool for working with large time series databases. In: Proceedings of

[31] Minnen D, Starner T, Essa M, Isbell C. Discovering characteristic actions from on body sensor data. In:

Proceedings of 10th IEEE International Symposium on Wearable Computers;

[32] Minnen D, Isbell CL, Essa I, Starner T. Discovering multivariate motifs using

the 2005 SIAM International Conference on Data Mining; 2005.

Data Bases; 2007. pp. 435-446

of the Seventh ACM SIGKDD

2000;**1805**(1):122-133

p. 599

pp. 245-250

1860-1872

pp. 273-280

pp. 531-535

2006. pp. 11-18

[33] Panuccio A, Bicego M, Murino V. A hidden Markov model-based approach to sequential data clustering. In: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Berlin, Heidelberg: Springer; 2002, pp. 734-743

[34] Bagnall AAJ, "Ann" Ratanamahatana C, Keogh E, Lonardi S, Janacek G. A bit level representation for time series data mining with shape based similarity. Data Mining and Knowledge Discovery. 2006;**13**(1): 11-40

[35] Ratanamahatana C, Keogh E, Bagnall AJ, Lonardi S. A novel bit level time series representation with implications for similarity search and clustering. In: Proceedings of 9th Pacific-Asian International Conference on Knowledge Discovery and Data Mining (PAKDD'05); 2005. pp. 771-777

[36] Bagnall AJ, Janacek G. Clustering time series with clipped data. Machine Learning. 2005;**58**(2):151-178

[37] Sakoe H, Chiba S. A dynamic programming approach to continuous speech recognition. In: Proceedings of the Seventh International Congress on Acousticsvol; 1971, Vol. 3. pp. 65-69

[38] Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing. 1978;**26**(1):43-49

[39] Smyth P. Clustering sequences with hidden Markov models. Advances in Neural Information Processing Systems. 1997;**9**:648-654

[40] Xiong Y, Yeung DY. Mixtures of ARMA models for model-based time series clustering. In: Data Mining, 2002. ICDM 2003; 2002. pp. 717-720

[41] Han J, Kamber M. Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann; 2001. pp. 346-389

[42] Liao TW. Clustering of time series data—a survey. Pattern Recognition. 2005;**38**(11):1857-1874

[43] MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; 21 June 1967, Vol. 1, No. 14. pp. 281-297

[44] Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, et al. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences. 1999;**96**(6):2907-2912

[45] Ng RT, Han J. Efficient and effective clustering methods for spatial data mining. In: Proceedings of the International Conference on Very Large Data Bases; 1994. pp. 144-144

[46] Kaufman L, Rousseeuw PJ, Corporation E. Finding Groups in Data: An Introduction to Cluster Analysis, Vol. 39. Hoboken, NewJersey: Wiley Online Library; 1990

[47] Guha S, Rastogi R, Shim K. CURE: An efficient clustering algorithm for large databases. ACM SIGMOD Record. 1998;**27**(2):73-84

[48] Zhang T, Ramakrishnan R, Livny M. BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD Record. 1996;**25**(2): 103-114

[49] Karypis G, Han EH, Kumar V. Chameleon: Hierarchical clustering using dynamic modeling. Computer. 1999;**32**(8):68-75

[50] Beal M, Krishnamurthy P. Gene expression time course clustering with countably infinite hidden Markov models. arXiv preprint arXiv:1206.6824; 2012

[51] Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial data bases with noise. In: Knowledge Discovery and Data Mining. Vol. 96, No. 34; August 1996. pp. 226-231

[52] Ankerst M, Breunig M, Kriegel H. OPTICS: Ordering points to identify the clustering structure. ACM SIGMOD Record. 1999;**28**(2):40-60

[53] Fisher DH. Knowledge acquisition via incremental conceptual clustering. Machine Learning. 1987;**2**(2):139-172

[54] Carpenter GA, Grossberg S. A massively parallel architecture for a selforganizing neural pattern recognition machine. Computer Vision Graphics Image Process. 1987;**37**(1):54-115

[55] Bouveyron C, Côme E, Jacques J. The discriminative functional mixture model for the analysis of bike sharing systems. The Annals of Applied Statistics. 2015;**9**(4):1726-1760

[56] Wang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. In: Proceedings of the International Conference on Very Large Data Bases; 1997. pp. 186-195

[57] Bouveyron C, Brunet C. Simultaneous model-based clustering and visualization in the fisher discriminative subspace. Statistics and Computing. 2012;**22**:301-324

[58] Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;**19**:716-723

gene expression monitoring. Science.

*DOI: http://dx.doi.org/10.5772/intechopen.84490*

[68] Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. ACM; 1998;

[69] Cheng Y, Church GM. Biclustering of expression data. In: ISMB; 2000, Vol.

[71] Lazzeroni L, Owen A. Plaid models for gene expression data. Statistica

[70] Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences.

1999;**286**(5439):531-537

*Clustering of Time-Series Data*

8, No. 2000. pp. 93-103

2000;**97**(22):12079-12084

Sinica. 2002;**1**:61-86

**105**

**27**(2):94-105

[59] Kass RE, Raftery AE. Bayes factors. Journal of the American Statistical Association. 1995;**90**(430):773-795

[60] Biernacki C, Celeux G, Govaert G. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000;**22**:719-725

[61] Bouveyron C. funFEM: Clustering in the Discriminative Functional Subspace. R package version. 2015;1

[62] Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, et al. NCBI GEO: Archive for highthroughput functional genomic data. Nucleic Acids Research. 2009;**37** (Database):D885-D890

[63] Kuenzel L. Gene clustering methods for time series microarray data. Biochemistry. 2010;**218**

[64] Moller-Levet CS, Cho KH, Yin H, Wolkenhauer O. Clustering of gene expression time-series data. Technical report. Department of Computer Science, University of Rostock, Germany; 2003

[65] Beal M, Krishneamurthy P. Gene expression time course clustering with countably infinite hidden Markov models. arXiv preprint arXiv:1206.6824; 2012

[66] Jiang D, Tang C, Zhang A. Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering. 2004; **16**(11):1370-1386

[67] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: Class discovery and class prediction by

*Clustering of Time-Series Data DOI: http://dx.doi.org/10.5772/intechopen.84490*

gene expression monitoring. Science. 1999;**286**(5439):531-537

[49] Karypis G, Han EH, Kumar V. Chameleon: Hierarchical clustering using dynamic modeling. Computer.

*Data Mining - Methods, Applications and Systems*

Transactions on Automatic Control.

[59] Kass RE, Raftery AE. Bayes factors. Journal of the American Statistical Association. 1995;**90**(430):773-795

[60] Biernacki C, Celeux G, Govaert G. Assessing a mixture model for clustering

likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61] Bouveyron C. funFEM: Clustering in the Discriminative Functional Subspace.

[62] Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, et al. NCBI GEO: Archive for highthroughput functional genomic data. Nucleic Acids Research. 2009;**37**

[63] Kuenzel L. Gene clustering methods

[64] Moller-Levet CS, Cho KH, Yin H, Wolkenhauer O. Clustering of gene expression time-series data. Technical report. Department of Computer Science, University of Rostock,

[65] Beal M, Krishneamurthy P. Gene expression time course clustering with countably infinite hidden Markov models. arXiv preprint arXiv:1206.6824;

[66] Jiang D, Tang C, Zhang A. Cluster analysis for gene expression data: A survey. IEEE Transactions on

Knowledge and Data Engineering. 2004;

[67] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: Class discovery and class prediction by

for time series microarray data.

with the integrated completed

R package version. 2015;1

(Database):D885-D890

Biochemistry. 2010;**218**

Germany; 2003

**16**(11):1370-1386

2012

1974;**19**:716-723

2000;**22**:719-725

[50] Beal M, Krishnamurthy P. Gene expression time course clustering with countably infinite hidden Markov models. arXiv preprint arXiv:1206.6824;

[51] Ester M, Kriegel HP, Sander J, Xu X.

discovering clusters in large spatial data

Discovery and Data Mining. Vol. 96, No.

[52] Ankerst M, Breunig M, Kriegel H. OPTICS: Ordering points to identify the clustering structure. ACM SIGMOD

[53] Fisher DH. Knowledge acquisition via incremental conceptual clustering. Machine Learning. 1987;**2**(2):139-172

[54] Carpenter GA, Grossberg S. A massively parallel architecture for a selforganizing neural pattern recognition machine. Computer Vision Graphics Image Process. 1987;**37**(1):54-115

[55] Bouveyron C, Côme E, Jacques J. The discriminative functional mixture model for the analysis of bike sharing systems. The Annals of Applied Statistics. 2015;**9**(4):1726-1760

[56] Wang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. In: Proceedings of the International Conference on Very Large Data Bases; 1997. pp. 186-195

Simultaneous model-based clustering

discriminative subspace. Statistics and

[57] Bouveyron C, Brunet C.

and visualization in the fisher

Computing. 2012;**22**:301-324

**104**

[58] Akaike H. A new look at the statistical model identification. IEEE

A density-based algorithm for

bases with noise. In: Knowledge

34; August 1996. pp. 226-231

Record. 1999;**28**(2):40-60

1999;**32**(8):68-75

2012

[68] Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. ACM; 1998; **27**(2):94-105

[69] Cheng Y, Church GM. Biclustering of expression data. In: ISMB; 2000, Vol. 8, No. 2000. pp. 93-103

[70] Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences. 2000;**97**(22):12079-12084

[71] Lazzeroni L, Owen A. Plaid models for gene expression data. Statistica Sinica. 2002;**1**:61-86

**107**

**Chapter 7**

**Abstract**

Weather Nowcasting Using Deep

Weather variations play a significant role in peoples' short-term, medium-term or long-term planning. Therefore, understanding of weather patterns has become very important in decision making. Short-term weather forecasting (nowcasting) involves the prediction of weather over a short period of time; typically few hours. Different techniques have been proposed for short-term weather forecasting. Traditional techniques used for nowcasting are highly parametric, and hence complex. Recently, there has been a shift towards the use of artificial intelligence techniques for weather nowcasting. These include the use of machine learning techniques such as artificial neural networks. In this chapter, we report the use of deep learning techniques for weather nowcasting. Deep learning techniques were tested on meteorological data. Three deep learning techniques, namely multilayer perceptron, Elman recurrent neural networks and Jordan recurrent neural networks, were used in this work. Multilayer perceptron models achieved 91 and 75% accuracies for sunshine forecasting and precipitation forecasting respectively, Elman recurrent neural network models achieved accuracies of 96 and 97% for sunshine and precipitation forecasting respectively, while Jordan recurrent neural network models achieved accuracies of 97 and 97% for sunshine and precipitation nowcasting respectively. The results obtained underline the utility of using deep

**Keywords:** nowcasting, deep learning, artificial neural network, Elman network,

Weather changes play a significant role in peoples' short-term, medium-term or long-term planning. Therefore, the understanding weather patterns have become very important in decision making. This further raises the need for availability of tools for accurate prediction of weather. This need is even more pronounced if the prediction is intended for a short-term weather forecasting, conventionally known

To date, different weather nowcasting models have been proposed [1]. These models are mainly based on the different variants of artificial neural networks and fuzzy logic. As will be discussed later in this chapter, these techniques have some limitations which need to be addressed. In this chapter, we report the use multilayer perceptron (MLP) neural networks, Elman neural networks (ENN) and Jordan

Learning Techniques

*and Molibeli Benedict Taele*

learning for weather nowcasting.

**1. Introduction**

as nowcasting.

Jordan network, precipitation, rainfall

*Makhamisa Senekane, Mhlambululi Mafu* 

#### **Chapter 7**
