**Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining**

Carmelo Cassisi, Placido Montalto, Marco Aliotta, Andrea Cannata and Alfredo Pulvirenti

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/49941

## **1. Introduction**

28 Will-be-set-by-IN-TECH

[82] Snedecor, G. & Cochran, W. [1967]. Statistical methods 6th ed, *IOWA State University*

[83] Snee, R. [1977]. Validation of regression models: methods and examples, *Technometrics*

[84] Srisawat, A., Phienthrakul, T. & Kijsirikul, B. [2006]. Sv-knnc: an algorithm for improving the efficiency of k-nearest neighbor, *PRICAI 2006: Trends in Artificial Intelligence*

[85] Stephens, M. [1974]. Edf statistics for goodness of fit and some comparisons, *Journal of*

[86] Stone, M. [1974]. Cross-validatory choice and assessment of statistical predictions,

[88] Thomas H. Cormen, Charles E. Leiserson, R. L. R. C. S. [2009]. *Introduction to Algorithms*,

[89] Tibshirani, R. & Efron, B. [1993]. An introduction to the bootstrap, *Monographs on*

[90] Tomek, I. [1976]. An experiment with the edited nearest-neighbor rule, *IEEE Transactions*

[92] Vázquez, F., Sánchez, J. & Pla, F. [2005]. A stochastic approach to wilsonâA˘ Zs editing ´

[93] Veenman, C. & Reinders, M. [2005]. The nearest sub-class classifier: a compromise between the nearest mean and nearest neighbor classifier, *Transactions on PAMI*

[94] Veenman, C., Reinders, M. & Backer, E. [2002]. A maximum variance cluster algorithm, *Pattern Analysis and Machine Intelligence, IEEE Transactions on* 24(9): 1273–1280. [95] Veropoulos, K., Campbell, C. & Cristianini, N. [1999]. Controlling the sensitivity of support vector machines, *Proceedings of the international joint conference on artificial*

[96] Weiss, G. [2003]. *The effect of small disjuncts and class distribution on decision tree learning*,

[97] Weiss, G. & Provost, F. [2003]. Learning when training data are costly: The effect of class

[98] Wilson, D. [1972]. Asymptotic properties of nearest neighbor rules using edited data,

[99] Wilson, D. & Martinez, T. [2000]. Reduction techniques for instance-based learning

[100] Witten, I., Frank, E. & Hall, M. [2011]. *Data Mining: Practical machine learning tools and*

[102] Wu, G. & Chang, E. [2003]. Class-boundary alignment for imbalanced dataset learning,

[103] Zhang, H. & Sun, G. [2002]. Optimal reference subset selection for nearest neighbor

[104] Zhao, K., Zhou, S., Guan, J. & Zhou, A. [2003]. C-pruner: An improved instance pruning algorithm, *Machine Learning and Cybernetics, 2003 International Conference on*, Vol. 1, IEEE,

*Journal of the Royal Statistical Society. Series B (Methodological)* pp. 111–147.

[91] Vapnik, V. [2000]. *The nature of statistical learning theory*, Springer Verlag.

algorithm, *Pattern Recognition and Image Analysis* pp. 35–42.

PhD thesis, Rutgers, The State University of New Jersey.

distribution on tree induction, *J. Artif. Intell. Res. (JAIR)* 19: 315–354.

[101] Wolpert, D. [1992]. Stacked generalization\*, *Neural networks* 5(2): 241–259.

*ICML 2003 workshop on learning from imbalanced data sets II*, pp. 49–56.

classification by tabu search, *Pattern Recognition* 35(7): 1481–1490.

*Systems, Man and Cybernetics, IEEE Transactions on* 2(3): 408–421.

*press, USA. 450pp* .

*the American Statistical Association* pp. 730–737.

[87] Thas, O. [2010]. *Comparing distributions*, Springer.

*Statistics and Applied Probability* 57: 1–436.

*intelligence*, Vol. 1999, Citeseer, pp. 55–60.

algorithms, *Machine learning* 38(3): 257–286.

*techniques*, Morgan Kaufmann.

pp. 94–99.

*on Systems, Man, and Cybernetics* (6): 448–452.

The MIT Press, London, England.

pp. 415–428.

pp. 975–979.

27(9): 1417–1429.

A time series is *"a sequence X =* (*x*1*, x*2*, …, xm*) *of observed data over time*", where *m* is the number of observations. Tracking the behavior of a specific phenomenon/data in time can produce important information. A large variety of real world applications, such as meteorology, geophysics and astrophysics, collect observations that can be represented as time series.

A collection of time series can be defined as a Time Series Database (*TSDB*). Given a *TSDB*, most of time series mining efforts are made for the similarity matching problem. Time series data mining can be exploited from research areas dealing with signals, such as image processing. For example, image data can be converted to time series: from image color histograms (Fig. 2), where image matching can be applied, to object perimeters for the characterization of shapes [39].

Time series are essentially high-dimensional data [17]. Mining high-dimensional involves addressing a range of challenges, among them: i) the curse of dimensionality [1], and ii) the meaningfulness of the similarity measure in the high-dimensional space. An important task to enhance performances on time series is the reduction of their dimensionality, that must preserve the main characteristics, and reflects the original (dis)similarity of such data (this effect will be referred to as *lower bounding* [11]). When treating time series, the similarity between two sequences of the same length can be calculated by summing the ordered pointto-point distance between them (Fig. 3). In this sense, the most used distance function is the *Euclidean Distance* [13], corresponding to the second degree of general *Lp-norm* [41]. This distance measure is cataloged as a metric distance function, since it obeys to the three

© 2012 Cassisi et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Cassisi et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining 73

*Value Decomposition* (*SVD*) [24]; *Piecewise Aggregate Approximation* (*PAA*) [19]; *Adaptive Piecewise Constant Approximation* (*APCA*) [6]; *Piecewise Linear Approximation* (*PLA*) [20]; and *Chebyshev Polynomials* (*CHEB*) [29]. Many researchers have also included symbolic representations of time series, that transform time series measurements into a collection of discretized symbols; among them we cite the *Symbolic Aggregate approXimation* (*SAX*) [26], based on *PAA*, and the evolved multi-resolution representation *iSAX 2.0* [35]. Symbolic representation can take advantage of efforts conducted by the text-processing and bioinformatics communities, who made available several data structures and algorithms for

The chapter is organized as follows. Section 2 will introduce the similarity matching problem on time series. We will note the importance of the use of efficient data structures to perform search, and the choice of an adequate distance measure. Section 3 will show some of the most used distance measure for time series data mining. Section 4 will review the

**Figure 2.** An example of conversion from the RGB (Red, Green, Blue) image color histograms, to a time

A *TSDB* with *m* objects, each of length *n*, is denoted by *TSDB =* {*x*1*, x*2*, . . . , xm*}, where *xi* =

(1) (2) ( ) ( ) 11 1 1 (1) (2) ( ) ( ) 22 2 2 1

*xx x x x*

 

*xx x x*

*j n j n*

*j n*

*j n*

(1) (2) ( ) ( )

*ii i i*

*xx x x*

*mm m m*

*xx x x*

(1) (2) ( ) ( )

(*j*) denotes the *j*th values of x*i*,

(1)

2

*x*

*m*

*x*

) is a vector denoting the *i*th time series and *xi*

with respect to time. *n* indicates the *dimensionality* of the data set.

The general representation of a *TSDB* is a *m* × *n* matrix:

*A*

efficient pattern discovery on symbolic encodings [2],[25],[36].

above mentioned dimensionality reduction techniques.

**2. Similarity matching problem** 

(*n*)

series.

(*xi* (1)*, xi*

(2)*, . . . , xi*

**Figure 1.** Examples of time series data relative to a) monsoon, b) sunspots, c) ECG (ElectroCardioGram), d) seismic signal.

fundamentals metric properties: *non-negativity*, *symmetry* and *triangle inequality* [29]. In most cases, a metric function is desired, because the triangle inequality can then be used to prune the index during search, allowing speed-up execution for exact matching [28]. In every way, Euclidean distance and its variants present several drawbacks, that make inappropriate their use in certain applications. For these reasons, other distance measure techniques were proposed to give more robustness to the similarity computation. In this sense it is required to cite also the well known *Dynamic Time Warping* (*DTW*) [22], that makes distance comparisons less sensitive to signal transformations as shifting, uniform amplitude scaling or uniform time scaling. In the literature there exist other distance measures that overcome signal transformation problems, such as the *Landmarks similarity*, which does not follow traditional similarity models that rely on point-wise Euclidean distance [31] but, in correspondence of human intuition and episodic memory, relies on similarity of those points (times, events) of "greatest importance" (for example local maxima, local minima, inflection points).

In conjunction to this branch of research, a wide range of techniques for dimensionality reduction was proposed. Among them we will treat only representations that have the desirable property of allowing *lower bounding*. By this property, after establishing a true distance measure for the raw data (in this case the Euclidean distance), the distance between two time series, in the reduced space, results always less or equal than the true distance. Such a property ensures exact indexing of data (i.e. with no false negatives [13]). The following representations describe the state-of-art in this field: spectral decomposition through *Discrete Fourier Transform* (*DFT*) [1]; *Discrete Wavelet Transform* (*DWT*) [7]; *Singular*  *Value Decomposition* (*SVD*) [24]; *Piecewise Aggregate Approximation* (*PAA*) [19]; *Adaptive Piecewise Constant Approximation* (*APCA*) [6]; *Piecewise Linear Approximation* (*PLA*) [20]; and *Chebyshev Polynomials* (*CHEB*) [29]. Many researchers have also included symbolic representations of time series, that transform time series measurements into a collection of discretized symbols; among them we cite the *Symbolic Aggregate approXimation* (*SAX*) [26], based on *PAA*, and the evolved multi-resolution representation *iSAX 2.0* [35]. Symbolic representation can take advantage of efforts conducted by the text-processing and bioinformatics communities, who made available several data structures and algorithms for efficient pattern discovery on symbolic encodings [2],[25],[36].

The chapter is organized as follows. Section 2 will introduce the similarity matching problem on time series. We will note the importance of the use of efficient data structures to perform search, and the choice of an adequate distance measure. Section 3 will show some of the most used distance measure for time series data mining. Section 4 will review the above mentioned dimensionality reduction techniques.

**Figure 2.** An example of conversion from the RGB (Red, Green, Blue) image color histograms, to a time series.

## **2. Similarity matching problem**

72 Advances in Data Mining Knowledge Discovery and Applications

**Figure 1.** Examples of time series data relative to a) monsoon, b) sunspots, c) ECG

"greatest importance" (for example local maxima, local minima, inflection points).

In conjunction to this branch of research, a wide range of techniques for dimensionality reduction was proposed. Among them we will treat only representations that have the desirable property of allowing *lower bounding*. By this property, after establishing a true distance measure for the raw data (in this case the Euclidean distance), the distance between two time series, in the reduced space, results always less or equal than the true distance. Such a property ensures exact indexing of data (i.e. with no false negatives [13]). The following representations describe the state-of-art in this field: spectral decomposition through *Discrete Fourier Transform* (*DFT*) [1]; *Discrete Wavelet Transform* (*DWT*) [7]; *Singular* 

fundamentals metric properties: *non-negativity*, *symmetry* and *triangle inequality* [29]. In most cases, a metric function is desired, because the triangle inequality can then be used to prune the index during search, allowing speed-up execution for exact matching [28]. In every way, Euclidean distance and its variants present several drawbacks, that make inappropriate their use in certain applications. For these reasons, other distance measure techniques were proposed to give more robustness to the similarity computation. In this sense it is required to cite also the well known *Dynamic Time Warping* (*DTW*) [22], that makes distance comparisons less sensitive to signal transformations as shifting, uniform amplitude scaling or uniform time scaling. In the literature there exist other distance measures that overcome signal transformation problems, such as the *Landmarks similarity*, which does not follow traditional similarity models that rely on point-wise Euclidean distance [31] but, in correspondence of human intuition and episodic memory, relies on similarity of those points (times, events) of

(ElectroCardioGram), d) seismic signal.

A *TSDB* with *m* objects, each of length *n*, is denoted by *TSDB =* {*x*1*, x*2*, . . . , xm*}, where *xi* = (*xi* (1)*, xi* (2)*, . . . , xi* (*n*) ) is a vector denoting the *i*th time series and *xi* (*j*) denotes the *j*th values of x*i*, with respect to time. *n* indicates the *dimensionality* of the data set.

The general representation of a *TSDB* is a *m* × *n* matrix:

$$A = \begin{pmatrix} \mathbf{x}\_1^{(1)} & \mathbf{x}\_1^{(2)} & \cdots & \mathbf{x}\_1^{(j)} & \cdots & \mathbf{x}\_1^{(n)} \\ \mathbf{x}\_2^{(1)} & \mathbf{x}\_2^{(2)} & \cdots & \mathbf{x}\_2^{(j)} & \cdots & \mathbf{x}\_2^{(n)} \\ \vdots & \vdots & \cdots & \vdots & \cdots & \vdots \\ \mathbf{x}\_i^{(1)} & \mathbf{x}\_i^{(2)} & \cdots & \mathbf{x}\_i^{(j)} & \cdots & \mathbf{x}\_i^{(n)} \\ \vdots & \vdots & \cdots & \vdots & \cdots & \vdots \\ \mathbf{x}\_m^{(1)} & \mathbf{x}\_m^{(2)} & \cdots & \mathbf{x}\_m^{(j)} & \cdots & \mathbf{x}\_m^{(n)} \end{pmatrix} = \begin{pmatrix} \mathbf{x}\_1 \\ \mathbf{x}\_2 \\ \vdots \\ \mathbf{x}\_m \end{pmatrix} \tag{1}$$

In many cases, datasets are supported by special data structures, especially when dataset get larger, that are referred as *indexing* structures. Indexing consists of building a data structure *I* that enables efficient searching within database [29]. Usually, it is designed to face two principal similarity queries: the (i) *k-nearest neighbors* (*knn*), and the (ii) *range query* problem. Given a time series *Q* in *TSDB*, and a similarity/dissimilarity measure *d*(*T*,*S*) defined for each pair *T*, *S* in *TSDB*, the former query deals with the search of the set of first *k* time series in *TSDB* more similar to *Q*. The latter query finds the set *R* of time series that are within distance *r* from *Q*. Given an indexing structure *I*, there are two ways to post a similarity query in time series databases [29]:


Indexing is crucial for reaching efficiency on data mining tasks, such as *clustering* or *classification*, specially for huge database such as *TSDB*s. Clustering is related to the unsupervised division of data into groups (clusters) of similar objects under some similarity or dissimilarity measures. Sometimes, on time series domain, a similar problem to clustering is the *motif discovery* problem [28], consisting of searching main cluster (or motif) into a *TSDB*. The search for clusters is unsupervised. Classification assigns unlabeled time series to predefined classes after a supervised learning process. Both tasks make massive use of distance computations.

Distance measures play an important role for similarity problem, in data mining tasks. Concerning a distance measure, it is important to understand if it can be considered *metric*. A metric function on a *TSDB* is a function *f* : *TSDB* × *TSDB* → R (where R is the set of real numbers). For all *x*, *y*, *z* in *TSDB*, this function obeys to four fundamental properties:

$$f(\mathbf{x}, y) \ge 0 \quad \text{(non-negativity)}\tag{2}$$

Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining 75

measure must be established; (ii) to work with the reduced representation, a specific

A common data mining task is the estimation of similarity among objects. A similarity measure is a relation between a pair of objects and a scalar number. Common intervals used to mapping the similarity are [-1, 1] or [0, 1], where 1 indicates the maximum of similarity.

(,) 1 *x y numSim x y*

1 <sup>1</sup> (,) (,) *n*

1 <sup>1</sup> (,) (,) *n*

<sup>1</sup>

*<sup>n</sup> i i i i i*

*n x y*

*x Xy Y*

*xX y Y*

2 2

*i rtsim X Y numSim x y n*

<sup>1</sup> (,) 1 2max ,

Another common similarity functions used to perform complete or partial matching between time series are the *cross-correlation* function (or *Pearson's correlation* function) [38] and the cosine angle between the two vectors. The cross correlation between two time series

1 1

Where *X* and *Y* are the means of *X* and *Y*. The correlation *rXY* provides the degree of linear dependence between the two vectors *X* and *Y* from perfect linear relationship (*rXY* = 1), to

Another way to evaluate similarity between two vectors is the estimation of cosine of the

*i il i*

*i i l i i*

*x y psim X Y*

*x* and *y* of length *n*, allowing a shifted comparison of *l* positions, is defined as:

*r*

angle between the two vectors *X* and *Y*, defined as:

perfect negative linear relation (*rXY* = -1).

1

*n*

*XY n n*

*i tsim X Y numSim x y n*

*x y* 

*i i*

*i i*

2

(7)

(8)

(9)

(10)

(6)

requirement is that it guarantees the lower bounding property.

Considering the similarity between two numbers x and y as :

Let two time series *X* = *x*1,…,*xn*, *Y* = *y*1,…,*yn*, some similarity measures are:

**3. Similarity measures** 




$$f(\mathbf{x}, y) = 0 \quad \text{if and only if} \quad \mathbf{x} = y \quad \text{(identity)}\tag{3}$$

$$f(\mathbf{x}, \mathbf{y}) = f(\mathbf{y}, \mathbf{x}) \quad (symmetry) \tag{4}$$

$$f(\mathbf{x}, \mathbf{z}) \le f(\mathbf{x}, \mathbf{y}) + f(\mathbf{y}, \mathbf{z}) \quad \text{(triangle inequality)}\tag{5}$$

If any of these is not obeyed, then the distance is non-metric. Using a metric function is desired, because the triangle inequality property (Eq. 5) can be used to index the space for speed-up search. A well known framework, for indexing time series, is *GEMINI* (*GEneric Multimedia INdexIng*) [13] that designs fast search algorithms for locating time series that match, in an exact or approximate way, a query time series *Q*.

It was introduced to accommodate any dimensionality reduction method for time series, and then allows indexing on new representation [29]. *GEMINI* guarantees no false negatives on index search if two conditions are satisfied: (i) for the raw time series, a metric distance measure must be established; (ii) to work with the reduced representation, a specific requirement is that it guarantees the lower bounding property.

#### **3. Similarity measures**

74 Advances in Data Mining Knowledge Discovery and Applications

query in time series databases [29]:

distance computations.

In many cases, datasets are supported by special data structures, especially when dataset get larger, that are referred as *indexing* structures. Indexing consists of building a data structure *I* that enables efficient searching within database [29]. Usually, it is designed to face two principal similarity queries: the (i) *k-nearest neighbors* (*knn*), and the (ii) *range query* problem. Given a time series *Q* in *TSDB*, and a similarity/dissimilarity measure *d*(*T*,*S*) defined for each pair *T*, *S* in *TSDB*, the former query deals with the search of the set of first *k* time series in *TSDB* more similar to *Q*. The latter query finds the set *R* of time series that are within distance *r* from *Q*. Given an indexing structure *I*, there are two ways to post a similarity

 *whole matching*: given a *TSDB* of time series, each of length *n*, whole matching relates to computation of similarity matching among time series along their whole length. *subsequence matching*: given a *TSDB* of *m* time series *S*1*, S*2*, …, Sm*, each of length *ni*, and a short query time series *Q* of length *nq* < *ni*, with 0 < *i* < *m*, subsequence matching relates

to finding matches of *Q* into subsequences of every *Si*, starting at every position.

Indexing is crucial for reaching efficiency on data mining tasks, such as *clustering* or *classification*, specially for huge database such as *TSDB*s. Clustering is related to the unsupervised division of data into groups (clusters) of similar objects under some similarity or dissimilarity measures. Sometimes, on time series domain, a similar problem to clustering is the *motif discovery* problem [28], consisting of searching main cluster (or motif) into a *TSDB*. The search for clusters is unsupervised. Classification assigns unlabeled time series to predefined classes after a supervised learning process. Both tasks make massive use of

Distance measures play an important role for similarity problem, in data mining tasks. Concerning a distance measure, it is important to understand if it can be considered *metric*. A metric function on a *TSDB* is a function *f* : *TSDB* × *TSDB* → R (where R is the set of real

 *f*(*x*, *y*) ≥ 0 (*non-negativity*) (2)

 *f*(*x*, *y*) = 0 if and only if *x* = *y* (*identity*) (3)

If any of these is not obeyed, then the distance is non-metric. Using a metric function is desired, because the triangle inequality property (Eq. 5) can be used to index the space for speed-up search. A well known framework, for indexing time series, is *GEMINI* (*GEneric Multimedia INdexIng*) [13] that designs fast search algorithms for locating time series that

It was introduced to accommodate any dimensionality reduction method for time series, and then allows indexing on new representation [29]. *GEMINI* guarantees no false negatives on index search if two conditions are satisfied: (i) for the raw time series, a metric distance

match, in an exact or approximate way, a query time series *Q*.

*f(x, y) = f(y, x) (symmetry)* (4)

*f(x, z) ≤ f(x, y) + f(y, z) (triangle inequality)* (5)

numbers). For all *x*, *y*, *z* in *TSDB*, this function obeys to four fundamental properties:

A common data mining task is the estimation of similarity among objects. A similarity measure is a relation between a pair of objects and a scalar number. Common intervals used to mapping the similarity are [-1, 1] or [0, 1], where 1 indicates the maximum of similarity.

Considering the similarity between two numbers x and y as :

$$numSim(x, y) = 1 - \frac{\left| x - y \right|}{\left| x \right| + \left| y \right|} \tag{6}$$

Let two time series *X* = *x*1,…,*xn*, *Y* = *y*1,…,*yn*, some similarity measures are:


$$\text{itsim}(\mathbf{X}, \mathbf{Y}) = \frac{1}{n} \sum\_{i=1}^{n} numSim(\mathbf{x}\_{i'}, y\_i) \tag{7}$$


$$rtrsim(\mathbf{X}, Y) = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} numSim(\mathbf{x}\_i, y\_i)^2} \tag{8}$$


$$p\text{sim}(\mathbf{X}, Y) = \frac{1}{n} \sum\_{i=1}^{n} \left[ 1 - \frac{|\mathbf{x}\_i - \mathbf{y}\_i|}{2\max\left( |\mathbf{x}\_i|, |\mathbf{y}\_i| \right)} \right] \tag{9}$$

Another common similarity functions used to perform complete or partial matching between time series are the *cross-correlation* function (or *Pearson's correlation* function) [38] and the cosine angle between the two vectors. The cross correlation between two time series *x* and *y* of length *n*, allowing a shifted comparison of *l* positions, is defined as:

$$r\_{XY} = \frac{\sum\_{i=1}^{n} \left(\mathbf{x}\_i - \overline{\mathbf{X}}\right) \left(y\_{i-l} - \overline{Y}\right)}{\sqrt{\sum\_{i=1}^{n} \left(\mathbf{x}\_i - \overline{\mathbf{X}}\right)^2} \sqrt{\sum\_{i=1}^{n} \left(y\_{i-l} - \overline{Y}\right)^2}} \tag{10}$$

Where *X* and *Y* are the means of *X* and *Y*. The correlation *rXY* provides the degree of linear dependence between the two vectors *X* and *Y* from perfect linear relationship (*rXY* = 1), to perfect negative linear relation (*rXY* = -1).

Another way to evaluate similarity between two vectors is the estimation of cosine of the angle between the two vectors *X* and *Y*, defined as:

$$\cos(\mathcal{G}) = \frac{X \cdot Y}{\|X\| \|Y\|} = \frac{\sum\_{i=1}^{n} x\_i y\_i}{\sqrt{\sum\_{i=1}^{n} x\_i^2} \sqrt{\sum\_{i=1}^{n} y\_i^2}} \tag{11}$$

Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining 77

**Figure 3.** *T* and *S* are two time series of a particular variable *v,* along the time axis *t*. The Euclidean distance results in the sum of the point-to-point distances (gray lines), along all the time series.

Euclidean distance is cataloged as a metric distance function, since it obeys to the metric properties: *non-negativity*, *identity*, *symmetry* and *triangle inequality* (Section 2, Eq.2~5). It is surprisingly competitive with other more complex approaches, especially when dataset size gets larger [35]. In every way, Euclidean distance and its variants present several

 It is very sensitive respect to six signal transformations: shifting, uniform amplitude scaling, uniform time scaling, uniform bi-scaling, time warping and non-uniform

*Dynamic Time Warping* (*DTW*) [22] gives more robustness to the similarity computation. By this method, also time series of different length can be compared, because it replaces the oneto-one point comparison, used in Euclidean distance, with a many-to-one (and viceversa) comparison. The main feature of this distance measure is that it allows to recognize similar shapes, even if they present signal transformations, such as shifting and/or scaling (Fig. 4).

Given two time series *T* = {*t*1*, t*2*, . . . , tn*} and *S* = {*s*1*, s*2*, . . . , sm*} of length *n* and *m*, respectively, an alignment by *DTW* method exploits information contained in a *n* × *m* distance matrix:

21 22

where *distMatrix*(*i*, *j*) corresponds to the distance of *i*th point of *T* and *j*th point of *S d*(*Ti*, *Sj*),

The *DTW* objective is to find the *warping path W* = {*w*1*, w*2*, . . . ,wk, . . ., wK*} of contiguous elements on *distMatrix* (with max(*n*, *m*) < *K* < *m + n -*1, and *wk = distMatrix*(*i, j*)), such that it

(,) (,)

*dT S dT S*

1

11 12 1

*dT S dT S dT S*

(,) (,) (, )

*m*

(17)

(,) (, )

*dT S dT S*

*n n m*

drawbacks, that make inappropriate their use in certain applications:

It compares only time series of the same length.

*distMatrix*

It doesn't handle outliers or noise.

amplitude scaling [31].

with 1 ≤ *i* ≤ *n* and 1 ≤ *j* ≤ *m*.

minimizes the following function:

**3.2. Dynamic time warping** 

This measure provides values in range [-1, 1]. The lower boundary indicates that the *X* and *Y* vectors are exactly opposite, the upper boundary indicates that the vectors are exactly the same, finally the 0 value indicates the independence.

#### **3.1. Metric distances**

Another way to compare time series data involves concept of distance measures. Let be two time series *T* and *S* vectors of length *n*, and *Ti* and *Si* the *i*th values of *T* and *S*, respectively. Let us list the following distance measures. This subsection presents a list of distance functions used in Euclidean space.

1. **Euclidean Distance**. The most used distance function in many applications. It is defined as (Fig. 3):

$$d(T, S) = \sqrt{\sum\_{i=1}^{n} (T\_i - S\_i)^2} \tag{12}$$

2. **Manhattan Distance**. Also called "city block distance". It is defined as:

$$d(T, \mathcal{S}) = \sum\_{i=1}^{n} \left| T\_i - \mathcal{S}\_i \right| \tag{13}$$

3. **Maximum Distance**. It is defined to be the maximum value of the distances of the attributes:

$$d(T, S) = \max\_{0 < i \le n} \left| T\_i - S\_i \right| \tag{14}$$

4. **Minkowski Distance**. The Euclidean distance, Manhattan distance, and Maximum distance, are particular instances of the Minkowski distance, called also Lp-norm. It is defined as:

$$d(T, S) = \sqrt[p]{\sum\_{i=1}^{n} (T\_i - S\_i)^p} \tag{15}$$

where *p* is called the order of *Minkowski* distance. In fact, for Manhattan distance *p = 1*, for the Euclidean distance *p = 2*, while for the Maximum distance *p =* ∞.

5. **Mahalanobis Distance**. The Mahalanobis distance is defined as:

$$d(T, S) = \sqrt{(T - S)\mathcal{W}^{-1}\left(T - S\right)^{T}}\tag{16}$$

where *W* is the *covariance* matrix [12].

**Figure 3.** *T* and *S* are two time series of a particular variable *v,* along the time axis *t*. The Euclidean distance results in the sum of the point-to-point distances (gray lines), along all the time series.

#### **3.2. Dynamic time warping**

76 Advances in Data Mining Knowledge Discovery and Applications

same, finally the 0 value indicates the independence.

**3.1. Metric distances** 

as (Fig. 3):

defined as:

where *W* is the *covariance* matrix [12].

functions used in Euclidean space.

cos( )

1

2

(12)

(13)

(14)

(15)

∞.

<sup>1</sup> (,) *<sup>T</sup> dTS T S W T S* (16)

*i i*

*i i*

*n i i i n n i i i i*

*X Y x y X Y x y*

This measure provides values in range [-1, 1]. The lower boundary indicates that the *X* and *Y* vectors are exactly opposite, the upper boundary indicates that the vectors are exactly the

Another way to compare time series data involves concept of distance measures. Let be two time series *T* and *S* vectors of length *n*, and *Ti* and *Si* the *i*th values of *T* and *S*, respectively. Let us list the following distance measures. This subsection presents a list of distance

1. **Euclidean Distance**. The most used distance function in many applications. It is defined

1 (,) ( ) *n*

1

*i dTS T S* 

3. **Maximum Distance**. It is defined to be the maximum value of the distances of the attributes:

0 ( , ) max *i i i n dTS T S*

4. **Minkowski Distance**. The Euclidean distance, Manhattan distance, and Maximum distance, are particular instances of the Minkowski distance, called also Lp-norm. It is

> 1 (,) ( ) *n p p i i*

*i dTS T S* 

where *p* is called the order of *Minkowski* distance. In fact, for Manhattan distance *p = 1*, for

the Euclidean distance *p = 2*, while for the Maximum distance *p =* 

5. **Mahalanobis Distance**. The Mahalanobis distance is defined as:

*n*

*i dTS T S* 

2. **Manhattan Distance**. Also called "city block distance". It is defined as:

(,)

2 2 1 1

(11)

Euclidean distance is cataloged as a metric distance function, since it obeys to the metric properties: *non-negativity*, *identity*, *symmetry* and *triangle inequality* (Section 2, Eq.2~5). It is surprisingly competitive with other more complex approaches, especially when dataset size gets larger [35]. In every way, Euclidean distance and its variants present several drawbacks, that make inappropriate their use in certain applications:


*Dynamic Time Warping* (*DTW*) [22] gives more robustness to the similarity computation. By this method, also time series of different length can be compared, because it replaces the oneto-one point comparison, used in Euclidean distance, with a many-to-one (and viceversa) comparison. The main feature of this distance measure is that it allows to recognize similar shapes, even if they present signal transformations, such as shifting and/or scaling (Fig. 4).

Given two time series *T* = {*t*1*, t*2*, . . . , tn*} and *S* = {*s*1*, s*2*, . . . , sm*} of length *n* and *m*, respectively, an alignment by *DTW* method exploits information contained in a *n* × *m* distance matrix:

$$\text{distMatrix} = \begin{pmatrix} d(T\_1, S\_1) & d(T\_1, S\_2) & \cdots & d(T\_1, S\_m) \\ d(T\_2, S\_1) & d(T\_2, S\_2) \\ \vdots & & \ddots & \\ d(T\_{n'}, S\_1) & & d(T\_{n'}, S\_m) \end{pmatrix} \\ \tag{17}$$

where *distMatrix*(*i*, *j*) corresponds to the distance of *i*th point of *T* and *j*th point of *S d*(*Ti*, *Sj*), with 1 ≤ *i* ≤ *n* and 1 ≤ *j* ≤ *m*.

The *DTW* objective is to find the *warping path W* = {*w*1*, w*2*, . . . ,wk, . . ., wK*} of contiguous elements on *distMatrix* (with max(*n*, *m*) < *K* < *m + n -*1, and *wk = distMatrix*(*i, j*)), such that it minimizes the following function:

**Figure 4.** Difference between *DTW* distance and Euclidean distance (green lines represent mapping between points of time series *T* and *S*). The former allows many-to-one point comparisons, while Euclidean point-to-point distance (or one-to-one).

$$DTW(T, S) = \min\left(\sqrt{\sum\_{k=1}^{K} w\_k}\right) \tag{18}$$

Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining 79

the path enforcing the recursion to stop at a certain depth, represented by a threshold *δ*.

( , ) min 1, 1), 1, ), , 1) (,) *i j dT S i j i j i j i j i j otherwise*

**Figure 5.** Warping path computation using dynamic programming. The lavender cells corresponds to the warping path. The red arrow indicates its direction. The warping distance at the (*i*, *j*) cell will consider, besides the distance between *Ti* and *Sj*, the minimum value among adjacent cells at positions: (*i-*1, *j-*1), (*i-*1, *j*) and (*i*, *j-*1). The Euclidean distance between two time series can be seen as a special case

**Figure 6.** Different mappings obtained with the classic implementation of *DTW* (a), and with the restricted path version using a threshold *δ* = 10 (b). Green lines represent mapping between points of

 

(20)

Then, the cumulative distance matrix *γ* will be calculated as follows:

time series *T* and *S*.

of *DTW*, where path's elements belong to the *γ* matrix diagonal.

The warping path is subject to several constraints [22]. Given *wk =* (*i, j*) and *wk-*1 *=* (*i', j*') with *i, i' ≤ n* and *j, j' ≤ m* :


The warping path can be efficiently computed using dynamic programming [9]. By this method, a cumulative distance matrix *γ* of the same dimension as the *distMatrix*, is created to store in the cell (*i*, *j*) the following value (Fig. 5):

$$\gamma(i,j) = d(T\_{i'}S\_j) + \min\left\{\gamma(i-1,j-1), \gamma(i-1,j), \gamma(i,j-1)\right\} \tag{19}$$

The overall complexity of the method is relative to the computation of all distances in *distMatrix*, that is *O*(*nm*). The last element of the warping path, *wK* corresponds to the distance calculated with the *DTW* method.

In many cases, this method can bring to undesired effects. An example is when a large number of point of a time series *T* is mapped to a single point of another time series *S* (Fig. 6a, 7a). A common way to overcome this problem is to restrict the warping path in such a way it has to follow a direction along the diagonal (Fig. 6b, 7b). To do this, we can restrict the path enforcing the recursion to stop at a certain depth, represented by a threshold *δ*. Then, the cumulative distance matrix *γ* will be calculated as follows:

78 Advances in Data Mining Knowledge Discovery and Applications

Euclidean point-to-point distance (or one-to-one).

2. **Continuity**. i – i' ≤ 1 and j – j' ≤ 1. 3. **Monotonicity**. i – i' ≥ 0 and j – j' ≥ 0.

distance calculated with the *DTW* method.

1. **Boundary conditions**. w1 = (1,1) and wK = (n, m).

to store in the cell (*i*, *j*) the following value (Fig. 5):

*i' ≤ n* and *j, j' ≤ m* :

**Figure 4.** Difference between *DTW* distance and Euclidean distance (green lines represent mapping between points of time series *T* and *S*). The former allows many-to-one point comparisons, while

<sup>1</sup> ( , ) min *<sup>K</sup> DTW T S <sup>k</sup> wk*

The warping path is subject to several constraints [22]. Given *wk =* (*i, j*) and *wk-*1 *=* (*i', j*') with *i,* 

The warping path can be efficiently computed using dynamic programming [9]. By this method, a cumulative distance matrix *γ* of the same dimension as the *distMatrix*, is created

> *i j dT S i j i j i j* , ) ( , ) min 1, 1), 1, ), , 1) *i j*

The overall complexity of the method is relative to the computation of all distances in *distMatrix*, that is *O*(*nm*). The last element of the warping path, *wK* corresponds to the

In many cases, this method can bring to undesired effects. An example is when a large number of point of a time series *T* is mapped to a single point of another time series *S* (Fig. 6a, 7a). A common way to overcome this problem is to restrict the warping path in such a way it has to follow a direction along the diagonal (Fig. 6b, 7b). To do this, we can restrict

(18)

(19)

**Figure 5.** Warping path computation using dynamic programming. The lavender cells corresponds to the warping path. The red arrow indicates its direction. The warping distance at the (*i*, *j*) cell will consider, besides the distance between *Ti* and *Sj*, the minimum value among adjacent cells at positions: (*i-*1, *j-*1), (*i-*1, *j*) and (*i*, *j-*1). The Euclidean distance between two time series can be seen as a special case of *DTW*, where path's elements belong to the *γ* matrix diagonal.

**Figure 6.** Different mappings obtained with the classic implementation of *DTW* (a), and with the restricted path version using a threshold *δ* = 10 (b). Green lines represent mapping between points of time series *T* and *S*.

Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining 81

 

*i j dT S i j i j i j* , ) ( , ) min 1, 1), 1, 2), 2, 1) *i j* (21)

Figure 7a shows the computation of a restricted warping path, using a threshold *δ* = 10. This constraint, besides limiting extreme or degenerate mappings, allows to speed-up *DTW* distance calculation, because we need to store only distances which are at most *δ* positions away (in horizontal and vertical direction) from the *distMatrix* diagonal. This reduces the computational complexity to *O*((*n + m*)*δ*). The above proposed constraint is known also as *Sakoe-Chiba band* (Fig. 8a) [34], and it is classified as global constraint. Another most common

Local constraints are subject of research and are different from global constraints [22], because they provide local restrictions on the set of the alternative depth steps of the

global constraint is the *Itakura parallelogram* (Fig. 8b) [18].

**3.3. Longest Common SubSequence** 

to define a new local constraint.

recurrence function (Eq. 19). For example we can replace Eq. 19 with:

**Figure 8.** Examples of global constraints: (a) Sakoe-Chiba band; (b) Itakura parallelogram.

and *S* of length *n* and *m*, respectively, the recurrence function is expressed as follows:

common way to relax this definition is to apply the following recurrence function:

Another well known method that takes advantage of dynamic programming to allow comparison of one-to-many points is the *Longest Common SubSequence* (*LCSS*) similarity measure [37]. An interesting feature of this method is that it is more resilient to noise than *DTW*, because allows some elements of time series to be unmatched (Fig. 9). This solution builds a matrix *LCSS* similar to *γ*, but considering similarity instead of distances. Given the time series *T*

0 0,

<sup>0</sup> 0, , 1 [ 1, 1] ,

*LCSS i j LCSS i j otherwise*

*i j*

(22)

*i*

max( [ 1, ], [ , 1])

with 1 ≤ *i* ≤ *n* and 1 ≤ *j* ≤ *m*. Since exact matching between *Ti* and *Sj* can be strict for numerical values (Eq. 22 is best indicated for string distance computation, such as the edit distance), a

*<sup>j</sup> LCSS i j LCSS i j if T S*

80 Advances in Data Mining Knowledge Discovery and Applications

**Figure 7.** (a) Classic implementation of DTW. (b) Restricted path, using a threshold *δ* = 10. For each plot (a) and (b): on the center, the warping path calculated on matrix *γ*. On the top, the alignment of time series *T* and *S*, represented by the green lines. On the left, the time series *T*. On the bottom, the time series *S*. On the right, the color bar relative to the distance values into matrix *γ*.

Figure 7a shows the computation of a restricted warping path, using a threshold *δ* = 10. This constraint, besides limiting extreme or degenerate mappings, allows to speed-up *DTW* distance calculation, because we need to store only distances which are at most *δ* positions away (in horizontal and vertical direction) from the *distMatrix* diagonal. This reduces the computational complexity to *O*((*n + m*)*δ*). The above proposed constraint is known also as *Sakoe-Chiba band* (Fig. 8a) [34], and it is classified as global constraint. Another most common global constraint is the *Itakura parallelogram* (Fig. 8b) [18].

Local constraints are subject of research and are different from global constraints [22], because they provide local restrictions on the set of the alternative depth steps of the recurrence function (Eq. 19). For example we can replace Eq. 19 with:

$$\gamma(i,j) = d(T\_i, S\_j) + \min\left\{\gamma(i-1, j-1), \gamma(i-1, j-2), \gamma(i-2, j-1)\right\} \tag{21}$$

to define a new local constraint.

80 Advances in Data Mining Knowledge Discovery and Applications

**Figure 7.** (a) Classic implementation of DTW. (b) Restricted path, using a threshold *δ* = 10. For each plot (a) and (b): on the center, the warping path calculated on matrix *γ*. On the top, the alignment of time series *T* and *S*, represented by the green lines. On the left, the time series *T*. On the bottom, the time

series *S*. On the right, the color bar relative to the distance values into matrix *γ*.

**Figure 8.** Examples of global constraints: (a) Sakoe-Chiba band; (b) Itakura parallelogram.

#### **3.3. Longest Common SubSequence**

Another well known method that takes advantage of dynamic programming to allow comparison of one-to-many points is the *Longest Common SubSequence* (*LCSS*) similarity measure [37]. An interesting feature of this method is that it is more resilient to noise than *DTW*, because allows some elements of time series to be unmatched (Fig. 9). This solution builds a matrix *LCSS* similar to *γ*, but considering similarity instead of distances. Given the time series *T* and *S* of length *n* and *m*, respectively, the recurrence function is expressed as follows:

$$\begin{aligned} LCSS(i,j) &= \begin{cases} 0 & \text{if } i = 0, \\ 0 & \text{if } i = 0, \\ 1 + LCSS[i-1, j-1] & \text{if } \ T\_i = S\_{j,i} \\ \max\{LCSS[i-1, j], LCSS[i, j-1]\} & \text{otherwise} \end{cases} \end{aligned} \tag{22}$$

with 1 ≤ *i* ≤ *n* and 1 ≤ *j* ≤ *m*. Since exact matching between *Ti* and *Sj* can be strict for numerical values (Eq. 22 is best indicated for string distance computation, such as the edit distance), a common way to relax this definition is to apply the following recurrence function:

$$\text{LCSS}(i,j) = \begin{cases} 0 & i = 0, \\ 0 & j = 0, \\ 1 + \text{LCSS}[i-1, j-1] & \text{if } \left| T\_i - S\_j \right| < \varepsilon, \\ \max(\text{LCSS}[i-1, j], \text{LCSS}[i, j-1]) & \text{otherwise} \end{cases} \tag{23}$$

The cell *LCSS*(*n*, *m*) contains the similarity between *T* and *S*, because it corresponds to length *l* of the longest common subsequence of elements between time series *T* and *S*. To define a distance measure, we can compute [32]:

$$LCSSdist(T, S) = \frac{n + m + 2l}{m + n} \tag{24}$$

Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining 83

If a dimensionality reduction techniques ensures that the reduced representation of a time series satisfies such a property, we can assume that the similarity matching in the reduced space maintains its meaning. Moreover, we can take advantage of indexing structure such as *GEMINI* (Section 2) [13] to perform speed-up search even avoiding false negative results. In the following subsections, we will review the main dimensionality reduction techniques that

The dimensionality reduction, based on the *Discrete Fourier Transform* (*DFT*) [1], was the first to be proposed for time series. The DFT decomposes a signal into a sum of sine and cosine waves, called *Fourier Series*. Each wave is represented by a complex number known as *Fourier coefficient* (Fig. 10) [29],[32]. The most important feature of this method is the data compression, because the original signal can be reconstructed by means of information carried by the waves

**Figure 10.** The raw data is in the top-left plot. In the first row, the central plot ("Fourier coefficients" plot) shows the magnitude for each wave (Fourier coefficient). Yellow points are drawn for the top ten highest values. The remaining plots (in order from first row to last, and from left to right) represent the waves corresponding to the top ten highest coefficients in decreasing order, respectively of index {2,

More formally, given a signal *x* = {*x*1*, x*2*, . . . , xn*}, the *n*-point *Discrete Fourier Transform* of *x* is a sequence *X* = {*X*1*, X*2*, . . . , Xn*} of complex numbers. *X* is the representation of *x* in the

100, 3, 99, 98, 4, 93, 9, 1, 97}, in the "Fourier coefficients" plot.

frequency domain. Each wave/frequency *XF* is calculated as:

with higher Fourier coefficient. The rest can be discarded with no significant loss.

preserve the *lower bounding* property.

**4.1. DFT** 

( ( ), ( )) ( , ) *feature true d RT RS d TS* (25)

Also for LCSS the time complexity is *O*(*nm*), but it can be improved to *O*((*n + m*)*δ*) if a restriction is used (i.e. when *|i - j| < δ*).

**Figure 9.** Alignment using *LCSS*. Time series *T* (red line) is obtained from *S* (blue line), by adding a fixed value = 5, and further "noise" at positions starting from 20 to 30. In these positions there is no mapping (green lines).

#### **4. Dimensionality reduction techniques**

We have already introduced that a key aspect to achieve efficiency, when mining time series data, is to work with a data representation that is lighter than the raw data. This can be done by reducing the dimensionality of data, still maintaining its main properties. An important feature to be considered, when choosing a representation, is the *lower bounding* property.

Given two raw representations of the time series *T* and *S*, by this property, after establishing a true distance measure *dtrue* for the raw data (such as the Euclidean distance), the distance *dfeature* between two time series, in the reduced space, *R*(*T*) and *R*(*S*), have to be always less or equal than *dtrue*:

$$d\_{feature}(R(T), R(S)) \le d\_{true}(T, S) \tag{25}$$

If a dimensionality reduction techniques ensures that the reduced representation of a time series satisfies such a property, we can assume that the similarity matching in the reduced space maintains its meaning. Moreover, we can take advantage of indexing structure such as *GEMINI* (Section 2) [13] to perform speed-up search even avoiding false negative results. In the following subsections, we will review the main dimensionality reduction techniques that preserve the *lower bounding* property.

#### **4.1. DFT**

82 Advances in Data Mining Knowledge Discovery and Applications

0 0, 0 0, , 1 [ 1, 1] ,

*LCSS i j LCSS i j otherwise*

*m n* 

*i j*

(24)

(23)

*i j*

max( [ 1, ], [ , 1])

The cell *LCSS*(*n*, *m*) contains the similarity between *T* and *S*, because it corresponds to length *l* of the longest common subsequence of elements between time series *T* and *S*. To define a

<sup>2</sup> (,) *nm l LCSSdist T S*

Also for LCSS the time complexity is *O*(*nm*), but it can be improved to *O*((*n + m*)*δ*) if a

**Figure 9.** Alignment using *LCSS*. Time series *T* (red line) is obtained from *S* (blue line), by adding a fixed value = 5, and further "noise" at positions starting from 20 to 30. In these positions there is no

We have already introduced that a key aspect to achieve efficiency, when mining time series data, is to work with a data representation that is lighter than the raw data. This can be done by reducing the dimensionality of data, still maintaining its main properties. An important feature to be considered, when choosing a representation, is the *lower bounding* property.

Given two raw representations of the time series *T* and *S*, by this property, after establishing a true distance measure *dtrue* for the raw data (such as the Euclidean distance), the distance *dfeature* between two time series, in the reduced space, *R*(*T*) and *R*(*S*), have to be always less or

*LCSS i j LCSS i j if T S*

distance measure, we can compute [32]:

restriction is used (i.e. when *|i - j| < δ*).

mapping (green lines).

equal than *dtrue*:

**4. Dimensionality reduction techniques** 

The dimensionality reduction, based on the *Discrete Fourier Transform* (*DFT*) [1], was the first to be proposed for time series. The DFT decomposes a signal into a sum of sine and cosine waves, called *Fourier Series*. Each wave is represented by a complex number known as *Fourier coefficient* (Fig. 10) [29],[32]. The most important feature of this method is the data compression, because the original signal can be reconstructed by means of information carried by the waves with higher Fourier coefficient. The rest can be discarded with no significant loss.

**Figure 10.** The raw data is in the top-left plot. In the first row, the central plot ("Fourier coefficients" plot) shows the magnitude for each wave (Fourier coefficient). Yellow points are drawn for the top ten highest values. The remaining plots (in order from first row to last, and from left to right) represent the waves corresponding to the top ten highest coefficients in decreasing order, respectively of index {2, 100, 3, 99, 98, 4, 93, 9, 1, 97}, in the "Fourier coefficients" plot.

More formally, given a signal *x* = {*x*1*, x*2*, . . . , xn*}, the *n*-point *Discrete Fourier Transform* of *x* is a sequence *X* = {*X*1*, X*2*, . . . , Xn*} of complex numbers. *X* is the representation of *x* in the frequency domain. Each wave/frequency *XF* is calculated as:

$$X\_F = \frac{1}{\sqrt{n}} \sum\_{i=1}^n x\_i e^{-\frac{2\pi i f F}{n}} \quad \text{ ( $j = \sqrt{-1}$ )} \quad F = 1, \dots, n \tag{26}$$

Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining 85

*dX Y dXY dxy* ( ', ') ( , ) ( , ) (30)

(31)

(32)

<sup>1</sup> () ( , , , , ) *HT A D D D L LLL o* (33)

If we use the Euclidean distance (Eq. 12), by this property, the distance *d*(*x*,*y*) between two signals *x* and *y* on time domain is the same as calculated in the frequency domain *d*(*X*,*Y*), where *X* and *Y* are the respective transforms of *x* and *y*. The reduced representation *X'* = {*X*1*, X*2*, . . . , Xk*} is built by only keeping first *k* coefficients of *X* to reconstruct the signal *x* (Fig. 11). For the Parseval's Theorem we can be sure that the distance calculated on the reduced space is always less than the distance calculated on the original space, because *k* ≤ *n* and then the

The computational complexity of *DFT* is *O*(*n2*), but it can be reduced by means of the *FFT* algorithm [8], which computes the *DFT* in *O*(*n log n*) time. The main drawback of *DFT* reduction technique is the choice of the best number of coefficients to keep for a faithfully

Another technique for decomposing signals is the *Wavelet Transform* (*WT*). The basic idea of *WT* is data representation in terms of sum and difference of prototype functions, called *wavelets*. The discrete version of *WT* is the *Discrete Wavelet Transform* (*DWT*). Similarly to *DFT*, wavelet coefficients give local contributions to the reconstruction of the signal, while Fourier coefficients always represent global contributions to the signal over all the time [32]. The *Haar* wavelet is the simplest possible wavelet. Its formal definition is shown in [7]. An

The general *Haar* transform *HL*(*T*) of a time series *T* of length *n* can be formalized as follows:

*<sup>A</sup> iA i A i*

*D iD i D i*

' '

' '

**Level (L) Averages coefficients (A) Wavelet coefficients (D)** 

2 8, 6 3, 0 3 7 1 **Table 1.** The *Haar* transform of *T* = {10, 4, 8, 6} depends on the chosen level, and corresponds to merging *Averages coefficients* (column 2) at the chosen level and all *Wavelet coefficients* (column 3) in decreasing order from the chosen level. At level 1 the representation is the same of time series: *H*1(*T*) = {10, 4, 6, 6} + {} = {10, 4, 6, 6} = *T*. At level 2 is *H2(T)*= {8, 6} + {3, 0} + {} = {8, 6, 3, 0}. At level 3 is *H*3(*T*) = {7} + {1} + {3, 0} = {7, 1, 3 0}.

(2 ) (2 1) ( ) <sup>2</sup> *L L*

(2 ) (2 1) ( ) <sup>2</sup> *L L*

distance measured using Eq. 12 will produce:

reconstruction of the original signal.

where 0 < *L' ≤ L*, and 1 ≤ *i* ≤ *n*.

**4.2. DWT** 

that satisfies the lower bounding property defined in Eq. 25.

example of *DWT* based on Haar wavelet is shown in Table 1.

' 1

' 1

*L*

*L*

1 10, 4, 6, 6

The original representation of *x*, in the time domain, can be recovered by the inverse function:

$$\mathbf{x}\_{i} = \frac{1}{\sqrt{n}} \sum\_{F=1}^{n} X\_{F} e^{-\frac{2\pi i \tilde{\mathbf{y}}^{F}}{n}} \quad (j = \sqrt{-1}) \quad i = 1, \ldots, n \tag{27}$$

The energy *E*(*x*) of a signal *x* is given by:

$$E(\mathbf{x}) = \left\|\mathbf{x}\right\|^2 = \sum\_{i=1}^{n} \left|\mathbf{x}\_i\right|^2 \tag{28}$$

A fundamental property of *DFT* is guaranteed by the *Parseval's Theorem*, which asserts that the energy calculated on time series domain for signal *x* is preserved on frequency domain, and then:

**Figure 11.** The raw data is in the top-left plot. In the first row, the central plot ("Fourier coefficients" plot) shows the magnitude (Fourier coefficient) for each wave. Yellow points are drawn for the top ten highest values. The remaining plots (in order from first row to last, and from left to right) represent the reconstruction of the raw data using the wave with highest values (of index 2) firstly, then by adding the wave relative to second highest coefficient (of index 100), and so on.

2 2 1 1 ( ) ( ) *n n i F i F Ex x X EX* (29) If we use the Euclidean distance (Eq. 12), by this property, the distance *d*(*x*,*y*) between two signals *x* and *y* on time domain is the same as calculated in the frequency domain *d*(*X*,*Y*), where *X* and *Y* are the respective transforms of *x* and *y*. The reduced representation *X'* = {*X*1*, X*2*, . . . , Xk*} is built by only keeping first *k* coefficients of *X* to reconstruct the signal *x* (Fig. 11).

For the Parseval's Theorem we can be sure that the distance calculated on the reduced space is always less than the distance calculated on the original space, because *k* ≤ *n* and then the distance measured using Eq. 12 will produce:

$$d(X',Y') \le d(X,Y) = d(\mathfrak{x},\mathfrak{y})\tag{30}$$

that satisfies the lower bounding property defined in Eq. 25.

The computational complexity of *DFT* is *O*(*n2*), but it can be reduced by means of the *FFT* algorithm [8], which computes the *DFT* in *O*(*n log n*) time. The main drawback of *DFT* reduction technique is the choice of the best number of coefficients to keep for a faithfully reconstruction of the original signal.

#### **4.2. DWT**

84 Advances in Data Mining Knowledge Discovery and Applications

The energy *E*(*x*) of a signal *x* is given by:

function:

and then:

2

2

*ijF n*

( )

*ijF n*

<sup>1</sup> ( 1) 1, ,

(26)

(27)

(28)

(29)

*X xe j F n*

The original representation of *x*, in the time domain, can be recovered by the inverse

<sup>1</sup> ( 1) 1, ,

2 2 1

*n i i*

*x Xe j i n*

*Ex x x*

A fundamental property of *DFT* is guaranteed by the *Parseval's Theorem*, which asserts that the energy calculated on time series domain for signal *x* is preserved on frequency domain,

2 2

*i F*

1 1 ( ) ( ) *n n*

**Figure 11.** The raw data is in the top-left plot. In the first row, the central plot ("Fourier coefficients" plot) shows the magnitude (Fourier coefficient) for each wave. Yellow points are drawn for the top ten highest values. The remaining plots (in order from first row to last, and from left to right) represent the reconstruction of the raw data using the wave with highest values (of index 2) firstly, then by adding

the wave relative to second highest coefficient (of index 100), and so on.

*i F Ex x X EX* 

1

*<sup>n</sup> F i i*

1

*<sup>n</sup> i F F*

*n*

*n*

Another technique for decomposing signals is the *Wavelet Transform* (*WT*). The basic idea of *WT* is data representation in terms of sum and difference of prototype functions, called *wavelets*. The discrete version of *WT* is the *Discrete Wavelet Transform* (*DWT*). Similarly to *DFT*, wavelet coefficients give local contributions to the reconstruction of the signal, while Fourier coefficients always represent global contributions to the signal over all the time [32].

The *Haar* wavelet is the simplest possible wavelet. Its formal definition is shown in [7]. An example of *DWT* based on Haar wavelet is shown in Table 1.

The general *Haar* transform *HL*(*T*) of a time series *T* of length *n* can be formalized as follows:

$$A\_{L^{+}+1}(i) = \frac{A\_{L^{+}}(2i) + A\_{L^{+}}(2i+1)}{2} \tag{31}$$

$$D\_{L^{+}+1}(i) = \frac{D\_{L^{+}}(2i) - D\_{L^{+}}(2i+1)}{2} \tag{32}$$

$$H\_L(T) = \{A\_{L'}, D\_{L'}, D\_{L-1'}, \dots, D\_o\} \tag{33}$$

where 0 < *L' ≤ L*, and 1 ≤ *i* ≤ *n*.


**Table 1.** The *Haar* transform of *T* = {10, 4, 8, 6} depends on the chosen level, and corresponds to merging *Averages coefficients* (column 2) at the chosen level and all *Wavelet coefficients* (column 3) in decreasing order from the chosen level. At level 1 the representation is the same of time series: *H*1(*T*) = {10, 4, 6, 6} + {} = {10, 4, 6, 6} = *T*. At level 2 is *H2(T)*= {8, 6} + {3, 0} + {} = {8, 6, 3, 0}. At level 3 is *H*3(*T*) = {7} + {1} + {3, 0} = {7, 1, 3 0}.

The main drawback of this method is that it is well defined for time series which length *n* is a power of 2 (*n* = 2*m*). The computational complexity of *DWT* using *Haar* Wavelet is *O*(*n*). Chan and Fu [7] demonstrated that the Euclidean distance between two time series *T* and *S*, *d*(*T*,*S*), can be calculated in terms of their *Haar* transform *d*(*H*(*T*), *H*(*S*)), by preserving the lower bounding property in Eq. 25, because:

$$d(H(T), H(S)) = \sqrt{2}d(T, S) < d(T, S) \tag{34}$$

Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining 87

(36)

(38)

*w*

0 0 0 0

 

*<sup>T</sup> AV UWV V AV UW* (37)

1

*w*

*W*

multiply both sides of Eq. 35 by *V* to get:

the original vectors *A*={*x*1*, x*2*, . . . , xm*} [29]:

2

*V* is orthonormal, because *VVT* = *VTV* = *In*, where I*n* is the identity matrix of size *n*. So, we can

*UW* represents a set of *n*-dimensional vectors *AV* ={*X*1*, X*2*, . . . , Xm*} which are rotated from

0 0 *m m <sup>n</sup>*

*X U w*

**Figure 13.** *SVD* for a *TSDB* of *m*=7 time series of length *n*=50. It is possible to note in the *transformed data* plot how only first *k* < 10 singular values are significant. In this example we heuristically choose to store only first *k*=5 diagonal elements of *V*, and their relative entries in *A*, *U* and *W*, because they represent about 95% of total variance. This permits to reduce space complexity from *n* to *k*, still maintaining

almost unchanged the information (see the reconstruction on the bottom-left plot).

1 11 22 2

*X Uw XU w*

*w*

0 0 *<sup>n</sup>*

0 0 0 0

 

**Figure 12.** *DWT* using *Haar* Wavelet with *MATLAB Wavelet Toolbox™ GUI tools*. *T* is a time series of length *n* = 256 and it is shown on the top-left plot (*Original Signal*). On the bottom-left plot (*Original Coefficients*) there are all the *AL*, represented by blue stems, and *DL'* coefficients (*L'* < *L* = 7), represented by green stems (stems' length is proportional to coefficients value). On the top-right plot, the *Synthesized Signal* by selecting only the 64 biggest coefficients, as reported on the bottom-right plot (*Selected Coefficients*): black points represent unselected coefficients.

#### **4.3. SVD**

As we have just seen in Section 2, a *TSDB* with *m* time series, each of length *n*, can be represented by a *m* × *n* matrix *A* (Eq. 1). An important result from linear algebra is that *A* can always be written in the form [16]:

$$A = \mathbb{L}\mathbb{I}\mathbb{W}V^T\tag{35}$$

where *U* is an *m* × *n* matrix, *W* and *V* are *n* × *n* matrices. This is called the *Singular Value Decomposition* (*SVD*) of the matrix *A*, and the elements of the *n* × *n* diagonal matrix *W* are the *singular values wi*:

$$\mathcal{W} = \begin{pmatrix} w\_1 & 0 & \cdots & 0 \\ 0 & w\_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & w\_n \end{pmatrix} \tag{36}$$

*V* is orthonormal, because *VVT* = *VTV* = *In*, where I*n* is the identity matrix of size *n*. So, we can multiply both sides of Eq. 35 by *V* to get:

86 Advances in Data Mining Knowledge Discovery and Applications

lower bounding property in Eq. 25, because:

The main drawback of this method is that it is well defined for time series which length *n* is a power of 2 (*n* = 2*m*). The computational complexity of *DWT* using *Haar* Wavelet is *O*(*n*). Chan and Fu [7] demonstrated that the Euclidean distance between two time series *T* and *S*, *d*(*T*,*S*), can be calculated in terms of their *Haar* transform *d*(*H*(*T*), *H*(*S*)), by preserving the

**Figure 12.** *DWT* using *Haar* Wavelet with *MATLAB Wavelet Toolbox™ GUI tools*. *T* is a time series of length *n* = 256 and it is shown on the top-left plot (*Original Signal*). On the bottom-left plot (*Original Coefficients*) there are all the *AL*, represented by blue stems, and *DL'* coefficients (*L'* < *L* = 7), represented by green stems (stems' length is proportional to coefficients value). On the top-right plot, the *Synthesized* 

*Signal* by selecting only the 64 biggest coefficients, as reported on the bottom-right plot (*Selected* 

As we have just seen in Section 2, a *TSDB* with *m* time series, each of length *n*, can be represented by a *m* × *n* matrix *A* (Eq. 1). An important result from linear algebra is that *A* can

where *U* is an *m* × *n* matrix, *W* and *V* are *n* × *n* matrices. This is called the *Singular Value Decomposition* (*SVD*) of the matrix *A*, and the elements of the *n* × *n* diagonal matrix *W* are the

*<sup>T</sup> A UWV* (35)

*Coefficients*): black points represent unselected coefficients.

always be written in the form [16]:

**4.3. SVD** 

*singular values wi*:

*dHT HS dTS dTS* ( ( ), ( )) 2 ( , ) ( , ) (34)

$$AV = \mathcal{U}WV^TV \quad \rightarrow \quad AV = \mathcal{U}W \tag{37}$$

*UW* represents a set of *n*-dimensional vectors *AV* ={*X*1*, X*2*, . . . , Xm*} which are rotated from the original vectors *A*={*x*1*, x*2*, . . . , xm*} [29]:

$$\begin{pmatrix} X\_1 \\ X\_2 \\ \vdots \\ X\_m \end{pmatrix} = \begin{pmatrix} \mathcal{U}\_1 \\ \mathcal{U}\_2 \\ \vdots \\ \mathcal{U}\_m \end{pmatrix} \begin{pmatrix} w\_1 & 0 & \cdots & 0 \\ 0 & w\_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & w\_n \end{pmatrix} \tag{38}$$

**Figure 13.** *SVD* for a *TSDB* of *m*=7 time series of length *n*=50. It is possible to note in the *transformed data* plot how only first *k* < 10 singular values are significant. In this example we heuristically choose to store only first *k*=5 diagonal elements of *V*, and their relative entries in *A*, *U* and *W*, because they represent about 95% of total variance. This permits to reduce space complexity from *n* to *k*, still maintaining almost unchanged the information (see the reconstruction on the bottom-left plot).

Similarly to sine/cosine waves for *DFT* (Section 3.1) and to wavelet for *DWT* (Section 3.2), *U* vectors represent basis for *AV*, and their linear combination with *W* (that represents their coefficients) can reconstruct *AV*.

We can perform dimensionality reduction by selecting the first ordered *k* biggest singular values, and their relative entries in *A*, *V* and *U*, to obtain a new *k*-dimensional dataset that best fits original data.

*SVD* is an optimal transform if we aim to reconstruct data, because it minimizes the reconstruction error, but have two important drawbacks: (i) it needs a collection of time series to perform dimensionality reduction (it cannot operate on singular time series), because examines the whole dataset prior to transform. Moreover, the computational complexity is *O*(min(*m2n, mn2*)). (ii) This transformation is not incremental, because a new data insertion requires a new global computation.

### **4.4. Dimensionality reduction via PAA**

Given a time series *T* of length *n*, *PAA* divides it into *w* equal sized segments *ti* (1 < *i* ≤ *w*) and records values corresponding to the mean of each segment *mean*(*ti*) (Fig. 14) into a vector *PAA*(*T*) = {*mean*(*t1*), *mean*(*t2*), …, *mean*(*tw*)}, using the following formula:

$$
gamma(t\_i) = \frac{w}{n} \sum\_{j=\frac{n}{w}(i-1)+1}^{\frac{n}{w}i} t\_j \tag{39}
$$

When *n* is a power of 2, each *mean*(*ti*) essentially represents an *Averages coefficient AL*(*i*), defined in Section 4.2, and *w* corresponds in this case to:

$$w = \frac{n}{2^L} \tag{40}$$

Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining 89

The complexity time to calculate the mean values of Eq. 39 is *O*(*n*). The *PAA* method is very simple and intuitive, moreover it is strongly competitive with other more sophisticated transforms such as *DFT* and *DWT* [21],[41]. Most of data mining researches makes use of *PAA* reduction for its simplicity. It is simple to demonstrate how the distance on raw representation is bounded below by the distances calculated on *PAA* representation (even using Euclidean distance as reference point), satisfying Eq. 25. A limitation of such a

In Section 4.2 we noticed that not all *Haar* coefficients in *DWT* are important for the time series reconstruction. Same thing for *PAA* in Section 4.4, where not all segment means are equally important for the reconstruction, or better, we sometimes need an approximation with no-fixed size segments. *APCA* is an adaptive model that, differently from *PAA*, allows to define segments of variable size. This can be useful when we find in time series areas of low variance and areas of high variance, for which we want to have, respectively, few

Given a time series *T* = {*t*1*, t*2*, . . . , tn*} of length *n*, the *APCA* representation of *T* is defined as [6]:

To find an optimal representation through the *APCA* technique, dynamic programming can be used [14,30]. This solution requires *O*(*Mn*2) time. A better solution was proposed by Chakrabarti et al. [6], which finds the *APCA* representation in *O*(*n log n*) time, and defines a distance measure for this representation satisfying the lower bounding property defined in Eq. 25. The proposed method bases on *Haar* wavelet transformation. As we have just seen in Section 4.2, the original signal can be reconstructed by only selecting bigger coefficients, and truncating the rest. The segments in the reconstructed signal may have approximate mean values (due to truncation) [6], so these values are replaced by the exact mean values of the

1. Since *Haar* transformation deals only with time series length *n* = 2*p*, we need to add

2. If we held the biggest *M* Haar coefficients, we are not sure if the reconstruction will return an *APCA* representation of length *M*. We can know only that the number of segments will vary between *M* and 3*M* [6]. If the number of segments is more than *M*, we will iteratively merge more similar adjacent pairs of segments, until we reach *M*

1 1 <sup>0</sup> *C cv cr cv cr cr* , ,, , , 0 *M M* (41)

<sup>1</sup> <sup>1</sup> , , *i i <sup>i</sup> cr cr cv mean t t* (42)

reduction, in some contexts, can be the fixed size of the obtained segments.

segments for the former, and many segments for the latter.

original signal. Two aspects to consider before performing APCA:

The algorithm for compute APCA representation can be found in [6].

zeros to the end of the time series, until it reaches the desired length.

where *cri* is the last index of the *i*th segment, and

**4.5. APCA** 

segments.

**Figure 14.** An approximation via *PAA* technique of a time series *T* of length *n* = 256, with *w* = 8 segments.

The complexity time to calculate the mean values of Eq. 39 is *O*(*n*). The *PAA* method is very simple and intuitive, moreover it is strongly competitive with other more sophisticated transforms such as *DFT* and *DWT* [21],[41]. Most of data mining researches makes use of *PAA* reduction for its simplicity. It is simple to demonstrate how the distance on raw representation is bounded below by the distances calculated on *PAA* representation (even using Euclidean distance as reference point), satisfying Eq. 25. A limitation of such a reduction, in some contexts, can be the fixed size of the obtained segments.

#### **4.5. APCA**

88 Advances in Data Mining Knowledge Discovery and Applications

data insertion requires a new global computation.

defined in Section 4.2, and *w* corresponds in this case to:

**4.4. Dimensionality reduction via PAA** 

coefficients) can reconstruct *AV*.

best fits original data.

Similarly to sine/cosine waves for *DFT* (Section 3.1) and to wavelet for *DWT* (Section 3.2), *U* vectors represent basis for *AV*, and their linear combination with *W* (that represents their

We can perform dimensionality reduction by selecting the first ordered *k* biggest singular values, and their relative entries in *A*, *V* and *U*, to obtain a new *k*-dimensional dataset that

*SVD* is an optimal transform if we aim to reconstruct data, because it minimizes the reconstruction error, but have two important drawbacks: (i) it needs a collection of time series to perform dimensionality reduction (it cannot operate on singular time series), because examines the whole dataset prior to transform. Moreover, the computational complexity is *O*(min(*m2n, mn2*)). (ii) This transformation is not incremental, because a new

Given a time series *T* of length *n*, *PAA* divides it into *w* equal sized segments *ti* (1 < *i* ≤ *w*) and records values corresponding to the mean of each segment *mean*(*ti*) (Fig. 14) into a vector

*<sup>w</sup> mean t <sup>t</sup>*

When *n* is a power of 2, each *mean*(*ti*) essentially represents an *Averages coefficient AL*(*i*),

2*L n*

**Figure 14.** An approximation via *PAA* technique of a time series *T* of length *n* = 256, with *w* = 8 segments.

( 1) 1

(39)

*w* (40)

*n i w i j <sup>n</sup> j i <sup>w</sup>*

*<sup>n</sup>*

*PAA*(*T*) = {*mean*(*t1*), *mean*(*t2*), …, *mean*(*tw*)}, using the following formula:

( )

In Section 4.2 we noticed that not all *Haar* coefficients in *DWT* are important for the time series reconstruction. Same thing for *PAA* in Section 4.4, where not all segment means are equally important for the reconstruction, or better, we sometimes need an approximation with no-fixed size segments. *APCA* is an adaptive model that, differently from *PAA*, allows to define segments of variable size. This can be useful when we find in time series areas of low variance and areas of high variance, for which we want to have, respectively, few segments for the former, and many segments for the latter.

Given a time series *T* = {*t*1*, t*2*, . . . , tn*} of length *n*, the *APCA* representation of *T* is defined as [6]:

$$\mathcal{C} = \left\langle \left\langle c v\_{1'} c r\_1 \right\rangle, \dots, \left\langle c v\_{M'} c r\_M \right\rangle \right\rangle, \quad c r\_0 = 0 \tag{41}$$

where *cri* is the last index of the *i*th segment, and

$$c\upsilon\_i = mean\left(t\_{cr\_{i-1}+1}, \dots, t\_{cr\_i}\right) \tag{42}$$

To find an optimal representation through the *APCA* technique, dynamic programming can be used [14,30]. This solution requires *O*(*Mn*2) time. A better solution was proposed by Chakrabarti et al. [6], which finds the *APCA* representation in *O*(*n log n*) time, and defines a distance measure for this representation satisfying the lower bounding property defined in Eq. 25. The proposed method bases on *Haar* wavelet transformation. As we have just seen in Section 4.2, the original signal can be reconstructed by only selecting bigger coefficients, and truncating the rest. The segments in the reconstructed signal may have approximate mean values (due to truncation) [6], so these values are replaced by the exact mean values of the original signal. Two aspects to consider before performing APCA:


The algorithm for compute APCA representation can be found in [6].

## **4.6. Time series segmentation using PLA**

As with most computer science problems, representation of data is the key to efficient and effective solutions. A suitable representation of a time series may be *Piecewise Linear Approximation* (*PLA*), where the original points are reduced to a set of segments.

Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining 91

segments and merging the two most similar adjacent segments until the desired number of

However, an open question is the choice of best *k* number of segments. This problem involves a trade-off between compression and accuracy of time series representation. As suggested by Salvador et al. [33], the appropriate number of segments may be estimated using evaluation graph. It is defined as a two dimensional plot where x-axis is the number of segments, while y-axis is a measure of the segmentation error. The best number of segments is provided by the point of maximum curvature, also called "knee", of the

**Figure 16.** Evaluation graph. The best number of segments is provided by the knee of the curvature.

By this technique, the reduction problem is resolved by considering the values of the time series *T* as values of a function *f*, and approximating it with a polynomial function of degree

( )

where each *ai* corresponds to coefficients and *xi*to the variables of degree *i*.

*n i i P x ax* 

There are many possible ways to choose the polynomial: Fourier transforms (Section 4.1), splines, non-linear regressions, etc. Ng and Cai [29] hypothesized that one of the best approaches is the polynomial that minimizes the maximum deviation from the true value, which is called the *minimax* polynomial. It has been shown that the *Chebyshev* approximation is almost identical to the optimal minimax polynomial, and is easy to compute [27]. Thus, Ng and Cai [29] explored how to use the *Chebyshev polynomials* (of the first kind) as a basis

0

*<sup>n</sup> <sup>i</sup>*

(43)

segment or an error threshold is reached.

**4.7. Chebyshev Polynomials approximation** 

*n* which well fits *f*:

evaluation graph (Fig. 16).

*PLA* refers to the approximation of a time series *T*, of length *N*, using *K* consecutive segments with *K* much smaller than *n* (Fig. 15). This representation makes the storage, transmission and computation of the data more efficient [23]. In the light of it, *PLA* may be used to support clustering, classification, indexing and association rule mining of time series data (e.g. [10]).

The process of time series approximation using *PLA* is known as segmentation and is related to clustering process where each segment can be considered as a cluster [33].

There are several techniques to segment a time series and they can be distinguished into offline and on-line approaches. In the former approach an error threshold is fixed by the user, while the latter uses a dynamic error threshold that changes, according to specific criteria, during the execution of the algorithm.

**Figure 15.** The trend approximation (red line) of the original time series (black line) obtained by *PLA*.1

Although off-line algorithms are simple to realize, they are less effective than the on-line ones. The classic approaches to time series segmentation are the sliding window, bottom-up and top-down algorithms. Sliding window is an on-line algorithm and works growing a segment until the error for the potential segment is greater than the user-specified threshold, then the subsequence is transformed into a segment; the process repeats until the entire time series has been approximated by its *PLA* [23]. A way to estimate error is by taking the mean of the sum of the square of vertical differences between the best-fit line and the actual data points. Another commonly used measure of goodness of fit is the distance between the best fit line and the data point furthest away in the vertical direction [23].

In the top-down approach a segment, that represents the entire time-series, is recursively split until the desired number of segment or a error threshold is reached. Dually, the bottom-up algorithm starts from the finest approximation of the time series using *n/2*

segments and merging the two most similar adjacent segments until the desired number of segment or an error threshold is reached.

However, an open question is the choice of best *k* number of segments. This problem involves a trade-off between compression and accuracy of time series representation. As suggested by Salvador et al. [33], the appropriate number of segments may be estimated using evaluation graph. It is defined as a two dimensional plot where x-axis is the number of segments, while y-axis is a measure of the segmentation error. The best number of segments is provided by the point of maximum curvature, also called "knee", of the evaluation graph (Fig. 16).

**Figure 16.** Evaluation graph. The best number of segments is provided by the knee of the curvature.

#### **4.7. Chebyshev Polynomials approximation**

90 Advances in Data Mining Knowledge Discovery and Applications

As with most computer science problems, representation of data is the key to efficient and effective solutions. A suitable representation of a time series may be *Piecewise Linear* 

*PLA* refers to the approximation of a time series *T*, of length *N*, using *K* consecutive segments with *K* much smaller than *n* (Fig. 15). This representation makes the storage, transmission and computation of the data more efficient [23]. In the light of it, *PLA* may be used to support clustering, classification, indexing and association rule mining of time series

The process of time series approximation using *PLA* is known as segmentation and is

There are several techniques to segment a time series and they can be distinguished into offline and on-line approaches. In the former approach an error threshold is fixed by the user, while the latter uses a dynamic error threshold that changes, according to specific criteria,

**Figure 15.** The trend approximation (red line) of the original time series (black line) obtained by *PLA*.1

Although off-line algorithms are simple to realize, they are less effective than the on-line ones. The classic approaches to time series segmentation are the sliding window, bottom-up and top-down algorithms. Sliding window is an on-line algorithm and works growing a segment until the error for the potential segment is greater than the user-specified threshold, then the subsequence is transformed into a segment; the process repeats until the entire time series has been approximated by its *PLA* [23]. A way to estimate error is by taking the mean of the sum of the square of vertical differences between the best-fit line and the actual data points. Another commonly used measure of goodness of fit is the distance between the best

In the top-down approach a segment, that represents the entire time-series, is recursively split until the desired number of segment or a error threshold is reached. Dually, the bottom-up algorithm starts from the finest approximation of the time series using *n/2*

fit line and the data point furthest away in the vertical direction [23].

related to clustering process where each segment can be considered as a cluster [33].

*Approximation* (*PLA*), where the original points are reduced to a set of segments.

**4.6. Time series segmentation using PLA**

during the execution of the algorithm.

data (e.g. [10]).

By this technique, the reduction problem is resolved by considering the values of the time series *T* as values of a function *f*, and approximating it with a polynomial function of degree *n* which well fits *f*:

$$P\_n(\mathbf{x}) = \sum\_{i=0}^n a\_i \mathbf{x}^i \tag{43}$$

where each *ai* corresponds to coefficients and *xi*to the variables of degree *i*.

There are many possible ways to choose the polynomial: Fourier transforms (Section 4.1), splines, non-linear regressions, etc. Ng and Cai [29] hypothesized that one of the best approaches is the polynomial that minimizes the maximum deviation from the true value, which is called the *minimax* polynomial. It has been shown that the *Chebyshev* approximation is almost identical to the optimal minimax polynomial, and is easy to compute [27]. Thus, Ng and Cai [29] explored how to use the *Chebyshev polynomials* (of the first kind) as a basis for approximating and indexing *n*-dimensional (*n ≥* 1) time series. The Chebyshev polynomial *CPm*(*x*) of the first kind is a polynomial of degree *m* (*m* = 0, 1, …), defined as:

$$\text{CP}\_m(\mathbf{x}) = \cos(m \arccos(\mathbf{x})) \quad \mathbf{x} \in [-1, 1] \tag{44}$$

Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining 93

defined on the symbolic representation, and that defined on original time series. *SAX* is the most known symbolic representation technique on time series data mining, that ensures both a considerable dimensionality reduction, and the lower bounding property, allowing

Given a time series *T* of length *n*, and an alphabet of arbitrary size *a*, *SAX* returns a string of arbitrary length *w* (typically *w << n*). The alphabet size *a* is an integer, where *a > 2*. *SAX* method is *PAA*-based, since it transforms *PAA* means into symbols, according to a defined

To give a significance to the symbolic transformation, it is necessary to deal with a system producing symbols with equal probability, or with a Gaussian distribution. This can be achieved by normalizing time series, since normalized time series have generally a Gaussian distribution [26]. This is the first assumption to consider about this technique. However, for data not obeying to this property, the efficiency of the reduction is slightly deteriorated. Given the Gaussian distribution, it is simple to determine the "breakpoints" that will produce *a* equal-sized areas of probability under the Gaussian curve. What follows gives the

**Definition 1**. *Breakpoints*: A sorted list of numbers *B = β1, . . . , βa−1* such that the area under a *N(*0, 1*)* Gaussian curve from *β<sup>i</sup>* to *βi*+1 = 1*/a* (*β*0 and *β<sup>a</sup>* are defined as −∞ and ∞, respectively) (Table 2). For example, if we want to obtain breakpoints for an alphabet of size *a = 4*, we have to compute the first (*q1*), the second (*q2*), and the third (*q3*) quartiles of the inverse cumulative Gaussian distribution, corresponding to the 25%, 50% and 75% of the

**Definition 2**. *Alphabet*: A collection of symbols *alpha = α*1*, α*2*,…, αa* of size *a* used to

**i\a** 3 4 5 6 7 8 -0.43 -0.67 -0.84 -0.97 -1.07 -1.15 0.43 0 -0.25 -0.43 -0.57 -0.67 0.67 0.25 0 -0.18 -0.32 0.84 0.43 0.18 0 0.97 0.57 0.32 1.07 0.67 **<sup>7</sup>** 1.15

**Definition 3**. *Word*: A PAA approximation *PAA*(*T*) = {*mean*(*t1*), *mean*(*t2*), …, *mean*(*tw*)} of length *w* can be represented as a word *SAX*(*T*) = {*sax*(*t*1), *sax*(*t*2), …, *sax*(*tw*)}, with respect to

<sup>1</sup> , (0 , 1 ) *ij j ij sax t iff mean t*

 

*iw ja* (39)

**Table 2.** A look-up table for breakpoints used for alphabet of size 2 < a < 9.

enhancing of time performances on most of data mining algorithm.

principal definitions to understand *SAX* representation.

cumulative frequency: *β*1 = *q*1, *β*2 = *q*2, *β*3 = *q*3.

transform mean frames into symbols.

the following mapping function:

transformation function.

It is possible to compute every *CPm*(*x*) using the following recurrence function [29]:

$$\text{C}P\_m(\mathbf{x}) = \text{2x}\text{C}P\_{m-1}(\mathbf{x}) - \text{C}P\_{m-2}(\mathbf{x}) \tag{45}$$

for all *m* ≥ 2 with *CP*0(*x*) = 1 and *CP*1(*x*) = *x*. Since Chebyshev polynomials form a family of orthogonal functions, a function *f*(*x*) can be approximated by the following Chebyshev series expansion:

$$\mathcal{S}\_{\alpha}^{CP}(f(\mathbf{x})) = \sum\_{i=0}^{\sigma} c\_i \mathcal{C}P\_i(\mathbf{x}) \tag{46}$$

where *ci* refer to the *Chebyshev coefficients*. We refer the reader to the paper [29] for the conversion of a time series, which represents a discrete function, to an interval function required for the computation of Chebyshev coefficients. Given two time series *T* and *S*, and their corresponding vectors of Chebyshev coefficients, *C*1 and *C*2, the key feature of their work is the definition of a distance function *dCheb* between the two vectors, that guarantees the lower bounding property defined in Eq. 25. Since it results:

$$d\_{cheb}(\mathbb{C}\_1, \mathbb{C}\_2) \le d\_{true}(T\_1, T\_2) \tag{47}$$

the indexing with Chebyshev coefficients admits no false negatives. The computational complexity of Chebishev approximation is *O*(*n*), where *n* is the length of the approximated time series.

**Figure 17.** An example of approximation of a time series *T* of length *n* = 10000 with a Chebyshev series expansion (Eq. 46) where *i* is from 0 to *k* = 100, using the *chebfun* toolbox for MATLAB (http://www2.maths.ox.ac.uk/chebfun/)

#### **4.8. SAX**

Many symbolic representations of time series have been introduced over the past decades. The challenge in this field is to create a real correlation between the distance measure defined on the symbolic representation, and that defined on original time series. *SAX* is the most known symbolic representation technique on time series data mining, that ensures both a considerable dimensionality reduction, and the lower bounding property, allowing enhancing of time performances on most of data mining algorithm.

92 Advances in Data Mining Knowledge Discovery and Applications

expansion:

time series.

**4.8. SAX** 

(http://www2.maths.ox.ac.uk/chebfun/)

for approximating and indexing *n*-dimensional (*n ≥* 1) time series. The Chebyshev polynomial *CPm*(*x*) of the first kind is a polynomial of degree *m* (*m* = 0, 1, …), defined as:

for all *m* ≥ 2 with *CP*0(*x*) = 1 and *CP*1(*x*) = *x*. Since Chebyshev polynomials form a family of orthogonal functions, a function *f*(*x*) can be approximated by the following Chebyshev series

> 0 ( ( )) ( ) *CP*

*i S f x c CP x* 

where *ci* refer to the *Chebyshev coefficients*. We refer the reader to the paper [29] for the conversion of a time series, which represents a discrete function, to an interval function required for the computation of Chebyshev coefficients. Given two time series *T* and *S*, and their corresponding vectors of Chebyshev coefficients, *C*1 and *C*2, the key feature of their work is the definition of a distance function *dCheb* between the two vectors, that guarantees

the indexing with Chebyshev coefficients admits no false negatives. The computational complexity of Chebishev approximation is *O*(*n*), where *n* is the length of the approximated

**Figure 17.** An example of approximation of a time series *T* of length *n* = 10000 with a Chebyshev series

Many symbolic representations of time series have been introduced over the past decades. The challenge in this field is to create a real correlation between the distance measure

expansion (Eq. 46) where *i* is from 0 to *k* = 100, using the *chebfun* toolbox for MATLAB

*i i*

It is possible to compute every *CPm*(*x*) using the following recurrence function [29]:

the lower bounding property defined in Eq. 25. Since it results:

( ) cos( arccos( )) [ 1, 1] *CP x m x x <sup>m</sup>* (44)

1 2 () 2 () () *CP x xCP x CP x m mm* (45)

1 2 12 ( , ) (,) *cheb true d CC d TT* (47)

(46)

Given a time series *T* of length *n*, and an alphabet of arbitrary size *a*, *SAX* returns a string of arbitrary length *w* (typically *w << n*). The alphabet size *a* is an integer, where *a > 2*. *SAX* method is *PAA*-based, since it transforms *PAA* means into symbols, according to a defined transformation function.

To give a significance to the symbolic transformation, it is necessary to deal with a system producing symbols with equal probability, or with a Gaussian distribution. This can be achieved by normalizing time series, since normalized time series have generally a Gaussian distribution [26]. This is the first assumption to consider about this technique. However, for data not obeying to this property, the efficiency of the reduction is slightly deteriorated. Given the Gaussian distribution, it is simple to determine the "breakpoints" that will produce *a* equal-sized areas of probability under the Gaussian curve. What follows gives the principal definitions to understand *SAX* representation.

**Definition 1**. *Breakpoints*: A sorted list of numbers *B = β1, . . . , βa−1* such that the area under a *N(*0, 1*)* Gaussian curve from *β<sup>i</sup>* to *βi*+1 = 1*/a* (*β*0 and *β<sup>a</sup>* are defined as −∞ and ∞, respectively) (Table 2). For example, if we want to obtain breakpoints for an alphabet of size *a = 4*, we have to compute the first (*q1*), the second (*q2*), and the third (*q3*) quartiles of the inverse cumulative Gaussian distribution, corresponding to the 25%, 50% and 75% of the cumulative frequency: *β*1 = *q*1, *β*2 = *q*2, *β*3 = *q*3.

**Definition 2**. *Alphabet*: A collection of symbols *alpha = α*1*, α*2*,…, αa* of size *a* used to transform mean frames into symbols.


**Table 2.** A look-up table for breakpoints used for alphabet of size 2 < a < 9.

**Definition 3**. *Word*: A PAA approximation *PAA*(*T*) = {*mean*(*t1*), *mean*(*t2*), …, *mean*(*tw*)} of length *w* can be represented as a word *SAX*(*T*) = {*sax*(*t*1), *sax*(*t*2), …, *sax*(*tw*)}, with respect to the following mapping function:

$$\text{max}\left(t\_i\right) = \alpha\_{j\text{-}i} \quad \text{iff } \mathcal{B}\_{j-1} \le \text{mean}\left(t\_i\right) < \mathcal{B}\_j \quad \text{( $0 < i \le w$ ,} \quad 1 < j < a\text{)}\tag{39}$$

Lin et al. [26] defined a distance measure for this representation, such that the real distance calculated on original representation is bounded below from it. An extension of *SAX* technique, *iSAX*, was proposed by Shieh and Keogh [35] which allows to get different resolutions for the same word, by using several combination of parameters *a* and *w*.

Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining 95

[5] Cantone, Ferro, Pulvirenti, Reforgiato (2005). "Antipole tree indexing to support range search and k-nearest neighbour search in metric spaces". *IEEE Transactions on Knowledge* 

[6] Chakrabarti, Keogh, Mehrotra, Pazzani (2002). "Locally adaptive dimensionality reduction for indexing large time series databases". *ACM Trans. Database Syst.* 27, 2, pp

[7] Chan, Fu (1999). "Efficient time series matching by wavelets". In *proceedings of the 15th IEEE Int'l Conference on Data Engineering*. Sydney, Australia, Mar 23-26. pp 126-133. [8] Cooley, Tukey (1965). "An algorithm for the machine calculation of complex Fourier

[9] Cormen, Leiserson, Rivest (1990). *Introduction to Algorithms*. The MIT Press, McGraw-

[10] Di Salvo, Montalto, Nunnari, Neri, Puglisi (2012). "Multivariate time series clustering on geophysical data recorded at Mt. Etna from 1996 to 2003". *Journal of Volcanology and* 

[11] Ding, Trajcevski, Scheuermann, Wang, Keogh (2008). "Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures".

[13] Faloutsos, Ranganthan, Manolopoulos (1994). "Fast subsequence Matching in Time-

[14] Faloutsos, Jagadish, Mendelzon, Milo (1997). "A signature technique for similaritybased queries". In *Proceedings of the SEQUENCES 97* (Positano-Salerno, Italy). [15] Fink, Pratt (2004). "Indexing of compressed time series**".** In Mark Last, Abraham Kandel, and Horst Bunke, editors, *Data Mining in Time Series Databases*, pages 43-65.

[16] Golub, Van Loan (1996). *Matrix Computations*, 3rd edition. Baltimore, MD: Hopkins

[17] Han and Kamber (2005). *Data Mining: Concepts and Techniques*. Morgan Kaufmann

[18] Itakura (1975). Minimum prediction residual principle applied to speech recognition.

[19] Keogh, Chakrabarti, Pazzani, Mehrotra (2000). "Dimensionality reduction for fast similarity search in large time series databases". *Journal of Knowledge and Information* 

[20] Keogh, Chu, Hart, Pazzani (2001a). An Online Algorithm for Segmenting Time Series.

[21] Keogh, Kasetty (2002). "On the need for time series data mining benchmarks: a survey and empirical demonstration". *Proceedings of the Eighth ACM SIGKDD International* 

[22] Keogh, Ratanamahatana (2002). "Exact indexing of dynamic time warping". In *proceedings of the 26th Int'l Conference on Very Large Data Bases*. Hong Kong. pp 406-417.

[12] Duda, Hart, Stork (2001). *Pattern Classification*. John Wiley & Sons.

*IEEE Trans Acoustics Speech Signal Process*. ASSP 23:52–72

In Proc. IEEE Intl. Conf. on Data Mining, pp. 289-296, 2001.

*Conference on Knowledge Discovery and Data Mining*, pp. 102-111.

*and Data Engineering*. Vol. 17, pp. 535-550.

series", *Math. Comput.* 19, 297–301.

Series Databases". *SIGMOD Conference*.

World Scientific, Singapore.

University Press.

Publishers, CA.

*Systems*.

Hill Book Company.

*Geothermal Research*.

188-228.

*VLDB*.

**Figure 18.** An example of conversion of a time series *T* (blue line) of length *n =* 128, into a word of length *w =* 8*,* using an alphabet *alpha =* {*a,b,c,d,e,f,g,h*} of size *a =* 8. The left plot refers to the Gaussian distribution divided into equal areas of size 1*/a. PAA* mean frames falling into two consecutive cutlines (gray lines) will be mapped into the corresponding plotted symbol (colored segments). The *PAA* plot shows the *PAA* representation (red line), while *SAX* plot shows the conversion of *PAA*(*T*) into the word *SAX*(*T*) = {*c,g,e,g,f,g,a,a*}. Images generated by MATLAB and code provided by *SAX* authors [26].

## **Author details**

Placido Montalto, Marco Aliotta and Andrea Cannata

*Istituto Nazionale di Geofisica e Vulcanologia, Osservatorio Etneo, Sezione di Catania, Catania, Italy* 

Carmelo Cassisi\* and Alfredo Pulvirenti *Università degli studi di Catania, Dipartimento di Matematica e Informatica, Catania, Italy* 

## **5. References**


<sup>\*</sup> Corresponding Author


Placido Montalto, Marco Aliotta and Andrea Cannata

and Alfredo Pulvirenti

**Author details** 

Carmelo Cassisi\*

**5. References** 

Corresponding Author

 \*

Lin et al. [26] defined a distance measure for this representation, such that the real distance calculated on original representation is bounded below from it. An extension of *SAX* technique, *iSAX*, was proposed by Shieh and Keogh [35] which allows to get different

resolutions for the same word, by using several combination of parameters *a* and *w*.

**Figure 18.** An example of conversion of a time series *T* (blue line) of length *n =* 128, into a word of length *w =* 8*,* using an alphabet *alpha =* {*a,b,c,d,e,f,g,h*} of size *a =* 8. The left plot refers to the Gaussian distribution divided into equal areas of size 1*/a. PAA* mean frames falling into two consecutive cutlines (gray lines) will be mapped into the corresponding plotted symbol (colored segments). The *PAA* plot shows the *PAA* representation (red line), while *SAX* plot shows the conversion of *PAA*(*T*) into the word *SAX*(*T*) = {*c,g,e,g,f,g,a,a*}. Images generated by MATLAB and code provided by *SAX* authors [26].

*Istituto Nazionale di Geofisica e Vulcanologia, Osservatorio Etneo, Sezione di Catania, Catania, Italy* 

[1] Agrawal, Faloutsos, Swami (1993). "Efficient similarity search in sequence databases". *Proc. of the 4th Conference on Foundations of Data Organization and Algorithms*, pp. 69–84. [2] Bailey, Elkan (1995). "Unsupervised learning of multiple motifs in biopolymers using

[3] Beckmann, Kriegel, Schneider, Seeger (1990). "The R\*-tree: An Efficient and Robust Access Method for Points and Rectangles". *Proc. ACM SIGMOD Int. Conf. on* 

[4] Bentley (1975). "Multidimensional binary search trees used for associative searching".

*Università degli studi di Catania, Dipartimento di Matematica e Informatica, Catania, Italy* 

expectation maximization". *Machine Learning Journal 21*, 51–80.

*Management of Data*, Atlantic City, NJ, pp. 322-331.

*Communications of ACM*. Vol. 18, pp. 509-517.

	- [23] Keogh, Chu, Hart, Pazzani (2004). "Segmenting time series: a survey and novel approach". In: Last, M., Kandel, A., Bunke, H. (Eds.), *Data mining in time series database*. World Scientific Publishing Company, pp. 1-21.

**Chapter 4** 

© 2012 Siraj et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Siraj et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Data Mining and Neural Networks:** 

**The Impact of Data Representation** 

Fadzilah Siraj, Ehab A. Omer A. Omer and Md. Rajib Hasan

The extensive use of computers and information technology has led toward the creation of extensive data repositories from a very wide variety of application areas [1]. Such vast data repositories can contribute significantly towards future decision making provided appropriate knowledge discovery mechanisms are applied for extracting hidden, but

Data mining (DM) is one of the phases in knowledge discovery in databases. It is the process of extracting the useful information and knowledge in which the data is abundant, incomplete, ambiguous and random [3], [4], [5]. DM is defined as an automated or semiautomated exploratory data analysis of large complex data sets that can be used to uncover patterns and relationships in data with an emphasis on large observational databases [6]. Modern statistical and computational technologies are applied to the problem in order to find useful patterns hidden withina large database [7], [8], [9]. To uncover hidden trends and patterns, DM uses a combination of an explicit knowledge base, sophisticated analytical skills, and domain knowledge. In effect, the predictive models formed from the trends and patterns through DM enable analysts to produce new observations from existing data. DM methods can also be viewed as statistical computation, artificial intelligence (AI) and database approach[10]. However, these methods are not replacing the existing traditional statistics; in fact, it is an extension of traditional techniques. For example, its techniques have been applied to uncover hidden information and predict future trends in financial markets. Competitive advantages achieved by DM in business and finance include increased revenue, reduced cost, and improved market place responsiveness and awareness [11]. It has also been used to derive new information that could be integrated in decision support, forecasting and estimation to help business gain competitive advantage [9]. In higher educational institutions, DM can be used in the process of uncovering hidden trends and patterns that help them in forecasting the students' achievement. For instance, by using DM

Additional information is available at the end of the chapter

potentially useful information embedded into the data [2].

http://dx.doi.org/10.5772/51594

**1. Introduction** 

