3.2.1 Methodology

The advantage of the ST over the FFT is its ability to adaptively capture spectral changes over time without windowing of data, resulting in a better time-frequency resolution for non-stationary signals [32]. To illustrate the ST method, let h l½�¼ h lð Þ ∙T , l ¼ 0, 1, …, N � 1, be the samples of the continuous signal h tð Þ, where T is the sampling interval (i.e., the sampling interval of our sensor or measuring device). The discrete Fourier transform (DFT) can be written as,

$$H[m] = \sum\_{l=0}^{N-1} h[l] e^{\frac{-i2ml}{N}} \tag{1}$$

where m is the discrete frequency index m ¼ 0, 1, …, N � 1: The discrete Stockwell transform (DST) is given by,

$$S[k,n] = \sum\_{m=0}^{N-1} W[m]H[m+n]e^{\frac{i2\pi nk}{N}}\tag{2}$$

knowing that three 3-axis acceleration signals are measured per accelerometer, an input sample will be a 90 � 1 � 9 matrix with 9 denoting the number of channels

On the other hand, frequency-domain samples are obtained by deriving ST for every other 10th sample, and by selecting the proper best frequency range. Considering that 98% of the FFT amplitude of human activity frequencies is contained below 10 Hz, the ST frequencies are first chosen to be between 0 and 10 Hz, which yields bad CNN classification performance. And, after considering Goodwin's observation that almost all SMMs are contained within a frequency range of 1–3 Hz, we chose a new frequency range of 0–3 Hz which produced higher CNN classification performance. So, computing the ST generates multiple input samples (vectors of length 50), each containing the power of 50 frequencies (τ ¼ 50) in the range of 0–3 Hz. Thus, each extracted frequency-domain sample is a 50 � 1 � 9 matrix. CNN training. The purpose is to analyze intersession variability of different SMMs by training one CNN per domain (time or frequency domain) per subject per study. In other words, feature learning that is performed is specific to one domain (time and frequency), one subject i and one study j. The goal of this experiment is to build deep networks that are capable of recognizing SMMs across multiple sessions within the same atypical subject i. Training is conducted using k-fold cross-validation of an atypical subject i for a study j such that k is the number of sessions for which a participant was observed within each study, and every fold consists of data from a specific session. Time and frequency domain CNN architectures are composed of three and two sets of convolution, ReLU and pooling layers respectively with the number of filters set to {96, 192, 300} and {96,192} respectively, followed by a fully connected layer with 500 neurons. Training is performed for 10 to 40 epochs with the following hyper-parameters: a dropout of 0.5, a momentum of 0.9, a learning rate of

0.01, a weight decay of 0.0005, and a mini-batch size of 150.

Dataset. The dataset used for HAR is the PUC dataset [34] which consists of 8 hours of human activities collected at a sampling frequency of 8 Hz by 4 tri-axial ADXL335 accelerometers located at the waist, left thigh, right ankle, and right arm.

CNN training. In this experiment, one CNN is trained for each domain (time and frequency domain), with a 10-fold cross-validation. CNN architecture and param-

Table 1 summarizes accuracy and F1-score results of CNN (in both time and frequency domains) for both the SMM recognition and HAR tasks. For SMM

The activities are: sitting, standing, sitting down, standing up, and walking. Pre-processing. The PUC data is further converted into time and frequency domain signals. In time-domain, a 1 s time window (e.g., L ¼ 8) with 125 ms overlapping (e.g., s ¼ 1) is employed to generate 8 � 1 time-domain samples. However, knowing that an 8 � 1 input matrix is not a vector long enough for training a CNN, signals are resampled from 8 to 50 using an antialiasing FIR low-pass filter and compensating for the delay introduced by the filter. The resultant time-domain input samples are 50 � 1 � 12 matrix where 12 stands for the number of channels (14accelerometers � 3coordinates). In frequency-domain, raw signals are resampled from 8 to 16 Hz; then, the ST is computed to obtain, for each input sample, the power of 50 frequencies in the range of 0–8 Hz, resulting in frequency-domain

3.2.2.2 Human activity recognition (HAR) task [3]

input samples of size 50 � 1 � 12.

3.2.3 Results

63

eters are set the same as in the SMM recognition task.

(9 ¼ 3accelerometers � 3coordinates).

CNN Approaches for Time Series Classification DOI: http://dx.doi.org/10.5772/intechopen.81170

where k is the index for time translation and n is the index for frequency shift. �2π2m<sup>2</sup> <sup>n</sup><sup>2</sup> is the Gaussian window in the frequency domain.

The function W m½ �¼ e Given a N-length signal, the DST coefficients are computed using the following steps:


Note that there is a DST coefficient calculated for every pair ofhk; ni in the timefrequency domain. Therefore, the result of the DST is a complex matrix of size τ � N where the rows represent the frequencies for every frequency shift index n and the columns are the time values of every time translation index k i.e., each column is the "local spectrum" for that point in time. This results in N instances, each instance being represented as a τ � 1 � 1 matrix. If the time series is multivariate with D channels, D DST matrices will be generated, and each instance is represented as a τ � 1 � D matrix.

### 3.2.2 Experiments

#### 3.2.2.1 Stereotypical motor movement (SMM) recognition task [4]

A SMM is defined as a repetitive movement which is regarded as one of the most apparent and relevant atypical behaviors present within children on the Autism Spectrum. Thus, detecting SMM behaviors can play a major role in the screening and therapy of ASD, thus potentially improving the lives of children in the spectrum.

Dataset. The SMM dataset used for training the CNN is derived from [33] and consists of raw time series of acceleration signals collected by three-axis wireless accelerometers (located at the torso, left wrist and right wrist) from six atypical (e.g., autistic) subjects in a longitudinal study. Activities including SMMs (body rocking, hand flapping, or simultaneous body rocking and hand flapping) and non-SMMs were engaged by subjects and were labeled (annotated) by an expert as SMM or non-SMM. Two to three sessions (9 to 39-min long) were recorded per participant, except subject 6 who was observed only once in Study 2.

Pre-processing. The data collection called "Study1" and "Study2" were recorded at a sampling frequency of 60 and 90 Hz respectively. So, to equalize the data, the 60 Hz signals are resampled and interpolated to 90 Hz. Next, data of both sensors go through a high pass filter with a cut-off frequency of 0.1 Hz in order to get rid of noise.

Afterwards, data is turned into fixed-length vector samples either in timedomain (using the sliding window) or in frequency-domain (using ST). Timedomain samples are obtained by segmenting raw data using a one second window (e.g., L ¼ 90, see algorithm1) and 88.9% overlap between consecutive data segments (e.g. s ¼ 10, see algorithm1), resulting in 90 time-point samples. And,

where m is the discrete frequency index m ¼ 0, 1, …, N � 1: The discrete Stockwell transform (DST) is given by,

> N�1 m¼0

2. Multiply H m½ � þ n with the Gaussian window function W m½ �¼ e

3.2.2.1 Stereotypical motor movement (SMM) recognition task [4]

pant, except subject 6 who was observed only once in Study 2.

W m½ �H m½ � þ n e

<sup>n</sup><sup>2</sup> is the Gaussian window in the frequency domain.

where k is the index for time translation and n is the index for frequency shift.

Given a N-length signal, the DST coefficients are computed using the following

1. Apply an N-point DFT to calculate the Fourier spectrum of the signal H m½ �;

3. For each fixed frequency shift n ¼ 0, 1, ⋯, τ � 1 (where τ is the number of frequency steps desired), apply an N-point inverse DFT to W[m]H[m + n] in order to calculate the DST coefficients S k½ � ; n , where k ¼ 0, 1, …, N � 1;

Note that there is a DST coefficient calculated for every pair ofhk; ni in the timefrequency domain. Therefore, the result of the DST is a complex matrix of size τ � N where the rows represent the frequencies for every frequency shift index n and the columns are the time values of every time translation index k i.e., each column is the "local spectrum" for that point in time. This results in N instances, each instance being represented as a τ � 1 � 1 matrix. If the time series is multivariate with D channels, D DST matrices will be generated, and each instance is

A SMM is defined as a repetitive movement which is regarded as one of the most

Pre-processing. The data collection called "Study1" and "Study2" were recorded at a sampling frequency of 60 and 90 Hz respectively. So, to equalize the data, the 60 Hz signals are resampled and interpolated to 90 Hz. Next, data of both sensors go through a high pass filter with a cut-off frequency of 0.1 Hz in order to get rid of noise. Afterwards, data is turned into fixed-length vector samples either in timedomain (using the sliding window) or in frequency-domain (using ST). Timedomain samples are obtained by segmenting raw data using a one second window (e.g., L ¼ 90, see algorithm1) and 88.9% overlap between consecutive data segments (e.g. s ¼ 10, see algorithm1), resulting in 90 time-point samples. And,

apparent and relevant atypical behaviors present within children on the Autism Spectrum. Thus, detecting SMM behaviors can play a major role in the screening and therapy of ASD, thus potentially improving the lives of children in the spectrum. Dataset. The SMM dataset used for training the CNN is derived from [33] and consists of raw time series of acceleration signals collected by three-axis wireless accelerometers (located at the torso, left wrist and right wrist) from six atypical (e.g., autistic) subjects in a longitudinal study. Activities including SMMs (body rocking, hand flapping, or simultaneous body rocking and hand flapping) and non-SMMs were engaged by subjects and were labeled (annotated) by an expert as SMM or non-SMM. Two to three sessions (9 to 39-min long) were recorded per partici-

i2πmk

<sup>N</sup> (2)

�2π2m<sup>2</sup> n2

S k½ �¼ ; n ∑

�2π2m<sup>2</sup>

Time Series Analysis - Data, Methods, and Applications

The function W m½ �¼ e

represented as a τ � 1 � D matrix.

3.2.2 Experiments

62

steps:

knowing that three 3-axis acceleration signals are measured per accelerometer, an input sample will be a 90 � 1 � 9 matrix with 9 denoting the number of channels (9 ¼ 3accelerometers � 3coordinates).

On the other hand, frequency-domain samples are obtained by deriving ST for every other 10th sample, and by selecting the proper best frequency range. Considering that 98% of the FFT amplitude of human activity frequencies is contained below 10 Hz, the ST frequencies are first chosen to be between 0 and 10 Hz, which yields bad CNN classification performance. And, after considering Goodwin's observation that almost all SMMs are contained within a frequency range of 1–3 Hz, we chose a new frequency range of 0–3 Hz which produced higher CNN classification performance. So, computing the ST generates multiple input samples (vectors of length 50), each containing the power of 50 frequencies (τ ¼ 50) in the range of 0–3 Hz. Thus, each extracted frequency-domain sample is a 50 � 1 � 9 matrix.

CNN training. The purpose is to analyze intersession variability of different SMMs by training one CNN per domain (time or frequency domain) per subject per study. In other words, feature learning that is performed is specific to one domain (time and frequency), one subject i and one study j. The goal of this experiment is to build deep networks that are capable of recognizing SMMs across multiple sessions within the same atypical subject i. Training is conducted using k-fold cross-validation of an atypical subject i for a study j such that k is the number of sessions for which a participant was observed within each study, and every fold consists of data from a specific session. Time and frequency domain CNN architectures are composed of three and two sets of convolution, ReLU and pooling layers respectively with the number of filters set to {96, 192, 300} and {96,192} respectively, followed by a fully connected layer with 500 neurons. Training is performed for 10 to 40 epochs with the following hyper-parameters: a dropout of 0.5, a momentum of 0.9, a learning rate of 0.01, a weight decay of 0.0005, and a mini-batch size of 150.
