4.2.1 Methodology

where w<sup>1</sup>

<sup>1</sup> <sup>¼</sup> h a<sup>1</sup> ð Þ:

f

<sup>h</sup> ∈ R1�k�<sup>1</sup> and a<sup>1</sup> ∈R1�N�kþ1�M<sup>1</sup> , i is the index of the feature map at the

,

l�1

∑ Ml�<sup>1</sup> m¼1 w1 <sup>h</sup>ð Þ j; m f <sup>1</sup>�<sup>1</sup> ∈R1�Nl�1�Ml�<sup>1</sup> ,

ð Þ i � j; m (4)

ð Þ h of a feature map h is achieved by

<sup>l</sup> <sup>¼</sup> h a<sup>l</sup> . The filter

(5)

<sup>h</sup> ∈R1�k�Ml�<sup>1</sup> ,

second dimension (i ¼ 1, …, N � k þ 1) and h is the index of the feature map at the third dimension (h ¼ 1, …, M1). Note that since the number of input channels in this case is one, the weight matrix also has only one channel. Similar to the feedforward neural network, this output is then passed through the non-linearity hð Þ� to give

where 1 � Nl�<sup>1</sup> � Ml�<sup>1</sup> is the size of the output filter map from the previous convo-

ðÞ¼ i ∑ ∞ j¼�∞

size parameter k thus controls the receptive field of each output node. Without zero padding, in every layer the convolution output has width Nl ¼ Nl�<sup>1</sup> � k þ 1 for l ¼ 1, ::, L. Since all the elements in the feature map share the same weights this allows for features to be detected in a time-invariant manner, while at the same

The output is then fed into a pooling layer (usually a max-pooling layer) which

l

ð Þ i � T þ r; h 

. Depending on what we want our model to learn,

computing maximum values over nearby inputs of the feature map as follows:

where R is the pooling size, T is the pooling stride, and i is the index of the

Multiple convolution, ReLU and pooling layers can be stacked on top of one another to form a deep CNN architecture. Then, the output of these layers is fed into a fully connected layer and an activation layer. The output of the network after

the weights in the model are trained to minimize the error between the output from

the objective function (loss function). For instance, a softmax layer can be applied on top, followed by the entropy cost function which is an objective function computed based on the true labels of training instances and probabilistic outputs of

In previous CNN works, several attempts have been made to extract the most relevant/meaningful features using different CNN architectures. While works [17, 24] transformed the time series signals (by applying down-sampling, slicing, or warping) so as to help the convolutional filters (especially the 1st convolutional layer filters) capture entire peaks (i.e., whole peaks) and fluctuations within the signals, the work of [18] proposed to keep time series data unchanged and rather feed them into three branches, each having a different 1st convolutional filter size, in order to capture the whole fluctuations within signals. An alternative is to find an adaptive 1st convolutional layer filter which has the most optimal size and is able to

<sup>L</sup> and the true output we are interested in, which is often denoted as

ð Þ¼ i; h max<sup>r</sup> <sup>∈</sup><sup>R</sup> f

L

4.2 CNN with the adaptive convolutional filter approach

In each subsequent layer l ¼ 2, …, L, the input feature map f

This output is then passed through the non-linearity to give f

<sup>h</sup> <sup>¼</sup> <sup>1</sup>, …, Ml, to create a feature map al <sup>∈</sup>R1�Nl�Ml

Time Series Analysis - Data, Methods, and Applications

time it reduces the number of trainable parameters.

pl

resultant feature map at the second dimension.

L layers will thus be the matrix f

the network f

softmax function.

66

acts as a subsampling layer. The output map p<sup>l</sup>

<sup>h</sup> ∗ f <sup>l</sup>�<sup>1</sup>

ð Þ¼ <sup>i</sup>; <sup>h</sup> <sup>w</sup><sup>l</sup>

al

lution with Nl�<sup>1</sup> <sup>¼</sup> Nl�<sup>2</sup> � <sup>k</sup> <sup>þ</sup> 1, is convolved with a set of Ml filters <sup>w</sup><sup>l</sup>

In CNNs, multiple hyper-parameters are to be chosen carefully in order to retrieve the best classification rate, including model hyper-parameters which define the CNN architecture, and optimization hyper-parameters such as the loss function, learning rate, etc. Model Hyper-parameter values are generally chosen based on the literature and on the trial-and-error process (through running experiments with multiple values). A conventional approach is to start with a CNN architecture which has already been adopted in a similar domain to ours, and then update hyperparameters by experimentation.

In our study, we focus on the convolutional layer filter (also known as "Receptive field"). Conventionally, the 1st convolutional layer filter has one of the following sizes: 3 � 3, 5 � 5, 7 � 7 and 10 � 10, where small filter sizes capture very fine details of the input while large ones leave out minute details in the input. After all, the goal is to set the receptive field size such that the filters detect complete features (e.g., entire variations) within an object (which could be edges, colors or textures within an image, or peaks within a signal). Choosing a small 1st layer filter is good for capturing a short variation (e.g., a signal peak) within a time series input signal, but may capture only a slice of this variation by convolving only part of it, thus failing to detect the whole fluctuation within the signal. Conversely, a relatively large filter may convolve multiple signal peaks at once and therefore no fluctuation is detected either. So, the choice of the proper 1st layer receptive field size is crucial for find good CNN features and for maximizing recognition performance. In that sense, we need 1st layer filter that suits the input signals and variations present within them. In other words, we need to find the best filter size which convolves most of the entire signal peak within the input signals. To this end, it is necessary to find the optimal length of all signal peaks present within input signals. To do so, we apply sampling, a statistical procedure that uses characteristics of the sample (i.e., sample peak lengths taken from randomly selected signals) to estimate the characteristics of the population (i.e., the optimal peak length of all signals). The sample statistic can be the mean, median or mode.

Given a population of signals with mean μ and n random values x (n being sufficiently large: n≥30) sampled from the population with replacement, the mean E xð Þ is a point estimate of <sup>μ</sup> and E xð Þ¼ <sup>μ</sup> where E xð Þ¼ <sup>1</sup> <sup>n</sup> <sup>∑</sup><sup>n</sup> <sup>i</sup>¼<sup>1</sup>xi. However, using the sample mean E xð Þ as the optimal length of peaks within time series signals (in time- and frequency-domains) and as the size of the 1st convolutional layer filter gives poor CNN performance since it is influenced by outliers or extremes values (i.e., by signals peaks whose length are either too small and too big). In this case, we use the sample median.

For the sample median Með Þ x to be a point estimate of the population median Me (with a small bias), the distribution of the sample values x should be normally distributed. Nonetheless, the asymptotic distribution of the sample median in the

classical definition is well-known to be normal for absolutely continuous distributions only and not for discrete distributions (such as time series). A solution to this problem is to employ the definition of the sample median based on mid-distribution functions [35] which proves that the sample median has an asymptotically normal distribution, meaning that Með Þ x ≈Me. Then, computing the sample median of signal peak lengths will give us the optimal size of the 1st convolutional layer filter.

## 4.2.2 Experiments

Experiments are conducted on the SMM Recognition task. The dataset and experimental setup are the same as in Section 3.2.2. The inputs used are either timedomains acceleration signals of size 90 � 1 � 9 for time-domain CNN training or frequency-domain signals of size 50 � 1 � 9 for frequency-domain CNN training. The goal is to find the optimal size of the 1st convolutional layer filter for both timedomain and frequency-domain CNNs.

As explained in the methodology, the first step to determine the size of the 1st convolutional layer filter is to collect 30 random signals (for each of the time and frequency domain SMM signals) that contain at least one peak and to randomly pick 30 peaks from these signals. Histograms (a) and (b) of Figure 1 represent frequency distributions of the 30 peak lengths for time and frequency domain signals respectively. Afterwards, the computed time and frequency domain medians (9 and 10 respectively) are applied as the optimal size of the 1st convolutional layer filter for the time and frequency domain CNNs respectively.

superiority of the adaptive 1st convolutional layer filter approach for both time and

Comparative results (F1-scores) between the CNN using the adaptive 1st convolutional filter approach and the

CNN-Rad [30] 71 73 70 92 68 94 68 22 2 77 75 64.73

S1 S2 S3 S4 S5 S6 S1 S2 S3 S4 S5

91.23 76.76 84.95 93.38 86.41 95.11 95.97 75.67 60.17 91.68 82.55 84.90

Study 1 Study 2 Mean

CNNs have so far yielded outstanding performance in several time series applications. However, this deep learning technique is a data driven approach i.e., a supervised machine learning algorithm that requires excessive amount of labeled (e.g., annotated) data for proper training and for a good convergence of parameters. Although in recent years several labeled datasets have become available, some fields such as medicine experience a lack of annotated data as manually annotating a large set requires human expertise and is time consuming. For instance, labeling acceleration signals of autistic children as SMMs or non-SMMs requires knowledge of a specialist. The conventional approach to deal with this kind of problem is to perform data augmentation by applying transformations to the existing data, as shown in Section 3.1.2. Data augmentation achieves slightly better time series classification rates but still the CNN is prone to overfitting. In this section, we present another solution to this problem, a "knowledge transfer" framework which is a global, fast and light-weight framework that combines the transfer learning technique with an SVM classifier. Afterwards, this technique is further implemented on another type of SMM recognition task, which consists of recognizing SMMs across different atypical subjects rather than recognizing SMMs across multiple sessions within one

Furthermore, another way to show the efficiency of this adaptive 1st convolutional layer filter approach is to compare the performance of our timedomain CNN with the CNN of Rad et al. [30] which was trained on the same dataset as ours (in time-domain). Table 2 displays F1-score results of CNNs trained per subject and per study using the optimal 1st convolutional layer filter size (denoted as "Time-domain CNN") and using the architecture of [30] (referred to as CNN-Rad). These results suggest that our time-domain CNN performs 20.17% higher in

Effect of the size of 1st convolutional layer kernel on SMM recognition performance.

CNN Approaches for Time Series Classification DOI: http://dx.doi.org/10.5772/intechopen.81170

overall than the CNN of [30] and confirms the efficiency of the adaptive

4.3 CNN approach for tasks with limited annotated data

subject (as performed in experiments of Sections 3.2.2 and 4.2.2).

frequency domain signals.

Figure 2.

Table 2.

Time-domain CNN

CNN of Rad et al. [30].

convolutional layer.

69

### 4.2.3 Results

In order to prove the efficiency of this adaptive 1st convolutional layer filter approach, we run experiments on different time and frequency domain CNN architectures by varying the size of the 1st convolutional layer filter between 7 and 11 across both architectures. Performance rates in terms of the F1-score metric are displayed in Figure 2. In time-domain, an increase in the size of the 1st convolutional layer filter from 7 (� a time span of 0.078 s) to 9 (� a time span of 0.1 s) results in an increase of 3.26%, while an increase of the filter size from 9 to 10 (� a time span of 0.11 s) and 11 (� a time span of 0.12 s) diminishes the performance of the network. Therefore, the most optimal size of the 1st convolutional filter is equal to the sample median of signal peak lengths, suggesting that 0.1 is the best time span of the 1st convolutional layer to retrieve the whole acceleration peaks and the best acceleration changes. Similarly, in frequency domain, the 1st convolutional layer kernel yielding the highest F1-score is the one with size 10, which is simply the sample median (Með Þ¼ x 10). Thus, these results confirm the

#### Figure 1.

(a) and (b) Histograms and box plots of the frequency distribution of 30 peak lengths present within 30 randomly selected time and frequency domain signals respectively.

CNN Approaches for Time Series Classification DOI: http://dx.doi.org/10.5772/intechopen.81170

#### Figure 2.

classical definition is well-known to be normal for absolutely continuous distributions only and not for discrete distributions (such as time series). A solution to this problem is to employ the definition of the sample median based on mid-distribution functions [35] which proves that the sample median has an asymptotically normal distribution, meaning that Með Þ x ≈Me. Then, computing the sample median of signal peak lengths will give us the optimal size of the 1st convolutional layer filter.

Experiments are conducted on the SMM Recognition task. The dataset and experimental setup are the same as in Section 3.2.2. The inputs used are either timedomains acceleration signals of size 90 � 1 � 9 for time-domain CNN training or frequency-domain signals of size 50 � 1 � 9 for frequency-domain CNN training. The goal is to find the optimal size of the 1st convolutional layer filter for both time-

As explained in the methodology, the first step to determine the size of the 1st convolutional layer filter is to collect 30 random signals (for each of the time and frequency domain SMM signals) that contain at least one peak and to randomly pick 30 peaks from these signals. Histograms (a) and (b) of Figure 1 represent frequency distributions of the 30 peak lengths for time and frequency domain signals respectively. Afterwards, the computed time and frequency domain medians (9 and 10 respectively) are applied as the optimal size of the 1st convolutional layer filter for

In order to prove the efficiency of this adaptive 1st convolutional layer filter approach, we run experiments on different time and frequency domain CNN architectures by varying the size of the 1st convolutional layer filter between 7 and 11 across both architectures. Performance rates in terms of the F1-score metric are

convolutional layer filter from 7 (� a time span of 0.078 s) to 9 (� a time span of 0.1 s) results in an increase of 3.26%, while an increase of the filter size from 9 to 10 (� a time span of 0.11 s) and 11 (� a time span of 0.12 s) diminishes the performance of the network. Therefore, the most optimal size of the 1st convolutional filter is equal to the sample median of signal peak lengths, suggesting that 0.1 is the best time span of the 1st convolutional layer to retrieve the whole acceleration peaks

displayed in Figure 2. In time-domain, an increase in the size of the 1st

and the best acceleration changes. Similarly, in frequency domain, the 1st convolutional layer kernel yielding the highest F1-score is the one with size 10, which is simply the sample median (Með Þ¼ x 10). Thus, these results confirm the

(a) and (b) Histograms and box plots of the frequency distribution of 30 peak lengths present within 30

randomly selected time and frequency domain signals respectively.

4.2.2 Experiments

4.2.3 Results

Figure 1.

68

domain and frequency-domain CNNs.

Time Series Analysis - Data, Methods, and Applications

the time and frequency domain CNNs respectively.

Effect of the size of 1st convolutional layer kernel on SMM recognition performance.


#### Table 2.

Comparative results (F1-scores) between the CNN using the adaptive 1st convolutional filter approach and the CNN of Rad et al. [30].

superiority of the adaptive 1st convolutional layer filter approach for both time and frequency domain signals.

Furthermore, another way to show the efficiency of this adaptive 1st convolutional layer filter approach is to compare the performance of our timedomain CNN with the CNN of Rad et al. [30] which was trained on the same dataset as ours (in time-domain). Table 2 displays F1-score results of CNNs trained per subject and per study using the optimal 1st convolutional layer filter size (denoted as "Time-domain CNN") and using the architecture of [30] (referred to as CNN-Rad). These results suggest that our time-domain CNN performs 20.17% higher in overall than the CNN of [30] and confirms the efficiency of the adaptive convolutional layer.

## 4.3 CNN approach for tasks with limited annotated data

CNNs have so far yielded outstanding performance in several time series applications. However, this deep learning technique is a data driven approach i.e., a supervised machine learning algorithm that requires excessive amount of labeled (e.g., annotated) data for proper training and for a good convergence of parameters. Although in recent years several labeled datasets have become available, some fields such as medicine experience a lack of annotated data as manually annotating a large set requires human expertise and is time consuming. For instance, labeling acceleration signals of autistic children as SMMs or non-SMMs requires knowledge of a specialist. The conventional approach to deal with this kind of problem is to perform data augmentation by applying transformations to the existing data, as shown in Section 3.1.2. Data augmentation achieves slightly better time series classification rates but still the CNN is prone to overfitting. In this section, we present another solution to this problem, a "knowledge transfer" framework which is a global, fast and light-weight framework that combines the transfer learning technique with an SVM classifier. Afterwards, this technique is further implemented on another type of SMM recognition task, which consists of recognizing SMMs across different atypical subjects rather than recognizing SMMs across multiple sessions within one subject (as performed in experiments of Sections 3.2.2 and 4.2.2).
