4.1.2 CNN structure

The input to a convolutional layer is usually taken to be three-dimensional: the height, weight and number of channels. In the first layer this input is convolved with a set of M<sup>1</sup> three-dimensional filters applied over all the input channels. In our case, we consider a one-dimensional time series <sup>x</sup> <sup>¼</sup> ð Þ xt <sup>N</sup>�<sup>1</sup> <sup>t</sup>¼<sup>0</sup> . Given a classification task and a model with parameter values w, the task for a classifier is to output the predicted class ^y based on the input time series, xð Þ 0 , …,x tð Þ.

The output feature map from the first convolutional layer is then given by convolving each filter w<sup>1</sup> <sup>h</sup> for h ¼ 1, …, M<sup>1</sup> with the input:

$$\mathfrak{a}^1(i,h) = \left(w^1\_h \ast \mathfrak{x}\right)(i) = \sum\_{j=-\infty}^{\infty} w^1\_h(j)\mathfrak{x}(i-j) \tag{3}$$

where w<sup>1</sup> <sup>h</sup> ∈ R1�k�<sup>1</sup> and a<sup>1</sup> ∈R1�N�kþ1�M<sup>1</sup> , i is the index of the feature map at the second dimension (i ¼ 1, …, N � k þ 1) and h is the index of the feature map at the third dimension (h ¼ 1, …, M1). Note that since the number of input channels in this case is one, the weight matrix also has only one channel. Similar to the feedforward neural network, this output is then passed through the non-linearity hð Þ� to give f <sup>1</sup> <sup>¼</sup> h a<sup>1</sup> ð Þ:

In each subsequent layer l ¼ 2, …, L, the input feature map f <sup>1</sup>�<sup>1</sup> ∈R1�Nl�1�Ml�<sup>1</sup> , where 1 � Nl�<sup>1</sup> � Ml�<sup>1</sup> is the size of the output filter map from the previous convolution with Nl�<sup>1</sup> <sup>¼</sup> Nl�<sup>2</sup> � <sup>k</sup> <sup>þ</sup> 1, is convolved with a set of Ml filters <sup>w</sup><sup>l</sup> <sup>h</sup> ∈R1�k�Ml�<sup>1</sup> , <sup>h</sup> <sup>¼</sup> <sup>1</sup>, …, Ml, to create a feature map al <sup>∈</sup>R1�Nl�Ml ,

$$a^l(i,h) = \left(w\_h^l \* f^{l-1}\right)(i) = \sum\_{j=-\infty}^{\infty} \sum\_{m=1}^{M\_{l-1}} w\_h^1(j,m) f^{l-1}(i-j,m) \tag{4}$$

capture most of entire peaks present in the input signals. By obtaining the most appropriate 1st convolutional layer filter, there will be no need to apply multiple branches with different 1st convolutional layer filter sizes, and no need to apply transformations such as down-sampling, slicing and warping, thus requiring less computational resources. The question of how to compute this adaptive 1st convolutional layer filter is addressed in [4]. In this section, we will discuss the approach based on the adaptive 1st convolutional layer filter. Next, to prove the efficiency of this/our approach, an application on SMM recognition is conducted

In CNNs, multiple hyper-parameters are to be chosen carefully in order to retrieve the best classification rate, including model hyper-parameters which define the CNN architecture, and optimization hyper-parameters such as the loss function, learning rate, etc. Model Hyper-parameter values are generally chosen based on the literature and on the trial-and-error process (through running experiments with multiple values). A conventional approach is to start with a CNN architecture which has already been adopted in a similar domain to ours, and then update hyper-

In our study, we focus on the convolutional layer filter (also known as "Receptive field"). Conventionally, the 1st convolutional layer filter has one of the following sizes: 3 � 3, 5 � 5, 7 � 7 and 10 � 10, where small filter sizes capture very fine details of the input while large ones leave out minute details in the input. After all, the goal is to set the receptive field size such that the filters detect complete features (e.g., entire variations) within an object (which could be edges, colors or textures within an image, or peaks within a signal). Choosing a small 1st layer filter is good for capturing a short variation (e.g., a signal peak) within a time series input signal, but may capture only a slice of this variation by convolving only part of it, thus failing to detect the whole fluctuation within the signal. Conversely, a relatively large filter may convolve multiple signal peaks at once and therefore no fluctuation is detected either. So, the choice of the proper 1st layer receptive field size is crucial for find good CNN features and for maximizing recognition performance. In that sense, we need 1st layer filter that suits the input signals and variations present within them. In other words, we need to find the best filter size which convolves most of the entire signal peak within the input signals. To this end, it is necessary to find the optimal length of all signal peaks present within input signals. To do so, we apply sampling, a statistical procedure that uses characteristics of the sample (i.e., sample peak lengths taken from randomly selected signals) to estimate the characteristics of the population (i.e., the optimal peak length of all signals). The sample

Given a population of signals with mean μ and n random values x (n being sufficiently large: n≥30) sampled from the population with replacement, the mean

the sample mean E xð Þ as the optimal length of peaks within time series signals (in time- and frequency-domains) and as the size of the 1st convolutional layer filter gives poor CNN performance since it is influenced by outliers or extremes values (i.e., by signals peaks whose length are either too small and too big). In this case, we

For the sample median Með Þ x to be a point estimate of the population median Me

(with a small bias), the distribution of the sample values x should be normally distributed. Nonetheless, the asymptotic distribution of the sample median in the

<sup>n</sup> <sup>∑</sup><sup>n</sup>

<sup>i</sup>¼<sup>1</sup>xi. However, using

and results are analyzed.

CNN Approaches for Time Series Classification DOI: http://dx.doi.org/10.5772/intechopen.81170

parameters by experimentation.

statistic can be the mean, median or mode.

use the sample median.

67

E xð Þ is a point estimate of <sup>μ</sup> and E xð Þ¼ <sup>μ</sup> where E xð Þ¼ <sup>1</sup>

4.2.1 Methodology

This output is then passed through the non-linearity to give f <sup>l</sup> <sup>¼</sup> h a<sup>l</sup> . The filter size parameter k thus controls the receptive field of each output node. Without zero padding, in every layer the convolution output has width Nl ¼ Nl�<sup>1</sup> � k þ 1 for l ¼ 1, ::, L. Since all the elements in the feature map share the same weights this allows for features to be detected in a time-invariant manner, while at the same time it reduces the number of trainable parameters.

The output is then fed into a pooling layer (usually a max-pooling layer) which acts as a subsampling layer. The output map p<sup>l</sup> ð Þ h of a feature map h is achieved by computing maximum values over nearby inputs of the feature map as follows:

$$p^l(i,h) = \max\_{r \in R} \left( f^l(i \times T + r, h) \right) \tag{5}$$

where R is the pooling size, T is the pooling stride, and i is the index of the resultant feature map at the second dimension.

Multiple convolution, ReLU and pooling layers can be stacked on top of one another to form a deep CNN architecture. Then, the output of these layers is fed into a fully connected layer and an activation layer. The output of the network after L layers will thus be the matrix f L . Depending on what we want our model to learn, the weights in the model are trained to minimize the error between the output from the network f <sup>L</sup> and the true output we are interested in, which is often denoted as the objective function (loss function). For instance, a softmax layer can be applied on top, followed by the entropy cost function which is an objective function computed based on the true labels of training instances and probabilistic outputs of softmax function.

#### 4.2 CNN with the adaptive convolutional filter approach

In previous CNN works, several attempts have been made to extract the most relevant/meaningful features using different CNN architectures. While works [17, 24] transformed the time series signals (by applying down-sampling, slicing, or warping) so as to help the convolutional filters (especially the 1st convolutional layer filters) capture entire peaks (i.e., whole peaks) and fluctuations within the signals, the work of [18] proposed to keep time series data unchanged and rather feed them into three branches, each having a different 1st convolutional filter size, in order to capture the whole fluctuations within signals. An alternative is to find an adaptive 1st convolutional layer filter which has the most optimal size and is able to
