**Bayesian Sequential Learning for EEG-Based BCI Classification Problems**

S. Shigezumi, H. Hara, H. Namba, C. Serizawa,

Y. Dobashi, A. Takemoto, K. Nakamura and

T. Matsumoto

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/56146

### **1. Introduction**

Non-invasive Brain-Computer Interfaces (BCIs) have been an active research area where several different methods have been developed. They include Electroencephalography (EEG), Near-infrared Spectroscopy (NIRS), functional-MRI (fMRI) among others [1]. Of those BCIs, EEG is one of the most studied methods. This is mainly due to its fine temporal resolution, ease of use, and relatively low set-up cost. Each BCI method naturally has its own advantages and disadvantages. EEG is no exception.

One of the main disadvantages of an EEG-based BCI is its susceptibility to noise, which motivates development of a variety of machine learning algorithms for decoding EEG signals, and there have been significant advancements in the area [2].

One way of categorizing machine-learning algorithms for BCI is **batch** mode and **sequential (online)** mode. In the batch mode learning, the collectively acquired EEG data from a subject is divided into two subsets: training data and test data. The former is used for training the machine-learning algorithm, whereas the latter is used to evaluate the algorithm's capability to predict the subject's intention [2]-[4]. There are several facets in batch mode learning which call for improvements:

**1.** First, it is non-trivial to decide how much data should be used for training and how much data should be left for testing. It should also be noted that the number of necessary training data may depend on each subject.

© 2013 Shigezumi et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2013 Shigezumi et al.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


In contrast, the sequential learning algorithm considered in this study starts learning with the very first single trial datum and proceeds with the learning each time a single trial datum arrives within a Bayesian framework.

This paper proposes a Bayesian sequential learning algorithm for steady-state visual evoked potential (SSVEP) classification problems in a principled manner. In particular, the paper performs the following:


#### **2. Related work**

There are three ingredients in this study: (i) SSVEP, (ii) Sequential (Online) learning, and (iii) Sequential Monte Carlo implementations. The descriptions that follow will be given in terms of these keywords. For the batch mode learning, we cite the survey paper reported in [2] instead of citing individual papers.

Allison et al. [5] performed a demographic study of several different BCI methods and showed that an SSVEP-based BCI spelling system was competitive for different age groups, as well as different gender groups, with little experience, under noisy environments. It is also reported that most subjects stated that they did not consider the flickering stimuli annoying and would use or recommend such a BCI system. In [6], and also in [7], an SSVEP-based orthosis control system with an LED light source is proposed. The flickering frequencies were 6 Hz, 7 Hz, 8

Hz, and 13 Hz in the former, and 8 Hz and 13 Hz in the latter. Classification was performed using the second- and third-order higher harmonics in addition to the fundamental frequency component. In [8], an SSVEP-based speller is proposed. After Principal Component Analysis (PCA), the probability of each frequency component is estimated using a particular informa‐ tion matrix. The speller introduces a selection based on a decision tree and an undo command for correcting eventual errors. In [9], EEG signals are represented in the spatial-spectraltemporal domain by a wavelet transform. It also uses a multi-linear discriminative subspace by employing general tensor discriminant analysis (GTDA). The classification is conducted by support vector machine (SVM). Reference [10] proposes a biphasic stimulation technique to solve the issue of phase drifts in SSVEP in phase-tagged systems. The Kalman filter is used in [11] to decode neural activity and estimate the desired movement kinematics, where the filter gain is approximated by its steady-state form, which is computed offline before real-time decoding commences. Canonical correlation analysis (CCA) is used in [12] to analyze the frequency components of SSVEP, where the correlations between the target oscillation waveforms, as well as their higher order harmonics, and those of the acquired SSVEP wave‐ forms are calculated. It is demonstrated that the scheme performed better than a fast Fourier transform-based spectrum estimation method. CCA is used in an online manner by updating the parameters each time data arrives. An online learning scheme called Stochastic Meta Descent (SMD), which is a generalization of the gradient descent algorithm, is proposed in [13]. The paper also discusses various aspects of errors incurred in online learning algorithms.

**2.** Second, with the batch mode learning, by definition, one cannot perform sequential

**3.** Third, the batch mode learning presumes that the data is stationary, i.e., the subject's physical condition and/or the environment around the subject does not change over the

In contrast, the sequential learning algorithm considered in this study starts learning with the very first single trial datum and proceeds with the learning each time a single trial datum

This paper proposes a Bayesian sequential learning algorithm for steady-state visual evoked potential (SSVEP) classification problems in a principled manner. In particular, the paper

**a.** Evaluation of the *sequential posterior distribution* of unknown parameters each time a trial

**b.** Computation of the *sequential predictive distribution* of the class label at each trial based on

**c.** *Automatic hyperparameter learning,* where hyperparameter in this study corresponds to the

**e.** Sequential evaluation of *marginal likelihood* which quantifies the reliability of the predic‐

**f.** Experiments are performed on a four class problem in addition to two class problem,

**g.** Formulate the problem using nonlinear model to capture potential nonlinearities which

There are three ingredients in this study: (i) SSVEP, (ii) Sequential (Online) learning, and (iii) Sequential Monte Carlo implementations. The descriptions that follow will be given in terms of these keywords. For the batch mode learning, we cite the survey paper reported in [2] instead

Allison et al. [5] performed a demographic study of several different BCI methods and showed that an SSVEP-based BCI spelling system was competitive for different age groups, as well as different gender groups, with little experience, under noisy environments. It is also reported that most subjects stated that they did not consider the flickering stimuli annoying and would use or recommend such a BCI system. In [6], and also in [7], an SSVEP-based orthosis control system with an LED light source is proposed. The flickering frequencies were 6 Hz, 7 Hz, 8

**d.** Sequential evaluation of the error between the true label and the predicted label.

evaluations of predictive performance as time evolves.

62 Brain-Computer Interface Systems – Recent Progress and Future Prospects

period of the experiment.

arrives within a Bayesian framework.

the posterior distribution obtained above.

search region volume in the unknown parameter space.

where the extension from the latter to the former is nontrivial.

can be easily extended to more difficult problems.

performs the following:

is performed.

tion at each trial.

**2. Related work**

of citing individual papers.

The Subject Specific Classification Model is discussed in [14], where model Gaussian param‐ eters are updated online after an initial learning of the Subject Independent Classification Model from a pool of subjects. The data was taken from P300, which is one component of EEG.

Martinez et al. [4] propose an SSVEP-based online BCI system with visual neurofeedback. The algorithm is different from the one proposed in this paper. There is a report on Bayesian sequential learning for EEG signals [6], where the Sequential Monte Carlo is used for imple‐ mentation. In addition to the task differences between [6] and this study, there are several algorithmic distinctions between the two. First, the basis function used in the former is linear with respect to the associated parameters, whereas the latter uses a basis function in which parameters appear in a nonlinear manner. Second, the parameter that controls the size of the parameter search region is fixed in the former, whereas it is learned automatically in the latter. These two differences, at least within our experience, are important for achieving better performance.

In addition, no extension for multi-class classification problems was performed in the former, whereas both algorithmic extension and a learning experiment by using multi-class real EEG data are performed in the latter. Our proposed algorithm is based on part of earlier work [16] of the authors' research group for a different application. Preliminary results on a two-class classification problem were reported by the present group in [17]. The current paper gives a full account of the results by expanding several parts of that conference paper. First, a fourclass classification problem was formulated and tested experimentally. Second, the algorithm was further improved by incorporating an Effective Sample Size and Rao-Blackwellisation. Third, more detailed discussions are added.

#### **3. Subjects and data acquisition**

Of Event Related Potentials used in BCI, the target quantities considered in this study are SSVEPs, which are natural responses to visual stimulation at specific frequencies. These frequency components and their higher-order harmonics can be observed in the occipital region [4]. It is known that SSVEPs are often useful in research because of the reasonable signalto-noise ratio and relative immunity to artifacts [5].

In an attempt to perform two-class and four-class classification problems, we gathered two sets of SSVEP data. The settings that we describe below for the two experiments are the same except for the number of stimuli and their frequencies. EEG data were recorded by a Polymate (Nihon Koden, Tokyo) with six active electrodes (O1, OZ, O2, O9, IZ, O10) according to the international 10-10 system and referenced to the left earlobe with a digitization rate of 500 Hz. Even though the highest flickering frequency was 10 Hz, we considered second and third order harmonics in one of the experiments reported below. Our original intention was to examine the harmonics higher than three even though they were not reported in this paper. In order to have a wide margin for the Nyquist frequency we chose 500 Hz. Five volunteers (aged 21-23 years) participated in the present study. All subjects were healthy, with no past history of psychiatric or neurological disorders.

Written informed consent was obtained from each subject on forms approved by the ethical committee of Waseda University.

Each subject was seated in a comfortable chair 60 cm in front of the monitor in an electrically shielded and dimmed room. The flow of task events is shown in Figure 1. The stimulus for the two-class problem is illustrated in Figure 2, whereas that of the four-class problem is illustrated in Figure 3.

**Figure 1.** Task flow for the four-class classification problem

**Figure 2.** Monitor display for the two class classification problem

**3. Subjects and data acquisition**

psychiatric or neurological disorders.

**Figure 1.** Task flow for the four-class classification problem

committee of Waseda University.

in Figure 3.

to-noise ratio and relative immunity to artifacts [5].

64 Brain-Computer Interface Systems – Recent Progress and Future Prospects

Of Event Related Potentials used in BCI, the target quantities considered in this study are SSVEPs, which are natural responses to visual stimulation at specific frequencies. These frequency components and their higher-order harmonics can be observed in the occipital region [4]. It is known that SSVEPs are often useful in research because of the reasonable signal-

In an attempt to perform two-class and four-class classification problems, we gathered two sets of SSVEP data. The settings that we describe below for the two experiments are the same except for the number of stimuli and their frequencies. EEG data were recorded by a Polymate (Nihon Koden, Tokyo) with six active electrodes (O1, OZ, O2, O9, IZ, O10) according to the international 10-10 system and referenced to the left earlobe with a digitization rate of 500 Hz. Even though the highest flickering frequency was 10 Hz, we considered second and third order harmonics in one of the experiments reported below. Our original intention was to examine the harmonics higher than three even though they were not reported in this paper. In order to have a wide margin for the Nyquist frequency we chose 500 Hz. Five volunteers (aged 21-23 years) participated in the present study. All subjects were healthy, with no past history of

Written informed consent was obtained from each subject on forms approved by the ethical

Each subject was seated in a comfortable chair 60 cm in front of the monitor in an electrically shielded and dimmed room. The flow of task events is shown in Figure 1. The stimulus for the two-class problem is illustrated in Figure 2, whereas that of the four-class problem is illustrated

**Figure 3.** Monitor display of the four-class classification problem

In the two-class problem, two flickering checkerboard stimuli (left and right) were presented on the monitor, whereas in the four-class problem, four flickering checkerboard stimuli (left, right, top, and bottom) were presented. In addition, a fixation cross was placed at the center, which the subject was usually asked to fixate at.

In the two-class problem, the left stimulus was a checkerboard flickering with frequency of 4.29 Hz, whereas the right stimulus flickered at frequency of 6.00 Hz. In the four-class problem, there were additional stimuli, one at the top with frequency of 7.50 Hz, and one at the bottom with frequency of 10.0 Hz. There were three reasons for selecting these particular frequencies. First, it is known that SSVEPs are discernible in approximately the 4.0 Hz - 50 Hz band [18][19]. Second, higher harmonics of a particular frequency component should not overlap with the fundamental frequency component. Such overlap could give rise to a problem when one considers multi-class classification problem where multiple frequencies are involved as is described in section **6**. Third, since the monitor refresh rate is 60 Hz, choice of the flickering frequencies were restricted by 60/positive integer. We chose 60/14=4.2857…. which we approximated by 4. 29.

The subject usually fixated at a central fixation cross. When an arrow replaced the cross, the subject should move his or her eyes to the checkerboard indicated by the arrow for 3.0 s, after which a red circle is shown so that the subject would know when to rest for 5.0 s. This sequence was one trial, and trials were repeated twenty times, constituting one session. The direction of the arrow was selected at random. Each subject completed 600 trials, or 30 sessions. The measurements were performed with a Polymate AP1124, a multi-purpose portable bioamplifier recording device, manufactured by TEAC Corporation, Tokyo, Japan. The device is equipped with 24 channels with a maximum sampling frequency of 1 kHz. In addition to electroencephalograms (EEGs), eyeball movement and other external signals can be measured. The dimensions are W90 mm x H 44 mm x D 158 mm, the weight is 300 g, and the device is powered by battery.

#### **4. Algorithm**

This section gives a description of the proposed sequential learning algorithm. It consists of several aspects: (i) the basis function to fit the data, (ii) the likelihood function, (iii) sequential parameter learning, and (iv) sequential hyperparameter learning. The actual predictive values are given by the predictive distribution of the target class labels, which will be described in **4.3**. In order to improve the learning capabilities, Rao-Blackwellisation will also be described. We begin with a two-class classification problem followed by a multi-class classification problem. A schematic diagram of the proposed algorithm is given in Figure 4.

#### **4.1. Two-class classification problem**

Let *xk* <sup>∈</sup>*<sup>R</sup> <sup>d</sup>* be the feature vector at the *k*-th trial, where *d* represents the dimension of *xk* which, in our paper, is the DFT spectrum of a single trial EEG. Let *yk* ∈ {0,1} be the binary class label of each trial, where 0 corresponds to the right flickering image and 1 corresponds to the left flickering image. Our purpose here is to learn parameters associated with the ba‐ sis function, to be defined shortly, and predict the subject's intention given SSVEP data, each time datum arrives.

#### *4.1.1. Basis function and classifier*

Consider the parameterized family of nonlinear basis functions *f* (∙ ) defined by:

Bayesian Sequential Learning for EEG-Based BCI Classification Problems http://dx.doi.org/10.5772/56146 67

Figure 4. Schematic diagram of the proposed algorithm **Figure 4.** Schematic diagram of the proposed algorithm

Φ(�) ∶=

Φ(�) ∶=

� �����(��)

while another is:

�

$$f\left(\mathbf{x}\_{k};\omega\_{k}\right) = \sum\_{j=1}^{h} \upsilon\_{k,\cdot,j} \sigma\left(\sum\_{i=1}^{d} \mathbf{u}\_{k,i,j} \mathbf{x}\_{k,i} + \mathbf{u}\_{k,0,j}\right) + \upsilon\_{k,0'} \tag{1}$$

shortly, and predict the subject's intention given SSVEP data, each time datum arrives.

where ��� = (��,�,�,��,�)� ∈ ��(���), ��,�� = (��,��,�,��,��)� ∈ ��,��� = (��,�,�,��,�)� ∈ ���� , �� = (��, ��).

whereΦ is a function which monotonically maps the real numbers onto [0,1]. Several choices of Φ are possible. One is:

Other popular basis functions often work as well. It should be noted that this basis function is nonlinear with respect to �� as well

We tested both functions and found them to work equally well for our SSVEP learning. In what follows, we will report our results

����� (��) , whereℎ represents the number of hidden units.

Consider the parameterized family of nonlinear basis functions �(∙) defined by:

��� � ��,�, (1)

with frequency of 10.0 Hz. There were three reasons for selecting these particular frequencies. First, it is known that SSVEPs are discernible in approximately the 4.0 Hz - 50 Hz band [18][19]. Second, higher harmonics of a particular frequency component should not overlap with the fundamental frequency component. Such overlap could give rise to a problem when one considers multi-class classification problem where multiple frequencies are involved as is described in section **6**. Third, since the monitor refresh rate is 60 Hz, choice of the flickering frequencies were restricted by 60/positive integer. We chose 60/14=4.2857…. which we

66 Brain-Computer Interface Systems – Recent Progress and Future Prospects

The subject usually fixated at a central fixation cross. When an arrow replaced the cross, the subject should move his or her eyes to the checkerboard indicated by the arrow for 3.0 s, after which a red circle is shown so that the subject would know when to rest for 5.0 s. This sequence was one trial, and trials were repeated twenty times, constituting one session. The direction of the arrow was selected at random. Each subject completed 600 trials, or 30 sessions. The measurements were performed with a Polymate AP1124, a multi-purpose portable bioamplifier recording device, manufactured by TEAC Corporation, Tokyo, Japan. The device is equipped with 24 channels with a maximum sampling frequency of 1 kHz. In addition to electroencephalograms (EEGs), eyeball movement and other external signals can be measured. The dimensions are W90 mm x H 44 mm x D 158 mm, the weight is 300 g, and the device is

This section gives a description of the proposed sequential learning algorithm. It consists of several aspects: (i) the basis function to fit the data, (ii) the likelihood function, (iii) sequential parameter learning, and (iv) sequential hyperparameter learning. The actual predictive values are given by the predictive distribution of the target class labels, which will be described in **4.3**. In order to improve the learning capabilities, Rao-Blackwellisation will also be described. We begin with a two-class classification problem followed by a multi-class classification

Let *xk* <sup>∈</sup>*<sup>R</sup> <sup>d</sup>* be the feature vector at the *k*-th trial, where *d* represents the dimension of *xk* which, in our paper, is the DFT spectrum of a single trial EEG. Let *yk* ∈ {0,1} be the binary class label of each trial, where 0 corresponds to the right flickering image and 1 corresponds to the left flickering image. Our purpose here is to learn parameters associated with the ba‐ sis function, to be defined shortly, and predict the subject's intention given SSVEP data, each

problem. A schematic diagram of the proposed algorithm is given in Figure 4.

Consider the parameterized family of nonlinear basis functions *f* (∙ ) defined by:

approximated by 4. 29.

powered by battery.

**4. Algorithm**

time datum arrives.

*4.1.1. Basis function and classifier*

**4.1. Two-class classification problem**

$$\begin{aligned} \text{where} & \quad \boldsymbol{\mu}\_{k} \coloneqq \begin{pmatrix} \boldsymbol{\mu}\_{k,0\prime} \ \cdots \ \boldsymbol{\mu}\_{k,d} \end{pmatrix}^{T} \in \mathbb{R}^{\boldsymbol{h}\cdot(d+1)}, \quad & \quad \boldsymbol{\mu}\_{k,i} \coloneqq \begin{pmatrix} \boldsymbol{\mu}\_{k,i1\prime} \ \cdots \ \boldsymbol{\mu}\_{k,i\ell} \end{pmatrix}^{T} \in \mathbb{R}^{\boldsymbol{h}\cdot\boldsymbol{h}}, \\\ \boldsymbol{\mu}\_{k} \coloneqq \begin{pmatrix} \boldsymbol{\mu}\_{k,0\prime} \ \cdots \ \boldsymbol{\mu}\_{k,\ell} \end{pmatrix}^{T} \in \mathbb{R}^{\boldsymbol{h}\cdot\boldsymbol{h}}, \quad & \quad \boldsymbol{\mu}\_{k,\ell} \coloneqq \begin{pmatrix} \boldsymbol{\mu}\_{k,0\prime} \ \cdots \ \boldsymbol{\mu}\_{k,\ell} \end{pmatrix}^{T} \in \mathbb{R}^{\boldsymbol{h}\cdot\boldsymbol{h}}. \end{aligned}$$

The function *σ*(∙) is a sigmoidal function defined by *σ*(*a*)= <sup>1</sup> <sup>1</sup> <sup>+</sup> exp ( - *<sup>a</sup>*) , where *h* represents the number of hidden units. Let �� ∈ �� be the feature vector at the �-th trial, where� represents the dimension of ��which, in our paper, is the DFT spectrum of a single trial EEG. Let �� ∈ {0,1} be the binary class label of each trial, where 0 corresponds to the right flickering image and 1 corresponds to the left flickering image. Our purpose here is to learn parameters associated with the basis function, to be defined

**4.1.1. Basis function and classifier** 

�(��� ��) = ∑ ��,���∑ ��,����,� � ��,��

�(��|��, ��) ∶= �� ���� Φ��(��� ��)��, (2)

, (3)

√�� � exp(���/2)��� �

�(��|��, ��) = � �(��, ��|��, ��)���, (5)

�� (4)

with (4) by introducing a latent variable �� and considering

�

��� � �

The function�(∙) is a sigmoidal function defined by�(�) <sup>=</sup> �

as ��, which enables capturing of potential nonlinearities involved.

In order to associate the quantity defined by (1) with the class label, consider:

Other popular basis functions often work as well. It should be noted that this basis function is nonlinear with respect to *uk* as well as *xk* , which enables capturing of potential nonlinearities involved.

In order to associate the quantity defined by (1) with the class label, consider:

$$P(y\_k \mid \mathbf{x}\_{k'}, \omega\_k) \colon = \text{Be}\{y\_{k'} \bullet (f(\mathbf{x}\_{k'}; \omega\_k))\},\tag{2}$$

whereΦ is a function which monotonically maps the real numbers onto [0,1]. Several choices of Φ are possible. One is:

$$\mathbb{O}(\mu) \, \mathrel{\;:= \frac{1}{1 + \exp(-\mu)}} \, \tag{3}$$

while another is:

$$\Phi(\mu) \colon= \frac{1}{\sqrt{2\pi}} \mathfrak{f}\_{\ast\ast}^{\mu} \exp\left(-a^2/2\right) da. \tag{4}$$

We tested both functions and found them to work equally well for our SSVEP learning. In what follows, we will report our results with (4) by introducing a latent variable *zk* and considering

$$P\left(y\_{k\ \prime} \mid \mathbf{x}\_{k\ \prime} \mid \omega\_k\right) = \left\{ P\left(y\_{k\ \prime} \mid z\_k \mid \mathbf{x}\_{k\ \prime} \mid \omega\_k\right) \middle| z\_{k\ \prime} \right. \tag{5}$$

$$P\left(y\_{k'} \mid z\_{k'} \mid \mathbf{x}\_{k'} \mid \omega\_k\right) = \mathbf{P}\left(y\_k \mid z\_k\right)\mathbf{P}\left(z\_k \mid \mathbf{x}\_{k'} \mid \omega\_k\right) \tag{6}$$

$$P\{y\_k \mid z\_k\} \colon= \operatorname{Be}\{I\{z\_k \ge 0\}\}.\tag{7}$$

$$P\{\mathbf{z}\_{k'} \mid \mathbf{x}\_{k'} \boldsymbol{\omega}\_k\} \colon= \mathcal{N}\{\mathbf{z}\_{k'} f\{\mathbf{x}\_{k'} \boldsymbol{\omega}\_k\}, \mathbf{1}.0\} \tag{8}$$

where *I*(*A*) represents an indicator function defined as 1 when *A* is *true* and 0 when *A* is *false*.

#### *4.1.2. Parameter search stochastic dynamics*

In order to perform sequential learning, we perform a sequential stochastic search of the parameter *ωk* each time trial data is acquired:

$$P\{\omega\_k \mid \omega\_{k-1}, \gamma\_k\} \coloneqq \frac{1}{Z\_{\omega}(\gamma\_k)} \exp\{-\frac{\gamma\_k \left\|\omega\_k - \omega\_{k-1}\right\|^2}{2}\}\tag{9}$$

where *Zω* represents the normalization constant. This amounts to searching for a new value *ωk* based on the previous value *ω<sup>k</sup>* -1, but in a random walk manner. This is a first-order Markov process, so that the parameters of the distant past are naturally forgotten because of the noise, whereas the parameters of the immediate past tend to be taken into account with higher weights. This stochastic parameter search is reflected in the posterior distributions (20) given sequential data. Since this transition probability is Gaussian, it involves *γ<sup>k</sup>* , which is the reciprocal of the variance parameter. More specifically, if *γk* is small, the parameter search region for *ωk* will be large, whereas if *γk* is large, the search region will be small.

#### *4.1.3. Automatic hyperparameter search stochastic dynamics*

Other popular basis functions often work as well. It should be noted that this basis function is nonlinear with respect to *uk* as well as *xk* , which enables capturing of potential nonlinearities

whereΦ is a function which monotonically maps the real numbers onto [0,1]. Several choices

We tested both functions and found them to work equally well for our SSVEP learning. In what follows, we will report our results with (4) by introducing a latent variable *zk* and considering

where *I*(*A*) represents an indicator function defined as 1 when *A* is *true* and 0 when *A* is *false*.

In order to perform sequential learning, we perform a sequential stochastic search of the

*<sup>Z</sup>ω*(*γ<sup>k</sup>* ) exp(-

where *Zω* represents the normalization constant. This amounts to searching for a new value *ωk* based on the previous value *ω<sup>k</sup>* -1, but in a random walk manner. This is a first-order Markov process, so that the parameters of the distant past are naturally forgotten because of the noise,

*P*(*yk* | *xk* , *ω<sup>k</sup>* ) ∶ = *Be*(*yk* ;Φ( *f* (*xk* ;*ω<sup>k</sup>* ))), (2)

*P*(*yk* | *xk* , *ω<sup>k</sup>* ) =*∫P*(*yk* , *zk* | *xk* , *ω<sup>k</sup>* )*d zk* , (5)

*P*(*yk* , *zk* | *xk* , *ω<sup>k</sup>* ) =*P*(*yk* | *zk* )*P*(*zk* | *xk* , *ω<sup>k</sup>* ), (6)

*P*(*zk* | *xk* , *ω<sup>k</sup>* ) ∶ = *N* (*zk* ; *f* (*xk* ;*ω<sup>k</sup>* ), 1.0), (8)

*γ<sup>k</sup> ω<sup>k</sup>* - *ω<sup>k</sup>* -1

2

<sup>2</sup> ) (9)

*P*(*yk* | *zk* ) ∶ = *Be*(*I*(*zk* ≥0)), (7)

<sup>1</sup> <sup>+</sup> exp(-*u*) , (3)

*<sup>u</sup>* exp(-*a* <sup>2</sup> / 2)*da*. (4)

In order to associate the quantity defined by (1) with the class label, consider:

68 Brain-Computer Interface Systems – Recent Progress and Future Prospects

<sup>Φ</sup>(*u*) <sup>∶</sup> <sup>=</sup> <sup>1</sup>

2*π ∫* -∞

<sup>Φ</sup>(*u*) <sup>∶</sup> <sup>=</sup> <sup>1</sup>

involved.

of Φ are possible. One is:

*4.1.2. Parameter search stochastic dynamics*

parameter *ωk* each time trial data is acquired:

*<sup>P</sup>*(*ω<sup>k</sup>* <sup>|</sup> *<sup>ω</sup><sup>k</sup>* -1, *<sup>γ</sup><sup>k</sup>* ) <sup>∶</sup> <sup>=</sup> <sup>1</sup>

while another is:

Our experiences tell us that automatic adjustment of *γ<sup>k</sup>* is often important in order to achieve better performance. *γk* is often called a hyperparameter since it controls the behavior of the target parameter *ω<sup>k</sup>* . We perform the following automatic stochastic search of *γk* :

$$P\{\mathcal{V}\_k \mid \mathcal{V}\_{k\cdot 1}\} \coloneqq \frac{1}{\sqrt{2\pi\gamma\_k\sigma\_h}} \exp\left(-\frac{(\log\gamma\_k \cdot \log\gamma\_{k\cdot l})^2}{2\sigma\_h^{\frac{\gamma}{2}}}\right) \tag{10}$$

There are at least two reasons for this transition probability to be log-normal. One is that *γ<sup>k</sup>* needs to be positive, and another is to cover a large range of values in the hyperparameter space. It should be noted that (10) is also a first-order Markov process, so that the hyperpara‐ meters of the immediate past are taken into account, whereas the hyperparameters of the distant past tend to be forgotten. In order to explain the importance of such hyperparameter learning, consider Figure 5. Letting *θ<sup>k</sup>* ∶ =(*ω<sup>k</sup>* , *γ<sup>k</sup>* ) the blue region represents the likelihood function landscape in the *θ*-space, where the darker the blue color, the higher the likelihood function value. The white diamonds represent samples *θt*-1 (*i*) ~*P*(*θt*-1 | *x* 1:*<sup>t</sup>*-1, *y*1:*<sup>t</sup>*-1), the yellow diamonds *θ<sup>t</sup> \**(*i*) ~*P*(*θ<sup>t</sup>* (*i*) | *θt*-1, *γ<sup>t</sup>* (*i*) ), *γ<sup>t</sup>* (*i*) :*large*, and the light-green diamonds *θt \**(*i*) ~*P*(*θ<sup>t</sup>* (*i*) | *θt*-1, *γ<sup>t</sup>* (*i*) ), *γ<sup>t</sup>* (*i*) :*small*. Now suppose that the likelihood function landscape in the *θ* -space changed by a relatively large amount, as shown by the pink region, where the darker the color, the higher the likelihood. The yellow diamonds are scarce in the pink region, so that it is difficult to find *θ* samples that give rise to meaningful likelihood function values. This is due to the fact that *γ<sup>t</sup>* (*i*) is large, so that the search region is restricted. If *γ<sup>t</sup>* (*i*) is relatively small, on the other hand, then the green *θ* samples might capture at least a part of the pink region where the likelihood function values are meaningful. The proposed hyperparameter learning scheme automatically learns appropriate *γ<sup>t</sup>* (*i*) values from the sequential data and lets the algorithm find reasonable *θ* samples.

#### *4.1.4. Rao-blackwellised SMC*

In this paper, we implemented not only the standard SMC but also Rao-Blackwellised SMC (RBSMC) for the purpose of performance improvement. Rao-Blackwellisation is a statistical variance reduction strategy for the Monte Carlo method [25], [26]. It is a combination of analytical integration (marginalization) and the Monte Carlo method. In order to explain this,

**Figure 5.** The proposed hyperparameter learning scheme automatically finds the appropriate region in the θ-space. The blue region indicates the likelihood function landscape at time *t* - 1, whereas the pink region indicates the likeli‐ hood function landscape at time *t*. The darker the color, the higher the likelihood function value. The white diamonds represent samples θ*t*-1 (*i*) ~*P*(θ*t*-1 | *x* 1:*t*-1, *y*1:*t*-1), the yellow diamonds θ*<sup>t</sup> \**(*i*) ~*P*(θ*<sup>t</sup>* (*i*) | θ*t*-1, γ*<sup>t</sup>* (*i*) ), γ*<sup>t</sup>* (*i*) :*large*, and the light-green diamonds θ*<sup>t</sup> \**(*i*) ~*P*(θ*<sup>t</sup>* (*i*) | θ*t*-1, γ*<sup>t</sup>* (*i*) ), γ*<sup>t</sup>* (*i*) :*small*. The proposed scheme automatically learns appropriate γ values so that it can capture appropriate θ samples in relatively high-likelihood regions in the θ-space.

recall the parameters associated with the basis function (1), write *ω<sup>k</sup>* ∶ =(*uk* , *vk* ), and decom‐ pose the stochastic search dynamics (9) into two parts:

$$P\{\boldsymbol{u}\_{k}\mid\boldsymbol{u}\_{k-1\prime},\boldsymbol{\gamma}\_{k}\} \vdots = \frac{1}{Z\_{\boldsymbol{u}\_{k}}(\boldsymbol{\gamma}\_{k})} \exp\left(-\frac{\boldsymbol{\gamma}\_{k}\left\|\boldsymbol{u}\_{k}\cdot\boldsymbol{u}\_{k-1}\right\|^{2}}{2}\right) \tag{11}$$

$$P\{\upsilon\_k \mid \upsilon\_{k-1}, \ \delta\_k\} := \frac{1}{Z\_{\upsilon\_k}(b\_k)} \exp\left(-\frac{\delta\_k \left\|\upsilon\_k \cdot \upsilon\_{k-1}\right\|^2}{2}\right) \tag{12}$$

where there are two hyperparameters *γk* and *δ<sup>k</sup>* . The corresponding hyperparameter stochastic search dynamics will be given by:

$$P\left(\mathcal{V}\_{k}\mid\mathcal{V}\_{k-1}\right) \vdots = \frac{1}{\sqrt{2\pi\gamma\_{k}\sigma\_{h}}} \exp\left(-\frac{(\log\gamma\_{k}\cdot\log\gamma\_{k-1})^{2}}{2\sigma\_{h}^{2}}\right) \tag{13}$$

$$P\{\delta\_k \mid \delta\_{k-1}\} := \frac{1}{\sqrt{2\pi\delta\_k\sigma\_h}} \exp\left(-\frac{(\log\delta\_k \text{ - } \log\delta\_{k-1})^2}{2\sigma\_h^{-2}}\right). \tag{14}$$

Since the basis function is linear with respect to *vk* , the Rao-Blackwellisation can be conducted with the data augmentation of *Zk* [26], where the likelihood function *P*(*yk* | *xk* , *ω<sup>k</sup>* ) is to be integrated out with respect to *vk* , which, in turn, gives rise to smaller variances. A specific implementation of this particular Rao-Blackwellisation will be given in subsection **5.2**.

#### **4.2. Multi-class classification problem**

This section attempts to generalize the results of the previous section to multi-class problems. Although we will restrict ourselves to a four-class problem, the method can, in principle, be applied to cases with more than four classes.

Let *xk* <sup>∈</sup>*<sup>R</sup> <sup>d</sup>* be the feature vector at the *k*-th trial, which is the power spectrum obtained through DFT over the trial period, where *d* stands for the dimension of *xk* . Let *yk* ∈ {1,2, 3,4} denote the class labels at each trial, where left corresponds to label 1, right to label 2, top to label 3, and bottom to label 4. Our goal is to learn the parameters associated with the basis function described below in an attempt to predict the subject's intention.

#### *4.2.1. Basis function*

recall the parameters associated with the basis function (1), write *ω<sup>k</sup>* ∶ =(*uk* , *vk* ), and decom‐

**Figure 5.** The proposed hyperparameter learning scheme automatically finds the appropriate region in the θ-space. The blue region indicates the likelihood function landscape at time *t* - 1, whereas the pink region indicates the likeli‐ hood function landscape at time *t*. The darker the color, the higher the likelihood function value. The white diamonds

(*γ<sup>k</sup>* ) exp(-

(*δ<sup>k</sup>* ) exp(-

where there are two hyperparameters *γk* and *δ<sup>k</sup>* . The corresponding hyperparameter stochastic

exp(-

exp(-

*γ<sup>k</sup> uk* - *uk* -1

*\**(*i*) ~*P*(θ*<sup>t</sup>* (*i*) | θ*t*-1, γ*<sup>t</sup>* (*i*) ), γ*<sup>t</sup>* (*i*)

*δ<sup>k</sup> vk* - *vk* -1

(log *γ<sup>k</sup>* - log *γ<sup>k</sup>* -1)2

(log *δ<sup>k</sup>* - log *δ<sup>k</sup>* -1)2

2

:*small*. The proposed scheme automatically learns appropriate γ values so that it

2

<sup>2</sup> ), (11)

:*large*, and the light-green

<sup>2</sup> ), (12)

<sup>2</sup>*σ<sup>h</sup>* <sup>2</sup> ), (13)

<sup>2</sup>*σ<sup>h</sup>* <sup>2</sup> ). (14)

*Zuk*

*Zvk*

2*πγ<sup>k</sup> σ<sup>h</sup>*

2*πδ<sup>k</sup> σ<sup>h</sup>*

pose the stochastic search dynamics (9) into two parts:

70 Brain-Computer Interface Systems – Recent Progress and Future Prospects

search dynamics will be given by:

represent samples θ*t*-1

*\**(*i*) ~*P*(θ*<sup>t</sup>* (*i*) | θ*t*-1, γ*<sup>t</sup>* (*i*) ), γ*<sup>t</sup>* (*i*)

diamonds θ*<sup>t</sup>*

(*i*)

*<sup>P</sup>*(*uk* <sup>|</sup> *uk* -1, *<sup>γ</sup><sup>k</sup>* ) <sup>∶</sup> <sup>=</sup> <sup>1</sup>

~*P*(θ*t*-1 | *x* 1:*t*-1, *y*1:*t*-1), the yellow diamonds θ*<sup>t</sup>*

can capture appropriate θ samples in relatively high-likelihood regions in the θ-space.

*<sup>P</sup>*(*vk* <sup>|</sup> *vk* -1, *<sup>δ</sup><sup>k</sup>* ) <sup>∶</sup> <sup>=</sup> <sup>1</sup>

*<sup>P</sup>*(*γ<sup>k</sup>* <sup>|</sup> *<sup>γ</sup><sup>k</sup>* -1) <sup>∶</sup> <sup>=</sup> <sup>1</sup>

*<sup>P</sup>*(*δ<sup>k</sup>* <sup>|</sup> *<sup>δ</sup><sup>k</sup>* -1) <sup>∶</sup> <sup>=</sup> <sup>1</sup>

Consider the basis function *f <sup>q</sup>* (∙ ) defined by (15), which is nonlinear with respect to not only *xk* but also the parameter vector *ω<sup>k</sup>* . There are *Q* outputs associated with the basis function, where *Q* is the number of class labels, which is four in this paper. We have:

$$\{f\_q(\mathbf{x}\_k; \omega\_k) = \sum\_{j=1}^h \upsilon\_{k, \, jq} \sigma(\sum\_{i=1}^d \mathbf{u}\_{k, \, ij} \mathbf{x}\_{k,i} + \mathbf{u}\_{k, \, 0 \, j}) + \upsilon\_{k, \, 0q'} \tag{15}$$

where *uk* : = (*uk* ,0, ⋯, *uk* ,*<sup>d</sup>* ) *T* ∈*R <sup>h</sup>* (*<sup>d</sup>* +1) , *uk* ,*<sup>i</sup>* : = (*uk* ,*<sup>i</sup>*1, ⋯, *uk* ,*ih* ) *T* ∈*R <sup>h</sup>* , *vk* : = (*vk* ,0, ⋯, *vk* ,*<sup>h</sup>* ) *T* ∈*R <sup>Q</sup>*(*<sup>h</sup>* +1) , *ω<sup>k</sup>* =(*uk* , *vk* ).

#### *4.2.2. Multinomial logistic model*

This paper assumes the Multinomial Logistic Model for the target problems, where it is assumed that the error *<sup>k</sup>* ,*q* in each term follows an independently identically distributed logistic distribution. By introducing a latent variable *zk* ,*q*, we write:

$$z\_{k,q} \colon= f\_q(\mathbf{x}\_k; \omega\_k) + \mathbf{C}\_{k,q} +\_{k,q} \tag{16}$$

$$\mathcal{C}\_{k,q} \colon = -\log \sum\_{i \neq q}^{Q} \exp(f\_i(\mathbf{x}\_k; \omega\_k))\_\prime \tag{17}$$

where *Ck* ,*<sup>q</sup>* represents the score of the term controlled by the outputs of the other class labels. It follows from (15) that the probability of *yk* belonging to class *q* is described by:

$$P\{y\_k = q \mid z\_{k,q}\} = \frac{\exp(f\_q(\mathbf{x}\_k; \omega\_k))}{\sum\_{i=1}^{S} \exp(f\_i(\mathbf{x}\_k; \omega\_k))} = \sigma\{f\_q(\mathbf{x}\_k; \omega\_k) + \mathbf{C}\_{k,q}\} \tag{18}$$

The predicted label *ypred* is the label *qmax* that has the maximum value of *P*(*yk* =*q* | *zk* ,*<sup>q</sup>*). Using (18), the likelihood function is described by:

$$P\{y\_k \mid \mathbf{x}\_{k'}, \omega\_k\} \coloneqq \prod\_{q=1}^Q P\{y\_k = q \mid \mathbf{z}\_{k,q}\}^{l\{y\_k \bullet q\}} \{1 \cdot P\{y\_k = q \mid \mathbf{z}\_{k,q}\}\}^{l\{y\_k \bullet q\}},\tag{19}$$

The function *I*(∙) is again an indicator described in **4.1**. The generalization to the multi-class problem (18)-(20) appears straightforward; however, our experience tells us that the multiclass problems are much more difficult than the equations look. Experimental results are reported in **6.4**.

#### *4.2.3. Parameter/hyperparameter search stochastic dynamics*

We use the same standard Sequential Monte Carlo (SMC) used in **4.1**.

#### **4.3. Bayesian Sequential Learning**

Letting *θ<sup>k</sup>* <sup>∶</sup> =(*ω<sup>k</sup>* , *<sup>γ</sup><sup>k</sup>* , *<sup>δ</sup><sup>k</sup>* ), *x*1:*<sup>k</sup>* : <sup>=</sup> (*x*1, <sup>⋯</sup>, *xk* ), *y*1:*<sup>k</sup>* : =(*y*1, <sup>⋯</sup>, *yk* ), one can derive its sequential posterior distribution at trial *k*:

$$P\left\{\Theta\_{k}\mid\mathbf{x}\_{1:k\prime}\mid y\_{1:k}\right\} = \frac{P\left\{y\_{k}\mid\mathbf{x}\_{k\prime}\mid\theta\_{k}\right\}P\left\{\theta\_{k}\mid\mathbf{x}\_{1:k\prime}\mid y\_{1:k\cdot 1}\right\}}{\overline{\mathcal{P}\left\{y\_{k}\mid\mathbf{x}\_{k\prime}\mid\mathbf{x}\_{k\prime}\right\}P\left\{\theta\_{k}\mid\mathbf{x}\_{1:k\cdot 1}\mid y\_{1:k\cdot 1}\right\}d\theta\_{k}}\tag{20}$$

The second factor in the numerator is the predictive probability for parameter *θ<sup>k</sup>* , which is given by:

$$P\{\theta\_k \mid \mathbf{x}\_{1:k-1\prime}, y\_{1:k-1}\} = \{P\{\theta\_k \mid \theta\_{k-1}\} P\{\theta\_{k-1} \mid \mathbf{x}\_{1:k-1\prime}, y\_{1:k-1}\} d\theta\_{k-1\prime} \tag{21}$$

$$\mathbb{P}\{\Theta\_{k}\mid\Theta\_{k-1}\} = \mathbb{P}\{\omega\_{k}\mid\omega\_{k-1\nu}\ \gamma\_{k}\} \mathbb{P}\{\gamma\_{k}\mid\gamma\_{k-1}\}.\tag{22}$$

At the *k* + 1-st trial, let the EEG data *xk* +1 be given. Then the prediction at the trial amounts to computing the predictive probability for label *P*(*yk* +1):

$$P\{y\_{k+1} \mid \mathbf{x}\_{1:k+1'}, y\_{1:k}\} = \{P\{y\_{k+1} \mid \mathbf{x}\_{k+1'}, \theta\_{k+1}\} \: \mathbf{P}\{\theta\_{k+1} \mid \mathbf{x}\_{1:k'}, y\_{1:k}\} d\theta\_{k+1}.\tag{23}$$

#### **5. Implementation**

The Sequential Monte Carlo (SMC) is a powerful means of evaluating the posterior or predic‐ tive probabilities of Bayesian nonlinear or non-Gaussian models in a sequential manner. This study uses the SMC to evaluate equations (20) and (23) in an attempt to evaluate the SER. The SMC first attempts to approximate the posterior distribution by an empirical distribution (delta mass) weighted by normalized importance weights (importance sampling). In order to avoid depletion of samples, caused by an increase in the variance of the weights, the SMC replaces the weighted empirical distribution by unweighted delta masses (resampling).

There are several different methods of determining when resampling should be done. We tried two of them. One method is to resample every step, and another is to perform resampling only when the Effective Sample Size (EES [22]-[24]) becomes smaller than a threshold value: There are several different methods of determining when resampling should be done. We tried two of them. One method is to

$$ESS = \frac{1}{\frac{\iota}{\sum\_{k=1}^{n} \Omega\_k}} \tag{24}$$

resampling is done with the ESS. If resampling is performed every step, then the ESS step is simply ignored.

where � is the number of samples, � is the index of a sample, and�� is the normalized importance weight defined by ��

resample every step, and another is to perform resampling only when the Effective Sample Size (EES [22]-[24]) becomes smaller

In the Rao-Blackwellised SMC implementation, the marginal likelihood �(��|����, ����) is used instead of the likelihood function

(�).

where *n* is the number of samples, *i* is the index of a sample, and Ω ~ *k* is the normalized importance weight defined by Ω ~ *k* (*i*) . The threshold value of ESS is often set at ��� ([23],[24]), which we adopted. **5.1. Standard SMC** 

The threshold value of ESS is often set at *N* / 2 ([23],[24]), which we adopted. Figure 6 gives an overview of the standard SMC, where �� <sup>∗</sup> �� (��, ��) and �(�) denote the proposal distribution, which in this paper is set as �(�� ∗ |������, ������). This choice is due to its simplicity of implementation. Figure 6 demonstrates the case where

**5.2. Rao-blackwellised SMC** 

��� � � <sup>∑</sup> ���� (�)� � � ���

$$\begin{aligned} \text{Inplimentations of Sundier SSMC} & \text{SMC} \\ (\text{a) Importez Sample Sampling step} \\ (\text{i) For } i = 1, \ldots, n \text{ draw samples of } \theta\_{k}^{k} \text{ from the preposed density } Q; \\ \left(\theta\_{k}^{\mathcal{N}}^{(i)}\right)\_{i=1}^{n} & \sim Q(\theta\_{k}^{k} | \mathbf{x}\_{1:k}, y\_{1:k}). \\ (\text{ii) For } i = 1, \ldots, n, \text{ compute the importance weight } \Omega\_{k}^{(i)}; \\ \Omega\_{k}^{(i)} & \propto \left. \partial\_{\mathbf{x}\_{k-1}^{(i)}}^{\mathcal{N}} \frac{P(y\_{k} | \mathbf{x}\_{k}, \mathbf{w}\_{k}^{(i)})P(\theta\_{k}^{(i)} | \mathbf{x}\_{1:k-1}, y\_{1:k-1})}{Q(\theta\_{k}^{(i)} | \mathbf{x}\_{1:k}, y\_{1:k})} \right. \\ & \left. \quad \left. \right. \right. \\ (\text{iii) For } i = 1, \ldots, n, \text{ compute the normalized importance weight } \Omega\_{k}^{(i)}; \\ \left. \right. \right. \\ (\text{iii) For } i = 1, \ldots, n, \text{ compute the normalized importance weight } \Omega\_{k}^{(i)}; \\ \Omega\_{k}^{(i)} &= \frac{\Omega\_{k}^{(i)}}{\sum\_{j=1}^{n} \Omega\_{k}^{(j)}}, \\ \text{where } \sum\_{j=1}^{n} \hat{\Omega}\_{k}^{(i)} &= 1. \\ (\text{iv) Calculate the ESS } \{\mathbf{x}\_{k}\} & \coloneqq \left\{ \frac{\Omega}{n} \right\} \text{ go to (v)} \\ \text{(iv) Calculate the ESS } \{\mathbf{x}\_{k}\} & \coloneqq \left\{ \mathbf{x}\_{$$

�(��|��, ��), where �� �� (��, ��, ��).This implementation is described in Figure **7**.

Figure 6. Implementation of standard SMC. **Figure 6.** Implementation of standard SMC.

*<sup>P</sup>*(*yk* <sup>=</sup>*<sup>q</sup>* <sup>|</sup> *zk* ,*q*) <sup>=</sup> exp( *<sup>f</sup> <sup>q</sup>*(*xk* ; *<sup>ω</sup><sup>k</sup>* )) ∑ *i*=1 *Q* exp( *f <sup>i</sup>*

72 Brain-Computer Interface Systems – Recent Progress and Future Prospects

*q*=1 *Q*

*4.2.3. Parameter/hyperparameter search stochastic dynamics*

computing the predictive probability for label *P*(*yk* +1):

*P*(*yk* =*q* | *zk* ,*q*)

We use the same standard Sequential Monte Carlo (SMC) used in **4.1**.

(18), the likelihood function is described by:

*P*(*yk* | *xk* , *ω<sup>k</sup>* ) ∶ = ∏

**4.3. Bayesian Sequential Learning**

posterior distribution at trial *k*:

reported in **6.4**.

given by:

**5. Implementation**

The predicted label *ypred* is the label *qmax* that has the maximum value of *P*(*yk* =*q* | *zk* ,*<sup>q</sup>*). Using

*I* (*yk* =*q*)

The function *I*(∙) is again an indicator described in **4.1**. The generalization to the multi-class problem (18)-(20) appears straightforward; however, our experience tells us that the multiclass problems are much more difficult than the equations look. Experimental results are

Letting *θ<sup>k</sup>* <sup>∶</sup> =(*ω<sup>k</sup>* , *<sup>γ</sup><sup>k</sup>* , *<sup>δ</sup><sup>k</sup>* ), *x*1:*<sup>k</sup>* : <sup>=</sup> (*x*1, <sup>⋯</sup>, *xk* ), *y*1:*<sup>k</sup>* : =(*y*1, <sup>⋯</sup>, *yk* ), one can derive its sequential

The second factor in the numerator is the predictive probability for parameter *θ<sup>k</sup>* , which is

At the *k* + 1-st trial, let the EEG data *xk* +1 be given. Then the prediction at the trial amounts to

The Sequential Monte Carlo (SMC) is a powerful means of evaluating the posterior or predic‐ tive probabilities of Bayesian nonlinear or non-Gaussian models in a sequential manner. This

*P*(*θ<sup>k</sup>* | *x*1:*<sup>k</sup>* -1, *y*1:*<sup>k</sup>* -1) =*∫P*(*θ<sup>k</sup>* | *θ<sup>k</sup>* -1)*P*(*θ<sup>k</sup>* -1 | *x*1:*<sup>k</sup>* -1, *y*1:*<sup>k</sup>* -1)*dθ<sup>k</sup>* -1, (21)

*P*(*yk* +1 | *x*1:*<sup>k</sup>* +1, *y*1:*<sup>k</sup>* ) =*∫P*(*yk* +1 | *xk* +1, *θ<sup>k</sup>* +1) *P*(*θ<sup>k</sup>* +1 | *x*1:*<sup>k</sup>* , *y*1:*<sup>k</sup>* )*dθ<sup>k</sup>* +1. (23)

*P*(*θ<sup>k</sup>* | *θ<sup>k</sup>* -1) =*P*(*ω<sup>k</sup>* | *ω<sup>k</sup>* -1, *γ<sup>k</sup>* )*P*(*γ<sup>k</sup>* | *γ<sup>k</sup>* -1). (22)

*<sup>P</sup>*(*θ<sup>k</sup>* <sup>|</sup> *<sup>x</sup>*1:*<sup>k</sup>* , *<sup>y</sup>*1:*<sup>k</sup>* ) <sup>=</sup> *<sup>P</sup>*(*yk* <sup>|</sup> *xk* , *<sup>θ</sup><sup>k</sup>* )*P*(*θ<sup>k</sup>* <sup>|</sup> *<sup>x</sup>*1:*<sup>k</sup>* -1, *<sup>y</sup>*1:*<sup>k</sup>* -1)

(*xk* ; *<sup>ω</sup><sup>k</sup>* )) <sup>=</sup>*σ*( *<sup>f</sup> <sup>q</sup>*(*xk* ;*ω<sup>k</sup>* ) <sup>+</sup> *Ck* ,*<sup>q</sup>*), (18)

*<sup>∫</sup>P*(*yk* <sup>|</sup> *xk* , *<sup>θ</sup><sup>k</sup>* )*P*(*θ<sup>k</sup>* <sup>|</sup> *<sup>x</sup>*1:*<sup>k</sup>* -1, *<sup>y</sup>*1:*<sup>k</sup>* -1)*dθ<sup>k</sup>* , (20)

*I* (*yk* ≠*q*)

, (19)

(1 - *P*(*yk* =*q* | *zk* ,*q*))

#### **5.1. Standard SMC**

Figure 6 gives an overview of the standard SMC, where *θ<sup>k</sup> \** ∶ =(*ω<sup>k</sup>* , *γ<sup>k</sup>* ) and *Q*(∙) denote the proposal distribution, which in this paper is set as *P*(*θ<sup>k</sup> \** | *x*1:*<sup>k</sup>* -1, *y*1:*<sup>k</sup>* -1). This choice is due to its simplicity of implementation. Figure 6 demonstrates the case where resampling is done with the ESS. If resampling is performed every step, then the ESS step is simply ignored.

#### **5.2. Rao-blackwellised SMC**

In the Rao-Blackwellised SMC implementation, the marginal likelihood *P*(*yk* | *z*1:*<sup>k</sup>* , *u*1:*<sup>k</sup>* ) is used instead of the likelihood function *P*(*yk* | *xk* , *ω<sup>k</sup>* ), where *Θ<sup>k</sup>* ∶ =(*ω<sup>k</sup>* , *γ<sup>k</sup>* , *δ<sup>k</sup>* ). This imple‐ mentation is described in Figure 7.

Figure 7. Implementation of Rao-Blackwellised SMC. **Figure 7.** Implementation of Rao-Blackwellised SMC.

Here, the marginal likelihood �(��|����� ����) can be written as: �(��|����� ����) �����|��� Here, the marginal likelihood *P*(*yk* | *z*1:*<sup>k</sup>* , *u*1:*<sup>k</sup>* ) can be written as:

**6.1. Observation data** 

component is not clearly discernible.

**6. Results** 

spectrum.

� ��

�� � � ���|���

$$P\left(y\_k \mid z\_{1:k'} \mid \mu\_{1:k}\right) = \mathfrak{O}\left(\frac{z\_{k\mid k-1}}{\sqrt{S\_k}}\right)^{y\_k} \left(1 - \mathfrak{O}\left(\frac{z\_{k\mid k-1}}{\sqrt{S\_k}}\right)\right)^{1\cdot y\_k} \tag{25}$$

(25)

This section reports the results of learning experiments using the algorithms proposed in the previous sections.

As explained in **3**, the six channels (*O1, OZ, O2, O9, IZ, O10*) located in the occipital eye field were used for our classification problems. Data were taken in 600 trials from each of five subjects (*A, B, C, D, E*). One trial lasted for 3.0 seconds. Data in the first 0.0-1.0 s was deleted in order to eliminate the effect of eyeball movements on the EEG. The raw data were filtered by 50.0 Hz notch filter. After Hanning windowing, DFT was performed by *MATLAB\_R2008a* to obtain the feature vector � consisting of the power

Figure.8 demonstrates the power spectrum of subject *D* taken from one trial when a stimulus was presented at 6.0 Hz. The particular frequency component is relatively clear. Figure.9 is from another trial of the same subject, where the target frequency

������

Details of updating *zk* <sup>|</sup>*<sup>k</sup>* -1 and *Sk* are given in the Appendix.

#### **6. Results**

**5.1. Standard SMC**

**5.2. Rao-blackwellised SMC**

mentation is described in Figure 7.

Figure 6 gives an overview of the standard SMC, where *θ<sup>k</sup>*

simplicity of implementation. Figure 6 demonstrates the case where resampling is done with

In the Rao-Blackwellised SMC implementation, the marginal likelihood *P*(*yk* | *z*1:*<sup>k</sup>* , *u*1:*<sup>k</sup>* ) is used instead of the likelihood function *P*(*yk* | *xk* , *ω<sup>k</sup>* ), where *Θ<sup>k</sup>* ∶ =(*ω<sup>k</sup>* , *γ<sup>k</sup>* , *δ<sup>k</sup>* ). This imple‐

the ESS. If resampling is performed every step, then the ESS step is simply ignored.

Figure 7. Implementation of Rao-Blackwellised SMC.

��� � ��

�(��|����� ����) �����|���

Here, the marginal likelihood *P*(*yk* | *z*1:*<sup>k</sup>* , *u*1:*<sup>k</sup>* ) can be written as:

*<sup>P</sup>*(*yk* <sup>|</sup> *<sup>z</sup>*1:*<sup>k</sup>* , *<sup>u</sup>*1:*<sup>k</sup>* ) <sup>=</sup>Φ( *zk* <sup>|</sup>*<sup>k</sup>* -1

**6.1. Observation data** 

component is not clearly discernible.

**6. Results** 

**Figure 7.** Implementation of Rao-Blackwellised SMC.

spectrum.

Here, the marginal likelihood �(��|����� ����) can be written as:

Details of updating ��|��� and �� are given in the Appendix.

*Sk* ) *yk*

�� � � ���|��� ��� �� ���� (25)

> (1 - <sup>Φ</sup>( *zk* <sup>|</sup>*<sup>k</sup>* -1 *Sk* ))1-*yk*

This section reports the results of learning experiments using the algorithms proposed in the previous sections.

As explained in **3**, the six channels (*O1, OZ, O2, O9, IZ, O10*) located in the occipital eye field were used for our classification problems. Data were taken in 600 trials from each of five subjects (*A, B, C, D, E*). One trial lasted for 3.0 seconds. Data in the first 0.0-1.0 s was deleted in order to eliminate the effect of eyeball movements on the EEG. The raw data were filtered by 50.0 Hz notch filter. After Hanning windowing, DFT was performed by *MATLAB\_R2008a* to obtain the feature vector � consisting of the power

(25)

Figure.8 demonstrates the power spectrum of subject *D* taken from one trial when a stimulus was presented at 6.0 Hz. The particular frequency component is relatively clear. Figure.9 is from another trial of the same subject, where the target frequency

proposal distribution, which in this paper is set as *P*(*θ<sup>k</sup>*

74 Brain-Computer Interface Systems – Recent Progress and Future Prospects

*\** ∶ =(*ω<sup>k</sup>* , *γ<sup>k</sup>* ) and *Q*(∙) denote the

*\** | *x*1:*<sup>k</sup>* -1, *y*1:*<sup>k</sup>* -1). This choice is due to its

This section reports the results of learning experiments using the algorithms proposed in the previous sections.

#### **6.1. Observation data**

As explained in **3**, the six channels (*O1, OZ, O2, O9, IZ, O10*) located in the occipital eye field were used for our classification problems. Data were taken in 600 trials from each of five subjects (*A, B, C, D, E*). One trial lasted for 3.0 seconds. Data in the first 0.0-1.0 s was deleted in order to eliminate the effect of eyeball movements on the EEG. The raw data were filtered by 50.0 Hz notch filter. After Hanning windowing, DFT was performed by *MAT‐ LAB\_R2008a* to obtain the feature vector *x* consisting of the power spectrum.

Figure.8 demonstrates the power spectrum of subject *D* taken from one trial when a stimulus was presented at 6.0 Hz. The particular frequency component is relatively clear. Figure.9 is from another trial of the same subject, where the target frequency component is not clearly discernible.

**Figure 8.** Frequency spectrum of Subject D. The target frequency of 6.0 Hz is reasonably discernible.

The vertical lines in the two figures indicate (from the left) 4.29Hz, 6.0Hz, 8.58(4.29 × 2) Hz, 12.0(6.0 × 2) Hz, 12.87(4.29 × 3) Hz and 18.0(6.0 × 3) Hz, respectively. It should be noted that even with SSVEP, the observed frequency components are not always identifiable by inspec‐ tion. It should also be noted that SSVEP can contain higher harmonics of the target frequency [4] and that the classification accuracy may be improved by taking into account higher harmonics [4]. In order to examine the effectiveness of the higher order harmonics for our

**Figure 9.** Frequency spectrum of the same subject at in Figure.8. The target frequency component is difficult to ob‐ serve.

classification problem, this section considers the following three settings: (i) the fundamental frequency only, (ii) the second order harmonics, in addition to the fundamental frequency, and (iii) second and third order harmonics, in addition to the fundamental frequency. Since the number of channels is 6, the dimensions of our feature vectors are (i) 12, (ii) 24, and (iii) 36, respectively.

It should be noted that while more frequency components give more information, the number of parameters to be learned increases, so that learning becomes more difficult.

#### **6.2. Experimental settings**

This study examines several different versions of Sequential Monte Carlo for implementing the target sequential learning, as displayed in Table 1, where standard SMC means no hyperparameter learning and resampling is performed at every step. The abbreviated notation will be used throughout the rest of the paper. Various experimental settings are summarized in Table 2, where *n* denotes the number of samples; *γ*0 and *δ*0, the initial conditions for hyperparameters *γ* and *δ* ; *σ<sup>h</sup>* , the hyper-hyperparameter; and *h* , the number of perceptron hidden units.

#### **6.3. Performance evaluation criteria**

We will propose three performance evaluation criteria. One is Sequential Error Rates (*SERk* ) defined by

$$SER\_k: = \frac{1}{M} \sum\_{k=k-M+1}^{k} I\left(y\_k \cdot \# \, y\_{k', \text{pred}}\right) \tag{26}$$


**Table 1.** Algorithm names and their abbreviations

classification problem, this section considers the following three settings: (i) the fundamental frequency only, (ii) the second order harmonics, in addition to the fundamental frequency, and (iii) second and third order harmonics, in addition to the fundamental frequency. Since the number of channels is 6, the dimensions of our feature vectors are (i) 12, (ii) 24, and (iii) 36,

**Figure 9.** Frequency spectrum of the same subject at in Figure.8. The target frequency component is difficult to ob‐

It should be noted that while more frequency components give more information, the number

This study examines several different versions of Sequential Monte Carlo for implementing the target sequential learning, as displayed in Table 1, where standard SMC means no hyperparameter learning and resampling is performed at every step. The abbreviated notation will be used throughout the rest of the paper. Various experimental settings are summarized in Table 2, where *n* denotes the number of samples; *γ*0 and *δ*0, the initial conditions for hyperparameters *γ* and *δ* ; *σ<sup>h</sup>* , the hyper-hyperparameter; and *h* , the number of perceptron

We will propose three performance evaluation criteria. One is Sequential Error Rates (*SERk* )

*I*(*yk* '≠ *yk* '

, *pred* ), (26)

of parameters to be learned increases, so that learning becomes more difficult.

76 Brain-Computer Interface Systems – Recent Progress and Future Prospects

respectively.

serve.

hidden units.

defined by

**6.2. Experimental settings**

**6.3. Performance evaluation criteria**

*SERk* <sup>∶</sup> <sup>=</sup> <sup>1</sup>

*<sup>M</sup>* ∑ *k* ' =*k*-*M* +1 *k*


**Table 2.** Experimental Settings. *n* denotes the number of samples; γ0 and δ0, the initial conditions for hyperparameters γ and δ ; σ*<sup>h</sup>* , the hyper-hyperparameter; and *h* , the number of perceptron hidden units.

where *yk* (*yk '*) is the true class, and *yk* , *pred* (*yk '* , *pred* ) is the predicted class defined by (23). Notation *I*(∙ ) stands for an indicator described in **4**. This is the moving average of the prediction error over a window of size *M*. We will also compute Cumulative Error (*CE*)

$$\text{CEE} \doteq\_{k=1}^{K} I\{y\_k \# y\_{k,pred}\}\_{\prime} \tag{27}$$

in order to make performance comparisons with the existing methods. Another quantity we will be evaluating is the sequential marginal likelihood:

$$P\{y\_k \mid \mathbf{x}\_{1:k\prime} \ y\_{1:k-1}\} = \{P\{y\_k \mid \mathbf{x}\_{k\prime} \ \theta\_k\} \ P\{\theta\_k \mid \mathbf{x}\_{1:k-1\prime} \ y\_{1:k-1}\} d\theta\_k\tag{28}$$

which is the marginalization of the likelihood with respect to the current predictive distribu‐ tion. This quantifies the reliability of the prediction *yk* with respect to (*x*1:*<sup>k</sup>* , *y*1:*<sup>k</sup>* -1 ). In order to explain a *rationale* behind this, recall that given data *y*, the likelihood *P*(*y* | *z*) can be interpreted as the degree of appropriateness of *z* in explaining *y*. This, in turn, can be interpreted as the appropriateness of *y* in terms of *z*.

#### **6.4. Experimental results**

#### *6.4.1. Two-class classification problem*

**a.** Sequential Error Rate

Figure 10 shows the Sequential Error Rate of subject D over one session consisting of 600 trials. The algorithm was implemented by Sequential Monte Carlo together with the proposed hyperparameter learning (HP+SMC). Table 3 summarizes the Sequential Error Rates of subjects A-E, which were averages over ten learning trials.

Figure 10.Sequential Error Rate of subject D with (HP+SMC), Sequential Monte Carlo together with the proposed hyperparameter learning. The dotted line at 1/2 corresponds to a random classification. **Figure 10.** Sequential Error Rate of subject D with (HP+SMC), Sequential Monte Carlo together with the proposed hy‐ perparameter learning. The dotted line at 1/2 corresponds to a random classification.

Table 3. Sequential Error Rate of the subjects (M=20, HP+SMC)

c. Sequential Marginal Likelihood (Reliability of the Predictions)

abrupt. This particular quantity can be applied to the change detection problem as is done in [27].

b. Trajectory of hyperparameter

**A B C D E** 

minimum error rate 0.00 0.010 0.00 0.00 0.00 maximum error rate 0.75 0.75 0.71 0.75 0.80 average over 600 trials 0.16 0.26 0.19 0.077 0.13

In Figure 11, the ߛ௧-trajectory (blue) is superimposed on the Sequential Error Rates (red) of Figure 10. The value of ߛ௧ is the posterior mean. Note that there was a significant dip in ߛ௧ around 380 and 480, due to the fact that the algorithm detected a sudden change in the data, so that it automatically widened the search region in the parameter space. Eventually, the algorithm re-started learning the parameters. This phenomenon was also discernible at around 475. The hyperparameter learning appeared functional.

Figure 11.Trajectory of hyperparameter ߛ for Subject D with the SER in Fig.10 (HP+SMC) superimposed. The value of ߛ was its posterior mean

Figure 12 shows the negative log-Sequential Marginal Likelihood of subject D averaged over the window *M*=20 as was in Figure 10. Even though Figure 10 and Figure 12 are similar, the latter comes from the Bayesian concept where the latter appears slightly less average over 600 trials 0.16 0.26 0.19 0.077 0.13


**Table 3.** Sequential Error Rate of the subjects (M=20, HP+SMC) maximum error rate 0.75 0.75 0.71 0.75 0.80

#### **b.** Trajectory of hyperparameter

in order to make performance comparisons with the existing methods. Another quantity we

which is the marginalization of the likelihood with respect to the current predictive distribu‐ tion. This quantifies the reliability of the prediction *yk* with respect to (*x*1:*<sup>k</sup>* , *y*1:*<sup>k</sup>* -1 ). In order to explain a *rationale* behind this, recall that given data *y*, the likelihood *P*(*y* | *z*) can be interpreted as the degree of appropriateness of *z* in explaining *y*. This, in turn, can be interpreted as the

Figure 10 shows the Sequential Error Rate of subject D over one session consisting of 600 trials. The algorithm was implemented by Sequential Monte Carlo together with the proposed hyperparameter learning (HP+SMC). Table 3 summarizes the Sequential Error Rates of

dotted line at 1/2 corresponds to a random classification.

**Figure 10.** Sequential Error Rate of subject D with (HP+SMC), Sequential Monte Carlo together with the proposed hy‐

Table 3. Sequential Error Rate of the subjects (M=20, HP+SMC)

c. Sequential Marginal Likelihood (Reliability of the Predictions)

abrupt. This particular quantity can be applied to the change detection problem as is done in [27].

b. Trajectory of hyperparameter

perparameter learning. The dotted line at 1/2 corresponds to a random classification.

*P*(*yk* | *x*1:*<sup>k</sup>* , *y*1:*<sup>k</sup>* -1) =*∫P*(*yk* | *xk* , *θ<sup>k</sup>* ) *P*(*θ<sup>k</sup>* | *x*1:*<sup>k</sup>* -1, *y*1:*<sup>k</sup>* -1)*dθ<sup>k</sup>* (28)

will be evaluating is the sequential marginal likelihood:

78 Brain-Computer Interface Systems – Recent Progress and Future Prospects

subjects A-E, which were averages over ten learning trials.

appropriateness of *y* in terms of *z*.

*6.4.1. Two-class classification problem*

**6.4. Experimental results**

**a.** Sequential Error Rate

In Figure 11, the *γt*-trajectory (blue) is superimposed on the Sequential Error Rates (red) of Figure 10. The value of *γt* is the posterior mean. Note that there was a significant dip in *γ<sup>t</sup>* around 380 and 480, due to the fact that the algorithm detected a sudden change in the data, so that it automatically widened the search region in the parameter space. Eventually, the algorithm re-started learning the parameters. This phenomenon was also discernible at around 475. The hyperparameter learning appeared functional. b. Trajectory of hyperparameter In Figure 11, the ߛ௧-trajectory (blue) is superimposed on the Sequential Error Rates (red) of Figure 10. The value of ߛ௧ is the posterior mean. Note that there was a significant dip in ߛ௧ around 380 and 480, due to the fact that the algorithm detected a sudden change in the data, so that it automatically widened the search region in the parameter space. Eventually, the algorithm re-started learning the parameters. This phenomenon was also discernible at around 475. The hyperparameter learning appeared functional.

Table 3. Sequential Error Rate of the subjects (M=20, HP+SMC)

Figure 11.Trajectory of hyperparameter ߛ for Subject D with the SER in Fig.10 (HP+SMC) superimposed. The value of ߛ was its posterior mean **Figure 11.** Trajectory of hyperparameter γ*<sup>k</sup>* for Subject D with the SER in Fig.10 (HP+SMC) superimposed. The value of γ*k* was its posterior mean

c. Sequential Marginal Likelihood (Reliability of the Predictions) **c.** Sequential Marginal Likelihood (Reliability of the Predictions)

**A B C D E** 

minimum error rate 0.00 0.010 0.00 0.00 0.00 maximum error rate 0.75 0.75 0.71 0.75 0.80 average over 600 trials 0.16 0.26 0.19 0.077 0.13

In Figure 11, the ߛ௧-trajectory (blue) is superimposed on the Sequential Error Rates (red) of Figure 10. The value of ߛ௧ is the posterior mean. Note that there was a significant dip in ߛ௧ around 380 and 480, due to the fact that the algorithm detected a sudden change in the data, so that it automatically widened the search region in the parameter space. Eventually, the algorithm re-started learning the parameters. This phenomenon was also discernible at around 475. The hyperparameter learning appeared functional.

Figure 11.Trajectory of hyperparameter ߛ for Subject D with the SER in Fig.10 (HP+SMC) superimposed. The value of ߛ was its posterior mean

Figure 12 shows the negative log-Sequential Marginal Likelihood of subject D averaged over the window *M*=20 as was in Figure 10. Even though Figure 10 and Figure 12 are similar, the latter comes from the Bayesian concept where the latter appears slightly less

Figure 10.Sequential Error Rate of subject D with (HP+SMC), Sequential Monte Carlo together with the proposed hyperparameter learning. The Figure 12 shows the negative log-Sequential Marginal Likelihood of subject D averaged over the window *M*=20 as was in Figure 10. Even though Figure 10 and Figure 12 are similar, the latter comes from the Bayesian concept where the latter appears slightly less abrupt. This particular quantity can be applied to the change detection problem as is done in [27]. Figure 12 shows the negative log-Sequential Marginal Likelihood of subject D averaged over the window *M*=20 as was in Figure 10. Even though Figure 10 and Figure 12 are similar, the latter comes from the Bayesian concept where the latter appears slightly less abrupt. This particular quantity can be applied to the change detection problem as is done in [27].

**Figure 12.** Negative log-Sequential Marginal Likelihood of subject D with moving average *M*=20.

#### **d.** Cumulative Error Figure 12.Negative log-Sequential Marginal Likelihood of subject D with moving average *M*=20.

d. Cumulative Error Figure.13 shows the Cumulative Error of subject D with different algorithms, and Table.4 gives final Cumulative Errors of subjects A-E, that is, the Cumulative Errors at the last trial. These values were the averages over ten experiments. Figure.13 shows the Cumulative Error of subject D with different algorithms, and Table.4 gives final Cumulative Errors of subjects A-E, that is, the Cumulative Errors at the last trial. These values were the averages over ten experiments. d. Cumulative Error Figure.13 shows the Cumulative Error of subject D with different algorithms, and Table.4 gives final Cumulative Errors of subjects A-E, that is, the Cumulative Errors at the last trial. These values were the averages over ten experiments.

trial (k)

Figure 12.Negative log-Sequential Marginal Likelihood of subject D with moving average *M*=20.

**A B C D E** 

SMC 95.50 159.2 109.3 45.90 80.10 HP+SMC 90.70 151.4 113.7 42.50 75.20 SMCESS 96.20 170.5 108.3 44.60 80.30 HP+SMCESS 92.10 155.9 110.0 42.10 71.70 RBSMC 91.90 155.8 109.1 44.80 75.20 HP+RBSMC 88.80 158.4 109.3 41.20 73.90 RBSMCESS 91.00 157.1 107.6 43.50 76.00 HP+RBSMCESS 89.40 154.6 106.3 41.70 72.70

SMC 95.50 159.2 109.3 45.90 80.10 HP+SMC 90.70 151.4 113.7 42.50 75.20 SMCESS 96.20 170.5 108.3 44.60 80.30 HP+SMCESS 92.10 155.9 110.0 42.10 71.70 RBSMC 91.90 155.8 109.1 44.80 75.20 HP+RBSMC 88.80 158.4 109.3 41.20 73.90 RBSMCESS 91.00 157.1 107.6 43.50 76.00 HP+RBSMCESS 89.40 154.6 106.3 41.70 72.70

**A B C D E** 

Figure 14 shows the ESS trajectories (moving average over 20 trials) of subject D with several different methods.

Figure 14 shows the ESS trajectories (moving average over 20 trials) of subject D with several different methods.

Figure 13.Cumulative Error (CE) of subject D. Different colors indicate different versions of the algorithms, as shown in Table. I .The dotted straight Figure 13.Cumulative Error (CE) of subject D. Different colors indicate different versions of the algorithms, as shown in Table. I .The dotted straight line at 1/2 corresponds to a random classification. **Figure 13.** Cumulative Error (CE) of subject D. Different colors indicate different versions of the algorithms, as shown in Table. I.The dotted straight line at 1/2 corresponds to a random classification.

line at 1/2 corresponds to a random classification.

Table 4. Final Cumulative Error

e. Effective Sample Size

Table 4. Final Cumulative Error

e. Effective Sample Size

Bayesian Sequential Learning for EEG-Based BCI Classification Problems http://dx.doi.org/10.5772/56146 81


**Table 4.** Final Cumulative Error

#### **e.** Effective Sample Size

Figure 12.Negative log-Sequential Marginal Likelihood of subject D with moving average *M*=20.

Figure 12.Negative log-Sequential Marginal Likelihood of subject D with moving average *M*=20.

0 100 200 300 400 500 600

trial (k)

0 100 200 300 400 500 600

trial (k)

Figure.13 shows the Cumulative Error of subject D with different algorithms, and Table.4 gives final Cumulative Errors of subjects A-E, that is, the Cumulative Errors at the last trial. These

d. Cumulative Error

d. Cumulative Error

80 Brain-Computer Interface Systems – Recent Progress and Future Prospects

values were the averages over ten experiments.

**Figure 12.** Negative log-Sequential Marginal Likelihood of subject D with moving average *M*=20.



negative log-Sequential Marginal Likelihood

negative log-Sequential Marginal Likelihood

**d.** Cumulative Error

line at 1/2 corresponds to a random classification.

**Figure 13.** Cumulative Error (CE) of subject D. Different colors indicate different versions of the algorithms, as shown

line at 1/2 corresponds to a random classification.

in Table. I.The dotted straight line at 1/2 corresponds to a random classification.

Table 4. Final Cumulative Error

e. Effective Sample Size

Table 4. Final Cumulative Error

e. Effective Sample Size

Figure 14 shows the ESS trajectories (moving average over 20 trials) of subject D with several different methods.

Figure 14.Trajectory of ESS of subject D. **Figure 14.** Trajectory of ESS of subject D.

Table 5. Computation time

g. Harmonic Frequency Components

were the averages over ten experiments.

Table 6. Effect of Harmonics (Cumulative Error)

**6.4.2. Multi-class classification problem** 

(iii)

of subjects A-E averaged over 10 experiments.

#### **f.** Computation Time

**A B C D E** 

SMC 95.50 159.2 109.3 45.90 80.10 HP+SMC 90.70 151.4 113.7 42.50 75.20 SMCESS 96.20 170.5 108.3 44.60 80.30 HP+SMCESS 92.10 155.9 110.0 42.10 71.70 RBSMC 91.90 155.8 109.1 44.80 75.20 HP+RBSMC 88.80 158.4 109.3 41.20 73.90 RBSMCESS 91.00 157.1 107.6 43.50 76.00 HP+RBSMCESS 89.40 154.6 106.3 41.70 72.70

SMC 95.50 159.2 109.3 45.90 80.10 HP+SMC 90.70 151.4 113.7 42.50 75.20 SMCESS 96.20 170.5 108.3 44.60 80.30 HP+SMCESS 92.10 155.9 110.0 42.10 71.70 RBSMC 91.90 155.8 109.1 44.80 75.20 HP+RBSMC 88.80 158.4 109.3 41.20 73.90 RBSMCESS 91.00 157.1 107.6 43.50 76.00 HP+RBSMCESS 89.40 154.6 106.3 41.70 72.70

**A B C D E** 

Figure 14 shows the ESS trajectories (moving average over 20 trials) of subject D with several different methods.

Figure 14 shows the ESS trajectories (moving average over 20 trials) of subject D with several different methods.

Figure 13.Cumulative Error (CE) of subject D. Different colors indicate different versions of the algorithms, as shown in Table. I .The dotted straight Figure 13.Cumulative Error (CE) of subject D. Different colors indicate different versions of the algorithms, as shown in Table. I .The dotted straight f. Computation Time Table 5 summarizes the computation time of the various methods averaged over ten experiments. The middle column shows the time per trial, whereas the right-most column shows the time needed for all trials. The par trial time does not contain the case where resampling is done with the ESS. Table 5 summarizes the computation time of the various methods averaged over ten experi‐ ments. The middle column shows the time per trial, whereas the right-most column shows the time needed for all trials. The par trial time does not contain the case where resampling is done with the ESS.

**1 step (S) whole data (S)**

**A B C D E** 

SMC 0.120 72.0 HP+SMC 0.130 77.7 SMCESS - 57.4 HP+SMCESS - 63.9 RBSMC 0.320 192 HP+RBSMC 0.328 197 RBSMCESS - 145 HP+RBSMCESS - 121

Effects of the higher order harmonics were examined and are summarized in Table.6 for three cases: (i) fundamental frequency only, (ii) second-order higher harmonics in addition to the fundamental frequency, and (iii) second- and third-order higher harmonics in addition to the fundamental frequency. The numbers in the table indicate the final Cumulative Errors. The results

> (i) fundamental 90.70 151.4 113.7 42.50 75.20 (ii) fundamental+2nd 98.20 141.1 46.50 37.60 92.80

> fundamental+2nd+3rd 78.20 131.3 44.30 45.70 104.0

a. Sequential Error Rate: Figure 15 shows the Sequential Error Rates (moving average window size 20) of subject D for the the four-class problem with the HP+SMC algorithm, and Table 7 gives various values related to the Sequential Error Rates


**Table 5.** Computation time

#### **g.** Harmonic Frequency Components

Effects of the higher order harmonics were examined and are summarized in Table.6 for three cases: (i) fundamental frequency only, (ii) second-order higher harmonics in addition to the fundamental frequency, and (iii) second- and third-order higher harmonics in addition to the fundamental frequency. The numbers in the table indicate the final Cumulative Errors. The results were the averages over ten experiments.


**Table 6.** Effect of Harmonics (Cumulative Error)

#### *6.4.2. Multi-class classification problem*

**a.** Sequential Error Rate: Figure 15 shows the Sequential Error Rates (moving average window size 20) of subject D for the the four-class problem with the HP+SMC algorithm, and Table 7 gives various values related to the Sequential Error Rates of subjects A-E averaged over 10 experiments.


**Table 7.** Sequential Error Rate of the subjects (M=20, HP+SMCmulti)

**A B C D E** 

**A B C D E** 

Figure 16.Cumulative Error of four-class classification for subject D with two different algorithms. One is the standard SMC without hyperparameter learning, and the other is with the proposed hyperparameter learning. The dotted straight line indicates a random classification.

We first observe that there are two errors involved in brain-computer interfaces in general, and in this study in particular. One is the error made by the brain (subject), and the other is the error made by the computer (algorithm), provided that the hardware behind the experiments is functional. Let us look at the Sequential Error Rate in Figure 10. It started decreasing immediately after the experiment began, and it had already dropped to about 0.1 at around the 20-th trial. At around the 80-th trial, the Sequential Error Rate became almost 0. One possible interpretation of this is that, if we can assume that the subject does not make an error during these 80 trials, then the SER trajectory represents the process of how the computer learns the classification problem. Recall

 **A B C D E**  SMC����� 283.2 363.1 285.1 224.0 434.1 HP + SMC����� 280.3 361.7 281.6 214.5 434.8

We first observe that there are two errors involved in brain-computer interfaces in general, and in this study in particular. One is the error made by the brain (subject), and the other is the error made by the computer (algorithm), provided that the hardware behind the experiments is functional. Let us look at the Sequential Error Rate in Figure 10. It started decreasing immediately after the experiment began, and it had already dropped to about 0.1 at around the 20-th trial. At around the 80-th trial, the Sequential Error Rate became almost 0. One possible interpretation of this is that, if we can assume that the subject does not make an error during these 80 trials, then the SER trajectory represents the process of how the computer learns the classification problem. Recall

**1 step (S) whole data (S)**

**A B C D E**

**A B C D E**

SMC 0.120 72.0 HP+SMC 0.130 77.7 SMCESS - 57.4 HP+SMCESS - 63.9 RBSMC 0.320 192 HP+RBSMC 0.328 197 RBSMCESS - 145 HP+RBSMCESS - 121

Effects of the higher order harmonics were examined and are summarized in Table.6 for three cases: (i) fundamental frequency only, (ii) second-order higher harmonics in addition to the fundamental frequency, and (iii) second- and third-order higher harmonics in addition to the fundamental frequency. The numbers in the table indicate the final Cumulative Errors. The

(i) fundamental 90.70 151.4 113.7 42.50 75.20 (ii) fundamental+2nd 98.20 141.1 46.50 37.60 92.80 (iii) fundamental+2nd+3rd 78.20 131.3 44.30 45.70 104.0

**a.** Sequential Error Rate: Figure 15 shows the Sequential Error Rates (moving average window size 20) of subject D for the the four-class problem with the HP+SMC algorithm, and Table 7 gives various values related to the Sequential Error Rates of subjects A-E

minimum error rate 0.23 0.39 0.25 0.15 0.58 maximum error rate 0.74 0.82 0.60 0.72 0.85 average over 600 trials 0.46 0.60 0.47 0.36 0.72

**Table 5.** Computation time

**g.** Harmonic Frequency Components

**Table 6.** Effect of Harmonics (Cumulative Error)

*6.4.2. Multi-class classification problem*

averaged over 10 experiments.

**Table 7.** Sequential Error Rate of the subjects (M=20, HP+SMCmulti)

results were the averages over ten experiments.

82 Brain-Computer Interface Systems – Recent Progress and Future Prospects

Figure 15.Sequential Error Rate of four-class classification for subject D, where the hyperparameter is learned together with SMC (HP+SMCmulti). **Figure 15.** Sequential Error Rate of four-class classification for subject D, where the hyperparameter is learned togeth‐ er with SMC (HP+SMCmulti). The dotted line at 3/4 corresponds to random classification. algorithms. One is a standard SMC without hyperparameter learning (SMCmulti), and the other is the proposed SMC with hyperparameter learning (HP+SMCmulti). The dotted line indicates a random classification. Table.8 summarizes the CEs for

subjects A-E. These are the values averaged over ten experiments.

The dotted line at 3/4 corresponds to random classification.

Figure 16.Cumulative Error of four-class classification for subject D with two different algorithms. One is the standard SMC without hyperparameter learning, and the other is with the proposed hyperparameter learning. The dotted straight line indicates a random classification. **Figure 16.** Cumulative Error of four-class classification for subject D with two different algorithms. One is the standard SMC without hyperparameter learning, and the other is with the proposed hyperparameter learning. The dotted straight line indicates a random classification.

**b.** Cumulative Error: Figure 16 shows Cumulative Errors of subject D for the four-class problem with two different algorithms. One is a standard SMC without hyperparameter learning (SMCmulti), and the other is the proposed SMC with hyperparameter learning  **A B C D E**  SMC����� 283.2 363.1 285.1 224.0 434.1 HP + SMC����� 280.3 361.7 281.6 214.5 434.8

Table 8. Final Cumulative Error of four class classification

Table 8. Final Cumulative Error of four class classification

**7. Discussion** 

**7. Discussion** 

**7.1. Sequential error rate** 

**7.1. Sequential error rate** 


(HP+SMCmulti). The dotted line indicates a random classification. Table.8 summarizes the CEs for subjects A-E. These are the values averaged over ten experiments.

**Table 8.** Final Cumulative Error of four class classification

#### **7. Discussion**

#### **7.1. Sequential error rate**

We first observe that there are two errors involved in brain-computer interfaces in general, and in this study in particular. One is the error made by the brain (subject), and the other is the error made by the computer (algorithm), provided that the hardware behind the experi‐ ments is functional. Let us look at the Sequential Error Rate in Figure 10. It started decreasing immediately after the experiment began, and it had already dropped to about 0.1 at around the 20-th trial. At around the 80-th trial, the Sequential Error Rate became almost 0. One possible interpretation of this is that, if we can assume that the subject does not make an error during these 80 trials, then the SER trajectory represents the process of how the computer learns the classification problem. Recall that there are h(d+2)+1 parameters in (1), which in this case is 141. In addition, the hyperparameter *γ* needs to be learned. This means that the parameter/ hyperparameter landscape is vague at the beginning in a high dimensional space, so that the computer searches for posterior samples in an attempt to find appropriate parameter values for better classification. It should be noted that the hyperparameter *γ<sup>t</sup>* is relatively flat up to trial 80 but slightly increases. Since *γt* represents the reciprocal of the size of the parameter search region, this period can be interpreted as the computer's early effort to search for parameters by slightly narrowing down the parameter space search region.

At around trial 80, the SER dropped to almost zero, so that if the subject's EEG signals were consistent with the previous ones, the computer algorithm did not need to seek different samples in the parameter space. Therefore, the ups and downs after trial 80 could be inter‐ preted as the fact that the subject's EEG signals became slightly inconsistent with the previous ones. During this period, the computer naturally needed to search for slightly different posterior samples, so that some errors were incurred.

This was followed by several ups and downs between 0.2 and almost 0 until approximately trial 310. The subject seemed to have obtained a reasonable amount of skill for the task, so that the subject achieved almost 0. The Sequential Error Rate in the trials between 310 and 330, as well as between 350 and 380, were almost zero. However, the subject's Sequential Error Rate again increased at around trial 380. A sharp dip was observed in the hyperparameter trajectory, as demonstrated in Figure 11. One possible interpretation of this is that by trial 380, the computer had found fairly good posterior samples for predictions so that the parameter search region was narrowed down; however, a sudden change was observed and the computer needed to quickly widen its stochastic parameter search region, which was indicated by the sudden drop of hyperparameter *γt* at around k=380. With this, the algorithm tried to learn parameter values different from the previous ones and eventually found better parameter values. The Sequential Error Rate again dropped to almost 0 at around trial 420, which lasted for approximately 40 trials. A similar phenomenon was discernible after around trial 480. It is important to notice that, in addition to the learning mechanism, a "forgetting" mechanism is naturally built in. Namely, 9 and 10 are first-order Markov stochastic dynamical systems, so that memories of the distant past are forgotten, whereas the more recent data are taken into account with higher weights. Note, however, that the Sequential Monte Carlo algorithm took into account several hundred parameter values instead of a single parameter value, which endowed the predictions with robustness.

A question might arise as to why was the drop in *γ<sup>t</sup>* at around k=380 much more significant than, for instance, that at around k=270, where the SER increase was more significant at the latter than the former. Our interpretation is that at around k=270, *γt*is not too large, so that the parameter search region is still reasonably large, whereas at around k=380, the parameter search region was already sharp, and a more sudden change of the search region size was needed.

#### **7.2. Sequential marginal likelihood**

(HP+SMCmulti). The dotted line indicates a random classification. Table.8 summarizes

SMC*MULTI* 283.2 363.1 285.1 224.0 434.1 HP + SMC*MULTI* 280.3 361.7 281.6 214.5 434.8

We first observe that there are two errors involved in brain-computer interfaces in general, and in this study in particular. One is the error made by the brain (subject), and the other is the error made by the computer (algorithm), provided that the hardware behind the experi‐ ments is functional. Let us look at the Sequential Error Rate in Figure 10. It started decreasing immediately after the experiment began, and it had already dropped to about 0.1 at around the 20-th trial. At around the 80-th trial, the Sequential Error Rate became almost 0. One possible interpretation of this is that, if we can assume that the subject does not make an error during these 80 trials, then the SER trajectory represents the process of how the computer learns the classification problem. Recall that there are h(d+2)+1 parameters in (1), which in this case is 141. In addition, the hyperparameter *γ* needs to be learned. This means that the parameter/ hyperparameter landscape is vague at the beginning in a high dimensional space, so that the computer searches for posterior samples in an attempt to find appropriate parameter values for better classification. It should be noted that the hyperparameter *γ<sup>t</sup>* is relatively flat up to trial 80 but slightly increases. Since *γt* represents the reciprocal of the size of the parameter search region, this period can be interpreted as the computer's early effort to search for

parameters by slightly narrowing down the parameter space search region.

posterior samples, so that some errors were incurred.

At around trial 80, the SER dropped to almost zero, so that if the subject's EEG signals were consistent with the previous ones, the computer algorithm did not need to seek different samples in the parameter space. Therefore, the ups and downs after trial 80 could be inter‐ preted as the fact that the subject's EEG signals became slightly inconsistent with the previous ones. During this period, the computer naturally needed to search for slightly different

This was followed by several ups and downs between 0.2 and almost 0 until approximately trial 310. The subject seemed to have obtained a reasonable amount of skill for the task, so that the subject achieved almost 0. The Sequential Error Rate in the trials between 310 and 330, as well as between 350 and 380, were almost zero. However, the subject's Sequential Error Rate again increased at around trial 380. A sharp dip was observed in the hyperparameter trajectory,

**A B C D E**

the CEs for subjects A-E. These are the values averaged over ten experiments.

**Table 8.** Final Cumulative Error of four class classification

84 Brain-Computer Interface Systems – Recent Progress and Future Prospects

**7. Discussion**

**7.1. Sequential error rate**

Since Sequential Marginal Likelihood can be interpreted as the reliability of each prediction of the subject, this quantity can also be used to evaluate subject's performance with probabilistic justification. Another potential application of this quantity is its use in change detection problem such that a significant change of this quantity would indicate occurrence of change in the subject's signal quality or/and environmental change. It should be noted that the sequential marginal likelihood is a well defined Bayesian quantity whereas such reliability index is not available in maximum likelihood method.

#### **7.3. Rao-blackwellisation**

Note that in Figure13, the best performance was achieved by HP+RBSMCESS, where the Rao-Blackwellisation and the Effective Sample Size were taken into account, in addition to the hyperparameter learning. The proposed scheme appeared functional.

Figure 13 shows the Cumulative Error of subject D, where the black dotted line shows the Cumulative Error corresponding to random classification. Since the Cumulative Error with the proposed algorithms grows slower than the random classification, the results appeared to indicate that the algorithms were functional. Figure 13 appears to indicate that the proposed *γt*-learning, as well as the Rao-Blackwellised SMC, was functional.

#### **7.4. Effective sample size**

From the trajectory of the Effective Sample Size (ESS), we observed that the ESS was generally large if resampling was performed at each step. This could be attributable to the fact that the purpose of resampling was to avoid degeneracy of samples, i.e., to bring in more diversity in the samples. Table 5 appears to indicate that the computation time with ESS was significantly reduced since it avoided sampling when ESS did not become smaller than a threshold value.

#### **7.5. Higher-order harmonics**

Taking the higher-order harmonics into account generally improved the prediction capabili‐ ties, except for subject E, as was seen in Table 6. One future research project could be to develop an algorithm to choose appropriate frequency components automatically. The number of frequency components is also related to the overfitting problem in machine learning, where the number of parameters is large compared with the number of data available, which sometimes results in performance degradation.

#### **7.6. Multi-class classification problem**

The extension of the two-class problem to the four-class classification problem discussed in Section 4.2 was nontrivial. One of the difficulties can be seen from the term *Ck* ,*<sup>q</sup>* in (17) - (19), where the values of equation (18) must be well-separated from each other for the four classes. The experimental results reported in subsection 6.4, however, appeared reasonable. The Cumulative Errors shown in Figure.16 appeared to indicate that the learning was functional.

#### **8. Conclusion**

This paper proposed Bayesian sequential learning algorithms for SSVEP sequential classifica‐ tion problems in a principled manner. Two experiments were conducted: one involving a twoclass problem, and the other involving a four-class problem. The stimuli consisted of a flickering checkerboard at frequencies ranging from 4.29 to 10.0 Hz. The algorithms were implemented by the Sequential Monte Carlo. One of the points of the proposed algorithms was their hyperparameter learning, enabling it to automatically adjust to environmental changes, including changes in the subjects' physical conditions as well as their environments. Computation costs were also measured, which appeared to indicate that the algorithms could be implemented in real time. The proposed algorithms appeared functional. The proposed sequential algorithms are applicable to other brain signals besides EEG.

In the experiments performed in this study, the subjects were asked to look at the stimuli. A future research project is to examine sequential classification problems with covert selective attention [20], [21] where subjects are asked to pay attention to stimuli without eyeball movements. This project is in progress and will be reported in a future paper.

### **Appendix**

**7.4. Effective sample size**

**7.5. Higher-order harmonics**

sometimes results in performance degradation.

86 Brain-Computer Interface Systems – Recent Progress and Future Prospects

**7.6. Multi-class classification problem**

**8. Conclusion**

From the trajectory of the Effective Sample Size (ESS), we observed that the ESS was generally large if resampling was performed at each step. This could be attributable to the fact that the purpose of resampling was to avoid degeneracy of samples, i.e., to bring in more diversity in the samples. Table 5 appears to indicate that the computation time with ESS was significantly reduced since it avoided sampling when ESS did not become smaller than a threshold value.

Taking the higher-order harmonics into account generally improved the prediction capabili‐ ties, except for subject E, as was seen in Table 6. One future research project could be to develop an algorithm to choose appropriate frequency components automatically. The number of frequency components is also related to the overfitting problem in machine learning, where the number of parameters is large compared with the number of data available, which

The extension of the two-class problem to the four-class classification problem discussed in Section 4.2 was nontrivial. One of the difficulties can be seen from the term *Ck* ,*<sup>q</sup>* in (17) - (19), where the values of equation (18) must be well-separated from each other for the four classes. The experimental results reported in subsection 6.4, however, appeared reasonable. The Cumulative Errors shown in Figure.16 appeared to indicate that the learning was functional.

This paper proposed Bayesian sequential learning algorithms for SSVEP sequential classifica‐ tion problems in a principled manner. Two experiments were conducted: one involving a twoclass problem, and the other involving a four-class problem. The stimuli consisted of a flickering checkerboard at frequencies ranging from 4.29 to 10.0 Hz. The algorithms were implemented by the Sequential Monte Carlo. One of the points of the proposed algorithms was their hyperparameter learning, enabling it to automatically adjust to environmental changes, including changes in the subjects' physical conditions as well as their environments. Computation costs were also measured, which appeared to indicate that the algorithms could be implemented in real time. The proposed algorithms appeared functional. The proposed

In the experiments performed in this study, the subjects were asked to look at the stimuli. A future research project is to examine sequential classification problems with covert selective attention [20], [21] where subjects are asked to pay attention to stimuli without eyeball

sequential algorithms are applicable to other brain signals besides EEG.

movements. This project is in progress and will be reported in a future paper.

Update Equations of *zk* <sup>|</sup>*<sup>k</sup>* -1 and *Sk*

The update equations of *zk* <sup>|</sup>*<sup>k</sup>* -1 and *Sk* can be summarized as follows:

$$\begin{aligned} &V\_{k}|\_{k=1} = v\_{k-1}|\_{k=1} \\ &V\_{k}|\_{k=1} = V\_{k-1|k,k-1} + \delta\_{k}^{-1} \\ &S\_{k} = \mathbf{V}\_{k}^{T} \{\mathbf{x}\_{k}; u\_{k}\} V\_{k}|\_{k=1} \mathbf{V}\_{k} \{\mathbf{x}\_{k}; u\_{k}\} + 1 \\ &Z\_{k}|\_{k=1} = \mathbf{W}\_{k}^{T} \{\mathbf{x}\_{k}; u\_{k}\} v\_{k|\_{k=1}} \\ &K\_{k} = V\_{k|\_{k=1}} \mathbf{W}\_{k} \{\mathbf{x}\_{k}; u\_{k}\} \mathbf{S}\_{k}^{-1} \\ &v\_{k}|\_{k=1} = v\_{k|\_{k=1}} + K\_{k} \{\mathbf{z}\_{k} - \mathbf{z}\_{k^{\top}|k-1}\} \\ &V\_{k}|\_{k=1} = V\_{k|\_{k=1}} - K\_{k} \mathbf{W}\_{k}^{T} \{\mathbf{x}\_{k}; u\_{k}\} V\_{k}|\_{k=1} \\ &\text{Where } v\_{k}|\_{k=1} = \mathbf{E} \{\mathbf{z}\_{k} \mid \mathbf{\Theta}\_{1:k-1} \} \mathbf{J}\_{k} v\_{k|k} = \mathbf{E} \{\mathbf{z}\_{k} \mid \mathbf{\Theta}\_{1:k-1} \}, \\ &v\_{k+1}|\_{k=1} = \mathbf{E} \{\mathbf{z}\_{k} \mid \mathbf{\Theta}\_{1:k-1} \} \mathbf{J}\_{k} \{\mathbf{z}\_{k} \mid \mathbf{\Theta}\_{1:k-1} \}. \end{aligned}$$

#### **Acknowledgements**

The authors thank A. Doucet for valuable comments.

#### **Author details**

S. Shigezumi1 , H. Hara1 , H. Namba1 , C. Serizawa1 , Y. Dobashi1 , A. Takemoto2 , K. Nakamura2 and T. Matsumoto1

1 Department of Electrical Engineering and Bioscience, Waseda University, Tokyo, Japan

2 Primate Research Institute, Kyoto University, Aichi, Japan

#### **References**

[1] Niels Birbaumer ``Breaking the silence: Brain-computer interfaces (BCI) for commu‐ nication and motor control*Psychophysiology*, (2006). , 43(6), 517-532.


[15] Yoon, J. W, Roberts, S. J, Dyson, M, & Gan, J. Q. Adaptive classification for Brain Computer Interface systems using Sequential Monte Carlo sampling," *Neural Netw.*, (2009). , 22, 1286-1294.

[2] Bashashati, A, Fatourechi, M, Ward, R. K, & Birch, G. E. A survey of signal process‐ ing algorithms in brain-computer interfaces based on electrical brain signals," *J. Neu‐*

[3] G. R. Mu¨ller-Putz, R. Scherer, C. Brauneis, and G. Pfurtscheller, ``Steady-state visual evoked potential (SSVEP)-based communication: impact of harmonic frequency com‐

[4] Martinez, P, Bakardjian, H, & Cichocki, A. Fully Online Multicommand Brain-Com‐ puter Interface with Visual Neurofeedback Using SSVEP Paradigm," *Comput. Intell.*

[5] B. Allison, T. Lu¨th, D. Valbuena, A. Teymourian, I. Volosyak, and A. Gra¨ser, ``BCI Demographics: How Many (and What Kinds of) People Can Use an SSVEP BCI?,"

[6] G. R. Mu¨ller-Putz and G. Pfurtscheller, ``Control of an Electrical Prosthesis With an SSVEP-Based BCI," *IEEE Trans. Biomed. Eng.*, vol. 55, no. 1, pp. 361-364, 2008.

[7] Ortner, R, Allison, B. Z, Korisek, G, Gaggl, H, & Pfurtscheller, G. An SSVEP BCI to Control a Hand Orthosis for Persons With Tetraplegia," *IEEE Trans. Neural Syst. Re‐*

[8] Cecotti, H, & Self-paced, A. and Calibration-Less SSVEP-Based Brain-Computer In‐ terface Speller," *IEEE Trans. Neural Syst. Rehabil. Eng.*, (2010). , 18(2), 127-133.

[9] Li, J, Zhang, L, Tao, D, Sun, H, & Zhao, Q. A Prior Neurophysiologic Knowledge Free Tensor-Based Scheme for Single Trial EEG Classification," *IEEE Trans. Neural*

[10] Wu, H. Y, Lee, P. L, Chang, H. C, & Hsieh, J. C. Accounting for Phase Drifts in SSVEP-Based BCIs by Means of Biphasic Stimulation," *IEEE Trans. Biomed. Eng.*,

[11] Malik, W. Q, Truccolo, W, Brown, E. N, & Hochberg, L. R. Efficient Decoding With Steady-State Kalman Filter in Neural Interface Systems," *IEEE Trans. Neural Syst. Re‐*

[12] Lin, Z, Zhang, C, Wu, W, & Gao, X. Frequency Recognition Based on Canonical Cor‐ relation Analysis for SSVEP-Based BCIs," *IEEE Trans. Biomed. Eng.*, (2007). , 54(6),

[13] Buttfield, A, Ferrez, P. W, & Millan, J. R. Towards a Robust BCI: Error Potentials and Online Learning," *IEEE Trans. Neural Syst. Rehabil. Eng.*, (2006). , 14(2), 164-168.

[14] Lu, S, Guan, C, & Zhang, H. Unsupervised Brain Computer Interface Based on Inter‐ subject Information and Online Adaptation," *IEEE Trans. Neural Syst. Rehabil. Eng.*,

*IEEE Trans. Neural Syst. Rehabil. Eng.*, vol. 18, no. 2, pp. 107-115, 2010.

*ral Eng.*, (2007). , 4(2), R32-R57.

88 Brain-Computer Interface Systems – Recent Progress and Future Prospects

*Neurosci.*, , 2007, 1-9.

*habil. Eng.*, (2011). , 19(1), 1-5.

(2011). , 58(5), 1394-1402.

(2009). , 17(2), 135-145.

1172-1176.

*habil. Eng.*, (2011). , 19(1), 25-34.

*Syst. Rehabil. Eng.*, (2009). , 17(2), 107-115.

ponents," *J. Neural Eng.*, vol. 2, pp. 123-130, 2005.


## **Optimal Fractal Feature and Neural Network: EEG Based BCI Applications**

Montri Phothisonothai and Katsumi Watanabe

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/55801

#### **1. Introduction**

This chapter provides the theoretical principle of fractal dimension analysis applied for a brain-computer interface (BCI). Fractal geometry is a mathematical tool for dealing with complex systems. A method of estimating its dimension has been widely used to describe objects in space, since it has been found useful for the analysis of biological data. More‐ over, fractal dimension (FD) is one of most popular fractal features. The term waveform applies to the shape of a wave, usually drawn as instantaneous values of a periodic quantity versus time. Any waveform is an infinite series of points. Aside of classical methods such as moment statistics and regression analysis, properties such as the Kolmogorov-Sinai entropy [1], the apparent entropy [2] and the FD [3] have been proposed to deal with the problem of pattern analysis of waveforms. The FD may convey information on spatial extent and self similarity and self affinity [4]. Unfortunately, although precise methods to deter‐ mine the FD have already been proposed [5,6], their usefulness is severely limited since they are computer intensive and their evaluation is time consuming. Recently, the FD is relatively intensive to data scaling and shows a strong correlation with the human movement of EEG data [7-9]. The time series with fractal nature are to be describable by the functions of fractional Brownian motion (fBm), for which the FD can easily be set. Waveform FD values indicate the complexity of a pattern or the quantity of information embodied in a wave‐ form pattern in terms of morphology, spectra, and variance. This chapter, we investigate most widely popular six algorithms to be feature patterns for a BCI in which algorithms are in time and frequency domain approaches. However, many methods in evaluating FDs have been developed. As the existing algorithms, they present different characteristics therefore the optimum condition respects to the requirements of BCI application should be serious‐ ly considered.

properly cited.

© 2013 Phothisonothai and Watanabe; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is © 2013 Phothisonothai and Watanabe; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### **2. Experimental paradigm1**

In this study, we tested 7-male and 3-female. Their ages were 21-32 years (mean age, 25.3 years; SD ±3.6), with 180 trials per subject (30 trails per task). At the beginning of each EEG recording session, the subjects were instructed to relax with their eyes open for 30 s. The EEG data from this period was used as the baseline for the tasks. The duration of one trial was 6 s throughout the experiment. Each trail in the experiment was divided into two periods, a relaxing period of 0-3 s and an imaging period of 3-6 s. When the subjects had their eyes open on a blank screen, they would be in a relaxed state of mind (relaxing period). In the imaging period, the subjects imagined with eyes open, as represented by arrows. This study focuses on the motor move‐ ment functions in the brain. Therefore, imagination of motors movements are used as the tasks throughoutthis experiment.Toassistthe subjectimagine the taskwithoutdifficulty,weprovide theactingstimuliinaccordancetothegiventasks.Explicitindicatorsarealsoshownonamonitor for informing subject during the experiment, all selected tasks are listed in Table 1.


**Table 1.** Selected tasks in this study

#### **3. Fractal analysis and methods**

In this chapter, most widely popular six algorithms in time and frequency domain analysis have been addressed; the box-counting method (BCM), Higuchi's method (HM), variance fractal method (VFD), detrended fluctuation analysis (DFA), power spectral density analysis (PSDA), and critical exponent method (CEM). To measure the fractality in short time intervals of time-sequential data from one end of the waveform to the other sequentially, the dynamical changes in the FDs with respect the time series based on the time-dependent fractal dimensions (TDFD) were observed. For the classification process, two new feature parameters on the basis of fractal dimension; Kullback–Leibler (K-L) divergence and the different expected values were presented. In experimental results, DFA was selected to evaluate fraction dimensions on the basis of TDFDs [11, 36]. The reason is that, we found that the DFA provides fast computation

<sup>1</sup> By NLAB, under supervision of Prof.Masahiro Nakagawa, Department of Electrical Engieering, Nagaoka University of Technology, 1603-1 Kamitomioka, Nagaoka-shi 940-2188 Japan. E-mail: masanaka@vos.nagaokaut.ac.jp Http:// pelican.nagaokaut.ac.jp/

time and also presents reasonable values in terms of accuracy and variability when comparison with each other.

#### **3.1. Time series-based analysis**

**2. Experimental paradigm1**

92 Brain-Computer Interface Systems – Recent Progress and Future Prospects

**Table 1.** Selected tasks in this study

pelican.nagaokaut.ac.jp/

**3. Fractal analysis and methods**

In this study, we tested 7-male and 3-female. Their ages were 21-32 years (mean age, 25.3 years; SD ±3.6), with 180 trials per subject (30 trails per task). At the beginning of each EEG recording session, the subjects were instructed to relax with their eyes open for 30 s. The EEG data from this period was used as the baseline for the tasks. The duration of one trial was 6 s throughout the experiment. Each trail in the experiment was divided into two periods, a relaxing period of 0-3 s and an imaging period of 3-6 s. When the subjects had their eyes open on a blank screen, they would be in a relaxed state of mind (relaxing period). In the imaging period, the subjects imagined with eyes open, as represented by arrows. This study focuses on the motor move‐ ment functions in the brain. Therefore, imagination of motors movements are used as the tasks throughoutthis experiment.Toassistthe subjectimagine the taskwithoutdifficulty,weprovide theactingstimuliinaccordancetothegiventasks.Explicitindicatorsarealsoshownonamonitor

for informing subject during the experiment, all selected tasks are listed in Table 1.

Left-hand movement (LH-MI) Opening and grasping left hand

Right-hand movement (RH-MI) Opening and grasping right hand

Feet movement (FT-MI) Up-down lifting Tongue movement (TG-MI) Up-down movement

**Task Imagine Indicator**

In this chapter, most widely popular six algorithms in time and frequency domain analysis have been addressed; the box-counting method (BCM), Higuchi's method (HM), variance fractal method (VFD), detrended fluctuation analysis (DFA), power spectral density analysis (PSDA), and critical exponent method (CEM). To measure the fractality in short time intervals of time-sequential data from one end of the waveform to the other sequentially, the dynamical changes in the FDs with respect the time series based on the time-dependent fractal dimensions (TDFD) were observed. For the classification process, two new feature parameters on the basis of fractal dimension; Kullback–Leibler (K-L) divergence and the different expected values were presented. In experimental results, DFA was selected to evaluate fraction dimensions on the basis of TDFDs [11, 36]. The reason is that, we found that the DFA provides fast computation

1 By NLAB, under supervision of Prof.Masahiro Nakagawa, Department of Electrical Engieering, Nagaoka University of Technology, 1603-1 Kamitomioka, Nagaoka-shi 940-2188 Japan. E-mail: masanaka@vos.nagaokaut.ac.jp Http://

#### *3.1.1. Box-Counting Method (BCM)*

One of the most common methods for calculating the FD of a self-similar fractal is the boxcounting method (BCM). The definition of BCM is a bounded set in the Euclidian *n* -space that it composes self-similar property [25], *Nr* <sup>∝</sup>(1 /*r*)*D* by covering a structure with boxes of radii, *r*, the FD can be determined by

$$D\_{\text{BCM}} = \lim\_{r \to 0} \frac{\log\_2 \left( N\_r \right)}{\log\_2 \left( 1/r \right)} \tag{1}$$

We then repeat this process with several different radii. To implement this algorithm, number of contained box, *nr*, is computed from the difference between the maximum and minimum amplitudes of the data divide by the changed radius, as

$$\log\_r(j) = \left\lfloor \frac{\max\left(\mathbf{x}\_r\right) \cdot \min\left(\mathbf{x}\_r\right)}{r(i)} \right\rfloor \quad \text{for } \left\{ r \in \mathbf{2}^k \mid k = 1, 2, \dots, \log\_2(L) \cdot 1 \right\} \tag{2}$$

$$N\_r = \sum\_i n\_r(i) \tag{3}$$

where *Nr* is a total number of contained boxes, *xr* represents the EEG time series with length *L* , *r*(*i*) is a radius by changing a step of *k* within the *i*-th subdivision window, and integerpart function denoted by ∙ . To obtain the FD, least-square linear fitted line corresponds to the slope of the plot log2 (*Nr*) versus log2 (1 /*r*) is applied.

#### *3.1.2. Variance Fractal Dimension (VFD)*

This method is determined by the Hurst exponent, *H* , whose calculation was divided from the properties of the fBm data. The basic idea of calculation is based on the power law relationship between the variance of the amplitude increments of the input time series, which was produced by a dynamic process over time. The main advantage of VFD was its support of the real-time computation [17]. We also selected the VFD for estimating the FD of EEG data. The amplitude increments of a datum over a time interval ∆*t* adhere to the following power law relationship Var[*x*(*t*2-*t*1)]∝|*t*2-*t*1|<sup>2</sup>*<sup>H</sup>* , the Hurst exponent can be calculated by using a loglog plot then given by

$$H = \lim\_{\Delta t \to 0} \left( \frac{1}{2} \frac{\log\_2 \left( \operatorname{Var} \left[ \left( \Delta x \right)\_{\Delta t} \right] \right)}{\log\_2 \left( \Delta t \right)} \right) \tag{4}$$

The variance in each window per stage *k* can be calculated as follows

$$\mathbf{Var}\left[\Delta\mathbf{x}\_{\Delta t}\right] = \frac{1}{\left(N\_k \cdot 1\right)} \left[\sum\_{j=1}^{N\_k} \left\{\Delta\mathbf{x}\right\}\_{jk}^2 - \frac{1}{N\_k} \left\{\sum\_{j=1}^{N\_k} \left\{\Delta\mathbf{x}\right\}\_{jk}\right\}^2\right] \tag{5}$$

$$\mathbf{x}(\Delta x)\_{jk} = \mathbf{x}(jm\_k) \cdot \mathbf{x}((j-1)n\_k), \quad \text{for } j = 1, 2, \dots, N\_k \tag{6}$$

The least-square linear fitted line corresponds to the slope of the plot log2 (*nk* ) and log2 (Var ∆ *x <sup>k</sup>* ), the Hurst exponent is computed as *H* =0.5*s* where *s* is the obtained slope then the FD can be estimated as

$$D\_{\rm VFD} = \mathbf{2} \cdot H \tag{7}$$

The process of calculating the FD essentially involves segmenting the entire input time series data into numerous subsequence (or window). The values *k* represents the integer range chosen such that each window of size *NT* contains a number *nk* =2*<sup>k</sup>* of smaller windows of size *Nk* = *NT* / *nk* .

#### *3.1.3. Higuchi's Method (HM)*

FD is another measure of data complexity, generally evaluated in phase space by means of correlation dimension. Higuchi proposed an algorithm for the estimation of FD directly in time domain without reconstructing the strange attractor [18]. The HM also gives reasonable estimate of the FD in the case of short time segment. The HM algorithm based on the given finite time series *y* ={*y*(1), *y*(2), …, *y*(*N* )}, a new time series, *ym k* , are constructed by the following equation

$$\begin{array}{ll} \text{a. } y\_m^k = \begin{Bmatrix} y(m), & y(m+k), & y(m+2k), & \dots, & y\left(m+\left\lfloor \frac{N-m}{k} \right\rfloor\right) \bullet k \end{Bmatrix} \text{ for } m = 1, 2, \dots, k \end{array} \tag{8}$$

where both *m* and *k* are integers and they indicate the initial time and the time interval, respectively. The length, *L <sup>m</sup>*(*k*), of each curves is computed as

$$L\_{\
u}(k) = \frac{1}{k} \left| \left( \overbrace{\sum\_{i=1}^{N-n} \mathbf{1}}^{N-n} \mathbf{1}\_{\{y(m+ik) - y(m+(i-1)k)\}} \right) \bullet \frac{N-1}{\left\lfloor \frac{N-n}{k} \right\rfloor \bullet k} \right| \tag{9}$$

The length of curve for time interval *k*, *L <sup>m</sup>*(*k*) is computed as the average of the *m* curves. A relationship of this algorithm is *<sup>L</sup> <sup>m</sup>*(*k*)∝*<sup>k</sup>* -*DHM* therefore we apply the least-squares fitting line of log2 (*L* (*k*)) versus log2 (*k*), the negative slope of the obtained line is calculated giving the estimate of the FD, *DHM* .

#### *3.1.4. Detrended Fluctuation Analysis (DFA)*

The variance in each window per stage *k* can be calculated as follows

(*Nk* - 1) ∑ *j*=1 *Nk*

(∆ *x*) *jk* <sup>2</sup> - <sup>1</sup> *Nk* (∑ *j*=1 *Nk*

The least-square linear fitted line corresponds to the slope of the plot log2 (*nk* ) and log2 (Var ∆ *x <sup>k</sup>* ), the Hurst exponent is computed as *H* =0.5*s* where *s* is the obtained slope then

The process of calculating the FD essentially involves segmenting the entire input time series data into numerous subsequence (or window). The values *k* represents the integer range chosen such that each window of size *NT* contains a number *nk* =2*<sup>k</sup>* of smaller windows of

FD is another measure of data complexity, generally evaluated in phase space by means of correlation dimension. Higuchi proposed an algorithm for the estimation of FD directly in time domain without reconstructing the strange attractor [18]. The HM also gives reasonable estimate of the FD in the case of short time segment. The HM algorithm based on the given

where both *m* and *k* are integers and they indicate the initial time and the time interval,

<sup>|</sup>*y*(*<sup>m</sup>* <sup>+</sup> *ik*) - *<sup>y</sup>*(*<sup>m</sup>* <sup>+</sup> (*<sup>i</sup>* - 1)*k*)|) <sup>∙</sup> *<sup>N</sup>* - <sup>1</sup>

The length of curve for time interval *k*, *L <sup>m</sup>*(*k*) is computed as the average of the *m* curves. A relationship of this algorithm is *<sup>L</sup> <sup>m</sup>*(*k*)∝*<sup>k</sup>* -*DHM* therefore we apply the least-squares fitting line of log2 (*L* (*k*)) versus log2 (*k*), the negative slope of the obtained line is calculated giving the

*N* - *m*

finite time series *y* ={*y*(1), *y*(2), …, *y*(*N* )}, a new time series, *ym*

*<sup>k</sup>* ={*y*(*m*), *y*(*m* + *k*), *y*(*m* + 2*k*), …, *y*(*m* +

respectively. The length, *L <sup>m</sup>*(*k*), of each curves is computed as

*<sup>k</sup>* {( ∑ *i*=1

*N* -*m k*

*<sup>L</sup> <sup>m</sup>*(*k*)= <sup>1</sup>

estimate of the FD, *DHM* .

(∆ *x*) *jk* ) 2

*DVFD* =2 - *H* (7)

*k*

*N* - *m <sup>k</sup>* ∙ *k*

*<sup>k</sup>* ) ∙*k*} for *m*=1,2, …, *k* (8)

, are constructed by the

} (9)

(∆ *x*) *jk* = *x*( *jnk* ) - *x*(( *j* - 1)*nk* ), for *j* =1,2, …, *Nk* (6)

(5)

Var <sup>∆</sup> *<sup>x</sup>*∆*<sup>t</sup>* <sup>=</sup> <sup>1</sup>

94 Brain-Computer Interface Systems – Recent Progress and Future Prospects

the FD can be estimated as

size *Nk* = *NT* / *nk* .

following equation

*ym*

*3.1.3. Higuchi's Method (HM)*

The idea of DFA was invented originally to investigate the long-range dependence in coding and noncoding DNA nucleotide sequence [19]. Due to simplicity in implementation, the DFA is now becoming the most important method in the field of fractal analysis [20]. Therefore the DFA method was also applied to FD estimation in this study.

$$\mathbf{X} \text{ (} k \text{)}=\sum\_{i=1}^{k} \mathbf{x}(i) \text{ - } \mathbf{x} \text{ (} \mathbf{x} \text{)}\text{ - } \mathbf{x} \text{ (} \mathbf{x} \text{)}\text{ - } \tag{10}$$

This integrated series is divided into non-overlapping intervals of length *n*. In each interval, a least squares line is fit to the data (representing the trend in the interval). The series *X* (*k*) is then locally detrended by subtracting the theoretical values *Xn*(*k*) given by the regression. For a given interval length *n*, the characteristic size of fluctuation for this integrated and detrended series is calculated by

$$F\left(n\right) = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} \prod X\left(k\right) - X\_n(k)\mathbf{1}^2} \tag{11}$$

This computation is repeated over all possible interval lengths (in practice, the minimum length is around 10, and the maximum is a half-length of input data, giving two adjacent intervals). In this experiment, we use the power of 2 based length for input EEG data therefore in this case we can range *n* =2*<sup>k</sup>* for *k* =4,5, …, log2 (*L* ) - 1. A relationship is expected, as *<sup>F</sup>* (*n*)∝*<sup>n</sup> <sup>α</sup>* where *α*is expressed as the slope of a double logarithmic plot log2 (*<sup>F</sup>* (*n*)) versus log2 (*n*). As PSD, Then *α* can be converted into the Hurst exponent *H* =*α* - 1 and the estimated FD according as*DDFA* =3 - *α*.

#### **3.2. Frequency-based analysis**

#### *3.2.1. Power Spectral Density Analysis (PSDA)*

This method is widely used for assessing the fractal properties of time series, the method based on frequency domain analysis, and also works on the basis of the 1/*f*-like power scaling which can be obtained by the fast Fourier transform (FFT) algorithm. It is called a power-law relationship of Mandelbrot and van Ness [25] can be expressed as follows

$$f\left(s\right) \propto \frac{1}{s^{-\beta}}\tag{12}$$

where *s* is the frequency, *f* (*s*) is the correspondence power at a given frequency, and *β* is the scaling exponent which is estimated by calculating the negative slope of the linear fit relating log ( *f* (*s*)) to log (*s*). The PSDA was applied to raw EEG data to classify the EEG characteristics. This study used the enhanced version of PSDA that initially proposed by Fougère [26] and modified by Eke [27]. To improve the stability of spectral estimates, raw EEG data of each individual trial were first detrending by subtracted the mean of the data series from each value, and then each value is multiplied by a parabolic window is given as below

$$\psi(i) = 1 \cdot \left(\frac{2i}{M+1} - 1\right)^2, \qquad \text{for } i = 1, 2, \dots, M \tag{13}$$

Finally the scaling exponent was estimated by the least-square fitting of log-log domain for the frequencies lower than 1/8 of the maximum sampling rate, i.e., less than 64 Hz in this study. The Hurst exponent can be determined as *H* =(*β* - 1) / 2 where 1<*β* <3 and the estimated FD can be computed by the following equation

$$D\_{PSDA} = 2 + \frac{(1 - \beta)}{2} \tag{14}$$

#### *3.2.2. Critical Exponent Method (CEM)*

This method was initially proposed by Nakagawa [21]. The CEM has been applied in the physiological data analysis and featuring [22-24]. The PSD of observed fBm data, *PH* (*υ*), in the frequency domain is determined as

$$P\_H(\nu) \propto \nu^{2H+1} = \nu^{\cdot \beta} \tag{15}$$

The moment, *Iα*, and the moment exponent, *α*, of the PSD is determined as

$$I\_{\alpha} = \int\_{1}^{\Omega} P\_{H}(\nu) \nu^{a} d\nu, \qquad \left( \text{ - } \curvearrowleft \alpha \leqslant + \curvearrowright \right) \tag{16}$$

We will consider the frequency bands as finite value and substitute Eq. (15) to Eq. (16) thus the equation was given as

$$I\_{\alpha} \propto f\_1^{\Omega} \nu^{\alpha \cdot \beta} d\nu = f\_1^{\Omega} \nu^{A \cdot 1} d\nu = \frac{1}{A} (\Omega^A \text{ - 1}) = \frac{2}{A} \exp\left(\frac{uA}{2}\right) \sinh\left(\frac{uA}{2}\right) \tag{17}$$

where *A*=*α* - *β* + 1, *u* =log (Ω), and Ω is the upper frequency variable which was normalized to the lower bound of the integration region as 1. Hence taking the logarithm of moment and the third order derivative can be written as

$$\frac{d^{\frac{3}{2}\log I\_a}}{d\alpha^3} = \frac{2}{8}\mu^3 \text{cosech}^3\left(\frac{\mu A}{2}\right) \text{cosh}\left(\frac{\mu A}{2}\right) \cdot \frac{1}{A^3} \tag{18}$$

We then determine the critical value *α* =*α<sup>c</sup>* which satisfies for Eq. (18) is equal to zero. Finally, the FD can be estimated by

$$D\_{CEM} = 2 \cdot \frac{\alpha\_c}{2} \tag{19}$$

The PSDA and CEM used the FFT algorithm for transforming the EEG time series data into frequency components.

#### **4. Performance assessment**

modified by Eke [27]. To improve the stability of spectral estimates, raw EEG data of each individual trial were first detrending by subtracted the mean of the data series from each value,

Finally the scaling exponent was estimated by the least-square fitting of log-log domain for the frequencies lower than 1/8 of the maximum sampling rate, i.e., less than 64 Hz in this study. The Hurst exponent can be determined as *H* =(*β* - 1) / 2 where 1<*β* <3 and the estimated FD

(1 - *β*)

This method was initially proposed by Nakagawa [21]. The CEM has been applied in the physiological data analysis and featuring [22-24]. The PSD of observed fBm data, *PH* (*υ*), in the

We will consider the frequency bands as finite value and substitute Eq. (15) to Eq. (16) thus

*<sup>A</sup>* (Ω*<sup>A</sup>* - 1) <sup>=</sup> <sup>2</sup>

where *A*=*α* - *β* + 1, *u* =log (Ω), and Ω is the upper frequency variable which was normalized to the lower bound of the integration region as 1. Hence taking the logarithm of moment and

We then determine the critical value *α* =*α<sup>c</sup>* which satisfies for Eq. (18) is equal to zero. Finally,

<sup>2</sup> )cosh ( *uA*

, for *i* =1,2, …, *M* (13)

<sup>2</sup> (14)

*PH* (*υ*)∝*<sup>υ</sup>* <sup>2</sup>*<sup>H</sup>* +1 <sup>=</sup>*<sup>υ</sup>* -*<sup>β</sup>* (15)

<sup>Ω</sup>*PH* (*υ*)*<sup>υ</sup> <sup>α</sup>dυ*, ( - <sup>∞</sup><*<sup>α</sup>* <sup>&</sup>lt; <sup>+</sup> *<sup>∞</sup>*) (16)

<sup>2</sup> )sinh ( *uA*

<sup>2</sup> ) (17)

*<sup>A</sup>* <sup>3</sup> (18)

*<sup>A</sup>* exp( *uA*

<sup>2</sup> ) - <sup>1</sup>

and then each value is multiplied by a parabolic window is given as below

*<sup>M</sup>* <sup>+</sup> <sup>1</sup> - 1)<sup>2</sup>

*DPSDA* =2 +

The moment, *Iα*, and the moment exponent, *α*, of the PSD is determined as

*<sup>d</sup><sup>ν</sup>* <sup>=</sup> <sup>1</sup>

cosech3( *uA*

<sup>8</sup> *<sup>u</sup>* <sup>3</sup>

*I<sup>α</sup>* =*∫* 1

*dν* = *∫* 1 <sup>Ω</sup>*ν <sup>A</sup>*-1

*<sup>ψ</sup>*(*i*)=1 - ( <sup>2</sup>*<sup>i</sup>*

96 Brain-Computer Interface Systems – Recent Progress and Future Prospects

can be computed by the following equation

*3.2.2. Critical Exponent Method (CEM)*

frequency domain is determined as

the equation was given as

the FD can be estimated by

*I<sup>α</sup>* ∝*∫* 1 <sup>Ω</sup>*ν <sup>α</sup>*-*<sup>β</sup>*

the third order derivative can be written as

*d* 3 log *I<sup>α</sup> <sup>d</sup>α*<sup>3</sup> <sup>=</sup> <sup>2</sup> Since high computation time and low variability are the requirement of the BCI system, the optimal algorithm for evaluating fractal dimension should be examined carefully. In this experiment, we selected two signals that fractal dimension value can be easily set for assessing. The usefulness of output results can be helped us know which algorithm is suited for applying it as the feature extractor.

#### **4.1. Assessing with artificially generated signals**

**1.** Fractional Brownian motion (fBm) signal2 . General model of the fBm signal can be written as the following fractional differential equation [32]

$$\frac{\text{d}^{H\*1/2}\text{B}\_{\text{H}}\text{(}t\text{)}}{\text{d}t^{H\*1/2}} \propto w\left(t\right) \tag{20}$$

The mathematical model for solving the fBm signal in Eq.(20) was proposed by Mandelbrot and Van Ness [25], which is defined by

$$B\_H(t) = \frac{1}{\Gamma(H + 1/2)} \{ \Box^0 \Box (t - s)^{H - 1/2} \text{ - (-s)^{H - 1/2}} \Box dB(s) + \zeta\_0^t \Box (t - s)^{H - 1/2} \Box dB(s) \} \tag{21}$$

where Γ(∙) is the Gamma function Γ(*a*)=*∫* 0 <sup>+</sup>*∞x <sup>a</sup>*-1 *e*-*x dx* and *H* is the Hurst exponent where 0<*H* <1. B is a stochastic process. To simulate the fBm signal, we can simply implement the above definition to discrete time series at *n*-th index by Riemann's type sums as follows:

$$\mathbf{G}\_{H}(n) = \frac{1}{\Gamma(\hbar \, \, \, \, \mathbf{r} + 1/2)} \left( \sum\_{k=b}^{0} \left[ \left( n \cdot k \right)^{H - \frac{1}{2}} - \left( \cdot k \right)^{H - \frac{1}{2}} \right] \mathbf{G}\_{1} \left( k \right) + \sum\_{k=0}^{n} \left[ \left( n \cdot k \right)^{H - \frac{1}{2}} \right] \mathbf{G}\_{2} \left( k \right) \right) \tag{22}$$

In case *H* =1 / 2, Eq.(4.22) can be reduced, we then obtain

$$B\_{1/2}\left(\mu\right) = \sum\_{k=0}^{n} G\left(k\right) \tag{23}$$

<sup>2</sup> fBm data were simulated by NLAB, Department of Electrical Engieering, Nagaoka University of Technology, 1603-1 Kamitomioka, Nagaoka-shi 940-2188 Japan.

For *n* =1,2, …, *N* and *b* = *N* <sup>3</sup> . *G* is a normal distribution space.

**2.** Synthetic (Syn) signal. This signal was produced using the deterministic Weierstrauss cosine function [28]. To simulate the Syn signal, we can simply implement the above definition to discrete time series at *n*-th index by [33]

$$\mathcal{W}\_h(n) = \sum\_{i=0}^{M} \gamma^{\cdot iH} \cos\left\{2\pi\gamma^i n\right\} \tag{24}$$

where *γ* >1, and we set *M* =26 and *γ* =5. The fractal dimension of this signal is given by *D* =2 - *H* .

#### **4.2. Conditions for assessing**

A summary of the selected parameter values in this experiment is shown in Table 2. The selection of signal lengths that are powers of two was motivated by the requirements of frequency-based algorithms (PSDA and CEM). Output performances of each algorithm were shown in Fig. 1.

#### **4.3. Results of performance assessing**

In order to test the effect of signal lengths on *H* estimation and each algorithm was applied on the entire series. The estimated fractal dimension values are then averaged at the change of *H* . Before utilizing those of algorithms, we should regard as the following things:


$$\mathsf{E}\_{\mathsf{D}} = \frac{1}{L} \sum\_{i=1}^{L} (D\_i - \hat{\vec{D}}\_i)^2 \tag{25}$$

where *D* is a theoretical FD value (true value), *D* ^ is an estimated FD value at *i*th step of *<sup>H</sup>* , and *L* is a signal length. Results are shown in Tables 4 and 5.

**1.** *Variability (or robustness)*. The standard deviation (SD) value is computed to indicate that that variability of algorithms. Results are shown in Tables 6 and 7.



**Table 2.** Selected parameters in the experiment.

For *n* =1,2, …, *N* and *b* = *N* <sup>3</sup>

**4.2. Conditions for assessing**

**4.3. Results of performance assessing**

*D* =2 - *H* .

shown in Fig. 1.

time.

. *G* is a normal distribution space. **2.** Synthetic (Syn) signal. This signal was produced using the deterministic Weierstrauss cosine function [28]. To simulate the Syn signal, we can simply implement the above

*γ* -*iH cos* (2*πγ <sup>i</sup>*

where *γ* >1, and we set *M* =26 and *γ* =5. The fractal dimension of this signal is given by

A summary of the selected parameter values in this experiment is shown in Table 2. The selection of signal lengths that are powers of two was motivated by the requirements of frequency-based algorithms (PSDA and CEM). Output performances of each algorithm were

In order to test the effect of signal lengths on *H* estimation and each algorithm was applied on the entire series. The estimated fractal dimension values are then averaged at the change

**1.** *Computation time*. Short computation time is the requirement of the BCI applications. The fBm and Syn signals are performed to complete the entire series on a laptop 1.6 GHz Pentium with memory of 512 MB. Table 3 shows the results of the average computation

**2.** *Accuracy*. To compare with the theoretical FD value (true FD value), we used a mean-

*<sup>L</sup>* (*Di* - *<sup>D</sup>* ^ *i*

**1.** *Variability (or robustness)*. The standard deviation (SD) value is computed to indicate that

of *H* . Before utilizing those of algorithms, we should regard as the following things:

squared error (MSE) value, which can be defined by

and *L* is a signal length. Results are shown in Tables 4 and 5.

where *D* is a theoretical FD value (true value), *D*

<sup>E</sup><sup>D</sup> <sup>=</sup> <sup>1</sup>

that variability of algorithms. Results are shown in Tables 6 and 7.

**Algorithm Parameter Value** BCM Step size of radius *k* =1,2, …, log2 (*p*) - 1 VFD Step size of window *k* =2,3, …, log2 (*p*) - 1

*<sup>L</sup>* ∑*i*=1

*n*) (24)

)<sup>2</sup> (25)

^ is an estimated FD value at *i*th step of *<sup>H</sup>* ,

definition to discrete time series at *n*-th index by [33]

98 Brain-Computer Interface Systems – Recent Progress and Future Prospects

*Wh* (*n*)= ∑

*i*=0 *M*

boxes versus radius. (b) VFD algorithm: a log-log plot of variance versus window length. (c) HM algorithm: a log-log plot of a curve's length versus interval length. (d) DFA algorithm: a log-log plot of a total length ሺሻ versus interval length. (e) PSDA algorithm: a log-log plot of power spectra versus frequency. (f) CEM algorithm: a log-log plot of moment function versus moment exponent value [16]. **Signal length ]ms]**  BCM VFD HM DFA PSDA CEM **Figure 1.** The results for fractal analysis, performed on a series of test EEG data. (a) BCM algorithm: a log-log plot of a total number of contained boxes versus radius. (b) VFD algorithm: a log-log plot of variance versus window length. (c) HM algorithm: a log-log plot of a curve's length versus interval length. (d) DFA algorithm: a log-log plot of a total length F(n) versus interval length. (e) PSDA algorithm: a log-log plot of power spectra versus frequency. (f) CEM algo‐ rithm: a log-log plot of moment function versus moment exponent value [16].

Table 3. Average computation time in millisecond (best value is marked in bold).

**Signal length** ए۲ **(fBm signal)**

Figure 1. The results for fractal analysis, performed on a series of test EEG data. (a) BCM algorithm: a log-log plot of a total number of contained

L = 27 2.60 **0.77** 14.90 10.05 2.07 140.40 L = 28 3.50 **1.10** 148.60 20.20 2.90 227.50 L = 29 6.80 **2.80** 385.20 40.20 4.90 1,767.80 L = 210 12.60 **4.30** 697.90 82.50 10.80 3,351.20 L = 211 26.00 **7.10** 1,400.70 168.80 20.30 6,538.50 L = 212 54.00 **12.40** 2,490.50 329.70 46.40 12,964.80 L = 213 114.90 **61.10** 4,160.70 895.10 137.80 25,318.10 L = 214 265.50 **164.70** 14,630.80 2,971.50 478.70 50,157.60 L = 215 676.10 **607.90** 47,082.50 11,210.20 1,882.30 101,809.70 Average 129.11 **95.79** 7,890.20 1,747.53 287.35 22,475.06

BCM VFD HM DFA PSDA CEM

L = 27 0.091 0.018 **0.003** 0.004 0.027 0.039 L = 28 0.069 0.013 0.031 **0.002** 0.031 0.047 L = 29 0.054 0.008 0.033 **0.001** 0.096 0.040 L = 210 0.044 0.007 0.001 **0.001** 0.099 0.045 L = 211 0.038 0.005 0.019 **0.002** 0.102 0.042


**Table 3.** Average computation time in millisecond (best value is marked in bold).


**Table 4.** MSE value performed with the fBm signal (best value is marked in bold).



**Table 5.** MSE value performed with the Syn signal (best value is marked in bold).

**Signal length** *tavr* **[ms]**

100 Brain-Computer Interface Systems – Recent Progress and Future Prospects

**Table 3.** Average computation time in millisecond (best value is marked in bold).

**Table 4.** MSE value performed with the fBm signal (best value is marked in bold).

**Signal length** *SD***(fBm signal)**

BCM VFD HM DFA PSDA CEM

L = 27 0.083 0.009 0.054 **0.003** 0.059 0.038 L = 28 0.067 0.023 0.004 **0.002** 0.048 0.048 L = 29 0.044 **0.002** 0.003 0.009 0.173 0.026

**Signal length ED(fBm signal)**

BCM VFD HM DFA PSDA CEM

L = 27 0.091 0.018 **0.003** 0.004 0.027 0.039 L = 28 0.069 0.013 0.031 **0.002** 0.031 0.047 L = 29 0.054 0.008 0.033 **0.001** 0.096 0.040 L = 210 0.044 0.007 0.001 **0.001** 0.099 0.045 L = 211 0.038 0.005 0.019 **0.002** 0.102 0.042 L = 212 0.032 0.003 0.015 **0.001** 0.102 0.041 L = 213 0.023 **0.001 0.001** 0.004 0.101 0.046 L = 214 0.022 0.004 **0.001 0.001** 0.105 0.046 L = 215 0.019 0.007 0.002 **0.001** 0.103 0.053 Average 0.044 0.007 0.012 **0.002** 0.085 0.044

BCM VFD HM DFA PSDA CEM

L = 27 2.60 **0.77** 14.90 10.05 2.07 140.40 L = 28 3.50 **1.10** 148.60 20.20 2.90 227.50 L = 29 6.80 **2.80** 385.20 40.20 4.90 1,767.80 L = 210 12.60 **4.30** 697.90 82.50 10.80 3,351.20 L = 211 26.00 **7.10** 1,400.70 168.80 20.30 6,538.50 L = 212 54.00 **12.40** 2,490.50 329.70 46.40 12,964.80 L = 213 114.90 **61.10** 4,160.70 895.10 137.80 25,318.10 L = 214 265.50 **164.70** 14,630.80 2,971.50 478.70 50,157.60 L = 215 676.10 **607.90** 47,082.50 11,210.20 1,882.30 101,809.70 Average 129.11 **95.79** 7,890.20 1,747.53 287.35 22,475.06


**Table 6.** Deviation performed with the fBm signal (best value is marked in bold).



**Table 7.** Deviation performed with the Syn signal (best value is marked in bold).

#### **4.4. Time-Dependent Fractal Dimensions (TDFD)**

One of common measure of irregularity in fluctuations as time-sequential data may be the fractal dimension. Fractal dimensions estimate the degree of freedom of fluctuations in a signal. To measure the fractality in short time intervals of time-sequential data from one end of the waveform to the other sequentially, we may observe the dynamical changes in the fractal dimensions with respect the time series. These fractal dimensions, namely, the time-dependent fractal dimensions (TDFD) [22].

In this study, EEG signal was sampled at 512 Hz (or *L* =3,072 points) whose duration is 6 seconds. We set a window function is a rectangular type, window size = 512 points (1 s), moving window with intervals = 10 points (19.5 ms) because the temporal resolution of EEG is millisecond or even better [29]. The number of obtained points can compute from *NFD* = (*L* - *L <sup>w</sup>*) / ∆*t* + 1 where *L* is a signal length, *L <sup>w</sup>* is a window length (*L <sup>w</sup>* ≤ *L* ), and ∆*t* is an interval. Thus, we then obtained 257 points of fractal dimension. The time series-based algorithms of fractal dimension present more effective than frequency-based algorithms [16].

**Figure 2.** Computation process of TDFDs.

#### **4.5. Evaluation of features by Kullback-Leibler divergence**

In probability theory and information theory, the Kullback–Leibler divergence [34] (or called K-L divergence) is a measure of the difference between two probability distributions: from an actual probability distribution *P* to an arbitrary probability distribution *Q*.

$$D\_{KL}\left(P\parallel Q\right) = \stackrel{\ast}{\int\_{-\infty}^{\infty}} p(\mathbf{x}) \log \frac{p(\mathbf{x})}{q(\mathbf{x})} d\mathbf{x} \tag{26}$$

For probability distributions *P* and *Q* of a discrete random variable is defined to be

$$D\_{KL}\text{ ( $P \parallel Q$ )} = \sum\_{i} P(i) \log \frac{P(i)}{Q(i)}\tag{27}$$

Important properties of the K-L divergence are


**Signal length** *SD***(Syn signal)**

**Table 7.** Deviation performed with the Syn signal (best value is marked in bold).

**4.4. Time-Dependent Fractal Dimensions (TDFD)**

102 Brain-Computer Interface Systems – Recent Progress and Future Prospects

Window *1*

Dt

**Figure 2.** Computation process of TDFDs.

Amplitude [

V]

Window *2*

Window *3*

fractal dimensions (TDFD) [22].

L = 214 0.019 0.049 0.015 **0.014** 0.121 0.017 L = 215 **0.008** 0.011 0.009 0.012 0.119 0.012 Average **0.022** 0.077 **0.022** 0.027 0.168 0.027

One of common measure of irregularity in fluctuations as time-sequential data may be the fractal dimension. Fractal dimensions estimate the degree of freedom of fluctuations in a signal. To measure the fractality in short time intervals of time-sequential data from one end of the waveform to the other sequentially, we may observe the dynamical changes in the fractal dimensions with respect the time series. These fractal dimensions, namely, the time-dependent

In this study, EEG signal was sampled at 512 Hz (or *L* =3,072 points) whose duration is 6 seconds. We set a window function is a rectangular type, window size = 512 points (1 s), moving window with intervals = 10 points (19.5 ms) because the temporal resolution of EEG is millisecond or even better [29]. The number of obtained points can compute from *NFD* = (*L* - *L <sup>w</sup>*) / ∆*t* + 1 where *L* is a signal length, *L <sup>w</sup>* is a window length (*L <sup>w</sup>* ≤ *L* ), and ∆*t* is an interval. Thus, we then obtained 257 points of fractal dimension. The time series-based algorithms of fractal dimension present more effective than frequency-based algorithms [16].

0 1 2 3 4 5 6

Time [s]

Window *m*

**3.** The K-L divergence is not symmetric *DKL* (*P* ∥*Q*)≠*DKL* (*Q* ∥ *P*).

**Figure 3.** Feature extraction concepts on the basis of the K-L divergence.

The different expected values between imaging and relaxing periods are also proposed to use as the featuring parameter together with the K-L divergence. In general, if *X* is random variable defined on a probability space, then the different expected value of two random variables can be defined by

$$
\Delta E\{X\}\_{PQ} = \sum\_{i} p\_i \mathbf{x}\_i - \sum\_{j} q\_i \mathbf{x}\_j \tag{28}
$$

Inthisstudy,weappliedtheK-Ldivergenceanddifferentexpectedvaluesforfindingthefeatures betweenrelaxingandimagingperiods.AccordingtoEq.(28), *P*(*i*)is regardedas relaxingperiod, and *Q*(*i*) is regarded as imaging period. The probability distributions can be approximated by normalizing histogram of the data. The output patterns of K-L divergences versus the differ‐ ent expectedvalue are also showninFig. 4.These obtainedresults canbe classifiedby theneural network. More explanations of this classification process are in chapter 5.

Figure 4. Left-hand movement imagination (LH-MI). **Figure 4.** Left-hand movement imagination (LH-MI).

**5. Neural network classifier** 

#### This study is concerned with the most common class of neural networks (NNs) called backpropagation algorithm. Backpropagation is an abbreviation for the backwards propagation of error. The learning process is processed based on the **5. Neural network classifier**

**H** � ����(29) and the gradient can be determined as **∇J** � ����(30) This study is concerned with the most common class of neural networks (NNs) called back‐ propagation algorithm. Backpropagation is an abbreviation for the backwards propagation of error. The learning process is processed based on the Levenberg–Marquardt (LM) backpro‐ pagation method [30] the Hessian matrix can be approximated as the following equation

$$\mathbf{H} \mathbf{z} \mathbf{2} \mathbf{j}^T \mathbf{J} \tag{29}$$

Levenberg–Marquardt (LM) backpropagation method [30] the Hessian matrix can be approximated as the following equation

���� � � ∂�� <sup>⋯</sup> ������ ∂�� ⋮⋱⋮ and the gradient can be determined as

[31].

vector of network errors.

������

������ ∂��

<sup>⋯</sup> ������ ∂�� �

�(31)

�

$$\nabla \mathbf{J} = \mathbf{2J}^T \mathbf{e} \tag{30}$$

where **I** is an identity matrix, � is a learning rate, and � is an iteration step. This algorithm is particularly well suited for training NN with MSE. Moreover, the LM algorithm has been shown to be much faster than backpropagation in a variety of applications

���� � �� � �**J** �**J** � �**I**� ��**J** �**e**(32) ∆**w** � ��**J** �**J** � �**I**� ��**J** �**e**(33) where *J* is the Jacobian matrix that contains first derivatives of the network error with respect to weights and biases, and **e** is a vector of network errors.

#### Optimal Fractal Feature and Neural Network: EEG Based BCI Applications http://dx.doi.org/10.5772/55801 105

$$J \begin{pmatrix} \omega \\ \omega \end{pmatrix} = \begin{pmatrix} \frac{\partial \epsilon\_1(w)}{\partial w\_1} & \cdots & \frac{\partial \epsilon\_1(w)}{\partial w\_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial \epsilon\_N(w)}{\partial w\_1} & \cdots & \frac{\partial \epsilon\_N(w)}{\partial w\_n} \end{pmatrix} \tag{31}$$

The LM algorithm can use for updating the vector of weights, **w**, by

and *Q*(*i*) is regarded as imaging period. The probability distributions can be approximated by normalizing histogram of the data. The output patterns of K-L divergences versus the differ‐ ent expectedvalue are also showninFig. 4.These obtainedresults canbe classifiedby theneural

(a) TDFDs of IC on L-area. (b) TDFDs of IC on R-area.

Relaxing period Imaging period

1.55 1.6 1.65 1.7 1.75 1.8

0.05 0.1 0.15 0.2 0.25

Probability

This study is concerned with the most common class of neural networks (NNs) called back‐ propagation algorithm. Backpropagation is an abbreviation for the backwards propagation of error. The learning process is processed based on the Levenberg–Marquardt (LM) backpro‐ pagation method [30] the Hessian matrix can be approximated as the following equation

FD

network. More explanations of this classification process are in chapter 5.

Figure 4. Left-hand movement imagination (LH-MI).

This study is concerned with the most common class of neural networks (NNs) called backpropagation algorithm. Backpropagation is an abbreviation for the backwards propagation of error. The learning process is processed based on the Levenberg–Marquardt (LM) backpropagation method [30] the Hessian matrix can be approximated as the following equation

(d) Probability distributions and its K-L divergence on R-area.

1.5 1.55 1.6 1.65 1.7 1.75 1.8 <sup>0</sup>

FD

<sup>0</sup> <sup>1</sup> <sup>2</sup> <sup>3</sup> <sup>4</sup> <sup>5</sup> <sup>6</sup> 1.5

DKL(P||Q) = 8.508

Time [s]

Relaxing period Imaging period

1.85 TDFDs on R-area

where � is the Jacobian matrix that contains first derivatives of the network error with respect to weights and biases, and�� is a

∇J=2*J <sup>T</sup>* **e** (30)

*<sup>T</sup>* **J** (29)

where **I** is an identity matrix, � is a learning rate, and � is an iteration step. This algorithm is particularly well suited for training NN with MSE. Moreover, the LM algorithm has been shown to be much faster than backpropagation in a variety of applications

**5. Neural network classifier** 

(c) Probability distributions and its K-L

1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 <sup>0</sup>

FD

<sup>0</sup> <sup>1</sup> <sup>2</sup> <sup>3</sup> <sup>4</sup> <sup>5</sup> <sup>6</sup> 1.4

DKL(P||Q) = 6.909

Time [s]

1.8 TDFDs on L-area

104 Brain-Computer Interface Systems – Recent Progress and Future Prospects

divergence on L-area.

1.45 1.5 1.55 1.6 1.65 1.7 1.75

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

**Figure 4.** Left-hand movement imagination (LH-MI).

**5. Neural network classifier**

Probability

FD

and the gradient can be determined as

<sup>⋯</sup> ������ ∂�� ⋮⋱⋮ ������ ∂��

<sup>⋯</sup> ������ ∂�� �

�**J** � �**I**� ��**J** �**e**(32)

�**J** � �**I**� ��**J** �**e**(33)

to weights and biases, and **e** is a vector of network errors.

�(31)

H≈2**J**

The LM algorithm can use for updating the vector of weights, �, by

where *J* is the Jacobian matrix that contains first derivatives of the network error with respect

**H** � ����(29)

**∇J** � ����(30)

���� �

and the gradient can be determined as

� � ������ ∂��

���� � �� � �**J**

∆**w** � ��**J**

[31].

vector of network errors.

$$\mathbf{w}\_{n+1} = \mathbf{w}\_n \cdot \left(\mathbf{J}^T \mathbf{J} + \mu \mathbf{I}\right)^{-1} \mathbf{J}^T \mathbf{e} \tag{32}$$

$$
\Delta \mathbf{w} = -\left\{\mathbf{J}^T \mathbf{J} + \mu \mathbf{I}\right\}^{-1} \mathbf{J}^T \mathbf{e} \tag{33}
$$

where I is an identity matrix, *μ* is a learning rate, and *n* is an iteration step. This algorithm is particularly well suited for training NN with MSE. Moreover, the LM algorithm has been shown to be much faster than backpropagation in a variety of applications [31].

**Figure 5.** Classification diagram of the proposed method.


**Table 8.** Desired vectors from the proposed network

#### **5.1. Cross-validation procedure**

Then, the featured parameters typically perform in terms of fractal properties. The NN was selected as classifier and it includes adaptive self-learning that was learnt to produce the correct classifications based on a set of training examples via cross validation process. A tenfold cross validation method was used to increase the reliability of the results [13]. For each experiment, we divided all the data into three sets, namely, the training set (50% of the entire data), cross validation set (20% of the entire data), and testing set (30% of the entire data). According to the tasks, a set of desired values for network training are shown in Table 8. The MSE rate is succeeded versus the iteration step is shown in Fig. 6. In Table 9, we compared the proposed method with the conventional feature, namely, band power estimation (BPE) [14]. The training process makes the network to define its decision boundaries. Figures 7 to 8 show a tight decision boundary after training the network.

#### **6. Discussions**

This chapter discusses the major contributions of this study, 12 electrode channels were used to measure EEG signal over the sensorimotor area of the cortex. The accuracy rates of motor imagery tasks are satisfactory. The error in this classification is due to the ambiguous data between the relaxing and imaging periods, particularly when the subjects have imagined the movements. However, in this experiment, throughout the period of recoding raw EEG signals without training and also without rejecting a bad trial. On contrary, they can be different for different individuals. This makes the EEG signal depend on the measured subject. Even if different persons perform the same tasks and the measurement was done under identical conditions the resulting signals could differ [35].

The proposed method provides 2880 features for each subject, whereas BPE method provides 21600 features. The current features obtained from the proposed method present an easy practicable and reliable method for training the classification algorithms. Nevertheless, the number of features from BPE method can be reduced by changing the number of electrodes. We used the NN to classify the features processed by BPE method. After learning the NN, learning curves and decision boundaries of BPE method is shown in Fig. 9. As the results,

**Figure 6.** Log-log plot of learning curve.

**Task Desired vector** LH-MI 1 0 0 0 *<sup>T</sup>* RH-MI 0 1 0 0 *<sup>T</sup>* FT-MI *0 0 1 0 <sup>T</sup>* TG-MI *0 0 0 1 <sup>T</sup>*

Then, the featured parameters typically perform in terms of fractal properties. The NN was selected as classifier and it includes adaptive self-learning that was learnt to produce the correct classifications based on a set of training examples via cross validation process. A tenfold cross validation method was used to increase the reliability of the results [13]. For each experiment, we divided all the data into three sets, namely, the training set (50% of the entire data), cross validation set (20% of the entire data), and testing set (30% of the entire data). According to the tasks, a set of desired values for network training are shown in Table 8. The MSE rate is succeeded versus the iteration step is shown in Fig. 6. In Table 9, we compared the proposed method with the conventional feature, namely, band power estimation (BPE) [14]. The training process makes the network to define its decision boundaries. Figures 7 to 8 show a tight

This chapter discusses the major contributions of this study, 12 electrode channels were used to measure EEG signal over the sensorimotor area of the cortex. The accuracy rates of motor imagery tasks are satisfactory. The error in this classification is due to the ambiguous data between the relaxing and imaging periods, particularly when the subjects have imagined the movements. However, in this experiment, throughout the period of recoding raw EEG signals without training and also without rejecting a bad trial. On contrary, they can be different for different individuals. This makes the EEG signal depend on the measured subject. Even if different persons perform the same tasks and the measurement was done under identical

The proposed method provides 2880 features for each subject, whereas BPE method provides 21600 features. The current features obtained from the proposed method present an easy practicable and reliable method for training the classification algorithms. Nevertheless, the number of features from BPE method can be reduced by changing the number of electrodes. We used the NN to classify the features processed by BPE method. After learning the NN, learning curves and decision boundaries of BPE method is shown in Fig. 9. As the results,

**Table 8.** Desired vectors from the proposed network

106 Brain-Computer Interface Systems – Recent Progress and Future Prospects

decision boundary after training the network.

conditions the resulting signals could differ [35].

**6. Discussions**

**5.1. Cross-validation procedure**


Table 9. A comparison result of average accuracy rates.

**Table 9.** A comparison result of average accuracy rates. Proposed 82.5 84.3 83.6 82.6

Figure 7. Decision boundary of the trained network. **Figure 7.** Decision boundary of the trained network.

0.2 0.4

Output decision function

output decision functions of classifying features based on BPE method does not perform well and cannot made clearly separation boundaries. 0.6 0.8 1 1.2 Output decision function 0.6 0.8 1 1.2 Output decision function

0.2 0.4

The six algorithms present specific properties in terms of capability and variability. Some algorithms appeared inapplicable for the EEG time series data in this study, such as PSDA and CEM algorithms, for example. The main reason was that two algorithms based on frequency analysis since estimating the Hurst exponent can be utilized by means of the FFT method where -3 -2 -1 <sup>0</sup> <sup>1</sup> <sup>2</sup> <sup>3</sup> 4 6 8 10 12 14 -0.4 -0.2 0 <sup>E</sup> DKL -3 -2 -1 <sup>0</sup> <sup>1</sup> <sup>2</sup> <sup>3</sup> 4 6 8 10 12 14 -0.2 0 <sup>E</sup> DKL

Output decision function

Figure 8. Corresponding 3-D surface of decision boundaries for each task.


<sup>D</sup> <sup>E</sup> KL


<sup>D</sup> <sup>E</sup> KL

DKL

DKL

Table 9. A comparison result of average accuracy rates.

LH-MI RH-MI FT-MI TG-MI

Figure 7. Decision boundary of the trained network.


E

BPE 77.6 79.5 76.4 77.3 Proposed 82.5 84.3 83.6 82.6


E


E

DKL

DKL

Figure 8. Corresponding 3-D surface of decision boundaries for each task. **Figure 8.** Corresponding 3-D surface of decision boundaries for each task.

**Figure 9.** Updated weight values of the NN after learning; (Top panel) Proposed method, (Bottom panel) BPE Method.

the FFT can be substantially applicable in the long-time series data, and it easily declines when the EEG data has been analyzed in a short-time series. However, we can enhance the variability and avoid such drawbacks by increasing the sampling rate at the same period of experiment. In terms of variability, the HM algorithm presents the lowest average values of SD. The DFA provides the fast computation time and also presents reasonable values in terms of accuracy and variability when comparison with each other. Although, VDF gives extremely fast computation time but the VFD itself has the drawbacks in terms of accuracy and robustness. The VFD algorithm appropriates for real-time application, since its consumed computation time was less than 10 ms. Moreover, as proposed algorithms we suggest that the most The proposed method provides 2880 features for each subject, whereas BPE method provides 21600 features. The current features obtained from the proposed method present an easy practicable and reliable method for training the classification algorithms. Nevertheless, the number of features from BPE method can be reduced by changing the number of electrodes. We used the NN to classify the features processed by BPE method. After learning the NN, learning curves and decision boundaries of BPE method is shown in Fig. 9. As the results, output decision functions of classifying features based on BPE method does not perform well and

EEG time series data in this study, such as PSDA and CEM algorithms, for example. The main reason was that two algorithms

algorithm was suited for evaluating the FD of EEG data with high precision, whereas HM algorithm extremely consumed the processing time in comparison with the other algorithms. To show the performance of all algorithms, we then compared all estimated FDs with the theoretical value especially computed by short interval, i.e., 27, 28 and 29 points; these results were shown in

appropriate algorithm depends on an application purpose; for instance, the HM algorithm was suited for evaluating the FD of EEG data with high precision, whereas HM algorithm extremely consumed the processing time in comparison with the other algorithms. To show the per‐ formance of all algorithms, we then compared all estimated FDs with the theoretical value especially computed by short interval, i.e., 27 , 28 and 29 points; these results were shown in Figs. 10 to 15. based on frequency analysis since estimating the Hurst exponent can be utilized by means of the FFT method where the FFT can be substantially applicable in the long-time series data, and it easily declines when the EEG data has been analyzed in a short-time series. However, we can enhance the variability and avoid such drawbacks by increasing the sampling rate at the same period of experiment. In terms of variability, the HM algorithm presents the lowest average values of SD. The DFA provides the fast computation time and also presents reasonable values in terms of accuracy and variability when comparison with each other. Although, VDF gives extremely fast computation time but the VFD itself has the drawbacks in terms of accuracy and robustness. The VFD algorithm appropriates for real-time application, since its consumed computation time was less than 10 ms. Moreover, as proposed algorithms we suggest that the most appropriate algorithm depends on an application purpose; for instance, the HM

Figure 9. Updated weight values of the NN after learning; (Left) Proposed method, (Right) BPE Method.

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> <sup>400</sup> <sup>450</sup> -0.5

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> <sup>400</sup> <sup>450</sup> -0.5

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> <sup>400</sup> <sup>450</sup> -0.5

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> <sup>400</sup> <sup>450</sup> -0.5

Number of trial

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> -1

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> -1

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> -1

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> -1

Number of trial

0 0.5 1 1.5 LH-MI

0 0.5 1 1.5 RH-MI

0 0.5 1 1.5 FT-MI

> 0 0.5 1 1.5

0 1 FT-MI

> 0 1 2

TG-MI

TG-MI

Figure 10.Fractal dimension evaluated by BCM. (Right) fBm signal; (left) Syn signal. **Figure 10.** Fractal dimension evaluated by BCM. (Right) fBm signal; (left) Syn signal.

Figs. 10 to 15.

Figure 11.Fractal dimension evaluated by VFD. (Right) fBm signal; (left) Syn signal. **Figure 11.** Fractal dimension evaluated by VFD. (Right) fBm signal; (left) Syn signal. 1.2 1.2

1 1.2 1.4 1.6 1.8 2

Input FD

the FFT can be substantially applicable in the long-time series data, and it easily declines when the EEG data has been analyzed in a short-time series. However, we can enhance the variability and avoid such drawbacks by increasing the sampling rate at the same period of experiment. In terms of variability, the HM algorithm presents the lowest average values of SD. The DFA provides the fast computation time and also presents reasonable values in terms of accuracy and variability when comparison with each other. Although, VDF gives extremely fast computation time but the VFD itself has the drawbacks in terms of accuracy and robustness. The VFD algorithm appropriates for real-time application, since its consumed computation time was less than 10 ms. Moreover, as proposed algorithms we suggest that the most

**Figure 9.** Updated weight values of the NN after learning; (Top panel) Proposed method, (Bottom panel) BPE Method.

Figure 8. Corresponding 3-D surface of decision boundaries for each task.

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> <sup>400</sup> <sup>450</sup> -0.5

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> <sup>400</sup> <sup>450</sup> -0.50

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> <sup>400</sup> <sup>450</sup> -0.50

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> <sup>400</sup> <sup>450</sup> -0.5

Number of trial

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> -1

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> -1

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> -1

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> <sup>250</sup> <sup>300</sup> <sup>350</sup> -1

Number of trial

BPE 77.6 79.5 76.4 77.3 Proposed 82.5 84.3 83.6 82.6

Table 9. A comparison result of average accuracy rates.

LH-MI RH-MI FT-MI TG-MI

Figure 7. Decision boundary of the trained network.


E


E



**Figure 8.** Corresponding 3-D surface of decision boundaries for each task.

0 0.5 1 1.5 LH-MI

0.5 1 1.5 RH-MI

0.5 1 1.5 FT-MI

> 0 0.5 1 1.5

0 1 FT-MI

> 0 1 2

TG-MI

TG-MI

<sup>D</sup> <sup>E</sup> KL

4 6 8 10 12 14 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2

Output decision function

1

108 Brain-Computer Interface Systems – Recent Progress and Future Prospects

DKL

DKL

Output decision function

<sup>E</sup> DKL

<sup>1</sup> <sup>2</sup> <sup>3</sup>

Output decision function

DKL

DKL

Output decision function



<sup>D</sup> <sup>E</sup> KL

4 6 8 10 12 14 -0.2 0 0.2 0.4 0.6 0.8 1 1.2

<sup>E</sup> DKL


E


E

<sup>1</sup> <sup>2</sup> <sup>3</sup>

Figure 11.Fractal dimension evaluated by VFD. (Right) fBm signal; (left) Syn signal.

1.6 1.8 2

1

1.4

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

1 1.2 1.4 1.6 1.8 2

Input FD

1.6 1.6 Figure 12.Fractal dimension evaluated by HM. (Right) fBm signal; (left) Syn signal. **Figure 12.** Fractal dimension evaluated by HM. (Right) fBm signal; (left) Syn signal.

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

Figure 13.Fractal dimension evaluated by DFA. (Right) fBm signal; (left) Syn signal.

Figure 13.Fractal dimension evaluated by DFA. (Right) fBm signal; (left) Syn signal.

Figure 14.Fractal dimension evaluated by PSDA. (Right) fBm signal; (left) Syn signal.

Figure 14.Fractal dimension evaluated by PSDA. (Right) fBm signal; (left) Syn signal.

Figure 15.Fractal dimension evaluated by CEM. (Right) fBm signal; (left) Syn signal.

Figure 15.Fractal dimension evaluated by CEM. (Right) fBm signal; (left) Syn signal.

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

O utput FD

1 1.2 1.4 1.6 1.8 2

O utput FD

1 1.2 1.4 1.6 1.8 2

O utput FD

1 1.2 1.4 1.6 1.8 2

O utput FD

1 1.2 1.4

Output FD

1 1.2 1.4 1.6 1.8 2

1.8

1.6 1.8 2

1

1.4

Output FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 213) Estimated FD (L = 214) Estimated FD (L = 215) Theoretical FD

> 1 1.2 1.4 1.6 1.8 2

O utput FD

1 1.2 1.4 1.6 1.8 2

O utput FD

1 1.2 1.4 1.6 1.8 2 2.2

Output FD

1 1.2 1.4 1.6 1.8 2 2.2

Output FD

1 1.2 1.4

O utput FD

1 1.2 1.4 1.6 1.8 2

1.8

) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

O utput FD

> Esimated FD (L = 27 ) Esimated FD (L = 28 ) Esimated FD (L = 29 ) Theoretical FD

> Esimated FD (L = 27 ) Esimated FD (L = 28 ) Esimated FD (L = 29 ) Theoretical FD

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

Estimated FD (L = 213) Estimated FD (L = 214) Estimated FD (L = 215) Theoretical FD

Estimated FD (L = 213) Estimated FD (L = 214) Estimated FD (L = 215) Theoretical FD

Estimated FD (L = 213) Estimated FD (L = 214) Estimated FD (L = 215) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

1 1.2 1.4 1.6 1.8 2

1.4 1.6 1.8 2

1

1 1.2 1.4 1.6 1.8 2

Output FD

Output FD

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

Output FD

Output FD

Output FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

Input FD

Estimated FD (L = 2<sup>8</sup> Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

Figure 11.Fractal dimension evaluated by VFD. (Right) fBm signal; (left) Syn signal.

Figure 11.Fractal dimension evaluated by VFD. (Right) fBm signal; (left) Syn signal.

Figure 11.Fractal dimension evaluated by VFD. (Right) fBm signal; (left) Syn signal.

Figure 12.Fractal dimension evaluated by HM. (Right) fBm signal; (left) Syn signal.

Figure 13.Fractal dimension evaluated by DFA. (Right) fBm signal; (left) Syn signal.

Figure 14.Fractal dimension evaluated by PSDA. (Right) fBm signal; (left) Syn signal.

Figure 13.Fractal dimension evaluated by DFA. (Right) fBm signal; (left) Syn signal. **Figure 13.** Fractal dimension evaluated by DFA. (Right) fBm signal; (left) Syn signal. Input FD Input FD

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

Input FD

2 2.2

1

O

1 1.2 1.4

2 Esimated FD (L = 27 ) Esimated FD (L = 28 ) 2 Estimated FD (L = 27 ) Estimated FD (L = 28 ) Figure 14.Fractal dimension evaluated by PSDA. (Right) fBm signal; (left) Syn signal. **Figure 14.** Fractal dimension evaluated by PSDA. (Right) fBm signal; (left) Syn signal. 1 1.2 1.4 1.6 1.8 2 Input FD Input FD

1 1.2 1.4 1.6 1.8 2

1.8

1

2

Esimated FD (L = 29 ) Theoretical FD

Esimated FD (L = 27 ) Esimated FD (L = 28 )

Figure 15.Fractal dimension evaluated by CEM. (Right) fBm signal; (left) Syn signal. **Figure 15.** Fractal dimension evaluated by CEM. (Right) fBm signal; (left) Syn signal.

#### **7. Conclusion**

1.6 1.8

2

1

2

1 1.2 1.4

1

1 1.2 1.4 1.6 1.8 2

1.4 1.6 1.8 2

1

1 1.2 1.4 1.6 1.8 2

Output FD

Output FD

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

Output FD

Output FD

Output FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 )

Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 )

By selecting the best performance of fractal algorithm, we can obviously improve the rate of classification and can develop the novel methods in terms of fractal properties. Fractal features and the experimental framework can be applied not only binary-command BCIs, but also multi-command BCIs. The measurement of FD from EEG data can be capable not only in particular works, but also in publicity data. The FD characterizes the self-affine property of EEG data and has a direct relation to the different tasks.

#### **Author details**

Figure 11.Fractal dimension evaluated by VFD. (Right) fBm signal; (left) Syn signal.

Figure 11.Fractal dimension evaluated by VFD. (Right) fBm signal; (left) Syn signal.

Figure 11.Fractal dimension evaluated by VFD. (Right) fBm signal; (left) Syn signal.

Figure 12.Fractal dimension evaluated by HM. (Right) fBm signal; (left) Syn signal.

Figure 12.Fractal dimension evaluated by HM. (Right) fBm signal; (left) Syn signal.

Figure 12.Fractal dimension evaluated by HM. (Right) fBm signal; (left) Syn signal.

Figure 13.Fractal dimension evaluated by DFA. (Right) fBm signal; (left) Syn signal.

Figure 13.Fractal dimension evaluated by DFA. (Right) fBm signal; (left) Syn signal.

Figure 13.Fractal dimension evaluated by DFA. (Right) fBm signal; (left) Syn signal.

Figure 14.Fractal dimension evaluated by PSDA. (Right) fBm signal; (left) Syn signal.

Figure 14.Fractal dimension evaluated by PSDA. (Right) fBm signal; (left) Syn signal.

Figure 14.Fractal dimension evaluated by PSDA. (Right) fBm signal; (left) Syn signal.

Figure 15.Fractal dimension evaluated by CEM. (Right) fBm signal; (left) Syn signal.

Figure 15.Fractal dimension evaluated by CEM. (Right) fBm signal; (left) Syn signal.

Figure 15.Fractal dimension evaluated by CEM. (Right) fBm signal; (left) Syn signal.

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

110 Brain-Computer Interface Systems – Recent Progress and Future Prospects

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

**Figure 13.** Fractal dimension evaluated by DFA. (Right) fBm signal; (left) Syn signal.

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

**Figure 14.** Fractal dimension evaluated by PSDA. (Right) fBm signal; (left) Syn signal.

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

EEG data and has a direct relation to the different tasks.

**Figure 15.** Fractal dimension evaluated by CEM. (Right) fBm signal; (left) Syn signal.

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

**7. Conclusion**

O utput FD

O utput FD

O utput FD

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

O utput FD

O utput FD

O utput FD

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

Output FD

Output FD

Output FD

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

Output FD

Output FD

Output FD

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

Output FD

Output FD

Output FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 213) Estimated FD (L = 214) Estimated FD (L = 215) Theoretical FD

Estimated FD (L = 213) Estimated FD (L = 214) Estimated FD (L = 215) Theoretical FD

Estimated FD (L = 213) Estimated FD (L = 214) Estimated FD (L = 215) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

> 1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

By selecting the best performance of fractal algorithm, we can obviously improve the rate of classification and can develop the novel methods in terms of fractal properties. Fractal features and the experimental framework can be applied not only binary-command BCIs, but also multi-command BCIs. The measurement of FD from EEG data can be capable not only in particular works, but also in publicity data. The FD characterizes the self-affine property of

O utput FD

O utput FD

O utput FD

1 1.2 1.4 1.6 1.8 2 2.2

1 1.2 1.4 1.6 1.8 2 2.2

1 1.2 1.4 1.6 1.8 2 2.2

Output FD

Output FD

Output FD

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

O utput FD

O utput FD

O utput FD

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

Output FD

Output FD

Output FD

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

Output FD

Output FD

Output FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 27 ) Estimated FD (L = 28 ) Estimated FD (L = 29 ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> ) Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Esimated FD (L = 27 ) Esimated FD (L = 28 ) Esimated FD (L = 29 ) Theoretical FD

Esimated FD (L = 27 ) Esimated FD (L = 28 ) Esimated FD (L = 29 ) Theoretical FD

Esimated FD (L = 27 ) Esimated FD (L = 28 ) Esimated FD (L = 29 ) Theoretical FD

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Estimated FD (L = 2<sup>7</sup> ) Estimated FD (L = 2<sup>8</sup> Estimated FD (L = 2<sup>9</sup> ) Theoretical FD

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

1 1.2 1.4 1.6 1.8 2

Input FD

Montri Phothisonothai and Katsumi Watanabe

Research Center for Advanced Science and Technology, The University of Tokyo, Japan So‐ ciety for the Promotion of Science, Tokyo, Japan

#### **References**


[29] Koenig, T, Studer, D, Hubl, D, Melie, L, & Strik, W. K. Brain connectivity at different time-scales measured with EEG. Philos. Trans R. Soc. Lond. B Biol. Sci. (2005). , 360, 1015-1024.

[13] Setiono, R. Feedforward Neural Network Construction using Cross Validation. Neu‐

[14] Palaniappan, R. Utilizing Gamma Band to Improve Mental Task based Brain-Com‐ puter Interface Design. IEEE Trans. Neural Net. Rehabil. Eng. (2006). , 14, 299-303.

[15] Mandelbrot, B. B. The Fractal Geometry of Nature. W.H. Freeman, New York; (1983).

[16] Phothisonothai, M, & Nakagawa, M. Fractal-based EEG Data Analysis of Body Parts

[17] Kinsner, W. Batch and Real-time Computation of a Fractal Dimension based on Var‐ iance of a Time Series. Univ. Manitoba Canada. Tech. Report (1994). del, 94-6.

[18] Higuchi, T. Approach to an Irregular Time Series on the Basis of the Fractal Theory.

[19] Peng, C. K, Mietus, J, Hausdorff, J. M, Havlin, S, Stanley, H. E, & Goldberger, A. L. Long Range Anticorrelations and Non-Gaussian Behavior of the Heartbeat. Phys.

[20] Peng, C. K, Havlin, S, Stanley, H. E, & Goldberger, A. L. Quantification of Scaling Ex‐ ponents and Crossover Phenomena in Nonstationary Heartbeat Time Series. Chaos

[21] Nakagawa, M. A Critical Exponent Method to Evaluate Fractal Dimension of Self-af‐

[22] Sabanal, S, & Nakagawa, M. The Fractal Properties of Vocal Sound and their Appli‐ cation in the Speech Recognition Model. Chaos Soli. Frac. (1996). , 7, 1825-1843.

[23] Petry, A, Augusto, D, & Barone, C. Speaker Identification using Nonlinear Dynami‐

[24] Phothisonothai, M, & Nakagawa, M. EEG-based Fractal Analysis of Different Motor Imagery Tasks using Critical Exponent Method. Int. J. Bio. Life Sci. (2006). , 1(3),

[25] Mandelbrot, B. B, & Van Ness, J. W. Fractional Brownian Motions, Fractional Noises

[26] Fougère, P. F. On the Accuracy of Spectrum Analysis of Red Noise Processes using Maximum Entropy and Periodogram Methods: Simulation Studies and Application

[27] Eke, A, Hermann, P, Kocsis, L, & Kozak, L. R. Fractal Characterization of Complexity

Movement Imagery Tasks. J Physiol Sci. (2007). , 57(4), 217-226.

ral Comput. (2005). , 13, 2783-2786.

112 Brain-Computer Interface Systems – Recent Progress and Future Prospects

Physica D (1988). , 31, 277-83.

Rev. Lett. (1993). , 70, 1343-1346.

fine Data. J. Phys. Soc. Jpn. (1993). , 62, 4233-4239.

cal Features. Chaos Soli Frac. (2002). , 13, 221-231.

and Applications. SIAM Rev. (1968). , 10, 422-437.

to Geographical Data. J. Geograp. Res. (1985). , 95, 4355-4366.

in Temporal Physiological Signals. Physio. Meas. (2002). R, 1-38.

[28] Tricot, C. Curves and Fractal Dimension. New York: Springer-Verlag;(1995).

(1995). , 5, 82-87.

175-180.


## **Using Autoregressive Models of Wavelet Bases in the Design of Mental Task-Based BCIs**

Farshad Faradji, Farhad Faradji, Rabab K. Ward and Gary E. Birch

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/55803

#### **1. Introduction**

#### **1.1. Wavelet packet analysis**

A powerful tool for analyzing the characteristics of the signal in the frequency domain as well as in the time domain is wavelet analysis. To analyze the signal, it can be decomposed into some levels successively by wavelet transform. At each level, decomposition yields two types of components: the approximations component which is the low-frequency high-scale portion of the signal, and the details component which is the high-frequency low-scale portion. The resultant approximations component is decomposed repetitively after each level. This is usually referred to as wavelet decomposition. Fig. 1.a shows a signal decomposed into three levels by wavelet decomposition.

There is another approach for signal decomposition in which at each level, not only the approximations component but the details component is also decomposed. This is called wavelet packet analysis, and each of the components at different levels is referred to as a node or wavelet packet. Wavelet packet analysis is more flexible but more complicated than wavelet decomposition. Please see Fig. 1.b for more details. In this figure, the signal is decomposed into three levels using wavelet packet analysis.

Let us assume that the maximum frequency of the signal is *fm*. At each level of decomposition, the approximations component has the lower half of the frequency spectrum of the signal decomposed, and the details component has the higher half of the signal's frequency spectrum. The frequency spectrums pertaining to different packets in three-level wavelet packet analysis is shown in Table 1.

© 2013 Faradji et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2013 Faradji et al.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


**Table 1.** Frequency Spectrums in Wavelet Packet Analysis

The packets produced by the wavelet packet analysis make an upside-down "tree", whose root is the initial signal and its branches are coming down. If we cut some branches of this tree (i.e., we exclude some packets from this tree), what remains is called a "sub-tree". The final nodes (terminal nodes or leaves) of each sub-tree are a "wavelet basis" for the initial signal. In other words, a wavelet basis is by definition a set of packets containing all non-overlapping frequency components of the initial signal. The packets of a basis can be selected from different levels. Each wavelet basis can serve as a representative for the initial signal in the wavelet domain. Here are some examples of the basis with respect to Fig. 1.b: *b1* = {*A1, D2*}, *b2* = {*A1, A21, D22*}, *b3* = {*A111, D112, D12, D2*}, *b4* = {*A11, A121, D122, A211, D212, D22*}. It can be seen that wavelet decomposition is a special case of wavelet packet analysis (*b3*).

**Figure 1.** Decomposition of a signal: (a) wavelet decomposition, (b) wavelet packet analysis

The best sub-tree (or basis) is a sub-tree (or basis) which minimizes a specific cost function. Cost functions are usually defined based on an entropy function (e.g., Shannon's entropy function) [1]. One can also define his or her own cost function. To find the best sub-tree, the costs of all nodes in the main tree are measured. Then, in the direction from leaf to root and at each level, the cost of the node and its resultant child nodes are compared to each other. If the sum of the child nodes' costs is higher than or equal to the parent node's cost, the child nodes are cut off from the tree; otherwise, the child nodes remain in the tree, and the cost of the parent's node will be updated (replaced) with the sum of the child nodes' costs. This parent node is in turn one of the child nodes for the upper level. The same comparison (with updated costs) is done for the upper levels until the root level is reached. What remains at the end is the best sub-tree and its leaves are the best basis for the initial signal.

#### **1.2. Brain–computer interfaces**

**Frequency**

0.5*fm* – *0.625fm* 0.625*fm* – *0.75fm*

0.75*fm* – *0.875fm*

0.875*fm* – *fm*

**1** 0 – *0.5fm 0.5fm* – *fm*

0.25*fm* – *0.375fm*

**2** 0 – *0.25fm* 0.25*fm* – *0.5fm* 0.5*fm* – *0.75fm* 0.75*fm* – *fm*

0.375*fm* – *0.5fm*

The packets produced by the wavelet packet analysis make an upside-down "tree", whose root is the initial signal and its branches are coming down. If we cut some branches of this tree (i.e., we exclude some packets from this tree), what remains is called a "sub-tree". The final nodes (terminal nodes or leaves) of each sub-tree are a "wavelet basis" for the initial signal. In other words, a wavelet basis is by definition a set of packets containing all non-overlapping frequency components of the initial signal. The packets of a basis can be selected from different levels. Each wavelet basis can serve as a representative for the initial signal in the wavelet domain. Here are some examples of the basis with respect to Fig. 1.b: *b1* = {*A1, D2*}, *b2* = {*A1, A21, D22*}, *b3* = {*A111, D112, D12, D2*}, *b4* = {*A11, A121, D122, A211, D212, D22*}. It can be seen that wavelet

**Signal** 0 – *fm*

116 Brain-Computer Interface Systems – Recent Progress and Future Prospects

0.125*fm* – *0.25fm*

decomposition is a special case of wavelet packet analysis (*b3*).

**Figure 1.** Decomposition of a signal: (a) wavelet decomposition, (b) wavelet packet analysis

The best sub-tree (or basis) is a sub-tree (or basis) which minimizes a specific cost function. Cost functions are usually defined based on an entropy function (e.g., Shannon's entropy

**Level**

**3** 0 – *0.125fm*

**Table 1.** Frequency Spectrums in Wavelet Packet Analysis

Brain–Computer Interfaces (BCIs) providing an alternative channel for communication and intended to be used by motor-disabled individuals can be categorized into two main classes. Synchronized BCIs are the first and earliest class, in which the user is able to activate the system during pre-defined time periods only. The second class is referred to as asynchronous or selfpaced BCIs. These systems are more useful, since the user can activate them whenever he or she wishes. At each time instant, a self-paced BCI is either in the intentional-control (IC) state or in the no-control (NC) state. IC is the active state, and NC is the inactive state (i.e., the state during which the output of the system is inactive).

The performance of a self-paced BCI is usually evaluated by two rates: the true positive rate (TPR) and the false positive rate (FPR). TPR is considered as the rate of correctly classifying intentional-control states, while FPR is defined as the rate of misclassifying no-control states.

A neurological phenomenon is a set of specific features or patterns in signals produced by the brain. They arise due to brain activities to which these phenomena are time-locked. Different typesofneurologicalphenomenaincludetheactivityofneuralcells(ANCs),P300,brainrhythms such as the Mu, Beta, and Gamma rhythms, movement-related potentials (MRPs), slow cortical potentials (SCPs), visual evoked potentials (VEPs), steady-state visual evoked potentials (SSVEPs), and mental tasks (MTs). For a review of the field of BCIs, please refer to [2]-[7].

The dataset of Keirn and Aunon [8] containing the EEG signals of five mental tasks is used in this paper, as it has been used in a variety of studies such as [9]-[55]. Although most of these studies can classify mental tasks to some extent, only a few of them care about false positives or confusion matrices and report them [32],[45]-[48]. The classification error is reported in studies [21],[31],[49] as well. The rates of correct classification are the only measure well paid attention to in the remaining studies. False positives are of great importance in BCI applica‐ tions, they are hence fairly considered in this study.

#### **1.3. Autoregressive modeling**

The autoregressive (AR) model of an order *K* of a signal is defined as follows:

$$\mathbf{x}[m] = \sum\_{k=1}^{K} a\_k \mathbf{x}[m-k] + e[m] \tag{1}$$

where *x*[*m*] is the one-dimensional signal at time instant *m*, and *ak* represents the AR coeffi‐ cients. It is assumed that the error signal, *e*[*m*], is a stochastic process and is independent of previous values of the signal *x*. It is also postulated that *e*[*m*] has a zero mean and a finite variance. The autoregressive coefficients, *ak*, should be estimated from the finite samples of the signal *x*[*m*].

The most popular method to find the AR coefficients is the Burg algorithm [56] which computes coefficients at successive orders in the forward direction as well as in the backward direction. This method is used in this paper to estimate the AR coefficients.

Figuring out the optimal AR model order is not straightforward, even though some techniques including the Reflection Coefficient [18], the Information Theoretic criterion or Akaike Information Criterion (AIC), the Autoregressive Transfer Function criterion, and the Final Prediction Error (FPE) criterion [57] have been introduced. If the order of the model is too low, the whole signal cannot be captured in the model. On the other hand, the more the order is, the more portion of the noise is captured. Since there is no guarantee that the above mentioned techniques work well in every application and since FPR is of great importance in our application, we do not use these techniques to find the optimal AR model order. Instead, to be on the safe side, we vary the order in a reasonably large range and select the best order based on the performance of the system evaluated via nested five-fold cross-validation.

#### **1.4. Quadratic discriminant analysis**

The Quadratic Discriminant Analysis (QDA) classifier [58] assumes the classes have normal distributions. For QDA method of classification, unlike the linear discriminant analysis, the covariance matrices of the classes can be different. For a two-class problem, the quadratic discriminant function by definition is as follows:

$$\hat{\mu}\eta df(\mathbf{x}) = -\frac{1}{2}\mathbf{x}^{T}(\hat{\Sigma}\_{1}^{-1} - \hat{\Sigma}\_{2}^{-1})\mathbf{x} + (\hat{\mu}\_{\text{1}}^{T}\hat{\Sigma}\_{1}^{-1} - \hat{\mu}\_{\text{2}}^{T}\hat{\Sigma}\_{2}^{-1})\mathbf{x} - \frac{1}{2}\ln(\frac{\hat{\Sigma}\_{1}}{|\hat{\Sigma}\_{2}|}) - \frac{1}{2}(\hat{\mu}\_{\text{1}}^{T}\hat{\Sigma}\_{1}^{-1}\hat{\mu}\_{1} - \hat{\mu}\_{\text{2}}^{T}\hat{\Sigma}\_{2}^{-1}\hat{\mu}\_{2}) - \ln(\frac{C\_{21}}{C\_{12}}\frac{\pi\_{2}}{\pi\_{1}})\tag{2}$$

where *x* is the vector to be classified, *μ* ^ 1, *μ* ^ <sup>2</sup> are the estimated mean vectors of classes 1 and 2, *Σ* ^ <sup>1</sup>, *Σ* ^ <sup>2</sup> are the estimated covariance matrices of class 1 and class 2, *π*1, *π*2 are the prior proba‐ bilities of the two classes, *C*<sup>12</sup> is the cost of misclassifying a member of class 1 as class 2, and *C*21 is the cost due to misclassifying a member of class 2.

The decision rule for classification is:

$$\mathbf{x}\_0 \in \begin{cases} \alpha\_1 & \text{if } qdf(\mathbf{x}\_0) \ge 0 \\ \alpha\_2 & \text{if } qdf(\mathbf{x}\_0) < 0 \end{cases} \tag{3}$$

where *ω*1, *ω*2 represent class 1 and class 2, respectively.

In this paper, the same value for the cost of false negative *C*21 and false positive *C*12 is used. It is also assumed that the a-priori probabilities of the two classes are equal.

#### **2. Methods**

#### **2.1. Data**

where *x*[*m*] is the one-dimensional signal at time instant *m*, and *ak* represents the AR coeffi‐ cients. It is assumed that the error signal, *e*[*m*], is a stochastic process and is independent of previous values of the signal *x*. It is also postulated that *e*[*m*] has a zero mean and a finite variance. The autoregressive coefficients, *ak*, should be estimated from the finite samples of the

The most popular method to find the AR coefficients is the Burg algorithm [56] which computes coefficients at successive orders in the forward direction as well as in the backward direction.

Figuring out the optimal AR model order is not straightforward, even though some techniques including the Reflection Coefficient [18], the Information Theoretic criterion or Akaike Information Criterion (AIC), the Autoregressive Transfer Function criterion, and the Final Prediction Error (FPE) criterion [57] have been introduced. If the order of the model is too low, the whole signal cannot be captured in the model. On the other hand, the more the order is, the more portion of the noise is captured. Since there is no guarantee that the above mentioned techniques work well in every application and since FPR is of great importance in our application, we do not use these techniques to find the optimal AR model order. Instead, to be on the safe side, we vary the order in a reasonably large range and select the best order based

The Quadratic Discriminant Analysis (QDA) classifier [58] assumes the classes have normal distributions. For QDA method of classification, unlike the linear discriminant analysis, the covariance matrices of the classes can be different. For a two-class problem, the quadratic

> 1 2 1 2 11 1 1 1 1 1 21 2

<sup>2</sup> are the estimated covariance matrices of class 1 and class 2, *π*1, *π*2 are the prior proba‐ bilities of the two classes, *C*<sup>12</sup> is the cost of misclassifying a member of class 1 as class 2, and

> ()0 ()0

 

2 12 1

<sup>2</sup> are the estimated mean vectors of classes 1 and 2,

 

<sup>S</sup> (2)

p

p

(3)

12 1 2 11 22

1 0

ìï <sup>³</sup> Îí ï < î

*if qdf x*

*if qdf x*

2 0

<sup>ˆ</sup> <sup>1</sup> 1 1 ˆˆ ˆ ˆ ˆ ˆ () ( ) ( ˆ ˆ ) ln( ) ( ˆ ˆˆ ˆ ) ln( ) <sup>2</sup> 2 2 <sup>ˆ</sup> *<sup>T</sup> T T T T <sup>C</sup> qdf x x <sup>x</sup> <sup>x</sup> <sup>C</sup>*


on the performance of the system evaluated via nested five-fold cross-validation.

This method is used in this paper to estimate the AR coefficients.

118 Brain-Computer Interface Systems – Recent Progress and Future Prospects

**1.4. Quadratic discriminant analysis**

where *x* is the vector to be classified, *μ*

The decision rule for classification is:

*Σ* ^ 1, *Σ* ^

discriminant function by definition is as follows:

*C*21 is the cost due to misclassifying a member of class 2.

0

w

w

*x*

where *ω*1, *ω*2 represent class 1 and class 2, respectively.

 

> ^ 1, *μ* ^

signal *x*[*m*].

As mentioned in the Section 1.2, we used the EEG data which has been collected previously by Keirn and Aunon [8]. The EEG signals of this dataset belong to seven subjects, each performing five different mental tasks. The mental tasks include baseline, mentally computing a nontrivial multiplication, composing a letter to a friend mentally, rotating a three-dimen‐ sional object mentally, and visualizing writing a sequence of numbers on a blackboard. The subjects did not vocalize or gesture in any way when their signals were being recorded. Each session of recording comprises five trials of each mental task; therefore, there are a total of twenty five trials in a session. Only one session was performed on a single day. The length of each trial is ten seconds. Two of the subjects (Subjects 2 and 7) completed only one session, while one of them (Subject 5) completed three sessions. In this study, we used the data of subjects who completed at least 10 trials. Since the signals of Subject 4 were missing some data, we did not use them. New numbers were assigned to the subjects whose signals have been used in this study (see Table 2). Table 2 also shows the number of trials completed by each subject.


**Table 2.** The Number of Completed Trials for Each Subject

EEG signals were recorded from six channels (electrodes) while the subjects were seated in a room which is sound-controlled and with dim lighting. The electrodes were placed at positions C3, C4, P3, P4, O1, and O2 (based on the International 10-20 System) on the scalp. The reference electrodes were two electrically linked mastoids, A1 and A2. Fig. 2 shows the electrodes' locations. During recordings, the impedances between each electrode and the reference electrodes were kept below 5 kΩ. The signals were sampled at 250 Hz with an A/D converter (twelve-bit Lab Master) and a bank of amplifiers (Grass 7P511, with the band-pass filters set at 0.1-100 Hz). The system was calibrated at the beginning of each session with a known voltage.

Two EOG electrodes were placed below and at the outside corner of the left eye for detecting ocular artifacts. Since in this study, we did not remove any segment of the EEG signals due to ocular artifacts, the signals of EOG electrodes were not used.

Fig. 2. Electrodes positions based on International 10-20 System. **Figure 2.** Electrodes positions based on International 10-20 System.

Even though this dataset has not been collected in a self-paced paradigm, we are using it as an introductory exploration. It is obvious that in a self-paced paradigm, brain activities do not change, but since the pacing information (the exact start and end time of the mental tasks) is not known, training the BCI system would be more complicated.

#### **2.2. Procedure**

#### *2.2.1. The design of BCI systems*

In this paper, we apply wavelet packet analysis to the design of a two-state self-paced mental task-based BCI. We develop and custom design five different BCIs for each subject based on the five mental tasks. In each BCI, one mental task is considered as the intentional-control task and the other four mental tasks are considered as the no-control tasks. Unlike the no-control tasks, the intentional-control task should activate the BCI system. Even though a BCI in which the baseline is the intentional-control task is practically useless, we consider it here for comparison purposes. We then determine the two most discriminatory mental tasks for each subject by comparing the performance of the five BCI systems of that subject. The overview of the proposed BCI system is illustrated in Fig. 3.

We customize the BCIs for every subject and mental task, since it has been proven that customized BCIs yield better results than general BCIs [59]-[60].

The EEG signals of four subjects are exploited. We use the first ten trials for Subject 3, and all ten trials for the other three subjects. The sampling rate is 250 Hz and each trial is 10 seconds long; therefore, there are 2500 samples in every trial of each mental task.

Each trial is divided into 45 256-sample overlapping segments. Each segment overlaps with the adjacent segment by 206 samples (about 80%). Hence, for each subject, the total number of segments for each mental task is 450. The segments with the length of more than 1 second are sufficiently long to get a good characteristic of the signal [61].

For each BCI, we have six different EEG channels. Each channel has its own feature vector and classifier. For each channel, the EEG segment is decomposed using wavelet packet analysis and the AR models of the resultant wavelet packets are estimated using the Burg algorithm. The AR coefficients of the packets belonging to a given wavelet basis are concatenated into a vector to form the channel's feature vector. The classifier of each channel is QDA. Each QDA

16

Using Autoregressive Models of Wavelet Bases in the Design of Mental Task-Based BCIs http://dx.doi.org/10.5772/55803 121

**Figure 3.** Overview of the proposed BCI system.

Even though this dataset has not been collected in a self-paced paradigm, we are using it as an introductory exploration. It is obvious that in a self-paced paradigm, brain activities do not change, but since the pacing information (the exact start and end time of the mental tasks) is

Fig. 2. Electrodes positions based on International 10-20 System.

**C3 C4 P3 P4 O1 O2**

**A1 A2**

In this paper, we apply wavelet packet analysis to the design of a two-state self-paced mental task-based BCI. We develop and custom design five different BCIs for each subject based on the five mental tasks. In each BCI, one mental task is considered as the intentional-control task and the other four mental tasks are considered as the no-control tasks. Unlike the no-control tasks, the intentional-control task should activate the BCI system. Even though a BCI in which the baseline is the intentional-control task is practically useless, we consider it here for comparison purposes. We then determine the two most discriminatory mental tasks for each subject by comparing the performance of the five BCI systems of that subject. The overview of

We customize the BCIs for every subject and mental task, since it has been proven that

The EEG signals of four subjects are exploited. We use the first ten trials for Subject 3, and all ten trials for the other three subjects. The sampling rate is 250 Hz and each trial is 10 seconds

Each trial is divided into 45 256-sample overlapping segments. Each segment overlaps with the adjacent segment by 206 samples (about 80%). Hence, for each subject, the total number of segments for each mental task is 450. The segments with the length of more than 1 second are

For each BCI, we have six different EEG channels. Each channel has its own feature vector and classifier. For each channel, the EEG segment is decomposed using wavelet packet analysis and the AR models of the resultant wavelet packets are estimated using the Burg algorithm. The AR coefficients of the packets belonging to a given wavelet basis are concatenated into a vector to form the channel's feature vector. The classifier of each channel is QDA. Each QDA

16

not known, training the BCI system would be more complicated.

**Figure 2.** Electrodes positions based on International 10-20 System.

120 Brain-Computer Interface Systems – Recent Progress and Future Prospects

**2.2. Procedure**

*2.2.1. The design of BCI systems*

the proposed BCI system is illustrated in Fig. 3.

customized BCIs yield better results than general BCIs [59]-[60].

sufficiently long to get a good characteristic of the signal [61].

long; therefore, there are 2500 samples in every trial of each mental task.

classifies the input EEG segment as an IC or NC segment. The task of the second-stage classifier which is a simple majority voting classifier is to determine whether the final output of the system is IC or NC.

#### *2.2.2. Training, cross-validation, and testing*

For each subject, the 5×450 segments pertaining to the five mental tasks are divided into a training set, a validation set, and a test set. The training set is used to train the system. The validation set is used to select the best wavelet, the best wavelet basis, and the best AR model order. The test set is used to evaluate the final performance of the system. The performance of a system evaluated based on a fixed split of data into training, validation and test sets is not accurate and robust, therefore, we perform nested five-fold (or 5×5) cross-validation. While the model selection is done during the inner cross-validation process, the system performance is estimated in the outer cross-validation.

The data are split into the five outer folds, for each of which 80% of the data is used for training and validation and 20% of the data is used for testing. The portion of data which is assigned for training and validation are further divided into five inner folds. For each inner fold, 80% of the data is used for training and the rest is used for validation. Hence, what we report as the cross-validation and testing results are the average over 25 and 5 different cases, respectively.

#### *2.2.3. Different wavelet bases*

Each of these segments is decomposed by wavelet packet analysis into three levels. In threelevel wavelet packet decomposition, there exist 14 packets which can be seen in Fig. 1.b. We have 25 different wavelet bases representing the initial segment in the wavelet domain. These bases are listed in Table 3.


**Table 3.** Different Wavelet Bases

#### *2.2.4. The relationship between the packet length and the AR order*

The length of a child packet is almost half of its parent packet's length. Since the initial decomposed segment is 256 samples long, the packets at levels 1, 2, and 3 contain approxi‐ mately 128, 64, and 32 samples, respectively. When we estimate the packets' AR model, we consider their lengths as a factor in selecting the appropriate AR order. In other words, we try to keep the same ratios between the orders of the packets at different levels. Therefore, if we set the AR order of a first-level packet as *K*, the AR orders for the packets at levels 2 and 3 would be close to *K/2* and *K/4*, respectively. Table 4 provides the information on the sets of AR orders used for different levels of decomposition.


**Table 4.** The Sets of AR Orders Used for Different Levels of Decomposition

#### *2.2.5. Different wavelets*

*2.2.3. Different wavelet bases*

bases are listed in Table 3.

**Table 3.** Different Wavelet Bases

**Basis Packets** 1 A1 D2

122 Brain-Computer Interface Systems – Recent Progress and Future Prospects

6 A11 D12 D2

11 A11 A121 D122 D2

16 A111 D112 D12 D2

21 A111 D112 A121 D122 D2

*2.2.4. The relationship between the packet length and the AR order*

2 A1 A21 D22

7 A11 D12 A21 D22

12 A11 A121 D122 A21 D22

17 A111 D112 D12 A21 D22

22 A111 D112 A121 D122 A21 D22

3 A1 A21 A221 D222 4 A1 A211 D212 D22

8 A11 D12 A21 A221 D222 9 A11 D12 A211 D212 D22

13 A11 A121 D122 A21 A221 D222 14 A11 A121 D122 A211 D212 D22

18 A111 D112 D12 A21 A221 D222 19 A111 D112 D12 A211 D212 D22

23 A111 D112 A121 D122 A21 A221 D222 24 A111 D112 A121 D122 A211 D212 D22

5 A1 A211 D212 A221 D222

10 A11 D12 A211 D212 A221 D222

15 A11 A121 D122 A211 D212 A221 D222

20 A111 D112 D12 A211 D212 A221 D222

25 A111 D112 A121 D122 A211 D212 A221 D222

The length of a child packet is almost half of its parent packet's length. Since the initial decomposed segment is 256 samples long, the packets at levels 1, 2, and 3 contain approxi‐ mately 128, 64, and 32 samples, respectively. When we estimate the packets' AR model, we

Each of these segments is decomposed by wavelet packet analysis into three levels. In threelevel wavelet packet decomposition, there exist 14 packets which can be seen in Fig. 1.b. We have 25 different wavelet bases representing the initial segment in the wavelet domain. These

> Wavelets from various families are used. For each subject and each mental task, the wavelet with the best performance during nested five-fold cross-validation is selected. The 36 wavelets tested are from the Haar, Daubechies, Biorthogonal, Coiflets, and Symlets families. We assign a number to each of the wavelets and list them in Table 5.


**Table 5.** Wavelets Used from Different Families

#### *2.2.6. The proposed method to find the best wavelet basis*

As mentioned earlier in the Introduction section, to find the best wavelet basis, different cost functions, which are usually different types of entropy functions, have been proposed. Unfortunately, the methodology based on the cost function is not working here because of two main reasons. First of all, since we have a number of segments for training the system which are not fully stationary, different wavelet bases are chosen for different segments. Hence, we can not come up with a single basis. Secondly, if we suppose that we are able to find the best basis based on a defined cost function, there is no guarantee that the selected basis has the best performance for our system. Therefore, we propose a method to find the best wavelet basis which does not have the above problems. The idea is very simple. We select a basis as the best basis which shows the best performance during a cross-validation process. There are two measures for performance evaluation of our system, TPR and FPR. The ratio of TPR to FPR is calculated to compare the performance with different wavelet basis. This method is not only applicable to the BCI system but also to any system which is based on classification.

#### *2.2.7. Selecting the best wavelet and the best wavelet basis*

For each subject and each mental task, we run a nested five-fold cross-validation process with the first set of AR model orders in Table 4 (i.e., 12, 6, and 3 for levels 1, 2, and 3, respectively) in order to find the best wavelet and the best wavelet basis. For each of the 36 wavelets, we test all 25 possible wavelet bases. The best wavelet basis is firstly determined for every wavelet based on the ratio of TPR/FPR. Secondly, we compare the performance of the system with the best bases of different wavelets, and figure out the wavelet whose best basis yields the best system performance (in terms of the ratio of TPR to FPR). It is noteworthy that the results would be the same if we first find the best wavelet for each of the 25 wavelet bases and then select the basis with the best performance.

#### *2.2.8. Optimizing the AR model order*

Having selected the best wavelet and wavelet basis for the BCI of each subject and each mental task, we find the optimal AR order via another nested five-fold cross-validation process. To this end, we test different sets of AR orders (i.e., sets 2, 3, …, 13 in Table 4) using the best wavelet and the best basis as previously determined, and select the set with the highest TPR/FPR ratio.

#### *2.2.9. Testing the system*

We test the system via the outer fold of nested five-fold cross-validation with the selected wavelet, the best wavelet basis, and the optimal AR order set for every BCI belonging to each subject and mental task. The results are given in the next section.

#### **3. Results**

The results of the cross-validation process (at AR order set 1 and the optimal AR order set) and the results of testing (at the optimal AR order set) are summarized in Table 6. This table also shows the selected wavelet and the best wavelet basis for each BCI. The performance of a system which is based on three-level wavelet decomposition (instead of wavelet packet analysis) is furthermore given in Table 6 for comparison purposes. TABLE VI

Performance of BCIs for Different Subjects and Tasks during Cross-Validation and Testing

*2.2.6. The proposed method to find the best wavelet basis*

124 Brain-Computer Interface Systems – Recent Progress and Future Prospects

*2.2.7. Selecting the best wavelet and the best wavelet basis*

select the basis with the best performance.

*2.2.8. Optimizing the AR model order*

*2.2.9. Testing the system*

**3. Results**

As mentioned earlier in the Introduction section, to find the best wavelet basis, different cost functions, which are usually different types of entropy functions, have been proposed. Unfortunately, the methodology based on the cost function is not working here because of two main reasons. First of all, since we have a number of segments for training the system which are not fully stationary, different wavelet bases are chosen for different segments. Hence, we can not come up with a single basis. Secondly, if we suppose that we are able to find the best basis based on a defined cost function, there is no guarantee that the selected basis has the best performance for our system. Therefore, we propose a method to find the best wavelet basis which does not have the above problems. The idea is very simple. We select a basis as the best basis which shows the best performance during a cross-validation process. There are two measures for performance evaluation of our system, TPR and FPR. The ratio of TPR to FPR is calculated to compare the performance with different wavelet basis. This method is not only

applicable to the BCI system but also to any system which is based on classification.

For each subject and each mental task, we run a nested five-fold cross-validation process with the first set of AR model orders in Table 4 (i.e., 12, 6, and 3 for levels 1, 2, and 3, respectively) in order to find the best wavelet and the best wavelet basis. For each of the 36 wavelets, we test all 25 possible wavelet bases. The best wavelet basis is firstly determined for every wavelet based on the ratio of TPR/FPR. Secondly, we compare the performance of the system with the best bases of different wavelets, and figure out the wavelet whose best basis yields the best system performance (in terms of the ratio of TPR to FPR). It is noteworthy that the results would be the same if we first find the best wavelet for each of the 25 wavelet bases and then

Having selected the best wavelet and wavelet basis for the BCI of each subject and each mental task, we find the optimal AR order via another nested five-fold cross-validation process. To this end, we test different sets of AR orders (i.e., sets 2, 3, …, 13 in Table 4) using the best wavelet and the best basis as previously determined, and select the set with the highest TPR/FPR ratio.

We test the system via the outer fold of nested five-fold cross-validation with the selected wavelet, the best wavelet basis, and the optimal AR order set for every BCI belonging to each

The results of the cross-validation process (at AR order set 1 and the optimal AR order set) and the results of testing (at the optimal AR order set) are summarized in Table 6. This table

subject and mental task. The results are given in the next section.


**Table 6.** Performance of BCIs for Different Subjects and Tasks during Cross-Validation and Testing

Table 6 is divided into 20 sections. Each section contains the information about the BCI belonging to a specific subject and mental task. The first line in every section shows the result of the first cross-validation process at AR order set 1. As mentioned before, this cross-validation is performed in order to select the best wavelet and the best wavelet basis for each BCI. The results include the mean and the standard deviation of the TPR and FPR values for the selected case (i.e., the best wavelet and basis). The best wavelet and the best basis are given on lines four and five of the table, respectively.

The optimal AR model order set is determined based on the results of the second crossvalidation process done with the selected wavelet and basis. The results of the second crossvalidation with the optimal AR order make the second lines of sections. It is noteworthy that the optimal AR order set for all BCIs is the last set which has the largest numbers.

The performance of the BCI systems with the best configuration (i.e., using the best wavelet, best basis, and optimal AR order set) are also given on the third line of each section. The last line of each section presents the performance of the BCI based on three-level wavelet decom‐ position as an example to be compared with the performance of the system based on the best

1

basis. As previously discussed, three-level wavelet decomposition is one of the existing bases in three-level wavelet packet analysis (basis 16 according to Table 3).

The most discriminatory task for each subject is determined by comparing the performance of five BCIs pertaining to the subject. The most discriminatory mental task is in a section with solid black borders. For each subject, the second most discriminatory task is also chosen based on the results and shown in bold in the table.

Table 7 presents the average system performance of the BCIs for each subject (over tasks) and for each task (over subjects).


The preliminary results have been published in [62].

**Table 7.** Average Performance of BCI Systems

#### **4. Discussion**

#### **4.1. Best wavelet bases**

As seen from Table 6, the best wavelet basis for all subjects and mental tasks is surprisingly the first basis which is made of the approximations and details components of the first level of wavelet decomposition. This implies that for the proposed BCIs, it is enough to decompose the signal into the first level, and further decomposition degrades the system performance.

#### **4.2. Most discriminatory tasks**

To determine the most discriminatory tasks for each subject, we consider the FPR values during testing and cross-validation at the optimal AR order set. We put the main weight on the FPR of testing since not only the portion of the dataset used for testing is larger than the portion used for cross-validation, but also the testing portion is completely separate from the portions exploited during training and cross-validation processes. The TPR values during testing and cross-validation are then considered if necessary to find out the most discriminatory tasks.

For Subject 1, the testing FPR for the baseline and the rotation tasks are the same and the lowest. The cross-validation FPR is lower for the baseline. Hence, the most discriminatory task must be the baseline. Since a BCI based on the baseline is activated when the subject wishes to relax and think of nothing, it is practically useless; therefore, we do not consider the baseline as the most discriminatory task. The rotation task is then selected as the most discriminatory task. The second most discriminatory task for Subject 1 is the multiplication.

For Subject 2, the most discriminatory task is the multiplication since it has the lowest testing and cross-validation FPRs. The counting is the second most discriminatory task for this subject.

For Subjects 3 and 4, the most and the second most discriminatory tasks are the counting and the letter composing, respectively. For Subject 3, the FPR values for these two tasks interest‐ ingly reach zero during testing.

#### **4.3. Selected wavelets**

basis. As previously discussed, three-level wavelet decomposition is one of the existing bases

The most discriminatory task for each subject is determined by comparing the performance of five BCIs pertaining to the subject. The most discriminatory mental task is in a section with solid black borders. For each subject, the second most discriminatory task is also chosen based

Table 7 presents the average system performance of the BCIs for each subject (over tasks) and

**TPR FPR Mean SD Mean SD**

 61.38 5.61 0.27 0.24 59.24 5.33 0.38 0.39 57.51 5.69 0.08 0.10 68.49 4.90 0.19 0.24

**Baseline** 63.34 5.50 0.27 0.35 **Multiplication** 59.39 5.00 0.20 0.19 **Letter Composing** 63.50 4.77 0.21 0.17 **Rotation** 59.06 4.84 0.31 0.33 **Counting** 63.00 6.81 0.17 0.18

**Total Average 61.66 5.38 0.23 0.24**

As seen from Table 6, the best wavelet basis for all subjects and mental tasks is surprisingly the first basis which is made of the approximations and details components of the first level of wavelet decomposition. This implies that for the proposed BCIs, it is enough to decompose the signal into the first level, and further decomposition degrades the system performance.

To determine the most discriminatory tasks for each subject, we consider the FPR values during testing and cross-validation at the optimal AR order set. We put the main weight on the FPR

in three-level wavelet packet analysis (basis 16 according to Table 3).

on the results and shown in bold in the table.

The preliminary results have been published in [62].

126 Brain-Computer Interface Systems – Recent Progress and Future Prospects

for each task (over subjects).

**Average**

**Average over Subjects**

**4. Discussion**

**4.1. Best wavelet bases**

**4.2. Most discriminatory tasks**

**over Tasks Subject**

**Task**

**Table 7.** Average Performance of BCI Systems

Unlike the basis, the selected wavelets are not the same for all subjects and mental tasks. In eleven BCIs (out of the twenty BCIs designed for different subjects and tasks), a wavelet from the Daubechies family has been chosen as the best wavelet. Wavelet 'db2', the most selected wavelet, is the best wavelet for five BCIs. The BCIs for the most and the second most discrim‐ inatory tasks of Subjects 2 and 3 has the best performance with this wavelet among all other wavelets. In five BCIs, the Biorthogonal family is the best. Each of the Coiflets and Symlets families are also selected in two BCIs.

#### **4.4. Average system performance**

According to Table 7, the BCIs of Subject 3 have generally the best performance, with the average FPR of 0.08% and the average TPR of 57.51%. Moreover, the BCIs based on the counting task have the best performance overall. The average performance of BCIs based on the counting has FPR and TPR values of 0.17% and 63.00%.

#### **4.5. Comparison of different wavelet bases**

For each BCI, we consider and evaluate the system performance with different wavelet bases as further analysis. We then sort and rank the bases based on the ratio of TPR to FPR during testing. Considering the performance of each wavelet basis for different subjects and mental tasks, we count the number of cases that each basis ranks *n*-th among the other bases. The results are summarized in Table 8. The corresponding three-dimensional histograms are shown in Fig. 4. In this figure, the horizontal axes are related to different wavelet bases and their ranks. The vertical axis is showing the number of times that a basis has a specific rank amongst other bases. It can be seen from this bar diagram that as the basis number is increasing, the ranks of the basis is getting worse, i.e., the first basis has the highest ranks and the last basis has the worst rank. The results are almost along with the cross-validation results (with the first set of AR orders) and show that the first basis is the best basis for all BCIs except for three of them belonging to the multiplication task of Subject 1 and the rotation task of Subjects 2 and 3. For the multiplication task of Subject 1, the best basis is basis 2. The basis 6 ranks first for the rotation task of Subjects 2 and 3. The rank of the first basis for the multiplication task of Subject 1 and the rotation task of Subject 2 is two. The first basis is ranked third for the rotation task of Subject 3. In all other 17 BCIs, the first basis is the best.

**Figure 4.** Comparison of different wavelet bases.


**Table 8.** Comparing Different Wavelet Bases during Testing at AR Order Set 13

#### **5. Conclusion**

the ranks of the basis is getting worse, i.e., the first basis has the highest ranks and the last basis has the worst rank. The results are almost along with the cross-validation results (with the first set of AR orders) and show that the first basis is the best basis for all BCIs except for three of them belonging to the multiplication task of Subject 1 and the rotation task of Subjects 2 and 3. For the multiplication task of Subject 1, the best basis is basis 2. The basis 6 ranks first for the rotation task of Subjects 2 and 3. The rank of the first basis for the multiplication task of Subject 1 and the rotation task of Subject 2 is two. The first basis is ranked third for the rotation

task of Subject 3. In all other 17 BCIs, the first basis is the best.

128 Brain-Computer Interface Systems – Recent Progress and Future Prospects

1

**Figure 4.** Comparison of different wavelet bases.

**Number**

4

7

10

**Rank**

13

16

19

22

25

Basis 1

Basis 5

Basis 9

Basis 13

Basis 17

Basis 21

Basis 25

In this paper, we presented a method to select the best wavelet basis in the design of a twostate self-paced mental task-based BCI. The use of the proposed methodology is not limited to the BCI systems and it can be also used in other applications. The previously introduced methods (based on cost functions) to find the best wavelet basis generally has two major drawbacks. First of all, they are not practical in classification problems where we have different training signal segments and each of them may result in a different basis. In this case, we cannot come up with a unique basis for all of the training signal segments; therefore, we are not able to finalize our design of the classification system. The second drawback is resulted by this fact that, supposing we can find a unique best wavelet basis for all training segments, there is no guarantee that this wavelet basis yields the best classification accuracy. Because of these two reasons, we decided to propose a method based on the classification accuracy to find the best basis. Since our BCI systems are evaluated by true positive and false positive rates, we used these two measures in the process of finding the best wavelet basis. It is worth noting that any other kind of classification accuracy measure can be potentially exploited in our proposed method to select the best basis.

We have tested our proposed method in the design of mental task-based BCIs. The output of the BCI should be activated when the subject performs a specific mental task. The aim is to minimize and maximize the false activation rate and the true activation rate, respectively.

The scalar autoregressive model coefficients of the components of the best wavelet basis were used as features. The classifiers studied were based on QDA and majority voting. We per‐ formed nested five-fold cross-validation two times to choose the best wavelet, the best wavelet basis, and the best autoregressive model orders. Results have shown that the most discrimi‐ natory tasks are different amongst the subjects confirming the findings of previous studies ([20], [30], [33], and [42]) based on the same dataset.

For each subject and each mental task, the best configuration (i.e., the best wavelet, the best wavelet basis, and the optimum AR order) is found during offline analysis of the data. During online analysis, the best configuration is used. Therefore, the system is applicable to real-time applications.

#### **Author details**

Farshad Faradji1 , Farhad Faradji2,3\*, Rabab K. Ward3 and Gary E. Birch3


#### **References**


[3] S.G. Mason and G.E. Birch, "A general framework for brain-computer interface de‐ sign," *IEEE Trans. Neural Syst. Rehabil. Eng.*, vol. 11, pp. 70–85, Mar. 2003.

come up with a unique basis for all of the training signal segments; therefore, we are not able to finalize our design of the classification system. The second drawback is resulted by this fact that, supposing we can find a unique best wavelet basis for all training segments, there is no guarantee that this wavelet basis yields the best classification accuracy. Because of these two reasons, we decided to propose a method based on the classification accuracy to find the best basis. Since our BCI systems are evaluated by true positive and false positive rates, we used these two measures in the process of finding the best wavelet basis. It is worth noting that any other kind of classification accuracy measure can be potentially exploited in our proposed

We have tested our proposed method in the design of mental task-based BCIs. The output of the BCI should be activated when the subject performs a specific mental task. The aim is to minimize and maximize the false activation rate and the true activation rate, respectively.

The scalar autoregressive model coefficients of the components of the best wavelet basis were used as features. The classifiers studied were based on QDA and majority voting. We per‐ formed nested five-fold cross-validation two times to choose the best wavelet, the best wavelet basis, and the best autoregressive model orders. Results have shown that the most discrimi‐ natory tasks are different amongst the subjects confirming the findings of previous studies

For each subject and each mental task, the best configuration (i.e., the best wavelet, the best wavelet basis, and the optimum AR order) is found during offline analysis of the data. During online analysis, the best configuration is used. Therefore, the system is applicable to real-time

[1] R.R. Coifman and M.V. Wickerhauser, "Entropy-based algorithms for best basis se‐

[2] M.A. Nicolelis, "Brain-machine interfaces to restore motor function and probe neural

lection," *IEEE Trans. on Inf. Theory*, vol. 38, no. 2, pp. 713–718, Mar. 1992.

and Gary E. Birch3

method to select the best basis.

applications.

**Author details**

Farshad Faradji1

**References**

1 Qom University, Qom, Iran

([20], [30], [33], and [42]) based on the same dataset.

130 Brain-Computer Interface Systems – Recent Progress and Future Prospects

2 Amirkabir University of Technology, Tehran, Iran

, Farhad Faradji2,3\*, Rabab K. Ward3

3 University of British Columbia, Vancouver, British Columbia, Canada

circuits," *Nat. Rev. Neurosci.*, vol. 4, no. 5, pp. 417–422, 2003.


[30] K. Tavakolian and S. Rezaei, "Classification of mental tasks using Gaussian mixture Bayesian network classifiers," *In Proc. IEEE Int. Workshop Biomed. Circuits Syst.*, pp. S3.6-9–S3.6-11, Dec. 2004.

[17] R. Palaniappan and P. Raveendran, "A new mode of EEG based communication," *In*

[18] R. Palaniappan, P. Raveendran, S. Nishida, and N. Saiwaki, "Autoregressive spectral analysis and model order selection criteria for EEG signals," *In Proc. IEEE Region 10*

[19] M.I. Bhatti, A. Pervaiz, and M.H. Baig, "EEG signal decomposition and improved spectral analysis using wavelet transform," *In Proc. 23rd IEEE EMBS*, pp. 1862–1864,

[20] R. Palaniappan, R. Paramesran, S. Nishida, and N. Saiwaki, "A new brain-computer interface design using fuzzy ARTMAP," *IEEE Trans. Neural Syst. Rehabil. Eng.*, vol.

[21] V.A. Maiorescu, M. Serban, and A.M. Lazar, "Classification of EEG signals represent‐ ed by AR models for cognitive tasks - a neural network based method," *In Proc. Int.*

[22] D. Garrett, D.A. Peterson, C.W. Anderson, and M.H. Thaut, "Comparison of linear, nonlinear, and feature selection methods for EEG signal classification," *IEEE Trans.*

[23] D. Liu, Z. Jiang, W. Cong, and H. Feng, "Detect determinism of spontaneous EEG with a multi-channel reconstruction method," *In Proc. IEEE Int. Conf. Neural Net. Sig‐*

[24] X. Wu and X. Guo, "Mental EEG Analysis based on Independent Component Analy‐ sis," *In Proc. 3rd Int. Symp. Image Signal Process. Anal.*, pp. 327–331, Sep. 2003.

[25] D. Liu, Z. Jian, and H. Feng, "Separating the different components of spontaneous EEG by optimized ICA," *In Proc. IEEE Int. Conf. Neural Net. Signal Process.*, pp. 1334–

[26] J.Z. Xue, H. Zhang, C.X. Zheng, and X.G. Yan, "Wavelet packet transform for feature extraction of EEG during mental tasks," *In Proc. 2nd Int. Conf. Machine Learn. Cybern.*,

[27] G.A. Barreto, R.A. Frota, and F.N.S. de Medeiros, "On the classification of mental tasks: A performance comparison of neural and statistical approaches," *In Proc. IEEE*

[28] M.S. Daud and J. Yunus, "Classification of mental tasks using de-noised EEG sig‐

[29] K. Tavakolian, S. Rezaei, and S.K. Setarehdan, "Choosing optimal mental tasks for classification in brain computer interfaces," *In Proc. Int. Conf. Artificial Intell. and App.*,

*Workshop Machine Learn. Signal Process.*, pp. 529–538, 2004.

nals," *In Proc. 7th Int. Conf. Signal Process.*, pp. 2206–2209, 2004.

*Proc. IEEE Int. Joint Conf. Neural Net.*, vol. 4, pp. 2679–2682, Jul. 2001.

*Conf.*, pp. II-126–II-129, Sep. 2000.

132 Brain-Computer Interface Systems – Recent Progress and Future Prospects

10, no. 3, pp. 140–148, Sep. 2002.

*nal Process.*, pp. 708–711, Dec. 2003.

1337, Dec. 2003.

pp. 360–363, Nov. 2003.

pp. 396-399, Feb. 2004.

*Symp. Signals Circuits Syst.*, vol. 2, pp. 441–444, 2003.

*Neural Syst. Rehabil. Eng.*, vol. 11, no. 2, pp. 141–144, Jun. 2003.

Oct. 2001.


[55] M.P. Paulraj, C.R. Hema, R. Nagarajan, S. Yaacob, and A.H. Adom, "EEG classifica‐ tion using radial basis PSO neural network for brain machine interfaces," *In Proc. 5th Student Conf. Research Develop.*, pp. 1–5, Dec. 2007.

[43] G. Yan, X. Guo, R. Yan, and B. Yang, "Nonlinear quadratic phase coupling on EEG based on 1*1/2*-dimension spectrum," *In Proc. 3rd Int. Conf.* Advances Med. Signal Infor.

[44] F. Abdollahi and A. Motie-Nasrabadi, "Combination of frequency bands in EEG for feature reduction in mental task classification," *In Proc. 28th IEEE EMBS*, pp. 1146–

[45] K. Nakayama and K. Inagaki, "A brain computer interface based on neural network with efficient pre-processing," *In Proc. Int. Symp. Intell. Signal Process. Commun. Syst.*,

[46] C.W. Anderson, J.N. Knight, T. O'Connor, M.J. Kirby, and A. Sokolov, "Geometric subspace methods and time-delay embedding for EEG artifact removal and classifi‐

[47] D.-M. Dobrea and M.-C. Dobrea, "An EEG (bio) technological system for assisting the disabled people," *In Proc. 5th IEEE Int. Conf. Comput. Cybern.*, pp. 191–196, Oct.

[48] D.-M. Dobrea, M.-C. Dobrea, and M. Costin, "An EEG coherence based method used for mental tasks classification," *In Proc. 5th IEEE Int. Conf. Comput. Cybern.*, pp. 185–

[49] K. Nakayama Y. Kaneda, and A. Hirano, "A brain computer interface based on FFT and multilayer neural network - feature extraction and generalization," *In Proc. Int.*

[50] L. Zhiwei and S. Minfen, "Classification of mental task EEG signals using wavelet packet entropy and SVM," *In Proc. 8th Int. Conf. Elec. Measure. Instr.*, pp. 3-906–3-909,

[51] B.T. Skinner, H.T. Nguyen, and D.K. Liu, "Classification of EEG signals using a ge‐ netic-based machine learning classifier," *In Proc. 29th IEEE EMBS*, pp. 3120–3123,

[52] F. Abdollahi, S.K. Setarehdan, and A.M. Nasrabadi, "Locating information maximi‐ zation time in EEG signals recorded during mental tasks," *In Proc. 5th Int. Symp. Image*

[53] C.R. Hema, M.P. Paulraj, R. Nagarajan, S. Yaacob, and A.H. Adom, "Fuzzy based classification of EEG mental tasks for a brain machine interface," *In Proc. 3rd Int. Conf.*

[54] S.M. Hosni, M.E. Gadallah, S.F. Bahgat, and M.S. AbdelWahab, "Classification of EEG signals using different feature extraction techniques for mental-task BCI," *In*

*Intell. Infor. Hiding Multimed. Signal Process.*, vol. 1, pp. 53–56, Nov. 2007.

*Proc. Int. Conf.* Computer Eng. Syst., pp. 220–226, Nov. 2007.

*Symp. Intell. Signal Process. Commun. Syst.*, pp. 826–829, Nov. 2007.

*Signal Process. Anal.*, pp. 238–241, Sep. 2007.

cation," *IEEE Trans. Neural Syst. Rehabil. Eng.*, vol. 14, no. 2, pp. 142–146, 2006.

Process., pp. 1–4, Jul. 2006.

134 Brain-Computer Interface Systems – Recent Progress and Future Prospects

1149, Sep. 2006.

2007.

190, Oct. 2007.

Aug. 2007.

Aug. 2007.

pp. 673–676, Dec. 2006.


**Provisional chapter**

#### **Client-Centred Music Imagery Classification Based on Hidden Markov Models of Baseline Prefrontal Hemodynamic Responses Client-Centred Music Imagery Classification Based on Hidden Markov Models of Baseline Prefrontal Hemodynamic Responses**

Tiago H. Falk, Kelly M. Paton and Tom Chau Tiago H. Falk, Kelly M. Paton and Tom Chau

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/55804 10.5772/55804

#### **1. Introduction**

A prevalent avenue in rehabilitation engineering is the development of new access technologies which enable individuals with severe and/or multiple disabilities to access or interact with their environment. Despite continued efforts to develop such technologies, many individuals still possess no means to communicate or control external devices [1]. Brain-computer interfaces (BCIs) have surfaced as promising access solutions for those with severe motor disabilities such as late-stage amyotrophic lateral sclerosis (ALS), brainstem stroke, or severe cerebral palsy [2]. Electroencephalography (EEG), a noninvasive method used in traditional BCI technologies, is safe and inexpensive, but long-term electrode fixation is difficult and the signal is inherently noisy [1, 3]. Other common noninvasive BCI technologies such as functional magnetic resonance imaging (fMRI), magnetoencephalography (MEG), and positron emission topography (PET) are theoretically feasible, but are expensive and technically demanding for online applications [3]. More recently, the possibility of using near infrared spectroscopy (NIRS) as a tool in this field has been investigated due to its affordability, portability, high temporal resolution (similar to that of fMRI), and higher signal-to-noise ratio than that of EEG due to the absence of artifacts [3–9].

Near-infrared spectroscopy is an analysis technique that operates by transmitting near-infrared (650 nm – 950 nm wavelengths) electromagnetic radiation through the skull and comparing the intensities of the returning and incident light. When functional mental activity occurs, metabolic demands cause an increase in the blood flow to pre-defined regions of the brain (e.g., motor cortex during motor imagery tasks [3]). The increase in blood flow causes changes to the regional concentrations of oxygenated and deoxygenated hemoglobin,

Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2013 Falk et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Falk et al.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

©2012 Falk et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative

hence altering the optical properties of the brain tissue [10, 11]. As a consequence, by measuring the intensity of the reflected light, NIRS technologies can be used to examine regional cortical activity with depths of up to 2 cm [10]. Previous research has shown that NIRS can be used to assess hemodynamic responses in the motor cortex using activities such as motor imagery [3, 7], as well as in the prefrontal cortex using higher cognitive tasks such as mental arithmetic [6, 11, 12], working memory [13, 14], and emotion induction [15, 16], both in silent (e.g., [17, 18]) and noisy environments [19].

A major drawback of measuring hemodynamic responses in the motor cortex is the interference of hair [8]. Additionally, for individuals who have congenital motor impairments or who are many years post-traumatic injury, motor imagery may be a difficult, or even impossible, task due to the absence of a somatosensory map of motor activities [20–22]. These problems can be avoided by using more intuitive tasks which activate the prefrontal cortex (PFC). In this study, we proposed to classify hemodynamic responses in the PFC resultant from an intuitive music imagery task. More specifically, we were interested in exploiting the emotional response that music imagery can have on an individual. We were motivated by previous studies which have shown that music can elicit [23] and enhance [24] intense emotional responses that activate brain regions believed to be associated with emotional behaviours, such as the PFC [25] and the orbitofrontal and frontopolar areas [26, 27]. More importantly, prefrontal hemodynamic responses due to subject-selected music imagery have been recently observed using fMRI [28]. An additional motivation for using the music imagery task was related to the simplicity of the cognitive task | relative to mental arithmetic, for example | which would allow younger children to use this classification system. Recent experiments have suggested that NIRS responses in the PFC resultant from music imagery are detectable and sufficiently different from those resultant from mental arithmetic tasks and baseline resting states [17, 18, 29].

The goal of this study was to investigate the potential of classifying rest and music imagery on the basis of non-invasively acquired NIRS signals. Automated detection of music imagery events could then be used to control a system-paced two-state BCI (e.g., a binary switch). To develop a music imagery classifier that is applicable to users with different characteristics (e.g., age, gender, disability) requires a significant BCI–user coordination effort [2]. For the task at hand, we have incorporated a client-centred paradigm where subject-specific physical and physiological factors such as movement, heartbeat, and respiration were taken into account during classifier development [10]. Additional motivations for pursuing a client-driven paradigm included *i)* dependency of the hemodynamic responses on the intensity and valence (i.e., happy or sad) of the emotions evoked [30], *ii)* the mode (major or minor) and tempo (fast or slow) of the imagined songs [31], and *iii)* the differences in age, race, gender, skull thickness and skull shapes which may lead to inter-subject variations in penetration depths of NIR light [7, 32]. In this paper, hidden Markov models (HMMs), which have been previously used for motor imagery classification [3, 33], were used to characterize the individuals' hemodynamic responses during rest (baseline). Client-centredness was achieved by optimizing HMM model parameters and classifier input parameters on a per-client basis.

#### **2. Experimental methods**

This section describes the experiment setup, execution, and data pre-processing.

#### **2.1. Participants**

2 Brain-Computer Interface

hence altering the optical properties of the brain tissue [10, 11]. As a consequence, by measuring the intensity of the reflected light, NIRS technologies can be used to examine regional cortical activity with depths of up to 2 cm [10]. Previous research has shown that NIRS can be used to assess hemodynamic responses in the motor cortex using activities such as motor imagery [3, 7], as well as in the prefrontal cortex using higher cognitive tasks such as mental arithmetic [6, 11, 12], working memory [13, 14], and emotion induction [15, 16],

A major drawback of measuring hemodynamic responses in the motor cortex is the interference of hair [8]. Additionally, for individuals who have congenital motor impairments or who are many years post-traumatic injury, motor imagery may be a difficult, or even impossible, task due to the absence of a somatosensory map of motor activities [20–22]. These problems can be avoided by using more intuitive tasks which activate the prefrontal cortex (PFC). In this study, we proposed to classify hemodynamic responses in the PFC resultant from an intuitive music imagery task. More specifically, we were interested in exploiting the emotional response that music imagery can have on an individual. We were motivated by previous studies which have shown that music can elicit [23] and enhance [24] intense emotional responses that activate brain regions believed to be associated with emotional behaviours, such as the PFC [25] and the orbitofrontal and frontopolar areas [26, 27]. More importantly, prefrontal hemodynamic responses due to subject-selected music imagery have been recently observed using fMRI [28]. An additional motivation for using the music imagery task was related to the simplicity of the cognitive task | relative to mental arithmetic, for example | which would allow younger children to use this classification system. Recent experiments have suggested that NIRS responses in the PFC resultant from music imagery are detectable and sufficiently different from those resultant from mental arithmetic tasks

The goal of this study was to investigate the potential of classifying rest and music imagery on the basis of non-invasively acquired NIRS signals. Automated detection of music imagery events could then be used to control a system-paced two-state BCI (e.g., a binary switch). To develop a music imagery classifier that is applicable to users with different characteristics (e.g., age, gender, disability) requires a significant BCI–user coordination effort [2]. For the task at hand, we have incorporated a client-centred paradigm where subject-specific physical and physiological factors such as movement, heartbeat, and respiration were taken into account during classifier development [10]. Additional motivations for pursuing a client-driven paradigm included *i)* dependency of the hemodynamic responses on the intensity and valence (i.e., happy or sad) of the emotions evoked [30], *ii)* the mode (major or minor) and tempo (fast or slow) of the imagined songs [31], and *iii)* the differences in age, race, gender, skull thickness and skull shapes which may lead to inter-subject variations in penetration depths of NIR light [7, 32]. In this paper, hidden Markov models (HMMs), which have been previously used for motor imagery classification [3, 33], were used to characterize the individuals' hemodynamic responses during rest (baseline). Client-centredness was achieved by optimizing HMM model parameters and classifier input parameters on a

This section describes the experiment setup, execution, and data pre-processing.

both in silent (e.g., [17, 18]) and noisy environments [19].

and baseline resting states [17, 18, 29].

per-client basis.

**2. Experimental methods**

Thirteen able-bodied adults (three male, mean age of 33.5 ± 12.8 years) were recruited from the University of Toronto and Bloorview Kids Rehab (Toronto). Exclusion criteria were metabolic, cardiovascular, respiratory, psychiatric, or drug- or alcohol-related conditions that could affect either the measurements or the participant's ability to follow the experimental protocol. Participants were required to have normal hearing. Ethical approval was obtained from the relevant institutions, and all participants provided informed signed consent.

#### **2.2. Instrumentation**

The hemodynamic response was recorded from each subject using the Imagent Function Brain Imaging System from ISS Inc. Two photomultiplier tube detectors were employed along with sixteen light sources — eight at 690 nm and eight at 830 nm. The sources delivered 110 MHz-modulated light to the forehead via 400 *µ*m-diameter optical fibres, and the detectors received the returning light via 3 mm-diameter optical fibres. Light returned to the detectors was demodulated at a cross-correlation frequency (CCF) of 5 kHz. To avoid cross-signal contamination, the light sources were cyclically switched such that no two sources were on simultaneously. For one data collection cycle — defined as one complete sequence through all sixteen sources — each source remained on for eight periods of the CCF (i.e., 1.6 ms), separated by a two-period break (0.4 ms) to ensure no overlap. This resulted in an effective sampling rate of 31.25 Hz per full data collection cycle. For each cycle, the eight waveforms from each source-detector pair were averaged, and a fast Fourier transform (FFT) was applied to the result in order to obtain three output data components: AC (relative amplitude at the CCF), DC (relative amplitude at 0 Hz), and phase. In this study, only the AC and DC data were used.

The sixteen source fibres were grouped in pairs comprised of one source at each of the two wavelengths; this allowed a given location to be probed by both wavelengths simultaneously. Positioning four source pairs around one detector on each side of the participant's forehead allowed both the right and left prefrontal cortices to be probed, as shown in Figure 1(a). Emitters in position 3 on each side were located roughly at the left and right prefrontal cortices (FP1 and FP2 in the 10-20 system). The raw AC and DC signals obtained from each source can be represented as *x<sup>λ</sup> p*,*i* (*γ*) where the subscript *i* indexes the source position (*i* = 1, . . . 4), subscript *p* indicates the left or right side by *L* or *R*, and superscript *λ* indicates the wavelength (*λ* =690, 830 nm). The signal type (AC or DC) is indicated by the parameter *γ*.

On each side, the four source pairs were positioned 2.12 cm away from the detector (see Figure 1(a)). It has been shown (in the occipital region) that, although a larger source–detector distance (3.37 cm to 4.55 cm) gives a greater magnitude of response, the same features can be seen in a signal obtained at a distance of 2.25 cm [34]. While the chosen distance of 2.12 cm is slightly smaller, the region of interest is the prefrontal cortex, which previous NIRS studies have shown to have attenuation 1000 times smaller than the occipital region [34]. Such insight suggests hemodynamic responses should be observable with a source–detector separation of 2.12 cm. The promising results reported in 4, as well as the more recent results reported in [35], corroborate this assumption.

(a)

**Figure 1.** (a) Positioning of source pairs (circles) and detectors (diamonds). Each source pair represents one *λ* =690 nm source and one *λ* =830 nm source. The signal from each source is represented as *x<sup>λ</sup> p*,*i* , with *i* = 1, . . . 4 and *p* = *L* (left) or *R* (right). Source pairs at position 3 on each side are located, anatomically, at the prefrontal cortices FP1 and FP2. (b) Image of headband position on the forehead with only the right side detector and sources in place.

Sources and detectors were secured to the forehead with a specially designed headband constructed from 1.6 mm low density polyethylene, as shown in Figure 1(b). The area probed using this arrangement corresponds approximately to the frontopolar cortex, the superior portion of the orbitofrontal cortex, and the more medial sections of the dorsolateral prefrontal cortex. Subjects wore a blindfold and industrial earmuffs (AOSafety Professional Hearing Protector, noise rating of 30 dB) for the entire data collection duration in order to eliminate any visual and/or auditory distractions external to the experiment.

#### **2.3. Protocol**

NIRS signals were collected from each participant while they were outfitted with the headband, blindfold, and earmuffs and seated comfortably in front of a computer in a quiet room. Each subject performed four trials spread over two separate sessions (i.e., two trials were completed per session); each session lasted about 40 minutes, including the time for experiment preparation (e.g., headband placement) and NIRS sensor calibration. Each

**Figure 2.** Stimulus pattern for the imagery trials. Imagery periods are shaded and rest intervals are not shaded.

session was performed on a separate day to ensure maximum concentration and focus and to reduce any effects of discomfort due to the headband. Participants were asked to remain as still as possible and to refrain from speaking during all experimental trials. Halfway through each session, the participant was given a short break to alleviate fatigue. While the blindfold and earmuffs were removed during the break, the NIRS sensors remained in place so as to preserve calibration and positioning.

Two of the trials consisted of the participant in a static state of rest with no music imagery (referred to as 'baseline' trials). Participants were instructed to clear their minds as much as possible, and to focus on their breathing for approximately two minutes. For the remaining two trials, participants were informed that they were required to sing *one* of their chosen songs vividly in their head when cued by the experimenter via a light tap in the shoulder (referred to as 'imagery' trials). Prior to each session, participants were asked to choose several songs of their own preference that were of the same emotional valence (i.e. happy or sad), which they felt elicited a strong emotional reaction. Participants were informed that the purpose of the music imagery task was to elicit an intense emotional reaction. As a consequence, they were instructed to switch to a new song once they began to feel emotionally habituated to the currently-used song.

Figure 2 shows the stimulus pattern used for the 220-second imagery trials. As can be seen, each imagery trial consisted of eleven 20-second intervals alternating between rest (six intervals) and music imagery (five intervals). An activation window of 20 seconds was chosen to take into account the task initiation delay, which we have found in previous experiments to be as long as 5 seconds, as well as the inherent physiological latency of the hemodynamic responses, which can be in the order of 5-10 seconds [3]. For similar reasons, twenty-second rest intervals were chosen. Such choice of stimulus pattern, however, can reinforce slow 3*rd*-order blood pressure waves and its subharmonics (circa 0.05-0.1 Hz); thus, a wavelet-based denoising algorithm was used, as described in 2.4.

#### **2.4. Pre-processing**

4 Brain-Computer Interface

**2.3. Protocol**

 **R,1**

 **R,4** x**λ**

x**λ**

**1 2**

**2.12 cm**

**3 cm** 

**4 3**

**right prefrontal cortex**

and one *λ* =830 nm source. The signal from each source is represented as *x<sup>λ</sup>*

position on the forehead with only the right side detector and sources in place.

any visual and/or auditory distractions external to the experiment.

**4.5 cm 1.5 cm**

 **R,3** x **<sup>λ</sup>**

at FP1 at FP2

(a)

(b)

**Figure 1.** (a) Positioning of source pairs (circles) and detectors (diamonds). Each source pair represents one *λ* =690 nm source

Source pairs at position 3 on each side are located, anatomically, at the prefrontal cortices FP1 and FP2. (b) Image of headband

Sources and detectors were secured to the forehead with a specially designed headband constructed from 1.6 mm low density polyethylene, as shown in Figure 1(b). The area probed using this arrangement corresponds approximately to the frontopolar cortex, the superior portion of the orbitofrontal cortex, and the more medial sections of the dorsolateral prefrontal cortex. Subjects wore a blindfold and industrial earmuffs (AOSafety Professional Hearing Protector, noise rating of 30 dB) for the entire data collection duration in order to eliminate

NIRS signals were collected from each participant while they were outfitted with the headband, blindfold, and earmuffs and seated comfortably in front of a computer in a quiet room. Each subject performed four trials spread over two separate sessions (i.e., two trials were completed per session); each session lasted about 40 minutes, including the time for experiment preparation (e.g., headband placement) and NIRS sensor calibration. Each

*p*,*i*

**Frontal view**

 **R,2** x**λ**

**2 1**

 **L,1** x**λ**

 **L,4** x**λ**

, with *i* = 1, . . . 4 and *p* = *L* (left) or *R* (right).

 **L,2** x**λ**

 **L,3** x**λ**

**3 4**

**left prefrontal cortex**

The raw signals (AC and DC) were filtered to mitigate physiological noise due, primarily, to cardiac signals (0.5–2 Hz), respiration (0.2–0.4 Hz) and the Mayer wave (approximately 0.1 Hz) [10], as well as the low-frequency artifacts (e.g., 3*rd*-order blood pressure waves) that may have been reinforced by the selected stimulus timing pattern. Since the frequencies of physiological noise and the desired hemodynamic response may vary slightly among participants, wavelet-based filters that would remove low-powered noise without relying on a specific cutoff frequency were used. Wavelet denoising has been shown to be effective for functional NIRS hemodynamics-driven signals [10], hence three wavelet filters were investigated which performed a 12-level decomposition using a Daubechies-12 wavelet. The filtered signals resulted from the reconstruction of the approximation wavelet coefficients and either the last three, four or five detail coefficients (henceforth referred to as 3-, 4-, and 5-detail wavelet filtered signals, respectively). A visual representation of the raw and filtered signals can be seen in Figure 3. The filtered signals are denoted as *x<sup>λ</sup> p*,*i* (*γ*, *d*), with *d* indicating the 'detail' of the filtered signal (*d* = 3, 4, 5) and *λ*, *p*, *i*, and *γ* as before.

#### **2.5. Feature extraction**

The estimated changes in the concentration of oxygenated hemoglobin (∆[*HbO*2]) and deoxygenated hemoglobin (∆[*HHb*]) were also computed, using the modified Beer-Lambert law [3] as shown in Eq. 1 and Eq. 2:

$$
\Delta[HbO\_2] = \frac{(\varepsilon\_{Hb}^{\Theta90m} \cdot \frac{\bullet CO^{\Theta90m}}{\bullet DP^{\Theta90m}}) - (\varepsilon\_{Hb}^{\Theta30m} \cdot \frac{\bullet CO^{\Theta90m}}{\bullet DP^{\Theta90m}})}{r \cdot (\varepsilon\_{Hb}^{\Theta90m} \cdot \varepsilon\_{HbO}^{\Theta90m} - \varepsilon\_{Hb}^{\Theta90m} \cdot \varepsilon\_{HbO}^{\Theta90m})}\right) \tag{1}
$$

$$
\Delta[HbHb] = \frac{(\varepsilon\_{HbO}^{\Theta90m} \cdot \frac{\bullet CO^{\Theta90m}}{\bullet DP^{\Theta90m}}) - (\varepsilon\_{HbO}^{\Theta90m} \cdot \frac{\bullet CO^{\Theta90m}}{\bullet PD^{\Theta90m}})}{r \cdot (\varepsilon\_{Hb}^{\Theta90m} \cdot \varepsilon\_{HbO}^{\Theta90m} - \varepsilon\_{Hb}^{\Theta90m} \cdot \varepsilon\_{HbOm}^{\Theta90m})}\tag{1}
$$

where

$$
\Delta OD^{\lambda} = \log \frac{D\mathbb{C}\_0^{\lambda}}{DC^{\lambda}(t)}.\tag{2}
$$

Parameter ∆*OD<sup>λ</sup>* corresponds to the change in optical density of light at wavelength *λ*; *DC<sup>λ</sup>* 0 to the mean DC intensity; *DCλ*(*t*) to the DC intensity at time *t*; and *r* to the source-detector separation distance. Constant *DPF<sup>λ</sup>* corresponds to the differential pathlength factor in the human adult head at wavelength *λ*, and *ε<sup>λ</sup> Hb* and *<sup>ε</sup><sup>λ</sup> HbO* correspond to extinction coefficients for *HHb* and *HbO*<sup>2</sup> cromophores, respectively. Table 1 reports the constant values for each wavelength used in the experiments described herein. Concentration signals can be represented as *c<sup>h</sup> p*,*i* (*d*), where the superscript *h* indexes oxygenated (*HbO*) or deoxygenated hemoglobin (*Hb*), and *p*=*L*, *R*, *i* = 1, . . . 4, and *d* = 3,4,5 as before.

In addition to each individual channel, the mean left and right responses were also computed by averaging like-wavelength raw signals or like-chromophore concentration signals over the four measurement positions on each side, resulting in:

$$\tilde{x}\_p^{\lambda}(\gamma, d) = \frac{1}{4} \sum\_{i=1}^{4} x\_{p,i}^{\lambda}(\gamma, d)\_{\prime} \tag{3}$$

$$
\sigma\_p^h(d) = \frac{1}{4} \sum\_{i=1}^4 \sigma\_{p,i}^h(d). \tag{4}
$$


**Table 1.** Differential pathlength factors (DPF) and extinction coefficients (*ε*) used in hemoglobin concentration calculations. Values are taken from [36] and [37].

**Figure 3.** Comparison of raw DC data obtained during an imagery trial with its three different wavelet-filtered signals. From the top down, the signals are: raw, 5-, 4-, and 3-detail filtered.

#### **3. Data analysis**

6 Brain-Computer Interface

**2.5. Feature extraction**

where

represented as *c<sup>h</sup>*

law [3] as shown in Eq. 1 and Eq. 2:

human adult head at wavelength *λ*, and *ε<sup>λ</sup>*

hemoglobin (*Hb*), and *p*=*L*, *R*, *i* = 1, . . . 4, and *d* = 3,4,5 as before.

*x*¯ *λ*

> *c*¯ *h <sup>p</sup>*(*d*) = <sup>1</sup> 4

*<sup>p</sup>* (*γ*, *<sup>d</sup>*) = <sup>1</sup>

four measurement positions on each side, resulting in:

*p*,*i*

for functional NIRS hemodynamics-driven signals [10], hence three wavelet filters were investigated which performed a 12-level decomposition using a Daubechies-12 wavelet. The filtered signals resulted from the reconstruction of the approximation wavelet coefficients and either the last three, four or five detail coefficients (henceforth referred to as 3-, 4-, and 5-detail wavelet filtered signals, respectively). A visual representation of the raw and filtered

The estimated changes in the concentration of oxygenated hemoglobin (∆[*HbO*2]) and deoxygenated hemoglobin (∆[*HHb*]) were also computed, using the modified Beer-Lambert

<sup>∆</sup>*OD<sup>λ</sup>* <sup>=</sup> log *DC<sup>λ</sup>*

Parameter ∆*OD<sup>λ</sup>* corresponds to the change in optical density of light at wavelength *λ*; *DC<sup>λ</sup>*

to the mean DC intensity; *DCλ*(*t*) to the DC intensity at time *t*; and *r* to the source-detector separation distance. Constant *DPF<sup>λ</sup>* corresponds to the differential pathlength factor in the

for *HHb* and *HbO*<sup>2</sup> cromophores, respectively. Table 1 reports the constant values for each wavelength used in the experiments described herein. Concentration signals can be

In addition to each individual channel, the mean left and right responses were also computed by averaging like-wavelength raw signals or like-chromophore concentration signals over the

4

4 ∑ *i*=1 *xλ p*,*i*

4 ∑ *i*=1 *ch p*,*i*

*Hb* and *<sup>ε</sup><sup>λ</sup>*

*DPF*830*nm* )−(*ε*830*nm Hb* · <sup>∆</sup>*OD*690*nm*

*DPF*690*nm* )−(*ε*690*nm HbO* · <sup>∆</sup>*OD*830*nm*

*<sup>r</sup>*·(*ε*690*nm Hb* ·*ε*830*nm HbO* <sup>−</sup>*ε*830*nm Hb* ·*ε*690*nm HbO* ) ,

0 *DCλ*(*t*)

(*d*), where the superscript *h* indexes oxygenated (*HbO*) or deoxygenated

*DPF*690*nm* ) *<sup>r</sup>*·(*ε*690*nm Hb* ·*ε*830*nm HbO* <sup>−</sup>*ε*830*nm Hb* ·*ε*690*nm HbO* ) , (1)

*DPF*830*nm* )

. (2)

*HbO* correspond to extinction coefficients

(*γ*, *d*), (3)

(*d*). (4)

0

*p*,*i*

(*γ*, *d*), with *d* indicating

signals can be seen in Figure 3. The filtered signals are denoted as *x<sup>λ</sup>*

the 'detail' of the filtered signal (*d* = 3, 4, 5) and *λ*, *p*, *i*, and *γ* as before.

<sup>∆</sup>[*HbO*2] = (*ε*690*nm Hb* · <sup>∆</sup>*OD*830*nm*

<sup>∆</sup>[*HHb*] = (*ε*830*nm HbO* · <sup>∆</sup>*OD*690*nm*

This section discusses basic HMM theory and the application of HMMs to characterize baseline signals and detect music imagery events.

#### **3.1. Hidden Markov models**

A hidden Markov model is a statistical model which examines a Markov process wherein the observable outputs are dependent upon the unobservable states. Here, only a brief description of HMMs is given and the reader is referred to [38] for further details. An HMM representing the feature vector *u* (see 3.2.1) is completely characterized by the number of *Q* discrete states, a transition matrix *<sup>A</sup>* = {*aij*} of transition probabilities between states *<sup>i</sup>* and *j*, an initial state distribution *π* and a state-dependent observation probability distribution *<sup>B</sup>* = {*bj*(*<sup>u</sup>*)} for state *<sup>j</sup>*. Commonly, Gaussian mixture models (GMM) are employed as observation probability distributions. A GMM is given by a weighted sum of *M* component densities,

$$P(\vec{u}) = \sum\_{k=1}^{M} \alpha\_k p\_k(\vec{u}) \tag{5}$$

where *<sup>α</sup><sup>k</sup>* ≥ 0 are the mixture weights with <sup>∑</sup>*<sup>M</sup> <sup>k</sup>*=<sup>1</sup> *<sup>α</sup><sup>k</sup>* <sup>=</sup> 1, and *pk*(*<sup>u</sup>*) are N-dimensional Gaussian densities with mean *<sup>µ</sup><sup>k</sup>* and covariance matrix <sup>Σ</sup>*k*, given by:

$$p\_k(\vec{u}) = \frac{1}{(2\pi)^{N/2}|\Sigma\_k|^{1/2}} \exp\left(-\frac{1}{2}(\vec{u} - \vec{\mu\_k})^\top \Sigma\_k^{-1} (\vec{u} - \vec{\mu\_k})\right). \tag{6}$$

A GMM with *M* = 1 indicates a conventional Gaussian density.

In this study, model parameters, such as state transition probabilities, initial state probabilities, and output distribution parameters, were computed using a recursive greedy-EM algorithm where model parameters and the number of Gaussian components were estimated simultaneously using a Bayesian information criterion to avoid overfitting [39]. Mean left and right concentration and raw features extracted from the imagery data were assessed against the baseline reference models via the log likelihood measure (see 3.2.1). Log likelihood values were computed using the so-called forward-backward procedure described in [38]. Normalization was performed based on the number of data points in the sampled signal under test. As mentioned previously, client-centredness was achieved by optimizing HMMs parameters for each individual participant. More details about HMM training and testing are described in 3.2.1 and 3.2.2, respectively. The publicly available HMM Toolbox for Matlab was used for the simulations [40].

#### **3.2. Modeling process**

#### *3.2.1. Training stage*

For each participant, HMMs were trained for different combinations of AC, DC, or concentration features using the signals obtained during the baseline trials. More specifically, the four-dimensional feature vectors included:

$$\vec{u}\_{\rm DC,d} = [\mathfrak{x}\_{\rm L}^{690}(\rm DC, d), \mathfrak{x}\_{\rm L}^{830}(\rm DC, d), \mathfrak{x}\_{\rm R}^{690}(\rm DC, d), \mathfrak{x}\_{\rm R}^{830}(\rm DC, d)],\tag{7}$$

$$\vec{u}\_{A\mathcal{C},d} = [\mathfrak{x}\_L^{690}(A\mathcal{C},d), \mathfrak{x}\_L^{830}(A\mathcal{C},d), \mathfrak{x}\_R^{690}(A\mathcal{C},d), \mathfrak{x}\_R^{830}(A\mathcal{C},d)],\tag{8}$$

$$\vec{u}\_{conc,d} = [\mathfrak{c}\_L^{Hb}(d), \mathfrak{c}\_L^{HbO}(d), \mathfrak{c}\_R^{Hb}(d), \mathfrak{c}\_R^{HbO}(d)],\tag{9}$$

where *d* indexes 3-, 4-, or 5-detail wavelet filters.

During hidden Markov model training, the number of parameters that need to be estimated depends on the number of HMM states *Q*, type of HMM (e.g., fully connected, left-right), number of Gaussian components *M*, data dimensionality *K*, and GMM covariance matrix type (i.e., diagonal or full). In this study, fully-connected HMMs and full covariance Gaussian components were used to explore correlations between NIRS signals measured from neighboring channels. Preliminary experiments with the recursive EM algorithm showed that the following HMM-GMM configurations were used most often:

• Q=4, M=1,

8 Brain-Computer Interface

*P*(*u*) =

Gaussian densities with mean *<sup>µ</sup><sup>k</sup>* and covariance matrix <sup>Σ</sup>*k*, given by:

(2*π*)*N*/2|Σ*<sup>k</sup>* <sup>|</sup>1/2 exp

A GMM with *M* = 1 indicates a conventional Gaussian density.

HMM Toolbox for Matlab was used for the simulations [40].

690 *<sup>L</sup>* (*DC*, *d*), *x*¯

690 *<sup>L</sup>* (*AC*, *d*), *x*¯

*uconc*,*<sup>d</sup>* = [*c*¯

the four-dimensional feature vectors included:

*uDC*,*<sup>d</sup>* = [*x*¯

*uAC*,*<sup>d</sup>* = [*x*¯

where *d* indexes 3-, 4-, or 5-detail wavelet filters.

**3.2. Modeling process**

*3.2.1. Training stage*

where *<sup>α</sup><sup>k</sup>* ≥ 0 are the mixture weights with <sup>∑</sup>*<sup>M</sup>*

*pk* (*<sup>u</sup>*) = <sup>1</sup>

*M* ∑ *k*=1

> − 1 2

In this study, model parameters, such as state transition probabilities, initial state probabilities, and output distribution parameters, were computed using a recursive greedy-EM algorithm where model parameters and the number of Gaussian components were estimated simultaneously using a Bayesian information criterion to avoid overfitting [39]. Mean left and right concentration and raw features extracted from the imagery data were assessed against the baseline reference models via the log likelihood measure (see 3.2.1). Log likelihood values were computed using the so-called forward-backward procedure described in [38]. Normalization was performed based on the number of data points in the sampled signal under test. As mentioned previously, client-centredness was achieved by optimizing HMMs parameters for each individual participant. More details about HMM training and testing are described in 3.2.1 and 3.2.2, respectively. The publicly available

For each participant, HMMs were trained for different combinations of AC, DC, or concentration features using the signals obtained during the baseline trials. More specifically,

> 690 *<sup>R</sup>* (*DC*, *d*), *x*¯

> 690 *<sup>R</sup>* (*AC*, *d*), *x*¯

*Hb <sup>R</sup>* (*d*), *c*¯ 830

830

*HbO*

*<sup>R</sup>* (*DC*, *d*)], (7)

*<sup>R</sup>* (*AC*, *d*)], (8)

*<sup>R</sup>* (*d*)], (9)

830 *<sup>L</sup>* (*DC*, *d*), *x*¯

830 *<sup>L</sup>* (*AC*, *d*), *x*¯

> *HbO <sup>L</sup>* (*d*), *c*¯

During hidden Markov model training, the number of parameters that need to be estimated depends on the number of HMM states *Q*, type of HMM (e.g., fully connected, left-right),

*Hb <sup>L</sup>* (*d*), *c*¯ (*<sup>u</sup>* <sup>−</sup> *<sup>µ</sup><sup>k</sup>* )⊤Σ−<sup>1</sup>

*<sup>α</sup><sup>k</sup> pk*(*<sup>u</sup>*) (5)

*<sup>k</sup>*=<sup>1</sup> *<sup>α</sup><sup>k</sup>* <sup>=</sup> 1, and *pk*(*<sup>u</sup>*) are N-dimensional

. (6)

*<sup>k</sup>* (*<sup>u</sup>* <sup>−</sup> *<sup>µ</sup><sup>k</sup>* )


Such configurations were consistent with those reported in previous HMM-based BCI studies (e.g., [3, 33]). Participant-specific models were trained using the feature vectors *uDC*,*d*, *uAC*,*d*, and *uconc*,*d*, for all three types of filtered data (*<sup>d</sup>* = 3, 4, and 5) and all three combinations of HMM parameters listed above. This resulted in 27 (3 data types × 3 filters × 3 model parameter sets) different models for each of the thirteen participants, or 9 models per data type per participant.

#### *3.2.2. Testing stage*

For testing, consecutive, overlapping data samples from the imagery trials were scored against the per-subject baseline reference HMMs via the log likelihood measure. Sliding-window samples of various lengths (*l* = 1, 3, 5, 7, 10, and 15 seconds) were investigated with 0.5–second overlap and the log likelihood for a window was normalized by the length of the window. Window lengths of different sizes were investigated for two reasons. First, NIRS signal propagations have varied delay times ranging from 5–8 seconds [3]. Second, reaction times differ between subjects due to various external factors such as mental alertness, innate reflex reaction time, and familiarity with the procedure.

Higher log likelihood values suggest hemodynamic responses akin to those observed during the baseline trials (i.e., rest). Lower log likelihood values, in turn, indicate responses different from rest, i.e, music imagery events. Given this interpretation of the log likelihoods, a decrease in the log likelihood function was expected during imagery periods and either an increase or a constant value during the rest periods. The plot depicted in Figure 4 shows one subject's representative log likelihood temporal series clearly illustrating the expected increases and decreases during rest (un-shaded intervals) and imagery (shaded intervals), respectively.

#### **3.3. Music imagery detection**

Given the nature of the log likelihood functions, complex classifier training was not needed in order to detect music imagery events. Instead, changes from positive to negative slope in the log likelihood function were used for classification. In order to remove possible artifacts due to e.g., head movement, which may produce a momentary positive-to-negative slope change, music imagery events were only detected if a decrease in the log likelihood function persisted for at least five seconds after a slope change was detected.

An expected imagery event (true positive) was identified with the occurrence of an activation within the first thirteen seconds of a twenty-second imagery interval. This time period

**Figure 4.** Normalized log likelihood function for a single subject, clearly showing the anticipated pattern for an imagery trial. Computed using �u*DC***,3** data and tested against a 4-state 1-Gaussian HMM using *l*=3-second sliding windows. Imagery intervals are shaded. The X axis is time [sec] and the Y axis is normalized log likelihood [arbitrary units].

allowed for both the delay in the change of the hemodynamics-driven signal (up to eight seconds [3]) and the delay on the part of the subject to start music imagery once tapped on the shoulder, which, in turn, was found to be between 1–5 seconds in pilot experiments. If no activation occurred during this first portion of an imagery interval, the interval was deemed to be a false negative, or incorrectly classified as a rest interval. Intervals were also labeled false negatives if an activation occurred after thirteen seconds into the beginning of the imagery interval. A rest interval with no activations was considered a true negative, as was a rest interval with an activation within the first eight seconds after the previous imagery interval — this margin was motivated again by the delay in the NIRS signal, and the observed few seconds of delay in the participant's cessation of the music imagery [3, 10]. Rest intervals with an activation after the first eight seconds were deemed false positives or incorrectly classified as imagery intervals. The number of true/false positives (TP/FP) and true/false negatives (TN/FN) were computed and summed over the two imagery trials, for a total of 12 rest intervals and 10 imagery intervals, and were used to compute the performance metrics sensitivity and specificity, given by:

$$Sensitivity = \frac{TP}{TP + FN'} \tag{10}$$

$$Specificity = \frac{TN}{TN + FP} \,\text{.}\tag{11}$$

Sensitivity is a measure of how many actual activations were correctly identified, while the specificity indicates how many rest intervals were correctly identified as baseline.

#### **4. Experimental results**

In this section, calculated statistics and optimal parameters are reported.

<sup>146</sup> Brain-Computer Interface Systems – Recent Progress and Future Prospects Client-Centred Music Imagery Classification Based on Hidden Markov Models of Baseline Prefrontal Hemodynamic Responses 11 10.5772/55804 Client-Centred Music Imagery Classification Based on Hidden Markov Models ... http://dx.doi.org/10.5772/55804 147


**Table 2.** Optimal performance results for all three data types, by subject. Columns labeled 'Sens' and 'Spec' indicate sensitivity and specificity, respectively. The HMM parameters, namely, number of states and number of Gaussian mixtures, are shown by Q and M, respectively. Parameter *d* indicates the 'detail' of the filter and *l* the window size [seconds]. SD is standard deviation.

#### **4.1. Quantitative results**

10 Brain-Computer Interface

**Figure 4.** Normalized log likelihood function for a single subject, clearly showing the anticipated pattern for an imagery trial. Computed using �u*DC***,3** data and tested against a 4-state 1-Gaussian HMM using *l*=3-second sliding windows. Imagery intervals

allowed for both the delay in the change of the hemodynamics-driven signal (up to eight seconds [3]) and the delay on the part of the subject to start music imagery once tapped on the shoulder, which, in turn, was found to be between 1–5 seconds in pilot experiments. If no activation occurred during this first portion of an imagery interval, the interval was deemed to be a false negative, or incorrectly classified as a rest interval. Intervals were also labeled false negatives if an activation occurred after thirteen seconds into the beginning of the imagery interval. A rest interval with no activations was considered a true negative, as was a rest interval with an activation within the first eight seconds after the previous imagery interval — this margin was motivated again by the delay in the NIRS signal, and the observed few seconds of delay in the participant's cessation of the music imagery [3, 10]. Rest intervals with an activation after the first eight seconds were deemed false positives or incorrectly classified as imagery intervals. The number of true/false positives (TP/FP) and true/false negatives (TN/FN) were computed and summed over the two imagery trials, for a total of 12 rest intervals and 10 imagery intervals, and were used to compute the performance

*Sensitivity* <sup>=</sup> *TP*

*Speci ficity* <sup>=</sup> *TN*

Sensitivity is a measure of how many actual activations were correctly identified, while the

specificity indicates how many rest intervals were correctly identified as baseline.

In this section, calculated statistics and optimal parameters are reported.

*TP* <sup>+</sup> *FN* , (10)

*TN* <sup>+</sup> *FP*. (11)

are shaded. The X axis is time [sec] and the Y axis is normalized log likelihood [arbitrary units].

metrics sensitivity and specificity, given by:

**4. Experimental results**

Table 2 reports sensitivity (column labeled "sens" in the table) and specificity ("spec") for the HMM–filter–window combinations that resulted in the best per-user performance (defined as the highest average of sensitivity and specificity) out of the 27 possible combinations described in 3.2.1. The average over all participants is recorded in the last row, along with the standard deviation (SD). As can be seen, average sensitivities and specificities were all greater than 82%, and in many individual instances all imagery trials were correctly detected (100% sensitivity). These results are promising for the prefrontal cortical task of music imagery as they are comparable to previously reported results from a group of 3 subjects using motor imagery [8], but were obtained without having to compensate for the interference of hair. Additionally, as observed in the table, all three data types exhibited comparable average performances over all subjects. Although a slight improvement was observed with the hemoglobin concentration data, performance gains were not significant (*p*-values ≥ 0.38, *t*-test). This suggests that the additional computational effort required to calculate the concentrations may not be justified. Nonetheless, if computational power can be afforded, some subjects were shown to benefit greatly from using the concentration data (e.g., subjects 1 and 6).

#### **4.2. Client-centred design**

Table 2 further indicates the per-user and per-data type optimal HMM–filter–window combinations. The variation between parameters for all participants supports a client-centred design. For raw data (AC and DC), HMMs with observation probabilities represented by only one Gaussian (*M* = 1) were selected most often, whereas for all three data types, window sizes of *l* ≤ 10 seconds were selected.


**Table 3.** Time-dependent performance by subject for hemoglobin concentration data. Columns labeled 'Sens' and 'Spec' are sensitivity and specificity, respectively. SD is standard deviation.

Table 3 reports the effects of window size on system performance. For brevity, only the concentration data is included. As can be seen in the table, the best average performance occurred at a window size of 3 seconds, and sensitivity decreased as the window size increased. However, it is noted that the optimal window size varied between subjects. For example, subject 1 experienced the best performance at a window size of 10 seconds, while subject 12 displayed the best results for a 3-second window. This variability is likely due to one or more of the following factors: differences in reaction time, intensity of emotions evoked by the music imagery task, ability to concentrate on the imagery, probe location, and/or anatomical differences, all of which could have affected the propagation delay of the hemodynamic signal. This per-user effect of window size is taken into consideration via a client-centred approach.

#### **5. Discussion**

This section provides a comparison with previous work, discusses the use of music imagery for BCI control and the advantages of a client-centred approach, and considers the study's limitations.

#### **5.1. Comparison to past work**

The proposed mental activity classifier was exclusively based on statistical models of *baseline* temporal hemodynamic responses. This is a distinct advantage over other HMM-based classification systems (e.g., [3, 17, 33]) which require models to be trained on both mental states in order to perform classification. Such an approach requires the collected data to be separated for training and testing, hence limiting the amount of information available. The proposed method, in comparison, allows the use of all collected baseline data to train HMMs and all collected imagery data to serve as test data, thus reducing possible model overfitting artifacts.

The majority of past work has focused on converting light intensity signals into hemoglobin concentrations, with a few studies focusing on the AC or DC components directly (e.g., [6, 41]). Results observed here suggest that neither method is more discriminating than the other for the task of music imagery classification. As a consequence, the computationally-intense intensity-to-concentration conversion functions may be avoided, thus making it more conducive for future adaptation into a binary BCI.

#### **5.2. Music imagery classification for BCI control**

Music imagery classification may be used to control a binary BCI operating in a system-paced mode, akin to single-switch scanning found in conventional electronic augmentative and alternative communication systems. In system-paced mode, options would be highlighted sequentially at a fixed rate. The user would make a selection by performing music imagery, which in turn would be used to activate the BCI. As mentioned in 3.3, music imagery classification was performed once a sustained decrease in the log-likelihood function was observed for five consecutive seconds. This decision rule limits the information transfer rate (*ITR*) of the BCI to a maximum of 12 bits/min and the system scan rate (*SR*) to a minimum of 5 seconds. Such values are somewhat lower than those achieved with existing EEG-based BCIs which can be in the order of *ITR* = 25 bits/min and *SR* = 2.4 s [4]. However, it is emphasized that participants had no prior training in using the technology; user training may lead to improvements in system response times [2], thus allowing for a faster decision rule to be implemented.

#### **5.3. Client-centred paradigm**

12 Brain-Computer Interface

Subject

Window size *l* [seconds]

1 3 5 7 10 15

Sens Spec Sens Spec Sens Spec Sens Spec Sens Spec Sens Spec 1 0.50 1.00 0.67 1.00 0.67 1.00 0.83 0.67 1.00 1.00 0.50 1.00 2 0.90 0.67 0.80 0.75 0.90 0.67 0.70 1.00 0.80 0.75 0.60 1.00 3 0.78 0.55 0.78 0.55 0.33 0.88 0.40 0.58 0.60 0.92 0.50 1.00 4 1.00 0.67 1.00 0.67 1.00 0.58 1.00 0.70 0.67 0.90 0.67 0.86 5 0.67 0.63 0.83 0.57 0.75 0.50 0.80 0.83 0.40 1.00 0.60 0.83 6 1.00 0.83 1.00 0.83 1.00 0.83 1.00 0.67 0.67 1.00 0.60 0.83 7 0.80 0.86 0.80 0.86 0.80 0.86 0.80 0.86 0.50 1.00 0.67 0.83 8 0.60 0.83 0.70 0.92 0.60 0.92 0.70 0.75 0.70 0.58 0.80 0.75 9 0.70 0.83 0.80 0.92 0.70 0.83 0.60 0.92 0.70 0.83 0.50 0.75 10 0.63 0.91 0.63 0.91 1.00 0.64 0.70 0.75 0.44 0.82 0.50 0.83 11 0.80 0.75 0.63 0.90 0.63 0.90 0.71 0.89 0.90 0.75 0.70 0.83 12 1.00 0.83 1.00 1.00 1.00 0.83 0.60 1.00 0.83 0.83 0.60 1.00 13 0.90 1.00 0.80 1.00 0.80 0.92 0.70 0.92 1.00 0.75 0.80 0.92

**Mean 0.79 0.80 0.80 0.84 0.78 0.80 0.73 0.81 0.71 0.86 0.62 0.88 (SD) (0.16) (0.14) (0.13) (0.16) (0.20) (0.15) (0.16) (0.13) (0.19) (0.13) (0.11) (0.10)**

**Table 3.** Time-dependent performance by subject for hemoglobin concentration data. Columns labeled 'Sens' and 'Spec' are

Table 3 reports the effects of window size on system performance. For brevity, only the concentration data is included. As can be seen in the table, the best average performance occurred at a window size of 3 seconds, and sensitivity decreased as the window size increased. However, it is noted that the optimal window size varied between subjects. For example, subject 1 experienced the best performance at a window size of 10 seconds, while subject 12 displayed the best results for a 3-second window. This variability is likely due to one or more of the following factors: differences in reaction time, intensity of emotions evoked by the music imagery task, ability to concentrate on the imagery, probe location, and/or anatomical differences, all of which could have affected the propagation delay of the hemodynamic signal. This per-user effect of window size is taken into consideration via a

This section provides a comparison with previous work, discusses the use of music imagery for BCI control and the advantages of a client-centred approach, and considers the study's

The proposed mental activity classifier was exclusively based on statistical models of *baseline* temporal hemodynamic responses. This is a distinct advantage over other HMM-based classification systems (e.g., [3, 17, 33]) which require models to be trained on both mental

sensitivity and specificity, respectively. SD is standard deviation.

client-centred approach.

**5.1. Comparison to past work**

**5. Discussion**

limitations.

In this study, a client-driven approach was used and HMM model parameters were optimized on a per-participant basis. The optimal number of HMM states *Q* and Gaussian components *M* varied between subjects and also between data types (i.e., AC, DC, and hemoglobin concentrations), suggesting that throughout the course of the trials participants did not transition directly from rest to an active mental state. The intermediate activity between rest and music imagery likely varied significantly between participants due to irregularities in level, temporal and spatial distribution of the activity within the task interval. Such irregularities could be present due to specific participant-specific factors such as mind wandering, mental alertness, innate reflex reaction time, familiarity with the procedure, and NIRS signal levels and spatial distributions caused by varying emotional intensities of user-selected songs [25]. It is also observed that for the majority of the participants, the temporal light intensity signals after the intensity-to-concentration mapping are smoother than their original counterparts. This smoothing behaviour is likely the reason that a lower number of states were required, in general, for hemoglobin concentration data relative to AC or DC data.

In order to customize the proposed NIRS-driven classifier for an individual user for BCI control, a short calibration session can be undertaken with known imagery and rest intervals — much like the imagery trials of this experiment — in order to choose the optimal parameters for the data (data type, filter), the HMM (states, Gaussians) and the response time of the switch (window size). The optimization process can be hardcoded into a calibration program, and executed with minimal intervention by an outside party, thus likely will not pose a burden on the user. This would be akin to the training phase required by widely-used speech recognition engines.

#### **5.4. Study limitations**

Experimental conditions were restrictive in the sense that participants were required to remain quiet and immobile for the duration of the data recording sessions; all focus was placed on the given task. All auditory and visual distractions were suppressed as much as possible by the earmuffs and blindfold. If BCI technologies are to be used in everyday settings, however, movement artifacts [10], auditory and visual distractions [42], and mind-wandering still have to be detected and accounted for (e.g., as in [19]). Additionally, experimental sessions lasted less than one hour. For BCIs, however, systems may be used for multiple consecutive hours. As a consequence, the effects of fatigue on classifier performance also need to be quantified. However, multiple calibration sessions may be performed throughout the day such that the detrimental effects of fatigue on the harnessed signals can be incorporated directly into the reference baseline hidden Markov models. Lastly, further work must be done to investigate the feasibility of music imagery classification for individuals in the target population. The prospects are promising, however, as hemodynamic responses due to mental activity have been shown to be detectable by NIRS for individuals with ALS [6].

#### **6. Conclusions**

This study proposed a client-centred NIRS-driven system that uses HMM as a classification tool to differentiate between music imagery and baseline (rest). The classifier performance, reported both in terms of classifier sensitivity and specificity, averaged over thirteen subjects was greater than 82%, and in many cases exhibited successful detection of all imagery events (100% sensitivity). To allow for a client-centred design, three different wavelet filters (3-, 4-, and 5–detail), three HMM parameter sets (Q–M values of 4–1, 2–2, and 2–1), and six different sliding window sizes (1, 3, 5, 7, 10 and 15 seconds) were considered. The resultant wide range of optimal parameters suggests that parameters be selected for each individual via a short, simple calibration session wherein the user alternates between rest and music imagery in a predetermined pattern. Moreover, since concentration calculations require additional computational power, the comparable results obtained here between all data types suggest that NIRS raw data (AC, DC) should be considered in future studies as an alternative to hemoglobin concentrations. The promising performance of the system suggests that a user-centered NIRS-based BCI is a concept that deserves further feasibility studies, as does the use of HMM as a classifier of cognitive activity versus rest.

#### **Acknowledgements**

The authors would like to acknowledge the financial support from the WB Family Foundation and the Natural Sciences and Engineering Research Council of Canada, as well as Ka Lun Tam and Pierre Duez for their hardware/software support.

#### **Author details**

14 Brain-Computer Interface

speech recognition engines.

for individuals with ALS [6].

**6. Conclusions**

**Acknowledgements**

**5.4. Study limitations**

— much like the imagery trials of this experiment — in order to choose the optimal parameters for the data (data type, filter), the HMM (states, Gaussians) and the response time of the switch (window size). The optimization process can be hardcoded into a calibration program, and executed with minimal intervention by an outside party, thus likely will not pose a burden on the user. This would be akin to the training phase required by widely-used

Experimental conditions were restrictive in the sense that participants were required to remain quiet and immobile for the duration of the data recording sessions; all focus was placed on the given task. All auditory and visual distractions were suppressed as much as possible by the earmuffs and blindfold. If BCI technologies are to be used in everyday settings, however, movement artifacts [10], auditory and visual distractions [42], and mind-wandering still have to be detected and accounted for (e.g., as in [19]). Additionally, experimental sessions lasted less than one hour. For BCIs, however, systems may be used for multiple consecutive hours. As a consequence, the effects of fatigue on classifier performance also need to be quantified. However, multiple calibration sessions may be performed throughout the day such that the detrimental effects of fatigue on the harnessed signals can be incorporated directly into the reference baseline hidden Markov models. Lastly, further work must be done to investigate the feasibility of music imagery classification for individuals in the target population. The prospects are promising, however, as hemodynamic responses due to mental activity have been shown to be detectable by NIRS

This study proposed a client-centred NIRS-driven system that uses HMM as a classification tool to differentiate between music imagery and baseline (rest). The classifier performance, reported both in terms of classifier sensitivity and specificity, averaged over thirteen subjects was greater than 82%, and in many cases exhibited successful detection of all imagery events (100% sensitivity). To allow for a client-centred design, three different wavelet filters (3-, 4-, and 5–detail), three HMM parameter sets (Q–M values of 4–1, 2–2, and 2–1), and six different sliding window sizes (1, 3, 5, 7, 10 and 15 seconds) were considered. The resultant wide range of optimal parameters suggests that parameters be selected for each individual via a short, simple calibration session wherein the user alternates between rest and music imagery in a predetermined pattern. Moreover, since concentration calculations require additional computational power, the comparable results obtained here between all data types suggest that NIRS raw data (AC, DC) should be considered in future studies as an alternative to hemoglobin concentrations. The promising performance of the system suggests that a user-centered NIRS-based BCI is a concept that deserves further feasibility studies, as does

The authors would like to acknowledge the financial support from the WB Family Foundation and the Natural Sciences and Engineering Research Council of Canada, as well

the use of HMM as a classifier of cognitive activity versus rest.

as Ka Lun Tam and Pierre Duez for their hardware/software support.

Tiago H. Falk1,⋆, Kelly M. Paton<sup>2</sup> and Tom Chau<sup>3</sup>

<sup>⋆</sup> Address all correspondence to: tiago.falk@ieee.org

1 Institut National de la Recherche Scientifique (INRS-EMT), University of Quebec, Montreal, Canada

2 Institute of Applied Mathematics, University of British Columbia, British Columbia, Canada

3 Bloorview Research Institute, University of Toronto, Toronto, Canada

#### **References**


[22] S B Frost, S Barbay, K M Friel, E J Plautz, and R J Nudo. Reorganization of Remote Cortical Regions After Ischemic Brain Injury: A Potential Substrate for Stroke Recovery. *Journal of Neurophysiology*, 89:3205–3214, 2003.

16 Brain-Computer Interface

[10] Fiachra Matthews, Barak A Pearlmutter, Tomas E Ward, Christopher Soraghan, and Charles Markham. Hemodynamics for Brain-Computer Interfaces. *IEEE Signal*

[11] Arno Villringer and Britton Chance. Non-invasive optical spectroscopy and imaging of human brain function. *Trends in Neurosciences*, 20(10):431–433, October 1997.

[12] Gert Pfurtscheller, G´'unther Bauernfeind, Selina Christin Wriessnegger, and Christa Neuper. Focal frontal (de)oxyhemoglobin responses during simple arithmetic.

[13] H. Ogata, T. Mukai, and T. Yagi. A study on the frontal cortex in cognitive tasks using near-infrared spectroscopy. In *Proc. IEEE Conf. of the Engineering in Medicine and Biology*

[14] H Ayaz, M Izzetoglu, S Bunce, T Heiman-Patterson, and B Onaral. Detecting cognitive activity related hemodynamic signal for brain-computer interface using functional near-infrared spectroscopy. In *Proceedings of the IEEE Conference of the Engineering in*

[15] MJ Herrmann, A.C. Ehlis, and AJ Fallgatter. Prefrontal activation through task requirements of emotional induction measured with NIRS. *Biological Psychology*,

[16] J. Leon-Carrion, J. Damas, K. Izzetoglu, K. Pourrezai, J.F. Martín-Rodríguez, J.M.B. Martin, and M.R. Dominguez-Morales. Differential time course and intensity of PFC activation for men and women in response to emotional stimuli: A functional near-infrared spectroscopy (fNIRS) study. *Neuroscience Letters*, 403(1-2):90–95, 2006.

[17] S Power, T Falk, and T Chau. Classification of prefrontal activity due to mental arithmetic and music imagery using hidden markov models and frequency domain

[18] Sarah D Power, Azadeh Kushki, and Tom Chau. Automatic single-trial discrimination of mental arithmetic, mental singing and the no-control state from prefrontal activity:

[19] Tiago Falk, M Guirgis, S Power, and T Chau. Taking nirs-bcis outside the lab: Achieving robustness against ambient noise. *IEEE Trans Neural Syst Rehab Eng*, 19(2):136–146, 2011.

[20] Boris Kotchoubey, Anna Dubischar, Herbert Mack, Jochen Kaiser, and Niels Birbaumer. Electrocortical and behavioral effects of chronic immobility on word processing.

[21] Toshiki Endo, Teiji Tominaga, and Lars Olson. Cortical Changes Following Spinal Cord Injury with Emphasis on the Nogo Signaling System. *The Neuroscientist*, 15(3):291–299,

near-infrared spectroscopy. *Journal of Neural Engineering*, 7:026002, 2010.

toward a three-state nirs-bci. *BMC Research Notes*, 5(141), March 2012.

*Processing Magazine*, 25(1):87–94, January 2008.

*Medicine and Biology Society*, pages 342–345, 2007.

*Cognitive Brain Research*, 17(1):188–199, June 2003.

*Society*, pages 4731–4734, 2007.

64(3):255–263, 2003.

2009.

*International Journal of Psychophysiology*, 76(3):186–192, 2010.

