**A Robust and Flexible Control System to Reduce Environmental Effects of Thermal Power Plants**

Toru Eguchi, Takaaki Sekiai, Naohiro Kusumi, Akihiro Yamada, Satoru Shimizu and Masayuki Fukai *Hitachi Ltd. Japan* 

## **1. Introduction**

Regulations on environmental effects due to such issues as nitrogen oxide (NOx) and carbon monoxide (CO) emissions from thermal power plants have become stricter[1]; hence the need for compliance with these regulations has been increasing. To meet this need, several technologies with respect to fuel combustion, exhaust gas treatment and operational control have been developed[2-4]. The technologies for the fuel combustion and the exhaust gas treatment include a low NOx burner and an air quality control system, and they are capable of reducing impact on the environment as physical and chemical implementation methods. The operational control technology for the thermal power plants is constantly required to receive changes in operational conditions. It is difficult to realize operational control which responds to combustion properties.

To overcome this issue, the operational control must be able to reduce NOx and CO emissions flexibly in accordance with such changes. Robustness is also required in such control because the measured NOx and CO data often include noise. Therefore, a robust and flexible plant control system is strongly desired to reduce environmental effects from thermal power plants efficiently.

Several studies have proposed plant control technologies to reduce the environmental effects[4-10]. These technologies are classified into two types of methods: model based and non-model based methods. The former methods include an optimization algorithm and a numerical model to estimate plant properties using neural networks (NNs)[11,12] and multivariable model predictive control[13]. The optimization algorithm searches for optimal control signals to reduce NOx and CO emissions using the numerical model. The latter methods have no models and they generates the optimal control signals by fuzzy logic[14]. A fuzzy logic controller outputs the optimal control signals for multivariable inputs using fuzzy rule bases. The fuzzy rule bases are based on *a priori* knowledge of plant control, and they can be tuned by parameters.

These technologies require the measured plant data for initial tuning of the model properties and the parameters of rules when the technologies are installed in plants. It usually takes some time to collect enough plant data. In addition, the search for control

A Robust and Flexible Control System

achieved.

conditions.

**2.1 Basic structure** 

during the plant operations.

installation by the data in the measured DB.

to Reduce Environmental Effects of Thermal Power Plants 307

estimation accuracy. This method adjusts radii parameters considering distances among the learning data. Consequently, the Gaussian basis functions can cover the input space properly and both high estimation accuracy and practical computational speed are

The second problem is to improve flexibility of the learning algorithm. Performance of the RL depends on the definition of a reward function equivalent to an evaluation function. The reward function has to be defined so that the RL algorithm can obtain the desired goal for the problem. As for application of the RL to thermal power plant control, the properties of the model changes in accordance with operational changes, thus the reward function has to be changed flexibly for the operational changes. However, it is quite difficult to prepare the

To overcome this second problem, the authors introduce a reward function which has variable parameters and they proposed an automatic reward adjustment method[22]. The proposed method adjusts the variable parameters of the reward function automatically based on the NOx and CO emissions obtained in the learning process. As a result, the proposed method can obtain proper reward functions for all kinds of operational

The following sections outline the proposed control system and its newly proposed methods. Simulations clarify the advantages of the proposed system with respect to the following points: estimation accuracy and computational time of the RBF network,

Figure 1 shows the basic structure of the proposed control system. This system consists of a plant property estimation part and an operation optimizing part. The plant property estimation part includes a statistical model and measurement and numerical calculation DBs. The statistical model estimates the NOx and CO emission properties in thermal power plants. It is difficult to express these properties as mathematical equations because they have strong nonlinearities. The proposed system employs the RBF network as the statistical model which can estimate NOx and CO emissions for control variables using data stored in the DBs. The measurement DBs store the measured NOx and CO data for some control variables, and the numerical calculation DB stores data consisting of NOx and CO values for control variables calculated by the combustion analysis[15]. The control variables correspond to input of the statistical model, and the estimated NOx and CO emissions correspond to output of it. The statistical model can be modified by measured data obtained

Conventional studies have been made about the model based control technology to reduce environmental effects from thermal power plants[4,6-8], but none of them considered employing not only the measured DB, but also the numerical calculation DBs. As the model can be tuned using the calculation DBs in advance, it is not necessary to take times for initial tuning at the time of installation. In addition, it is possible to tune the model after the

The operation optimizing part includes a RL agent, a reward calculation module, a reward adjustment module and a learning result DB. The learning procedure is as follows. First, the statistical model calculates and outputs the model outputs for the model inputs changed by

reward functions for all patterns of operational conditions in advance.

flexibility of the control logic and robustness in control for the noise of data.

**2. Proposed plant control system for reducing environmental effects** 

signals can only be made in the past operating range, thus it is difficult to find the optimal control signals of they are located outside the range.

The authors have proposed a new plant control system for reducing environmental effects utilizing numerical calculation technology[15] to shorten the time for initial tuning and search the global optima. The system has one or more calculation databases (DBs) with respect to NOx and CO properties obtained by numerical calculation. Since the model can be tuned using the calculation DBs in advance, it is not necessary to take times for initial tuning when the control system is installed. Moreover, the proposed system obtains better control signals than the conventional technologies because it can model the NOx and CO properties including both inside and outside the operating range by the numerical calculation, which facilitates to search the optimal control signals.

After installation, the proposed control system is capable of tuning its model using the data measured in real time to reduce the model errors. In plant control, the shortest interval for changing operations is every 20 minutes because it often takes about 20 minutes to become static after an operation. The proposed system must be able to calculate the control signals during this interval, hence model tuning and searching for control signals should terminate within 20 minutes.

The proposed system employs radial basis function (RBF) network[16,17] and reinforcement learning (RL)[18]. The RBF network represents the NOx and CO properties to estimate their concentrations according to the control signals. The RL leads to the optimal control signals to achieve the control goals which is to reduce the estimated NOx and CO concentrations. The RBF network is one of the NNs having Gaussian basis functions. The RBF network usually learns the NOx and CO properties faster than ordinary NNs because the learning algorithm of the RBF network can be converted into matrix calculations without iterations. The RL is one of the machine learning methods[19] optimizing action rules of an agent by trial and error. It is preferable to apply the RL to the control system which requires real-time computing because the RL is a single point searching method and its computational cost is relatively small. It is also preferable to use the RL because the control history which can be utilized to improve the control logic can be traced in the RL control system. The proposed control system with the above features is expected to realize robustness, flexibility in control and real-time computing.

However, there are two practical problems to enhance these advantages more efficiently in the proposed control system. The first one is ensuring that the model can achieve enough estimation accuracy within practical computational times. Conventional methods to improve estimation accuracy of the model[11] are to adjust radii parameters of the Gaussian basis functions in RBF networks by calculating the estimation error for regression. However, with the conventional radius adjustment methods it might be difficult to adjust radii parameters within the time restriction because the adjustment of radii by regression requires many iterations. On the other hand, a radius adjustment method without calculation of estimation error has also been proposed[20]. This method determines the radii parameters using an equation considering learning data properties such as size and dimension. Its computational time is fast, but the estimation accuracy is worse than the method with regression. Therefore, it is desired to propose a new radius adjustment method for the plant control to achieve both higher estimation accuracy and faster computation.

The authors propose a novel radius adjustment method to overcome this first problem[21]. The proposed method focuses on the importance of covering input space properly where the model simulates the NOx and CO properties by the Gaussian basis functions to improve estimation accuracy. This method adjusts radii parameters considering distances among the learning data. Consequently, the Gaussian basis functions can cover the input space properly and both high estimation accuracy and practical computational speed are achieved.

The second problem is to improve flexibility of the learning algorithm. Performance of the RL depends on the definition of a reward function equivalent to an evaluation function. The reward function has to be defined so that the RL algorithm can obtain the desired goal for the problem. As for application of the RL to thermal power plant control, the properties of the model changes in accordance with operational changes, thus the reward function has to be changed flexibly for the operational changes. However, it is quite difficult to prepare the reward functions for all patterns of operational conditions in advance.

To overcome this second problem, the authors introduce a reward function which has variable parameters and they proposed an automatic reward adjustment method[22]. The proposed method adjusts the variable parameters of the reward function automatically based on the NOx and CO emissions obtained in the learning process. As a result, the proposed method can obtain proper reward functions for all kinds of operational conditions.

The following sections outline the proposed control system and its newly proposed methods. Simulations clarify the advantages of the proposed system with respect to the following points: estimation accuracy and computational time of the RBF network, flexibility of the control logic and robustness in control for the noise of data.

## **2. Proposed plant control system for reducing environmental effects**

## **2.1 Basic structure**

306 Challenges and Paradigms in Applied Robust Control

signals can only be made in the past operating range, thus it is difficult to find the optimal

The authors have proposed a new plant control system for reducing environmental effects utilizing numerical calculation technology[15] to shorten the time for initial tuning and search the global optima. The system has one or more calculation databases (DBs) with respect to NOx and CO properties obtained by numerical calculation. Since the model can be tuned using the calculation DBs in advance, it is not necessary to take times for initial tuning when the control system is installed. Moreover, the proposed system obtains better control signals than the conventional technologies because it can model the NOx and CO properties including both inside and outside the operating range by the numerical

After installation, the proposed control system is capable of tuning its model using the data measured in real time to reduce the model errors. In plant control, the shortest interval for changing operations is every 20 minutes because it often takes about 20 minutes to become static after an operation. The proposed system must be able to calculate the control signals during this interval, hence model tuning and searching for control signals should terminate

The proposed system employs radial basis function (RBF) network[16,17] and reinforcement learning (RL)[18]. The RBF network represents the NOx and CO properties to estimate their concentrations according to the control signals. The RL leads to the optimal control signals to achieve the control goals which is to reduce the estimated NOx and CO concentrations. The RBF network is one of the NNs having Gaussian basis functions. The RBF network usually learns the NOx and CO properties faster than ordinary NNs because the learning algorithm of the RBF network can be converted into matrix calculations without iterations. The RL is one of the machine learning methods[19] optimizing action rules of an agent by trial and error. It is preferable to apply the RL to the control system which requires real-time computing because the RL is a single point searching method and its computational cost is relatively small. It is also preferable to use the RL because the control history which can be utilized to improve the control logic can be traced in the RL control system. The proposed control system with the above features is expected to realize robustness, flexibility in control

However, there are two practical problems to enhance these advantages more efficiently in the proposed control system. The first one is ensuring that the model can achieve enough estimation accuracy within practical computational times. Conventional methods to improve estimation accuracy of the model[11] are to adjust radii parameters of the Gaussian basis functions in RBF networks by calculating the estimation error for regression. However, with the conventional radius adjustment methods it might be difficult to adjust radii parameters within the time restriction because the adjustment of radii by regression requires many iterations. On the other hand, a radius adjustment method without calculation of estimation error has also been proposed[20]. This method determines the radii parameters using an equation considering learning data properties such as size and dimension. Its computational time is fast, but the estimation accuracy is worse than the method with regression. Therefore, it is desired to propose a new radius adjustment method for the plant

The authors propose a novel radius adjustment method to overcome this first problem[21]. The proposed method focuses on the importance of covering input space properly where the model simulates the NOx and CO properties by the Gaussian basis functions to improve

control to achieve both higher estimation accuracy and faster computation.

control signals of they are located outside the range.

within 20 minutes.

and real-time computing.

calculation, which facilitates to search the optimal control signals.

Figure 1 shows the basic structure of the proposed control system. This system consists of a plant property estimation part and an operation optimizing part. The plant property estimation part includes a statistical model and measurement and numerical calculation DBs. The statistical model estimates the NOx and CO emission properties in thermal power plants. It is difficult to express these properties as mathematical equations because they have strong nonlinearities. The proposed system employs the RBF network as the statistical model which can estimate NOx and CO emissions for control variables using data stored in the DBs. The measurement DBs store the measured NOx and CO data for some control variables, and the numerical calculation DB stores data consisting of NOx and CO values for control variables calculated by the combustion analysis[15]. The control variables correspond to input of the statistical model, and the estimated NOx and CO emissions correspond to output of it. The statistical model can be modified by measured data obtained during the plant operations.

Conventional studies have been made about the model based control technology to reduce environmental effects from thermal power plants[4,6-8], but none of them considered employing not only the measured DB, but also the numerical calculation DBs. As the model can be tuned using the calculation DBs in advance, it is not necessary to take times for initial tuning at the time of installation. In addition, it is possible to tune the model after the installation by the data in the measured DB.

The operation optimizing part includes a RL agent, a reward calculation module, a reward adjustment module and a learning result DB. The learning procedure is as follows. First, the statistical model calculates and outputs the model outputs for the model inputs changed by

A Robust and Flexible Control System

1 *x*

2 *x*

*J x*

Fig. 2. Basic Structure of RBF Network

learning data *<sup>q</sup>* **x** are denoted as *pq y* .

Here, *ND* is the number of learning data and

**U**

later.

defined.

to Reduce Environmental Effects of Thermal Power Plants 309

1 () () *NM p lp l l y uh* 

Here, ( ) *<sup>l</sup> h* **x** is the Gaussian function value of the *l* -th basis function, *NM* is the number of basis functions, *ulp* is the weight parameter between the hidden layer and output layer and *<sup>l</sup>* **c** , *lr* are center coordinates and radius of the *l* -th basis function, respectively. The parameters *<sup>l</sup>* **c** and *lr* should be determined appropriately because they have much influence on estimation accuracy. In this chapter, the center coordinates are set to the learning data, and the radii are adjusted by the proposed radius adjusting method described

Input Layer Hidden Layer Output Layer

Learning of the RBF network corresponds to the determination of the weight parameter *ulp* to minimize the energy function *Ep* given by Eq. (3) when the teaching data paired with

> 1 1 ( ( )) *ND NM p pq p lp q l E yy u*

11 12 1 21 22 2

 

*uu u uu u*

*MM M* 1 1

*N N N P*

*uu u*

noise included in learning data. The proposed control system can realize a robust control by tuning this parameter in accordance with the learning data. Then, the following matrices are

(*x*) *<sup>l</sup> h*

*lp u*

2 2

*P P*

Gaussian Basis Function

**<sup>x</sup>** (3)

is a weight decay reducing influences of

(4)

1 *y*

2 *y*

*P y*

**x x** (2)

the RL agent. Secondly, the reward calculation module calculates a reward using the model inputs and gives it to the RL agent. Thirdly, the RL agent learns its control logic. Learning results are stored in the learning result DB, and they are converted into modification signals. The control signals of the plant are generated by adding the modification signals to original control signals of the basic controller. The reward adjustment module adjusts reward parameters using the model outputs and the calculated reward. Normalized Gaussian network (NGnet)[23] has been employed as the structure of the RL agent. The learning algorithm of the NGnet is an actor-critic learning method[18], and it is appropriate for learning in a continuous environment.

Operation Optimizing Part

Fig. 1. Basic Structure of the Proposed Plant Control System

## **2.2 RBF network**

The basic structure of the RBF network is shown in Fig. 2. The RBF network has three layers: an input layer, a hidden layer with Gaussian function, and an output layer. First, the *J* dimensional vector is input in the input layer. Secondly, Gaussian function values are calculated using the input in the hidden layer. Finally, the *P* -dimensional vector is calculated by the Gaussian function values and weight parameters in the output layer. The RBF network is preferred for constructing a response surface due to the following properties.


Here, the input and output of the RBF network are denoted as <sup>1</sup> ,... ,... *<sup>T</sup> <sup>j</sup> <sup>J</sup>* **x** *xxx* ( ) *j J* , <sup>1</sup> ,... ,... *<sup>T</sup> <sup>p</sup> <sup>P</sup>* **y** *yyy* ( ) *p P* , then the *p* -th output *<sup>p</sup> y* is calculated by Eqs. (1) and (2).

$$h\_l(\mathbf{x}) = \exp\left(-\frac{\left(\mathbf{x} - \mathbf{c}\_l\right)^T \left(\mathbf{x} - \mathbf{c}\_l\right)}{\eta\_l^2}\right) \tag{1}$$

308 Challenges and Paradigms in Applied Robust Control

the RL agent. Secondly, the reward calculation module calculates a reward using the model inputs and gives it to the RL agent. Thirdly, the RL agent learns its control logic. Learning results are stored in the learning result DB, and they are converted into modification signals. The control signals of the plant are generated by adding the modification signals to original control signals of the basic controller. The reward adjustment module adjusts reward parameters using the model outputs and the calculated reward. Normalized Gaussian network (NGnet)[23] has been employed as the structure of the RL agent. The learning algorithm of the NGnet is an actor-critic learning method[18], and it is appropriate for

> Measured DB

Learning Result DB

Learning Results

RL Agent

The basic structure of the RBF network is shown in Fig. 2. The RBF network has three layers: an input layer, a hidden layer with Gaussian function, and an output layer. First, the *J* dimensional vector is input in the input layer. Secondly, Gaussian function values are calculated using the input in the hidden layer. Finally, the *P* -dimensional vector is calculated by the Gaussian function values and weight parameters in the output layer. The RBF network is preferred for constructing a response surface due to the following

The RBF network avoids overfitting by the parameter of weight decay[16] to reduce the

The RBF network does not need iterative calculations for learning of weight parameters

( )( ) ( ) exp

2

**x** (1)

*l*

*r* **xc xc**

*T l l*

Here, the input and output of the RBF network are denoted as <sup>1</sup> ,... ,... *<sup>T</sup>*

*<sup>p</sup> <sup>P</sup>* **y** *yyy* ( ) *p P* , then the *p* -th output *<sup>p</sup> y* is calculated by Eqs. (1) and (2).

Model EstimationPart

DB

Reward Adjustment Module

Reward Calculation Module

*<sup>j</sup> <sup>J</sup>* **x** *xxx* ( ) *j J* ,

Model Calculation

Statistical

Model Input Model Output

Reward

Operation Optimizing Part

Reward Parameter

learning in a continuous environment.

Plant

Basic Controller

+ +

Basic Signal

**2.2 RBF network** 

properties.

<sup>1</sup> ,... ,... *<sup>T</sup>*

Control Signal

Measured Signal

Modification Signal

Fig. 1. Basic Structure of the Proposed Plant Control System

influences of noise included in the learning data.

*l*

*h*

like back propagation does[12].

$$\mathbf{y}\_p(\mathbf{x}) = \sum\_{l=1}^{N\_M} \boldsymbol{\mu}\_{lp} \boldsymbol{h}\_l(\mathbf{x}) \tag{2}$$

Here, ( ) *<sup>l</sup> h* **x** is the Gaussian function value of the *l* -th basis function, *NM* is the number of basis functions, *ulp* is the weight parameter between the hidden layer and output layer and *<sup>l</sup>* **c** , *lr* are center coordinates and radius of the *l* -th basis function, respectively. The parameters *<sup>l</sup>* **c** and *lr* should be determined appropriately because they have much influence on estimation accuracy. In this chapter, the center coordinates are set to the learning data, and the radii are adjusted by the proposed radius adjusting method described later.

Fig. 2. Basic Structure of RBF Network

Learning of the RBF network corresponds to the determination of the weight parameter *ulp* to minimize the energy function *Ep* given by Eq. (3) when the teaching data paired with learning data *<sup>q</sup>* **x** are denoted as *pq y* .

$$E\_p = \sum\_{q=1}^{N\_D} \left( y\_{pq} - y\_p(\mathbf{x}) \right)^2 + \lambda \sum\_{l=1}^{N\_M} u\_{lp}^2 \tag{3}$$

Here, *ND* is the number of learning data and is a weight decay reducing influences of noise included in learning data. The proposed control system can realize a robust control by tuning this parameter in accordance with the learning data. Then, the following matrices are defined.

$$\mathbf{U} = \begin{bmatrix} u\_{11} & u\_{12} & \cdots & u\_{1P} \\ u\_{21} & u\_{22} & \cdots & u\_{2P} \\ \vdots & \vdots & \ddots & \vdots \\ u\_{N\_M 1} & u\_{N\_M 1} & \cdots & u\_{N\_M P} \end{bmatrix} \tag{4}$$

A Robust and Flexible Control System

obtained after the control.

*x*

Fig. 3. Basic Structure of NGnet

 2 2 22 <sup>1</sup> ,... ,... **σ***i ii* 

noise ratio.

 

**2.3.2 Learning algorithm** 

*μi*

*i*

to Reduce Environmental Effects of Thermal Power Plants 311

<sup>2</sup> ( ) <sup>1</sup> 1 exp( ) *sig f z*

*NL*

1

*i V vb* 

Here, *ijk* , , denote the subscripts of the basis functions of the agent, inputs and actor outputs, respectively. *J K*, also denote the dimensions of the inputs and actor outputs. In this chapter, the input of the statistical model is defined as becoming equal to that of the RL agent. In other words, the RL agent outputs the control bias to the input condition **x** . The reward is calculated based on the results of control, *i.e.*, the outputs of the statistical model

*i i*

*wki*

*vi*

*<sup>j</sup> iJ* are the center and radii vectors, respectively. *NL* is the basis function

( )

*ai bi*

*ai*

Here,*<sup>i</sup>* is the covariance matrix of the Gaussian basis function. **μ***iii*

converted value. Here, *nk* is normalized noise whose average is 0 and variance is 1.

The procedures to calculate the actor outputs *mk* are as follows. First, the sum of the normalized activations *<sup>i</sup> b* is added to a noise component to search for optimal actions. Next, they are converted to the region of [ 1.1] by a sigmoid function. Finally, the actor outputs *mk* are calculated by multiplying the maximum values of the actor outputs max *mk* and the

Learning of NGnet is executed by the following procedures: updating the weight parameters , *w v ki i* , adding/deleting of the Gaussian basis functions, and tuning **μ***<sup>i</sup>* , <sup>2</sup> **σ***<sup>i</sup>* . TD

size. , *w v ki i* are the weight parameters of actor and critic, respectively.

*<sup>z</sup>* (13)

**<sup>x</sup>** (14)

*V*(*x*)

 <sup>1</sup> ,... ,... *<sup>j</sup> iJ* ,

> is a

*m*(*x*)

Gaussian Basis Function

$$\mathbf{H} = \begin{bmatrix} h\_1(\mathbf{x}\_1) & h\_2(\mathbf{x}\_1) & \cdots & h\_{N\_M}(\mathbf{x}\_1) \\ h\_1(\mathbf{x}\_2) & h\_2(\mathbf{x}\_2) & \cdots & h\_{N\_M}(\mathbf{x}\_2) \\ \vdots & \vdots & \ddots & \vdots \\ h\_1(\mathbf{x}\_{N\_D}) & h\_2(\mathbf{x}\_{N\_D}) & \cdots & h\_{N\_M}(\mathbf{x}\_{N\_D}) \end{bmatrix} \tag{5}$$
 
$$\mathbf{Y} = \begin{bmatrix} y\_{11} & y\_{12} & \cdots & y\_{1P} \\ y\_{21} & y\_{22} & \cdots & y\_{2P} \\ \vdots & \vdots & \ddots & \vdots \\ y\_{N\_M 1} & y\_{N\_M 1} & \cdots & y\_{N\_M P} \end{bmatrix} \tag{6}$$
 
$$\mathbf{A} = A\mathbf{I} \tag{7}$$

In Eq. (3), both sides are partially differentiated by *ulp* and Eqs. (4)-(7) are substituted, then Eq. (8) is obtained[16]. The learning of the RBF network can be described as the calculation of the weight matrix **U** given by Eq. (8).

$$\mathbf{U} = \left(\mathbf{H}^T \mathbf{H} + \boldsymbol{\Lambda}\right)^{-1} \mathbf{H}^T \mathbf{Y} \tag{8}$$

## **2.3 Reinforcement learning**

#### **2.3.1 Basic algorithm**

The NGnet for learning of the RL agent learns its action, *i.e.*, control logic, and state value by putting Gaussian basis functions on its state space. Here, the state space is a mapping space to identify its status in the learning environment. The state value is a degree to evaluate how desirable the agent is in its current state. NGnet is known to be able to learn faster than other RL algorithms such as tile coding[18] because of the following features.


Figure 3 shows the basic structure of NGnet. First, NGnet calculates activations of its Gaussian basis functions *<sup>i</sup> a* and normalized activations *<sup>i</sup> b* for the input **x** by Eqs. (9)-(11). Next, outputs of actor **m x**( ) ,... ,... *mmm* <sup>1</sup> *k K* ( ) *k K i.e.*, action and critic *V*( ) **x** *i.e.*, state value are calculated by Eqs. (12)-(14).

$$a\_i = \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}\_i)^T \sum\_{i}^{-1} (\mathbf{x} - \boldsymbol{\mu}\_i)\right) \tag{9}$$

$$\sum\_{i} = \text{diag}(\mathbf{o}\_i^2) \tag{10}$$

$$b\_i = \frac{a\_i}{\sum\_{t=1}^{N\_l} a\_t} \tag{11}$$

$$m\_k(\mathbf{x}) = m\_k^{\max} f\_{sig} \left( \sum\_{i=1}^{N\_L} w\_{ki} b\_i + \beta n\_k \right) \tag{12}$$

310 Challenges and Paradigms in Applied Robust Control

*hh h hh h*

*hh h*

1 2

**Y**

**H**

of the weight matrix **U** given by Eq. (8).

**2.3 Reinforcement learning 2.3.1 Basic algorithm** 

value are calculated by Eqs. (12)-(14).

11 21 1 12 22 2

() () () () () ()

 

**xx x xx x**

*M M*

(5)

(6)

**I** (7)

**U HH Λ H Y** (8)

**<sup>x</sup> <sup>μ</sup> <sup>x</sup> <sup>μ</sup>** (9)

<sup>2</sup> ( )*<sup>i</sup> <sup>i</sup> diag* **<sup>σ</sup>** (10)

(11)

*N N*

> *P P*

()() ()

**xx x**

11 12 1 21 22 2

*yy y yy y*

*MM M* 1 1

*N N N P*

*yy y*

In Eq. (3), both sides are partially differentiated by *ulp* and Eqs. (4)-(7) are substituted, then Eq. (8) is obtained[16]. The learning of the RBF network can be described as the calculation

<sup>1</sup> *T T*

The NGnet for learning of the RL agent learns its action, *i.e.*, control logic, and state value by putting Gaussian basis functions on its state space. Here, the state space is a mapping space to identify its status in the learning environment. The state value is a degree to evaluate how desirable the agent is in its current state. NGnet is known to be able to learn faster than other

Figure 3 shows the basic structure of NGnet. First, NGnet calculates activations of its Gaussian basis functions *<sup>i</sup> a* and normalized activations *<sup>i</sup> b* for the input **x** by Eqs. (9)-(11). Next, outputs of actor **m x**( ) ,... ,... *mmm* <sup>1</sup> *k K* ( ) *k K i.e.*, action and critic *V*( ) **x** *i.e.*, state

> <sup>1</sup> <sup>1</sup> exp ( ) ( ) <sup>2</sup> *T i ii <sup>i</sup> a*

> > 1 *L i i N*

*NL k k sig ki i k i m m f wb n*

*<sup>a</sup> <sup>b</sup>*

 

max

( )

*t t*

1

 

**<sup>x</sup>** (12)

*a*

RL algorithms such as tile coding[18] because of the following features.

 NGnet can reduce necessary basis function size by normalization. NGnet can add/delete the basis functions and parameter tuning.

NGnet can learn locally by the Gaussian basis functions.

**Λ** 

*N N NN*

*D D M D*

 

$$f\_{\rm sig}(z) = \frac{2}{1 - \exp(-z)} - 1\tag{13}$$

$$V(\mathbf{x}) = \sum\_{i=1}^{N\_L} v\_i b\_i \tag{14}$$

Here, *ijk* , , denote the subscripts of the basis functions of the agent, inputs and actor outputs, respectively. *J K*, also denote the dimensions of the inputs and actor outputs. In this chapter, the input of the statistical model is defined as becoming equal to that of the RL agent. In other words, the RL agent outputs the control bias to the input condition **x** . The reward is calculated based on the results of control, *i.e.*, the outputs of the statistical model obtained after the control.

Fig. 3. Basic Structure of NGnet

Here,*<sup>i</sup>* is the covariance matrix of the Gaussian basis function. **μ***iii* <sup>1</sup> ,... ,... *<sup>j</sup> iJ* , 2 2 22 <sup>1</sup> ,... ,... **σ***i ii <sup>j</sup> iJ* are the center and radii vectors, respectively. *NL* is the basis function size. , *w v ki i* are the weight parameters of actor and critic, respectively.

The procedures to calculate the actor outputs *mk* are as follows. First, the sum of the normalized activations *<sup>i</sup> b* is added to a noise component to search for optimal actions. Next, they are converted to the region of [ 1.1] by a sigmoid function. Finally, the actor outputs *mk* are calculated by multiplying the maximum values of the actor outputs max *mk* and the converted value. Here, *nk* is normalized noise whose average is 0 and variance is 1. is a noise ratio.

## **2.3.2 Learning algorithm**

Learning of NGnet is executed by the following procedures: updating the weight parameters , *w v ki i* , adding/deleting of the Gaussian basis functions, and tuning **μ***<sup>i</sup>* , <sup>2</sup> **σ***<sup>i</sup>* . TD

A Robust and Flexible Control System

**Step 7.** Calculate reward. **Step 8.** Calculate TD error.

to **Step 5**.

**3.1 Basic concepts** 

become small, and *vice versa*.

**3.2 Algorithm of the proposed method** 

**Step 2.** Generate an input randomly.

**Step 7.** Update radii of the selected data

**Step 2.**

**Algorithm of the Radius Adjustment Method** 

**Step 1.** Initialize the radii and adjusting parameters.

The algorithm of the proposed method consists of the following steps.

**Step 4.** Exclude the selected data from the data candidates for selection.

**Step 3.** Select pairs of learning data by the *k*-SN (*k*-surrounded neighbor) method[24].

**Step 8.** If *n* reaches *N* , terminate the algorithm. Otherwise, increment *n* and return to

In **Step 1**, the radii are initialized as a small value. In **Step 2**, an input condition *<sup>n</sup>* **x** ( *n* : suffix showing the number of iterations) is generated randomly. **In Step 3**, the pairs of learning data ( <sup>1</sup> **x***<sup>m</sup>* , <sup>2</sup> **x***<sup>m</sup>* ) ( *m* : suffix showing the number of pairs) for which the radii are to be

**Step 5.** If there are no data candidates, go to **Step 4.** Otherwise, return to **Step 3. Step 6.** If there are no selected data, go to **Step 8**. Otherwise, go to **Step 7.**

**Step 12.** Adjust the reward parameters.

between **Step 4** and **Step 12** for *T* times.

**3. Adaptive radius adjustment method** 

return to **Step 4**.

**Step 2.** Adjust radii of the RBF network.

**Step 4.** Determine initial control variables. **Step 5.** Change control variables by the RL agent. **Step 6.** Calculate model outputs by the RBF network.

**Step 3.** Calculate weight parameters of the RBF network.

**Step 9.** Update weight parameters of the RL agent. **Step 10.** Add new basis functions of the RL agent.

to Reduce Environmental Effects of Thermal Power Plants 313

**Step 11.** If the terminal condition of the episode is reached, go to **Step 12**. Otherwise, return

**Step 13.** If the terminal condition of learning is reached, terminate the algorithm. Otherwise,

In the above algorithm, an episode terminates after executing the processes between **Step 5** and **Step 10** for *S* times, and a trial of learning terminates after executing the processes

In the proposed control system, the outputs of the RBF network are calculated by the Gaussian basis functions according to the input space. To obtain high estimation accuracy, the radii should be adjusted so that the basis functions can cover the space sufficiently. The proposed method focuses on the covering rate of the basis functions on the input space. It adjusts the radii based on the distances between a randomly generated input and the center of the basis functions selected to surround the input, where the learning data are located. As a result, the radii of basis functions whose distances to other data are short

learning[17] is employed to update , *w v ki i* . The agent updates its input ( **x x** ) by its actor outputs *mk* , then the model outputs are calculated by the actor outputs. Eq. (15) calculates TD error by *reward* calculated by the model outputs and the state value *V*( ) **x** calculated by the input **x** .

$$\mathcal{S} = reward + \gamma V(\mathbf{x'}) - V(\mathbf{x}) \tag{15}$$

Here, is a discount ratio for the future reward. The actor of NGnet learns its actions to improve *V*( ) **x** , and the critic of NGnet also learns to estimate *V*( ) **x** appropriately. , *w v ki i* are updated by Eqs. (16) and (17) using .

$$
\omega\_{ki} = \varpi\_{ki} + \varpi\_A b\_i \delta n\_k \tag{16}
$$

$$
\upsilon\_i = \upsilon\_i + a\_C b\_i \delta \tag{17}
$$

Here, *<sup>A</sup>* and *<sup>C</sup>* denote the learning rates of *wki* and *<sup>i</sup> v* , respectively.

The other learning procedures execute adding/deleting the Gaussian basis functions and tuning of **μ***<sup>i</sup>* , <sup>2</sup> **σ***i* so that the NGnet can obtain enough resolutions to learn its state space. The proposed control system employs the following algorithm: the sizes of basis functions of the NGnet are initialized to 0, and new basis functions are added adaptively in its learning.

#### **Basis Addition Algorithm**


Here, max min , *N a <sup>L</sup>* and min denote maximum basis function size, threshold value of activation and threshold value of TD error, respectively. This algorithm adds new basis functions in the regions of the state space which are not sufficiently covered with learned basis functions. In addition, the maximum basis function size max *NL* is set because it might be possible to add unnecessary basis functions by increasing variation of the TD error due to the proposed automatic reward adjustment method described later. Therefore, the agent can put only the necessary basis functions in its state space.

#### **2.4 Learning flow of the proposed control system**

The learning algorithm flow of the proposed control system consists of the following steps.

#### **Learning Algorithm of the Proposed Control System**

**Step 1.** Initialize learning parameters of the RBF network and RL.


312 Challenges and Paradigms in Applied Robust Control

learning[17] is employed to update , *w v ki i* . The agent updates its input ( **x x** ) by its actor outputs *mk* , then the model outputs are calculated by the actor outputs. Eq. (15) calculates

> *reward V V*

improve *V*( ) **x** , and the critic of NGnet also learns to estimate *V*( ) **x** appropriately. , *w v ki i* are

*w w bn ki ki A i k* 

> *i i Ci vv b*

**Step 1.** If the current basis function size *NL* satisfies max *N N L L* , then the algorithm goes to

**Step 2.** The activations of the agent's current basis functions *<sup>i</sup> a* are calculated for the input

**Step 3.** If there is no basis function *i* which meets *<sup>i</sup>* min *a a* , then the algorithm goes to

**Step 5.** A basis function whose center and radius is set to **x** and **σ***i* is added to NGnet,

activation and threshold value of TD error, respectively. This algorithm adds new basis functions in the regions of the state space which are not sufficiently covered with learned basis functions. In addition, the maximum basis function size max *NL* is set because it might be possible to add unnecessary basis functions by increasing variation of the TD error due to the proposed automatic reward adjustment method described later. Therefore, the agent can

The learning algorithm flow of the proposed control system consists of the following steps.

is satisfied, the algorithm goes to **Step 5**. Otherwise, it terminates.

denote maximum basis function size, threshold value of

*<sup>C</sup>* denote the learning rates of *wki* and *<sup>i</sup> v* , respectively. The other learning procedures execute adding/deleting the Gaussian basis functions and tuning of **μ***<sup>i</sup>* , <sup>2</sup> **σ***i* so that the NGnet can obtain enough resolutions to learn its state space. The proposed control system employs the following algorithm: the sizes of basis functions of the NGnet are initialized to 0, and new basis functions are added adaptively in its

.

by *reward* calculated by the model outputs and the state value *V*( ) **x** calculated

is a discount ratio for the future reward. The actor of NGnet learns its actions to

( ) () **x x** (15)

(16)

(17)

TD error

Here, 

Here, *<sup>A</sup>* and

learning.

updated by Eqs. (16) and (17) using

**Basis Addition Algorithm** 

**Step 4.** If min 

Here, max

**Step 2**. Otherwise, it terminates.

**Step 4.** Otherwise, it terminates.

then the algorithm terminates.

put only the necessary basis functions in its state space.

**2.4 Learning flow of the proposed control system** 

**Learning Algorithm of the Proposed Control System** 

**Step 1.** Initialize learning parameters of the RBF network and RL.

**x** during its learning.

min , *N a <sup>L</sup>* and min

by the input **x** .


In the above algorithm, an episode terminates after executing the processes between **Step 5** and **Step 10** for *S* times, and a trial of learning terminates after executing the processes between **Step 4** and **Step 12** for *T* times.

## **3. Adaptive radius adjustment method**

## **3.1 Basic concepts**

In the proposed control system, the outputs of the RBF network are calculated by the Gaussian basis functions according to the input space. To obtain high estimation accuracy, the radii should be adjusted so that the basis functions can cover the space sufficiently.

The proposed method focuses on the covering rate of the basis functions on the input space. It adjusts the radii based on the distances between a randomly generated input and the center of the basis functions selected to surround the input, where the learning data are located. As a result, the radii of basis functions whose distances to other data are short become small, and *vice versa*.

## **3.2 Algorithm of the proposed method**

The algorithm of the proposed method consists of the following steps.

### **Algorithm of the Radius Adjustment Method**


In **Step 1**, the radii are initialized as a small value. In **Step 2**, an input condition *<sup>n</sup>* **x** ( *n* : suffix showing the number of iterations) is generated randomly. **In Step 3**, the pairs of learning data ( <sup>1</sup> **x***<sup>m</sup>* , <sup>2</sup> **x***<sup>m</sup>* ) ( *m* : suffix showing the number of pairs) for which the radii are to be

A Robust and Flexible Control System

distribution of learning data.

**3.3.1 Simulation conditions** 

of RBF networks[20].

from 2 to 10. The parameters of

(CPU clock: 2.8[GHz]).

computational time using the test function data.

outputs of the RBF network and the test data are evaluated.

1

*rad* ,

1

*F x xx* 

*j*

2

( )

*J*

Here, 

parameter ( 0 1

**3.3 Simulations** 

to Reduce Environmental Effects of Thermal Power Plants 315

2 2 2 2 (,) *<sup>n</sup> rr d r m m rad m n m*

iteration *n* increases, then the radii finally converge to certain values. These steps are iterated until *n* reaches *N* , then the radii are adjusted to certain values according to the

In this section, some simulations are executed in order to evaluate the performances of the proposed radius adjustment method. The proposed method is compared with two conventional radius adjustment methods with respect to estimation accuracy and

Simulations are executed in the following steps: a) determination of radii, b) calculation of weight parameters, and c) evaluation of estimation error. In step a), the proposed method, the Cross Validation (CV) method[11] and the radius equation method[20] are used to determine radii. The CV method adjusts radii with regression, and the radius equation method adjusts radii without regression. (See appendix). In step b), the weight parameters of the RBF network are calculated by Eq. (8). In step c), the estimation errors between the

In the case of plant control, the shape of the response surface changes according to the plant properties, input dimensions and numbers of learning data. In order to simulate various response surfaces, the learning data are created for different test functions, input dimensions and numbers of data. The test functions 1*F* ( ) **x** and 2*F* ( ) **x** ( [ 5,5] **x** ) described as Eqs. (21) and (22) are used in the simulations. These functions are often used as benchmark problems

4 2

2

*k*

**<sup>x</sup>** (21)

**<sup>x</sup>** (22)

and *N* are set to 0.01, 0.999 and 3000, respectively.

( ) 16 5 100

*j jj*

1 1

*j k F x* 

Table 1 shows settings of learning data and test data of the RBF network in the simulations. Here, the numbers of learning data and test data are denoted as *ND* and *NTest* , respectively. In simulations, the output dimension *P* is fixed to 1, while the input dimension *J* is varied

They are set appropriately based on prior experimental results. The parameters of min *r* , max *r* and *r* used in the CV method are shown in Table 2. The common parameter,

is set to 0.01. Each simulation is executed for 25 random sequences using a Linux machine

*J j*

**x x** (20)

is a decay rate of the step size

). The second term in the right sides of both Eqs. (19) and (20) decays as

 

*rad* is an initial step size parameter of radius, and

adjusted are selected for the generated *<sup>n</sup>* **x** using the *k*-SN method. The *k*-SN method is a data extraction method to satisfy the condition of interpolation. It selects the pair of data ( <sup>1</sup> **x***<sup>m</sup>* , <sup>2</sup> **x***<sup>m</sup>* ) so that *<sup>n</sup>* **<sup>x</sup>** is surrounded by them.

Fig. 4. Mechanism of k-SN method

Figure 4 shows the mechanism of the *k*-SN method in a 2-dimensional input space. The nearest datum <sup>1</sup> **x***m* to *<sup>n</sup>* **<sup>x</sup>** is selected from the data candidates available for selection, *i.e.*, learning data excluding the formerly selected data. Then the datum <sup>2</sup> **x***m* paired with <sup>1</sup> **x***m* is selected according to Eq. (18).

$$
\mathbf{x}\_m^2 = \arg\min\_{z \in \mathcal{Z}} d(\mathbf{x}\_{z'}, \mathbf{x}\_n)
$$

$$
\text{subject to} \quad d(\mathbf{x}\_z, \mathbf{x}\_n) < d(\mathbf{x}\_{m'}^1, \mathbf{x}\_z) \tag{18}
$$

Here, *z* denotes the suffix of the data candidates available for selection and (,) *z n d* **x x** denotes the distance between *<sup>z</sup>* **<sup>x</sup>** and *<sup>n</sup>* **<sup>x</sup>** . In **Step 4**, the selected data ( <sup>1</sup> **x***<sup>m</sup>* , <sup>2</sup> **x***<sup>m</sup>* ) are excluded from the data candidates. If there is no *<sup>z</sup>* **<sup>x</sup>** satisfying Eq. (18), only <sup>1</sup> **x***m* is excluded. In this way, the radii of basis functions in an interpolative relation with inputs are adjusted, then the basis functions can cover the input space sufficiently. This selection continues until all the data candidates have been selected.

In **Step 7**, the radii ( 1 2 *r r m m*, ) set at the selected data are adjusted by Eqs. (19) and (20).

$$r\_m^1 = r\_m^1 + \alpha\_{rad} \tau^n \left( d(\mathbf{x}\_m^1, \mathbf{x}\_n) - r\_m^1 \right) \tag{19}$$

$$r\_m^2 = r\_m^2 + \alpha\_{rad} \tau^n \left( d(\mathbf{x}\_{m'}^2 \mathbf{x}\_n) - r\_m^2 \right) \tag{20}$$

Here, *rad* is an initial step size parameter of radius, and is a decay rate of the step size parameter ( 0 1 ). The second term in the right sides of both Eqs. (19) and (20) decays as iteration *n* increases, then the radii finally converge to certain values. These steps are iterated until *n* reaches *N* , then the radii are adjusted to certain values according to the distribution of learning data.

## **3.3 Simulations**

314 Challenges and Paradigms in Applied Robust Control

adjusted are selected for the generated *<sup>n</sup>* **x** using the *k*-SN method. The *k*-SN method is a data extraction method to satisfy the condition of interpolation. It selects the pair of data

*n x*

*<sup>m</sup> <sup>n</sup> <sup>d</sup> <sup>x</sup> ,<sup>x</sup>* ( ) <sup>1</sup> <sup>2</sup>

<sup>2</sup>

*2 xm*

( ) <sup>2</sup>

Figure 4 shows the mechanism of the *k*-SN method in a 2-dimensional input space. The nearest datum <sup>1</sup> **x***m* to *<sup>n</sup>* **<sup>x</sup>** is selected from the data candidates available for selection, *i.e.*, learning data excluding the formerly selected data. Then the datum <sup>2</sup> **x***m* paired with <sup>1</sup> **x***m* is

Here, *z* denotes the suffix of the data candidates available for selection and (,) *z n d* **x x** denotes the distance between *<sup>z</sup>* **<sup>x</sup>** and *<sup>n</sup>* **<sup>x</sup>** . In **Step 4**, the selected data ( <sup>1</sup> **x***<sup>m</sup>* , <sup>2</sup> **x***<sup>m</sup>* ) are excluded from the data candidates. If there is no *<sup>z</sup>* **<sup>x</sup>** satisfying Eq. (18), only <sup>1</sup> **x***m* is excluded. In this way, the radii of basis functions in an interpolative relation with inputs are adjusted, then the basis functions can cover the input space sufficiently. This selection continues until all

1 1 1 1 (,) *<sup>n</sup> rr d r m m rad m n m*

In **Step 7**, the radii ( 1 2 *r r m m*, ) set at the selected data are adjusted by Eqs. (19) and (20).

 

*<sup>m</sup>* argmin ( , ) *z n z Z <sup>d</sup>* **<sup>x</sup> x x**

*1 xm*

*<sup>m</sup> <sup>m</sup> d x ,x*

1 *x*

*subject to* <sup>1</sup> (,) ( ,) *zn mz d d* **xx x x** (18)

**x x** (19)

Formerly selected data Data candidates for selection

( <sup>1</sup> **x***<sup>m</sup>* , <sup>2</sup> **x***<sup>m</sup>* ) so that *<sup>n</sup>* **<sup>x</sup>** is surrounded by them.

2 *x*

Fig. 4. Mechanism of k-SN method

selected according to Eq. (18).

the data candidates have been selected.

In this section, some simulations are executed in order to evaluate the performances of the proposed radius adjustment method. The proposed method is compared with two conventional radius adjustment methods with respect to estimation accuracy and computational time using the test function data.

## **3.3.1 Simulation conditions**

Simulations are executed in the following steps: a) determination of radii, b) calculation of weight parameters, and c) evaluation of estimation error. In step a), the proposed method, the Cross Validation (CV) method[11] and the radius equation method[20] are used to determine radii. The CV method adjusts radii with regression, and the radius equation method adjusts radii without regression. (See appendix). In step b), the weight parameters of the RBF network are calculated by Eq. (8). In step c), the estimation errors between the outputs of the RBF network and the test data are evaluated.

In the case of plant control, the shape of the response surface changes according to the plant properties, input dimensions and numbers of learning data. In order to simulate various response surfaces, the learning data are created for different test functions, input dimensions and numbers of data. The test functions 1*F* ( ) **x** and 2*F* ( ) **x** ( [ 5,5] **x** ) described as Eqs. (21) and (22) are used in the simulations. These functions are often used as benchmark problems of RBF networks[20].

$$F\_1(\mathbf{x}) = \sum\_{j=1}^{l} \left(\mathbf{x}\_j^4 - 16\mathbf{x}\_j^2 + 5\mathbf{x}\_j + 100\right) \tag{21}$$

$$F\_2(\mathbf{x}) = \sum\_{j=1}^{J} \left(\sum\_{k=1}^{j} x\_k\right)^2 \tag{22}$$

Table 1 shows settings of learning data and test data of the RBF network in the simulations. Here, the numbers of learning data and test data are denoted as *ND* and *NTest* , respectively. In simulations, the output dimension *P* is fixed to 1, while the input dimension *J* is varied from 2 to 10. The parameters of *rad* , and *N* are set to 0.01, 0.999 and 3000, respectively. They are set appropriately based on prior experimental results. The parameters of min *r* , max *r* and *r* used in the CV method are shown in Table 2. The common parameter, is set to 0.01. Each simulation is executed for 25 random sequences using a Linux machine (CPU clock: 2.8[GHz]).

A Robust and Flexible Control System

of *<sup>n</sup>* 

Fig. 5. RMSE curve obtained by the proposed method

contributes to the convergence of RMSE.

Fig. 6. Adjusting history of typical radial values by the proposed method

index of the data whose distances to other data are short usually becomes large.

*t i*

*h*

**Radial Value**

RMSE

to Reduce Environmental Effects of Thermal Power Plants 317

**0 500 1000 1500 2000 2500 3000**

Iteration

**0 500 1000 1500 2000 2500 3000**

2

**c** (24)

*ci*

*r* **cc cc**

*T it it*

**r1 r2 r3 r4 r5 r6 r7 r8 r9 r10**

**Iteration**

Next, Fig. 7 shows the radial values plotted for the crowding index *ci* of their basis functions calculated by Eqs. (24) and (25). The crowded index *ci* represents how the center coordinate *<sup>i</sup>* **c** of the basis function *i* is covered with all the basis functions having uniform radii, thus this

( )( ) ( ) exp

Figure 6 shows the adjustment history of 10 typical radii selected from those of 300 Gaussian basis functions corresponding to the numbers of learning data in case 5. In this figure, the radii soon increase with iteration but converge into different values. The reason why the adjusted radii converge into different values is that the proposed method adjusts the radii based on the distribution of learning data. For the data whose distances to other data are short, the distances between the learning data and *<sup>n</sup>* **x** become short. Consequently, the radii of the data in the region become shorter than those in the region whose distances are long. It is also confirmed by comparing Figs. 5 and 6 that the convergence of radii due to the decay


Table 1. Specifications of learning data and test data in the simulation cases


Table 2. Parameter conditions in CV method

## **3.3.2 Results and discussions**

In order to evaluate estimation accuracy of the proposed method, root mean square error *RMSEcn* calculated by Eq. (23) is used.

$$RMSE\_{\rm cut} = \sqrt{\sum\_{t=1}^{N\_{\rm Test}} \frac{\left(y\_{cn}(\mathbf{x}\_t) - F\_{cn}(\mathbf{x}\_t)\right)^2}{N\_{\rm Test}}} \tag{23}$$

Here, ( ) *cn t y* **x** and ( ) *Fcn t* **x** are an output value of the RBF network and the test data for the input *<sup>t</sup>* **x** ( *t* : suffix of the test data) in case *cn* ( *cn* 1,2, 18 ), respectively.

First, convergence performance of the proposed method is studied using the RMSE and adjusted radii parameters. Figs. 5 and 6 show the RMSE and several radii parameters for iteration *N* in case 5 of Table 1. Case 5 is the most suitable condition for real plants with respect to input dimensions and numbers of learning data. The other cases also show the results similar to those of case 5. From Fig. 5, it is confirmed that the RMSE decreases and converges into a certain value with iteration.

316 Challenges and Paradigms in Applied Robust Control

2

5

10

2

Data Size Case Function Input Dimension *J*

25

*ND NTest*

50

100

25

50

100

**x x** (23)

*r*

5

10

1,2,3,10,11,12 0.1 10 0.1 4,13 5 15 0.1 5,14 5 15 0.5 6,15 5 15 1 7,16 5 20 0.1 8,9,17,18 5 20 1

In order to evaluate estimation accuracy of the proposed method, root mean square error

1

*y F RMSE*

*cn*

input *<sup>t</sup>* **x** ( *t* : suffix of the test data) in case *cn* ( *cn* 1,2, 18 ), respectively.

<sup>2</sup>

() () *NTest cn t cn t*

*t Test*

*N*

Here, ( ) *cn t y* **x** and ( ) *Fcn t* **x** are an output value of the RBF network and the test data for the

First, convergence performance of the proposed method is studied using the RMSE and adjusted radii parameters. Figs. 5 and 6 show the RMSE and several radii parameters for iteration *N* in case 5 of Table 1. Case 5 is the most suitable condition for real plants with respect to input dimensions and numbers of learning data. The other cases also show the results similar to those of case 5. From Fig. 5, it is confirmed that the RMSE decreases and

*min r max*

*F1* (*x* )

*F2* (*x* )

Table 1. Specifications of learning data and test data in the simulation cases

Table 2. Parameter conditions in CV method

converges into a certain value with iteration.

**3.3.2 Results and discussions** 

*RMSEcn* calculated by Eq. (23) is used.

Case *r*

Fig. 5. RMSE curve obtained by the proposed method

Figure 6 shows the adjustment history of 10 typical radii selected from those of 300 Gaussian basis functions corresponding to the numbers of learning data in case 5. In this figure, the radii soon increase with iteration but converge into different values. The reason why the adjusted radii converge into different values is that the proposed method adjusts the radii based on the distribution of learning data. For the data whose distances to other data are short, the distances between the learning data and *<sup>n</sup>* **x** become short. Consequently, the radii of the data in the region become shorter than those in the region whose distances are long. It is also confirmed by comparing Figs. 5 and 6 that the convergence of radii due to the decay of *<sup>n</sup>* contributes to the convergence of RMSE.

Fig. 6. Adjusting history of typical radial values by the proposed method

Next, Fig. 7 shows the radial values plotted for the crowding index *ci* of their basis functions calculated by Eqs. (24) and (25). The crowded index *ci* represents how the center coordinate *<sup>i</sup>* **c** of the basis function *i* is covered with all the basis functions having uniform radii, thus this index of the data whose distances to other data are short usually becomes large.

$$h\_t(\mathbf{c}\_i) = \exp\left(-\frac{(\mathbf{c}\_i - \mathbf{c}\_t)^T (\mathbf{c}\_i - \mathbf{c}\_t)}{r\_{ci}^2}\right) \tag{24}$$

A Robust and Flexible Control System

the CV method.

methods

to Reduce Environmental Effects of Thermal Power Plants 319

Table 3 compares the RMSEs of the proposed method and conventional methods. The case values in the table are the averages of 25 simulation results. The RMSEs of the proposed method are smaller than those for the radius equation in each case. The radius equation is usually applied to learning data having a uniform crowded index[20]. Therefore, it is difficult to apply it to plant control where the learning data usually have deviations of crowded index like Fig. 7. The proposed method can adjust the radii considering the distribution of the learning data, thus the RMSEs are an average of 33.9[%] better compared to those from the radius equation. The proposed method also has the same performances as

Table 4 compares computational times of the proposed and conventional methods. These case results are also the averages of 25 simulation results. The computational times of the radius equation are enormously short because it spends time only in the calculation of Eq. (34) to adjust the radii. Regarding the CV method, the computational times increase exponentially with the number of data because error evaluations are needed for all learning data. There are some cases where the computational times are well beyond the limitation of practical use (20 minutes). Therefore, it is difficult to apply the CV method to plant control. On the other hand, the computational times of the proposed method in every case are within 20 minutes. These computational times are practical for plant control and it is

These simulation results show that the proposed plant control system can construct a flexible statistical model having high estimation accuracy for various operational conditions of thermal power plants within a practical computational time. It is expected to improve

> CV Method

1 2.8E-02 6.5E-01 7.6E-06 2 9.9E-02 9.2E+00 2.8E-05 3 3.7E-01 1.5E+02 1.1E-04 4 4.6E-01 1.4E+02 1.4E-04 5 3.9E+00 2.6E+03 1.3E-03 6 1.1E+01 1.7E+04 3.6E-03 7 6.6E-01 2.2E+02 2.8E-04 8 1.6E+01 2.3E+04 6.9E-03 9 6.4E+02 6.5E+05 3.1E-02 10 2.7E-02 6.5E-01 7.6E-06 11 9.8E-02 9.2E+00 2.7E-05 12 3.7E-01 1.5E+02 1.1E-04 13 4.6E-01 1.4E+02 1.4E-04 14 3.9E+00 2.6E+03 1.3E-03 15 1.1E+01 1.6E+04 3.6E-03 16 6.6E-01 2.2E+02 2.8E-04 17 1.6E+01 2.3E+04 6.9E-03 18 6.4E+02 6.5E+05 3.1E-02

Table 4. Comparisons of the computational times [s] for the proposed and conventional

Radius Equation

confirmed that the proposed method is the most suitable for plant control.

Case Proposed Method

effectiveness in reducing NOx and CO by learning with such a statistical model.

$$ci(\mathbf{c}\_i) = \frac{1}{N\_D} \sum\_{t=1}^{N\_D} h\_t(\mathbf{c}\_i) \tag{25}$$

Here, ( ) *t i h* **c** is the Gaussian function value of the basis function whose center is *<sup>t</sup>* **c** and *ci r* is the radius to which a certain constant value is set. In this simulation, *ci r* is set to 3.89 considering the range of input values.

Fig. 7. Relation between the crowded index and radial values

In Fig. 7, the radial values with low crowded index are larger than those with high crowded index because the basis functions in the region where distances to data are long need to cover a wider input space. This result indicates that the proposed radius adjustment algorithm works properly.


Table 3. Comparisons of the RMSEs obtained by the proposed and conventional methods

318 Challenges and Paradigms in Applied Robust Control

Here, ( ) *t i h* **c** is the Gaussian function value of the basis function whose center is *<sup>t</sup>* **c** and *ci r* is the radius to which a certain constant value is set. In this simulation, *ci r* is set to 3.89

considering the range of input values.

algorithm works properly.

Fig. 7. Relation between the crowded index and radial values

Case Proposed Method

**Radial Value**

1 <sup>1</sup> () () *ND i t i D t ci h N*

**0 0.05 0.1 0.15 0.2**

**Crowded Index**

CV Method

Radius Equation

In Fig. 7, the radial values with low crowded index are larger than those with high crowded index because the basis functions in the region where distances to data are long need to cover a wider input space. This result indicates that the proposed radius adjustment

1 76.0 83.5 96.5 2 71.4 70.9 99.7 3 29.5 36.1 63.3 4 138.3 130.3 186.5 5 116.6 115.4 170.1 6 107.2 113.3 153.1 7 201.0 234.1 272.7 8 174.4 166.8 230.7 9 164.0 158.4 239.3 10 7.9 6.1 14.2 11 2.7 2.7 10.5 12 1.5 2.3 9.3 13 39.8 35.3 79.1 14 16.4 12.9 58.5 15 8.9 10.9 48.4 16 268.1 292.2 294.7 17 82.0 58.8 173.5 18 63.9 38.2 175.6 Ave. 87.3 87.1 132.0 Table 3. Comparisons of the RMSEs obtained by the proposed and conventional methods

**c c** (25)

Table 3 compares the RMSEs of the proposed method and conventional methods. The case values in the table are the averages of 25 simulation results. The RMSEs of the proposed method are smaller than those for the radius equation in each case. The radius equation is usually applied to learning data having a uniform crowded index[20]. Therefore, it is difficult to apply it to plant control where the learning data usually have deviations of crowded index like Fig. 7. The proposed method can adjust the radii considering the distribution of the learning data, thus the RMSEs are an average of 33.9[%] better compared to those from the radius equation. The proposed method also has the same performances as the CV method.

Table 4 compares computational times of the proposed and conventional methods. These case results are also the averages of 25 simulation results. The computational times of the radius equation are enormously short because it spends time only in the calculation of Eq. (34) to adjust the radii. Regarding the CV method, the computational times increase exponentially with the number of data because error evaluations are needed for all learning data. There are some cases where the computational times are well beyond the limitation of practical use (20 minutes). Therefore, it is difficult to apply the CV method to plant control. On the other hand, the computational times of the proposed method in every case are within 20 minutes. These computational times are practical for plant control and it is confirmed that the proposed method is the most suitable for plant control.

These simulation results show that the proposed plant control system can construct a flexible statistical model having high estimation accuracy for various operational conditions of thermal power plants within a practical computational time. It is expected to improve effectiveness in reducing NOx and CO by learning with such a statistical model.


Table 4. Comparisons of the computational times [s] for the proposed and conventional methods

A Robust and Flexible Control System

0.6 0.8

0.4

0.2

Fig. 8. Schematic of reward function

improving the agent's learning accuracy.

*<sup>t</sup>* , respectively.

is an updating index of

direction (positive/negative), and

*t* becomes positive where

its moving average *tf* is calculated.

Eqs. (29) and (30) where *t t f*

, *t t f* and 

Here, 

Here, *t*

corresponds to the

**4.3 Algorithm of the proposed reward adjustment method** 

The proposed reward adjustment method adjusts the reward parameters

optimal control conditions and NOx/CO properties are different by adjusting

1 1.2 *reward*

to Reduce Environmental Effects of Thermal Power Plants 321

0 20 40 60 100 *f* <sup>0</sup>

model outputs which are obtained during the learning so that the agent can get the proper reward for (1) characteristics of the learning object and (2) progress of learning. Here, (1) means that this method can adjust the reward properly for the statistical models whose

means that this method makes it easier for the agent to get the reward and accelerate learning at the early stage, while also making the conditions to get the reward stricter and

The reward parameters are updated based on the sum of weighted model outputs *f* obtained in each episode and the best *f* value obtained during the past episodes. Hereafter, the sum of weighted model outputs and the reward parameters at episode *t* are denoted as

The algorithm of the proposed method is as follows. First, *tf* is calculated by Eq. (28), then

<sup>1</sup> (1 ) *tt t ff f*

<sup>1</sup> ( )

max ln( / )

calculated by Eq. (31) is smaller than

*f reward* 

is a step size parameter of

 

*t t tt* 

*t t <sup>t</sup> t*

when the reward value for *tf* becomes

 

is a smoothing parameter of the moving average. The parameter

is satisfied.

*t* , 

*t* 80

(28)

(29)

(30)

*<sup>t</sup>* is a threshold parameter to determine the updating

*<sup>t</sup>* . As shown in Fig. 9,

*<sup>t</sup>* . The updating direction of

*<sup>t</sup>* , and *vice versa.*

*<sup>t</sup>* is updated by

*t* 

 ,  using the

 , . (2)

## **4. Automatic reward adjustment method**

## **4.1 Basic concepts**

When the RL is applied to the thermal power plant control, it is necessary to design the reward so that it can be given to the agent instantly in order to adapt to the plant properties which change from hour to hour. So far, studies with respect to designing reward of the RL have reported[25,26] that high flexibility could be realized by switching or adjusting the reward in accordance with change of the agent's objectives and situations. However, it would be difficult to apply this to thermal power plant control which needs instant reward designing for changes of plant properties because the reward design and its switching or adjusting depend on *a priori* knowledge.

The proposed control system defines a reward function which does not depend on the learning object and proposes an automatic reward adjustment method which adjusts the parameters of the reward function adaptively based on the plant property information obtained in the learning. It is possible to use the same reward function for different operating conditions and control objectives in this method, and the reward function is adjusted in accordance with learning progress. Therefore, it is expected possible to construct a flexible plant control system without manual reward design.

## **4.2 Definition of reward**

The statistical model in the proposed control system has a unique characteristic due to specifications of applied plants, kinds of environmental effects and operating conditions. In case such a model is used for learning, the reward function should be generalized because it is difficult to design unique reward functions for various plant properties in real time. Thus the authors have defined the reward function as Eq. (26).

$$reward = \begin{cases} reward\_{\text{max}} \exp\left(\frac{\rho - f}{\phi}\right) & (f \ge \rho) \\\\ reward\_{\text{max}} & (f < \rho) \end{cases} \tag{26}$$

Here, max *reward* and *f* are maximum reward value and sum of weighted model outputs calculated by Eq. (27), respectively. and are the parameters to determine shapes of the reward function.

$$f = \sum\_{p=1}^{P} \mathbf{C}\_p \mathbf{y}\_p \tag{27}$$

Here, *Cp* are the weight of the model output *<sup>p</sup> y* , and *p* is a suffix for model output. In Eq. (26), the conditions 0 , 0 are satisfied. If and become larger, a larger reward is gotten for *f* . In addition, it is possible for *f* to weight *<sup>p</sup> y* by *Cp* in accordance with control goals. Fig. 8 shows the shape of the reward function where max *reward* 1 , 10 , 20 are set in Eq. (26).

The reward function defined as Eq. (26) can be applied for various kinds of statistical models where the operating conditions and the control goals are different because it is possible to define the reward only by , and *Cp* . *Cp* is set in accordance with the control goals, and , are adjusted automatically by the proposed automatic reward adjustment method.

320 Challenges and Paradigms in Applied Robust Control

When the RL is applied to the thermal power plant control, it is necessary to design the reward so that it can be given to the agent instantly in order to adapt to the plant properties which change from hour to hour. So far, studies with respect to designing reward of the RL have reported[25,26] that high flexibility could be realized by switching or adjusting the reward in accordance with change of the agent's objectives and situations. However, it would be difficult to apply this to thermal power plant control which needs instant reward designing for changes of plant properties because the reward design and its switching or

The proposed control system defines a reward function which does not depend on the learning object and proposes an automatic reward adjustment method which adjusts the parameters of the reward function adaptively based on the plant property information obtained in the learning. It is possible to use the same reward function for different operating conditions and control objectives in this method, and the reward function is adjusted in accordance with learning progress. Therefore, it is expected possible to construct

The statistical model in the proposed control system has a unique characteristic due to specifications of applied plants, kinds of environmental effects and operating conditions. In case such a model is used for learning, the reward function should be generalized because it is difficult to design unique reward functions for various plant properties in real time. Thus

max

*<sup>f</sup> reward <sup>f</sup> reward*

 and 

0 are satisfied. If

 ,  max

Here, max *reward* and *f* are maximum reward value and sum of weighted model outputs

1

Here, *Cp* are the weight of the model output *<sup>p</sup> y* , and *p* is a suffix for model output. In Eq.

gotten for *f* . In addition, it is possible for *f* to weight *<sup>p</sup> y* by *Cp* in accordance with

The reward function defined as Eq. (26) can be applied for various kinds of statistical models where the operating conditions and the control goals are different because it is

control goals. Fig. 8 shows the shape of the reward function where max *reward* 1 ,

*p f C y* 

*p p*

 and 

are adjusted automatically by the proposed automatic reward adjustment

*P*

*reward f*

exp ( )

( )

(26)

are the parameters to determine shapes of the

(27)

and *Cp* . *Cp* is set in accordance with the control

become larger, a larger reward is

10 ,

**4. Automatic reward adjustment method** 

adjusting depend on *a priori* knowledge.

a flexible plant control system without manual reward design.

the authors have defined the reward function as Eq. (26).

**4.1 Basic concepts** 

**4.2 Definition of reward** 

calculated by Eq. (27), respectively.

 0 , 

possible to define the reward only by

reward function.

(26), the conditions

20 are set in Eq. (26).

 , 

goals, and

method.

Fig. 8. Schematic of reward function

## **4.3 Algorithm of the proposed reward adjustment method**

The proposed reward adjustment method adjusts the reward parameters , using the model outputs which are obtained during the learning so that the agent can get the proper reward for (1) characteristics of the learning object and (2) progress of learning. Here, (1) means that this method can adjust the reward properly for the statistical models whose optimal control conditions and NOx/CO properties are different by adjusting , . (2) means that this method makes it easier for the agent to get the reward and accelerate learning at the early stage, while also making the conditions to get the reward stricter and improving the agent's learning accuracy.

The reward parameters are updated based on the sum of weighted model outputs *f* obtained in each episode and the best *f* value obtained during the past episodes. Hereafter, the sum of weighted model outputs and the reward parameters at episode *t* are denoted as , *t t f* and *<sup>t</sup>* , respectively.

The algorithm of the proposed method is as follows. First, *tf* is calculated by Eq. (28), then its moving average *tf* is calculated.

$$
\overline{f\_t} = \varepsilon f\_t + (1 - \varepsilon)\overline{f\_{t-1}}\tag{28}
$$

Here, is a smoothing parameter of the moving average. The parameter *<sup>t</sup>* is updated by Eqs. (29) and (30) where *t t f* is satisfied.

$$
\phi\_{t+1} = \phi\_t + \alpha\_\phi (\phi\_t' - \phi\_t) \tag{29}
$$

$$\phi\_t' = \frac{\rho\_t - \overline{f\_t}}{\ln(\theta\_t \text{ / revard}\_{\text{max}})} \tag{30}$$

Here, *t* is an updating index of *t* , *<sup>t</sup>* is a threshold parameter to determine the updating direction (positive/negative), and is a step size parameter of *<sup>t</sup>* . As shown in Fig. 9, *t* corresponds to the when the reward value for *tf* becomes *<sup>t</sup>* . The updating direction of *t* becomes positive where *t* calculated by Eq. (31) is smaller than *<sup>t</sup>* , and *vice versa.*

A Robust and Flexible Control System

**4.4 Simulations** 

also studied.

**4.4.1 Simulation conditions** 

Coal+Air

Calculation DB

to Reduce Environmental Effects of Thermal Power Plants 323

In this section, simulations are described to evaluate the performances of the proposed control system with the automatic reward adjustment method when it is applied to virtual plant models configured on the basis of experimental data. The simulations incorporate changes of the plant operations several times and the data for the RBF network. The evaluations focus on the flexibility in control of the proposed reward adjustment method for the change of the operational conditions. In addition, the robustness in control for the statistical model including noise by tuning the weight decay parameter of RBF network is

Figure 10 shows the basic structure of the simulation. The objective of the simulation is to reduce NOx and CO emissions from a virtual coal-fired boiler model (statistical model) constructed with three numerical calculation DBs. The RL agent learns how to control three operational parameters with respect to air mass flow supplied to the boiler. Therefore, input and output dimensions ( *J P*, ) of the control system are 3 and 2, respectively. The input values are normalized into the range of [ ,] 0 1 . The three numerical calculation DBs have different operational conditions, and each DB has 63 data whose input-output conditions are

> Model Input (Air Mass Flow)

> > Statistical Model DB

Model Output (CO, NOx)

Reward Adjustment Module

Reward Calculation Module

Reward Parameter

Reward

RL Agent

different. These data include some noise similar to the actual plant data.

Statistical Model (Coal-fired Boiler)

Air CO, NOx

Calculation DB

Fig. 10. Basic structure of thermal power plant control simulation

1 2 *C C*, of CO and NOx, respectively in that equation are different.

Calculation DB Operation C

In this simulation, the robustness and flexibility of the proposed control system are verified by implementing the RL agent so that it learns and controls the statistical model which changes in time series. Two kinds of boiler operational simulations are executed according to Table 5. Each simulation case is done for six hours (0:00-6:00) of operation, and it is considered that the statistical model is changed at 0:00, 2:00 and 4:00. One of the simulations considers three kinds of operational conditions ( , , *ABC* ) where coal types and power outputs are different, and the other considers three kinds of control goals defined as Eq. (27), where the weight coefficients

Operation A Operation B

$$\theta\_t' = r \text{varard}\_{\text{max}} \exp\left(\frac{\rho\_t - \overline{f\_t}}{\phi\_t}\right) \tag{31}$$

*t* is updated by Eq. (32) so that it becomes closer to *t* .

$$
\theta\_{t+1} = \theta\_t + \alpha\_{\theta} (\theta\_t' - \theta\_t) \tag{32}
$$

Fig. 9. Mechanism of the proposed method

Here, is a step size parameter of *t* . *<sup>t</sup>* is initialized to small value. As a result of updating *t* by Eq. (32), finally *t* becomes equal to *<sup>t</sup>* . This means that the reward is given to the agent appropriately for current *<sup>t</sup> f* . The value of *t* depends on the learning object and progress, hence it is preferable to acquire empirically in the learning process. That is because *t* , the reward value for *<sup>t</sup> f* is defined according to the updating index of *t* .

The parameter *t* is updated to approach the *<sup>t</sup> f* by Eq. (33) which is the best value of *f* during past learning.

$$
\rho\_{t+1} = \rho\_t + \alpha\_\rho (f\_t^\* - \rho\_t) \tag{33}
$$

Here, is a step size parameter of *t* .

The above algorithm is summarized as the following steps.

#### **Reward Automatic Adjustment Algorithm**


## **4.4 Simulations**

322 Challenges and Paradigms in Applied Robust Control

max exp *t t <sup>t</sup>*

*<sup>t</sup> f* \*

*t*

*t* is updated to approach the

 *f* 

*t* .

to the agent appropriately for current *<sup>t</sup> f* . The value of

The above algorithm is summarized as the following steps.

*<sup>t</sup>* by Eqs. (29) and (30).

by Eqs. (31) and (32).

*t* . 

becomes equal to

and progress, hence it is preferable to acquire empirically in the learning process. That is

, the reward value for *<sup>t</sup> f* is defined according to the updating index of

( ) *<sup>t</sup>* <sup>1</sup> *<sup>t</sup> <sup>t</sup> <sup>t</sup>*

is satisfied, go to **Step 3**. Otherwise, go to **Step 5**.

*<sup>t</sup>* by Eq. (33) and terminate the algorithm.

*t* *t* 1 *t t* 

*t* is updated by Eq. (32) so that it becomes closer to

*reward*

0 0

Fig. 9. Mechanism of the proposed method

by Eq. (32), finally

is a step size parameter of

**Reward Automatic Adjustment Algorithm** 

*t f*

is a step size parameter of

*t* 

**Step 1.** Calculate *<sup>t</sup> f* by Eq. (28).

*t*

*t*

*rewardmax*

Here, 

updating

because

Here,  *t*

*t*

during past learning.

The parameter

**Step 2.** If *<sup>t</sup> <sup>t</sup> f*

**Step 3.** Update

**Step 4.** Update

**Step 5.** Update

*<sup>f</sup> reward*

*t*

*t* 

*t*

*t*

*<sup>t</sup>* (32)

*<sup>t</sup>* is initialized to small value. As a result of

*<sup>t</sup> f* by Eq. (33) which is the best value of *f*

(33)

*<sup>t</sup>* . This means that the reward is given

depends on the learning object

(31)

*f*

*t* .

*t* .

( )

In this section, simulations are described to evaluate the performances of the proposed control system with the automatic reward adjustment method when it is applied to virtual plant models configured on the basis of experimental data. The simulations incorporate changes of the plant operations several times and the data for the RBF network. The evaluations focus on the flexibility in control of the proposed reward adjustment method for the change of the operational conditions. In addition, the robustness in control for the statistical model including noise by tuning the weight decay parameter of RBF network is also studied.

## **4.4.1 Simulation conditions**

Figure 10 shows the basic structure of the simulation. The objective of the simulation is to reduce NOx and CO emissions from a virtual coal-fired boiler model (statistical model) constructed with three numerical calculation DBs. The RL agent learns how to control three operational parameters with respect to air mass flow supplied to the boiler. Therefore, input and output dimensions ( *J P*, ) of the control system are 3 and 2, respectively. The input values are normalized into the range of [ ,] 0 1 . The three numerical calculation DBs have different operational conditions, and each DB has 63 data whose input-output conditions are different. These data include some noise similar to the actual plant data.

Fig. 10. Basic structure of thermal power plant control simulation

In this simulation, the robustness and flexibility of the proposed control system are verified by implementing the RL agent so that it learns and controls the statistical model which changes in time series. Two kinds of boiler operational simulations are executed according to Table 5. Each simulation case is done for six hours (0:00-6:00) of operation, and it is considered that the statistical model is changed at 0:00, 2:00 and 4:00. One of the simulations considers three kinds of operational conditions ( , , *ABC* ) where coal types and power outputs are different, and the other considers three kinds of control goals defined as Eq. (27), where the weight coefficients 1 2 *C C*, of CO and NOx, respectively in that equation are different.

A Robust and Flexible Control System

**4.4.2 Results and discussion** 

Optimal value

0

0.01

0 0.2 0.4 0.6 0.8 1 1.2

0 0.2 0.4 0.6 0.8 1 1.2

conventional method in the case of

function flexibly for such changes.

*f* [-]

*f* [-]

with combinations of the two objectives of simulations and

Change operation Change operation

Change operation Change operation

Fig. 11. Time series of normalized *f* in the boiler operation simulations

0:00 1:00 2:00 3:00 4:00 5:00 6:00 Time [h]

> Conventional Proposed

0:00 1:00 2:00 3:00 4:00 5:00 6:00 Time [h]

Conventional Proposed

to Reduce Environmental Effects of Thermal Power Plants 325

Figure 11 shows the time series of normalized *f* as a result of controls by the two methods, where the initial value at 0:00 is determined as the base. There are four graphs in Fig. 11

in each period is shown as well. The computational time of learning in each case was 23[s].

0 0.2 0.4 0.6 0.8 1 1.2

0 0.2 0.4 0.6 0.8 1 1.2

*f* [-]

To begin with, time series of the normalized *f* values by the proposed method and

these methods have offsets with the optimal values, but they are decreased for control and finally converged near the optimal values. This is because the reward functions used in each method are appropriate to learn the optimal control logic. The RL agent relearns its control logic when the statistical model and its optimal *f* values are changed at 2:00 by the change of operational conditions or control goals. However, the *f* values of the conventional method after 11 control times still have offsets from the optimal values, while the proposed method can obtain the optimal values after 11 times. The initial reward setting of the conventional method would be inappropriate for the next operational condition. Similar results of control are obtained for the same reason after changing the statistical model at 4:00. As discussed above, the plant control system by the conventional method has a possibility to deteriorate the control performances in thermal power plants for which operational conditions and control goals are changed frequently. Therefore, the proposed reward adjustment method is effective for the plant control, which can adjust the reward

*f* [-]

(a) Change of operational conditions (b) Change of control goals

settings. The optimal *f* value

Change goals Change goals

0:00 1:00 2:00 3:00 4:00 5:00 6:00 Time [h]

0:00 1:00 2:00 3:00 4:00 5:00 6:00 Time [h]

=0.01 are discussed. The initial *f* values at 0:00 of

Conventional

Proposed

Conventional

Proposed

Change goals Change goals

The simulations are executed by two reward settings: the variable reward for the proposed reward adjustment method (proposed method) and the fixed reward (conventional method). Both reward settings are done under two conditions where the weight decay for the RBF network is set to 0, 0.01 to evaluate the robustness of control by settings. The RL agent learns at the times when operational conditions or control goals (0:00, 2:00 and 4:00) are changed, and the control interval is 10 minutes. Hence it is possible to control the boiler 11 times in each period.

Parameter conditions of learning are shown in Table 6. These conditions are set using prior experimental results. The parameter conditions of reward are shown in Table 7. The parameters ( , , , ) of the proposed method are also set properly using prior experiments. In the conventional method, the values of , are fixed to their initial values which are optimal for the first operational condition in Table 5 because their step size parameters ( , ) are set to 0.


Table 5. Time table of plant operation simulation


Table 6. Parameter conditions of learning


Table 7. Reward conditions of each method

## **4.4.2 Results and discussion**

324 Challenges and Paradigms in Applied Robust Control

The simulations are executed by two reward settings: the variable reward for the proposed reward adjustment method (proposed method) and the fixed reward (conventional method). Both reward settings are done under two conditions where the weight decay

agent learns at the times when operational conditions or control goals (0:00, 2:00 and 4:00) are changed, and the control interval is 10 minutes. Hence it is possible to control the boiler

Parameter conditions of learning are shown in Table 6. These conditions are set using prior experimental results. The parameter conditions of reward are shown in Table 7. The

which are optimal for the first operational condition in Table 5 because their step size

0:00 - 2:00 *A* 0.1 0.9 *A* 0.1 0.9 2:00 - 4:00 *B* 0.1 0.9 *A* 0.9 0.1 4:00 - 6:00 *C* 0.1 0.9 *A* 0.001 0.999

> Radius of Gaussian basis 0.2 Max. output of NGnet 0.2 Noise ratio 0.2 Discount rate 0.9 Learning rate for actor 0.1 Learning rate for critic 0.02 Max. basis num of agent 100 Min. for basis addition 0.368 Min. for basis addition 0.01 Max. iteration in 1 episode 30 Max. episode 10000

Max. reward 1 1 Smoothing parameter 0.1 0.1 Step size parameter of 0.05 0 Step size parameter of 0.05 0 Step size parameter of 0.05 0

max *reward*

 

Time Ope. Cond. Ope. Cond.

Parameter

*i a* 

Parameter

Initial value of Initial value of

Initial value of

Change of Operational

) of the proposed method are also set properly using prior

 

*C*<sup>1</sup> *C*<sup>2</sup> *C*<sup>1</sup> *C*<sup>2</sup>

Condition

Prop. Method Conv. Method

0.001 3 0.001 0 0 186

Conditions Change of Goals

max *mk* 

 *A C* max *NL* min *a* min 

> *S T*

the RBF network is set to 0, 0.01 to evaluate the robustness of control by

experiments. In the conventional method, the values of ,

) are set to 0.

Table 5. Time table of plant operation simulation

Table 6. Parameter conditions of learning

Table 7. Reward conditions of each method

Objective

11 times in each period.

 , , , 

 , 

parameters (

parameters (

for

settings. The RL

are fixed to their initial values

Figure 11 shows the time series of normalized *f* as a result of controls by the two methods, where the initial value at 0:00 is determined as the base. There are four graphs in Fig. 11 with combinations of the two objectives of simulations and settings. The optimal *f* value in each period is shown as well. The computational time of learning in each case was 23[s].

Fig. 11. Time series of normalized *f* in the boiler operation simulations

To begin with, time series of the normalized *f* values by the proposed method and conventional method in the case of =0.01 are discussed. The initial *f* values at 0:00 of these methods have offsets with the optimal values, but they are decreased for control and finally converged near the optimal values. This is because the reward functions used in each method are appropriate to learn the optimal control logic. The RL agent relearns its control logic when the statistical model and its optimal *f* values are changed at 2:00 by the change of operational conditions or control goals. However, the *f* values of the conventional method after 11 control times still have offsets from the optimal values, while the proposed method can obtain the optimal values after 11 times. The initial reward setting of the conventional method would be inappropriate for the next operational condition. Similar results of control are obtained for the same reason after changing the statistical model at 4:00. As discussed above, the plant control system by the conventional method has a possibility to deteriorate the control performances in thermal power plants for which operational conditions and control goals are changed frequently. Therefore, the proposed reward adjustment method is effective for the plant control, which can adjust the reward function flexibly for such changes.

A Robust and Flexible Control System

a better *f* value, then ,

larger

**5. Conclusions** 

adjustment method.

environmental effects.

**A.1 Cross Validation (CV) method** 

control thus it is necessary to set

the other hand, it is useless to set

properties of the statistical models.

(32) which are the updating algorithms of

middle stage of learning (episode 2000-6000), but ,

 

insufficient learning of the RL agent. In the next 1000 episodes,

to Reduce Environmental Effects of Thermal Power Plants 327

simultaneously as the learning progresses. This behavior can be explained by the Eqs. (29)-

converges to certain values by the 2000th episode. This indicates that the optimal *f* values are found in the learning process. Then the parameters of each case remain stable during the

only in the case of operation B. This is because the RL agent can learn the control logic to get

These adjustment results of reward parameters for different statistical models can be discussed as follows. By analysis of the characteristics of these statistical models, it seems that the gradient of *f* in operation *A* is larger than that of operation *B* because operation *A* has a larger difference between the maximum and minimum value of *f* than operation *B* . When the gradient of *f* is larger, *f* will vary significantly for each

reward function of operation *A* certainly becomes easier to give the reward due to the

method can obtain the appropriate reward function flexibly in accordance with the

This chapter presented a plant control system to reduce NOx and CO emissions exhausted by thermal power plants. The proposed control system generates optimal control signals by that the RL agent which learns optimal control logic using the statistical model to estimate the NOx and CO properties. The proposed control system requires flexibility for the change of plant operation conditions and robustness for noise of the measured data. In addition, the statistical model should be able to be tuned by the measured data within a practical computational time. To overcome these problems the authors proposed two novel methods, the adaptive radius adjustment method of the RBF network and the automatic reward

The simulations clarified the proposed methods provided high estimation accuracy of the statistical model within practical computational time, flexible control by RL for various changes of plant properties and robustness for the plant data with noise. These advantages led to the conclusion that the proposed plant control system would be effective for reducing

The cross validation (CV) method is one of the conventional radius adjustment methods for the RBF network with regression and it adjusts radii by error evaluations. In this method, a datum is excluded from the learning data and the estimation error at the excluded datum is

**6. Appendix A. Conventional radius adjustment method** 

than for operation *B* . Therefore, the above results show that the proposed

Eqs. (29) and (30). As a result, these parameters converge into different values.

which the gradient of *f* is small. As for the results of adjustment of , ,

 

 , larger so that the agent can get the reward easily. On

larger in the statistical model in operation *B* for

 

. On the other hand,

are adjusted flexibly in accordance with the change of *f* used in

increases and

change suddenly at the 6000th episode

value in each case

in Fig. 12, the

decreases

Next, the robustness of the proposed control system by weight decay ( ) tuning is discussed. In Fig. 11, every *f* value of the proposed method can reach nearly the optimal value when is 0.01, whereas *f* converges into the values larger than the optimal values when is 0 for 2:00-6:00 in (a) and 2:00~4:00 in (b). The RBF network cannot learn with considered the influences of noise included in the learning data when is 0[16]. The response surface is created to fit the noised data closely and many local minimum values are generated in it compared with the response surface of 0.01. This is because the learned control logic is converged each local minimum. The above results show that the RBF network can avoid overfitting by tuning properly and the proposed control system can control thermal power plants robustly.

Fig. 12. Learning processes of *f* and reward parameters ( , , ) of the proposed method

Finally, the learning processes of *f* and reward parameters of the proposed method are studied. Fig. 12 shows the *f* ,, , values for episodes in learning at the operational changes at 0:00 and 2:00 when is 0.01. In the early stage of learning (episodes 1-500), the parameter in each case increases nearby 0.9 because the *f* value does not decrease due to insufficient learning of the RL agent. In the next 1000 episodes, increases and decreases simultaneously as the learning progresses. This behavior can be explained by the Eqs. (29)- (32) which are the updating algorithms of , . On the other hand, value in each case converges to certain values by the 2000th episode. This indicates that the optimal *f* values are found in the learning process. Then the parameters of each case remain stable during the middle stage of learning (episode 2000-6000), but , change suddenly at the 6000th episode only in the case of operation B. This is because the RL agent can learn the control logic to get a better *f* value, then , are adjusted flexibly in accordance with the change of *f* used in Eqs. (29) and (30). As a result, these parameters converge into different values.

These adjustment results of reward parameters for different statistical models can be discussed as follows. By analysis of the characteristics of these statistical models, it seems that the gradient of *f* in operation *A* is larger than that of operation *B* because operation *A* has a larger difference between the maximum and minimum value of *f* than operation *B* . When the gradient of *f* is larger, *f* will vary significantly for each control thus it is necessary to set larger so that the agent can get the reward easily. On the other hand, it is useless to set larger in the statistical model in operation *B* for which the gradient of *f* is small. As for the results of adjustment of , , in Fig. 12, the reward function of operation *A* certainly becomes easier to give the reward due to the larger than for operation *B* . Therefore, the above results show that the proposed method can obtain the appropriate reward function flexibly in accordance with the properties of the statistical models.

## **5. Conclusions**

326 Challenges and Paradigms in Applied Robust Control

discussed. In Fig. 11, every *f* value of the proposed method can reach nearly the optimal

response surface is created to fit the noised data closely and many local minimum values are

control logic is converged each local minimum. The above results show that the RBF

is 0.01, whereas *f* converges into the values larger than the optimal values

is 0 for 2:00-6:00 in (a) and 2:00~4:00 in (b). The RBF network cannot learn with

**0 0.2 0.4 0.6 0.8 1**

**0 0.2 0.4 0.6 0.8 1**

> 

values for episodes in learning at the operational

is 0.01. In the early stage of learning (episodes 1-500), the

**θ**

(d)

Finally, the learning processes of *f* and reward parameters of the proposed method are

parameter in each case increases nearby 0.9 because the *f* value does not decrease due to

**Φ (rel. value)**

(a) *f* (b)

0.01. This is because the learned

**0 2000 4000 6000 8000 10000**

**episode**

**0 2000 4000 6000 8000 10000**

**episode**

) of the proposed method

**Operation A**

**Operation B**

properly and the proposed control system can

) tuning is

is 0[16]. The

**Operation A**

**Operation B**

Next, the robustness of the proposed control system by weight decay (

considered the influences of noise included in the learning data when

generated in it compared with the response surface of

**0 2000 4000 6000 8000 10000**

**0 2000 4000 6000 8000 10000**

Fig. 12. Learning processes of *f* and reward parameters ( , ,

 

**episode**

(c) 

studied. Fig. 12 shows the *f* ,, ,

changes at 0:00 and 2:00 when

**Operation A**

**Operation B**

**episode**

**Operation A Operation B**

network can avoid overfitting by tuning

control thermal power plants robustly.

value when

**0 0.2 0.4 0.6 0.8 1**

**0 0.2 0.4 0.6 0.8 1**

**ρ (rel. value)**

*f* **(rel. value)**

when  This chapter presented a plant control system to reduce NOx and CO emissions exhausted by thermal power plants. The proposed control system generates optimal control signals by that the RL agent which learns optimal control logic using the statistical model to estimate the NOx and CO properties. The proposed control system requires flexibility for the change of plant operation conditions and robustness for noise of the measured data. In addition, the statistical model should be able to be tuned by the measured data within a practical computational time. To overcome these problems the authors proposed two novel methods, the adaptive radius adjustment method of the RBF network and the automatic reward adjustment method.

The simulations clarified the proposed methods provided high estimation accuracy of the statistical model within practical computational time, flexible control by RL for various changes of plant properties and robustness for the plant data with noise. These advantages led to the conclusion that the proposed plant control system would be effective for reducing environmental effects.

## **6. Appendix A. Conventional radius adjustment method**

## **A.1 Cross Validation (CV) method**

The cross validation (CV) method is one of the conventional radius adjustment methods for the RBF network with regression and it adjusts radii by error evaluations. In this method, a datum is excluded from the learning data and the estimation error at the excluded datum is

A Robust and Flexible Control System

Amsterdam.

Reinhold.

Hall.

Japanese)

191.

636-645. (in Japanese)

to Reduce Environmental Effects of Thermal Power Plants 329

[8] Winn H. R. & Bolos H. R. (2008), Optimizing the Boiler Combustion Process in Tampa

[9] Vesel R. (2008), The Million Dollar Annual Payback: Realtime Combustion

[10] Airikka P. & Nieminen V. (2010), Optimized Combustion through Collaboration of

[11] Wasserman P. D. (1993), *Advanced Methods in Neural Computing*, Van Nostrand

[12] Rumelhart D. E.; Hinton G. E. & Williams R. J. (1986), Learning Representations of

[14] Jamshidi, M., Titli, A., Zadeh, L. & Boverie, S. (1997), *Applications of Fuzzy Logic*. Prentice

[15] Yamamoto K.; Fukuchi T.; Chaki M.; Shimogori Y. & Matsuda J. (2000), Development of

[16] Orr M. J. L.. Introduction to Radial Basis Function Networks. Available from

[17] Maruyama M. (1992), Learning Networks Using Radial Basis Function - New Approach

[20] Kitayama S.; Yasuda K. & Yamazaki K. (2008), The Integrative Optimization by RBF

[21] Eguchi, T.; Sekiai T.; Yamada, A.; Shimizu S. & Fukai M. (2009), A Plant Control

[23] Moody J. & Darken C. J. (1989), Fast learning in networks of locally-tuned processing

[24] Zhang J.; Yim Y. & Yang J. (1997), Intelligent Selection of Instances for Prediction

[25] Ng. A.; Harada D. & Russell S. (1999), Policy invariance under reward transformations:

[18] Sutton R. S. & Barto A. G. (1998), *Reinforcement Learning-An Introduction*, MIT Press. [19] Bishop, C., M. (2006). *Pattern Recognition And Machine Learning*. Springer-Verlag.

Computer Program for Combustion Analysis in Pulverized Coal-fired Boilers,

for the Neural Computing. *Trans. of ISCIE*, Vol. 36, No. 5, pp. 322—329. (in

Network and Particle Swarm Optimization, *IEEJ Trans. on EIS*, Vol. 128, No. 4, pp.

Technology Using Reinforcement Learning Method with Automatic Reward Adjustment, *IEEJ Trans. on EIS*, Vol. 129, No. 7, pp. 1253-1263. (in Japanese) [22] Eguchi, T.; Sekiai T.; Yamada, A.; Shimizu S. & Fukai M. (2009), An Adaptive Radius

Adjusting Method for RBF Networks Considering Data Densities and Its Application to Plant Control Technology, *Proceedings of ICCAS-SICE2009*, pp.4188-

Function in Lazy Learning Algorithms, *Artificial Intelligence Review*, Vol. 11, pp. 175-

Theory and application to reward shaping, *Proceedings of 16th International* 

*Proceedings of Power-Gen International 2008*, Orlando, FL.

Back-propagation Errors, *Nature*, vol. 323, pp. 533-536. [13] Camacho, E. F. & Bordons, C. (1999), *Model Predictive Control*. Springer.

*Power-Gen International 2008*, Orlando, FL.

*Hitachi Review*, Vol. 49, No. 2, pp. 76-80.

http://anc.ed.ac.uk/mjo/rbf.html

4194, Fukuoka, Japan, August 18-21.

units, *Neural Computation*, Vol.1 , pp. 281-294.

*Conference on Machine Learning*, pp.278-287.

Electric Coal Fired Power Plants Utilizing Fuzzy Neural Model Technology,

Optimization with Advanced Multi-Variable Control at PPL Colstrip, *Proceedings of* 

Boiler and Automation Suppliers, *Proceedings of Power-Gen International 2008*,

evaluated. Iterations are repeated until all data are selected as excluded data to calculate RMSE. After the calculations of RMSE for several radius conditions, the best condition is determined as the radius to use. The algorithm is shown as follows.

#### **Algorithm of Cross Validation Method**


**Step 9.** Select the radius with the best RMSE if the radius is over max *r* and terminate the **Step 10.** algorithm. Otherwise, return to **Step 2**.

#### **A.2 Radius equation**

This method is one of the non-regression methods and it adjusts the radius *r* by Eq. (34).

$$r = \frac{d\_{\text{max}}}{\sqrt{J} \left(\sqrt[l]{N\_D - 1}\right)}\tag{34}$$

Here, max *d* is the maximum distance among the learning data.

## **7. References**


328 Challenges and Paradigms in Applied Robust Control

evaluated. Iterations are repeated until all data are selected as excluded data to calculate RMSE. After the calculations of RMSE for several radius conditions, the best condition is

**Step 3.** Learn weight parameters of RBF network using all data except the excluded datum. **Step 4.** Calculate the output of the RBF network at the point of the excluded datum.

**Step 9.** Select the radius with the best RMSE if the radius is over max *r* and terminate the

This method is one of the non-regression methods and it adjusts the radius *r* by Eq. (34).

*r*

Here, max *d* is the maximum distance among the learning data.

[1] U.S. Environmental Protection Agency, Available from http://www.epa.gov/air/ oaq\_caa.html/

*d*

[2] Ochi, K., Kiyama, K., Yoshizako, H., Okazaki, H. & Taniguchi, M. (2009), Latest Low-

[3] Jorgensen, K. L., Dudek, S. A. & Hopkins, M. W. (2008), Use of Combustion Modeling in

*ASME International Mechanical Engineering Congress and Exposition,* Boston. [4] EPRI (2005), *Power Plant Optimization Industry Experience*, 2005 Update. EPRI, Palo

[5] Rangaswamy, T. R.; Shanmugam J. & Mohammed K. P. (2005), Adaptive Fuzzy Tuned

[6] Booth, R. C. & Roland W. B. (1998), Neural Network-Based Combustion Optimization

[7] Radl B. J. (1999), Neural networks improve performance of coal-fired boilers, CADDET

*Modeling Control Applications for Industry Workshop*, pp.1-6.

Energy Efficiency Newsletter, No.1, pp.4-6.

NOx Combustion Technology for Pulverized-coal-fired Boilers, *Hitachi Review*, Vol.

the Design and Development of Coal-Fired Furnaces and Boilers, *Proceedings of* 

PID Controller for Combustion of Utility Boiler, *Control and Intelligent Systems*, Vol.

Reduces NOx Emissions While Improving Performance, *Proceedings of Dynamic* 

 1 max *J D*

*J N* (34)

determined as the radius to use. The algorithm is shown as follows.

**Step 5.** Calculate the error between the output and the excluded datum. **Step 6.** Go to **Step 7** if all data have been selected. Otherwise, return to **Step 2**.

**Algorithm of Cross Validation Method** 

**Step 2.** Select an excluded datum.

**Step 8.** Increment the radius by *r*.

58, No. 5, pp. 187-193.

33, No. 1, pp. 63-71.

**A.2 Radius equation** 

**7. References** 

Alto.

**Step 1.** Initialize the radius is initialized to min *r* .

**Step 7.** Calculate RMSE by the estimation errors.

**Step 10.** algorithm. Otherwise, return to **Step 2**.


*China* 

**Wide-Area Robust** *H***2/***H*∞ **Control with** 

**Oscillation of Power System** 

Chen He1 and Bai Hong2

*2China Electric Power Research Institute*

**Pole Placement for Damping Inter-Area** 

*1State Power Economic Research Institute, State Grid Corporation of China* 

The damping of inter-area oscillations is an important problem in electric power systems (Klein et al., 1991; Kundur, 1994; Rogers, 2000). Especially in China, the practices of nationwide interconnection and ultra high voltage (UHV) transmission are carrying on and under broad researches (Zhou et al., 2010), bulk power will be transferred through very long distance in near future from the viewpoints of economical transmission and requirement of allocation of insufficient resources. The potential threat of inter-area oscillations will increase with these developments. If inter-area oscillations happened, restrictions would have to be placed on the transferred power. So procedures and equipments of providing

Conventional method coping with oscillations is by using power system stabilizer (PSS) that provides supplementary control through the excitation system (Kundur, 1994; Rogers, 2000; Larsen et al., 1981), or utilizing supplementary control of flexible AC transmission systems (FACTS) devices (Farsangi et al., 2003; Pal et al., 2001; Chaudhuri et al., 2003, 2004). Decentralized construction is often adopted by these controllers. But for inter-area oscillations, conventional decentralized control may not work so well since they have not observability of system level. Maximum observability for particular modes can be obtained from the remote signals or from the combination of remote and local signals (Chaudhuri et al., 2004; Snyder, et al., 1998; Kamwa et al., 2001). Phasor measurement units (PMUs)-based wide-area measurement system (WAMS) (Phadke, 1993) can provide system level observability and controllability and make so-called wide-area damping control practical. On the other hand, power system exists in a dynamic balance, its operating condition always changes with the variations of generations or load patterns, as well as changes of system topology, etc. From control theory point of view, these changes can be called uncertainty. Conventional control methods can not systemically consider these uncertainties, and often need tuning or coordination. Therefore, so-called robust models are derived to take these uncertainties into account at the controller design stage (Doyle et al., 1989; Zhou et al., 1998). Then the robust control is applied on these models to realize both

adequate damping to inter-area oscillations become mandatory.

disturbance attenuation and stability enhancement.

**1. Introduction** 

[26] Li J. & Chan L. (2006), Reward Adjustment Reinforcement Learning for Risk-averse Asset Allocation, *Proceedings of International Joint Conference on Neural Networks 2006 (IJCNN06)*, pp.534-541.
