**3.2. Learning conditional probability tables**

In the case of PdG-L3 presented in this chapter, the Bayesian Networks were built in the HuginTM software environment. This software offers three approaches to learn the entries of conditional probability tables of the kind in eq. (2): from subjective knowledge; from analytical equations; from datasets. As the procedure presented in paragraph 6 learned conditional probability tables from datasets put together through numerical simulations, this is the case that will be considered in this sub-section. In particular, the algorithm used by HuginTM is called "EM learning". In Fig. 4 we depicted two examples of conditional probability tables required by HuginTM in order to perform the inference propagations explained in sub-section 4.1. The left sided one (Fig. 4-a) is the table needed to define how node *Y* of the network in Fig. 2-a is conditionally dependent to node *X*; while the right sided one (Fig. 4-b) is the table needed to define the likelihood of each state of the node *X* with respect to the joint combinations of the states of nodes *U1*, *U2* and *U3* (i.e. *X*'s parents). It is worth remarking that in this second case the number of columns is equal to the number of combinations of states of the nodes which are parents of *X*.

**Figure 4.** Conditional probability tables relative to the a node having just one parent with four states (a) and to a node having three parents with two states each (b).

The case we are considering might be described as a set of discrete variables *U={x*1*, …, x*n*}*, whose multivariate joint probability distribution can be encoded in some particular Bayesian Network structure *B*s. The structure of *B*<sup>s</sup> for our case study was derived from expert knowl‐ edge, hence just probabilities must be inferred from the additional knowledge provided by a random sample. Also, and according to the references in this field, it is assumed that the random sample *D={C*1*, …, C*m*}* contains no missing data [23], which means that each case *C*<sup>l</sup> consists of the observations of all the variables in *U*. Also, we can say that *D* is a random sample from *B*s. Indeed, *B*s may be thought as a directed acyclic graph that encodes assertions of conditional independence. In fact, it orders the variables in domain *U* such that the joint probability distribution can be estimated by means of the chain rule in eq. (1). Now, for every *X*i there will be some subset *Π*<sup>i</sup> *⊆ {X*1*, …, X*n*}* such that *X*<sup>i</sup> and *{X*1*, …, X*n*}* are conditionally independent given *Π*<sup>i</sup> . That is:

$$P\{\mathbf{X}\_{i} \mid \mathbf{X}\_{1^{\nu}} \dots \dots \mid \mathbf{X}\_{n}\} = P\{\mathbf{X}\_{i} \mid \Pi\_{i}\} \tag{7}$$

As a consequence, a Bayesian Network is made up of a set of local conditional probability distributions and a set of assertions of conditional independence. These conditional inde‐ pendences can be described like in eq. (7), where the parents of *X*<sup>i</sup> are grouped in the set *Π*<sup>i</sup> , and are useful to "d-separate" any variable (i.e., to make it conditionally independent) from the rest of *B*s.

Starting from these concepts, further and more complex computations can be accomplished, so as to propagate inference throughout networks with any kind of admissible connections, but they all need the prior definition of conditional probability tables between linked variables.

In the case of PdG-L3 presented in this chapter, the Bayesian Networks were built in the HuginTM software environment. This software offers three approaches to learn the entries of conditional probability tables of the kind in eq. (2): from subjective knowledge; from analytical equations; from datasets. As the procedure presented in paragraph 6 learned conditional probability tables from datasets put together through numerical simulations, this is the case that will be considered in this sub-section. In particular, the algorithm used by HuginTM is called "EM learning". In Fig. 4 we depicted two examples of conditional probability tables required by HuginTM in order to perform the inference propagations explained in sub-section 4.1. The left sided one (Fig. 4-a) is the table needed to define how node *Y* of the network in Fig. 2-a is conditionally dependent to node *X*; while the right sided one (Fig. 4-b) is the table needed to define the likelihood of each state of the node *X* with respect to the joint combinations of the states of nodes *U1*, *U2* and *U3* (i.e. *X*'s parents). It is worth remarking that in this second case the number of columns is equal to the number of combinations of states of the nodes which

**Figure 4.** Conditional probability tables relative to the a node having just one parent with four states (a) and to a

The case we are considering might be described as a set of discrete variables *U={x*1*, …, x*n*}*, whose multivariate joint probability distribution can be encoded in some particular Bayesian Network structure *B*s. The structure of *B*<sup>s</sup> for our case study was derived from expert knowl‐ edge, hence just probabilities must be inferred from the additional knowledge provided by a random sample. Also, and according to the references in this field, it is assumed that the random sample *D={C*1*, …, C*m*}* contains no missing data [23], which means that each case *C*<sup>l</sup> consists of the observations of all the variables in *U*. Also, we can say that *D* is a random sample from *B*s. Indeed, *B*s may be thought as a directed acyclic graph that encodes assertions of conditional independence. In fact, it orders the variables in domain *U* such that the joint probability distribution can be estimated by means of the chain rule in eq. (1). Now, for every

*⊆ {X*1*, …, X*n*}* such that *X*<sup>i</sup>

and *{X*1*, …, X*n*}* are conditionally

**3.2. Learning conditional probability tables**

12 Dynamic Programming and Bayesian Inference, Concepts and Applications

node having three parents with two states each (b).

there will be some subset *Π*<sup>i</sup>

. That is:

independent given *Π*<sup>i</sup>

are parents of *X*.

*X*i

Given *B*s, let *ri* be the number of states of variable *X*<sup>i</sup> ; and let *qi* = ∏ *xl* ∈*Π<sup>i</sup> rl* be the number of states of *Π*<sup>i</sup> . Let *θ*ijk denote the physical probability of *X*<sup>i</sup> *=k* given *Π*<sup>i</sup> *=j* for *i=1,..,n, j=1,..,q*<sup>i</sup> , *k=1,..,r*<sup>i</sup> . Adopting this notation, the following equivalences are valid [23]:

$$\mathfrak{S}\_{ij} \equiv \mathfrak{U}\_{i=1}^{n} \begin{Bmatrix} \mathfrak{S}\_{ijk} \end{Bmatrix} \quad \mathfrak{S}\_{Bs} \equiv \mathfrak{U}\_{i=1}^{n} \begin{Bmatrix} \mathfrak{S}\_{j=1}^{q} \end{Bmatrix} \begin{Bmatrix} \mathfrak{S}\_{ij} \end{Bmatrix} , \tag{8}$$

In other words, eq. (8a) states that the two notations are equivalent and represent the case where all the physical probabilities of *xi* =*k* are grouped, once any *x*<sup>i</sup> in domain *U* is selected and the states of its parents is fixed at any Π*<sup>i</sup>* = *j*. As a consequence, eq. (8b) represents all the physical probabilities of the joint space *B*s (i.e. the Bayesian Network structure), because it encompasses all the states of *Π*<sup>i</sup> (i.e., parents of *x*<sup>i</sup> ) and all the variables *x*<sup>i</sup> in domain *U*. Where "⋃ … " stands for the union of all the states represented by that expression. The probability distribution of child nodes must be described by a probability distribution, which may be then updated according to evidence acquired by its parents. The software HuginTM uses the EM algorithm, which defines a Dirichlet distribution for each variable *θ*ij (i.e. one distribution for any variable of *B*s given its parents are in the state j) [23]:

$$p\left(\mathbb{S}\_{ij}\mid B\_{s'}\mid \xi\right) = c \bullet \prod\_{k=1}^{r\_l} \mathbb{S}\_{ijk}^{N\_{ijk}-1} \tag{9}$$

where *c* is a normalization constant (in the form of a gamma function), *N'*ijk is the multinomial parameters of that distribution, limited between 0 and 1, finally *ξ* is the observed evidence. The Dirichlet distribution describes the probability of one of the variables *x*<sup>i</sup> when it varies all over its states. One of the advantages provided by using a Dirichlet distribution is that it is completely defined by its multinomial parameters. In addition, its shape can be easily adapted to fit various probability density function. Finally, and more importantly, it can be demon‐ strated that the learning process is made easier in this way [23]. In fact, the values *N'*ijk represent expert knowledge introduced in the network by the developer themselves. Then, if *N*ijk is the number of observations in the database *D*, in which *x*<sup>i</sup> *=k* and *Π*<sup>i</sup> *=j*, we are able to update that distribution by adding empirical information meant by the parameter *N*ijk:

$$p\{\mathfrak{G}\_{ij} \mid B\_{s\nu} \cdot \xi\} = c \bullet \prod\_{k=1}^{r\_i} \mathfrak{G}\_{ijk}^{N\_{jk}^{'} \star N\_{ijk} \cdot \mathbf{1}} \tag{10}$$

Thanks to this algorithm, which is easily implemented by computer programs, the distribution re-adjust its shape according to data provided by available sample data. It exploits the notion of experience, that is quantitative memory which can be based both on quantitative expert judgment and past cases. The prior parameters in the first version of the network are set with a particular equivalent sample size based on expert knowledge (by tuning the values *N'*ijk). Then subsequent data are added, starting from the given equivalent sample size, so that the two Dirichlet parameters reshape the probability density function according to observations included in the dataset, which determine the values of the set of parameters *N*ijk. Furthermore, it can be shown that the probability that *X*<sup>i</sup> *=k* and *Π*<sup>i</sup> *=j* in the next case *C*m+1 to be seen in the database (i.e. expected value) is computed by means of the equation:

$$P\left(\mathbb{C}\_{m+1}\mid D\_{\prime}, B\_{s\prime}, \xi\right) = \prod\_{i=1}^{n} \prod\_{j=1}^{q\_i} \frac{N\_{i\bar{k}}^{\cdot} + N\_{i\bar{k}}}{N\_{i\bar{j}}^{\cdot} + N\_{i\bar{j}}} \tag{11}$$

the whole validation dataset made up of *K* samples, these instantaneous indices must be

Bayesian Networks for Supporting Model Based Predictive Control of Smart Buildings


http://dx.doi.org/10.5772/58470

15

)2. (13)

<sup>|</sup>*<sup>X</sup> max* - *<sup>X</sup> min*<sup>|</sup> (14)

<sup>|</sup>*<sup>X</sup> max* - *<sup>X</sup> min*<sup>|</sup> . (15)

combined into global indices. Therefore, the mean absolute error is defined as:

*<sup>K</sup>* ∑<sup>1</sup> *<sup>K</sup>* |*X* ^ *<sup>i</sup>* - *Xi*

*<sup>K</sup>* ∑<sup>1</sup> *<sup>K</sup>* (*X* ^ *<sup>i</sup>* - *Xi*

> 1 *<sup>K</sup>* ∑<sup>1</sup> *<sup>K</sup>* |*X* ^ *<sup>i</sup>* - *Xi* |

1 *<sup>K</sup>* ∑<sup>1</sup> *<sup>K</sup>* (*X* ^ *<sup>i</sup>* - *Xi* )2

ASHRAE Guideline 14-2002 [24] establishes that for calibrated simulations, the *CVRMSE* and *NMBE* of energy models shall be determined for each calibration parameter by comparing simulation-predicted data to the utility data used for calibration. The proposed indices are the coefficients of variation of the root mean square error (*CVRMSE*) and normalized mean bias error (*NMBE*). All these indices are normalized with respect to the arithmetic mean of the variable. Following this guideline, the *RMSE* has been selected as the main performance index for evaluating the accuracy of a BN. However, the mean value of a variable is a good normal‐ ization factor only if the variable is always positive (or always negative). For this reason, the range of the considered variable has been taken as a normalization factor and the *NRMSE* has been selected as final index for the design process of the BN: it includes information about both

The control block in the framework depicted in Fig. 1 needs to evaluate a cost function for each candidate control policy. However, since discrete nodes are often used in BNs for describing nonlinearities, the resulting cost function may also present significant nonlinearities. One of the main problems in MPC consists in guaranteeing closed-loop stability for the controlled system. As stated in [25], stability about MPC requires some continuity assumptions. The usual approach to ensure stability is to consider the value function of the MPC cost as a candidate Lyapunov function. Even if closed-loop stability is guaranteed in some way without the continuity assumption, the absence of a continuous Lyapunov function may result in a closedloop system that has no robustness. This implies that, whenever possible, a continuous cost

The indices of the predicted variables are related to different physical quantities with different units. For this reason, these indices must be normalized with respect to their typical range of variation. Two more indices are defined and practically used for evaluating prediction models:

*MAE* <sup>≜</sup> <sup>1</sup>

*RMSE* <sup>≜</sup> <sup>1</sup>

*NMAE* ≜

*NRMSE* ≜

and the root mean square error is:

bias and variance of the error.

**4.2. Instantiation of BNs for MPC**

where *Nij* ' = ∑ *k*=1 *ri Nijk* ' , and *Nij* = ∑ *k*=1 *ri Nijk* .
