**2. Literature review**

#### **2.1 Airborne pollution**

Air pollution has been a problem that has been increasing in recent decades mainly in large cities, bringing with it respiratory diseases [2, 3]. This contamination is accompanied by the same pollutant particles that have a useful life depending on physical parameters (size and shape) and their chemical composition. Mexico City has been studied for decades [4] due to its high levels of pollution that affect more than 20 million people. The sites used in this work are the following: Northeast (Gustavo A. Madero— GAM, FES Aragón—FAR, Xalostoc—XAL), Northwest (Tlalnepantla—TLA,), Center (Hospital General de México—HGM, Merced—MER), Southeast (Nezahualcóyotl— NEZ, Santiago Acahualtepec—SAC) and Southwest (Ajusco Medio—AJM, Pedregal— PED, Santa Fe, SFE). The map of the monitoring sites is shown in **Figure 1**.

#### *2.1.1 Multiple imputation by chained equations (MICE)*

Dealing with the missing data problem often leads to two general approaches for imputing multivariate data: Joint modeling (JM) and fully conditional specification

**Figure 1.** *Location of the monitoring sites in Mexico City.*

*Perspective Chapter: Airborne Pollution (PM2.5) Forecasting Using Long Short-Term Memory… DOI: http://dx.doi.org/10.5772/intechopen.108543*

(FCS), also known as multivariate imputation by chained equations (MICE) [5]. It is known that is a JM-type problem when we must specify a multivariate distribution for the missing data and obtain the imputation of its conditional distributions through Markov Monte Carlo chains (MCMC) techniques. On the other hand, it is FCS, which specifies the multivariate imputation model on a variable-by-variable basis using a set of conditional densities, one for each incomplete variable. The imputation starts by iterating over the conditional densities, usually, a low number of iterations is enough. In order to explain the model, let us use the following notation [5]: Let *Yj* with (*j* ¼ 1,⋯, p) be one of *p* incomplete variables and *Y* ¼, ⋯, *Yp*). The observed and missing parts of *Yj* are denoted by *Yobs <sup>j</sup>* and *Ymis <sup>j</sup>* , respectively, then *<sup>Y</sup>obs* <sup>¼</sup> *Yobs* <sup>1</sup> , ⋯, *Yobs p* � � and *<sup>Y</sup>obs* <sup>¼</sup> *<sup>Y</sup>mis* <sup>1</sup> , ⋯, *Ymis p* � �, these are the observed and missing data respectively in *Y*. The number of imputations is *m* ≥1. The *h*-th imputed data set is denoted as *<sup>Y</sup>*ð Þ *<sup>h</sup>* where *<sup>h</sup>* <sup>¼</sup> 1,⋯,*m*. Now let *<sup>Y</sup>*�*<sup>j</sup>* <sup>¼</sup> *<sup>Y</sup>*1, <sup>⋯</sup>, *Yj*�1, *Yj*þ1, <sup>⋯</sup>, *Yp* � � denote the collection of the *p* � 1 variables in *Y* except *Yj*. Finally, let *Q* denote the quantity of scientific interest. The mice algorithm has three main steps: imputation, analysis, and pooling. The analysis starts with an incomplete data set *Yobs*. The second step is to compute *Q* on each imputed data set, here the model is applied to *Y*ð Þ<sup>1</sup> ,⋯,*Y*ð Þ *<sup>m</sup>* in the general identical. Finally, the third step is to pool the *<sup>m</sup>* estimates ^*Q*ð Þ! ,⋯, ^ *<sup>Q</sup>*ð Þ *<sup>m</sup>* into one estimate *Q*� and estimate its variance.

#### *2.1.2 Recurrent neural networks*

Recurring neural networks (better known as RNN) can be used for any type of data. In practical applications, the use of symbolic values is more common. In a recurrent neural network, there is a one-to-one correspondence between the layers of the network and specific positions in the sequence. The position in the sequence is also known as its timestamp. Finally, RNNs are complete Turing, which means that this type of network can simulate any algorithm with sufficient data and computational resources [6]. A representation of this kind of network is shown in **Figure 2**.

#### *2.1.3 Long-short term memory (LSTM)*

To represent the hidden states of the *kth* hidden states (layer) the notation *h*�ð Þ*<sup>k</sup> <sup>t</sup>* is used and to simplify the notation it will be assumed that the input layer *x*�*<sup>t</sup>* can be denoted by *h*�ð Þ <sup>0</sup> *<sup>t</sup>* (this layer is not hidden) [7]. To obtain good results, a hidden vector of dimension p must also be included, which will be denoted by � *c* ð Þ*k <sup>t</sup>* and refers to the state of the cell. The state of the trap can be observed as the long-term memory within the network. The matrix that updates the values is denoted by *W*ð Þ*<sup>k</sup>* and is used to permute the column vectors *h*�ð Þ *<sup>k</sup>*�<sup>1</sup> *<sup>t</sup>* , *h* �ð Þ*k t*�1 h i*<sup>T</sup>* . The matrix that is obtained always results in dimensions 4*<sup>p</sup>* � <sup>2</sup>*p*.A2*<sup>p</sup>* size vector is then premultiplied by the *<sup>W</sup>*ð Þ*<sup>k</sup>* matrix resulting in a 4*p* vector. Now to find the updates we have the following; for setting up intermediates Eq. 1, for selectively forget and add to long-term memory eq. 2, for selectively leak long-term memory to hidden state (Eq. 3).

**Figure 2.** *Graphic representation of a recurrent neural network.*

$$\begin{array}{l} \text{InputGate}: \quad \begin{bmatrix} \bar{i} \\ \bar{i} \\ \bar{U} \\ \text{OutputGate}: \\ \bar{\sigma} \end{bmatrix} = \begin{pmatrix} \text{sign} \\ \text{sign} \\ \text{sign} \\ \bar{c} \end{pmatrix} \mathbf{W}^{(k)} \begin{bmatrix} \bar{h}\_{i}^{(k-1)} \\ \bar{h}\_{t-1}^{k} \end{bmatrix} \\ \text{ (1)}$$

$$c\_t^{(\overline{k})} = \overline{f} \odot c\_{t-1}^{(\overline{k})} + \overline{i} \odot \overline{c} \tag{2}$$

$$h\_t^{\bar{(k)}} = \bar{o} \odot \tanh\left(c\_t^{\bar{(k)}}\right) \tag{3}$$

Additionally, clarify that LSTM is an algorithm that belongs to recurrent neural networks or RNN [8]. The RNN's refer to neural networks that take their previous state as input, this means that the neural network will have two inputs, the new information entered into the network and its previous state, which is shown in **Figure 2**. With this model we can have short-term memory in the neural network [9]. These neural networks have applications in sequential predictions, that is, predictions that depend on a temporal variable.

#### **2.2 Bayesian optimization using Gaussian processes**

#### *2.2.1 Multidimensional Gaussian distribution*

To talk about Gaussian processes, we must first define a multivariable Gaussian distribution in several dimensions. Formally, this distribution is expressed as in Eq. 4:

$$p(\mathbf{x}) = \frac{\mathbf{1}}{\left(2\pi\right)^{D/2}} e^{\frac{-1}{\hbar}(\mathbf{x}-\boldsymbol{\mu})^T} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\tag{4}$$

*Perspective Chapter: Airborne Pollution (PM2.5) Forecasting Using Long Short-Term Memory… DOI: http://dx.doi.org/10.5772/intechopen.108543*

**Figure 3.** *Graphic example of a multidimensional gaussian distribution.*

Where *D* is the number of dimensions, *x* is the variables, *μ* is the average vector, *Σ* is the covariance matrix. Gaussian processes try to model a function f given a set of points [10]. Traditional nonlinear regression machine learning methods usually give a function that they think best fits these observations. But, there may be more than one function that fits the observations equally well. When we have more observation points, we use our posterior-anterior as our anterior, we use these new observations to update our posterior. This is the Gaussian process. A Gaussian process is a probability distribution over possible functions that fit a set of points. Because we have the probability distribution over all possible functions, we can calculate the means as the function and calculate the variance to show how confident we are when we make predictions using the function (**Figure 3**).

#### *2.2.2 Gaussian process*

Because we have the probability distribution over all possible functions, we can calculate the means as the function and calculate the variance to show how confident when predictions are made using the function as demonstrated by Wang [10], we must take into account that:

I.The (later) functions are updated with new observations.

II.The mean calculated by the posterior distribution of the possible functions is the function used for the regression.

The function is modeled by a multivariable Gaussian of the form shown in Eq. 5:

$$P(f|X) = N(f|\mu, K) \tag{5}$$

Where *X* ¼ ½ � *x*,⋯, *xn* , *f* ¼, ⋯, *f x*ð Þ*<sup>n</sup>* , *μ* ¼, ⋯, *m x*ð Þ*<sup>n</sup>* , *Ki*,*<sup>j</sup>* ¼ *k xi*, *xj* . Being *X* the points of the observed data, *m* represents the average function and *K* represents a definite positive kernel.
