**3. Materials and methods**

#### **3.1 Materials**

The data used to train the model were obtained from the Atmospheric Monitoring System (SIMAT for its acronym in Spanish) database is conformed of four subsystems: RAMA, REDMA, REDMET, and REDDA, all are given by its website but we just focus on the Automatic Environmental Monitoring Network (RAMA for its acronym in Spanish).

#### **3.2 Methodology**

#### *3.2.1 Data acquisition and preprocessing data*

First, the database on air quality, RAMA (SEDEMA, 2021) [11] must be downloaded, which is public. This file (dataset) contains values captured by all stations capable of monitoring *PM*2*:*5. In case the dataset contains more variables in addition to the one already mentioned, a preprocessing process must be carried out. First, the values of interest (*PM*2*:*5. and air direction) should be classified, excluding any other. Once the data has been classified, it will be necessary to determine if there are missing data and in which cases it is convenient to impute because if the amount of data to be imputed exceeds 40% of the total data, it is advisable not to use that station since there is a loss very large data and a case of over-learning or data that does not reflect reality could be presented. To impute the missing data, the MICE algorithm will be used and once the algorithm has been applied, a new dataset will have to be generated with the imputed data (complete). Because the RNN-LSTM works by taking a tensor as input and already having an absent dataset of missing data, now it will be necessary to divide this data into three different datasets which would serve for training, testing, and data validation. To conclude with the preprocessing, the data will be normalized and later converted into tensors. To normalize the data, the min-max normalization will be used, which takes the minimum value of the data as "zero" and the maximum value as "one" and it is based on these that the normalization is performed. For tensors, they must take into account the batch value, which in turn takes values of 2*<sup>n</sup>* with *nϵN*. The value of the sequence to use as a parameter should also be considered when creating the tensors.

#### *3.2.2 Instantiate LSTM and optimize the model*

To begin with the training and optimization of the network, we are going to start the network with values of the hyperparameters selected at random, this is only to make the network start and work since later the values of each selected hyperparameter will be rewritten in optimization until the optimal point is found within the search space that is established. By having the data imputed, divided,

*Perspective Chapter: Airborne Pollution (PM2.5) Forecasting Using Long Short-Term Memory… DOI: http://dx.doi.org/10.5772/intechopen.108543*

normalized, and transformed into tensors now, using the training dataset, as its name implies, we will begin the process of training the network, which will go hand in hand with the number of iterations (points) that have been assigned to the optimization process. In each iteration, an adjustment will be made in some of the hyperparameters, and saving the score of each model to end up with the best one.

#### *3.2.3 Model evaluation*

Having already a trained and optimized model, we can now determine the efficiency of the model by calculating the RMSE that the model has by comparing the predicted data against the real data. To generate predictions, we are going to use the evaluation set (**Figure 4**).

**Figure 4.** *Proposed methodology.*
