**RMSE = 9.13**

The model is still composed of the same nucleus of features: minutes, %orange, RH, temperature, and wind. The only new parameter that appears as predictive feature is the precipitations. It can be explained by the fact that the rain events usually occur in Quito at midday. This factor has a negative coefficient, because the precipitation has a cleaning effect on the concentration of fine particulate matter [19]. The performance of the model is maintained at a constant accuracy (r = 0.56).

#### *4.3.3. Afternoon model*

The linear regression model obtained after running the algorithm is as follows:


The prediction accuracy of the model is evaluated as

$$\begin{array}{rcl} \textbf{r} & = \textbf{0.56} \\\\ \textbf{RMSE} & = \textbf{6.61} \end{array}$$

This model is simpler (only four features) and does not consider exactly the same attributes than the two previous models (Pressure is used, and %red is preferred to %orange.) In Section 3.3.3, differences were already noted in the afternoon model with respect to the morning and midday. The explanation seemed to be related to the difficulty to get a reliable predictive model of PM2.5 when the particulates are strongly diluted in the atmosphere. In such a situation, the fair performance of the model (r = 0.56; RMSE = 6.61) would be more caused by the reduced fluctuation of the PM2.5 values (**Figure 3** shows a maximum peak at around 20 μg/m3 , against 30 μg/m3 in the morning) than the reliability of the prediction per se.

#### *4.3.4. Interpretation of the results*

Eq. (11) presents the average prediction accuracy by modeling the air pollution through the three daily models.

$$\overline{r} = \frac{0.58 + 0.56 + 0.56}{3} = 0.57\tag{11}$$

significant in the morning than in the rest of the day, because of the low dilution of the vehicle emissions in the atmosphere, adding meteorological parameters in the model decreases the

Regression Models to Predict Air Pollution from Affordable Data Collections

http://dx.doi.org/10.5772/intechopen.71848

39

This part intends to verify the prediction accuracy of the methods as described in the previous sections. To do so, the precision of the prediction based on low-cost data collection is compared with a pollution monitoring that makes use of costlier technologies (i.e., EPA-approved chemical sensors). Then, a hybrid model is proposed from a selection of the most relevant

The concentrations of PM2.5 are commonly correlated with other air pollutants, such as

specialized equipment than traffic or weather monitoring. The performance of the models built in this section is used as referential to assess the quality of the previous models and investigates if a selection of the most affordable chemical records can significantly improve the overall prediction accuracy. Four additional criteria pollutants were mea-

, CO, etc. [20]. However, the monitoring of these substances involves a more

analyzer was used based on ultraviolet florescence (EPA No. EQSA-0486-060).

 concentration data collection, ThermoFisher Scientific 49i ozone analyzer was used based on ultraviolet absorption (EPA No. EQOA-0880-047). For NOx concentration data collection, ThermoFisher Scientific 42i NOx analyzer was used based on chemiluminescence method (EPA No. RFNA-1289-074). Finally, for CO concentration data collection, ThermoFisher Scientific 48i was used based on infrared absorption (EPA No. RFCA-0981-

concentrations, ThermoFisher Scientific 43i high-

, O3 , SO2 ,

performance differences between the three daily models.

**5. Adding trace gas concentrations**

features to minimize the prediction error.

**5.1. Prediction from chemical monitoring**

, SO2

and PM2.5 (= feature to predict).

, and O3

The prediction accuracy of the model is evaluated as

). For SO2

054). The used dataset is composed of 1118 observations and 5 features: CO, NO<sup>2</sup>

recordings should improve the prediction accuracy of the affordable models.

The evaluation of this model demonstrates that only the chemical factors are very high predictors of the level of fine particulate matter. A model built with these parameters provides a significantly lower RMSE and higher r than the traffic and meteorology based models. This outcome was expected as the levels of anthropogenic PM2.5 that are directly related to the emission of other air pollutants, such as a number of different contaminants that come from the same sources. It can be concluded from this analysis that selecting some low-cost chemical

SO2

, NO2

sured (CO, NO<sup>2</sup>

 **r = 0.75 RMSE = 5.89**

level SO2

For O3

Although the morning model is slightly more accurate than the two other ones, the mean value of the regression coefficient is not better than the regression coefficient of the single model, especially if this model is obtained by a model tree algorithm.

Thus, when meteorological factors are taken into account, it does not seem to be advantageous to consider three regression models per day. It can be explained by the fact that the weather conditions have a very strong effect on the levels of PM2.5 (e.g., rain and wind tend to clean the atmosphere). Thus, including these factors as model features reduces the mere influence of the traffic on the value of PM2.5. And since the impact of this human activity is more significant in the morning than in the rest of the day, because of the low dilution of the vehicle emissions in the atmosphere, adding meteorological parameters in the model decreases the performance differences between the three daily models.
