*3.3.1. Morning model*

**3.3. Multiple models**

In the city of Quito as in most of the cities worldwide, there are two peaks of PM2.5 pollution during the day. The first peak is in the morning (around 10 am) and the second is in the evening (around 7 pm). **Figure 3** is a graphical representation of the two daily peaks of fine particulate contamination averaged over the last 10 years (2007–2016) for the district of Belisario (These peaks occur approximately at the same time in any district of Quito.) During the morning hours, the rush hour actually lasts longer than the visible PM2.5 concentration peak, but a sudden decline can be observed due to the deepening of the planetary boundary

**Figure 2.** Representation of the value of PM2.5 against the ratio of medium traffic (each dot is an observation) and the respective simple linear regression between these two features (line). The higher is the medium amount of traffic

(%orange), the larger is the concentration of fine particulate matter (PM2.5).

26 Machine Learning - Advanced Techniques and Emerging Applications

The morning model is defined between 6 am (360th minute) and 10 am (600th minute). **Figure 3** shows that there is a constant increase in the PM2.5 concentration during this period. The two main factors that should explain this increase are the traffic intensification and the low morning PBL. If this assumption is correct, then the predictive accuracy of a regression model that considers traffic

**Figure 3.** Typical profile of the PM2.5 concentrations during the day in the Belisario district of Quito (2007–2016 data). Although, a slight reduction in the level of pollution was observed throughout the years, the air contamination peaks are always located at the same time of day (around 10 am and 7 pm).

data as features should be improved in comparison with the single models. The characteristic of the used dataset is as follows: 110 instances and 4 features (minutes, %red, %orange, and PM2.5).

The coefficients of the resulting model are lower than in the morning model, for all the features. It suggests that the weight of the traffic data to predict PM2.5 is less important at midday than in the morning, as hypothesized. It is confirmed by the performance evaluation of the

Regression Models to Predict Air Pollution from Affordable Data Collections

http://dx.doi.org/10.5772/intechopen.71848

29

The afternoon model is defined between 2 pm (840th minute) and 7 pm (1140th minute). **Figure 3** shows that there is a constant increase in the PM2.5 concentration, although the evening peak is lower than the morning peak due to the fact that the PBL has reached its peak and is not changing at this time of day, until a nocturnal boundary layer starts forming due to the absence of surface heating. Besides the elevated PBL, the air pollution increases because of the traffic growth at the end of the day. Again, the important dilution of pollutants in the atmosphere should reduce the correlation between traffic and PM2.5 concentrations. The used dataset to build the model is as follows: 145 instances and 4 features (minutes, %red, %orange,

The feature with the maximum weight in the afternoon model is still %orange, although its value continues to decrease. The time coefficient is extremely low, and %red is filtered by the M5 attribute selection method. As expected, the model accuracy assessed by the correlation coefficient is relatively low (r ≈ 0.3). It means that the traffic input is not a good predictor to estimate the level of PM2.5 in the afternoon. The important dilution of the air contaminants in the atmosphere would explain this result. Surprisingly, the RMSE (<8) is lower than in the two previous models (>10). This reduced error of prediction can be explained by the lower standard deviation (SD) of the PM2.5 values in the afternoon (SD = 8) than in the morning (SD = 11.6) and midday (SD = 10.8). In other words, the better power of prediction is not due to the reliability of the model per se (essentially based on the traffic), but due to the limited

There is a significant improvement in the prediction of PM2.5 in the morning (r ≈ 0.5). The performance can be explained by the fact that the PBL is relatively low in the morning. Thus, the pollution dilution is reduced and consequently the level of PM2.5 becomes strongly

model, which is similar as the accuracy obtained from the single models (r ≈ 0.3).

The linear regression model obtained after running the algorithm is as follows:

 **0.0242 \* minutes + 20.7938 \* %orange +**

*3.3.3. Afternoon model*

and PM2.5).

**PM2.5 =**

**r = 0.28 RMSE = 7.65**

*3.3.4. Interpretation of the results*

**−14.6845**

The prediction accuracy of the model is evaluated as

variation in the PM2.5 concentrations in the afternoon.

The linear regression model obtained after running the algorithm is as follows:


The prediction accuracy of the model is evaluated as

 **r = 0.49 RMSE = 10.13**

As observed in the single model approach, the weights of the traffic attributes are significantly larger than the coefficient of time. The most representative feature, which is %orange, shows that the higher is the medium amount of traffic, the higher is the value of PM2.5. In terms of performance, the prediction accuracy is around 0.5, for the correlation coefficient, and around 10 out of an average value of PM2.5 = 17.4 μg/m3 , for the RMSE. As hypothesized, this limited analysis on a morning window provides a regression model more accurate than the models based on the full day.

#### *3.3.2. Midday model*

The midday model is defined between 10 am (600th minute) and 2 pm (840th minute). **Figure 3** shows that there is a constant decrease in the PM2.5 concentration during this period. The two main factors that should explain this drop are the traffic diminution and the elevation of the PBL that increases the dilution of air contaminants. In such a situation, the correlation between traffic and PM2.5 should decrease. Here, the regression algorithm is applied on a dataset composed of 116 instances and 4 features (minutes, %red, %orange, and PM2.5).

The linear regression model obtained after running the algorithm is as follows:

 **PM2.5 = −0.0354 \* minutes + −68.1378 \* %red + 55.4262 \* %orange + 35.2107**

The prediction accuracy of the model is evaluated as

$$\begin{array}{rcl} \text{r} & = 0.\mathbf{29} \\\\ \mathbf{RMSE} & = \mathbf{10.36} \end{array}$$

The coefficients of the resulting model are lower than in the morning model, for all the features. It suggests that the weight of the traffic data to predict PM2.5 is less important at midday than in the morning, as hypothesized. It is confirmed by the performance evaluation of the model, which is similar as the accuracy obtained from the single models (r ≈ 0.3).
