*3.2.2. Time and traffic*

The result of the previous model suggests that it is necessary to consider additional information, such as traffic data, to improve the prediction accuracy of the regression model. To do so, the present analysis takes into account the traffic information provided by Google Maps and processed as described in Section 3.1.1. Thus, the used dataset is composed of five parameters, which are Xminutes, Yminutes, %red, %orange, and PM2.5.

Again, the model shows that the weight of the %orange parameter is the largest. The higher is the medium amount of traffic, the higher is the level of PM2.5. In terms of performance, this model based on two predictive features has an accuracy similar as the previous model with

Regression Models to Predict Air Pollution from Affordable Data Collections

http://dx.doi.org/10.5772/intechopen.71848

25

Since the %orange parameter is the attribute with the highest weight, it would be possible to build a predictive model of PM2.5 based on a simple regression. The advantage of such a model is its simplicity and the fact that it is visually interpretable from a bidimensional graph (see **Figure 2**). Thus, the used dataset for this analysis has two features, only: %orange and

The simple regression model and **Figure 2** show a growing trend of the level of PM2.5 when the %orange parameter increases. This very elementary model (a single predictive feature) allows for a prediction performance quite comparable with the two preceding models (r ≈ 0.3), which

The performance accuracy of the models evaluated by a metric in terms of correlation coefficient and RMSE between traffic and PM2.5 is slightly above 0.3 and around 8.5, respectively. The models that consider traffic monitoring provide a higher accuracy than a model based on time only. This result means that traffic is more reliable than time to predict air quality. This difference could be reduced if the weekends (air pollution levels usually low) are excluded, since the traffic is quite stereotypic during the workdays. Also, the accuracy of a model based on traffic monitoring is not significantly improved by adding the time of day, because this

Overall, it seems that Google Maps Traffic can provide a fair information to predict the level of PM2.5. From this data source, the number of orange pixels (medium amount of traffic) would be the most relevant feature. It could be explained by the fact that the medium traffic has the largest amplitude of variation all day long, and thus, this is the category that best represents the traffic density in the city. Nevertheless, the accuracy of the model could be improved if we consider an air pollution modeling based on several daily models, defined by the variation of

air pollution levels all day long (two peaks a day), instead of a single one.

The linear regression model obtained after running the algorithm is as follows:

four features (r ≈ 0.3 in both cases).

**20.1012 \* %orange +**

The prediction accuracy of the model is evaluated as

information is mostly redundant with the traffic data.

are more complex (four and two predictive features, respectively).

*3.2.4. Simple regression model*

concentrations of PM2.5.

**PM2.5 =**

**9.6609**

 **r = 0.31 RMSE = 8.53**

*3.2.5. Interpretation of the results*

The linear regression model obtained after running the algorithm is as follows:


The prediction accuracy of the model is evaluated as

 **r = 0.32 RMSE = 8.48**

The model shows that the parameters with the highest weight is %orange. It means that the quantification of the medium amount of traffic is an important feature to estimate the level of PM2.5. It is to note that this model, which includes data regarding human activity (i.e., transportation), provides a higher prediction accuracy than a model based on temporal information, only.
