*3.2.4. Simple regression model*

*3.2.2. Time and traffic*

**PM2.5 =**

**7.0578**

 **r = 0.32 RMSE = 8.48**

*3.2.3. Traffic only*

**PM2.5 =**

**9.2185**

 **R = 0.31 RMSE = 8.51**

The result of the previous model suggests that it is necessary to consider additional information, such as traffic data, to improve the prediction accuracy of the regression model. To do so, the present analysis takes into account the traffic information provided by Google Maps and processed as described in Section 3.1.1. Thus, the used dataset is composed of five parameters,

The model shows that the parameters with the highest weight is %orange. It means that the quantification of the medium amount of traffic is an important feature to estimate the level of PM2.5. It is to note that this model, which includes data regarding human activity (i.e., transportation), provides a higher prediction accuracy than a model based on temporal information, only.

One of the main objectives of a machine-learning approach is to produce the most accurate prediction with a model as simple as possible. Since the temporal features seem to have a lower weight than the traffic features, we propose to build a model based on traffic only and assessing its reliability. Here, the number of attributes is three: %orange, %red, and PM2.5.

The linear regression model obtained after running the algorithm is as follows:

The linear regression model obtained after running the algorithm is as follows:

which are Xminutes, Yminutes, %red, %orange, and PM2.5.

24 Machine Learning - Advanced Techniques and Emerging Applications

 **1.2093 \* Xminutes + 2.0369 \* Yminutes + −23.3875 \* %red + 40.6166 \* %orange +**

The prediction accuracy of the model is evaluated as

 **−18.8914 \* %red + 28.618 \* %orange +**

The prediction accuracy of the model is evaluated as

Since the %orange parameter is the attribute with the highest weight, it would be possible to build a predictive model of PM2.5 based on a simple regression. The advantage of such a model is its simplicity and the fact that it is visually interpretable from a bidimensional graph (see **Figure 2**). Thus, the used dataset for this analysis has two features, only: %orange and concentrations of PM2.5.

The linear regression model obtained after running the algorithm is as follows:

 **PM2.5 = 20.1012 \* %orange + 9.6609**

The prediction accuracy of the model is evaluated as

 **r = 0.31 RMSE = 8.53**

The simple regression model and **Figure 2** show a growing trend of the level of PM2.5 when the %orange parameter increases. This very elementary model (a single predictive feature) allows for a prediction performance quite comparable with the two preceding models (r ≈ 0.3), which are more complex (four and two predictive features, respectively).
