*3.3.3. Afternoon model*

data as features should be improved in comparison with the single models. The characteristic of the used dataset is as follows: 110 instances and 4 features (minutes, %red, %orange, and PM2.5).

As observed in the single model approach, the weights of the traffic attributes are significantly larger than the coefficient of time. The most representative feature, which is %orange, shows that the higher is the medium amount of traffic, the higher is the value of PM2.5. In terms of performance, the prediction accuracy is around 0.5, for the correlation coefficient, and around

analysis on a morning window provides a regression model more accurate than the models

The midday model is defined between 10 am (600th minute) and 2 pm (840th minute). **Figure 3** shows that there is a constant decrease in the PM2.5 concentration during this period. The two main factors that should explain this drop are the traffic diminution and the elevation of the PBL that increases the dilution of air contaminants. In such a situation, the correlation between traffic and PM2.5 should decrease. Here, the regression algorithm is applied on a data-

set composed of 116 instances and 4 features (minutes, %red, %orange, and PM2.5). The linear regression model obtained after running the algorithm is as follows:

, for the RMSE. As hypothesized, this limited

The linear regression model obtained after running the algorithm is as follows:

 **0.0444 \* minutes + −123.0175 \* %red + 89.1856 \* %orange +**

28 Machine Learning - Advanced Techniques and Emerging Applications

The prediction accuracy of the model is evaluated as

10 out of an average value of PM2.5 = 17.4 μg/m3

 **−0.0354 \* minutes + −68.1378 \* %red + 55.4262 \* %orange +**

The prediction accuracy of the model is evaluated as

**35.2107**

**r = 0.29 RMSE = 10.36**

**PM2.5 =**

**−15.4187**

**RMSE = 10.13**

based on the full day.

*3.3.2. Midday model*

**PM2.5 =**

**r = 0.49**

The afternoon model is defined between 2 pm (840th minute) and 7 pm (1140th minute). **Figure 3** shows that there is a constant increase in the PM2.5 concentration, although the evening peak is lower than the morning peak due to the fact that the PBL has reached its peak and is not changing at this time of day, until a nocturnal boundary layer starts forming due to the absence of surface heating. Besides the elevated PBL, the air pollution increases because of the traffic growth at the end of the day. Again, the important dilution of pollutants in the atmosphere should reduce the correlation between traffic and PM2.5 concentrations. The used dataset to build the model is as follows: 145 instances and 4 features (minutes, %red, %orange, and PM2.5).

The linear regression model obtained after running the algorithm is as follows:

**PM2.5 = 0.0242 \* minutes + 20.7938 \* %orange + −14.6845**

The prediction accuracy of the model is evaluated as

$$\begin{array}{rcl} \text{r} & = 0.28\\ \text{RMSE} = 7.65 \end{array}$$

The feature with the maximum weight in the afternoon model is still %orange, although its value continues to decrease. The time coefficient is extremely low, and %red is filtered by the M5 attribute selection method. As expected, the model accuracy assessed by the correlation coefficient is relatively low (r ≈ 0.3). It means that the traffic input is not a good predictor to estimate the level of PM2.5 in the afternoon. The important dilution of the air contaminants in the atmosphere would explain this result. Surprisingly, the RMSE (<8) is lower than in the two previous models (>10). This reduced error of prediction can be explained by the lower standard deviation (SD) of the PM2.5 values in the afternoon (SD = 8) than in the morning (SD = 11.6) and midday (SD = 10.8). In other words, the better power of prediction is not due to the reliability of the model per se (essentially based on the traffic), but due to the limited variation in the PM2.5 concentrations in the afternoon.
