**6.1. The simplest best model**

**0.3632 \* NO<sup>2</sup>**

**0.796 \* SO<sup>2</sup>**

**0.2348 \* O<sup>3</sup>**

The prediction accuracy of the model is evaluated as

42 Machine Learning - Advanced Techniques and Emerging Applications

 **21.026 \* %red + −14.9417 \* %orange + 0.3291 \* RH + 0.8285 \* temperature + 1.2914 \* WS + −1.1325 \* pressure + −0.0109 \* SR +**

**0.3909 \* NO<sup>2</sup>**

**0.6993 \* SO<sup>2</sup>**

**0.2503 \* O<sup>3</sup>**

The prediction accuracy of the model is evaluated as

**790.3383**

 **r = 0.66 RMSE = 6.29**

*5.2.2.4. Interpretation of the results*

*r*

**835.1936**

 **r = 0.87 RMSE = 5.33**

*5.2.2.3. Afternoon model*

**PM2.5 =**

**+**

**+**

**+**

**+**

**+**

**+**

The results of the Eq. (12) shows that the average prediction accuracy (evaluated by the regres-

<sup>3</sup> = 0.79 (12)

¯ <sup>=</sup> 0.85 <sup>+</sup> 0.87 <sup>+</sup> 0.66 \_\_\_\_\_\_\_\_\_\_\_

sion coefficient metrics) by modeling the air pollution through three models is

The linear regression model obtained after running the algorithm is as follows:

Since the full feature model (Section 5.2) is quite complex, the present stage consists of removing insignificant and/or redundant features in order to optimize the modeling. The goal is to find a simple model that is still able to provide a reliable estimation of PM2.5 concentrations. The simplest best model is defined as a model that maintains a high accuracy (r ≥ 0.8) with a maximum number of features equal to eight. The method used to get this model is the ranker search method. This technique sorts the attributes according to their evaluation and allows for a specification of the number of attributes to retain.

The linear regression model obtained after running the algorithm is as follows:

$$\mathbf{PM}\_{25} \qquad =$$


#### **−23.9476**

The prediction accuracy of the model is evaluated as

**r = 0.8**

**RMSE = 5.34**

**Table 1** represents the ranked attributes, in which the features are sorted in the descending order of their individual performance to predict the output value.


be automatically discarded, since its power of prediction is the lowest (Section 4 shows that

**Table 2.** Model performance (r value) with all the affordable attributes (e.g., time, traffic, and meteorology) and only one

**SO<sup>2</sup> CO NO<sup>2</sup> O<sup>3</sup>**

This study demonstrates that the PM2.5 prediction performance depends on the available input information. The first finding shows that it is possible to get a reasonable prediction of PM2.5 concentrations only using public access traffic data. Ambient PM2.5 pollution prediction based on traffic can be significantly improved by using three models a day instead of a single one, especially for the morning hours. During the morning rush hour, planetary boundary layer is shallow, resulting in a continuous traffic emission buildup showing a cumulative growth of PM2.5 concentrations. The latter start decreasing with the dilution effect of the PBL deepening, due to surface heating, increase in temperatures and ventilating wind effect. Thus, using an affordable meteorological station data further improves the prediction accuracy. In this case, a regression model tree gives a better prediction than a linear regression model. As expected, the best model is obtained by including a hybrid data sources as features (time, traffic, meteorological, and the concentrations of atmospheric criteria pollutants). The complexity of the resulting model can be reduced from seventeen to eight most relevant features without reducing the performance (r ≈ 0.8, and RMSE ≈ 5.3).

These eight selected attributes are composed of criteria pollutants (CO, NO<sup>2</sup>

meteorological factors (humidity, solar radiation, temperature, wind speed, and direction). Thus, our results suggest to proceed with a selection of chemical sensors based on the best ratio prediction/cost. For example, if only one trace gas sensor is affordable, the best perfor-

) are sufficient to get very close to the best possible accuracy. In contrast, O<sup>3</sup>

secondary pollutant that can be excluded from the models with no significant consequences on the prediction of PM2.5, suggesting a low impact of photochemical component in PM2.5

The proposed approach is easily generalizable to other cities worldwide. A storage and regression analysis of 2-month data were sufficient to build models that are able to predict fine particulate matter with high accuracy. The main limitation of the present method is to

get a better r). This finding could be expected as there is no direct relation-

Regression Models to Predict Air Pollution from Affordable Data Collections

http://dx.doi.org/10.5772/intechopen.71848

45

(a secondary pollutant) and the concentrations of PM2.5.

, O3 , SO2

concentrations, while the use of two trace gases (SO<sup>2</sup>

) and

is a

models without O3

SO2 0.7

ship between the level of O3

**7. Conclusions and perspectives**

CO 0.77 0.73

(main diagonal) or two (other cells) trace gases.

NO2 0.78 0.76 0.73

O3 0.7 0.75 0.73 0.58

mance can be reached with CO or NO<sup>2</sup>

and NO2

formation.

**Table 1.** Ranked attributes.

The simplest best model is composed of the whole chemical parameters and a selection of meteorological factors (RH, SR, Xwind, and T). As suggested by the previous analyses, the individual performance to accurately estimate the values of PM2.5 is globally higher for the chemical (first, second, fourth, and fifth positions) than the meteorological features (third, sixth, seventh, and eighth positions). In other words, PM2.5 are firstly correlated with the emission of chemical substances (especially SO<sup>2</sup> and CO) and secondly with the weather conditions (especially relative humidity and solar radiation). It is to note the negative correlation between the value of SR and the concentration of PM2.5. This result can be explained by the fact that the larger is the SR, the deeper is PBL, and consequently, the bigger is the dilution of fine particulate matter in the boundary layer. The other factors are positively correlated with PM2.5. Besides its simplicity (eight features only), the model is able to predict the level of fine particulate matter with the same accuracy than a model using all the features (r = 0.8 and RMSE = 5.3, in both cases).

## **6.2. Recommendations based on model performances**

The final objective of this study is to find the best predictive model that uses the less costly data recording of relevant features. As previously mentioned, the accurate measurement of trace gases requires expensive equipment. Thus, the best affordable model can be defined as the model that gets the best performance with no more than two trace gases. The model performances with the whole affordable attributes and only one or two trace gases are presented in **Table 2**. The model accuracy is assessed according to the value of r. The main diagonal represents the performance by considering a single trace gas, whereas the other cells take into account two gases.

The results show that it is still possible to build a model with high prediction accuracy with two trace gases, only. The best performance is obtained by considering SO2 and NO2 (r = 0.78). It can be explained by the fact that these two trace gases are strongly correlated with the values of PM2.5 (see **Table 1**). In the case that only one trace gas sensor is affordable, it has to be a device that measures the levels of CO or NO<sup>2</sup> (r = 0.73). It is to note that O<sup>3</sup> is a gas that can


**Table 2.** Model performance (r value) with all the affordable attributes (e.g., time, traffic, and meteorology) and only one (main diagonal) or two (other cells) trace gases.

be automatically discarded, since its power of prediction is the lowest (Section 4 shows that models without O3 get a better r). This finding could be expected as there is no direct relationship between the level of O3 (a secondary pollutant) and the concentrations of PM2.5.
