**7. Conclusions and perspectives**

The simplest best model is composed of the whole chemical parameters and a selection of meteorological factors (RH, SR, Xwind, and T). As suggested by the previous analyses, the individual performance to accurately estimate the values of PM2.5 is globally higher for the chemical (first, second, fourth, and fifth positions) than the meteorological features (third, sixth, seventh, and eighth positions). In other words, PM2.5 are firstly correlated with the emis-

tions (especially relative humidity and solar radiation). It is to note the negative correlation between the value of SR and the concentration of PM2.5. This result can be explained by the fact that the larger is the SR, the deeper is PBL, and consequently, the bigger is the dilution of fine particulate matter in the boundary layer. The other factors are positively correlated with PM2.5. Besides its simplicity (eight features only), the model is able to predict the level of fine particulate matter with the same accuracy than a model using all the features (r = 0.8 and

The final objective of this study is to find the best predictive model that uses the less costly data recording of relevant features. As previously mentioned, the accurate measurement of trace gases requires expensive equipment. Thus, the best affordable model can be defined as the model that gets the best performance with no more than two trace gases. The model performances with the whole affordable attributes and only one or two trace gases are presented in **Table 2**. The model accuracy is assessed according to the value of r. The main diagonal represents the performance by considering a single trace gas, whereas the other cells take into

The results show that it is still possible to build a model with high prediction accuracy with

It can be explained by the fact that these two trace gases are strongly correlated with the values of PM2.5 (see **Table 1**). In the case that only one trace gas sensor is affordable, it has to be

(r = 0.73). It is to note that O<sup>3</sup>

two trace gases, only. The best performance is obtained by considering SO2

and CO) and secondly with the weather condi-

and NO2

(r = 0.78).

is a gas that can

sion of chemical substances (especially SO<sup>2</sup>

**Ranking Performance Feature** 1 0.0311 SO2 2 0.0256 CO

44 Machine Learning - Advanced Techniques and Emerging Applications

4 0.0172 NO2 5 0.0133 O3

7 0.0109 Xwind 8 0.0065 Temperature

3 0.0193 Relative humidity

6 0.0125 Solar radiation

**6.2. Recommendations based on model performances**

a device that measures the levels of CO or NO<sup>2</sup>

RMSE = 5.3, in both cases).

**Table 1.** Ranked attributes.

account two gases.

This study demonstrates that the PM2.5 prediction performance depends on the available input information. The first finding shows that it is possible to get a reasonable prediction of PM2.5 concentrations only using public access traffic data. Ambient PM2.5 pollution prediction based on traffic can be significantly improved by using three models a day instead of a single one, especially for the morning hours. During the morning rush hour, planetary boundary layer is shallow, resulting in a continuous traffic emission buildup showing a cumulative growth of PM2.5 concentrations. The latter start decreasing with the dilution effect of the PBL deepening, due to surface heating, increase in temperatures and ventilating wind effect. Thus, using an affordable meteorological station data further improves the prediction accuracy. In this case, a regression model tree gives a better prediction than a linear regression model. As expected, the best model is obtained by including a hybrid data sources as features (time, traffic, meteorological, and the concentrations of atmospheric criteria pollutants). The complexity of the resulting model can be reduced from seventeen to eight most relevant features without reducing the performance (r ≈ 0.8, and RMSE ≈ 5.3). These eight selected attributes are composed of criteria pollutants (CO, NO<sup>2</sup> , O3 , SO2 ) and meteorological factors (humidity, solar radiation, temperature, wind speed, and direction). Thus, our results suggest to proceed with a selection of chemical sensors based on the best ratio prediction/cost. For example, if only one trace gas sensor is affordable, the best performance can be reached with CO or NO<sup>2</sup> concentrations, while the use of two trace gases (SO<sup>2</sup> and NO2 ) are sufficient to get very close to the best possible accuracy. In contrast, O<sup>3</sup> is a secondary pollutant that can be excluded from the models with no significant consequences on the prediction of PM2.5, suggesting a low impact of photochemical component in PM2.5 formation.

The proposed approach is easily generalizable to other cities worldwide. A storage and regression analysis of 2-month data were sufficient to build models that are able to predict fine particulate matter with high accuracy. The main limitation of the present method is to predict PM2.5 when the PBL is deep. Nevertheless, it is often less of an issue in terms of air quality since an elevated PBL enhances dilution and, consequently, reduces the concentration of atmospheric contaminants. Further work will focus on improving the model performance at evening rush hours. More refined models are expected to be obtained by including additional observations and features into the dataset. For example, some additional studies are anticipated to investigate the impact of PBL depth on the dilution of the PM2.5 pollution.

[5] Karagulian F, Belis CA, Dora CFC, Prüss-Ustün AM, Bonjour S, Adair-Rohani H, Amann M. Contributions to cities' ambient particulate matter (PM): A systematic review of local

Regression Models to Predict Air Pollution from Affordable Data Collections

http://dx.doi.org/10.5772/intechopen.71848

47

source contributions at global level. Atmospheric Environment. 2015;**120**:475-483

exposure estimates? Environment International. 2017;**99**:293-302

exercise. Atmospheric Environment. 2016;**147**:246-263.

formance. [Accessed: 28 August 2017]

10.1155/2017/5106045

319-31232-3

fed4147c9.pdf

13Kluwer Academic Publishers; 1988.

[6] Castell N, Dauge FR, Schneider P, Vogt M, Lerner U, Fishbain B, Broday D, Bartonova A. Can commercial low-cost sensor platforms contribute to air quality monitoring and

[7] Borrego C, Costa AM, Ginja J, Amorim M, Coutinho M, Karatzas K, Sioumis T, Katsifarakis N, Konstantinidis K, De Vito S, Esposito E, Smith P, André N, Gérard P, Francis LA, Castell N, Schneider P, Viana M, Minguillón MC, Reimringer W, Otjes RP, von Sicard O, Pohle R, Elen B, Suriano D, Pfister V, Prato M, Dipinto S, Penza M Assessment of air quality microsensors versus reference methods: The EuNetAir joint

[8] USEPA. Evaluation of emerging air pollution sensor performance. 2017. Available from: https://www.epa.gov/air-sensor-toolbox/evaluation-emerging-air-pollution-sensor-per-

[9] Kleine Deters J, Zalakeviciute R, Gonzalez M, Rybarczyk Y. Modeling PM2.5 urbanpollution using machine learning and selected meteorological parameters. Journal of Electrical and Computer Engineering. 2017. 14 pages. Article ID 5106045. DOI:

[10] Brokamp C, Jandarov R, Rao MB, LeMasters G, Ryan P. Exposure assessment models for elemental components of particulate matter in an urban environment: A comparison of regression and random forest approaches. Atmospheric Environment. 2017;**151**:1-11

[11] Quinlan RJ. Learning with continuous classes. In: Proceedings of the 5th Australian Joint Conference on Artificial Intelligence. Singapore; 1992. pp.343-348. Retrieved from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.885&rep=rep1&type=pdf

[12] Rybarczyk Y. 3D markerless motion capture: A low-cost approach. In: Rocha A, Correia AM, Adeli H, Reis LP, Teixeira MM, editors. New Advanced in Information Systems and Technologies; Recife, Brazil. Switzerland: Springer; 2016. pp. 731-738. DOI: 10.1007/978-3-

[13] Mierswa I, Wurst M, Klinkenberg R, Scholz M, T. Yale E. Rapid prototyping for complex data mining tasks. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia: USA; 2006; pp. 935-940. Retrieved from: https://pdfs.semanticscholar.org/5722/e63d03edba571262ba258fe5aaf-

[14] Stull RB. An Introduction to Boundary Layer Meteorology. Boston, Massachusetts:

[15] Cazorla M. Air quality over a populated Andean region: Insights from measurements of ozone, NO, and boundary layer depths. Atmospheric Pollution Research. 2016;**7**:66-74

Furthermore, it is motivating to investigate the current model performance with the data acquired by the lower tier equipment. In this study, the air pollution and meteorology were measured with USEPA-approved equipment, not affordable to a large fraction of cities in the developing countries, thus limiting air pollution studies and awareness to the main cities. It has been shown, however, that small cities are often more polluted than the big agglomerations, presenting the necessity for a wide set of options to promote the consciousness of the air quality [21].
