*4.1.1. Data acquisition*

Seven meteorological parameters (wind speed and direction, temperature, relative humidity, atmospheric pressure, precipitation, and solar radiation) were measured using Vaisala WXT536 instrumentation, with an exception of Kipp&Zonnen netradiometer to measure solar radiation. To get the hourly value of SR, T, P, rain, RH, and WS, we simply have to calculate the average value from the six records per hour of the used dataset (one record each 10 minutes). However, the calculation of the WD is a bit more complex. It is not possible to compute the mean direction per hour, because it can provide a completely wrong result. For instance, if the wind angle is four times around the east (90°) and the two other times is around the west (270°), the mean WD will be the south-southeast (150°), even if the wind never originated in that direction. To tackle this issue, the calculation of the most representative WD for each hour is carried out through the process as follows:


**Figure 4** represents an example regarding the approach the WD is obtained.

*4.1.2. Data transformation*

correlated with the pollution produced by the vehicles. The higher is the traffic activity, the higher is the concentration of fine particulate matter (see the high weight of the %orange

For the two other models, the accuracy is around the same value as a global model (r ≈ 0.3). Their predictive performance seems reduced, because the depth of the PBL increases with the augmentation of the solar radiation (maximal around noon). The poor power of prediction of these two models would be caused by the reduction of the influence of the traffic on the level of PM2.5, since the weight of the %orange parameter drops at midday and

Nevertheless, the average performance of an approach based on three models per day provides an accuracy slightly better than the single model (see Eq. (8)). It suggests that the best prediction of PM2.5 from the traffic monitoring is obtained by analyzing the typical daily fluctuation of PM2.5 concentration and applying a specific model according to the occurrence of

¯ <sup>=</sup> 0.49 <sup>+</sup> 0.29 <sup>+</sup> 0.28 \_\_\_\_\_\_\_\_\_\_\_

ing station. In this study, the used picture represents an area of 22.4 km2

area for Belisario station (monitoring station height 10 m) would be around 3 km<sup>2</sup>

This performance could be further improved by analyzing a reduced image of the traffic map that closely matches the footprint of PM2.5 concentrations measured by the monitor-

However, we chose a bigger traffic map area to have a more representative traffic situation

The ambient air pollution levels are mainly modulated by meteorological conditions [9, 17]. Consequently, considering these parameters in a model should improve the prediction of the concentration of fine particulate matter. Since the required equipment to proceed with the recording of these data is significantly cheaper than the air quality sensors, we present models that can predict the level of PM2.5 from the selected meteorological features as follows: solar radiation (SR), temperature (T), pressure (P), precipitation (rain), relative humidity (RH),

Seven meteorological parameters (wind speed and direction, temperature, relative humidity, atmospheric pressure, precipitation, and solar radiation) were measured using Vaisala WXT536 instrumentation, with an exception of Kipp&Zonnen netradiometer to measure solar radiation. To get the hourly value of SR, T, P, rain, RH, and WS, we simply have to calculate the average value from the six records per hour of the used dataset (one record each 10 minutes).

<sup>3</sup> = 0.35 (8)

and the footprint

, only [16].

parameter).

afternoon.

of the city.

**4.1. Dataset**

*4.1.1. Data acquisition*

the pollution peaks, especially in the morning.

30 Machine Learning - Advanced Techniques and Emerging Applications

*r*

**4. Adding meteorological factors**

wind speed (WS), and wind direction (WD).

Another data preparation is required before running the machine-learning algorithms. The polar coordinates of the WD (0–360°) are transformed into Cartesian coordinates, by consider-

**Figure 4.** Representation of the calculation of the WD for a specific hour. The graphic indicates the WD angles, in degrees (x-axis), and their respective ratio of occurrences (y-axis). The black curve represents the normal distribution that fits the data. Here, the value of the hourly WD is mu ≈ 191°.

ing both WD and WS in a same formula (see Eqs. (9) and (10)). This mathematical transformation permits a more accurate feature representation of the data with respect to the WD around the north axis. Otherwise, it would be impossible to find a correlation between WD and PM2.5, since some similar WD pointing north could have completely different values (slightly higher than 0° or slightly lower than 360°) according to the polar coordinates. This transformation is particularly relevant for machine-learning algorithms based on linear regression, because this modeling relies on a continuous relationship between parameters [9].

$$Xwind = \cos\left(\frac{\text{WD} \cdot \pi}{180^\circ}\right) . \text{WS} \tag{9}$$

$$Ywind = \sin\left(\frac{\text{WD} \cdot \pi}{180^\circ}\right) \text{.WS} \tag{10}$$

factors, the solar radiation and the precipitation are filtered by the M5 method (see Section 2.1 for more details). The rain attribute is certainly removed, since it occurs only 71 times, which represents 6.4% of the total instances. The SR is also excluded from the model, because it is an attribute mostly redundant with some of the other meteorological factors, and the filtering method is essentially based on an elimination of the redundant information. As hypothesized, including the weather conditions in the model allows for a significant improvement of the prediction accuracy. The value of correlation coefficient is almost twice higher than a model

**Figure 5.** Graphical representation of the model tree and its respective decision rules to invoke the best regression

Regression Models to Predict Air Pollution from Affordable Data Collections

http://dx.doi.org/10.5772/intechopen.71848

33

A model tree is a more complex and flexible modeling of the data, since it is composed of several rules and each of these rules are associated with a regression model [18]. In other words, in such a tree representation, there is a different linear regression model at the leaves to predict the response of the instances that reach the leaf. In the present modeling, we use a pruned tree, in which the minimum number of instances allowed at a leaf node is nine. **Figure 5** represents the resulting model tree. It is composed of four rules as follows:

that does not consider meteorological data.

models (LM 1–4) to predict the value of PM2.5.

*4.2.2. Regression model tree*

1: if Ywind ≤ −0.66 and RH ≤ 70.245

2: else if Ywind ≤ −0.66 and RH > 70.245

3: else if Ywind > −0.66 and Xminutes ≤ −0.538

4: else if Ywind > −0.66 and Xminutes > −0.538

model = LM 1

model = LM 2

model = LM 3

model = LM 4

Thus, the final dataset is composed of 13 features, which are Xminutes, Yminutes, %orange, %red, SR, T, P, rain, RH, WS, Xwind, Ywind, and PM2.5 (= feature to predict).

### **4.2. Single models**

Two models are proposed. The first one is based on a multiple regression algorithm as described in Section 2.1. The second one implements a model tree that allows for a larger flexibility (but also complexity) than a linear regression for modeling the data.

#### *4.2.1. Multiple regression model*

The linear regression model obtained after running the algorithm is as follows:

**PM2.5 =**


The prediction accuracy of the model is evaluated as

**r = 0.58**

#### **RMSE = 7.32**

The result shows that the regression model considers all the three classes of parameters (time, traffic, and weather) to predict the value of PM2.5. Nevertheless, in terms of meteorological Regression Models to Predict Air Pollution from Affordable Data Collections http://dx.doi.org/10.5772/intechopen.71848 33

**Figure 5.** Graphical representation of the model tree and its respective decision rules to invoke the best regression models (LM 1–4) to predict the value of PM2.5.

factors, the solar radiation and the precipitation are filtered by the M5 method (see Section 2.1 for more details). The rain attribute is certainly removed, since it occurs only 71 times, which represents 6.4% of the total instances. The SR is also excluded from the model, because it is an attribute mostly redundant with some of the other meteorological factors, and the filtering method is essentially based on an elimination of the redundant information. As hypothesized, including the weather conditions in the model allows for a significant improvement of the prediction accuracy. The value of correlation coefficient is almost twice higher than a model that does not consider meteorological data.

## *4.2.2. Regression model tree*

ing both WD and WS in a same formula (see Eqs. (9) and (10)). This mathematical transformation permits a more accurate feature representation of the data with respect to the WD around the north axis. Otherwise, it would be impossible to find a correlation between WD and PM2.5, since some similar WD pointing north could have completely different values (slightly higher than 0° or slightly lower than 360°) according to the polar coordinates. This transformation is particularly relevant for machine-learning algorithms based on linear regression, because this

*WD*\_\_\_\_\_\_\_\_\_\_ . *π*

*WD*\_\_\_\_\_\_\_\_\_\_ . *π*

Thus, the final dataset is composed of 13 features, which are Xminutes, Yminutes, %orange,

Two models are proposed. The first one is based on a multiple regression algorithm as described in Section 2.1. The second one implements a model tree that allows for a larger flex-

The result shows that the regression model considers all the three classes of parameters (time, traffic, and weather) to predict the value of PM2.5. Nevertheless, in terms of meteorological

180° ) . *WS* (9)

180° ) . *WS* (10)

modeling relies on a continuous relationship between parameters [9].

%red, SR, T, P, rain, RH, WS, Xwind, Ywind, and PM2.5 (= feature to predict).

ibility (but also complexity) than a linear regression for modeling the data.

The linear regression model obtained after running the algorithm is as follows:

 **2.199 \* Yminutes + −18.0966 \* %red + 39.7399 \* %orange + 0.2636 \* RH + 1.0088 \* pressure + 0.8186 \* temperature + 1.3403 \* Xwind +**

*Xwind* = *cos*(

32 Machine Learning - Advanced Techniques and Emerging Applications

*Ywind* = *sin*(

**−753.8078**

The prediction accuracy of the model is evaluated as

**4.2. Single models**

*4.2.1. Multiple regression model*

**PM2.5 =**

**r = 0.58 RMSE = 7.32** A model tree is a more complex and flexible modeling of the data, since it is composed of several rules and each of these rules are associated with a regression model [18]. In other words, in such a tree representation, there is a different linear regression model at the leaves to predict the response of the instances that reach the leaf. In the present modeling, we use a pruned tree, in which the minimum number of instances allowed at a leaf node is nine.

**Figure 5** represents the resulting model tree. It is composed of four rules as follows:

1: if Ywind ≤ −0.66 and RH ≤ 70.245 model = LM 1 2: else if Ywind ≤ −0.66 and RH > 70.245 model = LM 2 3: else if Ywind > −0.66 and Xminutes ≤ −0.538 model = LM 3 4: else if Ywind > −0.66 and Xminutes > −0.538 model = LM 4

The linear regression models associated to each rule are:



 **−41.9183 \* %red + 51.2883 \* %orange + 0.3139 \* RH + 1.6439 \* pressure + −0.0056 \* SR + 1.7683 \* temperature + 2.2056 \* WS + 3.1792 \* Xwind + −0.0401 \* Ywind +**

Regression Models to Predict Air Pollution from Affordable Data Collections

http://dx.doi.org/10.5772/intechopen.71848

35

 **−0.0474 \* Xminutes + −2.2031 \* Yminutes + 8.5034 \* %red + 14.6847 \* %orange + 0.2603 \* RH + −0.9338 \* pressure + −0.0001 \* SR + 0.048 \* temperature + 0.6414 \* WS + 0.3914 \* Xwind + −1.3052 \* Ywind +**

The prediction accuracy of the model is evaluated as

The root node of the tree is Ywind. It means that wind direction and wind speed are the fundamental factors to proceed with the selection of one or another regression model. Then, the second level of discrimination is based on two other important parameters, which are

**−1233.2713**

**669.5642**

 **r = 0.63 RMSE = 6.95**

• LM 4

**PM2.5 =**

• LM 2

**PM2.5 =**


• LM 3

**PM2.5 =**


Regression Models to Predict Air Pollution from Affordable Data Collections http://dx.doi.org/10.5772/intechopen.71848 35


## • LM 4

The linear regression models associated to each rule are:

34 Machine Learning - Advanced Techniques and Emerging Applications

 **3.3209 \* Xminutes + 0.1278 \* Yminutes + −1.0521 \* %red + 22.0077 \* %orange + 0.1359 \* RH + 0.0587 \* pressure + 0.0101 \* SR + −0.3479 \* temperature + 0.8434 \* Xwind +**

 **0.5269 \* Xminutes + 0.1278 \* Yminutes + −1.0521 \* %red + 6.595 \* %orange + 0.3362 \* RH + 0.0587 \* pressure + −0.0346 \* SR + 1.5505 \* temperature + 0.2383 \* Xwind +**

 **−0.0904 \* Xminutes + 10.1893 \* Yminutes +**

• LM 1

**PM2.5 =**

**−41.7637**

**−72.0163**

**PM2.5 =**

• LM 3

**PM2.5 =**

• LM 2

**PM2.5 =**


The prediction accuracy of the model is evaluated as

#### **r = 0.63**

#### **RMSE = 6.95**

The root node of the tree is Ywind. It means that wind direction and wind speed are the fundamental factors to proceed with the selection of one or another regression model. Then, the second level of discrimination is based on two other important parameters, which are relative humidity and Xminutes. The regression models that depend on the RH threshold (nine features) are slightly simpler than the models that depend on the Xminutes threshold (11 features). To note that when the tree algorithm is applied, the SR is included in the model, even though its weight is quite low. As expected, the model tree (four rules and an average of 10 features per rule) is more complex than the linear regression model (seven features). Nevertheless, the model tree is still easy to interpret and provides a prediction performance slightly better than the linear regression (+0.05 for the correlation coefficient of the tree).

**8.5432 \* Ywind +**

The prediction accuracy of the model is evaluated as

**−0.0636 \* minutes +**

 **28.7942 \* %orange + 0.4791 \* RH + −10.0519 \* rain + −0.0141 \* SR + 2.5065 \* temperature + 3.8358 \* Xwind +**

The prediction accuracy of the model is evaluated as

The model presents six features, only. It means that many attributes are filtered, especially in terms of meteorological factors (SR, pressure, rain, and WS are removed). It can be explained by the fact that the prediction of the level of PM2.5 in the morning would be mainly correlated with the density of the traffic (see Section 3.3). However, the morning model does not seem to be significantly different than the single multiple regression neither in terms of features (five

Regression Models to Predict Air Pollution from Affordable Data Collections

http://dx.doi.org/10.5772/intechopen.71848

37

The model is still composed of the same nucleus of features: minutes, %orange, RH, temperature, and wind. The only new parameter that appears as predictive feature is the precipitations. It can be explained by the fact that the rain events usually occur in Quito at midday. This factor has a negative coefficient, because the precipitation has a cleaning effect on the concentration of fine particulate matter [19]. The performance of the model is maintained at a

identical attributes) nor in terms of performance (r = 0.58 in both cases).

The linear regression model obtained after running the algorithm is as follows:

**38.6386**

 **r = 0.58 RMSE = 9.56**

*4.3.2. Midday model*

**PM2.5 =**

**−2.4909**

**r = 0.56**

**RMSE = 9.13**

constant accuracy (r = 0.56).
