*3.1.2. Data transformation*

A last data preparation is necessary before running the machine-learning algorithms. The polar coordinates of time (think of time as an analog clock of 24 × 60 minutes, in which minute hand describes an angle) are transformed into Cartesian coordinates (Eqs. (6) and (7)). This mathematical transformation permits a more accurate feature representation of the data with respect to the traffic density at night. Otherwise, it would be impossible to find a correlation between time and traffic around midnight, since a similar traffic would correspond to a completely different number of minutes (before midnight ≈ 1440 minute, and after midnight ≈ 0 minute). This transformation is particularly relevant for machine-learning algorithms based on linear regression, because it relies on a continuous relationship between parameters [13].

$$X \text{mimutes} = \cos\left(\frac{\text{minutes}\cdot\pi}{720}\right) \tag{6}$$

**3.2. Single models**

prediction.

*3.2.1. Time only*

**PM2.5 =**

**13.8294**

 **r = 0.21 RMSE = 8.76**

for the studied period.

which are Xminutes, Yminutes, and PM2.5.

 **−2.2242 \* Xminutes + −1.7366 \* Yminutes +**

The prediction accuracy of the model is evaluated as

Two possible approaches can be considered to predict the level of PM2.5 from other attributes. The first one is to build a single model for the whole day. Another approach is to consider several successive models, since the human activity and the atmospheric conditions change

Regression Models to Predict Air Pollution from Affordable Data Collections

http://dx.doi.org/10.5772/intechopen.71848

23

A machine-learning algorithm based on a linear regression, as described in Section 2.1, is applied on the dataset. The models are trained and tested according to a 10-fold crossvalidation technique. Then, the performance of the models is assessed by two metrics: the correlation coefficient and the root-mean-squared error (RMSE). The correlation coefficient (r) measures the strength of the linear relationship between two or more variables. The advantage of r over the other metrics is to be based on a scale with a maximum (±1) and a minimum (0) to quantify the strength of the relationship. The closer to 1 is the absolute value of r, the better is the correlation. The root-mean-squared error (RMSE) is the square root of the averaged squared error per prediction (MSE). RMSE is an intuitive evaluation metric that is frequently used, because it provides a performance in the same unit as the predicted attribute itself. The lower is the value of RMSE, the more accurate is the model

Since the transportation is the main source of pollution in Quito, and this human activity is relatively stereotypic all day long, the simplest approach is to build a predictive model of PM2.5 based on time parameters, only. In this case, the number of features is limited to three,

In the present model, the coefficients attributed to both features are negative. It means that the higher are the two temporal attributes, the lower are the concentrations of fine particulate matter. However, the performance of this first model is quite low (r ≈ 0.2). This is confirmed by the value of the RMSE, which is around nine out of an average level of PM2.5 = 13.8 μg/m3

The linear regression model obtained after running the algorithm is as follows:

during the day. This section presents the former method.

$$\text{Ymimutes} = \sin\left(\frac{\text{minutes}\cdot\pi}{720}\right) \tag{7}$$

Thus, the final dataset is composed of a number of five features, which are: Xminutes, Yminutes, %orange, %red, and PM2.5 (= feature to predict). The %green can be discarded, because it provides a redundant data with the information brought by %orange and %red.
