**3.2. Single models**

3: image(frame[i], 0, 0)

4: for(int x = 0, x < width, x++)

5: color c = get(x, y)

8: red++

10: orange++

12: green++

*3.1.1.4. Hourly averaging*

*3.1.2. Data transformation*

4: for(int y = 0, y < height, y++)

22 Machine Learning - Advanced Techniques and Emerging Applications

6: if(red(c) < 200 || green(c) < 200 || blue(c) < 200)

7: if(red(c) > green(c) && abs(green(c)-blue(c)) < 20)

9: else if(red(c) > green(c) && green(c) > blue(c))

11: else if(green(c) > red(c) && red(c) > blue(c))

*Xminutes* = *cos*(

*Yminutes* = *sin*(

Since the machine-learning models are based on hourly data analysis, it is required to determinate for each hour the trend of the six 10 minute recording. To do so, the average of the six percentages per hour and for each color is calculated. Then, these values are added into the final dataset.

A last data preparation is necessary before running the machine-learning algorithms. The polar coordinates of time (think of time as an analog clock of 24 × 60 minutes, in which minute hand describes an angle) are transformed into Cartesian coordinates (Eqs. (6) and (7)). This mathematical transformation permits a more accurate feature representation of the data with respect to the traffic density at night. Otherwise, it would be impossible to find a correlation between time and traffic around midnight, since a similar traffic would correspond to a completely different number of minutes (before midnight ≈ 1440 minute, and after midnight ≈ 0 minute). This transformation is particularly relevant for machine-learning algorithms based on linear regression, because it relies on a continuous relationship between parameters [13].

> \_\_\_\_\_\_\_\_\_\_\_\_\_ *minutes* . *π*

\_\_\_\_\_\_\_\_\_\_\_\_\_ *minutes* . *π*

Thus, the final dataset is composed of a number of five features, which are: Xminutes, Yminutes, %orange, %red, and PM2.5 (= feature to predict). The %green can be discarded, because it provides a redundant data with the information brought by %orange and %red.

<sup>720</sup> ) (6)

<sup>720</sup> ) (7)

Two possible approaches can be considered to predict the level of PM2.5 from other attributes. The first one is to build a single model for the whole day. Another approach is to consider several successive models, since the human activity and the atmospheric conditions change during the day. This section presents the former method.

A machine-learning algorithm based on a linear regression, as described in Section 2.1, is applied on the dataset. The models are trained and tested according to a 10-fold crossvalidation technique. Then, the performance of the models is assessed by two metrics: the correlation coefficient and the root-mean-squared error (RMSE). The correlation coefficient (r) measures the strength of the linear relationship between two or more variables. The advantage of r over the other metrics is to be based on a scale with a maximum (±1) and a minimum (0) to quantify the strength of the relationship. The closer to 1 is the absolute value of r, the better is the correlation. The root-mean-squared error (RMSE) is the square root of the averaged squared error per prediction (MSE). RMSE is an intuitive evaluation metric that is frequently used, because it provides a performance in the same unit as the predicted attribute itself. The lower is the value of RMSE, the more accurate is the model prediction.
