**2.1. Prediction by multiple regression**

In regression, features derived from a dataset are used as input of the regression model to predict continuous valued output. This kind of prediction is obtained by learning the relationship between the input x and the output y. The simplest case of a regression model is a simple regression, in which a single feature is used to estimate the value of the output. This relationship is acquired by fitting a linear or nonlinear curve to the data. In order to correctly fit the curve, it is necessary to define the goodness-of-fit metric, which allows us to identify the curve that fits better than the other ones. The optimization technique used in regression, and in several other machine-learning methods, is the gradient descent algorithm. In the case of a simple linear regression, the objective is to find the value of the slope and the intercept of the line that minimizes the goodness-of-fit metric. The residual sum of squares (*RSS*), also called sum of squared errors of prediction, is used to calculate this cost. The *RSS* adds up the squared difference between the estimated relationship between x and y (regression model) and the actual values of y (*yi* ), as described in Eq. (1)

to exceed the recommendations for air quality [2]. Apart from economic losses, reduced visibility, and climate change, ambient air pollution costs millions of premature deaths annually, mostly due to anthropogenic fine particulate matter (PM2.5—particles with aerodynamic diameter less than 2.5 μm) [3]. In the case of business-as-usual, the global atmospheric chemistry models suggest that the contribution of outdoor air pollution to premature mortality could

Even though the concentrations of PM2.5 are 2–5 times higher in the developing countries, most of the air quality studies and measurements are concentrated in the developed countries [2, 5]. This is often due to the investments required to launch and support a reliable air quality monitoring station or network. High accuracy, standard air quality reference method equipment costs can range from \$6000 to \$36,000 per sensor [6], excluding the costs for maintenance, calibration and accessories, resulting in a price of a functional air quality monitoring station well over \$100,000. Meteorological equipment is also essential for the evaluation of air quality, as high UV radiation, high winds, precipitation, or extreme temperatures can cause serious health concerns. Meteorological station, depending on accuracy requirements, can cost from \$1000 to over \$7000; although, the accuracy differences are not too great between the tiers (not including the lowest level equipment). Dynamic and nonhomogeneous urban systems contain different pollution sources, infrastructures, varying terrains, requiring more than one station for a comprehensive evaluation of air pollution conditions, consequently

The question of economic limitations has recently been brought to attention resulting in the introduction of the lower cost sensors (<\$500) or bundled platforms (\$5000–10,000) to the market. Based on the comparative studies, evaluating sensor performance (fit for air quality monitoring), some air criteria pollutants compare quite well with the standard air quality reference methods, while some show lower correlation [6–8]. In addition, in some cases, adding

Recently, a different approach aims at using machine learning to estimate particulate pollution [9, 10]. This study proposes to evaluate the reliability for predicting air quality through a machine-learning approach and from data sources with a different scale of affordability. It focuses on the case study of Quito, the capital city of Ecuador, because it is a model example of complex terrain rapidly growing in mid-size cities in developing world with air pollution issues and economic limitations (e.g., poor quality fuel). In addition, Quito has many years of

In regression, features derived from a dataset are used as input of the regression model to predict continuous valued output. This kind of prediction is obtained by learning the relationship between the input x and the output y. The simplest case of a regression model is a

double by 2050 [4].

16 Machine Learning - Advanced Techniques and Emerging Applications

excluding poorer cities.

a PM sensor to the platform increases costs significantly.

environmental data collection that can be used for data mining.

**2. Machine-learning approach**

**2.1. Prediction by multiple regression**

$$\text{RSS}\{w\_{\vartheta'}w\_{i}\} = \sum\_{i=1}^{N} (y\_i - [w\_0 + w\_1 x\_i])^2 \tag{1}$$

where *N* is the number of observations, *xi* are the input values, and the coefficients *w*<sup>0</sup> and *w*<sup>1</sup> are the intercept and slope of the linear regression, respectively. For simplification, Eq. (1) is commonly rewritten as follows:

$$RSS\{w\_{\nu'}w\_1\} = \sum\_{i=1}^{N} \left(y\_i - \hat{y}\_{\bigwedge} w\_{\nu'} w\_1\right)^2\tag{2}$$

where *y*̂ *i* (*w*<sup>0</sup> , *<sup>w</sup>*1) is the predictive value of observation *yi*, if a linear regression defined by *w*<sup>0</sup> and *w*1 is used. In the case of a multiple regression model, more than one input (or feature) is considered to predict the output. The generic equation of such a model can be written as follows:

$$y\_i = \sum\_{j=0}^{D} w\_j h\_j(\vec{x}\_i) + \varepsilon\_i \tag{3}$$

where D is the number of features, *hj* (*x* →*i* ) are functions of the inputs (represented as a vector) that are weighted by different coefficients *wj*  , and *ε<sup>i</sup>* is the error. Thus, the RSS is generically defined by Eq. 4 as

$$RSS\{\vec{w}\} = \sum\_{i=1}^{N} \left( y\_i - \hat{y}\_i(\vec{w}) \right)^2 \tag{4}$$

where *w* <sup>→</sup> is a vector of the weights (or coefficients) of the whole parameters of the fit. The best regression model is the function that provides the smallest RSS. The model is obtained after a split of the dataset into two independent sets: a training set and a test set. The training set is used to build the model, and the calculation of the RSS is performed over the test set, only. The gradient descent is an iterative method that minimizes the RSS metric. It takes multiple steps to eventually provide the optimal solution as described in Algorithm 1. At first, all the parameters are initialized to be zero at the first iteration (t = 1). Then, the algorithm repeats while the magnitude of the RSS does not converge. The internal part of the loop calculates the partial derivative (partial[j]) for each feature of the multiple regression model, and then, the gradient step takes the jth coefficient at time t and subtracts the step size (*η*) times that partial derivative. Once the algorithm cycled through all the features of the model, the t counter is incremented and the convergence condition is tested to decide whether the program must loop through or not. When the minimum is reached (RSS ≤ *ε*), the respective values of the regression coefficients are used as the model parameters to form the predictions.

Algorithm 1. Gradient descent algorithm for multiple regression.

```
1: init w
       →(1) = 0, t = 1
2: while ‖∇ RSS(w
                  →(t)
                     )‖ > ε.
3: for j = 0, …, D
4: partial[j] = −2∑i=1
                              N hj
                                  (x
                                   →i
                                    )(yi − ŷ
5: w
             → j
               (t+1) ← w
                     → j
                       (t)
                        − ηpartial[j]
```

```
6: t ← t + 1
```
In addition, the final regression models of this study are obtained after an attribute selection using the M5 method, which steps through the attributes removing the one with the smallest standardized coefficient until no improvement is observed in the estimate of the error given by the Akaike information criterion (AIC) [11].

*i* (*w* →(*t*) ))

$$AIC = N \ln \left( \frac{RSS}{N-D} \right) + 2D \tag{5}$$

factors on top of the traffic data. Most of the meteorological equipment is not as costly as air quality sensors, thus still presenting a viable option for the prediction of PM2.5 concentrations. Subsequently, Section 5 describes a prediction that includes traffic data, meteorological factors and trace gas concentrations. This way we build from the simplest to the most complex model, increasing the equipment costs with every step and improving the prediction performance. Finally, we finish our study by proposing the best simple model based on a feature selection method, letting us to reduce the costs significantly, but still producing a

Regression Models to Predict Air Pollution from Affordable Data Collections

http://dx.doi.org/10.5772/intechopen.71848

19

We propose a method to extract data from Google Maps Traffic, in which a simple request to the website enables us to build a database regarding the traffic in the city and, consequently,

A request to Google Maps Traffic is performed by the use of the library selenium for Python. A screenshot is carried out each 10 minute in a specific zone of Quito, which is centered on the neighborhood of Belisario. The exact coordinates of the geographic area of interest are −0.181661, −78.4987077, which is 1.2 km southwest from the center of the traffic map. Two kinds of images are stored: the one with traffic (**Figure 1a**) and the another without traffic (**Figure 1b**). It is necessary to save these two different types of pictures in order to proceed

A technique of background subtraction is used to eliminate picture information that is not related to traffic (**Figure 1**). The background removal is carried out through the process as

• Check every pixel in the frame. If it is different from the corresponding pixel in the background image, it is a foreground pixel (traffic information). If not, it is a background pixel. To get a clean image of the traffic, it is necessary to define a distance threshold of brightness when comparing the background image to the traffic + background images (see Algorithm 2). For every pixel, if the absolute difference of brightness between the image with traffic and the background image is lower than the threshold (empirically defined at 30), then the corresponding pixels are considered identical. In this case, the pixels are colored white

with the next step that consists of isolating the traffic information only (**Figure 1c**).

• Memorize the background image (picture without traffic).

high performance.

**3.1. Dataset**

*3.1.1. Data acquisition*

*3.1.1.1. Screenshot*

the level of urban air pollution.

*3.1.1.2. Background subtraction*

follows [12]:

**3. Prediction from real-time traffic monitoring**

where N is the number of observations (or instances), and D is the number of features (or attributes). The selected model is the model that gets the lowest AIC.

All the models presented in the manuscript are obtained after a normalization of the value of the variables, in order to avoid a dominance of the variables with the highest intrinsic values. The used method to evaluate the model accuracy is a 10-fold cross-validation. The regression modeling is performed with Pandas and scikit-learn machine-learning library for Python.
