**1. Introduction**

Over the last century, the global human population has augmented more than four times. Most of the recent growth is accredited to the urban areas in the less developed parts of the world [1]. This has resulted in 80% of global cities and 98% of cities in low- and middle-income countries

Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons

to exceed the recommendations for air quality [2]. Apart from economic losses, reduced visibility, and climate change, ambient air pollution costs millions of premature deaths annually, mostly due to anthropogenic fine particulate matter (PM2.5—particles with aerodynamic diameter less than 2.5 μm) [3]. In the case of business-as-usual, the global atmospheric chemistry models suggest that the contribution of outdoor air pollution to premature mortality could double by 2050 [4].

simple regression, in which a single feature is used to estimate the value of the output. This relationship is acquired by fitting a linear or nonlinear curve to the data. In order to correctly fit the curve, it is necessary to define the goodness-of-fit metric, which allows us to identify the curve that fits better than the other ones. The optimization technique used in regression, and in several other machine-learning methods, is the gradient descent algorithm. In the case of a simple linear regression, the objective is to find the value of the slope and the intercept of the line that minimizes the goodness-of-fit metric. The residual sum of squares (*RSS*), also called sum of squared errors of prediction, is used to calculate this cost. The *RSS* adds up the squared difference between the estimated relationship between x and y (regression model)

), as described in Eq. (1)

, *w*1) = ∑ *i*=1 *N*

, *w*1) = ∑ *i*=1 *N* (*yi* − *y*̂ *i* (*w*<sup>0</sup>

> *j*=0 *D*

<sup>→</sup> ) = ∑ *i*=1 *N*

(*x* →*i*

are the intercept and slope of the linear regression, respectively. For simplification, Eq. (1) is

, *<sup>w</sup>*1) is the predictive value of observation *yi*, if a linear regression defined by *w*<sup>0</sup>

 is used. In the case of a multiple regression model, more than one input (or feature) is considered to predict the output. The generic equation of such a model can be written as follows:

> *wj hj*(*x* →

 , and *ε<sup>i</sup>*

(*yi* − *y*̂ *i* (*w* → ) ) 2

<sup>→</sup> is a vector of the weights (or coefficients) of the whole parameters of the fit. The best regression model is the function that provides the smallest RSS. The model is obtained after a split of the dataset into two independent sets: a training set and a test set. The training set is used to build the model, and the calculation of the RSS is performed over the test set, only. The gradient descent is an iterative method that minimizes the RSS metric. It takes multiple steps to eventually provide the optimal solution as described in Algorithm 1. At first, all the parameters are initialized to be zero at the first iteration (t = 1). Then, the algorithm repeats while the magnitude of the RSS does not converge. The internal part of the loop calculates the partial derivative (partial[j]) for each feature of the multiple regression model, and then, the gradient step takes the jth coefficient at time t and subtracts the step size (*η*) times that partial

(*yi* − [*w*<sup>0</sup> + *w*<sup>1</sup> *xi*

])

are the input values, and the coefficients *w*<sup>0</sup>

Regression Models to Predict Air Pollution from Affordable Data Collections

<sup>2</sup> (1)

http://dx.doi.org/10.5772/intechopen.71848

, *w*1))<sup>2</sup> (2)

*<sup>i</sup>*) + *ε<sup>i</sup>* (3)

is the error. Thus, the RSS is generically

) are functions of the inputs (represented as a vector)

and *w*<sup>1</sup>

17

and

(4)

and the actual values of y (*yi*

*RSS*(*w*<sup>0</sup>

*RSS*(*w*<sup>0</sup>

*yi* = ∑

that are weighted by different coefficients *wj*

*RSS*(*w*

where D is the number of features, *hj*

defined by Eq. 4 as

where *w*

commonly rewritten as follows:

where *y*̂ *i* (*w*<sup>0</sup>

*w*1

where *N* is the number of observations, *xi*

Even though the concentrations of PM2.5 are 2–5 times higher in the developing countries, most of the air quality studies and measurements are concentrated in the developed countries [2, 5]. This is often due to the investments required to launch and support a reliable air quality monitoring station or network. High accuracy, standard air quality reference method equipment costs can range from \$6000 to \$36,000 per sensor [6], excluding the costs for maintenance, calibration and accessories, resulting in a price of a functional air quality monitoring station well over \$100,000. Meteorological equipment is also essential for the evaluation of air quality, as high UV radiation, high winds, precipitation, or extreme temperatures can cause serious health concerns. Meteorological station, depending on accuracy requirements, can cost from \$1000 to over \$7000; although, the accuracy differences are not too great between the tiers (not including the lowest level equipment). Dynamic and nonhomogeneous urban systems contain different pollution sources, infrastructures, varying terrains, requiring more than one station for a comprehensive evaluation of air pollution conditions, consequently excluding poorer cities.

The question of economic limitations has recently been brought to attention resulting in the introduction of the lower cost sensors (<\$500) or bundled platforms (\$5000–10,000) to the market. Based on the comparative studies, evaluating sensor performance (fit for air quality monitoring), some air criteria pollutants compare quite well with the standard air quality reference methods, while some show lower correlation [6–8]. In addition, in some cases, adding a PM sensor to the platform increases costs significantly.

Recently, a different approach aims at using machine learning to estimate particulate pollution [9, 10]. This study proposes to evaluate the reliability for predicting air quality through a machine-learning approach and from data sources with a different scale of affordability. It focuses on the case study of Quito, the capital city of Ecuador, because it is a model example of complex terrain rapidly growing in mid-size cities in developing world with air pollution issues and economic limitations (e.g., poor quality fuel). In addition, Quito has many years of environmental data collection that can be used for data mining.
