**2. Binary classification concepts**

classes.

**1.3. Modelling approach**

352 Current Air Quality Issues

"nonsmoke."

challenging.

so it was not included at this stage of our study.

The observed images are the product of natural processes that are very complex. From a statistical standpoint, a sequence of remote sensing images covering a particular region of the earth is a spatiotemporal data set with statistical dependence both within and between images. Physically, the presence of smoke in a particular region at a particular time is surely dependent on the characteristics of a particular fire, as well as on meteorological and topographical variables that vary over the region of interest and over time. There is thus ample scope for mathematical complexity in a model used for classification. Some decisions must be made at the outset about which aspects of the problem to include in our classifiers, and which to ignore.

As the research is still in its early stages, three simplifying decisions have been made.

also maximizes the applicability of the methods to other image processing tasks.

First, classification will be conducted based only on the spectral information in the images themselves; no ancillary information (for example, about wind, fire locations, or topography) will be used to aid prediction. This decision was made partly to limit model complexity, but also to ensure that our methods are wholly independent of any physics-based deterministic models (which they might eventually be used to validate). Using only the hyperspectral data

Second, the focus is on detecting only the presence or absence of smoke. A successful system will be able to classify images on a pixel-by-pixel basis into one of two categories, "smoke" or

Third, all pixels and all images are assumed to be independent of one another. While ignoring temporal dependence from image to image does not throw away much information—with images collected at a frequency of once per day, there is little correlation between smoke locations from one image to the next—ignoring spatial dependence within images is clearly making a compromise. Smoke appears in spatially contiguous regions, so knowledge that a certain pixel contains smoke should influence adjacent pixels' probability of being smoke. Nevertheless, spatial association between the outcomes introduces many technical difficulties,

With these decisions, the smoke detection task becomes a typical *binary classification* or *binary image segmentation* problem, using the data in the 35 spectral bands as predictors. Simplifying the problem in this way is justified in a preliminary analysis. Our goal is to evaluate whether the spectral data contain enough information to allow the smoke and nonsmoke pixels to be distinguished from one another with reasonably high probability. If they do not, there is little to be gained from the added complexity of more sophisticated models; if they do, the simple independent-pixel smoke/nonsmoke model can be extended in a variety of ways to obtain further improvements. Furthermore, it will be seen that despite retreating to a simple model for classification, the problem is still high dimensional, computationally intensive, and

With these considerations in mind, we use logistic regression for building our classifiers. Logistic regression has convenient extensions for accommodating spatial associations, for handling multiple levels of smoke abundance, and for including additional predictor variables. We anticipate that a final, useful future system will be based on such an extended model.

Classification is the process of assigning a category (a class label) to an item, using available information about the item. We are interested in binary classification, where there are only two class labels. In our case, the labels are nonsmoke (class 0) and smoke (class 1), the items to be classified are image pixels, and the available information is the content of the hyper‐ spectral image. We say we have "built a classifier" when we have established a rule that tells us how any given pixel in a new image should be classified.

Figure 1. The data used for the example. Top: an RGB image of the study region, with regions of smoke outlined in red. The blue rectangle encloses the pixels that are used for the example. Bottom, from left to right: the RGB sub-image of the region of interest, the green channel, the blue channel, and the mask showing the true smoke (white) and true nonsmoke (black) regions. **Figure 1.** The data used for the example. Top: an RGB image of the study region, with regions of smoke outlined in red. The blue rectangle encloses the pixels that are used for the example. Bottom, from left to right: the RGB sub-image of the region of interest, the green channel, the blue channel, and the mask showing the true smoke (white) and true nonsmoke (black) regions.

2.1 A Small Example Classifier building requires the availability of *training data*—a set of items where the true class labels are known. The reliance on training data is one reason classification is also known as

As an illustrative example, we restrict our attention to a small subset of the study data—a portion of a single image—and work with only the RGB image rather than the full hyperspectral data. The large image in Figure 1 shows the entire study region on the chosen date (and also provides an example of what the color images look like on a clear day). The picture contains two areas outlined in red. These are the areas that were deemed to contain smoke during the masking process. The blue rectangle in the image outlines the set of pixels used for this example. The four smaller images at the bottom of the figure show the example data in more detail: the RGB image, the information in the green channel, the information in the blue channel, and the corresponding mask showing the true

The sub-image used for the example is 150 by 165 pixels (24750 pixels in all) and is centered on a smoke plume. To allow the problem to be visualized in two dimensions, we will consider only

the green channel (G) and the blue channel (B) as predictors in our classifier.

*supervised learning*. One may think of an all-knowing supervisor who tells us the class mem‐ bership of a subset of our items, but then goes home for the day, leaving us to learn for ourselves how to classify the remaining items. To prevent confusion, note that the alternative problem of *unsupervised learning* (where the wise supervisor never shows up, leaving all class labels unknown) is also known as *clustering*, and—although important in its own right—is not presently relevant.

Classification is a large topic. It is, in fact, the dominant activity in the field of machine learning. Consequently, no attempt is made here to provide a thorough review of the subject. Rather, a single classifier based on logistic regression will be discussed as a means of introducing common themes in classification. The logistic classifier is naturally suited to binary classifica‐ tion problems, and has a relatively simple form with strong connections to linear and nonlinear regression. This classifier will be used throughout the chapter.

Readers interested in further background on classification, and alternative classifiers, have many resources to turn to. The books [1, 8, 9, 10] provide accessible introductions to the topic, and [1] in particular discusses classification and many related topics in the context of remote sensing imagery. Note that while alternative classification methods may have better or worse performance in different situations, most of the important aspects of setting up and solving a classification problem remain the same regardless of the particular method chosen.

#### **2.1. A small example**

As an illustrative example, we restrict our attention to a small subset of the study data—a portion of a single image—and work with only the RGB image rather than the full hyperspec‐ tral data. The large image in Figure 1 shows the entire study region on the chosen date (and also provides an example of what the color images look like on a clear day). The picture contains two areas outlined in red. These are the areas that were deemed to contain smoke during the masking process. The blue rectangle in the image outlines the set of pixels used for this example. The four smaller images at the bottom of the figure show the example data in more detail: the RGB image, the information in the green channel, the information in the blue channel, and the corresponding mask showing the true classes.

The sub-image used for the example is 150 by 165 pixels (24750 pixels in all) and is centered on a smoke plume. To allow the problem to be visualized in two dimensions, we will consider only the green channel (G) and the blue channel (B) as predictors in our classifier.

#### *2.1.1. Logistic classifier with two predictors*

The logistic classifier is based on logistic regression, which is set up as follows. Let the true class (the response variable) of the *i* th pixel be *Yi* , with *Yi* =1 corresponding to smoke and *Yi* =0 corresponding to nonsmoke. The true class is modelled as a Bernoulli random variable with *π<sup>i</sup>* = *P*(*Yi* =1) being the probability of the smoke outcome. All pixels are assumed to be statisti‐ cally independent.

Logistic regression models the log-odds of pixel *i* being smoke (the event *Yi* =1) as a linear combination of predictor variables (the green and blue brightness values, in this case):

*supervised learning*. One may think of an all-knowing supervisor who tells us the class mem‐ bership of a subset of our items, but then goes home for the day, leaving us to learn for ourselves how to classify the remaining items. To prevent confusion, note that the alternative problem of *unsupervised learning* (where the wise supervisor never shows up, leaving all class labels unknown) is also known as *clustering*, and—although important in its own right—is not

Classification is a large topic. It is, in fact, the dominant activity in the field of machine learning. Consequently, no attempt is made here to provide a thorough review of the subject. Rather, a single classifier based on logistic regression will be discussed as a means of introducing common themes in classification. The logistic classifier is naturally suited to binary classifica‐ tion problems, and has a relatively simple form with strong connections to linear and nonlinear

Readers interested in further background on classification, and alternative classifiers, have many resources to turn to. The books [1, 8, 9, 10] provide accessible introductions to the topic, and [1] in particular discusses classification and many related topics in the context of remote sensing imagery. Note that while alternative classification methods may have better or worse performance in different situations, most of the important aspects of setting up and solving a

As an illustrative example, we restrict our attention to a small subset of the study data—a portion of a single image—and work with only the RGB image rather than the full hyperspec‐ tral data. The large image in Figure 1 shows the entire study region on the chosen date (and also provides an example of what the color images look like on a clear day). The picture contains two areas outlined in red. These are the areas that were deemed to contain smoke during the masking process. The blue rectangle in the image outlines the set of pixels used for this example. The four smaller images at the bottom of the figure show the example data in more detail: the RGB image, the information in the green channel, the information in the blue

The sub-image used for the example is 150 by 165 pixels (24750 pixels in all) and is centered on a smoke plume. To allow the problem to be visualized in two dimensions, we will consider

The logistic classifier is based on logistic regression, which is set up as follows. Let the true

corresponding to nonsmoke. The true class is modelled as a Bernoulli random variable with *π<sup>i</sup>* = *P*(*Yi* =1) being the probability of the smoke outcome. All pixels are assumed to be statisti‐

, with *Yi* =1 corresponding to smoke and *Yi* =0

only the green channel (G) and the blue channel (B) as predictors in our classifier.

classification problem remain the same regardless of the particular method chosen.

regression. This classifier will be used throughout the chapter.

channel, and the corresponding mask showing the true classes.

*2.1.1. Logistic classifier with two predictors*

cally independent.

class (the response variable) of the *i* th pixel be *Yi*

presently relevant.

354 Current Air Quality Issues

**2.1. A small example**

$$
\log\left(\frac{\pi\_i}{1-\pi\_i}\right) = \beta\_0 + \beta\_1 G\_i + \beta\_2 B\_i,\tag{1}
$$

where *Gi* and *Bi* are the green and blue values of the *i* th pixel, and {*β*0, *β*1, *β*2} are the model coefficients. These three coefficients are to be estimated from a set of pixels for which both the responses and the predictors are known. Estimation is done using a weighted least squares or (equivalently) maximum likelihood approach. The process is called *model fitting* or *training*, and software for performing the estimation is readily available.

Once the parameters are estimated, the fitted model can be used to generate predictions for any given pixel, whether or not the response has been observed. Let *xj* represent such a pixel, with predictor values *Gj* and *Bj* . Plugging *Gj* , *Bj* , and the fitted coefficients into the right hand side of (1), the equation can be solved for *π* ^ *j* , the *fitted probability*. This quantity is the estimated probability that pixel *j* belongs to the smoke class.

The logistic regression model gives us fitted probabilities on a continuous scale from zero to one. To convert the model into a binary classifier, one need only specify a cutoff probability, *c*. If *π* ^ *<sup>j</sup>* is less than *c*, pixel *<sup>j</sup>* will be put into class 0 (nonsmoke), and if *<sup>π</sup>* ^ *<sup>j</sup>* is greater than *c*, it will be put into class 1 (smoke). We choose *c* =0.5, so that each pixel is put into the class that is more probable under the model.

Returning to the example data, the above procedure was followed using the 24750 chosen pixels and their true class labels as training data to fit model (1). The nature of the resulting fitted model is shown in Figure 2. The figure plots each pixel as a point in the (green, blue) plane. In machine learning, predictor variables are often called *features*, and so this plot considers each pixel in the model's *feature space*. We see that the smoke pixels generally occur at higher values of both blue and green, but that there is overlap between the two classes; the two classes are not completely separable. The fitted logistic regression model allows us to calculate a probability of being smoke for any point in the feature space. The thick line on the plot is the probability 0.5 contour of this probability surface; it is the decision boundary for our classifier with *c* =0.5. The model will classify any pixel above this line as smoke, and any pixel below the line as nonsmoke.

The inset image in the figure shows the classifier's predictions. White pixels in this image indicate pixels estimated to have greater than 50% chance of being smoke. The red outline indicates the boundary of the true smoke region. While most of the pixels are classified correctly, many are not.

**Figure 2.** Results of fitting the two-predictor model (G, B) to the example image. Blue points are smoke pixels and red points are nonsmoke. The line on the plot gives the 50% probability line that can be used to discriminate one class from the other. The inset image shows the predicted classes using this model; the red outline in the inset is the boundary of the true smoke region.

#### *2.1.2. Logistic classifier with expanded feature space*

The mathematical structure of the previous model ensured that the decision boundary in Figure 1 had to be a straight line. This limited the ability of the classifier to discriminate between the two classes. To make the model more flexible, we can expand the size of the feature space by adding nonlinear functions of the original predictors G and B. For example, we can consider the model

$$\begin{split} \log\left(\frac{\pi\_{i}}{1-\pi\_{i}}\right) &= \beta\_{0} + \beta\_{1}\mathbf{G}\_{i} + \beta\_{2}\mathbf{B}\_{i}, \quad + \beta\_{3}\mathbf{G}\_{i}^{2} + \beta\_{4}\mathbf{B}\_{i}^{2} + \\ \beta\_{1} &+ \beta\_{5}\mathbf{G}\_{i}\mathbf{B}\_{i} + \beta\_{6}\mathbf{G}\_{i}^{3} + \beta\_{7}\mathbf{G}\_{i}\mathbf{B}\_{i}^{2} + \beta\_{8}\mathbf{B}\_{i}\mathbf{G}\_{i}^{2} + \beta\_{9}\mathbf{B}\_{i}^{3} + \beta\_{10}\mathbf{G}\_{i}^{2}\mathbf{B}\_{i}^{2}, \end{split} \tag{2}$$

which includes the original variables *Gi* and *Bi* , along with squared and cubed terms (like *Gi* 2 and *Gi* 3 ) as well as products between the original variables taken to various powers (as in *GiBi* and *Bi Gi* 2 ). Borrowing terminology from industrial experimentation, we call the original variables *main effects* and any terms involving products of variables *interactions*.

The right hand side of model (2) is still a linear combination of various predictor variables, but we have expanded the feature space to ten dimensions. Considered as a function of G and B, the model is able to handle nonlinear relationships between these main effects. In Figure 3 we see the results of fitting this model to the example data. The figure shows the same scatter plot of the data, but now with the 50% contour line for this more flexible model. By adding extra features we can define a decision boundary with more complex shape. The additional shape flexibility of this boundary allows the classifier to correctly assign classes to a greater propor‐ tion of the pixels, as seen in the inset prediction image.

**Figure 2.** Results of fitting the two-predictor model (G, B) to the example image. Blue points are smoke pixels and red points are nonsmoke. The line on the plot gives the 50% probability line that can be used to discriminate one class from the other. The inset image shows the predicted classes using this model; the red outline in the inset is the boundary of

The mathematical structure of the previous model ensured that the decision boundary in Figure 1 had to be a straight line. This limited the ability of the classifier to discriminate between the two classes. To make the model more flexible, we can expand the size of the feature space by adding nonlinear functions of the original predictors G and B. For example, we can consider

0 1 2 3 i 4i

5 i i 6 i 7 i i 8 i i 9 i 10 i i

 b

*i i*

ç ÷ =+ + + + +

 b

GB G GB BG B G B ,

log , G B

+ ++ + ++

*G B* æ ö

bb

3 2 2 3 22

 bb

 b 2 2

(2)

 b

the true smoke region.

356 Current Air Quality Issues

the model

*2.1.2. Logistic classifier with expanded feature space*

1

b

*i*

è ø -

p

*i*

 bb

p

**Figure 3.** Results of fitting the example data to the 10-predictor model (G, B, G2 , B2 , GB, G3 , GB2 , BG2 , B3 , G2 B2 ). The plot is constructed in the same way as the previous figure. In this case the class decision boundary can take a complex non‐ linear shape.

#### **2.2. Other important concepts**

The preceding example might tempt one to believe that simply adding more predictors to the model will always yield a better classifier. This is not true, however, for two reasons.

The first problem with arbitrarily growing the feature space is purely computational. In most problems (and certainly in the present study), the measured main effects are correlated with each other to varying degrees. When expanding the feature space, the variables in the model will increasingly suffer from a form of redundancy known *multicollinearity*: certain predictors can (almost) be written as linear combinations of the other predictors. When the degree of multicollinearity is mild, model fitting will still be possible, but the coefficient estimates can be grossly inaccurate (and can vary greatly from sample to sample). As the problem gets worse, fitting will fail due to the occurrence of numerically singular matrices in the estimation routine.

The multicollinearity problem does not preclude us from considering a large feature space, but it means we cannot include *all* variables from a large feature space in the model. This leads to the problem of *model (feature) selection*: when the number of potential predictors is large, we seek to choose a subset of them that produces a good classifier that is numerically tractable.

When selecting a model from a large collection of correlated predictors, it is important to remember that the coefficient estimate of a particular variable will vary depending on which other variables are included in the model. Further, the best-fitting models of two different sizes need not share their variables in common (the variables selected in the best five-variable model, for example, might not be present in the best ten-variable model). For these reasons it is best to consider the performance of a model as a whole, rather than paying undue attention to coefficient values, statistical significance tests, and the like.

The second problem is more fundamental, and can arise even when multicollinearity is not present. The predictions shown in the previous figures were predictions made on the training data itself; the same data were used both for model fitting and for evaluating performance. This circumstance leads to *overfitting* and poor *generalization* ability: the model fits the training data very well but, because the training data is only a sample from the population, the model's predictive power on new data suffers. When considering increasingly complex models, a point is reached at which additional complexity only detracts from out-of-sample prediction accuracy.

The remedy for overfitting again involves model selection. Because of overfitting, larger models are not necessarily better, so the challenge is to select a model of intermediate size that is best at what is really important, out-of-sample prediction. To do this, one must use different samples of the data for different parts of the procedure. Ideally, one portion of the data (a training set) is used for fitting, another portion (a *validation set*) for model selection, and a third portion (a *test set*) for final evaluation of predictive performance ([9], p. 222).

A final important consideration is the particular measure used for evaluating classifier performance. Any item processed by a binary classifier falls into one of four groups, defined by its true class (0 or 1) and its predicted class (0 or 1). The rates of these four outcomes can be displayed in a so-called confusion matrix, as shown in Table 1. The values *a*, *b*, *c*, *d* in the

table are the rates (relative frequencies) of the four possible outcomes. They must sum to 1. The values *b* and *c* (shown in bold) are the rates of the two types of errors: nonsmoke classified as smoke, and smoke classified as nonsmoke. The row sums *f* <sup>0</sup> and *f* <sup>1</sup> are the true proportions of items in each class.

Three error rates derived from the confusion matrix are considered subsequently. The *overall error rate* (OER=*b* + *c*) is simply the global proportion of pixels misclassified. The *classwise error rates* are the rates of misclassification in each class considered separately. We denote these by CER0=*b* / *f* 0 for the nonsmoke class, and CER1=*c* / *f* <sup>1</sup> for the smoke class.

Minimizing the OER will be taken as the primary goal of classifier construction. Note however, that our data set consists of 90% nonsmoke pixels ( *f* <sup>0</sup> =0.9), so focusing on overall prediction performance implicitly puts more weight on prediction accuracy in the nonsmoke class. Because the data are so unbalanced, even the naïve classification rule "assign all pixels to class 0" can achieve an error rate of only 10% (OER=0.1), but with the highly unsatisfactory classwise rates CER0=0 and CER1=1. More will be said about the trade-off between OER and CER in later discussion.


**Table 1.** A confusion matrix. Values in bold represent errors.
