**4. Results**

The data splitting, sampling, and model selection procedures just described were carried out on the study data, with the net result of producing one best classifier from each of the four scenarios. These four best classifiers were subsequently used to generate predictions for every pixel in the 37 test images. The results of these tasks are presented below, beginning with model selection, and then moving on to the quantitative assessment of prediction performance. The qualitative assessment of performance is reviewed in Section 5.

#### **4.1. Model selection results**

The results of model selection are shown in Table 2 and Table 3. The first table lists all of the models considered, along with their deviance and their error rates on the validation data. The error rate estimates in the table are preliminary only, because they are measured on the same validation sample that was used to do variable selection. The final and most accurate measure of out-of-sample predictive performance (the error rates on the test images) are reported in the next section.

The four models selected as best in the four groups are shown in bold in Table 2. For model 1 (RGB), there was only one model, which was selected best by default. For models 2 and 3 (the main effects and all effects models), the best models had *k* =20 and *k* =50, respectively. For model 4 (the LASSO), the minimum-deviance approach chose a model with 109 variables.


\*The LASSO model shown is the minimum-deviance one, which had 109 nonzero coefficients.

**Table 2.** List of models considered, with results for the validation set.

Table 3 shows the particular combinations of variables that were chosen in the best models from each of the four groups. The main-effects-only model had 20 variables, the all-effects model had 50 variables, and the LASSO model had 109 variables (of which only 50 are shown). When regression models become this large, it is very difficult to glean any useful information from lists of included variables. Nevertheless, the table is presented for the sake of complete‐ ness.


**Table 3.** Chosen variables for the best model in each category. Variables are listed in descending order of coefficient magnitude. See the text for a description of the notation.

A compact notation is used in the table to reduce the space consumed by long lists of variables. In this notation, each of the 35 spectral bands in the original images (the main effects) is represented by its band number. Squared terms are written with a bar over the band number, and square root terms are written with a bar underneath. Interactions between two terms are indicated by a colon. So, for example, the notation 9 ¯ refers to the square root of band 9, and 11:17¯ refers to the interaction between band 11 and the square of band 17.

#### **4.2. Predictive performance**

scenarios. These four best classifiers were subsequently used to generate predictions for every pixel in the 37 test images. The results of these tasks are presented below, beginning with model selection, and then moving on to the quantitative assessment of prediction performance. The

The results of model selection are shown in Table 2 and Table 3. The first table lists all of the models considered, along with their deviance and their error rates on the validation data. The error rate estimates in the table are preliminary only, because they are measured on the same validation sample that was used to do variable selection. The final and most accurate measure of out-of-sample predictive performance (the error rates on the test images) are reported in the

The four models selected as best in the four groups are shown in bold in Table 2. For model 1 (RGB), there was only one model, which was selected best by default. For models 2 and 3 (the main effects and all effects models), the best models had *k* =20 and *k* =50, respectively. For model 4 (the LASSO), the minimum-deviance approach chose a model with 109 variables.

**1. RGB 58549 10.2 4.3 98.4**

*k* =3 57245 10.0 0.0 100.0 *k* =5 53162 9.8 0.4 94.3 *k* =10 50399 9.3 0.5 87.9 *k* =15 48521 8.7 0.5 83.0 *k=20* **48483 8.6 0.5 82.1** *k* =25 48704 8.8 0.6 82.8 *k* =30 50144 8.8 0.7 81.7

*k* =3 51262 9.6 0.4 92.9 *k* =10 42442 7.6 1.1 65.9 *k* =20 40180 7.2 1.1 62.2 *k* =30 39785 7.1 1.2 60.0 *k* =40 38600 6.8 1.3 55.7 *k=50* **38174 6.8 1.4 56.0** *k* =60 38424 6.9 1.6 54.5 *k* =70 38475 6.8 1.6 53.7 **4. All effects, LASSO\* 47711 8.1 1.6 66.6**

\*The LASSO model shown is the minimum-deviance one, which had 109 nonzero coefficients.

**Table 2.** List of models considered, with results for the validation set.

**Results on VALIDATION sample Deviance OER (%) CER0 (%) CER1 (%)**

qualitative assessment of performance is reviewed in Section 5.

**4.1. Model selection results**

**Scenario/Model**

2. Main effects, *k* variables

3. All effects, *k* variables

next section.

364 Current Air Quality Issues

The final estimate of the performance of the four selected models is based on those models' predictions on the complete set 37 test images. Together these images contain over 43 million pixels that were not used in any way during the model fitting and variable selection processes. Because they are previously unused, they provide a more accurate approximation of the predictive power of the models (better than the validation data, which was not used for parameter estimation, but was used repeatedly for variable selection). The results are shown in Table 4.


**Table 4.** Summary of the selected models and their predictive performance on the test images.

Figure 4 illustrates the trade-off between the different error types as the cutoff *c* is varied, for the 50-variable all effects model. The plot shows OER, CER0, and CER1 as functions of the cutoff. We can see that the overall error rate is in fact minimized at the original cutoff of 0.5, so changing the cutoff to improve performance on the smoke class will unfortunately come at the cost of worse overall performance. This notwithstanding, both OER and CER0 are relatively flat over the cutoff range (0.3, 0.5). So, for example, setting the cutoff to 0.4 will reduce the classwise error rate of smoke pixels to 50%, while increasing the OER only slightly. flat over the cutoff range (0.3, 0.5). So, for example, setting the cutoff to 0.4 will reduce the classwise error rate of smoke pixels to 50%, while increasing the OER only slightly.

Figure 4. The effect of the decision cutoff on the overall and classwise error rates, for model 3. **Figure 4.** The effect of the decision cutoff on the overall and classwise error rates, for model 3.

**5.1. Remarks on the selected models**

misclassification of the smoke pixels.

#### **5. Discussion 5. Discussion**

The experimental results are interpreted and discussed below, beginning with several remarks about model selection and performance evaluation, and followed by a qualitative evaluation of the classification results. Afterwards, a variety of suggestions for further improvement are provided. The experimental results are interpreted and discussed below, beginning with several remarks about model selection and performance evaluation, and followed by a qualitative evaluation

slight advantage; but model 3 has better performance on the smoke class.

becomes a "black box" with internal structure that defies simple interpretation.

The classification error rates were reported in Table 2 (for all models, on the validation set) and Table 4 (for the best models in each group, on the test set). Considering these tables, we see that our concern about the dominance of the smoke class (class 0) in the data set was justified. All of the models had overall error rates less than about 10%, which seems good at first glance. However in all cases this low error rate was achieved by having a very low error rate in the nonsmoke class (CER0) and a high error rate in the smoke class (CER1). This problem is particularly severe for smaller models and smaller sets of candidate variables, but even the best model in group 3 (the 50-variable model) had 56%

Comparing the best models from each group, the only two models that can be considered even moderately successful are the two largest ones, the 50-variable all effects model (model 3) and the 109 variable LASSO model (model 4). There is little to separate these two classifiers: both have overall error rates of about 8% on the test set, with model 4 having a

Interestingly, these two models share only one variable in common (it happens to be 11: 6). This is a consequence of the huge feature space and of the correlations among predictors. Two different models containing disjoint sets of variables

*Remark 1: physical interpretability of selected variables.* It is desirable from a scientific and intellectual standpoint to be able to interpret the structure of a predictive model in terms of physical principles, but this is not always straightforward in a machine learning context. In the case of the spectral signature of smoke, a few general characteristics have been observed. Smoke scatters visible light [20], a component of it (organic carbon) is strongly absorbing below about 0.6 �� [21], and it is largely transparent in the middle infrared [22,23]. We endeavored to interpret our models in light of these observations, but were unable to find any simple and unambiguous relationships based on the patterns of variables included in the models. This is often the price to pay for focusing on out-of-sample predictive accuracy: the classifier

*Remark 2: interpretability of model coefficients.* Noticeably absent from the discussion so far has been the actual values of the regression coefficients in the fitted models. This has been deliberate, because in a pure classification problem like this one

can both have similar predictive power. This observation is related to the following two remarks.

of the classification results. Afterwards, a variety of suggestions for further improvement are provided.

#### **5.1. Remarks on the selected models**

Because they are previously unused, they provide a more accurate approximation of the predictive power of the models (better than the validation data, which was not used for parameter estimation, but was used repeatedly for variable selection). The results are shown

Model 1: RGB image 10.4 0.5 98.6 Model 2: main effects, 20 variables 8.6 0.5 82.1 Model 3: all effects, 50 variables 8.1 1.9 63.5 Model 4: all effects, LASSO (109 variables) 7.8 1.2 66.0

Figure 4 illustrates the trade-off between the different error types as the cutoff *c* is varied, for the 50-variable all effects model. The plot shows OER, CER0, and CER1 as functions of the cutoff. We can see that the overall error rate is in fact minimized at the original cutoff of 0.5, so changing the cutoff to improve performance on the smoke class will unfortunately come at the cost of worse overall performance. This notwithstanding, both OER and CER0 are relatively flat over the cutoff range (0.3, 0.5). So, for example, setting the cutoff to 0.4 will reduce the

Figure 4. The effect of the decision cutoff on the overall and classwise error rates, for model 3.

0.0 0.2 0.4 0.6 0.8 1.0

Cutoff value

The experimental results are interpreted and discussed below, beginning with several remarks about model selection and performance evaluation, and followed by a qualitative evaluation

slight advantage; but model 3 has better performance on the smoke class.

becomes a "black box" with internal structure that defies simple interpretation.

The classification error rates were reported in Table 2 (for all models, on the validation set) and Table 4 (for the best models in each group, on the test set). Considering these tables, we see that our concern about the dominance of the smoke class (class 0) in the data set was justified. All of the models had overall error rates less than about 10%, which seems good at first glance. However in all cases this low error rate was achieved by having a very low error rate in the nonsmoke class (CER0) and a high error rate in the smoke class (CER1). This problem is particularly severe for smaller models and smaller sets of candidate variables, but even the best model in group 3 (the 50-variable model) had 56%

Comparing the best models from each group, the only two models that can be considered even moderately successful are the two largest ones, the 50-variable all effects model (model 3) and the 109 variable LASSO model (model 4). There is little to separate these two classifiers: both have overall error rates of about 8% on the test set, with model 4 having a

Interestingly, these two models share only one variable in common (it happens to be 11: 6). This is a consequence of the huge feature space and of the correlations among predictors. Two different models containing disjoint sets of variables

*Remark 1: physical interpretability of selected variables.* It is desirable from a scientific and intellectual standpoint to be able to interpret the structure of a predictive model in terms of physical principles, but this is not always straightforward in a machine learning context. In the case of the spectral signature of smoke, a few general characteristics have been observed. Smoke scatters visible light [20], a component of it (organic carbon) is strongly absorbing below about 0.6 �� [21], and it is largely transparent in the middle infrared [22,23]. We endeavored to interpret our models in light of these observations, but were unable to find any simple and unambiguous relationships based on the patterns of variables included in the models. This is often the price to pay for focusing on out-of-sample predictive accuracy: the classifier

*Remark 2: interpretability of model coefficients.* Noticeably absent from the discussion so far has been the actual values of the regression coefficients in the fitted models. This has been deliberate, because in a pure classification problem like this one

can both have similar predictive power. This observation is related to the following two remarks.

**Table 4.** Summary of the selected models and their predictive performance on the test images.

pixels to 50%, while increasing the OER only slightly.

of suggestions for further improvement are provided.

**5.1. Remarks on the selected models**

**Figure 4.** The effect of the decision cutoff on the overall and classwise error rates, for model 3.

misclassification of the smoke pixels.

**5. Discussion**

0

**5. Discussion**

 10

 20

 30

Error rate (percent)

 40

 50 **OER (%) CER0 (%) CER1 (%)**

OER

CER0

CER1

in Table 4.

366 Current Air Quality Issues

The classification error rates were reported in Table 2 (for all models, on the validation set) and Table 4 (for the best models in each group, on the test set). Considering these tables, we see that our concern about the dominance of the smoke class (class 0) in the data set was justified. All of the models had overall error rates less than about 10%, which seems good at first glance. However in all cases this low error rate was achieved by having a very low error rate in the nonsmoke class (CER0) and a high error rate in the smoke class (CER1). This problem is particularly severe for smaller models and smaller sets of candidate variables, but even the best model in group 3 (the 50-variable model) had 56% misclassification of the smoke pixels.

Comparing the best models from each group, the only two models that can be considered even moderately successful are the two largest ones, the 50-variable all effects model (model 3) and the 109 variable LASSO model (model 4). There is little to separate these two classifiers: both have overall error rates of about 8% on the test set, with model 4 having a slight advantage; but model 3 has better performance on the smoke class.

classwise error rate of smoke pixels to 50%, while increasing the OER only slightly. flat over the cutoff range (0.3, 0.5). So, for example, setting the cutoff to 0.4 will reduce the classwise error rate of smoke Interestingly, these two models share only one variable in common (it happens to be 11:6 ¯ ). This is a consequence of the huge feature space and of the correlations among predictors. Two different models containing disjoint sets of variables can both have similar predictive power. This observation is related to the following two remarks.

> *Remark 1: physical interpretability of selected variables.* It is desirable from a scientific and intellectual standpoint to be able to interpret the structure of a predictive model in terms of physical principles, but this is not always straightforward in a machine learning context. In the case of the spectral signature of smoke, a few general characteristics have been observed. Smoke scatters visible light [20], a component of it (organic carbon) is strongly absorbing below about 0.6 *μm* [21], and it is largely transparent in the middle infrared [22, 23]. We endeavored to interpret our models in light of these observations, but were unable to find any simple and unambiguous relationships based on the patterns of variables included in the models. This is often the price to pay for focusing on out-of-sample predictive accuracy: the classifier becomes a "black box" with internal structure that defies simple interpretation.

The experimental results are interpreted and discussed below, beginning with several remarks about model selection and performance evaluation, and followed by a qualitative evaluation of the classification results. Afterwards, a variety *Remark 2: interpretability of model coefficients.* Noticeably absent from the discussion so far has been the actual values of the regression coefficients in the fitted models. This has been deliberate, because in a pure classification problem like this one the predictive performance of the model as a whole is the overriding concern. Interpretability of model coefficients is desirable, but is likely not achievable when we have models with dozens of predictors that are all interactions. Assessment of statistical significance of particular predictors also adds nothing to our understanding of the model as a classifier, and is best avoided.

**Figure 5.** Results on a test image. Left: the RGB image. Right: the predicted probability map using the 50-variable mod‐ el. The red contour delineates the true smoke region.

#### **5.2. Qualitative performance analysis**

Based purely on the observed numerical measures of prediction accuracy, it seems clear that none of the classifiers considered have performance good enough for real-world application, primarily because the majority of smoke pixels are misclassified in all cases. Visual inspection of the predictions on the test images can yield further insight into the nature of the problem, and possible causes of difficulty. Figure 5 and Figure 6 provide prototypical examples drawn from the test images. Our qualitative conclusions about predictive performance, based on the full set of 37 images, are listed below.


numbers of snow or ice pixels were incorrectly labelled smoke. Both Figure 5 and Figure 6 provide some evidence of this, with moderate probabilities being mapped over the Coast Mountains in the upper left of either image.


**Figure 5.** Results on a test image. Left: the RGB image. Right: the predicted probability map using the 50-variable mod‐

Based purely on the observed numerical measures of prediction accuracy, it seems clear that none of the classifiers considered have performance good enough for real-world application, primarily because the majority of smoke pixels are misclassified in all cases. Visual inspection of the predictions on the test images can yield further insight into the nature of the problem, and possible causes of difficulty. Figure 5 and Figure 6 provide prototypical examples drawn from the test images. Our qualitative conclusions about predictive performance, based on the

**1.** *Smoke-free images are generally classified well.* The classifier does have *some* ability to detect smoke, so it is still encouraging to observe that smoke-free images, or large regions that are smoke-free, are generally classified accurately. This can be observed in the bottom and left portions of Figure 6, which are assigned low probabilities throughout, despite the

**2.** *Clouds and smoke can be distinguished well from one another.* It was observed that throughout the 37 test images, there were very few instances where cloud was erroneously identified as smoke. This provides at least some encouragement that the use of hyperspectral data holds benefits, because distinguishing clouds from smoke visually using the RGB images

**3.** *Snow and ice can be distinguished from smoke, but with greater difficulty.* A similar comment can be made about snow and ice, but less emphatically. The classifier generally performed well in separating smoke from snow and ice, but performance was less consistent. In certain images this task seemed to pose no problem, while in other images significant

el. The red contour delineates the true smoke region.

368 Current Air Quality Issues

**5.2. Qualitative performance analysis**

full set of 37 images, are listed below.

can be quite difficult.

presence of clouds, water, and various types of terrain.

**Figure 6.** Another example prediction. Top: RGB image. Bottom: predicted probability map.

**6.** *The quality of the training data is a major impediment to classifier construction.* Perhaps the most significant problem inherent in this study is uncertainty about the assigned classes in the original images themselves. Various portions of the images proved extremely difficult to assign to one class or the other with high confidence during the masking step. The aforementioned regions of mixed smoke and cloud provide one example. Regions where smoke becomes less concentrated provide another example (see Figure 5): where does the smoke end and the nonsmoke begin? In the same figure, we see a third example. A large number of pixels in a region over the mountains are "erroneously" assigned a high probability of being smoke. Is this a classification error, or an error in masking the original RGB image? The RGB image has a hazy appearance in this region, but it was not assigned to the smoke class due to the absence of a local fire and the general uncertainty about the nature of this hazy appearance. After the fact, it seems plausible that the classifier is detecting smoke that was erroneously labelled nonsmoke in the data set.

#### **5.3. Opportunities for improvement**

While the classification results were mixed, we feel there were enough positive elements to warrant further investigation, and that the overall approach can still be successful with appropriate modifications and extensions.

Probably the clearest opportunity for improvement is to alleviate the uncertainty in the true class labels that exists throughout the data set, and was illustrated in Figure 5 and Figure 6. The ambiguity in distinguishing smoke from nonsmoke at various places in the RGB images is a fundamental limitation. Simple approaches to solving this problem include considering only smoke plumes or "thick" smoke; excluding pixels that the photointerpreter finds ambiguous or that contain both cloud and smoke; or labelling images with more than two classes. More involved approaches include modelling each pixel as a mixture of different components, or modelling some continuous measure of smoke concentration rather than a binary presence/absence response. An unsupervised learning (clustering) approach or a semisupervised method (where only some pixels are labelled) could also be considered, though such methods make quantitative performance assessment more difficult.

Another avenue for potential improvement of classification performance is to modify the feature space in the logistic model in the hopes of improving the separability of smoke and nonsmoke. While this could be done by adding even more factorial terms (cubic terms, higherorder interactions, and so on), it is unlikely that the benefit of doing so would outweigh the increase in computational burden. Instead, more focused modifications of the model could be considered. To reduce the effect of highly heterogeneous surface terrain in the nonsmoke class, for instance, a baseline spectrum (perhaps taken as an average of observations over recent clear-sky days) could be included as predictors in the model. Or each pixel could be assigned to a known ground-cover class at the outset, and these classes could be included in the model as categorical variables. Another option is to replace the fixed powers of reflectance we used (squared and square root terms) with spline functions, allowing data-adaptive nonlinear transformations of the variables to be used in the model. We anticipate exploring some of these alternatives in future work with these data.

Additional possibilities for improvement can be found by moving farther from the logistic regression framework. Under the assumption of independent pixels, for example, any of the many existing classification tools could be applied to the data. The support vector machine (see, e.g., [24], Ch. 11) in particular is a state-of-the-art method that has performed well across a variety of tasks and is worthy of consideration. If the independence assumption is dropped, the autologistic regression model [25], a model for spatially-correlated binary responses, is a natural fit for these data. This model would alleviate the problem of noise in the predicted probabilities, producing smoother and more accurate prediction maps. It is a natural extension of logistic regression to spatially-associated data. Finally, it may also be possible to incorporate relevant ancillary information (for example, prior knowledge of fire locations and wind directions) into a classification model to improve predictive power. Again, consideration of these alternatives and extensions are planned in future work.
