**Toward a Better Quality Control of Weather Data**

Kenneth Hubbard, Jinsheng You and Martha Shulski

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51632

## **1. Introduction**

Previous studies have documented various QC tools for use with weather data (26; 4; 6; 25; 9; 3; 10; 16; 18). As a result, there has been good progress in the automated QC ofweather indices, especially the daily maximum/ minimum air temperature. The QC of precipitation is more dif‐ ficult than for temperature; this is due to the fact that the spatial and temporal variability of a variable (2) is related to the confidence in identifying outliers. Another approach to maintain‐ ing quality of data is to conduct intercomparisons of redundant measurements taken at a site. For example, the designers of the United States Climate Reference Network (USCRN) made it possible to compare between redundant measurements by specifying a rain gauge with multi‐ ple vibrating wires in order to avoid a single point of failure in the measurement process. In this case the three vibrating wires can be compared to determine whether or not the outputs are comparable and any outlying values can result in a site visit. CRN also includes three tempera‐ ture sensors at each site for the purpose of comparison.

Generally identifying outliers involves tests designed to work on data from a single site (9) or tests designed to compare a station's data against the data from neighboring stations (16). Stat‐ istical decisions play a large role in quality control efforts but, increasingly there are rules intro‐ duced which depend upon the physical system involved. Examples of these are the testing of hourly solar radiation against the clear sky envelope (Allen, 1996; Geiger, et al., 2002) and the use of soil heat diffusion theory to determine soil temperature validity (Hu, et al., 2002). It is now realized that quality assurance (QA) is best suited when made a seamless process be‐ tween staff operating the quality control software at a centralized location where data is ingest‐ ed and technicians responsible for maintenance of sensors in the field (16; 10).

Quality assurance software consists of procedures or rules against which data are tested. Each procedure will either accept the data as being true or reject the data and label it as an

© 2012 Hubbard et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Hubbard et al.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

outlier. This hypothesis (Ho) testing of the data and the statistical decision to accept the data or to note it as an outlier can have the outcomes shown in Table 1:

can be tracked to determine the number of seeded errors that are identified. The ratio of er‐ rors identified by a QC procedure to the total number of errors seeded is a metric that can be compared across the range of error magnitudes introduced. The data used to create the seeded error dataset was from the U.S. Cooperative Observer Network as archived in the National Climatic Data Center (NCDC).We used the Applied Climate Information (ACIS) system to access stations with daily data available for all months from 1971~2000(see 24). The data have been assessed using NCDC procedures and are referred to as "clean" data. Note, however, that "clean" does not necessarily infer that the data are true values but,

About 2% of all observations were selected on a random basis to be seeded with an error. The magnitude of the error was also determined in a random manner. A random number, *r*, was se‐ lected using a random number generator operating on a uniform distribution with a mean of zero and range of ±3.5. This number was then multiplied by the standard deviation (σx) of the variable in question to obtain the error magnitude *E* for the randomly selected observation *x*:

> *E r x x* = s

The variable*r* is not used when the error would produce negative precipitation, (*E <sup>x</sup>*+*x*)<0., Thus the seeded error value is skewed distributed when *r*<0 but roughly uniformly distrib‐ uted when *r*> 0. The selection of 3.5 for the range is arbitrary but does serve to produce a large range of errors (±3.5σx).This approach to producing a seeded data set is used below in

**3. The spatial regression test (estimates)and Inverse Distance Weighted**

When checking data from a site, missing values are sometimes present. For modeling and oth‐ er purposes where continuous data are required, an estimate is needed for the missing value. We will refer to the station which is missing the data as the target station. The IDW method has been used to make estimates (x') at the target stations from surrounding observations (xi

*i i i*

Where di is the distance from the target station to each of the nearby stations, f(di) is a func‐

*Spatial Regression (SRT)* is a new method that provides an estimate for the target station and can be used to check that the observation (when not missing) falls inside the confidence in‐

)=1/di

<sup>=</sup> å å (2)

). This approach assumes that the nearest sta‐

1 1 ' ( / ( )) / 1/ ( ) *N N*

*i i x x fd fd* = =

(in our case we took f(di

tions will be most representative of the target site.

(1)

Toward a Better Quality Control of Weather Data

http://dx.doi.org/10.5772/51632

5

).

means instead that the largest outliers have been removed.

some of the comparisons.

**Estimates (IDW)**

tion relying on di


**Table 1.** The classification of possible outcomes in testing of a quality assurance hypothesis.

Take the simple case of testing a variable against limits. If we take as our hypothesis that the data for a measured variable is valid only if it lies within ±3σ of the mean (X), then assuming a normal distribution we expect to accept Ho 99.73% of the time in the abscense of errors. The values that lie beyond X±3σ will be rejected and we will make a Type I error when we encounter valid values beyond these limits. In these cases, we are rejecting Ho when the val‐ ue is actually valid and we therefore expect to make a Type I error 0.27% of the time assum‐ ing for this discussion that the data has no errant values. If we encounter a bad value inside the limits X±3σ we will accept it when it is actually false (the value is not valid) and this would lead to a Type II error. In this simple example, reducing the limits against which the data values are tested will produce more Type I errors and fewer Type II errors while in‐ creasing the limits leads to fewer Type I errors and more Type II errors. For quality assur‐ ance software, study is necessary to achieve a balance wherein one reduces the Type II errors (mark more "errant" data as having failed the test) while not increasing Type I errors to the point where valid extremes are brought into question. Because Type I errors cannot be avoided, it is prudent for data managers to always keep the original measured values re‐ gardless of the quality testing results and offer users an input into specifying the limits ± *f*σ beyond which the data will be marked as potential outliers.

In this chapter we point to three major contributions. The first is the explicit treatment of Type I and Type II errors in the evaluation of the performance of quality control proce‐ dures to provide a basis for comparison of procedures. The second is to illustrate how the selection of parameters in the quality control process can be tailored to individual needs in regions or sub-regions of a wide-spread network. Finally, we introduce a new spatial regression test (SRT) which uses a subset of the neighboring stations to provide the "best fit" to the target station. This spatial regression weighted procedure produces non-biased estimates with characteristics which make it possible to specify statistical confidence inter‐ vals for testing data at the target station.

#### **2. A Dataset with seeded errors**

A dataset consisting of original data and seeded errors (18) is used to evaluate the perform‐ ance of the different QC approaches for temperature and precipitation. The QC procedures can be tracked to determine the number of seeded errors that are identified. The ratio of er‐ rors identified by a QC procedure to the total number of errors seeded is a metric that can be compared across the range of error magnitudes introduced. The data used to create the seeded error dataset was from the U.S. Cooperative Observer Network as archived in the National Climatic Data Center (NCDC).We used the Applied Climate Information (ACIS) system to access stations with daily data available for all months from 1971~2000(see 24). The data have been assessed using NCDC procedures and are referred to as "clean" data. Note, however, that "clean" does not necessarily infer that the data are true values but, means instead that the largest outliers have been removed.

outlier. This hypothesis (Ho) testing of the data and the statistical decision to accept the data

Take the simple case of testing a variable against limits. If we take as our hypothesis that the data for a measured variable is valid only if it lies within ±3σ of the mean (X), then assuming a normal distribution we expect to accept Ho 99.73% of the time in the abscense of errors. The values that lie beyond X±3σ will be rejected and we will make a Type I error when we encounter valid values beyond these limits. In these cases, we are rejecting Ho when the val‐ ue is actually valid and we therefore expect to make a Type I error 0.27% of the time assum‐ ing for this discussion that the data has no errant values. If we encounter a bad value inside the limits X±3σ we will accept it when it is actually false (the value is not valid) and this would lead to a Type II error. In this simple example, reducing the limits against which the data values are tested will produce more Type I errors and fewer Type II errors while in‐ creasing the limits leads to fewer Type I errors and more Type II errors. For quality assur‐ ance software, study is necessary to achieve a balance wherein one reduces the Type II errors (mark more "errant" data as having failed the test) while not increasing Type I errors to the point where valid extremes are brought into question. Because Type I errors cannot be avoided, it is prudent for data managers to always keep the original measured values re‐ gardless of the quality testing results and offer users an input into specifying the limits ± *f*σ

In this chapter we point to three major contributions. The first is the explicit treatment of Type I and Type II errors in the evaluation of the performance of quality control proce‐ dures to provide a basis for comparison of procedures. The second is to illustrate how the selection of parameters in the quality control process can be tailored to individual needs in regions or sub-regions of a wide-spread network. Finally, we introduce a new spatial regression test (SRT) which uses a subset of the neighboring stations to provide the "best fit" to the target station. This spatial regression weighted procedure produces non-biased estimates with characteristics which make it possible to specify statistical confidence inter‐

A dataset consisting of original data and seeded errors (18) is used to evaluate the perform‐ ance of the different QC approaches for temperature and precipitation. The QC procedures

**Ho True Ho False**

or to note it as an outlier can have the outcomes shown in Table 1:

Accept Ho No error Type II error Reject Ho Type I error No Error

**Table 1.** The classification of possible outcomes in testing of a quality assurance hypothesis.

beyond which the data will be marked as potential outliers.

vals for testing data at the target station.

**2. A Dataset with seeded errors**

**Statistical Decision True Situation**

4 Practical Concepts of Quality Control

About 2% of all observations were selected on a random basis to be seeded with an error. The magnitude of the error was also determined in a random manner. A random number, *r*, was se‐ lected using a random number generator operating on a uniform distribution with a mean of zero and range of ±3.5. This number was then multiplied by the standard deviation (σx) of the variable in question to obtain the error magnitude *E* for the randomly selected observation *x*:

$$E\_x = \ \sigma\_x r \tag{1}$$

The variable*r* is not used when the error would produce negative precipitation, (*E <sup>x</sup>*+*x*)<0., Thus the seeded error value is skewed distributed when *r*<0 but roughly uniformly distrib‐ uted when *r*> 0. The selection of 3.5 for the range is arbitrary but does serve to produce a large range of errors (±3.5σx).This approach to producing a seeded data set is used below in some of the comparisons.
