**8. Issues relating QC to gridded datasets,**

stations with a length of at least 40 years of observations for all three variables: precipitation (PRCP), maximum (Tmax), and minimum (Tmin) temperatures. Paper records were scruti‐ nized to identify reported, but previously non-digitized data to reduce, to the extent possi‐ ble, the number of missing data. A list of 2144 stations was compiled for the sites that met the criterion of at least 40 years data with less than two months continuous missing gaps for at least one of the three variables. The remaining missing data in the dataset were supple‐ mented by the estimates obtained from the measurements made at nearby stations. The spa‐ tial regression test (SRT) and the inverse distance weighted (IDW) method were adopted in a dynamic data filling procedure to provide these estimates. The replacement of missing val‐ ues follows a reproducible process that uses robust estimation procedures and results in a serially complete data set (SCD) for 2144 stations that provide a firm basis for climate analy‐ sis. Scientists who have used more qualitative or less sophisticated quantitative QC techni‐ ques may wish to use this data set so that direct comparisons to other studies that used this SCD can be made without worry about how differences in missing dataprocedures would influence the results. A drought atlas based on data from the SCD will provide decision

After identifying stations with a long-term (at least 40 years) continuous (no data gaps lon‐ ger than two months) dataset of Tmax, Tmin, and/or PRCP for a total of 2144 stations, the missing values in the original dataset retrieved from ACIS were filled to the extent possible with the keyed data from paper record and the estimates using the SRT and IDW methods. Two implementations of SRT were applied in this study. The short-window (60 days) imple‐ mentation provides the best estimates based on the most recent information available for constructing the regression. The second implementation of SRT fills the long gaps, e.g. gaps longer than one month using the data available on a yearly basis. The IDW method was

This is the first serially complete data set where a statement of confidence can be associated with many of the estimates, ie. SRT estimates. The RMSE is less than 1F in most cases and thus we are 95% confident that the value, if available, would lie between ±2F of the estimate.

of severe heat, cold, and dryness. Probabilities related to extreme rainfall for flooding and erosion potential can be derived along with indices to reflect impact on livestock produc‐ tion. The data set is offered as an option to distributing raw data to the users who need this level of spatial and temporal coverage but are not well positioned to spend time and resour‐

Analysis based on the long-term dataset will best reveal the regional and large scale climatic variability in the continental U.S., making this an ideal data set for the development of a new drought atlas and associated drought index calculations. Future data observations can be easily appended to this SCD with the dynamic data filling procedures described herein.

to interested parties and can be used in crop models, assessment

adopted to fill any remaining missing data after the two implementations of SRT.

makers more support in their risk management needs.

This data set is available 1

24 Practical Concepts of Quality Control

ces to fill gaps with acceptable estimates.

1 Contact the High Plains Regional Climate Center at 402-472-6709

Gridded datasets are sometimes used in QC but, we caution against this for the following reasons.New datasets created from inverse distance weighted methods or krigging suffer from uncertainties. The values at a grid point are usually not "true"measurements but are interpolated values from the measurements at nearby stations in theweather network.Thus, the values at the grid points are susceptible to bias. When further interpolation is made to a given location within the grid, bias will again exist at the specific location between the grid‐ ded values..Fig.10provides an example of potential bias. Outside of a gridded data set the target location would give a large weight to the value at station 5. However, if the radius used for the gridded data is as in the Fig.10, then the closest station to the target station (5) will not be included in the grid-based estimation.

**Figure 10.** An example of station distribution used in the grid method.

### **9. Quality control of high temporal resolution datasets**

The Oklahoma Mesonet (http://www.mesonet.org/) measures and archives weather condi‐ tions at 5-minute intervals (Shafer et al., 2000). The quality control system used in the net‐ work starts from the raw data of the measurements for the high temporal resolution data. A set of QC tools was developed to routinely maintain data of the Mesonet. These tools de‐ pend on the status of hardware and measurement flag sets built in the climate data sys‐ tem.The Climate Reference Network (CRN, Baker et al. 2004) is another example of the QC of high frequency data, which installs multiple sensors for each variable to guarantee the continuous operation of the weather station and thus the quality control can also rely on the multiple measurements of a single variable. This method is efficient to detect the instrumen‐ tal failures or other disturbances; however the cost of such a network may be prohibitive for non-research or operational networks. The authors of this chapter also carried out QC on a high temporal resolution dataset in the Beaufort and Chukchi Sea regions. Surface meteoro‐ logical data from more than 200 stations in a variety of observing networks and various stand-alone projects were obtained for the MMS Beaufort and Chukchi Seas Modeling Study (Phase II). Many stations have a relatively short period of record (i.e. less than 10 years).The traditional basic QC procedures were developed and tested for a daily data and found in need of improvement for the high temporal resolution data. In the modification, the time series of the maximum and the minimum were calculated from the high resolution data. The mean and standard deviation of the maximum and the minimum can then be calculated from the time series (e.g max and min temperatures) as the (ux, sx) and (un, sn), respectively. The equation (6) using (ux + f sx) and (un - f sn) forms limits defined by the upper limits of the maximum and lower limits of the minimum. The value falling outside the limits will be flag‐ ged as an outlier for further manual checking. Similarly, the diurnal change of a variable (e.g. temperature) was calculated from the high resolution (hourly or sub-hourly) data. The mean and standard deviation calculated from the diurnal changes will form the limits.

of good data as bad and bad data as good respectively. Decreasing the number of Type I and Type II errors is difficult because often a push to decrease Type I errors will result in an unin‐ tended increase inType II errors and vice versa. We have derived a spatial technique to intro‐ duce thresholds associated with user selected probabilities (i.e. select 99.7% as the level of confidence that a data value is an outlier before labeling it as bad and/or replacing it with an es‐ timate). We base this technique on statistical regression in the neighborhood of the data in question and call it the Spatial Regression Test (SRT). Observations taken in a network are of‐ ten affected by the same factors. In weather applications individual stations in a network are generally exposed to air masses in much the same way as are neighboring stations. Thus, tem‐ peratures in the vicinity move up and down together and the correlation between data in the same neighborhood is very high.Similarly seasonal forcings on this neighborhood (e.g. the day to day and seasonal solar irradiance) are essentially the same. We have defined a neighbor‐ hood for a station as those nearby stations that are best correlated to it. We found that the SRT method is an improvement over conventional inverse distance weighting estimates (IDW). A huge benefit of the SRT method is it's ability to remove systematic biases in the data estimation process. Additionally, the method allows a user selected threshold on the probability as con‐ trasted to the IDW. Although the SRT estimates are similar to IDW estimates over smooth ter‐ rain, SRT estimates are notably superior over complex terrain (mountains) and in the vicinity of other climate forcing (e.g. ocean/land boundaries). Gridded data sets that result from IDW, Kriging or most other interpolation schemes do not provide unbiased estimates. Even when grid spacing is decreased to a point where the complexity of the land surface is well represent‐ ed there remains two problems: what is the microclimate of the nearest observation points and what is the transfer function between points. This is a future challenge for increasing the quali‐

Toward a Better Quality Control of Weather Data

http://dx.doi.org/10.5772/51632

27

ty of data sets and the estimation of data between observation sites.

, Jinsheng You and Martha Shulski

High Plains Regional Climate Center, University of Nebraska, Lincoln, NE, USA

[1] Barnett, V., & Lewis, T. (1994). *Outliers in Statistical Data* (3 ed.), J. Wiley and Sons,

[2] Camargo, M. B. P., & Hubbard, K. G. (1999). Spatial and temporal variability of daily weather variables in sub-humid and semi-arid areas of the U.S. *High Plains.Ag.and*

\*Address all correspondence to: khubbard1@unl.edu

**Author details**

Kenneth Hubbard\*

**References**

604.

*Forest Meteor.*, 93, 141-148.

The traditional quality control methods were improved for examining the high temporal resolution data, to avoid intensive manual reviewing which is not timely or cost efficient. The identified problems in the dataset demonstrate that the improved methods did find con‐ siderable errors in the raw data including the time errors (e.g. month being great than 12). These newtools offer a dataset that, after manual checking of the flagged data, can be givin a statement of confidence. The level of confidence can be selected by the user, prior to QC.

The applied in-station limit tests can successfully identify outliers in the dataset. Howev‐ er, spatial tests based information from the neighboring stations is more robust in many cases and identifies errors or outliers in the dataset when strong correlation exists. The good relationship between the measurements at station pairs demonstrates that there is a potential opportunity to successfully apply the spatial regression test (SRT, 18) to the sta‐ tions which measure the same variables (i.e. air temperature orwind speed). The short term measurements at some stations may not be efficiently QC'ed with only the three methods described in this work. One example is the dew point measurements at the firstorder station Iultin-in-Chukot. More than 90 percent of the dew point measurements were flagged, because the parameters for QC'ing the variable used the state wide parameters which cannot reflect the microclimate of each station.

### **10. Summary and Conclusions**

Quality control (QC) methods can never provide total proof that a data point is good or bad. Type I errors (false positives) or Type II errors (false negatives) can occur and result in labeling of good data as bad and bad data as good respectively. Decreasing the number of Type I and Type II errors is difficult because often a push to decrease Type I errors will result in an unin‐ tended increase inType II errors and vice versa. We have derived a spatial technique to intro‐ duce thresholds associated with user selected probabilities (i.e. select 99.7% as the level of confidence that a data value is an outlier before labeling it as bad and/or replacing it with an es‐ timate). We base this technique on statistical regression in the neighborhood of the data in question and call it the Spatial Regression Test (SRT). Observations taken in a network are of‐ ten affected by the same factors. In weather applications individual stations in a network are generally exposed to air masses in much the same way as are neighboring stations. Thus, tem‐ peratures in the vicinity move up and down together and the correlation between data in the same neighborhood is very high.Similarly seasonal forcings on this neighborhood (e.g. the day to day and seasonal solar irradiance) are essentially the same. We have defined a neighbor‐ hood for a station as those nearby stations that are best correlated to it. We found that the SRT method is an improvement over conventional inverse distance weighting estimates (IDW). A huge benefit of the SRT method is it's ability to remove systematic biases in the data estimation process. Additionally, the method allows a user selected threshold on the probability as con‐ trasted to the IDW. Although the SRT estimates are similar to IDW estimates over smooth ter‐ rain, SRT estimates are notably superior over complex terrain (mountains) and in the vicinity of other climate forcing (e.g. ocean/land boundaries). Gridded data sets that result from IDW, Kriging or most other interpolation schemes do not provide unbiased estimates. Even when grid spacing is decreased to a point where the complexity of the land surface is well represent‐ ed there remains two problems: what is the microclimate of the nearest observation points and what is the transfer function between points. This is a future challenge for increasing the quali‐ ty of data sets and the estimation of data between observation sites.
