**6. Multiple interval methods based on measurements from reference stations for precipitation.**

One QC approach involved developing threshold quantification methods to identify a sub‐ set of data consisting of potential outliers in the precipitation observations with the aim of reducing the manual checking workload. This QC method for precipitation was developed based on the empirical statistical distributions underlying the observations.

The search for precipitation quality control (QC) methods has proven difficult. The high spatial and temporal variability associated with precipitation data causes high uncertainty and edge creep when regression-based approaches are applied. Precipitation frequency dis‐ tributions are generally skewed rather than normally distributed. The commonly assumed normal distribution in QC methods is not a good representation of the actual distribution of precipitation and is inefficient in identifying the outliers.

The SRTmethod is able to identify many of the errant data values but the rate of finding er‐ rant values to that of making type I errors is conservatively 1:6. This is not acceptable be‐ cause it would take excessive manpower to check all the flagged values that are generated in a nationwide network. For example, the number of precipitation observations from the co‐ operative network in a typical day is 4000. Using an error rate of 2% and considering the type I error rate indicates that several hundred values may be flagged, requiring substantial personnel resources for assessment.

(29) found the use of a single gamma distribution fit to all precipitation data was ineffective. A second test, the multiple intervals gamma distribution (MIGD) method, was introduced. It assumed that meteorological conditions that produce a certain range in average precipita‐ tion at surrounding stations will produce a predictable range at the target station. The MIGD method sorts data into bins according to the average of precipitation at neighboring stations; then, for the events in a specific bin, an associated gamma distribution is derived by fit to the same events at the target station. The new gamma distributions can then be used to establish the threshold for QC according to the user-selected probability of exceed‐ ance. We also employed the *Q* test for precipitation (20) using a metric based on compari‐ sons with neighboring stations. The performance of the three approaches was evaluated by assessing the fraction of "known" errors that can be identified in a seeded error dataset(18). The single gamma distribution and *Q*-test approach were found to be relatively efficient at identifying extreme precipitation values as potential outliers. However, the MIGD method outperforms the other two QC methods. This method identifies more seeded errors and re‐ sults in fewer Type I errors than the other methods.

#### **6.1. Estimation of parameters for distribution of precipitation and thresholds from the Gama distribution**

The Gamma distribution was employed to represent the distribution of precipitation. While other functions may provide a better overall fit to precipitation data our goal is to establish a reasonable threshold on values beyond which further checking will be required to deter‐ mine if the value is an outlier or simply an extreme precipitation event. The precipitation events are fit to a Gamma distribution,*G*(*γ*, *β*). The shape and scale parameters *γ*, *β* can be estimated from the precipitation events following (21) and (13),

$$\mathcal{Y} = \frac{\overline{X}^2}{s^2} \tag{11}$$

**Figure 9.** Schematic of gamma distribution for all daily precipitation events and for the ith interval of the MIGD approach.

The MIGD was developed to address these non-extreme points along the distribution. It as‐ sumes that meteorological conditions that produce a certain range in average precipitation at surrounding stations will produce a predictable range at the target station. Our concept is to develop a family of Gamma distributions for the station of interest and to selectively apply the distributions based on specific criteria. The average precipitation for each day is calculated for neighboring stations during a time period (e.g. 30 years). These values are ranked and placed into *n* bins with an equal number of values in each. The range for *n* intervals can be obtained from the cumulative probabilities of neighboring average time series, {0, *1/n, 2/n, …, n-1/n,* 1}. For the *i* th interval all corresponding precipitation values at the station of interest (target sta‐ tion) are gathered and parameters for the gamma distribution estimated. This process is re‐

QC involves the application of the threshold test where the gamma distribution for a given day is selected from the family of curves based on the average precipitation for the neighboring sta‐ tions. Each interval can be defined as(*ξ*¯(*p*(*i* / *n*)), *ξ*¯(*p*((*i* + 1) / *n*)) , where *p*(*i* / *n*)is the cumula‐ tive probability associated with *i/n*, *i*=0 to *n-1*, and *ξ*¯(*p*(*i* / *n*)) is the neighboring stations'

Now for each precipitation event, *x*, at the station of interest, the neighboring stations' aver‐ age is calculated. If the average precipitation falls in the interval(*ξ*¯(*p*(*i* / *n*)), *ξ*¯(*p*((*i* + 1) / *n*)) ,

where *p* is a probability in the range (0.5, 1), and the *G <sup>i</sup> (p)* is the precipitation value for the given probability *p* in the gamma distribution associated with the *i th* interval. This equation forms a two sided test. Any value that does not satisfy this test will be treated as an outlier for further manual checking. The intervals and the estimation of this method were imple‐

The results indicate that the Gamma distribution is well suited for deriving appropriate thresh‐ olds for a particular precipitation event. The calculated extreme values provide a good basis

*G p x jt G p i i* (1 , -< < ) ( ) ( ) (14)

). The operational

Toward a Better Quality Control of Weather Data

http://dx.doi.org/10.5772/51632

21

peated for each of the *n* intervals resulting in a family of Gamma curves (*G <sup>i</sup>*

average for a given cumulative probability.

is used to form a test:

mented using *R* statistical software (19).

then *G <sup>i</sup>*

$$
\beta = \frac{s^2}{\overline{X}} \tag{12}
$$

where*<sup>X</sup>*¯ and *<sup>s</sup>* are the sample mean and the sample standard deviation, respectively.

The data for each station in the Gamma distribution test include all precipitation events on a daily basis for a year. The parameters for left-censored (0 values excluded) Gamma distribu‐ tions, on a monthly basis, are also calculated, based on the precipitation events for individu‐ al months in the historical record. To ascertain the representativeness of the Gamma distribution, the precipitation value for the corresponding percentiles (*P*): 99, 99.9, 99.99, and 99.999% were computed from the Gamma distribution and compared with the precipitation values for given percentiles based on ranking (original data).

The criterion for a threshold test approach can be written as,

$$\text{tr}(j, t) < I(p) \tag{13}$$

where*x*(*j,t*) is the observed daily precipitation on day *t* at station *j* and *I*(*p*) is the threshold daily precipitation for a given probability, *p* (=*P*/100), calculated using the Gamma distribu‐ tion. A value not meeting this criterion is noted as a potential outlier (the shaded area to the right of the *p*=0.995 value for the distribution for all precipitation events in Fig. 9). The test function uses the one-sided test for precipitation, a non-negative variable.

#### **6.2. Multiple interval range limit gamma distribution test for precipitation (MIGD)**

Analysis has shown that precipitation data at a station can be fit to a Gamma distribution, which can then be applied to a threshold test approach. With this method only the most ex‐ treme precipitation events will be flagged as potential outliers so errant data at other points in the distribution are not identified.

**6.1. Estimation of parameters for distribution of precipitation and thresholds from the**

estimated from the precipitation events following (21) and (13),

values for given percentiles based on ranking (original data).

The criterion for a threshold test approach can be written as,

in the distribution are not identified.

The Gamma distribution was employed to represent the distribution of precipitation. While other functions may provide a better overall fit to precipitation data our goal is to establish a reasonable threshold on values beyond which further checking will be required to deter‐ mine if the value is an outlier or simply an extreme precipitation event. The precipitation events are fit to a Gamma distribution,*G*(*γ*, *β*). The shape and scale parameters *γ*, *β* can be

> 2 2 *X s* g

2 *s X* b

The data for each station in the Gamma distribution test include all precipitation events on a daily basis for a year. The parameters for left-censored (0 values excluded) Gamma distribu‐ tions, on a monthly basis, are also calculated, based on the precipitation events for individu‐ al months in the historical record. To ascertain the representativeness of the Gamma distribution, the precipitation value for the corresponding percentiles (*P*): 99, 99.9, 99.99, and 99.999% were computed from the Gamma distribution and compared with the precipitation

where*x*(*j,t*) is the observed daily precipitation on day *t* at station *j* and *I*(*p*) is the threshold daily precipitation for a given probability, *p* (=*P*/100), calculated using the Gamma distribu‐ tion. A value not meeting this criterion is noted as a potential outlier (the shaded area to the right of the *p*=0.995 value for the distribution for all precipitation events in Fig. 9). The test

function uses the one-sided test for precipitation, a non-negative variable.

**6.2. Multiple interval range limit gamma distribution test for precipitation (MIGD)**

Analysis has shown that precipitation data at a station can be fit to a Gamma distribution, which can then be applied to a threshold test approach. With this method only the most ex‐ treme precipitation events will be flagged as potential outliers so errant data at other points

where*<sup>X</sup>*¯ and *<sup>s</sup>* are the sample mean and the sample standard deviation, respectively.

= (11)

= (12)

*x jt I p* (,) ( ) < (13)

**Gama distribution**

20 Practical Concepts of Quality Control

**Figure 9.** Schematic of gamma distribution for all daily precipitation events and for the ith interval of the MIGD approach.

The MIGD was developed to address these non-extreme points along the distribution. It as‐ sumes that meteorological conditions that produce a certain range in average precipitation at surrounding stations will produce a predictable range at the target station. Our concept is to develop a family of Gamma distributions for the station of interest and to selectively apply the distributions based on specific criteria. The average precipitation for each day is calculated for neighboring stations during a time period (e.g. 30 years). These values are ranked and placed into *n* bins with an equal number of values in each. The range for *n* intervals can be obtained from the cumulative probabilities of neighboring average time series, {0, *1/n, 2/n, …, n-1/n,* 1}. For the *i* th interval all corresponding precipitation values at the station of interest (target sta‐ tion) are gathered and parameters for the gamma distribution estimated. This process is re‐ peated for each of the *n* intervals resulting in a family of Gamma curves (*G <sup>i</sup>* ). The operational QC involves the application of the threshold test where the gamma distribution for a given day is selected from the family of curves based on the average precipitation for the neighboring sta‐ tions. Each interval can be defined as(*ξ*¯(*p*(*i* / *n*)), *ξ*¯(*p*((*i* + 1) / *n*)) , where *p*(*i* / *n*)is the cumula‐ tive probability associated with *i/n*, *i*=0 to *n-1*, and *ξ*¯(*p*(*i* / *n*)) is the neighboring stations' average for a given cumulative probability.

Now for each precipitation event, *x*, at the station of interest, the neighboring stations' aver‐ age is calculated. If the average precipitation falls in the interval(*ξ*¯(*p*(*i* / *n*)), *ξ*¯(*p*((*i* + 1) / *n*)) , then *G <sup>i</sup>* is used to form a test:

$$G\_{\iota}(1-p) < \mathbf{x}(j,t) < G\_{\iota}(p) \tag{14}$$

where *p* is a probability in the range (0.5, 1), and the *G <sup>i</sup> (p)* is the precipitation value for the given probability *p* in the gamma distribution associated with the *i th* interval. This equation forms a two sided test. Any value that does not satisfy this test will be treated as an outlier for further manual checking. The intervals and the estimation of this method were imple‐ mented using *R* statistical software (19).

The results indicate that the Gamma distribution is well suited for deriving appropriate thresh‐ olds for a particular precipitation event. The calculated extreme values provide a good basis for identifying extreme outliers in the precipitation observations. The inclusion of all precipita‐ tion events reduces the data requirements for the quantification of extreme events which gen‐ erally requires a long time series of observations (e.g. using Gumbel distribution.) Using the approach based on the Gamma distribution, a suitable representation of the distribution of precipitation can be obtained with only a few years of observation, as is the case with newly es‐ tablished automatic weather stations, e.g. Climate Reference Network. Further study is re‐ quired for probability selection in the Gamma distribution approach.

confidence level, we mark the value as an outlier. For example, Suppose we select q999 for our confidence. The precipitation on August 2, 1987 was 1.3 inches while the average of neighbor‐ ing stations had a value of 0.06 inches. The average falls between lower and upper in the 2nd row, n=2. ie.0.05, 0.11. The rainfall value (1.3 inches) is larger than the q999 threshold (1.15 in‐ ches) thus we can say we are 99.9 % confident that the rainfall is an outlier and it should be flag‐ ged for further manual examination. Note that 1.3 inches is in no way an extreme precipitation

Toward a Better Quality Control of Weather Data

http://dx.doi.org/10.5772/51632

23

One other QC method for precipitation test is the Q-test (20). The Q-test approach serves as a tool to discriminate between extreme precipitation and outliers and it has proven to mini‐ mize the manual examination of precipitation by choice of parameters that identify the most likely outliers (20). The performance of both the Gamma distribution test and the Q-test is relatively weak with respect to identifying the seeded errors. The Q-Test is different from the Gamma distribution method because the Q-Test uses both the historical data and meas‐ urements from neighboring stations while the simple implementation of the Gamma distri‐

The MIGD method is a more complex implementation of the Gamma distribution that uses historical data and measurements from neighboring stations to partition a station's precipitation values into separate populations. The MIGD method shows promise and outperforms other QC methods for precipitation. This method identifies more seeded er‐ rors and creates fewer Type I errors than the other methods. MIGD will be used as an op‐ erational tool in identifying the outliers for precipitation in ACIS. However, the fraction of errors identified by the MIGD method varies for different probabilities and among the different stations. Network operators, data managers, and scientist who plan to use MIGD to identify potential precipitation outliers can perform a similar analysis (sort the data in‐ to bins and derive the gamma distribution coefficients for each interval) over their geo‐

**7. Quality control of the NCDC dataset to create a serially complete**

Development of continuous and high-quality climate datasets is essential to populate Webdistributed databases (17) and to serve as input to Decision Support Systems (e.g., 27).

Serially complete data are necessary as input to many risk assessments related to human en‐ deavor including the frequency analysis associated with heavy rains, severe heat, severe cold, and drought. Continuous data are also needed to understand the climate impacts on crop yield, and ecosystem production. The National Drought Mitigation Center (NDMC) and the High Plains Regional Climate Center (HPRCC) at the University of Nebraska are de‐ veloping a new drought atlas. The last drought atlas (1994) was produced with the data from 1119 stations ending in 1992. The forthcoming drought atlas will include additional stations and will update the analyses, maps, and figures through the period 1994 to the present time. A list was compiled from the Applied Climate Information System (ACIS) for

value but, it's validity can be challenged on the basis of the MIGD test.

bution method only uses the data from the station of interest.

graphic region to choose an optimum probability level.

**dataset.**


**Table 5.** Multiple gamma distributions (n=5) for the Multiple Interval Gamma Distribution (MIGD) method at Tucson, AZ. Lower and upper represent the upper and lower limits of each bin for surrounding station averages. The precipita‐ tion threshold for the target station can be selected from q999, q995, q99, q975, q95, q9, q1,q05,q025,q01, q005, and q001 as these are associated with gamma distribution for the station of interest.

A simple gamma distribution can be fit to the daily precipitation values at a station. Upper thresholds can be set based on the cumulative probability of the precipitation distribution. This single gamma distribution (SGD) test will address the most extreme values of precipita‐ tion and flag them for further testing. However, to address non-extreme values of precipita‐ tion that are not out on the tail of the SGD another approach is needed. We have formulated the multiple interval gamma distribution test (MIGD) for this purpose. The main assump‐ tion is that the meteorological conditions that produce a certain range in average precipita‐ tion at surrounding stations will produce a predictable range of precipitation at the target station. It does not estimate the precipitation at the target station but estimates the range in‐ to which the precipitation should fit.

The average precipitation for each day is calculated for neighboring stations during a histor‐ ical period, say 30 years. These values are then ranked and placed into n bins with an equal number of values in each. For all the values in a given bin, the daily precipitation at the tar‐ get station are gathered and a gamma distribution formed. The process is repeated n times once for each bin resulting in a family of gamma distribution curves. A separate family of curves can be derived for each month or each season. In operation, the daily average of the precipitation at surrounding stations is calculated and used to point to the n'th gamma dis‐ tribution which in turn provides thresholds against which to test for that day. For instance, the upper threshold can be selected to correspond with the cumulative probability for the n'th gamma distribution. The user is able to specify the threshold according the cumulative probability. For example we can be 99.5 % confident that values will not exceed the corre‐ sponding value on the cumulative probability curve. Values that exceed this are not necessa‐ rily wrong but flagged for further review. The MIGD will find more precipitation values that need to be reviewed than the single gamma distribution test.

Table 5 provides an example of the MIGD for n=5 at Tucson, AZ, USA. We update this type of information on an annual basis. If the precipitation value falls outside the q value of a selected confidence level, we mark the value as an outlier. For example, Suppose we select q999 for our confidence. The precipitation on August 2, 1987 was 1.3 inches while the average of neighbor‐ ing stations had a value of 0.06 inches. The average falls between lower and upper in the 2nd row, n=2. ie.0.05, 0.11. The rainfall value (1.3 inches) is larger than the q999 threshold (1.15 in‐ ches) thus we can say we are 99.9 % confident that the rainfall is an outlier and it should be flag‐ ged for further manual examination. Note that 1.3 inches is in no way an extreme precipitation value but, it's validity can be challenged on the basis of the MIGD test.

for identifying extreme outliers in the precipitation observations. The inclusion of all precipita‐ tion events reduces the data requirements for the quantification of extreme events which gen‐ erally requires a long time series of observations (e.g. using Gumbel distribution.) Using the approach based on the Gamma distribution, a suitable representation of the distribution of precipitation can be obtained with only a few years of observation, as is the case with newly es‐ tablished automatic weather stations, e.g. Climate Reference Network. Further study is re‐

sname range lower upper q999 q995 q99 q975 q95 q9 q1 q05 q025 q01 q005 028820 0~0.0510 0 0.0510 0.8441 0.6332 0.5431 0.4251 0.3369 0.2502 0.0055 0.0021 0.0008 0.0002 9.39E-05 028820 0.0510~0.1111 0.0510 0.1111 1.1533 0.8704 0.7493 0.5903 0.4711 0.3532 0.0097 0.0040 0.0017 0.0005 0.0002 028820 0.1111~0.1899 0.1111 0.1899 1.2989 0.9896 0.8567 0.6816 0.5495 0.4181 0.0156 0.0071 0.0033 0.0012 0.0006 028820 0.1899~0.3326 0.1899 0.3326 1.8631 1.4162 1.2243 0.9717 0.7815 0.5925 0.0206 0.0092 0.0042 0.0015 0.0007 028820 0.3326~2.1216 0.3326 2.1216 2.8514 2.1861 1.8996 1.5210 1.2346 0.9484 0.0429 0.0208 0.0102 0.0040 0.0020

**Table 5.** Multiple gamma distributions (n=5) for the Multiple Interval Gamma Distribution (MIGD) method at Tucson, AZ. Lower and upper represent the upper and lower limits of each bin for surrounding station averages. The precipita‐ tion threshold for the target station can be selected from q999, q995, q99, q975, q95, q9, q1,q05,q025,q01, q005,

A simple gamma distribution can be fit to the daily precipitation values at a station. Upper thresholds can be set based on the cumulative probability of the precipitation distribution. This single gamma distribution (SGD) test will address the most extreme values of precipita‐ tion and flag them for further testing. However, to address non-extreme values of precipita‐ tion that are not out on the tail of the SGD another approach is needed. We have formulated the multiple interval gamma distribution test (MIGD) for this purpose. The main assump‐ tion is that the meteorological conditions that produce a certain range in average precipita‐ tion at surrounding stations will produce a predictable range of precipitation at the target station. It does not estimate the precipitation at the target station but estimates the range in‐

The average precipitation for each day is calculated for neighboring stations during a histor‐ ical period, say 30 years. These values are then ranked and placed into n bins with an equal number of values in each. For all the values in a given bin, the daily precipitation at the tar‐ get station are gathered and a gamma distribution formed. The process is repeated n times once for each bin resulting in a family of gamma distribution curves. A separate family of curves can be derived for each month or each season. In operation, the daily average of the precipitation at surrounding stations is calculated and used to point to the n'th gamma dis‐ tribution which in turn provides thresholds against which to test for that day. For instance, the upper threshold can be selected to correspond with the cumulative probability for the n'th gamma distribution. The user is able to specify the threshold according the cumulative probability. For example we can be 99.5 % confident that values will not exceed the corre‐ sponding value on the cumulative probability curve. Values that exceed this are not necessa‐ rily wrong but flagged for further review. The MIGD will find more precipitation values

Table 5 provides an example of the MIGD for n=5 at Tucson, AZ, USA. We update this type of information on an annual basis. If the precipitation value falls outside the q value of a selected

quired for probability selection in the Gamma distribution approach.

and q001 as these are associated with gamma distribution for the station of interest.

that need to be reviewed than the single gamma distribution test.

to which the precipitation should fit.

22 Practical Concepts of Quality Control

One other QC method for precipitation test is the Q-test (20). The Q-test approach serves as a tool to discriminate between extreme precipitation and outliers and it has proven to mini‐ mize the manual examination of precipitation by choice of parameters that identify the most likely outliers (20). The performance of both the Gamma distribution test and the Q-test is relatively weak with respect to identifying the seeded errors. The Q-Test is different from the Gamma distribution method because the Q-Test uses both the historical data and meas‐ urements from neighboring stations while the simple implementation of the Gamma distri‐ bution method only uses the data from the station of interest.

The MIGD method is a more complex implementation of the Gamma distribution that uses historical data and measurements from neighboring stations to partition a station's precipitation values into separate populations. The MIGD method shows promise and outperforms other QC methods for precipitation. This method identifies more seeded er‐ rors and creates fewer Type I errors than the other methods. MIGD will be used as an op‐ erational tool in identifying the outliers for precipitation in ACIS. However, the fraction of errors identified by the MIGD method varies for different probabilities and among the different stations. Network operators, data managers, and scientist who plan to use MIGD to identify potential precipitation outliers can perform a similar analysis (sort the data in‐ to bins and derive the gamma distribution coefficients for each interval) over their geo‐ graphic region to choose an optimum probability level.
