**3. The spatial regression test (estimates)and Inverse Distance Weighted Estimates (IDW)**

When checking data from a site, missing values are sometimes present. For modeling and oth‐ er purposes where continuous data are required, an estimate is needed for the missing value. We will refer to the station which is missing the data as the target station. The IDW method has been used to make estimates (x') at the target stations from surrounding observations (xi ).

$$\mathbf{x}' = \sum\_{i=1}^{N} (\mathbf{x}\_i \mid f(d\_i)) \;/ \;/ \sum\_{i=1}^{N} 1 \;/ \; f(d\_i) \tag{2}$$

Where di is the distance from the target station to each of the nearby stations, f(di) is a func‐ tion relying on di (in our case we took f(di )=1/di ). This approach assumes that the nearest sta‐ tions will be most representative of the target site.

*Spatial Regression (SRT)* is a new method that provides an estimate for the target station and can be used to check that the observation (when not missing) falls inside the confidence in‐ terval formed from N estimates based on N "best fits" between the target station and neigh‐ boring stations during a time period of length *n*. The surrounding stations are selected be specifying a radius around the station and finding those stations with the closest statistical agreement to the target station. Additional requirements for station selection are that the variable to be tested is one of the variables measured at the target site and the data for that variable spans the data period to be tested. A station that otherwise qualifies could also be eliminated from consideration if more than half of the data is missing for the time span (e.g. more than 12 missing dayswhere n=24) First non-biased, preliminary estimates *xlt* are de‐ rived by use ofthe coefficients derived from linear regression, so for any time *t*, and for each surrounding station (*ylt*) an estimate is formed.

$$\dot{\mathbf{x}}\_i = \mathbf{a}\_i + \mathbf{b}\_i \mathbf{y}\_i \tag{3}$$

and what weighting these stations should receive. An example of the estimates obtained

A254739 A254749 days x y1 y2 y3 y4 x'1 x'2 x'3 x'4 x'1/s1'^2 x'2/s2'^2 x'3/s3'^2 83.696 85.586 6/1/2011 85.1 85.5 83.4 83.7 85.6 85.51 84.30 84.82 84.62 47.016 92.680 170.315 85.604 87.584 6/2/2011 86.2 86.2 85.3 85.6 87.6 86.28 86.33 86.78 86.62 47.438 94.906 174.255 89.942 92.282 6/3/2011 91.9 89.5 90.0 89.9 92.3 89.73 91.33 91.24 91.30 49.338 100.408 183.214 85.478 85.1 6/4/2011 84.1 85.9 83.5 85.5 85.1 85.91 84.42 86.65 84.14 47.238 92.806 173.995 94.46 97.286 6/5/2011 96.3 94.9 94.1 94.5 97.3 95.49 95.67 95.89 96.29 52.504 105.175 192.545 97.574 100.994 6/6/2011 99.8 98.0 97.7 97.6 101.0 98.83 99.51 99.09 99.99 54.341 109.395 198.977 95.918 98.726 6/7/2011 97.2 96.3 96.4 95.9 98.7 97.03 98.10 97.39 97.73 53.349 107.841 195.557 83.066 86.288 6/8/2011 83.5 86.4 84.8 83.1 86.3 86.41 85.81 84.17 85.32 47.512 94.339 169.014 69.674 72.878 6/9/2011 71.0 71.8 71.9 69.7 72.9 70.92 72.18 70.40 71.95 38.994 79.345 141.355 66.2 67.766 6/10/2011 66.2 69.8 67.6 66.2 67.8 68.77 67.59 66.82 66.86 37.812 74.306 134.181 75.758 76.694 6/11/2011 76.2 76.2 74.8 75.8 76.7 75.53 75.19 76.65 75.76 41.527 82.663 153.921 77.324 78.98 6/12/2011 78.8 77.9 77.7 77.3 79.0 77.43 78.29 78.26 78.04 42.572 86.065 157.155 69.314 70.97 6/13/2011 69.2 70.3 69.9 69.3 71.0 69.23 69.98 70.03 70.05 38.066 76.930 140.612 76.028 78.728 6/14/2011 78.1 79.5 78.1 76.0 78.7 79.12 78.67 76.93 77.79 43.501 86.485 154.478 84.632 86.396 6/15/2011 86.4 85.0 85.3 84.6 86.4 84.97 86.35 85.78 85.43 46.720 94.927 172.248 85.118 86.27 6/16/2011 86.8 85.3 84.0 85.1 86.3 85.24 84.94 86.28 85.31 46.868 93.373 173.252 90.266 92.732 6/17/2011 91.3 92.5 90.9 90.3 92.7 92.92 92.33 91.58 91.75 51.090 101.500 183.884 80.312 82.904 6/18/2011 81.5 82.9 81.4 80.3 82.9 82.71 82.22 81.34 81.95 45.475 90.391 163.326 85.118 87.458 6/19/2011 85.6 86.6 85.5 85.1 87.5 86.66 86.60 86.28 86.49 47.649 95.200 173.252 86.81 88.448 6/20/2011 87.9 88.2 86.7 86.8 88.4 88.35 87.88 88.02 87.48 48.578 96.607 176.746 71.258 72.788 6/21/2011 72.0 72.9 71.9 71.3 72.8 72.07 72.16 72.03 71.87 39.628 79.324 144.627 74.948 76.586 6/22/2011 76.7 75.0 74.4 74.9 76.6 74.26 74.83 75.82 75.65 40.831 82.264 152.248 76.604 78.62 6/23/2011 77.1 78.9 76.4 76.6 78.6 78.45 76.87 77.52 77.68 43.132 84.511 155.668 78.17 80.168 6/24/2011 79.4 79.4 78.3 78.2 80.2 78.96 78.92 79.13 79.22 43.417 86.758 158.902 80.564 82.544 6/25/2011 82.0 80.8 80.6 80.6 82.5 80.52 81.33 81.60 81.59 44.272 89.404 163.846 81.302 82.814 6/26/2011 82.1 82.3 82.1 81.3 82.8 82.09 82.91 82.36 81.86 45.137 91.147 165.370 78.044 80.06 6/27/2011 79.1 79.8 77.9 78.0 80.1 79.37 78.54 79.00 79.12 43.638 86.338 158.642 79.61 81.716 6/28/2011 81.1 80.2 79.1 79.6 81.7 79.87 79.80 80.62 80.77 43.913 87.724 161.876 89.78 91.76 6/29/2011 91.3 89.7 89.3 89.8 91.8 89.96 90.55 91.08 90.78 49.465 99.547 182.880 98.78 101.48 6/30/2011 100.0 100.3 98.4 98.8 101.5 101.25 100.29 100.33 100.47 55.671 110.256 201.467

parameters Intercept -5.687 -4.170 -1.265 -0.705 sum(1/si^2) s'

6.273045 3.18E-05 yi 1.1 0.4 -0.7 -0.6 1.43312 0.83533

**Table 2** An example of QC using Spatial Regression Test (SRT) method for daily maximum temperature estimation (unit: F). Stations are from the Automated Weather Data Network and locations are on an East-West by North South street naming convention. The original station (Lincoln 20E 35S) is labeled x while the four neighboring stations are

Using the above methodology, the rate of error detection can be pre-selected. The reader should note that the results are presented in terms of the fraction of data flagged against the range of *f* values (defined above) rather than selecting one *f* value on an arbitrary ba‐ sis. This type of analysis makes it possible to select the specific *f* values for stations in dif‐ fering climate regimes that would keep the Type I error rate uniform across the country. For example for sake of illustration, suppose the goal is to select *f* values which keep the potential Type I errors to about two percent. A representative set of stations and years can be pre-analyzed prior to QC to determine the *f* values appropriate to achieve this goal.The SRT method implicitly resolves the bias between variables at different stations

Tables 2 and 3 show the use of SRT (equations 3, 4 and 5 above). The data in the example are re‐ trieved from the AWDN stations for the month of June 2011. Only one month was used in this

Si(x,yi) 1.349 0.954 0.706 0.694 5.73249 0.83533

*'*

, *x*<sup>2</sup> *'*

etc. for n=30. The final estimate x(est) is

Toward a Better Quality Control of Weather Data

http://dx.doi.org/10.5772/51632

7

Linear regression Slope 1.066 1.061 1.029 0.997

0.48731 0.408523 One example for day 30 (i=1 to 4 for four reference stations) :

y1,y2, y3, and y4. Equation 3 is used to derive the unbiased estimates *x*<sup>1</sup>

determined from the unbiased estimates using equations 4 and 5.

induced by elevation difference or other attributes.

from the SRT is given in Table 2.

0.069812 0.208934 0.382169 0.206952

Random values generator, generating yi based on x 20E 35S Havelock 82E 20S 12W 55N 51E 13S

The approach obtains an un-biased estimate (*x'*) by utilizing the standard error of estimate (*s*) for each of the linear regressions in the weighting process.

$$\mathbf{x}^\* = \sum\_{l=1}^N \left(\mathbf{x}\_l^\cdot / \mathbf{s}\_l^2\right) / \sum\_{l=1}^N \mathbf{1} / \mathbf{s}\_l^2 \tag{4}$$

$$N \text{ / s}^{\ast^2} = \sum\_{l=1}^{N} 1 \text{ / s}\_l^2 \tag{5}$$

The surrounding stations are ranked according to the magnitude of the standard error of es‐ timate and the N stations with the lowest s values are used in the weighting process:

This approach provides more weight to the stations that are the best estimators of the target station. Because the stations used in (4) are a subset of the neighboring stations the estimate is not an areal average but a spatial regression weighted estimate

The approach differs from inverse distance weighting in that the standard error of esti‐ mate has a statistical distribution, therefore confidence intervals can be calculated on the basis of *s'* and the station value (*x*) can be tested to determine whether or not it falls with‐ in the *confidence* intervals.

$$\mathbf{x'} - f\mathbf{\hat{s}'} \le \mathbf{x} \le \mathbf{x'} + f\mathbf{\hat{s}'} \tag{6}$$

If the above relationship holds, then the datum passes the spatial test. This relationship indi‐ cates that with successively larger values of *f,* the number of potential Type I errors decreas‐ es. Unlike distance weighting techniques, this approach does not assume that the best station to compare against is the closest station but, instead looks to the relationships be‐ tween the actual station data to settle which stations should be used to make the estimates


and what weighting these stations should receive. An example of the estimates obtained from the SRT is given in Table 2.

terval formed from N estimates based on N "best fits" between the target station and neigh‐ boring stations during a time period of length *n*. The surrounding stations are selected be specifying a radius around the station and finding those stations with the closest statistical agreement to the target station. Additional requirements for station selection are that the variable to be tested is one of the variables measured at the target site and the data for that variable spans the data period to be tested. A station that otherwise qualifies could also be eliminated from consideration if more than half of the data is missing for the time span (e.g. more than 12 missing dayswhere n=24) First non-biased, preliminary estimates *xlt* are de‐ rived by use ofthe coefficients derived from linear regression, so for any time *t*, and for each

The approach obtains an un-biased estimate (*x'*) by utilizing the standard error of estimate

'2 2

*ii i*

2 2 1 / ' 1/ *N*

The surrounding stations are ranked according to the magnitude of the standard error of es‐

This approach provides more weight to the stations that are the best estimators of the target station. Because the stations used in (4) are a subset of the neighboring stations the estimate

The approach differs from inverse distance weighting in that the standard error of esti‐ mate has a statistical distribution, therefore confidence intervals can be calculated on the basis of *s'* and the station value (*x*) can be tested to determine whether or not it falls with‐

If the above relationship holds, then the datum passes the spatial test. This relationship indi‐ cates that with successively larger values of *f,* the number of potential Type I errors decreas‐ es. Unlike distance weighting techniques, this approach does not assume that the best station to compare against is the closest station but, instead looks to the relationships be‐ tween the actual station data to settle which stations should be used to make the estimates

*i Ns s* =

timate and the N stations with the lowest s values are used in the weighting process:

*i*

1 1 ' ( / )/ 1/ *N N*

*i i x xs s* = =

*i i ii x a by* = + (3)

<sup>=</sup> å å (4)

*x fs x x fs* '' '' - ££+ (6)

<sup>=</sup> å (5)

'

(*s*) for each of the linear regressions in the weighting process.

is not an areal average but a spatial regression weighted estimate

in the *confidence* intervals.

surrounding station (*ylt*) an estimate is formed.

6 Practical Concepts of Quality Control

**Table 2** An example of QC using Spatial Regression Test (SRT) method for daily maximum temperature estimation (unit: F). Stations are from the Automated Weather Data Network and locations are on an East-West by North South street naming convention. The original station (Lincoln 20E 35S) is labeled x while the four neighboring stations are y1,y2, y3, and y4. Equation 3 is used to derive the unbiased estimates *x*<sup>1</sup> *'* , *x*<sup>2</sup> *'* etc. for n=30. The final estimate x(est) is determined from the unbiased estimates using equations 4 and 5.

Using the above methodology, the rate of error detection can be pre-selected. The reader should note that the results are presented in terms of the fraction of data flagged against the range of *f* values (defined above) rather than selecting one *f* value on an arbitrary ba‐ sis. This type of analysis makes it possible to select the specific *f* values for stations in dif‐ fering climate regimes that would keep the Type I error rate uniform across the country. For example for sake of illustration, suppose the goal is to select *f* values which keep the potential Type I errors to about two percent. A representative set of stations and years can be pre-analyzed prior to QC to determine the *f* values appropriate to achieve this goal.The SRT method implicitly resolves the bias between variables at different stations induced by elevation difference or other attributes.

Tables 2 and 3 show the use of SRT (equations 3, 4 and 5 above). The data in the example are re‐ trieved from the AWDN stations for the month of June 2011. Only one month was used in this example. The stations are located in the city of Lincoln, NE, USA. The station being tested is Lincoln 20E 35S and is labeled x while the neighboring stations are labeled y1, y2, y3, and y4. The slope (ai), interception (bi), and standard errors of the linear regression between the x and yi are computed. The non-biased estimation of x from data at neighboring stations (yi) are shown as x'1, x'2, x'3, and x'4. The values normalized s by the standard errors ( x'i/si2 ) are used in equation 4 to create the estimation x(est). The last column shows the bias between the true X value and the estimated value (x(est)) from the four stations. We see that the sum of bias of the 30 days has a value of 0.00, which is expected because the estimates using the SRT method are un-biased. The standard error of this regression estimation is 0.83 F. Here, for instance, where f was chosen as 3, any value that is smaller than -2.5 F or larger than 2.5 F will be treated as an outlier. In this example no value of x-x(est) was marked as an outlier.

ue of x is missing in June 10 and June 17, 2011, through the SRT method we can obtain the estimates as 67.4 F and 91.9 F for the two days independent of the true values of 66.2 F and 91.3 F with a bias of 1.2 F and 0.6 F, respectively. Here we note that the estimated values of the two days are slightly different than those estimated in Table 2 because there

Toward a Better Quality Control of Weather Data

http://dx.doi.org/10.5772/51632

9

**4. Providing estimates: robustness of SRT method and weakness of IDW**

The SRT method was tested against the Inverse Distance Weighted (IDW) method to deter‐ mine the representativeness of estimates obtained (29). The SRT method outperformed the IDW method in complex terrain and complex microclimates. To illustrate this we have taken the data from a national cooperative observer site at Silver Lake Brighton, UT.The elevation at Silver Lake Brighton is 8740 ft. The nearest neighboring station is located at Soldier Sum‐ mit at an elevation of 7486 ft. This data is for the year 2002. Daily estimates for maximum and minimum temperature were obtained for each day by temporarily removing the obser‐ vation from that day and applying both the IDW (eq. 1) and the SRT (eq.2) methodsagainst 15 neighboring stations. The estimations for the SRT method were derived by applying the

**Figure 1.** The results of estimating maximum temperature at Silver Lake Brighton, UT for both the IDW and the

are 2 less values to include in the regression.

method (deriving the un-biased estimates) every 24 data.

**method**

SRT methods.


**Table 3** An example of estimating missing data Spatial Regression Test (SRT) method for daily maximum temperature estimation (unit: F). In this example, two days were assumed missing: 6/10 and 6/17 and were estimated using equa‐ tions 3, 4, and 5 (see highlighted values in the x(est) column. Stations are from the Automated Weather Data Network and locations are on an East-West by North South naming convention. The original station (Lincoln 20E 35S) is labeled x while the four neighboring stations are y1,y2, y3, and y4. Equation 3 is used to derive the unbiased estimates *x*<sup>1</sup> *'* , *x*<sup>2</sup> *'* etc. for n=28. The final estimate x(est) is determined from the unbiased estimates using equations 4 and 5.

If one value or several values at the station x is missing, the x(est) will provide an esti‐ mate for the missing data entry (see Table 3). The example in Table 3 shows that the val‐ ue of x is missing in June 10 and June 17, 2011, through the SRT method we can obtain the estimates as 67.4 F and 91.9 F for the two days independent of the true values of 66.2 F and 91.3 F with a bias of 1.2 F and 0.6 F, respectively. Here we note that the estimated values of the two days are slightly different than those estimated in Table 2 because there are 2 less values to include in the regression.

example. The stations are located in the city of Lincoln, NE, USA. The station being tested is Lincoln 20E 35S and is labeled x while the neighboring stations are labeled y1, y2, y3, and y4. The slope (ai), interception (bi), and standard errors of the linear regression between the x and yi are computed. The non-biased estimation of x from data at neighboring stations (yi) are

in equation 4 to create the estimation x(est). The last column shows the bias between the true X value and the estimated value (x(est)) from the four stations. We see that the sum of bias of the 30 days has a value of 0.00, which is expected because the estimates using the SRT method are un-biased. The standard error of this regression estimation is 0.83 F. Here, for instance, where f was chosen as 3, any value that is smaller than -2.5 F or larger than 2.5 F will be treated as an

) are used

*'*

, *x*<sup>2</sup> *'*

shown as x'1, x'2, x'3, and x'4. The values normalized s by the standard errors ( x'i/si2

Original data at Stations, Lincoln NE, USA estimated x from y Normalized by s'

days x y1 y2 y3 y4 x'1 x'2 x'3 x'4 x'1/s1'^2 x'2/s2'^2 x'3/s3'^2 x'4/s4'^2 X(est) x-x(est) 6/1/2011 85.1 85.5 83.4 83.7 85.6 85.64 84.39 84.84 84.66 54.055 98.238 164.200 171.671 84.8 -0.31 6/2/2011 86.2 86.2 85.3 85.6 87.6 86.39 86.39 86.80 86.64 54.533 100.577 167.980 175.693 86.6 0.46 6/3/2011 91.9 89.5 90.0 89.9 92.3 89.80 91.36 91.24 91.31 56.686 106.360 176.575 185.152 91.1 -0.81 6/4/2011 84.1 85.9 83.5 85.5 85.1 86.03 84.50 86.67 84.18 54.306 98.370 167.731 170.692 85.3 1.14 6/5/2011 96.3 94.9 94.1 94.5 97.3 95.49 95.67 95.86 96.28 60.274 111.370 185.527 195.227 95.9 -0.35 6/6/2011 99.8 98.0 97.7 97.6 101.0 98.79 99.48 99.05 99.96 62.356 115.806 191.697 202.692 99.4 -0.32 6/7/2011 97.2 96.3 96.4 95.9 98.7 97.00 98.07 97.36 97.71 61.231 114.173 188.416 198.126 97.6 0.35 6/8/2011 83.5 86.4 84.8 83.1 86.3 86.53 85.88 84.20 85.36 54.617 99.981 162.951 173.084 85.2 1.69 6/9/2011 71.0 71.8 71.9 69.7 72.9 71.23 72.35 70.49 72.04 44.964 84.223 136.417 146.085 71.5 0.47 6/10/2011 69.8 67.6 66.2 67.8 69.11 67.80 66.93 66.97 43.624 78.926 129.534 135.792 67.4 6/11/2011 76.2 76.2 74.8 75.8 76.7 75.78 75.34 76.72 75.83 47.835 87.710 148.472 153.768 76.0 -0.15 6/12/2011 78.8 77.9 77.7 77.3 79.0 77.66 78.41 78.32 78.10 49.019 91.285 151.575 158.370 78.2 -0.65 6/13/2011 69.2 70.3 69.9 69.3 71.0 69.57 70.17 70.12 70.15 43.911 81.685 135.704 142.243 70.1 0.85 6/14/2011 78.1 79.5 78.1 76.0 78.7 79.32 78.79 76.99 77.85 50.071 91.727 149.007 157.863 77.9 -0.20 6/15/2011 86.4 85.0 85.3 84.6 86.4 85.10 86.41 85.80 85.46 53.720 100.599 166.054 173.301 85.7 -0.67 6/16/2011 86.8 85.3 84.0 85.1 86.3 85.37 85.01 86.30 85.34 53.887 98.966 167.017 173.048 85.6 -1.18 6/17/2011 92.5 90.9 90.3 92.7 92.95 92.35 91.57 91.75 58.672 107.508 177.217 186.058 91.9 6/18/2011 81.5 82.9 81.4 80.3 82.9 82.87 82.32 81.38 82.00 52.308 95.832 157.495 166.271 81.9 0.47 6/19/2011 85.6 86.6 85.5 85.1 87.5 86.77 86.66 86.30 86.52 54.772 100.886 167.017 175.440 86.5 0.93 6/20/2011 87.9 88.2 86.7 86.8 88.4 88.44 87.93 88.03 87.50 55.825 102.365 170.370 177.433 87.9 0.01 6/21/2011 72.0 72.9 71.9 71.3 72.8 72.37 72.33 72.11 71.95 45.682 84.201 139.556 145.904 72.1 0.13 6/22/2011 76.7 75.0 74.4 74.9 76.6 74.53 74.98 75.89 75.72 47.045 87.291 146.867 153.550 75.5 -1.18 6/23/2011 77.1 78.9 76.4 76.6 78.6 78.66 77.01 77.58 77.74 49.653 89.652 150.148 157.645 77.6 0.50 6/24/2011 79.4 79.4 78.3 78.2 80.2 79.17 79.04 79.19 79.28 49.976 92.014 153.251 160.762 79.2 -0.24 6/25/2011 82.0 80.8 80.6 80.6 82.5 80.71 81.43 81.64 81.64 50.945 94.795 157.994 165.546 81.5 -0.46 6/26/2011 82.1 82.3 82.1 81.3 82.8 82.26 83.00 82.39 81.91 51.925 96.627 159.456 166.089 82.3 0.24 6/27/2011 79.1 79.8 77.9 78.0 80.1 79.57 78.66 79.06 79.17 50.227 91.572 153.001 160.545 79.1 0.00 6/28/2011 81.1 80.2 79.1 79.6 81.7 80.06 79.91 80.66 80.82 50.538 93.029 156.104 163.879 80.5 -0.61 6/29/2011 91.3 89.7 89.3 89.8 91.8 90.03 90.58 91.07 90.79 56.830 105.455 176.254 184.101 90.8 -0.57 6/30/2011 100.0 100.3 98.4 98.8 101.5 101.17 100.25 100.29 100.44 63.863 116.711 194.087 203.671 100.4 0.45

Slope 1.053 1.053 1.024 0.993 0.00

**Table 3** An example of estimating missing data Spatial Regression Test (SRT) method for daily maximum temperature estimation (unit: F). In this example, two days were assumed missing: 6/10 and 6/17 and were estimated using equa‐ tions 3, 4, and 5 (see highlighted values in the x(est) column. Stations are from the Automated Weather Data Network and locations are on an East-West by North South naming convention. The original station (Lincoln 20E 35S) is labeled x while the four neighboring stations are y1,y2, y3, and y4. Equation 3 is used to derive the unbiased estimates *x*<sup>1</sup>

If one value or several values at the station x is missing, the x(est) will provide an esti‐ mate for the missing data entry (see Table 3). The example in Table 3 shows that the val‐

etc. for n=28. The final estimate x(est) is determined from the unbiased estimates using equations 4 and 5.

outlier. In this example no value of x-x(est) was marked as an outlier.

20E 35S Havelock 82E 20S 12W 55N 51E 13S

8 Practical Concepts of Quality Control
