**5.2. Extending the binary cost-loss model**

570 Risk Management – Current Issues and Challenges

seasonal forecasting [27] [28] [29].

forecast.

and popular extension of this idea is that GCM-based forecasts should be adjusted by this skill assessment. This motivates 'Model Output Statistics' methods[25] and 'model calibration' and has been widely adopted in medium range weather forecasting [26] and

In order to make rational decisions based on quantifiable costs, losses and probabilities the end user needs the calibrated forecast probabilities, and needs to know what their costs and losses are for each contingency. Given the calibrated forecast probabilities, with reliable confidence intervals, they are in a position to use these probabilities to determine the optimum course of action to follow for their unique cost function. Given information about climatology, a model and its verification, the calibrated model probability p**(**E**|**F**)** is this best

As a simple example of calibration, consider the true positive ratio calculated above. While crude and subject to sampling error, this represents the conditional probability of the event given the model forecast category. The true positive ratio, proposed as the best estimate of the event probability from the POAMA MDB seasonal outlooks discussed above is a conditioning of probabilistic forecasts derived from the GCM ensemble upon probabilities obtained from comparison of the hindcast set with observations. These conditional probabilities are needed for users to make optimal decisions [30]. Skill for coupled models is commonly presented as correlation plots, mean error plots and sometimes more esoteric scores for probabilistic forecasts. While these scores are useful for model diagnostics, and can quantify potential forecast value, it is not obvious how users who need to make decisions based on forecasts should convert these measures into new estimates of probability. We note that some effort has been spent into developing verification measures that do have a direct relationship to economic value such the ROC (Receiver Operating Characteristic) score and the logarithmic score based on the information content of a

Resolution can be degraded by calibration and it is expected that the application of calibration techniques will involve some trade-off in which resolution is traded for reliability. It is also the case that cross-validation methods used on the application of calibration in order to avoid 'artificial skill' can also result in artificial reduction in skill scores, and thus in the assessment of such methods it can be difficult to disentangle cross

This simple calibration framework can be extended: similar methods can be applied to parametric probability density functions [28]. Below we discuss different calibration

Event Action No Action

Yes C L No C 0

validation artefacts from true reduction of model skill due to calibration.

methods, but first we turn to more sophisticated decision models.

**Table 5.** Simple binary cost-loss model.

estimate, subject to the assumptions made in determining the calibrated probability.

In the simple cost-loss model the cost of taking protective action is the same whether the event occurs or does not occur. While this may be true for many economic decisions, when social and political dimensions are considered there is a clear penalty, in terms of confidence in the forecasting system and reduced possibility of action in the future, for false alarms. The binary cost-loss model can be developed further to include such a false alarm or 'cry wolf' effect. Such an extension is effectively an adjustment for the deviation from perfect rationality of forecast users.

The above model can also be extended to more sophisticated decisions based on event probability thresholds, with different actions to be taken at different probability thresholds, depending on the users attitude to risk. We present a hypothetical example of an agriculturalist making a decision about whether to apply additional fertilizer, at a cost, with a potential payoff depending on the probability of expected rainfall being above median. In this example 20% rainfall probability is the threshold at which the cost of applying fertilizer is less than the expected payoff (Table 6). The decision thresholds in Table 6 provide a way of mapping from a given forecast to an action, again in relation to a binary yes/no event. Such tables are dependent on the details of individual enterprises and must be determined with regard to their operating costs and potential losses. The premise for Table 6 is the decision by wheat farmers to apply top-dressed fertiliser in order to benefit from expected rainfall[12], however the numbers selected are arbitrary and shown for illustration. Another management decision that could be studied using this methodology is choice of cultivar, for example to decide whether to plant a drought tolerant strain of wheat or one with a higher potential yield in the event of good rains.



Using the true positive ratio we calculated for our sample rainfall forecasts in Table 3, the farmer would find that the calibrated 'low probability' forecasts from POAMA are not sufficient to justify the 'no fertilizer' action, because the observed frequency of above median rainfall events is above the 20% threshold. In other words, while the forecast have skill they do not have value to this particular decision.

Another simple decision model is the theory of Kelley betting, which deals with cost-loss scenarios in the context of gambling. In this theory a gambler bets a fraction �� of their wealth on an outcome �, where ∑� �� = 1. The ratio of the gamblers post bet wealth to his pre bet wealth is � = ∑ �� � log� ���� where �� is the wealth multiplier, or odds, assigned to the outcome and �� is the event probability.[31]

Managing Climate Risk with Seasonal Forecasts 573

around our estimate of skill by month are troubling, because we know that skill varies strongly by month but are unable to quantify this adequately for these forecasts. Pooling forecast-verification pairs in order to increase confidence is one way to increase sample size, by aggregating forecasts at different locations and times. Both procedures will reduce the size of our credible intervals, but risk increasing the autocorrelation of the forecast data. A similar sample size problem affects the statistical significance of attempts to calibrate

A question for forecast users is how the probability range should affect the decision. The wider the interval, the less evidence exists that the forecast probability corresponds to a repeatable relationship between model and reality. Decision makers may prefer to assume climatological probabilities until this information can be sharpened. Theoretical work or modeling could determine optimum forecasts for selected decision making cost functions.

Calibration methods can be considered to adjust the probability distribution produced by the model by using information about its past performance, with the aim of providing unbiased and reliable forecasts. A straightforward approach to the generation of probability outlooks is to build a linear regression model for predictand � using the GCM ensemble

� = �� � � � � where � and � are regression coefficients which may be computed by the least-squares

errors � are typically assumed to be normally distributed with a standard deviation equal to

data it can be shown that this scheme produces probability forecasts that are reliable in the sense discussed above. A limitation of this method is that it assumes the data are normally distributed, significant deviations from normality may require that data are transformed or that different methods be chosen to compute the regression coefficients. Analytical methods can be used to assess the errors in the regression parameters but again these are usually

Such regression-based approaches can be made more robust to small sample size using

with hindcast data *H* and observations *O*. The likelihood function �(�� �|�) estimates the probability of observing the hindcast-observation series given a set of model parameters. *p*(*θ*) is the prior probability for the model parameters. Probability density functions for model parameters *θ* can be determined using Markov Chain Monte-Carlo sampling [32].

��(�� �) ��(�)

�(�|�� �) = � �(�� �|�)

Bayesian methods in which model parameters *θ* are given by Bayes theorem as

� ∑� � ��, with correlation coefficient �. The random

� ∑(� � �� � �)�. When applied to synthetic

forecasts for individual grid points.

mean � as a predictor:

method such that � = �� ��

**5.4. Adjusting for Model Error in Continuous Forecasts** 

���

the mean of squared regression residuals �� <sup>=</sup> �

based on distributional assumptions.

and � = �

**Figure 6.** Posterior probability of above median seasonal rainfall for all forecasts (right) and June-July-August forecasts (left) initialized at the end of March.

### **5.3. Presenting probability outlooks**

We now turn our attention to ways of presenting information about forecasts and their skillbased calibration. The actual contingency table (Table 2) has the advantage of containing almost all the usable information (assuming the stationarity of the marginal distributions), but the disadvantage of requiring knowledge of verification methods to translate it into usable probabilities. A plot of the actual ensemble of past forecasts (Figure 4) allows users to eyeball the agreement and spread between forecasts and observations. However it provides no quantitative information about how much credibility to assign to a particular forecast. The reliability diagram (Figure 5) provides this information, but it is not intuitive to interpret for most users. A simple pie chart can also be used to present relative probabilities. Figure 6 shows visually how the model forecast adjusts the model estimated probabilities, and what the credible intervals based on the size of the sample are. It shows the prior climatological probability of the event and the updated probabilities, with 90% credible intervals for each forecast category. This plot is designed to communicate to end users how much the forecast ought to affect their estimate of the event's probability, based on the rate of event occurrence for previous forecasts.

Coupled model skill varies strongly by month, but using the simple binning calibration method this information is difficult to resolve. Table 4 shows the contingency table and true positive ratio for June-July-August seasonal forecasts. The true positive ratio suggests that the forecasts have reasonable skill and that we ought to take the forecast of a high probability of above median rainfall as increased from a 50:50 climatological odds to 9:2 in favour of the event. Unfortunately the small sample size in each probability bin results in very large probability intervals as shown in Figure 6 (left). The wide probability intervals around our estimate of skill by month are troubling, because we know that skill varies strongly by month but are unable to quantify this adequately for these forecasts. Pooling forecast-verification pairs in order to increase confidence is one way to increase sample size, by aggregating forecasts at different locations and times. Both procedures will reduce the size of our credible intervals, but risk increasing the autocorrelation of the forecast data. A similar sample size problem affects the statistical significance of attempts to calibrate forecasts for individual grid points.

A question for forecast users is how the probability range should affect the decision. The wider the interval, the less evidence exists that the forecast probability corresponds to a repeatable relationship between model and reality. Decision makers may prefer to assume climatological probabilities until this information can be sharpened. Theoretical work or modeling could determine optimum forecasts for selected decision making cost functions.

#### **5.4. Adjusting for Model Error in Continuous Forecasts**

572 Risk Management – Current Issues and Challenges

outcome and �� is the event probability.[31]

August forecasts (left) initialized at the end of March.

**5.3. Presenting probability outlooks** 

of event occurrence for previous forecasts.

Another simple decision model is the theory of Kelley betting, which deals with cost-loss scenarios in the context of gambling. In this theory a gambler bets a fraction �� of their wealth on an outcome �, where ∑� �� = 1. The ratio of the gamblers post bet wealth to his pre bet wealth is � = ∑ �� � log� ���� where �� is the wealth multiplier, or odds, assigned to the

**Figure 6.** Posterior probability of above median seasonal rainfall for all forecasts (right) and June-July-

We now turn our attention to ways of presenting information about forecasts and their skillbased calibration. The actual contingency table (Table 2) has the advantage of containing almost all the usable information (assuming the stationarity of the marginal distributions), but the disadvantage of requiring knowledge of verification methods to translate it into usable probabilities. A plot of the actual ensemble of past forecasts (Figure 4) allows users to eyeball the agreement and spread between forecasts and observations. However it provides no quantitative information about how much credibility to assign to a particular forecast. The reliability diagram (Figure 5) provides this information, but it is not intuitive to interpret for most users. A simple pie chart can also be used to present relative probabilities. Figure 6 shows visually how the model forecast adjusts the model estimated probabilities, and what the credible intervals based on the size of the sample are. It shows the prior climatological probability of the event and the updated probabilities, with 90% credible intervals for each forecast category. This plot is designed to communicate to end users how much the forecast ought to affect their estimate of the event's probability, based on the rate

Coupled model skill varies strongly by month, but using the simple binning calibration method this information is difficult to resolve. Table 4 shows the contingency table and true positive ratio for June-July-August seasonal forecasts. The true positive ratio suggests that the forecasts have reasonable skill and that we ought to take the forecast of a high probability of above median rainfall as increased from a 50:50 climatological odds to 9:2 in favour of the event. Unfortunately the small sample size in each probability bin results in very large probability intervals as shown in Figure 6 (left). The wide probability intervals Calibration methods can be considered to adjust the probability distribution produced by the model by using information about its past performance, with the aim of providing unbiased and reliable forecasts. A straightforward approach to the generation of probability outlooks is to build a linear regression model for predictand � using the GCM ensemble mean � as a predictor:

#### � = �� � � � �

where � and � are regression coefficients which may be computed by the least-squares method such that � = �� �� ��� and � = � � ∑� � ��, with correlation coefficient �. The random errors � are typically assumed to be normally distributed with a standard deviation equal to the mean of squared regression residuals �� <sup>=</sup> � � ∑(� � �� � �)�. When applied to synthetic data it can be shown that this scheme produces probability forecasts that are reliable in the sense discussed above. A limitation of this method is that it assumes the data are normally distributed, significant deviations from normality may require that data are transformed or that different methods be chosen to compute the regression coefficients. Analytical methods can be used to assess the errors in the regression parameters but again these are usually based on distributional assumptions.

Such regression-based approaches can be made more robust to small sample size using Bayesian methods in which model parameters *θ* are given by Bayes theorem as

$$p(\theta|H,O) = \frac{p(H,O|\theta)}{p(H,O)} \ p(\theta)$$

with hindcast data *H* and observations *O*. The likelihood function �(�� �|�) estimates the probability of observing the hindcast-observation series given a set of model parameters. *p*(*θ*) is the prior probability for the model parameters. Probability density functions for model parameters *θ* can be determined using Markov Chain Monte-Carlo sampling [32].

#### **5.5. Variance inflation**

Johnson and Bowler (2009) outline a variance inflation technique which adjusts the ensemble forecast to meet two conditions: a) that ensemble members have the same variance as observations, and b) that the root-mean-square error of the ensemble mean be equal to the spread of the ensemble. A major difference between this and the previous method of linear regression with residual errors is that the ensemble spread remains a major determinant of forecast uncertainty. The first condition is designed to achieve the statistical indistinguishability of the first two moments between ensemble members and observations. The second condition is designed to ensure that the ensemble spread accounts for the expected model error. These conditions are achieved by increasing (or decreasing) the perturbations of the ensemble members from the mean while keeping the correlation between model and truth is unchanged (except in the case of a negative correlation between model and truth, in which case the sign of the correlation is reversed).

Managing Climate Risk with Seasonal Forecasts 575

�=� �� ��̅

�� <sup>=</sup> (����) ��

with observed variance ��, ensemble mean variance � �� , correlation between observations and ensemble mean *ρ* and time average of ensemble variance ��. Leave-one-out cross validation is used for the calculation of correlation and standard deviation when constructing a calibrated hindcast set. Typically the time series for each GCM grid point is calibrated independently. Johnson and Bowler show that under the assumption of normally distributed model predictions and observations, this procedure minimises the root-mean-

Another method of adjusting probability forecasts is to regress the forecast probabilities directly against the observed events/non-events frequencies. While having the drawback that it is computed directly on ensemble-derived probabilities, it has the advantage that it

In the case of overconfident forecasts, calibration procedures reduce the amplitude of the probabilities, adjusting for this overconfidence by reducing the resolution. Conceptually, this calibration step can be considered the application of a statistical model to the direct model output in which the forecasts are corrected for mean state bias and over-confidence in the ensemble distribution. Figure 7 presents the application of two calibration methods to model time series which exhibit a high and low correlation with the verifying observations respectively. The central panel demonstrates the effect of the variance inflation adjustment described above, while the lower panel shows a regression adjustment with Bayesian

It could be argued that such procedures degrade, or corrupt model outputs, because they make use of only limited information from the model reforecast set and available observations. This may be the case, but if such information can be specified it can be included in the calculation of calibration factors. If it cannot be specified and measured, then

In seasonal forecasting, calibration is complicated by the short length of the hindcast verification data set, typically 15 to 30 years, which imposes hard limits on how much information we can reliably say we have about the model. This paucity of data makes model skill assessments and model adjustment difficult because parameters calculated from the

we are hardly in a position to use it to inform our estimates of future probabilities!

makes no distributional assumptions, and estimates only one parameter.

� �� � 

and

square error.

parameter estimates.

**5.6. Regression estimate of event probability** 

**5.7. General remarks on calibration** 

**Figure 7.** Effect of calibration procedures on model time series. Left: A model grid point with high hindcast correlation. Right: A model grid point with low hindcast calibration. Black line: observations, blue solid line: model mean, blue dashed line: 10% and 90% model probability intervals.

Given ensemble mean �̅ and ensemble member perturbations ��, adjusted ensemble members �� are constructed by

$$g\_l = af + \beta \epsilon\_l$$

Coefficients *α* and *β* are computed as

$$
\alpha = \rho \frac{\sigma\_\chi}{\sigma\_{\overline{f}}}
$$

and

574 Risk Management – Current Issues and Challenges

Johnson and Bowler (2009) outline a variance inflation technique which adjusts the ensemble forecast to meet two conditions: a) that ensemble members have the same variance as observations, and b) that the root-mean-square error of the ensemble mean be equal to the spread of the ensemble. A major difference between this and the previous method of linear regression with residual errors is that the ensemble spread remains a major determinant of forecast uncertainty. The first condition is designed to achieve the statistical indistinguishability of the first two moments between ensemble members and observations. The second condition is designed to ensure that the ensemble spread accounts for the expected model error. These conditions are achieved by increasing (or decreasing) the perturbations of the ensemble members from the mean while keeping the correlation between model and truth is unchanged (except in the case of a negative correlation between

model and truth, in which case the sign of the correlation is reversed).

**Figure 7.** Effect of calibration procedures on model time series. Left: A model grid point with high hindcast correlation. Right: A model grid point with low hindcast calibration. Black line: observations,

�� � ��̅

Given ensemble mean �̅ and ensemble member perturbations ��, adjusted ensemble

� ���

blue solid line: model mean, blue dashed line: 10% and 90% model probability intervals.

members �� are constructed by

Coefficients *α* and *β* are computed as

**5.5. Variance inflation** 

$$
\beta^2 = \left(1 - \rho^2\right) \frac{\sigma\_\chi^2}{\sigma\_\epsilon^2}
$$

with observed variance ��, ensemble mean variance � �� , correlation between observations and ensemble mean *ρ* and time average of ensemble variance ��. Leave-one-out cross validation is used for the calculation of correlation and standard deviation when constructing a calibrated hindcast set. Typically the time series for each GCM grid point is calibrated independently. Johnson and Bowler show that under the assumption of normally distributed model predictions and observations, this procedure minimises the root-meansquare error.

### **5.6. Regression estimate of event probability**

Another method of adjusting probability forecasts is to regress the forecast probabilities directly against the observed events/non-events frequencies. While having the drawback that it is computed directly on ensemble-derived probabilities, it has the advantage that it makes no distributional assumptions, and estimates only one parameter.

#### **5.7. General remarks on calibration**

In the case of overconfident forecasts, calibration procedures reduce the amplitude of the probabilities, adjusting for this overconfidence by reducing the resolution. Conceptually, this calibration step can be considered the application of a statistical model to the direct model output in which the forecasts are corrected for mean state bias and over-confidence in the ensemble distribution. Figure 7 presents the application of two calibration methods to model time series which exhibit a high and low correlation with the verifying observations respectively. The central panel demonstrates the effect of the variance inflation adjustment described above, while the lower panel shows a regression adjustment with Bayesian parameter estimates.

It could be argued that such procedures degrade, or corrupt model outputs, because they make use of only limited information from the model reforecast set and available observations. This may be the case, but if such information can be specified it can be included in the calculation of calibration factors. If it cannot be specified and measured, then we are hardly in a position to use it to inform our estimates of future probabilities!

In seasonal forecasting, calibration is complicated by the short length of the hindcast verification data set, typically 15 to 30 years, which imposes hard limits on how much information we can reliably say we have about the model. This paucity of data makes model skill assessments and model adjustment difficult because parameters calculated from the

verification dataset will necessarily have large sampling error. For this reason it is desirable that calibration models have a small number of parameters.

Managing Climate Risk with Seasonal Forecasts 577

seasonal timescale GCM is used to generate forecasts of the large-scale fields. The analogue methods has been shown has been shown to produce good results for Twentieth Century South Eastern Australian rainfall in the context of downscaling for climate change projections[33]. As with most statistical downscaling techniques, analogue downscaling is computationally cheap, in contrast to resource-intensive dynamical downscaling using

Figure 6 shows the topography resolved by a high-resolution numerical weather prediction model, and the topography resolved by a coarse resolution seasonal prediction model.

**Figure 8.** Left: topography resolved in a high resolution weather model. Right: topography resolved in

The design of systems for the generation and distribution of GCM based outlooks is architecturally complex. It is here that the interdisciplinary nature of the seasonal forecasting activity becomes clear. In addition to more traditional earth system science involved in understanding coupled ocean-atmosphere processes, the tasks of data processing, data modelling and information system architecture require advanced computing skills. We now outline a general pattern for the design of systems for the delivery of seasonal forecasts to end users which is a generalisation of the implementation

Four distinct layers can be defined as components of the overall process of turning the

The **model layer** comprises the GCM simulating the evolution of the coupled oceanatmosphere system. This component is a complex software system in itself, integrating the ingestion of data analyses (themselves based on multiple networks of observations), the assimilation of these observations into the model integration cycle, and the output of variables of interest. This layer is the domain of earth system scientists and experts in numerical computation. GCMs are typically the result of the combined efforts of a large

outputs of GCMs into seasonal outlooks suitable for use by decision-makers.

number of such scientists and engineers working over a long period of time.

nested atmospheric models.

a coarse resolution seasonal prediction GCM.

described above and in [34].

**6. Software architecture: From models to systems** 

The problem is even thornier, because some circulation regimes such as strong El Nino events are thought to be more predictable than others, so there is every chance other variables may be strongly related to the predicted accuracy of a given forecast. Indeed the practice of ensemble forecasting is designed to reflect such changes in potential predictability. For example the influence of strong El Niño and La Niña events leads to greater predictability of climate anomalies in affected regions during such events. Given this knowledge, information about the state of ENSO should in theory be used to estimate the certainty of seasonal outlooks. Empirical outlooks do just this, but schemes using this information for dynamical model based outlooks are not yet common.
