**1. Introduction**

336 Health Management – Different Approaches and Solutions

Pearl, J. (1995). Causal inference from indirect experiments. *Artificial Intelligence in Medicine,*

Pearl, J. (2000). *Causality: Models, Reasoning, and Inference*, Cambridge University Press, ISBN

Robins, J.M. (1989). The analysis of randomized and non-randomized AIDS treatment trials

Robins, J.M. & Greenland, S. (1994). Adjusting for differential rates of PCP prophylaxis in

Robins, J.M. & Tsiatis, A.A. (1991). Correcting for non-compliance in randomized trials

Rubin, D.B. (1978). Bayesian inference for causal effects: The role of randomization. *Annals of* 

Rubin, D. B. (1990). Formal models of statistical inference for causal effects. *Journal of* 

Rubin, D.B. (2004). Direct and indirect effects via potential outcomes. *Scandinavian Journal of* 

Sato, T. (2006). Randomization-based analysis of causal effects, In: *Handbook of Clinical Trials:* 

Schwartz, D. & Lellouch, J. (1967). Explanatory and pragmatic attitudes in therapeutic trials. *Journal of Chronic Diseases*, Vol.20, No.8, (August 1967), pp.637-648, ISSN 0021-9681 Sheiner, L. & Rubin, D.B. (1995). Intention-to-treat analysis and the goals of clinical trials.

VanderWeele, T.J. (2008a). The sign of the bias of unmeasured confounding. *Biometrics*,

VanderWeele, T.J. (2008b). Simple relations between principal stratification and direct and

World Health Organization (1980). W.H.O. cooperative trial on primary prevention of

Zhang, J.L. & Rubin, D.B. (2003). Estimation of causal effects via principal stratification

*Statistics,* Vol.28, No.4, (December 2003), pp.353-368, ISSN 1076-9986

*Statistics*, Vol.6, No.1, (January 1978), pp.34-58, ISSN 0090-5364

*Statistics,* Vol.31, No.2, (June 2004), pp.161-170, ISSN 1467-9469

Vol.64, No.3, (September 2008), pp.702-706, ISSN 0006-341X

No.8191, (August 1980), pp.379-385, ISSN 0140-6736

ISBN 978-4-254-32214-9, Tokyo, Japan (in Japanese)

using a new approach to causal inference in longitudinal studies, In: *Health Service Research Methodology: A Focus on AIDS*, L. Sechrest, H. Freeman & A. Mulley (Eds), 113-159, DHHS Publication No.(PHS)89-3439, U.S. Public Health Service,

high- versus low-dose AZT treatment arms in an AIDS randomized trial. *Journal of the American Statistical Association,* Vol.89, No.427, (September 1994), pp.737-749,

using rank preserving structural failure time models. *Communications in Statistics – Theory and Methods,* Vol.20, No.8, (January 1991), pp.2609-2631, ISSN 0361-0926 Rubin, D.B. (1974). Estimating causal effects of treatments in randomized and

nonrandomized studies. *Journal of Educational Psychology,* Vol.66, No.5, (October

*Statistical Planning and Inference,* Vol.25, No.3, (July 1990), pp.279-292, ISSN 0378-

*Design and Analysis*, T. Tango & H. Uesaka (Eds.), 535-556, Asakura Publishing,

*Clinical Pharmacology and Therapeutics,* Vol.57, No.1, (January 1995), pp.6-15, ISSN

indirect effects. *Statistics and Probability Letters,* Vol.78, No.17, (December 2008),

ischaemic heart disease using clofibrate to lower serum cholesterol: Mortality follow-up: Report of the committee of principal investigators. *Lancet*, Vol.316,

when some outcomes are truncated by "death." *Journal of Educational and Behavioral* 

Vol.7, No.6, (December 1995), pp.561-582, ISSN 0933-3657

0-521-77362-8, Cambridge, USA

1974), pp.688-701, ISSN 0022-0663

Washington DC, USA

ISSN 0162-1459

3758

0009-9236

pp.2957-2962, ISSN 0167-7152

Prediction of an adverse health event (AHE) from objective data is of great importance in clinical practice. A health event is inherently dichotomous as it either happens or does not happen, and in the latter case, it is a favourable health event (FHE).

In many clinical applications, it is relevant not only to predict AHEs happening (diagnostic ability) but also to estimate in advance their individual risk of occurrence using ordered multinomial or quantitative scales (prognostic ability) such as probability. An estimated probability of a patient's outcome is usually preferred to a simpler binary decision rule. However, models cannot be designed by optimising their fit to true individual risk probabilities because the latter are not intrinsically known, nor can they be easily and accurately associated with an individual's data. Classification models are therefore usually trained on binary outcomes to provide an orderable or quantitative output, which can be dichotomised using a suitable cut-off value.

Model discrimination refers to accurate identification of actual outcomes. Model calibration, or goodness of fit, is related to the agreement between predicted probabilities and observed proportions and it is an important aspect to consider in evaluating the prognostic capacity of a risk model (Cook, 2008). Model calibration is independent of discrimination, since there are risk models with good discrimination but poor calibration. A well-calibrated model gives probability values that can be reliably associated with the true individual risk of outcomes.

Many models have recently been proposed for diagnostic purposes in a wide range of medical applications and they also provide reliable estimates of individual risk probabilities. Two different approaches have been used to predict patient risk. The first approach is based on estimation of risk probability by sophisticated mathematical and statistical methods, such as logistic regression, the Bayesian rule and artificial neural networks (Dreiseitl & Ohno-Machado, 2002; Fukunaga, 1990; Marshall et al., 1994). Despite their great accuracy, these models are unfortunately not widely used because they are hard to design and call for difficult calculations, often requiring dedicated software and computing knowledge that doctors do not welcome, besides being difficult to incorporate in clinical practice. The second approach creates scoring systems, in which the predictor variables are usually selected and scored subjectively by expert consensus or objectively using statistical methods (den Boer et al., 2005; Higgins et al., 1997; Vincent & Moreno, 2010).

Design of Scoring Models for Trustworthy Risk Prediction in Critical Patients 339

recognized to occur, that is when P(AHE|x)>Pt

al., 2007).

Moreno, 2010).

**2.1 Discrimination and calibration** 

calibration (Cook, 2008; Diamond, 1992).

obtained for all possible choices of Pt

2005).

(Bishop, 1995; Dreiseitl & Ohno-Machado, 2002; Fukunaga, 1990; Lee, 2004). A probability threshold value, Pt, is identified for classification, over which AHE is

cost of a wrong decision and influences model classification performance (E. Barbini et

2. Score models evaluate risk by a discrete scale of n positive integer values si (i = 0, 1, 2, ..., n) which includes zero to represent null risk, but rarely provides a threshold value for classification purposes (Cevenini & P. Barbini, 2010; Vincent &

Whatever the risk model, its prediction power is generally expressed by discrimination and

Discrimination is the capacity of a classification model to correctly distinguish patients who will develop an adverse outcome from patients who will not. It must be optimized during model design by ascertaining that the model learns all the discrimination properties valid for the population, correctly from the training sample and therefore shows similar performance in different samples (generalisation ability) (Dreiseitl & Ohno-Machado, 2002; Vapnik, 1999). Though many criteria exist for evaluating model discrimination capacity (Fukunaga, 1990), sensitivity (SE) and specificity (SP), which measure the fractions of correctly classified sick and healthy patients, respectively, are commonly used for statistical evaluations of binary diagnostic test performance. SE end SP are combined in the receiver operating characteristic (ROC) curve which is a graphic representation of the relationship between the true-positive fraction (TPF = SE) and false-positive fraction (FPF = 1-SP)

widely used index of total discrimination capacity in medical applications (Lasko et al.,

Calibration, or goodness of fit, represents the agreement between model-predicted and true probabilities of developing the adverse outcome (Hosmer & Lemeshow, 2000). Retrospective training data only provides dichotomous responses, that is presence or absence of the AHE, so true individual risk probabilities cannot intrinsically be known. The only way to derive them directly from sample data is to calculate the proportion of AHEs in groups of patients, but this obviously becomes less accurate as group size decreases. Nevertheless, from a health or clinical point of view, it is often useful to have an estimation of the level at which each event happens, using a continuous scale, such as probability. For probability models with dichotomous outcomes, calibration capacity can be evaluated by the Hosmer-Lemeshow (HL) goodness-of-fit test, based on two alternative chi-squared statistics, Ĥ and Ĉ (Hosmer & Lemeshow, 2000). The first formulation compares model-predicted and observed outcome frequencies of fixed deciles of predicted risk probability; the second compares by partitioning observations into ten groups of the same size (the last group can have a slightly different number of cases) and calculating model-predicted frequencies from average group probabilities. The Ĉ-statistic is generally preferred because it avoids empty groups, although it depends heavily on sample size and grouping criterion (den Boer et al., 2005). The HL test cannot really be applied to models with discrete outputs, such as score systems, because group sizes should themselves be adjusted on the basis of discrete values (Finazzi et al., 2011).

; the choice of Pt

. The area under the ROC curve (AUC) is the most

depends on the clinical

Despite their lower accuracy, scoring models are usually preferred to probability models by clinicians and health operators because they allow immediate calculation of individual patient scores as a simple sum of integer values associated with binary risk factors, without the need for any data processing system. It has also been demonstrated that in most cases, where a considerable amount of clinical information is available, their diagnostic accuracy is similar to that of probability models (Cevenini & P. Barbini, 2010, as cited in Cevenini et al., 2007). Computation facility of score models should be carefully evaluated in conjunction with their predictive performance. Too many simple models can lead to misleading estimates of a patient's clinical risk, which can be useless, counterproductive or even dangerous.

Any risk model, even if sophisticated and accurate in the local specific condition in which it was designed, loses much of its predictive power when exported to different clinical scenarios. Locally customized scoring models generally provide better performances than exported probability models. This reinforces the clinical success and effectiveness of scoring systems, the design and customisation to local conditions and/or institutions of which are usually much easier.

A limit of many scoring systems is their complex, involuted and even arbitrary design procedure that often involves contrivances to round off parameters of more sophisticated probability models to integer values. This can make their design even more complicated than that of probability models. Scoring often involves dichotomisation of continuous clinical variables to binary risk factors by identifying cut-off values from subjective clinical criteria not based on suitable optimisation techniques. However, whatever the design procedure, the main weakness of scoring models regards the interpretation of individual scores in terms of prognostic probabilities (model calibration), the reliability of which depends on the availability of a sufficient proportion of adverse outcomes and of a design procedure that provides precise individual risk estimation (Cevenini & P. Barbini, 2010). The Hosmer-Lemeshow test is commonly used to assess the calibration of probability models and therefore to manage their learning, but its results are unreliable when applied to models with discrete outputs, such as scoring systems (Finazzi et al., 2011).

This chapter provides an initial brief overview of general issues for the correct design of predictive models with binary outcomes. It broadly describes the main modelling approaches, then illustrates in more detail a method for creating score models for predicting the risk of an AHE. The method tackles and overcomes many of the above-mentioned limits. It uses a well-founded numerical bootstrap technique for appropriate statistical interpretation of simple scoring systems, and provides useful and reliable diagnostic and prognostic information (Carpenter & Bithell, 2000; DiCiccio & Efron, 1996). The whole design procedure is set out and validated by a simulation approach that mimics realistic clinical conditions. Finally, the method is applied to an actual clinical example, to predict the risk of morbidity of heart surgery patients in intensive care.
