**4. Direct score model**

342 Health Management – Different Approaches and Solutions

larger values of k generally reduce the effect of noise on classification at the expense of

Heuristic techniques are used to obtain the optimal value of k. A common choice is to take k equal to the square root of the total number of training cases, but cross-validation methods,

Although k-nearest neighbour is not strictly a probability method, it has been demonstrated that the fraction of k neighbourhood training cases falling in the AHE region is a good

Logistic regression is perhaps the most popular method for estimating risk probabilities in the medical field (Hosmer & Lemeshow, 2000). Logistic regression is a variation of ordinary regression: it belongs to the family of methods called generalized linear models, which include a linear part followed by some associated function. It can be considered a predictive model to use when the dependent response variable is dichotomous and the independent predictor variables are of any type, i.e. continuous, categorical, or both. In d-dimensional

P(AHE|x) log =c +c x +c x +...+c x

where "log" is the natural logarithm function, xk (k = 1, 2, …, d) the observation data set and ck (k = 0, 1, 2, ..., d) regression coefficients estimated from training data using maximum

The inverse of eq. 2 allows the posterior probability of AHE risk, P(AHE|x), to be modelled by a continuous S-shaped curve, even if all predictor variables are categorical. The argument of the logarithm of eq. 2 defines the probability of the outcome event occurring divided by the probability of the event not occurring and is known as the odds ratio. When it is specifically associated with dichotomous predictor variables (risk factors), it is a useful measure of the relative risk due to single risk factors. The reliability of logistic regression results is affected by linear correlations and interaction effects between predictor variables,

Artificial neural networks (or simply neural networks) are mathematical models miming the physiological learning functions of the human brain. They can be designed and trained to create optimal input-output maps of any physical or statistical phenomenon, the relationships of which may even be complex or unknown. They do not require sophisticated statistical hypotheses and account for all possible interrelations between predictor variables in a natural way. In this sense, neural networks can be considered universal approximators

A preliminary definition of network architecture is needed and should include number of neurons, number of layers, number and type of connections among neurons, type of neuronal activation functions and so on. Learning is the trickiest phase of neural networks: it consists of estimating network parameters (connection weights and activation thresholds) iteratively from training data, to minimize error between actual and model-estimated outputs. Feed-forward neural networks can be designed to directly estimate class-

0 11 22 dd

1-P(AHE|x) (2)

distinction between classes.

**3.3 Logistic regression** 

likelihood criteria.

(Bishop, 1995).

**3.4 Artificial neural networks** 

such as bootstrap, are often preferred.

feature space, the form of the model is:

estimate of class-conditional risk probability (Beyer et al., 1999).

dependence between error terms, and especially outliers.

A scoring model is a formula that assigns points based on known information, in order to predict an unknown future outcome. Many integer score systems have been designed for clinical application to critical patients. The most popular were derived from simplification of any of the above probability models by rounding their parameters to integer values. In particular, many approximate the coefficients of logistic regression models to the nearest integer values (Higgins et al., 1997). We do not dwell on the methodology of these score models here, directing readers to the specialised literature (Vincent & Moreno, 2010). Our main interest is to identify score values that give reliable probabilities of individual risk for prognostic purposes. We discuss on the design of a very simple score system that we call a "direct score model". We also provide a correct and useful statistical interpretation of model prognostic capacity, which can easily be extended to any other score model, even more sophisticated ones (Cevenini & P. Barbini, 2010).

#### **4.1 Model design**

Only binary predictor variables (risk factors) are used in this score model. The automatic computer procedure and model training is described by the following steps:


Design of Scoring Models for Trustworthy Risk Prediction in Critical Patients 345

Beta distribution is particularly suitable for representing multinomial phenomena, such as that described by the above n discrete values. In detail, we refer to the discrete probability distribution of a multinomial variable x, the probability values of which are calculated using

Figure 1 shows an example with two different choices of the beta probability density function shape parameters, and β, to simulate healthy and sick subjects. When the classconditional probability density functions of a two-class classification problem are known, the highest achievable discrimination level is related to the areas of overlap. The lowest

1 12 2 ε= min P(C )p(x|C ),P(C )p(x|C ) dx

where P(Ch) and p(x|Ch) are the prior probability and the class-conditional probability density function for class Ch (h = 1, 2), respectively. Prior probability of an adverse outcome, P(AHE), is also known as prevalence, π, and prior probability of favourable outcome, P(FHE), is 1-π. Because of the discrete nature of variable x, in our simulation study, eq. 5 can

ε= min π×p(x|AHE),(1-π)×p(x|FHE)

( ) ( )

where AHE, βAHE, FHE and βFHE are the corresponding shape parameters of beta functions, ΒAHE = ( AHE AHE ) <sup>Β</sup> xj ,<sup>α</sup> ,β and ΒFHE = ( ) <sup>j</sup> FHE <sup>β</sup> FHE <sup>Β</sup> <sup>x</sup> ,<sup>α</sup> , , related to adverse and favourable

Eq. 6 shows that depends on prevalence and beta parameters. At any iteration w of the above-mentioned stepwise procedure, for any kth integer value of score Sk, the simulated

theorem, considering AHE prevalence, π, and the class-conditional score probabilities,

obtained from the two simulated beta distributions as the sum of all the discrete

w k j AHE AHE

*j k*

*x S*

<sup>1</sup> P (S |AHE)= <sup>Β</sup> x ,<sup>α</sup> ,<sup>β</sup> n

<sup>1</sup> P (S |FHE)= <sup>Β</sup> x ,<sup>α</sup> ,<sup>β</sup> n

w k j FHE FHE

*j k*

*x S*

<sup>w</sup> , of adverse and favourable outcomes, respectively:

w k w k

values giving the score Sk, that is:

( )

( )

j j AHE AHE j j FHE FHE

t

<sup>t</sup> <sup>t</sup> w k w k t t

<sup>π</sup> P (S |AHE) P (AHE|S )=

p(x|AHE)=Β x ,α ,β p(x| FHE)=Β x ,α ,β

j j

(5)

<sup>w</sup> , can be calculated using the Bayes

<sup>π</sup> P (S |AHE)+(1-π) P (S |FHE) (7)

events, the true class-conditional probabilities are simply

(6)

(8)

a beta probability density function.

be approximated as:

outcomes, respectively.

w and P (S |FHE) <sup>k</sup> t

By assuming mutually exclusive xj

probabilities corresponding to the xj

P (S |AHE) <sup>k</sup>

t

error probability of classification, , is given by:

n-1

1

n

"true" conditional risk probability, P (AHE|S ) <sup>k</sup>

t

t

j=0


Backward sessions and cross-validation trials cannot be applied because the model is nonparametric. Optimal model selection is carried out by a step-by-step analysis of model prognostic and diagnostic power. At each step w, the conditional probability of the adverse outcome (prognostic risk probability), Pw(AHE|Sk), associated with each kth integer score value Sk, is estimated from sample data as the ratio of adverse events to the total number of events determining a model score Sk.

The bias-corrected and accelerated bootstrap method is applied to estimate 95% CIs of Pw(AHE|Sk) using 2000 bootstrapped samples. This method makes it possible to infer complex statistics that are difficult or even impossible to represent mathematically and have proven to be theoretically and practically more accurate than other bootstrap methods (Cevenini & P. Barbini, 2010; DiCiccio & Efron, 1996). By graphic inspection of results, the convenience of grouping close scores having large 95% CI because of excessively low data frequencies is considered. The model is chosen to correspond to the iteration providing the largest number of score values or classes having sufficiently narrow and separate 95% CIs with respect to the training data, and at the same time giving testing-data probabilities falling within their 95% CIs.

Once the model is created, the score, S, associated with a generic patient is simply given by:

$$\mathbf{S} = \sum\_{i=1}^{d} \mathbf{p}\_i \mathbf{s}\_i \tag{4}$$

where d is the number of predictors in the model, pi the binary value of the ith predictor, and si , its model-identified associated score. Finally, model discrimination and calibration performance are compared with a logistic regression model designed on the same training data.

All statistical procedures are evaluated at a significance level of 95%.

#### **4.2 Simulation**

Many realistic simulation experiments are carried out to validate and optimise model design. Predictor variables are all taken in binary form, skipping the dichotomisation of continuous variables. In particular, we consider d dichotomised binary predictors, obtaining n = 2d different combinations of these predictors. Each combination identifies one value of a discrete variable xj = j/n (j = 0, 1, 2, ..., n-1) ranging from 0 to 1. In this way two different beta probability density functions can be associated with adverse and favourable outcomes.

A forward iterative procedure is applied to a data sample (training set) which sums

All binary factors are reconsidered at each step, so that multiple selection of one factor

 Training is stopped when the cumulative increment in AUC obtained in five consecutive steps is less than 1%. This rather soft stopping criterion is used instead of well-established statistical methods (Zhou et al., 2008) to avoid selecting too few predictors, which reduces the possibility of associating an effective probability of AHE

 A testing dataset of the same size as the training set is used to evaluate model generalisation and to guide conclusive selection of the optimal predictor set. Backward sessions and cross-validation trials cannot be applied because the model is nonparametric. Optimal model selection is carried out by a step-by-step analysis of model prognostic and diagnostic power. At each step w, the conditional probability of the adverse outcome (prognostic risk probability), Pw(AHE|Sk), associated with each kth integer score value Sk, is estimated from sample data as the ratio of adverse events to the total number of

The bias-corrected and accelerated bootstrap method is applied to estimate 95% CIs of Pw(AHE|Sk) using 2000 bootstrapped samples. This method makes it possible to infer complex statistics that are difficult or even impossible to represent mathematically and have proven to be theoretically and practically more accurate than other bootstrap methods (Cevenini & P. Barbini, 2010; DiCiccio & Efron, 1996). By graphic inspection of results, the convenience of grouping close scores having large 95% CI because of excessively low data frequencies is considered. The model is chosen to correspond to the iteration providing the largest number of score values or classes having sufficiently narrow and separate 95% CIs with respect to the training data, and at the same time giving testing-data probabilities

Once the model is created, the score, S, associated with a generic patient is simply given by:

d

i i i=1

, its model-identified associated score. Finally, model discrimination and calibration performance are compared with a logistic regression model designed on the same training

Many realistic simulation experiments are carried out to validate and optimise model design. Predictor variables are all taken in binary form, skipping the dichotomisation of continuous variables. In particular, we consider d dichotomised binary predictors, obtaining

beta probability density functions can be associated with adverse and favourable outcomes.

different combinations of these predictors. Each combination identifies one value of a

= j/n (j = 0, 1, 2, ..., n-1) ranging from 0 to 1. In this way two different

S= p s (4)

the binary value of the ith predictor, and

At each step the risk factor providing the highest increment to AUC is included.

selected binary variables stepwise.

with each integer score.

events determining a model score Sk.

falling within their 95% CIs.

si

data.

n = 2d

**4.2 Simulation** 

discrete variable xj

where d is the number of predictors in the model, pi

All statistical procedures are evaluated at a significance level of 95%.

gives rise to a multiple integer contribution to the score.

Beta distribution is particularly suitable for representing multinomial phenomena, such as that described by the above n discrete values. In detail, we refer to the discrete probability distribution of a multinomial variable x, the probability values of which are calculated using a beta probability density function.

Figure 1 shows an example with two different choices of the beta probability density function shape parameters, and β, to simulate healthy and sick subjects. When the classconditional probability density functions of a two-class classification problem are known, the highest achievable discrimination level is related to the areas of overlap. The lowest error probability of classification, , is given by:

$$\epsilon = \int \min\_{\mathbf{x}} \left\{ \mathbf{P}(\mathbf{C}\_1) \mathbf{p}(\mathbf{x} \mid \mathbf{C}\_1) \mathbf{P}(\mathbf{C}\_2) \mathbf{p}(\mathbf{x} \mid \mathbf{C}\_2) \right\} d\mathbf{x} \tag{5}$$

where P(Ch) and p(x|Ch) are the prior probability and the class-conditional probability density function for class Ch (h = 1, 2), respectively. Prior probability of an adverse outcome, P(AHE), is also known as prevalence, π, and prior probability of favourable outcome, P(FHE), is 1-π. Because of the discrete nature of variable x, in our simulation study, eq. 5 can be approximated as:

$$\begin{aligned} \mathbf{c} &= \frac{1}{\mathbf{n}} \sum\_{j=0}^{n-1} \min \left[ \mathbf{n} \times \mathbf{p}(\mathbf{x}\_j | \, \text{AHE}), (1 - \mathbf{n}) \times \mathbf{p}(\mathbf{x}\_j | \, \text{FHE}) \right] \\ \mathbf{p}(\mathbf{x}\_j | \, \text{AHE}) &= \mathbf{B} \{ \mathbf{x}\_j, \mathbf{a}\_{\text{AHE}}, \beta\_{\text{AHE}} \} \\ \mathbf{p}(\mathbf{x}\_j | \, \text{FHE}) &= \mathbf{B} \{ \mathbf{x}\_j, \mathbf{a}\_{\text{FHE}}, \beta\_{\text{FHE}} \} \end{aligned} \tag{6}$$

where AHE, βAHE, FHE and βFHE are the corresponding shape parameters of beta functions, ΒAHE = ( AHE AHE ) <sup>Β</sup> xj ,<sup>α</sup> ,β and ΒFHE = ( ) <sup>j</sup> FHE <sup>β</sup> FHE <sup>Β</sup> <sup>x</sup> ,<sup>α</sup> , , related to adverse and favourable outcomes, respectively.

Eq. 6 shows that depends on prevalence and beta parameters. At any iteration w of the above-mentioned stepwise procedure, for any kth integer value of score Sk, the simulated "true" conditional risk probability, P (AHE|S ) <sup>k</sup> t <sup>w</sup> , can be calculated using the Bayes theorem, considering AHE prevalence, π, and the class-conditional score probabilities, P (S |AHE) <sup>k</sup> t w and P (S |FHE) <sup>k</sup> t <sup>w</sup> , of adverse and favourable outcomes, respectively:

$$\mathbf{P}\_{\rm w}^{\rm t}(\text{AHE} \mid \mathbf{S}\_{\rm k}) = \frac{\text{m P}\_{\rm w}^{\rm t}(\mathbf{S}\_{\rm k} \mid \text{AHE})}{\text{m P}\_{\rm w}^{\rm t}(\mathbf{S}\_{\rm k} \mid \text{AHE}) + (1 \text{-} \text{n})P\_{\rm w}^{\rm t}(\mathbf{S}\_{\rm k} \mid \text{FHE})} \tag{7}$$

By assuming mutually exclusive xj events, the true class-conditional probabilities are simply obtained from the two simulated beta distributions as the sum of all the discrete probabilities corresponding to the xj values giving the score Sk, that is:

$$\begin{aligned} \mathbf{P}\_{\mathbf{w}}^{\mathrm{t}} (\mathbf{S}\_{\mathrm{k}} \, | \, \mathrm{AHE}) &= \frac{1}{\mathbf{n}} \sum\_{\mathbf{x}\_{j} \neq S\_{\mathrm{k}}} \mathrm{B} \Big( \mathbf{x}\_{j}, \mathbf{a}\_{\mathrm{AHE}}, \boldsymbol{\beta}\_{\mathrm{AHE}} \Big) \\\ \mathbf{P}\_{\mathbf{w}}^{\mathrm{t}} (\mathbf{S}\_{\mathrm{k}} \, | \, \mathrm{FHE}) &= \frac{1}{\mathbf{n}} \sum\_{\mathbf{x}\_{j} \neq S\_{\mathrm{k}}} \mathrm{B} \Big( \mathbf{x}\_{j}, \mathbf{a}\_{\mathrm{FHE}}, \boldsymbol{\beta}\_{\mathrm{FHE}} \Big) \end{aligned} \tag{8}$$

Design of Scoring Models for Trustworthy Risk Prediction in Critical Patients 347

The method is illustrated in detail by describing the results of a simulation of the 54 experiments performed. The experiment corresponding to N = 1000, π = 20% and AHE = 3 is illustrated, because it is similar to an actual clinical condition that will be shown below. Figure 2 shows the AUC values obtained using the forward selection of model features from simulated training data described above. The stopping criterion arrested the stepwise algorithm at the eleventh step, after 5 out of 12 predictor variables had been selected. In fact, the cumulative increment in AUC was about 0.8% in the last five steps (nos. 7-11). The variables are numbered in order of decreasing discrimination power. The most discriminating variable, no. 1, was entered five times (s1 = 5) in the model, variable no. 2

Figure 3 shows the 95% confidence interval of score-associated risk probabilities identified by the bias-corrected and accelerated bootstrap method applied to simulated sample data, from step no. 2 to step no. 9. For each integer score value, the estimated 95% CI is plotted together with the corresponding true probability of AHE (calculated from the beta distribution) and the percentage of cases. The discrimination capacity of the model can be detected at every step by observing the growth of estimated AHE probability with the score, whereas calibration is demonstrated by true risk probabilities (stars), which fall in the corresponding 95% confidence interval of the training data, with the sole exception of

Fig. 2. Area under the ROC curve (AUC) during stepwise selection of model features from

simulated data. The predictor variables entered are also indicated

three times (s2 = 3) and variables nos. 3-5 only once each (s3-5 = 1).

certain high scores, where there may be too few cases.

**4.2.2 Simulation results** 

Fig. 1. Simulated probability density functions, ΒFHE and ΒAHE, for favourable and adverse outcomes, respectively: example with beta parameters FHE = 1, AHE = 3, βAHE = βFHE = 5

#### **4.2.1 Simulation experiments**

Simulation experiments are performed by randomly extracting N = NAHE + NFHE data items from beta distributions of adverse and favourable outcomes, ΒAHE and ΒFHE, respectively, to form two samples of size NAHE = π·N and NFHE = (1-π)·N. Each extracted item xj (j = 1, 2, ..., N) is represented as a d-dimensional point in the discrete space of binary variables.

We use d = 12 binary variables and simulate nine different conditions corresponding to the combinations of three prevalence values and three levels of separation between event classes, obtained by changing the parameters of beta distributions. Low, medium and high separation between AHEs and FHEs are reproduced by increasing only the values of parameter AHE, specifically equal to 2, 3 and 5, respectively. The other three beta parameters are kept constant at FHE = 1, βAHE = βFHE = 5. Prevalence values of 5%, 20% and 40% are tried. For each condition, six samples with progressively doubled sizes, namely N = 250, 500, 1000, 2000, 4000 and 8000, are extracted for a total of 54 simulation experiments covering a wide range of actual clinical situations (see also Table 1). Training data is not used because the simulation process enables the true probabilities, described above, to be evaluated exactly.

All computations are performed using MATLAB code.

#### **4.2.2 Simulation results**

346 Health Management – Different Approaches and Solutions

Fig. 1. Simulated probability density functions, ΒFHE and ΒAHE, for favourable and adverse outcomes, respectively: example with beta parameters FHE = 1, AHE = 3, βAHE = βFHE = 5

Simulation experiments are performed by randomly extracting N = NAHE + NFHE data items from beta distributions of adverse and favourable outcomes, ΒAHE and ΒFHE, respectively, to form two samples of size NAHE = π·N and NFHE = (1-π)·N. Each extracted item xj (j = 1, 2, ..., N) is represented as a d-dimensional point in the discrete space of binary

We use d = 12 binary variables and simulate nine different conditions corresponding to the combinations of three prevalence values and three levels of separation between event classes, obtained by changing the parameters of beta distributions. Low, medium and high separation between AHEs and FHEs are reproduced by increasing only the values of parameter AHE, specifically equal to 2, 3 and 5, respectively. The other three beta parameters are kept constant at FHE = 1, βAHE = βFHE = 5. Prevalence values of 5%, 20% and 40% are tried. For each condition, six samples with progressively doubled sizes, namely N = 250, 500, 1000, 2000, 4000 and 8000, are extracted for a total of 54 simulation experiments covering a wide range of actual clinical situations (see also Table 1). Training data is not used because the simulation process enables the true probabilities, described above, to be

**4.2.1 Simulation experiments** 

variables.

evaluated exactly.

All computations are performed using MATLAB code.

The method is illustrated in detail by describing the results of a simulation of the 54 experiments performed. The experiment corresponding to N = 1000, π = 20% and AHE = 3 is illustrated, because it is similar to an actual clinical condition that will be shown below.

Figure 2 shows the AUC values obtained using the forward selection of model features from simulated training data described above. The stopping criterion arrested the stepwise algorithm at the eleventh step, after 5 out of 12 predictor variables had been selected. In fact, the cumulative increment in AUC was about 0.8% in the last five steps (nos. 7-11). The variables are numbered in order of decreasing discrimination power. The most discriminating variable, no. 1, was entered five times (s1 = 5) in the model, variable no. 2 three times (s2 = 3) and variables nos. 3-5 only once each (s3-5 = 1).

Figure 3 shows the 95% confidence interval of score-associated risk probabilities identified by the bias-corrected and accelerated bootstrap method applied to simulated sample data, from step no. 2 to step no. 9. For each integer score value, the estimated 95% CI is plotted together with the corresponding true probability of AHE (calculated from the beta distribution) and the percentage of cases. The discrimination capacity of the model can be detected at every step by observing the growth of estimated AHE probability with the score, whereas calibration is demonstrated by true risk probabilities (stars), which fall in the corresponding 95% confidence interval of the training data, with the sole exception of certain high scores, where there may be too few cases.

Fig. 2. Area under the ROC curve (AUC) during stepwise selection of model features from simulated data. The predictor variables entered are also indicated

Design of Scoring Models for Trustworthy Risk Prediction in Critical Patients 349

Now it is necessary to identify a model that reconciles calibration and discrimination. Excessively simple score models (few steps) have low discrimination power (low AUC) and give inopportunely separated 95% CIs. This can be observed at steps nos. 2 and 3 of Fig. 3 where only three score values (0, 1 and 2) are obtained: CIs between scores have very large gaps, suggesting that finer partitioning of the score axis can be achieved with a larger number of steps. Figure 2 indicates poor discrimination capacity of the scoring system at

On the contrary, if too many scores are used, as in steps nos. 7-9, the CIs are either too wide or overlap, worsening calibration accuracy. The width of score CIs increases significantly with decreasing observed frequency. For example, at step no. 4, score of 3 has only 16 cases (1.6%) and the corresponding 95% CI is so large that it completely overlaps with the previous score of 2. When the number of cases is even lower, as in step no. 9, where the highest scores of 6 and 7 have four and one cases, respectively, the bootstrap method fails to correctly estimate the CIs and the corresponding scores are totally unreliable in prognostic terms. Hence the need to combine neighbouring scores with too few cases. It is particularly convenient to pool the highest scores, which often have few cases, into a single class having a sufficient data frequency to significantly narrow the 95% CIs. For example, at step no. 6 it is useful to pool the last two scores of 4 and 5 into a single class. The pooling of adjacent scores with small data frequency enhances model prognostic reliability, usually with an

From the simulated experiment of Fig. 3, five score classes were identified as a suitable compromise between calibration and discrimination. At any step from no. 6 to no. 9, it is worthwhile combining scores greater than or equal to 4 and leaving the lower scores of 0-3

Figure 4 shows the results of pooling the three highest scores of step no. 8, which is preferred to the previous steps no. 6 and no. 7, because besides having higher discrimination capacity, the pooled class contains a greater number of cases, which narrows the related 95% CI to a greater extent. Just a small gap and a slight overlap can be observed in Fig. 4 between scores of 1 and 2, and between scores of 3 and the class of scores 4, respectively. Step no. 9 and subsequent steps not reported in Fig. 3 are discarded because no improvement can be obtained with respect to step no. 8 and CI overlap increases. Indeed, to improve the accuracy of estimates of individual probability of AHE, it could be worthwhile increasing the number of classes, tolerating a greater CI overlap. This can be done by analysing and selecting a step beyond the eighth, where the observed frequency in each class is of course

Comparison of the results of the three-step model with those of the eight-step pooled model shown in Fig. 4 indicates that the scoring system with five classes effectively fills the gaps between adjacent CIs of the simpler score model. At step no. 8, pooling of the highest scores does not significantly influence the discrimination capacity of the scoring system: the estimated AUC decreases slightly from 0.838 (95% CI, 0.781-0.885) to 0.827 (95% CI, 0.777-

Stepwise logistic regression applied to the training data used for the simulation example, set at statistical significance levels of 95% and 90% to enter and remove variables, respectively, selected the first five binary variables. Figure 5 compares ROC curves of the logistic model and the score model of Fig. 4. The ROC curve of true probability values, calculated from training data using beta distributions and the Bayes theorem, is also plotted (dashed line).

AUCs of true data and the logistic model were 0.845 and 0.849, respectively.

these initial steps.

insignificant reduction in discrimination capacity.

significantly reduced, especially for high scores.

0.869).

ungrouped, so as to form five score classes: 0, 1, 2, 3 and ≥ 4.

Fig. 3. 95% confidence intervals of AHE score probability, estimated from simulated training data, percentages of score cases and true probabilities (stars)

Fig. 3. 95% confidence intervals of AHE score probability, estimated from simulated training

data, percentages of score cases and true probabilities (stars)

Now it is necessary to identify a model that reconciles calibration and discrimination. Excessively simple score models (few steps) have low discrimination power (low AUC) and give inopportunely separated 95% CIs. This can be observed at steps nos. 2 and 3 of Fig. 3 where only three score values (0, 1 and 2) are obtained: CIs between scores have very large gaps, suggesting that finer partitioning of the score axis can be achieved with a larger number of steps. Figure 2 indicates poor discrimination capacity of the scoring system at these initial steps.

On the contrary, if too many scores are used, as in steps nos. 7-9, the CIs are either too wide or overlap, worsening calibration accuracy. The width of score CIs increases significantly with decreasing observed frequency. For example, at step no. 4, score of 3 has only 16 cases (1.6%) and the corresponding 95% CI is so large that it completely overlaps with the previous score of 2. When the number of cases is even lower, as in step no. 9, where the highest scores of 6 and 7 have four and one cases, respectively, the bootstrap method fails to correctly estimate the CIs and the corresponding scores are totally unreliable in prognostic terms. Hence the need to combine neighbouring scores with too few cases. It is particularly convenient to pool the highest scores, which often have few cases, into a single class having a sufficient data frequency to significantly narrow the 95% CIs. For example, at step no. 6 it is useful to pool the last two scores of 4 and 5 into a single class. The pooling of adjacent scores with small data frequency enhances model prognostic reliability, usually with an insignificant reduction in discrimination capacity.

From the simulated experiment of Fig. 3, five score classes were identified as a suitable compromise between calibration and discrimination. At any step from no. 6 to no. 9, it is worthwhile combining scores greater than or equal to 4 and leaving the lower scores of 0-3 ungrouped, so as to form five score classes: 0, 1, 2, 3 and ≥ 4.

Figure 4 shows the results of pooling the three highest scores of step no. 8, which is preferred to the previous steps no. 6 and no. 7, because besides having higher discrimination capacity, the pooled class contains a greater number of cases, which narrows the related 95% CI to a greater extent. Just a small gap and a slight overlap can be observed in Fig. 4 between scores of 1 and 2, and between scores of 3 and the class of scores 4, respectively. Step no. 9 and subsequent steps not reported in Fig. 3 are discarded because no improvement can be obtained with respect to step no. 8 and CI overlap increases. Indeed, to improve the accuracy of estimates of individual probability of AHE, it could be worthwhile increasing the number of classes, tolerating a greater CI overlap. This can be done by analysing and selecting a step beyond the eighth, where the observed frequency in each class is of course significantly reduced, especially for high scores.

Comparison of the results of the three-step model with those of the eight-step pooled model shown in Fig. 4 indicates that the scoring system with five classes effectively fills the gaps between adjacent CIs of the simpler score model. At step no. 8, pooling of the highest scores does not significantly influence the discrimination capacity of the scoring system: the estimated AUC decreases slightly from 0.838 (95% CI, 0.781-0.885) to 0.827 (95% CI, 0.777- 0.869).

Stepwise logistic regression applied to the training data used for the simulation example, set at statistical significance levels of 95% and 90% to enter and remove variables, respectively, selected the first five binary variables. Figure 5 compares ROC curves of the logistic model and the score model of Fig. 4. The ROC curve of true probability values, calculated from training data using beta distributions and the Bayes theorem, is also plotted (dashed line). AUCs of true data and the logistic model were 0.845 and 0.849, respectively.

Design of Scoring Models for Trustworthy Risk Prediction in Critical Patients 351

When comparing model discrimination power by AUC, we have to consider that the ROC curve of the score model (continuous line) is drawn by connecting only 5 discrete points (score classes), whereas the logistic model curve (gray line) is based on more probability values. Figure 5 shows that the score model is close enough to the logistic and true ROC curves. Clearly, discretisation leads to a lower AUC, resulting in underestimation of scoremodel discrimination capacity. In addition, the true and logistic curves are to a large extent within the 95% CI of the score curve. Finally, in real clinical applications, logistic regression

The HL goodness-of-fit test (Hosmer & Lemeshow, 2000) showed good calibration performance of the logistic model (p = 0.751). However, 95% of the training-data errors between model-estimated and true percentage risk probabilities were from about -10.5% (underestimation) and +12.0% (overestimation), revealing similar uncertainty to that of the

Table 1 gives the number of score values or classes identified by the same procedure, for each of the 54 simulation experiments. It shows that the number of score classes increases with increasing sample size, prevalence and separation between event classes (decreasing error ). The importance of estimating uncertainty suggests to keep 95% CIs of between-class probabilities separate, or slightly overlapping. This limits the identifiable number of score classes and provides reliable probability estimates. Enlargement and overlapping of 95% CIs and consequent loss of prognostic probability information depends heavily on the data frequency of score values or classes and their rate of AHEs influenced by prevalence. Small samples and/or low prevalence make it necessary to pool neighbouring scores to form classes with a sufficient number of cases to ensure a reliable estimate (narrow CI) of class

> Medium separation AHE = 3

5.0 20.0 32.9 5.0 17.7 23.9 4.6 11.4 13.7

Π% 5 20 40 5 20 40 5 20 40

250 2 3 3 3 4 4 3 4 4

500 3 4 4 4 5 5 4 5 5

1000 4 4 4 4 5 5 4 5 6

2000 4 5 5 5 5 6 5 6 6

4000 5 5 5 5 6 6 5 6 6

8000 5 6 6 6 6 7 6 7 7

Table 1. Simulation experiments: largest number of score classes having sufficiently narrow and separate 95% confidence intervals of prognostic probability. AHE = shape parameter of AHE beta distribution; Π = prevalence; = lowest error probability of classification;

High separation AHE = 5

often includes continuous variables that may improve discrimination performance.

score model.

probabilities.

%

N = sample size

N

Low separation AHE = 2

Fig. 4. 95% confidence intervals of AHE score probabilities estimated from simulated training data, percentages of score cases and true probabilities (stars) for the model identified at step no. 8

Fig. 5. ROC curves from simulated sample data. 95% CI refers to score model

Fig. 4. 95% confidence intervals of AHE score probabilities estimated from simulated training data, percentages of score cases and true probabilities (stars) for the model

Fig. 5. ROC curves from simulated sample data. 95% CI refers to score model

identified at step no. 8

When comparing model discrimination power by AUC, we have to consider that the ROC curve of the score model (continuous line) is drawn by connecting only 5 discrete points (score classes), whereas the logistic model curve (gray line) is based on more probability values. Figure 5 shows that the score model is close enough to the logistic and true ROC curves. Clearly, discretisation leads to a lower AUC, resulting in underestimation of scoremodel discrimination capacity. In addition, the true and logistic curves are to a large extent within the 95% CI of the score curve. Finally, in real clinical applications, logistic regression often includes continuous variables that may improve discrimination performance.

The HL goodness-of-fit test (Hosmer & Lemeshow, 2000) showed good calibration performance of the logistic model (p = 0.751). However, 95% of the training-data errors between model-estimated and true percentage risk probabilities were from about -10.5% (underestimation) and +12.0% (overestimation), revealing similar uncertainty to that of the score model.

Table 1 gives the number of score values or classes identified by the same procedure, for each of the 54 simulation experiments. It shows that the number of score classes increases with increasing sample size, prevalence and separation between event classes (decreasing error ). The importance of estimating uncertainty suggests to keep 95% CIs of between-class probabilities separate, or slightly overlapping. This limits the identifiable number of score classes and provides reliable probability estimates. Enlargement and overlapping of 95% CIs and consequent loss of prognostic probability information depends heavily on the data frequency of score values or classes and their rate of AHEs influenced by prevalence. Small samples and/or low prevalence make it necessary to pool neighbouring scores to form classes with a sufficient number of cases to ensure a reliable estimate (narrow CI) of class probabilities.


Table 1. Simulation experiments: largest number of score classes having sufficiently narrow and separate 95% confidence intervals of prognostic probability. AHE = shape parameter of AHE beta distribution; Π = prevalence; = lowest error probability of classification; N = sample size

Design of Scoring Models for Trustworthy Risk Prediction in Critical Patients 353

slight but not statistically significant reduction in discrimination performance: the estimated training and testing AUCs decreased from 0.851 (95% CI, 0.781-0.909) to 0.835 (95% CI, 0.764-0.895) and from 0.841 (95% CI, 0.775-0.900) to 0.816 (95% CI, 0.743-0.879),

**Variable description Acronym Type Cut-off Steps** 

Inotropic heart drugs IHD Binary 1,4,10 (LR)

O2 delivery index DO2I Continuous < 280 ml/min/m2 2 (LR)

Peripheral vascular disease PVD Binary 3,9 (LR)

O2 extraction ratio O2ER Continuous 38% 5 (LR)

hypertension PAH Binary 8 (LR)

time CPB Continuous 2 hours 11 (LR)

Intra aortic balloon pump IABP Binary 12 (LR)

Creatinine Cr Continuous 1 mg/l NE (LR)

Potassium K Continuous 4.1 mEq/l NE (LR)

Cardiac index CI Continuous < 2.4 l/min/m2 NS (LR)

Haemoglobin Hb Continuous < 9.6 g/dl NE

Mean arterial pressure MAP Continuous > 95 mmHg NS

Previous heart surgery Re-do Binary NS

and score-model entry steps. NE = not entered; NS = not statistically significant;

LR = variable selected by stepwise logistic regression

Table 2. Clinical variables, cut-off values for the dichotomisation of continuous variables

Emergency EM Binary 6

CO2 production VCO2 Continuous 180 ml/min 7

respectively.

Pulmonary artery

Cardio-pulmonary bypass

Simulation experiments suggests grouping scores into classes when frequencies are less than about 3% and 10% of the whole sample for N = 8000 and N = 250, respectively. Only two classes are recognised in the worst condition of minimum sample size (N = 250), minimum prevalence (Π = 5%) and low separation between health events (AHE = 2). A maximum of seven score-classes is identified in conditions of large sample size (N = 8000), high prevalence and high separation between event classes. Although more score classes could be achieved with greater CI overlap, the cost would be unreliable estimates.

The discrimination of the different simulation experiments is assessed by AUC of true simulated probability calculated using beta functions. Conditions of large overlap between areas of beta functions (AHE = 2) lead to values of true AUC ranging from 0.72 to 0.75; medium overlap (AHE = 3) gives AUC values in the range 0.82-0.85 and the conditions of greatest separation (AHE = 5) produce AUCs between 0.92 and 0.95.

#### **4.3 Clinical example**

The approach was applied to actual clinical data of critical patients in the intensive care unit to evaluate their risk of morbidity after heart surgery.

We used a sample of 1040 adult patients younger than 80 years, who underwent coronary artery bypass grafting and were admitted to the intensive care unit of the Department of Surgery and Bioengineering of Siena University. 212 patients developed at least one serious postoperative complication (cardiovascular, respiratory, neurological, renal, infectious or hemorrhagic), corresponding to a morbidity of 20.4% (Cevenini & P. Barbini, 2010, as cited in Cevenini et al., 2007). The data was split randomly into a training and a testing set of the same size (520 cases), with the same number of patients with morbid conditions in each set (106 cases) to avoid misleading bias in the results.

Table 2 describes the 15 clinical variables used for score model design, six of which were binary in origin. The other nine continuous variables were dichotomised using cut-off values associated with the point of equal sensitivity and specificity on the respective ROC curves. Three of the resulting 15 binary variables were discarded because their odds ratios of morbidity were not significantly greater than 1. This left a total of 12 variables for training the score model, as in the simulation experiments.

This real clinical situation was similar to the simulation experiment with N = 500 and Π = 20% (see Table 1). Consulting Table 1, we expected to develop a score model with 4 or 5 classes, depending on the level of data separation between normal and morbid patients.

Figure 6 shows the stepwise procedure used to select the model variables. After step no. 8, AUC values of testing data (dashed line with stars) decreased and diverged from training data AUCs (continuous line with dots). This indicated overfitting that was possible because the criterion used to stop the training procedure was deliberately soft, to allow inclusion of more steps than needed for generalisation. In fact, as previously illustrated in the simulation results, investigation of extra steps can be useful to optimise model prognostic power through score pooling. Steps nos. 6, 7 and 8 gave similar prognostic performance, so we chose step no. 8, thus obtaining higher discrimination (greater AUC). A convenient class was formed by pooling scores greater than 3, as shown in Fig. 7. All 95% CIs of adjacent scores or classes were well-separated and all testing score probabilities (stars) fell within their corresponding CIs, thereby ensuring high prognostic reliability of the model. The pooling of the highest scores of the eight-step model led to a

Simulation experiments suggests grouping scores into classes when frequencies are less than about 3% and 10% of the whole sample for N = 8000 and N = 250, respectively. Only two classes are recognised in the worst condition of minimum sample size (N = 250), minimum prevalence (Π = 5%) and low separation between health events (AHE = 2). A maximum of seven score-classes is identified in conditions of large sample size (N = 8000), high prevalence and high separation between event classes. Although more score classes could be

The discrimination of the different simulation experiments is assessed by AUC of true simulated probability calculated using beta functions. Conditions of large overlap between areas of beta functions (AHE = 2) lead to values of true AUC ranging from 0.72 to 0.75; medium overlap (AHE = 3) gives AUC values in the range 0.82-0.85 and the conditions of

The approach was applied to actual clinical data of critical patients in the intensive care unit

We used a sample of 1040 adult patients younger than 80 years, who underwent coronary artery bypass grafting and were admitted to the intensive care unit of the Department of Surgery and Bioengineering of Siena University. 212 patients developed at least one serious postoperative complication (cardiovascular, respiratory, neurological, renal, infectious or hemorrhagic), corresponding to a morbidity of 20.4% (Cevenini & P. Barbini, 2010, as cited in Cevenini et al., 2007). The data was split randomly into a training and a testing set of the same size (520 cases), with the same number of patients with morbid conditions in each set

Table 2 describes the 15 clinical variables used for score model design, six of which were binary in origin. The other nine continuous variables were dichotomised using cut-off values associated with the point of equal sensitivity and specificity on the respective ROC curves. Three of the resulting 15 binary variables were discarded because their odds ratios of morbidity were not significantly greater than 1. This left a total of 12 variables for training

This real clinical situation was similar to the simulation experiment with N = 500 and Π = 20% (see Table 1). Consulting Table 1, we expected to develop a score model with 4 or 5 classes, depending on the level of data separation between normal and morbid

Figure 6 shows the stepwise procedure used to select the model variables. After step no. 8, AUC values of testing data (dashed line with stars) decreased and diverged from training data AUCs (continuous line with dots). This indicated overfitting that was possible because the criterion used to stop the training procedure was deliberately soft, to allow inclusion of more steps than needed for generalisation. In fact, as previously illustrated in the simulation results, investigation of extra steps can be useful to optimise model prognostic power through score pooling. Steps nos. 6, 7 and 8 gave similar prognostic performance, so we chose step no. 8, thus obtaining higher discrimination (greater AUC). A convenient class was formed by pooling scores greater than 3, as shown in Fig. 7. All 95% CIs of adjacent scores or classes were well-separated and all testing score probabilities (stars) fell within their corresponding CIs, thereby ensuring high prognostic reliability of the model. The pooling of the highest scores of the eight-step model led to a

achieved with greater CI overlap, the cost would be unreliable estimates.

greatest separation (AHE = 5) produce AUCs between 0.92 and 0.95.

to evaluate their risk of morbidity after heart surgery.

(106 cases) to avoid misleading bias in the results.

the score model, as in the simulation experiments.

**4.3 Clinical example** 

patients.

slight but not statistically significant reduction in discrimination performance: the estimated training and testing AUCs decreased from 0.851 (95% CI, 0.781-0.909) to 0.835 (95% CI, 0.764-0.895) and from 0.841 (95% CI, 0.775-0.900) to 0.816 (95% CI, 0.743-0.879), respectively.


Table 2. Clinical variables, cut-off values for the dichotomisation of continuous variables and score-model entry steps. NE = not entered; NS = not statistically significant; LR = variable selected by stepwise logistic regression

Design of Scoring Models for Trustworthy Risk Prediction in Critical Patients 355

Fig. 7. Estimated 95% confidence intervals of AHE score probabilities from clinical training data, percentages of score cases and testing-data probabilities (stars) for the eight-step

Fig. 8. ROC curves from clinical data. 95% CI refers to the score model. CV = also with

continuous variables; BV = with binary variables only

model chosen

Two logistic regression models were designed to compare the score model results on the same training data with the 15 clinical variables of Table 2. The first model, named LogCV, used the original continuous variables and the second (LogBV) dichotomised them (see Table 2). The stepwise regression procedure selected ten clinical variables (see Table 2) and provided training-data AUC values of 0.906 (HL test, p = 0.135) and 0.871 (HL test, p = 0.557) for LogCV and LogBV, respectively. Figure 8 compares the ROC curves. The LogCV ROC curve (continuous gray line) showed the greatest discrimination performance, mainly because the model selected many continuous variables (6 out of 10). Except for the highest specificity values, where the discretisation effect of scoring was more evident, the score model ROC curve (continuous black line) did not differ significantly from that of LogBV (dashed gray line), which was inside the respective 95% CI and close enough to the score-model points. Model scores computed using the testing data gave a ROC curve (dashed black line) not significantly different from the training data curve. Finally, it should be noted that the discrimination performance of logistic models decreased considerably when applied to testing data (ROC curves not reported in Fig. 8): AUCs of logCV and logBV were reduced to 0.879 and 0.826, respectively, thus suggesting a possible overfitting.

Fig. 6. Area under the ROC curve (AUC) during the stepwise selection of model features from clinical data. The predictor variables entered are also indicated

Two logistic regression models were designed to compare the score model results on the same training data with the 15 clinical variables of Table 2. The first model, named LogCV, used the original continuous variables and the second (LogBV) dichotomised them (see Table 2). The stepwise regression procedure selected ten clinical variables (see Table 2) and provided training-data AUC values of 0.906 (HL test, p = 0.135) and 0.871 (HL test, p = 0.557) for LogCV and LogBV, respectively. Figure 8 compares the ROC curves. The LogCV ROC curve (continuous gray line) showed the greatest discrimination performance, mainly because the model selected many continuous variables (6 out of 10). Except for the highest specificity values, where the discretisation effect of scoring was more evident, the score model ROC curve (continuous black line) did not differ significantly from that of LogBV (dashed gray line), which was inside the respective 95% CI and close enough to the score-model points. Model scores computed using the testing data gave a ROC curve (dashed black line) not significantly different from the training data curve. Finally, it should be noted that the discrimination performance of logistic models decreased considerably when applied to testing data (ROC curves not reported in Fig. 8): AUCs of logCV and logBV were reduced to 0.879 and 0.826, respectively, thus

Fig. 6. Area under the ROC curve (AUC) during the stepwise selection of model features

from clinical data. The predictor variables entered are also indicated

suggesting a possible overfitting.

Fig. 7. Estimated 95% confidence intervals of AHE score probabilities from clinical training data, percentages of score cases and testing-data probabilities (stars) for the eight-step model chosen

Fig. 8. ROC curves from clinical data. 95% CI refers to the score model. CV = also with continuous variables; BV = with binary variables only

Design of Scoring Models for Trustworthy Risk Prediction in Critical Patients 357

Unfortunately, the statistics of the sampling error are not simple to derive. We therefore preferred to use bootstrap resampling, a method commonly used in statistical inference to estimate confidence intervals (Carpenter & Bithell, 2000; DiCiccio & Efron, 1996). The bootstrap method is simpler and more general than conventional approaches; it requires no great expertise in mathematics or probability theory and is based on assumptions that are less restrictive and easier to control. The method can be used to evaluate statistics that are difficult or impossible to determine by conventional methods. We used an elaboration of the simplest bootstrap method of percentile intervals, known as bias–corrected and accelerated intervals, which avoids estimate bias and offers substantial advantages over other bootstrap methods, both in theory and practice (Chernick, 2007). Our simulation experiments confirmed the method's accuracy in estimating 95% CI of prognostic probabilities: when true probabilities were related to score values, or classes, with a sufficient number of sampled training data, they always fell within bootstrap-estimated 95% CIs (see Fig. 3). Bootstrap techniques are not too complex in a clinical environment, since nowadays many available packages for data processing include them for calculating confidence intervals. In

As shown in Fig. 3, step by step graphical inspection of probability CIs made it possible to choose the best model to compromise between calibration and discrimination, also suggesting convenient pooling of adjacent scores that gave large and overlapping CIs due to an insufficient number of cases or adverse events. The controlled simulation experiments showed that good calibration was achieved with a limited number of score classes, up to a maximum of seven in experiments with the biggest sample size, and high prevalence and separation between event classes (see Table 1). More classes could be identified if greater overlap of close scores were allowed, but when the number of classes became excessive, there were problems of overfitting. We also saw that a logistic model designed on the same training data provided nearly continuous probability estimates, the uncertainty of which was similar to that achieved by the score model. Significant improvement of discrimination performance could only be appreciated when continuous variables were also included in the logistic model, as in the clinical example described. This analysis can enable medical staff to

In critical care medicine, scoring systems are often designed exclusively on the basis of discrimination and generalisation characteristics (diagnostic capacity), at the expense of reliable individual probabilities (prognostic capacity). Our proposed approach that weighs both these capacities is validated by suitable simulation experiments, which also allow design conditions and application limits of scoring systems to be investigated for correct

The bias-corrected and accelerated bootstrap method for evaluating the 95% confidence interval, CI, of individual prognostic probabilities provides reliable estimates of true simulated probabilities. CIs are calculated for each score and at each step of scoring-system design. By increasing the number of steps, model discrimination power (greater AUC) and prognostic information (greater number of different score values) increases but widening and overlap of 95% CIs soon occurs, so that it becomes convenient to pool adjacent scores into score classes. The maximum number of different score classes giving distinct prognostic

any case, they are used exclusively during model design.

select the best scoring system for any specific clinical context.

prediction of critical patient risk in a real clinical context.

**6. Conclusion** 
