**3. Probability models**

340 Health Management – Different Approaches and Solutions

Calibration can be improved, without changing discrimination capacity, by suitable monotonic mathematical transformations of model predicted probabilities (Harrell et al., 1996). The mean squared error between model predicted probability and observed binary outcomes is sometimes calculated as a global index of model accuracy, and has been demonstrated to incorporate both discrimination and calibration capacities (Murphy,

Generalisation is defined as the capacity of the model to maintain the same predictive performance on data not used for training, but belonging to the same population. A high generalisation power is of primary importance for predictive models designed on a sample data set of correctly classified cases (training set). Many different procedures, which involve different correctly classified data sets for testing model performance (testing sets), have been used to control model generalisation (Bishop, 1995; Fukunaga, 1990; Vapnik, 1999). A model generalises when differences between errors of testing and training sets are not statistically

Theoretically, the optimal model is the simplest possible model designed on training data and has the highest possible performance on any other equally representative set of testing data. Excessively complex models tend to overfit, i.e. give significantly lower errors on the training data than on the testing data. Overfitting produces data storage rather than learning of prediction rules. Models must be designed to avoid overfitting and improve generalisation through efficient control of the training process. This control often includes suitable techniques for the selection of predictor variables (Guyon & Elisseeff,

Computer algorithms for properly controlling overfitting are known as cross-validation or rotation techniques and make efficient use of all available data to train and test the model (Vapnik, 1999). The most common type of cross-validation procedure is k-fold, where the original sample is randomly partitioned into k subsamples, one of which is used as testing set and the other k–1 as training set. The process is then repeated k times, changing the testing set each time so that all subsamples are used for testing. A convenient variant, more appropriate in dichotomous classification, selects each subsample to contain approximately the same proportion of cases in the two classes. When k is equal to sample size, n, the procedure is called leave-one-out. One case is tested at a time at each of the n training sessions using n–1 training cases. Resampling methods also exist, and include bootstrap methods that produce different data samples by randomly extracting cases with

Cross-validation can be used to compare the performance of different predictive modelling procedures and, specifically, to select different sets of predictor variables with the same model. In fact, it is convenient to select the best minimum subset of predictor variables to control generalisation and to avoid information overlap due to correlation between variables. Computer-aided stepwise techniques are usually used to obtain optimal nested subsets of variables for this purpose. To train the model, a variable is entered or removed from the predictor subset on the basis of its contribution to a significant increase in discrimination performance (typically the AUC for dichotomous classification) at each step of the process. The stepwise process stops when no variable satisfies the statistical criterion

**2.2 Generalisation, cross-validation and variable selection** 

replacement from the original dataset (Chernick, 2007).

for inclusion or removal (Guyon & Elisseeff, 2003).

1973).

significant.

2003).

We now provide an overview of four approaches for estimating AHE risk probability: the Bayesian classification rule (Lee, 2004), k-nearest neighbour discrimination (Beyer et al., 1999), logistic regression (Dreiseitl & Ohno-Machado, 2002; Hosmer & Lemeshow, 2000), and artificial neural networks (Bishop, 1995; Dreiseitl & Ohno-Machado, 2002). Linear and quadratic discriminant analyses and related Fisher discriminant functions were not considered because they are strictly classification methods, and although they also enable easy derivation of prediction probabilities, they have been demonstrated to be equivalent to Bayesian methods (Fukunaga, 1990).

### **3.1 Bayesian classifiers**

Bayes's rule allows the posterior conditional probability of AHEs to be predicted as follows (Lee, 2004):

$$\text{P(AHE \mid x)} = \frac{\text{P(AHE) p(x \mid AHE)}}{\text{P(AHE) p(x \mid AHE)} + \text{P(FHE) p(x \mid FHE)}} \tag{1}$$

where P(AHE) and P(FHE) = 1–P(AHE) are the prior probabilities of the adverse and favourable health events, respectively, p(x|AHE), and p(x|FHE) are the corresponding class-conditional probability density functions (CPDFs) of selected features x. Posterior probability of class FHE is simply P(FHE|x) = 1–P(AHE|x).

Setting the posterior class-conditional probability threshold Pt at 0.5, the Bayes decision rule gives minimum error. It amounts to assigning patients to the class with the largest posterior probability. A higher/lower value of Pt gives rise to a smaller/larger number of patients classified at risk.

Lack of knowledge about prior probability P(AHE), i.e. the prevalence of AHE, does not affect the discrimination performance of the Bayesian classifier since it can be counterbalanced by different choices of Pt. On the contrary, a reliable estimate of prognostic probability P(AHE|x) can be obtained only if all prior probabilities and CPDFs are correctly known.

Statistical assumptions are usually made about whether CPDFs have parametric or non parametric structure. In many cases they are assumed to be of the parametric Gaussian type, because this has been proven to provide good discrimination performance, especially if a subset of predictors can be optimally selected from a large set of clinically available variables (E. Barbini et al., 2007; Fukunaga, 1990).

#### **3.2 K-nearest neighbour algorithms**

The k-nearest neighbour algorithm is among the simplest non parametric methods for assigning patients based on closest training examples in the space of features x (Beyer et al., 1999). Euclidean distance is usually used to measure between-point nearness but other metrics must be introduced if non continuous variables are considered.

In our binary classification scheme, the training phase simply consists in partitioning feature space into the two regions or classes, AHE and FHE, based on the positions of training cases. Each new patient is assigned to the region in which the greatest number of its k neighbours occurs, where k is of course a positive integer.

With two classes, it is convenient to choose an odd k to avoid situations of equality. Typically, the choice of neighbourhood size depends on the type and size of the training set;

Design of Scoring Models for Trustworthy Risk Prediction in Critical Patients 343

conditional posterior probabilities from predictor variables, without requiring sophisticated statistical hypotheses. Their architecture can be variably complex, but should provide one output neuron with a logistic sigmoid activation function, generating an output between 0 and 1. Neural networks have been demonstrated to provide reliable estimates of classconditional posterior probabilities, such as the AHE risk probability, P(AHE|x), that is

<sup>1</sup> P(AHE|x)= 1+exp(-f)

f=b+w u +w u +...+w u

where f is a linear function of n neuron inputs uk (k = 0, 1, 2, ..., n), originating from the outputs of n preceding connected neurons, the parameters of which are connection weights,

Under-learning can lead to high prediction errors, whereas over-learning can cause overfitting which produces loss of generalisation. Artificial neural network design is therefore anything but simple. Experience is necessary to manipulate heuristic procedures for suitable definition of network architecture and to correctly use iterative numerical

A scoring model is a formula that assigns points based on known information, in order to predict an unknown future outcome. Many integer score systems have been designed for clinical application to critical patients. The most popular were derived from simplification of any of the above probability models by rounding their parameters to integer values. In particular, many approximate the coefficients of logistic regression models to the nearest integer values (Higgins et al., 1997). We do not dwell on the methodology of these score models here, directing readers to the specialised literature (Vincent & Moreno, 2010). Our main interest is to identify score values that give reliable probabilities of individual risk for prognostic purposes. We discuss on the design of a very simple score system that we call a "direct score model". We also provide a correct and useful statistical interpretation of model prognostic capacity, which can easily be extended to any other score model, even more

Only binary predictor variables (risk factors) are used in this score model. The automatic

 All quantitative predictor variables are dichotomised by ROC curve analysis, identifying cut-off values giving equal sensitivity and specificity in relation to adverse

Risk factors over or under the cut-off value are coded 0 or 1, depending on whether the

 The odds ratio of each binary variable is evaluated on the basis of the corresponding confidence interval (CI) (Agresti, 1999): variables with odds ratios not significantly

computer procedure and model training is described by the following steps:

training techniques that stop learning when the network begins to overfit.

11 22 nn

(3)

(Bishop, 1995):

wk, and neuron activation bias, b.

**4. Direct score model** 

**4.1 Model design** 

outcomes.

sophisticated ones (Cevenini & P. Barbini, 2010).

risk of AHE decreases or increases, respectively.

greater than 1 are discarded.

larger values of k generally reduce the effect of noise on classification at the expense of distinction between classes.

Heuristic techniques are used to obtain the optimal value of k. A common choice is to take k equal to the square root of the total number of training cases, but cross-validation methods, such as bootstrap, are often preferred.

Although k-nearest neighbour is not strictly a probability method, it has been demonstrated that the fraction of k neighbourhood training cases falling in the AHE region is a good estimate of class-conditional risk probability (Beyer et al., 1999).

#### **3.3 Logistic regression**

Logistic regression is perhaps the most popular method for estimating risk probabilities in the medical field (Hosmer & Lemeshow, 2000). Logistic regression is a variation of ordinary regression: it belongs to the family of methods called generalized linear models, which include a linear part followed by some associated function. It can be considered a predictive model to use when the dependent response variable is dichotomous and the independent predictor variables are of any type, i.e. continuous, categorical, or both. In d-dimensional feature space, the form of the model is:

$$\log \frac{\text{P(AHE \mid x)}}{\text{1-P(AHE \mid x)}} = \mathbf{c}\_0 + \mathbf{c}\_1 \mathbf{x}\_1 + \mathbf{c}\_2 \mathbf{x}\_2 + \dots + \mathbf{c}\_d \mathbf{x}\_d \tag{2}$$

where "log" is the natural logarithm function, xk (k = 1, 2, …, d) the observation data set and ck (k = 0, 1, 2, ..., d) regression coefficients estimated from training data using maximum likelihood criteria.

The inverse of eq. 2 allows the posterior probability of AHE risk, P(AHE|x), to be modelled by a continuous S-shaped curve, even if all predictor variables are categorical. The argument of the logarithm of eq. 2 defines the probability of the outcome event occurring divided by the probability of the event not occurring and is known as the odds ratio. When it is specifically associated with dichotomous predictor variables (risk factors), it is a useful measure of the relative risk due to single risk factors. The reliability of logistic regression results is affected by linear correlations and interaction effects between predictor variables, dependence between error terms, and especially outliers.

#### **3.4 Artificial neural networks**

Artificial neural networks (or simply neural networks) are mathematical models miming the physiological learning functions of the human brain. They can be designed and trained to create optimal input-output maps of any physical or statistical phenomenon, the relationships of which may even be complex or unknown. They do not require sophisticated statistical hypotheses and account for all possible interrelations between predictor variables in a natural way. In this sense, neural networks can be considered universal approximators (Bishop, 1995).

A preliminary definition of network architecture is needed and should include number of neurons, number of layers, number and type of connections among neurons, type of neuronal activation functions and so on. Learning is the trickiest phase of neural networks: it consists of estimating network parameters (connection weights and activation thresholds) iteratively from training data, to minimize error between actual and model-estimated outputs. Feed-forward neural networks can be designed to directly estimate classconditional posterior probabilities from predictor variables, without requiring sophisticated statistical hypotheses. Their architecture can be variably complex, but should provide one output neuron with a logistic sigmoid activation function, generating an output between 0 and 1. Neural networks have been demonstrated to provide reliable estimates of classconditional posterior probabilities, such as the AHE risk probability, P(AHE|x), that is (Bishop, 1995):

$$\begin{aligned} \text{P(AHE \mid x)} &= \frac{1}{1 + \exp(-\text{f})} \\ \text{f} &= \text{b} + \mathbf{w}\_1 \mathbf{u}\_1 + \mathbf{w}\_2 \mathbf{u}\_2 + ... + \mathbf{w}\_n \mathbf{u}\_n \end{aligned} \tag{3}$$

where f is a linear function of n neuron inputs uk (k = 0, 1, 2, ..., n), originating from the outputs of n preceding connected neurons, the parameters of which are connection weights, wk, and neuron activation bias, b.

Under-learning can lead to high prediction errors, whereas over-learning can cause overfitting which produces loss of generalisation. Artificial neural network design is therefore anything but simple. Experience is necessary to manipulate heuristic procedures for suitable definition of network architecture and to correctly use iterative numerical training techniques that stop learning when the network begins to overfit.
