**3. Experimental evaluation**

### **3.1. Exploratory analyses**

The MIMIC-III database consists of 58,976 hospital admissions from 46,520 patients. All patients are characterized by at least one ICU admission.

Guided by the CRoss-Industry Standard Process for Data Mining (CRISP-DM), the ETL process was initiated by retrieving data from the MIMIC-III tables of interest (d\_labitems, admissions, patients, and diagnosis\_icd) [26]. Next, patients were selected for renal failure by the Elixhauser score, leading to 1477 (3.15%) patients satisfying our inclusion criteria and 20,068 patient days (examples) in total. In a consecutive step, admissions were joined based on hospital admission id (hadm\_id) with all laboratory tests (from the 755 item ids in d\_labitems), aggregated on a daily level. Mean, standard deviation, and the number of tests per day (*len*) were defined as aggregation functions. As an output feature (*label*), this study focused on hospital mortality (hospital\_expire\_flag). From all renal failure patients in this study, 399 (27.0%) did not survive during hospital admission. Next, data were split per day in order to examine feature selection and weight changes over time. Therefore, we arbitrarily limited our computations for admission duration of 7 days, where for each day the number of patients was >1000. After that period, the number of patients admitted to ICU declined.

Patients who survived hospital stay were significantly older (69.3 ± 12.4 years vs. 65.9 ± 14.1 years; p < 0.05), suffered more frequently from deficiency anemia (15.5 vs. 9.8% p = 0.01) and depression (8.3 vs. 3.8% p = 0.00). The survivors suffered less frequently from congestive heart failure (40.2 vs. 46.9% p = 0.02), valvular disease (9.8 vs. 14.3% p = 0.01), lymphoma (1.6 vs. 3.8% p = 0.01), and metastatic cancer (1.7 vs. 4.5% p = 0.00). **Table 1** displays the basic characteristics of the baseline dataset. Binary variables are reported as prevalence percentages or count, and continuous variables are reported as data mean ± standard deviation.

In **Figure 1**, distributions of numbers of laboratory tests by admission days are described by the box plots for each day demonstrating a decline in the number of different laboratory tests requested by admission days from day 1 to day 4. For the following days, the number of requested laboratory tests was stable.

### **3.2. Automatic model building, feature selection and evaluation**

*2.3.2. Ensemble learning methods*

96 Data Mining

**2.4. Feature weighting and selection**

**3. Experimental evaluation**

patients are characterized by at least one ICU admission.

**3.1. Exploratory analyses**

Ensemble (meta-learning) methods combine multiple models aiming to provide more accurate or more stable predictions. These models can be aggregated from the same model built on different sub-samples of data, from different models built on the same sample or a combination of the previous two techniques. Ensemble methods are often used to improve the individual performance of algorithms that constitute ensembles by exploiting the diversity among the models produced [21]. The ensemble methods implemented in this chapter are: Random Forest [22], Boosting [23], and Bootstrap Aggregating (Bagging) [24]. In our experi-

Random Forest (RF) is an ensemble classifier that evaluates multiple DT and aggregates their results, by majority voting, in order to classify an example [22]. There is a two-level randomization in building these models. First, each tree is trained on a bootstrap sample of the training data and second, in each recursive iteration of building a DT (splitting data based on information potential of features); a subset of features for evaluation is randomly selected. In

Boosting is an ensemble meta-algorithm developed in order to improve supervised learning performance of weak learners (models whose predictive performance is only slightly better than random guessing). In this study, the adaptive boosting (AdaBoost) algorithm was used [23].

Bagging algorithm builds a series of models (e.g. CHAID Decision Trees) on different data subsamples (with replacement) [24]. For new examples, each model is applied, and predictions are aggregated (e.g. majority voting for classification or average prediction for regression).

Several filter feature selection schemes were evaluated. Filter selection (FS) methods rely on the evaluation of the information potential of each input feature in relation to the label (hospital mortality). A threshold search and selection of those features, providing most predictive power, was calculated for each predictive model. The first is based on Pearson correlation returning the absolute or squared value of the correlation as attribute weight. Furthermore, we applied Information Gain Ratio and Gini Index, two weighting schemes that are based on information theoretic measures, frequently used with decision trees for evaluation of potential splits [17]. The T-test calculated, for each attribute, a p-value for two-sided, two-sample T-test. Finally, the ReliefF evaluated the impact of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the same and different class [25].

The MIMIC-III database consists of 58,976 hospital admissions from 46,520 patients. All

ments, Boosting and Bagging used J4.8 and Logistic regression as base learners.

this research, we grew and evaluated Random Forest (RF) with 10 trees.

A more detailed technical description of the use of RapidMiner for scalable predictive analytics of medical data, as well as templates of generic processes, can be found in [8] and its supplementary materials.

Initially, all features are weighted by five feature weighting and selection methods (Information Gain ratio, Gini, Correlation, ReliefF, and T-test), for each day. In order to find the adequate number of features that will be used by each predictive model for each day (and to identify optimal feature selection methods for our data), we conducted the following procedure. First, we sorted the features by their weights in descending order (for each feature weighting method). Then we trained each of five predictive models (Decision tree, Logistic regression, Random Forest, Bagging, and Boosting) on subsets of features with highest weights, starting from 10 features up to 100 with the step of 10 (9 different feature sets) [27, 28]. Even though a number of experiments were conducted (315 experiments: 7 algorithms X 5 feature selection schemes X 9 thresholds), this method as previously described [8] allowed ease of implementation of the experimental setup within only one RapidMiner process execution and with complete reproducibility of the results.


Evaluation of all predictive models was performed by AUPRC (area under the precision-recall curve) for model comparison, because of the unbalanced nature of data [27, 28]. Namely, frequently used area under receiver operating curve (AUROC) is calculated based on true positive rate and false positive rate. True positive rate may be high even if recall is low (situation when predictor rarely predicts positive class), and thus it is often misleading in case of

Early Prediction of Patient Mortality Based on Routine Laboratory Tests and Predictive Models…

http://dx.doi.org/10.5772/intechopen.76988

99

**Figure 1.** Distribution of the number of laboratory tests per patient by days.

Because of relatively small number of samples (between 1000 and 1500 for each day), the predictive performance of models on unseen data was estimated by a fivefold cross-validation set created by stratified sampling, preserving the initial distribution of positive and negative classes of the target attribute. This validation on relatively small samples avoids the risk of misleading interpretation of the results caused by biased selection of a test set based on one sample.

First, we present a comparison between feature selection methods, based on maximal predictive performance (in terms of AUPRC) overall algorithms. Next, we restrict further analyses on experiments with the overall best feature selection technique. Values in **Table 2** illustrate the maximal predictive performance for each day and for each feature selection technique. Maximum values by days (rows) are shown in bold. It can be seen that Gini and ReliefF

achieved maximum values on all days, except for the first day of ICU admission.

imbalanced data.

**3.3. Performance and feature selection**

**Table 1.** Patient characteristics of the baseline dataset.

Early Prediction of Patient Mortality Based on Routine Laboratory Tests and Predictive Models… http://dx.doi.org/10.5772/intechopen.76988 99

**Figure 1.** Distribution of the number of laboratory tests per patient by days.

Evaluation of all predictive models was performed by AUPRC (area under the precision-recall curve) for model comparison, because of the unbalanced nature of data [27, 28]. Namely, frequently used area under receiver operating curve (AUROC) is calculated based on true positive rate and false positive rate. True positive rate may be high even if recall is low (situation when predictor rarely predicts positive class), and thus it is often misleading in case of imbalanced data.

Because of relatively small number of samples (between 1000 and 1500 for each day), the predictive performance of models on unseen data was estimated by a fivefold cross-validation set created by stratified sampling, preserving the initial distribution of positive and negative classes of the target attribute. This validation on relatively small samples avoids the risk of misleading interpretation of the results caused by biased selection of a test set based on one sample.

### **3.3. Performance and feature selection**

**Characteristics ICU patients with** 

Peripheral vascular

Chronic pulmonary

disease

98 Data Mining

disease

**renal failure (n = 1477)**

Sex (male, %) 60.1 59.4 61.9

**Survival during hospital admission (n = 1078)** 

274 (18.6) 199 (18.5) 75 (18.8) 0.88

253 (17.1) 188 (17.4) 65 (16.3) 0.60

**Death during hospital admission (n = 399)** 

**p**

**(27.0%)**

**(73.0%)**

Age (years) 66.8 ± 13.8 69.3 ± 12.4 65.9 ± 14.1 *<0.05*

Congestive heart failure 620 (42.0) 433 (40.2) 187 (46.9) *0.02* Cardiac arrhythmias 438 (29.7) 307 (28.5) 131 (32.8) 0.10 Valvular disease 163 (11.0) 106 (9.8) 57 (14.3) *0.01* Pulmonary circulation 95 (6.4) 68 (6.3) 27 (6.8) 0.75

Hypertension 13 (0.9) 8 (0.7) 5 (1.3) 0.35 Paralysis 19 (1.3) 13 (1.2) 6 (1.5) 0.65 Other neurological 70 (4.7) 46 (4.3) 24 (6.0) 0.16

Diabetes uncomplicated 322 (21.8) 225 (20.9) 97 (24.3) 0.15 Diabetes complicated 435 (28.5) 326 (30.2) 109 (27.3) 0.27 Hypothyroidism 155 (10.5) 120 (11.1) 35 (8.8) 0.19

Liver disease 76 (5.1) 51 (4.7) 25 (6.3) 0.24 Peptic ulcer 12 (0.8) 10 (0.9) 2 (0.5) 0.42 Aids 19 (1.3) 14 (1.3) 5 (1.3) 0.94 Lymphoma 32 (2.2) 17 (1.6) 15 (3.8) *0.01* Metastatic cancer 36 (2.4) 18 (1.7) 18 (4.5) *0.00* Solid tumor 68 (4.6) 46 (4.3) 22 (5.5) 0.31 Rheumatoid arthritis 44 (3.0) 31 (2.9) 13 (3.3) 0.70 Coagulopathy 149 (10.1) 106 (9.8) 43 (10.8) 0.59 Obesity 34 (2.3) 28 (2.6) 6 (1.5) 0.21 Weight loss 60 (4.1) 37 (3.4) 23 (5.8) *0.04* Fluid electrolyte 572 (38.7) 414 (38.4) 158 (39.6) 0.68 Blood loss anemia 0 (0.0) 0 0 0 Deficiency anemias 206 (13.9) 167 (15.5) 39 (9.8) *0.01* Alcohol abuse 41 (2.8) 29 (2.7) 12 (3.0) 0.74 Drug abuse 25 (1.7) 19 (1.8) 39 (1.5) 0.73 Psychoses 42 (2.8) 32 (3.0) 10 (2.5) 0.63 Depression 105 (7.1) 90 (8.3) 15 (3.8) *0.00*

Renal failure 1477 (100) 1078 (100.0) 399 (100.0)

**Table 1.** Patient characteristics of the baseline dataset.

First, we present a comparison between feature selection methods, based on maximal predictive performance (in terms of AUPRC) overall algorithms. Next, we restrict further analyses on experiments with the overall best feature selection technique. Values in **Table 2** illustrate the maximal predictive performance for each day and for each feature selection technique. Maximum values by days (rows) are shown in bold. It can be seen that Gini and ReliefF achieved maximum values on all days, except for the first day of ICU admission.


Bold values represent the best results per rows. Multiple bold Values per row means that there was more than one equally good results.

**Table 2.** Maximum area under the precision-recall curve (AUPRC) performance of algorithms per days (rows) and feature selection measures (columns).

On the first day, the maximal performance is achieved with correlation (0.48). A more detailed inspection of the model performance and the number of features selected for each admission day demonstrated that Random Forest, result in the best predictive performance overall days (except the first day). Logistic regression often achieved a good performance. J4.8. achieved the worst AUPRC performance over all days, but in synergy with the AdaBoost ensemble scheme, it provided a competitive performance with Random Forest and Logistic Regression.

Further, **Table 2** illustrates that predictive performance is increasing over days and stabilizes from day 4 to 7 on AUPRC = 0.86. AUPRC values for days 2 and 3 are also high (0.83 and 0.85, respectively) and illustrate that risk for hospital mortality can be predicted, with high confidence, starting from the second day of admission.

Values of area under precision-recall curve illustrate the general performance of predictive models but do not explain anything related to the selection of the actual thresholds that should be selected for predictions. Therefore, we analyzed possible thresholds by inspecting precision-recall (PR) trade-off. High recall means that most of the positive examples (in this case, hospital death) are predicted correctly. High precision means that there is a low number of false alarming (mortality is predicted, but the patient survived). **Figure 2** shows precision/ recall (PR) curves for first 4 days that were generated from the predictions of the best performing models (**Table 3**), built on features from the Gini selection.

**Figure 2.** PR curves for first 4 days that were generated from the predictions of the best performing models, built on

Bold values represent the best results per rows. Multiple bold Values per row means that there was more than one

**Table 3.** Algorithms performance per admission day based on Gini index feature selection measure.

**AdaBoost (Logistic)**

Early Prediction of Patient Mortality Based on Routine Laboratory Tests and Predictive Models…

http://dx.doi.org/10.5772/intechopen.76988

101

**Bagging (J4.8)**

**Bagging (Logistic)**

**AdaBoost (J 4.8)**

 0.35 (10) **0.47 (60)** 0.42 (30) 0.37 (10) 0.42 (30) 0.36 (20) 0.46 (30) 0.76 (60) **0.83 (50) 0.83 (70)** 0.81 (30) 0.82 (60) 0.78 (50) 0.82 (60) 0.78 (30) 0.84 (20) **0.85 (40)** 0.84 (60) 0.82 (90) 0.76 (10) 0.83 (20) 0.79 (20) 0.85 (20) **0.86 (40) 0.86 (10)** 0.83 (20) 0.78 (10) 0.84 (10) 0.81 (20) 0.85 (30) **0.86 (80)** 0.85 (20) 0.83 (20) 0.8 (10) 0.84 (10) 0.82 (20) 0.83 (40) **0.86 (70) 0.86 (20)** 0.82 (90) 0.8 (20) 0.83 (10) 0.80 (10) 0.84 (30) **0.86 (80)** 0.85 (10) 0.81 (30) 0.78 (10) 0.84 (30)

features from the Gini selection.

equally good results.

**Day J4.8 Logistic Random** 

**Forest**

On the first day, all models resulted in poor results which are found on the upper left PR curve in **Figure 2**.

Highest recall can be achieved with 0.3 precision (70% of false alarms), so this model is not useful, regardless of the threshold selection. On days 2–4, maximal recall can be achieved with around 20% of false alarms. Considering the cost of false negative predictions (low recall), we argue that optimal threshold values for day 2–4 models should be between 0.8 and 1 of recall.

Further, rankings of features provided by Gini feature selection methods illustrated in **Table 3** demonstrate that different algorithms achieved best predictive accuracies with different numbers of selected features. The number of features selected varied over the days.

Early Prediction of Patient Mortality Based on Routine Laboratory Tests and Predictive Models… http://dx.doi.org/10.5772/intechopen.76988 101

On the first day, the maximal performance is achieved with correlation (0.48). A more detailed inspection of the model performance and the number of features selected for each admission day demonstrated that Random Forest, result in the best predictive performance overall days (except the first day). Logistic regression often achieved a good performance. J4.8. achieved the worst AUPRC performance over all days, but in synergy with the AdaBoost ensemble scheme, it provided a competitive performance with Random Forest and Logistic Regression. Further, **Table 2** illustrates that predictive performance is increasing over days and stabilizes from day 4 to 7 on AUPRC = 0.86. AUPRC values for days 2 and 3 are also high (0.83 and 0.85, respectively) and illustrate that risk for hospital mortality can be predicted, with high confi-

Bold values represent the best results per rows. Multiple bold Values per row means that there was more than one equally

**Table 2.** Maximum area under the precision-recall curve (AUPRC) performance of algorithms per days (rows) and

**Day/FS method Info Gain Ratio Gini Correlation ReliefF T-test** 0.44 0.47 **0.48** 0.45 0.33 0.79 **0.84** 0.82 **0.84** 0.77 0.80 **0.85 0.85 0.85** 0.78 0.83 **0.86 0.86 0.86** 0.78 0.84 **0.86 0.86 0.86** 0.80 0.83 **0.86** 0.85 **0.86** 0.77 0.84 **0.86 0.86 0.86** 0.77

Values of area under precision-recall curve illustrate the general performance of predictive models but do not explain anything related to the selection of the actual thresholds that should be selected for predictions. Therefore, we analyzed possible thresholds by inspecting precision-recall (PR) trade-off. High recall means that most of the positive examples (in this case, hospital death) are predicted correctly. High precision means that there is a low number of false alarming (mortality is predicted, but the patient survived). **Figure 2** shows precision/ recall (PR) curves for first 4 days that were generated from the predictions of the best perform-

On the first day, all models resulted in poor results which are found on the upper left PR

Highest recall can be achieved with 0.3 precision (70% of false alarms), so this model is not useful, regardless of the threshold selection. On days 2–4, maximal recall can be achieved with around 20% of false alarms. Considering the cost of false negative predictions (low recall), we argue that optimal threshold values for day 2–4 models should be between 0.8 and 1 of recall. Further, rankings of features provided by Gini feature selection methods illustrated in **Table 3** demonstrate that different algorithms achieved best predictive accuracies with different num-

bers of selected features. The number of features selected varied over the days.

dence, starting from the second day of admission.

curve in **Figure 2**.

good results.

100 Data Mining

feature selection measures (columns).

ing models (**Table 3**), built on features from the Gini selection.

**Figure 2.** PR curves for first 4 days that were generated from the predictions of the best performing models, built on features from the Gini selection.


Bold values represent the best results per rows. Multiple bold Values per row means that there was more than one equally good results.

**Table 3.** Algorithms performance per admission day based on Gini index feature selection measure.

resulted in a high predictive accuracy for mortality risk prediction [29]. This study, however, did not analyze different ensemble and feature selection methods and was conducted on a different population. In addition, the feature selection techniques Gini Index and ReliefF

Early Prediction of Patient Mortality Based on Routine Laboratory Tests and Predictive Models…

http://dx.doi.org/10.5772/intechopen.76988

103

Laboratory tests ranked per day based on Random Forest and Gini Index. (*mean*, *std.*: standard deviation, *len*: number of tests/day) indicated the importance of mean lactate values (ranked first on day 1, 2, and 8), and mean white blood cells count (on day 3, 4 and 6) in the prediction of hospital mortality. In addition, it is fascinating to observe that predictive features for hospital mortality calculated from 755 laboratory related parameters and without any additional patient-related information or medical knowledge, correlated well with the laboratory tests used on a daily basis (sodium, anion gap, chloride, bicarbonate, creatinine, urea nitrogen, potassium, glucose, INR, hemoglobin, phosphate, total bilirubin, and base excess). Laboratory-based clinical decision support may improve physician adherence to guidelines

As demonstrated in this study, parameters for shock (lactate), sepsis (white blood cells), and

The ensemble models in this study were able to generate a high predictive accuracy (AUPRC values) from day 4, with acceptable results on the second and third day. On the first day of admission, however, AUPRC values were very low but correlated well with diagnostic uncer-

Surprisingly, patients who survived hospital stay were significantly older but suffered less from lymphoma and metastatic cancer. These findings might indicate some admission bias for certain comorbidities or indicate a constitutional superiority of older people admitted to

The hospital mortality in this renal failure ICU population was 27.0% (399/1477) [33]. Laboratory testing alone is only a part of the daily assessment of ICU patients. More research could elaborate predictive analysis of laboratory tests and other patient-related data in differ-

Predictive analytics using ensemble methods are able to predict hospital or ICU outcome of renal patients with high accuracy. Predictive accuracy changes with the length of stay. Feature ranking enables quantitative assessment of patient data (e.g. laboratory tests) for predictive power. Lactate and white blood cell count best predict hospital mortality in this population. From the second day of ICU admission, predictive accuracy based on laboratory tests >80%. This generates opportunities for efficacy and efficiency analysis of other data recorded during ICU stay.

Licenses for RapidMiner/Radoop are provided by RapidMiner (Cambridge, MA, and USA).

with respect to timely monitoring of chronic kidney disease [30, 31].

scored best in the majority of the cases.

multi-organ failure are more important [32].

tainty on the first day of admission.

ICU despite renal failure.

ent patient populations.

**Acknowledgements**

**5. Conclusions**

**Figure 3.** Pyramid chart of feature ranking (laboratory tests) for predictive value by mean and standard deviation; inand -excluding day 1.

Finally, we analyzed if there was the difference, between features selected on the first day, where the predictive performance was consistently poor, and other days, where the predictive performance was acceptable. Rank means and standard deviations were calculated for two groups: all days (with day one) and days 2–7 (without day 1) (**Figure 3**). Standard deviations of ranks are much higher over ranks of features that include day 1 (right parts of **Figure 3**). The average ranks changed (middle part of the figure), but similar laboratory tests were in first 15 ranks in both cases.

### **4. Discussion**

This study using ensemble methods demonstrated an improvement in predictive accuracy compared to prediction based on single models. Random Forests seem to provide the best predictive accuracy complying with our previous research [8]. (**Table 3**) Random Forest also resulted in a high predictive accuracy for mortality risk prediction [29]. This study, however, did not analyze different ensemble and feature selection methods and was conducted on a different population. In addition, the feature selection techniques Gini Index and ReliefF scored best in the majority of the cases.

Laboratory tests ranked per day based on Random Forest and Gini Index. (*mean*, *std.*: standard deviation, *len*: number of tests/day) indicated the importance of mean lactate values (ranked first on day 1, 2, and 8), and mean white blood cells count (on day 3, 4 and 6) in the prediction of hospital mortality. In addition, it is fascinating to observe that predictive features for hospital mortality calculated from 755 laboratory related parameters and without any additional patient-related information or medical knowledge, correlated well with the laboratory tests used on a daily basis (sodium, anion gap, chloride, bicarbonate, creatinine, urea nitrogen, potassium, glucose, INR, hemoglobin, phosphate, total bilirubin, and base excess). Laboratory-based clinical decision support may improve physician adherence to guidelines with respect to timely monitoring of chronic kidney disease [30, 31].

As demonstrated in this study, parameters for shock (lactate), sepsis (white blood cells), and multi-organ failure are more important [32].

The ensemble models in this study were able to generate a high predictive accuracy (AUPRC values) from day 4, with acceptable results on the second and third day. On the first day of admission, however, AUPRC values were very low but correlated well with diagnostic uncertainty on the first day of admission.

Surprisingly, patients who survived hospital stay were significantly older but suffered less from lymphoma and metastatic cancer. These findings might indicate some admission bias for certain comorbidities or indicate a constitutional superiority of older people admitted to ICU despite renal failure.

The hospital mortality in this renal failure ICU population was 27.0% (399/1477) [33]. Laboratory testing alone is only a part of the daily assessment of ICU patients. More research could elaborate predictive analysis of laboratory tests and other patient-related data in different patient populations.
