**2. Methods**

An overview of the proposed methodology from raw data to explainable decision framework is shown in **Figure 1**.

#### **2.1 Dataset and study population**

The publically available training set consists of data from two cohorts [18]. Cohort A has 790,215 records of 20,336 patients. Cohort B has 761,995 records of 20,000 patients. Particularly, data for every patient record contains 40 clinical covariates i.e. 8 vital signs, 26 laboratory values, and 6 demographic values. The labeling of the patient data was done adhering to Sepsis-3 clinical criteria. **Table 1** presents the details of various clinical covariates used under study together with their missing information in percentage [18, 19].

#### **2.2 Feature extraction**

Feature extraction takes place on the imputed version of given clinical data that generates features sample-wise on an hourly time grid. Two types of features ware generated namely:

*An Explainable Machine Learning Model for Early Prediction of Sepsis Using ICU Data DOI: http://dx.doi.org/10.5772/intechopen.98957*

*Physiological features*: In literature, inter-relations among the clinical values have been proven to enhance the capability of anomaly detection tasks [7, 20]. By reviewing various studies that justify the clinical significance of well-established physiological inter-relations among the given clinical signs 10 such physiological relations are derived from the given covariates: Three Shock Indices firstly the well defined Shock Index (SIndex) using Systolic BP and the other two are its modified versions proposed in this study for Diastolic BP (DPBSIndex) [21] and Mean

#### **Figure 1.**

*Graphical overview: From given raw clinical data to explainable decision framework. (a & b) clinical data from two ICU cohorts is imputed. (c) Physiological inter-relations and time lag differences are computed as features. (d) an optimal sepsis onset prediction architecture is developed using LightGBM models via bayesian optimization. (e) the predictions are rendered to explanations and potentially their predictive power is increased by refining the threshold that drove the prediction at every time-point. (f) Final decision.*

### *Infections and Sepsis Development*


*An Explainable Machine Learning Model for Early Prediction of Sepsis Using ICU Data DOI: http://dx.doi.org/10.5772/intechopen.98957*


**Table 1.**

*Details of the various clinical variables used under study along with missing values information in percentage.*


#### **Table 2.**

*Detailed definitions of the physiological features.*

Arterial Pressure (MAPSIndex) [22] followed by ratios BUN/Creatinine (BUNCr) [7], Bilirubintotal/Creatinine (BILTcr), SaO2/FiO2 [23], PaO2/FiO2 [24], Platelets/ Age (PlaAge), the difference between SBP and DBP called Pulse Pressure (PP) [25], and lastly Cardiac Output (CO) [26]. **Table 2** gives a detailed description of Physiological features.

*Time-Lag difference features*: A set of 35 time-lag features are computed with 6 hours of time-lag difference among vital signs and lab values from the given 40 clinical variables excluding the last 5 demographic values.

Finally, the obtained 45 features are combined with the given 40 clinical signs, thereby increasing the final feature count to 85 features. The resultant feature set is then fed to train the proposed xMLEPS framework.

#### **2.3 Implementation of xMLEPS**

Together with Bayesian optimization and the refinement of prediction risk threshold an optimal disease onset detection method before six hours for sepsis called xMLEPS is developed. As shown in **Figure 1** the given clinical sepsis data has large amount of missing information (approximately 20%). So at the onset of the algorithm computation, filling of these missing values is carried out as as a preprocessing step. The data imputation to fill in the missing values is done by employing forward fill imputation on the given EHR clinical data. In the real-time scenario, the current missing values encountered are to be filled with previous

available measurements. Thus only the previous clinical values of given EHR data are fetched for data imputation of current observation.

In this study, imputation is carried out into two rounds: first local imputation, for each individual record, and then global imputation for all the combined records together. In the case of local imputation, the trailing missing values in a row for a particular clinical covariate (or feature vector) are forward filled with the nearest past non-missing value in that row locally for the given record. Ipso facto, if the record encounters 'NaN' values, in the beginning, i.e. for the first alone measurement at t = 0, they are retained as it is initially and then later replaced with 'global mean' for that covariate row obtained by combining all records [19].

During model development, a ten-fold cross-validation scheme is employed wherein 10 LightGBM classifiers with the same complexity of model hyperparameters obtained during Bayesian optimization are developed for the corresponding fold. The total feature set used to develop these models comprises of 85 features as described in *sub-Section 2.2*. Generally, hyper-parameter optimization aims at looking for the best hyper-parameter values to minimize the objective loss function. The hyper-parameter settings maximizing the custom-defined challenge metric- utility score on the subset of training data during the Bayesian optimization phase are later used to build models. These built models generate the predictions on the hold out 10% of validation data in each fold. The training process of the model in each fold stops when the utility score of the validation set does not show further improvements over 32 consecutive iterations, i.e. early stopping to best iteration is achieved to reset the model and thereby to avoid over-fitting.

The initial predictions generated by each optimal model on the corresponding validation data of each fold undergo refinement of the prediction risk threshold to enhance the utility score. The search space for the prediction risk thresholds lies in the range of 0 to 1 and is varied in steps of 0.05. Thus the threshold search space has 20 values. So the initial predictions of validation data of each fold are compared with each of these 20 values. After comparison, the threshold value that gives the maximum utility score for the set of predictions of that fold is said to be optimal. Such 10 optimum threshold values are later used by the corresponding models to refine the predictive power in terms of utility score for generated labels in each fold.

This LightGBM based gradient boosting framework serves with a specific processing method for sparse data which is important in our classification task with class imbalance problem [27]. For the interpretability of the proposed framework, the LightGBM uses its feature importance attribute to quantify each variable, and the explainability component is addressed by employing SHAP summary and dependency plots wherein the distribution of the variable importance is illustrated [28, 29].

#### **3. Results**

The proposed framework performs the prediction from the given patientrecords to determine the risk of development of sepsis onset in the next 6 hours. This is achieved using a continuous-valued utility score as defined by challenge organizers for each prediction [18]. The utility function rewards or penalizes classifiers for their predictions within 12 hours before and 3 hours after sepsis onset time and was normalized as described in [18]. Using a ten-fold cross-validation scheme 10 LightGBM models are designed based on patient-wise stratified ten-folds each containing unique 10% of the entire training set. The hyper-parameters of the above models that minimize cross-validation loss are obtained by using automatic hyper-parameter optimization utility 'bayesopt' in Python [30, 31]. The underlying *An Explainable Machine Learning Model for Early Prediction of Sepsis Using ICU Data DOI: http://dx.doi.org/10.5772/intechopen.98957*

objective function formulated for the optimization is intended to maximize the AUROC. The given software utility finds optimal parameters automatically using Bayesian optimization. At the outset, the optimized models includes: 60 '*num*\_*leaves*', 120 ' *min* \_*data*\_*in*\_*leaf*', ' *max* \_*depth*' of 2, '*learning*\_*rate*' of 0.01, '*scale*\_*pos*\_*weight*' of 20, ' *min* \_*samples*\_*split*' of 4.

**Table 3** gives a summary of the results by the proposed framework on the entire training data in a ten-fold cross-validation scheme. Results also include performances of inter-cohort and baseline studies. To ensure that the models trained in the proposed study learn dependencies not only between the patient-records but also among the cohorts, we considered inter-cohort training and testing scheme. i.e. model trained with the data of cohort A was scored on cohort B data and vice versa. This certainly avoids the doubt of the over-fitting, thus increasing the robustness of the framework. Inter-cohort scores for A and B were 0.3191 and 0.3284 respectively.

#### **3.1 Comparison of xMLEPS with baseline**

Further, to emphasize the clinical relevance of the derived features under this proposed method, a comparative analysis of results is done by carrying out three baseline studies as shown in **Figure 2**.

As a part of comparative analysis three well-tuned baseline studies are performed: Firstly, the proposed method with feature set of 85 features is tested without optimal threshold refinement (default threshold value of 0.5 with no skill is used) in a 10-fold cross-validation scheme. In the second and third methods, the given 40 clinical variables only are directly fed to LightGBM models with and without refinement of optimal threshold respectively in a 10-fold cross-validation scheme. **Table 3** presents the results of these three baseline studies accordingly. As expected the proposed method xMLEPS outperforms these three studies. The third study carried out without derived features and optimal threshold refinement shows


#### **Table 3.**

*Results summary of the proposed framework.*

worst performance. Even for the first baseline study, results are significantly lower by 3% in terms of the utility score as compared to the proposed method.

#### **3.2 Explanation and visualization of feature importance**

The cumulative feature importance of the first top 50 features is shown in **Figure 3**. Here the LightGBM feature importance attribute is used for the gradient boosting framework developed. The approach used is to count the number of times a feature gets involved to split the dataset across all trees. The failure of such an approach is that it accounts for different impacts due to different splits. The next best approach is to attribute the gain achieved with the reduction in average training loss when using a feature for splitting. This "Gain" measure used for feature importance recovers the correct mutual information between feature inputs and label outputs [32]. The limitation of this approach is that it gets easily biased when greedy trees are built in the finite ensembles. So other methods are designed to compensate for the bias in feature selection using gain approach [33, 34].

SHAP summary plot with the 20 most important clinical features that cause sepsis onset identified by the xMLEPS framework is shown in **Figure 4(a)**. Here the approach used for the feature importance is to sort all the relevance scores across the entire population in decreasing order of mean relevance as computed for local, but considering only those individuals who were positive for sepsis. The mean relevance is displayed as blue horizontal bars in **Figure 4(a)**. While local explanations summary is shown in **Figure 4(b)**, wherein all the individual data points are displaced by mean relevance for sepsis and are colored by feature values. As shown from **Figure 4(b)** we can draw that the increase in the length of stay (ICULOS) and higher value of clinical ratio's like PaO2/FiO2, Shock indices: DBPSIndex and SIndex, etc. leads to the development of sepsis, whereas lower Platelets, DBP and Magnesium levels cause sepsis. These findings are found to be consistent with previous studies on it [7, 21, 35, 36].

Further, the impact of each feature and the interactions among them for sepsis development can also be illustrated using SHAP dependency plots. As an example, in **Figure 4(c)** the dependency plot showing the interaction of Heart rate with ICULOS is depicted. As seen the xMLEPS model seems to associate high heart rate values in the range 120–180 with increased ICULOS and hence causing sepsis.

**Figure 2.** *Comparison of results by xMLEPS with the three base-line studies. US: Utility score, F1: F1 score.*

*An Explainable Machine Learning Model for Early Prediction of Sepsis Using ICU Data DOI: http://dx.doi.org/10.5772/intechopen.98957*

**Figure 3.** *Cumulative feature importance of the top 50 features using the feature importance attribute of LightGBM.*

#### **Figure 4.**

*Results from the SHAP explanation module showing the global feature importance together with local explanation summary. PaO2: partial pressure of oxygen. FiO2: fraction of inspired oxygen. HR: Heart Rate. DBP: Diastolic Blood Pressure. SBP: Systolic Blood Pressure. SIndex: Shock Index. DBPIndex: Diastolic Blood Pressure Shock Index. PaCO2: Partial pressure of carbon dioxide. PTT: Partial thromboplastin time. WBC: Leukocyte count. BUN: Blood Urea Nitrogen.*

#### **Figure 5.**

*Summary plot of a SHAP interaction value matrix.*

Further **Figure 4(d)** shows lower values of SBP (approx. Below 90) is associated with increase ICULOS causing sepsis. A summary plot of a SHAP interaction value matrix is shown in **Figure 5** wherein the diagonal reflects the main effects, while across the diagonal show interaction effects. The explainable model will produce a high probability when it is confident about a decision, resulting in larger relevance scores due to the availability of more relevance for distributing backward. On the contrary, the model will output a lower probability when it is less confident about the patient to develop sepsis and as a result, yields lower relevant scores. This summary of scores distribution assists the clinicians with the hints to what to be expected from the designed model for clinical practice.

#### **4. Discussion**

This study justifies the clinical significance of the derived physiological interrelations among the clinical signs via feature importance and SHAP plots for visualized interpretation. Though SHAP values cannot be used as a generalized approach for early prediction of sepsis, they certainly help in generating relevant clinical hypotheses for desired events. The SHAP illustrations indeed assist in mitigating the concerns of the black-box issue associated with prediction models and might assist clinicians with a better understanding of the important features of the xMLEPS framework. The the proposed framework has the ability to establish the significance of the individual features contributing to enhance prediction of the utility score. Thus ensuring interpretablity of the framework to its clinical users. Furthermore, the proposed prediction framework, deploying clinical ICU data in the routine practice care can be potentially integrated into a computerized clinical decision support system instead of employing advanced molecular biomarkers.

The recent research literature relevant to early diagnosis of sepsis comes from the articles of various submission entries to PhysioNet Challenge 2019 [18]. This challenge aimed at the design and development of algorithms for early and automated prediction of sepsis onset with the optimal window definition of six hours before the actual clinical recognition of disease onset. The predictions of the machine learning algorithms were rewarded if they were able to detect true positives correctly up to 12 hours before disease onset and were slightly penalized if

*An Explainable Machine Learning Model for Early Prediction of Sepsis Using ICU Data DOI: http://dx.doi.org/10.5772/intechopen.98957*


**Table 4.**

*Summary of the results obtained by our previous works and the submitted solutions to PhysioNet 2019 challenge under 5/10-fold cross-validation scheme using training data.*

they were false positive. However predictions were strongly penalized if they were incorrect near disease onset. The reason for choosing the optimal prediction window to be six hours comes from the clinical fact that the ratio of observed median time to antimicrobial therapy is found to be 6 hours [37]. Furthermore, delay in each hour of treatment results in average decrease of survival rate of 7.6% [37].

The comparative analysis of the results obtained by the proposed method with our previous works [19, 38] and submission approaches [39–46] that reported the best results in the PhysioNet 2019 Challenge [18] is listed in **Table 4**. Most of these approaches utilized 5 or 10 fold cross-validation scheme and yielded utility scores in the range of 0.36–0.45.

This study supports the usage of the Utility score as an effective metric on ICU data for sepsis onset. However, experiments showed that even the F1 score gave reliable results aligning with utility score. i.e. the increase and decrease of F1 score follow accordingly with the Utility score. However, the bounds for utility score vary from 2 to 1 whereas the F1 Score has bounds from 0 to 1. The other conventional metrics namely AUROC, AUPRC, and Accuracy are insignificant to use with such a highly unbalanced dataset and are misleading for sepsis onset. Further, the fact that the interpretation of these results together with utility score is quite difficult cannot be ignored as mentioned by Roussel et al. [47].

The limitation of this study is, it constrains only to a two-center cohort design from the available training data, which might create doubt that the trained models may get over-fit towards the particular cohort data and it's patient-records. However, the analyzed ICU patient admissions originate from a diverse population covering the entire spectrum of ICU patients, and further, the validation in terms of inter-cohorts train-test approach along with optimum threshold refinement demonstrates the deployment of our framework in other ICUs.

#### **5. Conclusion**

This study presents xMLEPS – an explainable machine learning framework for the early prediction of sepsis using clinical data in the ICU setting. These predictive explanations justify the clinical significance of physiological inter-relations among the given clinical signs via visualized interpretation. And thus assist the clinicians in decision making for diagnosis and recommend future actions to be taken to improve the quality of predictions. This certainly ensures that the data-driven automated ML models have the potential to make the paradigm shift from conventional detection and treatment to an automated early prediction that prevents the failure of the organ system due to sepsis.
