**3.4 Clinical variables**


### **3.5 Data wrangling**

There are numerous EHR systems within the United States but to train a machine learning model with reasonable predictive power, it requires a large enough volume of data and a wide variety of features. Epic is one of the largest EHR systems in the United States and had the most data available from its back-end Caboodle warehouse making it an ideal choice as the data source for our model. Our model was developed at Cedars Sinai which uses Epic as its EHR. Our data was trained with patient data from Cedars Sinai, but our methodologies could be used by other health systems using Epic-derived data if features are defined in a similar fashion to our methodology.

Most machine learning algorithms require data to be converted into numerical values before entry into the model. Clinical data, particularly for lab values, can be extremely noisy with values documented in non-standardized formats in flowsheets. For example, when reporting the results of white blood cell counts in a urine sample, the data could be reported as 0, 1+, 2+, 3+, 4+ or none, or as some, few, many white blood cells with variations in how the text is entered by each technician. A data analyst must go through each data element entered in the algorithm and use code to replace text data or strings into numerical values. This is a painstaking process requiring meticulous data review. Afterwards, a clinician, preferably a clinical informaticist should comb through the data to identify outliers or mis-entered data that would not fit in the dataset with an understanding of the data from a clinical perspective.

Once individual data elements have been cleansed, data elements from different tables will be converted to a format allowing tables to be joined. When extracted from the Epic Caboodle Data Warehouse, data is often stored in rows for each encounter. To merge with data from another table such as vital signs, each lab value needs to be converted to a column for interpretation by the machine learning model. We utilize a pivot function to reformat this data from rows into columns using the common identifiers of medical record number, encounter identifier, measurement value, measurement time, and measurement unit. We then used a merge function to combine data elements from different tables to create a usable dataset. Please refer to the following link for details on our code:

https://github.com/rohith-mohan/caboodledatacleanse/commit/7d17c05fc3eeb04 3d22cb97e454701d2fbe81075
