**4.3 Unstructured data in ML models**

Recent studies have shown that incorporating non-numerical data including key words from clinical documentation and diagnostic imaging can increase the accuracy of models [25]. This data is first converted to a format usable by machine learning algorithms via natural language processing (NLP). NLP uses sophisticated methods of text analytics to convert text into numerical data usable by an algorithm [26]. Goh et al. use a method of text analytics known as latent Direchlet allocation to group texts that are similar into topics. They identified 100 common text topics that were grouped into one of the following seven categories: (1) clinical status, (2) communication, (3) laboratory tests, (4) non-clinical status, (5) social relationships, (6) symptom, and (7) treatment. The numerical values derived from this text data were combined with structured numerical data like those used in the numerical regression models such
