**2.3 Data extraction**

The next step in the analysis process was to pull the patients' medical history from the available information in the US healthcare claims patient-level database [21]. The event date for the target cohort was established for each individual in the study to ensure the extraction of the healthcare information before the first condition event. For the control cohort, the first activity in 2019 was leveraged as the event date [21].

The approach for the data extraction and the study target and control setup was the same as presented in the earlier version of the chapter available on Research Square. Using the medical event dates, representing the first date of endometriosis diagnosis, as the index date, 36 months of medical history was extracted for each patient. Historical data presented all available medical events in the patients' healthcare history before the condition diagnosis, including diagnoses for comorbid conditions, medical and surgical procedures, therapeutics, healthcare provider's specialty, and treatments prescribed to patients. A transactional level dataset, representing the top 1000 diagnosis codes, top 800 medical and surgical procedures, and top 500 prescribed drugs, was utilized to enable additional insights since these top codes constituted more than 80% of the dataset [21].

A pivot table was built at the transaction level and aggregated at the patientlevel. Each row of the dataset represented an individual patient and the values within the row represented the counts of transactions that were generated during the patient's journey for the respective medical events. The columns of the table were the medical events, such as diagnosis and procedure codes, drugs prescribed, and physician specialties. The aggregated data table had more than 6 million rows and 2600 columns. The aggregated data table had missing values for selected patients and data elements, as not all records had complete medical information captured in the study period. Any medical events absent in the patient's history were represented with the value of *zero* (*0*), which implied that no such event was observed in the individual's medical history. The final aggregated dataset was leveraged as an analytical dataset for the remaining parts of the chapter [21].

The analytical dataset was further normalized and divided into two groups: a training and test set. A ratio of 70:30 was applied to the dataset [28]. The training dataset was employed to identify the key data elements driving endometriosis diagnoses, while the test group was used to confirm whether these elements would predict the condition occurrence accurately [29]. Splitting the data into training and test sets aided the assessment of the model performance and its ability to generalize the hidden data trends [21, 30].
