**Real-World Treatment Patterns and Outcomes among Elderly Acute Myeloid Leukemia Patients in the United States**

Sacha Satram- Hoang, Carolina Reyes, Deborah Hurst, Khang Q. Hoang and Bruno C. Medeiros

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/63758

#### **Abstract**

Simulation: Transactions of Society for Modelling and Simulation International.

[19] A. Øhrn. ROSETTA Technical Reference Manual. Trondheim, Norway: Norwegian

[20] D. S. Hockbaum. Approximation Algorithms for NP-Hard Problems. Boston, MA: PWS

[21] A. Øhrn. Discernibility and Rough Sets in Medicine: Tools and Applications. Trond‐

[22] MAAC. Requirements for Protection System Operation Reporting and Analysis.

[23] N. Instruments. LabVIEW Basics I Introduction Course Manual. Austin, TX: National

[24] A. Øhrn, J. Komorowski, A. Skowron, and P. Synak. The design and implementation of a knowledge discovery toolkit based on rough sets: The ROSETTA system. In: L. Polkowski and A. Skowron, editors. Rough Sets in Knowledge Discovery 1: Method‐ ology and Applications, Studies in Fuzziness and Soft Computing. Heidelberg,

[25] N. Instruments. DIAdem: Data Mining, Analysis, and Report Generation. Austin, TX:

heim, Norway: Norwegian University of Science and Technology; 1999.

2014;90(6):660–686.

22 Big Data on Real-World Applications

Publishing Company; 1996.

Instruments Corporation; 2006.

University of Science and Technology; 2000.

Cleveland, Ohio: Mid Atlantic Area Council; 2003.

Germany: Physica-Verlag, Springer; 1998. pp. 376–399.

National Instrument Corporation; 2005.

Over half of patients diagnosed with acute myeloid leukemia (AML) are 65 years or older. Using the linked SEER-Medicare database, we conducted a retrospective cohort analysis to examine patient characteristics, treatment patterns, and survival among the elderly AML patients in routine clinical practice. Out of 29,857 patients with AML in the database, 8336 were eligible for inclusion in the study. The inclusion criteria included a diagnosis with first primary AML between January 1, 2000 and December 31, 2009, >66 years of age, and continuous enrollment in Medicare Parts A and B in the year before diagnosis. Forty percent (*N* = 3327) of the cohort received chemotherapy within 3 months after diagno‐ sis. The multivariable overall survival analyses showed a lower risk of death among those receiving intensive and hypomethylating agent therapies compared with no therapy. Among the younger cohort, a significant lower mortality was also noted with receipt of allogeneic hematopoietic stem cell transplantation. Over the past decade, about 60% of the elderly AML patients remain untreated in routine clinical practice. Use of antileuke‐ mic therapy was associated with a significant survival benefit and provides further support that age alone should not deter the use of guideline-recommended therapies particular‐ ly because of the high disparities in outcomes between treatment receipt and palliative care in this elderly cohort.

**Keywords:** acute myeloid leukemia, immunotherapy, chemotherapy, elderly patients, survival

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### **1. Introduction**

The American Cancer Society estimates that about 20,830 new cases of acute myeloid leuke‐ mia(AML)willbediagnosedintheUnitedStates in2015and10,460peoplewilldieofthedisease [1]. Incidence of AML increases with age, with a median age at diagnosis of 66 years making it primarily a disease of the elderly [2]. Survival rates decline with age and AML is the leading cause of mortality from leukemia in the United States [3, 4].

The management of older adults with AML poses a difficult clinical challenge as they are more likely to have comorbidities and poorer performance status which can limit treatment options and tolerability. Treatment efficacy and tolerability have been shown to deteriorate markedly with age [5]. Although intensive combination chemotherapy is frequently chosen to achieve complete remission and long-term survival, fewer than half of elderly patients receive treatment and their outcomes remain dismal [5–7]. Conventional chemotherapy treatments are highly toxic and may increase early death rates in patients 65 and older and these patients are alternatively given low intensity treatment or palliation only [7, 8]. However, without treatment, patients succumb to their illness within weeks to months of diagnosis [9].

For medically fit older patients (>60 years), the National Comprehensive Cancer Network (NCCN) recommend treatment with a combination of an anthracycline and standard dose cytarabine while for medically unfit older adults with poor physical function or unfavorable risk disease, the NCCN recommends less intensive chemotherapy with DNA hypomethylating agents, low-dose cytarabine, or supportive care alone [10]. Allogeneic hematopoietic stem cell transplantation (HSCT) is rarely used in older patients due to significant comorbidities and higher risk of transplant-related morbidity and mortality [11, 12]. Even so, data from the Swedish Acute Leukemia Registry have demonstrated that the majority of patients <80 years are able to tolerate intensive treatment and have shown benefits in spite of deteriorating organ function [8, 13].

Elderly, Medicare aged patients constitute the majority of patients with cancer in the United States, but only 1–2% of them are enrolled in randomized clinical trials (RCTs) providing a limited evidence base in which to evaluate treatment efficacy and safety in this population [14– 16]. Advanced age or the presence of significant comorbidity was the most frequently cited factors for clinical trial ineligibility [17]. The incidence of AML is expected to increase due to the aging population, and given the limited treatment options and clinical trial participation among the elderly, we examined Medicare beneficiaries diagnosed with AML from a large population-based cancer registry. The objectives of this analysis were to describe treatment patterns during the study time period, to examine factors predictive of receiving therapy, and to identify factors associated with prognosis among older AML patients in real-world clinical practice.

Real-World Treatment Patterns and Outcomes among Elderly Acute Myeloid Leukemia Patients in the United States http://dx.doi.org/10.5772/63758 25

#### **2. Methods**

**1. Introduction**

24 Big Data on Real-World Applications

function [8, 13].

practice.

The American Cancer Society estimates that about 20,830 new cases of acute myeloid leuke‐ mia(AML)willbediagnosedintheUnitedStates in2015and10,460peoplewilldieofthedisease [1]. Incidence of AML increases with age, with a median age at diagnosis of 66 years making it primarily a disease of the elderly [2]. Survival rates decline with age and AML is the leading

The management of older adults with AML poses a difficult clinical challenge as they are more likely to have comorbidities and poorer performance status which can limit treatment options and tolerability. Treatment efficacy and tolerability have been shown to deteriorate markedly with age [5]. Although intensive combination chemotherapy is frequently chosen to achieve complete remission and long-term survival, fewer than half of elderly patients receive treatment and their outcomes remain dismal [5–7]. Conventional chemotherapy treatments are highly toxic and may increase early death rates in patients 65 and older and these patients are alternatively given low intensity treatment or palliation only [7, 8]. However, without

treatment, patients succumb to their illness within weeks to months of diagnosis [9].

For medically fit older patients (>60 years), the National Comprehensive Cancer Network (NCCN) recommend treatment with a combination of an anthracycline and standard dose cytarabine while for medically unfit older adults with poor physical function or unfavorable risk disease, the NCCN recommends less intensive chemotherapy with DNA hypomethylating agents, low-dose cytarabine, or supportive care alone [10]. Allogeneic hematopoietic stem cell transplantation (HSCT) is rarely used in older patients due to significant comorbidities and higher risk of transplant-related morbidity and mortality [11, 12]. Even so, data from the Swedish Acute Leukemia Registry have demonstrated that the majority of patients <80 years are able to tolerate intensive treatment and have shown benefits in spite of deteriorating organ

Elderly, Medicare aged patients constitute the majority of patients with cancer in the United States, but only 1–2% of them are enrolled in randomized clinical trials (RCTs) providing a limited evidence base in which to evaluate treatment efficacy and safety in this population [14– 16]. Advanced age or the presence of significant comorbidity was the most frequently cited factors for clinical trial ineligibility [17]. The incidence of AML is expected to increase due to the aging population, and given the limited treatment options and clinical trial participation among the elderly, we examined Medicare beneficiaries diagnosed with AML from a large population-based cancer registry. The objectives of this analysis were to describe treatment patterns during the study time period, to examine factors predictive of receiving therapy, and to identify factors associated with prognosis among older AML patients in real-world clinical

cause of mortality from leukemia in the United States [3, 4].

#### **2.1. Data sources**

**Figure 1.** Schematic of inclusion/exclusion criteria.

This study utilized linked data from two large population-based data sources of Medicare beneficiaries with incident cancer identified in the Surveillance, Epidemiology, and End Results (SEER) program tumor registries. The database contains more than 3.3 million persons with cancer. Details of the linked SEER-Medicare database have been published elsewhere [18]. Briefly, the database combines clinical, demographic, cancer diagnosis, survival, and cause of death information with medical claims (hospital, physician, outpatient, home health, and hospice bills) for adults ≥65 diagnosed with cancer and enrolled in Medicare Part A (inpatient care, skilled nursing, home healthcare, and hospice care) and Part B (outpatient and physician services). The SEER is a nationally representative collection of 18 population-based registries of all incident cancers from diverse geographic areas covering approximately 26% of the US population. The registries monitor cancer trends, and provide continuous information on cancer incidence, extent of disease at diagnosis, therapy, and patient survival. A 98% case ascertainment is mandated with annual quality-assurance studies. The majority of persons aged 65 years and older in the SEER are successfully matched to their Medicare enrollment files [18]. All Medicare beneficiaries receive Part A coverage and approximately 95% of beneficiaries subscribe to Part B. The SEER-Medicare linkage used in this study included all Medicare eligible cancer patients appearing in the SEER data through 2009 and their Medicare claims for Part A and Part B through 2010. Institutional review board approval was waived because the SEER-Medicare data lack personal identifiers.

#### **2.2. Study cohort**

The SEER-Medicare dataset contained 29,857 patients with AML. All patients had microscop‐ ically confirmed AML diagnosis based on the International Classification of Diseases for Oncology (3rd edition, ICD-O-3) histology codes in the SEER. For inclusion in the study, patients were restricted to those with a first primary AML in order to exclude therapy-related AML, diagnosed within the time period from January 1, 2000 to December 31, 2009, ≥66 years of age, and enrolled in Medicare Parts A and B for a full 12 months before diagnosis date. Study exclusion criteria were as follows: (1) diagnosis at death, (2) enrollment in a health maintenance organization (HMO) any time within the 12 months before diagnosis since HMO claims are unavailable, and (3) receipt of chemotherapy before diagnosis. See **Figure 1** for the schematic of inclusion/exclusion process.

#### **2.3. Study variables**

Key study measures include patient demographics (age, race/ethnicity, gender, income, and education level); clinical characteristics (AML diagnosis, tumor characteristics, risk status, comorbidity burden, treatment, and survival time). In the absence of cytogenetic data and molecular abnormalities in the SEER data, prior myelodysplastic syndrome (MDS) or myelo‐ proliferative neoplasm (MPN) was used as a proxy for high-risk patients and was identified using diagnosis codes in Medicare Parts A and B claim files. MDS or MPN that transforms into AML are poor prognostic features of the disease and occur more commonly among elderly patients [19]. Performance status, such as Eastern Cooperative Oncology Group (ECOG), is not available in the dataset so Medicare claims were used to identify poor performance indicators (PPI) which include oxygen and related respiratory supplies, wheelchair and supplies, home health agency services, and skilled nursing facility services occurring in the 12 months before diagnosis [20]. The National Cancer Institute (NCI) comorbidity index [21] is the gold standard in SEER-Medicare to capture comorbidity burden using diagnosis and procedure codes to identify the 15 noncancer comorbidities from the Charlson Comorbidity Index [22] that occurred in the 12 months before cancer diagnosis.

In the Medicare claims files, International Classification of Disease (9th revision) Clinical Modification (ICD-9-CM) procedure codes were used to identify chemotherapy administra‐ tion while the Healthcare Common Procedural Coding System (HCPCS) "J" codes were used to identify the specific intravenous chemotherapy administered [23]. The first claim for chemotherapy had to appear within 3 months of the AML diagnosis date, and patients were classified into one of three treatment groups using all chemotherapies received during the first 60 days after date of chemotherapy initiation. Those receiving low intensity therapy with a DNA methyltransferase (DNMT) inhibitor such as Azacitidine or Decitabine were classified into the hypomethylating agents or "HMA Therapy" group; and those receiving aggressive induction therapy with Cytarabine + Anthracycline were classified into the "Intensive Therapy" group. Given that chemotherapy for AML is usually administered during inpatient stays, specific chemotherapy agent identification using J codes was not possible in about 70% of treated patients because inpatient stays are paid according to ICD-9 diagnosis or procedures codes only. Allogeneic HSCT was also identified using ICD-9-CM and HCPCS codes in the patient's Medicare claim files that occurred in the study follow-up period.

#### **2.4. Outcome measures**

Medicare eligible cancer patients appearing in the SEER data through 2009 and their Medicare claims for Part A and Part B through 2010. Institutional review board approval was waived

The SEER-Medicare dataset contained 29,857 patients with AML. All patients had microscop‐ ically confirmed AML diagnosis based on the International Classification of Diseases for Oncology (3rd edition, ICD-O-3) histology codes in the SEER. For inclusion in the study, patients were restricted to those with a first primary AML in order to exclude therapy-related AML, diagnosed within the time period from January 1, 2000 to December 31, 2009, ≥66 years of age, and enrolled in Medicare Parts A and B for a full 12 months before diagnosis date. Study exclusion criteria were as follows: (1) diagnosis at death, (2) enrollment in a health maintenance organization (HMO) any time within the 12 months before diagnosis since HMO claims are unavailable, and (3) receipt of chemotherapy before diagnosis. See **Figure 1** for the schematic

Key study measures include patient demographics (age, race/ethnicity, gender, income, and education level); clinical characteristics (AML diagnosis, tumor characteristics, risk status, comorbidity burden, treatment, and survival time). In the absence of cytogenetic data and molecular abnormalities in the SEER data, prior myelodysplastic syndrome (MDS) or myelo‐ proliferative neoplasm (MPN) was used as a proxy for high-risk patients and was identified using diagnosis codes in Medicare Parts A and B claim files. MDS or MPN that transforms into AML are poor prognostic features of the disease and occur more commonly among elderly patients [19]. Performance status, such as Eastern Cooperative Oncology Group (ECOG), is not available in the dataset so Medicare claims were used to identify poor performance indicators (PPI) which include oxygen and related respiratory supplies, wheelchair and supplies, home health agency services, and skilled nursing facility services occurring in the 12 months before diagnosis [20]. The National Cancer Institute (NCI) comorbidity index [21] is the gold standard in SEER-Medicare to capture comorbidity burden using diagnosis and procedure codes to identify the 15 noncancer comorbidities from the Charlson Comorbidity

In the Medicare claims files, International Classification of Disease (9th revision) Clinical Modification (ICD-9-CM) procedure codes were used to identify chemotherapy administra‐ tion while the Healthcare Common Procedural Coding System (HCPCS) "J" codes were used to identify the specific intravenous chemotherapy administered [23]. The first claim for chemotherapy had to appear within 3 months of the AML diagnosis date, and patients were classified into one of three treatment groups using all chemotherapies received during the first 60 days after date of chemotherapy initiation. Those receiving low intensity therapy with a DNA methyltransferase (DNMT) inhibitor such as Azacitidine or Decitabine were classified into the hypomethylating agents or "HMA Therapy" group; and those receiving aggressive induction therapy with Cytarabine + Anthracycline were classified into the "Intensive

Index [22] that occurred in the 12 months before cancer diagnosis.

because the SEER-Medicare data lack personal identifiers.

**2.2. Study cohort**

26 Big Data on Real-World Applications

of inclusion/exclusion process.

**2.3. Study variables**

The primary endpoint was overall survival after the AML diagnosis. Overall survival was measured from date of diagnosis to date of death. To assess the risk of early death (30-day mortality and 60-day mortality) after diagnosis, the "treated" group was limited to patients who received treatment within 30 days after diagnosis to minimize the introduction of immortal time bias into the analysis (period of follow-up time during which death cannot occur) [24]. All patients who were still alive at the end of the follow-up period (December 31, 2010) were censored.

#### **2.5. Statistical analysis**

Patient characteristics were compared with treatment status and treatment type using the Chisquare test for categorical variables and ANOVA or *t* test for continuous variables. We considered a *p*-value <0.05 to be statistically significant. Multivariate logistic regression was used to assess factors associated with receipt of treatment.

In the survival analyses, we made comparisons between the treated and Not Treated patients; between treated patients receiving HSCT and those who did not; and between HMA Therapy, Intensive Therapy, and No Treatment groups. The Kaplan-Meier survival analysis was used to plot survival curves. A time-varying Cox regression model with treatment as a timedependent factor was used to account for variation in treatment initiation between groups. Other independent variables included in the Cox model were selected demographic and clinical characteristics. All statistical analyses were performed using SAS software, version 9.1.3 (SAS Institute Inc., Cary, NC, USA).

#### **3. Results**

#### **3.1. Treatment patterns**

Treatment rates increased over the study time period from 35% in 2000 to 50% in 2009 (**Figure 2**). Of the 8336 patients who met all study criteria, 3327 (40%) received treatment with chemotherapy within 3 months of diagnosis and 5009 (60%) did not. As age and comorbidity burden increased, likelihood of treatment was found to decrease (**Figures 3** and **4**).

**Figure 2.** Treatment status by year of diagnosis.

**Figure 3.** Treatment status by age.

Real-World Treatment Patterns and Outcomes among Elderly Acute Myeloid Leukemia Patients in the United States http://dx.doi.org/10.5772/63758 29

**Figure 4.** Treatment status by comorbidity burden.

**Figure 2.** Treatment status by year of diagnosis.

28 Big Data on Real-World Applications

**Figure 3.** Treatment status by age.

#### **3.2. Cohort characteristics and the odds of treatment receipt**

**Table 1** shows the baseline patient characteristics of the cohort. Overall, the majority of patients were over 75 years of age (63%), male, white, and married. In the logistic regression model of factors associated with the odds of not receiving treatment with chemotherapy or HSCT, increasing age and increasing comorbidity score were confirmed to significantly decrease the likelihood of receiving treatment. Patients of black or African ancestry were 30% less likely to receive treatment than white patients. Being widowed, separated/divorced, having a history of MDS or presence of PPI significantly decreased the likelihood of receiving treatment.

**Table 2** shows the baseline patient characteristics by the type of treatment received. Compared with other treatment groups, patients receiving Intensive Therapy were younger, more likely male, married, less secondary AML (prior MDS), less likely to have PPIs, and had lower comorbidity score. Similarities in age, comorbidity burden, and proportion with high-risk disease were noted in HMA Therapy and Not Treated patients.

Among treated patients, there were 276 (8%) who underwent HSCT therapy and 3051 (92%) who did not (**Table 2**). The HSCT patients were younger at diagnosis with a mean age of 73 compared with the non-HSCT group (75 years; *p* <0.0001) and were more likely to be male.


a *Model also includes geographic region, income,and year of diagnosis.*

**Table 1.** Factors associated with the odds of NOT receiving chemotherapy or HSCT.

Real-World Treatment Patterns and Outcomes among Elderly Acute Myeloid Leukemia Patients in the United States http://dx.doi.org/10.5772/63758 31

**Characteristic Total (***N* **= 8336) Odds of no treatment**

71–75 1774 21.3 1.64 1.41–1.91 <0.0001 76–80 1971 23.6 2.86 2.46–3.32 <0.0001 >80 3269 39.2 7.40 6.36–8.61 <0.0001

Female 4005 48.0 0.97 0.87–1.07 0.5193

Black 502 6.0 1.30 1.04–1.62 0.0045 Other/unknown 549 6.6 0.87 0.71–1.05 0.4119

Widowed 2492 29.9 1.29 1.13–1.46 0.0036 Separated/divorced 543 6.5 1.34 1.10–1.64 0.0128 Single 535 6.4 1.21 0.99–1.48 0.0796 Unknown 393 4.7 1.31 1.04–1.66 0.0359

Yes 1440 17.3 1.18 1.03–1.34 0.0151

Yes 1056 12.7 2.02 1.69–2.41 <0.0001

1 2104 25.2 1.07 0.95–1.21 0.1017 2 1018 12.2 1.41 1.20–1.66 0.0326 ≥3 948 11.4 1.56 1.31–1.86 0.0004

66–70 1322 15.9 ref

Male 4331 52.0 ref

White 7285 87.4 ref

Married 4373 52.5 ref

No 6896 82.7 ref

No 7280 87.3 ref

0 4266 51.2 ref

**Table 1.** Factors associated with the odds of NOT receiving chemotherapy or HSCT.

*Model also includes geographic region, income,and year of diagnosis.*

Age at diagnosis

30 Big Data on Real-World Applications

Sex

Race/ethnicity

Marital status

Prior MDS

NCI comorbidity score

PPI

a

*n* **% ORa 95% CI** *p***-value**


a Cells with counts of less than 11 are combined in compliance with the National Cancer Institute data in agreement with small cell sizes.

**Table 2.** Baseline patient characteristics by type of treatment received.

#### **3.3. Overall survival by chemotherapy type**


Real-World Treatment Patterns and Outcomes among Elderly Acute Myeloid Leukemia Patients in the United States http://dx.doi.org/10.5772/63758 33


a Model also includes geographic region, income, and year of diagnosis.

**Table 3.** Adjusted overall survival by treatment type.

**3.3. Overall survival by chemotherapy type**

**Covariates** *N* **Totala**

Not treated (ref) 5009

32 Big Data on Real-World Applications

66–70 (ref) 537

Male (ref) 2780

White (ref) 4788

Married (ref) 2644

No (ref) 4445

No (ref) 4605

0 (ref) 2611

71–75 920 1.31 1.17–1.46 76–80 1276 1.42 1.27–1.58 >80 2745 1.68 1.52–1.86

Age at diagnosis

Sex

Race/ethnicity

Marital status

Prior MDS

NCI comorbidity score

PPI

*(N* **= 5478)**

**Treatment HR 95% CI HR 95% CI HR 95% CI**

HMA therapy 345 0.52 0.47–0.59 0.54 0.45–0.66 0.50 0.44–0.58 Intensive therapy 124 0.33 0.27–0.41 0.30 0.23–0.39 0.38 0.26–0.54

Female 2698 1.01 0.96–1.08 0.99 0.88–1.11 1.03 0.96–1.10

Black 350 0.96 0.86–1.07 1.04 0.85–1.28 0.92 0.80–1.05 Other/unknown 340 0.89 0.79–1.00 0.93 0.74–1.16 0.86 0.75–0.98

Widowed 1859 1.12 1.05–1.20 1.33 1.14–1.56 1.10 1.02–1.19 Separated/divorced 349 1.11 0.99–1.25 1.09 0.89–1.33 1.07 0.93–1.23 Single 349 1.18 1.05–1.32 1.31 1.07–1.59 1.12 0.97–1.28 Unknown 277 1.00 0.88–1.13 0.94 0.73–1.20 1.01 0.87–1.18

Yes 1033 0.97 0.91–1.04 1.03 0.89–1.19 0.95 0.88–1.03

Yes 873 1.30 1.20–1.40 1.58 1.32–1.90 1.26 1.16–1.38

1 1383 1.18 1.10–1.26 1.35 1.18–1.54 1.12 1.03–1.21

**≤75 years<sup>a</sup>** *(N* **= 1457)** **>75 yearsa** *(N* **= 4021)**

> Patients receiving Intensive Therapy had longer unadjusted median overall survival (18.9 months) compared with patients receiving HMA Therapy (6.6 months) and those Not Treated (1.5 months; log rank *p* <0.0001). In the multivariable survival analysis (**Table 3**), significantly lower risks of death were noted among patients treated with Intensive Therapy and HMA Therapy compared with Not Treated with similar mortality risk reductions maintained in the younger (≤75) and older (>75) cohorts. Other factors found to be predictive of mortality include increasing age, increasing comorbidity score, and presence of PPIs.

#### **3.4. Overall survival by HSCT**

The unadjusted median overall survival was significantly higher for the HSCT (9.7 months) compared with the non-HSCT group (4.7 months; log rank *p* < 0.0001) and this survival benefit was supported in the multivariable survival analysis (**Table 4**), where a statistically significant, 21% lower risk of death in the HSCT group was found compared with the non-HSCT group. Stratifying by age, the lower risk of death among the HSCT group was only supported in the younger cohort (≤75 years old).


a Adjusted for age, sex, race, marital status, geographic region, income, year of diagnosis, prior MDS, PPI, and NCI comorbidity index.

**Table 4.** Adjusted overall survival among treated patients with and without HSCT.

#### **4. Discussion**

Treatment for elderly patients diagnosed with AML has increased over time from the 34% reported by Lang et al. between 1991 and 2001 [7] to the 40% reported in our study between 2000 and 2010. However, the 60% of elderly AML patients who remain untreated following diagnosis represents a large unmet need in this patient population. We observed a significant survival benefit with receiving antileukemic therapy even among the HMA Therapy group who had similar characteristics to the untreated group. Our multivariate analysis demonstrat‐ ed a greater reduction in mortality among patients receiving Intensive Therapy compared with HMA Therapy, but both therapeutic options appeared to be equally better than supportive measures when the cohorts were properly matched for relevant confounders. Results from prior RCTs also support our findings and have demonstrated not only an improvement in complete remission rate, but also an improvement in overall survival for AML patients aged 65 years or older treated with intensive chemotherapy [25] and HMA Therapy [26] compared with supportive measures only.

The current results also draw attention to the perception that elderly AML is an untreatable disease and conventional chemotherapy is usually withheld due to toxicity and high early death rates. Our results, however, confirm findings from other registry-based analyses that showed elderly AML patients who received treatment exhibited a lower early death rate compared with untreated patients or palliation after adjustment for confounding factors [8, 13, 27]. Despite the overall improvement in early death rates in the treated versus untreated groups, subsets of patients older than 80 years or those with poor performance or higher comorbidity burden did experience higher risks of early death suggesting caution in use of therapy within these subgroups.

The HSCT therapy was associated with a significant lower risk of death compared with patients receiving chemotherapy only and the survival benefit was even more pronounced in the younger cohort (≤75 years) with no benefit in the >75 years old subset. Although our observations are at best hypothesis generating, they raise the question of whether allogeneic HSCT provides therapeutic benefit to AML patients older than 75 years of age. Although use of myeloablative allogeneic HSCT is rare among older unfit patients, reduced-intensity conditioning (RIC) of the allogeneic HSCT has shown encouraging results in the postremission setting [11, 12, 28] and is considered an additional treatment option after complete response from induction therapy among older patients ≥60 years [10]. In fact, a recent uncontrolled study demonstrated that reduced-intensity conditioning HSCT as postremission therapy was well tolerated in selected older patients with AML, and survival compared favorably to historical patients treated without HSCT [29]. However, in the "real world," chronologic age remains a driving factor in receiving HSCT as only 8% of patients in the current study who received chemotherapy underwent subsequent HSCT therapy. The randomized clinical trials are needed to define the role of allogeneic HSCT as postremission therapy in this cohort of patients.

The results show that patients receiving Intensive Therapy were younger, had less secondary AML, were less likely to have indicators of poor performance, and had lower comorbidity burden compared with patients receiving HMA Therapy and No Treatment, and this may be related to physician beliefs that elderly patients are less able to tolerate more aggressive treatments [5, 30–32]. Undertreatment because of age, independent of comorbidities, occurs in other oncology studies, and may be due to patient preferences, physicians' tendencies to treat patients according to their chronologic age, and a lack of evidence-based guidelines for treating older patients [33, 34]. In two prior RCTs where preselection of conventional care regimens was performed before subjects were randomized, those assigned to aggressive therapies had a median of 5–8 years younger than their counterparts assigned to less intensive regimens [35, 36]. These age disparities in treatment patterns are associated with higher mortality in older AML patients [5, 6] and our results provide further support that demographic factors such as age should not discourage the use of guideline-recommended therapies.

Treatment receipt also varied by gender, socioeconomic factors, geographic region, and marital status, similar to patterns observed in prior oncology research [37–39]. Even after adjustment for known confounders, married patients were more likely to receive treatment and had better outcomes compared with unmarried patients [39] and may indicate that marital status is a surrogate of social-economic support in this patient population. Reducing the disparity of nonclinical factors such as income and geographic region on receipt of cancer therapy may reduce the adverse impact on outcomes among these patients. Further research is warranted to better quantify how nonclinical factors contribute to receipt of cancer therapy and outcomes.

#### **4.1. Strengths and limitations**

diagnosis represents a large unmet need in this patient population. We observed a significant survival benefit with receiving antileukemic therapy even among the HMA Therapy group who had similar characteristics to the untreated group. Our multivariate analysis demonstrat‐ ed a greater reduction in mortality among patients receiving Intensive Therapy compared with HMA Therapy, but both therapeutic options appeared to be equally better than supportive measures when the cohorts were properly matched for relevant confounders. Results from prior RCTs also support our findings and have demonstrated not only an improvement in complete remission rate, but also an improvement in overall survival for AML patients aged 65 years or older treated with intensive chemotherapy [25] and HMA Therapy [26] compared

The current results also draw attention to the perception that elderly AML is an untreatable disease and conventional chemotherapy is usually withheld due to toxicity and high early death rates. Our results, however, confirm findings from other registry-based analyses that showed elderly AML patients who received treatment exhibited a lower early death rate compared with untreated patients or palliation after adjustment for confounding factors [8, 13, 27]. Despite the overall improvement in early death rates in the treated versus untreated groups, subsets of patients older than 80 years or those with poor performance or higher comorbidity burden did experience higher risks of early death suggesting caution in use of

The HSCT therapy was associated with a significant lower risk of death compared with patients receiving chemotherapy only and the survival benefit was even more pronounced in the younger cohort (≤75 years) with no benefit in the >75 years old subset. Although our observations are at best hypothesis generating, they raise the question of whether allogeneic HSCT provides therapeutic benefit to AML patients older than 75 years of age. Although use of myeloablative allogeneic HSCT is rare among older unfit patients, reduced-intensity conditioning (RIC) of the allogeneic HSCT has shown encouraging results in the postremission setting [11, 12, 28] and is considered an additional treatment option after complete response from induction therapy among older patients ≥60 years [10]. In fact, a recent uncontrolled study demonstrated that reduced-intensity conditioning HSCT as postremission therapy was well tolerated in selected older patients with AML, and survival compared favorably to historical patients treated without HSCT [29]. However, in the "real world," chronologic age remains a driving factor in receiving HSCT as only 8% of patients in the current study who received chemotherapy underwent subsequent HSCT therapy. The randomized clinical trials are needed to define the role of allogeneic HSCT as postremission therapy in this cohort of patients.

The results show that patients receiving Intensive Therapy were younger, had less secondary AML, were less likely to have indicators of poor performance, and had lower comorbidity burden compared with patients receiving HMA Therapy and No Treatment, and this may be related to physician beliefs that elderly patients are less able to tolerate more aggressive treatments [5, 30–32]. Undertreatment because of age, independent of comorbidities, occurs in other oncology studies, and may be due to patient preferences, physicians' tendencies to treat patients according to their chronologic age, and a lack of evidence-based guidelines for treating older patients [33, 34]. In two prior RCTs where preselection of conventional care regimens

with supportive measures only.

34 Big Data on Real-World Applications

therapy within these subgroups.

This unique dataset allowed us to examine all AML patients, both treated and untreated, and provided insight into treatment decisions and effectiveness of therapies in routine oncology practice among this underrepresented elderly patient population. Our analysis has several strengths including the large sample size from a population-based cancer registry, the diverse geographic representation of AML patients in the United States, and comprehensive, longi‐ tudinal data with medical claims from the time a person is eligible for Medicare until death regardless of residence or service area.

However, there are some limitations to the analysis that deserve mention. The SEER registry does not collect baseline molecular and cytogenetic information or performance status, and these factors influence clinicians' decisions to treat or the specific regimen to administer. Our proxies for stage (including claims for prior MDS as a marker of disease severity) and per‐ formance status (including claims to identify indicators of poor performance) may not adequately assess stage or performance status in all patients and may be subject to bias.

The results of the comparative effectiveness analysis should be interpreted with caution due to the large amount of missing data and resulting small sample size of treatment groups. Conventional chemotherapy treatments for AML are highly toxic [9] and generally require inpatient treatment. Inpatient stays are paid based on ICD-9 diagnosis or procedures codes only and not the specific chemotherapy J code administered. Therefore, we were unable to define the type of chemotherapy received for 70% of the treated cohort without the specific J code. Given that induction chemotherapy with curative intent in the outpatient setting is applied to very select elderly AML patients, our findings may not be representative of the general patient population receiving intensive induction therapy.

Finally, this analysis does not contain information regarding treatment patterns and outcomes of patients enrolled in HMO plans as these claims are not submitted to Medicare. Prior solid tumor studies found that HMO enrollees were diagnosed earlier and had better overall survival compared with fee-for-service (FFS) plan members [40, 41]. An investigation of how patient characteristics, treatment patterns, and prognosis may differ between these alternative healthcare plans and Medicare enrollees would be a productive area for additional evaluation.

In conclusion, our findings provide an important context for therapeutic selection that occurs in older patients with AML and suggests that age alone should not discourage the use of guideline-recommended therapies particularly because of the high disparities in outcomes between treatment receipt and palliative care. But even with treatment, outcomes remain dismal, and given this important unmet medical need, many new agents are currently in development for older patients with AML [42–45]. Moving forward, it will be important to identify patients less likely to be treated at diagnosis and design clinical trials to address the therapeutic challenges that exist in this cohort of patients.

#### **Acknowledgements**

Funding for this study was provided by Genentech, Inc. The authors would like to acknowl‐ edge Faiyaz Momin, MS, for programming support and Dr. Michelle Byrtek for her invaluable input on the statistical analyses. This study used the linked SEER-Medicare database. We acknowledge the efforts of the Applied Research Program, NCI (Bethesda, MD), the Office of Information Services, and the Office of Strategic Planning, Health Care Financing Adminis‐ tration (Baltimore, MD), Information Management Services, Inc. (Silver Spring, MD), and the Surveillance, Epidemiology, and End Results (SEER) Program tumor registries in the creation of the SEER-Medicare database. The interpretation and reporting of these data are the sole responsibility of the authors.

#### **Author details**

Sacha Satram- Hoang1\*, Carolina Reyes2,3, Deborah Hurst2 , Khang Q. Hoang1 and Bruno C. Medeiros4


#### **References**

survival compared with fee-for-service (FFS) plan members [40, 41]. An investigation of how patient characteristics, treatment patterns, and prognosis may differ between these alternative healthcare plans and Medicare enrollees would be a productive area for additional evaluation.

In conclusion, our findings provide an important context for therapeutic selection that occurs in older patients with AML and suggests that age alone should not discourage the use of guideline-recommended therapies particularly because of the high disparities in outcomes between treatment receipt and palliative care. But even with treatment, outcomes remain dismal, and given this important unmet medical need, many new agents are currently in development for older patients with AML [42–45]. Moving forward, it will be important to identify patients less likely to be treated at diagnosis and design clinical trials to address the

Funding for this study was provided by Genentech, Inc. The authors would like to acknowl‐ edge Faiyaz Momin, MS, for programming support and Dr. Michelle Byrtek for her invaluable input on the statistical analyses. This study used the linked SEER-Medicare database. We acknowledge the efforts of the Applied Research Program, NCI (Bethesda, MD), the Office of Information Services, and the Office of Strategic Planning, Health Care Financing Adminis‐ tration (Baltimore, MD), Information Management Services, Inc. (Silver Spring, MD), and the Surveillance, Epidemiology, and End Results (SEER) Program tumor registries in the creation of the SEER-Medicare database. The interpretation and reporting of these data are the sole

, Khang Q. Hoang1

and

therapeutic challenges that exist in this cohort of patients.

Sacha Satram- Hoang1\*, Carolina Reyes2,3, Deborah Hurst2

\*Address all correspondence to: sacha@qdresearch.com

3 University of California San Francisco,, San Francisco, CA, USA

1 Q.D. Research, Inc.,, Granite Bay, CA, USA

4 Stanford University, Stanford,, CA, USA

2 Genentech, Inc.,, South San Francisco, CA, USA

**Acknowledgements**

36 Big Data on Real-World Applications

responsibility of the authors.

**Author details**

Bruno C. Medeiros4


[28] Estey, E., et al., *Prospective feasibility analysis of reduced-intensity conditioning (RIC) regimens for hematopoietic stem cell transplantation (HSCT) in elderly patients with acute myeloid leukemia (AML) and high-risk myelodysplastic syndrome (MDS)*. Blood, 2007. 109(4): pp. 1395–400.

[14] Murthy, V.H., H.M. Krumholz, and C.P. Gross, *Participation in cancer clinical trials: race-,*

[15] Hutchins, L.F., et al., *Underrepresentation of patients 65 years of age or older in cancer-*

[16] Gross, C.P., et al., *Cancer trial enrollment after state-mandated reimbursement*. J Natl Cancer

[17] Mengis, C., et al., *Assessment of differences in patient populations selected for excluded from participation in clinical phase III acute myelogenous leukemia trials*. J Clin Oncol, 2003. 21(21):

[18] Warren, J.L., et al., *Overview of the SEER-Medicare data: content, research applications, and generalizability to the United States elderly population*. Med Care, 2002. 40(8 Suppl): p.

[19] Vardiman, J.W., et al., *The 2008 revision of the World Health Organization (WHO) classifi‐ cation of myeloid neoplasms and acute leukemia: rationale and important changes*. Blood, 2009.

[20] Davidoff, A.J., et al., *Chemotherapy and survival benefit in elderly patients with advanced*

[21] Klabunde, C.N., et al., *A refined comorbidity measurement algorithm for claims-based studies of breast, prostate, colorectal, and lung cancer patients*. Ann Epidemiol, 2007. 17(8): pp. 584–

[22] Charlson, M.E., et al., *A new method of classifying prognostic comorbidity in longitudinal*

[23] Warren, J.L., et al., *Utility of the SEER-Medicare data to identify chemotherapy use*. Med

[24] Suissa, S., *Immortal time bias in pharmaco-epidemiology*. Am J Epidemiol, 2008. 167(4): pp.

[25] Lowenberg, B., et al., *On the value of intensive remission-induction chemotherapy in elderly patients of 65+ years with acute myeloid leukemia: a randomized phase III study of the European Organization for Research and Treatment of Cancer Leukemia Group*. J Clin Oncol, 1989. 7(9):

[26] Kantarjian, H.M., et al., *Multicenter, randomized, open-label, phase III trial of decitabine versus patient choice, with physician advice, of either supportive care or low-dose cytarabine for the treatment of older patients with newly diagnosed acute myeloid leukemia*. J Clin Oncol,

[27] Oran, B. and D.J. Weisdorf, *Survival for older patients with acute myeloid leukemia: a*

*population-based study*. Haematologica, 2012. 97(12): pp. 1916–24.

*studies: development and validation*. J Chronic Dis, 1987. 40(5): pp. 373–83.

*non-small-cell lung cancer*. J Clin Oncol, 2010. 28(13): pp. 2191–7.

*sex-, and age-based disparities*. JAMA, 2004. 291(22): pp. 2720–6.

*treatment trials*. N Engl J Med, 1999. 341(27): pp. 2061–7.

Inst, 2004. 96(14): pp. 1063–9.

pp. 3933–9.

38 Big Data on Real-World Applications

IV-3-18.

90.

492–9.

pp. 1268–74.

2012. 30(21): pp. 2670–7.

114(5): pp. 937–51.

Care, 2002. 40(8 Suppl): p. IV-55-61.


## **Introduction to Big Data in Education and Its Contribution to the Quality Improvement Processes**

Christos Vaitsis, Vasilis Hervatis and Nabil Zary

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/63896

#### **Abstract**

[41] Merrill, R.M., et al., *Survival and treatment for colorectal cancer Medicare patients in two group/staff health maintenance organizations and the fee-for-service setting*. Med Care Res

[42] Dohner, H., et al., *Randomized, phase 2 trial of low-dose cytarabine with or without volasertib in AML patients not suitable for induction therapy*. Blood, 2014. 124(9): pp. 1426–33. [43] Lancet, J.E., et al., *Phase 2 trial of CPX-351, a fixed 5:1 molar ratio of cytarabine/daunorubicin, vs cytarabine/daunorubicin in older adults with untreated AML*. Blood, 2014. 123(21): pp.

[44] Burnett, A.K., et al., *Addition of gemtuzumab ozogamicin to induction chemotherapy improves survival in older patients with acute myeloid leukemia*. J Clin Oncol, 2012. 30(32): pp. 3924–

[45] Castaigne, S., et al., *Effect of gemtuzumab ozogamicin on survival of adult patients with denovo acute myeloid leukaemia (ALFA-0701): a randomised, open-label, phase 3 study*. Lancet,

Rev, 1999. 56(2): pp. 177–96.

2012. 379(9825): pp. 1508–16.

3239–46.

40 Big Data on Real-World Applications

31.

In this chapter, we introduce the readers to the field of big educational data and how big educational data can be analysed to provide insights into different stakeholders and thereby foster data driven actions concerning quality improvement in education. For the analysis and exploitation of big educational data, we present different techniques and popular applied scientific methods for data analysis and manipulation such as analyt‐ ics and different analytical approaches such as learning, academic and visual analytics, providing examples of how these techniques and methods could be used. The concept of quality improvement in education is presented in relation to two factors: (a) to improve‐ ment science and its impact on different processes in education such as the learning, educational and academic processes and (b) as a result of the practical application and realization of the presented analytical concepts. The context of health professions education is used to exemplify the different concepts.

**Keywords:** big data, big educational data, analytics, health education, quality im‐ provement

#### **1. Introduction**

Higher and professional education is a domain which constantly needs to be evaluated and transformed to follow the fast pace of changing trends in different sectors in the market which in turn creates a variety of needs in workforce. A major factor that has radically altered the way education is conducted is technology. Examples of different types of technologies used in education are mobile devices and apparatuses, teleconference and remote access sys‐ tems, educational platforms and services and other that students, teachers, academic faculty,

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

evaluation specialists, researchers and decision-makers in education interact with and use in an effort to impact and improve teaching and learning but also to realistically reflect in the learning stage the usage of modern technologies used in real settings. The interaction with these technologies generates large amounts of data that range from an individual access log file to an institutional level activity. Still the educational systems are not yet fully prepared to cope with and exploit them for continuous quality improvement purposes. In particular‐ ly, health professions education or health education is a context that these technologies are predominantly used, producing a wide range of educational data. In addition, health education is in constant need of reflecting the growing body of medical knowledge and evidence in order to practically embed it in education and prepare the future health profes‐ sionals to meet the future challenges of healthcare systems. The need to govern these challenges within health education is now more than ever timely, and therefore, attention has been paid to different approaches such as big data and analytics that could be useful in investigating and exploiting educational data too.

#### **2. Big data and education**

#### **2.1. Big data**

Big data is extensively used as a term today to describe and define the recent emergence and existence of data sets of high magnitude. It can be found in many sectors. The public, com‐ mercial and social sectors receive and produce ceaselessly vast amounts of data from different sources and in different formats. In some cases, the data reach extremely big sizes such as in petabytes exceeding the hardware or human abilities to warehouse, manipulate and process them and therefore is characterized as big data. Nevertheless, this term has been readily given to large sized data, although the size can vary from sector to sector or more specifically between services within a sector [1]. Big data is in fact termed as such given its characteristic of being large in size. Nevertheless, big data is defined by additional characteristics such as the disparate types and formats and different sources the data are collected from but also the speed they are produced, and most importantly, the frequency they are processed, in real time, frequently or occasionally. All these characteristics are summarized as volume (size), variety (sources, formats and types) and velocity (speed and frequency) and add complexity to the data, which is in fact another attribute in concern [2]. Data possessed in a system or a specific domain are considered as big data when simultaneously the volume, the variety and the velocity are high irrespective of whether these three characteristics can be considered "small" to another domain. In this case, this is enough to challenge constrains in manipulating and analysing the data so they can be used for different purposes. Depending on the domain, the size of data can vary from megabytes to petabytes. Thus, big data is context-specific and may refer to different sizes and types from domain to domain but the common challenge that all these domains must cope with is to being able to make sense of the data by processing them in a high analytical level to enable data-driven improvement of processes and procedures [3]. Big data and analytics have added value to data possessed in different contexts and conse‐ quently have proven to be an extremely useful approach for investigating its possible impact either in industry in the form of business intelligence and analytics [4] or in academia with educational data mining techniques and learning analytics [5]. Given the limited research on the usage of big data and analytics in the context of health education, we will introduce the reader to the new field of big educational data which places big data in education and how the educational data can be treated in different dimensions and from different perspectives to bring into light insights for different stakeholders such as decision-makers, academic faculty, evaluation specialists, researchers and students in computer science, engineering and infor‐ matics courses and encourage accordingly data-driven activities concerning quality improve‐ ment in education.

#### **2.2. Big educational data**

evaluation specialists, researchers and decision-makers in education interact with and use in an effort to impact and improve teaching and learning but also to realistically reflect in the learning stage the usage of modern technologies used in real settings. The interaction with these technologies generates large amounts of data that range from an individual access log file to an institutional level activity. Still the educational systems are not yet fully prepared to cope with and exploit them for continuous quality improvement purposes. In particular‐ ly, health professions education or health education is a context that these technologies are predominantly used, producing a wide range of educational data. In addition, health education is in constant need of reflecting the growing body of medical knowledge and evidence in order to practically embed it in education and prepare the future health profes‐ sionals to meet the future challenges of healthcare systems. The need to govern these challenges within health education is now more than ever timely, and therefore, attention has been paid to different approaches such as big data and analytics that could be useful in

Big data is extensively used as a term today to describe and define the recent emergence and existence of data sets of high magnitude. It can be found in many sectors. The public, com‐ mercial and social sectors receive and produce ceaselessly vast amounts of data from different sources and in different formats. In some cases, the data reach extremely big sizes such as in petabytes exceeding the hardware or human abilities to warehouse, manipulate and process them and therefore is characterized as big data. Nevertheless, this term has been readily given to large sized data, although the size can vary from sector to sector or more specifically between services within a sector [1]. Big data is in fact termed as such given its characteristic of being large in size. Nevertheless, big data is defined by additional characteristics such as the disparate types and formats and different sources the data are collected from but also the speed they are produced, and most importantly, the frequency they are processed, in real time, frequently or occasionally. All these characteristics are summarized as volume (size), variety (sources, formats and types) and velocity (speed and frequency) and add complexity to the data, which is in fact another attribute in concern [2]. Data possessed in a system or a specific domain are considered as big data when simultaneously the volume, the variety and the velocity are high irrespective of whether these three characteristics can be considered "small" to another domain. In this case, this is enough to challenge constrains in manipulating and analysing the data so they can be used for different purposes. Depending on the domain, the size of data can vary from megabytes to petabytes. Thus, big data is context-specific and may refer to different sizes and types from domain to domain but the common challenge that all these domains must cope with is to being able to make sense of the data by processing them in a high analytical level to enable data-driven improvement of processes and procedures [3]. Big data and analytics have added value to data possessed in different contexts and conse‐ quently have proven to be an extremely useful approach for investigating its possible impact

investigating and exploiting educational data too.

**2. Big data and education**

42 Big Data on Real-World Applications

**2.1. Big data**

One of the domains that volume, variety and velocity coexist in the data is the higher education. Large amounts of educational data are captured and generated on a daily basis from different sources and in different formats in the higher educational ecosystem. The educational data vary from those produced from students' usage and interaction with learning management systems (LMSs) and platforms, to learning activities and courses information consisting a curriculum such as learning objectives, syllabuses, learning material and activities, examination results and courses' evaluation, to other kind of data related to administrative, educational and quality improvement processes and procedures. The limited exploitation of big educational data and the size and type of these data within the context of higher education signifies the need for special techniques to be applied in order to discover new beneficial knowledge that currently is hidden within data [6]. Such techniques can be derived and adapted from other domains characterized by big data and successfully used to manipulate big educational data. These techniques could be used to enable the develop‐ ment of insights "regarding student performance and learning approaches" and exemplify areas within big educational data—such as students' actual performance according to taught curriculum—that can be positively impacted [7]. Recently, big data and Analytics together have shown promise in promoting different actions in higher education. These actions concern "administrative decision-making and organizational resource allocation", preven‐ tion of students at risk to fail by early identify them, development of effective instructional techniques and transform the traditional view of the curriculum to reconsider it as a network of relations and connections between the different entities of data gathered and regularly produced from LMSs, social networks, learning activities and the curriculum [8]. More specifically, one of the identified areas in which big data and Analytics are appropriately applicable for investigation and improvement in higher education is the curriculum and its contents, as a major part of big educational data [9, 10].

#### **2.3. Big educational data in health education**

Health education is an interesting context since it is complex. Its complexity lies in the constantly increased body of medical knowledge and evidence that continuously needs to be reflected in educational activities in order to match the needs for competent health professio‐ nals that meet the demands of the healthcare system and the society as its stakeholder. It produces an enormous amount of educational data considered as big. More specifically, the variety of data encased from teaching, learning and assessment activities, make it an area in which big data and analytics can be very useful to exploit them and sort out the complex information to be found in large diverse data sets [11]. Using big data and analytics techniques as an approach to make sense of the data, representing a health education curriculum and the associations between them, revealed its underlying complexity and the power that these techniques offer in two different cases.

In the first case [12], it was attempted to analyze and visualize the connections between the overall intended learning outcomes (ILO—in red) given in the different courses of an undergraduate medical curriculum and the desired competencies—from both the medical programme (in blue) and the higher education board (in dark and light green)—a medical student should have acquired after graduation from the medical programme. This is considered an attempt to make sense of this data in a small scale but yet, even in this case, the visualizations (**Figures 1** and **2**) reveal and confirm the high levels of complexity of this data. Further, considering as we mentioned before the continuously growing medical evidence that needs to realistically be reflected in the educational activities, the nature of this data is not static and represent only a snapshot of a long-term changeable network on the time it was captured. Yet, meaningful conclusions can be derived in a glance from these visualizations such as which competency is addressed the most with ILOs (connections between light green and red in **Figure 1**), or for example, clusters of ILOs used to address either knowledge or skills while addressing a common competency of the medical programme (connections between red nonclustered and clustered in **Figure 2**), and more.

**Figure 1.** Competencies and ILOs map.

Introduction to Big Data in Education and Its Contribution to the Quality Improvement Processes http://dx.doi.org/10.5772/63896 45

**Figure 2.** Clusters of competencies and ILOs.

produces an enormous amount of educational data considered as big. More specifically, the variety of data encased from teaching, learning and assessment activities, make it an area in which big data and analytics can be very useful to exploit them and sort out the complex information to be found in large diverse data sets [11]. Using big data and analytics techniques as an approach to make sense of the data, representing a health education curriculum and the associations between them, revealed its underlying complexity and the power that these

In the first case [12], it was attempted to analyze and visualize the connections between the overall intended learning outcomes (ILO—in red) given in the different courses of an undergraduate medical curriculum and the desired competencies—from both the medical programme (in blue) and the higher education board (in dark and light green)—a medical student should have acquired after graduation from the medical programme. This is considered an attempt to make sense of this data in a small scale but yet, even in this case, the visualizations (**Figures 1** and **2**) reveal and confirm the high levels of complexity of this data. Further, considering as we mentioned before the continuously growing medical evidence that needs to realistically be reflected in the educational activities, the nature of this data is not static and represent only a snapshot of a long-term changeable network on the time it was captured. Yet, meaningful conclusions can be derived in a glance from these visualizations such as which competency is addressed the most with ILOs (connections between light green and red in **Figure 1**), or for example, clusters of ILOs used to address either knowledge or skills while addressing a common competency of the medical programme (connections between red non-

techniques offer in two different cases.

44 Big Data on Real-World Applications

clustered and clustered in **Figure 2**), and more.

**Figure 1.** Competencies and ILOs map.

In the second case [13], it was attempted to visualize in a global association map the connections created by the practical incorporation of MeSH terminology in one particular section of a medical curriculum (**Figure 3**). Again, despite the obvious complexity of the MeSH map, conclusions can easily be derived quickly concerning, for example the less often used MeSH terms, here depicted in small clusters and located outside the main big cluster. Of course, this kind of representations require considerable time to be processed by humans due to their high complexity, but definitely they can promote understanding of overview of the situation and facilitate high-level reporting of bulks of information.

**Figure 3.** MeSH terms association map of a particular section of a medical curriculum.

#### **3. Analytics**

#### **3.1. Dimensions and objectives**

From a broad perspective, the development of analytics models has shown promise in transforming big educational data in health education into an Analytics-driven quality management tool. In the world of academic and learning analytics, the sources that big educational data are derived from are distinguished in different levels. This gives a multidis‐ ciplinary character to the field of analytics in general, involving various techniques, methods and approaches frequently used in the field. The range of actions that can be taken within the analytics area is wide, and frequently, these actions are classified into different levels and dimensions. For instance, the different actions taken in the field are divided by some practi‐ tioners into three different dimensions: time, level and stakeholder. Specific analytical approaches are applied to address respective questions for each of the dimensions. Descriptive analytics, for instance, produces reports, summaries and models in the dimension of time to answer the what, how and why something did happen. It monitors also processes to provide alerts in real time and recommend answers to questions as: What is happening now? In the case of predictive analytics, past actions are evaluated to estimate the future actions outcomes by answering: What are the trends, and what is likely to happen. It also simulates alternative actions outcomes to support decisions. Using analytics, choices are based on evidence rather than assumptions [14].

Analytics has been also classified into five levels: course, department, institution, region and national/international [8]. Other terms attempting to define the different levels more specifi‐ cally can be applied; "nanolevel" indicates activities in a course; the "microlevel" points an entire course in an education programme; the "mesolevel" includes many courses in a specific academic year; and finally, the "macrolevel" concerns many study programmes in an educa‐ tional institution [15]. **Figure 4** shows these four levels and the relation between them.

**Figure 4.** Overlapping of Analytics levels in higher education.

When the focus is on decision-making concerning achievements of specific learning outcomes, then all included actions are governed by "learning analytics" which refers to operations at the microlevel and nanolevel. When the focus is on decision-making regarding procedures, management and matters of operational nature, then it is governed by "academic analytics" which applies to the other two levels, macro and meso [16]. **Figure 4** illustrates how the different levels of analytics in education overlap and complement each other. For example, results of actions taken in the nanolevel can be input to the other levels micro, meso and macro, while it is controlled and monitored by them. The application of analytics in this classification can also be oriented toward different stakeholders, including students, teachers, administrators, institutions, and researchers. They may have different objectives, such as mentoring, monitoring, analysis, prediction, assessment, feedback, personalization, recommendation, and decision support. Despite the categorization of analytics actions in different levels, the data that these levels generate enter the same analytics loop which is defined in five steps in **Table 1** [17].


**Table 1.** Steps in analytics loop.

**3. Analytics**

46 Big Data on Real-World Applications

**3.1. Dimensions and objectives**

than assumptions [14].

**Figure 4.** Overlapping of Analytics levels in higher education.

From a broad perspective, the development of analytics models has shown promise in transforming big educational data in health education into an Analytics-driven quality management tool. In the world of academic and learning analytics, the sources that big educational data are derived from are distinguished in different levels. This gives a multidis‐ ciplinary character to the field of analytics in general, involving various techniques, methods and approaches frequently used in the field. The range of actions that can be taken within the analytics area is wide, and frequently, these actions are classified into different levels and dimensions. For instance, the different actions taken in the field are divided by some practi‐ tioners into three different dimensions: time, level and stakeholder. Specific analytical approaches are applied to address respective questions for each of the dimensions. Descriptive analytics, for instance, produces reports, summaries and models in the dimension of time to answer the what, how and why something did happen. It monitors also processes to provide alerts in real time and recommend answers to questions as: What is happening now? In the case of predictive analytics, past actions are evaluated to estimate the future actions outcomes by answering: What are the trends, and what is likely to happen. It also simulates alternative actions outcomes to support decisions. Using analytics, choices are based on evidence rather

Analytics has been also classified into five levels: course, department, institution, region and national/international [8]. Other terms attempting to define the different levels more specifi‐ cally can be applied; "nanolevel" indicates activities in a course; the "microlevel" points an entire course in an education programme; the "mesolevel" includes many courses in a specific academic year; and finally, the "macrolevel" concerns many study programmes in an educa‐ tional institution [15]. **Figure 4** shows these four levels and the relation between them.

When the focus is on decision-making concerning achievements of specific learning outcomes, then all included actions are governed by "learning analytics" which refers to operations at Another type of classification was proposed [18] and provides a division in different dimen‐ sions: The environment; what data is available? The stakeholders; who is targeted? The objectives; why do the analysis? And the method; how has the analysis been performed? Finally, analytics can team up with other scientific areas for analysis and high-level commu‐ nication of actions such as scientific information visualization and data analysis techniques (e.g. data mining and network analysis) elaborated upon later in Section 3.2.4 in the chapter.

#### **3.2. Analytical approaches**

As we saw, there are different components that analytics actions need in order to be effective. These components are the data (type and source) and the context in interest. If these compo‐ nents of analytics are in place, we are able to create different analytics models which can thrive and grow into an analytics engine capable to harness big educational data to ultimately contribute to the quality management and improvement of health education. Based each time on the needs of the health educational ecosystem in question, different approaches can result in building multiple viewpoint analytical models. The analytics approaches presented below are not specifically related to any type of classification in dimensions or levels but rather can work with any type of analytics model which constitutes all necessary components.

#### *3.2.1. Data-driven analytics approach*

Reading from the left to the right, **Figure 5** describes the common and traditional data-driven analytics approach, which is quite meaningful to experts in the data analysis area. It starts from the data and ends in the decision. The main focus is on the data and the necessary techniques to collect, store, clean, secure, transfer and process them. According to this approach, the loop starts in the first step by capturing as much data as possible, and then, the data are pushed through the different steps. Into the reporting step, the high volume of data is an asset. The more data we add, the better results we will receive. However, processing massive data sets includes challenges, such as demand for high-level mining techniques and more robust computers, applications, software and skills. To make sense of all this data, estimate the trends and examine all possible associations is a challenging task. Data analysis techniques, necessary to process the data in this step, require expertise usually found in data analysts and most commonly within the educational data mining area. Based on the evidence from previous steps, the engine predicts the trends and suggests actions that might be accurate and precise, but still remain suggestions. Often, the decision makers, frequently because of unknown circumstances, underestimate the recommendations and act differently. The loop finishes with the last step which is to either end the loop or feed the engine with more data in step 1 and run the engine again.

**Figure 5.** Data-driven Analytics Approach.

#### *3.2.2. Context- or need-driven analytics approach*

**3.2. Analytical approaches**

48 Big Data on Real-World Applications

*3.2.1. Data-driven analytics approach*

run the engine again.

**Figure 5.** Data-driven Analytics Approach.

As we saw, there are different components that analytics actions need in order to be effective. These components are the data (type and source) and the context in interest. If these compo‐ nents of analytics are in place, we are able to create different analytics models which can thrive and grow into an analytics engine capable to harness big educational data to ultimately contribute to the quality management and improvement of health education. Based each time on the needs of the health educational ecosystem in question, different approaches can result in building multiple viewpoint analytical models. The analytics approaches presented below are not specifically related to any type of classification in dimensions or levels but rather can

work with any type of analytics model which constitutes all necessary components.

Reading from the left to the right, **Figure 5** describes the common and traditional data-driven analytics approach, which is quite meaningful to experts in the data analysis area. It starts from the data and ends in the decision. The main focus is on the data and the necessary techniques to collect, store, clean, secure, transfer and process them. According to this approach, the loop starts in the first step by capturing as much data as possible, and then, the data are pushed through the different steps. Into the reporting step, the high volume of data is an asset. The more data we add, the better results we will receive. However, processing massive data sets includes challenges, such as demand for high-level mining techniques and more robust computers, applications, software and skills. To make sense of all this data, estimate the trends and examine all possible associations is a challenging task. Data analysis techniques, necessary to process the data in this step, require expertise usually found in data analysts and most commonly within the educational data mining area. Based on the evidence from previous steps, the engine predicts the trends and suggests actions that might be accurate and precise, but still remain suggestions. Often, the decision makers, frequently because of unknown circumstances, underestimate the recommendations and act differently. The loop finishes with the last step which is to either end the loop or feed the engine with more data in step 1 and

The model reads also from backwards (steps 1–8 in **Figure 6**). It describes in this way a new analytics approach called context- or need-driven analytics. This approach is more suitable for less qualified group of users in data analysis techniques such as educators and decisionmakers. The approach starts from the need for a decision and goes through the analysis of relevant data which could support the decisions. Quality improvements, decisions and actions must be crystal clear. Every detail is important: the stakeholders, the circumstances, particular needs, economic boundaries, accessibility of resources, organizational atmosphere, policies, technological ecosystem, timing and other factors which could influence the decisions. The results of this investigation are the demands of specific information to support a judgment or micro-decisions. This important and particular information emerges from the integration of carefully picked and explicit data. These data are selected, prepared, assessed, compared and produced by analytics tools utilizing particular mining methods. The analytics engine includes additional mechanisms and specific operators to recognize the systems which generate the data or the containers which carry the data. This time, we extract just the necessary data we need. Finally, the analytics loop either filter the data and provide an answer to the primary question or re-enter a new, more precise, question and restart the analytics process [19].

**Figure 6.** Context- or need-driven Analytics Approach.

#### **3.3. Learning analytics**

The term "learning analytics (LA)" is defined as "the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimising learning and the environments in which it occurs" [20] and affects actions and operations at the microlevel and nanolevel in **Figure 4**. Through LA, we can detect similarities in behaviours (e.g. user's satisfaction) or detect anomalous patterns (e.g. cheating). It can function as a bridge between past and future operations by inserting data concerning past events into a LA engine and analyse them to determine the probable future outcomes. It can synthesize thus big educational data and create a set of predictions to suggest different decision options revealing each time the implications of each decision option. LA can be further enhanced through visuals to amplify insight, increase understanding and impact decisionmaking as we explain further, later in the chapter.

Teachers, usually based on their experience, use their own "gut feeling" to translate students' behaviour and suspect if a student might drop out of a course or even abandon the studies. This can be proven to be either true or false, but without evidence, there is low level of certainty in decisions that are based only on experience. An example demonstrates the LA capacity to use evidence and add confidence to this type of decisions [21]. Here, data mining techniques were applied in big educational data and were utilized as a part of an analytics engine to detect students that perform in high, middle and low levels and notify them accordingly with different types of feedback. Thus, students at risk were identified very early when the institution still had the time to react and take preventive actions.

#### **3.4. Academic analytics**

The term "academic analytics" is defined as "the intersection of technology, information, management culture and the application of information to manage the academic enterprise" [22] and affects actions and operations at the macro and mesolevel as we saw before in **Figure 4**. The focus of academic analytics includes reporting, modelling, analysis and decision support concerning university and campus services. Examples of this kind of services include, but not limited to admission, advising, financing, academic counselling, enrolment and administra‐ tion. Following is a practical use of academic analytics [23], where librarians have used analytics on library usage data as part of the big educational data ecosystem to predict students' grades demonstrating the value that can be provided by the data produced and processed in the library to the hosting institution. In another case [24], it is demonstrated how within the context of health education academic analytics reports extracted from a mapped medical curriculum using data mining techniques, can add transparency to the big educational data consisting the medical curriculum and can be of use to stakeholders to facilitate decisions that need to be taken concerning different kinds of services such as managerial and financial.

#### **3.5. Visual analytics**

Methods and techniques have been developed in the recent years that can be used to manip‐ ulate complicated data in many different disciplines [25, 26]. Visual analytics (VA) is the science of analytical reasoning supported by interactive visual interfaces as an outgrowth of the fields of information visualization and scientific visualization [27]. VA combines different techni‐ ques: information visualization, data analysis and the power of human visual perception (**Figure 7**) [28].

Introduction to Big Data in Education and Its Contribution to the Quality Improvement Processes http://dx.doi.org/10.5772/63896 51

**Figure 7.** Big educational data are modelled by information visualization and data analysis techniques and represented in visual interfaces with which the human visual perception interacts to impact the analytical reasoning process.

It has the potential to support in the process of manipulating big data and exploit them by creating a holistic view of the data while revealing underlying complex information to the extent possible to positively impact analytical reasoning and decision-making [29–31]. A review of the literature resulted in identifying variables [32, 33] that are able to support analytical reasoning and decision making through VA and the interaction between human visual perception and visual interfaces as below:

**•** Increased cognitive resources (V1)

operations at the microlevel and nanolevel in **Figure 4**. Through LA, we can detect similarities in behaviours (e.g. user's satisfaction) or detect anomalous patterns (e.g. cheating). It can function as a bridge between past and future operations by inserting data concerning past events into a LA engine and analyse them to determine the probable future outcomes. It can synthesize thus big educational data and create a set of predictions to suggest different decision options revealing each time the implications of each decision option. LA can be further enhanced through visuals to amplify insight, increase understanding and impact decision-

Teachers, usually based on their experience, use their own "gut feeling" to translate students' behaviour and suspect if a student might drop out of a course or even abandon the studies. This can be proven to be either true or false, but without evidence, there is low level of certainty in decisions that are based only on experience. An example demonstrates the LA capacity to use evidence and add confidence to this type of decisions [21]. Here, data mining techniques were applied in big educational data and were utilized as a part of an analytics engine to detect students that perform in high, middle and low levels and notify them accordingly with different types of feedback. Thus, students at risk were identified very early when the

The term "academic analytics" is defined as "the intersection of technology, information, management culture and the application of information to manage the academic enterprise" [22] and affects actions and operations at the macro and mesolevel as we saw before in **Figure 4**. The focus of academic analytics includes reporting, modelling, analysis and decision support concerning university and campus services. Examples of this kind of services include, but not limited to admission, advising, financing, academic counselling, enrolment and administra‐ tion. Following is a practical use of academic analytics [23], where librarians have used analytics on library usage data as part of the big educational data ecosystem to predict students' grades demonstrating the value that can be provided by the data produced and processed in the library to the hosting institution. In another case [24], it is demonstrated how within the context of health education academic analytics reports extracted from a mapped medical curriculum using data mining techniques, can add transparency to the big educational data consisting the medical curriculum and can be of use to stakeholders to facilitate decisions that need to be taken concerning different kinds of services such as managerial and financial.

Methods and techniques have been developed in the recent years that can be used to manip‐ ulate complicated data in many different disciplines [25, 26]. Visual analytics (VA) is the science of analytical reasoning supported by interactive visual interfaces as an outgrowth of the fields of information visualization and scientific visualization [27]. VA combines different techni‐ ques: information visualization, data analysis and the power of human visual perception

making as we explain further, later in the chapter.

**3.4. Academic analytics**

50 Big Data on Real-World Applications

**3.5. Visual analytics**

(**Figure 7**) [28].

institution still had the time to react and take preventive actions.


The potentials offered by VA making it a promising tool to explore also how big educational data could contribute to the quality improvement of higher education. Different approaches prove the potential of VA to impact quality improvement specifically within the context of health education. It is reported [34] how the analysis and a simple visualization of educational data of a medical programme enabled involved stakeholders to instantly review and preview the effects of implemented changes in a medical curriculum. We will examine how in another case, VA has been practically used to explore its impact on analytical reasoning and decision making using big educational data from a medical programme [35, 36].

In **Figure 8**, we see how the learning outcomes (LO) and the teaching methods (TM) of one course were modelled to visually represent the hidden underlying network of connections and relations between them. The TMs are depicted in percentages in red, to show to what extent each TM is used in the course out of a 100%. Each TM addresses a number of LOs, and these are depicted in light blue. The percentages between an individual TM and its LOs depict the extent in which each TM's content is used to address the specific LO. A number of nonaddressed LOs are depicted on the top-right corner to complete the set of predefined LOs (16 in total) that the medical programme should address within the different courses. Here, the LOs and TMs are mapped and represented hierarchically from its 100% of TMs to correspond‐ ing percentages of TMs showing to which extent each TM is used in the course. Going further, the percentages between TMs and LOs reveal how much of the learning content of the TM is used to address the specific LO. For instance, the "clinical training" TM is fully addressing LO7 with its learning content while uses only 10% (5% out of 50%) of the learning content to address LO8. Thus, a comparison between learning content usage can instantly show which LO is mostly addressed and reveal the tendency of the TM or even the whole course—when we compare all TMs—towards specific LOs and even further competencies build through the LOs. This approach provides a way of analysing the teaching part of the course in relation to the LOs addressed to support the process of analytical reasoning. In the event of a series of similar comparisons, an instructor can base its decisions concerning the right percentage to address an LO and reform and redesign accordingly if necessary, to be more tailored to the LO's importance. In this way, an instructor evaluates and confirms the correct usage of TMs to address the LOs even if redesigning is not necessary. In parallel, a comparison between addressed and non-addressed LOs and between used and non-used TMs can be performed at any moment, revealing the whole course's map.

**Figure 8.** Learning outcomes and teaching methods.

In **Figure 9**, we see how the LOs of the same course were modelled this time against the assessment part and more specifically one part of the assessment, the questions used in the written examination, 34 in total. The percentages on the connections between yellow and red circles depict the proportion (out of 100%) of exam questions used to address the specific LO in red. For instance, eleven questions are used to assess LO5 which corresponds to 32%. Groups of LOs correspond to main outcomes—knowledge, skills and attitude—which are depicted in green. In cases where multiple main outcomes are assessed in groups of questions, the total percentage is divided into single main outcomes as in the case where 30% of the questions are used to assess skills and knowledge corresponding to 15% skills and 15% knowledge. An instant observation is that 83% of the questions on the written examination are used to assess skills, while 16% are used to assess knowledge and 1% attitude. Also, the percentage of questions that assess each of the LOs reveals how the written examination is built around them and which LOs are most heavily assessed. Some LOs are assessed in more than one group of questions, like LO5 in five different cases with corresponding red circles or in combination with other learning outcomes, like LO7 in two cases. The analytical process is supported in this case by instantly evaluating how the LOs of the course are assessed in the written examination. The percentages of questions can be examined against the importance of the assessed LO and thus suggest whether it is the correct percentage of questions, compared to the other percentages of questions used to assess other LOs. Thus, an instructor can decide if these percentages should be adjusted according to the importance of LOs and redesign the questions of the examination or even if it is more appropriate to address these LOs in other types of examination. Finally, this approach can be used to construct a more outcome-oriented written examination by redesigning it to cover identified gaps in addressing important LOs and instantly evaluating it with the updated visual model of the assessment activity.

**Figure 9.** Examination and learning outcomes.

are depicted in light blue. The percentages between an individual TM and its LOs depict the extent in which each TM's content is used to address the specific LO. A number of nonaddressed LOs are depicted on the top-right corner to complete the set of predefined LOs (16 in total) that the medical programme should address within the different courses. Here, the LOs and TMs are mapped and represented hierarchically from its 100% of TMs to correspond‐ ing percentages of TMs showing to which extent each TM is used in the course. Going further, the percentages between TMs and LOs reveal how much of the learning content of the TM is used to address the specific LO. For instance, the "clinical training" TM is fully addressing LO7 with its learning content while uses only 10% (5% out of 50%) of the learning content to address LO8. Thus, a comparison between learning content usage can instantly show which LO is mostly addressed and reveal the tendency of the TM or even the whole course—when we compare all TMs—towards specific LOs and even further competencies build through the LOs. This approach provides a way of analysing the teaching part of the course in relation to the LOs addressed to support the process of analytical reasoning. In the event of a series of similar comparisons, an instructor can base its decisions concerning the right percentage to address an LO and reform and redesign accordingly if necessary, to be more tailored to the LO's importance. In this way, an instructor evaluates and confirms the correct usage of TMs to address the LOs even if redesigning is not necessary. In parallel, a comparison between addressed and non-addressed LOs and between used and non-used TMs can be performed at

In **Figure 9**, we see how the LOs of the same course were modelled this time against the assessment part and more specifically one part of the assessment, the questions used in the written examination, 34 in total. The percentages on the connections between yellow and red circles depict the proportion (out of 100%) of exam questions used to address the specific LO in red. For instance, eleven questions are used to assess LO5 which corresponds to 32%. Groups of LOs correspond to main outcomes—knowledge, skills and attitude—which are depicted in green. In cases where multiple main outcomes are assessed in groups of questions, the total

any moment, revealing the whole course's map.

52 Big Data on Real-World Applications

**Figure 8.** Learning outcomes and teaching methods.

In **Figure 10**, we see an overview of the whole course. The TMs are depicted in red, main outcomes in yellow and LOs in light blue. The total points a student can get from each exam question are depicted inside the orange circles, and the percentages on the connections between these circles to LOs show the average success rate from all student answers on this particular question. The three light blue circles bordered in black (LO4,5, LO4,8 and LO4,10,14) and LO4 in bottom right corner depict the different cases where LO4 it is assessed by exam questions, but it is not taught in any of the TMs. This visualization sums all the information from Figures 9 and 10 providing additionally more information about the course in one place. Here, we can observe and analyse the entire course from different perspectives but also as a whole. Examining this figure from left to right and vice versa, different paths are created to disclose the underlying network in the examined educational data. The most focused and most assessed LOs can be observed instantly, showing the trend of the course towards skills, knowledge and attitude, to what extent these are addressed and if there are any gaps of taught/ non-assessed LOs. Finally, the existence or not of the constructive alignment [37] in the course can be verified as a synthesis of possible identified gaps and the utilization of learning activities and LOs in one place presenting the course as a structured network.

**Figure 10.** Overview of a course.

The analytical reasoning process is here more enhanced. The entire course can be instantly evaluated for gaps between taught and assessed LOs. For example, the identified gap for LO4 means simply that the written exam questions assess the LO4, but it was never actually taught in any of the TMs. This approach can be used as a tool in the hands of the course stakeholders to analyse it for this type of inconsistencies and possibly redesign it to establish a connection between what it is taught and what it is assessed and verify it again. After the redesigning, a comparison can take place where the different versions of the course will be similarly depicted before applying the desired changes in reality and thus create a more concrete and aligned course without gaps that meets the desired LOs appropriately.

The three presented approaches of using VA on big educational data within the context of health education demonstrate the potentials on impacting analytical reasoning and decision making in connection to the previously identified variables (V1–V5). Specifically, the infor‐ mation depicted is easily recognizable to the stakeholders in interest while making perceptible the different patterns and relations between the data (V1, V3 and V4). Searching for informa‐ tion relevant to the course structure is facilitated to a high extent (V2). The course can be readily analysed for gaps of different kinds while, at any time, the constructive alignment of the course can be verified (V3–V5). Finally, **Figure 10** has been further investigated with the use of augmented reality (AR) technology in an attempt to increase interactivity between the user and the visual and to enrich it with additional information while sustaining the complexity in low levels showing promising results for investigating big educational data by combining VA and AR [38].

### **4. Quality improvement (QI)**

assessed LOs can be observed instantly, showing the trend of the course towards skills, knowledge and attitude, to what extent these are addressed and if there are any gaps of taught/ non-assessed LOs. Finally, the existence or not of the constructive alignment [37] in the course can be verified as a synthesis of possible identified gaps and the utilization of learning activities

The analytical reasoning process is here more enhanced. The entire course can be instantly evaluated for gaps between taught and assessed LOs. For example, the identified gap for LO4 means simply that the written exam questions assess the LO4, but it was never actually taught in any of the TMs. This approach can be used as a tool in the hands of the course stakeholders to analyse it for this type of inconsistencies and possibly redesign it to establish a connection between what it is taught and what it is assessed and verify it again. After the redesigning, a comparison can take place where the different versions of the course will be similarly depicted before applying the desired changes in reality and thus create a more concrete and aligned

The three presented approaches of using VA on big educational data within the context of health education demonstrate the potentials on impacting analytical reasoning and decision making in connection to the previously identified variables (V1–V5). Specifically, the infor‐ mation depicted is easily recognizable to the stakeholders in interest while making perceptible the different patterns and relations between the data (V1, V3 and V4). Searching for informa‐ tion relevant to the course structure is facilitated to a high extent (V2). The course can be readily analysed for gaps of different kinds while, at any time, the constructive alignment of the course can be verified (V3–V5). Finally, **Figure 10** has been further investigated with the use of augmented reality (AR) technology in an attempt to increase interactivity between the user and the visual and to enrich it with additional information while sustaining the complexity in low levels showing promising results for investigating big educational data by combining VA

and LOs in one place presenting the course as a structured network.

course without gaps that meets the desired LOs appropriately.

**Figure 10.** Overview of a course.

54 Big Data on Real-World Applications

and AR [38].

#### **4.1. Quality improvement as an implication of improvement science in education**

Quality improvement is defined as "the combined and unceasing efforts of everyone to make the changes that will lead to better outcomes, better system performance and better profes‐ sional development" [39]. This definition covers all different aspects of health care that inextricably are affected by efforts targeting change. Improvement science instruments all the different ingredients and components necessary to realize this type of efforts that quality improvement requires to be a successful process. Improvement science has been applied in many disciplines such as automobile manufacturing and health care like an alternative approach to bring new knowledge into practice. Projects rooted in improvement science began to show success even within education. The characteristic of the improvement science is the holistic view of the examined context, and the key step is to identify the context (e.g. the organization, the actors and stakeholders, the routines and the workflow) and consider it as a system; deep knowledge of how small changes in a system instance can affect other parts of the system is very important.

Traditionally, improvement science was based on the "plan-do-study-act" cycle [40] attempt‐ ing to answer fundamental questions such as:


Today, the use of analytics in big educational data can be the "game changer" and can play an undeniably significant role in orchestrating the components of improvement science actions to design changes that successfully lead in improvement in the quality of education. Below is a formula that utilizes big educational data and combines the necessary components along with analytics within the context of education to successfully make a desired change to produce improvement.

#### **4.2. The formula and its elements**

The formula illustrates the way in which the different components come together like building blocks to produce improvement and can be used like a guide to design the change.


Each of the five elements is driven by a different knowledge area and has its own characteristics and settings.

#### *4.2.1. Element #1: context*

Deep knowledge of the particular context is the starting point. Differences on who, when, why, where and what can affect the choices we have or the selections we make. Different stake‐ holders perceive and use the terms and concepts differently in different occasions, but there are predominantly two ways to describe the context of education and define its quality. Some describe it as the personal development in people focusing on the outcome. They talk about "learning" and consider students like collaborators, or participants. Others describe education as the service of educating people focusing on the process. This group talks about "teaching" and considers the students like stakeholders, receivers, target group or customers/clients. Based on how we describe what education is we use different indicators to define its quality [41].

#### *4.2.2. Element #2: the "+" symbol*

This element represents the knowledge required about the different modalities for appropriate management of big educational data (analytics and data processing techniques) to properly connect and transform the context knowledge into the next element, the actionable intelligence.

#### *4.2.3. Element #3: actionable intelligence*

Through analytics, we can transform data to actionable insights and support decisions. As we have demonstrated, different analytics types, approaches and techniques are available (learning analytics, visual analytics, academic analytics, sense-making or predictive analytics, data-driven or need-driven analytics, etc.). Making decisions based on big educational data collected from complex learning environments may encounter limitations of human cognitive capability. That makes it necessary to expand this field and further investigate how different processes like cognitive artefacts that model human thinking sub-processes (e.g. accommo‐ dation, conclusions and categorization) could possibly facilitate the flow of human reasoning and therefore enhance the human cognitive ability [42, 43]. According to multiple analytics reports derived from the same data set, each of which provides a lens that adds more contextual insight will enable, for example the course developers to look for patterns [44, 45]. It is obvious that in our case the used final set of analytical reports as well as the selection between the mass univariate and multidimensional approach will emerge mostly from the available data sources and the technical/ethical possibilities to fuse them. Very often, the measures or parameters presented to the course developers will have to be extracted from the raw data with techniques, such as natural language processing, social network analysis, process mining and other.

#### *4.2.4. Element #4: the → symbol*

This element represents the knowledge about the execution and management of the change. The knowledge area is based on the Implementation Science and focuses on the methods and techniques required to "make things happen" and drive a successful implementation of an intervention in place.

#### *4.2.5. Element #5: improvement*

*4.2.1. Element #1: context*

56 Big Data on Real-World Applications

*4.2.2. Element #2: the "+" symbol*

*4.2.3. Element #3: actionable intelligence*

*4.2.4. Element #4: the → symbol*

intervention in place.

[41].

Deep knowledge of the particular context is the starting point. Differences on who, when, why, where and what can affect the choices we have or the selections we make. Different stake‐ holders perceive and use the terms and concepts differently in different occasions, but there are predominantly two ways to describe the context of education and define its quality. Some describe it as the personal development in people focusing on the outcome. They talk about "learning" and consider students like collaborators, or participants. Others describe education as the service of educating people focusing on the process. This group talks about "teaching" and considers the students like stakeholders, receivers, target group or customers/clients. Based on how we describe what education is we use different indicators to define its quality

This element represents the knowledge required about the different modalities for appropriate management of big educational data (analytics and data processing techniques) to properly connect and transform the context knowledge into the next element, the actionable intelligence.

Through analytics, we can transform data to actionable insights and support decisions. As we have demonstrated, different analytics types, approaches and techniques are available (learning analytics, visual analytics, academic analytics, sense-making or predictive analytics, data-driven or need-driven analytics, etc.). Making decisions based on big educational data collected from complex learning environments may encounter limitations of human cognitive capability. That makes it necessary to expand this field and further investigate how different processes like cognitive artefacts that model human thinking sub-processes (e.g. accommo‐ dation, conclusions and categorization) could possibly facilitate the flow of human reasoning and therefore enhance the human cognitive ability [42, 43]. According to multiple analytics reports derived from the same data set, each of which provides a lens that adds more contextual insight will enable, for example the course developers to look for patterns [44, 45]. It is obvious that in our case the used final set of analytical reports as well as the selection between the mass univariate and multidimensional approach will emerge mostly from the available data sources and the technical/ethical possibilities to fuse them. Very often, the measures or parameters presented to the course developers will have to be extracted from the raw data with techniques, such as natural language processing, social network analysis, process mining and other.

This element represents the knowledge about the execution and management of the change. The knowledge area is based on the Implementation Science and focuses on the methods and techniques required to "make things happen" and drive a successful implementation of an Improvement is about changing but not all changes are improvement. This element represents knowledge about the types and methods required to evaluate special types of measurements to show whether improvement has happened and calculate its impact. There are five different approaches depending on how we consider or view the quality [44] summarized in **Table 2**.


**Table 2.** The different approaches we follow for each one of the views.

#### **4.3. Quality improvement of learning process**

Operations at the microlevel and nanolevel (**Figure 4**) such as teaching or learning activities in a course are referred to LA. Examples of these operations are performed by teachers, course designers, studies and programme directors. The following scenario demonstrates the practical use of LA in the quality improvement circle of a course.

In the preparation phase of a course, the instructors can use curriculum mapping tools to discover actual gaps precisely. They can recognize thus which learning objectives are not properly addressed by teaching or learning activities. They need recommendations for new, more proper and motivational teaching activities to include them into their schedule. With the available Analytics tools, they are able to analyse further the class and predict its needs such as student demographics, performance, different learning approaches, the technology used and the group dynamics. This type of data is processed by a number of algorithms and predictive models that can develop the characteristics of the class [32]. Visualization tools can be used for the following round to give alternative proposals for designing suitable activities fitting this particular class and also illustrate the effects of each of the options. The course director can control the activities and observe students' progress during the ongoing course. They can zoom in and out from the whole class to one working group or one individual student. They can additionally track the flow of the formed social networks. They can judge the overall commitment and identify students at risk. In an extensively used platform, they can also compare particular indicators from other classes, or through to other anonymized data sets within the same program, or from a different department, or even compare against data from related programs in other universities [46]. The results and the produced experiences can be used to build up the knowledge database evidently regarding several pedagogical interven‐ tions. This can support in forming new policies in the entire organization and be an important element of the quality development and academic research.

#### **4.4. Quality improvement of educational process**

We presented how VA could be used to support the analytical reasoning and decision making of stakeholders involved in the quality improvement of the educational process. This is achieved when both visual and analytics factors function as instruments of a harmonized engine that complement and support each other. The analytics factor applied on the big educational data aims at reducing its complexity without losing vital information and critical characteristics; these are kept at the top level of the presented visuals. The other factor is the visualization, which brought pathways and relations into light by taking advantage of the human ability to process and understand visual information more easily. These two factors cannot stand alone without each other and be implemented to data with incoherent structure, which makes Analytics an essential key component to build a strong base for a meaningful VA result. The data analysis preceding visualizations assists in shaping the inchoate big educational data that visuals are then responsible to represent. An important point is the effort needed to apply each of the factors. The effort required for the visual and analytics parts is not comparable, and their roles are totally different. Analytics requires significant effort to shape the data in question and compile all the discrete elements to represent the data adequately. On the contrary, visuals require less effort since the network of connections and relations is already assembled. However, to select and gradually build the appropriate visuals, it requires expertise in order to emphasize in a big picture the essential information existing in the network produced from data analysis and add scientific value onto it while going beyond simple statistical-based visuals. Of course, the human visual perception is irreplaceable in this chain of actions in order to perceive and interact with the visual interfaces and perform high-level analysis. In summary, VA allows the different stakeholders to easily perceive the structure of the examined data, define how each part coexists as part of a network and reason for its use and importance in the data. It also helps to better understand stakeholders' individual role in the educational process and the consequences of delivering their parts without being able to determine how it can be harmonized with other parts in the data. It supports stakeholders also to decide how to cope with discrepancies and structure anomalies revealed from gap analysis and the existence or not of the constructive alignment in the data. Finally, VA can display currently needed changes for an improved future overall picture in order to deliver health education in pace with healthcare demands [47, 48]. Revealing the underlying network of information in the examined data, identifying gaps, discrepancies and anomalies between the data and being able to verify the appropriateness of the given educational activities promotes the process of analytical reasoning and decision making and transforms the big educational data into an instrument for planning and applying changes in a constant effort for quality improvement in health education.

#### **4.5. Quality improvement of academic functions and campus services**

Academic analytics has been compared to business intelligence and refers to operations at the macrolevel and mesolevel as we saw in **Figure 4**, including decision support concerning university and campus services. In most of the cases, Academic Analytics have been used to provide actionable insights and support single or isolated decisions [49]. As we demonstrated Academic Analytics is a main part of the quality improvement process and can be beneficial in multiple ways into the steps of the improvement's cycle. Into the early steps of the cycle (the data-driven approach, **Figure 5**), it can support decision makers to identify the gaps and the needs of what is possible or necessary to improve. Into the following steps, academic analytics can support decision about choosing appropriate actions trough predictions and by providing "what if" scenarios using the need-driven approach in **Figure 6**. Academic analytics (through dashboards and reports) can be used to monitor the ongoing processes and support decisions concerning eventual adjustments. At the end of the quality improvement cycle, academic analytics can support in performing evaluations of the intervention's impact demonstrating the hidden connections between actions and events.

#### **5. Conclusion**

director can control the activities and observe students' progress during the ongoing course. They can zoom in and out from the whole class to one working group or one individual student. They can additionally track the flow of the formed social networks. They can judge the overall commitment and identify students at risk. In an extensively used platform, they can also compare particular indicators from other classes, or through to other anonymized data sets within the same program, or from a different department, or even compare against data from related programs in other universities [46]. The results and the produced experiences can be used to build up the knowledge database evidently regarding several pedagogical interven‐ tions. This can support in forming new policies in the entire organization and be an important

We presented how VA could be used to support the analytical reasoning and decision making of stakeholders involved in the quality improvement of the educational process. This is achieved when both visual and analytics factors function as instruments of a harmonized engine that complement and support each other. The analytics factor applied on the big educational data aims at reducing its complexity without losing vital information and critical characteristics; these are kept at the top level of the presented visuals. The other factor is the visualization, which brought pathways and relations into light by taking advantage of the human ability to process and understand visual information more easily. These two factors cannot stand alone without each other and be implemented to data with incoherent structure, which makes Analytics an essential key component to build a strong base for a meaningful VA result. The data analysis preceding visualizations assists in shaping the inchoate big educational data that visuals are then responsible to represent. An important point is the effort needed to apply each of the factors. The effort required for the visual and analytics parts is not comparable, and their roles are totally different. Analytics requires significant effort to shape the data in question and compile all the discrete elements to represent the data adequately. On the contrary, visuals require less effort since the network of connections and relations is already assembled. However, to select and gradually build the appropriate visuals, it requires expertise in order to emphasize in a big picture the essential information existing in the network produced from data analysis and add scientific value onto it while going beyond simple statistical-based visuals. Of course, the human visual perception is irreplaceable in this chain of actions in order to perceive and interact with the visual interfaces and perform high-level analysis. In summary, VA allows the different stakeholders to easily perceive the structure of the examined data, define how each part coexists as part of a network and reason for its use and importance in the data. It also helps to better understand stakeholders' individual role in the educational process and the consequences of delivering their parts without being able to determine how it can be harmonized with other parts in the data. It supports stakeholders also to decide how to cope with discrepancies and structure anomalies revealed from gap analysis and the existence or not of the constructive alignment in the data. Finally, VA can display currently needed changes for an improved future overall picture in order to deliver health education in pace with healthcare demands [47, 48]. Revealing the underlying network of information in the examined data, identifying gaps, discrepancies and anomalies between the

element of the quality development and academic research.

**4.4. Quality improvement of educational process**

58 Big Data on Real-World Applications

The goal of this chapter was to introduce the reader to the concept of big educational data and the different forms of analytics as applied scientific areas and go deeper to popular techniques for data manipulation and how they can be transferred within the health education system and used as approaches to exploit big educational data that such systems produce. Apart from the techniques itself, the benefits and potential to use them for quality improvement purposes in health education are provided and discussed in detail.

In the era of technology and its inevitable impact on health education systems, such approaches are proven to be quite utilitarian in order to support the quality improvement process of education and ultimately contribute to health care with highly skilled health professionals.

#### **Acknowledgements**

We wish to thank all the staff at Karolinska Institutet, Sweden that provided the authors of this chapter with assistance, comments and encouragement.

#### **Author details**

Christos Vaitsis1\*, Vasilis Hervatis1 and Nabil Zary1,2

\*Address all correspondence to: christos.vaitsis@ki.se

1 Department of Learning, Informatics, Management and Ethics, Karolinska Institutet, Stockholm, Sweden

2 Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singa‐ pore

#### **References**


[11] Ellaway RH, Pusic MV, Galbraith RM, Cameron T. Developing the role of big data and analytics in health professional education. Medical Teacher. 2014;36(3):216–22.

**Author details**

60 Big Data on Real-World Applications

Stockholm, Sweden

pore

**References**

Christos Vaitsis1\*, Vasilis Hervatis1

\*Address all correspondence to: christos.vaitsis@ki.se

Institute; San Francisco. 2011.

preprint. 2013;1301.0159.

and Nabil Zary1,2

1 Department of Learning, Informatics, Management and Ethics, Karolinska Institutet,

2 Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singa‐

[1] Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH. "Big Data: The Next Frontier for Innovation, Competition, and Productivity," McKinsey Global

[2] Zaslavsky A, Perera C, Georgakopoulos D. Sensing as a service and big data. arXiv

[3] Zikopoulos P, Eaton C. Understanding Big Data: Analytics for Enterprise Class Hadoop

[4] Chen H, Chiang RH, Storey VC. Business intelligence and analytics: from Big Data to

[5] Baker RS, Inventado PS. Educational data mining and learning analytics. In: Larusson A. J., White B. editors. Learning Analytics, Springer, New York. 2014; 61–75.

[6] Romero C, Ventura S. Educational data mining: a survey from 1995 to 2005. Expert

[7] West DM. Big data for education: data mining, data analytics, and web dashboards.

[8] Siemens G, Long P. Penetrating the Fog: Analytics in Learning and Education. EDU‐

[9] Picciano AG. The Evolution of Big Data and Learning Analytics in American Higher

[10] Komenda M, Schwarz D, Vaitsis C, Zary N, Štěrba J, Dušek L. OPTIMED Platform: curriculum harmonisation system for medical and healthcare education. Studies in

Education. Journal of Asynchronous Learning Networks. 2012;16(3):9–20.

and Streaming Data. McGraw-Hill Osborne Media; New York. 2011.

Big Impact. MIS Quarterly. 2012;36(4):1165–88.

Systems with Applications. 2007;33(1):135–46.

Governance Studies at Brookings. 2012;4:1–0.

Health Technology and Informatics. 2015;210:511.

CAUSE Review. 2011;46(5):30.


[39] Batalden PB, Davidoff F. What is "quality improvement" and how can it transform healthcare? Quality and Safety in Health Care. 2007;16(1):2–3.

[25] Perer A. Finding Beautiful Insights in the Chaos of Social Network Visualization. In: Steele J, Iliinsky N, editors. Beautiful Visualization. Looking at Data Through the Eyes

[26] Witten I, Frank EH, Hall MA. Data Mining. Practical Machine Learning Tools and Techniques. 3rd ed. Morgan Kaufmann Series in Data Management Systems; Burling‐

[27] Thomas J., Cook K. Illuminating the Path: Research and Development Agenda for

[28] Visual Analytics portal [Internet]. Available from: http://www.visual-Analytics.eu/

[29] Keim DA, Mansmann F, Thomas J. Visual analytics: how much visualization and how

[30] Steed C, Potok T, Patton R, Goodall J, Maness C, Senter J. Interactive Visual Analysis of High Throughput Text Streams. In: Proceedings of The 2nd Workshop on Interactive Visual Text Analytics, Oct 15, 2012, Seattle, WA, USA [Internet]. 2012. Available from: http://aser.ornl.gov/publications\_2013/Publication 39367.pdf [Accessed 2016-05-26]

[31] Keim DA, Mansmann F, Stoffel A, Ziegler H. Visual Analytics. Encyclopedia of

[32] Mazza R. Visualization in educational environments. In: Romero C, Ventura S, Pechenizkiy M, Baker RSJD. editors. Handbook of Educational Data Mining. 1st ed. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, London. 2010.

[33] Card SK, Mackinlay JD, Shneiderman B. Readings in information visualization: using

[34] Olmos M, Corrin L. Academic analytics in a medical curriculum: Enabling educational excellence. Australasian Journal of Educational Technology. 2012;28(1):1–5.

[35] Vaitsis C, Nilsson G, Zary N. Visual analytics in healthcare education: exploring novel ways to analyze and represent big data in undergraduate medical education. PeerJ.

[36] Vaitsis C, Nilsson G, Zary N. Visual Analytics in Medical Education: Impacting Analytical Reasoning and Decision Making for Quality Improvement. Studies in Health

[37] Biggs JB. Teaching for Quality Learning at University: What the Student Does.

[38] Nifakos S, Vaitsis C, Zary N. AUVA-augmented reality empowers visual analytics to explore medical curriculum data. Studies in Health Technology and Informatics.

much analytics? ACM SIGKDD Explorations Newsletter. 2010;11(2):5–8.

Database Systems. Springer, New York. 2009; pp. 3341–3346.

vision to think. Morgan Kaufmann, Burlington. 1999.

Technology and Informatics. 2015;210:95.

McGraw-Hill Education, New York. 2011.

of Experts. O'Reilly Media; Beijing; 2010; pp. 157–73.

ton; 2011; pp. 375–97.

62 Big Data on Real-World Applications

[Accessed: 2016-03-17]

pp. 9–27

2014;2:e683.

2015;210:494.

Visual Analytics. IEEE-Press. 2005.

