**Abstract**

Treatment and disease registries have played a vital role in understanding the heterogeneous nature of cystic fibrosis (CF) disease progression. The maturity of so many patient registries and recent national focus on their potential to improve patientcentered outcomes have led to the establishment of guidelines for the conduct of registry data analyses. Despite the insights garnered from utilizing CF patient registries, the analyses are plagued with methodological challenges, such as confounding, missing data, time varying treatment and/or covariates, and treatment-by-selection bias. Nonetheless, these registry studies have been essential for CF clinical effectiveness research. They reflect real-world clinical practice and allow for evaluating patient outcomes in a realistic clinical environment. In this chapter, we reflect on these advancements in registries and study results broadly and specifically in CF. We identify the key statistical challenges with the analysis of CF registry data from start to finish, including design considerations, quality assurance, issues with selection bias, covariate effects, sample size justification and missing data. We describe how these approaches are implemented to answer clinical effectiveness questions and undertake an illustrative example on tobramycin effectiveness and lung function decline.

**Keywords:** confounding-by-indication bias, instrumental variables, lung function decline, propensity scores, treatment-selection bias

#### **1. Introduction**

A registry is "an organized system that uses observational study methods to collect uniform data (clinical or otherwise) to evaluate specified outcomes for a population defined by a particular disease, condition or exposure, and that serves a predetermined scientific, clinical or policy purpose(s)" [1]. Registries and other non-intervention studies are often referred to as *real-world* data to distinguish them from clinical trials or experimental studies.

Treatment and disease registries play a vital role in the advancement of patientcentered outcomes research. These patient registries often include data arising from patient surveillance in observational settings. Numerous epidemiologic studies have used patient registries to characterize disease progression. In more recent years, patient registries have been used for a variety of health-related inquiries, ranging from comparative effectiveness studies to informing clinical decision making at the point of care (see [2], for an example). The maturity of so many patient registries and recent national focuses on their potential to improve patient-centered outcomes have led to the establishment of guidelines for the conduct of registry data analyses [1].

Although these guidelines are recent, the statistical challenges posed in these observational settings were noted decades ago in epidemiology and public health research [3]. Indeed, registry analyses are plagued with methodological challenges, such as confounding, missing data, time varying treatment and/or covariates, and treatment-by-selection bias.

clinical events in a defined population. Follow-up can be retrospective, prospective, or a combination of both. The mode and duration of follow-up can range from days (e.g., hospital admission registry) to decades (e.g., orthopedic implant registry). Constructing and maintaining a large registry requires substantial resources, collaborative effort, and often requires a multi-center or inter-institutional agreement, and a governing body that oversees and coordinates all activities. Typically, there are standard guidelines or written procedures in place that help researchers to gain

Before utilizing data from any registry, it is imperative to define the research question and develop a study protocol. Clinical or public health questions of interest should be stated as research questions. Each research question should correspond to a testable hypothesis, which may be assessed using an approach fully described in the statistical considerations (this is particularly important for comparative effec-

Finding a registry that is appropriate to answer the research question of interest will require us to review preliminary information about each of the prospective registries, particularly regarding the data elements. For example, consider the following two studies. In each study, it is of interest to determine treatment effectiveness for cystic fibrosis (CF) lung disease. The first study utilized the Cystic Fibrosis Foundation Patient Registry (hereafter, CFFPR) [5] to examine the association between ibuprofen and lung function decline [6, 7]. In a subsequent study, Konstan et al. [8] assessed the relationship between a different treatment, dornase alfa, and lung function decline using registry data from the Epidemiologic Study of Cystic Fibrosis (ESCF) [9]. Although both studies examined treatment effectiveness on the same outcome (lung function decline), each study required distinct data elements to answer the research questions regarding treatment effectiveness. The CFFPR includes data collected on ibuprofen usage; however, the ECSF does not include information for this treatment, eliminating this database as an option for the first study. On the other hand, the ECSF has detailed information on pulmonary symptoms (e.g., coughing), which are known predictors of more rapid lung function decline [7] and therefore need to be considered as potential confounders to assessing treatment effectiveness. Although both registries include data elements to measure dornase alfa usage, which are necessary to answer the research question in the second study, the ECSF enabled the authors to consider detailed pulmonary symptoms as potential confounders. If our research question involves a newly diagnosed condition or rare disorder, we may be limited to a single patient registry.

In those instances, the research question may need additional refinement.

In the study protocol, we will need to state the specific objectives. The objective of our CF study is to evaluate the effect of tobramycin on lung function decline. Once the objectives are clarified, we consider the most appropriate study design. In registry analyses, the selection of our study design often depends on how the registry was structured. Registries constructed to capture natural histories are often amenable to studies with longitudinal cohort designs. We can identify the population of interest at this point in the study protocol. Acquiring the subset of data which best reflects the population of interest, exposure variables, and primary and secondary endpoints may include some manipulation of the original registry data files. In our CF example, it is of interest to limit our cohort to individuals chronically infected with *Pseudomonas aeruginosa* (*Pa*). We target this population, since our research question is related to the effectiveness of tobramycin, which is a drug recommended for treating CF chronic *Pa* in patients with CF. In our example, we

familiarity and/or access to the registry study.

*Evaluating Clinical Effectiveness with CF Registries DOI: http://dx.doi.org/10.5772/intechopen.84269*

**2.1 Selecting a registry and target population**

tiveness studies).

**55**

Despite these challenges, registry studies are essential for clinical effectiveness research. They reflect real-world clinical practice and allow for evaluating patient outcomes in a realistic clinical environment. A registry encompasses the general patient population, including those who are severely ill or less likely to adhere with assigned treatment. These patients commonly are excluded from the randomized controlled trials, and are likely to have very different treatment responses. Further, registry study offers the opportunity to examine important factors such as physician's practice behavior, prescription preference and other covariates pertaining to quality of care, which are impossible to assess in an experimental study. Registry studies commonly include long-term observation and therefore can reflect change of treatment practices, in order to provide a timely assessment of emerging research questions. The use of registry data to evaluate outcomes is of mutual benefit to both patients and clinicians, and it facilitates management of patient care, thereby improving the health care system.

#### **1.1 Evaluating the effectiveness of tobramycin on lung function decline**

Throughout the chapter, we will refer to an example from a retrospective longitudinal cohort study, which used the Cystic Fibrosis Foundation Patient Registry (CFFPR) to evaluate the clinical effectiveness of a treatment for lung function decline [4]. Cystic fibrosis (CF) is a lethal autosomal disease in which respiratory failure is the primary cause of death. *Pseudomonas aeruginosa* (*Pa*) is a common, chronic pulmonary infection in CF patients. Inhaled tobramycin (hereafter, Tobi) has been shown to improve lung function in CF patients with Pa in the clinical trial setting. In this example, it is our objective to evaluate the clinical effectiveness—as opposed to efficacy—of Tobi using the CFFPR. We will refer to this case study, in order to illustrate statistical methods for registry data analysis. The Appendix includes analysis implementation using SAS 9.3 (SAS Institute, Cary, NC).

In this chapter, we focus on the design and statistical analyses of patient registry studies. We begin in Section 2 by describing processes to design a study involving registry data, in accordance with the aforementioned guidelines from Gliklich and colleagues. We follow this section with overviews of inferential analyses methods that can be used in registry study to combat selection bias, missing data, time varying treatment or covariates in Section 3. In Section 4, we describe details of the application to the aforementioned patient registry. We discuss the utility of existing methods and remaining analytic challenges in Section 5. Finally, we provide an appendix in Appendix A with implementation of the statistical analyses in our illustrative application.

#### **2. Design considerations for registry studies**

Registries may be organized around conditions or exposures (e.g., a cystic fibrosis registry, stroke registry); a healthcare service (e.g., procedure); or a product (drug or device) and can address questions ranging from treatment effectiveness and safety to the quality of care delivered. Registries vary in complexity from simply recording product use as a requirement for reimbursement to more systematic efforts to collect prospective data on many types of treatment, risk factors, and

#### *Evaluating Clinical Effectiveness with CF Registries DOI: http://dx.doi.org/10.5772/intechopen.84269*

clinical events in a defined population. Follow-up can be retrospective, prospective, or a combination of both. The mode and duration of follow-up can range from days (e.g., hospital admission registry) to decades (e.g., orthopedic implant registry). Constructing and maintaining a large registry requires substantial resources, collaborative effort, and often requires a multi-center or inter-institutional agreement, and a governing body that oversees and coordinates all activities. Typically, there are standard guidelines or written procedures in place that help researchers to gain familiarity and/or access to the registry study.

Before utilizing data from any registry, it is imperative to define the research question and develop a study protocol. Clinical or public health questions of interest should be stated as research questions. Each research question should correspond to a testable hypothesis, which may be assessed using an approach fully described in the statistical considerations (this is particularly important for comparative effectiveness studies).

#### **2.1 Selecting a registry and target population**

Finding a registry that is appropriate to answer the research question of interest will require us to review preliminary information about each of the prospective registries, particularly regarding the data elements. For example, consider the following two studies. In each study, it is of interest to determine treatment effectiveness for cystic fibrosis (CF) lung disease. The first study utilized the Cystic Fibrosis Foundation Patient Registry (hereafter, CFFPR) [5] to examine the association between ibuprofen and lung function decline [6, 7]. In a subsequent study, Konstan et al. [8] assessed the relationship between a different treatment, dornase alfa, and lung function decline using registry data from the Epidemiologic Study of Cystic Fibrosis (ESCF) [9]. Although both studies examined treatment effectiveness on the same outcome (lung function decline), each study required distinct data elements to answer the research questions regarding treatment effectiveness. The CFFPR includes data collected on ibuprofen usage; however, the ECSF does not include information for this treatment, eliminating this database as an option for the first study. On the other hand, the ECSF has detailed information on pulmonary symptoms (e.g., coughing), which are known predictors of more rapid lung function decline [7] and therefore need to be considered as potential confounders to assessing treatment effectiveness. Although both registries include data elements to measure dornase alfa usage, which are necessary to answer the research question in the second study, the ECSF enabled the authors to consider detailed pulmonary symptoms as potential confounders. If our research question involves a newly diagnosed condition or rare disorder, we may be limited to a single patient registry. In those instances, the research question may need additional refinement.

In the study protocol, we will need to state the specific objectives. The objective of our CF study is to evaluate the effect of tobramycin on lung function decline. Once the objectives are clarified, we consider the most appropriate study design. In registry analyses, the selection of our study design often depends on how the registry was structured. Registries constructed to capture natural histories are often amenable to studies with longitudinal cohort designs. We can identify the population of interest at this point in the study protocol. Acquiring the subset of data which best reflects the population of interest, exposure variables, and primary and secondary endpoints may include some manipulation of the original registry data files. In our CF example, it is of interest to limit our cohort to individuals chronically infected with *Pseudomonas aeruginosa* (*Pa*). We target this population, since our research question is related to the effectiveness of tobramycin, which is a drug recommended for treating CF chronic *Pa* in patients with CF. In our example, we

#### *Cystic Fibrosis - Heterogeneity and Personalized Treatment*


entries. Furthermore, summary statistics stratified by calendar year can inform selection of an optimal time frame from natural history registries. In our example, CF-related diabetes, a known predictor of lung function decline that should be included in the analysis, was not collected in earlier calendar years in the CFFPR. Access to most registries requires approval by a local institutional review board (IRB) prior to data release, and this approval is often necessary to have results of the study peer-reviewed and published. In our experience, developing a protocol that is in accordance with the aforementioned guidelines is sufficient for the IRB review. Although registries rarely contain patient names or medical record numbers, they often include clinical encounter and/or discharge dates. Having this type of protected health information in the data often requires IRB approval.

**3. Statistical considerations for comparative effectiveness using registry**

Statistical analyses in the registry data setting are subject to the statistical challenges previously described for analyses of observational studies [10]. Registries are often established for the purpose of evaluating the effects of interventions. The statistical analysis plan should include appropriate methods to test each hypothesis, methods to address biases and confounding arising from various sources, and

Regardless of the research question, a registry study will likely be plagued with numerous sources of bias. Selection bias, although inevitable, is typically the most concerning. This type of bias distorts the results for the association of interest and may yield misleading results. Failure to sample from the correct target population and loss to follow-up due to death or some other event are types of selection bias. A pervasive type of selection bias is confounding by indication, arising from nonrandomized treatment assignment that is often related to the patient's risk to experience poor outcomes. This treatment-by-selection bias creates distinctions between the risk profiles of treated and comparator groups and may violate statistical assumptions in our analyses. In our CF example, treatment selection bias may be more pronounced because the drug in question should only be prescribed to individuals with CF who have a specific chronic infection. Narrowing the cohort to "sicker" individuals can intensify the aforementioned risk profile imbalance

Statistical methods to combat treatment selection bias have been applied in previous studies. Approaches to adjust for treatment selection bias include multivariable regression, propensity score methods, matching and instrumental variables analysis. Stukel et al. [11] applied each of these four approaches to examine the association between cardiac catheterization and long-term acute myocardial infarction mortality. The authors found that the results differed according to the choice of statistical approach. Next, we describe and outline each approach in the context of our CF example.

**3.2 Statistical analyses of comparative effectiveness utilized for registry data**

In the absence of randomization, intervention and comparator groups may exhibit large differences with respect to observed covariates recorded in the

**studies**

**3.1 Selection bias**

**analysis**

**57**

*3.2.1 Multivariable regression*

sample size/power considerations.

*Evaluating Clinical Effectiveness with CF Registries DOI: http://dx.doi.org/10.5772/intechopen.84269*

between Tobi and non-Tobi groups.

*Abbreviations: CF, cystic fibrosis; FEV1, percentage predicted of forced expiratory volume in 1 s. <sup>a</sup> P-values from Wilcoxon Mann-Whitney or chi-square test.*

*b Number of hospitalizations in the year before baseline.*

#### **Table 1.**

*Descriptive analysis of CF registry variables.*

determine chronic *Pa* status for each patient by examining the number of recorded *Pa* infections throughout the calendar year. Our primary endpoint is the mean change in FEV1% predicted over a 2-year period. We selected additional exposure variables of interest, which are known predictors of change in FEV1% (see **Table 1**).

#### **2.2 Data elements and quality assurance**

For many different types of research, particularly comparative effectiveness research or research involving children and/or rare disease conditions, no single institution has a large enough patient population to perform a proper study. This, along with the growing infrastructures of electronic medical records, has led to an increased effort to create distributed research networks. The widespread adoption of electronic health records (EHRs) has enabled them to become a main source for registry data, capable of capturing the necessary elements as part of routine clinical care, and the ever-changing clinical practices.

The number of data elements and scope of collection often increase over the life of the registry. Well-maintained registries typically include data dictionaries, but verifying data quality specific to our study is essential. In our CF example, we had to calculate specific variables for analysis. Understanding how the data have been collected over time and to what extent (e.g., every clinical encounter) will help determine the appropriate subset of data to extract from the registry. For example, the CFFPR data are collected at every clinical encounter and hospitalization, as well as on an annual basis, on each patient and provided to the CF Foundation. Using descriptive statistics, such as the 5-number summary, mean and standard deviation for each variable, and histograms or boxplots can highlight data discrepancies in continuous variables. Similarly, computing the frequency and percentage of each category in a nominal or ordinal variable may identify variables with questionable

*Evaluating Clinical Effectiveness with CF Registries DOI: http://dx.doi.org/10.5772/intechopen.84269*

entries. Furthermore, summary statistics stratified by calendar year can inform selection of an optimal time frame from natural history registries. In our example, CF-related diabetes, a known predictor of lung function decline that should be included in the analysis, was not collected in earlier calendar years in the CFFPR.

Access to most registries requires approval by a local institutional review board (IRB) prior to data release, and this approval is often necessary to have results of the study peer-reviewed and published. In our experience, developing a protocol that is in accordance with the aforementioned guidelines is sufficient for the IRB review. Although registries rarely contain patient names or medical record numbers, they often include clinical encounter and/or discharge dates. Having this type of protected health information in the data often requires IRB approval.
