**Meta-Analysis of Genome-Wide Association Studies to Understand Disease Relatedness**

Stephanie N. Lewis, Elaine O. Nsoesie, Charles Weeks, Dan Qiao and Liqing Zhang *Virginia Tech, Blacksburg, VA USA* 

#### **1. Introduction**

198 Type 1 Diabetes – Complications, Pathogenesis, and Alternative Treatments

Willi SM, Egede LE. Type 2 diabetes mellitus in adolescents. *Current Opinion in* 

Williams KV, Erbey JR, Becker D, Orchard TJ. Improved glycemic control reduces the

Yki-Jarvinen H, Koivisto VA. Natural course of insulin resistance in type 1 diabetes. *N Engl J* 

impact of weight gain on cardiovascular risk factors in type 1 diabetes. The Epidemiology of Diabetes Complications Study. *Diabetes Care* 1999; 22: 1084–

*Endocrinology & Diabetes*, 2000, 7:71–76.

1091.

*Med* 1986; 315: 224-30.

Genome-wide association studies (GWAS) have become a popular method of surveying haplotype variations within populations. The recent explosion and success of these studies has allowed for identification of multiple gene variations and non-genetic risk factors that are often involved in pathogenesis of many diseases (Xavier&Rioux, 2008). Efforts to archive these single nucleotide polymorphisms (SNPs) and make the information publicly available have been made possible by the International Haplotype Map Project (HapMap) (The International HapMap Consortium, 2005; The International HapMap Consortium, 2007) and development of GWAS databases (Johnson&O'Donnell, 2009) such as Genomes.gov (Hindorff et al., 2009). The HapMap database of genetic variants and the ever progressing technology involved in identifying genetic disease susceptibility markers has allowed for identification of shared genetic associations that were undetectable with previous methods for identifying deleterious mutations effects for individual genes (Xavier&Rioux, 2008). We are now capable of detecting common susceptibility markers between previously unassociated diseases with the ability to assess combined association signals shared by biological pathways (Wang et al., 2011).

Research of immune-mediated disease susceptibility has benefited from the discovery of shared haplotypes. GWAS with a focus on autoimmune diseases, which included celiac disease, Crohn's disease, multiple sclerosis, rheumatoid arthritis, systemic lupus erythematosus, and type 1 diabetes (Lettre&Rioux, 2008), have shed light on shared genetic markers. Such markers can be exploited to identify biomedical traits that translate to improved diagnostic and treatment techniques (McCarthy et al., 2008). Under the common disease/common variant hypothesis (Wang et al., 2005), one would assume that shared variants result in shared disease phenotypes, and this commonality could serve as a global target for effective treatment options. It is under this assumption that many disease association studies are conducted. The Wellcome Trust Case Control Consortium (WTCCC) conducted a study in which nearly 2000 individuals were examined for coronary artery disease (CAD), hypertension, type II diabetes (T2D), rheumatoid arthritis (RA), Crohn's disease (CD), type I diabetes (T1D) and bipolar disorder (BD) susceptibility against a shared set of about 3000 controls (The Wellcome Trust Case Control Consortium, 2007). The study revealed several association loci for the seven diseases, with some of these indicating risk for more than one of the studied diseases (The Wellcome Trust Case Control Consortium, 2007).

Meta-Analysis of Genome-Wide Association Studies to Understand Disease Relatedness 201

maintained for each succeeding level of analysis. SNPs were divided into blocks based on an r2 greater than or equal to 0.1. Gene names from Ensembl (Birney et al., 2004) were assigned to blocks if the genetic location was within 2 kilobases up- or downstream of the gene of interest or within the start and end bases for the gene. Gene data were cross-referenced against pathway-specific gene lists generated from the KEGG database (Kanehisa&Goto, 2000; Kanehisa et al., 2006; Kanehisa et al., 2010) in order to assign genes to identified pathways. Pairwise comparisons for each level were conducted to see if diseases and phenotpyes shared SNPs, blocks, genes, or pathway designations. Jaccard index values were calculated for each comparison at each level to assess similarity. Using the Jaccard indexes, DRNs were constructed to visualize the strength of relatedness between diseases. DRNs were visually inspected to identify the strongest relationships. Suggested associations were verified by principal components analysis (PCA) and minor data mining for clinical relevance. Complete details of these methods were previously described by Lewis et al

The gene expression data used in this analysis was obtained from the NCBI Gene Expression Omnibus (GEO) (Barrett et al., 2009). Not all of the 61 diseases were represented by expression data on the GEO site. Data for a subset of diseases was found by scanning the experimental context of a collection of GEO data (or GEO Series, GSE) for microarrays that were assigned to human disease conditions. Only those microarrays that were curated and reported in the GEO Datasets (or GDS) were used in our analysis. The data set was also restricted to those GSEs in which both the disease and the corresponding control condition (from healthy tissue samples) were measured in the same tissue. For consistency, we further restricted the GSEs to only those datasets which used Affymetrix Gene Chip Human Genome U133 Array Set HG-U133A (GPL96), HG-U133B (GPL97) and HG-U133plus2 (GPL570), which are among the most commonly used platforms. Probes for these platforms were mapped to the current gene identifiers (Chen et al., 2007). This process yielded

To quantitatively compare expression data, we first normalized the data in each microarray sample using the Z-score transformation to make the expression values across various microarray samples and diseases comparable. Next, we performed an unpaired two-sample Student t-test to compute the t-test statistic and *p-*value of each gene between the disease and control groups. We only used the most appropriate Affymetrix probe set in which a single probe was representative of each gene. The most appropriate Affymetrix probe set was adopted from the work of Hu et al. (Hu&Agarwal, 2009) as many genes were represented by multiple probe sets in Affymetrix U133 microarray chips. This modification avoided correlation and scoring biases brought on by over-representation of those genes. 18,600 most appropriate probes/genes for each of nineteen diseases were identified. The genes were grouped with statistically significant high t-test statistics (*p*<0.05) as "upregulated genes" and statistically significant low t-test statistics (*p*<0.05) as "down-regulated genes". Instead of using a *p*-value threshold as a cutoff to identify significantly changed genes, the 200 and 1000 most changed genes were designated as the disease-associated significantly changed genes for each disease state. The lowest p-values in each category

(Lewis et al., 2011).

**2.2 Gene expression dataset** 

nineteen diseases for the final GEO analysis.

**2.3 Expression measurement** 

Huang et al. used the data from the WTCCC study to see if associations could be made between the seven diseases given the loci and collections of other data regarding disease susceptibility (Huang et al., 2009). Huang et al. performed analyses at four levels (nucleotide, gene, protein, and phenotype) to determine the existence of overlap across SNPs associated with the seven diseases and constructed protein-protein interaction networks to visualize similarities between diseases (Huang et al., 2009). The group found strong associations across all four levels of analysis for the autoimmune group (CD, RA, and T1D), while no genetic associations were found at any level within the metabolic/cardiovascular group (CAD, hypertension and T2D) (Huang et al., 2009). These results reasserted some expectations derived from clinical literature in the case of the autoimmune group, and suggested inappropriate disease grouping in the case of the metabolic/cardiovascular group (Huang et al., 2009).

For this study, we proposed a large-scale disease and phenotype comparison based on the WTCCC and Huang et al. studies. To this end, we have combined data from GWAS with expression pattern data to determine if genetic and expression similarities exist between diseases. A total of 61 human diseases and phenotypes were assessed. Disease relatedness networks (DRNs) were constructed to visually assess associations on a larger scale. We also took advantage of high-throughput molecular assay technologies to incorporate mRNA expression profiles of diseases, and thus added another dimension of analysis toward assessing disease relationships. Gene expression is an indicator of cellular state, and gene expression profiles can be considered as quantitative traits that are highly heritable. The link between organismal complex traits, such as disease-related phenotype, and gene expression variation has been theoretically accepted (Goring et al., 2007; Moffatt et al., 2007; Chen et al., 2008; Emilsson et al., 2008). With the declining per-sample costs of high-throughput microarray experiments, the amount of gene expression data in international repositories has grown exponentially. The availability of these datasets for many different diseases provides an opportunity to use data-driven approaches to improve our understanding of disease relationships. Hu and Agarwal (Hu&P., 2009) determined disease-disease and disease-drug networks using large-scale gene expression data. Very recently, Suthram et al. (Suthram et al., 2010) presented a quantitative framework to compare and contrast diseases by combining both disease-related mRNA expression data and human protein interaction data. Although GWAS provide comprehensive views of disease interrelationships at the DNA level, the insights from the gene expression aspect, which reflects cellular phenotype, will further advance and strengthen the understanding of this issue. A large-scale disease comparison study such as this has the potential to uncover relationships between diseases and phenotypes that are often overlooked in single disease SNP data analysis.
