**2. Methods**

#### **2.1 SNP-based genetic analysis**

Five populations were considered for this expansion study: Han Chinese (CHB), Japanese (JPT), a combined CHB and JPT population (CHB+JPT), Yoruba (YRI), and U.S. residents with northern and western European ancestry (CEU). SNP dataset 2009-02\_rel24 (The International HapMap Consortium, 2005; The International HapMap Consortium, 2007) was downloaded from the HapMap site and the SNP set was expanded by means of linkage disequilibrium (LD). SNPs with an r2 greater than or equal to 0.5 were included. SNPs were divided by associated disease or phenotype (listed in Table 1) and the divisions were maintained for each succeeding level of analysis. SNPs were divided into blocks based on an r2 greater than or equal to 0.1. Gene names from Ensembl (Birney et al., 2004) were assigned to blocks if the genetic location was within 2 kilobases up- or downstream of the gene of interest or within the start and end bases for the gene. Gene data were cross-referenced against pathway-specific gene lists generated from the KEGG database (Kanehisa&Goto, 2000; Kanehisa et al., 2006; Kanehisa et al., 2010) in order to assign genes to identified pathways. Pairwise comparisons for each level were conducted to see if diseases and phenotpyes shared SNPs, blocks, genes, or pathway designations. Jaccard index values were calculated for each comparison at each level to assess similarity. Using the Jaccard indexes, DRNs were constructed to visualize the strength of relatedness between diseases. DRNs were visually inspected to identify the strongest relationships. Suggested associations were verified by principal components analysis (PCA) and minor data mining for clinical relevance. Complete details of these methods were previously described by Lewis et al (Lewis et al., 2011).

#### **2.2 Gene expression dataset**

200 Type 1 Diabetes – Complications, Pathogenesis, and Alternative Treatments

Huang et al. used the data from the WTCCC study to see if associations could be made between the seven diseases given the loci and collections of other data regarding disease susceptibility (Huang et al., 2009). Huang et al. performed analyses at four levels (nucleotide, gene, protein, and phenotype) to determine the existence of overlap across SNPs associated with the seven diseases and constructed protein-protein interaction networks to visualize similarities between diseases (Huang et al., 2009). The group found strong associations across all four levels of analysis for the autoimmune group (CD, RA, and T1D), while no genetic associations were found at any level within the metabolic/cardiovascular group (CAD, hypertension and T2D) (Huang et al., 2009). These results reasserted some expectations derived from clinical literature in the case of the autoimmune group, and suggested inappropriate disease grouping in the case of the

For this study, we proposed a large-scale disease and phenotype comparison based on the WTCCC and Huang et al. studies. To this end, we have combined data from GWAS with expression pattern data to determine if genetic and expression similarities exist between diseases. A total of 61 human diseases and phenotypes were assessed. Disease relatedness networks (DRNs) were constructed to visually assess associations on a larger scale. We also took advantage of high-throughput molecular assay technologies to incorporate mRNA expression profiles of diseases, and thus added another dimension of analysis toward assessing disease relationships. Gene expression is an indicator of cellular state, and gene expression profiles can be considered as quantitative traits that are highly heritable. The link between organismal complex traits, such as disease-related phenotype, and gene expression variation has been theoretically accepted (Goring et al., 2007; Moffatt et al., 2007; Chen et al., 2008; Emilsson et al., 2008). With the declining per-sample costs of high-throughput microarray experiments, the amount of gene expression data in international repositories has grown exponentially. The availability of these datasets for many different diseases provides an opportunity to use data-driven approaches to improve our understanding of disease relationships. Hu and Agarwal (Hu&P., 2009) determined disease-disease and disease-drug networks using large-scale gene expression data. Very recently, Suthram et al. (Suthram et al., 2010) presented a quantitative framework to compare and contrast diseases by combining both disease-related mRNA expression data and human protein interaction data. Although GWAS provide comprehensive views of disease interrelationships at the DNA level, the insights from the gene expression aspect, which reflects cellular phenotype, will further advance and strengthen the understanding of this issue. A large-scale disease comparison study such as this has the potential to uncover relationships between diseases

and phenotypes that are often overlooked in single disease SNP data analysis.

Five populations were considered for this expansion study: Han Chinese (CHB), Japanese (JPT), a combined CHB and JPT population (CHB+JPT), Yoruba (YRI), and U.S. residents with northern and western European ancestry (CEU). SNP dataset 2009-02\_rel24 (The International HapMap Consortium, 2005; The International HapMap Consortium, 2007) was downloaded from the HapMap site and the SNP set was expanded by means of linkage disequilibrium (LD). SNPs with an r2 greater than or equal to 0.5 were included. SNPs were divided by associated disease or phenotype (listed in Table 1) and the divisions were

**2. Methods** 

**2.1 SNP-based genetic analysis** 

metabolic/cardiovascular group (Huang et al., 2009).

The gene expression data used in this analysis was obtained from the NCBI Gene Expression Omnibus (GEO) (Barrett et al., 2009). Not all of the 61 diseases were represented by expression data on the GEO site. Data for a subset of diseases was found by scanning the experimental context of a collection of GEO data (or GEO Series, GSE) for microarrays that were assigned to human disease conditions. Only those microarrays that were curated and reported in the GEO Datasets (or GDS) were used in our analysis. The data set was also restricted to those GSEs in which both the disease and the corresponding control condition (from healthy tissue samples) were measured in the same tissue. For consistency, we further restricted the GSEs to only those datasets which used Affymetrix Gene Chip Human Genome U133 Array Set HG-U133A (GPL96), HG-U133B (GPL97) and HG-U133plus2 (GPL570), which are among the most commonly used platforms. Probes for these platforms were mapped to the current gene identifiers (Chen et al., 2007). This process yielded nineteen diseases for the final GEO analysis.

#### **2.3 Expression measurement**

To quantitatively compare expression data, we first normalized the data in each microarray sample using the Z-score transformation to make the expression values across various microarray samples and diseases comparable. Next, we performed an unpaired two-sample Student t-test to compute the t-test statistic and *p-*value of each gene between the disease and control groups. We only used the most appropriate Affymetrix probe set in which a single probe was representative of each gene. The most appropriate Affymetrix probe set was adopted from the work of Hu et al. (Hu&Agarwal, 2009) as many genes were represented by multiple probe sets in Affymetrix U133 microarray chips. This modification avoided correlation and scoring biases brought on by over-representation of those genes. 18,600 most appropriate probes/genes for each of nineteen diseases were identified. The genes were grouped with statistically significant high t-test statistics (*p*<0.05) as "upregulated genes" and statistically significant low t-test statistics (*p*<0.05) as "down-regulated genes". Instead of using a *p*-value threshold as a cutoff to identify significantly changed genes, the 200 and 1000 most changed genes were designated as the disease-associated significantly changed genes for each disease state. The lowest p-values in each category

Meta-Analysis of Genome-Wide Association Studies to Understand Disease Relatedness 203

(up-regulated, down-regulated, and combined) for the top 200 or 1000 genes were pooled for each disease. All of the genes with significant expression changes were grouped together and Jaccard index values were calculated. Gene lists for each disease were compared pair-wise for each of the three expression categories. Here, a high Jaccard index implied a high degree of commonality between diseases/phenotypes. The Jaccard indexes were normalized to produce Z-scores, which were then used as a measure of disease relatedness. The significantly changed genes shared by two diseases were also subjected to Gene Ontology (GO) term enrichment analysis using the web-based Gene Ontology enrichment analysis and visualization (GOrilla) tool (Eden et al., 2007; Eden et al., 2009).

MeSH is the National Library of Medicine's controlled vocabulary thesaurus (Bodenreider et al., 1998). It consists of sets of terms associated with descriptors in a hierarchical structure. For the nineteen GEO validation diseases (Table 2), the MeSH trees were downloaded and the first level of each tree was used as the disease category. The category that could best

**3.1 Summary of significant disease associations for screening of 61 diseases and** 

shows a slight distribution shift to the right from SNP level to pathway level.

previously submitted for publication (Lewis et al., 2011).

Jaccard index values were used to assess similarity between diseases and phenotypes within each level of analysis. Correlation between the levels was also assessed using the Spearman correlation method. High correlation was seen between the SNP and block data sets, while low correlation was seen between the pathway data and the other three levels of analysis. The progression from SNP to block, block to gene, and gene to pathway levels resulted in a grouping of susceptibility markers. Visualization of the associations by means of DRNs suggested the grouping translated to an increase in the strength of associations between diseases. This was also reflected in the distribution of Jaccard indexes for each level. Figure 1

The DRNs suggested consistent association between several diseases for the SNP, block, and gene levels. The strongest associations seen for all populations were observed between (multiple sclerosis [MS], T1D, and RA), with noticeable association between (haematological traits [HT] and adult fetal hemoglobin levels [HBF]) and (serum low-density lipopolysaccharide cholesterol levels [SLCL] and lipid measurements [LM]). Several other less significant associations were suggested by the DRNs as well, but these associations were not consistent in significance for all populations. The qualitative assessments made by examining the DRNs were verified using PCA, which allowed for quantitative isolation of the strongest relationships. The PCA results matched the visual assessment for all levels, and suggested additional strong associations unique to specific populations were present. For example, an association between (LM and triglyceride levels [TG]) that was unique to the JPT population was suggested that was not outwardly apparent by visual inspection of the DRNs. This association was found in the CHB+JPT populations, but not the CHB population. JPT was also missing the (HBF and HT) association that was observed in the other populations. Further details regarding the results of this portion of the study were

**2.4 Medical subject headings (MeSH) term mapping** 

**3. Results** 

**phenotypes** 

indicate the cause of the disease was taken as the disease category.


Table 1. List of diseases and phenotypes considered for this study and the previous study (Lewis et al., 2011) with corresponding abbreviations.

(up-regulated, down-regulated, and combined) for the top 200 or 1000 genes were pooled for each disease. All of the genes with significant expression changes were grouped together and Jaccard index values were calculated. Gene lists for each disease were compared pair-wise for each of the three expression categories. Here, a high Jaccard index implied a high degree of commonality between diseases/phenotypes. The Jaccard indexes were normalized to produce Z-scores, which were then used as a measure of disease relatedness. The significantly changed genes shared by two diseases were also subjected to Gene Ontology (GO) term enrichment analysis using the web-based Gene Ontology enrichment analysis and visualization (GOrilla) tool (Eden et al., 2007; Eden et al., 2009).

#### **2.4 Medical subject headings (MeSH) term mapping**

MeSH is the National Library of Medicine's controlled vocabulary thesaurus (Bodenreider et al., 1998). It consists of sets of terms associated with descriptors in a hierarchical structure. For the nineteen GEO validation diseases (Table 2), the MeSH trees were downloaded and the first level of each tree was used as the disease category. The category that could best indicate the cause of the disease was taken as the disease category.
