**3.4 Comparison of the SNP and expression data for nineteen diseases**

206 Type 1 Diabetes – Complications, Pathogenesis, and Alternative Treatments

[ND]), (ischemic stroke [IS], CC, and CD), and (CAD and childhood asthma [CA]) were observed under all three cutoff scenarios for all three expression categories of analysis. Of these, the (CAD and CA) pair showed the most variation in association strength for all the

Fig. 2. Clustering dendrogram for 61 disease/phenotype comparisons at the (A) SNP, (B) block, (C) gene, and (D) pathway levels. Colored boxes indicate the clusters derived from Rand Index analysis. Results for the CHB+JPT population are shown as a representative

Links between disease classifications were also seen. Connections between nervous system diseases and disorders of environmental origin (i.e., (PSP and ND) and (PD and ND)) were seen in all three expression categories and cutoff types. Associations between nervous system and mental disorders (e.g., AD and BD) were seen for the top 200 and top 1000 groups, but this association was masked in the *p*-value-derived group. For the p-value group, predominate associations between metabolic, cardiovascular, digestive, and immune system diseases were found. One unexpected classification association was the nervous system-metabolic disease link exemplified by (PSP and OBE) and (PD and OBE) for the down-regulation and subsequently combined expression groups with the top 1000 and p-

As expected, the number of significant associations increased as the threshold criteria increased given that the quantity of data available for comparison was greater. Seemingly strong associations observed at the top 200 cutoff, such as the (AD and BD) and (BD and SP) associations were masked in the *p*-value cutoff data as other stronger associations were present. The increase in maximum Jaccard index for the combined expression data set from 0.44 to 0.81 agreed with this observation. Though we saw an increase in relationship strength with less stringent cutoff thresholds, the additional comparison data resulted in reduction in significant associations. Therefore, the expression categories for the *p*-value cutoff group were used to compare with the SNP-based data in order to avoid assigning an arbitrary cutoff for the expression data and to ensure enough data was available for the

variables considered.

data set for all populations.

nineteen-disease comparison.

value cutoffs.

Correlation between data sets may have been influenced by the data sources. Both the SNP and block levels encompassed data from the HapMap site. The gene level data was obtained by cross referencing the HapMap data against the Ensembl database of gene names. The pathway data was obtained by cross referencing the Ensembl-derived data against the KEGG database. Given that the amount of data available through each of these sources is not consistent, there was loss of data in the transition from blocks to genes and genes to pathways. Of the reduced set of nineteen diseases and phenotypes compared, only atrial fibrillation/atrial flutter (AF) did not contain gene data for the SNP-based comparisons. The number of missing diseases/phenotypes increased to four at the pathway level (i.e., AF, CA, psoriasis [PR], and PSP). Despite the missing disease associations for AF, the gene level of analysis was used for comparison to the expression data. The range of Z-scores for this dataset was closest to the range seen for the expression data, and intuitively, the gene data should show some correlation to gene expression.

Fig. 3. DRNs for expression data for the three cutoff levels (top 200, top 1000, and significance with p-value < 0.05) and three expression categories (up-regulated, downregulated, and combined). Disease nodes are color coded to show grouping of diseases based on MeSH classification. Edges are color coded according to increasing strength of disease association. Values for the color scale are listed in the inserted table.

DRNs comparing the gene level of analysis for the CEU, CHB+JPT, and YRI populations to the expression data are shown in Figure 4. The JPT and CHB populations are not shown since the CHB+JPT population is highly representative of the individual populations. A Spearman correlation was calculated between each population for the SNP-based data set and the expression data (Table 3). A weak negative correlation was observed between the genetic and expression data, suggesting no significant relationships were shared between the two data sets. A qualitative analysis of the networks and clustering from the SNP-based data analysis suggested a high degree of similarity between the predicted associations for all population. However, the strong associations observed in the genetic analysis were not seen in the expression data. Rather, a seemingly reciprocal relationship appeared between the

Meta-Analysis of Genome-Wide Association Studies to Understand Disease Relatedness 209

The results from this study suggested it is possible to elucidate genetic similarities that can be overlooked during single disease GWAS. Several expected associations supported by literature were found (e.g. association between (SLE and RA) and (EO and SLCL)) while some unexpected associations were also observed. The unexpected neurologicalcardiovascular/metabolic disease associations were observed for both the genetic analysis and the expression profile analysis. Though the origin and symptoms associated with diseases in each category may be different, the results suggest genetic similarities. Possible explanations for these associations cannot be elucidated solely from this study given the broad nature of the comparison. A detailed SNP-by-SNP and gene-by-gene examination may indicate the reason behind the neurological-cardiovascular/metabolic relatedness. Those relationships are particularly interesting and may indicate some common underlying molecular mechanism among these disease groups that has not yet been widely studied. Clinical evidence supports the strongest relationships identified from the expression data. PSP and PD share some common symptoms such as stiffness, and movement difficulties which could explain the common expression pattern indicating some degree of relatedness between the two. On the other hand, explaining the relationship between PSP and ND is more difficult. Several studies have shown that smokers have a lower risk of developing Parkinson's disease (Soto-Otero et al., 1998; Hernan et al., 2001; Quik, 2004). One recently published paper showed that smoking for a greater number of years may reduce the risk of the disease (Chen et al., 2010). An earlier study suggested that younger patients with CD might be under an increased risk of IS (Andersohn et al., 2010). Extensive studies have demonstrated a strong association between CD and CC (Gillen et al., 1994). The relationship between (IS and CC) and (CAD and CA) is also unclear, but shared immune-dependent

Similarities and differences were observed between the three categories (up-regulated, down-regulated and combined) of gene expression analysis (see Figure 3). The different association patterns may be due to the use of a single rule to identify disease associated genes for all kinds of diseases, which over simplifies the problem. Theoretically, variance of gene expression can be considered as a quantitative trait inherited from genetic variation. It is possible that a combined DNA variant and expression phenotype can better explain genetic architecture with reduced environmental and biological noise (Dermitzakis, 2008). However, the precise and reliable estimation of molecular link between functional genomic effects and complex organism phenotypes depends on a large number of pooled variant and gene expression data from corresponding tissues or cell types, since tissue-specific differences can be found widely (Dermitzakis, 2008). A combined genetic and gene expression profile study, as presented here, can shed light on disease relatedness from different perspectives. Parikh et al. performed a more direct comparison of GWAS and expression data in an effort to prioritize T2D susceptibility genes (Parikh et al., 2009). The group isolated SNPs from GWAS, searched for associated genes, and then found corresponding tissue-specific expression profiles for a subset of all the SNP-associated genes (Parikh et al., 2009). Parikh et al. were able to identify five genes common to individuals with T2D and twelve genes with differentiating expression patterns in individuals with versus without the disease (Parikh et al., 2009). Rather than focusing on a single disease to identify targets, we strove for a more global comparison of genetic and expression data.

**4. Discussion** 

responses may be the common link.

genetic and expression DRNs. The strongest expression-based association was between ALS and obesity-related traits (OBE), which was in the weakest associations group for the SNPbased associations. An examination of the genetic DRNs suggested the strongest associations between (ALS and PD), (AD and T2D), and (T1D and SLE). These associations were weak for the expression data. Some associations near the middle of the Z-score range appeared more common between the data sets, such as the (IS and CC), (AD and BD), and (OBE and CC) pairs.


Table 3. Spearman correlation coefficients between populations and between each population and the GEO data. The Spearman correlation is a comparison of the ranked Zscores for each data set.

Despite the overall lack of correlation between the genetic and expression analyses, several unexpected links between neurological and cardiovascular/metabolic diseases were observed in both data sets (i.e., (AD and T2D) and (PD and OBE)). These potentially novel disease relationships may primarily rely on genetic similarity or genomic expression similarity instead of phenotypic classification, but this idea would need to be further explored.

Fig. 4. DRNs based on Z-scores for three populations and expression data. DRNs for (A) combined expression data for significantly changed genes (*p*<0.05), (B) CEU gene level, (C) CHB+JPT gene level, and (D) YRI gene level are shown. Edge live color and width correspond to strength of association between disease pairs. The gradient and corresponding values are listed in the inserted table.

### **4. Discussion**

208 Type 1 Diabetes – Complications, Pathogenesis, and Alternative Treatments

genetic and expression DRNs. The strongest expression-based association was between ALS and obesity-related traits (OBE), which was in the weakest associations group for the SNPbased associations. An examination of the genetic DRNs suggested the strongest associations between (ALS and PD), (AD and T2D), and (T1D and SLE). These associations were weak for the expression data. Some associations near the middle of the Z-score range appeared more common between the data sets, such as the (IS and CC), (AD and BD), and

61 diseases CEU CHB JPT CHB+JPT YRI CEU 1 0.9599 0.9595 0.9574 0.9447 CHB 1 0.9779 0.9925 0.9726 JPT 1 0.9858 0.9556 CHB+JPT 1 0.9686 YRI 1

Table 3. Spearman correlation coefficients between populations and between each population and the GEO data. The Spearman correlation is a comparison of the ranked Z-

of phenotypic classification, but this idea would need to be further explored.

Despite the overall lack of correlation between the genetic and expression analyses, several unexpected links between neurological and cardiovascular/metabolic diseases were observed in both data sets (i.e., (AD and T2D) and (PD and OBE)). These potentially novel disease relationships may primarily rely on genetic similarity or genomic expression similarity instead

GEO -0.1367 -0.1228 -0.1278 -0.1254 -0.1176

Fig. 4. DRNs based on Z-scores for three populations and expression data. DRNs for (A) combined expression data for significantly changed genes (*p*<0.05), (B) CEU gene level, (C)

CHB+JPT gene level, and (D) YRI gene level are shown. Edge live color and width correspond to strength of association between disease pairs. The gradient and

corresponding values are listed in the inserted table.

(OBE and CC) pairs.

scores for each data set.

19 diseases

The results from this study suggested it is possible to elucidate genetic similarities that can be overlooked during single disease GWAS. Several expected associations supported by literature were found (e.g. association between (SLE and RA) and (EO and SLCL)) while some unexpected associations were also observed. The unexpected neurologicalcardiovascular/metabolic disease associations were observed for both the genetic analysis and the expression profile analysis. Though the origin and symptoms associated with diseases in each category may be different, the results suggest genetic similarities. Possible explanations for these associations cannot be elucidated solely from this study given the broad nature of the comparison. A detailed SNP-by-SNP and gene-by-gene examination may indicate the reason behind the neurological-cardiovascular/metabolic relatedness. Those relationships are particularly interesting and may indicate some common underlying molecular mechanism among these disease groups that has not yet been widely studied.

Clinical evidence supports the strongest relationships identified from the expression data. PSP and PD share some common symptoms such as stiffness, and movement difficulties which could explain the common expression pattern indicating some degree of relatedness between the two. On the other hand, explaining the relationship between PSP and ND is more difficult. Several studies have shown that smokers have a lower risk of developing Parkinson's disease (Soto-Otero et al., 1998; Hernan et al., 2001; Quik, 2004). One recently published paper showed that smoking for a greater number of years may reduce the risk of the disease (Chen et al., 2010). An earlier study suggested that younger patients with CD might be under an increased risk of IS (Andersohn et al., 2010). Extensive studies have demonstrated a strong association between CD and CC (Gillen et al., 1994). The relationship between (IS and CC) and (CAD and CA) is also unclear, but shared immune-dependent responses may be the common link.

Similarities and differences were observed between the three categories (up-regulated, down-regulated and combined) of gene expression analysis (see Figure 3). The different association patterns may be due to the use of a single rule to identify disease associated genes for all kinds of diseases, which over simplifies the problem. Theoretically, variance of gene expression can be considered as a quantitative trait inherited from genetic variation. It is possible that a combined DNA variant and expression phenotype can better explain genetic architecture with reduced environmental and biological noise (Dermitzakis, 2008). However, the precise and reliable estimation of molecular link between functional genomic effects and complex organism phenotypes depends on a large number of pooled variant and gene expression data from corresponding tissues or cell types, since tissue-specific differences can be found widely (Dermitzakis, 2008). A combined genetic and gene expression profile study, as presented here, can shed light on disease relatedness from different perspectives. Parikh et al. performed a more direct comparison of GWAS and expression data in an effort to prioritize T2D susceptibility genes (Parikh et al., 2009). The group isolated SNPs from GWAS, searched for associated genes, and then found corresponding tissue-specific expression profiles for a subset of all the SNP-associated genes (Parikh et al., 2009). Parikh et al. were able to identify five genes common to individuals with T2D and twelve genes with differentiating expression patterns in individuals with versus without the disease (Parikh et al., 2009). Rather than focusing on a single disease to identify targets, we strove for a more global comparison of genetic and expression data.

Meta-Analysis of Genome-Wide Association Studies to Understand Disease Relatedness 211

Birney, E., T. D. Andrews, et al. (2004). "An overview of Ensembl." Genome Research 14(5):

Bodenreider, O., S. Nelson, et al. (1998). Beyond synonymy: exploiting the UMLS semantics

Chen, H., X. Huang, et al. (2010). "Smoking, duration, intensity, and risk of Parkinson

Chen, R., L. Li, et al. (2007). "AILUN: reannotating gene expression data automatically."

Chen, Y., J. Zhu, et al. (2008). "Variations in DNA elucidate molecular networks that cause

Dermitzakis, E. T. (2008). "From gene expression to disease risk." Nature genetics 40(5): 492-493. Doran, M. (2007). "Rheumatoid arthritis and diabetes mellitus: evidence for an association?"

Dorman, J. S., A. R. Steenkiste, et al. (2003). "Type 1 Diabetes and Multiple Sclerosis."

Eden, E., D. Lipson, et al. (2007). "Discovering Motifs in Ranked Lists of DNA Sequences."

Eden, E., R. Navon, et al. (2009). "GOrilla: A Tool for Discovery and Visualization of Enriched GO Terms in Ranked Gene Lists." BMC Bioinformatics 10: 48. Emilsson, V., G. Thorleifsson, et al. (2008). "Genetics of gene expression and its effect on

Gillen, C. D., H. A. Andrews, et al. (1994). "Crohn's disease and colorectal cancer." Gut 35(5):

Goring, H. H. H., J. E. Curran, et al. (2007). "Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes." Nature Genetics 39: 1208-1216. Hernan, M. A., S. M. Zhang, et al. (2001). "Cigarette Smoking and the Incidence of Parkinson's Disease in Two Prospective Studies." Annals of Neurology 50(6): 780-786. Hindorff, L. A., P. Sethupathy, et al. (2009). "Potential etiologic and functional implications of genome-wide association loci for human diseases and traits." PNAS 106(23): 9362-9367.

Hu, G. and P. Agarwal. (2009). "Human disease-drug network based on genomic expression

Huang, W., P. Wang, et al. (2009). "Indentifying disease associations via genome-wide

Johnson, A. D. and C. J. O'Donnell (2009). "An Open Access Database of Genome-wide

Kanehisa, M. and S. Goto (2000). "KEGG: Kyoto Encyclopedia of Genes and Genomes."

Kanehisa, M., S. Goto, et al. (2010). "KEGG for representation and analysis of molecular networks involving diseases and drugs." Nucleic Acids Research 38: D355-D360. Kanehisa, M., S. Goto, et al. (2006). "From genomics to chemical genomics: new

Lettre, G. and J. D. Rioux (2008). "Autoimmune diseases: insights from genome-wide

Lewis, S. N., E. Nsoesie, et al. (2011). "Prediction of Disease and Phenotype Associations from Genome-Wide Association Studies." PLoS One Submitted for Review.

developments in KEGG." Nucleic Acids Research 34: D354-D357.

assocation studies." Human Molecular Genetics 17(2): R116-R121.

Association, Orlando, FL, Hanley & Belfus, Inc.

disease." Neurology 74(11): 878-884.

disease." Nature 452(7186): 429-435.

Diabetes Care 26(11): 3192-3193.

PLoS Computational Biology 3(3): e39.

disease." Nature 452(7186): 423-428.

profiles." PLoS One 4(8): e6536.

Nucleic Acids Research 28(27-30).

association studies." BMC Bioinformatics 10: 1-11.

Association Results." BMC Medical Genetics 10: 1-6.

The Journal of Rhematology 34(3): 460-462.

Nature Methods 4(11): 879.

in mapping vocabularies. Annual Symposium of the American Medical Informatics

925-928.

651-655.

Even though discrepancies between our data sets were observed, it is possible that the reduction in data between the gene and pathway level could have excluded some genes common to multiple diseases. With the increased density of GWAS and gene expression studies, the discrepancies and anomalies observed in this study might be better understood. We set out to support the idea that diseases potentially share phenotype similarity as a result of genetic factors, pathway associations, expression regulation, or some combination of these three ideas. Within the autoimmune disease group, we observed diseases that possessed some genetic similarity. We saw expected strong associations between T1D, MS, and RA, as well as less expected associations between AD and T2D. It would appear that systemic inflammation responses may be the key to shared susceptibility among many of the diseases and phenotypes for which we observed relatedness. Clinical studies suggested individuals with one immunemediated disease, such as T1D, may be more susceptible to pathogenesis of another (Dorman et al., 2003; Nielson et al., 2006; Toussirot et al., 2006; Doran, 2007). It has also been clinically suggested that inflammation plays a role in neurological diseases like AD (Akiyama et al., 2000; Perry, 2004) and PD (Perry, 2004). We also know that cardiovascular and metabolic diseases, such as atherosclerosis, T2D, and OBE have links to chronic inflammatory responses (Stienstra et al., 2006; Tontonoz&Spiegelman, 2008). In all of these cases, our results suggest the clinical manifestations may have genetic relevance and the unexpected cardiovascular/neurological links may be important. Given the broad scope of this study, the conclusions made here are suggestions for where genetic commonality could be found without specific identification of the related targets. A more detailed disease-by-disease analysis similar to the study conducted by Parikh et al. (Parikh et al., 2009) would need to be conducted to identify specific genes of interest shared by diseases. The methods used in the Parikh et al. study can be specifically applied to the study of T1D by performing a detailed step-by-step comparison between this disease and other possibly related diseases in order to elucidate genetic commonalities to T1D. The results from our study and from one tailored specifically for T1D could influence current treatment options and suggest new approaches for managing and treating the disease. We feel our study is a strong example of how GWAS and expression data can be used conjunctively to predict significant disease associations relevant to improving and unifying diagnoses and treatment options for multiple immune-mediated diseases.

## **5. Acknowledgements**

The authors would like to acknowledge the faculty and students of the Spring 2010 Genetics, Bioinformatics, and Computational Biology Problem Solving course for feedback regarding the progress of this study.
