**3. Results**

202 Type 1 Diabetes – Complications, Pathogenesis, and Alternative Treatments

extreme obesity LM Lipid

disease LONG

**Abbreviation**

LOAD

**Disease/ Phenotype**

Late-onset Alzheimer's disease

Minor histocompatibility antigenicity

MI Myocardial

MS Multiple

progression OBE Obesity-

HYP Hypertension PC Prostate

cancer IC Iris color PD Parkinson's

Immunoglo -bulin A nephropathy

Kidney function and endocrine traits

(blood) traits PA Polysubstance

PF

CDI Celiac disease IS Ischemic stroke PR Psoriasis T2D Type II Diabetes

Table 1. List of diseases and phenotypes considered for this study and the previous study

PSP

Longevity and age-related phenotypes

infarction SALS

sclerosis SCP

addiction SPBC

Pulmonary function phenotypes

Progressive Supranuclear Palsy

cancer SPM Skin

disease STR Stroke

dependence SLCL Serum LDL

related traits SP Schizophrenia

NEU Neuroticism SLE Systemic Lupus

measurements QT

**Abbreviation** **Disease/ Phenotype** 

Cardiac repolarization (QT interval)

RA Rheumatoid Arthritis

RLS Restless Leg Syndrome

SA Subclinical

atherosclerosis

Sporadic Amyotrophic lateral Sclerosis

Sleep and circadian phenotypes

cholesterol levels

Erythematosus

Sporadic postmenopausal breast cancer

pigmentation

T1D Type I Diabetes

TG Triglycerides

**Abbreviation** 

AF

ALS

**Disease/ Phenotype**

AD Alzheimer's

Atrial Fibrillation/At -rial Flutter

Amyotrophic Lateral Sclerosis

BC Breast cancer HAE

disorder HBF

and geometry HEM

disease IMAN

Cardiovascular Disease outcomes

BD Bipolar

BMG Bone mass

CA Childhood

CC Colorectal

CD Crohn's

CS Coronary spasm

Coronary Artery Disease

Blood pressure and arterial stiffness

BPAS

CAD

CVD

**Abbreviation**

disease EO Early onset

GCA

BA Brain aging GLA Glaucoma MHA

**Disease/ Phenotype**

General cognitive ability

Hepatic adverse events with thrombin inhibitor ximelagatran

Adult fetal hemoglobin levels (HbF) by F cell levels

BL Blood lipids HEI Height ND Nicotine

Human episodic memory

HIV1 HIV-1 disease

asthma HT Haematological

KFET

(Lewis et al., 2011) with corresponding abbreviations.

GD Gallstone

#### **3.1 Summary of significant disease associations for screening of 61 diseases and phenotypes**

Jaccard index values were used to assess similarity between diseases and phenotypes within each level of analysis. Correlation between the levels was also assessed using the Spearman correlation method. High correlation was seen between the SNP and block data sets, while low correlation was seen between the pathway data and the other three levels of analysis. The progression from SNP to block, block to gene, and gene to pathway levels resulted in a grouping of susceptibility markers. Visualization of the associations by means of DRNs suggested the grouping translated to an increase in the strength of associations between diseases. This was also reflected in the distribution of Jaccard indexes for each level. Figure 1 shows a slight distribution shift to the right from SNP level to pathway level.

The DRNs suggested consistent association between several diseases for the SNP, block, and gene levels. The strongest associations seen for all populations were observed between (multiple sclerosis [MS], T1D, and RA), with noticeable association between (haematological traits [HT] and adult fetal hemoglobin levels [HBF]) and (serum low-density lipopolysaccharide cholesterol levels [SLCL] and lipid measurements [LM]). Several other less significant associations were suggested by the DRNs as well, but these associations were not consistent in significance for all populations. The qualitative assessments made by examining the DRNs were verified using PCA, which allowed for quantitative isolation of the strongest relationships. The PCA results matched the visual assessment for all levels, and suggested additional strong associations unique to specific populations were present. For example, an association between (LM and triglyceride levels [TG]) that was unique to the JPT population was suggested that was not outwardly apparent by visual inspection of the DRNs. This association was found in the CHB+JPT populations, but not the CHB population. JPT was also missing the (HBF and HT) association that was observed in the other populations. Further details regarding the results of this portion of the study were previously submitted for publication (Lewis et al., 2011).

Meta-Analysis of Genome-Wide Association Studies to Understand Disease Relatedness 205

Fig. 1. Graphical representation of histogram data showing distribution of all Jaccard indexes for all populations at each level of analysis. Index values were grouped and then divided into twenty bins across the range zero to one. (N = 9150 for each analysis level)

between (LONG, EO, and SLCL).

**3.3 Gene expression analysis** 

hierarchical branching of the trees. Representative clustering results are shown for the CHB+JPT population in Figure 2. The CHB+JPT population showed a high correlation to most populations at all levels of analysis based on the Rand Index for similarity. The Rand Index for similarity was used to compare the clustering across populations at each level. The diseases within each cluster were least similar at the SNP level for all populations and most similar at the gene level across most of the populations. At the SNP level groupings, associations between (MS, RA, and T1D), (HBF and HT), and (breast cancer [BC] and sporadic post-menopausal breast cancer [SPBC]) were found for all populations (Figure 2A). The grouping of (RA and T1D), (BC and SPBC), (HBF and HT), (Amyotrophic Lateral Sclerosis [ALS] and Parkinson's disease [PD]) and (colorectal cancer [CC] and prostate cancer [PC]) were consistent at the block level for all populations (Figure 2B). At the gene level, the number of diseases/phenotypes included in each cluster increased with consistent groups again observed for all populations. These groups included (MS, RA, and T1D), (ALS, PD, CAD, Alzheimer's disease [AD] and T2D), and (neuroticism [NEU], brain aging [BA], and sleep and circadian phenotypes [SCP]) (Figure 2C). Clusters at the pathway level were also much larger than at the other levels. No consistent relationships were seen for the clusters containing a larger number of diseases, but the smaller groupings consistently showed relationships between (longevity and age-related phenotypes [LONG] and early onset extreme obesity [EO]), (cardiovascular disease outcomes [CVD], CD, and NEU) and (blood lipids [BL], LM, and Restless Leg Syndrome [RLS]) (Figure 2D)*.* Four populations suggested clustering of (LONG, EO, and T1D), while one, YRI, showed a relationship

The gene expression profiles showed some patterns for the three expression categories (upregulated, down-regulated, and combined), with the number of strong associations increasing with cutoff type (top 200 most changed genes, top 1000 most changed genes, and changes with a *p*-value less than 0.05). Jaccard indexes for each disease/phenotype pair were calculated and used to construct DRNs, which are shown in Figure 3. Strong associations between (PD, Progressive Supranuclear Palsy [PSP], and nicotine dependence


Table 2. List of nineteen diseases in gene expression analysis and their MeSH classification.

#### **3.2 Clustering of genetic associations**

Based on the observations made using the DRNs, agglomerative hierarchical clustering was used to find groups of diseases. At each level, the 61 diseases/phenotypes were clustered into ten groups. The number of clusters was set to ten based on visual inspection of the

Fig. 1. Graphical representation of histogram data showing distribution of all Jaccard indexes for all populations at each level of analysis. Index values were grouped and then divided into twenty bins across the range zero to one. (N = 9150 for each analysis level)

hierarchical branching of the trees. Representative clustering results are shown for the CHB+JPT population in Figure 2. The CHB+JPT population showed a high correlation to most populations at all levels of analysis based on the Rand Index for similarity. The Rand Index for similarity was used to compare the clustering across populations at each level. The diseases within each cluster were least similar at the SNP level for all populations and most similar at the gene level across most of the populations. At the SNP level groupings, associations between (MS, RA, and T1D), (HBF and HT), and (breast cancer [BC] and sporadic post-menopausal breast cancer [SPBC]) were found for all populations (Figure 2A). The grouping of (RA and T1D), (BC and SPBC), (HBF and HT), (Amyotrophic Lateral Sclerosis [ALS] and Parkinson's disease [PD]) and (colorectal cancer [CC] and prostate cancer [PC]) were consistent at the block level for all populations (Figure 2B). At the gene level, the number of diseases/phenotypes included in each cluster increased with consistent groups again observed for all populations. These groups included (MS, RA, and T1D), (ALS, PD, CAD, Alzheimer's disease [AD] and T2D), and (neuroticism [NEU], brain aging [BA], and sleep and circadian phenotypes [SCP]) (Figure 2C). Clusters at the pathway level were also much larger than at the other levels. No consistent relationships were seen for the clusters containing a larger number of diseases, but the smaller groupings consistently showed relationships between (longevity and age-related phenotypes [LONG] and early onset extreme obesity [EO]), (cardiovascular disease outcomes [CVD], CD, and NEU) and (blood lipids [BL], LM, and Restless Leg Syndrome [RLS]) (Figure 2D)*.* Four populations suggested clustering of (LONG, EO, and T1D), while one, YRI, showed a relationship between (LONG, EO, and SLCL).

#### **3.3 Gene expression analysis**

204 Type 1 Diabetes – Complications, Pathogenesis, and Alternative Treatments

AD GPL96 GSE1297 22 9 Nervous System Diseases [C10]

CD GPL96 GSE3365 59 42 Digestive System Diseases [C06] IS GPL96 GSE1869 6 10 Cardiovascular Diseases [C14] OBE GPL96 GSE474 16 8 Nutritional and Metabolic Diseases

PD GPL96 GSE6613 50 22 Nervous System Diseases [C10] PR GPL96 GSE6710 13 13 Skin and Connective Tissue Diseases

CAD GPL96 GSE12288 110 120 Cardiovascular Diseases [C14]

CA GPL570 GSE8052 268 136 Respiratory Tract Diseases [C08]

Table 2. List of nineteen diseases in gene expression analysis and their MeSH classification.

Based on the observations made using the DRNs, agglomerative hierarchical clustering was used to find groups of diseases. At each level, the 61 diseases/phenotypes were clustered into ten groups. The number of clusters was set to ten based on visual inspection of the

CC GPL570 GSE9348 70 12 Neoplasms [C04]

SP GPL570 GSE4036 14 14 Mental Disorders [F03] AF GPL96 and 97 GSE2240 10 5 Cardiovascular Diseases [C14] PSP GPL96 GSE6613 6 22 Nervous System Diseases [C10]

BD GPL96 GSE5388 30 31 Mental Disorders [F03]

Sample Size MeSH category Disease Control

[C18]

[C17]

[C18]

[C17]

[C17]

[C18]

[C18]

[C21]

Mental Disorders [F03]

Neoplasms [C04]

Nervous System Diseases [C10] Nutritional and Metabolic Diseases

Skin and Connective Tissue Diseases

Skin and Connective Tissue Diseases

Nutritional and Metabolic Diseases

Endocrine System Diseases [C19] Immune System Diseases [C20]

Nutritional and Metabolic Diseases

Endocrine System Diseases [C19]

Immune System Diseases [C20]

Digestive System Diseases [C06]

Mental Disorders [F03]

Eye Diseases [C11]

Disorders of Environmental Origin

Immune System Diseases [C20]

Disease platform GEO

record

ALS GPL96 and 97 GSE3307 9 16

BC GPL96 and 97 GSE6883 6 3

SLE GPL96 and 97 GSE11909 103 12

T1D GPL570 GSE10586 12 15

T2D GPL96 and 97 GSE9006 12 24

ND GPL570 GSE11208 6 5

**3.2 Clustering of genetic associations** 

The gene expression profiles showed some patterns for the three expression categories (upregulated, down-regulated, and combined), with the number of strong associations increasing with cutoff type (top 200 most changed genes, top 1000 most changed genes, and changes with a *p*-value less than 0.05). Jaccard indexes for each disease/phenotype pair were calculated and used to construct DRNs, which are shown in Figure 3. Strong associations between (PD, Progressive Supranuclear Palsy [PSP], and nicotine dependence

Meta-Analysis of Genome-Wide Association Studies to Understand Disease Relatedness 207

Correlation between data sets may have been influenced by the data sources. Both the SNP and block levels encompassed data from the HapMap site. The gene level data was obtained by cross referencing the HapMap data against the Ensembl database of gene names. The pathway data was obtained by cross referencing the Ensembl-derived data against the KEGG database. Given that the amount of data available through each of these sources is not consistent, there was loss of data in the transition from blocks to genes and genes to pathways. Of the reduced set of nineteen diseases and phenotypes compared, only atrial fibrillation/atrial flutter (AF) did not contain gene data for the SNP-based comparisons. The number of missing diseases/phenotypes increased to four at the pathway level (i.e., AF, CA, psoriasis [PR], and PSP). Despite the missing disease associations for AF, the gene level of analysis was used for comparison to the expression data. The range of Z-scores for this dataset was closest to the range seen for the expression data, and intuitively, the gene data

**3.4 Comparison of the SNP and expression data for nineteen diseases** 

Fig. 3. DRNs for expression data for the three cutoff levels (top 200, top 1000, and significance with p-value < 0.05) and three expression categories (up-regulated, downregulated, and combined). Disease nodes are color coded to show grouping of diseases based on MeSH classification. Edges are color coded according to increasing strength of

DRNs comparing the gene level of analysis for the CEU, CHB+JPT, and YRI populations to the expression data are shown in Figure 4. The JPT and CHB populations are not shown since the CHB+JPT population is highly representative of the individual populations. A Spearman correlation was calculated between each population for the SNP-based data set and the expression data (Table 3). A weak negative correlation was observed between the genetic and expression data, suggesting no significant relationships were shared between the two data sets. A qualitative analysis of the networks and clustering from the SNP-based data analysis suggested a high degree of similarity between the predicted associations for all population. However, the strong associations observed in the genetic analysis were not seen in the expression data. Rather, a seemingly reciprocal relationship appeared between the

disease association. Values for the color scale are listed in the inserted table.

should show some correlation to gene expression.

[ND]), (ischemic stroke [IS], CC, and CD), and (CAD and childhood asthma [CA]) were observed under all three cutoff scenarios for all three expression categories of analysis. Of these, the (CAD and CA) pair showed the most variation in association strength for all the variables considered.

Fig. 2. Clustering dendrogram for 61 disease/phenotype comparisons at the (A) SNP, (B) block, (C) gene, and (D) pathway levels. Colored boxes indicate the clusters derived from Rand Index analysis. Results for the CHB+JPT population are shown as a representative data set for all populations.

Links between disease classifications were also seen. Connections between nervous system diseases and disorders of environmental origin (i.e., (PSP and ND) and (PD and ND)) were seen in all three expression categories and cutoff types. Associations between nervous system and mental disorders (e.g., AD and BD) were seen for the top 200 and top 1000 groups, but this association was masked in the *p*-value-derived group. For the p-value group, predominate associations between metabolic, cardiovascular, digestive, and immune system diseases were found. One unexpected classification association was the nervous system-metabolic disease link exemplified by (PSP and OBE) and (PD and OBE) for the down-regulation and subsequently combined expression groups with the top 1000 and pvalue cutoffs.

As expected, the number of significant associations increased as the threshold criteria increased given that the quantity of data available for comparison was greater. Seemingly strong associations observed at the top 200 cutoff, such as the (AD and BD) and (BD and SP) associations were masked in the *p*-value cutoff data as other stronger associations were present. The increase in maximum Jaccard index for the combined expression data set from 0.44 to 0.81 agreed with this observation. Though we saw an increase in relationship strength with less stringent cutoff thresholds, the additional comparison data resulted in reduction in significant associations. Therefore, the expression categories for the *p*-value cutoff group were used to compare with the SNP-based data in order to avoid assigning an arbitrary cutoff for the expression data and to ensure enough data was available for the nineteen-disease comparison.
