**Genome-Wide Gene Expression Analysis to Identify Epistatic Gene-Pairs Associated with Prognosis of Breast Cancer**

I-Hsuan Lin and Ming-Ta Hsu

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/59462

## **1. Introduction**

2011; doi: 10.1158/1535-7163.MCT-11-0458 Molecular Cancer Therapeutics 2012;11;

[139] Rhee I, Bachman KE, Park BH, Jair KW, Yen RW, et al. DNMT1 and DNMT3b coop‐

[140] Sharma D, Blum J, Yang X, Beaulieu N, Macleod AR, et al. Release of methyl CpG binding proteins and histone deacetylase 1 from the Estrogen receptor alpha (ER) promoter upon reactivation in ER-negative human breast cancer cells. Molecular En‐

[141] Yang X, Ferguson AT, Nass SJ, Phillips DL, Butash KA, et al. Transcriptional activa‐ tion of estrogen receptor alpha in human breast cancer cells by histone deacetylase

[142] Ben-Kasus T, Ben-Zvi Z, Marquez VE, Kelley JA, AgbariaR.Metabolic activation of zebularine, a novel DNA methylation inhibitor, in human bladder carcinoma cells.

[143] Yoo CB, Chuang JC, Byun HM, Egger G, Yang AS, et al. Long-term epigenetic thera‐ py with oral zebularine has minimal side effects and prevents intestinal tumors in

[144] Juergens RA, Wrangle J, Vendetti FP, Murphy SC, Zhao M, et al. Combination epige‐ netic therapy has efficacy in patients with refractory advanced non–small cell lung

[145] Stathis A, Hotte SJ, Chen EX, Hirte HW, Oza AM, et al Phase I study of decitabine in combination with vorinostat in patients with advanced solid tumors and non-Hodg‐

[146] Eliopoulos N, Cournoyer D, Momparler RL. Drug resistance to 5-aza-2′-deoxycyti‐ dine, 2′, 2′-difluorodeoxycytidine, and cytosine arabinoside conferred by retroviralmediated transfer of human cytidinedeaminasecDNA into murine cells.Cancer

[147] Lemaire M, Momparler LF, Bernstein ML, Marquez VE, MomparlerRL.Enhancement of antineoplastic action of 5-aza-2′-deoxycytidine by zebularine on L1210 leukemia.

[148] Cheishvili D, Chik F, Li CC, Bhattacharyya B, Suderman M et al. Synergistic effects of combined DNA methyltransferase inhibition and MBD2 depletion on breast cancer cells; MBD2 depletion blocks 5-aza-2'-deoxycytidine triggered invasiveness. Carcino‐

[149] Fu S, Hu W, Iyer R, Kavanagh JJ, Coleman RL, et al. Phase 1b-2a study to reverse platinum resistance through use of a hypomethylating agent, azacitidine, in patients with platinum-resistant or platinum-refractory epithelial ovarian cancer. Cancer

kin's lymphomas. Clinical Cancer Research 2011;17:1582–1590.

erate to silence genes in human cancer cells. Nature 2002;416:552–556.

370.

docrinology 2005;19(7):1740-1751.

56 A Concise Review of Molecular Pathology of Breast Cancer

inhibition. Cancer Research 2000;60(24):6890-6894.

Biochemical Pharmacology 2005;70(1):121-133.

cancer. Cancer Discovery 2011;1(7):598–607.

mice.Cancer Prevention Research 2008;1(4):233-240.

Chemotherapy and Pharmacology 1998;42:373–378.

Anticancer Drugs 2005;16:301–308.

genesis 2014 Sep 1.

2011;117:1661–1669.

Breast cancer is the most common cancer in women in the world [1]. According to the most recent estimates from GLOBOCAN published by the International Agency for Research on Cancer (IARC) [2], there were nearly 1.7 million new breast cancer cases diagnosed in 2012 (25.2% of all cancers in women) and 6.3 millions have been diagnosed with breast cancer in 2007-2012. Breast cancer incidence has been increasing by more than 20% and mortality increased by 14% since 2008 and is the most common cause of cancer death in women in less developed regions (324,000 deaths, 14.3% of total). Breast cancer is less favorable in the underdeveloped countries due to less advanced medical diagnosis and treatments. Therefore a good diagnosis/prognosis would help to prevent as well as provide effective clinical treatments.

Biomarker testing is an essential step in the evaluation of breast cancer and help medical doctors and patients in deciding the best treatment strategy. There are several commercial products or services developed towards this purpose. The Oncotype DX (Genomic Health) measures the expression levels of 21 genes and is most helpful for patients of early stage breast cancer with estrogen receptor (ER) positivity and no cancer cells in the lymph node. The HERmark assay (Monogram Biosciences) can quantitatively measure the HER2 total proteins with greater sensitivity than immunohistochemistry (IHC) which is an important indicator of predicting response of HER2-positive breast cancer patients to trastuzumab therapy. There are also tests for *BRCA1* and *BRCA2* mutations for the hereditary breast cancer patients. The targeted sequencing-based breast cancer panels such as BreastNext (Ambry Genetics) and BROCA (University of Washington) can be used to screen for mutations and copy number variants in genes implicated in breast cancer, including *BRCA1* and *BRCA2*.

© 2015 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

Despite the relative success of these tests, there is a need for more efficient biomarkers in specific groups of breast cancer, such as lobular carcinoma [3, 4], triple-negative breast cancer [5] and early-onset breast cancers [6, 7] for diagnostic and/or prognostic application. We believe the discovery of more useful markers using the wealth of gene expression data available publicly nowadays would help medical doctors in the decision of the best way to help breast cancer patients, especially if the markers are correlated with specific therapeutic interventions. In the past decade, the high throughput microarray technique has been widely used to identify potential biomarkers for various cancers [8-12]. Recent years, the employment of RNA sequencing (RNA-seq) allows researchers to obtain transcriptome information and differential gene expression profiling at a much higher resolution. With the huge amount of data generated by these technologies, we are able to study the association of genes with cancer survival and identify novel potentially prognostic biomarkers for cancers with improved estimation. Traditionally, genetic search identifies genes that correlate with poor or good prognosis of patients. However, it is important to consider the epistatic gene-gene interactions underlying gene expression in complex diseases such as cancer. The epistatic (second or higher order) information would allow more refined prognostic evaluation that may help clinical treatments. Furthermore, epistatic analysis could be useful for identifying hub genes involved in prognosis and help to identify the major genetic risk factors and pathways in breast cancer.

In this work, we utilized the breast tumor RNA-seq data from The Cancer Genome Atlas (TCGA) as well as microarray-based expression datasets from Gene Expression Omnibus (GEO) to detect differentially expressed genes in various subsets of breast cancer patients, to identify genes whose expression profile is associated with survival of breast cancer patients and to examine the influence of co-expression of a second gene in the survival of patients. This analysis identifies specific gene groups differentially expressed between early-onset vs. lateonset breast cancer, between ductal vs. lobular carcinoma, between early vs. advanced stage breast tumors and tumor of various receptor status. Furthermore, epistatic interactions among these genes demonstrate the gene-gene interactions in patient survival and identify several hub genes that may be important determinants of breast cancer.

## **2. Statistical analysis of gene expression data**

A global change in gene expression is a common theme in many human cancers. Highthroughput techniques such as microarrays and next generation sequencing allow investiga‐ tors to observe and compare the transcriptional landscapes of tumor cells in different biological states [13-18]. In this work, we integrated multiple gene expression data from several largescale breast cancer studies to improve the assessment of differential gene expression in breast tumor cells and to effectively increase statistical power.

We collected 3,188 breast cancer related Affymetrix expression microarray data from GEO (http://www.ncbi.nlm.nih.gov/geo) from the following 16 series: GSE2603, GSE4922, GSE2990, GSE3494, GSE6532, GSE9195, GSE7390, GSE20194, GSE20271, GSE20685, GSE25066, GSE16391, GSE19615, GSE42568, GSE45255 and GSE50948. We also obtained 1,172 breast invasive carcinoma (BRCA) RNA-seq Level 3 data from TCGA Data Portal (http://tcgadata.nci.nih.gov/). The demographic and clinicopathological characteristics of the breast cancer patients from each study were also retrieved.

#### **2.1. Processing of gene expression data and differential gene expression analysis**

Despite the relative success of these tests, there is a need for more efficient biomarkers in specific groups of breast cancer, such as lobular carcinoma [3, 4], triple-negative breast cancer [5] and early-onset breast cancers [6, 7] for diagnostic and/or prognostic application. We believe the discovery of more useful markers using the wealth of gene expression data available publicly nowadays would help medical doctors in the decision of the best way to help breast cancer patients, especially if the markers are correlated with specific therapeutic interventions. In the past decade, the high throughput microarray technique has been widely used to identify potential biomarkers for various cancers [8-12]. Recent years, the employment of RNA sequencing (RNA-seq) allows researchers to obtain transcriptome information and differential gene expression profiling at a much higher resolution. With the huge amount of data generated by these technologies, we are able to study the association of genes with cancer survival and identify novel potentially prognostic biomarkers for cancers with improved estimation. Traditionally, genetic search identifies genes that correlate with poor or good prognosis of patients. However, it is important to consider the epistatic gene-gene interactions underlying gene expression in complex diseases such as cancer. The epistatic (second or higher order) information would allow more refined prognostic evaluation that may help clinical treatments. Furthermore, epistatic analysis could be useful for identifying hub genes involved in prognosis

58 A Concise Review of Molecular Pathology of Breast Cancer

and help to identify the major genetic risk factors and pathways in breast cancer.

hub genes that may be important determinants of breast cancer.

**2. Statistical analysis of gene expression data**

tumor cells and to effectively increase statistical power.

In this work, we utilized the breast tumor RNA-seq data from The Cancer Genome Atlas (TCGA) as well as microarray-based expression datasets from Gene Expression Omnibus (GEO) to detect differentially expressed genes in various subsets of breast cancer patients, to identify genes whose expression profile is associated with survival of breast cancer patients and to examine the influence of co-expression of a second gene in the survival of patients. This analysis identifies specific gene groups differentially expressed between early-onset vs. lateonset breast cancer, between ductal vs. lobular carcinoma, between early vs. advanced stage breast tumors and tumor of various receptor status. Furthermore, epistatic interactions among these genes demonstrate the gene-gene interactions in patient survival and identify several

A global change in gene expression is a common theme in many human cancers. Highthroughput techniques such as microarrays and next generation sequencing allow investiga‐ tors to observe and compare the transcriptional landscapes of tumor cells in different biological states [13-18]. In this work, we integrated multiple gene expression data from several largescale breast cancer studies to improve the assessment of differential gene expression in breast

We collected 3,188 breast cancer related Affymetrix expression microarray data from GEO (http://www.ncbi.nlm.nih.gov/geo) from the following 16 series: GSE2603, GSE4922, GSE2990, GSE3494, GSE6532, GSE9195, GSE7390, GSE20194, GSE20271, GSE20685, GSE25066, GSE16391, GSE19615, GSE42568, GSE45255 and GSE50948. We also obtained 1,172 breast The CEL files obtained from microarray experiments were pre-processed by subjecting to quality check using Bioconductor in the R environment to ensure comparability between the different series and microarray platforms. The following quality measurements from the *simpleaffy* and *affy* packages were performed: average background (avbg), scale factor (sfs), percent present (percent.present), and possible RNA Degradation (AffyRNAdeg) of the array. Additionally, the relative log expression (RLE) and normalized unscaled standard error (NUSE) was also estimated using the *affyPLM* package. 466 arrays that did not pass the quality control tests were removed. For the 2,722 arrays that had sufficient quality, the quantile normalization and background correction were performed using the justRMA (robust multiarray average) algorithm of the *affy* package and the gene (probe set)-level log2-transformed expression values were summarized with Custom CDF file annotations (version 18.0.0. ENSG) [19]. Lastly, the COMBAT method available in the *inSilicoMerging* package was used to remove batch effect when combining the final expression data from the HG-U133A and HG-U133 Plus 2.0 arrays [20]. The RMA-normalized expression values from microarrays and the raw count data from RNA-seq datasets were then analyzed using the *edgeR* package [21]. The differen‐ tially expressed genes were selected with a threshold of FDR adjusted *P*-value < 0.05.

#### **2.2. Chinicopathological characteristics of breast cancer patients**

We include 2,722 breast cancer patients from various microarray-based studies (referred as GEO cohort) and 1,052 breast cancer patients from the TCGA project (referred as TCGA cohort) following differential gene expression analyses (Table 1). All patients were women in the GEO cohort with a median age of 53 years. The patients from the TCGA cohort were older with a median age of 58 years and approximately 96% of patients were women. There was a signifi‐ cant amount of clinicopathological data not available from the GEO cohort as noted in Table 1. In both cohorts, there were more stage I/II breast cancer cases than advanced stage cases, and invasive ductal carcinoma (IDC) being the major histological subtype diagnosed. The data also contained status of tumor receptors such as the estrogen receptor (ER), progesterone receptor (PR) and HER2 which are frequently used prognostic factors to aid therapeutic decisions. Many patients were positive for the ER and/or PR, and/or negative for the HER2 receptor.



**Table 1.** Patient characteristics of the GEO and TCGA cohorts.

#### **2.3. Differentially expressed genes in patients from different age groups**

**Characteristic**

60 A Concise Review of Molecular Pathology of Breast Cancer

Stage

Histologic Subtype

ER Status

PR Status

HER2 Status

Female patients with a least one type of survival data

**Table 1.** Patient characteristics of the GEO and TCGA cohorts.

**No. of Patients Microarray (n = 2722), % RNA-seq (n = 1052), %**

Female 2722 85.4% 1005 95.5% Missing data 0 0.0% 36 3.4%

Median Age (range) 53 (24-93) 58 (26-90) Younger than 40 300 11.0% 71 6.7% 40 to 55 1184 43.5% 365 34.7% Older than 55 1224 45.0% 580 55.1% Missing data 14 0.5% 36 3.4%

Early (Stage I and II) 1236 45.4% 751 71.4% Late (Stage III and IV) 370 13.6% 246 23.4% Missing data 1116 41.0% 55 5.2%

IDC 500 18.4% 754 71.7% ILC 32 1.2% 168 16.0% Mixed 47 1.7% 29 2.8% Others 6 0.2% 64 6.1% Missing data 2137 78.5% 37 3.5%

ER positive 1710 62.8% 749 71.2% ER negative 647 23.8% 222 21.1% Missing data 365 13.4% 81 7.7%

PR positive 1017 37.4% 650 61.8% PR negative 684 25.1% 318 30.2% Missing data 1021 37.5% 84 8.0%

HER2 positive 202 7.4% 150 14.3% HER2 negative 946 34.8% 524 49.8% Missing data 1574 57.8% 378 35.9%

2294 84.3% 999 36.7%

Differential gene expression analysis was performed to identify over- and under-expressed genes specific to tumors derived from young, middle-aged and elderly breast cancer patients. As presented in Figure 1, there were very few middle-aged-specific expression signatures, indicating that the gene expression pattern of middle-aged patients was not significantly different from the young adults and/or elderly patients. In contrast, the elderly breast cancer patients possessed a high number of differentially expressed genes specific to this age group. IPA analysis of the differentially expressed genes from tumor cells obtained from older patients have decreased cell proliferation, movement, migration and cell cycle progression (activation z-score between -2.677 and -1.611) and increased cell death (activation z-score = 1.321). On the contrary, tumor cells from young patients were predicted to have increased proliferation of cells and DNA synthesis (activation z-score between 2.000 and 2.117) and decreased cell death and apoptosis (activation z-score between -0.586 and -0.299).

**Figure 1.** Number of significantly over- and under-expressed genes in the three age groups presented with the jvenn Venn diagram viewer [22].

It is interesting to note that 14 genes that were over-expressed in young patients were underexpressed in elderly patients, and conversely, 15 genes under-expressed in young patients were over-expressed in elderly patients (Table 2). Several of these genes such as *BIRC5* (survivin), *KPNA2*, *PLAC8* (onzin), *TFPI2*, *CITED2*, *NKX3-1*, *PIP* and *ZNF395* have been found to play a role in cancer cell proliferation and cancer progression [23-28].



**Table 2.** Concordant differentially expressed genes identified in the young and elderly breast cancer patients.

#### **2.4. Differentially expressed genes in patients with early stage versus advanced stage breast cancer**

We compared the gene expression profile of patients diagnosed with early stage (stage I and II) breast cancer with those with advanced stage (stage III and IV) breast cancer. We identified 79 over-expressed and 140 under-expressed genes in early stage breast cancer. IPA analysis showed 121 of the total 219 differentially expressed genes were associated with cancer (*P*-value = 4.81E-02), and 24 were specifically associated with breast cancer (*P*-value = 3.25E-03, Table 3). Also, there were 17 under-expressed genes in early stage breast cancer (i.e. over-expressed in advanced stage tumors) that were found to be cancer recurrence-associated (*ADORA3*, *FLT4*, *GSR*, *HSP90AA1*, *TEK* and *TXNRD1*) and metastasis-associated (*ACP5*, *FLT4*, *FTL*, *GSR*, *HSP90AA1*, *MAPK11*, *MMP9*, *NRAS*, *PGF*, *SCD* and *TEK*). Interestingly, we detected overexpression of the DNA methyltransferase *DNMT1* in early stage tumors. In cancer cells, the over-expression of this gene can lead to hypermethylation of CpG islands and epigenetically silences multiple tumor suppressor genes and hence promotes tumorigenesis in early stage cancers [29-31].

**Type Symbol Entrez Gene Name Location Type(s)**

*POLR3G* polymerase (RNA) III (DNA directed)

*TFPI2* tissue factor pathway inhibitor 2

*ABCC6* ATP-binding cassette, sub-family C, member

polypeptide G

6

62 A Concise Review of Molecular Pathology of Breast Cancer

*GPC4* glypican 4

*PIP* prolactin-induced protein

Young-Dn Elderly-Up

**cancer**

*HN1* hematological and neurological expressed 1 Nucleus Other *KPNA2* karyopherin alpha 2 Nucleus Transporter *NOL11* nucleolar protein 11 Nucleus Other *NUP85* nucleoporin 85kDa Cytoplasm Other *PLAC8* placenta-specific 8 Nucleus Other

*PSMA4* proteasome alpha 4 subunit isoform 1 Cytoplasm Peptidase *RAPGEFL1* Rap guanine nucleotide exchange factor Other Other

*UCHL1* ubiquitin carboxyl-terminal esterase L1 Cytoplasm Peptidase *XDH* xanthine dehydrogenase Cytoplasm Enzyme

*ACAA1* acetyl-CoA acyltransferase 1 Cytoplasm Enzyme *CCDC28A* coiled-coil domain containing 28A Other Other *CITED2* Cbp/p300-interacting transactivator Nucleus Transcription

*CLMN* calmin (calponin-like, transmembrane) Cytoplasm Other *CTDSPL* small CTD phosphatase 3 isoform 1 Nucleus Other *CTSF* cathepsin F Cytoplasm Peptidase *FMO5* flavin containing monooxygenase 5 Cytoplasm Enzyme

*KIF13B* kinesin family member 13B Cytoplasm Other *NDST1* N-deacetylase/N-sulfotransferase (heparan) Cytoplasm Enzyme *NKX3*-1 NK3 homeobox 1 Nucleus Transcription

*ZNF385D* zinc finger protein 385D Nucleus Other *ZNF395* zinc finger protein 395 Cytoplasm Other

**Table 2.** Concordant differentially expressed genes identified in the young and elderly breast cancer patients.

**2.4. Differentially expressed genes in patients with early stage versus advanced stage breast**

We compared the gene expression profile of patients diagnosed with early stage (stage I and II) breast cancer with those with advanced stage (stage III and IV) breast cancer. We identified 79 over-expressed and 140 under-expressed genes in early stage breast cancer. IPA analysis

Nucleus Enzyme

Membrane Transporter

Other

regulator

Transmembrane receptor

regulator

Space Peptidase

Extracellular Space

Plasma

Plasma Membrane

Extracellular


**Table 3.** The 25 differentially expressed genes associated with breast cancer in the early versus advanced stage analysis.

#### **2.5. Differentially expressed genes in patients with Invasive Ductal Carcinoma (IDC) versus Invasive Lobular Carcinoma (ILC)**

The invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) are the two major histological subtypes of breast cancer. They also represented the main types of breast cancer cases gathered in this study. We compared the gene expression profiles of patients with IDC and ILC and identified 216 over-expressed and 126 under-expressed genes in IDC as compared to patients with ILC. IPA analysis showed 66 genes were related to breast cancer (Table 4), including 12 transcription regulators (*ATF3*, *BTG2*, *EGR1*, *EZH2*, *FOS*, *FOSB*, *JUN*, *JUNB*, *MTDH*, *STAT1*, *ZFP36* and *ZNF423*) and a translation regulator (*EIF4EBP1*). There were also 12 genes annotated as tumor suppressor genes in the TSGene database [32], where *CDH1*, *DKK1* and *S100A2* were over-expressed in IDC and *BTG2*, *CAV1*, *EGR1*, *GPC3*, *MUC1*, *NR4A1*, *SLIT2*, *TGFBR2* and *ZFP36* were under-expressed in IDC. IPA predicted common upstream regulators *KDM5B*, *STUB1*, *CDKN1A*, *HIF1A* and *TGFB1* to be inhibited whereas *FOXM1*, *IFNB1*, *IFNG* and *PPARG* were in activated states. Additionally, the activities of several disease functions such as cell proliferation, invasion and DNA replication were predicted to be increased in IDC (activation z-score between 1.342 and 3.092).


Genome-Wide Gene Expression Analysis to Identify Epistatic Gene-Pairs Associated with Prognosis of Breast Cancer http://dx.doi.org/10.5772/59462 65

**2.5. Differentially expressed genes in patients with Invasive Ductal Carcinoma (IDC) versus**

The invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) are the two major histological subtypes of breast cancer. They also represented the main types of breast cancer cases gathered in this study. We compared the gene expression profiles of patients with IDC and ILC and identified 216 over-expressed and 126 under-expressed genes in IDC as compared to patients with ILC. IPA analysis showed 66 genes were related to breast cancer (Table 4), including 12 transcription regulators (*ATF3*, *BTG2*, *EGR1*, *EZH2*, *FOS*, *FOSB*, *JUN*, *JUNB*, *MTDH*, *STAT1*, *ZFP36* and *ZNF423*) and a translation regulator (*EIF4EBP1*). There were also 12 genes annotated as tumor suppressor genes in the TSGene database [32], where *CDH1*, *DKK1* and *S100A2* were over-expressed in IDC and *BTG2*, *CAV1*, *EGR1*, *GPC3*, *MUC1*, *NR4A1*, *SLIT2*, *TGFBR2* and *ZFP36* were under-expressed in IDC. IPA predicted common upstream regulators *KDM5B*, *STUB1*, *CDKN1A*, *HIF1A* and *TGFB1* to be inhibited whereas *FOXM1*, *IFNB1*, *IFNG* and *PPARG* were in activated states. Additionally, the activities of several disease functions such as cell proliferation, invasion and DNA replication were predicted to be

**Symbol Entrez Gene Name Location Type(s) DE Status** *ACACB* acetyl-CoA carboxylase beta Cytoplasm enzyme Down

*ATF3* activating transcription factor 3 Nucleus transcription regulator Down *BIRC5* baculoviral IAP repeat containing 5 Cytoplasm other Up *BTG2* BTG family, member 2 Nucleus transcription regulator Down

*CDC20* cell division cycle 20 Nucleus other Up

*CDK1* cyclin-dependent kinase 1 Nucleus kinase Up

member A1 Cytoplasm enzyme Down

catalytic Cytoplasm enzyme Up

Plasma

Extracellular

Plasma Membrane

Plasma

Plasma Membrane

Plasma Membrane

Plasma Membrane

Membrane transmembrane receptor Down

Space cytokine Down

Membrane transmembrane receptor Down

other Down

other Up

other Up

other Down

**Invasive Lobular Carcinoma (ILC)**

64 A Concise Review of Molecular Pathology of Breast Cancer

increased in IDC (activation z-score between 1.342 and 3.092).

*ALDH1A1* aldehyde dehydrogenase 1 family,

*APOBEC3B* apolipoprotein B mRNA editing enzyme,

*CAV1* caveolin 1, caveolae protein, 22kDa

*CCL21* chemokine (C-C motif) ligand 21

*CDH1* cadherin 1, type 1, E-cadherin (epithelial)

*CDH3* cadherin 3, type 1, P-cadherin (placental)

*CDH5* cadherin 5, type 2 (vascular endothelium)

*CD34* CD34 molecule

*CD69* CD69 molecule



**Table 4.** The 66 differentially expressed genes associated with breast cancer in the IDC versus ILC analysis.

#### **2.6. Differentially expressed genes in patients with different receptor status**

In the last part of the differential gene expression analysis, we sought to examine the differ‐ entially expressed genes of breast cancer patients of different receptor status: (1) estrogen receptor positive (ER+) versus ER negative (ER–), (2) progesterone receptor positive (PR+) versus PR negative (PR–), (3) HER2 receptor positive (HER2+) versus HER2 negative (HER2–), and (4) triple-negative breast cancer (TNBC, also known as basal-like breast cancer) versus non-TNBC. The Venn diagram in Figure 2 summarized the intersections between the differ‐ entially expressed genes identified in the four assays. There were 57% and 65% of breast cancer patients that were both ER+ and PR+ in the GEO and TCGA cohorts respectively; hence the patient pools divided by the ER positivity for gene expression assays are similar to that divided by the PR positivity. Because of this fact, it is not surprising to observe genes that were found over- or under-expressed in the ER assay were also differentially expressed in the same direction in the PR assay. Likewise, genes that were over-expressed in TNBC were underexpressed in the ER and PR assays (n = 74) and ER, PR and HER2 assays (n = 35), and vice versa for the under-expressed genes in TNBC (n = 87 and 2 respectively).

**Symbol Entrez Gene Name Location Type(s) DE Status**

*PCNA* proliferating cell nuclear antigen Nucleus enzyme Up *PDK4* pyruvate dehydrogenase kinase, isozyme 4 Cytoplasm kinase Down *PGK1* phosphoglycerate kinase 1 Cytoplasm kinase Up *RFC4* replication factor C (activator 1) 4, 37kDa Nucleus other Up *RRM2* ribonucleotide reductase M2 Nucleus enzyme Up *S100A2* S100 calcium binding protein A2 Nucleus other Up

*SQLE* squalene epoxidase Cytoplasm enzyme Up

*TCP1* t-complex 1 Cytoplasm other Up

*TOP2A* topoisomerase (DNA) II alpha 170kDa Nucleus enzyme Up *TPD52* tumor protein D52 Cytoplasm other Up *TYMS* thymidylate synthetase Nucleus enzyme Up *UBE2C* ubiquitin-conjugating enzyme E2C Cytoplasm enzyme Up

*ZFP36* ZFP36 ring finger protein Nucleus transcription regulator Down *ZNF423* zinc finger protein 423 Nucleus transcription regulator Down

In the last part of the differential gene expression analysis, we sought to examine the differ‐ entially expressed genes of breast cancer patients of different receptor status: (1) estrogen receptor positive (ER+) versus ER negative (ER–), (2) progesterone receptor positive (PR+) versus PR negative (PR–), (3) HER2 receptor positive (HER2+) versus HER2 negative (HER2–), and (4) triple-negative breast cancer (TNBC, also known as basal-like breast cancer) versus

**Table 4.** The 66 differentially expressed genes associated with breast cancer in the IDC versus ILC analysis.

**2.6. Differentially expressed genes in patients with different receptor status**

Plasma Membrane

Extracellular Space

Extracellular Space

Extracellular

Plasma Membrane

Extracellular Space

Extracellular

Nucleus ligand-dependent nuclear

receptor

Space cytokine Up

Nucleus transcription regulator Up

Space growth factor Up

other Down

other Up

other Down

kinase Down

other Down

Down

*MUC1* mucin 1, cell surface associated

66 A Concise Review of Molecular Pathology of Breast Cancer

member 1

*SLIT2* slit homolog 2 (Drosophila)

*SPP1* secreted phosphoprotein 1

II

*STAT1* signal transducer and activator of transcription

*TIMP4* TIMP metallopeptidase inhibitor 4

*VEGFA* vascular endothelial growth factor A

*TGFBR2* transforming growth factor, beta receptor

*ORM1* orosomucoid 1

*NR4A1* nuclear receptor subfamily 4, group A,

**Figure 2.** Number of differentially up- and down-regulated genes in the ER, PR, HER2 or TNBC receptor status assays.

The *GALNT6* (polypeptide N-acetylgalactosaminyltransferase) and *SCGB2A2* (secretoglobin) are the two genes consistently over-expressed in ER+, PR+, HER2+ breast tumors but underexpressed in TNBC. There were also 87 genes over-expressed in ER+ and PR+ breast tumors and under-expressed in TNBC, including seven transcription regulators (*EGR3*, *FOXA1*, *GATA3*, *INSM1*, *NRIP1*, *TBX3* and *XBP1*) and 11 breast cancer associated genes (*ABAT*, *AGR2*, *CXCL14*, *GSTM3*, *HSPB8*, *MUC1*, *NR4A2*, *PGR*, *PIP*, *PLAT* and *PSD3*). On the other hand, there were more under-expressed genes (n = 35) shared among ER+, PR+ and HER2+ breast tumors that were over-expressed in TNBC. Among these are four transcription regulators (*ELF5*, *EN1*, *FOXC1* and *ZIC1*) and 12 extracellular proteins (*CHI3L1*, *CHI3L2*, *COL2A1*, *COL9A3*, *CRLF1*, *KLK6*, *KLK7*, *MMP12*, *MMP7*, *PTX3*, *SERPINB5* and *SOSTDC1*), and some of these genes are known TNBC-associated markers [33-37].

## **3. Identifying survival-related genes of patients with breast cancer**

In cancer survival analysis, survival time is often defined as the period of time from the beginning of the medical process (treatment, surgery, etc.) until the death (or some other events such as development of a particular symptom or to relapse after the remission of disease) of the observed patient or until the end of observation. The goal of such analysis is to link the time to event (i.e. survival time) to certain explanatory variables. New methodologies were developed for calculating the survival probabilities using gene expression profiles when genome-wide expression data becomes increasingly available in the past two decades [38-42].

In this work, we analyzed associations between breast cancer patient survival and gene expression of breast tumors from published microarray and the RNA-seq datasets, denoted as the GEO and TCGA cohorts respectively. Survival analysis was performed separately for each cohort and the median times from diagnosis to death or last follow-up were 99.5 and 21.4 months in the GEO and TCGA cohorts respectively. We transform the expression values into gene expression status (i.e. 0 for low expression and 1 for high expression) using the modified auto\_cutoff function of the R script available from the Kaplan Meier-plotter website (http:// kmplot.com/). The survival probability is calculated using the "survival" package and modified *kmplot* function (http://biostat.mc.vanderbilt.edu/wiki/Main/TatsukiRcode#kmplot) is used to plot Kaplan-Meier curves. The hazard ratio with 95% confidence intervals and logrank *P*-value are estimated using the Cox proportional hazards model. All analyses were conducted within the R statistical environment.

#### **3.1. Univariate gene selection and survival analysis**

We extract the gene expression profiles of 1,694 genes that were found differentially expressed (consistently in microarray and RNA-seq datasets) in any one type of the assays discussed in section 2.3 to 2.6. We calculated the overall survival (OS), relapse-free survival (RFS) and the distant metastasis-free survival (DMFS) of breast cancer patients with respect to the expression status. DMFS is not calculated for the TCGA cohort due to unavailability of the time to distant metastases information from patients in this cohort. The log-rank *P*-values were adjusted for multiple comparisons using the Benjamini-Hochberg false discovery rate (FDR) method and were used to select genes expression profiles significantly associated with survival.

We summarized the survival statistics, including the hazard ratios (an estimate of the ratio of the hazard rate in the highly versus the lowly expressed patient group) and the estimated 2 and 10-year survival rates in Table 5. There were about 24% OS-associated, 48% RFS-associated and 51% DMFS-associated genes that have adjusted log-rank *P*-value < 0.01 in the GEO cohort. There were 23% OS-associated genes but only 1.3% RFS-associated genes in the TCGA cohort, due to much fewer relapse/recurrence information in this cohort (adjusted log-rank *P*-value < 0.05). The breast cancer patients in the TCGA cohorts have lower overall survival (10-year) than those from the GEO and both cohorts have similar RFS rates. In the DMFS analysis, 51% of the differentially expressed genes were significant predictor of DMFS.

Genome-Wide Gene Expression Analysis to Identify Epistatic Gene-Pairs Associated with Prognosis of Breast Cancer http://dx.doi.org/10.5772/59462 69

**3. Identifying survival-related genes of patients with breast cancer**

In cancer survival analysis, survival time is often defined as the period of time from the beginning of the medical process (treatment, surgery, etc.) until the death (or some other events such as development of a particular symptom or to relapse after the remission of disease) of the observed patient or until the end of observation. The goal of such analysis is to link the time to event (i.e. survival time) to certain explanatory variables. New methodologies were developed for calculating the survival probabilities using gene expression profiles when genome-wide expression data becomes increasingly available in the past two decades [38-42].

In this work, we analyzed associations between breast cancer patient survival and gene expression of breast tumors from published microarray and the RNA-seq datasets, denoted as the GEO and TCGA cohorts respectively. Survival analysis was performed separately for each cohort and the median times from diagnosis to death or last follow-up were 99.5 and 21.4 months in the GEO and TCGA cohorts respectively. We transform the expression values into gene expression status (i.e. 0 for low expression and 1 for high expression) using the modified auto\_cutoff function of the R script available from the Kaplan Meier-plotter website (http:// kmplot.com/). The survival probability is calculated using the "survival" package and modified *kmplot* function (http://biostat.mc.vanderbilt.edu/wiki/Main/TatsukiRcode#kmplot) is used to plot Kaplan-Meier curves. The hazard ratio with 95% confidence intervals and logrank *P*-value are estimated using the Cox proportional hazards model. All analyses were

We extract the gene expression profiles of 1,694 genes that were found differentially expressed (consistently in microarray and RNA-seq datasets) in any one type of the assays discussed in section 2.3 to 2.6. We calculated the overall survival (OS), relapse-free survival (RFS) and the distant metastasis-free survival (DMFS) of breast cancer patients with respect to the expression status. DMFS is not calculated for the TCGA cohort due to unavailability of the time to distant metastases information from patients in this cohort. The log-rank *P*-values were adjusted for multiple comparisons using the Benjamini-Hochberg false discovery rate (FDR) method and

We summarized the survival statistics, including the hazard ratios (an estimate of the ratio of the hazard rate in the highly versus the lowly expressed patient group) and the estimated 2 and 10-year survival rates in Table 5. There were about 24% OS-associated, 48% RFS-associated and 51% DMFS-associated genes that have adjusted log-rank *P*-value < 0.01 in the GEO cohort. There were 23% OS-associated genes but only 1.3% RFS-associated genes in the TCGA cohort, due to much fewer relapse/recurrence information in this cohort (adjusted log-rank *P*-value < 0.05). The breast cancer patients in the TCGA cohorts have lower overall survival (10-year) than those from the GEO and both cohorts have similar RFS rates. In the DMFS analysis, 51%

were used to select genes expression profiles significantly associated with survival.

of the differentially expressed genes were significant predictor of DMFS.

conducted within the R statistical environment.

68 A Concise Review of Molecular Pathology of Breast Cancer

**3.1. Univariate gene selection and survival analysis**


**Table 5.** Survival statistics according to gene expression profiles of breast cancer patients.

#### **3.2. Cox regression analysis using the expression profiles of two genes**

From the three survival data, i.e. OS, RFS and DMFS, we selected the top 500 most significantly survival associated gene expression profiles consistent in both cohorts to generate 124,750 twogene combinations and perform Cox regression analysis with two covariates (i.e. using the expression status of each gene as a covariate).

There were 81,902 (65.7%) and 78,136 (62.6%) pairs whose expression signatures of both genes remained predictors of OS in the GEO and TCGA cohorts respectively (*P*-value of the coeffi‐ cient estimates < 0.05). Of these, 31,189 pairs were mutual in the two cohorts and 234 genepairs (consisting of 131 genes) had survival probability patterns greatly shifted compared with the previous single-gene model. The strongest predictor pairs were *COL16A1*-*ARHGEF3*, *IGF1R*-*LTB*, *IGF1R*-*PTGDS*, *NPY1R*-*ARHGEF3* and *SERPINA1*-*ACADSB*, where the lower expression of both genes in each pair was associated with lowest survival probabilities in all five cases. The results were presented as Kaplan Meier plots in Figure 3A to E.

**Figure 3.** The Kaplan Meier plots of five OS-associated gene-pairs that also gained most changes in survival probabili‐ ties compared to the matching univariate approach.

The sameanalysiswasalsoperformedforRFSintheGOEandTCGAcohorts,and85,244(68.3%) and64,049(51.3%)pairsweresignificantpredictorsofRFSrespectively(*P*-valueofthecoefficient estimates < 0.05). We found 22,165 significant pairs common in the two cohorts and 1,130 genepairs (consisting of 276 genes) whose survival probability patterns had greatly shifted com‐ pared with the single-gene model. The most significant RFS-associated pairs were *ADM*-*MYBPC1*, *DIRAS3*-*TANC2*, *KIFC1*-*ADORA3*, *PDSS1*-*DIRAS3*, *STMN1*-*ADORA3* (Figure 4).

cient estimates < 0.05). Of these, 31,189 pairs were mutual in the two cohorts and 234 genepairs (consisting of 131 genes) had survival probability patterns greatly shifted compared with the previous single-gene model. The strongest predictor pairs were *COL16A1*-*ARHGEF3*, *IGF1R*-*LTB*, *IGF1R*-*PTGDS*, *NPY1R*-*ARHGEF3* and *SERPINA1*-*ACADSB*, where the lower expression of both genes in each pair was associated with lowest survival probabilities in all

**Figure 3.** The Kaplan Meier plots of five OS-associated gene-pairs that also gained most changes in survival probabili‐

ties compared to the matching univariate approach.

five cases. The results were presented as Kaplan Meier plots in Figure 3A to E.

70 A Concise Review of Molecular Pathology of Breast Cancer

**Figure 4.** The Kaplan Meier plots of five RFS-associated gene-pairs that also gained most changes in survival probabili‐ ties compared to the matching univariate approach.

Lastly, we also make use of the DMFS data available from the GEO cohort to demonstrate the improvement of epistatic gene-pair approach in predicting survival probabilities. Of the 124,750 two-gene combinations, 122,751 (98.4%) were significant predictors of DMFS in breast cancer patients. The high percentage of strong two-gene predictors derived from the DMFS analysis was most likely due to the already high numbers of strong single-gene predictors as shown in Table 5. We further distinguished 228 gene-pairs (consisting of 138 genes) whose survival probability patterns had greatly shifted compared with the single-gene model. Six most significantly improved DMFS-associated pairs were *MMP15*-*SPDEF*, *TRIB3*-*ETV1*, *TRIB3*-*PLD1*, *TRIB3*-*TRIM2*, *TRIM2*-*KRT14* and *XBP1*-*TRIB3* (Figure 5).

**Figure 5.** The Kaplan Meier plots of six DMFS-associated gene-pairs that also gained most changes in survival proba‐ bilities compared to the matching univariate approach.

#### **3.3. Differentially expressed survival-associated hub genes and gene-pair candidates**

As mentioned in the section 3.2, we have identified 234, 1,130 and 228 OS-, RFS- and DMFSassociated gene-pairs (consisting of 131, 276 and 138 genes respectively) that showed improved predictive performance. Some of these genes may be paired with many partners while remaining highly significant. In Table 6, we list five genes that have high number of pairing possibilities and also common in OS, RFS and DMFS analysis.


Lastly, we also make use of the DMFS data available from the GEO cohort to demonstrate the improvement of epistatic gene-pair approach in predicting survival probabilities. Of the 124,750 two-gene combinations, 122,751 (98.4%) were significant predictors of DMFS in breast cancer patients. The high percentage of strong two-gene predictors derived from the DMFS analysis was most likely due to the already high numbers of strong single-gene predictors as shown in Table 5. We further distinguished 228 gene-pairs (consisting of 138 genes) whose survival probability patterns had greatly shifted compared with the single-gene model. Six most significantly improved DMFS-associated pairs were *MMP15*-*SPDEF*, *TRIB3*-*ETV1*,

**Figure 5.** The Kaplan Meier plots of six DMFS-associated gene-pairs that also gained most changes in survival proba‐

bilities compared to the matching univariate approach.

*TRIB3*-*PLD1*, *TRIB3*-*TRIM2*, *TRIM2*-*KRT14* and *XBP1*-*TRIB3* (Figure 5).

72 A Concise Review of Molecular Pathology of Breast Cancer


**Table 7.** The gene-pairs that associated with more than one type of survival data.

There were also seven gene-pairs that were significantly associated with more than one type of survival data (Table 7). By incorporating the differential expression information we derived previously, we may observe that the TNBC patients were noticeably having worse survival outcomes than non-TNBC patients as TNBC is known to be an aggressive breast cancer subtype [43, 44]. For example, both *GATA3* [45, 46] and *SERPINA1* were found significantly underexpressed in TNBC cases and the low expressions of both genes were correlated with poor OS and DMFS. Additionally, the over-expression of *PSAT1* and the under-expression of *SERPI‐ NA1* in TNBC patients also correlated with poor OS and DMFS. Moreover, the over-expression of *MMP15* relating to advanced stage breast cancer and the under-expression of *SLC44A4* associated with TNBC are predictors of cancer recurrence as well as distant metastases.

## **4. Conclusion and perspectives**

In section 2 of this chapter, we identified 1,694 genes that were differentially expressed in breast cancer patients of three age groups, early versus advanced stage breast cancers, invasive ductal versus invasive lobular breast cancers, and patients of various receptor status. While some of these genes are known to participate in the biological and genetic pathways that lead to breast cancer and many are novel findings. In section 3, we showed that more than 20% the differ‐ entially expressed genes were associated with at least one type of survival data. Our data indicated improved predictive performance when using a multivariate approach of combining the expression of two genes in the assessment of survival data. Perceivably, the gene pairs found in the epistatic analysis could provide useful pictures in gene interactions in breast carcinogenesis.

Breast cancer is a heterogeneous and complex disease where researchers and doctors have implemented different classifications (be it molecular, pathological, genetic or prognostic) to aid disease diagnosis and treatment decision. In the future, we hope to use the gene expression profiles of multiple survival-associated biomarkers to sub-classify patients of different types of breast cancer, and ultimately allow medical practitioners to derive better disease assessment and treatment decision.

## **Acknowledgements**

The authors acknowledge financial support from the Ministry of Science and Technology, Taiwan (MOST 103-2811-B-010-020), and from the Ministry of Education, Taiwan (Aim for the Top University Plan; 103AC-T903). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

## **Author details**

I-Hsuan Lin and Ming-Ta Hsu\*

\*Address all correspondence to: mth@ym.edu.tw

Institute of Biochemistry and Molecular Biology, School of Life Science, National Yang-Ming University, Taipei, Taiwan

## **References**

expressed in TNBC cases and the low expressions of both genes were correlated with poor OS and DMFS. Additionally, the over-expression of *PSAT1* and the under-expression of *SERPI‐ NA1* in TNBC patients also correlated with poor OS and DMFS. Moreover, the over-expression of *MMP15* relating to advanced stage breast cancer and the under-expression of *SLC44A4* associated with TNBC are predictors of cancer recurrence as well as distant metastases.

In section 2 of this chapter, we identified 1,694 genes that were differentially expressed in breast cancer patients of three age groups, early versus advanced stage breast cancers, invasive ductal versus invasive lobular breast cancers, and patients of various receptor status. While some of these genes are known to participate in the biological and genetic pathways that lead to breast cancer and many are novel findings. In section 3, we showed that more than 20% the differ‐ entially expressed genes were associated with at least one type of survival data. Our data indicated improved predictive performance when using a multivariate approach of combining the expression of two genes in the assessment of survival data. Perceivably, the gene pairs found in the epistatic analysis could provide useful pictures in gene interactions in breast

Breast cancer is a heterogeneous and complex disease where researchers and doctors have implemented different classifications (be it molecular, pathological, genetic or prognostic) to aid disease diagnosis and treatment decision. In the future, we hope to use the gene expression profiles of multiple survival-associated biomarkers to sub-classify patients of different types of breast cancer, and ultimately allow medical practitioners to derive better disease assessment

The authors acknowledge financial support from the Ministry of Science and Technology, Taiwan (MOST 103-2811-B-010-020), and from the Ministry of Education, Taiwan (Aim for the Top University Plan; 103AC-T903). The funders had no role in study design, data collection

Institute of Biochemistry and Molecular Biology, School of Life Science, National Yang-Ming

and analysis, decision to publish, or preparation of the manuscript.

**4. Conclusion and perspectives**

74 A Concise Review of Molecular Pathology of Breast Cancer

carcinogenesis.

and treatment decision.

**Acknowledgements**

**Author details**

I-Hsuan Lin and Ming-Ta Hsu\*

University, Taipei, Taiwan

\*Address all correspondence to: mth@ym.edu.tw


[28] Li C, Ma H, Wang Y, Cao Z, Graves-Deal R, Powell AE, Starchenko A, Ayers GD, Washington MK, Kamath V, et al: Excess PLAC8 promotes an unconventional ERK2 dependent EMT in colon cancer. *The Journal of clinical investigation* 2014, 124:2172-2187.

[14] Shih Ie M, Wang TL: Apply innovative technologies to explore cancer genome. *Cur‐*

[15] Dupuy A, Simon RM: Critical review of published microarray studies for cancer out‐ come and guidelines on statistical analysis and reporting. *Journal of the National Can‐*

[16] Morozova O, Hirst M, Marra MA: Applications of new sequencing technologies for transcriptome analysis. *Annual review of genomics and human genetics* 2009, 10:135-151.

[17] Marguerat S, Bahler J: RNA-seq: from technology to biology. *Cellular and molecular life*

[18] Costa V, Aprile M, Esposito R, Ciccodicola A: RNA-Seq and human complex diseas‐ es: recent accomplishments and future perspectives. *European journal of human genet‐*

[19] Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, et al: Evolving gene/transcript definitions significantly alter the in‐

[20] Johnson WE, Li C, Rabinovic A: Adjusting batch effects in microarray expression da‐

[21] Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differen‐ tial expression analysis of digital gene expression data. *Bioinformatics* 2010,

[22] Bardou P, Mariette J, Escudie F, Djemiel C, Klopp C: jvenn: an interactive Venn dia‐

[23] Chou YT, Wang H, Chen Y, Danielpour D, Yang YC: Cited2 modulates TGF-beta-

[24] Sohn DM, Kim SY, Baek MJ, Lim CW, Lee MH, Cho MS, Kim TY: Expression of sur‐ vivin and clinical correlation in patients with breast cancer. *Biomedicine & pharmaco‐*

[25] Naderi A, Meyer M: Prolactin-induced protein mediates cell invasion and regulates integrin signaling in estrogen receptor-negative breast cancer. *Breast cancer research :*

[26] Noetzel E, Rose M, Bornemann J, Gajewski M, Knuchel R, Dahl E: Nuclear transport receptor karyopherin-alpha2 promotes malignant breast cancer phenotypes in vitro.

[27] Xu C, Wang H, He H, Zheng F, Chen Y, Zhang J, Lin X, Ma D, Zhang H: Low expres‐ sion of TFPI-2 associated with poor survival outcome in patients with breast cancer.

terpretation of GeneChip data. *Nucleic Acids Res* 2005, 33:e175.

ta using empirical Bayes methods. *Biostatistics* 2007, 8:118-127.

mediated upregulation of MMP9. *Oncogene* 2006, 25:5547-5560.

*therapy = Biomedecine & pharmacotherapie* 2006, 60:289-292.

gram viewer. *BMC Bioinformatics* 2014, 15:293.

*rent opinion in oncology* 2005, 17:33-38.

*cer Institute* 2007, 99:147-157.

76 A Concise Review of Molecular Pathology of Breast Cancer

*sciences : CMLS* 2010, 67:569-579.

*ics : EJHG* 2013, 21:134-142.

26:139-140.

*BCR* 2012, 14:R111.

*Oncogene* 2012, 31:2101-2114.

*BMC cancer* 2013, 13:118.

