**4.1.3 Network motif discovery**

230 Reverse Engineering – Recent Advances and Applications

profiles of the transcription factor to the gene modules.

3. Repeat Steps 1 and 2 for each transcription factor.

transcription factors and their regulated gene modules.

**4. Results** 

interactions.

**4.1 Yeast cell cycle dataset** 

gene was standardized between 0 and 1.

**4.1.2 Gene module Identification** 

**4.1.1 Data sources** 

and measured expression profiles for the target gene modules.

determine the optimized parameters of the NN that maps the measured expression

c. For each chromosome, calculate the RMSE between the predicted output of the NN

d. Apply GA operators (reproduction, cross-over, mutation) based on the RMSE calculated in Step 2.3 as a fitness value. This will generate new vectors/chromosomes altering the choice of output gene module combinations. e. Repeat steps 2.1 – 2.4 until stop criteria are met. The stopping criteria are numbers of generations or minimum RMSE, depending on which one is met first. f. Repeat Steps 2.1 – 2.5 for each NM the transcription factor is assigned to.

When the process is completed, NM regulatory modules are constructed between

**Gene expression dataset**: the yeast cell cycle dataset presented in (Spellman et al. 1998) consists of six time series (*cln3*, *clb2*, *alpha*, *cdc15*, *cdc28*, and *elu*) expression measurements of the mRNA levels of *S. cerevisiae* genes. 800 genes were identified as cell cycle regulated based on cluster analysis in (Spellman et al. 1998). We used the *cdc15* time course data of the 800 genes since it has the largest number of time points (24). Missing values in the data were imputed using KNN imputation (Troyanskaya et al. 2001). The expression pattern of each

**Molecular interaction data**: data of transcription factors and their target genes were extracted from the SCPD database (Zhu and Zhang 1999), from the YPD database (Costanzo et al. 2001), and from recent publications on genome-wide experiments that locate binding sites of given transcription factors (Ren et al. 2000; Iyer et al. 2001; Simon et al. 2001; Lee et al. 2002). For data extraction from the latter we used the same experimental thresholds as in the original papers. Protein-protein interaction data was extracted from the DIP database (Przulj et al. 2004), from the BIND database (Bader et al. 2003), and from the MIPS database (Mewes et al. 2002). In total the molecular interaction dataset consisted of 8184 protein pairs connected by protein-protein interactions and 5976 protein pairs connected by protein-DNA

We grouped 800 cell cycle-regulated genes into clusters by Fuzzy c-means method, where genes with similar expression profiles are represented by a gene module. The optimal cluster number was determined by the proposed method in (Zhang et al. 2009). The highest *z* score was obtained when the number of clusters was 34 by Fuzzy c-means clustering with optimal parameter m = 1.1573. We evaluated the resulting clusters through the gene set enrichement analysis method. All clusters except 10, 18, 21, 22, 25 and 26 are enriched in some gene ontology categories (data not shown). We used these clusters as candidate gene modules in our subsequent analyses to reduce the search space for gene regulatory module inference.

Among the 800 cell cycle related genes, 85 have been identified as DNA-binding transcription factors (Zhang et al. 2008). Four NMs were considered significant: autoregulatory motif, feed-forward loop, single input module, and multi-input module (Figure 2). These NMs were used to build NN models for corresponding transcription factors.

## **4.1.4 Reverse engineering gene regulatory networks**

Neural network models that mimic the topology of the NMs were constructed to identify the relationships between transcription factors and putative gene modules. The NN models

Fig. 8. Predicted gene regulatory modules from eight known cell cycle dependent transcription factors in yeast cell cycle dataset. The left panel presents the four gene regulatory modules, and the right panel depicts inferred gene regulatory modules for eight known cell cycle dependent transcription factors.

Reverse Engineering Gene Regulatory Networks by Integrating Multi-Source Biological Data 233

total, 846 genes were used for GRN inference. We chose the Thy-Thy3 time course gene expression pattern for 846 genes, since it has the largest number of time points (47). Missing values in the data were imputed using KNN imputation (Troyanskaya et al. 2001). The

**Molecular interaction data**: Protein-protein interactions were extracted from twelve publicly available large-scale protein interaction maps, seven of which are based on information from scientific literature literature-based, three on orthology information, and two on results of

**C4**

**BRCA1**

**BRCA1 E2F1**

**C34 C5**

**PCNA E2F2**

**C34**

**PCNA**

**C20**

Fig. 9. Predicted gene regulatory modules from known human cell cycle dependent genes. The left panel presents the four gene regulatory modules, and the right panel depicts inferred transcription factor-target gene relationships for eight cell cycle dependent

**E2F2**

**E2F2**

**C2**

**E2F1**

transcription factors.

**RBPSUH HMGB2**

**C5**

**PCNA**

**PCNA SP1**

**C12 C11**

**E2F1 BRCA1**

**C12**

**E2F1**

**C18**

**E2F1**

**C12**

**E2F2**

**E2F2**

**C30**

**C34**

**STAT1**

**STAT1**

**C36**

**BRCA1**

**E2F1 SP1**

**E2F1 E2F2**

expression pattern of each gene was standardized between 0 and 1.

were trained to select for all 85 transcription factors the downstream targets from the 34 gene modules. The parameter settings of GA-PSO algorithm are set as described in our previous work (Ressom et al. 2006). Our inference method mapped all 85 transcription factors to the target gene modules and inferred the most likely NMs.

We evaluated the predicted gene regulatory modules for the following eight well known cell cycle related transcription factors: *SWI4*, *SWI5*, *FKH1*, *NDD1*, *ACE2*, *KAR4*, *MET28* and *RAP1*. Since the "true" GRN is not available, the accuracy of putative regulatory relationship was determined by searching known gene connections in databases. Based on the results of the NM module prediction, we collected literature evidences from SGD (Cherry et al. 1997) and BIND (Bader et al. 2003) databases. We examined the inferred relationships for each of the eight transcription factors. An inferred relationship is assumed to be biologically significant if the transcription factors are correlated with the biological functions associated with the critical downstream cluster(s). Figure 8 lists the significant relationships; the eight transcription factors yielded an average precision of 82.9%. NMs for four of these transcription factors were identified in Chiang et al. (Chiang and Chao 2007) together with other four transcription factors. The eight transcription factors in Chiang et al. yielded an average precision of 80.1%.

The regulatory relationships inferred by the proposed method are expected to correspond more closely to biologically meaningful regulatory systems and naturally lead themselves to optimum experimental design methods. The results presented in Figure 8 are verified from previous biological evidences. For example, *FKH1* is a gene whose protein product is a fork head family protein with a role in the expression of G2/M phase genes. It negatively regulates transcriptional elongation, and regulates donor preference during switching. To further investigate the possibilities that the predicted downstream gene clusters are truly regulated by *FKH1*, we applied the motif discovery tool, WebMOTIFS (Romer et al. 2007) to find shared motifs in these gene clusters (data not shown). The results revealed that a motif called Fork\_head, GTAAACAA, is identified as the most significant motif among these gene clusters (Weigel and Jackle 1990). This finding strongly supports our NM inference results. Another example is the FFL involving *SWI5*, *GAT3* and Gene Cluster 10. *SWI5* has been identified as the upstream regulator of *GAT3* (Simon et al. 2001; Lee et al. 2002; Harbison et al. 2004). Genes in cluster 10 are mostly involved in DNA helicase activity and mitotic recombination, both of which are important biological steps in the regulation of cell cycle. Although no biological evidences have shown that *SWI5* and *GAT3* are involved in these processes, there are significant numbers of genes in cluster 10 which are characterized (according to yeastract.com) as genes regulated by both transcription factors (24 for *GAT3* and 23 for *SWI5* out of 44 genes in cluster 10, respectively).

#### **4.2 Human hela cell cycle dataset**

#### **4.2.1 Data sources**

**Gene expression dataset**: the human Hela cell cycle dataset (Whitfield et al. 2002) consists of five time courses (114 total arrays). RNA samples were collected for points (typically every 1-2 h) for 30 h (Thy- Thy1), 44 h (Thy-Thy2), 46 h (Thy-Thy3), 36 h (Thy-Noc), or 14 h (shake) after the synchronous arrest. The cell-cycle related gene set contains 1,134 clones corresponding to 874 UNIGENE clusters (UNIGENE build 143). Of these, 1,072 have corresponding Entrez gene IDs, among which 226 have more than one mapping to clones. In 232 Reverse Engineering – Recent Advances and Applications

were trained to select for all 85 transcription factors the downstream targets from the 34 gene modules. The parameter settings of GA-PSO algorithm are set as described in our previous work (Ressom et al. 2006). Our inference method mapped all 85 transcription

We evaluated the predicted gene regulatory modules for the following eight well known cell cycle related transcription factors: *SWI4*, *SWI5*, *FKH1*, *NDD1*, *ACE2*, *KAR4*, *MET28* and *RAP1*. Since the "true" GRN is not available, the accuracy of putative regulatory relationship was determined by searching known gene connections in databases. Based on the results of the NM module prediction, we collected literature evidences from SGD (Cherry et al. 1997) and BIND (Bader et al. 2003) databases. We examined the inferred relationships for each of the eight transcription factors. An inferred relationship is assumed to be biologically significant if the transcription factors are correlated with the biological functions associated with the critical downstream cluster(s). Figure 8 lists the significant relationships; the eight transcription factors yielded an average precision of 82.9%. NMs for four of these transcription factors were identified in Chiang et al. (Chiang and Chao 2007) together with other four transcription factors. The eight transcription factors in Chiang et al.

The regulatory relationships inferred by the proposed method are expected to correspond more closely to biologically meaningful regulatory systems and naturally lead themselves to optimum experimental design methods. The results presented in Figure 8 are verified from previous biological evidences. For example, *FKH1* is a gene whose protein product is a fork head family protein with a role in the expression of G2/M phase genes. It negatively regulates transcriptional elongation, and regulates donor preference during switching. To further investigate the possibilities that the predicted downstream gene clusters are truly regulated by *FKH1*, we applied the motif discovery tool, WebMOTIFS (Romer et al. 2007) to find shared motifs in these gene clusters (data not shown). The results revealed that a motif called Fork\_head, GTAAACAA, is identified as the most significant motif among these gene clusters (Weigel and Jackle 1990). This finding strongly supports our NM inference results. Another example is the FFL involving *SWI5*, *GAT3* and Gene Cluster 10. *SWI5* has been identified as the upstream regulator of *GAT3* (Simon et al. 2001; Lee et al. 2002; Harbison et al. 2004). Genes in cluster 10 are mostly involved in DNA helicase activity and mitotic recombination, both of which are important biological steps in the regulation of cell cycle. Although no biological evidences have shown that *SWI5* and *GAT3* are involved in these processes, there are significant numbers of genes in cluster 10 which are characterized (according to yeastract.com) as genes regulated by both transcription factors (24 for *GAT3*

**Gene expression dataset**: the human Hela cell cycle dataset (Whitfield et al. 2002) consists of five time courses (114 total arrays). RNA samples were collected for points (typically every 1-2 h) for 30 h (Thy- Thy1), 44 h (Thy-Thy2), 46 h (Thy-Thy3), 36 h (Thy-Noc), or 14 h (shake) after the synchronous arrest. The cell-cycle related gene set contains 1,134 clones corresponding to 874 UNIGENE clusters (UNIGENE build 143). Of these, 1,072 have corresponding Entrez gene IDs, among which 226 have more than one mapping to clones. In

factors to the target gene modules and inferred the most likely NMs.

yielded an average precision of 80.1%.

and 23 for *SWI5* out of 44 genes in cluster 10, respectively).

**4.2 Human hela cell cycle dataset** 

**4.2.1 Data sources** 

total, 846 genes were used for GRN inference. We chose the Thy-Thy3 time course gene expression pattern for 846 genes, since it has the largest number of time points (47). Missing values in the data were imputed using KNN imputation (Troyanskaya et al. 2001). The expression pattern of each gene was standardized between 0 and 1.

**Molecular interaction data**: Protein-protein interactions were extracted from twelve publicly available large-scale protein interaction maps, seven of which are based on information from scientific literature literature-based, three on orthology information, and two on results of

Fig. 9. Predicted gene regulatory modules from known human cell cycle dependent genes. The left panel presents the four gene regulatory modules, and the right panel depicts inferred transcription factor-target gene relationships for eight cell cycle dependent transcription factors.

Reverse Engineering Gene Regulatory Networks by Integrating Multi-Source Biological Data 235

genes that remained after filtering other gene function categories are those that are assigned to the following putative functions: transcription factor activity (GO: 0003700), regulation of transcription (GO: 0061019), and transcription factor complex (GO: 0005667). Since gene ontology information alone may not be sufficient to identify the genes with bona fide roles as transcription factors, we further filtered our list of candidate transcription factors by adding another layer of confirmatory information based on the results of PubMed database searches. This additional annotation allowed us to validate the gene ontology classification of our candidate genes. Among the 846 cell cycle related genes, 46 were annotated with functions related to transcriptional regulation based on both gene ontology and PubMed databases. These genes were considered as putative

In the microarray gene expression data, genes are often represented by multiple oligonucleotide probes. Genes represented by probe sets with larger variance were further considered in this study. We decomposed the collected human molecular interaction network into several NMs, with each NM potentially associated with a given transcription factor(s). A total of four NMs were found to be significant in the combined molecular interaction network (Figure 3), thus each transcription factor was assigned to at least one of

The relationships between transcription factors and gene modules were determined based on NN models. For each of the four NMs (Figure 3), a suitable NN was built as we previously described (Zhang et al. 2008). The NN models were trained using the GA-PSO algorithm to find the downstream gene clusters for all 46 putative transcription factors. Associations between each transcription factor and 39 gene modules was determined by training the NN model that mimics the specific NM for a given transcription factor. Due to a reduction in the computational complexity (mapping between 46 transcription factors and 39 gene clusters instead of 846 genes), the numbers of GA and PSO generations needed to reach the pre-specified minimum RMSE was significantly reduced. The proposed inference method successfully assigned all 46 putative transcription factors to their target gene modules and inferred the most likely gene regulatory modules (see Figure 9 for

The validity and accuracy of the network depicted by the gene regulatory modules are assessed by comparison with a network model constructed based on actual biological data. In the absence of such information, we performed an initial validation of the network by searching for known gene connections in databases. Based on the prediction results, we collected literature evidence from the NCBI and TRANSFAC databases. We reviewed each predicted NM and examined the relationships between the transcription factor and its target gene module(s). Subsequent analysis was performed under the basic assumption that the inferred NM is more likely to be biologically meaningful if the transcription factors therein

Significant NMs resulting from the survey of available literature cell cycle dependent genes such as *E2F1*, *E2F2*, *SP1*, *BRCA1*, *STAT1*, *PCNA*, *RBPSUH*, and *HMGB2* are listed in Figure 9. Based on the combined information, the biological implication of the network is further explained. For instance, *E2F* is a transcription factor that plays a crucial role in cell-cycle

are correlated with the enriched biological functions in the downstream clusters.

transcription factors (Zhang et al. 2010).

representative gene regulatory modules).

**4.2.4 Reverse engineering gene regulatory networks** 

these NMs.

previous yeast two-hybrid (Y2H) analyses (Zhang et al. 2010). The analysis was restricted to binary interactions in order to make consistent between Y2H-based interactions and the remaining maps. To merge twelve interaction maps into one combination map, all proteins were mapped to their corresponding Entrez gene IDs. The protein-DNA interaction data was extracted from the TRANSFAC database (Wingender et al. 2001). In total the molecular interaction data consisted of 20,473 protein pairs connected by protein-protein interactions and 2,546 protein pairs connected by protein-DNA interactions.

## **4.2.2 Gene module identification**

A total of 846 genes associated with the control of cell cycle have been identified previously in Hela cells. We further partitioned these genes into more specific functional groups by Fuzzy c-means. The optimal value of m for the dataset used in this study was 1.1548. The highest *z* score was obtained with 39 clusters, indicating an optimal condition to reduce the search space for GRN inference. To evaluate the optimal clusters selected based on gene ontology, gene set enrichment analysis was applied using the optimal value. The total set of genes involved in cell cycle regulation was further subdivided into 39 clusters. Of these clusters, 31 were clearly associated with gene ontology categories that imply a more specific function that unifies the members of one but not other clusters, thereby establishing more direct relationships among certain smaller sub-groups of genes. For example, clusters 8 and 29 are both associated with pre-mitotic, mitotic and post-mitotic events (M-phase). However, members of cluster 8 are distinguished from the members of cluster 29 by virtue of their specific roles in chromosome doubling (DNA replication) and cytokinesis. Conversely, members of cluster 29 are distinguished from the members of cluster 8 by virtue of their specific roles in spindle fiber assembly and disassembly.

Biological significance of these highly specific functional relationships, established by our clustering scheme, can further be extended in terms of relationships within the regulatory context. For instance, members of both gene modules 8 and 29 have been identified previously as direct downstream targets of *E2F* factors (Ren et al. 2002). Similar relationships are established with other clusters such as gene module 32, which is comprised of genes with biochemical roles of a DNA ligase. Thus, the genes in gene module 32 are involved in processes associated with gap repair or Okazaki fragment processing during DNA replication and chromosome doubling. Previous studies have established that genes associated with this function are under the regulatory control of *E2F1* and *PCNA* (Shibutani et al. 2008).

Based on all these relationships, we demonstrated that one specific strength of the proposed method is its ability to distinguish genes that are related by function in a broad sense and subcategorize them into highly specific (narrow) functional categories, resulting in the prediction of regulatory relationships that are consistent with biologically validated relationships.

#### **4.2.3 Network motif discovery**

All genes with either direct or indirect roles in the regulation of transcription were first identified from the total set of 846 cell cycle associated genes according to gene ontology categories that denote possible roles in transcription (Ashburner et al. 2000). Candidate genes that remained after filtering other gene function categories are those that are assigned to the following putative functions: transcription factor activity (GO: 0003700), regulation of transcription (GO: 0061019), and transcription factor complex (GO: 0005667). Since gene ontology information alone may not be sufficient to identify the genes with bona fide roles as transcription factors, we further filtered our list of candidate transcription factors by adding another layer of confirmatory information based on the results of PubMed database searches. This additional annotation allowed us to validate the gene ontology classification of our candidate genes. Among the 846 cell cycle related genes, 46 were annotated with functions related to transcriptional regulation based on both gene ontology and PubMed databases. These genes were considered as putative transcription factors (Zhang et al. 2010).

In the microarray gene expression data, genes are often represented by multiple oligonucleotide probes. Genes represented by probe sets with larger variance were further considered in this study. We decomposed the collected human molecular interaction network into several NMs, with each NM potentially associated with a given transcription factor(s). A total of four NMs were found to be significant in the combined molecular interaction network (Figure 3), thus each transcription factor was assigned to at least one of these NMs.

#### **4.2.4 Reverse engineering gene regulatory networks**

234 Reverse Engineering – Recent Advances and Applications

previous yeast two-hybrid (Y2H) analyses (Zhang et al. 2010). The analysis was restricted to binary interactions in order to make consistent between Y2H-based interactions and the remaining maps. To merge twelve interaction maps into one combination map, all proteins were mapped to their corresponding Entrez gene IDs. The protein-DNA interaction data was extracted from the TRANSFAC database (Wingender et al. 2001). In total the molecular interaction data consisted of 20,473 protein pairs connected by protein-protein interactions

A total of 846 genes associated with the control of cell cycle have been identified previously in Hela cells. We further partitioned these genes into more specific functional groups by Fuzzy c-means. The optimal value of m for the dataset used in this study was 1.1548. The highest *z* score was obtained with 39 clusters, indicating an optimal condition to reduce the search space for GRN inference. To evaluate the optimal clusters selected based on gene ontology, gene set enrichment analysis was applied using the optimal value. The total set of genes involved in cell cycle regulation was further subdivided into 39 clusters. Of these clusters, 31 were clearly associated with gene ontology categories that imply a more specific function that unifies the members of one but not other clusters, thereby establishing more direct relationships among certain smaller sub-groups of genes. For example, clusters 8 and 29 are both associated with pre-mitotic, mitotic and post-mitotic events (M-phase). However, members of cluster 8 are distinguished from the members of cluster 29 by virtue of their specific roles in chromosome doubling (DNA replication) and cytokinesis. Conversely, members of cluster 29 are distinguished from the members of cluster 8 by virtue

Biological significance of these highly specific functional relationships, established by our clustering scheme, can further be extended in terms of relationships within the regulatory context. For instance, members of both gene modules 8 and 29 have been identified previously as direct downstream targets of *E2F* factors (Ren et al. 2002). Similar relationships are established with other clusters such as gene module 32, which is comprised of genes with biochemical roles of a DNA ligase. Thus, the genes in gene module 32 are involved in processes associated with gap repair or Okazaki fragment processing during DNA replication and chromosome doubling. Previous studies have established that genes associated with this function are under the regulatory control of *E2F1* and *PCNA* (Shibutani

Based on all these relationships, we demonstrated that one specific strength of the proposed method is its ability to distinguish genes that are related by function in a broad sense and subcategorize them into highly specific (narrow) functional categories, resulting in the prediction

All genes with either direct or indirect roles in the regulation of transcription were first identified from the total set of 846 cell cycle associated genes according to gene ontology categories that denote possible roles in transcription (Ashburner et al. 2000). Candidate

of regulatory relationships that are consistent with biologically validated relationships.

and 2,546 protein pairs connected by protein-DNA interactions.

of their specific roles in spindle fiber assembly and disassembly.

**4.2.2 Gene module identification** 

et al. 2008).

**4.2.3 Network motif discovery** 

The relationships between transcription factors and gene modules were determined based on NN models. For each of the four NMs (Figure 3), a suitable NN was built as we previously described (Zhang et al. 2008). The NN models were trained using the GA-PSO algorithm to find the downstream gene clusters for all 46 putative transcription factors. Associations between each transcription factor and 39 gene modules was determined by training the NN model that mimics the specific NM for a given transcription factor. Due to a reduction in the computational complexity (mapping between 46 transcription factors and 39 gene clusters instead of 846 genes), the numbers of GA and PSO generations needed to reach the pre-specified minimum RMSE was significantly reduced. The proposed inference method successfully assigned all 46 putative transcription factors to their target gene modules and inferred the most likely gene regulatory modules (see Figure 9 for representative gene regulatory modules).

The validity and accuracy of the network depicted by the gene regulatory modules are assessed by comparison with a network model constructed based on actual biological data. In the absence of such information, we performed an initial validation of the network by searching for known gene connections in databases. Based on the prediction results, we collected literature evidence from the NCBI and TRANSFAC databases. We reviewed each predicted NM and examined the relationships between the transcription factor and its target gene module(s). Subsequent analysis was performed under the basic assumption that the inferred NM is more likely to be biologically meaningful if the transcription factors therein are correlated with the enriched biological functions in the downstream clusters.

Significant NMs resulting from the survey of available literature cell cycle dependent genes such as *E2F1*, *E2F2*, *SP1*, *BRCA1*, *STAT1*, *PCNA*, *RBPSUH*, and *HMGB2* are listed in Figure 9. Based on the combined information, the biological implication of the network is further explained. For instance, *E2F* is a transcription factor that plays a crucial role in cell-cycle

Reverse Engineering Gene Regulatory Networks by Integrating Multi-Source Biological Data 237

too small for WebMOTIFS analysis. All these gene modules are predicted to the downstream targets of E2F. For instance, 43 out of 52 genes in gene module 2 have putative *E2F* binding sites in their upstream regions. The detailed information of binding site enrichment analysis results is shown in Table 2. For those gene regulatory networks where two transcription factors are involved in, the downstream gene modules are found to be enriched in both the binding site sequence motifs. For instance, gene module 32 is enriched in both E2F\_TDP and MH1 motifs, corresponding to the two transcription factors in the gene regulatory module: *E2F1* and *SP1*. These binding site enrichment analysis results

> **Binding domain (Pfam ID)**

E2F\_TDP (PF02319)

E2F\_TDP (PF02319)

zf-C4

E2F\_TDP (PF02319)

HMG\_box

MH1

E2F\_TDP (PF02319)

E2F\_TDP (PF02319)

zf-C4

E2F\_TDP (PF02319)

Table 2. Binding site enrichment analysis for gene modules identified in human Hela cell cycle dataset. aSequence logos represent the motif significantly overrepresented in individual gene cluster associated with their predicted upstream transcription factors, according to the WebMOTIFS discovery algorithm. Individual base letter height indicates level of conservation within each binding site position; bConserved binding motifs are the

conserved binding sequences used in the WebMOTIFS discovery algorithm.

**Corresponding transcription factor** 

E2F1

E2F1

(PF00105) BRCA1 TGACCTTTG

(PF00505) HMGB2 AACAAwRr

(PF03165) SP1 TGGc.....gCCA

E2F1

E2F1

(PF00105) BRCA1 TGACCTTTG

E2F1

E2F1

**Conserved binding motifb** 

ACCyy

E2F2 GCGssAAa

E2F2 GCGssAAa

E2F2 GCGssAAa

E2F2 GCGssAAa

E2F2 GCGssAAa

E2F2 GCGssAAa

ACCyy

strongly support our inference results.

**# Sequence logoa** 

**Cluster** 

Cluster 2

Cluster 8

Cluster 29

Cluster 31

Cluster 32

Cluster 34

Cluster 38

progression in mammalian cells (Takahashi et al. 2000). *E2F1*, which contains two overlapping *E2F*-binding sites in its promoter region, is activated at the G1/S transition in an *E2F*-dependent manner. *E2F2* interacts with certain elements in the *E2F1* promoter and both genes are involved in DNA replication and repair (Ishida et al. 2001), cytokinesis, and tumor development (Zhu et al. 2001). According to the gene set enrichment analysis results, gene module 8 is enriched with genes involved in mitosis and cytokinesis, and gene module 34 is enriched with genes involved in several functional categories associated with tumor development. As shown in Figure 9, both gene module 8 and 34 are predicted to be regulated by *E2F1* and *E2F2*, and these results are in agreement with previous reports based on biological data.

Our analysis predicts that *E2F1* and *PCNA* are components of the same network. Both of these genes are involved in the regulation of gene modules 32 and 34. The best understood molecular function of the *PCNA* protein is its role in the regulation of eukaryotic DNA polymerase delta processivity, which ensures the fidelity of DNA synthesis and repair (Essers et al. 2005). However, recent studies have provided evidence that the *PCNA* protein also functions as a direct repressor of the transcriptional coactivator p300 (Hong and Chakravarti 2003). Another study shows that *PCNA* represses the transcriptional activity of retinoic acid receptors (RARs) (Martin et al. 2005). Thus, the involvement of these genes in the same network, as predicted by our network inference algorithm, is strongly supported by knowledge of regulatory relationships already established in experimental data. The results of our prediction are in agreement with these reports since both gene modules 8 and 32 are enriched with genes involved in DNA synthesis and regulatory processes.

We proposed three approaches to investigate further whether the genes predicted to be regulated by *E2F* genes in gene modules 8, 32 and 34 are validated in classical non-genome wide methods. First, we investigated how many "known" *E2F1* and *E2F2* targets are predicted by our proposed method. According to Bracken et al. (Bracken et al. 2004), 130 genes were reviewed as *E2F* targets, 44 of which were originally identified by classical, nongenome-wide approaches. Since we restricted our analysis to the 846 cell cycle related genes, 45 genes matched the *E2F* target genes listed in Bracken et al., 21 of which were known from studies using classical molecular biology analyses. The gene targets predicted by our method match 15 of 45 genes, all 15 of which are among those found originally using standard molecular biology experiments. One possible reason is that genome-wide approaches are usually highly noisy and inconsistent across different studies.

Second, we wanted to see whether our predicted gene target clusters are enriched in the corresponding binding sites for the transcription factors in their upstream region. For both *E2F1* and *E2F2*, 7 out of 17 genes in gene module 8 contain binding sites in their upstream regions as confirmed by data in the SABiosciences database1.

Finally, we determined how many genes in the gene clusters have *E2F* binding sites. We applied WebMOTIFS to find shared motifs in the gene modules predicted to the *E2F* targets using binding site enrichment analysis. The results revealed that a motif called E2F\_TDP, GCGSSAAA, is identified as the most significant motif amon1g gene modules 2, 8, 29, 31, 32 and 34. Unfortunately, for gene modules 30 and 36 the number of genes in these clusters is

<sup>1</sup> http://www.sabiosciences.com/

236 Reverse Engineering – Recent Advances and Applications

progression in mammalian cells (Takahashi et al. 2000). *E2F1*, which contains two overlapping *E2F*-binding sites in its promoter region, is activated at the G1/S transition in an *E2F*-dependent manner. *E2F2* interacts with certain elements in the *E2F1* promoter and both genes are involved in DNA replication and repair (Ishida et al. 2001), cytokinesis, and tumor development (Zhu et al. 2001). According to the gene set enrichment analysis results, gene module 8 is enriched with genes involved in mitosis and cytokinesis, and gene module 34 is enriched with genes involved in several functional categories associated with tumor development. As shown in Figure 9, both gene module 8 and 34 are predicted to be regulated by *E2F1* and *E2F2*, and these results are in agreement with previous reports based

Our analysis predicts that *E2F1* and *PCNA* are components of the same network. Both of these genes are involved in the regulation of gene modules 32 and 34. The best understood molecular function of the *PCNA* protein is its role in the regulation of eukaryotic DNA polymerase delta processivity, which ensures the fidelity of DNA synthesis and repair (Essers et al. 2005). However, recent studies have provided evidence that the *PCNA* protein also functions as a direct repressor of the transcriptional coactivator p300 (Hong and Chakravarti 2003). Another study shows that *PCNA* represses the transcriptional activity of retinoic acid receptors (RARs) (Martin et al. 2005). Thus, the involvement of these genes in the same network, as predicted by our network inference algorithm, is strongly supported by knowledge of regulatory relationships already established in experimental data. The results of our prediction are in agreement with these reports since both gene modules 8 and

32 are enriched with genes involved in DNA synthesis and regulatory processes.

approaches are usually highly noisy and inconsistent across different studies.

regions as confirmed by data in the SABiosciences database1.

We proposed three approaches to investigate further whether the genes predicted to be regulated by *E2F* genes in gene modules 8, 32 and 34 are validated in classical non-genome wide methods. First, we investigated how many "known" *E2F1* and *E2F2* targets are predicted by our proposed method. According to Bracken et al. (Bracken et al. 2004), 130 genes were reviewed as *E2F* targets, 44 of which were originally identified by classical, nongenome-wide approaches. Since we restricted our analysis to the 846 cell cycle related genes, 45 genes matched the *E2F* target genes listed in Bracken et al., 21 of which were known from studies using classical molecular biology analyses. The gene targets predicted by our method match 15 of 45 genes, all 15 of which are among those found originally using standard molecular biology experiments. One possible reason is that genome-wide

Second, we wanted to see whether our predicted gene target clusters are enriched in the corresponding binding sites for the transcription factors in their upstream region. For both *E2F1* and *E2F2*, 7 out of 17 genes in gene module 8 contain binding sites in their upstream

Finally, we determined how many genes in the gene clusters have *E2F* binding sites. We applied WebMOTIFS to find shared motifs in the gene modules predicted to the *E2F* targets using binding site enrichment analysis. The results revealed that a motif called E2F\_TDP, GCGSSAAA, is identified as the most significant motif amon1g gene modules 2, 8, 29, 31, 32 and 34. Unfortunately, for gene modules 30 and 36 the number of genes in these clusters is

on biological data.

1 http://www.sabiosciences.com/

too small for WebMOTIFS analysis. All these gene modules are predicted to the downstream targets of E2F. For instance, 43 out of 52 genes in gene module 2 have putative *E2F* binding sites in their upstream regions. The detailed information of binding site enrichment analysis results is shown in Table 2. For those gene regulatory networks where two transcription factors are involved in, the downstream gene modules are found to be enriched in both the binding site sequence motifs. For instance, gene module 32 is enriched in both E2F\_TDP and MH1 motifs, corresponding to the two transcription factors in the gene regulatory module: *E2F1* and *SP1*. These binding site enrichment analysis results strongly support our inference results.


Table 2. Binding site enrichment analysis for gene modules identified in human Hela cell cycle dataset. aSequence logos represent the motif significantly overrepresented in individual gene cluster associated with their predicted upstream transcription factors, according to the WebMOTIFS discovery algorithm. Individual base letter height indicates level of conservation within each binding site position; bConserved binding motifs are the conserved binding sequences used in the WebMOTIFS discovery algorithm.

Reverse Engineering Gene Regulatory Networks by Integrating Multi-Source Biological Data 239

predicting the components of the GRN in which their candidate gene is involved, followed

Alon, U. (2007). "Network motifs: theory and experimental approaches." Nat Rev Genet 8(6):

Ashburner, M., C. A. Ball, et al. (2000). "Gene ontology: tool for the unification of biology.

Bader, G. D., D. Betel, et al. (2003). "BIND: the Biomolecular Interaction Network Database."

Berriz, G. F., O. D. King, et al. (2003). "Characterizing gene sets with FuncAssociate."

Birge, B. (2003). PSOt - a particle swarm optimization toolbox for use with Matlab. Swarm Intelligence Symposium, 2003. SIS '03. Proceedings of the 2003 IEEE. Blais, A. and B. D. Dynlacht (2005). "Constructing transcriptional regulatory networks."

Boyer, L. A., T. I. Lee, et al. (2005). "Core transcriptional regulatory circuitry in human

Bracken, A. P., M. Ciro, et al. (2004). "E2F target genes: unraveling the biology." Trends

Bubitzky, W., M. Granzow, et al. (2007). Fundamentals of Data Mining in Genomics and

Chen, T., H. L. He, et al. (1999). "Modeling gene expression with differential equations." Pac

Cherry, J. M., C. Ball, et al. (1997). "Genetic and physical maps of Saccharomyces cerevisiae."

Chiang, J. H. and S. Y. Chao (2007). "Modeling human cancer-related regulatory modules by

Clarke, R., H. W. Ressom, et al. (2008). "The properties of high-dimensional data spaces:

Costanzo, M. C., M. E. Crawford, et al. (2001). "YPD, PombePD and WormPD: model

D'Haeseleer, P., X. Wen, et al. (1999). "Linear modeling of mRNA expression levels during

De Hoon, M. J., S. Imoto, et al. (2002). "Statistical analysis of a small set of time-ordered gene expression data using linear splines." Bioinformatics 18(11): 1477-1485. Essers, J., A. F. Theil, et al. (2005). "Nuclear dynamics of PCNA in DNA replication and

Friedman, N., M. Linial, et al. (2000). "Using Bayesian networks to analyze expression data."

Harbison, C. T., D. B. Gordon, et al. (2004). "Transcriptional regulatory code of a eukaryotic

implications for exploring gene and protein expression data." Nat Rev Cancer 8(1):

organism volumes of the BioKnowledge library, an integrated resource for protein

GA-RNN hybrid algorithms." BMC Bioinformatics 8: 91.

CNS development and injury." Pac Symp Biocomput: 41-52.

information." Nucleic Acids Res 29(1): 75-79.

repair." Mol Cell Biol 25(21): 9350-9359.

genome." Nature 431(7004): 99-104.

J Comput Biol. 7: 601-620.

by designing a more streamlined experiment for biological validation.

The Gene Ontology Consortium." Nat Genet 25(1): 25-29.

Nucleic Acids Res 31(1): 248-250.

Bioinformatics 19(18): 2502-2504.

Genes Dev 19(13): 1499-1511.

Biochem Sci 29(8): 409-417.

Symp Biocomput: 29-40.

37-49.

Proteomics. New York, Springer.

Nature 387(6632 Suppl): 67-73.

embryonic stem cells." Cell 122(6): 947-956.

**6. References** 

450-461.
