**3.1 Assessment of individual sites**

70 DNA Repair

Fig. 1. Co-expression clustering analysis of 10 DNA repair genes finds intersecting nodes.

*Sp1* Transcription factor that can activate or repress transcription in

*NFAT* The nuclear factor of activated T-cells family of transcription factors. *EGR-1* The protein encoded by this gene belongs to the EGR family of

*PAX4* This gene is a member of the paired box (*PAX*) family of

*ELK1 ELK1* is a member of *ETS* oncogene family. The protein encoded by

development and cancer growth.

Table 2. Transcription factor binding sites in the promoters of the B/O cancer genes.

response to physiological and pathological stimuli. Regulates the expression of a large number of genes involved in a variety of processes such as cell growth, apoptosis, differentiation and immune responses. May have a role in modulating the cellular response to

*C2H2*-type zinc-finger proteins. It is a nuclear protein and functions as a transcriptional regulator. Studies suggest this is a cancer

transcription factors. These genes play critical roles during fetal

this gene is a nuclear target for the ras-raf-MAPK signaling cascade.

**Description from GeneCards** 

DNA damage.

suppressor gene.

**Transcription** 

**Factor**

Each group of promoters was analyzed to discover putative transcription factor binding sites. The analysis was performed with WordSeeker motif discovery software (Lichtenberg et al. 2010), which employs high performance supercomputer-based algorithms to perform motif enumeration and to construct Markov models. Our analysis revealed that the average nucleotide G+C content of the bidirectional promoters was slightly higher than the unidirectional promoters, 59.87% versus 50.84%, respectively. These differences were rigorously controlled by the use of the Markov model, which examines background frequencies of each nucleotide in the collection of sequences. Unique sets of binding sites were identified for each group, some of which represent novel binding sites.

A statistical analysis of the promoters of the DNA repair genes revealed a number of significant DNA binding site motifs. Some of the discovered motifs correspond to recognition sequences of known proteins. These are listed in Table 3, along with their *p*values and the corresponding transcription factors known to bind to the motifs (as determined by the TRANSFAC database (Wingender et al. 2000) and the JASPAR database (Bryne et al. 2008)). In addition, novel motifs, representing uncharacterized transcription factor binding sites, were discovered in the bidirectional and unidirectional promoters from DNA repair pathway genes (see Table 4 for the motifs and their *p*-values).


Table 3. Enriched motifs matching characterized transcription factor binding sites discovered in the bidirectional promoters (columns 1 and 2) and in the unidirectional promoters (columns 3 and 4).

Shared Regulatory Motifs in Promoters of Human DNA Repair Genes 73

TCTGAGGA TCGCGCCA 12.1158 GTTCATTC TCCGCCGG 11.2184 ACTCCAGC TCGCGCCA 11.8387 CTGTGTGC TGCGCCGA 11.1966 GCCCAGCC TCCGCCGC 11.1827 TGACGCGA CTCCCGCT 10.9997 GCCCAGCC CGGAGCGC 10.8711 AGCCGGCT GGGGAGTA 10.0590 TGCCCGCG TCCCGGGA 10.7404 ATTGCAGG ATTCTCTC 9.5459 GGCAGGGA GGGCCAGG 9.8609 GGGGAGTA AGGAAACA 9.3177 TCCCGGGA TCGCGCCA 9.8112 CTGGGAGC GTTCATTC 9.0337 AGCCTGTC TCCCGGGA 9.7646 CCTTCCGA CTGGGAGC 8.8439 GGAGGCTG TCGCGCCA 9.7250 TGGGCGGA ACCCGCCT 8.7895 TCCGCCGC GCCCCTCC 9.6830 TTTCTCCA CGGAAACC 8.6446 AGAAAAGA TCGCGCCA 9.4042 CCCCCGCG ACCCGCCT 8.5339 GCCCAGCC GCCCCTCC 9.2808 TCCGCCGG GGGGCTGC 7.7522 TGCCAAAA GCCGGCGA 9.2604 AGCTGGCT CCAGGCTG 7.7192 CAGCAGCC TGCGGAAT 9.1297 TTGGTCTC AGGAAACA 7.6068 AGGGCCGT TCCCGGCT 9.1249 CTGGGAGC TCCGCCGG 7.3021 Table 5. Putative transcription factor binding modules discovered in promoters of DNA

section we report the DNA motifs that were discovered, along with several clusters of related genes and promoters. We hypothesize that these similar components implicate

We studied fourteen checkpoint factor genes, which are listed in Table 6. The number of alternative promoters per gene, shown in parentheses, varied for each gene. Because most of the genes have alternative promoters, we analyzed a total of thirty promoters. The complete set of alternative promotes is shown in Table 7. Alternative promoters were identified using annotations of genes in the UCSC Human Genome Browser. Transcription start sites of transcript isoforms served as the coordinates around which 900 bp upstream and 100 bp downstream were defined as the putative promoter region. Alternative promoters with significant overlap were truncated or removed from the analysis. DNA sequences were obtained for the forward and reverse strands of the genome to ensure coverage of words that might have biased nucleotide content and be subject to omission during the Markov

regulatory networks responsible for co-regulation of the checkpoint factor genes.

**Gene Description from GeneCards (Safran 2010)**  *ATM* (5) The protein encoded by this gene (ataxia telangiectasia mutated)

*ATR* (2) The protein encoded by this gene (ataxia telangiectasia and Rad3

damage and for genome stability.

belongs to the PI3/PI4-kinase family. This protein functions as a regulator of a wide variety of downstream proteins, including *p53*, *BRCA1, CHK2, RAD17, RAD9*, and *NBS1*. This protein and the closely related kinase ATR are thought to be master controllers of cell cycle checkpoint signaling pathways, required for cell response to DNA

related) belongs the PI3/PI4-kinase family, and is most closely related to ATM. Both proteins share similarity with

**(Unidirectional Promoters) Score** 

**(Bidirectional Promoters) Score Co-Occurring Motif Pair** 

**Co-Occurring Motif Pair** 

repair genes.

model analysis stage.


Table 4. Uncharacterized motifs discovered in the promoters of DNA repair genes. Words are ordered alphabetically.

#### **3.2 Assessment of paired binding sites**

To identify putative regulatory modules (co-acting regulatory elements), we identified statistically overrepresented pairs of DNA motifs in each set of promoters. Motif pairs are shown in Table 5. The motif pair scores are computed as the product of (1) the number sequences, *S*, in which the pair occurs and (2) the natural log of the ratio of *S* and the expected value of *S*, *Es*; i.e., the score is *S·*ln(*S*/*Es*). The genomic signatures (significant DNA motifs and motif pairs) of the bidirectional promoters were virtually non-overlapping with the signatures of unidirectional promoters. This provides strong support for the hypothesis that the regulatory mechanisms of bidirectional promoters are unique. Additionally, this work contributes a significant enhancement to the available knowledge about transcriptional regulation of genes involved in DNA repair pathways, and implicates the presence of a regulatory network.

#### **4. Unbiased assessment of transcription factor binding sites of checkpoint factor genes from DNA repair pathways**

We have performed a focused, detailed characterization of the checkpoint factors in DNA repair pathways (Elnitski et al. 2010). The checkpoint factors (Kanehisa et al. 2008, Wood 2005, Helleday et al. 2008) are activated upon detection of DNA damage, resulting in halting the cell cycle so that subsequent DNA repair pathways can mend the damage. In addition to examining the most recognized promoter in each gene (the 5' end of the full-length transcription unit), we assessed alternative start sites for each checkpoint factor gene as independent regulatory units, to discover putative transcription factor binding sites. In this

ACTCCAGC 0.06212 AGCCGGCT 0.05007 AGAAAAGA 0.02756 ATTCCCAG 0.05599 AGGGAGGG 0.07159 CCTCTTTA 0.03381 CAGCAGCC 0.10540 CGCCCCTT 0.11386 CGACTCCG 0.02756 CGGCGGCG 0.04742 CGCGGCCG 0.03377 CTCCCGCT 0.05998 CGGGCCGA 0.06548 CTTCTTTC 0.03773 GCCCCTCC 0.07021 GCGCCGCG 0.09760 GCCGGCGA 0.03662 GGGCGCCC 0.08390 GGCAGGGA 0.10334 GTGCGTTT 0.06286 GGGCCAGG 0.09632 TCCGCCGG 0.05794 GGGGCCGG 0.05265 TCTCCCCT 0.07881 TCTGGGAT 0.01466 TCTTCTTC 0.04649 TGAAGCCA 0.05699 TGCGCCGA 0.04148 TGCCCGCG 0.08277 TTGGTCTC 0.08543 TGCGGAAT 0.02132 TTTCTCCA 0.06840 TGCTGAGA 0.03377 TTTTTTGA 0.04742 Table 4. Uncharacterized motifs discovered in the promoters of DNA repair genes. Words

To identify putative regulatory modules (co-acting regulatory elements), we identified statistically overrepresented pairs of DNA motifs in each set of promoters. Motif pairs are shown in Table 5. The motif pair scores are computed as the product of (1) the number sequences, *S*, in which the pair occurs and (2) the natural log of the ratio of *S* and the expected value of *S*, *Es*; i.e., the score is *S·*ln(*S*/*Es*). The genomic signatures (significant DNA motifs and motif pairs) of the bidirectional promoters were virtually non-overlapping with the signatures of unidirectional promoters. This provides strong support for the hypothesis that the regulatory mechanisms of bidirectional promoters are unique. Additionally, this work contributes a significant enhancement to the available knowledge about transcriptional regulation of genes involved in DNA repair pathways, and implicates the

**4. Unbiased assessment of transcription factor binding sites of checkpoint** 

We have performed a focused, detailed characterization of the checkpoint factors in DNA repair pathways (Elnitski et al. 2010). The checkpoint factors (Kanehisa et al. 2008, Wood 2005, Helleday et al. 2008) are activated upon detection of DNA damage, resulting in halting the cell cycle so that subsequent DNA repair pathways can mend the damage. In addition to examining the most recognized promoter in each gene (the 5' end of the full-length transcription unit), we assessed alternative start sites for each checkpoint factor gene as independent regulatory units, to discover putative transcription factor binding sites. In this

P-Value **Motif** 

**(unidirectional promoters)** 

P-Value

**Motif (bidirectional promoters)** 

are ordered alphabetically.

**3.2 Assessment of paired binding sites** 

presence of a regulatory network.

**factor genes from DNA repair pathways** 


Table 5. Putative transcription factor binding modules discovered in promoters of DNA repair genes.

section we report the DNA motifs that were discovered, along with several clusters of related genes and promoters. We hypothesize that these similar components implicate regulatory networks responsible for co-regulation of the checkpoint factor genes.

We studied fourteen checkpoint factor genes, which are listed in Table 6. The number of alternative promoters per gene, shown in parentheses, varied for each gene. Because most of the genes have alternative promoters, we analyzed a total of thirty promoters. The complete set of alternative promotes is shown in Table 7. Alternative promoters were identified using annotations of genes in the UCSC Human Genome Browser. Transcription start sites of transcript isoforms served as the coordinates around which 900 bp upstream and 100 bp downstream were defined as the putative promoter region. Alternative promoters with significant overlap were truncated or removed from the analysis. DNA sequences were obtained for the forward and reverse strands of the genome to ensure coverage of words that might have biased nucleotide content and be subject to omission during the Markov model analysis stage.


Shared Regulatory Motifs in Promoters of Human DNA Repair Genes 75

*NBS1* (1) The encoded protein is a member of the *MRE11/RAD50* double-

and DNA damage-induced checkpoint activation. *P53/TP53* (3) This gene encodes tumor protein *p53*, which responds to diverse

*PER1* (1) This gene is a member of the Period family of genes and is expressed

however, these variants have not been fully described. *RAD1* (2) This gene encodes a component of a heterotrimeric cell cycle

transcript variants of this gene have been described. *RAD17* (3) The protein encoded by this gene is highly similar to the gene

which encode four distinct proteins, have been reported. *RAD9A* (2) This gene product is highly similar to Schizosaccharomyces pombe

Table 6. The checkpoint factors genes that were studied. The number of alternative

this gene.

promoters is shown in parentheses next to each gene name.

complex to DNA damage foci.

recruitment of the ATM kinase and meiotic recombination 11 protein

strand break repair complex which consists of 5 proteins. This gene product is thought to be involved in DNA double-strand break repair

cellular stresses to regulate target genes that induce cell cycle arrest,

in a circadian pattern in the suprachiasmatic nucleus, the primary circadian pacemaker in the mammalian brain. Genes in this family encode components of the circadian rhythms of locomotor activity, metabolism, and behavior. The specific function of this gene is not yet known. Alternative splicing has been observed in this gene;

checkpoint complex, known as the 9-1-1 complex, that is activated to stop cell cycle progression in response to DNA damage or incomplete DNA replication. The 9-1-1 complex is recruited by *RAD17* to affected sites where it may attract specialized DNA polymerases and other DNA repair effectors. Alternatively spliced

product of Schizosaccharomyces pombe rad17, a cell cycle checkpoint gene required for cell cycle arrest and DNA damage repair in response to DNA damage. This protein recruits the *RAD1- RAD9-HUS1* checkpoint protein complex onto chromatin after DNA damage,. The phosphorylation of this protein is required for the DNA-damage-induced cell cycle G2 arrest, and is thought to be a critical early event during checkpoint signaling in DNA-damaged cells. Eight alternatively spliced transcript variants of this gene,

rad9, a cell cycle checkpoint protein required for cell cycle arrest and DNA damage repair in response to DNA damage. This protein is found to possess 3' to 5' exonuclease activity, which may contribute to its role in sensing and repairing DNA damage. It forms a checkpoint protein complex with *RAD1* and *HUS1*. This complex is recruited by checkpoint protein *RAD17* to the sites of DNA damage, which is thought to be important for triggering the checkpointsignaling cascade. Use of alternative polyA sites has been noted for

apoptosis, senescence, DNA repair, or changes in metabolism.


*ATRIP* (1) The product of this gene (ATR interacting protein) is an essential

*CHEK1* (3) Required for checkpoint mediated cell cycle arrest in response to

enhances suppression of cellular proliferation. *CHEK2* (2) The protein encoded by this gene is a cell cycle checkpoint regulator

*CLK2* (2) This gene encodes a member of the *CLK* family of dual specificity

*HUS1* (1) The protein encoded by this gene is a component of an evolutionarily

thought to be an early checkpoint signaling event. *MDC1* (2) The protein encoded by this gene (mediator of DNA-damage

have been found for this gene.

isoforms have been found for this gene.

alternative polyA sites exist.

Schizosaccharomyces pombe rad3, a cell cycle checkpoint gene required for cell cycle arrest and DNA damage repair in response to DNA damage. This kinase has been shown to phosphorylate *CHK1*, *RAD17*, and *RAD9* and *BRCA1*. Transcript variants utilizing

component of the DNA damage checkpoint, and binds to singlestranded DNA coated with replication protein A that accumulates at sites of DNA damage. The encoded protein interacts with the ataxia telangiectasia and Rad3 related protein, a checkpoint kinase, resulting in accumulation of the kinase at intranuclear foci induced by DNA damage. Multiple transcript variants encoding different

DNA damage or the presence of unreplicated DNA. May also negatively regulate cell cycle progression during unperturbed cell cycles. Binds to and phosphorylates *CDC25A, CDC25B* and *CDC25C*. Binds to and phosphorylates *RAD51.* Binds to and phosphorylates *TLK1.* May also phosphorylate multiple sites within the C-terminus of *TP53*, which promotes activation of *TP53* by acetylation and

and putative tumor suppressor. It contains a forkhead-associated protein interaction domain essential for activation in response to DNA damage and is rapidly phosphorylated in response to replication blocks and DNA damage. This protein interacts with and phosphorylates *BRCA1*, allowing *BRCA1* to restore survival after DNA damage. Three transcript variants encoding different isoforms

protein kinases. *CLK* family members have been shown to interact with, and phosphorylate, serine- and arginine-rich (SR) proteins of the spliceosomal complex, which is a part of the regulatory mechanism that enables the SR proteins to control RNA splicing.

conserved, genotoxin-activated checkpoint complex that is involved in the cell cycle arrest in response to DNA damage. This protein forms a heterotrimeric complex with checkpoint proteins *RAD9* and *RAD1*. DNA damage induced chromatin binding has been shown to depend on the activation of the checkpoint kinase ATM, and is

checkpoint) is required to activate the intra-S phase and G2/M phase cell cycle checkpoints in response to DNA damage. This nuclear protein interacts with phosphorylated histone H2AX near sites of DNA double-strand breaks through its *BRCT* motifs, and facilitates


Table 6. The checkpoint factors genes that were studied. The number of alternative promoters is shown in parentheses next to each gene name.

Shared Regulatory Motifs in Promoters of Human DNA Repair Genes 77

*NBS1* chr8:91066025-91067075\_+

*PER1* chr17:7996377-7997427\_+

*(MDC12)* chr6:30792781-30793831\_+ *(MDC12)* chr6:30792781-30793831\_- *(MDC12)* chr6:30789060-30790110\_+ *(MDC12)* chr6:30789060-30790110\_-

chr8:91066025-91067075\_-

*(TP533)* chr17:7519486-7520536\_+ *(TP533)* chr17:7519486-7520536\_- *(TP532)* chr17:7531538-7532588\_+ *(TP532)* chr17:7531538-7532588\_- *(TP533)* chr17:7520612-7521662\_+ *(TP533)* chr17:7520612-7521662\_-

chr17:7996377-7997427\_-

*(RAD12)* chr5:34954089-34955139\_+ *(RAD12)* chr5:34954089-34955139\_- *(RAD12)* chr5:34951438-34952488\_+ *(RAD12)* chr5:34951438-34952488\_-

*(RAD173)* chr5:68699879-68700929\_+ *(RAD173)* chr5:68699879-68700929\_- *(RAD172)* chr5:68723716-68724766\_+ *(RAD172)* chr5:68723716-68724766\_- *(RAD173)* chr5:68701287-68702337\_+ *(RAD173)* chr5:68701287-68702337\_-

*(RAD9A2)* chr11:66918716-66919766\_+ *(RAD9A2)* chr11:66918716-66919766\_- *(RAD9A2)* chr11:66914998-66916048\_+ *(RAD9A2)* chr11:66914998-66916048\_-

Table 7. Alternative promoters, indicated by their genomic coordinates, of genes involved in

Statistical analysis of thirty promoters found several interesting DNA words, which predict DNA elements that participate in the regulation of the DNA repair checkpoint factors. The most significant words discovered are listed in Table 8. Words that are shared among the gene sets identify regulatory relationships. Reverse complement words are reported separately, as internal verification on the process. Words without a reverse complement

*MDC1* 

*P53 (TP53)* 

*RAD1* 

*RAD17* 

*RAD9A* 

example indicate a particular bias in the nucleotide content.

cell-cycle checkpoint factor pathways.


**Factors Alternative promoters (hg18 coordinates)** 

*(ATM5)* chr11:107662328-107663378\_+ *(ATM5)* chr11:107662328-107663378\_- *(ATM2)* chr11:107643346-107644396\_+ *(ATM2)* chr11:107643346-107644396\_- *(ATM3)* chr11:107597768-107598818\_+ *(ATM3)* chr11:107597768-107598818\_- *(ATM4)* chr11:107671910-107672960\_+ *(ATM4)* chr11:107671910-107672960\_- *(ATM5)* chr11:107679611-107680661\_+ *(ATM5)* chr11:107679611-107680661\_-

*(ATR1)* chr3:143780308-143781358\_+ *(ATR1)* chr3:143780308-143781358\_- *(ATR2)* chr3:143671051-143672101\_+ *(ATR2)* chr3:143671051-143672101\_-

chr3:48462221-48463271\_-

*(CHEK13)* chr11:125000333-125001383\_+ *(CHEK13)* chr11:125000333-125001383\_- *(CHEK12)* chr11:125018185-125019235\_+ *(CHEK12)* chr11:125018185-125019235\_- *(CHEK13)* chr11:124999245-125000295\_+ *(CHEK13)* chr11:124999245-125000295\_-

*(CHEK22)* chr22:27467772-27468822\_+ *(CHEK22)* chr22:27467772-27468822\_- *(CHEK22)* chr22:27460665-27461715\_+ *(CHEK22)* chr22:27460665-27461715\_-

*(CLK22)* chr1:153509855-153510905\_+ *(CLK22)* chr1:153509855-153510905\_- *(CLK22)* chr1:153514075-153515125\_+ *(CLK22)* chr1:153514075-153515125\_-

chr7:47985721-47986771\_-

*ATRIP* chr3:48462221-48463271\_+

*HUS1* chr7:47985721-47986771\_+

**Checkpoint**

*ATM* 

*ATR* 

*CHEK1* 

*CHEK2* 

*CLK2* 


Table 7. Alternative promoters, indicated by their genomic coordinates, of genes involved in cell-cycle checkpoint factor pathways.

Statistical analysis of thirty promoters found several interesting DNA words, which predict DNA elements that participate in the regulation of the DNA repair checkpoint factors. The most significant words discovered are listed in Table 8. Words that are shared among the gene sets identify regulatory relationships. Reverse complement words are reported separately, as internal verification on the process. Words without a reverse complement example indicate a particular bias in the nucleotide content.

Shared Regulatory Motifs in Promoters of Human DNA Repair Genes 79

*RAD1 1*

*TP53 2 TP53 3*

*ATRIP* 

*RAD1 2 RAD17 1*

*ATRIP* 

*RAD1 2 RAD17 1*

*ATM4*

*CHEK1 2 NBS1 RAD17 1 ATR 2 TP53 1 HUS1 RAD17 1*

Shared words among the checkpoint factor genes suggested the presence of regulatory networks. We assessed the relationships by generating network depictions in the form of interaction networks (Figure 2) and a circos diagram (Figure 3) constructed from the summary data in Table 9. To derive Figure 2, a metric MDS was conducted on the affiliation network defined in Table 9. The resulting graph was then spring-embedded, with node repulsion, to facilitate visualization (Borgatti, 2002). The interaction network depicts the distribution of the DNA words among the genes (note that each gene appears once, representing all alternative promoters as a single node). Genes are denoted by blue squares and words are represented with red circles. Bold lines indicate multiple occurrences of a

The circos diagram represents the information in a closed circular space, wherein connections between words on one side of the diagram extend to genes on the other side. The putative nodes of the regulatory networks are defined by multiple edges, representing a characterized transcription factor or a novel DNA binding site, or a

Some of the discovered words correspond to known binding sites for transcription factors, reported in the JASPAR and TRANSFAC databases of transcription factors (see Table 10). The relationships between the top fifteen words and the transcription factors are depicted in the circos diagram in Figure 4. Note that multiple binding site motifs were discovered for many of the transcription factors, and that several of the sites match the binding patterns of

4.97 *TP531*

4.73 *CHEK13*

4.73 *CHEK13*

*S*·ln( *S* / *E*

4.58

*<sup>S</sup>*) overrepresentation score,

CCTGCATT

ATCCCTGA

TCAGGGAT

GTATTTTA

Table 8. Top 15 enumerated DNA words, based on the

and the alternative promoters, identified by subscript.

word. Reverse complement words are shown independently.

checkpoint factor gene.

more than one transcription factor.

**5. Visualization and interpretation of data** 


ACAGCCAT

ATGGCTGT

GCCTGGGA

TCCCAGGC

ACTCCCTA

TAGGGAGT

AGCGGCCA

TGGCCGCT

GAAATGAA

TTCATTTC

AATGCAGG

**Word Promoters Sln(S/Es)**

*ATM2*

*CLK21*

*ATM2*

*CLK21*

*ATR1*

*CHEK21 CLK22 MDC11 MDC12 RAD12*

*ATR1*

*CHEK21 CLK22 MDC11 MDC12 RAD12*

*ATM3*

*RAD172*

*ATM3*

*RAD172*

*ATR1*

*CHEK11*

*ATR1*

*CHEK11*

*ATM2*

*ATM3 ATR2 CLK22 HUS1 MDC11*

*ATM2*

*ATM3 ATR2 CLK22 HUS1 MDC11*

*RAD11*

*TP532 TP533*

4.97 *TP531*

*CHEK21* 5.29

*CHEK21* 5.29

*ATR2* 5.24

*ATR2* 5.24

*CHEK22* 5.41

*CHECK22* 5.41

5.40

5.40

5.24

5.24


Table 8. Top 15 enumerated DNA words, based on the *S*·ln(*S*/*ES*) overrepresentation score, and the alternative promoters, identified by subscript.
