**4.1 Genome-wide identification of TF binding regions, ChIP-Seq**

ChIP is a technique for assaying protein-DNA interactions *in vivo* (Weinmann et al., 2002). This analysis allows identifying regions of the genome bound directly to ERs or PRs as well as regions bound indirectly via other TFs or co-regulators. During the procedure proteins are cross-linked to DNA and the chromatin is thereafter sonicated to small fragments

The Tissue Specific Role

of Estrogen and Progesterone in Human Endometrium and Mammary Gland 49

Fig. 6. **ChIP followed by highthroughput sequencing.** The ChIP process enriches the crosslinked proteins or modified nucleosomes using an antibody specific to the protein or the histone modification of interest. Purified DNA can be sequenced using different nextgeneration sequencing platforms. On the Illumina Solexa Genome Analyzer (bottom left) clusters of clonal sequences are generated by bridge PCR, and sequencing is performed by sequencing-by-synthesis. On the Roche 454 and Applied Biosystems (ABI) SOLiD platforms (bottom middle), clonal sequencing features are generated by emulsion PCR and amplicons are captured on the surface of micrometre-scale beads. Beads with amplicons are then recovered and immobilized to a planar substrate to be sequenced by pyrosequencing (for the 454 platform) or by DNA ligase-driven synthesis (for the SOLiD platform). On singlemolecule sequencing platforms such as the HeliScope by Helicos (bottom right), fluorescent nucleotides incorporated into templates can be imaged at the level of single molecules, which makes clonal amplification unnecessary (adapted from Nature Reviews, Park 2009).

around 150-1000bp depending on which application is used below. After immunoprecipitation of protein-DNA complexes, the cross-links are reversed and the DNA fragments purified. Extracted DNA could be analyzed with either PCR, ChIP-on-chip or direct sequencing. Regions significantly overrepresented in the immunoprecipitated DNA relative to control DNA are regarded as epigenetically modified or protein-bound, depending on the antibody used (Bock et al., 2008). Computational algorithms are used to infer the information from the array data or sequencing output. ChIP has two main drawbacks. First and the main problem is the specificity of antibodies used. The second problem is aggregation of chromatin that contaminates the purified specific chromatin fraction and raises unspecific background of isolated DNA. In case of ChIP-on-chip ChIPenriched DNA is spotted on glass slide microarrays (chip) to study how regulatory proteins interact with the genome of living cells (Lin Z et al., 2007, Liu et al., 2008). ChIP-on-Chip has many modifications such as ChIP-linked target site cloning (Lin Z et al., 2007) and ChIP coupled with a DNA selection and ligation (ChIP-DSL) strategy for direct target genes, permitting analysis of fewer cells than required by the conventional ChIP-on-chip method (Kwon et al., 2007). The ChIP-DSL technology is distinct from the latter assay. Besides it being more specific and sensitive, the immunoprecipitated DNA is used to template oligonucleotide ligation, instead of being directly amplified for hybridization, which makes it possible to bypass incomplete decrosslinking. There is also the paired-end ditag (PET) approach, which directly links the 5' terminal tags of genomic sequences with their corresponding 3' terminal tags to form PET ditags and concatenates them for efficient sequencing (Bock et al., 2008).

ChIP-Seq is emerging as the method of choice for genome-wide identification of TF binding sites. The ChIP-Seq involves immunoselecting an enriched population of transcription factor-bound chromatin fragments, which are purified and resolved via next-generation sequencing. Today, several DNA sequencing technologies are available - the ABI SOLiD platform utilizes oligonucleotide ligation and detection methodology (Dietz and Carroll, 2008), the sequencing-by-synthesis methods of 454 Life Sciences and Solexa/Illumina technology utilize, an emulsion based PCR followed by HTP sequencing and reversible terminator sequencing respectively. Also it is possible to sequence on single-molecule sequencing platforms such as the HeliScope by Helicos where, fluorescent nucleotides incorporated into templates can be imaged at the level of single molecules (Figure 6). A typical dataset generated from the Illumina Genome Analyzer yields several million short sequence reads with typical length 36-75 bp. These are aligned to a reference genome, and the resulting trace read placements are used to infer the locations of transcription factor binding in a global fashion. ChIP-seq provides clearly interpretable binding information. Even more, compared to ChIP-on-chip data normalization is not an issue because the sequencing results in absolute read counts (Barski et al., 2007). Also, the repetitive portion of DNA is not a hindrance. One limitation is that the process of mapping tags to the reference genome can bias the analysis toward genomic regions with unique and complex sequence patterns. This is because short sequencing reads that overlap with low-complexity regions or with interspersed repeats stand a higher chance of being discarded for lack of unique genomic alignment (Bock et al., 2008). Even though ChIP-seq shares ChIP-on-chip's dependence on high-quality antibodies, the unparalleled throughput makes ChIP-seq superior for whole genome mapping of DNAprotein interactions. The latest results show that ChIP-Seq method could detect more than 10 000 binding regions for ERα in MCF7 cells (Carroll et al., 2006; Hurtado et al., 2011). Nevertheless, linking the binding regions to the

around 150-1000bp depending on which application is used below. After immunoprecipitation of protein-DNA complexes, the cross-links are reversed and the DNA fragments purified. Extracted DNA could be analyzed with either PCR, ChIP-on-chip or direct sequencing. Regions significantly overrepresented in the immunoprecipitated DNA relative to control DNA are regarded as epigenetically modified or protein-bound, depending on the antibody used (Bock et al., 2008). Computational algorithms are used to infer the information from the array data or sequencing output. ChIP has two main drawbacks. First and the main problem is the specificity of antibodies used. The second problem is aggregation of chromatin that contaminates the purified specific chromatin fraction and raises unspecific background of isolated DNA. In case of ChIP-on-chip ChIPenriched DNA is spotted on glass slide microarrays (chip) to study how regulatory proteins interact with the genome of living cells (Lin Z et al., 2007, Liu et al., 2008). ChIP-on-Chip has many modifications such as ChIP-linked target site cloning (Lin Z et al., 2007) and ChIP coupled with a DNA selection and ligation (ChIP-DSL) strategy for direct target genes, permitting analysis of fewer cells than required by the conventional ChIP-on-chip method (Kwon et al., 2007). The ChIP-DSL technology is distinct from the latter assay. Besides it being more specific and sensitive, the immunoprecipitated DNA is used to template oligonucleotide ligation, instead of being directly amplified for hybridization, which makes it possible to bypass incomplete decrosslinking. There is also the paired-end ditag (PET) approach, which directly links the 5' terminal tags of genomic sequences with their corresponding 3' terminal tags to form PET ditags and concatenates them for efficient

ChIP-Seq is emerging as the method of choice for genome-wide identification of TF binding sites. The ChIP-Seq involves immunoselecting an enriched population of transcription factor-bound chromatin fragments, which are purified and resolved via next-generation sequencing. Today, several DNA sequencing technologies are available - the ABI SOLiD platform utilizes oligonucleotide ligation and detection methodology (Dietz and Carroll, 2008), the sequencing-by-synthesis methods of 454 Life Sciences and Solexa/Illumina technology utilize, an emulsion based PCR followed by HTP sequencing and reversible terminator sequencing respectively. Also it is possible to sequence on single-molecule sequencing platforms such as the HeliScope by Helicos where, fluorescent nucleotides incorporated into templates can be imaged at the level of single molecules (Figure 6). A typical dataset generated from the Illumina Genome Analyzer yields several million short sequence reads with typical length 36-75 bp. These are aligned to a reference genome, and the resulting trace read placements are used to infer the locations of transcription factor binding in a global fashion. ChIP-seq provides clearly interpretable binding information. Even more, compared to ChIP-on-chip data normalization is not an issue because the sequencing results in absolute read counts (Barski et al., 2007). Also, the repetitive portion of DNA is not a hindrance. One limitation is that the process of mapping tags to the reference genome can bias the analysis toward genomic regions with unique and complex sequence patterns. This is because short sequencing reads that overlap with low-complexity regions or with interspersed repeats stand a higher chance of being discarded for lack of unique genomic alignment (Bock et al., 2008). Even though ChIP-seq shares ChIP-on-chip's dependence on high-quality antibodies, the unparalleled throughput makes ChIP-seq superior for whole genome mapping of DNAprotein interactions. The latest results show that ChIP-Seq method could detect more than 10 000 binding regions for ERα in MCF7 cells (Carroll et al., 2006; Hurtado et al., 2011). Nevertheless, linking the binding regions to the

sequencing (Bock et al., 2008).

Fig. 6. **ChIP followed by highthroughput sequencing.** The ChIP process enriches the crosslinked proteins or modified nucleosomes using an antibody specific to the protein or the histone modification of interest. Purified DNA can be sequenced using different nextgeneration sequencing platforms. On the Illumina Solexa Genome Analyzer (bottom left) clusters of clonal sequences are generated by bridge PCR, and sequencing is performed by sequencing-by-synthesis. On the Roche 454 and Applied Biosystems (ABI) SOLiD platforms (bottom middle), clonal sequencing features are generated by emulsion PCR and amplicons are captured on the surface of micrometre-scale beads. Beads with amplicons are then recovered and immobilized to a planar substrate to be sequenced by pyrosequencing (for the 454 platform) or by DNA ligase-driven synthesis (for the SOLiD platform). On singlemolecule sequencing platforms such as the HeliScope by Helicos (bottom right), fluorescent nucleotides incorporated into templates can be imaged at the level of single molecules, which makes clonal amplification unnecessary (adapted from Nature Reviews, Park 2009).

The Tissue Specific Role

of Estrogen and Progesterone in Human Endometrium and Mammary Gland 51

Fig. 7. RNA-Seq method. a) Paired cDNA fragments are mapped to genome using TopHat software b) Each pair of fragment is treated as a single alignment and the abundances of the aasembled transcripts are estimated (b-e). First the fragments from distinct spliced mRNA isoforms are identified (b). Isoforms are then assembled from the overlap graph (c) and transcript abundance is estimated (d). Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of

the abundances of the transcripts from which it could be originated. The program numerically maximizes a function that assigns a likelyhood to all possible sets of relative abundances of different isoforms (e), producing the abundances that best explain the

observed fragments (adapted from Trapnell et al., 2010).

target genes has been an on-going struggle as majority of binding regions can be separated by hundreds of kilobases and in some cases megabases. In many cases the biological functionality of the TF binding is still unrevealed.

Fullwood and colleagues have developed a technique called ChIA-PET (chromatin interaction analysis using paired-end tag sequencing) (Fullwood et al., 2009), which couples chromosome conformation capture (Dekker et al., 2002), a method for identifying interacting chromatin regions, with high-throughput sequencing. The authors found 689 ER-associated chromatin interaction complexes made up of duplexes and more complex interactions. These tend to involve stronger ER-binding events, which are biased toward specific histone marks and other transcriptional regulators more imperative for ER function.

Endometrial cell lines seem to be less hormone responsive compared to MCF7. In our previous study we used ChIP-qPCR to identifying ER and PR targets in two endometrial cell lines. We found 137 target genes for ERs in HEC1A and 83 target genes for PRs in RL95-2 from 382 preselected genes. The results confirmed the *in vitro* model of non-receptive (HEC1A) and receptive (RL95-2) endometrium in steroid hormone manner (Tamm et al., 2009).

### **4.2 Expression analysis, RNA-Seq**

The transcriptome is the complete set of transcripts in a cell or tissue at a specific developmental stage or physiological condition. Expression microarrays are currently the most widely used methodology for transcriptome analysis. Breast cancer cell line MCF7 is most extensively used cell line in terms of studying E2 responsiveness and ERα localization. The number of genes which could be regulated by E2 has expanded extensively during the last decade from ~100 to ~1500 genes (Frasor et al., 2003, Carroll and Brown, 2006, Kininis et al., 2007, Levenson et al., 2002, Lin et al., 2004, Lin et al., 2007). It is likely that in the near future RNA-Seq, more sensitive technique, will introduce even more genes which show significant change in their activity after E2 or P4 treatment. Gene expression studies investigating endometrial receptivity using human biopsy samples have searched for genes differentially expressed in follicular and luteal phase (Kao et al., 2002; Carson et al., 2002; Riesewijk et al., 2003; Mirkin et al., 2005). The highest number of regulatory genes was brought out in Carson´s study with 323 up-regulated and 370 down-regulated genes comparing follicular phase to the luteal phase. As mentioned before, the overlap of genes identified in different publications is relatively low. The difference could be due to variations is study design and limiting factors of microarray analysis. Microarray is hybridization-based approach, which involves incubating fluorescently labelled cDNA with custom made microarrays. Prominent limitations with this method include hybridization, cross-hybridization artefacts, different data analysis and low coverage of all possible genes in large genomes (Casneuf et al., 2007). Comparing expression levels across different experiments is often difficult and requite complicated normalization methods. The newer and potentially more comprehensive way to measure the whole active transcriptome is by direct ultra-high-throughput sequencing named RNA-Seq. The resulting sequence reads are individually mapped to the source genome and counted to obtain the number and density of reads corresponding to RNA from each known exon, splice event or new candidate gene (Mortazavi et al., 2008). RNA-Seq uses recently developed deep-sequencing technologies where RNA is converted to a library of cDNA fragments with adaptors attached to one or both ends. Each molecule is sequenced from single end or paired end. The reads are typically 30-400bp, depending on the DNA-sequencing technology used. Similarly to ChIP extracted

target genes has been an on-going struggle as majority of binding regions can be separated by hundreds of kilobases and in some cases megabases. In many cases the biological

Fullwood and colleagues have developed a technique called ChIA-PET (chromatin interaction analysis using paired-end tag sequencing) (Fullwood et al., 2009), which couples chromosome conformation capture (Dekker et al., 2002), a method for identifying interacting chromatin regions, with high-throughput sequencing. The authors found 689 ER-associated chromatin interaction complexes made up of duplexes and more complex interactions. These tend to involve stronger ER-binding events, which are biased toward specific histone

Endometrial cell lines seem to be less hormone responsive compared to MCF7. In our previous study we used ChIP-qPCR to identifying ER and PR targets in two endometrial cell lines. We found 137 target genes for ERs in HEC1A and 83 target genes for PRs in RL95-2 from 382 preselected genes. The results confirmed the *in vitro* model of non-receptive (HEC1A) and

The transcriptome is the complete set of transcripts in a cell or tissue at a specific developmental stage or physiological condition. Expression microarrays are currently the most widely used methodology for transcriptome analysis. Breast cancer cell line MCF7 is most extensively used cell line in terms of studying E2 responsiveness and ERα localization. The number of genes which could be regulated by E2 has expanded extensively during the last decade from ~100 to ~1500 genes (Frasor et al., 2003, Carroll and Brown, 2006, Kininis et al., 2007, Levenson et al., 2002, Lin et al., 2004, Lin et al., 2007). It is likely that in the near future RNA-Seq, more sensitive technique, will introduce even more genes which show significant change in their activity after E2 or P4 treatment. Gene expression studies investigating endometrial receptivity using human biopsy samples have searched for genes differentially expressed in follicular and luteal phase (Kao et al., 2002; Carson et al., 2002; Riesewijk et al., 2003; Mirkin et al., 2005). The highest number of regulatory genes was brought out in Carson´s study with 323 up-regulated and 370 down-regulated genes comparing follicular phase to the luteal phase. As mentioned before, the overlap of genes identified in different publications is relatively low. The difference could be due to variations is study design and limiting factors of microarray analysis. Microarray is hybridization-based approach, which involves incubating fluorescently labelled cDNA with custom made microarrays. Prominent limitations with this method include hybridization, cross-hybridization artefacts, different data analysis and low coverage of all possible genes in large genomes (Casneuf et al., 2007). Comparing expression levels across different experiments is often difficult and requite complicated normalization methods. The newer and potentially more comprehensive way to measure the whole active transcriptome is by direct ultra-high-throughput sequencing named RNA-Seq. The resulting sequence reads are individually mapped to the source genome and counted to obtain the number and density of reads corresponding to RNA from each known exon, splice event or new candidate gene (Mortazavi et al., 2008). RNA-Seq uses recently developed deep-sequencing technologies where RNA is converted to a library of cDNA fragments with adaptors attached to one or both ends. Each molecule is sequenced from single end or paired end. The reads are typically 30-400bp, depending on the DNA-sequencing technology used. Similarly to ChIP extracted

marks and other transcriptional regulators more imperative for ER function.

receptive (RL95-2) endometrium in steroid hormone manner (Tamm et al., 2009).

functionality of the TF binding is still unrevealed.

**4.2 Expression analysis, RNA-Seq** 

Fig. 7. RNA-Seq method. a) Paired cDNA fragments are mapped to genome using TopHat software b) Each pair of fragment is treated as a single alignment and the abundances of the aasembled transcripts are estimated (b-e). First the fragments from distinct spliced mRNA isoforms are identified (b). Isoforms are then assembled from the overlap graph (c) and transcript abundance is estimated (d). Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could be originated. The program numerically maximizes a function that assigns a likelyhood to all possible sets of relative abundances of different isoforms (e), producing the abundances that best explain the observed fragments (adapted from Trapnell et al., 2010).

The Tissue Specific Role

**6. Acknowledgment**

**7. References** 

Enterprise Estonia (Grant no. EU30200).

1995 Jun;80(6):1908-15.

Metab. 2004 Mar;89(3):1429-42.

*Endocrinol. Metab*, 85, 2897–2902.

Cell. 2007 May 18;129(4):823-37.

*Eur. J. Endocrinol*., 155, 355-363.

1995 Jul-Aug;4(5):437-40.

2009 Nov 22.

of Estrogen and Progesterone in Human Endometrium and Mammary Gland 53

analysis is still complicated in a way and needs excellent computational skills but the data

This work has been supported by the Tallinn University of Technology (Tareted project no. B611), Estonian Ministry of Education and Science (Targeted project no. SF0180044s09) and

Altmäe, S., Martínez-Conejero, J.A., Salumets, A., Simón, C., Horcajadas, J.A. & Stavreus-

Arici, A., Engin, O., Attar, E. & Olive DL. (1995) *Modulation of leukemia inhibitory factor gene* 

Arnett-Mansfield, R.L., DeFazio, A., Mote, P.A. & Clarke CL. (2004) *Subnuclear distribution of* 

Attia, G.R., Zeitoun, K., Edwards, D., Johns, A., Carr, B.R. & Bulun, S.E. (2000).

Baart, E.B., Martini, E., Eijkemans, M.J., Van Opstal, D., Beckers, N.G., Verhoeff, A.,

Bain, DL., Heneghan, A.F, Connaghan-Jones, KD. & Miura MT (2007*). Nuclear receptor structure: implications for function.* Annu Rev Physiol. 2007;69:201-20. Review. Barski, A., Cuddapah, S., Cui, K., Roh, TY., Schones, DE., Wang, Z., Wei, G., Chepelev, I. &

Beckers, N.G., Platteau, P., Eijkemans, M.J., Macklon, N.S., de Jong, F.H., Devroey, P. &

Bergman, L., Beelen, ML., Gallee, MP., Hollema, H., Benraadt, J. and van Leeuwen, FE.

Endometrial cancer Risk following Tamoxifen. Lancet 2000;356: 881–887. Bernstein, L., Hanisch, R., Sullivan-Halley, J. & Ross, R.K. (1995) *Treatment with human* 

randomized controlled trial. *Hum. Reprod.,* 22, 980-988.

Evers A. *Endometrial gene expression analysis at the time of embryo implantation in women with unexplained infertility.* Mol Hum Reprod. 2010 Mar;16(3):178-87. Epub

*expression and protein biosynthesis in human endometrium.* J Clin Endocrinol Metab.

*progesterone receptors A and B in normal and malignant endometrium.* J Clin Endocrinol

Progesterone receptor isoform A but not B is expressed in endometriosis. *J. Clin.* 

Macklon, N.S. & Fauser, B.C. (2007). Milder ovarian stimulation for in-vitro fertilization reduces aneuploidy in the human preimplantation embryo: a

Zhao K. (2007) *High-resolution profiling of histone methylations in the human genome.*

Fauser, B.C. (2006). The early luteal phase administration of oestrogen and progesterone does not induce premature luteolysis in normo-ovulatory women.

(2000*) Risk and prognosis of endometrial cancer after tamoxifen for breast cancer.*  Comprehensive Cancer Centres' ALERT Group. Assessment of Liver and

*chorionic gonadotropin and risk of breast cancer.* Cancer Epidemiol Biomarkers Prev.

collected today will become the knowledge of tomorrow.

RNA can be used with Illumina, Applied Biosystems SOLIiD and Roche 454 Life Science platforms (Wang et al., 2009, Rev). RNA-Seq has very low, if any, background signal because cDNA sequences can be mapped to unique regions of the genome. It does not have any upper limit of quantification like DNA microarrays which lack sensitivity for genes expressed either at low or very high levels. Like other HTP Sequencing technologies, RNAseq faces several bioinformatics challenges in data processing.

The analysis of RNA-Seq data starts from raw cDNA sequences, usually having lengths of 40-70bp, depending on the platform. The general goal is to find which sites in human genome the RNA was transcribed from and determine the expression levels of these transcripts. Additionally, RNA-Seq data can be used to study expression levels of alternative splicing isoforms.

The usual analysis consists of three main steps:


More detailed explanation of the analysis is depicted in the Figure 7.
