**4. TCGA: a genomic hub of cancer**

The Cancer Genome Atlas well known as TCGA in short is a combined effort of National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI) investing \$50 million each to increase the better understanding of molecular basis of cancer using advanced genome analysis technology. The overall aim of launching such a big project was to improve the ability to diagnose, treat, and prevent cancer. The first phase of the study started in the year 2005 focused on the brain, lung, and ovarian cancers was aimed to test and develop the infrastructure for further research. The second phase of the study comprises of around 30 different type of cancers started in the year 2009 and analyzed by the year 2014.

• Copy number variation

Main categories of data type are:

• Gene expression quantification

• Masked copy number segment

• Isoform expression quantification • miRNA expression quantification

• Annotated somatic mutation • Raw simple somatic mutation

• Copy number segment

• Methylation beta value

• Biospecimen supplement

• Aggregated somatic mutation

These data that are generated from different experimental strategies such as WXS, RNA-Seq, and miRNA-Seq were studied under illumina platform, whereas Illumina Human Methylation 450 and Illumina Human Methylation 27 platforms were used for methylation

Transcriptome Sequencing for Precise and Accurate Measurement of Transcripts and...

http://dx.doi.org/10.5772/intechopen.70026

157

TCGA provides tissue-specific miRNA expression profiles, their isoforms, connection with diseases, and the discovery of unreported miRNAs. Alignment of the reads with BWA-aln is the very first step in the miRNA pipeline. Either the input can be FASTQ or BAM file format for alignment. The output after the alignment will be of BAM format. The alignment follows the expression workflow. The output from the expression workflow is raw read counts and normalized to reads per million mapped reads. There are two types of files, controlled and open. The aligned file which is having a controlled access, and the quantification files are open accessible (**Table 2**). The RPM comes in two separate files as "mirnas.quantification.txt" and "isoforms.quantification.txt." The mirna.quantification.txt data file describes the summed

array and genotyping array was carried out using Affymetrix SNP 6.0.

expression for each miRNA. The file contains the information:

• Masked somatic mutation

• Clinical supplement

**4.2. miRNA analysis**

• miRNA name • raw read count

• DNA methylation

• Aligned reads

The first phase of the study proved that an atlas specific for cancer can be created with a worldwide network of research and teams working on different cancer and develop a single platform for making the data publically accessible pooling all the data. The publicly available data from TCGA would also enable researchers around the world to make validate important discoveries. TCGA is supported by Genomic Data Commons (GDC) as one among the several programs at the NCI's Center for Cancer Genomics along with another program Therapeutically Applicable Research to Generate Effective Treatments (TARGET). Now, GDCs host genomic alterations of exactly 39 projects combining the TCGA and TARGET.

Data availability has categorized based on primary site of study, and they are kidney, adrenal gland, brain, colorectal, lung, uterus, bile duct, bladder, bone marrow, breast, cervix, esophagus, eye, head and neck, liver, lymph nodes, ovary, pancreas, pleura, prostate, skin, soft tissue, stomach, testis, thymus, and thyroid. Some of the primary sites are again divided into different subdivions. For example, kidney again divided into three different projects: kidney renal clear cell carcinoma, kidney renal papillary cell carcinoma, and kidney chromophobe. So as the case with adrenal gland, brain, colorectal, lung, and uterus which all are divided again into two different sub categories as follows: pheochromocytoma & paraganglioma, adrenocortical carcinoma, glioblastoma multiforme, brain lower grade glioma, colon adenocarcinoma, rectum adenocarcinoma, lung adenocarcinoma, lung squamous cell carcinoma, uterine corpus endometrial carcinoma, uterine carcinosarcoma.

#### **4.1. TCGA data and file formats**

The main category of data available in TCGA are:


**4. TCGA: a genomic hub of cancer**

156 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

year 2014.

TARGET.

• Clinical

• Biospecimen

The Cancer Genome Atlas well known as TCGA in short is a combined effort of National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI) investing \$50 million each to increase the better understanding of molecular basis of cancer using advanced genome analysis technology. The overall aim of launching such a big project was to improve the ability to diagnose, treat, and prevent cancer. The first phase of the study started in the year 2005 focused on the brain, lung, and ovarian cancers was aimed to test and develop the infrastructure for further research. The second phase of the study comprises of around 30 different type of cancers started in the year 2009 and analyzed by the

The first phase of the study proved that an atlas specific for cancer can be created with a worldwide network of research and teams working on different cancer and develop a single platform for making the data publically accessible pooling all the data. The publicly available data from TCGA would also enable researchers around the world to make validate important discoveries. TCGA is supported by Genomic Data Commons (GDC) as one among the several programs at the NCI's Center for Cancer Genomics along with another program Therapeutically Applicable Research to Generate Effective Treatments (TARGET). Now, GDCs host genomic alterations of exactly 39 projects combining the TCGA and

Data availability has categorized based on primary site of study, and they are kidney, adrenal gland, brain, colorectal, lung, uterus, bile duct, bladder, bone marrow, breast, cervix, esophagus, eye, head and neck, liver, lymph nodes, ovary, pancreas, pleura, prostate, skin, soft tissue, stomach, testis, thymus, and thyroid. Some of the primary sites are again divided into different subdivions. For example, kidney again divided into three different projects: kidney renal clear cell carcinoma, kidney renal papillary cell carcinoma, and kidney chromophobe. So as the case with adrenal gland, brain, colorectal, lung, and uterus which all are divided again into two different sub categories as follows: pheochromocytoma & paraganglioma, adrenocortical carcinoma, glioblastoma multiforme, brain lower grade glioma, colon adenocarcinoma, rectum adenocarcinoma, lung adenocarcinoma, lung squamous cell carcinoma,

uterine corpus endometrial carcinoma, uterine carcinosarcoma.

The main category of data available in TCGA are:

**4.1. TCGA data and file formats**

• Raw sequencing data

• Transcriptome profiling

• Simple nucleotide variation

Main categories of data type are:


These data that are generated from different experimental strategies such as WXS, RNA-Seq, and miRNA-Seq were studied under illumina platform, whereas Illumina Human Methylation 450 and Illumina Human Methylation 27 platforms were used for methylation array and genotyping array was carried out using Affymetrix SNP 6.0.

#### **4.2. miRNA analysis**

TCGA provides tissue-specific miRNA expression profiles, their isoforms, connection with diseases, and the discovery of unreported miRNAs. Alignment of the reads with BWA-aln is the very first step in the miRNA pipeline. Either the input can be FASTQ or BAM file format for alignment. The output after the alignment will be of BAM format. The alignment follows the expression workflow. The output from the expression workflow is raw read counts and normalized to reads per million mapped reads. There are two types of files, controlled and open. The aligned file which is having a controlled access, and the quantification files are open accessible (**Table 2**). The RPM comes in two separate files as "mirnas.quantification.txt" and "isoforms.quantification.txt." The mirna.quantification.txt data file describes the summed expression for each miRNA. The file contains the information:


whereas the isoform.quantification.txt file contains every individual sequence isoform observed as follows:


#### **4.3. RNA-Seq analysis**

TCGA uses an Illumina system as the basic platform. Information for nucleotide sequence and gene expression is found at TCGA. RNA sequence coverage, sequence variants (e.g., fusion genes), expression of genes, exon, or junction are different category of information available after the sequence alignment. The NCBI dbGaP database is the official repository for the actual sequence data [45]. After aligning the reads to reference genome, gene expression level is quantified in various forms such as HT-Seq raw mapping count, fragments per kilobase of transcript per million mapped reads (FPKM) and FPKM-UQ (upper quartile normalization) in TCGA mRNA quantification pipeline (**Table 3**). In case of mRNA analysis also the rules for data access are the same. Access for aligned reads file is controlled, whereas access for rest of the files is open.

Somatic variants from whole-genome sequencing are identified using this pipeline. Somatic variants are identified by comparing the tumor samples with the normal samples allele frequency. After annotating each mutation, one project is created combining files from multiple cases. Identification of somatic mutation has achieved through four pipelines. Identified somatic variants are then annotated. Information from multiple files is combined into one single MAF for each pipeline. Mutations are listed in a tab delimited format as Mutation Annotation Format (MAF). Two types of MAF files are produced for each variant calling in a project, i.e., the protected and the somatic or public MAF files. These MAF files are produced on the basis of annotated Variant Call Format (VCF) file. This VCF file contains variants reported in multiple transcripts. Only the critical ones are reported in the protected MAF file, whereas Public MAF are processed to remove the low quality and potential germline variants restricting the confidential information.VCF files are of two type, raw unannotated simple

GRCh38 build. Reads that were not aligned are included to facilitate the availability of raw read sets

Transcriptome Sequencing for Precise and Accurate Measurement of Transcripts and...

account each protein-coding gene length and the number of reads mappable to all protein-coding

expression values, in FPKM, are divided by the 75th

BAM

http://dx.doi.org/10.5772/intechopen.70026

159

TXT

TXT

TXT

**Type Description Format**

gene, calculated by HT-Seq

RNA-Seq alignment RNA-Seq reads that have been aligned to the

FPKM A normalized expression value that takes into

genes

FPKM-UQ A normalized raw read count in which gene

percentile value

Raw read counts The number of reads aligned to each protein-coding

TCGA utilized SNP-based technology to analyze genome-wide variations. It also includes

TCGA utilizes the Illumina platform for the DNA methylation study ensures single-base-pair resolution, high accuracy, easy workflows, and low input of DNA requirements. DNA methylation data files (**Table 4**) contain information of signal intensities (raw and normalized), detection confidence, and calculated beta values for methylated (M) and unmethylated (U) probes.

Is a high throughput, functional, and quantitative proteomic method for large-scale protein expression profiling which helps in biomarker discovery and cancer diagnostics eventually.

somatic mutations and annotated somatic mutation VCF files.

platforms to define CNV and loss of LOH across multiple samples.

**4.5. Single-nucleotide polymorphism**

**Table 3.** Gene quantification data formats.

**4.6. DNA methylation sequencing**

**4.7. Reverse-phase protein array (RPPA)**

#### **4.4. DNA-Seq analysis**

Genomic diversity across different cancer types has been characterized by utilizing DNA sequencing systems based on Sanger Sequencing at different Genome Sequencing Centers.


**Table 2.** Data types and file formats.


**Table 3.** Gene quantification data formats.

• reads per million miRNA reads

• reads per million miRNA reads

observed as follows:

• miRNA name

• raw read count

• region within miRNA

**4.3. RNA-Seq analysis**

for rest of the files is open.

**Table 2.** Data types and file formats.

**4.4. DNA-Seq analysis**

• cross-mapped to other miRNA forms (Y or N)

158 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

• cross-mapped to other miRNA forms (Y or N)

whereas the isoform.quantification.txt file contains every individual sequence isoform

• alignment coordinates as <version>:<Chromosome>:<Start position>-<End position>:<Strand>

TCGA uses an Illumina system as the basic platform. Information for nucleotide sequence and gene expression is found at TCGA. RNA sequence coverage, sequence variants (e.g., fusion genes), expression of genes, exon, or junction are different category of information available after the sequence alignment. The NCBI dbGaP database is the official repository for the actual sequence data [45]. After aligning the reads to reference genome, gene expression level is quantified in various forms such as HT-Seq raw mapping count, fragments per kilobase of transcript per million mapped reads (FPKM) and FPKM-UQ (upper quartile normalization) in TCGA mRNA quantification pipeline (**Table 3**). In case of mRNA analysis also the rules for data access are the same. Access for aligned reads file is controlled, whereas access

Genomic diversity across different cancer types has been characterized by utilizing DNA sequencing systems based on Sanger Sequencing at different Genome Sequencing Centers.

build. Reads that were not aligned are included to facilitate

Expression Quantification files with the addition of isoform information such as the coordinates of the isoform and the type of region it constitutes within the full miRNA

normalized count in reads per million miRNA mapped

BAM

TXT

TXT

**Type Description Format**

the availability of raw read sets

Aligned reads miRNA-Seq reads that have been aligned to the GRCh38

miRNA expression quantification A table that associates miRNA IDs with read count and a

Isoform expression quantification A table with the same information as the miRNA

transcript

Somatic variants from whole-genome sequencing are identified using this pipeline. Somatic variants are identified by comparing the tumor samples with the normal samples allele frequency. After annotating each mutation, one project is created combining files from multiple cases. Identification of somatic mutation has achieved through four pipelines. Identified somatic variants are then annotated. Information from multiple files is combined into one single MAF for each pipeline. Mutations are listed in a tab delimited format as Mutation Annotation Format (MAF). Two types of MAF files are produced for each variant calling in a project, i.e., the protected and the somatic or public MAF files. These MAF files are produced on the basis of annotated Variant Call Format (VCF) file. This VCF file contains variants reported in multiple transcripts. Only the critical ones are reported in the protected MAF file, whereas Public MAF are processed to remove the low quality and potential germline variants restricting the confidential information.VCF files are of two type, raw unannotated simple somatic mutations and annotated somatic mutation VCF files.

#### **4.5. Single-nucleotide polymorphism**

TCGA utilized SNP-based technology to analyze genome-wide variations. It also includes platforms to define CNV and loss of LOH across multiple samples.

#### **4.6. DNA methylation sequencing**

TCGA utilizes the Illumina platform for the DNA methylation study ensures single-base-pair resolution, high accuracy, easy workflows, and low input of DNA requirements. DNA methylation data files (**Table 4**) contain information of signal intensities (raw and normalized), detection confidence, and calculated beta values for methylated (M) and unmethylated (U) probes.

#### **4.7. Reverse-phase protein array (RPPA)**

Is a high throughput, functional, and quantitative proteomic method for large-scale protein expression profiling which helps in biomarker discovery and cancer diagnostics eventually. Protein arrays consist of data representing protein expression and concentration. These data archives are deposited to the TCGA DCC and include original images of protein arrays, calculated raw signals, relative concentrations of proteins and normalized protein signals (**Table 5**).

**Project Details Source**

metadata from patients (donors)

Tissue+Source+Site

sequencing

platforms

Collection of the samples (blood and tissue from tumour and normal controls) and clinical https://tcga-data.nci.nih.gov/datareports/ codeTablesReport.htm?codeTable=tissue%20

http://dx.doi.org/10.5772/intechopen.70026

161

Research Institute at Nationwide Children's

Broad Institute Sequencing Platform in

Copy Number Alteration (Brigham and Women's Hospital and Harvard Medical School in Boston, The Broad Institute in Cambridge) Epigenomics (University of Southern California in Los Angeles, Johns Hopkins University in

College of Medicine in Houston

Human Genome Sequencing Center, Baylor

The Genome Institute at Washington University

Gene (mRNA) Expression (University of North

Targeted Sequencing Center (Baylor College of

Functional Proteomics (MD Anderson Cancer

Center for Application of Advanced Clinical Proteomic Technologies for Cancer Proteo-Genomic Discovery

Prioritization and Verification of Cancer

Proteome Characterization Centre and Vanderbilt Proteome Characterization Center

University of California Santa Cruz

miRNA Analysis (British Columbia Cancer

Hospital in Columbus, Ohio

source%20site

Transcriptome Sequencing for Precise and Accurate Measurement of Transcripts and...

Cambridge

Baltimore)

Center)

Biomarkers

California at Chapel Hill)

Agency in Vancouver)

Medicine in Houston)

Cancer Proteomic Center

Shipment of the annotated biospecimens to Biospecimen Core Resources (BCR) https://wiki.nci.nih.gov/display/TCGA/

Coordination of sample delivery and data collection, cataloguing, processing, and verifying the quality and quantity

Isolation and distribution of RNA and DNA from biospecimens to other institutions for genomic characterization and high-throughput

http://cancergenome.nih.gov/abouttcga/

http://www.nationwidechildrens.org/ biospecimen-core-resource-about-us

High-throughput sequencing (data are available in TCGA Data Portal or at NIH's database of Genotype and Phenotype) Identification of the DNA alterations http://cancergenome.nih.gov/abouttcga/ overview/howitworks/sequencingcenters

Utilization of novel technologies and multiple

Comprehensive description of the genomic changes: alterations in miRNA and gene expression, SNP, CNV, and others http://cancergenome.nih.gov/abouttcga/ overview/howitworks/characterizationcenters

Identification of cancer-specific proteins http://cancergenome.nih.gov/ abouttcga/overview/howitworks/ proteomecharacterization

Management of all generated data and transfer them to public databases (TCGA Data Portal

http://cancergenome.nih.gov/abouttcga/ overview/howitworks/datasharingmanagement

and Cancer Genomics Hub)

**Table 6.** TCGA centers and data processing.

overview/howitworks/bcr

Tissue Source Sites (TSSs)

Biospecimen Core Resource (BCR)

Genome Sequencing Centers (GSCS)

Cancer Genome Characterization Centers (GCCs)

Proteome Characterization Centers (PCCs)

Data Coordinating Center (DCC)

#### **4.8. Data processing workflow**

TCGA have a well-organized structure from sample collection to bioinformatics analysis with involvement of several centers (**Table 6**).


**Table 4.** DNA methylation data files format.


**Table 5.** Protein data file format.


**Table 6.** TCGA centers and data processing.

Protein arrays consist of data representing protein expression and concentration. These data archives are deposited to the TCGA DCC and include original images of protein arrays, calculated raw signals, relative concentrations of proteins and normalized protein signals (**Table 5**).

TCGA have a well-organized structure from sample collection to bioinformatics analysis with

probes

probes

HumanMethylation27 Binary (.idat) Intensity data file with statistics for each bead type in terms

HumanMethylation450 Binary (.idat) Intensity data file with statistics for each bead type in terms

Array Slide Image (tiff) Black and white, high-resolution image of protein array RPPA Slide Image Measurements (txt) Raw signals from a black and white, high-resolution

Super Curve Results (tab-delimited, txt) Supercurve results, use dilution to calculate relative

Calculated beta values

Calculated beta values

(including known SNPs)

(including known SNPs)

image of protein array

concentration

Signals for genes

Cy3 and Cy5 signals and detection confidence of methylated

Cy3 and Cy5 signals and detection confidence of methylated

of bead count, mean and standard deviation per dye

Calculated beta values and mean signal intensities for replicate methylated and unmethylated probes

Calculated beta values, gene symbols, chromosomes and genomic coordinates (build 36). Some data have been masked

of bead count, mean and standard deviation per dye

Background-corrected methylated (M) and unmethylated (U) summary intensities as extracted by the methylumi package

Calculated beta values, gene symbols, chromosomes and genomic coordinates (hg18). Some data have been masked

**4.8. Data processing workflow**

IlluminaDNAMethylation\_

IlluminaDNAMethylation\_

IlluminaDNAMethylation\_

IlluminaDNAMethylation\_

OMA002\_CPI

OMA002\_CPI

OMA003\_CPI

OMA003\_CPI

txt)

involvement of several centers (**Table 6**).

HumanMethylation27 Tab-delimited, ASCII

HumanMethylation27 Tab-delimited, ASCII

HumanMethylation450 Tab-delimited, ASCII

HumanMethylation450 Tab-delimited, ASCII

**Table 4.** DNA methylation data files format.

**Table 5.** Protein data file format.

**Platform code File type Description**

text (.txt)

text (.txt)

text (.txt)

text (.txt)

text (.txt)

text (.txt)

text (.txt)

text (.txt)

**File type Description**

Normalized Protein Expression (MAGE-TAB data matrix,

Tab-delimited, ASCII

160 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

Tab-delimited, ASCII

Tab-delimited, ASCII

Tab-delimited, ASCII

Eligible patient samples (blood and tissue) are collected by different Tissue Source Sites (TSSs) and delivered to the Biospecimen Core Resource (BCR). BCR catalogue, process, and verify the quality and quantity of these samples and then submit clinical data and metadata to the Data Coordinating Center (DCC). Genome Characterization Centers (GCCs) and Genome Sequencing Centers (GSCs) then do the genomic characterization and highthroughput sequencing once the DCC provide molecular analytes. After sequencing, DCC again receives the sequence-related data from GSS. Trace files, sequences, and alignment mappings from Genome Characterization Centers are also submitted to the NCI's secure repository Cancer Genomic Hub (CGHub). Access to research community for these data is made available along with Genome Data Analysis Centers (GDACs). Information managed by DCC that has stored into public free-access databases (TCGA portal, NCBI's Trace Archive, CGHub), allows researchers to access the data and hence helps to advance in cancer studies.

#### **4.9. TCGA data identifiers**

Barcodes were initially used as the primary identifier for biospecimen data in TCGA during the beginning of the data. Tissue source site delivers the patient sample and the metadata to Biospecimen Core Resource (BCR). Once the sample is received by BCR, a human readable TCGA barcode was assigned. TCGA barcode was assigned to keep the navigation of the various results produced by the different data-generating centers for one particular sample connected. Sections of barcode also provide metadata information about the sample. Nowadays, BCR is also assigning universally unique identifiers (UUIDs) along with TCGA barcode to samples keeping UUIDs as the primary identifier instead of barcodes.

#### *4.9.1. Barcodes*

BCR generates the barcode for each sample received from TSS. Barcode initial numbers after the program code are assigned according to the TSS and the participant from which the tissue sample was received. The barcodes TCGA-02 and TCGA-02-0001 are assigned, respectively. Types of tissue are also differentiated with codes (**Table 7**). Next number in the barcode stands for the sample followed by the vial number; the sample was split into TCGA-02-0001-01 and TCGA-02-0001-01B. This vial number is again divided into different portions—TCGA-02- 0001-01B-02. Analytes represented with barcode, e.g., TCGA-02-0001-01B-02D was extracted and distributed across one or more than one plates TCGA-02-0001-01B-02D-0182. Each well represented as, e.g., TCGA-02-0001-01B-02D-0182-06 is identified as an aliquot. These plates are later given to various characterize and sequencing centers.

The generated data are not only categorized based on the type but also the level at which these data can be accessed. In addition to the analyzed tumor data, TCGA also collects nontumor samples aimed to analyze every patients germ line DNA to identify which alteration found in tumor sample responsible for the oncogenic process. For most of the tumors, TCGA collects and analyzes normal blood samples. In the absence of a matching normal blood sample, a normal tissue sample from the same patient is used as the germ line control for DNA assays. But in the case of RNA assays, using a normal blood sample as a control is not logically correct. Because RNA profile of blood sample is expected to be different from the RNA profile of tissues from other organs such as brain, breast, and lungs or ovary. Because of this reason, TCGA attempts to collect normal tissue matched to the anatomic site of the tumor not

40 TRB Recurrent Blood Derived Cancer—Peripheral Blood

Access to the data is strictly controlled. There are two levels of data access:

• Open access data tier [raw, non-normalized data (Level I), processed data (Level II)].

• Controlled access data tier [segmented/interpreted data (Level III) apply to individual sam-

matched to the patient.

**Table 7.** Tissue code.

**4.10. Accessibility of data**

ples, while summarized data (Level IV)].

**Tissue code Letter code Definition**

6 TM Metastatic

50 CELL Cell Lines

60 XP Primary Xenograft Tissue

61 XCL Cell Line Derived Xenograft Tissue

1 TP Primary Solid Tumor 2 TR Recurrent Solid Tumor

5 TAP Additional—New Primary

7 TAM Additional Metastatic 8 THOC Human Tumor Original Cells

 NB Blood Derived Normal NT Solid Tissue Normal NBC Buccal Cell Normal NEBV EBV Immortalized Normal NBM Bone Marrow Normal CELLC Control Analyte

3 TB Primary Blood Derived Cancer—Peripheral Blood 4 TRBM Recurrent Blood Derived Cancer—Bone Marrow

Transcriptome Sequencing for Precise and Accurate Measurement of Transcripts and...

http://dx.doi.org/10.5772/intechopen.70026

163

9 TBM Primary Blood Derived Cancer—Bone Marrow

#### *4.9.2. Universally unique identifier (UUID)*

UUIDs are randomly generated 32-digit hexadecimal value. TCGA became more complex, and the barcode was not enough to handle the generated data because there was not enough barcode combinations to represent the data. Also, flexibility in altering the barcode was also less when the associated metadata changes with a barcode. Considering all these factors, TCGA changed from using barcode for biospecimen and clinical data.


**Table 7.** Tissue code.

Eligible patient samples (blood and tissue) are collected by different Tissue Source Sites (TSSs) and delivered to the Biospecimen Core Resource (BCR). BCR catalogue, process, and verify the quality and quantity of these samples and then submit clinical data and metadata to the Data Coordinating Center (DCC). Genome Characterization Centers (GCCs) and Genome Sequencing Centers (GSCs) then do the genomic characterization and highthroughput sequencing once the DCC provide molecular analytes. After sequencing, DCC again receives the sequence-related data from GSS. Trace files, sequences, and alignment mappings from Genome Characterization Centers are also submitted to the NCI's secure repository Cancer Genomic Hub (CGHub). Access to research community for these data is made available along with Genome Data Analysis Centers (GDACs). Information managed by DCC that has stored into public free-access databases (TCGA portal, NCBI's Trace Archive, CGHub), allows researchers to access the data and hence helps to advance in cancer

162 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

Barcodes were initially used as the primary identifier for biospecimen data in TCGA during the beginning of the data. Tissue source site delivers the patient sample and the metadata to Biospecimen Core Resource (BCR). Once the sample is received by BCR, a human readable TCGA barcode was assigned. TCGA barcode was assigned to keep the navigation of the various results produced by the different data-generating centers for one particular sample connected. Sections of barcode also provide metadata information about the sample. Nowadays, BCR is also assigning universally unique identifiers (UUIDs) along with TCGA barcode to

BCR generates the barcode for each sample received from TSS. Barcode initial numbers after the program code are assigned according to the TSS and the participant from which the tissue sample was received. The barcodes TCGA-02 and TCGA-02-0001 are assigned, respectively. Types of tissue are also differentiated with codes (**Table 7**). Next number in the barcode stands for the sample followed by the vial number; the sample was split into TCGA-02-0001-01 and TCGA-02-0001-01B. This vial number is again divided into different portions—TCGA-02- 0001-01B-02. Analytes represented with barcode, e.g., TCGA-02-0001-01B-02D was extracted and distributed across one or more than one plates TCGA-02-0001-01B-02D-0182. Each well represented as, e.g., TCGA-02-0001-01B-02D-0182-06 is identified as an aliquot. These plates

UUIDs are randomly generated 32-digit hexadecimal value. TCGA became more complex, and the barcode was not enough to handle the generated data because there was not enough barcode combinations to represent the data. Also, flexibility in altering the barcode was also less when the associated metadata changes with a barcode. Considering all these factors,

samples keeping UUIDs as the primary identifier instead of barcodes.

are later given to various characterize and sequencing centers.

TCGA changed from using barcode for biospecimen and clinical data.

*4.9.2. Universally unique identifier (UUID)*

studies.

*4.9.1. Barcodes*

**4.9. TCGA data identifiers**

The generated data are not only categorized based on the type but also the level at which these data can be accessed. In addition to the analyzed tumor data, TCGA also collects nontumor samples aimed to analyze every patients germ line DNA to identify which alteration found in tumor sample responsible for the oncogenic process. For most of the tumors, TCGA collects and analyzes normal blood samples. In the absence of a matching normal blood sample, a normal tissue sample from the same patient is used as the germ line control for DNA assays. But in the case of RNA assays, using a normal blood sample as a control is not logically correct. Because RNA profile of blood sample is expected to be different from the RNA profile of tissues from other organs such as brain, breast, and lungs or ovary. Because of this reason, TCGA attempts to collect normal tissue matched to the anatomic site of the tumor not matched to the patient.

#### **4.10. Accessibility of data**

Access to the data is strictly controlled. There are two levels of data access:


#### *4.10.1. Open access data tier*

The open access data level is composed of public data not unique to a patient. The open access data tier does not require any user certification [45].

**Tool Application**

TCIA hosts a large archive of medical images of cancer accessible for public download. Information regarding patients treatment details, outcomes, pathology and genomics are also provided as supporting information

http://dx.doi.org/10.5772/intechopen.70026

165

Characterize tumour histopathology, through the delineation of the nuclear regions, from hematoxylin and eosin (H&E) stained tissue sections. The advantages of such a database is that other samples can be crossreferenced for personalized therapy and precision medicine as it contains information regarding responses to therapies, molecular correlates and morphometric

Is an integrated Web-based platform supporting wholeslide pathology image visualization and data integration

helps researchers to easily find any of thousands of data archives generated by the same. A powerful RESTful API is provided, with bindings to the UNIX command line, Python and R for programmers. For easy access, graphical interface like viewGene to explore expression levels and iCoMut are provided to explore the mutation information of each TCGA disease study with an

Is designed to help researchers to assess, diagnose and correct for any batch effects in TCGA data. It first allows the user to assess and quantify the presence of any batch effects through Principal Component Analysis and Hierarchical Clustering algorithms. The results from these algorithms are presented graphically as diagrams

NCI developed application which integrate and display genomic and transcription alterations across various cancers. Integrated tracks view, Heatmap view, Bambino

Is an open access suite integrate, visualize and cancer

Is a freely available visualization tool of the genome

of multidimensional cancer genomics data sets. The barriers between the genomic data and the researchers are reduced rapidly after the resources was established. This database stores DNA copy-number data (deep deletions or amplification), non-synonymous mutations, mRNA and microRNA expression data, protein level, phosphoprotein level (RPPA) data, limited de-identified

of TCGA data. According to user-specified parameters the

genomic data along with clinical data

clinical data and DNA methylation data

data can be filtered for the search and visualize

developed by Broad Institute

based on availability

Transcriptome Sequencing for Precise and Accurate Measurement of Transcripts and...

subtypes

The Broad GDAC Firehose (http://firebrowse.org/) Is a powerful tool for exploring cancer data. FireBrowse

The cBioPortal for Cancer Genomics (http://cbioportal.org) Is interactive open-access resource for the exploration

Regulome Explorer (http://explorer.cancerregulome.org/) It explores the association between and molecular features

of the TCGA data

interactive figure

are the major viewers

The Cancer Imaging Archive, CIA (http://www.

Berkeley Morphometric Data (http://tcga.lbl.gov:9999/

The Cancer Digital Slide Archive, CDSA (http://cancer.

The MD Anderson GDAC's MBatch (http:// bioinformatics.mdanderson.org/tcgabatcheffects)

Cancer Genome Workbench, CGWB (https://cgwb.nci.

UCSC Cancer Genomics Browser (https://genome-cancer.

Integrative Genomics Viewer, IGV (http://www.

**Table 8.** Visualization and data analysis tools.

cancerimagingarchive.net)

biosig/tcgadownload.do)

digitalslidearchive.net/)

nih.gov/)

soe.ucsc.edu/)

broadinstitute.org/igv)

Type of data accessible at open tier:


#### *4.10.2. Controlled access data tier*

Patient's unique information falls into the controlled access tier. Each data type has unique identifiers. In order to get the access to the data, user needs the certification.

Type of data accessible at controlled level:


In order to attain the access to these data, the researchers must:

• Complete the Data Access Request (DAR) form which is available electronically through the Database of Genotypes and Phenotypes (dbGaP).

Once the submitted request is evaluated and approved, researchers must


All patient samples are sworn to use for TCGA and there is no provision of sharing the material with a third party. Even this is not the case because 95% of material used up in different characterization. Even there is chance left to get the samples from the TSS centers. One can directly contact the TSS center for samples, and the decision lays on them.

#### **4.11. TCGA data: visualization and data analysis**

A huge amount of data accumulation demanding for advanced visualization technology and number of tools are available (**Table 8**). Visualization is essential to understand the data at ease.


**Table 8.** Visualization and data analysis tools.

*4.10.1. Open access data tier*

• Transcriptomic profiling • Copy number variations

• Single-nucleotide variation

*4.10.2. Controlled access data tier*

• BAM and FASTQ files

• Variant Call Format files • Peculiar data of MAFs

Type of data accessible at controlled level:

• Level 1 and level 2 SNP6 array data • Level 1 and level 2 exon array data

• DNA methylation

• Biospecimen

• Clinical

Type of data accessible at open tier:

data tier does not require any user certification [45].

164 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

The open access data level is composed of public data not unique to a patient. The open access

Patient's unique information falls into the controlled access tier. Each data type has unique

• Complete the Data Access Request (DAR) form which is available electronically through

All patient samples are sworn to use for TCGA and there is no provision of sharing the material with a third party. Even this is not the case because 95% of material used up in different characterization. Even there is chance left to get the samples from the TSS centers. One can

A huge amount of data accumulation demanding for advanced visualization technology and number of tools are available (**Table 8**). Visualization is essential to understand the data at ease.

• Agree to restrict their use of the information to biomedical research purposes only

identifiers. In order to get the access to the data, user needs the certification.

In order to attain the access to these data, the researchers must:

Once the submitted request is evaluated and approved, researchers must

• Agree with the statements within TCGA Data Use Certification (DUC)

directly contact the TSS center for samples, and the decision lays on them.

• Have their institutions certifiably agree to the statements within TCGA DUC

the Database of Genotypes and Phenotypes (dbGaP).

**4.11. TCGA data: visualization and data analysis**
