**3. Estimating telomere length from sequencing reads**

#### **3.1 Telomeres in the era of next-generation sequencing**

Telomeres are repetitive sequences of (TTAGGG)n on chromosome ends. In humans, the length of telomeres can vary from 10 to 15 kilobases [19]. With advancements in genome sequencing technology, it is now possible to measure telomere length by applying computational algorithms to sequencing reads, as sequence data from these reads contains information about telomeres, just as they do for other regions of the genome.

DNA sequencing is the process of determining the order of nucleotide bases (T, C, A, G). To sequence a genome, the DNA is typically broken down into small fragments and sequenced in parallel [20]. The resulting sequence reads are then assembled to reconstruct the genome. High-throughput next-generation sequencing (NGS) allows for rapid and scalable sequencing of millions of DNA fragments. NGS can be used to sequence specific regions of the genome, such as protein-coding regions, or the entire genome to identify DNA sequences and somatic mutations. It can also be applied to RNA to measure gene expression.

Whole-genome sequencing (WGS) is a type of next-generation sequencing (NGS) that provides a comprehensive view of the entire genome, including the ends of chromosomes where telomeres are located. Several large-scale sequencing initiatives have been undertaken. These include the Trans-Omics for Precision Medicine (TOPMed) program, which includes over 130,000 whole-genome sequences from more than 80 studies [21], and the Pan-Cancer Analysis of Whole Genomes (PCAWG) project, an international effort to analyze over 2600 cancer whole genomes from the International Cancer Genome Consortium [22]. These projects have generated extensive WGS data integrated with clinical health outcomes on an individual level. Although telomere length was not the primary focus of these initiatives, the wealth of biomarker and health outcome data available makes these studies valuable resources for investigating the potential health impacts of telomere length variation on a population scale.

#### **3.2 Computational tools for estimating telomere length and content**

Numerous bioinformatic tools are currently available for estimating telomere length from WGS data. In 2010, Castle et al. first described their approach for estimating telomere content as a proxy for telomere length by counting sequencing reads that contained the repetitive (TTAGGG)4 motif [23]. Since then, several algorithms have been developed and extended to determine telomere content, including Motif\_counter [24], TelSeq [25], Computel [26], Telomerecat [27], TelomereHunter [28], Telogator [29], Qmotif [30]. Each tool is based on different algorithms and utilizes different methods to estimate telomere length/content from WGS data. In this section, we will introduce each tool and discuss their validation and application.

**Motif\_counter** (https://sourceforge.net/projects/motifcounter) is a bash script that quantifies telomere content by counting sequencing reads that contain telomere motif. It takes Binary Alignment/Map (BAM) formatted files as input and thus requires sequence reads to be aligned to a reference genome prior to running. The motif to be searched for, i.e., the character string of the telomere repetitive region, and the threshold for classifying a read as a telomeric read must be specified. Despite its simplicity and user-friendly command interface, this tool is unable to control

*Current Technologies for Measuring or Predicting Telomere Length from Genomic Datasets DOI: http://dx.doi.org/10.5772/intechopen.113048*

for variations in genome coverage or sequence depth, so additional normalization is needed. The telomere sequence read counts estimated by motif\_counter correlated well with telomere length measured by TRF (Pearson's r = 0.855), after being normalized to genome coverage [31].

**TelSeq** (https://github.com/zd1/telseq) is a C++ software that takes BAM files as input and is specifically designed for estimating telomere length from WGS and could be extended to whole-exome sequencing (WES) data. It searches for telomeric reads of TTAGGG repeats and calculates the mean telomere length from telomeric content. One advantage of TelSeq is that it controls for sequencing biases by normalizing telomere counts based on the percentage of total reads that have a GC-content similar to telomere sequences (48–52%). This is important because a high GC value favors more amplification during PCR, potentially leading to biased estimates. TelSeq has been validated using 260 leukocyte samples from the TwinsUK cohort, where its estimated mean telomere length was compared to TRF estimates [25]. Although TelSeq estimates were consistently shorter than TRF estimates (mean 5.63 kb compared to 6.97 kb), their correlation remained stable across a range of pre-defined numbers of telomeric repeats (Spearman's ρ = 0.6). Additionally, TelSeq and TRF estimates had a correlation of 0.78 on exome data, providing promising evidence for its extended application. TelSeq is the most widely used tool and have been applied in several large WGS datasets [32–34].

**Computel** (https://github.com/lilit-nersisyan/computel) is an R program that operates in the Linux environment and takes FASTQ files as input. It identifies telomeric reads by aligning raw sequencing reads to a specially designed telomeric reference, distinguishing it from previous methods that were based on pattern matching. Mean telomere length is then calculated based on the ratio of coverage at the telomeric reference and genomic reference, read length, telomeric pattern length, and the number of chromosomes in a haploid genome. It also allows for telomeric repeat variant analysis to estimate the relative abundance of canonical and variant telomeric repeat patterns. This is important because telomeres may not always contain canonical repeat patterns (TTAGGG)n if there are variants within telomeric regions. Variant analysis provides necessary information about the distribution of telomeric repeat variants in samples. Computel has been validated with simulated data, where strong and linear correlation was observed between actual and estimated telomere length, and the results suggested that Computel could outperform TelSeq but consistently generated lower telomere length estimates [26].

**Telomerecat** (https://github.com/cancerit/telomerecat) is written in Python and operates in both the Linux and MacOSX environments. It is specifically designed to operate independently of the number of telomeres present in a cell by normalizing telomeric content against subtelomeric regions instead of the entire genome, making it applicable to WGS data from cancer cells. Telomerecat takes BAM files as input and extracts read pairs with at least two instances of telomeric hexamer. The software then classifies read pairs based on their sequence composition and orientation. Telomere length is calculated using the ratio of complete to boundary read-pairs, along with the insert length distribution. Telomerecat has been validated in 260 adult females from the TwinsUK10K study, showing a significant correlation with TRF estimates (Spearman's ρ = 0.618) and TelSeq (ρ = 0.631) [27].

**TelomereHunter** (https://pypi.org/project/telomerehunter) is a Python software designed to estimate telomeric content from WGS data of matched tumor and normal tissue control pairs within the same individual. The program accepts BAM files as input, selects reads with high telomere repeats, and organizes them by mapping their

position into categories such as intrachromosomal, subtelomeric, junction spanning, and intratelomeric reads. Telomere content can be derived from the intratelomeric reads. Telomere content can be calculated from intratelomeric reads. TelomereHunter has been validated by strong correlations between its estimated telomere content and telomere length measures from qPCR (Pearson's r = 0.94) and TRF (Pearson's r = 0.72) after GC correction [28].

**Telogator** (https://github.com/zstephens/telogator) is a Python software designed to estimate chromosome-specific telomere length from long reads. This recently developed tool is built on long-read telomere analysis and the newest human reference genome by the Telomere-to-Telomere (T2T) consortium [35]. Telogator takes long reads in FASTA or FASTQ format, performs alignment, and extracts reads mapped to subtelomeres and telomeres. It then identifies telomere regions, clusters reads by their telomere-subtelomere boundaries, and reports telomere length for specific chromosome arms. High correlation (Pearson's r = 0.91) has been reported between Telogator and TelomereHunter when comparing the averaged chromosomespecific telomere lengths against the telomere content from TelomereHunter [29].

**Qmotif** (https://github.com/AdamaJava/adamajava) is a Java-based software designed to estimate telomere content from WGS data in a fast and efficient manner. It takes BAM files as input and searches for user-defined motifs using a two-pass matching system. In the first stage, a quick string matching is performed to filter strings into the second stage for regular expression (regex) matching. For telomere quantification, the first stage matches 3 consecutive repeats of the canonical telomere motif (TTAGGG), while the second stage uses regex to match any 2 adjacent repeats of the motif with variation allowed in the first 3 base pairs. The runtime can be further sped up by instructing the algorithm to search for telomere repeats in regions of the genome most likely to contain them. Qmotif has been validated by comparison with qPCR (Spearman's ρ = 0.69) and other computational tools such as TelSeq (Spearman's ρ = 0.99) and TelomereHunter (Spearman's ρ = 0.85), with a much faster runtime of under 1 minute on the same set of samples (compared to 1–7 hours for TelSeq and 4–19 hours for TelomereHunter) [30].
