**Xianquan Zhan, MD, PhD.**

**1**

**Chapter 1**

Proteoforms

*Xianquan Zhan*

**1. Introduction**

Introductory Chapter:

The completion of human genome sequence has driven the research focusing from structural genomics to functional genomics. Transcriptomics and proteomics are two main contents in the era of functional genomics. Human genome contains about 20,300 genes [1]. However, RNA splicing and other factors in the transcriptional process from a gene to RNA result in multiple transcripts that are derived from the same single-one gene. Thus, human transcriptome is estimated to contain at least 100,000 transcripts, much more than the number of human genes. Each transcript guides the ribosome to synthesize an amino acid sequence of a protein. The synthesized protein in the ribosome must be translocated and redistributed to the appropriate locations to form a special conformation and interact with surrounding molecules, namely, a complex, to exert its biological functions. Also, protein is modified by many posttranslational modifications (PTMs) and even unknown factors in the process of translocation and redistribution. An estimated 400–600 PTMs in human body are the main factors to cause the complexity and diversity of proteins, namely, protein species [2, 3] or proteoforms [4, 5]. Thus, multiple proteoforms are often derived from one same transcript, and it is estimated that human proteome contains at least 1,000,000 proteoforms [6]. A proteoform is the basic unit in a proteome, and it is defined as its amino acid sequence + PTMs + spatial conformation + localization + cofactors + binding partners + a function (**Figure 1**), which is the final functional performer of a gene [6]. A protein is an umbrella term for all proteoforms coded by the same gene. Moreover, the different proteoforms derived from one same gene might have different conformation and functions. Each proteoform has its own copy number or abundance, which can be quantified between given conditions [4]. Studies on proteoforms will offer much more in-depth insights into a proteome, which will directly lead to the discovery of reliable biomarkers to understand accurate molecular mechanisms, the discovery of effective therapeutic targets,

and for effective prediction, diagnosis, and prognostic assessment.

It is a big challenge in the methodology to study the over millions of human proteoforms [1, 6]. The common bottom-up mass spectrometry (MS)-based strategies cannot identify proteoforms, which in fact only identify protein-coded genes, a protein group. This type of method includes stable isotope-labeled two-dimensional liquid chromatography-tandem mass spectrometry (2DLC-MS/MS) and stable isotope-free 2DLC-MS/MS, which only identify peptides and PTMs (**Figure 2**) [6]. Top-down MS-based strategies have been developed to identify proteoforms [7–9]. This type of method can identify proteoforms, which obtains the proteoform message including the amino acid sequence and PTMs. However, the obtained message of proteoform is only partial information of the above defined proteoform. Also, the protein must be purified prior to MS analysis, with different types of protein isolation techniques such as capillary zone electrophoresis (CZE) and liquid

Professor of Cancer Proteomics and PPPM, University Creative Research Initiatives Center, Shandong First Medical University, Shandong, China

## **Chapter 1**
