Introductory Chapter: Proteoforms

*Xianquan Zhan*

### **1. Introduction**

The completion of human genome sequence has driven the research focusing from structural genomics to functional genomics. Transcriptomics and proteomics are two main contents in the era of functional genomics. Human genome contains about 20,300 genes [1]. However, RNA splicing and other factors in the transcriptional process from a gene to RNA result in multiple transcripts that are derived from the same single-one gene. Thus, human transcriptome is estimated to contain at least 100,000 transcripts, much more than the number of human genes. Each transcript guides the ribosome to synthesize an amino acid sequence of a protein. The synthesized protein in the ribosome must be translocated and redistributed to the appropriate locations to form a special conformation and interact with surrounding molecules, namely, a complex, to exert its biological functions. Also, protein is modified by many posttranslational modifications (PTMs) and even unknown factors in the process of translocation and redistribution. An estimated 400–600 PTMs in human body are the main factors to cause the complexity and diversity of proteins, namely, protein species [2, 3] or proteoforms [4, 5]. Thus, multiple proteoforms are often derived from one same transcript, and it is estimated that human proteome contains at least 1,000,000 proteoforms [6]. A proteoform is the basic unit in a proteome, and it is defined as its amino acid sequence + PTMs + spatial conformation + localization + cofactors + binding partners + a function (**Figure 1**), which is the final functional performer of a gene [6]. A protein is an umbrella term for all proteoforms coded by the same gene. Moreover, the different proteoforms derived from one same gene might have different conformation and functions. Each proteoform has its own copy number or abundance, which can be quantified between given conditions [4]. Studies on proteoforms will offer much more in-depth insights into a proteome, which will directly lead to the discovery of reliable biomarkers to understand accurate molecular mechanisms, the discovery of effective therapeutic targets, and for effective prediction, diagnosis, and prognostic assessment.

It is a big challenge in the methodology to study the over millions of human proteoforms [1, 6]. The common bottom-up mass spectrometry (MS)-based strategies cannot identify proteoforms, which in fact only identify protein-coded genes, a protein group. This type of method includes stable isotope-labeled two-dimensional liquid chromatography-tandem mass spectrometry (2DLC-MS/MS) and stable isotope-free 2DLC-MS/MS, which only identify peptides and PTMs (**Figure 2**) [6]. Top-down MS-based strategies have been developed to identify proteoforms [7–9]. This type of method can identify proteoforms, which obtains the proteoform message including the amino acid sequence and PTMs. However, the obtained message of proteoform is only partial information of the above defined proteoform. Also, the protein must be purified prior to MS analysis, with different types of protein isolation techniques such as capillary zone electrophoresis (CZE) and liquid

#### *Proteoforms - Concept and Applications in Medical Sciences*

**Figure 1.**

*The concept and formation model of proteoform. (Reproduced from Zhan et al. [1, 6], copyright permission with open access policy.)*

#### **Figure 2.**

*The methods to study proteoforms. (Reproduced from Zhan et al. [1, 4, 6], copyright permission with open access policy.)*

chromatography (LC) [10, 11]. Another drawback is the low ratio of signal to noise (S/N) in the MS analysis. All of those factors result in a relative low throughput in identification of human proteoforms. Currently the maximum throughput of top-down MS is up to 5700 proteoforms corresponding to 860 proteins (**Figure 2**) [6]. Two-dimensional gel electrophoresis (2DE)-liquid chromatography-MS

**3**

cations of proteoforms.

*Introductory Chapter: Proteoforms*

*DOI: http://dx.doi.org/10.5772/intechopen.91403*

(2DE-LC/MS) strategy combines the top-down technique (2DE) and bottom-up technique (LC/MS), which is currently superhigh-throughput method to identify the large-scale proteoforms [1, 6, 12, 13]. With the innovating concept and practice of 2DE, 2DE is a real prefractionation method, which can effectively recognize isoelectric point (pI) and the relative mass (Mr)—two essential parameter of a proteoform; each 2D gel spot contains over 50 to several hundred proteoforms, and most of proteoforms are low-abundance. Currently, the largest 2D gel is 30 cm x 40 cm, which can separate 10,000 2D gel spots; thus at least 500,000 or 1,000,000 proteoforms can be identified. LC/MS can identify protein sequences and partial PTMs (**Figure 2**) [1, 13]. 2DE-LC/MS has great potential in analysis of large-scale proteoforms. 2DE-LC/MS and top-down MS are complementary in the achievement

Proteoform is the final functional format of a protein coded by a gene, which has important scientific merits in the fields of life sciences and medical sciences, and it is the research hot spot and international scientific frontiers. In the past 1–2 years, one has gradually paid more attention to the proteoform study. A total of 532 publications can be obtained through searching in the PubMed dataset with the keyword "proteoform or proteoforms." For example, 24 growth hormone (GH) proteoforms were identified with 2DE-LC/MS in human pituitary tissues [14], and 20 and 22 kDa GH proteoforms functioned in different signaling profiles. Six prolactin (PRL) proteoforms were identified with 2DE-LC/MS and 2DE-Western blot in human pituitary tissues, and the proportional ratio of six PRL proteoforms were significantly different among different subtype nonfunctional pituitary adenoma relative to control pituitary tissues [15]. The six PRL proteoforms bind to different long or short PRL receptors to exert their functions. A total of 3090 proteoforms were identified with liquid chromatography-MS (LC/MS), and 417 proteoforms were identified with sheathless CZE-MS, in seminal plasma [10]. A total of 3028 proteoforms corresponding to 387 proteins from *E. coli* cells were identified with coupling size exclusion chromatography (SEC) to CZE-activated ion electron transfer dissociation (CZE-AI-ETD) [16]. Human sperm protamine proteoforms were identified with a combination of top-down and bottom-up MS [17]. The glioblastoma [12, 13] and pituitary adenoma [13–15] tissue proteoforms were investigated with 2DE-LC-MS/MS. Proteoforms were identified from several cell lines (HepG2, glioblastoma, LEH) with 2DE-LC/MS [18]. Also, proteoform dynamics is also investigated underlying the senescence associated secretory phenotype [19]. In summary, development of proteoforms or protein species significantly enriches the concept of proteome, which is the next-generation research direction in the field of proteomics. 2DE-LC/MS and top-down MS are the complementary method to study the large-scale proteoforms. In-depth investigating proteoforms in a proteome with different pathophysiological conditions will directly cause to deeply understand disease molecular mechanisms, discover the reliable and effective therapeutic targets, and identify effective predictive, diagnostic, and prognostic biomarkers. Further, each proteoform is involved in a molecular network system and has multiple PTMs. It is the research hot spot how different PTMs competitively or synergistically affect proteoform structure and functions and their involved molecular network system [20–24]. Molecular network-based proteoform pattern

of maximum coverage of human proteoforms in a proteome.

biomarkers will have more important scientific merits.

Proteoforms are involved in the entire life science and medical sciences. This book contains only a fraction of the important frontier "proteoforms," which serve as a spur to stimulate and encourage researchers who study proteoforms to come forward with its scientific merits to research and clinical practice. This book will focus on the concept of proteoform, technologies to study proteoforms, and appli-

#### *Introductory Chapter: Proteoforms DOI: http://dx.doi.org/10.5772/intechopen.91403*

*Proteoforms - Concept and Applications in Medical Sciences*

**2**

**Figure 2.**

*access policy.)*

**Figure 1.**

*with open access policy.)*

chromatography (LC) [10, 11]. Another drawback is the low ratio of signal to noise (S/N) in the MS analysis. All of those factors result in a relative low throughput in identification of human proteoforms. Currently the maximum throughput of top-down MS is up to 5700 proteoforms corresponding to 860 proteins (**Figure 2**) [6]. Two-dimensional gel electrophoresis (2DE)-liquid chromatography-MS

*The methods to study proteoforms. (Reproduced from Zhan et al. [1, 4, 6], copyright permission with open* 

*The concept and formation model of proteoform. (Reproduced from Zhan et al. [1, 6], copyright permission* 

(2DE-LC/MS) strategy combines the top-down technique (2DE) and bottom-up technique (LC/MS), which is currently superhigh-throughput method to identify the large-scale proteoforms [1, 6, 12, 13]. With the innovating concept and practice of 2DE, 2DE is a real prefractionation method, which can effectively recognize isoelectric point (pI) and the relative mass (Mr)—two essential parameter of a proteoform; each 2D gel spot contains over 50 to several hundred proteoforms, and most of proteoforms are low-abundance. Currently, the largest 2D gel is 30 cm x 40 cm, which can separate 10,000 2D gel spots; thus at least 500,000 or 1,000,000 proteoforms can be identified. LC/MS can identify protein sequences and partial PTMs (**Figure 2**) [1, 13]. 2DE-LC/MS has great potential in analysis of large-scale proteoforms. 2DE-LC/MS and top-down MS are complementary in the achievement of maximum coverage of human proteoforms in a proteome.

Proteoform is the final functional format of a protein coded by a gene, which has important scientific merits in the fields of life sciences and medical sciences, and it is the research hot spot and international scientific frontiers. In the past 1–2 years, one has gradually paid more attention to the proteoform study. A total of 532 publications can be obtained through searching in the PubMed dataset with the keyword "proteoform or proteoforms." For example, 24 growth hormone (GH) proteoforms were identified with 2DE-LC/MS in human pituitary tissues [14], and 20 and 22 kDa GH proteoforms functioned in different signaling profiles. Six prolactin (PRL) proteoforms were identified with 2DE-LC/MS and 2DE-Western blot in human pituitary tissues, and the proportional ratio of six PRL proteoforms were significantly different among different subtype nonfunctional pituitary adenoma relative to control pituitary tissues [15]. The six PRL proteoforms bind to different long or short PRL receptors to exert their functions. A total of 3090 proteoforms were identified with liquid chromatography-MS (LC/MS), and 417 proteoforms were identified with sheathless CZE-MS, in seminal plasma [10]. A total of 3028 proteoforms corresponding to 387 proteins from *E. coli* cells were identified with coupling size exclusion chromatography (SEC) to CZE-activated ion electron transfer dissociation (CZE-AI-ETD) [16]. Human sperm protamine proteoforms were identified with a combination of top-down and bottom-up MS [17]. The glioblastoma [12, 13] and pituitary adenoma [13–15] tissue proteoforms were investigated with 2DE-LC-MS/MS. Proteoforms were identified from several cell lines (HepG2, glioblastoma, LEH) with 2DE-LC/MS [18]. Also, proteoform dynamics is also investigated underlying the senescence associated secretory phenotype [19].

In summary, development of proteoforms or protein species significantly enriches the concept of proteome, which is the next-generation research direction in the field of proteomics. 2DE-LC/MS and top-down MS are the complementary method to study the large-scale proteoforms. In-depth investigating proteoforms in a proteome with different pathophysiological conditions will directly cause to deeply understand disease molecular mechanisms, discover the reliable and effective therapeutic targets, and identify effective predictive, diagnostic, and prognostic biomarkers. Further, each proteoform is involved in a molecular network system and has multiple PTMs. It is the research hot spot how different PTMs competitively or synergistically affect proteoform structure and functions and their involved molecular network system [20–24]. Molecular network-based proteoform pattern biomarkers will have more important scientific merits.

Proteoforms are involved in the entire life science and medical sciences. This book contains only a fraction of the important frontier "proteoforms," which serve as a spur to stimulate and encourage researchers who study proteoforms to come forward with its scientific merits to research and clinical practice. This book will focus on the concept of proteoform, technologies to study proteoforms, and applications of proteoforms.
