|
Advertisement | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 4:1002-1008, 2005.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ABSTRACT |
|---|
|
|
|---|
m values). Such
m values found from the precise identification of 45 protein forms from HeLa cells reveal 34 coding single nucleotide polymorphisms, two protein forms from alternative splicing, and 12 diverse modifications (not including simple N-terminal processing), including a previously unknown phosphorylation at 10% occupancy. Automated protein identification was achieved with a median expectation value of 1013 and often occurred simultaneously with dissection of diverse sources of protein variability as they occur in combination. Top down MS therefore has a bright future for enabling precise annotation of gene products expressed from the human genome by non-mass spectrometrists.
High throughput platforms based on MALDI (5) and ESI use MS/MS engines capable of spectral acquisition at a rate of >104/week (6, 7). Recent studies indicate significant inefficiencies associated with such large scale "bottom up" analyses in mammalian systems including imperfect enzymatic cleavage (8, 9) and some MS/MS spectra requiring manual interpretation/validation for identification. Despite the lingering difficulties with peptide analysis, it provides the best and most general method for large scale protein identification today with information on nonsynonymous coding single nucleotide polymorphisms (cSNPs), alternative splicing (10), and PTMs challenging to obtain (2).
Recent developments by MacCoss et al. (11), Wu et al. (12), and Zhu et al. (13) use three proteases and multidimensional protein identification technology ("MudPIT") or isoelectric focusing, reversed-phase chromatography, and three mass spectrometers (13), respectively, to obtain mass information on
7099% of the primary protein structure. Combining intact protein measurement with near exhaustive peptide analysis of five proteins from human cells allowed detection of N-terminal modifications and one alternatively spliced transcript (13). Although cSNP analysis of abundant blood proteins is possible (14), a general informatic strategy has yet to systematically integrate DNA and RNA level data with the MS-based interrogation of the human proteome. This is accomplished here using a data base of human proteins tailored for the "top down" MS approach by combinatorial consideration of protein variability during a search (i.e."shotgun annotation") (15). Although nucleic acid-based approaches represent the highest throughput and best overall methods for capturing information about SNPs, proteomics-based approaches allow cSNP genotyping concurrent to modification and splice variant identification.
The direct fragmentation of intact protein ions using FTMS now provides expectation values (Pscores) that are orders of magnitude better than searches based on tryptic peptides (1618), a far more efficient and robust reconstruction process for the primary structure of the mature protein, and detection of more diverse mass discrepancies (
m values) than targeted analysis approaches (e.g. for phosphopeptides). Major limitations for top down MS are difficulties in handling proteins >50 kDa routinely, low percent occupancy and multivalent PTMs (such as glycosylations) are difficult to detect, and only medium scale projects <200 proteins from microorganisms have been achieved (19). The top down MS/MS approach using standard fragmentation methods or electron capture dissociation (ECD) has provided 100% coverage with localization of basic PTMs for proteins in Bacteria (17, 20), Archaea (16, 17), yeast (19, 21, 22), and a plant (23).
Here we demonstrate unparalleled characterization of human (nuclear) proteins revealing seven different types of modifications in regulation and maturation including a novel phosphoprotein. This was achieved by extending the data base concept of shotgun annotation from a single human histone (15) to a proteomic scale and required the integration of diverse DNA, RNA, and protein level information. This work establishes the basis for routine application of top down MS to capture coding haplotypes within a gene and allele-specific splicing and modification patterns on a far greater number of human proteins.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
25% of identified proteins including barrier-to-autointegration factor (BAF) (see Fig. 4), the PF two-dimensional (2D) system (Beckman Coulter) was used for separation of proteins by pI, and then RPLC was carried out as outlined in the PF 2D manual.
|
The instrument used in this study was a custom 8.5-tesla Q-FTMS of the Marshall design (25). In the case of CAD external to the magnet bore, ions were selected using the quadrupole and fragmented using electrostatic acceleration (1045 V) into an octopole pressurized to
10 millitorr with nitrogen gas. In the case of IRMPD or ECD, a SWIFT window 7 m/z wide was used. The isolated charge state was then dissociated using infrared laser radiation for 0.250.45 s (with a beam expander mounted in front of the laser, 40 watts, 75% power). After threshold dissociation, the quad-enhanced and SWIFT-isolated species was dissociated using ECD. Electrons were introduced to the cell for 100200 ms using a dispenser cathode 35 inches from the center of the magnet. The kinetic energy of the electrons was controlled by placing a 12-V bias potential on the filament of the dispenser cathode.
Automated Data Acquisition
A custom TCL automation script first acquired 510 broadband scans followed by a quadrupole marching experiment, and upon completion a modified THRASH algorithm (26) automatically determined Mr values resulting in a peak list that was then used to select proteins for MS/MS analysis. The most abundant charge state of each protein was selectively accumulated using a notch-filtering quadrupole window 10 m/z wide automatically acquiring 510 scans. For targeted proteins, 25 or 50 scans of axial CAD or IRMPD were recorded to yield protein identifications. Automatically acquired ECD spectra were the sum of 100 scans.
Construction of the Custom Human Data Base
A highly annotated data base of human protein forms was created within ProSight Warehouse (27) using conflict sequences, splicing data, PTMs from UniProt (28), SNP information from dbSNP, and a variety of manually entered data, such as new PTMs found in the primary literature. UniProt data bases were transformed from Swiss-Prot format by a custom data base loader created using Perl scripts and BioPerl libraries. To populate the data base with SNP information, dbSNP was queried for nonsynonymous, coding polymorphisms with an available corresponding protein accession number. The resultant information was populated to a local data base. Using a portion of dbSNP running locally, protein sequence information and function/description were obtained. Using custom Perl scripts, the results were converted to the necessary ProSight Warehouse format. A data base loader application then extracted the protein information and populated ProSight Warehouse with all possible protein forms based on combinations of known variations for each gene product (15). The current number of protein forms in the human data base is 2,823,267 yielding a structured query language data base of 3.5 gigabytes with 17,333 proteins containing 110 cSNPs for subsequent searching using ProSight Retriever (29).
Data Analysis and Data Base Searching
Intact protein MS and MS/MS data were analyzed by THRASH (26) resulting in a protein list and fragment ion list that were uploaded onto the ProSight PTM (27) web server for data base searching (prosightptm.scs.uiuc.edu). The criteria for data base searching were generally a ±2000 Mr window and 520-ppm tolerance for fragment ions with default search options selected as follows: Met, on/off; acetyl, on/off; and SNPs, on. Pscores reported in this study were calculated as reported previously (16), and those <103 required no manual validation of the identification result. Unless noted otherwise, Mr and fragment ion mass values reported are for neutral, monoisotopic peaks (using external calibration), and protein identification numbers are UniProt primary accession numbers.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
1 kb in the human genome and 50,973 cSNPs currently known in dbSNP alone, well over half of human genes contain cSNPs, and top down MS/MS should enable robust genotyping even in the presence of PTMs. Fractions generated from a previously reported 2D separation of intact proteins (21) typically contain multiple proteins of varying abundance as in the ESI/Q-FTMS spectrum of Fig. 1a. Of the seven components, proteins of 6657.71 Da and 11,644.8 Da were selectively accumulated and fragmented by CAD and separately using ECD (spectra not shown). The CAD fragmentation data of Fig. 1b identified the 6.7-kDa component as a mitochondrial proteolipid (Pscore, 4 x 107) containing a known cSNP encoding a I9V residue change (
m = 14.02 Da). Only the Ile-9 allele was observed with an intact mass error of 18 ppm. The 11.6-kDa component was identified from the Fig. 1c MS/MS data to be calgizzarin S100C (Pscore, 1 x 1012). The calgizzarin gene contains a cSNP translating to a 1-Da variability (E36K), readily resolved for the Glu-36 allele observed in the background of N-terminal methionine loss/acetylation (overall 0.6-ppm error). This illustrates the efficiency of intact protein MS/MS for genotyping cSNPs, a feat not often possible using digestion-based approaches. Determination of minihaplotypes in coding regions (i.e. the co-occurrence of multiple alleles in a coding sequence) should also be possible using endogenous material itself instead of in vitro produced/artificial peptides from PCR products (30).
|
Tyr change (
m = 26.00 Da). Only the His-124 form was observed indicating that these cells are homozygous at this locus. The observed intact mass contained a
m of 42.01 ± 0.02 Da localized to the first five N-terminal residues (Fig. 2d). This
m is most likely acetylation of the N terminus, although this same modification at Lys-5 is formally possible. Thus, an automated data flow can now differentiate between posttranslationally modified and cSNP-containing isoforms even in highly conserved gene families.
|
(ProT
; Fig. 3d). ProT
is encoded by six family members with high sequence homology (31, 32). The family member observed contains four introns and from EST data is known to be alternatively spliced due to a rare GAGGAG motif that creates adjacent AG acceptor sites at the intron 2/exon 3 boundary (Fig. 3e) (33). In most tissues,
10% of this mRNA contains an extra GAG codon (encodes for an extra Glu) versus 90% of ProT
transcripts where the more 5' acceptor site is used, producing a form with one less residue (33). Upon examination of the broadband spectrum, both species were observed in a
10:1 ratio of light versus heavy protein (Fig. 3a). The minor species was subsequently fragmented (Fig. 3c), and the extra Glu residue was precisely localized (Fig. 3d, right).
|
with only
150 matching the longer form, consistent with an earlier finding that the
9:1 ratio of short:long is not tissue-specific (33). Also using BLAST, the GAGGAG motif at this locus was found only in primates. Neither rat nor mouse have the extra splice acceptor site and have evolved only the long form of the protein, which is actually the less favorable form in humans.
Identification of a Novel Phosphoprotein
As a last illustration of new advantages provided by the top down MS approach, the 10,191.1-Da BAF protein was identified in a nuclear extract and exhibited a +79.95 ± 0.05-Da satellite peak at
10% occupancy consistent with phosphorylation (Fig. 4a). The data from automated MS/MS localized the phosphorylation to the 11 N-terminal residues. Manual MS/MS using electrons further confirmed a Met off/acetylated N terminus and narrowed the region of phosphorylation to Thr-2 or Ser-3 (Fig. 4b). This well studied protein directly binds to chromatin, is thought to be involved in attachment of chromatin to the inner nuclear membrane (34), and is not known to be modified. No other forms of this protein have been observed in adjacent fractions, and the pI change caused by phosphorylation is small enough to allow coelution of both forms in identical fractions during chromatofocusing and RPLC. With the two-dimensional fractionation behavior of this modified protein now known, detection of this protein from nuclear extracts was reproduced twice more. This now allows targeted studies on this protein from synchronized HeLa cells in a straightforward manner. Such a platform for biochemical interrogation of targeted proteins after RNA interference, chemical perturbation, or cell synchronization will be highly valuable for capturing a more detailed picture of functional regulation mechanisms involving PTM dynamics.
Summary of Findings and Outlook
Using a dual ion fragmentation approach to automatically analyze two to three small human proteins per fraction by top down MS/MS, 45 proteins were identified with a median probability score of 1013 (Table I). A main advantage of the top down strategy is that information on the entire primary structure of the mature protein is obtained, allowing reliable dissection and abundance measurements of highly related gene products from genetic or transcriptional variation and enzymatic modification. Due to the complimentary nature of ion fragmentation using electrons and collisions with gas, precise localization of PTMs, polymorphisms, and amino acids at splice junctions is indeed possible. For the identified proteins, 45% were found in forms not present in UniProts Human Proteome Initiative, and
40% contained SNPs for which only single alleles were observed. Over 85% of the identifications required no manual validation of the data base retrieval result;
m localization sometimes improved upon inspection of the raw data. Characterization of closely related protein forms (e.g. different PTM isomers or SNP forms) sometimes required manual scrutiny of the output from ProSight PTM with the correct form yielding the highest score in the retrieval list in
90% of cases.
|
m values. This strategy represents a major shift in curation philosophy for protein data bases (35), is well suited for a top down approach using FTMS, and recognizes that detailed information on SNPs, mutations (36), splice variants (37), and PTMs (38) will be increasingly known and even somewhat predictable (36). By embedding such variability tightly within a MS retrieval engine, the current study drastically improves identification metrics, enables known biological events to be characterized as they occur in combination, and allows unknown biology to be uncovered more efficiently. Shotgun annotation actually increases the quality of most retrievals by allowing more absolute mass values of fragment ions observed in a top down MS/MS experiment to match those values generated from protein forms housed in a data base. The examples highlighted here illustrate an overall process that can simply be called "proteotyping." The term proteotyping is akin to genotyping at the DNA level but captures all the variability of proteins as they occur in populations and change over time. Fragmentation of intact proteins represents an emergent method for "reverse annotation" of the human genome, and top down MS can now be embraced by organizations such as the Human Proteome Organization.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, April 28, 2005, DOI 10.1074/mcp.M500064-MCP200
1 The abbreviations used are: PTM, posttranslational modification;
m, mass discrepancy; SNP, single nucleotide polymorphism; cSNP, nonsynonymous coding single nucleotide polymorphism; ECD, electron capture dissociation; CAD, collisionally activated dissociation; IRMPD, infrared multiphoton dissociation; BAF, barrier-to-autointegration factor; SWIFT, stored waveform inverse FT; THRASH, thorough high resolution analysis of spectra by Horn; RP, reversed-phase; PF, protein fractionation; 2D, two-dimensional; Q, quadrupole; ProT
, prothymosin
; EST, expressed sequence tag; db, data base. ![]()
* * This work was supported by National Science Foundation Career Award CH 0134953, National Institutes of Health Grant GM 067193, the Sloan Foundation, the University of Illinois Urbana-Champaign Center of Neuroproteomics (p30_DAO18310), the Research Corporation (Cottrell Scholars Program), and the Henry and Lucille Packard Foundation. ![]()
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
S The on-line version of this manuscript (available at http://www.mcponline.org) contains supplemental material. ![]()
To whom correspondence should be addressed: Dept. of Chemistry, University of Illinois Urbana-Champaign, 39 RAL, 600 S. Matthews, Urbana, IL 61801. E-mail: Kelleher{at}scs.uiuc.edu
| REFERENCES |
|---|
|
|
|---|
: contradictory past and new horizons.
Peptides
21, 1433
1446[CrossRef][Medline]
gene family. Evidence against export of the gene products.
J. Biol. Chem.
264, 7546
7555
pre-mRNA.
J. Mol. Biol.
234, 281
288[CrossRef][Medline]This article has been cited by other articles:
![]() |
N. M. Karabacak, L. Li, A. Tiwari, L. J. Hayward, P. Hong, M. L. Easterling, and J. N. Agar Sensitive and Specific Identification of Wild Type and Variant Proteins from 8 to 669 kDa Using Top-down Mass Spectrometry Mol. Cell. Proteomics, April 1, 2009; 8(4): 846 - 856. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Mann and N. L. Kelleher Mass Spectrometry Special Feature: Precision proteomics: The case for high resolution and high mass accuracy PNAS, November 25, 2008; 105(47): 18132 - 18138. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Astle, J. T. Ferguson, J. B. German, G. G. Harrigan, N. L. Kelleher, T. Kodadek, B. A. Parks, M. J. Roth, K. W. Singletary, C. D. Wenger, et al. Characterization of Proteomic and Metabolomic Responses to Dietary Factors and Supplements J. Nutr., December 1, 2007; 137(12): 2787 - 2793. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. J. Pesavento, B. A. Garcia, J. A. Streeky, N. L. Kelleher, and C. A. Mizzen Mild Performic Acid Oxidation Enhances Chromatographic and Top Down Mass Spectrometric Analyses of Histones Mol. Cell. Proteomics, September 1, 2007; 6(9): 1510 - 1526. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Zamdborg, R. D. LeDuc, K. J. Glowacz, Y.-B. Kim, V. Viswanathan, I. T. Spaulding, B. P. Early, E. J. Bluhm, S. Babai, and N. L. Kelleher ProSight PTM 2.0: improved protein identification and characterization for top down mass spectrometry Nucleic Acids Res., July 13, 2007; 35(suppl_2): W701 - W706. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Tanner, Z. Shen, J. Ng, L. Florea, R. Guigo, S. P. Briggs, and V. Bafna Improving gene annotation using peptide mass spectrometry Genome Res., February 1, 2007; 17(2): 231 - 239. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Paarmann, B. Schmitt, B. Meyer, M. Karas, and H. Betz Mass Spectrometric Analysis of Glycine Receptor-associated Gephyrin Splice Variants J. Biol. Chem., November 17, 2006; 281(46): 34918 - 34925. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. M. Patrie, J. T. Ferguson, D. E. Robinson, D. Whipple, M. Rother, W. W. Metcalf, and N. L. Kelleher Top Down Mass Spectrometry of <60-kDa Proteins from Methanosarcina acetivorans Using Quadrupole FTMS with Automated Octopole Collisionally Activated Dissociation Mol. Cell. Proteomics, January 1, 2006; 5(1): 14 - 25. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |