|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 6:1589-1598, 2007.
© 2007 by The American Society for Biochemistry and Molecular Biology, Inc.

,
,¶
,||
From the
Institute of Molecular Systems Biology, ETH Zürich, CH-8093 Zürich, Switzerland,
Faculty of Sciences, University of Zürich, CH-8006 Zürich, Switzerland, and ¶ Institute for Molecular Systems Biology, Seattle, Washington 98103
| ABSTRACT |
|---|
|
|
|---|
Shotgun proteomics is founded on the assumption that each protein present in a sample reproducibly generates a relatively small number of peptides, the boundaries of which conform to the cleavage specificity of the protease used. Trypsin cleaves at the C termini of arginine and lysine residues, and based on the occurrence of these two amino acids in proteins, an average of 10 peptides is expected for a stretch of a hundred residues. The validity of this assumption is critical to the success of proteomics experiments for several reasons. First, the number of peptides present in a protein digest determines the number of MS/MS cycles minimally required to fully analyze the sample. Because the MS-MS/MS duty cycle is given for a particular mass spectrometer (better than 1 Hz for modern instruments), in theory the number of peptides to be analyzed in the sample relates to the minimal duration of a proteomics experiment. Second, the sample complexity determines the practical dynamic range of proteome analyses. The nominal dynamic range of a mass spectrometer (at best 3–4 orders of magnitude) can in practice be significantly reduced by the fact that automated precursor ion selection (data-dependent acquisition (DDA)1 primarily focuses on the most intense MS signals. In the case of very complex samples, only the highest intensity ions are fragmented, whereas ions of lower intensity pass through the system unselected even though their signal is well within the nominal dynamic range of the instrument. Third, database searches are often constrained to full tryptic peptides to restrict the searching time. Unspecific cleavages will thus produce peptides not anticipated by the search parameters, leading to misassignments or missed identifications.
Despite the development of peptide separation systems with higher peak capacity (5–7) and of tandem mass spectrometers with faster acquisition rate, the proteome of any species has yet to be fully mapped. The complete analysis of even moderately complex samples such as isolated organelles (8) or macromolecular complexes (9) has required enormous efforts. All these considerations suggest that the comprehensive analysis of a complex sample is more difficult than initially anticipated, and one of the reasons for this could be a degree of complexity resulting from proteolysis of the protein sample that is higher than expected.
To test this hypothesis and to assess the number of peptides actually generated by proteolysis of a protein, an in-depth characterization of the products of tryptic digestion of well defined proteins was carried out. Five pure bovine standard proteins, ß-lactoglobulin, carbonic anhydrase, serum albumin, transferrin, and ß-casein, were subjected individually to tryptic digestion, and the resulting peptide mixtures were extensively characterized by LC/MS/MS. To maximize the number of peptides identified in the proteolytic digests, a targeted MS/MS sequencing strategy (10–12) was applied and was compared with the intensity-driven data-dependent acquisition method. The targeted approach is based on inclusion lists of precursor ions to trigger collision-induced dissociation. In this approach, the samples were initially analyzed in a high accuracy mass spectrometer in full scan mode. Data were then extensively processed off-line to extract and inventory monoisotopic ions of all the peptide ions observed. The samples were then subjected to MS/MS sequencing multiple times using inclusion lists to trigger fragmentation of the ions of interest, present in full MS scan regardless of their intensity, with retention time and charge state as the only constraints. The approach was shown to be a robust and effective means to sequence low abundance ions and revealed a number of peptides produced from proteolysis of a protein that is at least 10 times higher then previously assumed.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Protein Digestion—
Each protein was solubilized in 0.1 M ammonium bicarbonate buffer containing 8 M urea at a final concentration of 3 mg/ml. After tris(2-carboxyethyl)phosphine reduction and iodoacetamide alkylation of cysteine residues, the solution was diluted to final 1 M urea, the pH was adjusted to 8.0, and the proteins were digested with trypsin at 37 °C. The digestion was repeated on the same set of proteins, after HPLC purification, to eliminate protein degradation products and other contaminants potentially present in the commercial protein preparations. Briefly proteins were loaded onto a macroporous reverse-phase C18 column (mRP-C18, 4.6 x 50 mm, Agilent Technologies, Waldbronn, Germany). Elution was carried out with a linear gradient of water/acetonitrile containing 0.1 and 0.08% (v/v) TFA, respectively, from 5 to 65% in 60 min at a flow rate of 0.75 ml/min. The eluate was monitored by absorption measurements at 226 and 280 nm. Fractions containing the protein species were collected at the peak apex, lyophilized, and then subjected to trypsin digestion with the protocol described previously. A range of different enzyme to substrate ratios (1:10 to 1:500) and different incubation times (from 2 to 24 h) were used for each protein digestion. In the case of ß-lactoglobulin, digestion was also conducted with trypsin immobilized on agarose beads (Pierce) with recombinant trypsin from Pichia pastoris and with the enzyme Lys-C from Lysobacter enzymogenes (Roche Diagnostics). The digestion was stopped by acidification with formic acid to a final pH of 4.0. The peptide mixtures were cleaned by OASIS HLB or Sep-Pak tC18 cartridges (Waters, Milford, MA) eluted with 80% acetonitrile. Alternatively a protocol based on ammonium bicarbonate buffer exchange by gel filtration desalting columns (cutoff, 7000 Da; Pierce) prior to trypsin addition was used. Cleaned peptide samples were evaporated on a vacuum centrifuge to dryness, resolubilized in 0.1% formic acid, and immediately analyzed. Control samples were also prepared, resulting from the same protocol steps, without addition of the protein substrate.
Mass Spectrometry Analysis—
Samples were analyzed on a hybrid LTQ-FT mass spectrometer (Thermo Electron, San Jose, CA) equipped with a nanoelectrospray ion source. Chromatographic separations of peptides were performed on an Agilent 1100 micro-HPLC system (Waldbronn, Germany) equipped with a 10-cm fused silica emitter (150- µm diameter) packed with a Magic C18 AQ 5-µm resin (Michrom BioResources, Auburn, CA). Peptides were loaded on the column from a cooled (4 °C) Agilent autosampler and separated with a linear gradient of acetonitrile/water containing 0.1% formic acid at a flow rate of 1.2 µl/min. A gradient from 5 to 50% acetonitrile in 50 min was used. For each peptide sample a standard DDA on the three most intense ions per MS scan was first performed. Three MS/MS spectra were acquired in the linear ion trap per each FT-MS scan, the latter acquired at 100,000 full-width half-maximum nominal resolution, resulting in an overall cycle time of
1 s. Charge state screening was used, allowing fragmentation of singly and multiply charged ions and rejecting ions of unknown charge state. A threshold of 200 ion counts was set to trigger an MS/MS attempt. A manual extraction of all the m/z features (monoisotopic peaks, 12C) was performed on the first FT-MS file resulting from the analysis of single protein tryptic digests down to a level of intensity of signal-to-noise ratio of 5.
The extracted ions were incorporated into an inclusion list for subsequent analysis. The features were sorted by intensity and divided into multiple lists containing up to 150 ions each. Each digest was reanalyzed under targeted, inclusion list-driven selection of precursor ions for MS/MS analysis, a number of times corresponding to the number of sublists generated. Previously derived attributes of peptide ions, such as charge state and retention time, were used as constraints for triggering MS/MS attempts. For low intensity signals (ion counts <1000) the acquisition time was increased to improve the quality of MS/MS spectra (up to five averaged MS/MS microscans; trapping time, up to 500 ms).
Data Analysis—
Raw MS/MS data were searched against the bovine National Center for Biotechnology Information (NCBI) non-redundant database using BioWorks 3.2 (Sequest, Thermo Electron). Human keratins and porcine trypsin were added to the bovine protein database. Precursor ion tolerance was set to 0.55, and fragment tolerance was set to 0.5 Da. MS/MS data acquired on the tryptic digest of human Pax-8 were searched against the human NCBI non-redundant database. Data were searched allowing oxidation of methionine as a variable modification with carboxyamidomethylation of cysteine residues as a static modification. In the standard searches, two missed cleavages and one non-tryptic terminus were allowed. Data were also searched with no enzyme constraints against the sequence of the protein of interest. Each peptide assignment (BioWorks Peptide Probability score
1) was subjected to manual validation, and only high quality matches were accepted. Additionally data were searched for urea-induced carbamylation as a variable modification, at either the N terminus or Lys or Arg residues, with carboxyamidomethylation of cysteine residues as a static modification, allowing in this case only full tryptic cleavage and two missed cleavages. For each peptide assignment, the difference in ppm between the experimental and the theoretical monoisotopic values was calculated. The high accuracy of FT measurements was used as an additional and independent data filtering criterion.
| RESULTS |
|---|
|
|
|---|
To maximize the number of peptides identified in the proteolytic digests, a targeted MS/MS sequencing strategy was applied and compared with the intensity-dependent DDA approach. The targeted approach is based on an inclusion list of precursor ions that are selected within a specified chromatographic elution window and trigger a collision-induced dissociation. The method is schematically illustrated in Fig. 1. Initially the sample of interest is analyzed in LC/MS mode (full scan) in a high mass accuracy mass spectrometer. Ideally the LC/MS analysis is replicated to determine the ion features reproducibly associated to the sample. The raw data are processed and analyzed in depth off line to extract all observed monoisotopic ions related to peptides with the pertaining attributes (m/z, charge state, retention time, and intensity). The sample is subsequently analyzed in MS/MS mode (several times if necessary), using inclusion lists containing the ions of interest to trigger collision-induced dissociation, with retention time and charge state as constraints. Sequencing parameters (MS/MS sampling time and triggering threshold) are adjusted according to the attributes of the features included. Number and length of inclusion lists are chosen based on the acquisition software inclusion capacity and on the instrument sequencing rate. Fragment ion spectra acquired by targeted sequencing analyses are then assigned to peptide sequences by database searching, and the assignments are pooled.
|
1) was subjected to manual inspection, and only high quality matches were accepted. In this way numerous good quality MS/MS spectra, with a valid sequence assignment, that would not have been retained using conventional filtering criteria were added to the dataset. Assignments were further corroborated by a difference in the precursor ion mass lower than 3.5 ppm using the accuracy of FT mass measurements as an independent filtering criterion. Examples of peptide assignments included in the dataset with a low score are illustrated in Fig. 2.
|
|
|
|
| DISCUSSION |
|---|
|
|
|---|
The stringency of trypsin specificity was recently emphasized by Mann and co-workers (13) in an analysis of a mouse liver proteome fraction. The study provides a good representation of products normally identified in standard proteomics experiments on complex digests. However, the sample used in that study was complex, and it was analyzed by automated selection of precursor ions, thus selecting and fragmenting predominantly the high intensity precursors. Therefore, the claim of strict specificity of trypsin for Lys and Arg residues is consistent with the data in this study in as much as the high intensity precursor ions are concerned. The present work demonstrates that the low intensity precursor ions that very likely were not analyzed in the prior study (13) are enriched for peptides with partially tryptic or non-tryptic ends. Furthermore a lack of trypsin specificity has been discussed previously and widely documented in the LYSIS (14, 15) database created by collecting literature data on proteolytic cleavages (see the supplemental materials for a discussion on the underlying reasons for trypsin unspecificity).
Partly tryptic peptides have occasionally been reported in the proteomics literature to occur as low abundance by-products of tryptic digestion. In addition, recent comprehensive studies have highlighted the presence of numerous unexpected species in tryptic digests, including partly tryptic, modified, and short peptides (16). In many cases, full trypsin specificity was chosen as a database search criterion, precluding the identification of partly tryptic peptides. However, the possibility of identifying partly tryptic peptides, even when specified as search parameter, might be affected by database search scoring or filtering algorithms, many of which tend to favor full tryptic peptides, attributing them a higher probability (see the supplemental materials for a detailed discussion on data analysis issues). For these reasons peptides with partially tryptic or non-tryptic ends are substantially underrepresented in the current proteomics literature.
In this study, the number of peptides identified in the ß-lactoglobulin digest (117 peptides) using a targeted sequencing method was almost one order of magnitude greater than predicted (12 peptides). The actual complexity of such digests might be even greater if one considers that the peptides identified were only those present in sufficient amount to be detected by the mass spectrometer (i.e. ionizable and within the dynamic range of the instrument). Furthermore additional peptides could potentially be found in the digests by searching MS/MS data against single nucleotide polymorphism databases or by using de novo sequencing tools (17). Taking into account also a wide range of potential polypeptide modifications occurring naturally or as a consequence of sample handling (18), the actual complexity of tryptic digests is likely to exponentially increase. As an example, a database search was performed on the raw MS/MS data of the bovine albumin digest in which peptide carbamylation (at the N terminus or Arg or Lys residues) was allowed on fully tryptic peptides (Supplemental Table 4). Forty different carbamylated peptides were identified as a result of the reaction with isocyanic acid derived from urea decomposition. This number is likely to substantially increase if half-tryptic cleavage is allowed.
These findings have deep implications for the widely used shotgun strategy and for proteomics strategies based on reproducible LC/MS feature maps such as the accurate mass and time tag method (19). The latter approach relies on accurate mass measurements and normalized chromatographic retention times to identify peptides as a surrogate for MS/MS analysis. Preliminary bioinformatics analysis indicated that the number of peptides of a given mass within a window of a 1–2 ppm is at least 10–20 times higher if half-tryptic peptides are included compared with fully tryptic peptides only. The results of this study, therefore, do not only indicate that predictions based on strict trypsin specificity seriously underestimate the actual sample complexity but also that the assumptions on which the shotgun and accurate mass tag strategies are founded are greatly oversimplified.
Unspecifically cleaved peptides produced during trypsinization of single proteins were demonstrated to be generally of relatively low abundance (orders of magnitude lower) in the digestion mixture compared with the expected, specific peptides. Although this is not a significant problem for the analysis of single proteins or simple mixtures, for a complex biological sample, an extremely large number of low abundance peptides is expected to create a dense proteolytic background in mass spectra. Considering the wide range of protein concentrations in biological samples (20), the proteolytic background resulting from abundant components will tend to hide the signal of true tryptic peptides originating from lower abundance proteins. This hypothesis was confirmed by analyzing a fraction of a human serum digest enriched for glycopeptides (21) in which the ß-lactoglobulin digest previously characterized was spiked in at the concentration of 1% by weight (see the supplemental materials). Several ß-lactoglobulin tryptic peptides known to be present in the sample from the previous analysis were not detected in full-scan mass spectra, whereas several half-tryptic peptides from highly abundant serum proteins were clearly identified.
In addition, to evaluate the abundance of partly tryptic peptides in already existing proteomics datasets, a large publicly available proteomics data repository, PeptideAtlas, was searched with respect to the yeast and plasma builds (www.peptideatlas.org (22–24)). A major fraction of the overall number of peptides identified, 23.3 and 40.1% for yeast and plasma databases, respectively (Fig. 6), corresponds to partly tryptic sequences. The much higher value obtained for serum might reflect the presence of a high number of serum proteases with different specificities and the number of spectra represented in the respective databases (4 million versus 14 million). However, both figures are extremely high considering that database search constraints and filtering systems tend to penalize and underrepresent unexpected digestion products. This is explained by the fact that PeptideAtlas is a collection of many different proteomics experiments (conducted in different laboratories with different fractionation/enrichment protocols). As a consequence, highly abundant tryptic peptides will tend to be observed consistently in different samples and replicated in the database, whereas each low abundance half-tryptic peptide is associated to a much lower number of observations as illustrated in the plot shown in Fig. 6. For the same considerations, one might evoke a potentially high false positive rate of half-tryptic peptide identifications in PeptideAtlas. However, in compiling the database extreme care has been taken to address this issue by using statistical models and reverse database searches (22–24). This was further evaluated by manually inspecting a subset of MS/MS spectra stored in PeptideAtlas corresponding to partly tryptic peptides, most of which were reliable assignments. Therefore the presence in the database of such a high number of partly tryptic peptides is a clear indication of their existence in tryptic digests. Collectively these data show that the observations made with purified single proteins also apply to the analysis of more complex protein mixtures. This is a serious limitation that might partly explain the failure of shotgun proteomics approaches to detect low abundance proteins in very complex biological samples. For instance, despite the considerable efforts (millions of MS/MS spectra collected in conjunction with extensive fractionation) (25), the comprehensive mapping of complex proteomes such as the human serum proteome by shotgun approach remains a major challenge. Even relatively small proteomes such as the yeast proteome have not yet been fully covered (26).
|
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, May 28, 2007, DOI 10.1074/mcp.M700029-MCP200
1 The abbreviation used is: DDA, data-dependent acquisition. ![]()
* This work was supported in part by federal funds from the NHLBI of the National Institutes of Health under Contract N01-HV-28179 and by the Swiss National Science Foundation under Contract 3100A0–107679 and the Competence Center for Systems Physiology and Metabolic Diseases, Zurich. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()
|| To whom correspondence should be addressed: Inst. of Molecular Systems Biology, ETH Zürich, Wolfgang-Pauli-Str. 16, HPT D 74, 8093 Zürich, Switzerland. Tel.: 41-44-633-20-88; E-mail: domon{at}imsb.biol.ethz.ch
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. A. Galan, M. Guo, E. E. Sanchez, E. Cantu, A. Rodriguez-Acosta, J. C. Perez, and W. A. Tao Quantitative Analysis of Snake Venoms Using Soluble Polymer-based Isotope Labeling Mol. Cell. Proteomics, April 1, 2008; 7(4): 785 - 799. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. J. Ulintz, B. Bodenmiller, P. C. Andrews, R. Aebersold, and A. I. Nesvizhskii Investigating MS2/MS3 Matching Statistics: A Model For Coupling Consecutive Stage Mass Spectrometry Data For Increased Peptide Identification Confidence Mol. Cell. Proteomics, January 1, 2008; 7(1): 71 - 87. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |