|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

From the Laboratory for Biological and Medical Mass Spectrometry, Uppsala University, S-75123 Uppsala, Sweden
| ABSTRACT |
|---|
|
|
|---|
The reliability problem is effectively addressed by a variety of different search engines, e.g. Mascot (4), Sequest (5), etc., through the use of a scoring technique that evaluates the probability of a false positive identification. Although the evaluation methods might be sophisticated, they have some limitations. One limitation is that for the same MS/MS data the score may significantly change if the content of the protein data base is altered (protein addition to and deletion from the data base are everyday phenomena); thus the score is dependent on whether the data base accurately represents all occurring peptides. Another important limitation is that some poor quality data that should not be trusted can give by pure chance a nearly perfect match, and thus corresponding peptides are wrongly identified with a very high score. Thus, even above threshold identifications call for confirmation by search engine-independent techniques (68). Conversely a low score may arise because of two very different reasons: poor matching and poor MS/MS data quality. Thus when an extremely high quality, informative in terms of fragmentation, MS/MS spectrum returns zero or a below threshold score, it is most often discarded, whereas the actual reason for the poor score is that the peptide in question is not present in the data base. Alternatively an MS/MS spectrum of a peptide definitely present in the data base will receive a low score and be discarded because of the poor quality resulting from the presence of noise spikes, distorted isotopic distribution, missing fragments, etc. For instance, Mascot search engine uses M-score (9) that takes into account the number of mismatched fragments and their relative abundances. The user does not usually know what caused the poor M-score unless a time-consuming manual inspection of the spectrum is performed. Thus the automatic routine cannot make an intelligent decision, which should be in the first case to search the data in question allowing for modifications/mutations (10, 11). In the second case, the decision should be to repeat the analysis or look for supporting information, e.g. other peptides belonging to the same protein, or to invoke the retention time (12). Thus a large fraction, perhaps more than 50%, of potentially useful MS/MS data is currently discarded; this aggravates the efficiency problem.
Complementing the M-score with another data base-independent score (13) that would evaluate primarily the quality of the MS/MS data provides a means for distinguishing between the reasons for the poor M-score and for making the above intelligent decision, rendering the tedious and time-consuming manual data inspection superfluous. This has recently been realized by several groups who have designed data base-independent scoring principles. For instance, Bern et al. (13) assessed the quality of tandem MS data obtained on a low resolution instrument and managed to filter out 75% of the unidentifiable spectra while losing only 10% of the identifiable spectra. They found that the number of peaks and their abundances (the guidelines for manual data quality inspection) had in fact little classification power compared with the number of peak pairs separated by an amino acid mass. In this study, we introduce and evaluate a new scoring principle (S-score) that differs from the scoring suggested by Bern et al. (13) in several aspects. First, the S-score is based on just one parameter, the maximum length of peptide sequence tag, which simplifies the interpretation of the S-score value. Second, it utilizes high mass accuracy afforded by FTMS (14). Finally S-score uses the MS/MS information that comes not only from the traditional collision activated dissociation (CAD)1 but also from electron capture dissociation (ECD) (15, 16) performed on the same peptide.
The application of the S-score goes beyond the mere filtering out of the "bad" spectra and includes salvaging some of the below threshold data. Moreover a further verified sequence tag used for the S-score is compared with the search engine sequence assignment, revealing cases of false positive identifications. In some cases, the sequence tag immediately reveals the presence of a modified sequence, removing the need for a separate de novo sequencing program.
An important issue also pertaining to this discussion is the instrument performance, including also the performance of a given fragmentation technique. Traditionally fragmentation efficiency has been measured as a ratio of the total abundance of fragmentation products and the abundance of the precursor ion before fragmentation (17). However, this is a general approach that is silent about the information quality of the data. For instance, the same efficiency could be assigned to two different MS/MS peptide spectra, one containing a single but very intense peak corresponding to the NH3 loss from the precursor ion and the other containing low intensity but extensive backbone fragmentation. Clearly the information content of these two MS/MS spectra would be different. We demonstrate the applicability of the S-score for quantitative assessment of the information content in peptide MS/MS spectra.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
In the previous analysis (18) the numbers of uniquely identified proteins for each gel band (sample) were calculated and then added together. Here instead the number of unique proteins from the combined gel strips was derived. Thus the number of proteins identified by using the complementary pairs approach was 224 in this case compared with 256 reported earlier.
S-score Description
The S-score was calculated from the so called dta files that contain the mass and the charge state of the precursor as well as the m/z values and intensities of all the fragment ion peaks in the spectrum above a certain cutoff intensity value. For every precursor, two dta files are present, one representing CAD and the other representing ECD fragmentation.
To build a sequence tag that serves as the basis for the S-score, the program takes deisotoped, neutral CAD fragment masses (potentially true fragments (PTFs)), adds the molecular mass and the mass of a water molecule to the peak list, and builds a sequence ladder (tag) between them, fitting masses of amino acid combinations.
To create a PTF list the data in the CAD and ECD dta files were deisotoped and charge-deconvoluted (20) to the neutral state. Ion fragments of monoisotopic mass <800 Da appear often without their heavier isotopes due to the noise cutoff. In these cases, the neutral masses were derived assuming that the charge state of the peak cannot exceed the charge state of the precursor ion if the peak originates from a CAD dta file and will be less than the charge state of the precursor if the peak originates from an ECD dta file (due to charge reduction in ECD). Thus a peak without isotopes located at m/z = 500.2 in a CAD spectrum of a 2+ precursor will be allowed to have a neutral mass of 500.2 and 1000.4 Da. Both these masses are considered for the construction of complementary pairs.
A fragment ion Mc, Ic with neutral mass Mc and intensity Ic in a CAD dta file of a peptide with neutral mass M will qualify for the PTF set if it passes at least one of the following tests. (a) It is an element of a "golden pair" (21). That means there is a corresponding fragment Me, Ie in the ECD dta file that together with the CAD fragment ion satisfies one of the following equations.
![]() |
![]() |
![]() |
![]() |
To reduce the computational redundancy and uncertainty, the algorithm primarily targets y-ion sequence tags. Thus if Mc is identified as a b-ion (satisfies Equation 1 or Equation 4), the mass of the complementary y-ion is calculated using the peptide mass and added to the list instead of Mc. Note that the total amount of information in the fragment list remains unaltered. (b) It is an element of a "complementary pair." That means there is a corresponding fragment ion Mc', Ic' in the same CAD dta file satisfying the following equation.
![]() |
Here of course y-ions are indistinguishable from b-ions. (c) It is the first peak in an isotopic cluster. Furthermore all masses of fragment ions have to pass through the peptide mass window (18) to qualify.
After adding to the PTF set of qualified masses the mass of H2O and the neutral mass of the peptide M (the two additional masses are added to improve the length of the y-ion sequence tag), the algorithm proceeds to find the longest possible amino acid sequence tag that can be constructed from these masses. The number of masses used for the construction of this tag minus one is the S-score value. The maximum allowed mass difference between adjacent masses was
Ms. The choice of 575 Da for the
Ms value will be explained in the next section.
The masses in the PTF set are sorted in an increasing order {m1, ..., mN}. The algorithm assigns to each value mi, 1
i
N, a value of a running tag length TLi, which is initially zero for all i. Following that, the masses mi are selected in an increasing order (i = 1, 2, ..., N), and the corresponding values TLi are determined for each mass mi according to the following recipe. First the differences
j = mi mj are calculated for all j < i. Then the program goes through all
j values and, if the current
j is equal to any combination of one or several amino acids and does not exceed the maximum allowed mass difference
Ms, then TLi accepts the value of TLj + 1 unless its current value is higher. After determining all TLi values, the largest TLi is selected, and the value of this TLi is the S-score for this dta file.
Reliable Tag
Besides the sequence tag that gives rise to the S-score, a "reliable" sequence tag (RST) was constructed for each spectrum. The objective was to design the most reliable sequence tag available in mass spectrometry to date and to use this tag to produce the most reliable sequence identification. The RST is constructed as follows. Only fragment ions satisfying conditions a and b simultaneously are selected for the construction of this tag. The PTF masses of the tag are thus doubly confirmed. The maximum step difference for the reliable tag was chosen to be 398 Da. The justification for this will be given in the following section.
| RESULTS |
|---|
|
|
|---|
|
Ms
Ms value was chosen by using two independent criteria. Criterion 1 measured the extent of bimodality (B) of the part of the analyzed data for 2+ peptides (Fig. 1, ad). The following simple formula was used for the figure of merit,
![]() |
where Max1 and Max2 were the two local maxima of the histogram, and Min was the height of the lowest point between them. The step difference that yielded the highest B was 600 Da (Fig. 1f, solid line). Criterion 2 measured the difference between the mean values of the S-score distributions of assumed "good" and bad spectra. The criterion for good spectra was the presence of complementary pairs or golden pairs; the spectra that did not meet this criterion were deemed bad. We should mention that this was a rather crude separation but one thought to be sufficiently good for optimizing the
Ms value. This approach gave a maximum at 550 Da (Fig. 1f, dashed line). A compromise between the two criteria was chosen,
Ms = 575 Da. In Fig. 1, ad, the evolution of bimodality is shown along with an abnormality in the abundance of the S = 2 peak that originates at
Ms = 500 Da and increases with higher masses. This behavior is explained by the fact that, given a sufficiently small peptide mass, e.g. 900 Da, a single cleavage site located in the middle of the peptide can give S = 2, an increment of two from the previous value S = 0. When the peptide mass is removed from consideration, this abnormality disappears as seen in Fig. 1e (compare with Fig. 1d).
Classification of MS/MS Spectra
The prime purpose of introducing the S-score was to partition the acquired MS/MS data into three classes: A, B, and C. Class A was reserved for data that has been identified by Mascot and whose credibility was proven either by the significance of the Mowse score, by the RST, or by both. Elements of class B have not fulfilled the criteria for qualifying as class A data, and in some cases Mascot has not even suggested a sequence for them, but according to the S-score they are most likely peptides with decent MS/MS spectra and should be worth pursuing further identification. Finally class C consists of MS/MS data that according to the S-score either belongs to non-peptides or peptides with such poor MS/MS spectra that reliable identification is impossible, and thus an attempt of identification would be counterproductive.
Ranking of A-data
In the conventional approach to proteomics, the peptide identification quality is synonymous to the M-score value. With the introduction of parallel S-scoring, this is no longer so. With this in mind, the following ranking procedure was devised. The acquired MS/MS data were processed according to the scheme depicted in Fig. 2.
|
|
The average number of peptides per protein Npep increases each time when the peptide set is extended by adding the peptides of the lower rank. This is an indication of the validity of peptide IDs in the lower ranks because if the added peptides were false positives, they would likely be distributed among unrelated proteins, and Npep would decrease.
Validation of S-score Threshold
As can be seen from the flowchart (Fig. 2), a total of four different filters were applied at various stages. The first one is simply a requirement for the S-score to have an above threshold value of 2. The justification for this value is deduced from Fig. 3. In Fig. 3, the S-score distribution is presented for MS/MS files of charge state 2+. The distribution (black columns) is clearly bimodal and has a distinct valley at S = 1. The second distribution (light columns) in Fig. 3 is that of MS/MS files for which complementary pairs exist. Note the relatively low abundances of S = 0 and S = 1 that support the choice of S = 2 as a threshold value. This distribution has a mean of 6.7. In Fig. 4, the distribution of Mascot-identified (meaning Mascot suggested a sequence with any score) peptides (light columns) is plotted against the complementary pair distribution (black columns). The mean of the Mascot-identified distribution is clearly shifted toward higher values (mean of 7.4) and also supports the choice of S = 2 as a threshold. No peptides with scores above M = 34 were lost using the S = 2 cutoff. The highest scoring Mascot identification that was discarded had an M-score of 21. We should note here that Mascot searches were made against the non-redundant data base and with oxidation of methionine chosen as the only viable modification, and so peptides with other modifications have no chance of being correctly identified. These peptides could account for the difference between the "complementary pairs exist" and the "Mascot found" distributions. The fact that this difference (data not shown) has a significantly lower mean (4.7) could be due to the generally inferior fragmentation of modified peptides.
|
|
Fig. 5, a and b, shows the same two distributions plotted differently. The first is the total distribution of peptides for which Mascot suggested a sequence (black columns), and the second is the distribution of peptides for which a sequence was suggested and an RST existed (light columns). Here the RST existed in 40% of the cases. Note that although the average S-scores were similar for both distributions (Fig. 5a), the average M-score for the RST-backed distribution was much higher (38 versus 27). Only 41% of the total distribution is above the Mascot-suggested threshold of 34, whereas 69% of RST-containing spectra gave hits above this threshold. Thus the mere presence of RST increased the probability of positive ID by Mascot by more than 50%.
|
|
1.4%), their absence in the Mascot-suggested sequence is an almost sure sign of a false positive. In summary, we found an RST for 40% of the Mascot-identified MS/MS data (17% of all MS/MS data), and the reliability of the identifications supported by these tags is estimated to
98.6%.
Revealing Modifications with RST
As an example, the peptide LFSVVADDR was identified by Mascot with a below threshold M-score of 13. The S-score, however, was rather high (S = 7), and the spectra contained 12 complementary pairs of fragments. Moreover RST existed and gave possible sequences ([AY]||[HP]||[FS]||[SMo])(I)(V)(A)(D)(D)([RV]||[AAI]||[GVV]) where Mo stands for oxidized methionine. In this notation, the amino acids in square brackets do not have a defined order, e.g. [AY] could in reality be both AY and YA. The || sign means "or," i.e. ([RV]||[AAI]||[GVV]) means that at the C-terminal sequence could be [RV], [AAI], or [GVV] with I being either leucine or isoleucine. Thus this RST is consistent with many possible sequences, for instance PHLVADDAIA or YAIVADDVR. However, the Mascot-suggested sequence was not among them, thus conflicting with the RST.
In the conflict between RST and Mascot, RST is likely to be more trustworthy, and thus such spectra must be searched in an extended data base allowing for modifications/mutations. Because the RST suggests the presence of at least two aspartic acids, it is logical to assume the possibility of deamidation (24). Indeed a Mascot search allowing for this modification gave a positive identification with an M = 72, and the sequence YAIVANDVR successfully fitted all 12 complementary fragment masses. This peptide is present in an E. coli protein that was additionally identified by at least two unmodified peptides. As additional evidence of modification/mutation, the unmodified sequence was also identified from a different dta file with M = 62. This example shows the potential of using the RST to detect modifications and mutations.
Connection between S-score and M-score
The S-score is data base-independent and reflects how successful the peptide backbone fragmentation was. Mascot tries to fit the fragmentation pattern to a tryptic peptide in the data base, and the score reflects the probability of achieving a match of a given quality at random when searching the data base.
The S-score and M-score, however, are not completely independent because both increase when the number of true backbone fragments in an MS/MS spectrum increases. Thus one can expect a statistical correlation between M- and S-scores, which is interesting to derive as it could be used for a priori prediction of the M-score for individual spectra before a data base search. A straightforward correlation between the S-score and M-score is low with r = 0.37 (25) (Fig. 7a). However, this correlation is significant given the large amount of data points, >14000 (26). There appears to be a bifurcation of the data that is visibly detectable for high S-values. In Fig. 7b, the dta files with S-score above 7 were selected, and their M-score distribution was plotted (black columns). Clearly, the distribution is bimodal. The distribution with the lower mean (A) could arise if Mascot fits data to the wrong peptide. This can happen if the true peptide is either not present in the data base or modified; in either case the fit Mascot makes would most likely be a poor one utilizing only part of the available data and thus resulting in a low M-score. To test this hypothesis we looked at the distribution of dta files for which an RST was found (Fig. 7b, light columns) and then identified the files (Fig. 7c) for which the RST affirmed the Mascot-suggested sequence (black columns) and for which it conflicted with the suggested sequence (light columns). As predicted, the low M-score distribution consisted mostly of false positives, whereas the high M-score distribution consisted mostly of correct identifications. Here false positive IDs were likely due to mutated and/or modified peptides. The ratio between the two distributions in Fig. 7c is 1:3. We believe that this ratio reflects the relative content of mutated and/or modified peptides in the test mixture. The relatively large content (
24.7%) is not surprising given the high rate of mutations in E. coli.
|
Connection between S-score and Precursor Abundance
Naturally one would expect that the extent of detectable fragmentation of a peptide should relate to its precursor abundance. Indeed for an abundant precursor, more fragment ions will have a chance of making it over the noise level. Thus for higher S-scores, a higher average precursor abundance should be expected. Fig. 8a shows the distribution of precursor abundances in our data. The distribution scales to 1/abundance, although the automatic gain control (AGC) mode was used (this was supposed to accumulate the same number of precursors in each scan). Without AGC, the slope would be much steeper (with an ideal AGC, the distribution would be flat). Fig. 8b shows how the average precursor abundance increases with the S-score. The increase from S = 0 to S = 2 is comparatively slow followed by a steep almost linear increase from S = 2 to S = 5. At S = 5 a saturation occurs. The curve again suggests that S = 2 is a logical threshold value. At S = 2, the average intensity of precursor ions is
30,000. It is tempting to suggest simplifying the analysis by using this intensity value as a threshold instead of S = 2. This, however, is not a good idea. In Fig. 8, c and d, the distribution of S-values is plotted for the abundance intervals of precursor ions between 20,000 and 30,000 as well as 30,000 and 40,000. The bimodality of these distributions implies that a simple threshold for intensity values is not enough to discriminate between good and bad spectra. The bimodality is likely due to the presence of non-peptide precursors for which the abundances do not have to be low to produce low S-scores.
|
| DISCUSSION |
|---|
|
|
|---|
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, May 22, 2005, DOI 10.1074/mcp.T500009-MCP200
1 The abbreviations used are: CAD, collisionally activated dissociation; AGC, automated gain control; ECD, electron capture dissociation; ID, identification; LTQ, linear trap quadrupole; M-score, Mascot score; PTF, potentially true fragment; RST, reliable sequence tag. ![]()
* This work was supported by Wallenberg Consortium North Grant WCN2003-UU/SLU-009 (to R. A. Z.). The purchase of the LTQ FT instrument was supported by a Knut och Alice Wallenbergs Stiftelse grant (to R. A. Z. and Carol Nilsson). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
To whom correspondence should be addressed: Laboratory for Biological and Medical Mass Spectrometry, Uppsala University, Box 583, S-75123 Uppsala, Sweden. Tel.: 46-18-471-5729; Fax: 46-18-471-5729; E-mail: Mikhail.Savitski{at}bmms.uu.se
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. Brosch, S. Swamy, T. Hubbard, and J. Choudhary Comparison of Mascot and X!Tandem Performance for Low and High Accuracy Mass Spectrometry and the Development of an Adjusted Mascot Threshold Mol. Cell. Proteomics, May 1, 2008; 7(5): 962 - 970. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Stead, N. W. Paton, P. Missier, S. M. Embury, C. Hedeler, B. Jin, A. J. P. Brown, and A. Preece Information quality in proteomics Brief Bioinform, March 1, 2008; 9(2): 174 - 188. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Bouyssie, A. G. de Peredo, E. Mouton, R. Albigot, L. Roussel, N. Ortega, C. Cayrol, O. Burlet-Schiltz, J.-P. Girard, and B. Monsarrat Mascot File Parsing and Quantification (MFPaQ), a New Software to Parse, Validate, and Quantify Proteomics Data Generated by ICAT and SILAC Mass Spectrometric Analyses: Application To the Proteomics Study of Membrane Proteins from Primary Human Endothelial Cells Mol. Cell. Proteomics, September 1, 2007; 6(9): 1621 - 1637. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Lerner, M. Corcoran, D. Cepeda, M. L. Nielsen, R. Zubarev, F. Ponten, M. Uhlen, S. Hober, D. Grander, and O. Sangfelt The RBCC Gene RFP2 (Leu5) Encodes a Novel Transmembrane E3 Ubiquitin Ligase Involved in ERAD Mol. Biol. Cell, May 1, 2007; 18(5): 1670 - 1682. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Bandeira, D. Tsur, A. Frank, and P. A. Pevzner Protein identification by spectral networks analysis PNAS, April 10, 2007; 104(15): 6140 - 6145. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. L. Nielsen, M. M. Savitski, and R. A. Zubarev Extent of Modifications in Human Proteome Samples and Their Effect on Dynamic Range of Analysis in Shotgun Proteomics Mol. Cell. Proteomics, December 1, 2006; 5(12): 2384 - 2391. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. C. Ru, L. A. Zhu, J. Silberman, and C. D. Shriver Label-free Semiquantitative Peptide Feature Profiling of Human Breast Cancer and Breast Disease Sera via Two-dimensional Liquid Chromatography-Mass Spectrometry Mol. Cell. Proteomics, June 1, 2006; 5(6): 1095 - 1104. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. M. Savitski, M. L. Nielsen, and R. A. Zubarev ModifiComb, a New Proteomic Tool for Mapping Substoichiometric Post-translational Modifications, Finding Novel Types of Modifications, and Fingerprinting Complex Protein Mixtures Mol. Cell. Proteomics, May 1, 2006; 5(5): 935 - 948. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. I. Nesvizhskii, F. F. Roos, J. Grossmann, M. Vogelzang, J. S. Eddes, W. Gruissem, S. Baginsky, and R. Aebersold Dynamic Spectrum Quality Assessment and Iterative Computational Analysis of Shotgun Proteomic Data: Toward More Efficient Identification of Post-translational Modifications, Sequence Polymorphisms, and Novel Peptides Mol. Cell. Proteomics, April 1, 2006; 5(4): 652 - 670. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Salmi, R. Moulder, J.-J. Filen, O. S. Nevalainen, T. A. Nyman, R. Lahesmaa, and T. Aittokallio Quality classification of tandem mass spectrometry data Bioinformatics, February 15, 2006; 22(4): 400 - 406. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. V. Olsen, L. M. F. de Godoy, G. Li, B. Macek, P. Mortensen, R. Pesch, A. Makarov, O. Lange, S. Horning, and M. Mann Parts per Million Mass Accuracy on an Orbitrap Mass Spectrometer via Lock Mass Injection into a C-trap Mol. Cell. Proteomics, December 1, 2005; 4(12): 2010 - 2021. [Abstract] [Full Text] [PDF] |
||||
| ||||||