Rapid Validation of Mascot Search Results via Stable Isotope Labeling, Pair Picking, and Deconvolution of Fragmentation Patterns*

Conventional LC-MS/MS data analysis matches each precursor ion and fragmentation pattern to their best fit within databases of theoretical spectra, yielding a peptide identification. Confidence is estimated by a score but can be validated by statistics, false discovery rates, and/or manual validation. A weakness is that each ion is evaluated independently, discarding potentially useful cross-correlations. In a classical approach to de novo sequence analysis, mixtures of peptides differing only in a carboxyl-terminal isotopic label yield fragmentation spectra with single, unlabeled b-type ions but pairs of isotope-labeled y-type ions, facilitating confident assignments. To apply this principle to identification by fragmentation pattern matching, we developed Validator, software that recognizes isotopic peptide pairs and compares their identifications and fragmentation patterns. Testing Validator 1 on a Mascot results file from FT-ICR LC-MS/MS of 16O/18O-labeled yeast cell lysate peptides yielded 2,775 peptide pairs sharing a common identification but differing in carboxyl-terminal label. Comparing observed b- and y-ions with the predicted fragmentation pattern improved the threshold Mascot score for 5% false discovery from 36 to 22, significantly increasing both sensitivity and specificity. Validator 2, which identifies pairs by precursor mass difference alone before comparing observed fragmentation with that predicted by Mascot, found 2,021 isotopic pairs, similarly achieving improved sensitivity and specificity. Finally Validator 3, which finds pairs based on mass difference alone and then deconvolutes fragmentation patterns independently of Mascot, found 964 predicted peptides. Validator 3 allowed raw mass spectrometry data to be mined not only to validate Mascot results but also to discover peptides missed by Mascot. Using standard desktop hardware, the Validator 1–3 software processed the 11,536 spectra in the 93-MB Mascot .DAT file in less than 6 min (32 spectra/s), revealing high confidence peptide identifications without regard to Mascot score, far faster than manual or other independent validation methods.

MS/MS combined with informatics analysis is now a uniquely powerful approach for identifying the components of complex protein samples (1)(2)(3). Although new technologies have dramatically enhanced the speed, sensitivity, and precision of LC-MS/MS instrumentation (4), data analysis has neither kept pace with nor taken full advantage of these advances. Determining peptide sequences from fragment ion spectra remains a difficult problem, and three main strategies have matured (5). In de novo sequencing, the peptide sequence is inferred directly from the fragment ion spectra, and many algorithms have been developed to automate this process, including Lutefisk (6), PepNovo (7), NovoHMM (8), Peptide Identification via Integer linear Optimization (PILOT) (9), and others (10 -13). Incomplete fragmentation patterns and low signal to noise (10) make this method difficult to implement as an exclusive means of peptide identification.
The most commonly used method involves comparing experimental MS/MS spectra to theoretical peptide fragmentation patterns derived from protein sequence databases (4) and reporting the best peptide match, which is then propagated forward through the process of determining likely protein components. Several programs are commonly used, including SEQUEST (14,15), Mascot (16), and X! Tandem (17,18). What these algorithms share is the determination of a score for a spectrum-peptide match and subsequently a protein identification, and it is the way in which these scores are assigned and interpreted that distinguishes them (19).
The third method for spectrum-peptide matching is a hybrid of de novo and database searching (5) in which small lengths of sequence are generated directly from the fragment ion spectra, and these "sequence tags" (20) are used to corroborate spectrum-database matches. Popular implementations of this strategy include DirecTag (21), GutenTag (22), and MultiTag (23). The limitations to this method include the requirement for consecutive fragmentation ions and the reliance on de novo algorithms to identify sequence tags.
Database search is highly susceptible to both overreporting false positives (low specificity) and underreporting true positives (low sensitivity). The search engines provide different scoring systems that cannot be directly compared, as the rankings of spectral quality are often based on arbitrary cutoff values. Recent research has focused less on the sequence matching algorithms themselves but more on the statistics used to evaluate the resulting match scores (24). Pep-tideProphet was one of the first algorithms developed to evaluate match scores and assign probabilities by evaluating each match with respect to all other peptide assignments. By using machine learning techniques (an expectation-maximization algorithm), PeptideProphet was shown to have high discriminating power for database search results (25). Initially developed for SEQUEST search results, PeptideProphet has been subsequently adapted for use with database search results from Mascot and X! Tandem. These components are combined in Scaffold, a commercial software suite developed by Proteome Software. An alternative approach is to filter the primary data to exclude poor quality MS/MS scans prior to the database search (26), thereby enhancing the likely significance of each reported match.
Using a false discovery rate instead of a false-positive rate is now the standard statistical measure for reporting error rates in data sets with large numbers of features (e.g. proteomics or genomics data) (5,27). Target-decoy searching as an estimate of false discovery rate (FDR) 1 involves first constructing a database of decoy peptides (28,29), and this strategy is being incorporated into PeptideProphet (30,31). For each peptide-spectrum match, the target spectrum is queried against a second (decoy) database with characteristics similar to those of the first (e.g. a database of reversed or random peptides). Matches to the decoy database are considered false discoveries, and the number of matches above a particular cutoff score threshold is reported. The targetdecoy search option is now available in the newest version (version 2.2) of the database search engine Mascot (Matrix Science).
Despite these advances in mass spectrometry, database searching, and statistical approaches to validating matches, the process of analyzing mass spectrometry data remains time-consuming and computer processor-intensive, often requiring several steps and various data transformations (19). To overcome these limitations, we developed a fast and efficient method for peptide identification validation that minimizes the false discovery rate. Our algorithm relies on data from stable isotopic labeling, which is a standard method for quantifying relative protein abundance in complex mixtures (see Ref. 32 and references therein). Carboxyl-terminal labeling methods, including trypsin-catalyzed 18 O exchange (33), result in a mixture of pairs of chemically identical but isotopically distinct peptides. The "light" and "heavy" peptides coelute from HPLC but are readily distinguished by precursor mass (Fig. 1A). Each peptide also has an isotopic envelope comprised of isotopologues, molecules that are identical in composition except they can contain any number of isotopes.
In the case of trypsin-catalyzed 18 O exchange, two 18 O atoms are substituted for the two carboxyl-terminal 16 O atoms. Comparison of CID fragmentation patterns of carboxyl terminus-labeled light and heavy precursors (or isotopologues) distinguishes b-type and y-type ions (34,35). The carboxylterminal fragments (y-ions) appear as light ( 16 O) and heavy ( 18 O-substituted) forms, but the amino-terminal fragments (bions) display a single shared mass (Fig. 1, B-D).
The technique of using isotopic pairs to enhance peptide identification is not new, and several authors have recognized that isotopic labeling could be used to differentiate carboxylterminal from amino-terminal peptide fragments to facilitate peptide sequence analysis (2,33,(35)(36)(37)(38). This method has been productively applied to de novo analysis (12, 39 -45) and peptide mass fingerprinting (46). In addition, analogous techniques have been applied to the analysis of mixtures of modified and unmodified peptides by probing for peptide mass differences that match known post-translational modifications (47); other groups have used MS/MS spectra information to corroborate these matches and remove noise (48,49). Finally, isotopic labeling with 18 O has been used for manual validation of peptide identifications by observing the predicted mass shift of y-ions (50). Nevertheless, this strategy has yet to be harnessed as a means for automated data analysis and peptide search validation.
The goal of this study was to develop a set of software tools designed to provide rapid and automatic validation of peptide assignments by Mascot and to determine the relative benefit of reducing false discovery and the magnitude of loss of bona fide identifications. We hypothesized that the characteristic shifting of y-type ions between fragmentation spectra of light and heavy precursors might provide a robust check for validity of peptide assignment by database search. Here we demonstrate the feasibility of quickly and efficiently analyzing searched mass spectrometry data, determining within minutes which peptide and protein assignments are likely valid. In its simplest form, Validator 1, identified isotopic pairs in a Mascot results file and improved the 5% FDR cutoff from a Mascot score of 36 to 22, thereby capturing many true identifications that would otherwise have been discarded. A more advanced algorithm, Validator 3, that considers only precursor ion mass, charge, and fragmentation spectral data to identify isotopic pairs independently of any peptide identifications, not only rapidly validated the Mascot results but also discovered peptides that Mascot had failed to match. Our software suite, Validator 1-3, provides new and robust tools for rapid validation of searched LC-MS/MS data obtained in stable isotope experiments, offering improved sensitivity and specificity over database searching alone.

EXPERIMENTAL PROCEDURES
Standardized and Normalized Data Sets-To provide normalized data for our analysis, we prepared a complex soluble protein sample from budding yeast cell lysate. The sample was subjected to prote-olysis by trypsin. In detail, the proteins were mixed with 6 l of Rapigest (Waters) and 10 mM tris(2-carboxyethyl)phosphine HCl, denatured at 37°C for 30 min, alkylated with 10 l of 50 mM iodoacetamide at room temperature in the dark for 40 min, and digested with 1:50 (w/w) trypsin in 50 mM ammonium bicarbonate, pH 8.9, at 37°C overnight. The Rapigest was removed by adding 5 l of 1% TFA. The sample was split and was exchanged in 100% [ 18 O]water or 100% [ 16 O]water using the 18 O Proteome Profiler kit (Sigma-Aldrich). MALDI-TOF analysis was used to follow the reaction. Finally this sample was mixed in equal amounts to create a 1:1 16 O: 18 O reference sample. The resulting peptide mixture was then subjected to reverse phase nanoelectrospray ionization LC-MS/MS on the LTQ-FT instrument (Thermo) using a standard gradient (Zorbax 300SB-C18 column, 150 mm ϫ 75 m; 0.1% formic acid in water with 5-60% acetonitrile; 0.5%/min gradient). The LTQ-FT instrument was run in positive ion mode at 50,000-ppm resolution MS for ICR. Parent ions were selected for fragmentation by data-dependent analysis using a cycle of one MS scan for ICR (m/z 400 -2000) and up to five MS/MS scans in the LTQ (m/z 50 -2000) of the most abundant ions using 120-s dynamic exclusion. A normalized collision energy of 35 was used for low energy CID MS/MS of peptide ions. Under these conditions, a high fraction of the most abundant peptides had both the 16 O and 18 O monoisotopic species subjected to CID based on our preliminary data. The data set was analyzed by Mascot (version 2.2, Matrix Science) and X! Tandem (version 2007.01.01.1, Global Proteome Machine Organization) to identify peptides and proteins from the MS/MS spectra. Mascot was set up to search the NCBInr_20060910 database (selected for Saccharomyces cerevisiae, 11,101 entries) assuming the digestion enzyme trypsin, a fragment ion mass tolerance of 1.0 Da, and a parent ion tolerance of 0.2 Da. Double 18 O modification of carboxyl-terminal lysine or arginine, oxidation of methionine, N-formylation of the amino terminus, and iodoacetic acid derivative of cysteine were specified as variable modifications. X! Tandem was set to search the scd.fasta.pro database (selected for S. cerevisiae, 6,794 entries) also assuming trypsin with a fragment ion mass tolerance of 0.60 Da and a parent ion tolerance of 10.0 ppm. Iodoacetamide derivative of cysteine was specified as a fixed modification. Double 18 O modification, deamidation of asparagine and glutamine, oxidation of methionine and tryptophan, sulfone of methionine, tryptophan oxidation to formyl, and acetylation of lysine and the amino terminus were specified as variable modifications. Scaffold (version Scaffold-01_06_00, Proteome Software) was used to validate MS/MS-based peptide and protein identifications. Peptide identifications are accepted if they can be established at greater than 90.0% probability as specified by the PeptideProphet algorithm (51). Protein identifications are accepted at greater than 95.0% probability and contain at least one identified peptide with probabilities assigned by the ProteinProphet algorithm. Proteins that contain similar peptides  18 O, indicated by *) will affect the y-ions exclusively. C, idealized sample MS/MS spectra from the peptide and ions in B. The spectra from the 16 O-and 18 O-peptide forms have similar patterns, although the peak heights may be different. D, top, the two spectra from C are overlaid to demonstrate that the b-ions will have a nearly identical mass-to-charge ratio, whereas the y-ions will have a shift reflective of the stable isotope substitution. In the example given, peaks "a" and "k" from C are both b-ions and therefore overlap, whereas peaks "b" and "l" are y-ions with l being shifted due to the substitution of two 18  and cannot be differentiated based on MS/MS analysis alone are grouped to satisfy the principles of parsimony.
Software Development-All software analysis was performed on searched Mascot data (e.g.".DAT files"). Custom software was written in Python 2.6. Statistical analysis was performed using both Python scripting as well as Microsoft Excel. Charts and graphs were generated using both Python's Matplotlib library (SourceForge, Inc.) and GraphPad Prism. Software was run on standard desktop and laptop computers running both Windows XP (service pack 3) and Macintosh OS 10.5. Details about software development and implementation are included under "Results."

RESULTS
The aim of this study is to describe a fast and efficient means for validating peptide identifications obtained by searching 18 O-labeled MS/MS data with Mascot. Our approach is to mine the Mascot .DAT file to extract information not utilized by Mascot but potentially useful for automated validation. For the purposes of this study, we refer to a "query" as any precursor ion and its associated fragmentation ions, regardless of whether Mascot assigned a match, and to a "peptide" as any query to which Mascot assigned a match, regardless of Mascot score and without external validation. For each query, up to 10 possible peptides are assigned by Mascot, each with a probability score. For this study, we examined all query-peptide identifications as well as only the top scoring match suggested by Mascot. Using a 16 (Table I). The FDR of 5% was achieved at a threshold Mascot peptide score of 36, and 2% was achieved at a cutoff score of 42.
The majority of peptides have low Mascot scores ( Fig. 2A). As expected, peptides with the highest Mascot scores tend to have a low precursor mass error (PME) (Fig. 3A). In fact, the search results represent two populations: peptides with high Mascot score/low PME and peptides with low Mascot score/ high PME. A plot of the Mascot score versus the variance of the PME for all peptide matches above that score illustrates a steep fall in the variance, plateauing close to a Mascot score of 35 (supplemental Fig. 1), providing an approximate cutoff threshold separating the two populations. Of the 17,200 peptides identified by Mascot, 2,308 have scores greater than 35. The width of precursor mass error range that encompasses 95% of these peptides with high Mascot scores is 0.048 Da, whereas the interval that covers 95% of all peptides is 0.386 Da (Fig. 3).

FIG. 2. Distribution of Mascot scores.
A, the raw Mascot data file was parsed, and the number of peptides in each score group was tallied. The vast majority of scores were less than 30. Note that the y axis has a break at 2,000. See the inset for the full-scale graph with identical x axis but no break in the y axis. B, Validator 1 finds 16 O/ 18 O pairs in the searched Mascot data file. The distribution of Validator 1-derived peptide scores (black) is seen against the raw distribution (gray) from A. Again, note the broken y axis and the inset showing the full y axis scale. At the low end of the scores, Validator 1 rejects most of the peptides while retaining most of the high scoring peptides. C, the Validator 2e-identified peptides with fragment ion tallies greater than 10 (black) are shown compared with the Validator 2 results (gray). At low scores, Validator 2e rejects most low scoring peptides while retaining most peptides with high Mascot scores. D, Validator 3e (black) performs similarly to Validator 2e (gray) despite not utilizing any Mascot search information.

TABLE I Validator data
For each version of Validator, the number of pairs, queries, and queries with peptides is shown. In addition, data are displayed after filtering the raw Mascot data for only those peptides with scores greater than 35. The precursor mass error range corresponds to the dotted ("all") and solid ("Ͼ35") lines in Fig. 3 (Table I). This analysis required ϳ10 s of calculation on a laptop computer. The precursor mass range width that encloses 95% of the peptides with Mascot scores greater than 35 was 0.034 Da, whereas the width of the range that encompasses 95% of all peptides decreased by 89% compared with Mascot alone, to 0.044 Da (Fig. 3, A versus B).
There were 223 unique peptides with Mascot scores over 35 that Validator 1 failed to discover as a member of a 16 O/ 18 O pair. Manual examination of the raw spectra for 10 of the highest scoring of these peptides revealed three scenarios. For six peptides, the 16 O form was fragmented and yielded a high Mascot score, but the 18 O form was not selected for MS/MS. In one case, the 18 O form subjected to MS/MS was an isotopologue not accounted for by the Mascot search and thus was not correctly identified. In three cases, a candidate pair was flagged by Validator 1, but the data turned out to correspond to two peaks within the isotopic envelope of a single peptide.
On the other hand, Validator 1 did not reject all low scoring peptides, particularly where the Mascot identifications yielded low precursor mass errors. As seen in Fig. 3B, these peptides represent a "comet tail" in the data, stretching all the way down to Mascot scores as low as 10. A closer inspection of these peptides (data not shown) reveals that most were also found in other queries with high Mascot scores. Nevertheless, of the low scoring peptides found by Validator 1, there were 21 proteins represented that would not be identified if only high Mascot scoring peptides were being retained. Therefore, Validator 1 was able to rapidly identify 16 O/ 18 O pairs within searched Mascot data. Using 16 O/ 18 O pairs as a criterion rather than a simple Mascot threshold retained most high scoring peptides and rejected most low scoring peptides but also rescued several low scoring but likely correct identifications.
Validator 2-Validator 1 relies on Mascot to identify both the 16 O-and 18 O-labeled peptides. We reasoned that additional 16 O/ 18 O pairs might be found in the Mascot .DAT file by searching for pairs of queries where the precursor masses were separated by a difference of 4.008491 Da without regard to any features of the MS/MS data or whether Mascot had assigned the same, different, or even any identifications. Thus, the Validator program was modified to start with a query identified as a 16 O-or 18 O-peptide and search the Mascot .DAT file for queries within a range of 200 scan units (2.25 min) with a precursor mass difference of 4.008491 Da and with a mass error limit of 3 ppm. Using these criteria, Validator 2 found 3,209 pairs representing 1,564 unique peptides and 1,150 unique proteins.
The most significant distinction between Validator 1 and 2 was the retention of considerably more low scoring peptides. Notably, of the 3,177 peptides retained by Validator 2, 1,696 had Mascot scores below 35, and many also displayed a high mass error, suggesting a low likelihood of correct identification. These results raised the question of whether using ad- FIG. 3. Precursor mass error versus Mascot score. Low Mascot peptide scores, as defined as a score less than 35, are shown in the shaded gray area. A, the raw data are separated into two distinct zones: the high Mascot score peptides, most with low precursor mass error, and the low Mascot score peptides, most with high precursor mass error. As the Mascot score increases from 0 to 35, the variance of the precursor mass errors of all peptide matches above this score falls dramatically (see also supplemental Fig. 1). We determined cutoffs for precursor mass error that would encompass 95% of all peptides (dashed lines) and 95% of peptides with Mascot peptide scores over 35 (solid lines). B, Validator 1 successfully removes most of the peptides with low Mascot peptide scores. Note the more narrow 95% range for all peptides (dashed lines) compared with A as well as the much tighter 95% interval for peptides with Mascot peptide scores greater than 35 (solid lines). C, Validator 2e-identified peptides with a fragment ion tally of 10 or more are shown. Note that although the interval encompassing 95% of the peptides (dashed lines) is wider than for Validator 1 it is much narrower than for the raw data. In addition, the 95% interval for peptides with Mascot peptide scores greater than 35 (solid lines) is narrower than for Validator 1-identified peptides. D, Validator 3e-identified peptides with a fragment ion tally of at least 10 are shown. Again the intervals encompassing 95% of the peptides (dashed lines) and 95% of peptides with Mascot scores greater than 35 (solid lines) are shown. ditional criteria based on the MS/MS data embedded in the Mascot data file might help reveal potentially correct peptide matches with low Mascot peptide scores while filtering out incorrect identifications.
Validator 2e-Given that fragmentation spectra are available for each member of a candidate 16 O/ 18 O-peptide pair identified by Validator 1 or 2, we hypothesized that these data could be mined to distinguish false identifications. As noted above, comparing the MS/MS fragmentation of the light and heavy forms will reveal identical sets of b-ions but distinct y-ions with pairs of fragments shifted by 4.008491 Da, reflecting the exchange of two 18 O atoms for 16 O at the carboxyl terminus (Fig. 1). We therefore extended our program, dubbed Validator 2e, to take advantage of the embedded carboxylterminal labeling information to distinguish the b-type and y-type ions, facilitating peptide validation.
As a first step, we confirmed that the MS/MS ions in each query correspond with a theoretical fragmentation table based on the sequence of the peptide match provided by Mascot. For each peptide identification in the Mascot data file, we calculated the fragmentation table and counted the number of observed ions that fell within a window of 2000 ppm from a predicted b-or y-ion. As expected, there is a positive correlation between the number of b-and y-ion matches and Mascot peptide score (r ϭ 0.596, p Ͻ 0.0001; supplemental Fig. 2A). To validate Mascot identifications for 16 O/ 18 O pairs, we tested whether the following held true: when pairs of ions matched predicted b-type ions, they should be identical (non-shifting), whereas those matching y-ions should differ by 4.008491 Da (shifting). The number of matching pairs of non-shifting b-ions and shifting y-ions were thus tallied to generate a "fragment ion tally." We hypothesized that a high fragment ion tally would characterize a correct peptide identification for a query member of a 16 For each pair identified by Validator 2, we calculated the fragment ion tally for each query member based on comparison with predicted fragmentation tables for the highest scoring peptide match provided by Mascot. Fragment ion tally correlates with a high Mascot peptide score (r ϭ 0.639, p Ͻ 0.0001; supplemental Fig. 2B) with a fragment ion tally of 10 corresponding to a Mascot score of 35. We therefore filtered the list generated by Validator 2 to retain only pairs that yielded a fragment ion tally of at least 10 with at least two matching shifting (y-type) ions. The requirement of two y-ion (shifting) matches will reject pairs of ions derived from the same isotopic envelope that are predicted to yield many matching b-ions but no matching y-ions. Calculating fragment ion tallies for the 3,209 pairs of queries found by Validator 2 yielded 1,782 queries with counts greater than or equal to 10 (Table I). These queries represent 481 unique peptides and 234 proteins. Notably, of the query-peptide matches with fragment ion tallies of 10 or greater, only 442 (24.8%) had Mascot scores less than 35. Compared with Validator 2, Validator 2e eliminates many of the low scoring/high mass error peptides but retains most of the high scoring/low mass error peptides (Fig. 2C). Limiting the plot to peptides evaluated with Validator 2e that yield a fragment ion tally of 10 or greater, 95% of high scoring peptides fell within a precursor mass error range of 0.022 Da versus a range of 0.084 Da for all peptides (Fig. 3C). Compared with Validator 1, Validator 2e found 219 queries, 163 peptides, and 135 proteins not found by Validator 1 (supplemental Table 1).
Validator 3/3e-As a next logical step, we sought to find candidate pairs based solely on their mass difference and ion lists from raw data without regard to any peptide sequence information provided by Mascot in the .DAT file. Validator 3 identifies pairs much like Validator 2 except for not requiring that one member of the pair be a Mascot-identified 16 O-or 18 O-peptide. The program iterates through all queries and searches for another query with the predicted 4.008491-Da mass difference, allowing an error of 3 ppm. From the reference data set, the program identified 3,779 pairs, representing 3,615 unique queries, of which 3,545 have Mascot-assigned peptide identifications. Examination of the data revealed that some Validator 1 pairs remained unidentified, as their difference in precursor mass lies outside the 3-ppm tolerance limit imposed by Validator 3 (data not shown). Validator 3 found 1,875 queries, 1,540 peptides, and 1,279 proteins not found by Validator 1 (supplemental Table 1).
As with Validator 2e, we extended Validator 3 to 3e by utilizing the expectation of non-shifting b-ions and shifting y-ions to perform an internal validation of the proposed pairs, without relying on the peptide identification(s) provided by Mascot. Therefore Validator 3 was modified to find pairs of shifting and non-shifting fragment ions for each pair based on comparing the two lists of MS/MS ions and finding nonshifting b-ions and shifting y-ions within a mass tolerance of 2,000 ppm. To decrease the influence of noise, only fragment ions with a peak height of at least 0.5% of the intensity of the strongest ion were evaluated. To be considered a shifting or non-shifting pair, the difference in intensity between the heavy and light forms of the candidate could be no more than 25%. Again a fragment ion tally was determined from the number of pairs of candidate b-(non-shifting) and y (shifting)-ions while requiring at least two y-ions. To validate the scoring scheme, the fragment ion tally and Mascot peptide scores were compared, and as with Validator 2e, we found a significant positive correlation (r ϭ 0.395, p Ͻ 0.0001; supplemental Fig. 2C).
Because two complete sets of MS/MS ions are being compared without regard to a predicted fragmentation pattern, we expected to identify more pairs with higher fragment ion tallies. To facilitate comparison with Validator 2e, we filtered based on a fragment ion tally cutoff of 10, yielding 2,310 queries (Table I). These correspond to 964 peptides and 696 proteins identified. As expected, Validator 3e was less selective than Validator 2e in rejecting low scoring peptides (Fig.  2D) while retaining a higher proportion of high mass error peptides (Fig. 3D). The precursor mass error range containing 95% of peptides with scores greater than 35 was quite similar to that of Validator 2e, 0.026 versus 0.022 Da, but considerably wider for all peptides, 0.258 versus 0.084 Da. These data show that a strategy agnostic to Mascot-specific peptide information can be used to identify peptides highly likely to represent bona fide 16

O/ 18 O pairs, providing independent validation for Mascot identifications.
Comparison with Scaffold-The commercial proteomics software suite Scaffold (Proteome Software) uses the Peptide-Prophet algorithm (25) to generate lists of peptides and proteins with an associated probability. Many groups use Scaffold for downstream data analysis, and we feel that it is important to compare the performance of our software with that of this commonly used analysis tool. Using the same Mascot .DAT file, the data were analyzed in Scaffold using probability cutoffs for peptides and proteins of 90 and 95%, respectively. The list of proteins meeting these criteria along with the constituent peptides was compared with the peptide and protein lists generated by Validator versions 1-3e (Table  II). Using the top scoring Mascot peptide identifications only, Validator 1 found 69.5% of the peptides and 91.9% of the proteins found by Scaffold. The performance of Validator 2e was similar, identifying 62.6 and 84.9% of the peptides and proteins, respectively. Validator 3e found 59.1% of the peptides and 88.4% of the proteins found by Scaffold. The seven proteins identified by Scaffold but not identified by Validator 1 were examined. Four proteins had peptide pairs with the MS mass difference outside of the Validator 3e tolerance of 3 ppm. One protein had a fragment ion tally below the cutoff limit of 10. Two proteins were identified solely from 16  sought to corroborate the pairs by analysis of shifting and non-shifting fragment ions. The Validator 3e program was extended to analyze all Validator 1-identified pairs, first by finding all shifting and non-shifting ions between the two MS/MS ion lists. Then the list of matches was compared with the predicted fragmentation table for the Mascot-identified peptide to calculate a fragment ion tally. To determine the significance of each potential match, the following algorithm was used: for each potential peptide pair, we randomly permuted the peptide sequence 30 times, each time computing the fragmentation table for the random peptide and determining a fragment ion tally. Based on the distribution of fragment ion tallies for the randomly permuted peptides, a 95% confidence interval was determined. Using a criterion that the fragment ion tally for the Mascot-identified peptide must fall outside this range, the fragment ion tallies for 2,626 (94.6%) of the 2,775 Validator 1-identified peptides were found to be significant. In other words, using internal pair validation based on matching shifting and non-shifting MS/MS ions, we were able to corroborate almost every 16 O/ 18 O pair found by Validator 1. This is highly significant as it both demonstrates the strength of using 16 O/ 18 O pair finding as a route to high confidence peptides and validates our method of peptide validation by matching MS/MS ions.
Statistical Analyses-We next sought to analyze our results by applying a conventional validation method of false discovery rate determination and receiver operating characteristic (ROC) curve plotting. Whenever a protein sequence from the target database is tested, a random sequence of equal length and similar amino composition is generated and tested (Matrix Science and Refs. 29 and 52). Any matches to the decoy database are assumed to be false positives, and this approach assumes that matches to the decoy peptides have the same distribution as false-positive matches to the original target data (5). For calculation of FDR at a given threshold Results are shown comparing the performance of Validator versions 1-3 with the peptide and protein output from the commercial software package Scaffold. In addition, data are displayed after filtering the raw Mascot data for only those peptides with scores greater than 35. The Scaffold filtering criteria were to include only peptides with a 90% confidence, proteins with a 95% confidence, and only those for which there were at least two unique peptides identified. For instance, using only the top peptide match from Mascot for each query, Validator 1 captured 69.5% of the peptides and 91.9% of the proteins as identified by Scaffold. Also shown are results when using all possible peptide and protein guesses by Mascot. ID'd, identified. score, we used the method described by Kä ll et al. (27,29) of dividing the number of decoy peptides identified (with scores over the threshold) by the number of target peptides identified (with scores over the threshold score). In general, the identified decoy peptides have low Mascot peptide scores and high precursor mass errors (supplemental Fig. 3). Searching the data set with Mascot against the reference proteomes of 17,200 target peptides and 17,687 decoy peptides yielded an FDR of 5% at a Mascot peptide score of 36 (Fig. 4A). At this cutoff score, Mascot retains 2,250 target peptides and 106 decoy peptides. We were interested in comparing the features of decoy peptides as an independent means of estimating the ability of Validator to decrease FDR. We therefore applied this test to analyze the filtering ability of Validator versions 1-3 (Table I). As an example, recall that Validator 2e identifies pairs by first finding a pair member that Mascot has identified as having either a carboxyl-terminal 16 O or 18 O and then finding the other pair member by searching for a peptide with the appropriate difference in m/z. Using this Mascot-identified peptide for each pair member, the program identifies the band y-ions from the list of MS/MS ions. This list is searched against the list of MS/MS ions from the isotopic partner to determine the number of non-shifting (b-type) and shifting (y-type) ions, and the sum of these is the fragment ion tally. Peptide-spectrum matches with a fragment ion tally of 10 or greater are retained. Validator 2e retains 1,782 target but only 650 decoy peptides. The majority of decoy peptides have a low Mascot score so that an FDR of 5% is achieved at a cutoff score of 29 (Fig. 4B). At that score, the algorithm retains 1,457 target peptides and 62 decoy peptides. Receiver operating characteristic curves are a useful way to visualize the relationship between the sensitivity and specificity of a test. We used ROC analysis to probe the relationship between sensitivity and specificity for Mascot peptide scores over all data, prefiltered data, and Validator-filtered data. For a typical mass spectrometry experiment, a true ROC curve cannot be plotted because the true-positive rate is unknown. Typically the search results from the target and decoy data sets are used to approximate the sensitivity and specificity of the search engine filter (Matrix Science). Sensitivity is approximated by the ratio of the number of queries with peptide scores above a given value to the total number of queries. Likewise specificity is approximated by the ratio of the number of decoy queries with assigned peptides above a given score to the total number of decoy peptides. ROC analysis of the full set of Mascot-searched data demonstrates poor sensitivity and specificity throughout most of the range of score thresholds (Fig. 5A, stars). It is only at a very low threshold score that the sensitivity approaches 100% (capturing all correct identifications) while the specificity is close to zero (capturing all incorrect identifications). As expected, restricting the ROC analysis to peptides with Mascot scores above 10 or above 35 (Fig. 5A, solid and open squares) improves sensitivity and specificity. When the Validator 1 filtering algorithm is applied to the data (Fig. 5A, triangles), the ROC curve demonstrates a stronger relationship between sensitivity and specificity with a sensitivity of 80% and specificity of 89% at a threshold score of 35 (Fig. 5A, arrow). The performance of Validator versions 2, 2e, and 3e are similarly compared in Fig.  5B. Note that Validator 2e has the best ROC curve with a sensitivity of 80% and a specificity of 94% at a Mascot peptide score threshold of 32 (Fig. 5B, arrow).
Corroboration of Validator 3-identified Peptide Pairs-A schema for corroboration of Validator 3-identified peptide pairs is shown in Fig. 6. For the pairs identified by Validator 3e, we utilized the Mascot information, where available, to determine the significance of the match. If the Mascot identification was the same for both members of the pair, we determined FIG. 4. Analysis of FDRs. A, number of Mascot peptide-spectrum matches for target (solid) and decoy data (dotted). The total number of matches with peptide scores over the given Mascot cutoff score is shown, and the score threshold for an FDR of 5% is indicated. B, number of Validator 2e matches for target data (solid) and decoy data (dotted). Note the different y axis scale compared with A. C and D, false discovery rate for raw Mascot and data filtered by Validator versions 1, 2e, and 3e. False discovery rate is the number of decoy peptides divided by the number of target peptides with scores exceeding a given threshold. In D, the black lines mark the Mascot peptide score cutoffs to achieve an FDR of 5% for Mascot (35.6) and Validator 1 (22), 2e (29), and 3e (37). the significance of the match using the corroboration strategy of determining fragment ion tallies after randomization of the candidate peptide. Of the 1,270 pairs where the peptide identifications were the same, the score was found to be significant in 1,258 pairs. For the 741 cases where the Mascot identifications were to different sequences, or only one member of a pair had an identification, the same technique was applied to determine the significance. In 621 cases, the corroboration score was significant for at least one matched peptide. For the 130 pairs where there was no corroboration or where neither peptide had a Mascot identification, 31 could be identified using X! Tandem. Of these, we were able to corroborate 19 using the randomization strategy. This left only 133 pairs that passed the fragment ion tally threshold of 10 but lacked any peptide identification to validate. Overall we were able to corroborate 1,898 of 2,021 Validator 3e pairs (93.9%).
Performance-All versions of Validator are written in Python version 2.6 running on desktop and laptop hardware. Versions were tested both in Windows XP and Mac OS X environments. Our reference Mascot .DAT data file is 92.8 MB and 1.24 million lines, consisting of 11,536 scans, 20,759 queries, and their analysis. On standard hardware (e.g. Intel Core-2 Duo processors with 2-4 GB of RAM), Validator versions 1-3 run in sequence in less than 6 min (ϳ32 spectra/s), including a complete parsing of the .DAT file, pair finding, and corroboration and full FDR analysis. Validator 1 by itself runs from start to finish in 70 s. Most of this time is spent building the query dictionaries, and once loaded, Validator 1 is able to find all 16 O/ 18 O pairs in about 10 s, including decoy search and false discovery rate determination. This corresponds to processing Ͼ1,000 spectra/s. Once optimized and compiled, it is expected that Validator should be able to run several times faster. To facilitate further development, software will be available freely both as stand alone code as well as a Web-based tool (www.msvalidator.org). DISCUSSION We have developed Validator, a novel proteomics database search validation software that provides a direct and independent means to validate peptide identifications provided by Mascot analysis of tandem mass spectrometry data. Our algorithm is based on LC-MS/MS analysis of a mixture of carboxyl-terminal stable isotope-labeled and non-labeled peptides, a common sample in quantitative mass spectrom-  Of the remaining pairs, if at least one had a Mascot identification, the shifting and non-shifting ions were compared with the theoretical fragmentation table, and if one or both had a valid fragment ion tally, it was assumed correct. This was true for 621 pairs. Of the remaining pairs, a search was performed using X! Tandem, an alternate search engine, and if a peptide was identified, the corroboration was repeated. For 31 peptides, an identification was made using X! Tandem, and for 19 of these, the match was corroborated with the identified ions. For the remaining pairs (133 in this case), a manual review will need to be performed to determine the identity of the peptide and the validity of the match. etry (32,(53)(54)(55)(56)(57). We exploit the characteristic fragmentation of isotopically labeled peptides to enhance their identification, a well established principle that goes back to the period preceding the modern era of ESI and LC-MS/MS (36,37) and has since been applied effectively by a number of investigators (e.g. Refs. 2, 5, 12, 14, 33, 35, 38 -48, and 50). Where both the light (unlabeled) and heavy (labeled) forms of a peptide are selected for fragmentation, the resulting spectra can be compared, thereby distinguishing pairs of non-shifting b-ions from pairs of y-ions that display a shift determined by the isotopic label. These data are then used to test the validity of Mascot peptide identifications, comparing observed with predicted fragmentation patterns. We found that this approach allows rapid and efficient automated filtering of Mascot analysis of LC-MS/MS data to improve both the sensitivity and specificity of peptide identification while salvaging potentially useful low scoring peptides not captured by conventional validation strategies.
Our naive, first approach was to rapidly identify all Mascotderived 16 O/ 18 O pairs from a Mascot .DAT file where both peptides received the same identification. Our data show that a majority of the highest scoring peptides are validated by this simple strategy, and this method was not only able to find 91% of the proteins identified by the commercial analysis package Scaffold but also to capture peptides where the Mascot scores would have fallen below any standard significance threshold. This analysis takes less than 10 s and results in a list of very high confidence peptide and protein identifications. The surprising performance of this simple approach probably reflects the high bar required for Mascot to independently match each of the fragmentation spectra to the 16 O and 18 O forms of the same peptide, even when the resulting scores fall below normal significance thresholds. In turn, this single criterion efficiently rejects most false identifications as from decoy data.
Validator 2 relaxes the requirement for Mascot to make the same identification for both spectra in a pair and simply seeks a partner for each 16 O-or 18 O-labeled peptide based on the expected difference in precursor mass. We have shown that this is also a fast and reliable way of identifying pairs, and we found many 16 O/ 18 O-labeled potential matches not identified by Validator 1. With Validator 2e, we extracted the b-type (non-shifting) and y-type (shifting) fragment ions from the MS/MS spectra of each pair and then compared these data with the theoretical peptide fragmentation table calculated from the Mascot peptide identifications. Validator 2e confirmed both low and high scoring Mascot identifications but also rejected many others, including nearly all high scoring matches to the decoy database. Thus, Validator 2e was able to achieve an FDR of 5% at a score of 29 versus 36 for Mascot alone. These data suggest that for any arbitrary level of significance running Validator can significantly increase confidence in peptide identifications independently of the Mascot score.
To develop a validation scheme agnostic to Mascot-derived information, we reasoned that peptide pairs could be found based only on the difference in precursor mass. Validator 3 was able to quickly find all Validator 2-identified pairs as well as many others. Here, even though in many pairs neither the light nor heavy forms were matched by Mascot, we again wanted to corroborate the peptides by matching shifting and non-shifting ions. By comparing the two MS/MS ion series directly, shifting and non-shifting ions were rapidly identified by Validator 3e, and we were able to confirm the majority of high Mascot scoring peptides by tallying the number of shifting and non-shifting ions and again efficiently reject Mascot decoy matches. In addition, Validator 3e validated many pairs that had received low Mascot scores and even determined fragmentation patterns for pairs of queries for which Mascot had made no assignments at all.
Using this fragment ion matching scheme, we were able to corroborate most of the 2,775 pairs found by Validator 1. To study Validator 3-identified peptides, we applied a more complicated but systematic approach and corroborated 94% of peptide pairs by combining multiple analysis methods including X! Tandem and manual validation. These results demonstrate that we can quickly (Ͻ5 min) parse a Mascot results file, returning a list of high confidence peptide pairs, many of which would be missed using conventional score cutoff techniques.
Because our software is designed to analyze data from samples that are a mixture of peptides labeled at the carboxyl terminus with either 16 O or 18 O, there is some concern that MS analysis of the mixture will result in fewer protein identifications than for an unlabeled sample due to an increase in fragmentation of "redundant" isotopologues at the expense of other peptides. Indeed when we analyzed 16 O and 18 O samples separately, we found that Mascot identified about 30% more peptides in either singly labeled sample than when the MS was performed on the 1:1 mixture. Thus, we modified Validator to allow for separate 16 O and 18 O fractions to be combined and analyzed as a single data set, and as expected, analysis of the combined fractions rescues the lost identifications (data not shown). Whether analyzed separately (requiring more MS time) or together (and potentially losing some protein identifications) Validator can accommodate the data analysis.
We intend to provide Validator versions 1-3 both as a downloadable, open source program and as a Web-based tool for parsing and analyzing searched Mascot data. In addition, this approach is readily applied to other labeling schemes used for quantitative analysis, such as stable isotope labeling by amino acids in cell culture (SILAC) or ICAT. Thus, we intend to adapt the software to accommodate other stable isotope tags. Analysis will also be extended to other search platforms such as SEQUEST or X! Tandem.
This study raises the possibility of implementing a new approach to proteomics data acquisition and analysis to speed up and enhance protein identification based on iden-tifying peptides "on the fly" during the LC-MS/MS run. Our data suggest that peptides might be readily identified, even in a complex sample, based on detecting pairs of precursor ions with a characteristic mass difference. Then MS/MS could be performed on both the heavy and light forms followed by comparison to detect shifting and non-shifting fragment ions. The lists of precursor ion masses and b-and y-ions determined from such a match could be used to generate sequence tags as done by Mann and Wilm (20) to directly identify each peptide and thus the protein. With such a strategy, protein identification in real time during the LC-MS/MS run is entirely feasible from a computational perspective. Toward these ends, we anticipate pursuing rapid recognition of 16 O/ 18 O pairs in raw LC-MS/MS data and interrogating pairs of fragmentation patterns to search for matching shifting and non-shifting ions.
In its current incarnation, our Validator software offers a simple and powerful tool to filter searched tandem mass spectrometry proteomics data. By applying the techniques outlined above, a list of high confidence peptide and protein identifications can be obtained within minutes, thus reducing the complexity of downstream proteomics analyses.