|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







,¶,||
,**
,

From the
Department of Computer Science and Engineering, University of Colorado at Denver and Health Sciences Center, Denver, Colorado 80217-3364, Departments of
Chemistry and Biochemistry and ¶ Computer Sciences and ** Howard Hughes Medical Institute, University of Colorado, Boulder, Colorado 80309-0215, and || Department of Preventive Medicine and Biometrics, University of Colorado at Denver and Health Sciences Center, Denver, Colorado 80262
| ABSTRACT |
|---|
|
|
|---|
To minimize false negatives, investigators often reduce the acceptance threshold to capture more information. Several methods have been developed to filter the resulting false positives based on agreement between sequence composition of the peptides and their behavior on ion exchange or reverse phase chromatography (7, 8), probability of missed cleavages (9), exact mass measurements (8), or differences in scores between the top ranking peptides and lower ranked candidates (10, 11). Methods have also utilized intensity information in statistical or machine learning approaches for validation (1215). Nevertheless, manual analysis of MS/MS spectra by an experienced annotator, who examines each spectrum for chemical plausibility, is regarded by many as the best method for validating borderline cases (1618). In particular, manual analysis considers other fragment ion types not evaluated by the search program, evaluates fragment ion intensities for chemical plausibility, and whether a peptide assignment is based primarily on noise peaks. However, manual analysis lacks uniform criteria, it can be error prone (17), and it is impractical with large datasets.
Accurate methods of predicting fragment ion intensity would allow the evaluation of chemical plausibility to be automated. Recently the known gas phase chemistry mechanisms of peptides (12, 19, 20) were incorporated into a kinetic model for peptide fragmentation by Zhang (21, 22) and implemented in the program MassAnalyzer, which simulates MS/MS spectra including relative fragment ion intensities. Similarity scoring between observed MS/MS spectra and the theoretical spectra generated by MassAnalyzer showed excellent discrimination between correct versus incorrect assignments for LCQ and LTQ MS/MS spectra of standard peptides and digests of purified proteins. This suggests that simulated spectra can be used to automate the evaluation of chemical plausibility for validating search results in complex samples. Such an approach complements methods that match MS/MS spectra to libraries of previously observed spectra (23, 24) because it can assess any predicted peptide sequence and can be rapidly adapted to different mass spectrometers or sample preparation methods. Therefore, one of our goals was to test the performance of MassAnalyzer-generated spectra on multidimensional LC/MS/MS ("MudPIT" (25)) datasets collected on tryptic digests.
Here we report a "Manual Analysis Emulator" ("MAE") program, developed to automate key aspects of manual analysis, minimize subjective decisions, and enable high throughput processing. We evaluated MAE performance with datasets of varying complexity (Table I) and found substantial discriminating power of a similarity (Sim) score using the theoretical spectra to evaluate the chemical plausibility of Sequest and Mascot search results. In addition, we developed a score to measure the proportion of the MS/MS spectral ion current accounted for by the peptide sequence (proportion of ion current, or PIC). A commonly used test for manual analysis is to account for most of the ion intensity in the MS/MS spectrum (1618). However, MAE analyses revealed that MS/MS spectra with borderline PIC scores were often correctly assigned but included more than one peptide ion in the mass spectrometer isolation window (chimera spectra). MAE also provides functions for MS/MS spectra data mining that were effective for identifying unexpected fragmentation chemistries. Thus, MAE provides a useful platform for assessing the validity of search program results and for mining information about gas phase chemistry from large proteomics datasets.
|
| MATERIALS AND METHODS |
|---|
|
|
|---|
Raw data were centroided during data collection using vendor default parameters. DTA file summaries for the MS/MS spectra were generated from the Raw files using TurboSequest 3.0 (Extract_MSN module) with intensity threshold = 10,000, peptide mass tolerance = 2.5 Da (average mass), grouping of one to five scans, and minimum ion count = 35. An in-house script concatenated DTA files into a Mascot Generic file format for Mascot searches. Mascot (version 1.9) and TurboSequest searches were carried out against the IPI protein database (version 3.1) or a database where each protein sequence in the same IPI database was inverted to read from C to N terminus. Search parameters allowed 2.5-Da (average) peptide mass tolerance and 1.0-Da (average) fragment ion mass tolerance, as previously optimized to maximize the number of peptides identified (7), allowing only fully tryptic products with up to two missed cleavages. In-house parsers were used to extract search results into an Oracle 9i database. In addition to MAE, we utilized in-house MSPlus and IsoformResolver programs (7) to analyze the search results. The MSPlus program evaluates consensus between the first Sequest and the top two Mascot search results, filtering out cases with unlikely physicochemical properties (in this study, validation required that the number of basic residues was consistent with observed charge on the parent ion and the SCX elution properties (7) and excluded unlikely trypsin missed cleavage products (9)). IsoformResolver generates protein profiles (either based on score thresholds or MSPlus evaluation), by resolving ambiguous protein assignments due to peptide isoforms that the MS/MS spectrum cannot differentiate, and then reports the minimum number of proteins that account for the identified peptides.
The XCorr, Mowse, and MAE scores for the DTA files for Sample 2 are given in Supplemental Table 1 (most of the analyses in this study were done on this dataset), and the protein profile is given in Supplemental Table 2. When evaluating search results where the number of identifiable MS/MS spectra is not known, we report false discovery rates (FDR = FP/(TP + FP) where FP = false positive and TP = true positive), either estimating the number of FPs using the inverted database search as recommended by Weatherly et al. (6) or by directly determining the FPs by manual analysis. When evaluating results with the sequence-inverted protein database, we report false positive rates (FPR = FP/(TN + FP) where TN = true negative) because in this analysis FDR is meaningless as it equals one under every condition (TP = 0). In some analyses, a high confidence subset of a dataset was used; this subset required that both Sequest and Mascot agreed on a sequence, that either XCorr or Mowse was above score thresholds yielding zero FPs from an inverted sequence database search, that Sequest RSP score = 1, and that the sequence satisfied charge state, SCX, and missed cleavage rules of MSPlus.
MAE Program
MAE was built using C++ to carry out three major steps: 1) simplify each low resolution MS/MS spectrum by eliminating noise and combining isotope clusters into single peaks, 2) calculate a Sim score that evaluates chemical plausibility based on relative fragment ion intensities when comparing an observed MS/MS spectrum with a theoretical spectrum, and 3) identify all possible fragment ions, including those not considered by conventional search programs, based on heuristic rules used in manual analysis and calculate a PIC score for each MS/MS spectrum that accounts for fragment ion assignments. MAE inputs and outputs are described in Fig. 1. Theoretical spectra were generated using MassAnalyzer 1.03 (21, 22) except that average parent and fragment ion masses are reported. To achieve rapid access to these spectra, we constructed a database of the +1 to +4 charge forms for all tryptic peptides allowed by the search strategy. The theoretical MS/MS spectral database consisted of 8,332,846 spectra formatted in a three-level tree by the first three amino acids in the sequence. MAE required 77.74 ms on a Pentium 4 computer (CPU 2.6G) to evaluate a candidate sequence when compiled with C++ 6, student version.
|
Simplifying the Information in DTA Files
To generate the sDTA files, noise peaks from each MS/MS spectrum are evaluated and removed. (Here we describe sDTA processing for the LCQ data. Similar issues exist for the LTQ-Orbitrap, although analysis of the LTQ-Orbitrap dataset, Sample 6, utilized simpler methods for processing as described in Table I, legend). The first step is to create an "ion list" of the most intense DTA ions (a DTA ion is one line in the DTA file) equal to 7 or 14 times the number of amino acids in the candidate peptide for singly or multiply charged cases, respectively. This division rule encompasses 98% of the obvious ions in a survey of the high intensity, richly fragmenting MS/MS spectra in Samples 2 and 5. The remaining DTA ions are categorized as "bulk noise ions."
To simplify the spectrum, the lines remaining in the ion list after removing the bulk noise are grouped into "clusters" of DTA ions and combined as described in the legend for Fig. 2. The resulting single peak is referred to as an "Average Mass Ion" with intensity equal to the sum of all the included DTA ions. The m/z and intensity of the Average Mass Ions are calculated using Equation 1 and Equation 2, showing an example of a cluster with three fragment ions. To distinguish the processed Average Mass Ions from the original data in the DTA file, we refer to the weighted average m/z of each reprocessed peak as "M/z" where Ik and Mk are intensity and m/z of DTA ions within each cluster, respectively, which are combined during processing of the DTA file.
![]() |
![]() |
The resulting I and M/z values for each Average Mass Ion are written to the sDTA. An error correction of 1.0002 x M/z was applied to each M/z value after analyzing the high quality assignments in the list (e.g. Supplemental Fig. 2). A second noise threshold was then determined where Average Mass Ions with I greater than the mean + 2.8 x standard deviation (S.D.) of the bulk noise ions were classified as "major ions" for the next step of processing. This threshold was chosen to eliminate >95% of noise ions in spectra where an MS noise peak was sequenced. Such cases were identified when spectra were very weak, no MS/MS spectra were observed for ions with similar m/z and reverse phase retention time but with higher signal, and no peptide tag sequence could be identified. A third noise threshold was used in calculating PIC score as described below excluding any Average Mass Ion below 3,000 counts. Supplemental Fig. 3 summarizes the method used to set this threshold. The effect of simplifying clusters by combining ions with similar m/z values is evaluated in Supplemental Fig. 4.
|
![]() |
where IA is the summed intensity of the observed ions that make up each Average Mass Ion and IB is the intensity of each ion in the theoretical spectrum. To reduce the impact of the remaining noise ions, any ion in the sDTA that was not in the theoretical DTA and that had intensity less than 0.5% of the total observed ion intensity was not considered in the Sim scoring. This filter was the only noise processing used for the preliminary analysis of the LTQ-Orbitrap dataset. When comparing theoretical and experimental spectra for Sim scoring for the LCQ mass spectrometers, a mass tolerance of ±(0.45 Da + 0.00085 x M/z) was allowed for the 3D traps, determined from error analysis of the M/z values for fragment ions (Supplemental Fig. 2), and ±0.2 Da was allowed for the LTQ-Orbitrap dataset.
Annotating the Fragment Ions
In the next step, the Average Mass Ions are annotated by comparison with average m/z values of possible theoretical fragment ions, including sequence-specific a, b, and y ions, multiply dehydrated b ions (up to four), internal fragmentation products, and bn 1 + 18-Da ions produced by C-terminal rearrangements (20). Multiple charges are considered for all except internal fragment ions (produced from cleavage of two peptide bonds). MAE assigns fragment ions in the sDTA in two stages based on how closely the observed M/z values match theoretical average masses. First, ions with narrow mass tolerance ±(0.25 Da + 0.00045 x M/z) are assigned in the order listed below; this mass tolerance was chosen to include >95% of the fragment ion assignments in MS/MS spectra of standard peptides where parent ion counts were
80% of the target values to minimize inaccuracy due to space charging. Second, ions with greater mass tolerance ±(0.45 Da + 0.00085 x M/z) are assigned to accommodate cases with higher mass error. This larger window was determined from error measurements for replicate MS/MS spectra of the same peptide observed in several datasets and captures cases where there are Asp/Asn or Gln/Glu isoforms, where the second or third isotopic peak was sequenced, or where the isotopic envelope of an intense fragment ion is split into two Average Mass Ions by MAE processing.
Only one fragment ion classification is made for each Average Mass Ion even when several are possible (for example, analysis of the second, fourth, and fifth peptide bond in DTA files from Sample 5 indicated that 8% of doubly charged ions are isomeric with a singly charged ion). Rules commonly utilized in manual analysis are used to assign the most likely annotation, but all other possibilities are listed in the summary output. Fragment ions are classified in the following order: 1) parent ion and ions generated by neutral losses of water/ammonia, the guanidinium side chain (only allowed when Arg is present), or H2CO3; 2) singly charged "canonical" bn and yn ions produced by cleavage at one peptide bond; 3) singly charged canonical an ions; 4) singly charged dehydrated/deammoniated canonical an, bn, and yn ions and C-terminal rearrangements (bn 1 + 18); 5) singly charged multiple dehydrated/deammoniated canonical bn ions; (6) doubly charged canonical bn and yn ions; (7) doubly charged canonical an ions; (8) doubly charged dehydrated/deammoniated canonical an, bn, and yn ions and C-terminal rearrangements (bn 1 + 18); (9) doubly charged multiple dehydrated/deammoniated canonical bn ions; (10) triply charged canonical bn and yn ions; (11) triply charged canonical an ions; (12) triply charged dehydrated/deammoniated canonical an, bn, and yn ions and C-terminal rearrangements (bn 1 + 18); (13) triply charged multiple dehydrated/deammoniated canonical bn ions; (14) b ions generated by internal fragmentation; (15) a ions from internal fragmentation; and (16) dehydrated/deammoniated a and b ions from internal fragmentation.
Additional Scores Generated by MAE
Several additional scores are generated by MAE; only two of these were used in this study. The PIC score is applied to each set of major ions (excluding those with IM/z <3,000 counts) and is defined by Equation 4,
![]() |
where Imatched is the intensity of each assigned Average Mass Ion that matches to a theoretical fragment ion and IA is the intensity of each observed Average Mass Ion in the sDTA file.
Internal fragment ions have a high probability of generating combinatorial redundancies; therefore, a score to assess internal fragments was added to the MAE output. The "IntFrag" score evaluates the percentage of observed fragments accounted for by internal cleavages and is defined by Equation 5,
![]() |
where Iint is the intensity of each observed ion identified as an internal fragmentation product.
Rules for Fragment Ions Generated by Secondary Cleavages
To decide between alternative ion types and to minimize chance assignments due to the large number of combinatorial possibilities, heuristic rules for fragment ion annotations based on simple chemical rules for multistep cleavages (indicated by
) or parallel reactions that are expected to be independent of each other (indicated by ||) are applied as follows.
) or deammoniated ion (
) if the corresponding unmodified form is absent (using the rule that "unmodified
/
"). We use the standard
symbol for dehydration, but in some cases, the ion shows mass closer to a deammoniated ion. Stochastic variation in intensity of different isotope peaks produces ambiguity between dehydrated and deammoniated ions (see Fig. 2, expanded view panels that show individual ions, keeping in mind that dehydration and deammoniated ions are 1 Da apart and have overlapping isotope peaks). Therefore, we utilize a
symbol for potentially deammoniated ions, instead of the more common NH3 nomenclature, to emphasize that the evidence for the observed deammoniated ions is inadequate to distinguish from dehydration.
a."
/
) a ion if the ratio of intensities between the
/
a ion and its corresponding a ion (Ra) is significantly different from the ratio of intensities between the
/
b ion and its corresponding b ion (Rb) generated by cleavage of the same peptide bond. This assumes that the two reactions are independent: "(a
/
a) || (b
/
b)". This rule is utilized when |Ra Rb| < 0.15(Ra + Rb) to avoid misclassifying cases where small changes could be due to stochastic differences in ion counting.
![]() |
where Is and It are the intensities of each observed ion identified as sequence-specific a, b, or y ions representing cleavage at the s and t peptide bonds and I is the intensity of each observed ion of any type except the internal fragment ions.
/
-modified a or b ions (for example
an). This rule assumes that the loss of H2O/NH3 and CO are independent: "(b
a) || (unmodified b
/
b)."
/
derivatives of a b ion are present (e.g. bn,
bn, 
bn, and 

bn), the related ions must follow intensity patterns that assume sequential reactions: "unmodified



." For example, a set of three related ions should show intensity patterns bn
bn

bn, bn
bn

bn or bn
bn

bn and exclude the pattern bn
bn

bn. If the set fails this test, the 
form should not be assigned. Similar patterns should be assessed for sets of four ions; if the set fails the test, the 

ion should not be assigned, and the first three ions should be reconsidered. This annotation test considers alternative assignments and low intensity ions in the sDTA files when testing for the unmodified element in this series and requires that the peptide sequence has sufficient Ser/Thr to account for the multiple dehydration events (number of Ser/Thr
number of
allowed). | RESULTS AND DISCUSSION |
|---|
|
|
|---|
A solution to these problems is to simplify the clusters into one peak. Ideally we would like to use the monoisotopic peak, but computationally it is easiest to identify a cluster by locating its most intense DTA ion. The peak corresponding to the monoisotopic mass may not be the most intense DTA ion, or it may be absent in low intensity, high m/z ions, or cases where the centroid processing distorts the distribution of the ion intensity. Thus, the average m/z of the original fragment ions is more reliably reconstructed from the information in the DTA files than is the monoisotopic m/z. MAE processing of the DTA information to remove noise and simplify ion clusters into single peaks is described in the legend to Fig. 2. Briefly MAE separates DTA ions into two classes: major ions and bulk noise ions. Clusters in the major ion list are then combined to produce "Average Mass Ions" by calculating the weighted average mass of the DTA ions within a 2.0 to +2.5-Da window of the most intense DTA ion in the cluster (Equation 2) but terminating at the point that DTA ions are >1.0 Da apart. We utilize the unit designation M/z for these averaged ions to distinguish them from the unprocessed data. An example of the resulting sDTA file is illustrated in Fig. 2C.
Characterizing the sDTA Information
Accuracy of the DTA signal processing by MAE was assessed by examining mass errors of fragment ion assignments (theoretical m/z observed M/z) of the cases where we were confident of the peptide identification. The mass error distribution of all b and y fragment ions showed an offset of the mean from zero (mean = 0.1535, S.D. = 0.492), indicating a systematic error in mass determination (Supplemental Fig. 2B), which varied with fragment ion M/z (Supplemental Fig. 2A). This was not an artifact due to combining cluster lines because the same systematic error was observed in unprocessed clusters where we could identify the monoisotopic peaks. It appears to be an aspect of the mass inaccuracy of ThermoElectron 3D ion traps and has been observed in other laboratories (18, 26, 27) but was not observed with the LTQ-Orbitrap when MS/MS spectra were collected in the LTQ. After applying a correction factor for the LCQ datasets, the resulting error distribution after the mass correction is shown in Supplemental Fig. 2, C and D (mean = 0.035, S.D. = 0.430). Applying this correction factor to a larger dataset collected a year later (Supplemental Fig. 2, E and F) showed similar results (mean = +0.088, S.D. = 0.347). The range of observed errors is well within the range expected for 3D ion traps (26, 27) and indicates that ions within 1.5 Da of each other cannot be accurately distinguished.
In combining DTA ions in a cluster, two difficult cases were seen when very intense singly charged fragment ions produced clusters up to 5 Da wide or when two different fragment ions appeared with monoisotopic m/z values within 3 Da of each other. For the first case, clustering of the high intensity DTA ions often produced two adjacent Average Mass Ions differing by 1.52.0 Da, one with high signal intensity and one with low intensity, where the most intense Average Mass Ion was within 0.5 Da of the expected mass in 100% of 23 randomly chosen cases. Thus, the Average Mass Ion provided a reasonable representation of the major fragment ion that was well within the error distribution for fragment ions in high confidence cases. The second, weaker peaks were as much as 2 Da larger than the main peak; they accounted for most of the outlier values in the error analysis in Supplemental Fig. 2.
For the second case, we evaluated spectra where it was likely that corresponding b and y ions were present and within 4.5 Da of each other (Supplemental Fig. 4). We surveyed 48 randomly chosen spectra of MH+ and MH2+2 parent ions containing this type of potential ambiguity. Thus, the method of combining clusters into Average Mass Ions combined only those ions separated by less than
2 Da, rather than the 4.5 Da that might be expected from the window size, most likely due to termination of ion summation when gaps >1.0 Da were encountered. This resolution was similar to that expected given the distribution of errors for all Average Mass Ions (Supplemental Fig. 2 shows 3 x S.D. = 1.25 Da).
After the signal processing, the resulting ion list of Average Mass Ions is written to an sDTA file that is then used as input to other MAE functions (Fig. 1). To test whether processing into sDTA files removed any critical information, sDTA files in a small MudPIT dataset of K562 cell proteins (Sample 2, Table I) were tested in searches with Mascot and Sequest, and the results were compared against those obtained when searching using the unprocessed DTA files. Overall Mascot Mowse scores for correct assignments remained the same or increased by 15% due to better matching of the high mass ions, and the scores for incorrect hits decreased slightly most likely due to the removal of isotopic peaks. The net effect was a small increase in discrimination. On the other hand, >90% of Sequest XCorr scores decreased with the sDTA files, sometimes by as much as 25%, which we attributed to the reduction in noise. Sequest SP and ion scores increased because these scores are sensitive to the more accurate matching between observed M/z and predicted m/z values. (SP evaluates the presence of continuous "runs" of b and y ions, whereas ion score evaluates the percentage of predicted b and y ions that are observed.) Importantly no correct Sequest or Mascot assignments made with DTA files were lost by searching using the sDTA files. These results are consistent with other studies reporting similar improvements upon deisotoping and removing noise from DTA files (28). Because the purpose of our study was to evaluate search results on DTA files, all other searches in this study were done using DTA files. However, this test demonstrated that the sDTA files lose no significant information after processing by MAE, and in fact the processing increases the likelihood that predicted and observed fragment ions can be matched.
Sim Scoring against Theoretical MS/MS Spectra Generated by MassAnalyzer
We next tested the performance of DTA files versus sDTA files in Sim scoring of the Sample 2 dataset. To do this, we generated two sets of theoretical spectra: 1) those generated by a version of the MassAnalyzer program that calculates isotope peaks of the fragment ions, which were compared with DTA ion m/z values, or 2) those generated by a modified version of MassAnalyzer that calculates average m/z values of the fragment ions, which were compared with sDTA Average Mass Ion M/z values. We then evaluated Sim scores comparing theoretical spectra with either sDTA or DTA files. Analysis of the Sample 2 dataset showed a bimodal distribution for the Sim scores with sDTA files (Fig. 3A, open symbols). The higher scoring class in the complex test sample (sDTA, Fig. 3A) aligned well with the score distribution of 175 validated MS/MS spectra from a dataset collected on standard proteins (Fig. 3C; Sample 1 in Table I), whereas the lower scoring class aligned with false positives generated by searching against inverted protein sequences (Fig. 3B). In addition, the ratio of areas of the two classes in the complex test sample was similar to our previous estimate of the ratio between normal tryptic MS/MS spectra and other MS/MS spectra in this dataset (approximately 1:3 when MH2+2/MH3+3 duplicated DTA files are included). Thus, the two peaks roughly corresponded to the expected ratio of correct versus incorrect assignments.
|
Several tests were carried out to assess whether the high versus low scoring peaks in the biphasic distribution corresponded to correct versus incorrect assignments (1). In earlier studies, 70% of the 1,838 MS/MS spectra in Sample 2 (after removing spectra with low signal intensity) were evaluated manually. Manual analysis was carried out on all MS/MS spectra where the Sequest
CN score was
0.08 or where Sequest and Mascot agreed on the peptide assignment (7, 9). In addition, any of the top five Mascot assignments were examined whenever Sim was
0.45, which validated lower ranked sequences. Together these confirmed 925 MS/MS spectral assignments as correct, including those where the correct assignment was a lower ranked "hit" or where only Sequest or Mascot alone could correctly identify the spectrum. In all cases, those MS/MS spectra that were manually validated as correct showed Sim >0.47 (Fig. 3D) (2). In contrast, the remaining 30% of cases that were not evaluated manually showed very low Mowse or XCorr scores and were most likely incorrect. All distributed to the low range of Sim score, less than 0.47. A random sampling of 45 cases were then further evaluated; manual analysis showed that all of their top ranked Sequest and top two ranked Mascot assignments were incorrect (3). Finally the two alternative charge forms of each multiply charged DTA, generated by the LCQ, were examined for incorrect charge forms (referred to as "decoys" in Supplemental Table 1, column 5) in those cases where the correct form could be identified. All of these incorrect assignments showed a Sim score less than 0.5. Taken together, this extensive analysis of the Sample 2 dataset showed that the range of Sim scores for the true positive class was nearly identical to that observed for the standard peptides (Fig. 3, D versus C). This indicates that the two peaks in the bimodal distribution of Sim scores effectively distinguish true positive from false positive assignments.
Receiver Operating Characteristic (ROC) Analyses
To further evaluate the discriminating power of Sim scoring with complex samples, we carried out ROC analyses; ROC curves compare sensitivity (true positives identified) versus specificity (1 FPR) where the area under the curve correlates positively with discriminating power, i.e. the ability to achieve high sensitivity with low numbers of false positives. The 925 manually validated MS/MS spectra of Sample 2 provided an ideal dataset for this analysis because the spectra adequately sampled the full range of scores not just the scores that were above an acceptance threshold. Thus, the manually validated cases showed a wide range of Mowse and XCorr ranging between 17 and 196 and between 1.14 and 7.85, respectively (Supplemental Table 1, considering only cases where Mascot or Sequest correctly identified the spectrum). This was similar to the range seen with standard peptides (7). Furthermore the validated cases accounted for most of the high scoring peak in the bimodal distribution of Sim scores (Fig. 3D) and was similar to our previous estimate of 920930 MS/MS spectra that would be identifiable if all correct sequences (fully tryptic, allowing up to two missed cleavages) with Mowse or XCorr below acceptance thresholds could be captured (7). Thus, the manually validated dataset from Sample 2 satisfied our criteria for ROC analyses of XCorr, Mowse, and Sim scoring.
Using this dataset, ROC analyses were carried out for the MH+, MH2+2, and MH3+3 charge states comparing Sim, Mowse, and XCorr scores of the manually validated subset (Fig. 4). Sensitivity was plotted as the number of DTA files with score greater than an acceptance threshold, expressed as the true positive rate (TP/(TP + FN) where FN includes both low scoring cases and those where the search program did not identify the correct assignment as the top ranked hit). Specificity was evaluated from the number of FPs in an inverted database search that passed the acceptance thresholds, expressed as the false positive rate (FPR = FP/(FP + TN). From the ROC curves, it was clear that Sim scoring showed significant improvement in discrimination compared with Mowse and XCorr when applied to MH+ and MH3+3 ions as well as modest improvement with MH2+2 ions. For many cases with low XCorr or Mowse scores, Sim performed dramatically better than XCorr or Mowse at capturing correct assignments; this could be seen by plotting Sim against either XCorr or Mowse (Fig. 3, E and F). Similar results were seen when specificity was normalized to only those peptides top ranked by Mascot (see Supplemental Fig. 5 and Supplemental Table 3). Thus, the ROC analysis revealed significant improvement in discriminating power by using relative fragment ion intensity information and Sim rescoring to evaluate search results.
|
CN parameter is often used to improve sensitivity and specificity by allowing use of lower XCorr thresholds without an increase in FPR (10). We found in a previous study that commonly accepted thresholds for XCorr +
CN validated 720 assignments in Sample 2 (9). In addition, we validated 833 cases with in-house MSPlus software, which evaluates consensus between Mascot and Sequest and then filters out FPs based on physicochemical properties (7) as described under "Materials and Methods." Because MSPlus validated cases that included most of the peptides identified by XCorr +
CN, our further analyses compared results between Mowse, MSPlus, and Sim. Fig. 5A summarizes the overlap between Mowse-, MSPlus-, and Sim-validated peptide assignments along with the FPs identified in the manual analysis. There were 670 MS/MS spectra validated by all three methods with no FPs detected. MAE-Sim rescoring of Mascot results identified 159 more cases of which 11 were FPs (seven could be filtered out by applying physicochemical tests). Of the 159, 80 were not validated by MSPlus because they were low scoring or there was no Sequest/Mascot consensus. On the other hand, Mascot or MSPlus validated 44 or 84 additional cases (with two or 10 FPs); most of these had Sim scores just below the acceptance threshold as discussed later.
|
We first compared the results of Sim rescoring of Mascot results with those from Mowse (summarized in Fig. 5B and Supplemental Table 4). Using Mowse acceptance thresholds, 228 proteins were identified based on 395 unique peptide sequences. MAE-Sim increased support for 44 of the proteins identified by Mowse (19%) including 32 proteins previously supported by only one or two peptides and added 33 new protein identifications (14%). Among the 13 cases that were unique to Mascot (all supported by only one peptide), two were found to be false positive assignments. MAE-Sim added nine manually confirmed FPs (all supported by only one peptide), but seven of these were eliminated upon application of physicochemical filters, so no net effect on FPs was shown. After removing the nine FPs from the MAE-Sim protein list, only 24 new proteins were identified compared with 44 proteins identified by Mascot that had additional support by MAE-Sim. These results are consistent with the population sampling nature of MudPIT where additional sampling will more likely identify peptides from proteins that were previously identified.
MAE Data Mining Functions Provide Detailed Analyses of MS/MS Spectral Anomalies
We observed two anomalies in comparing standard protein digests with complex MudPIT samples. First, the percentage of the expected tryptic peptides identified in the Sample 2 dataset (88%) was lower than that in the standard peptide dataset (97%); these represented cases with moderate Sim scores that were uniquely caught by Mascot and MSPlus. Second, the Sim score distribution of manually validated cases in the Sample 2 dataset extended farther into the lower range for the complex MudPIT dataset than the standard protein dataset (Fig. 3, C versus D) with proportionately more validated MS/MS spectra between 0.35 and 0.60. These effects revealed that a larger proportion of correct assignments with borderline Sim scores appear in datasets of complex protein mixtures compared with simple protein samples. To further characterize these borderline cases, we turned to other scoring functions of MAE, the PIC and IntFrag scores. PIC is used in manual analysis to assess how well a peptide sequence accounts for the total fragment ion current in the MS/MS spectrum; because it accounts for more ion types than considered by search engines, it captures orthogonal information about how well the assigned sequence fits the spectrum. The IntFrag score assesses the amount of the ion current present as internal fragment ions generated by cleavage of two peptide bonds. A high IntFrag score occurs when many fragment ions are matched by random chance to internal fragment ions and can be used together with PIC to evaluate plausibility.
The PIC and IntFrag scores require the major Average Mass Ions to be classified by fragment ion type, including canonical a, b, or y ions generated by peptide bond cleavage as well as noncanonical ions from other cleavages, as described under "Materials and Methods." Such annotations revealed fragmentations that were not modeled well by MassAnalyzer. Of interest were the problems encountered with 1) internal fragment ions generated by cleavage of two peptide bonds or 2) fragment ions generated by multiple dehydration of b ions. Fig. 6A shows an example of internal fragmentation where one cleavage is generated N-terminally of Pro, and the other cleavage is generated at hydrophobic amino acids that also yield intense y ions ( . . . FIQNV . . . ). Most internal fragment ions were predicted by the theoretical spectra but often with lower intensities than observed. The multiply dehydrated b ions were observed in peptides with multiple Ser/Thr residues often containing acidic residues and often in singly charged peptides. Up to four dehydration events could be observed in bn ions depending on the number of Ser and Thr residues. In some cases, the multiply dehydrated fragment ions showed greater intensity than their corresponding undehydrated forms (Fig. 6B, e.g. b11, b12, and b13) particularly when the parent ions were singly charged. Such fragment ion types were poorly predicted by the theoretical spectra and may account for part of the lower performance of Sim scoring with MH+ parent ions compared with MH2+2 or MH3+3 parents (Fig. 4).
|
The accuracy of the fragment ion assignment process carried out by MAE was assessed using the high confidence MS/MS spectra subset of Sample 2 dataset (Table I) where the automated annotations for 52 randomly chosen spectra were compared with annotations assigned manually by an experienced analyzer. We evaluated 1,357 fragment ions (not including internal fragment ions) assigned by manual analysis versus MAE and found only 53 MAE annotations that differed from manual analysis. Of these, six represented cases where multiple ions were possible and different alternatives were chosen by MAE versus an experienced annotator. Twenty-eight were plausible fragment ions that were not annotated during manual analysis of the original DTA file but were annotated by MAE after sDTA processing led to intensity enhancement of a weak ion cluster above the noise threshold. Only 19 MAE annotations (1.4%) were considered wrong or unlikely by the experienced annotator. This rate was comparable to the frequency (1.7%) at which two experienced annotators differed in their ion assignments after examining the same spectra. Taken together, the analysis shows that MAE accurately recapitulated the annotations of manual analysis.
Impact of Spectral Chimera in the Borderline Scoring Cases
Using the PIC and IntFrag scores, we explored MS/MS spectra with borderline Sim scores that were more common in complex samples like the Sample 2 dataset (Fig. 3D) compared with simpler samples like the standard protein dataset (Fig. 3C). We first focused on cases where more internal fragments were annotated than were plausible (IntFrag score >0.20) and the PIC score was moderate (0.450.60). In many cases, the internal fragment ions and unidentified fragment ions were absent in other DTA files representing the same peptide. This suggested that such ions represented chemical contamination due to commingling of fragment ions from two parent ions with similar m/z that eluted simultaneously on reverse phase HPLC ("spectral chimera"). The analysis of a spectral chimera is illustrated in Fig. 7 where two peptides with the same m/z were partially resolved on SCX and observed in fractions 6 and 8 but coeluted in SCX fraction 7. The MS/MS spectra in fractions 6 and 8 reported two different peptides with very high Mowse, XCorr, Sim, and PIC scores. The chimera MS/MS spectrum in fraction 7 contained all major ions seen in both peptides individually. This example is an ideal situation for analyzing a spectral chimera because the peptides in this case had the same charge, and the first and second highest Mascot "hits" for the chimera MS/MS spectrum represented the two peptides identified in SCX fractions 6 and 8. Furthermore the two peptides contributed approximately equal intensity to the chimera, also making it easier to deconvolute the two peptide sequences.
|
The presence of spectral chimera may explain why MSPlus was able to identify many cases that were not validated by MAE-Sim scores but were identified by Mowse or MSPlus (Fig. 5A). Of the 84 cases identified by Mascot and/or MSPlus but rejected by MAE, all showed Sim scores in the borderline scoring region just below acceptance thresholds. All showed moderate PIC and high IntFrag scores, strongly suggesting they were spectral chimeras, and included the 32 spectral chimera candidates identified from tag sequences described above. On the other hand, MAE uniquely identified 80 peptides that were rejected by MSPlus because of low combined Mowse/XCorr scores, lack of consensus between Sequest and Mascot, or Sequest score RSP
1. Together MAE and MSPlus rescoring of Mascot results validated the search results for 95% (881 of the 920930 expected) of the DTA files that we estimated could be identified as tryptic peptides. In a preliminary test to optimize data capture, we lowered Sim score to 0.45 and used PIC scores along with physiochemical filters to remove FPs. The results showed 94% of validated search results could be captured by this simple search strategy.
Analysis of Other Datasets
Finally we tested the performance of MAE-Sim with additional MudPIT datasets (Samples 3 and 4, Table I, respectively collected on LCQ Classic and LCQ Deca XP instruments) where methods were utilized to achieve higher sampling of low abundance ions. A larger proportion of the DTA files were in the low scoring class (Sim 0.00.5) than seen in Sample 2 because higher sampling depth resulted in more sequencing of source-generated fragment ions. The result was a very large number of high quality MS/MS spectra that were not identifiable as fully tryptic peptides; this type of MS/MS spectra often shows FP assignments in Sequest and Mascot searches. Nevertheless the Sim scores for these datasets yielded bimodal distributions (Supplement