|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

,
From Applied Biosystems/MDS Sciex, Foster City, California 94404
| ABSTRACT |
|---|
|
|
|---|
Protein identification for the analysis of MS/MS fragmentation data in the bottom-up approach can be thought of as having four main stages: 1) preprocessing, 2) selection of peptide hypotheses, 3) scoring peptide hypotheses, and 4) protein inference. The preprocessing stage 1 can include conversion of raw data to simplified peak lists, averaging of spectra deemed sufficiently similar, filtering of spectra considered unlikely to yield a good identification, etc. Most tools fall into one of two main categories differing in how hypotheses are selected: sequence approaches use some de novo estimation of sequence information from the observed MS/MS fragmentation (11–17), whereas precursor approaches rely on the precursor mass as the main filter (17–25). The goal of both approaches is to gain efficiency and discrimination by constraining the universe of all possible peptides and modifications to a much smaller search space that is tractable for scoring or manual inspection.
In sequence methods, an amino acid sequence(s) from manual or automated de novo sequencing of full or partial peptide sequences is used as an initial search space constraint. In the earliest example of this type of method by Mann and Wilm (11), a small section of sequence, referred to as a "sequence tag," would be manually interpreted and then provided to their algorithm along with the masses of the unsequenced regions flanking the sequence tag. They referred to all three pieces, preceding mass tag, sequence tag, and following mass tag, as a "peptide sequence tag." The database was subsequently scanned to find the matches to the three elements of the peptide sequence tag. In "error-tolerant" mode, all three elements of the peptide sequence tag are not required to match, allowing successful identification even in the presence of unsuspected modifications. At the same time, Pappin and co-workers were developing similar software (12), which now exists as the Mascot "Sequence query" search (26). This sequence-based approach has now been implemented in several forms, including MS-Seq in Protein Prospector (17). More recently, there have been sequence category approaches that use automatic de novo sequencing and attempt to call larger stretches of sequence particularly as a solution to the so-called "homology searching" problem where it is expected that the proteins from the species of interest are poorly represented in the database (27–29). Sequence tags have also been used to derive metrics of spectral quality and as part of the scoring step with precursor-type searches (30).
In the precursor category of algorithms, no MS/MS-derived sequence information is used, and peptide hypotheses are selected solely on the basis of conformance of the observed precursor mass to the mass of the theoretical peptide. The theoretical masses of all possible peptides are exhaustively enumerated given the database and search space constraints such as allowed modifications and digestion cleavage rules, and then all hypotheses that match the observed precursor mass within a prescribed tolerance are selected for scoring. Although this is a brute force approach, it is the dominant approach in current use, eclipsing approaches that use sequence tags. The two most common search engines, the "MS/MS ions" mode of the Mascot search engine (19) and the SEQUEST search engine (18), are of this type. The main reason for this is almost certainly the ease of automated analysis relative to sequence methods, which often require some manual sequencing.
Despite being less used, sequence tag algorithms should, in theory, be more powerful by increasing selectivity during hypothesis selection giving this type of algorithm the potential to be faster as well. However, in addition to being less practical for high throughput applications, sequence tags also come with a significant risk: an incorrect sequence tag call may exclude the right answer from consideration. Initially most tag-based methods relied on a single interpreted tag per spectrum where the assumption is made that the interpretation is correct. That is, the sequence information is used as a hard filter; portions of the database without this sequence are not considered. Newer tag-based approaches such as GutenTag (13) and InsPecT (14) have offered improvements by automatically determining sets of many smaller tags that are used to restrict to any sequences in the database that contain at least one of the tags.
Although precursor methods are broadly used, they do have significant limitations. Unlike sequence methods, the presence of a feature on a peptide that is not allowed in the search will prevent it from ever being identified. For example, if a peptide is N-terminally acetylated, but this feature is not allowed in the search, only wrong answers can be returned for a mass spectrum of this peptide. It might seem that the solution is simply to allow for a large number of variations in the search. This is not feasible, however, because it would bring with it a combinatorial explosion in additional wrong answers that would also need to be scored, yielding unacceptable search times and poor discrimination in scoring. In current practice, the upper limit of what is tractable with precursor-type search engines is around 6–10 modifications. Partly because of the challenges of large search space, current analyses typically only identify a fraction of the total MS/MS spectra acquired, roughly 5–20% for low resolution ion trap type instruments (3, 31) and 15–70% for quadrupole time-of-flight instruments (24, 32). In some cases, there may be 2–3-fold more spectra with sufficient fragmentation quality that go unidentified because of unexpected cleavages, incorrect monoisotopic peak assignments, incorrect charge state determinations, modifications and substitutions not considered, etc. Although the frequency of any single feature might be relatively small, collectively allowance for many less frequent features can account for a significant number of additional spectra, and thus it is desirable to find ways to improve exploration of large search space.
The Paragon Algorithm presents a new approach to protein identification. In contrast to recent advances in peptide identification, the algorithm relies on three key innovations that have nothing to do with the scoring stage. Our efforts have focused on the hypotheses selection stage, driven by the belief that there is greater potential for improvement from advances in determining what to score, not how to score it. First, the likely relevance of each sequence segment of a database to the MS/MS spectrum is quantified on a continuum using many weighted de novo sequence tags to compute a Sequence Temperature Value (STV).4 Second, feature probabilities are formally used to model the frequencies of peptide features such as modifications, digestion events, and substitutions, allowing the estimation of a net probability of any peptide hypothesis. The use of feature probabilities has also allowed a great reduction in the algorithmic complexity of the user interface through the implementation of a translation layer between what the user describes and what the engine understands. Third, an overall threshold is applied to the net effect of STV and feature probabilities, yielding a highly selective triage of which peptide hypotheses are worth scoring. The assessment of both tag evidence and feature probabilities on a continuum allows the efficient balancing of scoring effort to be commensurate with the likelihood that a candidate is worth scoring. Sequence regions more likely to be related to the correct answer for a spectrum are "searched more extensively" in the sense that peptide hypotheses with lower combined feature probabilities will be scored, whereas weakly implicated sequence segments are "searched less," only scoring precursor matches for peptides that have highly probable features.
The Paragon Algorithm offers significant advances in performance in searching very large search space and removes much of the informatics expertise barrier to doing quality protein identification by tandem mass spectrometry while maintaining the automation of conventional precursor-type search engines. The focus of this study was the fundamental description and validation of this new technology.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
130, and the true dynamic range of concentrations is likely to be over 3 orders of magnitude due to the additional contaminant proteins detectable in the purchased stocks.
Mass Spectrometry—
The resulting peptide mixture was separated by reverse phase chromatography (TempoTM nano-LC system, Applied Biosystems) using a 75-µm-inner diameter x 150-mm PepMap C18 column (Dionex) and a 30-min linear gradient from 5 to 30% acetonitrile in 0.1% formic acid with a total flow rate of 300 nl/min. The eluting peptides were ionized by electrospray ionization and analyzed by a QSTAR® Elite QqTOF system (Applied Biosystems/MDS Sciex). Peptide MS/MS spectra were acquired in an information-dependent manner utilizing the Analyst QS software 2.0 acquisition features (Smart Exit, rolling collision energy, and dynamic exclusion). The raw data file is included in the supplemental data.
Peak List Creation—
Reduction of raw data in the *.wiff format to searchable MS/MS peak lists was conducted without any merging of putatively like spectra. No restriction of mass range for precursors was applied beyond the constraints used during acquisition. Spectra containing less than three fragment peaks were not searched. For the file examined in this study, no spectra were rejected. Peak lists are created automatically at the beginning of a search in ProteinPilot Software using this protocol. Mascot Generic Format peak list files (.mgf) generated from the raw data file in this study using both the 1.0 and 2.0 versions of ProteinPilot Software have been included in the supplemental data.
Mascot Searches—
Mascot searches were performed from ProteinPilot Software version 1.0 to assure that exactly the same peak list was searched by both Mascot and the Paragon Algorithm. The Mascot server was version 2.1 and was run on a Dell Precision 340 computer with a Pentium IV 2.4-Hz processor, 1.0 gigabyte of RAM, and Windows XP SP2.
Paragon Searches—
All Paragon searches were run using ProteinPilot Software version 1.0 on a Dell Latitude D810 laptop computer with a Pentium M 1.86-GHz processor, 2.0 gigabytes of RAM, and Windows XP SP2. To allow better comparison with Mascot and to avoid the issue of modification identification, custom modification sets that were depleted with respect to the normal operation of the software were created and used to more closely equal Mascot search space. Repetition of several of the same Paragon searches on the desktop computer used to run Mascot searches found that the two hardware configurations were fairly equivalent. Small search space Paragon searches ran 15% faster on the desktop configuration, whereas large search space searches ran about 17% slower on the desktop. These differences were relatively small, and the point of emphasis in the results is on the relative trends, not absolute speed measurement.
Annotation of Spectra for Performance Evaluation—
An annotation was created for the reference file where the correct sequence was explicitly determined for a subset of the spectra in the whole file. The orthogonal nature of the protein information was leveraged to avoid bias toward either search engine while still allowing advantages to be detected. That is, a consensus set of confident proteins was determined from Mascot and Paragon-Pro Group analyses, and then only peptide IDs to these very confident proteins were included in the annotation. This approach allowed a natural distribution of fragmentation qualities to be included in the annotation, which thus contains a realistic distribution of low confidence to high confidence peptides. The goal was not to annotate every spectrum in the file, nor was it a goal to precisely determine the exact modification location; the aim was to identify only the correct sequence for each spectrum, accepting that this method is not perfect.
To accomplish this, protein identification analyses were conducted with both Mascot 2.1 and Paragon-Pro Group with the same search types later used for comparison, and the best peptide answers for each spectrum according to the best set of proteins were manually aligned for all searches. The only difference between the Paragon searches run for annotation and the searches used for comparison was that the normal set of 35 workup modifications was used for searches for annotation rather than the depleted sets. This yielded 1228 of the total 1987 spectra (62%) with an answer in at least one of the searches. Note that these were not necessarily the top ranked peptides for each spectrum. Each spectrum was manually validated for the presence of an answer with sufficient orthogonal evidence to be included in the annotation without risk of bias toward the engine that produced it if it was found by only one of the engines. Then the intended grading protocol was run on all of these searches, and all cases where either engine reported a high confidence answer that was graded as incorrect were inspected manually. There were few of these cases, and the majority of them were due to Lys/Gln differences or absence of one of these forms in the searched database. Because of this, we decided to allow Lys/Gln difference during grading.
Each peptide answer in the annotation had to be associated with a multi-hit protein or have a clear consensus peptide identification between the two engines, and the vast majority had both conditions. This reduced the set of 1228 spectra down to 902 of which an additional 12 spectra were excluded because spectra with ambiguous charge state assignments were not handled properly in submitting peak lists to Mascot in the first version of the software. Ultimately this left 890 spectra that were included in the annotation of which 708 (80% of 890) had correct answers that were sequences found by both search engines, not necessarily in the same spectrum. The other 182 (20%) of the annotated spectra were sequences from Paragon only, but they were from proteins clearly found by Mascot and had at least 50% confidence in one of the Paragon Algorithm searches. Because the full workup modification sets were used for annotation but not the main series of searches in this study, 96 of these 182 were out of search space for both search engines because the right answers had modifications that were not allowed. Most of these additional modifications were from minor side reactions of iodoacetamide such as modification of peptide N termini and reaction with methionine followed by dethiomethylation. Of the spectra in the annotation, 90% were associated with the top 32 proteins in the Paragon Thorough search of the CDS Combined database, meaning the vast majority of annotated spectra were connected to extremely solid protein identifications. The peptide set generally had few missed cleavages with 91% having none, 8% having one, and 1% having more than one missed cleavage. Because the file was a relatively deep characterization in terms of the number of spectra per proteins detectable in the sample and because the annotation set is enriched for peptides from multi-hit proteins, the frequency of cleavages at sites other than tryptic specificity was moderately high with 70% fully tryptic, 29% semitryptic, and less than 1% fully non-tryptic. The annotation and additional statistics are included in the supplemental data.
Grading Searches against the Annotation—
All search results were graded against the annotation for only the subset of 890 spectra for which the right answers were known. The grading protocol compared the peptide sequence of each answer against the known correct sequence(s) for the spectrum allowing for bidirectional Ile/Leu and Lys/Gln substitution and unidirectional Asn
Asp and Gln
Glu to allow for equivalence via deamidation. It was determined that more than one correct sequence should be allowed for 12 spectra (1.3% of 890) because manual inspection of the spectra showed they lacked fragmentation information that could favor a single answer. Virtually all of these cases had a pair of or several shuffled residues. The exact modification state, including name and location, was not considered as part of the grading procedure, consistent with the effort to remove the issue of modification finding throughout this study.
Receiver Operating Characteristic (ROC) Analysis—
ROC data were generated for a search by taking all first reported peptide answers for each spectrum, sorting the list by the peptide discriminating variable of the search engine, and tallying the cumulative sum of correct and incorrect first answers according to grading against the annotation, moving from highest to lowest confidence. The discriminating variable for the Paragon Algorithm is the peptide Confidence value, which is a 0–99.0 scaled real number. The peptide E-value was used as the discriminating variable for Mascot. This was chosen over the ion score because it takes advantage of spectrum-specific significance thresholds. Note that only the first answer was considered; other degenerate top ranked answers were not considered. This is necessary because engines may vary in their granularity of binning in ranking answers.
| RESULTS |
|---|
|
|
|---|
![]() |
The summation in the denominator includes only one member for each set of highly identical peptides. This allows a set of very similar high quality matches to all have high confidence (where generally only one among the ambiguous set is actually right) while it brings a beneficial competitive element that dilutes the confidences in cases with many dissimilar marginal matches. The probability of a peptide hypothesis, phypothesis, is determined by information independent of MS/MS fragmentation,
![]() |
where pf are probability factors for various features of the peptide hypothesis such as modifications or lack of expected modifications, conformance of peptide termini to expected digestion patterns, and consistency of the observed precursor ion to the theoretical mass to charge ratio. For example, a tryptic peptide with the expected cysteine alkylation modification would have a much higher hypothesis probability than a peptide with neither end conforming to tryptic digestion and missing an expected modification. We estimate the pf factors by empirically measuring the fraction of occurrences of a feature. For example, the probability of cleavage between lysine and proline could be estimated as follows.
![]() |
Clearly the frequencies of features will vary even among data sets that are putatively treated and acquired the same way. We have found that for the various ways feature probabilities are used by Paragon, estimating average values by looking at many samples of the same type captures enough of this variation, and more importantly, Paragon has proven to be quite robust such that rough estimates are sufficient.
|
Taglet Search Component—
For a given spectrum, a substantial number of sequence "taglets" two and three amino acids long are called. Each tag is rated on a quality scale to indicate how likely it is that the tag call is correct. In this case, the tags are generated by doing de novo sequencing and breaking the results into continuous or nearly continuous sequence sections, although there are clearly other methods for automated derivation of sequence tags that could be used. Similarly the approach could be used with variable or longer length tags than have been used here. Modified amino acids are used in tag calling where the set of allowed modifications is determined automatically by applying a threshold to the estimated modification feature probabilities. Each called tag is then matched against the database to find all locations where the sequence occurs. The sequences in the database are divided into segments seven residues in length. Longer, shorter, or even variable length segments could be used instead. The degree to which a segment is implicated by the set of tags is evaluated as the STV for that segment,
![]() |
where Ti is the net evidence or score from all taglets mapping to segment i calculated as
![]() |
the sum of tag quality scores tj for all n tags that map to segment i. Of course, there are many ways to determine a "net effect" of the evidence of many tags of differing qualities. The second and third terms in Equation 4 allow the neighboring segments of a segment in the database to influence the STV of the segment by including their T scores diminished by some fractional coefficient, c.
The calculation of STV for all sequence segments in the database allows them to be ranked, producing a full range of the degree to which each segment is implicated by the set of tags. Segments that are closer to the true sequence of the correct peptide for the spectrum should be ranked higher because more and higher quality tags hit these segments, whereas segments that are unlikely to be related to the correct peptide should be ranked lower because fewer and poorer tags match these segments.
Peptide hypotheses are then generated from each segment using all allowed features regardless of probability, and then the overall probability for each hypothesis being the correct answer for the spectrum is calculated using the fundamental Paragon equation.
![]() |
The probability that the segment used to generate a peptide hypothesis is associated with the correct answer, psegment, is determined based on the STV ranking of the segment among all segments. The probability pprotein that the protein corresponding to the peptide hypothesis is detected in the sample as estimated by the initial search can also be factored into the decision to score or not to score a peptide.
By applying a threshold to the fundamental equation, Equation 6,
![]() |
scoring can efficiently be limited to only those peptides that have an overall probability that assures a minimum level of believability while at the same time letting search space be very large for those segments very likely to contain the true sequence. This is the central innovation in the Paragon Algorithm. Sequence segments with very "hot" STVs are searched addressing very large search space such that peptide hypotheses containing lower probability features such as unlikely modifications and unexpected cleavages will be considered. Segments at the other limit of "cool" STVs are searched within very small search space such that only peptide hypotheses with the most probable features will be considered. These ideas are illustrated in Fig. 1B.
Note that because precursor mass delta is a factor, this approach is able to consider hypotheses that differ greatly from the expected mass. Robustness to inaccurate precursor mass information is one of the conventional advantages of sequence-based approaches over precursor-based methods, and this approach preserves that benefit. As long as the net effect of STV and other factors is favorable, large delta hypotheses can be considered in scoring. This allows identifications that would be lost by other approaches to be recovered, for example, in cases where multiple peptides pass the first mass analyzer.
Because the algorithm uses many tags and considers their qualities, identifications can also be recovered for some cases where the exact sequence is not in the database or the appropriate modifications were not considered. Identifications of this type will appear with multiple improbable features. The modifications, their locations, and the digestion information should not be considered reliable in these cases, but there will generally be a significant portion of the sequence that is correct, if the confidence is high, which often allows connection to the correct protein or a close homolog. For example, in the annotation in the supplemental data there is a peptide reported as AQCHTVEK with N-terminal carbamylation, deamidation of Gln, carbamidomethylcysteine, and a semitryptic Cys-Ala cleavage at the N terminus. However, a better interpretation of the spectrum would probably be the corresponding tryptic peptide CAQCHTVEK with an internal disulfide, a modification that was not allowed. Most of the sequence is correct, allowing connection to the right protein, but the exact details of the answer are not reliable. Note that this peptide was out of search space for all searches analyzed in the study so this inaccuracy in the annotation had no impact on the results presented here.
Protein Inference Analysis—
The peptide ID search results are passed automatically to a third component, the Pro Group Algorithm, which is also part of ProteinPilot Software. This algorithm takes the top 10 peptide hypotheses for each spectrum as input, regardless of maximal confidence, and rigorously distills this set into the set of proteins that can be reported as having been detected with a specified level of confidence in a way consistent with established publication guidelines (33, 34). The Pro Group Algorithm is described elsewhere.1–3
User Interface Control and Parameterization—
In conventional search engines, all method settings explicitly control how to do the search. In an effort to remove algorithmic complexity and reduce the risk of incorrect parameterization, a user interface was developed that hides virtually all of these direct algorithmic controls. This was achieved by implementing a business logic layer containing a "translation" framework whereby user input can be in the language of the experimental scientist as description of 1) the sample and treatment (cysteine alkylation, digestion, labeling scheme, acquisition instrument, and species) and 2) what is desired from the search in terms of the compromise between speed and the quality of the result. This simple input is then translated into the optimal set of algorithmic settings. For example, selecting trypsin as the digestion agent is translated to a set of digest feature probabilities to capture major and minor specificities of trypsin as well as a background rate for all other potential cleavage sites. This obviates the need to do "semitrypsin" or "no enzyme" searches on a tryptic sample because search space is made large enough for segments with very hot STV that many peptides of these less common types can be identified. The same concept of the translation of workflow factors into more complex feature probability descriptions applies to the rest of the method input. Selecting iodoacetamide as the cysteine alkylation would be translated into a set of feature probabilities that includes the major modification on Cys and also known less frequent side reactions from the reagent. A field called Special factors captures additional workflow steps that would impact how a search should be parameterized. For example, an option in this list called Gel-based ID translates into the increased frequency of oxidation artifact modifications and modifications due to acrylamide. For all sample types, there is a translation that describes the background rate of general workup (artifact) modifications like pyroglutamic acid formation, oxidations of methionine, and deamidation. The need to set mass tolerances is obviated by inferring expected variances of MS and MS/MS data for the specified instrument. There are only a few actual options about how to do a search. There are ID focus options, which allow the additional consideration of large sets of biological post-translational modifications and/or substitutions. The desired tradeoff between speed and quality of result is indicated by selecting a Rapid or Thorough search. The former only runs the simple Fraglet search component, whereas the latter invokes the Taglet component as well. A sequence database to be searched is selected, and a species constraint function is applied as with other search engines. A table containing all the modification translations is included in the help function within the software, and missing modifications can be defined. A figure showing the method definition screen is included in the supplemental data.
A fully functional trial version of the software is available for download (ProteinPilot, Applied Biosystems). This software is completely independent of the instrument acquisition software so any modern Windows-based computer can run it.
Searches for Comparative Assessment Versus Mascot
A series of five Paragon Algorithm searches and five Mascot searches were run to assess relative performance both between the two search engines and also among the different searches with each engine. Table I shows a summary of the searches run, the parameters used, and measurements from the analysis of each search. Searches were run on two different FASTA format protein databases, the UniProtKB/Swiss-Prot database, which has about 200,000 proteins, and the CDS Combined file (35, 36), which is essentially a version of National Center for Biotechnology Information (NCBI) NR.fas that has been made truly non-redundant for public proteins, and then Celera proteins are added to this set, yielding a total size of about two million proteins. These two were chosen because they differ in size by an order of magnitude, but both carry a full diversity of proteins from different species as was necessary to search the test mixture of proteins from multiple species. Each row in the table represents a type of search where an effort has been made to align similar Mascot and Paragon searches in terms of what peptides can be found by each search. The first row for each database is referred to as the Small Search Space (Small SS) search type. Because the Paragon Rapid search effort setting is like a conventional precursor-type search engine, it is possible to achieve nearly identical search space between the two engines for this search type as indicated by the listed parameters. Custom Paragon modification sets were created, removing the majority of features in the much larger set normally used, to allow exact alignment with Mascot in modifications. All modifications were variable modifications in all of these 10 searches. The only point of difference in the Small SS searches is the mass tolerances as will be discussed later. The second search type will be referred to as Large Search Space (Large SS) search type for which Mascot is run with the semitrypsin digest setting and the Paragon Algorithm is run in its Thorough search effort setting with trypsin specified as the digestion agent. These two searches have been aligned because they both enable finding peptides that only conform to tryptic specificity on one end. There is a substantial amount of these "semitryptic" peptides in the annotation. A few additional highly specific modifications have been added as well. The third search type, referred to as the No Digest Search Space (No Digest SS) search type, has the same parameters as the Large SS type except that all digest conformance requirements were removed, meaning any sequence in the database could be returned. The No Digest SS search was not run on the CDS Combined database because the Mascot search would have taken over a week to run on the hardware used for the comparison, and this size exceeds the RAM per process limitations with 32-bit processing for Paragon, a limitation that will be addressed in future work. The resulting Mascot Significance scores for each search are also listed in Table I.
|
|
The most striking aspects of the results in Fig. 2 are in the comparison of the effects of increasing search space on each search engine. Fig. 2, A and B, and Table I show that the Mascot Large SS searches yield more correct answers than the Small SS searches; however, this comes with some cost in discrimination. In Fig. 2, A and B, the green lines start to break from the y axis sooner than the red lines for Mascot. This same tradeoff, increased sensitivity at the cost of decreased discrimination, is not observed with Paragon in going from the Small SS searches to its Large SS searches. Comparison of the heavy green lines to the heavy red lines in Fig. 2, A and B, indicates much greater detection with larger search space without any apparent loss of discrimination. The green lines simultaneously go much higher and clearly break from the y axis later than the heavy red lines. Although these differences may be difficult to discern in Fig. 2, A and B, the differences are stark in Fig. 2, D and E. There is a clear loss of discrimination between the red and green lines for Mascot, whereas there is almost no detectable difference between the red and green lines for the Paragon Algorithm despite a huge increase in the effective search space.
Although the sample actually was digested with trypsin, the No Digest SS searches are another important test case for the differences in handling large search space, representing the upper limit in the digestion variable of search space. Table I shows that both Mascot and Paragon lose a few right answers relative to the Large SS searches, 742 down to 724 for Paragon and 680 down to 671 for Mascot. The answers are "lost" because the drastically enlarged search space layers so much statistical noise on top of the signal that the correct answers no longer fall within the top five and 10 reported answers for Paragon and Mascot, respectively. The increase in noise also causes the percentage of right answers in search space that are ranked first to drop from 97.3 to 93.0% for Paragon and from 92.5 to 87.0% for Mascot. Fig. 2C indicates the same trend in considering the whole curves rather than just the end points in Table I. Although the Paragon Algorithm does take a hit removing the digest specificity, it is strikingly better than the No Digest SS search of Mascot. Nearly twice as many right first answers are reported by Paragon before the lines begin to break away from the y axis. This means the yield of highly confident identifications is approximately double with Paragon for no digest searching. The Paragon No Digest SS search even outperforms the Large SS search of Mascot in Fig. 2C and appears to be equal or better in discrimination in Fig. 2D (heavy blue line versus light green line).
Although we eliminated the modification variable as a source of differences in search space, there are still real differences in total right answers for the two larger search space types listed in the Correct Full 890 columns. To understand what these differences were, we did a detailed examination of the CDS Combined Large SS searches where Paragon found right answers in any rank for 784 spectra compared with 681 for Mascot. A Venn analysis of these searches determined that both search engines found right answers for 677 of the spectra, whereas only Mascot found right answers for four spectra, two where the correct answer was ranked first and two where it was not, and only Paragon found right answers for 107 spectra, 82 of which had correct first answers. The four spectra where only Mascot reported a correct answer all had low confidences (E-values of 130, 12, 4, and 4300) and were all semitryptic peptides. For the Paragon-specific spectra, we focused on the 82 spectra where the first answer was correct. One way search space is larger for Paragon in its Thorough mode is that observed versus theoretical peptide delta masses much larger than the tolerances normally used in precursor-type database searches can be considered. This allows good identifications to be recovered when the wrong peak was called as the monoisotopic peak or when a secondary peptide species present within the precursor isolation window contributes to or even dominates the observed fragmentation. To our surprise, the large delta peptides did not account for the majority of additional detections relative to Mascot. 15 of the 82 spectra had delta masses off by close to 1 Da, whereas another 15 had delta masses off by 2 Da or more. About 40% of these 30 correct "large delta" cases had confidences greater than 95%. There were two other differences in search space that could account for some of the remaining 52 spectra in this set of 82. First the number of missed cleavages considered by Paragon is not limited to a fixed value. One peptide had five missed cleavages, accounting for an additional three spectra. Another difference is that the Paragon Thorough mode search with trypsin set as the digestion agent can actually find peptides that do not conform to expected tryptic cleavage on either end, often referred to as "non-tryptic" peptides. Because these are rare, we did not expect this to account for many of the spectra, and accordingly, only six spectra were explained as fully non-tryptic peptides. All of these were verified manually, belonged to the top 11 proteins, and had cohort peptides with overlapping sequence, including semitryptic peptides with common cleavages. In total, differences in search space only accounted for 39 of the 82 spectra, meaning 43 should be in search space for Mascot. Furthermore of the 82 spectra, 74.4% of the correct answers were actually tryptic, 18.3% were semitryptic, and 7.3% were fully non-tryptic. Other than an expected enrichment for non-tryptic peptides, this is essentially the same breakdown as was observed in the whole set of annotated spectra. We manually inspected a sampling of the 43 spectra to see what answers Mascot did report. In many cases, there were so many close alternative sequences that the 10 peptides Mascot saved per spectrum had very little sequence diversity. The right answer was effectively being "pushed below the surface" by the huge amount of wrong answer noise from large search space. To further test this theory, we checked the Small SS Mascot search on Swiss-Prot for these spectra to see whether correct answers could be found and observed that more than half, 24 spectra, did have right answers present, and 16 of these were even ranked first. In other words, the right answers were not being detected for these spectra when searching very large search space because of poor discrimination, not because of differential sensitivity because the allowed search space was different.
Analysis of Performance on the 397 Consensus Spectra—
To more rigorously interrogate the relative discrimination performance of the two engines in different search modes, we decided to focus on the subset of spectra where the right answer was within search space for all 10 searches. This means all these spectra had a simple tryptic peptide as the right answer. In this mode of examination, the benefit of larger search space in greater sensitivity (as shown in the study of the full 890 annotated spectra) could only be detrimental to discrimination in this focused examination. As described under "Experimental Procedures," the annotation did contain more answers derived from the Paragon Algorithm than from the Mascot algorithm. By focusing on only spectra where both engines can find the right answer in all modes of search, any negative effects from unintended bias should be removed. For this subset of spectra, the right answer is present for all searches, and thus, comparative analyses report purely on differences in discrimination, directly measuring the impact of increasing "noise" going to larger search space. Of the 890 annotated spectra, 805 had a right answer that was found in at least one of the 10 searches examined, 681 (85% of 805) had right answers in at least six of the 10 searches, and 397 (49% of 805) had right answers in all 10 searches.
We repeated ROC curve analyses using only the 397 consensus spectra. Fig. 3, A and B, shows the numerical ROC curves for Swiss-Prot and CDS Combined searches, respectively. As would be expected, the discrimination is weaker for any given search on the larger CDS Combined versus the same search on Swiss-Prot for all searches with both engines. The First in Shared 397 columns in Table I give the data for the end points of these lines for each engine, listing the number of first answers that are right and wrong and the percentage of the 397 that are right. This percentage is one measure of discrimination.
|
The most important feature of the ROC results in Fig. 3, A and B, is the differential impact of increasing search space for each engine. The loss of discrimination going from Small SS to Large SS to No Digest SS (red to green to blue lines) for Mascot is strikingly larger than it is for the same series with the Paragon Algorithm. There is almost no loss of discrimination for Paragon between Small SS and Large SS. As was observed in Fig. 2, D and E, Fig. 3A shows that the Paragon No Digest SS actually discriminates equally if not better compared with Mascot Large SS.
One of the most striking differences between the two engines in analogous cases is the Large SS searches on CDS Combined seen clearly in both the differences between the green lines in Fig. 3B and the end point data in Table I. Because this was the largest difference between the engines and because it was the largest Mascot search space (having the highest significance threshold), this pair of searches was examined in more detail. In the Mascot Large SS search on CDS Combined, a correct answer that was present in its top 10 hypotheses was not successfully ranked as the first answer for 55 of the 397 spectra (13.9%). For the analogous search with Paragon, the failure rate was only 7 in 397 (1.8%). Both engines failed on five of the same spectra, whereas only Paragon failed on an additional two spectra, and only Mascot failed on an additional 50 spectra. Believing that the main difference in performance between the engines should be because the Paragon Algorithm leverages the additional information from STVs and feature probabilities to score far fewer peptides, we theorized that if we took the reported first answer from the cases where Mascot failed to rank a correct answer first we should find that Paragon did not even score this hypothesis for that same spectrum in many cases. This is exactly what was observed. In 48 of 55 cases, the incorrect answer Mascot ranked as its first answer was not even among the top five hypotheses for the same spectrum for Paragon, meaning it is very likely Paragon did not even score the peptide.
Searches Using Full Modification Sets—
Custom depleted modification sets were used for Paragon searches to eliminate the modification variable with respect to Mascot. In normal operation, the Paragon Algorithm actually uses a base-line level of 35 workup modifications in all searches with its Thorough search effort setting and also has user-controllable options to additionally consider a set of 94 biological modifications and/or 376 amino acid substitutions. For clarity, we chose to eliminate the modification variable in the validation of the fundamental new ideas in the Paragon Algorithm. However, as a quick check, Fig. 3C shows the same type of analysis on the 397 common set to demonstrate that, even when much larger numbers of modifications are considered, the discrimination still holds up. The figure shows that when considering 35, 129, or even 502 modifications (or substitutions) the discrimination barely decreases. Note, however, that it does change slightly. If the algorithm were using an iterative, filtering approach that removed spectra from further search like many second pass approaches, there would be zero change. However, the separate pass approach in the Paragon Algorithm is considering additional hypotheses for these spectra.
Trends in Numbers of Peptide Hypotheses Scored and Search Times—
To quantitatively assess differences in the number of peptides that are scored in each search type, we added a counter to the Paragon Algorithm scoring function and exported these data for each spectrum. The median number of hypotheses scored among all spectra was determined for each of the five searches as a measure of the actual search space scored. For Mascot, we determined relative -fold changes in the number of hypotheses scored using the changes in significance threshold among the searches. These results are summarized in Table II. To emphasize the trends more than the absolute numbers, we normalized all search space measures to be described as a relative change over the Swiss-Prot Small SS search for the same engine. Because of this, attention should generally be focused on the relative trends between the searches within the same engine rather than comparing the absolute -fold changes across engines.
|
Fig. 4A estimates the difference between the scored and effective search space size for each of the search types. By plotting the -fold increases for each engine on equivalent searches separately for each search type, we can estimate this difference from the slopes. The red line for the Small SS search types yields a slope very close to unity, meaning the effective and scored search space for Paragon is the same because it follows the same trend as Mascot. This is what should be observed because the Paragon Rapid search mode is essentially the same simple precursor-type search that makes no use of sequence tags. The green line comparing the Large SS searches shows a dramatic difference with a slope of about 24. This means the Paragon Large SS search has a scored search space that is about 24 times smaller than the effective search space. This assumes that the effective search space of Paragon is the same as the scored search space of Mascot. Because the Paragon search can find fully non-tryptic peptides, very large mass delta peptides, and many more missed cleavages than were allowed in the Mascot search, the Paragon effective search space is actually much larger than this. Thus, this 24-fold estimate is a very conservative lower bound. The blue line for the No Digest SS searches has a slightly softer slope around 10 but still much greater than unity, also indicating a large difference between scored and effective search space. The actual scored search space size for the No Digest SS search cannot be reduced as much for the Large SS search because, without any expected digest specificity, all cleavages are treated as equally likely. By contrast, in the Large SS search, telling the Paragon Algorithm that the sample was digested with trypsin invokes a very complete description of the probabilities of cleavages between pairs of residues. To a very rough approximation, the gain from the red to blue lines is due mostly to the use of tags and STV, whereas the gain from the blue to green line is due mostly to the use of digest feature probabilities. That is, the difference between Small SS and Large SS for the Paragon Algorithm is due to both the use of STV and digest probabilities, whereas the change from its Large SS to No Digest SS is due to the removal of digest feature probabilities. Because the modifications are constant in the two larger search space search types, the difference between the blue and green lines reflects the value of having the digest information over not having it.
|
| DISCUSSION |
|---|
|
|
|---|
The Paragon Algorithm assesses this degree of implication on a continuous scale that is conceptually referred to as a Sequence Temperature Value. This value is derived by calling many small sequence tags for an MS/MS spectrum with associated estimates of correctness and determining their net effect for each region of the database. The corresponding modulation of search space is accomplished using feature probabilities. Thus, for a segment in the database that is hot for a particular spectrum, i.e. strongly implicated by the tag set called for that spectrum, the algorithm will consider peptides with rare modifications, unexpected cleavages, less likely substitutions, and large delta masses. At the other limit for a search of the same spectrum with the same set of tags, a different segment in the database may be very "cold," i.e. not at all implicated by the tags, and the algorithm will only consider the mostly likely features or lack of features, for example, only tryptic peptides and the expected cysteine alkylation modification but not its absence or any side reactions.
An alternate way to state the fundamental Paragon concept would be to say that peptide features should be considered such that unlikely peptides are only considered when there is a compensating amount of fragmentation evidence that could substantiate an otherwise improbable answer. One limitation of the Paragon Algorithm is that it may fail to find some peptides that are both low frequency types of peptides (atypical) and have poor fragmentation. This is a deliberate sacrifice to gain the speed and discrimination that has bee