|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 5:497-509, 2006.
© 2006 by The American Society for Biochemistry and Molecular Biology, Inc.
,
,¶
,**
,
From the
National Resource for Proteomics and Pathways,
Bioinformatics Program, || Department of Statistics, and ** Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan 48109
| ABSTRACT |
|---|
|
|
|---|
Software and heuristics for automated and accurate spectral identification (17) and discrimination of correct and incorrect hits (816) are thus an ongoing effort in the proteomics community with the ultimate goal being completely automated MS data interpretation. The most straightforward approach to automated analysis is to define specific score-based filtering thresholds as discriminators of correctness, e.g. accepting SEQUEST scores of doubly charged fully tryptic peptides with XCorr > 2.2 and
Cn values of at least 0.1 (17); these thresholds are typically published as the criteria for which correctness is defined. Other efforts have focused on establishing statistical methods for inferring the likelihood that a given hit is a random event. A well known example of this is the significance threshold calculated by the Mascot search algorithm, which by default displays a threshold indicating the predicted probability of an assignment being greater than 5% likely to be a false positive based on the size of the database. Use of a reverse database search to provide a measure of false positive rate is another method frequently used (8, 18). More formally, Sadygov and Yates (12) model the frequency of fragment ion matches from a peptide sequence database matching a spectrum as a hypergeometric distribution, a model also incorporated into the openly available X!Tandem algorithm (6, 13); whereas Geer et al. (7) model this distribution as a Poisson distribution.
Several of these approaches have been implemented directly in the scoring calculation of new search algorithms (6, 7, 12, 15). Alternatively external algorithms may be developed that process the output of the more standard search platforms such as Mascot or SEQUEST, classifying results as either correct or incorrect with an associated probability. Examples of the latter type include PeptideProphet (11) and QScore (8). These tools have the advantage of being able to accommodate results from these existing, well established search engines that may already be in place in a production laboratory; conversely approaches in which the quality measures are built into the search algorithm scoring are arguably more user-friendly in that they eliminate the extra postsearch processing step of having to run a second algorithm.
Keller et al. (11) were among the first to implement a generic tool for classifying the results of common search algorithms as either correct or incorrect. Their PeptideProphet tool represents arguably the most well known openly available tool implementing a probabilistic approach to assess the validity of peptide assignments generated by MS database search algorithms. Their approach contains elements of both supervised and unsupervised learning, achieving a much higher sensitivity than conventional methods based on scoring thresholds. One concern with PeptideProphet, however, is the degree to which the supervised component of the model can be generalized to new types of data and the ease with which new potentially useful information can be added to the algorithm.
This work attempts to address these difficulties by applying a set of simple "over the counter" methods to the challenging peptide identification problem. Anderson et al. (14) demonstrated that support vector machines could perform well on ion trap spectra searched using the SEQUEST algorithm. In this study, we demonstrated that the latest machine learning techniques for classification, namely tree-based ensemble methods such as boosting and random forest, are more suitable for the peptide classification problem and provide improved classification accuracy. The rationale for the improvements lies in their ability to efficiently combine information from multiple easy-to-get but dependent and weakly discriminatory attributes. Such work will hopefully result in development of software tools that are easily installed in a production laboratory setting that would allow convenient filtering of false identifications with an acceptably high accuracy either as new tools or as a complement to currently existing software. The problem of classification of mass spectrometry-based peptide identification seems well suited to these algorithms and could lead to more readily usable software for automated analysis of the results of mass spectrometry experiments.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
A crucial part of the above approach is the choice of discriminant score function F. In (11), the ci values are derived to maximize the between- versus within-class variation under the multivariate normal assumption using training data. To make this method work, one has to assume that the training data and the test data are generated from the same source. When a new set of discriminant scores is generated and needs to be classified, one has to retrain the ci weight parameters using a new corresponding training set; in other words, the discriminant function F is data-dependent. In an area such as proteomics in which there is a good amount of heterogeneity in instrumentation, protocol, database, and database searching software, it is fairly common to come across data that display significant differences. It is unclear to what degree the results of a classification algorithm are sensitive to these differences, hence it is desirable to automate the discriminant function training step. Another potential issue is the Normal and Gamma distribution used to model the two types of discriminant scores. There is no theoretical explanation why the discriminant scores should follow these two distributions; in fact, a Gamma distribution rather than a Normal distribution may be appropriate for both positive and negative scores when using the Mascot algorithm.1 It is possible that for a new set of data generated by different mass spectrometers and/or different search algorithms, the two distributions may not fit the discriminant scores well. Also certain types of data attributes or scores may be more difficult to accommodate into such a model if those attributes significantly alter the shape of the discriminant score distribution. For example, qualitative or discrete attributes may be more difficult to model. As a result, higher classification errors may be produced using this model-based approach.
Machine Learning Techniques
Distinguishing correct from incorrect peptide assignments can be regarded as a classification problem or supervised learning, a major topic in the statistical learning field. Many powerful methods have been developed such as classification and regression tree analysis, support vector machine (SVM),2 random forest, boosting, and bagging (21). Each of these approaches has some unique features that enable them to perform well in certain scenarios; the SVM, for example, is a good tool for small sample size, large feature space situations. On the other hand, all approaches are quite flexible and have been applied to an array of biomedical problems. In this study, we applied state-of-the-art machine learning approaches to the peptide assignment problem.
Boosting
The boosting idea, first introduced by Freund and Schapire (22) with their AdaBoost algorithm, is one of the most powerful learning techniques introduced during the past decade. It is a procedure that combines many "weak" classifiers to achieve a final powerful classifier. Here we give a concise description of boosting in the two-class classification setting. Suppose we have a set of training samples, where xi is a vector of input variables (in this case, various scores and attributes of an individual MS database search result produced from an algorithm such as SEQUEST) and yi is the output variable coded as 1 or 1, indicating whether the sample is an incorrect or correct assignment of a database peptide to a spectrum. Assume we have an algorithm that can build a classifier T(x) using weighted training samples so that, when given a new input x, T(x) produces a prediction taking one of the two values {1, 1}; the classifier T(x) is typically a decision tree. Then boosting proceeds as follows: start with equal weighted training samples and build a classifier T1(x). If a training sample is misclassified, e.g. an incorrect peptide is assigned to the spectrum, the weight of that sample is increased (boosted). A second classifier T2(x) is then built with the training samples but, using the new weights, is no longer equal. Again misclassified samples have their weights boosted, and the procedure is repeated M times. Typically one may build hundreds or thousands of classifiers this way. A final score is then assigned to any input x, defined to be a linear (weighted) combination of the classifiers. A high score indicates that the sample is most likely a correctly assigned protein with a low score indicating that it is most likely an incorrect hit. By choosing a particular value of the score as a threshold, one can select a desired specificity or a desired ratio of correct to incorrect assignments.
Random Forests
Similar to boosting, the random forest (23) is also an ensemble method that combines many decision trees. However, there are three primary differences in how the trees are grown. 1) Instead of assigning different weights to the training samples, the method randomly selects, with replacement, n samples from the original training data. 2) Instead of considering all input variables at each split of the decision tree, a small group of input variables on which to split are randomly selected. 3) Each tree is grown to the largest extent possible. To classify a new sample from an input, one runs the input down each of the trees in the forest. Each tree gives a classification (vote). The forest chooses the classification having the most votes over all the trees in the forest. The random forest enjoys several nice features: like boosting, it is robust with respect to input variable noise and overfitting, and it gives estimates of what variables are important in the classification. A discussion of the relative importance of the different attributes used in our analysis of MS search results is given under "Results."
Support Vector Machines
The SVM is another successful learning technique (24). It typically produces a non-linear classification boundary in the original input space by constructing a linear boundary in a transformed version of the original input space. The dimension of the transformed space can be very large, even infinite in some cases. This seemingly prohibitive computation is achieved through a positive definite reproducing kernel, which gives the inner product in the transformed space. The SVM also has a nice geometrical interpretation in the finding of a hyperplane in the transformed space that separates two classes by the biggest margin in the training samples, although this is usually only an approximate statement due to a cost parameter. The SVM has been successfully applied to diverse scientific and engineering problems, including the life sciences (2527). Anderson et al. (14) introduced the SVM to MS/MS spectra analysis, classifying SEQUEST results as correct and incorrect peptide assignments. Their result indicates that the SVM yields less false positives and false negatives compared with other cutoff approaches.
However, one weakness of the SVM is that it only estimates the category of the classification, whereas the assignment probability p(x) may be of interest itself where p(x) = P(Y = 1|X = x) is the posterior probability of a sample being in class 1 (i.e. a correctly identified peptide). Another problem with the SVM is that it is not trivial to select the best tuning parameters for the kernel and the cost. Often a grid search scheme has to be used; this can be time-consuming. In comparison, boosting and the random forest are very robust, and the amount of tuning needed is rather modest compared with the SVM.
Reference Datasets
Two collections of mass spectrometry data were used in this study, representing the two most common protein MS ionization approaches: ESI and MALDI. We benchmarked the performance of boosting and random forest methods in comparison with other approaches using a known, published ESI dataset and then further applied these methods to newer in-house generated MALDI data from an Applied Biosystems Inc. (Foster City, CA) TOF/TOF mass spectrometer (28).
ESI-SEQUEST Dataset
The electrospray dataset was kindly provided by Andy Keller as described in Refs. 11and 29. These data are combined MS/MS spectra generated from 22 different LC/MS/MS runs on a control sample of 18 known (non-human) proteins mixed in varying concentrations. A ThermoFinnigan ion trap mass spectrometer was used to generate the dataset. In total, the data consist of 37,044 spectra of three parent ion charge states: [M + H]+, [M + 2H]2+, and [M + 3H]3+. Each spectrum was searched by SEQUEST against a human protein database with the known protein sequences appended. The top scoring peptide hit was retained for each spectrum; top hits against the known 18 proteins were labeled as correct and manually verified by Keller et al. (11). All peptide assignments corresponding to proteins other than the 18 in the standard sample mixture and common contaminants were labeled as incorrect. In all, 2757 (7.44%) peptide assignments were determined to be correct. The distribution of hits as used to train and test the methods used in this study are indicated in Table I
|
To generate testing and training datasets for this analysis, all MALDI spectra were searched using the Agilent (Palo Alto, CA) Spectrum Mill platform (specifying trypsin as the proteolytic enzyme and accommodating oxidized methionine and pyro-Glu modifications) against a version of the National Center for Biotechnology Information non-redundant (NCBInr) human dataset downloaded March 10, 2005. The NCBInr database was modified by replacing all entries corresponding to the Genway protein standards with annotated entries that include the appropriate N-term affinity tag (either T7 or His6) and appending the E. coli K12 sequences because E. coli was the host organism in which the recombinants were produced. Tolerances of 0.7 Da for precursor ion selection and 0.3 Da for fragment ion selection were used in the search with a minimum matched peak intensity of 40%. "Correct/incorrect" labels were assigned to each spectrum in a semiautomated manner by correlating the accession number of the search result with that of the protein digest known to be spotted at the plate location from which each spectrum was derived. The dataset consists of 11,764 search results from 4340 spectra (up to the top five ranking hits for each spectrum). 1044 are "true positive" hits; of these, 111 are non-top ranking hits. The precise distribution of hits as used to test the algorithms are shown in Table I. Note the relatively low fraction of true positives as per what would be expected from this instrument is primarily due to the fact that the top five ranking hits, not just the top hits, were selected from every search result. We wished to include non-top ranking hits to examine whether the machine learning tools could be used to distinguish correct hits among these lower ranked results, a frequent occurrence in MS/MS search data. The lower ranked hits contain potentially valuable protein identifications that would be discarded using many normal approaches.
The complete set of annotated MALDI protein standard data (Aurum) is available for download on www.proteomecommons.org. To our knowledge, this represents the first publicly available annotated MALDI dataset useful for development of these types of algorithms.
Attributes Extracted from Each Dataset
Both *.out search result files from SEQUEST and *.spo result files from Spectrum Mill were parsed into a simple text row/column format suitable for use by pattern classification algorithms using custom modules written in Python (available upon request). For the SEQUEST results, only the top hit for each spectrum was parsed; again, for the MALDI dataset, the top five ranking hits were retained.
SEQUEST
The attributes extracted from SEQUEST assignments are listed in Table II. Attributes include typical scores generated by the SEQUEST algorithm (Sp, Sp rank,
Cn, and XCorr) as well as other statistics included in a SEQUEST report (total intensity, number of matching peaks, and fragment ion ratio). We include length among the PeptideProphet attributes because PeptideProphet normalizes the XCorr attribute using the length of the peptide. Number of tryptic termini (NTT) is a useful measure for search results obtained by specifying no proteolytic enzyme and is used extensively in Ref. 11. Other attributes include features readily obtainable from the candidate peptide sequence: C-term residue (Lys = "1," Arg = "2," and others = "0"), number of prolines, and number of arginines. A new statistic, the mobile proton factor (MPF), is calculated as follows.
![]() |
MPF attempts to provide a simple measure of the mobility of protons in a peptide, a theoretical measure of the ease of which a peptide may be fragmented in the gas phase (3032). A smaller value for MPF is indicative of higher protein mobility, whereas peptides with MPF
1 can be considered "nonmobile." R, K, and H refer to the number of Arg, Lys, and His residues present in the sequence, reflecting the overall basicity of the peptide. The coefficients for these factors reflect the relative basicity of the three residues normalized to the dissociation constant of arginine (the pKa values of Arg, Lys, and His are 12.0, 10.0, and 5.9, respectively). Charge indicates the charge on the parent peptide, reflecting the number of free protons potentially available for charge-directed fragmentation. We included MPF to demonstrate the ease of accommodation of additional information into the classification algorithms, amounting to simply adding an additional data column to the dataset.
|
|
The PeptideProphet standalone application used in this analysis was downloaded from peptideprophet.sourceforge.net. PeptideProphet is also available as part of the Trans-Proteomics Pipeline being developed at the Seattle Proteome Center (tools.proteomecenter.org/TPP.php). All SEQUEST *.out result files corresponding to each test set were placed in a separate directory and processed using the out2summary.c script to generate the PeptideProphet html input file. PeptideProphet was run by executing the runPeptideProphet script using default parameters. PeptideProphet was not run on the MALDI-Spectrum Mill dataset.
For the boosting and random forest approaches, we used contributed packages for the R programming language. We use an implementation of the AdaBoost algorithm (22) implemented in the R statistical programming language by Greg Ridgway (rweb.stat.umn.edu/R/library/gbm/html/gbm.html) and the randomForest version 4.5-8 package, an R port by Andy Liaw and Matthew Wiener of the original Fortran algorithm developed by Leo Brieman and Adele Cutler. In general, we did not fine tune the parameters (i.e. tree size, number of trees, etc.) of the random forest and boosting implementations for two reasons: classification performances of both the random forest and boosting are fairly robust to these parameters and also because our ultimate goal is to provide a software tool that can be easily used in a production laboratory setting without a significant tuning requirement. For the AdaBoost analysis, we used decision trees with 40 leaves for the weak classifier and fixed the number of boosting iterations (M) equal to 1000. For random forests, the default number of attributes for each tree (one-third of the total number of attributes) was used except for the five-variable case in which the number of attributes was fixed at two. The default number of trees in the forest was 500, and each tree in the forest was grown until the leaf was either pure or had only five samples.
With the support vector machine, we chose a radial kernel to classify the samples as implemented in the libSVM package version 2.7 (www.csie.ntu.edu.tw/
cjlin/libsvm/). The radial kernel is flexible and performed well in preliminary studies. To select the optimal set of tuning parameters for radial kernel, a grid search scheme was adopted using a modified version of the grid.py python script distributed with the libSVM package. Optimal parameters are sensitive to specific training sets: for the precise results presented in this study, the optimal parameters were cost = 32768.0 and
= 3.052e 05.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
The ESI-SEQUEST dataset allowed us to compare all four classification approaches: boosting, random forests, PeptideProphet, and the SVM. ROC plots showing the results of classifying correct versus incorrect peptide assignments of the ESI-SEQUEST dataset using these methods are shown in Fig. 1A. All methods performed well on the data. As can be seen, the boosting and random forest methods provided a slight performance improvement over PeptideProphet and the SVM classification using the same six attributes. At a false positive rate of roughly 0.05%, the boosting and random forest achieved a sensitivity of 99%, whereas PeptideProphet and SVM provided a 9798% sensitivity. We note that, although a systematic difference of 12% can be seen in these results, this corresponds to a relatively small number of total spectra. Also indicated in Fig. 1 are points corresponding to well known attribute thresholds from several literature citations. Each point shows the sensitivity and specificity that would be obtained on the test dataset by applying these published thresholds to the SEQUEST attributes charge, Xcorr,
Cn, and NTT.
|
Cn values; PeptideProphet appears to have some difficulty with instances of +1 spectra in which the second highest hit has a score very close to the top hit. The other three spectra represent an interesting case. Because the LCQ instrument on which the spectra were generated lacks the resolution to discriminate between doubly and triply charged parent ions, typical peak list extraction protocols produce two identical peak lists for non-singly charged precursor masses, one each for the doubly and triply charged case. The database search for only one of these two peak lists should produce a correct result under normal circumstances. The cases that PeptideProphet "missed" here are an exception to this rule. They are cases in which two peptides containing identical sequence from the same protein (a larger peptide with a +3 charge and one or more missed tryptic cleavage sites and a smaller peptide (a subset of the first) with a +2 charge) have the same apparent mass. For example, in one instance a parent precursor with a m/z of 1109 Da was selected, corresponding to a peptide from the CAH2_BOVINE protein. The CAH2_BOVINE peptide SSQQMLKFRTLNFNAEGEPELLMLANWR has a mass of 3324.82 Da. This peptide has two missed trypsin cleavage sites, a Lys at position 7 and an Arg at position 9. The peptide resulting from cleaving after position 9 is TLNFNAEGEPELLMLANWR with a mass of 2220 Da. The longer peptide was hit by SEQUEST for the +3 peak list, and the shorter one was hit for the +2 peak list: 3326/3
2219/2
1109 Da. Following the rule that only one of the two identical peak lists generated from the 1109-Da precursor mass spectrum can be correct, the annotation provided with the ESI-SEQUEST dataset assigned the +3 spectrum as correct and the +2 spectrum as incorrect. The incorrect annotation to the search result of the +2 peak lists is misleading in these cases, however. PeptideProphet distinguished these cases, accurately providing a correct classification for the +2 spectra. The other three completely supervised algorithms learned to classify these cases as incorrect based on similar examples in the training dataset.
Fig. 1, B and C, compares the performance of the boosting and random forest methods using different sets of input attributes as shown in Table II. The panels contain the results of these algorithms using three combinations of features: 1) attribute groups I and II: the six attributes used by the PeptideProphet algorithm (SEQUEST XCorr,
Cn, Sp rank, delta parent mass, length, and NTT), 2) attribute groups I, III, and IV (all attributes except NTT), and 3) attribute groups IIV (all 15 variables shown in Table II). Overall it can be seen that both machine learning approaches provided improvement over the scoring thresholds described in the literature. The best performance was obtained by including all 15 variables, indicating that accommodation of additional information is beneficial. The random forest appeared to be slightly more sensitive to the presence of the NTT variable than boosting. Of note is the fact that effective classification was attained by the boosting and random forest tools even in the explicit absence of the NTT variable as demonstrated by feature combination 2 despite the fact that the ESI dataset was generated using the "no enzyme" feature of SEQUEST. No enzyme specificity in the database search is often time-prohibitive in routine production work; it is much more common to restrict searches to tryptic peptides (or any other proteolytic enzyme used to digest the protein sample). Restricting to trypsin restricts results to having an NTT = 2, rendering the attribute non-discriminatory. It must be noted, however, that in this analysis the C-term residue attribute was not completely independent of NTT in that it contains residue information on one of the termini. If trypsin specificity is turned on in a search, in addition to distinguishing between Lys- and Arg-terminated peptides, C-term residue will discriminate tryptic and semitryptic peptides, the latter being possible if the peptide is the C-terminal peptide of a protein. If trypsin specificity is not used in the search (as in these data), although the C-term residue variable cannot predict the NTT value of a peptide, it can discriminate between cases. If the C terminus of the peptide is tryptic, the peptide may be either fully or partially tryptic; if it is not, it can either be partially tryptic or non-tryptic.
The results of classifying the MALDI-Spectrum Mill data using boosting and random forest are shown in Fig. 2. The functionality necessary to run PeptideProphet on Spectrum Mill data would require customization of the tool and was therefore not used. We chose not to run the SVM on the MALDI-Spectrum Mill dataset because the superior performance of the boosting and random forest methods were already demonstrated on the ESI-SEQUEST dataset. Overall the two classifiers performed similarly with boosting outperforming random forests slightly when the false positive rate was above 15%. Both methods showed dramatic improvement over thresholding combinations based on default recommended combinations of Spectrum Mill score and SPI values. Spectrum Mill documentation suggests three guideline threshold combinations for "outstanding", "good," and "modest" hits, indicated in Fig. 2. The Outstanding threshold effectively discriminates results with a low false positive rate but discards a majority of true positives. The learning approaches are able to pull a much greater number of true positives at a similar false positive rate: at a false positive rate of 5%, the learners classify roughly 70% of the true positives. The MALDI-Spectrum Mill dataset was generated by selecting the top five hits from search results because 11% of the correct results were non-top ranking hits. It does not appear to be the case that the machine learning algorithms were able to effectively select these cases out, however (data not shown).
|
Fig. 3 displays the relative importance of each attribute from SEQUEST and Spectrum Mill search results using the boosting and random forest methods. Results for classification of the ESI-SEQUEST dataset incorporating the six attributes used by PeptideProphet are shown in Fig. 3A. All six attributes show a contribution to the discrimination with the most important contribution from the NTT variable. PeptideProphet incorporates only the first five attributes in the calculation of the discriminant function, introducing NTT distributions using a separate joint probability calculation. The coefficients for their discriminant score weight XCorr highest followed by
Cn with much lower contributions due to delta M + H and Sp rank. Again length is used by PeptideProphet to correct for the well known peptide length dependence of the XCorr variable. Our results indicate a roughly equivalent contribution from
Cn and XCorr with a significant contribution from Sp rank. Delta M + H and length showed a much more moderate contribution. The six attributes display a high importance when used in conjunction with the other nine attributes from groups III and IV as indicated in Fig. 3B. Of these additional nine, Sp score showed a surprising contribution as this scoring measure is rarely used for discrimination in popular usage. Also significant were the PeakMatchRatio measure and the MPF. For those attributes that are in common, these results are in agreement with Fishers discriminant scores calculated by Anderson et al. (14) with the exception of delta M + H, which showed very little contribution using their SVM approach. The number of arginine and proline measures as well as parent charge, length, and number of matched peptides appear to provide very little discriminative value.
|
Cn measure provides the most important contribution. A comparison of Fig. 3, B and C, suggests that in the absence of NTT, the C-term residue variable contributes much more significantly to the discrimination. As discussed above, although not as discriminatory as NTT, C-term residue does contain some of the same information and may be useful as a partial replacement for NTT in situations in which NTT is prohibitively time-consuming to obtain. The relative importance of different attributes for the Spectrum Mill data are shown in Fig. 3D. Not surprisingly, rank and score features are very important for discrimination. It is interesting, however, that rank and score do not duplicate each other because the ranking of results is primarily based on score. One surprising feature is the importance of delta MH+ in the final discrimination. This attribute is relatively unimportant in the ESI-SEQUEST results; this may demonstrate a difference between the Spectrum Mill and SEQUEST scoring schemes but can more likely be explained by the greater mass accuracy of the TOF/TOF instrument compared with the LCQ. The percent SPI attribute is analogous to the "fraction matched MSMS total ion count" variable published by Anderson et al. (14) and is one of the primary criteria used for judging the correctness of a hit in the Spectrum Mill package. Its low importance measure in the context of the other attributes is of interest. The backbone cleavage score and unused ion ratio attributes calculated by Spectrum Mill appear relatively less important in this context as well. Note that NTT was not calculated for the MALDI-Spectrum Mill dataset due to difficulties in performing a no-enzyme specificity search against the NCBInr human dataset using Spectrum Mill.
Unsupervised Learning and Generalization/Comparison with PeptideProphet
In general, machine learning defines two primary types of models to address the classification problem, generative approaches and discriminative approaches. Algorithms such as the boosting and random forests methods discussed in this study are discriminative, non-parametric approaches in that they do not rely on an explicit distribution of the data and model the posterior probability p(y|x) directly (see Ref. 24 for a general description). A generative method assumes a model for the distributions of the attributes given the class label and learns the distribution p(x|y) and then uses Bayes rule to infer the posterior probability and the classification rule. In effect, these models incorporate assumed prior information in the form of probability distributions for each of the classes being modeled and use these distributions to calculate the probability that a data point belongs to each class. PeptideProphet is a generative, parametric method, modeling correct identifications as a Gaussian distribution and incorrect identifications as a Gamma distribution. If this model fits the observed data well, i.e. the distributions describing the different classes in the problem accurately reflect the physical processes by which the data are generated, the generative approach works well even for a small amount of training data. On the other hand, if the data diverge from the modeled distributions in a significant way, classification errors proportional to the degree of divergence result. Therefore, although straightforward, the performance of the parameter-based generative approaches tends to be sensitive to the assumed model. For the peptide classification problem discussed in this study, there is little scientific evidence that supports a particular distribution assumption; one has to rely on past experience to make the decision. Discriminative approaches, on the other hand, are a less risky option in that they do not rely on knowledge of the distributions of classes of the data. They become increasingly safe, approaching optimality, as data size increases. Keller et al. (11, 35) demonstrate that, for their data, the distributions described in their mixture model fit the data well. Whether these distributions are appropriate for all types of instruments and MS search engines and whether they are optimal are research questions. It may be the case that the addition of new attributes, which alter the discriminant score and thus the shape of the score distribution, may be problematic for a generative tool. Due to their flexibility, our approaches are expected to generalize well to other types of data obtained from various instruments, search engines, and experimental conditions. We believe this is a particularly attractive feature.
The boosting/random forest methods are supervised approaches, relying on training data for their functionality. We note that an unsupervised learning component has been added to PeptideProphet. PeptideProphet uses training data to learn coefficients in the calculation of the discriminate score; it subsequently uses these scores to establish the basic shape of the probability distributions modeling correct and incorrect search hits as a function of parent peptide charge. For each unique dataset, when additional test data arrive, the distribution parameters are refined using an expectation maximization algorithm such that a refined classification procedure can be performed. This unsupervised component can function to compensate for a less-than-optimal fit of observed data to the model distributions. How to combine the unsupervised learning techniques such as clustering with out-of-the-box classification tools such as the ensemble tree methods discussed here is a challenge that needs to be addressed. Our approach provides a framework for performing the supervised aspect of the problem in a more general way using established out-of-the-box functionality. This approach can be coupled with an unsupervised component to provide more flexible functionality, assuming appropriate training datasets are available that match the input data. Here we have described one such dataset, potentially useful as a distributed resource for training purposes. The degree to which an individual training dataset provides adequate parameterization for a particular test set is an open question. Certainly training sets will need to be search algorithm-specific, and we intend to extend this work to other algorithms such as Mascot and X!Tandem in future work, but whether instrument-specific datasets are necessary is an area of investigation.
Having a tool that generates a truly accurate probability of correctness is an attractive feature of all algorithms in this domain. The fitness scores generated by the random forest method are probability estimates, and the boosting fitness scores can be directly converted into probability estimates via a logit transformation. True probability estimation is a difficult problem, however. It must be clearly noted that, as with other tools in this domain, these estimates must be considered approximate. Various methods have been developed for converting these types of scores into accurate probability estimates (33, 34). These methods are referred to as probability calibration methods in the literature. On the other hand, accurate classification of results does not require accurate probability estimates. For example, in a simple two-class classification setting such as ours, one may only need to know whether the probability is bigger or smaller than some selected value to achieve an accurate classification rule.
As a final note, the work described here addresses the problem of generating rankings and confidence measures for identification of peptides using mass spectrometry database search algorithms. This step is typically not the end goal of data analysis: the peptide measures can be combined to generate confidence measures for the presence of a protein in a sample. This problem of combining peptide results to generate accurate protein identifications has been addressed in other algorithms, such as ProteinProphet (35). Improvement in peptide identification will increase the sensitivity and specificity of downstream protein identifications, and the results from the algorithms described in this work can be used directly as inputs into protein level calculations.
Conclusions
In a production proteomics laboratory, researchers are often faced with the challenge of curating large lists of protein identifications based on various confidence statistics generated by the search engines. The common methodology for selecting true hits from false positives is based on thresholding. These approaches can lead to a large number of false positives (using a more promiscuous threshold) or a large number of false negatives (using a relatively stringent threshold). Machine learning approaches such as boosting and random forest methods provide a more accurate method for classification of the results of MS/MS search engines as either correct or incorrect. Additionally newer scoring criteria continue to be published that could improve the ability of automated tools to better discriminate true search results and can complement the standard scoring measures generated by popular search engines. Flexible methods that allow for accommodation of these new scoring measures are necessary to allow them to be easily incorporated into production use. Modern machine learning approaches such as the ensemble methods described here can perform very well out of the box with very little tuning. Improved results could very likely be obtained by tuning these tools to particular data sets, i.e. by making use of class prior probabilities to accommodate the imbalanced sizes of the correct and incorrect datasets. These approaches can additionally be used to generate measures of relative importance of scoring variables and may be useful in the development of new scoring approaches.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, November 30, 2005, DOI 10.1074/mcp.M500233-MCP200
1 Liu, J., Beaudrie, C. E. H., Yanofsky, C., Carrillo, B., Boismenu, D., Morales, F. R., Bell, A., and Kearney, R. E. (2005) A Statistical Model for Estimating Reliability of Peptide Identifications Using Mascot, Poster WP21389 presented at the 53rd American Society for Mass Spectrometry Conference on Mass Spectrometry and Allied Topics, San Antonio, Texas (June 59, 2005). ![]()
2 The abbreviations used are: SVM, support vector machine; N-term, N-terminal; C-term, C-terminal; NTT, number of tryptic termini; MPF, mobile proton factor; SPI, scored peak intensity; BCS, backbone cleavage score; Sp, preliminary score; ROC, receiver operating characteristic. ![]()
* This work was supported in part by the National Resource for Proteomics and Pathways funded by National Center for Research Resources Grant P41 RR 18627-01 (to P. C. A). The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
We dedicate this work to the memory of Leo Breiman for outstanding contributions to the study of statistics and machine learning. ![]()
¶ To whom correspondence should be addressed: University of Michigan, 300 North Ingalls Bldg., Rm. 1196, Ann Arbor, MI 48109. Tel. and Fax: 734647-0951; E-mail: pulintz{at}umich.edu
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. Brosch, S. Swamy, T. Hubbard, and J. Choudhary Comparison of Mascot and X!Tandem Performance for Low and High Accuracy Mass Spectrometry and the Development of an Adjusted Mascot Threshold Mol. Cell. Proteomics, May 1, 2008; 7(5): 962 - 970. [Abstract] [Full Text] [PDF] |
||||