Abstract
Tandem mass spectrometry combined with database searching allows high throughput identification of peptides in shotgun proteomics. However, validating database search results, a problem with a lot of solutions proposed, is still advancing in some aspects, such as the sensitivity, specificity, and generalizability of the validation algorithms. Here a Bayesian nonparametric (BNP) model for the validation of database search results was developed that incorporates several popular techniques in statistical learning, including the compression of feature space with a linear discriminant function, the flexible nonparametric probability density function estimation for the variable probability structure in complex problem, and the Bayesian method to calculate the posterior probability. Importantly the BNP model is compatible with the popular targetdecoy database search strategy naturally. We tested the BNP model on standard proteins and real, complex sample data sets from multiple MS platforms and compared it with PeptideProphet, the cutoffbased method, and a simple nonparametric method (proposed by us previously). The performance of the BNP model was shown to be superior for all data sets searched on sensitivity and generalizability. Some high quality matches that had been filtered out by other methods were detected and assigned with high probability by the BNP model. Thus, the BNP model could be able to validate the database search results effectively and extract more information from MS/MS data.
Proteomics has become one of the most active areas of life science research in the postgenomics era. MS is an analytical technique widely used in proteomics research and provides information on protein identification, characterization, and quantification (1). MS/MS can analyze protein mixtures in a high throughput manner and provide sequence information for peptides and proteins (2, 3). Currently MS/MS data are usually processed by the socalled database search method or de novo sequencing (4, 5). Automated database search software, such as SEQUEST (6), Mascot (7), Phenyx (8), and X!Tandem (9), can assign mass spectra to peptides from a protein sequence database quickly and provide scores to measure the quality of these matches. Generally the search engines select the best matches according to their scoring models but do not guarantee the accuracy of the matches. Consequently validation of database search results has been the focus of much attention (10–13). Recently the challenge of simultaneously improving the specificity and the sensitivity of the quality of database search results was addressed by Domon and Aebersold (11). Nesvizhskii et al. (12) also addressed this issue in their review on MS/MS data processing.
The research on the database search result validation focused on finding the new features to distinguish the correct and incorrect matches; improving the sensitivity, specificity, and generalizability is the problem of main concern. As a robust search engine, SEQUEST is commonly used in many researches, and many algorithms, scoring models, and statistical models have been developed to validate SEQUEST database search results (14–17). Among them, the targetdecoy database search method is more favored in the practice data processing because it is simple to apply and is robust to the effects of database size, sample quality, experimental conditions, and instrument type (12, 18, 19). Nowadays probability frameworks, which can incorporate the decoy database searching and the multiple result validation features, are touched upon (20, 21), but more comprehensive discussions are needed, such as the estimation of false discovery rate (FDR)^{1} of the data set as a whole and the correct probability assignment for each match (19).
In this study, we propose a Bayesian nonparametric (BNP) model to incorporate a probability framework into the randomized database searching method. A similar idea was also proposed by Nesvizhskii et al. (12). The BNP model integrates an extended set of features to validate database search results; these features were selected from the literature and cover many characters of the spectrum, including SEQUEST scores, empirical parameters, peptide fragmentation knowledge, and chemical or physical properties of peptides. To compress the feature space and reduce the computational burden, an LDF was constructed based on the “typical labeled data set” from decoy database matches. Then a set of component Gaussian density functions (GDFs) was used to fit the LDF score distribution of random matches; the LDF score distribution of correct matches was fitted with GDFs estimated from the normal database matches, which were based on observations of correct and random results. In the latter step, the contribution of the incorrect matches remained unchanged. Thus, we call our approach “restricted nonparametric probability density function (PDF) estimation.” Finally the correct probability of each assignment was calculated using a Bayesian formula, and the error rates for different cutoff values of the probability score were estimated. This method can also estimate the total number of correct matches and the false negative rate of the filtered data set.
The basic idea behind the BNP model is that, based on the decoy database matches, a degraded filtering model can be used to initialize an iterative process to refine the model and improve the sensitivity. The principle underlying our model is that what constitutes a high quality spectrum can be learned from the analyzed data itself (22). In this way, the BNP model automatically develops a statistical classifier for each data set. By using a nonparametric approach, our model can flexibly adapt to variable score distributions, which are a frequent occurrence in database search result validation in proteomics. Based on randomized database searching, the model is sufficiently robust to analyze data sets derived from different samples, experimental conditions, and mass spectrometry platforms.
The BNP model was evaluated using three MS/MS spectra data sets from standard proteins, and the results indicate that our model performs well for peptide identification validation. We also demonstrate that the new model is suited to different MS instruments and databases, and it identifies more confident peptides than three other commonly used algorithms. Furthermore we applied the BNP model to data sets derived from real, complex samples analyzed by LCQ, LTQ, and LTQ/FT mass spectrometers and obtained results consistent with those from control data sets. When the confidence level is fixed, the BNP model can increase the number of confirmed identified peptides, including those with ambiguous mass difference. Importantly the calculated probability detects some high quality matches that other algorithms may filter out. In summary, the BNP model provides a tool with the potential to extract more information from MS/MS data.
EXPERIMENTAL PROCEDURES
Experimental Data Sets
To conduct a comprehensive evaluation of the Bayesian nonparametric model, we applied our method to three control sample data sets (named D1–D3) and five complex sample data sets (named D4–D8) and compared the results with those generated by three other methods. All the data sets were from different samples analyzed with different mass spectrometers in different laboratories. These data sets included most variable factors that have been shown to significantly impact the generalizability of the different database search result validation models. In the complex data sets, D4–D6 have been used in our previous work (23); D7 and D8 were added to prove that the BNP model was not overfitting.
Control Sample Data Sets—
Three control data sets from the LCQ, LTQ, and LTQ/FT instruments were used to investigate the FDR and estimate error rates as well as some other parameters of the BNP model. 1) The LCQ control data set (D1), published by the BIATECH Institute (Bothell, WA) (24), was generated by analyzing a standard mixture of 23 peptides and 12 proteins using an LCQ Deca XP^{PLUS} platform (Thermo Finnigan, San Jose, CA). Additional details about this data set can be found in Purvine et al. (24). 2) The LTQ control data set (D2), published by Proteomics Standards Research Group (sPRG), was derived from six LCMS/MS runs on the LTQ (Thermo Finnigan) platform. The sample was designed to contain 49 purified human proteins, but ∼200 proteins have been shown to be present in the sample as announced by the research poster of sPRG 2006. 3) The LTQ/FT control data set (D3), published by the Institute for Systems Biology (25), was generated by analyzing the peptides in a tryptic digest of a mixture of 18 proteins, the “ISB standard protein mix,” on the LTQ/FT platform (raw data of Mixture 4).
Complex Sample Data Sets—
We applied our model to three biological sample data sets analyzed by the LCQ, LTQ, and LTQ/FT mass spectrometers. 1) The LCQshotgun data set (D4) was generated from the K562 cell line sample and was downloaded from the Open Proteomics Database. This data set had been used by Resing et al. (26) to illustrate the use of multisource information to improve reproducibility and sensitivity in identifying human proteins by shotgun proteomics. 2) The LTQshotgun data set (D5) was generated from MS/MS analyses of a human liver tissue sample (27). Peptides generated by tryptic digestion were analyzed by reversedphase LCMS/MS using a Thermo Finnigan linear ion trap mass spectrometer (LTQ) with an ESI source. 3) The LTQ/FTshotgun data set (D6), was also generated from the human liver tissue sample. Strong cation exchange chromatography was performed on the treated protein mixtures, and each of 43 fractions collected was analyzed by the LTQ/FT platform. This data set was produced by the Beijing Proteome Research Center and was described previously (23). 4) The LTQshotgun data set (D7) was generated from yeast proteins analyzed by nanoLCMS/MS using a nanoflow HPLC system connected to a linear ion trap mass spectrometer (LTQ) (28), and the raw data were downloaded from the PeptideAtlas (PAe000324). 5) The LTQ/FTshotgun data set (D8)^{2} was produced by the Beijing Proteome Research Center by 10 reduplicated MS/MS analyses on yeast samples.
Protein Sequence Database
All data searching was performed using SEQUEST against the modified randomized sequence database by BioWorks 3.2 (29). For control data sets the standard sequences, which included the purified proteins and peptides as well as possible contaminants provided with the data sets, were combined with the protein sequences from Methanosarcina acetivorans C2A (4520 sequences in total; downloaded from the National Center for Biotechnology Information (NCBI)) to construct the target database. The human International Protein Index (IPI) version 3.19 database, containing 60,397 protein sequences, was the searched database of complex data sets D4, D5, and D6. The Saccharomyces cerevisiae ORF protein sequence (download from the Saccharomyces Genome Database (SGD) at Stanford University) was used as the searched database for D7 and D8. Simultaneously the random permutation amino acid sequence of digested peptide (RSDP) method (30) was proposed to construct a randomized database for each normal data set, and then the normal and randomized data sets were combined and used for database search.
Database Searching
The raw files were searched against the combined database using a local Turbo SEQUEST v.27 server using the same database search parameters for all data sets. The SEQUEST parameters were as follows. The monoisotopic mass was used for both peptide and fragment ions with fixed modification (Carbamidomethyl, +57 Da) on Cys and variable modification (oxidation, +16Da) on Met. The mass tolerance for precursors of all data sets was 3.0 Da, and the fragment ion mass tolerance was 1.0 Da for D7 and D8 and 0.6 Da for the others. Tryptic cleavage at only Lys or Arg was selected, and up to two missed cleavage sites were allowed. Only b and y fragment ions were taken into account. The peptide mass ranged from 400 to 5000 Da when creating *.dta files, and the threshold of the total ion intensity for the LTQ and the LCQ was 100 and 10,000, respectively.
BNP Model Work Flow
The work flow of the BNP model is shown in Fig. 1 and contains three main steps. After MS/MS spectra were searched against the combined database, the first step begins with the construction of two typical labeled subsets. The first subset includes all decoy matches, which were taken as negatives and designated y = 0. The second subset consisted of matches validated (positives) by the cutoffbased method with FDR_{Est} = 0.01 and designated y = 1. Based on these subsets, the coefficients of the LDF score (Equation 1) could be estimated by multivariate linear regression. We were then able to calculate the LDF score of all matches.
In the second step, the LDF score distribution of decoy matches was fitted by a nonparametric PDF with the maximum likelihood parameter estimation. By restricting the decoy matches as a constant part and applying the expectationmaximization (EM) algorithm, the LDF score PDF of correct matches can be estimated from the normal database matches, which consisted of the combined observations of correct and incorrect assignments. Consequently the correct probability of each assignment can be calculated using the Bayesian formula and the conditional distributions of correct and incorrect matches.
At the last step, we were able to make a decision according to the cost function, which is presented here as the FDR; the FDR can be replaced by estimated error rates in the probability framework. The percentages of correct and incorrect matches were also estimated at this step. Therefore, we can calculate the total number of correct assignments and provide the model specificity at each probability score (PScore) cutoff value. When applied to large data sets, the BNP model can reduce the computational burden by randomly resampling 30,000 observations of the whole data set for the modelbuilding process.
Features Involved in the BNP Model and LDF
Many parameters (referred to as “features” in this study) have been used to validate SEQUEST database search results; these include 1) database search scores, including Xcorr, ΔCn, Sp, RSp, and Ions; 2) physical and chemical properties of the peptide and the basic properties of the experimental MS/MS spectra, such as peptide length (PLen), predicted peptide chromatographic retention time (tRT), peptide molecular weight, and number of peaks in the MS/MS spectrum (PNum); and 3) the empirical parameters used in previous studies, such as RScore (17), Cont (16), and SimScore (31). We used a total of 28 features to improve the discriminant power of the BNP model because a large amount of information was being extracted from the MS data; these are listed, along with the corresponding transformations, in Table I, and additional details are briefly summarized in supplemental File S1. Table I lists the transformation of these features to reduce the variance and improve their discriminant power (32).
Therefore, the LDF score can be defined as where a_{0}–a_{26} are the coefficients that can be derived by regression from the “typical labeled data sets.” We constructed the LDF model for each charge state (Ch) individually and used peptide length PLen ≥6 as a prefilter in applying the model.
BNP Model and EM Algorithm
Based on the theory that the random matches and the correct matches can be grouped into subclasses and that the LDF score of each subclass should have a simple distribution (e.g. normal distribution; some detailed discussion can be found in the supplemental File S1), we used the Gaussian component distributions to simulate the mixture distribution of the observations. The format of the hypothesis mixture PDF is where and and the μ_{j}, j = 1, 2, … m, and μ_{i}, i = 1, 2, … n, are means of the component Gaussian distributions. Σ_{j}, j = 1, 2, … m, and Σ_{i}, i = 1, 2, …, n, are their covariance matrices. These parameters satisfy Σ_{j = 1}^{m}P_{j}^{neg} = 1 and Σ_{i = 1}^{n}P_{i}^{pos} = 1; P_{pos} ≥ 0, P_{neg} ≥ 0; P_{i}^{pos} ≥ 0, P_{j}^{neg} ≥ 0, and P_{pos} + P_{neg} = 1.
First the negative component contributing to random matches can be estimated from the decoy matches by the fully nonparametric probability density function estimate procedure proposed by Archambeau and Verleysen (33) and Duda et al. (34) that was implemented by the maximum likelihood estimation with the EM algorithm. Then the positive component contributing to correct matches can be estimated from the mixture observations of the normal database matches by a restricted fully nonparametric probability density function estimate. This iterative EM procedure can be read as described previously (23) with keeping P_{j}^{neg}, j = 1, 2, …, n, unchanged when updating the parameters in the Mstep. Here x is the LDF score, a scalar.
By trial and error, we found that five component GDFs can provide an accurate PDF fitting. We initialized the parameters in the EM procedure by partitioning the observations into five intervals on the LDF score axis and keeping the number of observations in each interval equal.
After estimation of the conditional PDF, the correct probability of a match with LDF score x can be given as follows.
The estimated number of correct matches is N_{pos} = KP_{pos} and the number of incorrect matches is N_{neg} = KP_{neg} where K is the total number of observations. The FDR and false negative rate (FNR) under different LDF score cutoff values can be estimated by the conditional distribution and the prior probability as follows.
Assuming the expected FDR is α, we can determine the filtration threshold of the LDF score x_{α} according to Equation 6. At the same time, the estimated sensitivity and discriminating power can be estimated as follows.
Finally we can calculate the estimated error rate (Err) under different PScore thresholds based on the correct probability of every identified peptide using where ‖{P_{i}P_{i} ≥ P_{α}}‖ denotes the number of the elements in aggregate {P_{i}P_{i} ≥ P_{α}}. In practice, we found that Err_{Est} was close to the actual FPR. So in the following sections, Err_{Est} was used as the estimation of FDR for the BNP model.
RESULTS
Estimation of the FDR—
The control data sets (D1–D3) were used to verify the accuracy of the estimated FDR of the BNP model. When the PScore cutoff is small, the FDR (Err) estimated by the BNP model is larger than the actual FPR. For high quality filtration, Err is close to the actual FPR (Fig. 2). Table II compares the performance of the BNP model (M3), the cutoffbased method (M1), PeptideProphet (M2; contained in the TransProteomic Pipeline version 4.0.1), and our previously published nonparametric model (M4) (23) under two typical FDRs. In the cutoffbased method, an exhaustive search procedure was used to identify the optimal threshold value of the Xcorr/ΔCn pair by maximizing the number of validated normal database matches and keeping the estimated FDR estimated lower than the expected FDR. PeptideProphet provides an estimated error rate for different probability score cutoffs. For all three data sets, the sensitivity of the BNP model surpassed that of the three other filtered methods when the estimated errors/FDRs were the same. The traditional cutoff method produced high quality results with quite a low actual FPR at a cost of the loss of some sensitivity. Thus, the total correct numbers validated by the cutoff method were much lower than those validated by the BNP model.
Although we used a relatively large parent ion mass tolerance setting (3.0 Da) for the FT/LTQ database search, the actual mass accuracy of the FT mass spectrometer is in the range of a few ppm. Mass accuracy filtering was proposed for this high accuracy data (35, 36). As the statistical mass error for D3 ranged from −2 to 6 ppm (23), validated results with a mass error larger than 10 ppm were taken as false positives and were excluded from the output lists of all filter methods. The peptide assignment lists of three control data sets are provide in Supplemental Tables S1_LCQ, S1_LTQ, and S1_FT, and the corresponding filter criteria can be found in supplemental File S2.
Searching a Larger Database—
It is generally acknowledged that search algorithms lose sensitivity as the search space is increased because more peptides are queried (37). Larger databases increase the number of candidate peptides for each MS/MS spectrum, and the probability of randomized matches increases as well. We constructed a large combined database containing 13,936 protein sequences from four different Archaea species (M. acetivorans C2A, Archaeoglobus fulgidus DSM 4304, Methanosarcina barkeri strain fusaro chromosome 1, and Methanosarcina mazei Go1; all downloaded from NCBI) and repeated the database search to test the performance of the BNP model on different searched databases. As the search space was expanded, fewer matches were identified. When the estimated FDR was set at 0.05 and 0.01, the BNP model confirmed 804 and 708 matches, respectively (supplemental Table S1_LCQ). The actual FPR was 2.86 and 0.71%, and the sensitivity was 91.24 and 82.13%, respectively. These values are nearly identical to those observed when searching a smaller database. The results indicate that the BNP model is reliable and accurate on different searched databases.
Among the validated matches of the large and small database search results, 778 matches were the same. Only 15 of the 40 matches validated by large database searching alone were from the control sequences. On the other hand, 43 MS/MS spectra matched with the control sequence were confirmed only in the small database search. These matches may possibly be false positives, the MS/MS spectra of which would be matched with a more appropriate peptide in a different database. These observations indicate that not all matches assigned to control sequences are correct because some spectra matched with different peptides in the large and small search spaces, and some were randomly matched with control peptides. Thus, we used four empirical rules to refine these possible correct matches for sensitivity calculation: 1) Rsp ≤ 50, 2) PLen ≥ 6; 3) PNum ≥ 20, and 4) Max_Tag_Len ≥ 4.
Quality of the Results Confirmed by the BNP Model—
We also validated the confirmed matches identified by the BNP model (M3) in the real, complex sample data sets using the empirical rules (Table III). These empirical rules came from different sources in the literature (16, 26, 38). In Table V, MTL is the abbreviation for Max_Tag_Len, and other parameters are introduced under “Features Involved in the BNP Model and LDF.” Most of the matches confirmed by the BNP model are of high quality in view of these empirical rules, and the quality of the results improves as the accuracy of the data increases. As a comparison, we calculated these percentages for results obtained with the cutoffbased method (M1; without the rule of Rsp ≤ 50). In some cases, the cutoffbased method seemed to generate slightly better results, but the difference was negligible, especially on the LTQ/FT data set (D6).
Comparison among Different Methods on Complex Data Sets—
We compared the performance of the BNP model (M3) with the cutoff method and PeptideProphet on complex data sets (D4–D8; Table IV). The BNP model confirmed about 14–39% more results, both total and unique peptides, than the cutoffbased method. PeptideProphet appeared to be influenced by data quantity and quality. For complex data sets D5 and D7 containing more than 10^{6} matches we had to separate the *.pep.xml files (60 files in total) into several runs because PeptideProphet requires too much memory, generates too large (greater than 4 gigabytes) a temporary file to be accommodated by Windows, and requires an unacceptable amount of time to complete the modeling. While conducting the search on complex data set D6 (data from 46 LC runs in total), the MS/MS data quality of some LC runs was so poor that PeptideProphet was not able to finish the modeling and validated very few peptides (only 18,446 correct with 1% FDR). Therefore, we used the results of D6 from an earlier version (PeptideProphet 1.9) that were superior to those derived using TransProteomic Pipeline version 4.0.1.
The list of peptide assignments of real, complex sample data sets by M1, M2, M3, and M4 and the corresponding proteins are provided in supplemental Tables S2_D4, S2_D5, S2_D6, and S3.
The Venn diagram in Fig. 3 shows the classification of confirmed peptides using these four methods. The peptides confirmed by the BNP model represented more than 92% of the merged results of the cutoffbased method and PeptideProphet (M1 ∪ M2) and represented more than 91% of the nonparametric model; the BNP model confirmed many additional results, indicating that the sensitivity of the BNP model is much higher than that of the other two methods. By manually checking the records that were discarded by the BNP model but confirmed by the other two methods, we found that some records had relative large Xcorr and ΔCn; careful inspection of these records showed that some other feature scores, such as Ions, iIons, Cont, and nIons were small, indicating that they were potential incorrect matches.
Conversion to Protein Identifications—
In analyzing complex samples, the most important criterion is the number of proteins identified with confidence as output. The number of unique protein counts and high confidence protein identifications (more than two or three peptide hits) for an FDR of 1 and 5% in each experimental data set are shown in Table V. The minimal protein lists were assembled according to the parsimony principle applied by the DBParser algorithm (39), and an inhouse software written in C++ was developed to support our file format. It appears that the percentages of proteins with two or three peptide hits provided by the four methods are close. However, the BNP model can generate a large protein list with a greater number of high confidence proteins. It is interesting that the percentage of high confidence proteins cannot be improved by improving the confidence level of resulting matches if only one method (M1, M2, M3, or M4) is used.
DISCUSSION
Proteomics research has generated vast amounts of MS/MS data. SEQUEST is a robust algorithm that is appropriate for processing low accuracy ion trap MS/MS data. Using external tools to separate correct from incorrect SEQUEST database search results has been the focus of much attention. We developed BNP to filter the falsepositive matches in shotgun proteomics database searching. This strategy is based on a randomized database method and nonparametric density distribution functions. By applying this model to control protein data sets and complex data sets from real samples, we demonstrate that the BNP model has greater power to discriminate between correct and incorrect assignments and can effectively control the falsepositive ratio of peptide identifications. Furthermore the BNP model can greatly increase the number of confirmed peptides and proteins, and it is suited for use with several MS platforms.
BNP Model Versus PeptideProphet—
Recently Choi et al. (21) presented a variable component mixture model and a semiparametric mixture model to remove the restrictive parametric assumptions in the mixture modeling approach of PeptideProphet. The most recent version of PeptideProphet provides an option to use nonparametric modeling with targetdecoy database searches. The process works well on D3, which was the test data set in Ref. 19. The number of validated peptides in D1 increased (964 for 5% FDR and 798 for 1% FDR), but the actual FPR increased as well (10.06 and 3.26% respectively); the numbers of identified peptides (7080 for 5% FDR and 6072 for 1% FDR) as well as the actual FPRs (5.25 and 1.15%, respectively) do not improve on the LTQ control data set.
BNP Model Versus Nonparametric Model—
Both the BNP model and our previously published nonparametric model (23) are based on the targetdecoy database search strategy. The nonparametric model, using the nonparametric density estimation technique, aims to estimate the multivariate PDF of the database search scores directly and takes the contour lines as the candidate discriminant functions to filter out falsepositive results. Based on the hypothesis that what constitutes a high quality match can be learned from the treated data itself (22), the BNP model was able to model the probability structure from the targetdecoy search results and then automatically classify the results. The nonparametric PDF estimation in the BNP model provided a flexible framework for the probability structure.
The primary parameters in the nonparametric model are three commonly used database scores: Xcorr, ΔCn, and SimScore; incorporation of an additional feature dictates an additional dimension of the feature space, and the complexity of the model increases accordingly. The BNP model incorporates 28 features into a linear discriminant function and permits convenient incorporation of more features as required. Furthermore the BNP model can provide a correct probability of each assignment that facilitates subsequent processes, such as application of the EBP (Empirical Bayes Protein identifier) model for protein inference (40).
Extension of the BNP Model to a Higher Dimension—
In this study all 28 features were integrated by an LDF. This process reduced the computational complexity but may result in the loss of some information contained in the raw features. From the view of principal component analysis, only the first main principal component was used by the LDF model. It is possible to use more principal components and extend the BNP model to a higher dimension space. The model building procedure will not require modification, but new techniques will be needed to compress the feature space. Partial least squares regression (41) can utilize the target classification information and complete the principal component analysis and regression at the same time; this is a useful tool to use the typical labeled data set to compress the dimension of the feature space. But when more principal components are taken, the initial procedure of the EM algorithm will have to be adjusted, and more computational time will be required.
Why Did the BNP Model Validate More Matches?—
The BNP model is able to identify more high confident proteins (with at least two peptide hits) from an MS/MS data set under the same estimated FDR compared with PeptideProphet and the cutoffbased method. Within 1% peptide FDR in the D4 data set, more than 90% of the proteins with two or more peptide hits that were identified using PeptideProphet and the cutoffbased method were also identified by the BNP model.
Confirming a higher number of confident peptides is the greatest strength of the BNP model; thus, the BNP model could offer a larger high confidence protein list under the established peptide identification FDR, and it can provide more information for downstream biological analysis. The capacity of the BNP model to confirm more peptides is due to its ability to detect high quality matches that other algorithms might filter out based on only a few features of these matches. There are some conditions that would result in high quality assignments being filtered out by other methods. The masses of some amino acid pairs (e.g. Lys/Gln and Leu/Ile) as well as several amino acid combinations (Table VI) are indistinguishable when the resolution of the instrument is low. Those may cause the ΔCn score to be small and, in some cases, as low as zero. There are also some conditions for which the theoretical spectra of rank 1 and rank 2 identified peptides are similar in SEQUEST outputs, which would also make the ΔCn score smaller than the commonly acceptable value of other methods.
We investigated the LCQ complex data set and found 118 undistinguished assignment cases whose ΔCn was less than 0.05. Some examples are listed in Table VI. The BNP model assigned high confidence probabilities (PScore) for those matches filtered by both PeptideProphet and the cutoffbased method. Practically we might not be able to confirm which was the true hit when we do not know the existing proteins at all. To some extent, the BNP model may provide a more objective judgment. In its present form the BNP model cannot accommodate the similarity of theoretical spectra systematically, and introduction of a new parameter to measure this characteristic would improve the performance of the model in the future. The BNP model algorithm tool as well as other scripts used for the SEQUEST search process will be made publicly available.
Acknowledgments
We thank Dr. Jianqi Li at the Beijing Proteome Research Center for thoughtful discussion. The LCQ control data set was kindly provided by the BIATECH Institute. Dr. Zhongqi Zhang at Amgen Inc. kindly provided a program for predicting the MS/MS spectrum.
Footnotes

Published, MCP Papers in Press, November 12, 2008, DOI 10.1074/mcp.M700558MCP200

↵ ^{1} The abbreviations used are: FDR, false discovery rate; LDF, linear discriminant function; GDF, Gaussian density function; PDF, probability density function; EM, expectationmaximization; Err, estimated error rate; BNP, Bayesian nonparametric; FPR, falsepositive rate; PScore, probability score; PLen, peptide length; PNum, number of peaks in the MS/MS spectrum; VEMS, Virtual Expert Mass Spectrometrist; MPF, mobile proton factor; HPM, hypergeometric probability model.

↵ ^{2} K. Liu, J. Zhang, J. Wang, L. Zhao, X. Peng, W. Jia, W. Ying, Y. Zhu, H. Xie, F. He, and X. Qian, Anal.Chem., in press.

↵* This work was supported by the Chinese Ministry of Science and Technology (Grants 2006CB910803, 2006CB910700, 2006AA02A312, and 2006AA02Z334), National Natural Science Foundation of China (Grant 30621063), and Beijing Municipal Science and Technology Project (Grant H030230280590).

↵S The online version of this article (available at http://www.mcponline.org) contains supplemental material.

↵¶ Both authors contributed equally to this work and are regarded as joint first authors.
 Received November 26, 2007.
 Revision received November 11, 2008.
 © 2009 The American Society for Biochemistry and Molecular Biology