Generalized Method for Probability-based Peptide and Protein Identification from Tandem Mass Spectrometry Data and Sequence Database Searching*

Tandem mass spectrometry-based proteomics is currently in great demand of computational methods that facilitate the elimination of likely false positives in peptide and protein identification. In the last few years, a number of new peptide identification programs have been described, but scores or other significance measures reported by these programs cannot always be directly translated into an easy to interpret error rate measurement such as the false discovery rate. In this work we used generalized lambda distributions to model frequency distributions of database search scores computed by MASCOT, X!TANDEM with k-score plug-in, OMSSA, and InsPecT. From these distributions, we could successfully estimate p values and false discovery rates with high accuracy. From the set of peptide assignments reported by any of these engines, we also defined a generic protein scoring scheme that enabled accurate estimation of protein-level p values by simulation of random score distributions that was also found to yield good estimates of protein-level false discovery rate. The performance of these methods was evaluated by searching four freely available data sets ranging from 40,000 to 285,000 MS/MS spectra.

Present day mass spectrometry-based proteomics research involves the generation of very large data sets containing thousands of tandem mass spectra, which are assigned to putative peptide sequences in databases by means of computer programs called database search engines. Given the number of MS/MS spectra involved, manual validation of spectrum to peptide assignments quickly became unfeasible, and user-unattended procedures for discarding incorrect matches were developed. In the earliest days of multidimensional chromatography coupled to tandem mass spectrometry, plain score cutoffs for each charge state were arbitrarily established by highly experienced mass spectrometrists (1,2) or determined by searching MS/MS spectra against reversed protein sequence databases (3). For instance, it was quite common to filter SEQUEST data by accepting all matches with ⌬Cn Ն 0.1 and Xcorr Ն 1.5, Ն 2, and Ն 3 for singly, doubly, and triply charged peptides, respectively. However, the relative frequency associated to a given score threshold was proven to be highly dependent on overall data set quality, database size, and database search parameters (4,5). This finding implied that significance thresholds needed to be established in an experiment-specific manner and that score thresholds established for trial data sets should never be extrapolated to other data sets expecting that the error rate would be an experiment-independent variable uniquely associated to score values. Such concerns led to the development of mathematical models for describing the probability distributions of database search scores of commonly used search engines such as SEQUEST. Other researchers aimed at developing probability-based search engines attempting to directly provide a significance measure for each peptide assignment, such as X!TANDEM (6) or OMSSA 1 (7). Finally others decided to estimate error rates by comparing the frequencies of scores of peptide assignments with those obtained by assignments to false protein sequences obtained either by reversing or randomizing real protein sequences (8). Among these strategies, the recently described composite target/ decoy sequence database search strategy is gaining increasing acceptance (9).
It is important to point out that warnings have been raised to encourage journals to increase the documentation of proteomics experiments, placing special emphasis on peptide and protein identification procedures, but current algorithmic diversity makes standardization a challenging task (for a detailed description of the current situation see a recent review by Nesvizhskii et al. (10)). Because of the number of database search engines available and the disparity of spectrum matching and scoring schemes that these programs implement, a generalized search engine-independent method to model score distributions and establish adequate statistical significance thresholds would be highly desirable. Furthermore statistical significance should be expressed in an easily interpretable form while allowing a good trade-off between stringency and power, such as the false discovery rate (FDR) (11,12), which measures the expected proportion of truly null features that will pass a given p value threshold (i.e. the expected fraction of false positives).
In this work we used generalized lambda distributions (GLDs) to model MS/MS assignment score distributions. The GLD is an extremely flexible four-parameter function that can mimic with high accuracy the most important families of continuous probability distribution functions used in mathematical modeling. This distribution has been successfully used to model biological, physical, and economical processes because it can provide a valid mathematical model from observed data histograms of virtually any shape (13). By searching several collections of tandem mass spectra with a set of popular database search engines, either commercially or freely available, we demonstrate that this modeling strategy can be used in a search engine-independent manner to compute p values and peptide identification error rates. Finally we also provide a simple but powerful method for computing protein-level p values that are not biased for protein length or number of peptide hits. Estimates of associated protein-level identification error rates are also provided.

EXPERIMENTAL PROCEDURES
MS/MS Data Sets-All the data sets used in this work are freely available and contain MS/MS spectra recorded using ion trap mass spectrometers. The data set "RaftFlow," containing approximately 40,000 dta files, was downloaded from the Sashimi documentation site (hosted by SourceForge). This data set corresponds to the analysis of the ICAT flow-through of lipid rafts purified from Jurkat T cells. The data set "PAe000038-39" was obtained by merging data sets PA000038 and PA300039 downloaded from the PeptideAtlas Web site that were obtained from proteome digests of human cancer cell lines SiHa and SqCC. MS/MS scans in mzXML files were converted to mgf file format as singly, doubly, and triply charged ions, yielding 53,666 spectra. The data set "PAe000114," obtained from a digest of the human erythroleukemia K562 cell line, was also downloaded from PeptideAtlas. MS/MS scans in mzXML files were converted to mgf file format as singly, doubly, and triply charged ions, yielding 284,045 spectra. The data set "iPRG2008," containing 42,235 MS/MS spectra, was obtained from the Association of Biomolecular Resource Facilities (ABRF) Proteome Informatics Research Group. These spectra were obtained from iTRAQ-labeled proteome digests of mouse liver cells.
MS/MS Database Searches-MS/MS database searches were carried out using MASCOT version 2.0.05 (available from Matrix Science under license), OMSSA 1.1.3.win32 (freely available from the National Center for Biotechnology Information (NCBI)), InsPecT "20070905" (freely available from the University of California Santa Cruz computational mass spectrometry group), and X!TANDEM 2 "2007.07.01.2" with k-score plug-in (freely available from LabKey). Peak lists in mgf format were used as input. Database search engine parameters were as similar as possible for all search engines. Briefly precursor mass tolerance was set to 2 Da, fragment ion mass tolerance was set to 0.8 Da, and cleavage specificity was set to "trypsin," allowing for a maximum of two missed cleavages. Instrument-specific scoring was used when available. Cysteine alkylation due to iodoacetamide (ϩ57.022) treatment was set as fixed modification for all data sets except for the iPRG2008 data set for which cysteines were treated with methylmethanethiosulfonate (ϩ45.98). For this data set we also set as fixed modifications iTRAQ reagent (ϩ144.102) at lysine side chain and peptide N terminus (except for InsPecT, which does not allow fixed modifications at the peptide N terminus). Databases used were target/decoy sequence databases built from International Protein Index (IPI) human v3.23 or the mouse UniProt database distributed along with the iPRG2008 data set. The use of random or reversed sequences as decoy proteins was decided arbitrarily.
Obtaining Database Search Engine Scores for Modeling-Matches to peptide sequences shorter than 10 residues were always discarded. For MASCOT, the main score was used; for X!TANDEM with k-score plug-in, the k-score (14) (an implementation of the COMET score (15)) was used; for InsPecT, we took the MQScore, which is a linear combination of several match quality scores computed by the program (16). Scores were used without modification but for OMSSA. In this case, to modify the scoring scale so that correctly identified peptides attain high scores, we took as score the negative logarithm of the reported "E value." GLD models were built for every charge state independently, and only assignments to reversed/random peptide sequences were used for this purpose. The number of data points was arbitrarily limited to the top 1500 scores of each charge state. This data set truncation was carried out to enforce the GLD model to minimize squared error in the right tail of the distribution for large data sets in the high score region where experimentally relevant statistical significance thresholds are usually computed. Fitting GLDs to truncated observed distributions is not a problem and does not affect model accuracy.
GLD Fitting-The generalized distribution is best defined from its percentile function, where 0 Յ y Յ 1. The parameters 1 and 2 are, respectively, the location and scale parameters, and 3 and 4 determine the skewness and kurtosis of the distribution. In the same way that the normal probability distribution has the restriction that the standard deviation must be non-0 and non-negative, not any given set of ( 1 , 2 , 3 , 4 ) parameters can yield a valid distribution. An accurate description of the restrictions on these parameters that yield a valid GLD may be found elsewhere (13). From the percentile function described above, the probability density at x ϭ Q(y) is computed as follows.
Because y is defined as the probability of x Յ Q(y), building data histograms for fitting a GLD involves turning data points into relative frequency scale, computing Q(y) for all points, and binning data points according to this amount. For fitting GLDs to histograms from observed data we used the method of percentiles described by Karian and Dudewicz (13) with minor modifications. Briefly this method involves the computation of four sample statistics that are used as estimators of distribution parameters, where f denotes the sample percentile limiting the fraction f of ranked observations, and u is an arbitrary number between 0 and 0.25, which we set to 0.005. 1 is the sample median, 2 is the interdecile range, 3 is the left-right tail weight ratio, and 4 is the tail weight factor. Valid pairs of ( 3 , 4 ) parameters corresponding to the estimated amounts ( 3 , 4 ), or (1/ 3 , 4 ) if 3 Ͼ 1, are then obtained from previously tabulated values of ( 3 , 4 ). From these amounts, the estimated values for 2 and 1 are obtained consecutively as follows.
Among all sets of ( 1 , 2 , 3 , 4 ) parameters compatible with the set of estimators ( 1 , 2 , 3 , 4 ) obtained from each data histogram, the GLD( 1 , 2 , 3 , 4 ) that better fits the observed data is selected so that it minimizes a given error indicator. An absolute error threshold of 0.35 in the sample estimators ( 3 , 4 ) was used, and to bias the model toward better accuracy of the right tail, we selected as best model the one that minimized the amount for every non-0 value of f i where y i and f i denote, respectively, the value at the ith bin of the observed and model score histograms. The weight w i was defined as the amount of 1 minus the relative rank raised to the power of 4. Optimal bin size for building histograms was computed from data set size, N, and interquartile range (a measure of data dispersion) as follows.
Because a closed expression for the probability function of the form y ϭ F(x) does not exist, this distribution must be computed numerically to assign a p value to each data point. Estimation of Error Rates in Peptide Identification-Given a set of p values assigned either at the peptide or protein level and ranked in ascending order, the expected proportion of data observations that pass a given p value threshold p i depends on data set size and on the number of i data points with equal or better p value. This expected error rate is called FDR.
Error rates may also be estimated from composite target/decoy sequence database searches by counting the number of decoy hits passing a given p value threshold. Because the FDR is a mathematical expectation, we decided to use a different name for the error rate estimated by counting decoy hits and called it decoy hit rate (DHR). This amount may be defined as where D i is the number of decoy hits with p value better than p i . Because this method is independent of the exact values of the estimated probabilities, it was used to assess the accuracy of the FDR computed from the models. Composite target/decoy databases were built by concatenating a regular protein database with a fake version of it containing either random or reversed protein sequences. Random protein databases were built using the same distribution of protein lengths and residue frequencies observed in the original database.

Computation of Protein-level p Values and Error
Rates-Peptide matches were grouped by parent protein sequence. From the p values of h peptide ions assigned to a given protein, the protein score was defined as where p i are the peptide ion p values computed from the corresponding GLD models. Because the null hypothesis states that the h j peptides were assigned to the jth protein by chance, the probability of finding a score equal or better than S Pj may be estimated by generating a sufficiently large set of random protein scores and calculating the relative frequency of random scores achieving values as high as or higher than S P . These random protein scores were computed by sampling h j decoy peptide p values from the whole data set at random and applying Equation 13 to obtain a score. In this work we computed 10 6 random protein scores for each category of h in the data set, so protein p values as low as 10 Ϫ6 could be assigned. Recall that lower p values may be assigned by increasing the number of random scores computed. After assigning p values, proteins were clustered into sequence similarity groups by defining a protein similarity cluster as the set of all proteins in the data set that shared at least one identified peptide. The FDR and DHR were calculated as described previously for each similarity cluster, taking as cluster p value the smallest protein p value in the cluster.

RESULTS
After database searches, MS/MS assignments to false peptide sequences were classified by charge state, redundancy was eliminated by taking the best scoring spectrum of a given charge state among those assigned to the same peptide sequence, and the resulting non-redundant set of scores was used to build a histogram of observed score frequencies.
From every charge state-specific score histogram a model probability function was obtained by fitting a GLD as described under "Experimental Procedures." These models were then used to compute the expected relative frequency of scores that will reach a magnitude as extreme as or more extreme than a given value, i.e. to assign p values. Fig. 1 shows GLD models obtained from database search scores provided by X!TANDEM with k-score plug-in, OMSSA, MASCOT, and InsPecT (from the data set RaftFlow). At first sight, probability density functions seem to provide a good description of score frequencies in all cases given the overlap between observed and model data series. To check the validity of the p values computed from the models, we pooled assignments from all charge states, ranked them by p value, and plotted the computed p value against its relative rank. As shown, both estimators converge to very similar values across several orders of magnitude, suggesting that the GLD fits may be used to estimate expected relative frequencies of raw scores with high accuracy. Furthermore we used the number of decoy hits passing a given p value threshold to estimate its associated frequency of random matches (DHR) and compared these amounts to FDR estimates obtained directly from the p values. The results obtained from searching the four data sets with the four search engines as well as parameters of the charge state-specific GLD models are shown in Tables  I-IV. For a predicted fraction of incorrect matches of 5%, we observed frequencies of incorrect matches ranging from 3.7 to 5.5% in 15 of 16 cases and of 1.8% in one case. These results account for a total of 25,986 peptide ion matches. Afterward peptide matches were grouped by parent protein sequence, and a protein score was computed for every protein from the p values of putatively identified peptide ions. Given h peptide ions assigned to any one protein, by repeatedly sampling h decoy peptide p values at random and com-puting a random protein score, we built protein score distributions that reflected the relative frequencies of the range of scores that may be obtained by combining h peptide ion p values under the null hypothesis. Such random protein score distributions are shown in Fig. 2A. As observed, score frequencies estimated from simulated distributions were very close to the observed frequencies of decoy protein matches in the data set, especially in the right tails. To check the validity of p values computed from these distributions, we pooled protein matches across all categories of h higher than 1, ranked them by ascending protein p value, and plotted the estimated protein p value against its relative rank. As observed in Fig. 2A, computed p values were very close to observed relative frequencies across several orders of magnitude. Because the absolute number of MS/MS hits at random to a given protein, h, depends on sequence length (larger proteins yield more theo- retical peptides upon in silico digestion) and a random score distribution was generated for every value of h, we expected that p values assigned by this method would not show any bias toward parent protein length. This desirable behavior of protein p values was found to be true and is illustrated in Fig. 2B. Tables  I-IV   GLD modeling experiments described above. As observed, for a predicted 5% of incorrect protein matches, observed frequencies ranged from 3.3 to 6.7% in 14 of 16 cases and exceeded this range in only two cases with values of 1.2 and 9.9%. These results account for a total of 7569 protein clusters.

DISCUSSION
In previous research, other investigators developed statistical models to assign p values to peptide match scores provided by database search engines. These statistical models incorporate experiment-specific information that is reflected in the frequency distributions of such scores. In this work, we developed a generalized approach that relies on the use of generalized distributions, avoiding the need to search for a known probability density function that may approximate the frequency distribution of scores from a given search engine. To obtain models that minimize the squared error in the right tail, where meaningful significance thresholds are usually established, we truncated the score distributions by taking the 1500 highest scores of each charge state and introduced a weight factor in every normalized squared error that emphasized the contribution of the right tail to the total error. Because GLD models are extremely flexible, the complete absence of search engine scores coming from correctly assigned spectra is critical to prevent the model from also fitting the tail of positive match scores. This is the reason why we used only matches to false peptide sequences to estimate parameters for the models. Given the increasing popularity of the target/decoy composite database method, we preferred this strategy instead of separate real and false protein sequence database searches. In addition, this method allowed us to simultaneously obtain a widely accepted estimation of the misidentification rate to compare with that predicted from the statistical model. As described in under "Results," the accuracy of these models proved very good and yielded expected error rates very close to their real values in almost all cases. The fact that MS/MS search scores from any search engines may be expressed as p values using this method might facilitate a detailed comparative analysis of algorithm performance in the future. Moreover a common probability scale for all search engines seems much more suitable for the development of algorithms to obtain extra significance by matching MS/MS spectra to peptides using multiple database search programs.
Because identifying individual peptides is not the goal of most present day proteomics experiments, peptides must be grouped according to parent protein sequence. For these groups of peptide matches, a new scoring scheme must be devised that represents each parent protein as a single experiment feature, and statistical significance thresholds must be established according to this new experimental level. To this end, we defined a protein score that reflected the quality of individual peptide ion matches by summing the negative logarithms of their p values. As described under "Experimental Procedures," this definition of protein score permits modeling the frequencies of random scores by chance in a straightforward manner and thus assigning protein-level p values without the need to develop an analytical formulation to estimate such probability. Estimation of protein p values and error rates using this method yielded values that were, in almost all cases, in very good agreement with the amounts estimated by counting decoy protein matches, suggesting that the method may be considered robust despite its simplicity. In addition, these protein p values are independent of the absolute number of MS/MS spectra matching peptides of a given protein, which is expected to be largely dependent on protein sequence length.
Recall that the ability to rank protein matches by decreasing significance is not intrinsic to the original target/decoy database strategy, which was initially defined at the peptide level. This added protein-level layer overcomes the fact that a misidentification rate at the peptide level may not necessarily be associated to a low misidentification rate at the protein level, which is a well known paradox of bottom-up proteomics. The other drawback of these proteomics technologies, known as the protein inference problem (17), was partially overcome by grouping proteins sharing identified identical peptide sequences into protein clusters. Because of sequence similarity among proteins, peptides released upon enzymatic digestion of a given protein may end up contributing to the putative identification of many other proteins not necessarily present in the sample. Although there is no perfect solution, a maximum parsimony assumption may be done to reduce the list of identified proteins to the minimum set of protein sequences capable of explaining the presence of all observed peptide matches. By taking the most significant protein match in each cluster as a representative match, protein clusters may be filtered under controlled error rate conditions as demonstrated in the tables. Attempting to do more sophisticated protein inference, for instance using non-degenerate peptides, is out of the scope of this work.
In conclusion, we think that the flexible statistical procedures presented might be applied to analyze MS/MS assignment scores from a variety of search engines as well as to obtain a non-redundant list of putatively identified proteins without significantly exceeding a maximum tolerated percentage of incorrectly identified proteins. The suitability of these procedures may be judged from the results obtained for the approximately 400,000 MS/MS spectra that were used in this work.