|
Advertisement | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 6:1599-1608, 2007.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ABSTRACT |
|---|
|
|
|---|
The first use of such an algorithm was reported by Eng et al. in 1994 (1). The algorithm, to be named Sequest, enabled the high throughput and automated interpretation of tandem mass spectra against a protein sequence library. The Mascot search algorithm, an outgrowth of the MOWSE (molecular weight search) project (2), was released in 1998 (3). More recently two open source search algorithms have become available including the Open Mass Spectrometry Search Algorithm (OMSSA)1 (4) and X! Tandem (5) distributed by the National Center for Biotechnology Information (NCBI) and the Global Proteome Machine Organization, respectively.
Although each implementation is different, these search algorithms operate under the same general principles. Namely they first find peptides whose theoretical mass approximates the experimental mass determined by the mass spectrometer for the precursor ion within an in silico digest of the sequence library being used. The search space is typically limited by parameters including mass tolerance, enzyme specificity, numbers of missed cleavages, and amino acid modifications. The algorithms next consider the fragment ions generated within the tandem mass spectrometer. The fragment ion masses are similarly compared with an in silico fragmentation of the peptides that passed the first criteria. The fragment ions must approximate the masses of the theoretical fragment ion masses within a defined mass tolerance. Typically only a subset of possible fragmentation ions are considered depending on the type of mass spectrometer and the fragmentation mechanism used. For example, only b and y ions are generally considered in the case of ion trap mass spectrometers using collision-induced dissociation. The details of this implementation differ among the algorithms and are not documented in all cases. In addition, the methods used to assign scores are very different as described in a recent review (6).
As noted, each algorithm outputs one or more scores in response to a query for each tandem mass spectrum. In some cases these scores are dependent on the search parameter inputs and/or sequence library used in addition to the quality of the tandem mass spectrum. This capability is intended to allow the search engine to output a probability consistent with these parameters. However, not all search algorithms consider all parameters in their score assignment. A combination of a lack of documentation together with a general lack of understanding of the underlying algorithms contributes to inconsistent application of any given search algorithm within and across research groups. Ideally a search algorithm would perform an initial search of a relatively small portion of the submitted dataset to determine the appropriate search parameters such as precursor and fragment ion mass accuracy, number of missed cleavages, monoisotopic or average masses, etc. to remove this aspect of user bias. This is now becoming increasingly important as search results make their way into public repositories such as PeptideAtlas (7, 8) and the Global Proteome Machine database (9).
Another source of variability in the search output is in the generation of the list of tandem mass spectra to be searched, typically referred to as a peak list. Peak lists are generated by proprietary programs compatible with the proprietary data files of the manufacturer of the mass spectrometer on which the data were recorded. Open source programs for converting proprietary data files into standardized XML formats such as mzXML and mzData have recently become available as has an open source program to convert mzXML files to peak list files, mzXML2other. There are often as many if not more parameters available to define the creation of a peak list as there are parameters that define the search itself. Again a lack of documentation and understanding of the peak listing algorithms leads to inconsistent use. For example, spectral counting (10) is emerging as a robust and sensitive quantitation method; however, peak listing parameters that have a dramatic impact on the outcome of such a measure are rarely reported. One such parameter available to users of the Thermo peak listing program extract_msn.exe (embedded within Bioworks) determines how many scans of the same precursor mass, within a variable scan window, will be averaged together to create a single query. Certainly choosing a wider window will tend to lessen the number of potential identifications, influencing the quantitative measure relative to a more narrow window or no window at all. This parameter may also affect the outcome of a search as many scans averaged together may vary significantly from any single scan of that grouping. For example, low resolution mass spectrometers routinely used for proteomics studies use a relatively wide isolation window. For any given scan there may be another analyte of similar mass that is isolated for fragmentation together with the analyte ion of interest, resulting in a "distracted" fragmentation of this mixture of analytes, confounding interpretation by the search algorithm.
The final hurdle to realizing a usable dataset is the interpretation of the scores output by the search algorithms. A wide variety of differing scoring procedures have been implemented by differing groups with differing methods for determining false positive rates. This is especially true of the Sequest algorithm as the scores do not reflect the probability of the assignment as is the case with Mascot, OMSSA, and X! Tandem. Many scoring thresholds are used for Sequest, two of which were used by the Human Proteome Organization (HUPO) to interpret results from their Plasma Proteome Project (11) and are valuable as a reference point for that reason. The scoring procedures are referred to as "high confidence" and "low confidence" criteria. Sequest has recently started reporting a probability-based score when used from within the Bioworks Browser (Thermo). Mascot has always provided two criteria for search result evaluation, an identity score and a homology score. The identity score is to be used as a threshold for identification, and the homology score is to be used as a threshold for extensive homology. Mascot has recently started reporting an expectation value in addition to the other scores. OMSSA and X! Tandem both report expectation values as the primary score for result evaluation.
The trend toward probability-based scores is welcome. However, a user must still decide what is an appropriate cutoff criterion. A user might reasonably decide that an expectation or probability value of 0.05 or 0.01, each intended to correspond to a 5% or 1% likelihood of a random match, would be appropriate. However, we have found that these scores can vary substantially with differing search parameters, sequence libraries, and samples as has been reported previously (12, 13). Many groups outside of the search algorithm distributors themselves have made efforts to improve data evaluation methods (14–18), yet the use of these methods has not been widely adopted and certainly not standardized.
One approach that is able to simultaneously deal with variations introduced by samples, analytical techniques, data processing, and search parameters is the target-decoy sequence library search (19–21). In this method the protein sequence library to be searched is copied and reversed, and the reversed copy is appended to the end of the original sequence library. This creates a decoy sequence library within the search set with exactly the same number of proteins, sequence lengths, and amino acid composition as the real, or target, sequence library. Therefore, any variability that would affect a search will be mirrored in the search of the decoy library. The target-decoy search strategy permits an impartial initial assessment of search results and application of cutoffs based on the estimated false discovery rates (FDRs) determined from the search. It is important to note that the FDRs are only estimates in that a hit to a decoy sequence is only a proxy for a "true" false positive and is not itself a false positive, although here we refer to them as such. This initial assessment should be followed by manual inspection of spectra identifying proteins to which biological meaning is attributed. Those proteins should then be further validated by other methods.
Kapp et al. (22) have recently evaluated and compared a number of tandem MS search algorithms. However, peak list generation was performed using differing parameters for input into each of the search algorithms, and differing search parameters were applied, such as precursor and fragment ion mass tolerances. Also a validated dataset was established for comparison that may have been biased toward the search algorithms with which it was created. Finally comparisons were made based on a limited set of tandem mass spectra (<4,000).
In this study we evaluated four search algorithms: Mascot, OMSSA, Sequest, and X! Tandem. Sample, data acquisition, data processing (peak listing), search algorithm parameter, and sequence library variables were removed by using a single processed dataset, identical search parameters, and the same target-decoy sequence library in all searches. Results obtained using commonly applied scoring procedures were compared with those obtained using a consistent cutoff as determined by the target-decoy determined FDR. The sensitivity and specificity of each algorithm was evaluated as well as the overlap between algorithms of inferred protein identifications. Finally an alternative, simpler method for evaluating appropriate scoring criteria is proposed.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
100,000 cells were procured. Proteins associated with cell pellets were extracted using sodium dodecyl sulfate detergent followed by denaturation, reduction, alkylation, digestion with trypsin, purification on a reversed-phase trap column, and lyophilization to dryness. Approximately 10 µg of the resulting digest was separated into discrete fractions by capillary IEF (CIEF) followed by nanoflow reversed-phase LC (nano-RPLC) coupled with ESI-MS/MS using a linear ion trap (LTQ, ThermoFinnigan, San Jose, CA). Precursor ions were scanned at a mass range of 400–1,400 m/z. Data-dependent scanning was enabled with the five most intense ions of the precursor ion scan selected for tandem MS with dynamic exclusion enabled and set to 18 s. MS and tandem MS scan times were set to a maximum of 100 ms. Automatic gain control target settings were 30,000 for full MS scans and 10,000 for MS2 scans.
Data Analysis—
Raw search data files were peak-listed using the version of extract_msn.exe distributed with Bioworks 3.3 (ThermoFinnigan). The command line argument used to run extract_msn was: > -F1 -L0 -B500 -T3500 -M1.00 -C2 -S1 -I10 -C0 -G1.
A detailed description of the command line arguments from the manufacturer may be found in the supplemental methods section. The 154,973 dta files produced by extract_msn were directly used as the input to Sequest. Additionally the dta files were concatenated and converted to mgf file format using an in-house script (convert.pl). This single 1-gigabyte mgf file was used as the input file to the Mascot, OMSSA, and X! Tandem search engines.
The human Swiss-Prot sequence library (November 2004 build) obtained from the European Bioinformatics Institute (ftp.ebi.ac.uk/pub/databases/SPproteomes/fasta/proteomes) was used to create a target-decoy sequence library for searching. An in-house script, rev_swiss_decoyed.pl, was used to copy the sequence library, append the string "_REVERSE" to each protein accession in the copy, reverse every protein sequence string in the copy, and concatenate the now decoyed copy to the original.
Four search engines were evaluated: Mascot 2.0, OMSSA 1.1.0, Sequest/Bioworks 3.3, and X! Tandem 2006.6.1.1. All searches were run locally on a Dell Optiplex GX620 with a 2.8-GHz Pentium D processor and 2 gigabytes of RAM. All searches used the following search parameters: precursor ion mass tolerance, ±1.5 Da; fragment ion mass tolerance, ±0.5 Da; fully tryptic enzyme specificity; one missed cleavage; monoisotopic precursor mass (Mascot, OMSSA, and Sequest); monoisotopic fragment ion mass (OMSSA and Sequest); a fixed modification of cysteine carbamidomethylation; and a variable modification of methionine oxidation. An additional parameter unique to OMSSA was used that requires that one of a variable number of the most intense fragment ions match those of the theoretical peptide. In this case the parameter was set to 5. It should also be noted that X! Tandem applies the 3-Da precursor mass window differently than the other algorithms. The window is set such that the search is defined by a window from –0.5 to +2.5 Da around the precursor ion rather than ±1.5 Da around the precursor in the case of the other algorithms.
Result files were parsed to extract dta name, query number, peptide sequence, protein accession, experimental and theoretical masses, charge, and algorithm-specific scores. All parsers were written in Perl. OMSSA XML result files, omx, were converted to csv format using omx2csv, available from Computational Systems Biology at the University of Virginia. Mascot dat result files were converted to PepXML using the converter available with the TransProteomic Pipeline (TPP) distribution, dat2xml. The TPP is available from the Seattle Proteome Center. PepXML files were converted to csv format using pepxml2csv, also available from Computational Systems Biology at the University of Virginia. Sequest results were exported to Excel format using the built-in feature within Bioworks. X! Tandem XML result files were converted to PepXML using the TPP converter tandem2xml. The parsed data were loaded into a Microsoft Access (2003) database for subsequent analysis and comparison. False discovery rates were calculated according to the method of Elias et al. (20): decoy hits were multiplied by 2, and the product was divided by the sum of the target and decoy hits. A table listing all queries that meet the 1% MS2 FDR threshold of the algorithms discussed is included in the form of an Excel spreadsheet as Supplemental Table 1.
Target and decoy peptide hits were redundantly assigned to all target and decoy Swiss-Prot entries to which they mapped. MS2 hits indicate a search query (peak list) that results in a peptide assignment at some score. Distinct peptides indicate peptides with differing sequences. Distinct proteins indicate differing Swiss-Prot entries (identifiers). A protein (Swiss-Prot entry) was counted if a single peptide mapped to it whether or not that peptide also mapped to other entries.
| RESULTS |
|---|
|
|
|---|
|
|
Mascot provides the user with two scores: an ion score and an E-value. The ion score is compared with two thresholds that are calculated independently for each peptide, the homology score and identity score. Traditionally the identity score is the reported threshold used in most laboratories. The homology score is typically lower than the identity score in the case of longer peptides and higher in the case of shorter peptides. Table I shows the results of a Mascot search using the identity score as the threshold criteria (Mascot identity), using both the identity and homology scores as criteria (Mascot identity and homology) and using the E-value at a threshold found to return a 1% MS2 FDR. The second scoring criteria using both the identity and homology scores is shown because this is an effective way to remove short peptides, which have a higher tendency to be false positives than longer peptides. This results in a higher confidence dataset with an increase in the estimated number of total proteins discovered.
Sequest presents a special case because many empirically determined threshold scores are used in research laboratories throughout the world. Displayed in Table I are some commonly used thresholds put forth by HUPO, including both a low confidence (Sequest HUPO low) and high confidence (Sequest HUPO high) set of thresholds. Sequest results are also shown using thresholds empirically determined to return a 1% MS2 FDR for each charge state (Sequest Xcorr FDR). Sequest offers two separate scores. The first score is actually a composite of several scores: The cross-correlation score (Xcorr) is a cross-correlation measure of the theoretical and experimental spectra. Typically a different Xcorr threshold is used for each charge state to reflect the difference in probability of a random match to each. A score measuring the difference between the top two candidate spectra is also given in the form of the delta correlation value (
Cn) where the scores are normalized such that the top score is set equal to 1 and the difference is taken. Less frequently used is the preliminary score (Sp) and the preliminary score ranking (RSp) that are scores calculated first as a filter to restrict the number of spectra to subject to cross-correlation scoring. The second Sequest score is a probability value score offered in the most recent version of the Bioworks software (Sequest p).
Encouragingly the results produced by using a 1% MS2 FDR are very consistent for each of the algorithms. Mascot, X! Tandem, Sequest p, and Sequest Xcorr FDR yielded 9,942–10,762 distinct peptides and
2,000 proteins, whereas OMSSA identified 13,512 distinct peptides and over 2,700 proteins. The protein FDR for each algorithm is consistent as well, yielding values between 8.9 and 10.8% at a controlled 1% MS2 FDR. The trend across a range of MS2 FDR values is shown in Fig. 2. Even across a very wide range there is little difference between search algorithms in the ratio of MS2 to protein false discovery rates. The figure highlights the importance of strictly controlling the MS2 FDR when reporting results as small changes in this value have a large impact on protein FDR. For example, a 0.1% MS2 FDR results in protein FDRs of about 1% for all algorithms. A 1% MS2 FDR increases the protein FDR to 8–11%. A 5% MS2 FDR yields a 35–40% protein FDR, and a 10% MS2 FDR results in protein FDRs in excess of 50%, meaning that half the proteins reported are likely false positives. This has been shown previously in the context of different Sequest Xcorr thresholds (27).
|
13 and 18%, respectively, from the values obtained using the cross-correlation-based cutoffs at a 1% MS2 FDR. However, protein identifications increased substantially to over 2,900 proteins in the case of the high confidence settings and over 4,500 proteins for the low confidence settings, increases of
50 and 125%, respectively. Correspondingly the number of decoy protein identifications increased
500 and almost 19,000%, respectively. In this case the low confidence thresholds are clearly set too low, significantly and negatively impacting the quality of search results and consequently the reproducibility of the analysis. The high confidence settings are an improvement; however, a protein FDR of 36% will also make meaningful data interpretation difficult and will confound efforts to obtain reproducible identifications across multiple runs. In contrast, the widely used Mascot identity score is a relatively conservative threshold, yielding an MS2 FDR of 0.6% and a protein FDR of 5.0%. Combined with the homology score as an additional threshold Mascot returned even more conservative results with a 0.2% MS2 FDR and 1.9% protein FDR. However, as can be seen from the estimated number of true proteins identified (equal to target proteins – 2 x decoy proteins), the Mascot identity and homology score yielded more estimated true proteins than the Sequest Xcorr-based scoring methods.
In all cases, combining a 1% MS2 FDR with the requirement for two distinct peptides per protein lowered the protein FDR to an effective rate of 0%. Mascot E, OMSSA, and X! Tandem each discovered only one decoy protein with two distinct peptides, whereas Sequest p discovered two. This method enables production of datasets with very high confidence identifications. However, the method also greatly increases the number of false negatives. For example, OMSSA discovered 127 decoy proteins, indicating that twice as many, 254, were predicted in the set to be false positives. Given that OMSSA discovered 899 proteins with a single distinct peptide, requiring two distinct peptides per protein would eliminate 645 putative true identifications, not an insignificant number.
Other measures of data quality are the average number of MS2 hits and distinct peptide identifications per protein (Table II). Increases in either of these measures is interpreted as an increase in the confidence of a protein identification (16). In particular, distinct peptide identifications are valued as they lead to increased confidence in and sequence coverage of the predicted protein. By this measure OMSSA and X! Tandem outperform Sequest and Mascot with respect to MS2 identifications per protein. OMSSA identified 17.8 MS2 events per protein an average, X! Tandem identified 15.0 MS2 events, Mascot E identified 14.7 MS2 events, and Sequest p identified 13.2 MS2 events. The number of MS2 identifications becomes very important when implementing spectral counting-based quantification because the expression levels of each protein are determined by the number of MS2 identifications. The algorithms are much more comparable with regard to the number of distinct peptides identified per protein with X! Tandem and Sequest Xcorr FDR yielding 5.2 distinct peptides per protein on average, Mascot E and OMSSA yielding 5.0, and Sequest p resulting in 4.7. The lower averages returned by the HUPO scoring thresholds are clearly a result of the high protein false positive rates resulting from use of these scoring procedures. As false positive protein identifications occur randomly throughout the sequence library, false positives do not accumulate at any given protein or subset of proteins, resulting in a large number of proteins identified with a single distinct peptide identification. This leads to lower averages of MS2 hits per protein and distinct peptides per protein. As shown in Table I, the number of decoy distinct peptide identifications barely exceeds the number of decoy protein identifications.
|
|
|
|
|
|
10%, are too high. This depends greatly on the intended use of the results. When the results are being used to construct a protein reference library for a sample it is probably best to err on the side of caution and to use a strict filtering criteria such as a 1% MS2 FDR coupled with a requirement for two distinct peptides per protein. In the cases where the results are being used in conjunction with other, complementary experimental methods, it may be better to use those results as a filter rather than being too strict at the initial, bioinformatics level, which could lead to a large false negative rate.
|
| DISCUSSION |
|---|
|
|
|---|
The open source OMSSA algorithm from NCBI is exceptional in that its performance significantly exceeded that of each of the other algorithms by almost every measure. Furthermore the large number of MS2 hits achieved by OMSSA may facilitate the implementation of spectral counting-based quantification approaches. The proteomics community would benefit were OMSSA scoring to be added as a pluggable score (29) in X! Tandem searches or if OMSSA were to be integrated as a search option into the Institute for Systems Biology's TransProteomic Pipeline project.
Regardless of the search algorithm a group may choose to use, it is important that the community arrive at some common standard for data evaluation and reporting. A target-decoy search strategy is one procedure that, as shown, produces relatively reproducible results from different algorithms for the same input data. Evaluating data at the maximum ratio of MS2 hits per protein is another possibility that would be relatively simple for each of the search algorithm makers to implement in their data viewing and reporting applications. Both of these implementations have the benefit of being sensitive to false positives. That is, the data reported will be impacted by all of the variables leading up to the search from the sample used, to the preparation and analysis methods, to the search parameters entered. This is useful both because false positive rates vary sample to sample and also because some algorithms do not consider all of the search variable parameters in their probability or expectation scoring functions. In this era of high throughput global proteomics studies and cross-platform evaluations it is critical that common data evaluation and reporting procedures are used and that their use is as transparent and standardized as possible.
| FOOTNOTES |
|---|
Published, MCP Papers in Press, May 28, 2007, DOI 10.1074/mcp.M600469-MCP200
1 The abbreviations used are: OMSSA, Open Mass Spectrometry Search Algorithm; CIEF, capillary IEF;
Cn, Sequest delta correlation number; E, expectation; FDR, false discovery rate; HUPO, Human Proteome Organization; nano-RPLC, nanoflow reversed-phase- LC; RSp, Sequest preliminary score ranking; Sp, Sequest preliminary score; Xcorr, Sequest cross-correlation score; TPP, TransProteomic Pipeline; XML, extensible markup language. ![]()
* Portions of this work were supported by NCI, National Institutes of Health Grants CA107988 and CA103086 and National Center for Research Resources Grants RR021862 and RR021239. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()
To whom correspondence should be addressed: Calibrant Biosystems, 910 Clopper Rd., Suite 220N, Gaithersburg, MD 20878. Tel.: 301-977-7900 (ext. 14); Fax: 301-977-7981; E-mail: brian.balgley{at}calibrant.com
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
B. D. M. Hodges and C. C. Wu Proteomic insights into an expanded cellular role for cytoplasmic lipid droplets J. Lipid Res., February 1, 2010; 51(2): 262 - 273. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Grobei, E. Qeli, E. Brunner, H. Rehrauer, R. Zhang, B. Roschitzki, K. Basler, C. H. Ahrens, and U. Grossniklaus Deterministic protein inference for shotgun proteomics data provides new insights into Arabidopsis pollen development and function Genome Res., October 1, 2009; 19(10): 1786 - 1800. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |