|
Advertisement | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 7:2364-2372, 2008.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ABSTRACT |
|---|
|
|
|---|
As a first and crucial step of data interpretation, coupling of a fragment ion spectrum to a peptide sequence has attracted much effort aimed at optimizing this process. A review of the variety of methods and tools available for this purpose was published recently (3). The most commonly applied method is based on sequence database searching by database search engines such as SEQUEST (4), Mascot (5), X!Tandem (6), Virtual Expert Mass Spectrometrist (7), or Open Mass Spectrometry Search Algorithm (8). The overall concept behind these algorithms is similar and consists of the generation of theoretical fragment ion spectra from sequence database entries against which experimental fragment ion spectra are matched. The difference between the algorithms is usually found in the spectral comparison method and scoring scheme (9). The most difficult part of this analysis is not necessarily finding the best match from the sequence database but finding out whether this best match is actually valid. Indeed an experimental spectrum cannot always be compared with the actual theoretical spectrum of its original precursor because this precursor may be absent from the database or because the precursor peptide carried one or more unanticipated modifications. Even so, this experimental spectrum may still be matched with a considerable score to a theoretical fragmentation spectrum derived from an unrelated precursor. To filter out such background matches, several search engines include probability-based scoring algorithms (9) in which the score of a proposed peptide identification can be compared against a threshold score for a given confidence level. In addition, postprocessing tools have been developed that analyze the detailed output of a search engine to obtain a revised score that should further optimize sensitivity and specificity (10–13). Typically such algorithms rely on certain assumptions about the identifications to model true positive and true negative score distributions. PeptideProphet (13) for example relies on a mixture model approach that models SEQUEST score distributions according to fixed assumptions such as tryptic correctness of the identified peptides. Ultimately a revised probabilistic score is calculated that should allow discrimination between true and false positives with increased accuracy.
In certain cases, however, only a subset of all peptide identifications obtained are of relevance to the biological system under study. In these cases, expert manual validation of the identifications is a more commonplace strategy for quality control. Protein modification studies for example often find biological relevance in a small subset of all experimentally obtained data (14, 15). In addition, the so-called "single hit wonders", which often populate the majority of identified peptides or proteins in gel-free proteomics, should not be simply discarded but must be treated intelligently as they potentially contain valuable biological information (16, 17). The manual validation required to assure the reliability of the biological conclusions drawn from such peptide identifications can be performed by using the visualization tools included with the search engine or by specialized applications such as CHOMPER (18), DTASelect (19), or myProMS (20). These tools present a specific set of details on a peptide identification and its associated spectrum for user validation. Finally a semimanual option was recently added to PeptideProphet by allowing the user to enable or disable certain of the modeling assumptions from which the overall score is derived (21).
An important side effect of the evolution of proteomics technologies toward more specialized and targeted approaches (22, 23), however, relates to the corresponding changes in the actual assumptions that can be made about the identifications. These changes effectively introduce new parameters that can be used to further enhance the separation of false and true positives, yet are necessarily largely ignored by tools built upon fixed, generalized assumptions. To allow this expanding array of technologies and associated identification parameters to be used effectively in the postprocessing and validation of proteomics data, here we present the Peptizer tool. Built upon a dynamic profiling framework that operates on pluggable assumptions, Peptizer can be quickly and efficiently configured with any a priori knowledge that is available to the user. Each assumption or parameter is coded in an autonomous agent, which is allowed to cast a vote on each peptide identification. In a second layer, the votes of these agents are aggregated using a pluggable algorithm, which outputs a final score that is used to judge whether an identification represents a potential false positive. We show that elimination of these suspicious identifications increases specificity albeit at the cost of a noticeable loss in sensitivity through removal of certain true positives. A sophisticated and highly efficient manual validation interface is also included that can be used to compensate in part for this loss in sensitivity.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
-amino-blocked peptides in the strong cation exchange non-binding fraction. The sample was then acidified to oxidize methionines before the primary N-terminal COFRADIC1 separation (25). Fractions of 4 min wide were collected and treated with 2,4,6-trinitrobenzenesulfonic acid. Such modified primary fractions were then loaded for the secondary COFRADIC run wherein the
-amino-blocked peptides, which show no altered chromatographic properties, are collected. The secondary fractions were analyzed by LC-MS/MS using a microfluidic interface (Agilent Chip Cube) on an Agilent XCT-Ultra ion trap mass spectrometer operated as described previously (26).
Peptide Identification, False Positive Estimations, and Peptizer Development—
The MS/MS spectra were searched by Mascot version 2.2 against the human subset of the UniProtKB/Swiss-Prot sequence database, release 53.2 (June 26, 2007), concatenated with a shuffled version of this database generated by DBToolkit (27). The following parameters were used in the Mascot searches: peptide mass tolerance and peptide fragment tolerance were set at ±0.5 Da, and allowed precursor charges were set to 1+, 2+, and 3+. Fixed modifications were oxidation of methionine to its sulfoxide derivative, trideuteroacetylation of lysine and carbamidomethylation of cysteine. Pyroglutamate formation (N-terminal Gln), pyrocarbamidomethylcysteine formation (N-terminal carbamidomethylated cysteines), acetylation and trideuteroacetylation of the
-N terminus, and deamidation (Gln and Asn) were considered as variable modifications. Endoproteinase Arg-C/P was set as the proteolytic enzyme, and at most one missed cleavage was allowed. The Mascot instrument setting parameter was set to ESI-TRAP. Only MS/MS spectra receiving an ion score equal to or exceeding the Mascot identity threshold score at the 95% confidence level were withheld for further inspection by Peptizer. All experimental fragmentation spectra (32,403), peptide identifications (2,739) made in the "forward" protein database, and corresponding experimental details will be made publicly available via the proteomics identifications (PRIDE) database (28) under experiment accession number 3261. All experimental fragmentation spectra (32,430), peptide identifications (2,739) made in the "forward" protein database, and corresponding experimental details are publicly available via the proteomics identifications (PRIDE) database (28) under experiment accession number 3,261. To estimate the false positive distribution we performed Mascot searches against a concatenated decoy database as described previously (29).
Peptizer was developed as an open source project under the Apache2 license in Java 1.5. Peptizer relies on Mascotdatfile (30) to process Mascot result files and can also interface with the ms_lims software package (31).
Manual Validation—
Manual validation was performed by an experienced mass spectrometrist. The scientist was blinded to the origin of the peptide identifications (i.e. from the decoy or target set proteins). The scientist was told to apply stringent criteria during the validation. The net effect of the manual validation was obtained by inspecting the unblinded results after completion of the validation.
Peptizer Configuration—
Peptizer was configured to use the agents listed in Table I for detecting potential false positive identifications in this data set. The agent configuration text file, which can be loaded in Peptizer, can be found at the project Website. The "best hit" agent aggregator was used to combine the individual agent votes. The aggregator used simply summed all votes together and marked the peptide identification as suspicious if the result was equal to or greater than 2 (or when an agent with veto rights declines).
|
| RESULTS |
|---|
|
|
|---|
Construction of a Peptizer Profile—
The peptide identifications are tested by evaluating a series of user-selectable and extensible properties. The result of this evaluation can be to decline, reserve, or recommend the identification based on that property. The results across all considered properties are then combined in an overall score for identification reliability that can ultimately be used as a filter.
In Peptizer, a property is inspected by an Agent, and the combination of multiple Agent scores is performed by an Aggregator. These two components are shown in Fig. 1 and are discussed in detail in the following sections.
|
Furthermore apart from being readily included in or excluded from a profile, each Agent can be parameterized as well. The Agent that inspects peptide length, for instance, can be provided with a cutoff length below which to decline an identification. Another example is the Agent that inspects for sequence coverage by b-ions, which also takes a threshold level of coverage below which identifications are declined by the Agent. As a final example, consider the Agent that inspects identifications for missed cleavages; in this case, both the cleavage specificity of the protease as well as the number of tolerated missed cleavages are Agent parameters. Cleavage specificity is therefore easily adapted when evaluating data from an experimental protocol that uses a different protease.
Aggregators for Combining Agent Votes into an Overall Score—
As outlined above, all peptide identifications are inspected by a voting panel composed of user-selected Agents that each decline, reserve, or recommend an identification by casting a vote. These individual votes must then be aggregated into an overall score for the identification on which recommendation or rejection is ultimately based (see Fig. 1). A first method in which Agent votes can be combined is by simple summation of the Agent scores. If the end result is above a preset threshold (e.g. 0), the identification is rejected. A more pessimistic approach counts only the number of Agents that decline the peptide identification. If that number is higher than a preset cutoff, the peptide identification is considered bad. Obviously an Aggregator can also be much more sophisticated than these simple examples, utilizing a support vector machine, neural network, or other learning algorithm for instance. Interestingly Peptizer also supports pluggable Aggregators, thus allowing complete flexibility at both the Agent and Aggregator level. It is worth noting that the Peptizer framework can therefore provide an extremely convenient infrastructure basis for the development and implementation of novel computational strategies for discovering false positive identification profiles.
Availability of Peptizer and Providing Extensions to the Framework—
Peptizer is released as open source under the Apache2 software license, and binaries as well as source code can be downloaded. Although it is made freely available under a permissive license, the source code is not required to build extensions to Peptizer, nor is a recompilation of the application necessary to include novel Agents or Aggregators. A typical Agent is only about 20 lines of code, whereas a typical, simple Aggregator is about twice that size. Peptizer loads its Agents and Aggregators from a simple eXtensible Markup Language (XML)-based configuration file upon application start-up, so simply adding a newly developed Agent into this configuration file will make it available for inclusion in the voting panel of the application, and the same holds true for Aggregators. The effort required to provide Peptizer with new Agents or Aggregators is thus minimized by design, allowing rapid adoption of novel experimental methodologies and their corresponding a priori information through custom-developed Agents and Aggregators.
Although Peptizer currently only accepts Mascot ".dat" result files as input, the source of peptide identifications can also be modified. However, to extend the reach of Peptizer to other search engine output files, a basic understanding of programming in Java is required as parsing of these more complex files can be more involved. All these extensions to Peptizer can be achieved by implementing well documented interfaces, thus providing a clean and efficient develop-by-contract approach.
Operation Modes of Peptizer and the Manual Validation Interface—
Peptizer can be used in one of two modes: fully automatic command line execution, or semiautomatic operation by means of a user-friendly graphical user interface (GUI). Both modes address a distinct group of users: although the average user will work most comfortably in GUI mode, more experienced users will benefit from the automated and scriptable command line execution. An important difference between the two modes is that, in automatic mode, all suspicious identifications will be considered incorrect, whereas the GUI mode will simply flag these for further manual validation. The GUI mode thus effectively uses the user as the final arbiter, whereas the command line mode does not include this final evaluation step.
The Peptizer GUI is designed for optimal efficiency as it guides the user through the process of choosing a data source, creating an Agent profile, and choosing an Aggregator as shown in Fig. 2. The top panel takes the source of the peptide identifications, and the center panel is subsequently used to construct the voting panel. Note that Agent parameterization as well as assignment of veto privileges is also taken care of at this stage. The lower panel presents the available Aggregators to the user, and the bottom panel can be used to define the confidence level below which identifications will not even be considered. When the profile configuration is complete, a new task can be started by clicking the appropriate button.
|
Upon submitting a task, the software starts to analyze each proposed identification using the user-configured Agent profile and Aggregator and then forwards the results to the manual validation application shown in Fig. 3. The screen is divided into three major parts: a tree with spectra and identifications on the left, the identification detail view on the right and in the center, and a status panel at the very bottom. Each of these parts can be resized or even collapsed according to the needs of the user. The tree structure fulfills several functions. First, it provides an overview of the work done by color-coding identifications based on their status (unresolved, user-declined, or user-accepted). Second, it also allows the user to quickly browse the entire set of suspect identifications. Third, it can be filtered to reveal specific subsets of these suspect identifications. Each tree node holds a single fragmentation spectrum with all of its suggested confident peptide identifications, the number of which is indicated between parentheses after the spectrum number. Unfolding the tree node shows these confidently assigned peptide sequences. Applying filters to the tree enhances navigation through the peptide identifications, for example by hiding all identifications that have already been validated. By double clicking a node, a new tab is opened in the detailed view on the right. In this view, three different perspectives are given for the user to explore. Topmost is the annotated modified peptide sequence, consisting of all identified b- and y-ions annotated as bars on the sequence. The height of the bars indicates the intensity of the corresponding peaks in the spectrum relative to the most intense identified fragment ion. The middle section of the detailed view sports an interactive display of the annotated fragmentation spectrum, whereas the bottom of the view is taken up by a table. The columns in this table correspond to the significant peptide hits obtained for this spectrum (three in the example given in Fig. 3) apart from the leftmost column that always serves as a legend. By default, the most confident peptide identification is selected when a new tab is opened, but the user can modify this selection by clicking on another column in the table (column selection is indicated by a darker color tone). When the selection changes, the experimental fragmentation spectrum shown in the center of the screen is updated with the fragment ion annotations of the selected peptide. Additionally the annotated sequence in the top panel of the detailed view is also adapted to the newly selected peptide.
|
After careful inspection of a spectrum and its peptide identifications, the user may either choose to accept or to reject an identification by clicking on the corresponding buttons in the lower right corner (see Fig. 3). The red "STOP" icon rejects an identification, whereas the green "OK" icon accepts an identification. Note that accepting one peptide candidate when multiple peptide candidates are given for a spectrum automatically rejects these other possibilities. For each of these icons, an alternative is given that takes a validation comment to go with the decision (see dialog on Fig. 3). Once the decision is communicated by a click on the appropriate button, the application will automatically close the freshly validated tab and open up the next available, unresolved identification.
The set of peptide identifications can be saved to the hard drive at any time and can be reloaded in another session, enabling discontinuous manual validation. Validation data in this form can also be distributed to other users or archived as training material.
The end result of the manual validation can be saved into delimited text files, allowing the user to choose the table data that are included as well as optionally including the confident peptide identifications that were recommended by the voting panel (and therefore automatically catalogued as good). Moreover Peptizer can also output its data to a file format that is directly readable by the open source Weka machine learning library (32) for further analysis.
Comparison between Fully Automatic and Manual Peptizer Validation—
Peptizer is a postprocessing tool aimed at identifying false positive peptide identifications. False positives have been shown to be simulated by performing decoy searches with experimental spectra (29). By integrating properties of such decoy-derived false positive matches as well as experimental knowledge of mass spectrometry scientists, a Peptizer voting panel was configured to select potential false positive identifications (see Table I, which details the agents extracted from the N-terminal COFRADIC data set reported in Ref. 24). A pessimistic Aggregator was chosen and configured to label a peptide identification as suspicious if two or more agents declined the peptide identification.
Because the assignment of an MS/MS spectrum to a peptide sequence is the first and most important step in the identification of proteins and because the interference from protein inference has not yet been introduced at this level (2), we decided to evaluate Peptizer on results obtained at the level of peptide identification. Improving the quality of peptide identifications will in turn affect protein identification because more reliable peptides are instrumental in obtaining reliable protein identifications (33).
We assessed the efficacy of the above Peptizer profile in labeling suspicious peptide identifications by applying it to a blinded set of 2,795 peptide identifications obtained by searching a concatenated normal/decoy database. 56 peptide identifications were derived from the decoy database, thus predicting that the whole set of 2,795 peptide identifications is composed of 112 false positives (about 4%) and 2,683 true positive identifications (calculations based on the work of Elias and Gygi (29)). The detailed results of applying the Peptizer profile to this data set are shown in Table II. In total, 193 peptide identifications were labeled suspicious by Peptizer. Among these, 47 peptide identifications originated from decoy sequences, and we therefore estimate that this selection contains 83.9% of all false positives (or 94 of the expected 112) in this data set, a very considerable enrichment.
|
The loss in sensitivity (removal of 3.7% true positives) in full automatic mode can be partially offset by performing manual validation. Indeed in this semiautomatic mode the user discarded 158 peptide identifications containing 45 of the identifications made in the decoy database. It is thus estimated that this rejected set of identifications contained 90 false positives and 68 true positives. Because the user also accepted two decoy peptide identifications, we could estimate that a total of 31 predicted true positive identifications were accepted by the user, indicating that about 30% of the true positive identifications rejected by Peptizer in full automatic mode were now "rescued" by the user (Table II). However, the user also made mistakes albeit of minor influence: two of the 47 decoy peptide identifications detected by Peptizer slipped through the user's scrutiny, representing a negligible increase in total false positives in the final data set. The time cost for validating these 193 peptide identifications by an experienced user was 2 working days. Despite all efforts at making the manual validation process as efficient as possible, it does incur a certain time cost. The choice between full automatic or semiautomatic mode must therefore be made based on the importance of sensitivity in the actual experiment. Overall, however, usage of Peptizer resulted in a major increase of specificity at a reasonable cost in sensitivity.
| DISCUSSION |
|---|
|
|
|---|
Ultimately decisions on sensitivity and specificity typically made by bioinformaticians should match requirements set by experimentalists. As such, the quality of the peptide identifications usually is the highest priority, although specific endeavors such as biomarker discovery will also benefit from maximum sensitivity. The extremely configurable nature of Peptizer readily accommodates these varying circumstances through a custom aggregation of voting results.
It is also important to note that this extreme customizability of Peptizer at various levels is what sets it apart from any other existing tool. A statistical evaluation of the validation efficiency of Peptizer compared with other tools was omitted here because determination of the variance specific to the tools was impractical. However, compared with other semimanual postprocessing applications such as CHOMPER (18), DTASelect (19), or myProMS (20), Peptizer stands out by being fully configurable. Although these existing tools may allow the configuration of a fixed set of criteria, Peptizer has no fixed set of criteria. Indeed Peptizer allows any combination of criteria to be used through its fully configurable and extensible Agent profile. Obviously as with the existing applications, once an Agent profile is created in Peptizer, each Agent can be configured in detail through parameters. The configurability of Peptizer goes even further, however, because even the actual score calculation module can be fully configured by the user through pluggable Aggregators. Importantly the versatility of Peptizer is functionally connected to both the full automatic and manual modes of operation. Indeed the information table in the GUI is directly fed by information from the Agents that were selected in the profile, and the nature of each Agent's vote is indicated in the typeface of its detailed report. The manual validation interface thus seamlessly adapts to any user-configured Agent profile even when it includes custom-written Agents contributed by the user. Full automatic mode supports plugging in advanced, custom-built Aggregators that can connect to machine learning libraries (32).
The agents presented in Table I are mainly based on protein chemistry and peptide identification principles. Even if more or other agents were created, e.g. based on peptide fragmentation patterns and rules extracted from large scale studies on MS/MS spectra (34, 35), the results presented here show that false positive identifications are already highly enriched in the identifications selected by applying an appropriate Peptizer profile, thus ensuring substantially increased stringency at only limited cost in sensitivity. Furthermore careful manual validation of the selected subset of peptides using the Peptizer validation GUI has been shown to maintain specificity while providing a large sensitivity bonus compared with full automatic processing.
To our knowledge, we also present the first experimental data on the cost of manual validation that is often only hinted upon in reports. In the rich and user-oriented manual validation environment that Peptizer presents, peptide identifications were validated at a rate of about 100 a day. Additionally instead of having to validate all 2,795 original peptide identifications, only a Peptizer-selected subset of 193 suspicious peptide identifications needed validation. The total cost amounted to 2 working days validation time, and 80% of the false positives were successfully removed with only 2.5% true positives lost. In the context of a complete proteomics experiment, 2 days of validation time should be well within acceptable bounds when optimal identification stringency at high sensitivity is desired. Finally because Peptizer is an open source project and because Agents, Aggregators, and profile configurations can be easily shared and implemented, we hope to establish an active user community at our purpose-built community portal that will continue to enhance the reach and power of the tool by adding Agents and progressively refined Aggregators as well as by expanding its applicable scope to the output of many other search engines available today.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, July 30, 2008, DOI 10.1074/mcp.M800082-MCP200
1 The abbreviations used are: COFRADIC, combined fractional diagonal chromatography; GUI, graphical user interface. ![]()
* The work in the laboratory in Ghent was supported by research grants from the Fund for Scientific Research-Flanders (Belgium) (Projects G.0156.05, G.0077.06, and G.0042.07), the Concerted Research Actions (Project BOF07/GOA/012) from the Ghent University, the Interuniversity Attraction Poles (IAP-Phase VI, Research Project P6/28), and the European Union Interaction Proteome (6th Framework Program). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()
¶ Supported by a Ph.D. grant from the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT-Vlaanderen). ![]()

Supported by "ProDaC" Grant LSHG-CT-2006-036814 from the European Union. ![]()
|| To whom correspondence should be addressed: Dept. of Biochemistry, Faculty of Medicine and Health Sciences, Ghent University, A. Baertsoenkaai 3, B-9000 Ghent, Belgium. Tel.: 32-92649274; Fax: 32-92649496; E-mail: kris.gevaert{at}ugent.be
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
B. Ghesquiere, N. Colaert, K. Helsens, L. Dejager, C. Vanhaute, K. Verleysen, K. Kas, E. Timmerman, M. Goethals, C. Libert, et al. In Vitro and in Vivo Protein-bound Tyrosine Nitration Characterized by Diagonal Chromatography Mol. Cell. Proteomics, December 1, 2009; 8(12): 2642 - 2652. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |