|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
,
,¶,||
From the
Institute of Molecular Systems Biology, ETH Zurich and ¶ Faculty of Sciences, University of Zurich, CH-8049 Zurich, Switzerland and || Institute for Molecular Systems Biology, Seattle, Washington 98103
| ABSTRACT |
|---|
|
|
|---|
Historically proteomics analyses have focused on the identification of proteins in the context of a specific experiment typically in a single laboratory. More recently, the need for more global, quantitative, and comparative studies has been recognized, and the value of comparing proteomics data across studies and laboratories has been highlighted especially in biomarker studies affecting different disease sites. However, the meaningful comparison, sharing, and exchange of data or analysis results obtained on different platforms or by different laboratories remain cumbersome mainly due to the lack of standards for data formats, data processing parameters, and data quality assessment. The necessity of an integrated pipeline for processing and analysis of complex proteomics data sets has therefore become critical. Here we briefly describe current art in proteomics data analysis and its integration into a continuous linear pipeline while underscoring current issues and pointing out opportunities for the near future.
| AN OPEN SOURCE PROTEOMICS DATA ANALYSIS PIPELINE |
|---|
|
|
|---|
|
| DATA PROCESSING |
|---|
|
|
|---|
Signal Processing
Often instruments are operated as a black box and are not always used to the maximum of their performance while data preprocessing is often performed in default mode. For instance, data quality increases significantly (at the expenses of data volume) by acquiring data in profile mode and by subsequently (postacquisition) using more elaborate algorithms to determine signal and noise and derive more accurate measurements. Peak detection (peak picking) is a key element, often neglected, in the data analysis. It is usually part of the instrument software, and the users have limited control over it. This step is often performed automatically during the data acquisition, and the critical parameters are not always explicitly documented. However, high quality data combined with effective preprocessing tools (i.e. algorithms for noise reduction, peak detection, and monoisotopic peak determination) are the basis of a reliable data analysis. The maxim "garbage in-garbage out" retains its full meaning is this context. High quality raw data (e.g. profile mode) together with refined peak detection algorithms allow reliable determination of charge state, monoisotopic m/z values, and signal intensities of peptide ions. In practice, much better results are obtained by collecting the data first and then reprocessing them off line to fully take advantage of the capabilities of modern instrumentation, which can drastically improve both identification and quantification. It includes higher sensitivity, high resolution, and high mass accuracy, which should be fully retained and exploited during the downstream data analysis. High quality data combined with advanced data processing tools are critical for a deeper insight into proteomics samples in general and in serum or plasma samples more specifically.
Data Format
A large variety of instrument platforms (ion traps, quadrupole/time-of-flight, ion cyclotron resonance, time-of-flight/time-of-flight, etc.) from various manufacturers are available for proteomics studies. Each instrument type will generate spectra with its own characteristics (signal-to-noise ratio, resolution, accuracy, etc.) usually in a proprietary data format. Data processing algorithms are not fully documented and usually are restricted to one instrument platform, thus limiting portability to other data processing tools and comparison of results.
The definition of a generic mass spectrometric data format such as mzXML1 (5) and the Human Proteome Organizations Proteomics Standards Initiative (6) have been first steps to overcome this problem. The use of a standardized file format allows analyzing data within a pipeline that is independent of the instrument platform. Although the conversion into the mzXML format requires additional computing resources and may increase the file size, a generic format broadly accepted by the community, including the manufacturers, will foster sharing and exchanging data in the future. In this context, the concrete plan to merge the mzXML and mzDATA formats into a single unified file format is encouraging.
| PEPTIDE IDENTIFICATION AND VALIDATION |
|---|
|
|
|---|
Once the initial output of the database search engine has been obtained, it is essential that the reliability of the assignments of spectra to peptide sequences is statistically validated. Such analyses generate reliable estimates of the false positive and false negative error rates, values that are critical to meaningfully compare results from multiple experiments or platforms. The PeptideProphet algorithm (9) has be designed to achieve this goal.
The error rates in a data set can also be estimated by performing a search using a "reversed database" (i.e. a database in which the sequences were scrambled to produce only false positive identifications and thus ascertain the false positive error rate (10)). In contrast to more specialized tools, reversed database search results do not estimate the false negative error rate of a dataset.
An alternate strategy consists of storing MS/MS data in a library at an earlier stage in the identification process. Comparison of MS/MS data occurs by comparing experimental spectra with those previously measured and stored in a database using a spectra-matching algorithm. Such an approach was proven to be very effective for several decades in the small molecule area. Stein et al. (11) have explored that route by building a library of consensus peptide spectra (i.e. a set of consistent spectra derived from multiple experimental data sets measured on quadrupole ion traps and quadrupole time-of-flight instruments). An extension of the library together with the ability to produce easily comparable results requires some normalization of the parameters used for the data generation (in particular collision energy). Despite this limitation, identification of already known (observed) peptides is much easier and faster than conventional database searches. With adequate search speed, one could even envision on-the-fly deployment of such a tool on current instruments with fast data acquisition to make data-dependent decisions and exclude ions based on their MS/MS signature rather that a single parameter such as the mass of the precursor.
A spectral matching approach is likely to be less biased because of the search engine or the protein database used. The limitations of such an approach are the number and quality of the spectra included in the library, the level of confidence in the peptide sequence assignments (emphasizing the need for a curation mechanism), and the performance of the spectral matching algorithm in minimizing false positive and false negative calls. Such libraries will also be valuable resources for multiple reaction monitoring approaches as they will simplify the selection of the transitions, i.e. precursor ions/fragment ion pairs (see below).
| PROTEIN IDENTIFICATION AND VALIDATION |
|---|
|
|
|---|
ProteinProphet weights peptides that have a reliable score/probability and from all corresponding proteins derives the simplest list of proteins that explains the observed peptides (13). Obviously proteins with multiple peptide matches have a much greater confidence in their assignment than proteins identified by one single peptide. In fact, there is a massive amplification of the false positive error rate at the protein level compared with the peptide level. This emphasizes the importance of this step and the need for a tool that is able to predict peptide and protein association (13).
| QUANTIFICATION |
|---|
|
|
|---|
Basically two main approaches have been applied. The first is based on stable isotope labeling and requires derivatization of the peptides from the various samples sets with different reagents that have different isotopic composition. The resulting products are then pooled together and analyzed in one single LC/MS/MS experiment. The relative quantity of a specific analyte is then determined from the relative signal intensity of the signal in the full spectrum. Subsequently the analyte in question is identified by database searching of the corresponding MS/MS signal.
The second approach, which is more relevant to larger biomarker studies (i.e. analysis of a larger sets of samples from normal (control) and disease (or treated) patients), analyzes each sample individually and then compares the multiple LC/MS runs subsequently. Performing data acquisition under rigorously controlled conditions and in an unbiased manner is essential for this method. The processing of the multiple data sets raises several important issues, including control of instrumental drifts (mass calibration and elution times) over longer periods of times, correction for shifts in elution times, and normalization of ion abundances to adjust for variation in sample amounts or instrument (ionization or detection) performances. None of these are trivial to solve. In essence, a series of LC/MS patterns are acquired, peaks (or more precisely peak clusters) are detected, and data are merged together. The main step consists in matching the ions observed (i.e. within specific m/z and retention time tolerances) across all experiments. It is a critical task as the high density of features within a window might results in mismatches that might jeopardize any downstream analysis. Thus high quality data (i.e. high mass accuracy and reproducible elution times) are critical to this process. It is typically reflected by a high number of "isolated" features, which are observed in only one or a few experiments. If features are properly matched, the quantitative analysis can be performed in a relatively straightforward manner; a number of tools have been described to perform such analyses (14, 15).
| DATA REPOSITORIES |
|---|
|
|
|---|
Results were usually reported as a set of identified proteins (i.e. list of peptides identified and associated proteins) with minimal supporting data. Obviously the large volume of such data sets has made publication of detailed results using classical mechanisms very challenging. Sharing and exchange of data and results requires the definition of standard formats for the data at all levels (including raw mass spectrometric data, processed data, and search results) as well as a better definition (and/or standardization) of the parameters used for the data processing or the database searches. In this way, a broad range of data and results generated by a variety of tools can be captured into widely accessible repositories, including results of database search engines, de novo sequencing algorithms, and statistical validation tools (e.g. PeptideProphet and ProteinProphet). In all cases metadata describing the parameters used for the analysis are essential and have to be included. Additional data, including annotation and clinical information about the samples, ought to be incorporated in the database as well. Data and results repositories such as PeptideAtlas (Institute for Systems Biology) (16), Global Proteome Machine Database (Beavis Informatics) (17), or Proteomics Identifications (PRIDE) Database (European Bioinformatics Institute) (18) facilitate exchange of results and information. Initially limited to peptide sequences and proteins with a minimum of parameters (i.e. retention/elution time, mass, charge state, and signal intensity) these databases are rapidly expanding to include actual spectra, links to the associated proteins, and genomics data.
One can envision expanding the repertoire of parameters captured as long as robust normalization methods are defined. For instance, elution times have limited value unless they are translated into normalized values that can be generalized across the entire database. Defining broad standards that the community can agree to and developing integrated platforms (or at least platforms that can integrate the different modules/tools) are currently two of the main challenges in integrating results from multiple platforms and laboratories. Despite major advances toward the standardization and sharing of data already achieved, issues regarding the protein annotation and accession numbers remain to be addressed.
| NEW STRATEGIES |
|---|
|
|
|---|
Proteomics-integrated data analysis pipelines were initially designed to automatically identify peptides and proteins in larger data set. More recently, the interest in important protein functional information has prompted the development of analytical strategies that also focus on the detection and identification of post-translational modifications (i.e. nature of the modification and its localization on the peptidic backbone). Currently most approaches rely on extended database searches that incorporate predefined modifications in the search space. More refined approaches based on de novo sequencing or integrating other experimental designs (neutral loss scans and MS3 experiments) are gaining importance. Also sample preparation to specifically isolate the peptides of interest is usually a critical step in post-translational modifications. It will be critical to expand the current proteomics data repositories and databases to also include high confidence information on modified peptides.
To overcome some of the limitations of current proteomics strategies in regard to the dynamic range of peptides detected and the undersampling of MS/MS spectra that restrict the ability to comprehensively analyze proteomes, alternative mass spectrometry-based approaches are being explored. Conventional data-dependent acquisition is limited by the detection of a signal in full scan mode to trigger a product ion spectrum. In contrast, targeted strategies exemplified by multiple reaction monitoring (MRM) detect, quantify, and possibly collect a product ion spectrum to confirm the identity of a peptide with much greater sensitivity because the precursor ion is not detected in the full mass spectrum. Performed on a triple quadrupole instrument (or hybrid quadrupole/linear ion trap instrument) the data are acquired by setting both mass analyzers to predefined m/z values corresponding to the multiply protonated ion and one specific fragment ion of the peptide of interest. The two-level mass filtering drastically increases the selectivity while the non-scanning nature of this experiment accounts for an increased sensitivity. It thus allows detection and quantification of low abundance analytes in complex biological matrices. Obviously MRM experiments differ from conventional approaches in that they are hypothesis-driven (i.e. screening for known or putative entities) and are primarily quantitative experiments or possibly confirmatory of identity through matching to already known information (e.g. elution time or observed or predicted MS/MS spectra).
Multiple ion monitoring was demonstrated to be a valuable approach to quantify with high sensitivity and selectivity peptides in complex mixtures such as a tryptic digests of plasma samples (19) or a specific subproteome, e.g. glycopeptides (20). The recent advances in the software of instrument control and data acquisition now provide the capability to analyze larger number of peptides (>500 transitions) in one single LC/MS run (17). This truly opens new paths toward in-depth analysis of proteomes at dramatically reduced redundancy using a hypothesis-driven approach as illustrated in Fig. 2. The strategy integrates high sensitivity mass spectrometric measurements in the MRM mode and the proteomics knowledge already acquired that is available in databases (e.g. PeptideAtlas).
|
| CONCLUSION |
|---|
|
|
|---|
The data processing and analysis bottleneck can nowadays be overcome through integration of the entire suite of tools into one linear pipeline. This allows processing of data from different instrument platforms in a reliable way through the different steps while maintaining consistency of results. It eases comparison between multiple platforms or laboratories. The organization, annotation, and sharing of data in public repositories will greatly facilitate the data exchange within the community. It should also prevent the replication of experiments that have already been carried out. In this way, ambitious programs such as the creation of a full proteome map of tissues or cell types will be achieved more readily.
The focus of this account has been on the integration of proteomics data. It thus represents the first step for the incorporation of the results in a more global, systems biology framework that includes data from various platforms, including genomics, metabolomics, and physiology. New quantitative proteomics strategies that leverage recent mass spectrometry technologies combined with advanced data analysis tools are anticipated to play a crucial role in the global analysis of biological systems.
| FOOTNOTES |
|---|
Published, MCP Papers in Press, August 9, 2006, DOI 10.1074/mcp.R600012-MCP200
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
1 The abbreviations used are: XML, extensible markup language; MRM, multiple reaction monitoring. ![]()
* This study was supported in part with federal funds from the NHLBI, National Institutes of Health, under Contract N01-HV-28179. ![]()
To whom correspondence should be addressed: ETH Zurich, Hoenggerberg HPT E73, CH-8093 Zurich, Switzerland. Tel.: 41-44-633-2088; Fax: 41-44-633-1051; E-mail: domon{at}imsb.biol.ethz.ch
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
E. W. Deutsch, H. Lam, and R. Aebersold Data analysis and bioinformatics tools for tandem mass spectrometry in proteomics Physiol Genomics, March 10, 2008; 33(1): 18 - 25. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Cannataro Computational proteomics: management and analysis of proteomics data Brief Bioinform, March 1, 2008; 9(2): 97 - 101. [Full Text] [PDF] |
||||
![]() |
V. Kulasingam and E. P. Diamandis Proteomics Analysis of Conditioned Media from Three Breast Cancer Cell Lines: A Mine for Biomarkers and Therapeutic Targets Mol. Cell. Proteomics, November 1, 2007; 6(11): 1997 - 2011. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |