MCP Agilent Technologies
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Originally published In Press as doi:10.1074/mcp.R500005-MCP200 on March 1, 2005.
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
R500005-MCP200v1
4/4/419    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow Glossary
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Listgarten, J.
Right arrow Articles by Emili, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Listgarten, J.
Right arrow Articles by Emili, A.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Molecular & Cellular Proteomics 4:419-434, 2005.
© 2005 by The American Society for Biochemistry and Molecular Biology, Inc.


Review

Statistical and Computational Methods for Comparative Proteomic Profiling Using Liquid Chromatography-Tandem Mass Spectrometry

Jennifer Listgarten{ddagger} and Andrew Emili§,||

From the {ddagger} Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3G4, Canada; § Program in Proteomics and Bioinformatics and Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario M5G 1L6, Canada


    ABSTRACT
 TOP
 ABSTRACT
 LC-MS-BASED PROTEOMIC PROFILING
 PARADIGMS FOR INFORMATION...
 LC-MS DATA PROCESSING
 LOW-LEVEL...
 SIGNAL FILTERING AND BACKGROUND...
 MID-LEVEL PROCESSING—PEAK...
 DE-ISOTOPING, CHARGE...
 DATASET ALIGNMENT AND COMPARISON
 DATA NORMALIZATION
 DATA TRANSFORMATIONS AND ERROR...
 HIGH-LEVEL PROCESSING
 STATISTICAL ISSUES
 MACHINE LEARNING AS APPLIED...
 EVALUATING CURRENT METHODS
 FUTURE PROSPECTS
 REFERENCES
 
The combined method of LC-MS/MS is increasingly being used to explore differences in the proteomic composition of complex biological systems. The reliability and utility of such comparative protein expression profiling studies is critically dependent on an accurate and rigorous assessment of quantitative changes in the relative abundance of the myriad of proteins typically present in a biological sample such as blood or tissue. In this review, we provide an overview of key statistical and computational issues relevant to bottom-up shotgun global proteomic analysis, with an emphasis on methods that can be applied to improve the dependability of biological inferences drawn from large proteomic datasets. Focusing on a start-to-finish approach, we address the following topics: 1) low-level data processing steps, such as formation of a data matrix, filtering, and baseline subtraction to minimize noise, 2) mid-level processing steps, such as data normalization, alignment in time, peak detection, peak quantification, peak matching, and error models, to facilitate profile comparisons; and, 3) high-level processing steps such as sample classification and biomarker discovery, and related topics such as significance testing, multiple testing, and choice of feature space. We report on approaches that have recently been developed for these steps, discussing their merits and limitations, and propose areas deserving of further research.


With the sequencing of the human genome largely complete and publicly available, emphasis in molecular biology is shifting away from DNA sequencing and related problems toward a systematic evaluation of how the myriad of encoded gene products operate together to mediate the biological mechanisms that sustain life, and how these processes become perturbed in response to disease. Comprehensive systems-wide biological studies have been greatly facilitated by the advent of large-scale genomic, proteomic, and informatic technologies, such as DNA microarrays, ultra-sensitive high-throughput protein MS, and robust statistical and machine-learning methods developed for very large datasets. Evaluation, interpretation, and integration of data produced by these respective platforms represent major ongoing challenges and areas of active research.

The field of expression proteomics seeks to answer the following questions: 1) which proteins and variant isoforms are expressed during the lifecycle of an organism; 2) which post-translational modifications occur in each of these proteins; 3) how do these patterns differ in different cell types and tissues and under different developmental, physiological, and disease conditions; and 4) how can biologists make use of this information to better understand the molecular basis for fundamental biological processes as well as for monitoring the course of disease so as to improve clinical diagnosis and treatment (1\N3). These questions are made all the more difficult by the complexity of most biological systems, which increases exponentially as one goes downstream from DNA sequence to mRNA intermediates to the protein end-products. While it appears there are likely far fewer genes coded for by the human genome than first anticipated, it is estimated that >60% of the ~25,000 putative ORFs encode more than one splice variant (often tens to hundreds), and these in turn are frequently subject to post-translational modification (4, 5). Moreover, because proteins typically function together as components of dynamic multisubunit macromolecular complexes, the final complexity of a biological system is enormous; hence the search for interesting and relevant proteomic patterns remains a challenging task.

Capillary-scale HPLC-MS/MS (LC-MS) is rapidly emerging as a method of choice for large-scale proteomic analysis (1, 2, 6\N12). State-of-the-art LC-MS systems can be used to identify and track the relative abundance of thousands of molecules (6, 7). For standard bottom-up profiling experiments, the molecules in question are peptides derived by proteolysis of intact proteins. For very complex protein samples, such as blood, the peptide mixtures are physically resolved by chromatographic separation prior to injection into the mass spectrometer so as to generate a more-richly informative map, consisting of both the unique elution characteristics (column retention times) as well as m/z (mass-over-charge) ratios of individual peptides. Discrete peptides of interest are subject to collision-induced fragmentation followed by database matching for the purpose of sequence identification, while the recorded pattern of precursor ion peak intensities can be used to infer the relative quantities of the various proteins between samples. Nevertheless, comparisons of large-scale multivariate proteomic datasets are subject to a number of challenging analytical issues, such as experimental noise, systematic variations between experimental runs, the extreme overall range and dynamic nature of protein levels, and the huge number of measured features (e.g. protein levels) of which many are uncorrelated (or spuriously correlated) to the variables of interest.

In this review, we discuss important computational and statistical concepts that should be considered when performing comparative proteomic analyses and outline procedures for processing MS data to achieve more reliable quantitative analyses. As the broader domains of statistical and machine-learning methods are beyond the scope of this review, we limit our examination to studies applying these methods to shotgun profiling datasets. For more general information about statistical learning models, a good reference is Ref. 13. For an in-depth discussion of the technical nature of LC-MS, and the relative merits of different MS platforms, the reader is directed to other reviews present in this issue.


    LC-MS-BASED PROTEOMIC PROFILING
 TOP
 ABSTRACT
 LC-MS-BASED PROTEOMIC PROFILING
 PARADIGMS FOR INFORMATION...
 LC-MS DATA PROCESSING
 LOW-LEVEL...
 SIGNAL FILTERING AND BACKGROUND...
 MID-LEVEL PROCESSING—PEAK...
 DE-ISOTOPING, CHARGE...
 DATASET ALIGNMENT AND COMPARISON
 DATA NORMALIZATION
 DATA TRANSFORMATIONS AND ERROR...
 HIGH-LEVEL PROCESSING
 STATISTICAL ISSUES
 MACHINE LEARNING AS APPLIED...
 EVALUATING CURRENT METHODS
 FUTURE PROSPECTS
 REFERENCES
 
LC-MS systems consists of: 1) a chromatography column, which separates peptide mixtures based on one or more physicochemical properties prior to MS; 2) an ionization source, which converts eluting peptides into gas phase ions; 3) one or more mass analyzers, which separate ions on the basis of m/z ratios; and 4) a detector, which registers the relative abundance of ions at discrete m/z. In MS/MS, precursor ions are recorded in full-scan mode (all m/z values), followed by selective ion isolation and fragmentation for sequence identification. MS/MS instruments are operated in an automated alternating scan mode. Two main ionization technologies used are ESI and MALDI (6). Because ESI generates ions directly from solution, it is readily coupled to LC or capillary electrophoresis. In a standard reverse-phase HPLC setup, the column media differentially retards the migration of peptides based on selective hydrophobic interaction affinities. Peptides are then eluted with a gradient of organic solvent and ionized just prior to introduction into the mass spectrometer. LC is well-suited to examining complex biological samples because: 1) peptides with the same nominal m/z are less likely to be introduced at the same time, reducing ambiguity; and 2) with fewer competing ion species, fewer artifacts arise due to ion suppression or ion-ion interference.

A non-LC-based MS platform commonly used in profiling studies is SELDI (a refinement of MALDI), in which subsets of proteins in samples are selectively pre-adsorbed onto various proteomic chips (consisting of differing binding surfaces, such as hydrophobic or metal-binding surfaces) (8, 9).

The quality of profiling studies is determined by the overall sensitivity, detection coverage, dynamic range, fragmentation efficiency, mass resolution, and accuracy (6). Femptomole or better detection limits are commonly attained with LC-MS, even with mixtures exceeding several thousand components (7). With SELDI, much of the sample is discarded (not selectively bound to the matrix), whereas LC-MS enables full sample throughput, while providing the added opportunity for protein identification and mapping of post-translation modifications.


    PARADIGMS FOR INFORMATION EXTRACTION FROM LC-MS DATASETS
 TOP
 ABSTRACT
 LC-MS-BASED PROTEOMIC PROFILING
 PARADIGMS FOR INFORMATION...
 LC-MS DATA PROCESSING
 LOW-LEVEL...
 SIGNAL FILTERING AND BACKGROUND...
 MID-LEVEL PROCESSING—PEAK...
 DE-ISOTOPING, CHARGE...
 DATASET ALIGNMENT AND COMPARISON
 DATA NORMALIZATION
 DATA TRANSFORMATIONS AND ERROR...
 HIGH-LEVEL PROCESSING
 STATISTICAL ISSUES
 MACHINE LEARNING AS APPLIED...
 EVALUATING CURRENT METHODS
 FUTURE PROSPECTS
 REFERENCES
 
High-throughput MS/MS offers a powerful means of analyzing biological samples. However, MS/MS is time consuming, adding an order of magnitude in time to a profiling experiment over full-scan precursor MS profiling. Much time is wasted repeatedly resequencing the same peptides (even with data-directed experiments) across different samples. Additionally, due to the biased under-sampling nature of MS/MS ion selection (15), it is impossible to sequence every peptide in a sample in a single-pass analysis. In the context of a comparative LC-MS profiling analysis of multiple samples, such as serum profiling for disease biomarker discovery, one can imagine avoiding some of the less-desirable properties associated with MS/MS. For example, a reference database of previously sequenced peptides ion species could be built up. Given a new sample, one could imagine running LC-MS alone (i.e. without performing CID), obtaining a global signature pattern of peptide ion abundance, and then scanning the database for discrete pairs of time, m/z values to predict peptide identity. In this way, the same peptides need not be resequenced, experimental run-time is reduced, and differential expression can be more readily uncovered (1).

However, such an approach is complicated by a number of factors: 1) the LC time axis needs to be corrected to account for spurious deviations in peptide elution times across different experiments; 2) there may be confounding overlap of peptides across the (time, m/z) space; 3) LC-MS systems are subject to considerable noise and variability that is not fully characterized or accounted for; and 4) differences in overall sample composition, leading to differential ion context, may affect the apparent signal intensities recorded for peptides common to multiple samples.

To address these issues, the following tasks need to be resolved: 1) corrective alignment along the experimental LC time axis so that times are comparable across experiments; 2) combining replicate experimental LC-MS datasets so as to improve the overall signal-to-noise ratio; 3) developing statistically sound methods for distinguishing signal from background; 4) systematically studying sources of variability and signal characteristics inherent to LC-MS data and modeling them so as to take them into account (for example, systematic variation in peak amplitude and width, signal linearity, and background artifacts should be investigated); 5) developing algorithms to detect and quantify peptide ion peaks in LC-MS data; and 6) developing probabilistic algorithms to uncover interesting peptide patterns and to match new patterns to previously discovered ones.

In the aforementioned approach, peptide peaks are extracted and then identified, and only after these steps are performed does one pose queries related to the particular experiment at hand. An alternative approach to exploiting LC-MS data is to take a signal processing perspective. For instance, rather than starting with peak detection and peptide identification to find disease biomarkers using LC-MS patterns, the data could be treated as a signal matrix, allowing the application of established methods in signal processing, statistics, and machine learning (13, 16, 17) to tease out interesting and relevant patterns from the data. Nevertheless, many of the previous tasks, such as time alignment, background detection and correction, and accounting for systematic variations in signal amplitude, are relevant to this latter approach.

Several groups (1, 10, 18) have developed working systems to address the main problems described above. The system of Radulovic et al. (18) performs filtering, normalization, peak detection, quantification and alignment, and then classification; they also establish linearity in LC-MS peak signal with peptide concentration. Although no algorithm details are provided, Kearney and Thibault (1) report a system for peptide detection and data alignment, which identifies the m/z, retention time, charge state, and maximum intensity of the principle peptide isotope. Wang et al. (10) performed baseline subtraction, peak detection, isotope detection, alignment, and quantification and demonstrated a linear LC-MS peak signal response to increasing peptide concentration in a complex mixture over a fairly large dynamic range.

Several studies of LC-MS have focused on particular low/mid-level data processing steps such as noise reduction, error models (which model the variance of the peptide abundance), or alignment in time, while other studies have been devoted strictly to the use of MS signal for sample classification. For studies attempting to cover the entire set of tasks, a typical approach for information extraction from LC-MS datasets typically involves a sequence of operations as listed in Table I, each treated in a largely independent manner from the others. The optimal order in which these should be performed so as to obtain the most information possible remains unclear. For instance, one could align the signals before or after peak detection, one could normalize before or after alignment, etc. There are several sequences of operations that make intuitive sense, but that differ significantly from one another. The optimal ordering depends in part on our ability to perform each task. For example, if alignment can be performed optimally on the raw data, without normalization or peak detection, then it may be desirable to do this first, potentially allowing for better normalization and peak detection. Alternatively, if one can not align the raw signals well, then it may be desirable to normalize and find peaks before alignment. Lastly, it may be desirable to approach some of these subtasks together, for example as in the continuous profile model (CPM),1 which aligns and normalizes abundance simultaneously, removing the necessity to choose one before the other (19).


View this table:
[in this window]
[in a new window]
 
TABLE I Sequence of operations for processing LC-MS datasets

 

    LC-MS DATA PROCESSING
 TOP
 ABSTRACT
 LC-MS-BASED PROTEOMIC PROFILING
 PARADIGMS FOR INFORMATION...
 LC-MS DATA PROCESSING
 LOW-LEVEL...
 SIGNAL FILTERING AND BACKGROUND...
 MID-LEVEL PROCESSING—PEAK...
 DE-ISOTOPING, CHARGE...
 DATASET ALIGNMENT AND COMPARISON
 DATA NORMALIZATION
 DATA TRANSFORMATIONS AND ERROR...
 HIGH-LEVEL PROCESSING
 STATISTICAL ISSUES
 MACHINE LEARNING AS APPLIED...
 EVALUATING CURRENT METHODS
 FUTURE PROSPECTS
 REFERENCES
 
Operationally, low-level processing involves raw LC-MS signal, while mid-level processing occurs after some basic preprocessing, such as m/z quantization, filtering and background subtraction (i.e. low-level processing). In contrast, high-level processing is applied to data that has been fully massaged for use in conjunction with machine-learning techniques or more traditional statistical techniques such as significance testing of individual features (e.g. peptide abundance). The substantial collection of methods developed for processing nonchromatographic MS data (e.g. MALDI and SELDI studies) is in many cases transferable to LC-MS data (which can be viewed as a time series of static MS spectra). Most of the low- and mid-level processing methods reported to date, however, have been performed parenthetically as a means to the larger goal of sample classification or biomarker discovery and hence have not been rigorously studied.


    LOW-LEVEL PROCESSING—QUANTIZATION OF m/z VALUES
 TOP
 ABSTRACT
 LC-MS-BASED PROTEOMIC PROFILING
 PARADIGMS FOR INFORMATION...
 LC-MS DATA PROCESSING
 LOW-LEVEL...
 SIGNAL FILTERING AND BACKGROUND...
 MID-LEVEL PROCESSING—PEAK...
 DE-ISOTOPING, CHARGE...
 DATASET ALIGNMENT AND COMPARISON
 DATA NORMALIZATION
 DATA TRANSFORMATIONS AND ERROR...
 HIGH-LEVEL PROCESSING
 STATISTICAL ISSUES
 MACHINE LEARNING AS APPLIED...
 EVALUATING CURRENT METHODS
 FUTURE PROSPECTS
 REFERENCES
 
In its most raw form, the full-scan MS spectra obtained from an LC-MS experiment generally consists of a table of values consisting of: i) scan (spectra) number; ii) LC retention time; iii) precursor ion m/z value; and iv) ion abundance. There is a one-to-one mapping such that the scan number is simply an enumeration of the ordered time points. For more easy data manipulation, LC-MS datasets can be converted into a matrix format, with columns representing m/z values and rows representing time. Entries in the matrix represent the relative abundance values at each combination of time and m/z bin. This matrix formation involves binning nominal m/z values, because retaining all possible values would lead to a huge, sparsely populated matrix, while time values can normally be left unchanged because these are usually not too numerous and because many m/z values typically correspond to a given time point.

An issue to consider here is how to bin the m/z values. For example, one could opt for evenly spaced bins in either native or log m/z space. An optimal bin width would be large enough to keep the matrix tractable and not too sparse, but small enough that individual m/z values remain informative (i.e. not collapsing the information too much); this trade-off depends on the MS instrumentation used. For instance, Radulovic et al. (18) decided to round m/z values to the nearest integer, consistent with the nominal mass accuracy of the ion trap instruments used in their study, while Wang et al. and Anderle et al. (10, 20) used evenly spaced bins of width 0.2Th to exploit the increased accuracy of their instruments. In a study of gas-phase chromatography (GC)-MS data by Stein (21), bin widths were chosen as a function of the measured mass accuracy and resolution and increased linearly with m/z. In Ref. 22, LC-MS peak-widths (ignoring time) were reported to be reasonably constant in log m/z space by Tibshirani and co-workers. No methods have been reported for evaluating optimal bin width, nor for determining the sensitivity of further calculations to this parameter.


    SIGNAL FILTERING AND BACKGROUND SUBTRACTION
 TOP
 ABSTRACT
 LC-MS-BASED PROTEOMIC PROFILING
 PARADIGMS FOR INFORMATION...
 LC-MS DATA PROCESSING
 LOW-LEVEL...
 SIGNAL FILTERING AND BACKGROUND...
 MID-LEVEL PROCESSING—PEAK...
 DE-ISOTOPING, CHARGE...
 DATASET ALIGNMENT AND COMPARISON
 DATA NORMALIZATION
 DATA TRANSFORMATIONS AND ERROR...
 HIGH-LEVEL PROCESSING
 STATISTICAL ISSUES
 MACHINE LEARNING AS APPLIED...
 EVALUATING CURRENT METHODS
 FUTURE PROSPECTS
 REFERENCES
 
Because LC-MS is subject to background chemical and electronic noise, together with systemic contaminants in the LC mobile phase (column solvent) (23, 24), methods for noise reduction and signal enhancement are commonly used. Signal filtering is a mature field from which a variety of techniques are applicable. The theory of digital signal processing is based on the assumption that data were sampled at regular time intervals, which is not necessarily the case for many LC-MS experiments. Filtering may nevertheless be useful provided extra care is taken to account for this. Conceptually, signal filtering and baseline subtraction can each be performed in both the time (scan header) and m/z domains. Two approaches applied to date include: i) subtracting a fitted, additive baseline model; and ii) using digital filters to smooth and enhance the MS signal. In certain cases, filtering is performed first, and then a modeled baseline is subtracted (10).

Various filters for data smoothing along the LC time axis have been implemented (25). These include simple "moving average," median, and moving geometric mean filters, and the Savitzky-Golay filter, which preserves high-frequency content by fitting a higher-order polynomial to the data over a local window (26). For example, Wang et al. observed well-defined peaks and baseline after applying the Savitsky-Golay filter to LC-MS data (10). Data points belonging to the baseline were then hand-picked, fit with a low-order polynomial, and subtracted from the original data, together with a second application of the Savitsky-Golay filter for added peak smoothing. Nevertheless, manual delineation of background is a subjective, tedious, and error-prone process, and inconsistent with high-throughput analysis.

Bylund et al. have noted that taking the second derivative of raw MS signal, followed by matched filtering, whereby signal (in this case the second derivative of the signal) is cross-correlated with a Gaussian template, reduces background noise and enhances peaks. This smoothing is similar to application of a low-pass "top hat" filter, as is done on SELDI-MS data along the m/z axis in (27) in which signal frequencies above some threshold are thereby completely removed from the data. Note that by taking the second derivative of the signal, nonlinear noise is actually amplified (and hence the need to filter afterward), while with matched filtering, noise (specifically white noise) adjusts to the template frequency, making it harder to identify. On the other hand, Radulovic et al. (18) reported a two-step procedure for noise reduction/binarization of LC-MS signal. First, a "moving average" filter (five-scan header width) is used to smooth the dataset across discrete m/z bins. Then peak intensities exceeding a pre-defined threshold, T (related to the trimmed mean or median intensity of one m/z bin across all scan headers), for N consecutive scans are selected as being signal, with the rest of the intensities deemed to be noise. Processing of negative control spectra acquired during LC-MS analysis of solvent alone revealed few false-positive peaks. However, the number of missed genuine peptide peaks (false-negatives) is quite sensitive to the selected processing parameters (values of T, N, or M), and peak detection efficiency has not been fully optimized.2

Wagner et al. (10) have made use of an iterative, nonparametric, local regression smoother (a robust loess smoother) to model the baseline in MALDI-MS datasets. Because distinct regions of m/z were different in nature, empirical selection of the size of the smoothing window for each region was necessary and ranged from 1% of the total number of m/z values for regions with small m/z to 70% for regions with larger m/z values. In contrast, Baggerly et al. found a single sinusoidal baseline noise component in their MALDI dataset, which they speculate was produced by use of an alternating current power source (28). The noise frequency was estimated by Fourier transform of a hand-selected data region, and then eliminated by subtracting out a sinusoid of this frequency. Following this procedure, the residual baseline was modeled with a modified local minimum at each m/z value and them removing the modeled baseline.

Bylund et al. developed a unique "orthogonal background subtraction" approach for LC-MS baseline correction (25), wherein principal component analysis (PCA) is applied to time vectors (one vector per time point, over all m/z) from a region in the dataset expected to consist solely of background. The noise subspace was then characterized by taking the top principal components, and noise is removed by subtracting the components of the data that lie in this subspace. While such a model may prove useful, it is not clear whether it is appropriate for LC-MS analysis; PCA operates in a global way (across all time points in this case) and therefore does not allow for noise characteristics to change over time.

Concerns have been raised about the suitability of data filtering followed by parametric fitting (e.g. based on a peak model) (29), and thus one must carefully consider the ultimate goal of a LC-MS-profiling experiment when performing tasks in sequence. To our knowledge, no systematic comparison of the effects of various filtering/background subtraction techniques on MS data integrity has yet been reported, and it is advisable to investigate the consequences on downstream analysis (e.g. classification, peak detection). Because the filtering efforts published to date were performed on only one data dimension (m/z), it will be interesting to see if filtering independently in both axes (time and m/z) or simultaneously is more beneficial.


    MID-LEVEL PROCESSING—PEAK DETECTION AND QUANTIFICATION
 TOP
 ABSTRACT
 LC-MS-BASED PROTEOMIC PROFILING
 PARADIGMS FOR INFORMATION...
 LC-MS DATA PROCESSING
 LOW-LEVEL...
 SIGNAL FILTERING AND BACKGROUND...
 MID-LEVEL PROCESSING—PEAK...
 DE-ISOTOPING, CHARGE...
 DATASET ALIGNMENT AND COMPARISON
 DATA NORMALIZATION
 DATA TRANSFORMATIONS AND ERROR...
 HIGH-LEVEL PROCESSING
 STATISTICAL ISSUES
 MACHINE LEARNING AS APPLIED...
 EVALUATING CURRENT METHODS
 FUTURE PROSPECTS
 REFERENCES
 
In certain situations, it may be desirable to detect and quantify two-dimensional peptide peaks in LC-MS signal for use as input to classification algorithms, biomarker discovery, or global proteomic comparisons using a unified reporting schema. Extracting peaks from LC-MS signal both reduces the dimensionality of the data, which can simplify downstream analysis, and assigns intuitive meaning to data features. (Possible drawbacks of this approach are discussed further below.)

Radulovic et al. used an iterative coarse-to-fine strategy to extract two-dimensional (in time and m/z) peaks from LC-MS data (18). Neighboring points in the data matrix deemed to be signal (rather than noise—see previous section on filtering) were combined to form peaks at the coarsest level, and then iteratively through each of the more refined levels, with a bisection method used to avoid spurious peak mergers. Peaks were quantified by summing individual grouped feature intensities. On the other hand, Wang et al. detected LC-MS peaks based on coinciding local signal maxima, in time and m/z; local maxima are defined as an increase in ion abundance greater than a prespecified threshold over a predefined range (10). Peaks were then quantified either by summing intensity over the component elution time or based on the maximum peak height. Unfortunately, the authors do not describe how the component elution time was determined. Similar techniques were used in Refs. 20 and 30.

Yasui et al. (31) defined peaks (in SELDI-MS data) as m/z elements exhibiting higher intensity than the N nearest neighbors, with N chosen empirically, with an added constraint that peaks have a higher intensity than the "broader" neighborhood as calculated by a local linear super-smoother method. In an effort to reduce dataset misalignment problems, m/z values within ±0.2% of these peaks were further selected as additional peaks. Although peak finding was not the emphasis of the study of Tibshirani et al. (22), these authors used a similar routine to scan for m/z peaks in SELDI/MALDI datasets exhibiting a higher intensity than a prespecified number of closest neighbors. In contrast, Randolf et al. made use of multiscale wavelet decomposition to detect peaks in MALDI-MS data (32), trying to avoid ad hoc decisions pertaining to thresholds and filter parameters. The wavelet decomposition provides a breakdown of the signal according to scale (and location), and by taking the derivative at each scale these authors detected scale-specific peaks. In a histogram of peak locations extracted in this way from many samples, locations with high counts were considered as evidence of true peaks, although the authors did not offer a method for choosing an optimal scale.

Instead of attempting to find peaks, Idborg et al. applied a curve resolution approach developed by the Chemometrics community for processing GC-MS profiles (see Ref. 33 for an overview of Chemometrics analysis techniques) to extract the major spectral components in LC-MS profiling of urine (34, 35), a notably simpler mixture than tissue or serum. In this manner, a data matrix, X, defining a single LC-MS experiment, with rows corresponding to time points and columns to m/z values, was resolved into a set of "pure" spectral profiles. These profiles were stored as column vectors of length m in the matrix, S, with the complete set of profiles approximately spanning the entire space of all m/z spectra generated across the various time points. The relative amount of each spectral profile present at a given time point was then estimated in the "pure" concentration profiles, stored as t row vectors of length a in the matrix C. Unexplained variation was represented by matrix E, and X = CST + E. An iterative method was used to solve for C and S, starting from an initial set of "key" spectra and using constraints related to the non-negativity of the concentrations (i.e. related to the fact that one can only add in physical components, not remove them), to update the spectral profiles. However, Idborg et al. do not describe how to select the number of spectral profiles, although various approaches to this problem are provided in Ref. 33, as well as other algorithms used for spectral resolution. Because use of these techniques has been largely restricted to relatively simple mixtures, it is not clear how well these methods will scale to more complex proteomic samples.

Overall, peak detection has generally been performed in a rather ad hoc manner, with little evaluation of the effectiveness of the various methods or parameter choices. The algorithms employed to date make no use of a priori or learned information with regards to peak shape, along either the time or m/z dimensions, and in some cases ion intensity values are only exploited very indirectly. Rather than retaining abundance information, peaks are frequently binarized (18, 31). Radulovic et al. do not motivate this decision, while Yasui et al. state that it helps to overcome noise in the signal. Such a step is lossy and is likely suboptimal for downstream analysis. Incorporating richer information would likely improve analytical performance, albeit at the cost of more computation. The underlying methodologies of machine learning and statistical techniques are intended to account for random variation caused by noise, and their performance is likely deteriorated by using them with binarized MS data. A less-extreme approach, and one retaining more information, would be to apply filtering and/or baseline subtraction as well as intensity normalization rather than binarization. It has been suggested that at present LC-MS is still not generally a quantitative science (36). Peak detection and quantification, even if done optimally, does not guarantee linearity of peak signal relative to analyte concentration due to possible ion suppression effects, although compelling evidence of linearity of extracted LC-MS peak intensities, at least for spiked reference proteins, has been established using certain data-processing methods and technological platforms (10, 18).


    DE-ISOTOPING, CHARGE DECONVOLUTION, AND PEAK MATCHING
 TOP
 ABSTRACT
 LC-MS-BASED PROTEOMIC PROFILING
 PARADIGMS FOR INFORMATION...
 LC-MS DATA PROCESSING
 LOW-LEVEL...
 SIGNAL FILTERING AND BACKGROUND...
 MID-LEVEL PROCESSING—PEAK...
 DE-ISOTOPING, CHARGE...
 DATASET ALIGNMENT AND COMPARISON
 DATA NORMALIZATION
 DATA TRANSFORMATIONS AND ERROR...
 HIGH-LEVEL PROCESSING
 STATISTICAL ISSUES
 MACHINE LEARNING AS APPLIED...
 EVALUATING CURRENT METHODS
 FUTURE PROSPECTS
 REFERENCES
 
Additional considerations relevant to LC-MS proteomic analysis are the charge spread on the ions, and also stable isotope variants. ESI results in variable net positive (or negative) charge. Multiply charged molecules are observed, with signal spread out along the m/z axis in a charge state envelope. Under typical conditions, mixtures of singly, doubly, and even triply protonated precursor ions are commonly observed with tryptic peptides. If the sample analyzed is simple, it is possible to deduce related m/z peaks. Charge deconvolution relies on determination of isotope shoulders, resulting from the presence of heavy isotope variants of carbon and other atoms. Roughly ~1% of naturally occurring carbon exists with seven (as opposed to the more common six) neutrons (13C versus 12C); this signature dominates other isotopes because ~50% of the mass of a typical peptide is carbon (37). Although isotope variants are chemically identical, the heavier isotope sister ion peaks exhibit greater apparent m/z than the predominant mono-isotopic peak (all 12C). Based on the mass differential, one can deduce the charge state of multiply charged ions.

Wang et al. and Anderle et al. applied a de-isotoping step (to assign isotope patterns) before peak quantification (10, 20). Though few details are provided, the algorithms were apparently based on cross-correlating the observed peak envelopes to reference isotopic tables, with the highest-scoring match identifying the most probable isotope shoulders. One can imagine that probabilistic modeling techniques, such as hidden Markov models (HMMs), may significantly improve upon this template-matching scheme. In contrast, Tibshirani et al. opted to smooth MALDI/SELDI data along the m/z axis as a simpler alternative to deconvolution (22). Extremely complex samples may prove to be less amenable to full de-isotoping.

Peak matching is another related topic relevant to quantitative proteomic comparisons. To measure reproducibility of peptide signal, experimental peaks must be matched across LC-MS datasets. Naive methods, based on simple proximity (in time or m/z space), are reported to be effective (10, 20, 39). For instance, Radulovic et al. used MS/MS-derived sequence identities to verify the correct matching of ~200 putative peptides across multiple samples (18). However, given that MS/MS targets prominent peaks, this assessment is likely biased. Anderle et al. (20), on the other hand, found it necessary to remove ~2% of data points—presumed outliers thought to be attributable to mismatching (20). Incorporation of prior knowledge of peak shape, instrument m/z drift, and a more-probabilistic formulation might significantly improve the effectiveness of peak detection, quantification, and matching.


    DATASET ALIGNMENT AND COMPARISON
 TOP
 ABSTRACT
 LC-MS-BASED PROTEOMIC PROFILING
 PARADIGMS FOR INFORMATION...
 LC-MS DATA PROCESSING
 LOW-LEVEL...
 SIGNAL FILTERING AND BACKGROUND...
 MID-LEVEL PROCESSING—PEAK...
 DE-ISOTOPING, CHARGE...
 DATASET ALIGNMENT AND COMPARISON
 DATA NORMALIZATION
 DATA TRANSFORMATIONS AND ERROR...
 HIGH-LEVEL PROCESSING
 STATISTICAL ISSUES
 MACHINE LEARNING AS APPLIED...
 EVALUATING CURRENT METHODS
 FUTURE PROSPECTS
 REFERENCES
 
While the accuracy of m/z measurements obtained with properly calibrated instruments is typically very good, LC fractionation is inherently variable (i.e. considerable dispersion in peptide retention times). Elution patterns can become distorted (locally compressed and/or expanded in complex, nonlinear ways) by differences in chromatography performance due to changes in ambient pressure and temperature. Even under ideal conditions, MS duty cycles are finite and sampling is not necessarily constant, resulting in spectral capture at different time points across an eluting peak even between repeat analyses. This variation can affect peak discrimination and global proteomic comparisons. Thus, to maximize the benefits of LC-MS, one needs to deal with the inherent variability in the time axis (i.e. recorded retention time or scan headers). In certain cases drift may occur along the m/z axis as well, although this is far less of a problem than variations in time. Time and m/z axes can be aligned independently or simultaneously, though the latter has not been reported in the literature and would be more easily applied after peak detection. Furthermore, if aligning in time only, one may wish to use scalar time series rather than the vector time series most readily available from the data (e.g. total ion current (TIC) as scalar time series versus a vector of all m/z values at each time point), or even more general representation schemes, such as a reduced-dimensionality vector time series as obtainable for example by PCA.

Alignment algorithms typically involve either: i) maximizing some objective function over a parametric set of transformations (usually linear); or ii) nonparametric alignment, by way of dynamic programming (DP); or iii) some combination of these methods (e.g. piecewise linear transformations). Some of the main differences among algorithms are: i) whether alignment is performed before or after peak detection (i.e. using either continuous signal or peaks); ii) whether or not signal amplitude is used; and iii) whether or not changes in scale are corrected for (i.e. allowing for interplay between these two types of corrections). Most algorithms used to date require a template, specified a priori, to which all time series are pre-aligned. Because suboptimal template choice could result in poor alignments, it may be desirable to avoid this.

Nielsen et al.’s correlated optimized warping (COW) algorithm (40) and modifications of it were used by Bylund et al. (41) to align chromatographic data by dividing the time axis into segments, and then performing a linear warp within each segment to optimize overlap while constraining segment boundaries to maintain agreement. An objective function defined the optimal set of transformations based on the sum of correlation coefficients or covariance between data segments in pairs of samples, and it was maximized by way of DP. COW can be applied to both scalar and vector time series by defining the correlation or covariance appropriately. Use of more than one data vector component (e.g. multiple m/z bins rather than the TIC) produced more stable alignments with respect to variation of free parameters such as maximum allowable warp (40). Nielsen et al. established visually using artificial chromatograms that COW is robust to varying peak numbers, heights, and widths, and is superior to a global linear warp (40). An interesting evaluation is provided in Ref. 41, where PCA was performed on the base peak chromatograms (BPC). The amount of variance explained by the top two principal components was 70% before alignment and 98% afterward. Similarly, explained variance went from 60 to 97% with a seven-component parallel factor analysis (PARAFAC—a generalization of PCA to three-way data), indicating a reduction in the major sources of sample variation.

Radulovic et al. (18) approach to alignment in time used binarized data and one alignment per data block. The data matrix was divided into five equal m/z partitions and six equal time partitions, creating 30 blocks. One experimental dataset was warped to match the other by applying a linear transformation with offset to each block and constraining neighboring blocks to have similar transformations. Monte Carlo maximization was used to find the optimal set of linear transformations as defined by a "data overlap" objective function. Finally, a post hoc "wobble" of peaks in time was applied to compensate for residual peak drift. Empirical assessment of feature-wise overlap of the datasets implied a considerably improved alignment (18), although the authors did not use an objective measure because their evaluation was based on the function being optimized. Another drawback of the proposed method is that it works only with binarized data, and is therefore sensitive to choice of binarization threshold, and does not exploit signal amplitude, which ultimately may be more informative.

Randolf et al. used coarse scale-specific peaks, extracted by multiscale wavelet decomposition, to align MALDI data along the m/z axis (32). Dominant peaks (above some threshold) were used to compute a single optimal shift for all peaks, and thus the alignment is not very flexible as it does not allow for even a simple linear stretch or compression. It is also not clear whether or not features detected at different scales should be aligned differently or could leverage one another in alignment. In contrast, Idborg et al. do not explicitly align their datasets, but compare detected components derived by curve resolution, using cross-correlation, and shifting individual components to account for constant time shifts between experiments. Components correlating above some threshold are said to be identical (34, 35).

We have recently developed the CPM, an HMM-based model to do multiple alignments of time series (for continuous-valued output, such as the abundances in an MS experiment) using LC-MS TIC traces (19). In the context of the CPM, one can think of the HMM (42, 43) as containing a series of hidden states, each of which represents some underlying "true"’ or canonical time, to which each scan header in the TIC is ultimately assigned. The alignment in time is dictated by which scan header gets (probabilistically) mapped to which hidden state. The states are called "hidden" because until the algorithm is run on the data (i.e. until the model is trained), we do not know which scan headers map to which states. In addition to the hidden time states mentioned, hidden states are also augmented by "scale" states, which allow scaling of the TIC amplitudes locally in time. Use of the model after training provides a mapping to both time and scale states, thereby performing alignment and normalization concurrently. Training, whereby the best parameters for the HMM are found, is performed by maximum likelihood (i.e. the objective function is defined to be the likelihood of the data under the HMM probabilistic model framework) by way of Expectation-Maximization (44). Both training and later use of the model (i.e. deduction of which scan headers map to which hidden states) are performed efficiently in HMMs by use of DP. The CPM has the advantages that no template is required, all experimental TICs are aligned simultaneously (leveraging the information across experiments), normalization local in time is concurrent with alignment, and the model is probabilistically formulated. As pointed out in Ref. 19, the algorithm can be extended to use vector time series rather than TICs and also to allow alignment of nonreplicate data (see end of this section).

The classical algorithm for aligning time series is dynamic time warping (DTW), a DP-based approach that originated in the speech recognition community as a robust distance metric between pairs of time series (45). DTW aligns one time series to a specified reference series. It is closely related to COW, except instead of moving only nodes (time segment boundaries) around, every data point can be moved; thus, transformations are not restricted to piecewise linear. It is likely that Bylund, Nielsen et al. avoided use of this less-restrictive model in order to reduce computational costs and avoid overfitting (40, 41). In contrast, Wang et al. used DTW on LC-MS data, but constrained the analysis to no more than 200 m/z bins so as to make it computationally practicable (10).

Hierarchical clustering was used in a novel way by Tibshirani et al. (22) to align MALDI/SELDI spectra along (log) m/z space after peak detection. Input to the hierarchical clustering algorithm is a list of putative candidate extracted peaks as well as the Euclidean distance between peaks in log m/z space. After clustering was completed, the dendogram was cut off at an empirically determined level, with the mean m/z of each cluster defining an individual peak.

A brief description is given for an alignment in m/z space for SELDI data by Sauve et al. (27) who used a tree, built for example by hierarchical clustering of samples, to guide the progressive warping of related experiments together. Although few details were reported, presumably the algorithm starts by aligning the two closest samples, forming a single pseudo-sample, to which the next closest sample is aligned, etc. Such a method removes the need for a prespecified template, but is likely adversely affected by the fact that the distances between spectra are measured before alignment, and hence are largely meaningless. It is also not clear that starting with the two closest spectra is any less arbitrary or more effective than other approaches.

As one moves from a local alignment perspective to a global one, a bias-variance trade-off comes into effect. That is, with more data constraining the transformation, more stable alignments result. The optimal trade off is determined by the informativeness of the LC-MS signal used and the type of misalignments present. In terms of what data are best for use in alignment, there are a number of choices. To date, these have largely been limited to the TIC, BPC, and to individual (or sets of) extracted m/z ion chromatograms. In practice, one could select ion chromatograms based on the highest ion count, or highest sum of second derivatives along the m/z axis, or rather use some dynamic binning of m/z such that the ion signal is evenly distributed to attempt to extract a smaller number of informative m/z bins rather than simply choosing evenly spaced ones. Alternatively, one could apply a dimensionality reduction technique, such as PCA, on the m/z space, with the aim of using a smaller number of features that are still informative (in this case, the features would be "eigen-m/z," that is, pseudo-m/z bins made up of linear combinations of the original m/z bins).

Whether one uses piecewise linear transformations in small regions, such as in Refs. 18 and 41, or more flexible alignment schemes, such as in Refs. 10 and 19, ultimately may have little importance, so long as the overall transformation is not restricted to a global linear warp. Incorporating local scaling simultaneously with alignment may also prove to be advantageous, as reported in Ref. 19. Another issue to consider is whether to align m/z bins individually, together, or somewhere in between (i.e. in a smoothly varying way). The issue of whether one should do alignments before or after peak detection has not been clearly answered. Assuming it were possible to correct the LC time axis before peak detection, one could better leverage the information encoded across aligned datasets to achieve more reliable and sensitive peak detection. Historically, with LC-MS data, researchers have concentrated on correcting the time axis, ignoring the m/z axis. However, corrections are commonly performed along the m/z axis in SELDI/MALDI experiments, suggesting it may be desirable to do so with LC-MS data (though this may be instrument-dependent). Alignment algorithms are typically formulated to work on datasets that are very similar to each other. However, if one knows a prior that the samples may differ significantly in a few (unknown) locations (by an unknown amount), for example in comparisons of cancer and noncancer serum, then this should be incorporated into the model, as suggested in Ref. 19. This should improve the overall performance of alignment algorithms and may be a fruitful direction to pursue.


    DATA NORMALIZATION
 TOP
 ABSTRACT
 LC-MS-BASED PROTEOMIC PROFILING
 PARADIGMS FOR INFORMATION...
 LC-MS DATA PROCESSING
 LOW-LEVEL...
 SIGNAL FILTERING AND BACKGROUND...
 MID-LEVEL PROCESSING—PEAK...
 DE-ISOTOPING, CHARGE...
 DATASET ALIGNMENT AND COMPARISON
 DATA NORMALIZATION
 DATA TRANSFORMATIONS AND ERROR...
 HIGH-LEVEL PROCESSING
 STATISTICAL ISSUES
 MACHINE LEARNING AS APPLIED...
 EVALUATING CURRENT METHODS
 FUTURE PROSPECTS
 REFERENCES
 
MS signals are frequently corrupted by either systematic or sporadic changes in abundance measurements. That is, overall peak amplitudes measured in one replicate may be elevated with respect to another, and may also have systematic changes within an experiment, across time, due to a change in column or ESI performance. In such cases, the data need to be normalized to make the experiments comparable. The simplest approach would be to multiply all abundance values in one experiment by some constant factor, but in general it may be necessary to perform more detailed corrections. Normalization of MS data can be performed either by coercing m/z intensity values to be comparable across experiments (low-level processing), or by altering peak abundance to be comparable (mid-level processing). In general, one aims to normalize not only replicates, but also experimental data of distinct biological origin, such as serum profiles from cancer patients and healthy case controls. The underlying assumption behind normalization is that the overall MS abundance of either all features (peaks or time-m/z pairs), or subset(s) of these, should be equal across different experiments. Given this assumption, one can determine the ratio of overall abundance of a chosen set of features between two experiments for use as a multiplicative correction factor, and then normalize an entire set of experiments by arbitrarily choosing one of them as a reference to which all others are normalized.

Global normalization refers to cases where all features are simultaneously used to determine a single normalization factor between two experiments, while local normalization refers to cases where a subset of features are used at a time (different subsets for different parts of the data). Locality can be defined by, say, similarity in m/z values, time (scan headers), or abundance (peak intensity) levels. For example, in an abundance-dependent, local normalization, peaks of similar abundance within the same MS experiment would be scaled in a similar way, while peaks of different abundance are scaled in a different way. If the mean of all features is made to agree across all experiments, it is referred to as a global mean normalization, as for example used by Sauve et al. (27). By plotting the point-wise log ratio of matched features between datasets versus either m/z or abundance, Sauve et al. (27) visually established that no trend existed along either the m/z or intensity axes (27) and hence that the normalization method need not take these factors into account. While several groups have opted for global abundance normalization, in the case of LC-MS data it may be necessary to normalize locally in time (19), because chromatography can produce irregular fluctuations in signal.

Many of the normalization techniques applicable to LC-MS data have also been applied to the results of microarray experiments (46). With gene expression profiles, the genes used for normalization have sometimes been restricted to so-called "housekeeping" genes presumed to remain constant across the experimental conditions. An analogous concept was applied to LC-MS data by Wang et al. (10), whereby a constant intensity ratio between pairs of experiments was computed based on reference peaks. These authors noted, however, that the use of all detected peaks provided similar results. Baggerly et al. likewise considered using "housekeeping" peaks to normalize, but stated that they were unable to find stable peaks across experiments (28).

Anderle et al. (20) opted to normalize multiple LC-MS samples to a single reference dataset, using median global normalization over pairs of matched peaks, while Wagner et al. and Baggerly et al. (28, 39) chose global mean normalization for processing MALDI data. Radulovic et al. also used global normalization, with each dataset multiplied by K, such that the total number of intensity values exceeding some predefined threshold was set to equal the somewhat arbitrarily chosen value 100 (18). In contrast, Tibshirani et al. (22) normalized their MALDI/SELDI spectra by mapping the 10th and 90th percentiles to 0 and 1, respectively (linearly interpolated between). In our own recent study (19), TIC traces were normalized collectively in conjunction with dataset alignment, leveraging information contained across all experiments simultaneously. Normalization was done locally at each scan header (but globally across m/z), with the constraint that neighboring scan headers have similar scaling factors.

Normalization is often evaluated by calculating the coefficient of variation (CV) between peaks across different experiments after normalization. While reasonable CVs (e.g. <30%) are commonly reported, a comparison to CVs from prenormalized data is often not provided. Moreover, because no systematic comparison of these various normalization techniques has been reported, it is difficult to assess their relative merits. While Sauve et al. reported no abundance-dependent artifacts with SELDI data (27), it will be interesting to see if this holds more generally across data sets, and also for LC-MS data.


    DATA TRANSFORMATIONS AND ERROR MODELS
 TOP
 ABSTRACT
 LC-MS-BASED PROTEOMIC PROFILING
 PARADIGMS FOR INFORMATION...
 LC-MS DATA PROCESSING
 LOW-LEVEL...
 SIGNAL FILTERING AND BACKGROUND...
 MID-LEVEL PROCESSING—PEAK...
 DE-ISOTOPING, CHARGE...
 DATASET ALIGNMENT AND COMPARISON
 DATA NORMALIZATION
 DATA TRANSFORMATIONS AND ERROR...
 HIGH-LEVEL PROCESSING
 STATISTICAL ISSUES
 MACHINE LEARNING AS APPLIED...
 EVALUATING CURRENT METHODS
 FUTURE PROSPECTS
 REFERENCES
 
One ultimate goal of many profiling studies is to find proteomic patterns that can discriminate between different biological conditions. In order to properly assign statistical significance to candidate biomarkers, or any changes in apparent protein abundance, it is important to understand the patterns of variability, such as how the variance of measurements correlates with abundance level. When using traditional statistical approaches, such as a t test or related measure, one needs to estimate the variance, which can be difficult to do if few replicates are available (as is typical of many LC-MS studies). Error models can be devised that leverage the variance estimates across abundance levels (20) and can be incorporated in the development of machine-learning algorithms for sample classification, possibly leading to improved performance. The log transformation of peak intensities is frequently taken, with some underlying, unarticulated consensus that the log transformation is a good thing (27, 28, 34, 47). Error models can help us to understand why this may (or may not) be the case. Indeed, log transformation converts multiplicative error into additive error and therefore stabilizes variance (20). Nevertheless, Satten et al. (48) offered that taking the log of MALDI spectra makes it almost impossible to visually discern peaks and harder to tease apart noise from background, albeit without a statistical basis for this assertion.

To examine patterns of variation and to deduce the variation attributable to sample preparation, Anderle et al. conducted a well-controlled LC-MS study, borrowing established parametric models of heteroscedasticity (i.e. unequal variance, in this case, across peaks with different abundance levels) from the microarray community. Human serum was fractionated into 40 samples (after removal of the most-abundant proteins and following tryptic digestion), with half of these analyzed directly by LC-MS, and the other half recombined and again resplit before analysis. The variation in the amplitude of matched peak intensities formed the basis of their study. A variance model, {sigma}2x {approx} {alpha}µ2 + ßµ, was fit to the observed intensity values (with true mean abundance µ and constants {alpha} and ß) and observed visually to be appropriate. The quadratic term dominates at higher-intensity levels resulting in a constant CV. Fitting the pooled and individual datasets to this model, it was shown that the pooled samples exhibited a CV of 11%, while the individually processed samples had a CV of 20%, suggesting much variation is attributable to sample preparation. Anderle et al. also report that application of this error model to two randomly divided subsets of individually processed data, in conjunction with t tests on each of the matched peaks between the groups, resulted in far fewer false-positives as compared with no data transformation. The number of false-negatives was not assessed, however, and could be high. Moreover, it would be beneficial to evaluate this model with other LC-MS datasets to ensure that the patterns of variability are generally valid.

Satten et al. (48), on the other hand, take the view that "all identification should be made using only features that well exceed a noise threshold, to ensure that the resulting classification algorithm has scientific validity." While these concerns are warranted, the emphasis appears to be in the wrong place. Classification algorithms, when used properly, for example in the context of Bayesian methods, or using cross-validation (with feature selection inside the cross-validation loop (49)), will not assign any importance to random structure in the data. Conversely, the approach offered by Satten et al. (48) can cause loss of small, but significant signals, reducing the efficacy of profiling. They also state that "the goal of analysis [is] to show that standardization and de-noising algorithms retain sufficient information to allow categorization." We counter that data processing should improve information availability. Otherwise, faced with more challenging tasks, unnecessarily conservative processing steps may obscure the answers we seek.

Stein et al. mention that event-counting detectors, such as electron multipliers, generate signals with "average random deviation" proportional to the square root of signal intensity (23). The proportionality constant was determined empirically (for GC-MS) and was invariant to m/z across a variety of MS platforms except at lower signal strengths where background noise becomes dominant. Assuming "average random deviation" refers to mean standard deviation, these results match closely to those of Anderle et al. (20).

We would conclude, largely on the basis of the Anderle study (20), that log transformation of MS data is appropriate, and that the error model they presented can be used to properly stabilize the variance of low-intensity peaks, or to obtain robust estimates of variance. It would also be interesting to find out whether these transformations could be applied directly to a raw data matrix rather than to extracted peak abundances.


    HIGH-LEVEL PROCESSING
 TOP
 ABSTRACT
 LC-MS-BASED PROTEOMIC PROFILING
 PARADIGMS FOR INFORMATION...
 LC-MS DATA PROCESSING
 LOW-LEVEL...
 SIGNAL FILTERING AND BACKGROUND...
 MID-LEVEL PROCESSING—PEAK...
 DE-ISOTOPING, CHARGE...
 DATASET ALIGNMENT AND COMPARISON
 DATA NORMALIZATION
 DATA TRANSFORMATIONS AND ERROR...
 HIGH-LEVEL PROCESSING
 STATISTICAL ISSUES
 MACHINE LEARNING AS APPLIED...
 EVALUATING CURRENT METHODS
 FUTURE PROSPECTS
 REFERENCES
 
With high-level processing, MS data is harnessed to tackle the ultimate goals of profiling, such as:

  1. Sample classification (e.g. using proteomic profiles to disti