|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 4:419-434, 2005.
© 2005 by The American Society for Biochemistry and Molecular Biology, Inc.

,¶,||
From the
Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3G4, Canada;
Program in Proteomics and Bioinformatics and ¶ Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario M5G 1L6, Canada
| ABSTRACT |
|---|
The field of expression proteomics seeks to answer the following questions: 1) which proteins and variant isoforms are expressed during the lifecycle of an organism; 2) which post-translational modifications occur in each of these proteins; 3) how do these patterns differ in different cell types and tissues and under different developmental, physiological, and disease conditions; and 4) how can biologists make use of this information to better understand the molecular basis for fundamental biological processes as well as for monitoring the course of disease so as to improve clinical diagnosis and treatment (1\N3). These questions are made all the more difficult by the complexity of most biological systems, which increases exponentially as one goes downstream from DNA sequence to mRNA intermediates to the protein end-products. While it appears there are likely far fewer genes coded for by the human genome than first anticipated, it is estimated that >60% of the
25,000 putative ORFs encode more than one splice variant (often tens to hundreds), and these in turn are frequently subject to post-translational modification (4, 5). Moreover, because proteins typically function together as components of dynamic multisubunit macromolecular complexes, the final complexity of a biological system is enormous; hence the search for interesting and relevant proteomic patterns remains a challenging task.
Capillary-scale HPLC-MS/MS (LC-MS) is rapidly emerging as a method of choice for large-scale proteomic analysis (1, 2, 6\N12). State-of-the-art LC-MS systems can be used to identify and track the relative abundance of thousands of molecules (6, 7). For standard bottom-up profiling experiments, the molecules in question are peptides derived by proteolysis of intact proteins. For very complex protein samples, such as blood, the peptide mixtures are physically resolved by chromatographic separation prior to injection into the mass spectrometer so as to generate a more-richly informative map, consisting of both the unique elution characteristics (column retention times) as well as m/z (mass-over-charge) ratios of individual peptides. Discrete peptides of interest are subject to collision-induced fragmentation followed by database matching for the purpose of sequence identification, while the recorded pattern of precursor ion peak intensities can be used to infer the relative quantities of the various proteins between samples. Nevertheless, comparisons of large-scale multivariate proteomic datasets are subject to a number of challenging analytical issues, such as experimental noise, systematic variations between experimental runs, the extreme overall range and dynamic nature of protein levels, and the huge number of measured features (e.g. protein levels) of which many are uncorrelated (or spuriously correlated) to the variables of interest.
In this review, we discuss important computational and statistical concepts that should be considered when performing comparative proteomic analyses and outline procedures for processing MS data to achieve more reliable quantitative analyses. As the broader domains of statistical and machine-learning methods are beyond the scope of this review, we limit our examination to studies applying these methods to shotgun profiling datasets. For more general information about statistical learning models, a good reference is Ref. 13. For an in-depth discussion of the technical nature of LC-MS, and the relative merits of different MS platforms, the reader is directed to other reviews present in this issue.
| LC-MS-BASED PROTEOMIC PROFILING |
|---|
A non-LC-based MS platform commonly used in profiling studies is SELDI (a refinement of MALDI), in which subsets of proteins in samples are selectively pre-adsorbed onto various proteomic chips (consisting of differing binding surfaces, such as hydrophobic or metal-binding surfaces) (8, 9).
The quality of profiling studies is determined by the overall sensitivity, detection coverage, dynamic range, fragmentation efficiency, mass resolution, and accuracy (6). Femptomole or better detection limits are commonly attained with LC-MS, even with mixtures exceeding several thousand components (7). With SELDI, much of the sample is discarded (not selectively bound to the matrix), whereas LC-MS enables full sample throughput, while providing the added opportunity for protein identification and mapping of post-translation modifications.
| PARADIGMS FOR INFORMATION EXTRACTION FROM LC-MS DATASETS |
|---|
However, such an approach is complicated by a number of factors: 1) the LC time axis needs to be corrected to account for spurious deviations in peptide elution times across different experiments; 2) there may be confounding overlap of peptides across the (time, m/z) space; 3) LC-MS systems are subject to considerable noise and variability that is not fully characterized or accounted for; and 4) differences in overall sample composition, leading to differential ion context, may affect the apparent signal intensities recorded for peptides common to multiple samples.
To address these issues, the following tasks need to be resolved: 1) corrective alignment along the experimental LC time axis so that times are comparable across experiments; 2) combining replicate experimental LC-MS datasets so as to improve the overall signal-to-noise ratio; 3) developing statistically sound methods for distinguishing signal from background; 4) systematically studying sources of variability and signal characteristics inherent to LC-MS data and modeling them so as to take them into account (for example, systematic variation in peak amplitude and width, signal linearity, and background artifacts should be investigated); 5) developing algorithms to detect and quantify peptide ion peaks in LC-MS data; and 6) developing probabilistic algorithms to uncover interesting peptide patterns and to match new patterns to previously discovered ones.
In the aforementioned approach, peptide peaks are extracted and then identified, and only after these steps are performed does one pose queries related to the particular experiment at hand. An alternative approach to exploiting LC-MS data is to take a signal processing perspective. For instance, rather than starting with peak detection and peptide identification to find disease biomarkers using LC-MS patterns, the data could be treated as a signal matrix, allowing the application of established methods in signal processing, statistics, and machine learning (13, 16, 17) to tease out interesting and relevant patterns from the data. Nevertheless, many of the previous tasks, such as time alignment, background detection and correction, and accounting for systematic variations in signal amplitude, are relevant to this latter approach.
Several groups (1, 10, 18) have developed working systems to address the main problems described above. The system of Radulovic et al. (18) performs filtering, normalization, peak detection, quantification and alignment, and then classification; they also establish linearity in LC-MS peak signal with peptide concentration. Although no algorithm details are provided, Kearney and Thibault (1) report a system for peptide detection and data alignment, which identifies the m/z, retention time, charge state, and maximum intensity of the principle peptide isotope. Wang et al. (10) performed baseline subtraction, peak detection, isotope detection, alignment, and quantification and demonstrated a linear LC-MS peak signal response to increasing peptide concentration in a complex mixture over a fairly large dynamic range.
Several studies of LC-MS have focused on particular low/mid-level data processing steps such as noise reduction, error models (which model the variance of the peptide abundance), or alignment in time, while other studies have been devoted strictly to the use of MS signal for sample classification. For studies attempting to cover the entire set of tasks, a typical approach for information extraction from LC-MS datasets typically involves a sequence of operations as listed in Table I, each treated in a largely independent manner from the others. The optimal order in which these should be performed so as to obtain the most information possible remains unclear. For instance, one could align the signals before or after peak detection, one could normalize before or after alignment, etc. There are several sequences of operations that make intuitive sense, but that differ significantly from one another. The optimal ordering depends in part on our ability to perform each task. For example, if alignment can be performed optimally on the raw data, without normalization or peak detection, then it may be desirable to do this first, potentially allowing for better normalization and peak detection. Alternatively, if one can not align the raw signals well, then it may be desirable to normalize and find peaks before alignment. Lastly, it may be desirable to approach some of these subtasks together, for example as in the continuous profile model (CPM),1 which aligns and normalizes abundance simultaneously, removing the necessity to choose one before the other (19).
|
| LC-MS DATA PROCESSING |
|---|
| LOW-LEVEL PROCESSINGQUANTIZATION OF m/z VALUES |
|---|
An issue to consider here is how to bin the m/z values. For example, one could opt for evenly spaced bins in either native or log m/z space. An optimal bin width would be large enough to keep the matrix tractable and not too sparse, but small enough that individual m/z values remain informative (i.e. not collapsing the information too much); this trade-off depends on the MS instrumentation used. For instance, Radulovic et al. (18) decided to round m/z values to the nearest integer, consistent with the nominal mass accuracy of the ion trap instruments used in their study, while Wang et al. and Anderle et al. (10, 20) used evenly spaced bins of width 0.2Th to exploit the increased accuracy of their instruments. In a study of gas-phase chromatography (GC)-MS data by Stein (21), bin widths were chosen as a function of the measured mass accuracy and resolution and increased linearly with m/z. In Ref. 22, LC-MS peak-widths (ignoring time) were reported to be reasonably constant in log m/z space by Tibshirani and co-workers. No methods have been reported for evaluating optimal bin width, nor for determining the sensitivity of further calculations to this parameter.
| SIGNAL FILTERING AND BACKGROUND SUBTRACTION |
|---|
Various filters for data smoothing along the LC time axis have been implemented (25). These include simple "moving average," median, and moving geometric mean filters, and the Savitzky-Golay filter, which preserves high-frequency content by fitting a higher-order polynomial to the data over a local window (26). For example, Wang et al. observed well-defined peaks and baseline after applying the Savitsky-Golay filter to LC-MS data (10). Data points belonging to the baseline were then hand-picked, fit with a low-order polynomial, and subtracted from the original data, together with a second application of the Savitsky-Golay filter for added peak smoothing. Nevertheless, manual delineation of background is a subjective, tedious, and error-prone process, and inconsistent with high-throughput analysis.
Bylund et al. have noted that taking the second derivative of raw MS signal, followed by matched filtering, whereby signal (in this case the second derivative of the signal) is cross-correlated with a Gaussian template, reduces background noise and enhances peaks. This smoothing is similar to application of a low-pass "top hat" filter, as is done on SELDI-MS data along the m/z axis in (27) in which signal frequencies above some threshold are thereby completely removed from the data. Note that by taking the second derivative of the signal, nonlinear noise is actually amplified (and hence the need to filter afterward), while with matched filtering, noise (specifically white noise) adjusts to the template frequency, making it harder to identify. On the other hand, Radulovic et al. (18) reported a two-step procedure for noise reduction/binarization of LC-MS signal. First, a "moving average" filter (five-scan header width) is used to smooth the dataset across discrete m/z bins. Then peak intensities exceeding a pre-defined threshold, T (related to the trimmed mean or median intensity of one m/z bin across all scan headers), for N consecutive scans are selected as being signal, with the rest of the intensities deemed to be noise. Processing of negative control spectra acquired during LC-MS analysis of solvent alone revealed few false-positive peaks. However, the number of missed genuine peptide peaks (false-negatives) is quite sensitive to the selected processing parameters (values of T, N, or M), and peak detection efficiency has not been fully optimized.2
Wagner et al. (10) have made use of an iterative, nonparametric, local regression smoother (a robust loess smoother) to model the baseline in MALDI-MS datasets. Because distinct regions of m/z were different in nature, empirical selection of the size of the smoothing window for each region was necessary and ranged from 1% of the total number of m/z values for regions with small m/z to 70% for regions with larger m/z values. In contrast, Baggerly et al. found a single sinusoidal baseline noise component in their MALDI dataset, which they speculate was produced by use of an alternating current power source (28). The noise frequency was estimated by Fourier transform of a hand-selected data region, and then eliminated by subtracting out a sinusoid of this frequency. Following this procedure, the residual baseline was modeled with a modified local minimum at each m/z value and them removing the modeled baseline.
Bylund et al. developed a unique "orthogonal background subtraction" approach for LC-MS baseline correction (25), wherein principal component analysis (PCA) is applied to time vectors (one vector per time point, over all m/z) from a region in the dataset expected to consist solely of background. The noise subspace was then characterized by taking the top principal components, and noise is removed by subtracting the components of the data that lie in this subspace. While such a model may prove useful, it is not clear whether it is appropriate for LC-MS analysis; PCA operates in a global way (across all time points in this case) and therefore does not allow for noise characteristics to change over time.
Concerns have been raised about the suitability of data filtering followed by parametric fitting (e.g. based on a peak model) (29), and thus one must carefully consider the ultimate goal of a LC-MS-profiling experiment when performing tasks in sequence. To our knowledge, no systematic comparison of the effects of various filtering/background subtraction techniques on MS data integrity has yet been reported, and it is advisable to investigate the consequences on downstream analysis (e.g. classification, peak detection). Because the filtering efforts published to date were performed on only one data dimension (m/z), it will be interesting to see if filtering independently in both axes (time and m/z) or simultaneously is more beneficial.
| MID-LEVEL PROCESSINGPEAK DETECTION AND QUANTIFICATION |
|---|
Radulovic et al. used an iterative coarse-to-fine strategy to extract two-dimensional (in time and m/z) peaks from LC-MS data (18). Neighboring points in the data matrix deemed to be signal (rather than noisesee previous section on filtering) were combined to form peaks at the coarsest level, and then iteratively through each of the more refined levels, with a bisection method used to avoid spurious peak mergers. Peaks were quantified by summing individual grouped feature intensities. On the other hand, Wang et al. detected LC-MS peaks based on coinciding local signal maxima, in time and m/z; local maxima are defined as an increase in ion abundance greater than a prespecified threshold over a predefined range (10). Peaks were then quantified either by summing intensity over the component elution time or based on the maximum peak height. Unfortunately, the authors do not describe how the component elution time was determined. Similar techniques were used in Refs. 20 and 30.
Yasui et al. (31) defined peaks (in SELDI-MS data) as m/z elements exhibiting higher intensity than the N nearest neighbors, with N chosen empirically, with an added constraint that peaks have a higher intensity than the "broader" neighborhood as calculated by a local linear super-smoother method. In an effort to reduce dataset misalignment problems, m/z values within ±0.2% of these peaks were further selected as additional peaks. Although peak finding was not the emphasis of the study of Tibshirani et al. (22), these authors used a similar routine to scan for m/z peaks in SELDI/MALDI datasets exhibiting a higher intensity than a prespecified number of closest neighbors. In contrast, Randolf et al. made use of multiscale wavelet decomposition to detect peaks in MALDI-MS data (32), trying to avoid ad hoc decisions pertaining to thresholds and filter parameters. The wavelet decomposition provides a breakdown of the signal according to scale (and location), and by taking the derivative at each scale these authors detected scale-specific peaks. In a histogram of peak locations extracted in this way from many samples, locations with high counts were considered as evidence of true peaks, although the authors did not offer a method for choosing an optimal scale.
Instead of attempting to find peaks, Idborg et al. applied a curve resolution approach developed by the Chemometrics community for processing GC-MS profiles (see Ref. 33 for an overview of Chemometrics analysis techniques) to extract the major spectral components in LC-MS profiling of urine (34, 35), a notably simpler mixture than tissue or serum. In this manner, a data matrix, X, defining a single LC-MS experiment, with rows corresponding to time points and columns to m/z values, was resolved into a set of "pure" spectral profiles. These profiles were stored as column vectors of length m in the matrix, S, with the complete set of profiles approximately spanning the entire space of all m/z spectra generated across the various time points. The relative amount of each spectral profile present at a given time point was then estimated in the "pure" concentration profiles, stored as t row vectors of length a in the matrix C. Unexplained variation was represented by matrix E, and X = CST + E. An iterative method was used to solve for C and S, starting from an initial set of "key" spectra and using constraints related to the non-negativity of the concentrations (i.e. related to the fact that one can only add in physical components, not remove them), to update the spectral profiles. However, Idborg et al. do not describe how to select the number of spectral profiles, although various approaches to this problem are provided in Ref. 33, as well as other algorithms used for spectral resolution. Because use of these techniques has been largely restricted to relatively simple mixtures, it is not clear how well these methods will scale to more complex proteomic samples.
Overall, peak detection has generally been performed in a rather ad hoc manner, with little evaluation of the effectiveness of the various methods or parameter choices. The algorithms employed to date make no use of a priori or learned information with regards to peak shape, along either the time or m/z dimensions, and in some cases ion intensity values are only exploited very indirectly. Rather than retaining abundance information, peaks are frequently binarized (18, 31). Radulovic et al. do not motivate this decision, while Yasui et al. state that it helps to overcome noise in the signal. Such a step is lossy and is likely suboptimal for downstream analysis. Incorporating richer information would likely improve analytical performance, albeit at the cost of more computation. The underlying methodologies of machine learning and statistical techniques are intended to account for random variation caused by noise, and their performance is likely deteriorated by using them with binarized MS data. A less-extreme approach, and one retaining more information, would be to apply filtering and/or baseline subtraction as well as intensity normalization rather than binarization. It has been suggested that at present LC-MS is still not generally a quantitative science (36). Peak detection and quantification, even if done optimally, does not guarantee linearity of peak signal relative to analyte concentration due to possible ion suppression effects, although compelling evidence of linearity of extracted LC-MS peak intensities, at least for spiked reference proteins, has been established using certain data-processing methods and technological platforms (10, 18).
| DE-ISOTOPING, CHARGE DECONVOLUTION, AND PEAK MATCHING |
|---|
1% of naturally occurring carbon exists with seven (as opposed to the more common six) neutrons (13C versus 12C); this signature dominates other isotopes because
50% of the mass of a typical peptide is carbon (37). Although isotope variants are chemically identical, the heavier isotope sister ion peaks exhibit greater apparent m/z than the predominant mono-isotopic peak (all 12C). Based on the mass differential, one can deduce the charge state of multiply charged ions. Wang et al. and Anderle et al. applied a de-isotoping step (to assign isotope patterns) before peak quantification (10, 20). Though few details are provided, the algorithms were apparently based on cross-correlating the observed peak envelopes to reference isotopic tables, with the highest-scoring match identifying the most probable isotope shoulders. One can imagine that probabilistic modeling techniques, such as hidden Markov models (HMMs), may significantly improve upon this template-matching scheme. In contrast, Tibshirani et al. opted to smooth MALDI/SELDI data along the m/z axis as a simpler alternative to deconvolution (22). Extremely complex samples may prove to be less amenable to full de-isotoping.
Peak matching is another related topic relevant to quantitative proteomic comparisons. To measure reproducibility of peptide signal, experimental peaks must be matched across LC-MS datasets. Naive methods, based on simple proximity (in time or m/z space), are reported to be effective (10, 20, 39). For instance, Radulovic et al. used MS/MS-derived sequence identities to verify the correct matching of
200 putative peptides across multiple samples (18). However, given that MS/MS targets prominent peaks, this assessment is likely biased. Anderle et al. (20), on the other hand, found it necessary to remove
2% of data pointspresumed outliers thought to be attributable to mismatching (20). Incorporation of prior knowledge of peak shape, instrument m/z drift, and a more-probabilistic formulation might significantly improve the effectiveness of peak detection, quantification, and matching.
| DATASET ALIGNMENT AND COMPARISON |
|---|
Alignment algorithms typically involve either: i) maximizing some objective function over a parametric set of transformations (usually linear); or ii) nonparametric alignment, by way of dynamic programming (DP); or iii) some combination of these methods (e.g. piecewise linear transformations). Some of the main differences among algorithms are: i) whether alignment is performed before or after peak detection (i.e. using either continuous signal or peaks); ii) whether or not signal amplitude is used; and iii) whether or not changes in scale are corrected for (i.e. allowing for interplay between these two types of corrections). Most algorithms used to date require a template, specified a priori, to which all time series are pre-aligned. Because suboptimal template choice could result in poor alignments, it may be desirable to avoid this.
Nielsen et al.s correlated optimized warping (COW) algorithm (40) and modifications of it were used by Bylund et al. (41) to align chromatographic data by dividing the time axis into segments, and then performing a linear warp within each segment to optimize overlap while constraining segment boundaries to maintain agreement. An objective function defined the optimal set of transformations based on the sum of correlation coefficients or covariance between data segments in pairs of samples, and it was maximized by way of DP. COW can be applied to both scalar and vector time series by defining the correlation or covariance appropriately. Use of more than one data vector component (e.g. multiple m/z bins rather than the TIC) produced more stable alignments with respect to variation of free parameters such as maximum allowable warp (40). Nielsen et al. established visually using artificial chromatograms that COW is robust to varying peak numbers, heights, and widths, and is superior to a global linear warp (40). An interesting evaluation is provided in Ref. 41, where PCA was performed on the base peak chromatograms (BPC). The amount of variance explained by the top two principal components was 70% before alignment and 98% afterward. Similarly, explained variance went from 60 to 97% with a seven-component parallel factor analysis (PARAFACa generalization of PCA to three-way data), indicating a reduction in the major sources of sample variation.
Radulovic et al. (18) approach to alignment in time used binarized data and one alignment per data block. The data matrix was divided into five equal m/z partitions and six equal time partitions, creating 30 blocks. One experimental dataset was warped to match the other by applying a linear transformation with offset to each block and constraining neighboring blocks to have similar transformations. Monte Carlo maximization was used to find the optimal set of linear transformations as defined by a "data overlap" objective function. Finally, a post hoc "wobble" of peaks in time was applied to compensate for residual peak drift. Empirical assessment of feature-wise overlap of the datasets implied a considerably improved alignment (18), although the authors did not use an objective measure because their evaluation was based on the function being optimized. Another drawback of the proposed method is that it works only with binarized data, and is therefore sensitive to choice of binarization threshold, and does not exploit signal amplitude, which ultimately may be more informative.
Randolf et al. used coarse scale-specific peaks, extracted by multiscale wavelet decomposition, to align MALDI data along the m/z axis (32). Dominant peaks (above some threshold) were used to compute a single optimal shift for all peaks, and thus the alignment is not very flexible as it does not allow for even a simple linear stretch or compression. It is also not clear whether or not features detected at different scales should be aligned differently or could leverage one another in alignment. In contrast, Idborg et al. do not explicitly align their datasets, but compare detected components derived by curve resolution, using cross-correlation, and shifting individual components to account for constant time shifts between experiments. Components correlating above some threshold are said to be identical (34, 35).
We have recently developed the CPM, an HMM-based model to do multiple alignments of time series (for continuous-valued output, such as the abundances in an MS experiment) using LC-MS TIC traces (19). In the context of the CPM, one can think of the HMM (42, 43) as containing a series of hidden states, each of which represents some underlying "true" or canonical time, to which each scan header in the TIC is ultimately assigned. The alignment in time is dictated by which scan header gets (probabilistically) mapped to which hidden state. The states are called "hidden" because until the algorithm is run on the data (i.e. until the model is trained), we do not know which scan headers map to which states. In addition to the hidden time states mentioned, hidden states are also augmented by "scale" states, which allow scaling of the TIC amplitudes locally in time. Use of the model after training provides a mapping to both time and scale states, thereby performing alignment and normalization concurrently. Training, whereby the best parameters for the HMM are found, is performed by maximum likelihood (i.e. the objective function is defined to be the likelihood of the data under the HMM probabilistic model framework) by way of Expectation-Maximization (44). Both training and later use of the model (i.e. deduction of which scan headers map to which hidden states) are performed efficiently in HMMs by use of DP. The CPM has the advantages that no template is required, all experimental TICs are aligned simultaneously (leveraging the information across experiments), normalization local in time is concurrent with alignment, and the model is probabilistically formulated. As pointed out in Ref. 19, the algorithm can be extended to use vector time series rather than TICs and also to allow alignment of nonreplicate data (see end of this section).
The classical algorithm for aligning time series is dynamic time warping (DTW), a DP-based approach that originated in the speech recognition community as a robust distance metric between pairs of time series (45). DTW aligns one time series to a specified reference series. It is closely related to COW, except instead of moving only nodes (time segment boundaries) around, every data point can be moved; thus, transformations are not restricted to piecewise linear. It is likely that Bylund, Nielsen et al. avoided use of this less-restrictive model in order to reduce computational costs and avoid overfitting (40, 41). In contrast, Wang et al. used DTW on LC-MS data, but constrained the analysis to no more than 200 m/z bins so as to make it computationally practicable (10).
Hierarchical clustering was used in a novel way by Tibshirani et al. (22) to align MALDI/SELDI spectra along (log) m/z space after peak detection. Input to the hierarchical clustering algorithm is a list of putative candidate extracted peaks as well as the Euclidean distance between peaks in log m/z space. After clustering was completed, the dendogram was cut off at an empirically determined level, with the mean m/z of each cluster defining an individual peak.
A brief description is given for an alignment in m/z space for SELDI data by Sauve et al. (27) who used a tree, built for example by hierarchical clustering of samples, to guide the progressive warping of related experiments together. Although few details were reported, presumably the algorithm starts by aligning the two closest samples, forming a single pseudo-sample, to which the next closest sample is aligned, etc. Such a method removes the need for a prespecified template, but is likely adversely affected by the fact that the distances between spectra are measured before alignment, and hence are largely meaningless. It is also not clear that starting with the two closest spectra is any less arbitrary or more effective than other approaches.
As one moves from a local alignment perspective to a global one, a bias-variance trade-off comes into effect. That is, with more data constraining the transformation, more stable alignments result. The optimal trade off is determined by the informativeness of the LC-MS signal used and the type of misalignments present. In terms of what data are best for use in alignment, there are a number of choices. To date, these have largely been limited to the TIC, BPC, and to individual (or sets of) extracted m/z ion chromatograms. In practice, one could select ion chromatograms based on the highest ion count, or highest sum of second derivatives along the m/z axis, or rather use some dynamic binning of m/z such that the ion signal is evenly distributed to attempt to extract a smaller number of informative m/z bins rather than simply choosing evenly spaced ones. Alternatively, one could apply a dimensionality reduction technique, such as PCA, on the m/z space, with the aim of using a smaller number of features that are still informative (in this case, the features would be "eigen-m/z," that is, pseudo-m/z bins made up of linear combinations of the original m/z bins).
Whether one uses piecewise linear transformations in small regions, such as in Refs. 18 and 41, or more flexible alignment schemes, such as in Refs. 10 and 19, ultimately may have little importance, so long as the overall transformation is not restricted to a global linear warp. Incorporating local scaling simultaneously with alignment may also prove to be advantageous, as reported in Ref. 19. Another issue to consider is whether to align m/z bins individually, together, or somewhere in between (i.e. in a smoothly varying way). The issue of whether one should do alignments before or after peak detection has not been clearly answered. Assuming it were possible to correct the LC time axis before peak detection, one could better leverage the information encoded across aligned datasets to achieve more reliable and sensitive peak detection. Historically, with LC-MS data, researchers have concentrated on correcting the time axis, ignoring the m/z axis. However, corrections are commonly performed along the m/z axis in SELDI/MALDI experiments, suggesting it may be desirable to do so with LC-MS data (though this may be instrument-dependent). Alignment algorithms are typically formulated to work on datasets that are very similar to each other. However, if one knows a prior that the samples may differ significantly in a few (unknown) locations (by an unknown amount), for example in comparisons of cancer and noncancer serum, then this should be incorporated into the model, as suggested in Ref. 19. This should improve the overall performance of alignment algorithms and may be a fruitful direction to pursue.
| DATA NORMALIZATION |
|---|
Global normalization refers to cases where all features are simultaneously used to determine a single normalization factor between two experiments, while local normalization refers to cases where a subset of features are used at a time (different subsets for different parts of the data). Locality can be defined by, say, similarity in m/z values, time (scan headers), or abundance (peak intensity) levels. For example, in an abundance-dependent, local normalization, peaks of similar abundance within the same MS experiment would be scaled in a similar way, while peaks of different abundance are scaled in a different way. If the mean of all features is made to agree across all experiments, it is referred to as a global mean normalization, as for example used by Sauve et al. (27). By plotting the point-wise log ratio of matched features between datasets versus either m/z or abundance, Sauve et al. (27) visually established that no trend existed along either the m/z or intensity axes (27) and hence that the normalization method need not take these factors into account. While several groups have opted for global abundance normalization, in the case of LC-MS data it may be necessary to normalize locally in time (19), because chromatography can produce irregular fluctuations in signal.
Many of the normalization techniques applicable to LC-MS data have also been applied to the results of microarray experiments (46). With gene expression profiles, the genes used for normalization have sometimes been restricted to so-called "housekeeping" genes presumed to remain constant across the experimental conditions. An analogous concept was applied to LC-MS data by Wang et al. (10), whereby a constant intensity ratio between pairs of experiments was computed based on reference peaks. These authors noted, however, that the use of all detected peaks provided similar results. Baggerly et al. likewise considered using "housekeeping" peaks to normalize, but stated that they were unable to find stable peaks across experiments (28).
Anderle et al. (20) opted to normalize multiple LC-MS samples to a single reference dataset, using median global normalization over pairs of matched peaks, while Wagner et al. and Baggerly et al. (28, 39) chose global mean normalization for processing MALDI data. Radulovic et al. also used global normalization, with each dataset multiplied by K, such that the total number of intensity values exceeding some predefined threshold was set to equal the somewhat arbitrarily chosen value 100 (18). In contrast, Tibshirani et al. (22) normalized their MALDI/SELDI spectra by mapping the 10th and 90th percentiles to 0 and 1, respectively (linearly interpolated between). In our own recent study (19), TIC traces were normalized collectively in conjunction with dataset alignment, leveraging information contained across all experiments simultaneously. Normalization was done locally at each scan header (but globally across m/z), with the constraint that neighboring scan headers have similar scaling factors.
Normalization is often evaluated by calculating the coefficient of variation (CV) between peaks across different experiments after normalization. While reasonable CVs (e.g. <30%) are commonly reported, a comparison to CVs from prenormalized data is often not provided. Moreover, because no systematic comparison of these various normalization techniques has been reported, it is difficult to assess their relative merits. While Sauve et al. reported no abundance-dependent artifacts with SELDI data (27), it will be interesting to see if this holds more generally across data sets, and also for LC-MS data.
| DATA TRANSFORMATIONS AND ERROR MODELS |
|---|
To examine patterns of variation and to deduce the variation attributable to sample preparation, Anderle et al. conducted a well-controlled LC-MS study, borrowing established parametric models of heteroscedasticity (i.e. unequal variance, in this case, across peaks with different abundance levels) from the microarray community. Human serum was fractionated into 40 samples (after removal of the most-abundant proteins and following tryptic digestion), with half of these analyzed directly by LC-MS, and the other half recombined and again resplit before analysis. The variation in the amplitude of matched peak intensities formed the basis of their study. A variance model,
2x
µ2 + ßµ, was fit to the observed intensity values (with true mean abundance µ and constants
and ß) and observed visually to be appropriate. The quadratic term dominates at higher-intensity levels resulting in a constant CV. Fitting the pooled and individual datasets to this model, it was shown that the pooled samples exhibited a CV of 11%, while the individually processed samples had a CV of 20%, suggesting much variation is attributable to sample preparation. Anderle et al. also report that application of this error model to two randomly divided subsets of individually processed data, in conjunction with t tests on each of the matched peaks between the groups, resulted in far fewer false-positives as compared with no data transformation. The number of false-negatives was not assessed, however, and could be high. Moreover, it would be beneficial to evaluate this model with other LC-MS datasets to ensure that the patterns of variability are generally valid.
Satten et al. (48), on the other hand, take the view that "all identification should be made using only features that well exceed a noise threshold, to ensure that the resulting classification algorithm has scientific validity." While these concerns are warranted, the emphasis appears to be in the wrong place. Classification algorithms, when used properly, for example in the context of Bayesian methods, or using cross-validation (with feature selection inside the cross-validation loop (49)), will not assign any importance to random structure in the data. Conversely, the approach offered by Satten et al. (48) can cause loss of small, but significant signals, reducing the efficacy of profiling. They also state that "the goal of analysis [is] to show that standardization and de-noising algorithms retain sufficient information to allow categorization." We counter that data processing should improve information availability. Otherwise, faced with more challenging tasks, unnecessarily conservative processing steps may obscure the answers we seek.
Stein et al. mention that event-counting detectors, such as electron multipliers, generate signals with "average random deviation" proportional to the square root of signal intensity (23). The proportionality constant was determined empirically (for GC-MS) and was invariant to m/z across a variety of MS platforms except at lower signal strengths where background noise becomes dominant. Assuming "average random deviation" refers to mean standard deviation, these results match closely to those of Anderle et al. (20).
We would conclude, largely on the basis of the Anderle study (20), that log transformation of MS data is appropriate, and that the error model they presented can be used to properly stabilize the variance of low-intensity peaks, or to obtain robust estimates of variance. It would also be interesting to find out whether these transformations could be applied directly to a raw data matrix rather than to extracted peak abundances.
| HIGH-LEVEL PROCESSING |
|---|