|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
,
,¶,||
,**



,












,||||
,¶
From the Departments of
Computer Science and Engineering and 
Medicinal Chemistry and ** Molecular and Cellular Biology Program, University of Washington, Seattle, Washington 98195,
Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, 
Institute for Systems Biology, Seattle, Washington 98103, ¶ Systems Biology Group, Institut Pasteur, 75015 Paris Cedex 15, France, |||| Institute of Molecular Systems Biology, 8093 Zurich, Switzerland, and ¶¶ Department of Molecular Biology and Biochemistry, Wesleyan University, Middletown, Connecticut 06459
| ABSTRACT |
|---|
|
|
|---|
Although mass spectrometers are capable of both selective and sensitive measurements, mass analyzers are limited in their dynamic range. The consequence of this is a limited capability to detect very low abundance analytes in biological samples with large dynamic range. Concomitantly mass spectrometer duty cycles limit the number of CID events per unit of time and often lead to a significant undersampling of more complex proteomes (6). Furthermore the subset of peptides being sampled for CID can vary from one experiment to the next, hindering both interpretation and confidence in quantification.
Many approaches therefore go beyond the straightforward use of CID for large scale protein identification. These range from the accurate mass and time tag approach (7), clustering (8), complete workflow solutions for LC-MS data sets (9–12), alignment algorithms for LC-MS (13, 14), and feature detection approaches for SELDI platforms (15–17). These approaches rely heavily on high quality LC-MS profiles for peak alignment, peptide identification, and quantitation and thus require a high degree of reproducibility in the sample collection, processing, and analytical run conditions (12).
The need for reproducibility in LC-MS experiments has become increasingly essential with the popularization of large scale quantitative proteomics (19). Sources of variation that greatly affect this reproducibility can vary depending on the experimental platform and, in particular, the choice of instrument. Critical sources of variation common to all LC-MS experiments include variation in signal intensity, mass accuracy (20–24), and elution profile. For the latter, this can be further described by variations in peptide elution time, elution order, and peak width (25). Sample collection (e.g. harvesting conditions), preparation protocols (e.g. freeze/thaw cycles), experimental design (e.g. run order), platform stability (e.g. column and spray), and sample stability can affect results, thereby leading to biomarkers that have no biological basis or discovery of biomarkers with high false positive rates. Further experimental variation is often observed when comparing results across laboratories (26, 27) and instruments.
Previous work highlighting these problems (19–24) has suggested that careful experimental design can minimize variation. Recently focused studies have assessed reproducibility of MS/MS acquisition (28) and ICAT labeling (29). Thus, to improve reproducibility of various protocol steps, biologists and chemists are exploring various techniques and ideas, for example different numbers of washes, different columns, different sample processing strategies, randomizing run order, etc. However, the lack of proper measures of experimental reproducibility prevents investigators from having a complete understanding of the impacts of their protocol modifications, thus impeding progress.
A common strategy to study reproducibility in sample and platform quality control is to spike a set of standard peptides/proteins into a sample and then describe variance in the measurement through coefficient of variance scores (30, 31). This strategy is problematic in two respects. First, using 5–10 peptides to represent the complexity of the entire sample is a classic case of statistical under-representation, especially for complex samples like human serum, which potentially may contain a few hundred thousand peptides (32). Second, each coefficient of variance score captures only one particular reproducibility factor, e.g. mass accuracy, intensity, etc., ignoring all other factors. In addition, there are no well characterized coefficient of variance scores for factors such as signal-to-noise ratio, elution profile, elution time, etc. Comparative measures for experimental data have proven to be key enablers of progress in other scientific domains. Examples include the BLAST (Basic Local Alignment Search Tool) (33) E-value score for the degree of homology between proteins, and the Phred (34) score, a key ingredient to enabling quality-assessed, large scale, automated DNA sequencing.
An ideal measure of reproducibility in LC-MS-based proteomics would build on a qualitative and quantitative measurement of every peptide or protein present in a sample and compare experiments based on this knowledge. However, our inability to attain identification for the majority of peaks in an experiment makes a corresponding definition of reproducibility impractical.
| RESULTS |
|---|
|
|
|---|
Building upon an existing alignment algorithm to compute this measure of similarity between two LC-MS experiments (9), we present a tool Chaorder1 that produces a visual representation of the similarity relationships in a set of experiments. Chaorder is highly efficient, scalable, and parallelizable and thus can handle the gigabytes of data that current state-of-the-art instruments generate. It can handle data generated from a variety of instruments and of varying sample complexities (shown under "Results"). It takes into account all features (mass accuracy, elution profile, intensity, elution times, signal-to-noise, etc.) of all signals present in the data. Building on this tool, we propose a methodology for quality control of LC-MS data that is flexible enough to be applied to any LC-MS proteomics platform.
The application of our methodology to data from different laboratories revealed significant and consistent biases in large scale LC-MS experiments despite careful experimental design. Biases detected in this study are caused by experimental protocols, like HPLC washing, sample freeze/thaw cycles, run order, and date. The fact that these are standard elements of any sample protocol suggests that these biases may be occurring in many proteomics experiments today. The systematic exploration of these biases may thus be an important first step on the way to the design of unbiased LC-MS proteomics experiments.
Chaorder takes as input a list of experiments and computes the bounded alignment score between each pair of experiments (9). It then represents each experiment as a point in two dimensions such that the Euclidean distance between a pair of points approximates the inverse of the alignment score between the corresponding two experiments (details are described under "Workflow"). A set of experiments that are expected to be similar (e.g. almost perfect technical repeat experiments) correspond to high pairwise alignment scores, and the corresponding points tend to appear in close proximity in the two-dimensional image. Conversely distant points correspond to dissimilar experiments.
In our experience, approximate ranges of pairwise alignment scores (Euclidean distances) are empirically correlated with different qualitative classes of similarity. A distance between two points close to 0 represents ideal levels of similarity; a distance between 0 and 0.2 represents achievable (and tolerable, depending on the experimental setup) levels of reproducibility usually seen in repeat experiments with the exact same experimental setup. A distance between 0.2 and 0.5 represents biases in the experimental setup if experiments were expected to be similar (e.g. technical replicates). The experimental setup could then be studied further for more in-depth analysis. Distances greater than 0.5 usually represent similarities caused by the same sample being analyzed under very different perturbations, different platforms, etc. Distances greater than 0.7–0.8 represent unrelated experiments, e.g. comparisons between yeast cell lysate with human serum. Beyond interpretations of individual pairwise distances, the two-dimensional image can be used to understand systematic effects that may occur over time due to possible sample degradation or changing experimental conditions (presented in detail below).
We applied Chaorder to a variety of data sets, ranging from simple quality control studies over simulated biomarker experiments to real biomarker studies. Most of these experiments were LC-MS runs with no MS/MS acquisition. Detailed experimental protocols of these data sets are provided in the Supplemental Appendix.
Instrument Comparison for Reproducibility: Instrument Variability and Run Order Effects—
10 repeat LC-MS experiments using human serum were performed on two instruments, a Q-TOF and QSTAR with an electrospray source. Both instruments are based on the time-of-flight principle but come from different manufacturers.
Fig. 1 shows the Chaorder image of this data set. Each point represents a complete LC-MS run of the unfractionated human serum. The colors correspond to the different instruments (QSTAR = blue, Q-TOF = red), and the data labels represent the order in which the experiments were run. A number of observations can be made. First, the data from the two instruments form two distinct clusters. Second, the QSTAR instrument shows much more variation than the Q-TOF instrument. Third, run order effects are revealed: successive experiments tend to move in one direction. Sample carryover is one possible explanation for this effect. These data also show the problems faced by the community with different choices of instruments available. There is little hope that results generated on the QSTAR instrument can be reproduced on a Q-TOF instrument, and comparing data between the two platforms would present a significant challenge.
|
Fig. 2 shows the Chaorder image for this data set. Again the image reveals strong run order effects as the run order is inferred without any prior information. Our own manual follow-up analysis revealed that, globally, the total ion intensity was reduced over this series of experiments. Many peptides maintain their intensity levels, but many other intensity levels are significantly reduced. This suggests that the sample is degrading over time. Another possibility is a systematic change of digestion efficiency. Overall Fig. 2 represents an example where sample preparation steps appear to introduce biases that are not well characterized.
|
Fig. 3 shows the Chaorder image for this data set. Each data point represents one Angiotensin II LC-MS experiment. Squares and triangles correspond to different LC columns, colors represent the different dates on which experiments were run, and the data label corresponds to the run order. Without having been informed by any prior knowledge but the raw data, Fig. 3 reveals a clear split between the two columns. Surprisingly the variation between colors (dates) appears to be larger. Beyond this, the image reveals a strong association by date, and within each date, strong run order effects are revealed. Upon further examination, we found that although the sample contained only Angiotensin there was significant carryover, and many non-Angiotensin features were observed in the resulting output that resemble peptide peaks. Feature identification using msInspect (35) resulted in the detection of more than 100 features. Because these experiments were performed between the yeast experiments, the carryover is expected, but the strong association of these experiments by run order and dates reveals the potential problems with the variation of the analytical platform.
|
Fig. 4 shows the Chaorder image for this data set. The four-protein experiments are shown in blue, the five-protein experiments are shown in red, and the data label corresponds to the date on which experiments were run. A first observation is that, even after randomizing the run order, the experiments cluster by the dates they were run on, especially on the 19th and 21st. Second the four- and five-protein data seem to be distinguishable despite the somewhat weak clustering of four-protein data sets and five-protein data sets; this could make it difficult to differentially identify the spiked-in fifth protein. This test experiment reveals a fundamental problem for biomarker discovery caused by low reproducibility: any actual biomarkers may remain hidden behind all the less interesting experimental variations.
|
Fig. 5 shows the Chaorder plot for time points 0 min (blue) and 30 min (red) of this data set. The data labels correspond to the SCX fraction, and multiple data points with the same label are the repeat experiments of that particular SCX fraction after additional freeze/thaw cycles. A first observation is that the horizontal axis of the image approximately reflects the different time points, and the vertical axis reflects the SCX chromatography. This is remarkable because Chaorder generated the plot without any additional prior knowledge. As one would expect, successive SCX fractions differ significantly but also share a set of common proteins, which then occur in related RPLC fractions. Fig. 5 captures this similarity of certain RPLC fractions as well. Another interesting point to note is that the repeat experiments do not cluster as tightly as one might expect. Our detailed analysis revealed many unexpected artifacts presumably generated by the freeze/thaw cycle, e.g. change in noise levels, changes in the intensity levels of many peptides, MS/MS undersampling, etc.
|
Fig. 6 shows the Chaorder plot for the 3-month data. This data set has three wild type mice (squares) and five in the homozygous/heterozygous class (triangles). Triplicate experiments for a single mouse are shown in a single color. The data label represents the run order. A first observation to be made from the plot is that the triplicate experiments are not as tightly clustered as one could have expected. Because there are multiple sources of variation (mice-to-mice, homozygous/heterozygous/wild type, replicates, etc.), one does not expect to see a very simple cluster structure, but one might hope to see at least that mice in one disease state would cluster together. Looking at Fig. 6, one can see that this is not the case. Instead an added effect of run creates a more complex pattern of associations. For example, some experiments close in run order (0-1-2-3, 5-6-7, 14-15, etc.) are close by. Note that Chaorder identified these clusters without any added prior knowledge. The clustering suggests that run order is creating artifacts in the data that reduce the statistical power of 1) clustering replicates and 2) distinguishing wild type from disease type. In particular, the 3-month data have to be treated with caution. Similar effects are suggested in the 6- and 9-month data, although the effects of run order are much weaker.
|
| WORKFLOW |
|---|
|
|
|---|
Prakash et al. (9) presented the ChAMS2 method to align LC-MS experiments based on raw MS1 signals following the principle described above. The alignment algorithm is based on a score that measures the similarity between pairs of mass spectra. On this basis, ChAMS produces a mapping between related spectra, i.e. the spectra that contain peaks generated by the same peptide. The alignment score is the average spectra similarity score between all spectra paired by the alignment, and the alignment is entirely based on the MS1 spectra. The algorithm is capable of handling data from different mass analyzers (e.g. FT, TOF, LTQ, etc.) by tuning the mass resolution parameter (called
in Ref. 9). The Supplemental Appendix describes the alignment algorithm in more detail.
Given a list containing N LC-MS experiments (possibly from different instruments), Chaorder computes the alignment score between every pair of experiments. If A and B are two LC-MS experiments, their distance d(A,B) (9) is as follows.
![]() |
This results in an N x N matrix of distances. Multidimensional scaling (38) then identifies each experiment with a point in two dimensions such that the distances between any two experiments A and B approximately represent the pairwise distances d(A,B). Such an embedding in two dimensions necessitates a certain distortion of some of the distances d(A,B), but the present study suggests that the global embedding in two dimensions still reveals major global effects. Furthermore as the embedding can be rotated without changing any of the embedding distances, the axes do not have a particular significance; the embedding just illustrates the relative distance relations between all experiments. The Supplemental Appendix describes multidimensional scaling in more detail. Chaorder provides other views of the data as well, e.g. a pair of experiments can be analyzed for their differences and similarities using ChAMS (9).
Most of the analysis under "Results" could also have been obtained by manual analysis of the data. In fact, many of our conclusions about run order effects, etc. were validated by manual analysis. However, just as with the manual analysis of tandem mass spectra, manual analysis alone is not feasible for the current, let alone expected future, large scale experimentation. Chaorder performed all of the above analyses in a matter of few minutes or hours, depending on the size of the data, on a single Linux desktop computer. We are not aware of any other method that can perform the above similarity analysis with such a high efficiency and quality. The software Chaorder is available on request.
| DISCUSSION |
|---|
|
|
|---|
The other case studies illustrate the need to systematically address the biases of each step of an experimental setup, for example freeze/thaw cycles, number of washes, chemicals used, trypsin digestion efficiency, instrument choices, column choices, the day the experiments were run, the scientist performing each step, etc. Judicious decisions are required about each of these to achieve the highest levels of reproducibility at each step, thus yielding an experiment design that allows statistically valid conclusions about the underlying biological phenomena.
To this end, the software Chaorder is a robust, efficient, and, to our knowledge, the first tool to assess global LC-MS experimental reproducibility and similarity. We present results from various studies completed in a number of different laboratories that show experimental reproducibility being significantly affected by sample processing steps, experimental protocol, and instrument choices. The low degree of reproducibility indicated by Chaorder in all case studies suggests a widespread need for quality control experiments and the use of quality assessment tools before any kind of comparative data analysis (e.g. by MS or MS2 with Sequest (39)).
Chaorder can also be used to identify outlier experiments, which can either be analyzed manually or be removed from downstream analysis if feasible. When deciding between different types of columns, Chaorder allows an assessment of their reproducibility. Measuring the experimental quality of the wash runs, Chaorder can suggest how many wash runs lead to a clean column. Chaorder can help tune the experimental setup, e.g. indicate whether a longer column is required for higher reproducibility, whether the column is degraded beyond tolerable limits, etc. It can also help tune the sample processing steps, e.g. the choice of parameters for freeze/thaw cycles, choice of chemicals, etc.
The problems uncovered here may be tolerable for studies in which qualitative aspects matter most but appear critical to render mass spectrometry useful as a quantitative survey and discovery tool. Because we used data from multiple mass spectrometry laboratories, these issues do not appear to be unique to any one of them; instead they seem to represent general challenges facing the mass spectrometry community. We found that one of the strongest parameters to affect the experimental reproducibility is the order in which experiments are run. There are multiple possible causes, e.g. change in instrument calibration over time, sample degradation, LC column degradation, etc. Because results from different laboratories have similar characteristics, it is likely that a combination of several causes is the source of the observed biases. Significant efforts will need to be focused in this direction to understand and eliminate these biases to strengthen the downstream analysis.
Beyond the potential to improve reproducibility across different instruments and sample processing protocols within a single laboratory, Chaorder is also a first step toward the comparison and standardization of experimental platforms and conditions across laboratories, another important step to make mass spectrometry more reliable, trustworthy, and relevant for biomedical research.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, July 24, 2007, DOI 10.1074/mcp.M600470-MCP200
1 Taken from the book Birth of the Chaordic Age by D. W. Hock and Visa International (18). ![]()
2 The abbreviations used are: ChAMS, chromatography aligner using mass spectra; SCX, strong cation exchange; RPLC, reversed phase LC. ![]()
* This work was supported in part by the National Institutes of Health through the University of Washington NIEHS-sponsored Center for Ecogenetics and Environmental Health (NIEHS, National Institutes of Health Grant P30ES07033) and the Northwest Regional Center of Excellence for Biodefense and Emerging Infectious Diseases (Grant 1U54 AI57141-01). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()
|| To whom correspondence and requests for the software Chaorder should be addressed. Fax: 206-543-2969; E-mail: amol{at}cs.washington.edu
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
D. A. Stead, N. W. Paton, P. Missier, S. M. Embury, C. Hedeler, B. Jin, A. J. P. Brown, and A. Preece Information quality in proteomics Brief Bioinform, March 1, 2008; 9(2): 174 - 188. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |