MCP Tips for better browsing
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Originally published In Press as doi:10.1074/mcp.M600274-MCP200 on May 17, 2007.
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Data
Right arrow All Versions of this Article:
M600274-MCP200v1
6/8/1354    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow Glossary
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Karp, N. A.
Right arrow Articles by Lilley, K. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Karp, N. A.
Right arrow Articles by Lilley, K. S.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Molecular & Cellular Proteomics 6:1354-1364, 2007.
© 2007 by The American Society for Biochemistry and Molecular Biology, Inc.


Research

Experimental and Statistical Considerations to Avoid False Conclusions in Proteomics Studies Using Differential In-gel Electrophoresis*,S

Natasha A. Karp{ddagger},§, Paul S. McCormick,||, Matthew R. Russell{ddagger},** and Kathryn S. Lilley{ddagger},{ddagger}{ddagger}

From the {ddagger} Department of Biochemistry, University of Cambridge, Building O, Downing Site, Cambridge CB2 1QW, United Kingdom and Department of Chemistry, University Chemical Laboratory, Lensfield Road, Cambridge CB2 1EW, United Kingdom


    ABSTRACT
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
In quantitative proteomics, the false discovery rate (FDR) can be defined as the number of false positives within statistically significant changes in expression. False positives accumulate during the simultaneous testing of expression changes across hundreds or thousands of protein or peptide species when univariate tests such as the Student's t test are used. Currently most researchers rely solely on the estimation of p values and a significance threshold, but this approach may result in false positives because it does not account for the multiple testing effect. For each species, a measure of significance in terms of the FDR can be calculated, producing individual q values. The q value maintains power by allowing the investigator to achieve an acceptable level of true or false positives within the calls of significance. The q value approach relies on the use of the correct statistical test for the experimental design. In this situation, a uniform p value frequency distribution when there are no differences in expression between two samples should be obtained. Here we report a bias in p value distribution in the case of a three-dye DIGE experiment where no changes in expression are occurring. The bias was shown to arise from correlation in the data from the use of a common internal standard. With a two-dye schema, where each sample has its own internal standard, such bias was removed, enabling the application of the q value to two different proteomics studies. In the case of the first study, we demonstrate that 80% of calls of significance by the more traditional method are false positives. In the second, we show that calculating the q value gives the user control over the FDR. These studies demonstrate the power and ease of use of the q value in correcting for multiple testing. This work also highlights the need for robust experimental design that includes the appropriate application of statistical procedures.


Quantitative proteomics, the study of global changes in protein expression, is a rapidly growing field where two-dimensional (2D)1 gel electrophoresis (1, 2), differential labeling of protein and peptides with stable isotopes (3), and label-free mass spectrometric peak intensity measurements (4) are pivotal approaches to measuring changes in the expression level of proteins. The output of large scale genomic sequencing and gene expression studies have driven the need to globally assess protein behavior, which is central to our understanding of cellular function and disease processes. In quantitative studies, the quantitation of the protein or peptide signal allows the comparison of protein levels from one state with another. Regardless of the technique used, the data generated can be analyzed with a variety of statistical approaches. In a simple comparison of two samples, changes in protein expression are considered to be significant when above a specified threshold (5). More rigorous studies incorporating replicates use univariate (1, 6) or multivariate statistical methods (1, 7) to analyze the data. Univariate methods, such as the Student's t test, analyze the data from each individual protein species to determine whether the differences between the samples are significant. In multivariate methods, such as principle component analysis, all the data are simultaneously analyzed to look for patterns in expression. Each of these statistical methods involves a number of mathematical assumptions; however, these assumptions are often ignored, potentially leading to erroneous results (8). Regardless of the analytical method used to collect the proteomics data, subtle issues with statistical analysis must be overcome to ensure that valid conclusions are made.

To date, the majority of published quantitative proteomics studies have used the established 2D gel electrophoresis approach for quantitation and applied a univariate test to identify protein species with significant changes in expression (8), although an increased number of studies are now using non-gel-based approaches. These tests calculate the probability (p) of observing a test statistic more extreme than the one calculated from the data given that the samples are from the same population (i.e. that any apparent change in expression occurs by chance alone). Typically an expression change is considered significant if the calculated p value falls below a prescribed significance threshold, for example 0.01 (the per comparison error rate (PCER)). Two types of errors are possible: a false positive (type I error) occurs when a protein species is declared to be differentially expressed erroneously; a false negative (type II error) occurs when the test fails to detect a differentially expressed species. In expression studies, many thousands of statistical tests are conducted, one for each species. A substantial number of false positives may accumulate at the 0.01 confidence level because 1% of sample differences will be significant even if no changes in protein expression exist between the two samples. This accumulation of false positives is termed the multiple testing problem and is a general property of a confidence-based statistical test when applied across multiple features (where in DIGE each feature is a gel spot). Although the issue of multiple testing in quantitative proteomics has been discussed (9), the extent of this problem has rarely been tested but has the potential to lead to a substantial number of false leads.

In the microarray field, the multiple testing problem has been the subject of detailed discussion. Early methods used the familywise error rate, which controls the probability of one or more false rejections (type I error) among all tests conducted. The simplest and most conservative approach is the Bonferroni correction, which adjusts the threshold of significance by dividing the PCER by the number of comparisons being completed (10). For example, when testing 1000 protein species, the PCER threshold of 0.05 would be divided by the number of tests leading to the stringent familywise error rate (FWER) of 0.00005. This approach fails to consider between-feature dependence within the data and does not take into account the risk of false positives associated with the proportion of tests for which no change occurs. In proteomics data, between-feature dependence can arise because proteins influence each other in complex interactions within and across biochemical pathways and when a protein is represented multiple times; for example in 2D gel electrophoresis, a single protein can be represented in multiple positions in a charge train. Within the microarray community, controlling the FWER has been found to be too conservative and has led to many missed features of interest. It has also been argued that in the context of exploratory experiments, where later confirmatory investigations are utilized, allowing a few false leads would not present a serious problem if the majority of significant species were correctly chosen (1113). Subsequently the concept of allowing false positives among the genuine changes has been widely accepted in the large scale data analysis of microarray data (14). This has led to the application of methodologies to control the false discovery rate (FDR) where the focus is on achieving an acceptable ratio of true and false positives. Benjamini and Hochberg (15) originally defined the false discovery rate as a proportion of changes identified as significant that are false. For example, an FDR rate of 10% means that on average 10% of changes identified as significant would be expected to have arisen from type I errors. The original FDR methodology was also considered to be too conservative for discovery experiments because it does not take into account that in these experiments a proportion of features genuinely change (16, 17). Consequently more recent methods focus on optimizing procedures to control and eliminate false discoveries by considering the behavior of the data.

An extension to the FDR was developed by Storey and calculates a q value for each tested feature (16, 17). The q value is the expected proportion of false positives incurred when making a call that this feature (i.e. protein species) has a significant change in expression between the two samples. Although the p value is a measure of significance in terms of the false positive rate of the test, the q value is a measure in terms of the false discovery rate. Consequently using a threshold focused solely on p values (e.g. methods that manage the FWER) controls the rate that unchanging features are called significant, whereas a focus on the FDR controls the rate of significant features being false. Thus, the calculation of FDR focuses attention on the content of the features called significant. The q values are calculated from the p values obtained for all features within a study with an easy to use point-and-click tool developed by Storey and Tibshirani (16). The frequency distribution of the p values is used to estimate the proportion of features that are unchanging; this is then used to estimate the false discovery rate (Fig. 1). The estimation process relies on the assumption that a statistical test appropriate for the experimental design and data characteristics was used. The use of an appropriate test in an unchanging situation results in a uniform distribution of p values being obtained. The q values are calculated using the p value distribution; this allows the experimenter to select an FDR threshold that is appropriate to their experiment. Restriction of analysis to features with a q value ≤x results in an overall FDR ≤x. An advantage of the q value approach is its tolerance of weak dependence between features (16). As argued by Storey and Tibshirani (16) for microarray data, we also reason that the dependence in proteomics data between protein species can be considered weak because as the number of measurements increases, the dependence becomes negligible. Studies comparing various FDR controlling methods on microarray expression data have found that the q value approach gives the highest apparent power (14). Several FDR controlling procedures exist; however, the q value is easy to use, maintains power, is tolerant of between-feature dependence, and leads to an output that allows the experimenter to select an appropriate FDR. This makes the q value approach ideal for use in quantitative proteomics.


Figure 1
View larger version (11K):
[in this window]
[in a new window]

 
FIG. 1. A frequency histogram illustrating the distribution of p values typically expected from Student's t tests of proteomics expression data when 23% of the features are changing. Proteins with no change in expression will contribute to a uniform frequency distribution, whereas proteins that change in expression will tend to have a low p value. The FDR process is searching for a non-background event. To assess this, the process needs to separate the background distribution from the changing distribution that together contribute to the p value distribution. The first step is to estimate the uniform frequency distribution arising from the background unchanging population using the flat portion of the frequency histogram assuming that most p values near 1 will be background events (dashed line). The q value is calculated for each feature by using the p value of that feature as a significance threshold and estimating the proportion of false positive within the total number of features called significant.

 
Some studies applying microarray methods of analysis to protein expression data have incorporated multiple testing controlling procedures and found that the number of protein species selected as significant was reduced (8, 1820). The multiple testing controlling procedures were not the main focus of these publications, and thus little attention was paid to the underlying assumptions utilized in the process.

Two-dimensional polyacrylamide gel electrophoresis allows the resolution of thousands of proteins, resulting in a global view of the proteome. The resulting spot patterns can be visualized by labeling the sample prior to the separation, e.g. with radioactivity (21) or fluorescence (1, 6, 22), or after the separation with total protein stains such as colloidal Coomassie (2), SYPRO Ruby, or Deep Purple (23, 24). The spot volumes can be compared from one sample to another, and it is an established technique in expression studies. The fluorescence DIGE approach has a high sensitivity and consequently is a powerful tool in expression studies. The high sensitivity arises from the use of an internal standard, which has been shown previously to substantial increase the accuracy of quantification (25). DIGE involves labeling samples with spectrally resolvable fluorescent CyDyesTM (Cy2, Cy3, and Cy5; GE Healthcare). The labeled samples are then mixed prior to isoelectric focusing and resolved on the same 2D gel. For a multigel approach, one of the CyDyesTM, typically Cy2, is used to label an internal standard, which consists of an average sample. This standard sample is used to match the spot patterns across a gel series and to calculate a standardized abundance value for each spot that can be compared across many gels. Typically the Cy3 and Cy5 CyDyes are used to label two samples from two independent groups to ensure a dye and gel balance (7, 26). A typical 13-cm DIGE gel contains 1000–1500 spots; of these, 600–1000 spots are generally matched across a six-gel series, depending on the quality of the gels (25). At a simplistic level, in a comparison of two groups across 800 spots, eight false positives could arise with the PCER significance threshold currently used in the field (p < 0.01).

This study focused on the issue of multiple testing in the context of 2D gel electrophoresis in conjunction with DIGE. To build on established approaches, this study used the Student's t test because the majority of published DIGE studies utilize this methodology to identify significant changes in expression.2 The Student's t test is a simple test that assumes the data are randomly sampled from normal distributions and shows homogeneity of variance. Regardless of the statistical approach used, all tests have underlying assumptions that need to be considered, and all univariate tests suffer from the issue of multiple testing. The q value procedure relies on a uniform distribution of p values when no changes in expression occur. To investigate this, we obtained data where the sample was identical. These same-same data were found to give a bias in p values toward higher values. Following these empirical observations, data simulations were used to understand the structure found with the DIGE same-same data. Based on these observations, we therefore make recommendations about the design of DIGE experiments that allows the application of the FDR correction procedures. Using the new experimental design, the q value approach was applied to two biological problems that represent typical questions being addressed in proteomics laboratories. The impact of the new experimental design on the calls of significance is discussed and demonstrates how this methodology avoids a flood of false positive results.


    EXPERIMENTAL PROCEDURES
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
Datasets—
To assess the distribution of p values when no changes in expression are occurring, same-same datasets were obtained where the same sample was run across a six-gel set using the three-dye system resulting in up to 12 data points per spot. In a same-same gel, three 50-µg aliquots of the sample were labeled individually with Cy2, Cy3, and Cy5; mixed; and separated by 2D gel electrophoresis as detailed by Karp and Lilley (25). Three independent same-same datasets were obtained using an Erwinia carotovora (ECA) wild type sample to investigate reproducibility. To assess the behavior across sample types, same-same datasets were also obtained for murine brain, liver, and heart tissue. For a dataset based on the two-dye system (Cy3 and Cy5), two of the Erwinia same-same datasets were combined.

To demonstrate the application of the FDR procedure when the two-dye DIGE system was utilized, two biological questions were studied. In biological study 1, the effect of a per2 knock-out in mouse liver samples was assessed; study 2 looked at the effect of an ECA0020 mutation in Erwinia. In these studies, four biological replicates for each group being compared were used. In the two-dye system, 50 µg of each sample was labeled with Cy3 and combined with a 50-µg Cy5-labeled mixed sample. The pooled samples were separated with 2D gel electrophoresis, and the fluorescent images were visualized following standard methodology (25).

Sample Preparation—
For the same-same studies, the bacterial samples were grown in liquid broth medium (10 g/liter Bacto tryptone, 5 g/liter Bacto yeast extract, and 5 g/liter sodium chloride) at 30 °C with agitation at 300 rpm overnight and harvested by centrifugation for 10 min at 4 °C at 5000 rpm. Cells were resuspended in lysis buffer (8 M urea, 2% (w/w) CHAPS, 5 mM magnesium acetate, 10 mM Tris, pH 8.0, and protease inhibitor mixture set I at 1x concentration (Calbiochem)) and lysed by sonication (3 x 10-s pulses on ice). From a Wistar rat the brain, liver, and heart tissue were harvested 24 h postbirth. Cells were homogenized in lysis buffer (8 M urea, 2% (w/w) amidosulfobetaine-14, 5 mM magnesium acetate, 10 mM Tris, pH 8.0, and protease inhibitor mixture set I at 1x concentration (Calbiochem)) using a motorized pestle, and cells were lysed by three cycles of freeze thawing and sonication.

In biological study 1, to investigate the effect of a per2 knock-out mutation in mice, liver samples were harvested from per2 knock-out and wild type mice with synchronized circadian clocks and extracted 6 h after the onset of activity (circadian time 18) and then frozen at –70 °C. A tissue block was taken from each sample and homogenized in CHAPS lysis buffer. Cell debris were removed by discarding the pellet formed after centrifugation for 10 min at 4 °C at 4500 rpm. To harvest the soluble protein fraction, the sample was centrifuged at 13,000 rpm for 10 min at 4 °C, and the pellet was discarded. Material from four genetically identical mice livers was extracted for each sample type giving biological replicates.

In biological study 2, to investigate the effect of a mutation in E. carotovora subspecies atroseptica SCRI1043 gene ECA0020 the strain was generated by allelic exchange as described in Coulthurst et al. (27). For each sample type, four independent cultures were grown in pectate lyase minimal medium (27) for 24 h at 25 °C with shaking at 300 rpm and harvested by centrifugation for 10 min at 4 °C at 5000 rpm. Cells were resuspended in lysis buffer (8 M urea, 2% (w/w) CHAPS, 5 mM magnesium acetate, 10 mM Tris, pH 8.0, and protease inhibitor mixture set I at 1x concentration (Calbiochem)) and lysed by sonication (3 x 10-s pulses on ice). To harvest the soluble protein fraction a low speed centrifugation was used to remove cell debris (15 min at 8000 rpm at 4 °C); this was then followed by a high speed centrifugation to remove insoluble material (10 min at 13,000 rpm at 4 °C). For all samples the protein concentrations were determined using the Bio-Rad DC protein assay as described by the manufacturer.

Data Analysis—
Gel analysis was performed using DeCyderTM Biological Variation Analysis Version 5.02 (GE Healthcare), a software package designed specifically to be used for DIGE, following the manufacturer's recommendations. Data were normalized within the software using a ratiometric approach, and a log10 transformation was used on the standardized abundance to stabilize variance. The estimated number of spots for each co-detection was set to 2500. Studies focused on spots that were matched across the gel series. The q value was calculated using the p values calculated in DeCyder via a point-and-click tool provided by Storey and Tibshirani (16). Statistical power was calculated as detailed in Karp and Lilley (25).

Data Simulation—
Data simulations were completed using the free software R (28). The scripts used are available in Supplemental Appendix 1.


    RESULTS
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
Investigating the Underlying Assumption with the Three-dye System
Accurate application of the q value procedure assumes the correct calculation of p values, which is dependent on the use of an appropriate statistical test. A uniform distribution of p values in a situation where no difference exist between groups can be used to test whether the correct statistical test is being utilized. From the same-same data, utilizing a three-dye system, the log-standardized abundance values from each gel were randomly assigned to either group 1 or group 2 ensuring a dye balance (for an example, see Fig. 2A). Groups 1 and 2 were compared with a Student's t test. With the three-dye system, the same-same datasets resulted in a bias in p value toward higher values. This indicates that the data were more similar than expected from random sampling, suggesting that a Student's t test is not suitable (Fig. 3A). For this to occur, an underlying assumption of the Student's t test is not being met, leading to a distortion in the p values obtained. The Student's t test assumes independent sampling, normality, and homogeneity of variance of which the latter two have been assessed previously for the DIGE technique by examining same-same data and found to be valid (25). This suggests that in the traditional three-dye approach, the final standardized abundance (SA) data for a spot are not truly independent. This leads to a similarity in the data that results in the groups being more alike than expected by random chance, giving a bias toward a p value of 1.


Figure 2
View larger version (13K):
[in this window]
[in a new window]

 
FIG. 2. Diagrammatic representation of the three- and two-dye schema utilized in DIGE.

 

Figure 3
View larger version (11K):
[in this window]
[in a new window]

 
FIG. 3. Examples of the p value distributions seen in a Student's t test comparison of same-sample data. A, an example profile obtained when a three-dye system is utilized while ensuring a dye balance. B, an example profile obtained when a two-dye system is utilized. These results were taken from an Erwinia same-same dataset.

 
Investigating the Underlying Assumption with Data Simulations
The bias in p values observed with the three-dye system when analyzing same-same data with a Student's t test suggests that the assumption of random sampling is not true leading to within-spot correlation. This effect was hypothesized to arise from the use of a common internal standard spot volume in the calculation of the two standardized abundance values obtained from a gel for a given spot. Any error in the internal standard spot volume would be common between the two SA values obtained from the same gel, leading to a common distortion in the final SA value for those values. To investigate whether this design could lead to bias in the p value distribution, data were simulated based on a variety of experimental designs using a straightforward system where all spots were assumed to have the same mean and S.D.

First the observation that a uniform frequency distribution of the p values is obtained when no changes are occurring provided the assumptions of the Student's t test are upheld was confirmed by data simulations. Data were randomly sampled from two normally distributed populations with the same mean and S.D. giving two groups with 10 data points, and the groups were compared using a Student's t test. This was repeated a thousand times. When the assumptions of the Student's t test are upheld the resulting p values indeed gave a uniform frequency distribution (Fig. 4A). This was tested for a variety of different mean and S.D. combinations (data not shown).


Figure 4
View larger version (10K):
[in this window]
[in a new window]

 
FIG. 4. The p value frequency distribution obtained from data simulations of a variety of experimental approaches to give 10 data points per group that were then compared with a Student's t test. A, the data have been independently sampled from groups with the same mean and S.D. This is equivalent to the utilization of the spot volume with no internal standard. B, the data have a standardized abundance approach but are obtained by use of a common internal standard for each pair of values obtained from a gel equivalent to a three-dye schema. C, the data have a standardized abundance approach but are obtained by the use of independent internal standards equivalent to a two-dye schema.

 
Data were then generated to mimic those obtained from same-same data with the typical DIGE experiment schema. The Cy3, Cy5, and Cy2 modeled values were obtained by randomly sampling data from normally distributed populations with the same mean and S.D. to give 10 data points per group. The standardized abundance values were calculated as a ratio with each gel pair (Cy3 and Cy5 values) being divided by a common internal standard (Cy2) value. The SA values were assigned to either group 1 or 2 following the typical DIGE experimental design (Fig. 1), and the generated data were then compared with a Student's t test. This was repeated a thousand times, and the resulting p values gave a bias toward 1 in a frequency distribution (Fig. 4B). This was tested for a variety of different mean and S.D. combinations (data not shown).

To ensure that the observed bias is arising from the common internal standard approach rather than the use of the ratio, the process was repeated but with the use of just two dyes where one dye was used as the internal standard and the other was used for the sample, and the resulting SA value was compared across the gels series (Fig. 2B). This two-dye approach is comparable to the Cy3 being used to label the sample and Cy5 being used to label the internal standard. The Cy3 and Cy5 same-same values were obtained by randomly sampling data from populations with the same mean and S.D. to give 10 data points per group. The SA values were obtained by randomly assigning the sampled values to either group 1 or 2, and the groups were compared with a Student's t test. This was repeated a thousand times, and in this situation the p values gave a uniform frequency distribution (Fig. 4C). The data simulations confirm that the bias observed in the same-same data from the traditional DIGE schema was arising from the utilization of the common internal standard value leading to within-spot correlation. This would be avoided by the use of the two-dye schema.

Investigating the Underlying Assumption with the Two-dye System
The non-uniform p value distribution seen with the three-dye system arises as the data points are correlated (within-spot correlation) and hence violate the assumption of independence. The correlation was shown to arise from the use of the Cy2 as a common denominator for the two samples from each gel. Mathematically this leads to the variance of the difference between the groups being a composite of the variance of each sample and the co-variance (see Equation 1) (29). Without consideration of the co-variance, the statistical test overestimates the true variance leading to the bias in the p values toward 1. To address this, the experimenter could utilize a significance test that incorporated a term to account for the co-variance or alter the experimental design such that only one data point is obtained on a gel (two-dye approach; Fig. 2B). Against the use of a more complex model, we have shown previously that the three-dye system gives significantly higher variance than the two-dye system when the data were analyzed assuming independent sampling (25). However, it can be argued that when the three-dye system is analyzed with a more complex model it could potentially be more powerful. Assessing the use of a more complex model with the three-dye system was beyond the scope of this study. Furthermore there are risks in overfitting with the use of a more complex model. Consequently the use of a two-dye system was investigated further. In a previous publication the Cy3 and Cy5 dye combination gave the lowest noise; hence this would be the recommended dye pair with one dye labeling the sample and the other dye labeling the standard (25).

Formula 1(Eq. 1)

Same-same ECA data from a two-dye design were randomly assigned to either group 1 or group 2, and the groups were compared with a Student's t test. This was repeated four times by altering the group assignments, and the distribution was examined (Fig. 3B). Ideally with random sampling effects, a uniform distribution of p values should be obtained. For one of the assignments, some bias toward the higher p values was observed (see supplemental information). These distributions were all obtained from the same dataset but with alteration in the group assignment. The bias may have arisen from similarity in data from the gels that were run in pairs being assigned in opposite groups. This indicates the importance of removing any sources of systematic bias where possible, for example by running the gels in large batches to ensure conditions are as similar as possible. Overall the move to a two-dye system in this study ensured that a uniform distribution of p values would be met in an unchanging situation where no expression differences are expected provided that the assumptions of normality and heterogeneity of variance are sustained with the use of biological replicates.

Dataset Selection via Filtering
In the microarray community, the issue of what should be considered as the dataset is quite simple. In 2D gel electrophoresis, however, the issue is more complex because spot detection can lead to many false features being included as potential spots, for example dust particles or smears. Artifact spots could be hypothesized to be unchanging features, thus contributing to the p value background, which will mask the features changing in expression in a multiple testing situation. Removing non-real spots that are contributing to this background could potentially increase the sensitivity. Thus, alternative filters could be utilized in the selection of the dataset provided the p value distribution is independent of such a filter and the filters are chosen in advance of data analysis. Consequently the use of filters in the selection of the dataset was considered.

So far, the analysis has focused on spots matched across the gel series because this should filter for "real" protein spots with the idea that only real spots would consistently be present in that position across the gel series. An alternative approach of using a volume filter was considered. However, spot volume is dependent on a variety of technical issues, e.g. the scanning settings (data not shown), and what is deemed low volume will depend significantly on the downstream processing of samples and sensitivity of instrumentation used for protein and peptide identification. Spots could also be filtered on "realness" as judged by the experimenter; however, this approach would be highly subjective and labor-intensive. Consequently after consideration of the issue our recommendation is to focus on well matched spots.

Correcting for Multiple Testing in Expression Studies
Biological Study 1: Expression Study on the per2 Knock-out—
Utilizing the two-dye DIGE schema, wild type liver samples of mice were compared with per2 knock-out samples utilizing four biological replicates, and the q value methodology was applied. A total of 823 protein spots were detected and matched across the dataset. Using the PCER threshold of 0.01 commonly utilized within the field, eight spots had significant p values. Of these, six would have been picked because they would have been considered suitably abundant for downstream processing in our laboratory.

During analysis, the p value frequency distribution did not give a large increase at low p values suggesting that little was detected as significantly changing (Fig. 5A). The estimated proportion of spots not changing was 0.796, which led to q values for the spots varying between 0.7004 and 0.795. Thus for the six spots identified as significant by the current methodology, 80% are expected to be false positives (Table I).


Figure 5
View larger version (12K):
[in this window]
[in a new window]

 
FIG. 5. Assessment of the p value distribution obtained from the DIGE expression study on the mice liver per2 knock-out. A, frequency histogram of the p value where the dashed line illustrates the estimated uniform frequency distribution arising from the unchanging population. B, a uniform Q-Q plot of p value distribution.

 

View this table:
[in this window]
[in a new window]

 
TABLE I The six spots that would have been selected in the circadian study because their p values fell below the PCER significance threshold (p < 0.01) with the Student's t test p value and the calculated q value

The expression ratio is the measured ratio change value reported by DeCyder and is calculated from the average ratio to give a value greater than or less than –1 (5).

 
The high q value arose because there is no clear signal of low scoring spots in the p value distribution above the background. An alternative method of assessing the p value distribution is to plot a uniform Q-Q plot, which provides clear representation of deviations from a uniform distribution (Fig. 5B). The Q-Q plot is a graphical technique for determining whether the sample comes from a specified, in this case a uniform, population. The quantile of the target population (y axis) is plotted against the respective sample quantile (x axis) where quantile is the fraction (or percentage) of points below the given value. This graphical approach clearly shows that the p value has no significant deviation from a uniform distribution and hence no difference from the same-same study (Fig. 6). By considering the FDR, the conclusion would be that no spots are significantly changing to warrant downstream processing.


Figure 6
View larger version (11K):
[in this window]
[in a new window]

 
FIG. 6. Standard deviation versus the percentile position for various datasets. The circles with a dotted line indicate the technical noise experienced for the Cy3 and Cy5 two-dye approach to DIGE. The squares with a solid line show the average S.D. calculated from the wild type liver sampled in biological study 1, and triangles with a solid line show the average S.D. for the wild type Erwinia sample in biological study 2. These values are presented as estimates of noise to assist others in the planning of experiments.

 
The mutation studied is expected to have biologically significant changes (30). Failure to detect any changes in the experiment suggests that the design had too little power for the size of expression changes occurring. The noise that encompasses 75% of the spots was found to be 4 times higher in this study compared with the technical noise published by Karp and Lilley (25) (Fig. 6). Completing a power study using the Lenth (31) power tool clearly demonstrates that with only four replicates the power in detecting change was low due to the high biological noise (Fig. 7).


Figure 7
View larger version (8K):
[in this window]
[in a new window]

 
FIG. 7. The relationship between power and number of replicates in detecting a 2-fold change when the variance encompasses the noise seen with 75% of spots. The squares with a solid line show the relationship obtained for biological study 1, and triangles with a solid line show the relationship obtained for biological study 2. The noise in biological system 1, the mouse liver, is higher leading to the power dropping below the recommended 0.8 with the four replicates utilized in this study.

 
Biological Study 2: Expression Study on the Erwinia Mutant—
Utilizing the two-dye DIGE schema, E. carotovora wild type samples were compared with a mutant utilizing four biological replicates, and the q value methodology was applied. A total of 575 protein spots were detected and matched across the dataset. Using the current PCER significance threshold utilized within the field (p < 0.01), 104 spots had significant p values. Of these, 86 would have been picked because they would have been considered suitable for downstream processing. With the use of the PCER threshold in this multiple testing situation, the extent of the false discovery rate is unknown.

To confirm the conservative nature of the approaches that control the FWER, the Bonferroni correction method was applied. Controlling the FWER to 0.05 led to an adjusted significance threshold (p' < 0.000087) resulting in the selection of only 11 spots as statistically significant. This method controls the chance of any one type I error but assumes independent tests. This method provides strong control of false positives and leads to the strongest statistical inference and high confidence in the selected spots of significance but has little power.

With the application of the Storey q value approach the proportion of spots that were unchanging was estimated at 0.473, and the q value for each spot was calculated. The frequency distributions of p value, with preponderance toward low values, shows that a high proportion of spots have a significant change in expression giving low p values and consequently low q values (Fig. 8).


Figure 8
View larger version (10K):
[in this window]
[in a new window]

 
FIG. 8. Assessment of p value distribution. A, a frequency histogram of the p value obtained from the DIGE study in the Erwinia study. The dotted line indicates the estimated uniform frequency distribution arising from the unchanging population. B, a uniform Q-Q plot of the p value distribution.

 
The calculated q value allows an estimation of false discoveries to be calculated for various false discovery rate thresholds (Table II). The results highlight that by increasing the proportion of the false calls the power of the experiment is increased, and a sizable number of significant spots are detected.


View this table:
[in this window]
[in a new window]

 
TABLE II Utilizing the q value approach, the number of spots selected as significant with various rates of false positives estimated for the Erwinia study

Real is the filter that ensures a spot is suitable for downstream processing.

 

    DISCUSSION
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
The multiple testing problem has had little attention in the field of quantitative proteomics; however, the accumulation of false positives can lead to a significant waste of resources in follow-up studies. FDR methodologies, which focus on the balance of false and true positives, address this issue and maintain the power of the experiment in detecting changes in expression. The q value, an extension of the FDR, provides a measure of the significance of each feature while taking into account the fact that thousands of features are simultaneously tested. The q value approach is easy to interpret and implement. The strength of the q value is that it allows the experimenter to choose an error rate that is acceptable to them and their subsequent studies, for example orthogonal validation techniques. In cases where validation of changes in protein expression is facile, the investigator may choose to accept a higher FDR. It is essential to consider the issue of multiple testing for all methods of quantitative proteomics. Here we concentrated on the DIGE method, but the use of the q value approach could be applied equally well to the other quantitative methods.

In the first biological study, looking at the effect of a per2 knock-out in mouse liver, the importance of assessing the false discovery rate was highlighted as this study demonstrated the risk of obtaining false leads if the multiple testing issue was not considered. With the current methodology of a stringent PCER (p < 0.01), six spots would have been chosen as significant, and yet five of these were estimated to be false discoveries with the q value approach. With this approach, the high false discovery rate would lead to no protein species being selected for downstream processing. As per2 is a key negative regulator of circadian rhythms, significant changes in expression are expected upon its knock-out (30). Thus the inability to detect statistically significant changes in expression arises from a lack of statistical power. The power of the experiment was low due to high biological noise and the low number of replicates for the size of changes occurring.

In the second biological study, looking at the effect of an Erwinia mutation, the q value approach was shown to be a revealing system for assessing the false discovery rate. This allows the experimenter to consider the error rate that is acceptable for the downstream studies and resources available. After the expression study is completed, the expected error rate forms a caveat that should influence the interpretation of the results.

The comparison of noise across these biological studies agrees with the microarray studies where biological noise in cell cultures is less than that found in inbred mouse populations (32). It is easy to anticipate that biological variability in human population studies will be larger still. Consideration of the expected variation is essential to ensure that experiments will have sufficient power to address the biological questions being addressed.

The study described here highlights the need for researchers to verify assumptions of statistical tests and procedures to achieve proper behavior of the computed tests ensuring valid conclusions. For each technique it can be anticipated that the issues will vary and need individual solutions. The traditional three-dye schema used with DIGE was found to give correlated data, thus violating the assumption of independence in the Student's t test and preventing the meaningful application of the q value approach. The correlation was shown to arise from the use of the Cy2 as a common internal standard for the two samples from each gel, leading to within-spot correlation. This could be accounted for with a more complex statistical test; however, with the high noise on the three-dye approach, it is far simpler to use the two-dye system combined with a more user-friendly statistical analysis. To make valid inferences and allow the application of the q value approach when using DIGE combined with a Student's t test, a two-dye schema is recommended where the Cy3 dye labels the sample and the Cy5 labels the internal standard. The issue of co-variance could also arise for other quantitative techniques utilize multiplexing and internal standards, such as the iTRAQ tagging system (the amine-modifying labeling reagents for multiplexed relative and absolute protein quantitation) where one of the four or eight possible tags is utilized to label an internal standard common to several separate labeling experiments.

In overall conclusion, the biological studies presented here, which are typical to the proteome community, highlight the need for robust experimental design that encompasses the appropriate application of statistical procedures. Such planning should include validation of the assumptions of the tests and procedures to ensure that the conclusions drawn from the study are valid. In designing the experiment, the expected technical and biological variability, the expected size of the expression changes, and the number of replicates will all need to be considered to ensure sufficient power to allow the detection of changes in expression.


    ACKNOWLEDGMENTS
 
We thank Dr. J. Byers, Dr. S. Coulthurst, and Dr. M. Deery for provision of samples. We especially thank Renata Feret for running the gels for the Erwinia biological study.


   FOOTNOTES
 
Received, July 26, 2006, and in revised form, May 14, 2007.

Published, MCP Papers in Press, May 17, 2007, DOI 10.1074/mcp.M600274-MCP200

1 The abbreviations used are: 2D, two-dimensional; FDR, false discovery rate; SA, standardized abundance; ECA, E. carotovora; PCER, per comparison error rate; FWER, familywise error rate; Q, quantile. Back

2 N. A. Karp, P. S. McCormick, M. R. Russell, and K. S. Lilley, unpublished observation. Back

* This work was supported in part by Biotechnology and Biological Sciences Research Council (BBSRC) Grant BB/C50694/1. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. Back

S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. Back

§ A BBSRC research associate supported by BBSRC Grant BB/C50694/1. Back

|| Supported by Unilever. Back

** Supported by a BBSRC Strategic Studentship BBS/Q/Q/2004/05630. Back

{ddagger}{ddagger} To whom correspondence should be addressed. Dept. of Biochemistry, University of Cambridge, Bldg. O, Downing Site, Cambridge, CB2 1QW UK. Tel.: 44-1223-765-255; Fax: 44-1223-333-345; E-mail: k.s.lilley{at}bioc.cam.ac.uk


    REFERENCES
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 

  1. Kleno, T. G., Leonardsen, L. R., Kjeldal, H. O., Laursen, S. M., Jensen, O. N., and Baunsgaard, D. (2004) Mechanisms of hydrazine toxicity in rat liver investigated by proteomics and multivariate data analysis. Proteomics 4, 868 –880[CrossRef][Medline]

  2. Fievet, J., Dillmann, C., Lagniel, G., Davanture, M., Negroni, L., Labarre, J., and de Vienne, D. (2004) Assessing factors for reliable quantitative proteomics based on two-dimensional gel electrophoresis. Proteomics 4, 1939 –1949[CrossRef][Medline]

  3. Brancia, F. (2006) Mass spectrometry based strategies in quantitative proteomics. Curr. Anal. Chem. 2, 1 –7[CrossRef]

  4. Old, W. M., Meyer-Arendt, K., Aveline-Wolf, L., Pierce, K. G., Mendoza, A., Sevinsky, J. R., Resing, K. A., and Ahn, N. G. (2005) Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Mol. Cell. Proteomics 4, 1487 –1502[Abstract/Free Full Text]

  5. Karp, N., Kreil, D., and Lilley, K. (2004) Determining a significant change in protein expression with DeCyderTM during a pair-wise comparison using two-dimensional difference gel electrophoresis. Proteomics 4, 1421 –1432[CrossRef][Medline]

  6. Yan, J. X., Devenish, A. T., Wait, R., Stone, T., Lewis, S., and Fowler, S. (2002) Fluorescence two-dimensional difference gel electrophoresis and mass spectrometry based proteomic analysis of Escherichia coli. Proteomics 2, 1682 –1698

  7. Karp, N. A., Griffin, J. L., and Lilley, K. S. (2005) Application of partial least squares discriminant analysis to two dimensional difference gel studies in expression proteomics. Proteomics 5, 81 –90[CrossRef][Medline]

  8. Meunier, B., Bouley, J., Piec, I., Bernard, C., Picard, B., and Hocquette, J. F. (2005) Data analysis methods for detection of differential protein expression in two-dimensional gel electrophoresis. Anal. Biochem. 340, 226 –230[CrossRef][Medline]

  9. Listgarten, J., and Emili, A. (2005) Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry. Mol. Cell. Proteomics 4, 419 –434[Abstract/Free Full Text]

  10. Bland, J. M., and Altman, D. G. (1995) Multiple significance tests: the Bonferroni method. BMJ Br. Med. J. (Clin. Res. Ed.) 310, 170

  11. Smyth, G. K., Yang, Y. H., and Speed, T. (2003) Statistical issues in cDNA microarray data analysis. Methods Mol. Biol. 224, 111 –136[Medline]

  12. Draghici, S. (2002) Statistical intelligence: effective analysis of high-density microarray data. Drug Discov. Today 7, S55 –S63[CrossRef][Medline]

  13. Cui, X., and Churchill, G. A. (2003) Statistical tests for differential expression in cDNA microarray experiments. Genome Biol. 4, 210[CrossRef][Medline]

  14. Qian, H. R., and Huang, S. (2005) Comparison of false discovery rate methods in identifying genes with differential expression. Genomics 86, 495 –503[CrossRef][Medline]

  15. Benjamini, Y., and Hochberg, Y. (1995) Controlling the false discovery rate—a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289 –300

  16. Storey, J. D., and Tibshirani, R. (2003) Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U. S. A. 100, 9440 –9445[Abstract/Free Full Text]

  17. Storey, J. D. (2002) A direct approach to false discovery rates. J. R. Stat. Soc. B 64, 479 –498[CrossRef]

  18. Fodor, I. K., Nelson, D. O., Alegria-Hartman, M., Robbins, K., Langlois, R. G., Turteltaub, K. W., Corzett, T. H., and McCutchen-Maloney, S. L. (2005) Statistical challenges in the analysis of two-dimensional difference gel electrophoresis experiments using DeCyder. Bioinformatics 21, 3733 –3740[Abstract/Free Full Text]

  19. Chang, J., Van Remmen, H., Ward, W. F., Regnier, F. E., Richardson, A., and Cornell, J. (2004) Processing of data generated by 2-dimensional gel electrophoresis for statistical analysis: missing data, normalization, and statistics. J. Proteome Res. 3, 1210 –1218[CrossRef][Medline]

  20. Wang, G., Wu, W. W., Zeng, W., Chou, C. L., and Shen, R. F. (2006) Label-free protein quantification using LC-coupled ion trap or FT mass spectrometry: reproducibility, linearity, and application with complex proteomes. J. Proteome Res. 5, 1214 –1223[CrossRef][Medline]

  21. Norbeck, J., and Blomberg, A. (1997) Two-dimensional electrophoretic separation of yeast proteins using a non-linear wide range (pH 3–10) immobilized pH gradient in the first dimension; reproducibility and evidence for isoelectric focusing of alkaline (pI > 7) proteins. Yeast 13, 1519 –1534[CrossRef][Medline]

  22. Hu, Y., Wang, G., Chen, G. Y., Fu, X., and Yao, S. Q. (2003) Proteome analysis of Saccharomyces cerevisiae under metal stress by two-dimensional differential gel electrophoresis. Electrophoresis 24, 1458 –1470[CrossRef][Medline]

  23. Chevalier, F., Rofidal, V., Vanova, P., Bergoin, A., and Rossignol, M. (2004) Proteomic capacity of recent fluorescent dyes for protein staining. Phytochemistry 65, 1499 –1506[CrossRef][Medline]

  24. Smejkal, G. B., Robinson, M. H., and Lazarev, A. (2004) Comparison of fluorescent stains: relative photostability and differential staining of proteins in two-dimensional gels. Electrophoresis 25, 2511 –2519[CrossRef][Medline]

  25. Karp, N. A., and Lilley, K. S. (2005) Maximizing sensitivity for detecting changes in protein expression: experimental design using minimal CyDyes. Proteomics 5, 3105 –3115[CrossRef][Medline]

  26. Marouga, R., David, S., and Hawkins, E. (2005) The development of the DIGE system: 2D fluorescence difference gel analysis technology. Anal. Bioanal. Chem. 382, 669 –678[CrossRef][Medline]

  27. Coulthurst, S. J., Lilley, K. S. and Salmond, G. P. (2006) Genetic and proteomic analysis of the role of luxS in the enteric phytopathogen, Erwinia carotovora. Mol. Plant Pathol. 7, 31 –46[CrossRef]

  28. R Development Core Team (2006) R: a Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria

  29. Casella, G., and Berger, R. (1990 ) Statistical Inference, p.250 , Duxbury Press, Belmont, CA

  30. Gallego, M., Kang, H., and Virshup, D. M. (2006 ) Protein phosphatase 1 regulates the stability of the circadian protein PER2. Biochem. J. 399, 169 –175[CrossRef][Medline]

  31. Lenth, R. (2001) Some practical guidelines for effective sample size determination. Am. Statistician 55, 187 –193[CrossRef]

  32. Novak, J. P., Sladek, R. and Hudson, T. J. (2002) Characterization of variability in large-scale gene expression data: implications for study design. Genomics 79, 104 –113[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
Mol. Cell. ProteomicsHome page
J. Sui, J. Zhang, T. L. Tan, C. B. Ching, and W. N. Chen
Comparative Proteomics Analysis of Vascular Smooth Muscle Cells Incubated with S- and R-Enantiomers of Atenolol Using iTRAQ-coupled Two-dimensional LC-MS/MS
Mol. Cell. Proteomics, June 1, 2008; 7(6): 1007 - 1018.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Data
Right arrow All Versions of this Article:
M600274-MCP200v1
6/8/1354    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow Glossary
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow