|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 6:1354-1364, 2007.
© 2007 by The American Society for Biochemistry and Molecular Biology, Inc.
,
,**
,

From the
Department of Biochemistry, University of Cambridge, Building O, Downing Site, Cambridge CB2 1QW, United Kingdom and ¶ Department of Chemistry, University Chemical Laboratory, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| ABSTRACT |
|---|
|
|
|---|
To date, the majority of published quantitative proteomics studies have used the established 2D gel electrophoresis approach for quantitation and applied a univariate test to identify protein species with significant changes in expression (8), although an increased number of studies are now using non-gel-based approaches. These tests calculate the probability (p) of observing a test statistic more extreme than the one calculated from the data given that the samples are from the same population (i.e. that any apparent change in expression occurs by chance alone). Typically an expression change is considered significant if the calculated p value falls below a prescribed significance threshold, for example 0.01 (the per comparison error rate (PCER)). Two types of errors are possible: a false positive (type I error) occurs when a protein species is declared to be differentially expressed erroneously; a false negative (type II error) occurs when the test fails to detect a differentially expressed species. In expression studies, many thousands of statistical tests are conducted, one for each species. A substantial number of false positives may accumulate at the 0.01 confidence level because 1% of sample differences will be significant even if no changes in protein expression exist between the two samples. This accumulation of false positives is termed the multiple testing problem and is a general property of a confidence-based statistical test when applied across multiple features (where in DIGE each feature is a gel spot). Although the issue of multiple testing in quantitative proteomics has been discussed (9), the extent of this problem has rarely been tested but has the potential to lead to a substantial number of false leads.
In the microarray field, the multiple testing problem has been the subject of detailed discussion. Early methods used the familywise error rate, which controls the probability of one or more false rejections (type I error) among all tests conducted. The simplest and most conservative approach is the Bonferroni correction, which adjusts the threshold of significance by dividing the PCER by the number of comparisons being completed (10). For example, when testing 1000 protein species, the PCER threshold of 0.05 would be divided by the number of tests leading to the stringent familywise error rate (FWER) of 0.00005. This approach fails to consider between-feature dependence within the data and does not take into account the risk of false positives associated with the proportion of tests for which no change occurs. In proteomics data, between-feature dependence can arise because proteins influence each other in complex interactions within and across biochemical pathways and when a protein is represented multiple times; for example in 2D gel electrophoresis, a single protein can be represented in multiple positions in a charge train. Within the microarray community, controlling the FWER has been found to be too conservative and has led to many missed features of interest. It has also been argued that in the context of exploratory experiments, where later confirmatory investigations are utilized, allowing a few false leads would not present a serious problem if the majority of significant species were correctly chosen (11–13). Subsequently the concept of allowing false positives among the genuine changes has been widely accepted in the large scale data analysis of microarray data (14). This has led to the application of methodologies to control the false discovery rate (FDR) where the focus is on achieving an acceptable ratio of true and false positives. Benjamini and Hochberg (15) originally defined the false discovery rate as a proportion of changes identified as significant that are false. For example, an FDR rate of 10% means that on average 10% of changes identified as significant would be expected to have arisen from type I errors. The original FDR methodology was also considered to be too conservative for discovery experiments because it does not take into account that in these experiments a proportion of features genuinely change (16, 17). Consequently more recent methods focus on optimizing procedures to control and eliminate false discoveries by considering the behavior of the data.
An extension to the FDR was developed by Storey and calculates a q value for each tested feature (16, 17). The q value is the expected proportion of false positives incurred when making a call that this feature (i.e. protein species) has a significant change in expression between the two samples. Although the p value is a measure of significance in terms of the false positive rate of the test, the q value is a measure in terms of the false discovery rate. Consequently using a threshold focused solely on p values (e.g. methods that manage the FWER) controls the rate that unchanging features are called significant, whereas a focus on the FDR controls the rate of significant features being false. Thus, the calculation of FDR focuses attention on the content of the features called significant. The q values are calculated from the p values obtained for all features within a study with an easy to use point-and-click tool developed by Storey and Tibshirani (16). The frequency distribution of the p values is used to estimate the proportion of features that are unchanging; this is then used to estimate the false discovery rate (Fig. 1). The estimation process relies on the assumption that a statistical test appropriate for the experimental design and data characteristics was used. The use of an appropriate test in an unchanging situation results in a uniform distribution of p values being obtained. The q values are calculated using the p value distribution; this allows the experimenter to select an FDR threshold that is appropriate to their experiment. Restriction of analysis to features with a q value
x results in an overall FDR
x. An advantage of the q value approach is its tolerance of weak dependence between features (16). As argued by Storey and Tibshirani (16) for microarray data, we also reason that the dependence in proteomics data between protein species can be considered weak because as the number of measurements increases, the dependence becomes negligible. Studies comparing various FDR controlling methods on microarray expression data have found that the q value approach gives the highest apparent power (14). Several FDR controlling procedures exist; however, the q value is easy to use, maintains power, is tolerant of between-feature dependence, and leads to an output that allows the experimenter to select an appropriate FDR. This makes the q value approach ideal for use in quantitative proteomics.
|
Two-dimensional polyacrylamide gel electrophoresis allows the resolution of thousands of proteins, resulting in a global view of the proteome. The resulting spot patterns can be visualized by labeling the sample prior to the separation, e.g. with radioactivity (21) or fluorescence (1, 6, 22), or after the separation with total protein stains such as colloidal Coomassie (2), SYPRO Ruby, or Deep Purple (23, 24). The spot volumes can be compared from one sample to another, and it is an established technique in expression studies. The fluorescence DIGE approach has a high sensitivity and consequently is a powerful tool in expression studies. The high sensitivity arises from the use of an internal standard, which has been shown previously to substantial increase the accuracy of quantification (25). DIGE involves labeling samples with spectrally resolvable fluorescent CyDyesTM (Cy2, Cy3, and Cy5; GE Healthcare). The labeled samples are then mixed prior to isoelectric focusing and resolved on the same 2D gel. For a multigel approach, one of the CyDyesTM, typically Cy2, is used to label an internal standard, which consists of an average sample. This standard sample is used to match the spot patterns across a gel series and to calculate a standardized abundance value for each spot that can be compared across many gels. Typically the Cy3 and Cy5 CyDyes are used to label two samples from two independent groups to ensure a dye and gel balance (7, 26). A typical 13-cm DIGE gel contains 1000–1500 spots; of these, 600–1000 spots are generally matched across a six-gel series, depending on the quality of the gels (25). At a simplistic level, in a comparison of two groups across 800 spots, eight false positives could arise with the PCER significance threshold currently used in the field (p < 0.01).
This study focused on the issue of multiple testing in the context of 2D gel electrophoresis in conjunction with DIGE. To build on established approaches, this study used the Student's t test because the majority of published DIGE studies utilize this methodology to identify significant changes in expression.2 The Student's t test is a simple test that assumes the data are randomly sampled from normal distributions and shows homogeneity of variance. Regardless of the statistical approach used, all tests have underlying assumptions that need to be considered, and all univariate tests suffer from the issue of multiple testing. The q value procedure relies on a uniform distribution of p values when no changes in expression occur. To investigate this, we obtained data where the sample was identical. These same-same data were found to give a bias in p values toward higher values. Following these empirical observations, data simulations were used to understand the structure found with the DIGE same-same data. Based on these observations, we therefore make recommendations about the design of DIGE experiments that allows the application of the FDR correction procedures. Using the new experimental design, the q value approach was applied to two biological problems that represent typical questions being addressed in proteomics laboratories. The impact of the new experimental design on the calls of significance is discussed and demonstrates how this methodology avoids a flood of false positive results.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
To demonstrate the application of the FDR procedure when the two-dye DIGE system was utilized, two biological questions were studied. In biological study 1, the effect of a per2 knock-out in mouse liver samples was assessed; study 2 looked at the effect of an ECA0020 mutation in Erwinia. In these studies, four biological replicates for each group being compared were used. In the two-dye system, 50 µg of each sample was labeled with Cy3 and combined with a 50-µg Cy5-labeled mixed sample. The pooled samples were separated with 2D gel electrophoresis, and the fluorescent images were visualized following standard methodology (25).
Sample Preparation—
For the same-same studies, the bacterial samples were grown in liquid broth medium (10 g/liter Bacto tryptone, 5 g/liter Bacto yeast extract, and 5 g/liter sodium chloride) at 30 °C with agitation at 300 rpm overnight and harvested by centrifugation for 10 min at 4 °C at 5000 rpm. Cells were resuspended in lysis buffer (8 M urea, 2% (w/w) CHAPS, 5 mM magnesium acetate, 10 mM Tris, pH 8.0, and protease inhibitor mixture set I at 1x concentration (Calbiochem)) and lysed by sonication (3 x 10-s pulses on ice). From a Wistar rat the brain, liver, and heart tissue were harvested 24 h postbirth. Cells were homogenized in lysis buffer (8 M urea, 2% (w/w) amidosulfobetaine-14, 5 mM magnesium acetate, 10 mM Tris, pH 8.0, and protease inhibitor mixture set I at 1x concentration (Calbiochem)) using a motorized pestle, and cells were lysed by three cycles of freeze thawing and sonication.
In biological study 1, to investigate the effect of a per2 knock-out mutation in mice, liver samples were harvested from per2 knock-out and wild type mice with synchronized circadian clocks and extracted 6 h after the onset of activity (circadian time 18) and then frozen at –70 °C. A tissue block was taken from each sample and homogenized in CHAPS lysis buffer. Cell debris were removed by discarding the pellet formed after centrifugation for 10 min at 4 °C at 4500 rpm. To harvest the soluble protein fraction, the sample was centrifuged at 13,000 rpm for 10 min at 4 °C, and the pellet was discarded. Material from four genetically identical mice livers was extracted for each sample type giving biological replicates.
In biological study 2, to investigate the effect of a mutation in E. carotovora subspecies atroseptica SCRI1043 gene ECA0020 the strain was generated by allelic exchange as described in Coulthurst et al. (27). For each sample type, four independent cultures were grown in pectate lyase minimal medium (27) for 24 h at 25 °C with shaking at 300 rpm and harvested by centrifugation for 10 min at 4 °C at 5000 rpm. Cells were resuspended in lysis buffer (8 M urea, 2% (w/w) CHAPS, 5 mM magnesium acetate, 10 mM Tris, pH 8.0, and protease inhibitor mixture set I at 1x concentration (Calbiochem)) and lysed by sonication (3 x 10-s pulses on ice). To harvest the soluble protein fraction a low speed centrifugation was used to remove cell debris (15 min at 8000 rpm at 4 °C); this was then followed by a high speed centrifugation to remove insoluble material (10 min at 13,000 rpm at 4 °C). For all samples the protein concentrations were determined using the Bio-Rad DC protein assay as described by the manufacturer.
Data Analysis—
Gel analysis was performed using DeCyderTM Biological Variation Analysis Version 5.02 (GE Healthcare), a software package designed specifically to be used for DIGE, following the manufacturer's recommendations. Data were normalized within the software using a ratiometric approach, and a log10 transformation was used on the standardized abundance to stabilize variance. The estimated number of spots for each co-detection was set to 2500. Studies focused on spots that were matched across the gel series. The q value was calculated using the p values calculated in DeCyder via a point-and-click tool provided by Storey and Tibshirani (16). Statistical power was calculated as detailed in Karp and Lilley (25).
Data Simulation—
Data simulations were completed using the free software R (28). The scripts used are available in Supplemental Appendix 1.
| RESULTS |
|---|
|
|
|---|
|
|
First the observation that a uniform frequency distribution of the p values is obtained when no changes are occurring provided the assumptions of the Student's t test are upheld was confirmed by data simulations. Data were randomly sampled from two normally distributed populations with the same mean and S.D. giving two groups with 10 data points, and the groups were compared using a Student's t test. This was repeated a thousand times. When the assumptions of the Student's t test are upheld the resulting p values indeed gave a uniform frequency distribution (Fig. 4A). This was tested for a variety of different mean and S.D. combinations (data not shown).
|
To ensure that the observed bias is arising from the common internal standard approach rather than the use of the ratio, the process was repeated but with the use of just two dyes where one dye was used as the internal standard and the other was used for the sample, and the resulting SA value was compared across the gels series (Fig. 2B). This two-dye approach is comparable to the Cy3 being used to label the sample and Cy5 being used to label the internal standard. The Cy3 and Cy5 same-same values were obtained by randomly sampling data from populations with the same mean and S.D. to give 10 data points per group. The SA values were obtained by randomly assigning the sampled values to either group 1 or 2, and the groups were compared with a Student's t test. This was repeated a thousand times, and in this situation the p values gave a uniform frequency distribution (Fig. 4C). The data simulations confirm that the bias observed in the same-same data from the traditional DIGE schema was arising from the utilization of the common internal standard value leading to within-spot correlation. This would be avoided by the use of the two-dye schema.
Investigating the Underlying Assumption with the Two-dye System
The non-uniform p value distribution seen with the three-dye system arises as the data points are correlated (within-spot correlation) and hence violate the assumption of independence. The correlation was shown to arise from the use of the Cy2 as a common denominator for the two samples from each gel. Mathematically this leads to the variance of the difference between the groups being a composite of the variance of each sample and the co-variance (see Equation 1) (29). Without consideration of the co-variance, the statistical test overestimates the true variance leading to the bias in the p values toward 1. To address this, the experimenter could utilize a significance test that incorporated a term to account for the co-variance or alter the experimental design such that only one data point is obtained on a gel (two-dye approach; Fig. 2B). Against the use of a more complex model, we have shown previously that the three-dye system gives significantly higher variance than the two-dye system when the data were analyzed assuming independent sampling (25). However, it can be argued that when the three-dye system is analyzed with a more complex model it could potentially be more powerful. Assessing the use of a more complex model with the three-dye system was beyond the scope of this study. Furthermore there are risks in overfitting with the use of a more complex model. Consequently the use of a two-dye system was investigated further. In a previous publication the Cy3 and Cy5 dye combination gave the lowest noise; hence this would be the recommended dye pair with one dye labeling the sample and the other dye labeling the standard (25).
![]() |
Same-same ECA data from a two-dye design were randomly assigned to either group 1 or group 2, and the groups were compared with a Student's t test. This was repeated four times by altering the group assignments, and the distribution was examined (Fig. 3B). Ideally with random sampling effects, a uniform distribution of p values should be obtained. For one of the assignments, some bias toward the higher p values was observed (see supplemental information). These distributions were all obtained from the same dataset but with alteration in the group assignment. The bias may have arisen from similarity in data from the gels that were run in pairs being assigned in opposite groups. This indicates the importance of removing any sources of systematic bias where possible, for example by running the gels in large batches to ensure conditions are as similar as possible. Overall the move to a two-dye system in this study ensured that a uniform distribution of p values would be met in an unchanging situation where no expression differences are expected provided that the assumptions of normality and heterogeneity of variance are sustained with the use of biological replicates.
Dataset Selection via Filtering
In the microarray community, the issue of what should be considered as the dataset is quite simple. In 2D gel electrophoresis, however, the issue is more complex because spot detection can lead to many false features being included as potential spots, for example dust particles or smears. Artifact spots could be hypothesized to be unchanging features, thus contributing to the p value background, which will mask the features changing in expression in a multiple testing situation. Removing non-real spots that are contributing to this background could potentially increase the sensitivity. Thus, alternative filters could be utilized in the selection of the dataset provided the p value distribution is independent of such a filter and the filters are chosen in advance of data analysis. Consequently the use of filters in the selection of the dataset was considered.
So far, the analysis has focused on spots matched across the gel series because this should filter for "real" protein spots with the idea that only real spots would consistently be present in that position across the gel series. An alternative approach of using a volume filter was considered. However, spot volume is dependent on a variety of technical issues, e.g. the scanning settings (data not shown), and what is deemed low volume will depend significantly on the downstream processing of samples and sensitivity of instrumentation used for protein and peptide identification. Spots could also be filtered on "realness" as judged by the experimenter; however, this approach would be highly subjective and labor-intensive. Consequently after consideration of the issue our recommendation is to focus on well matched spots.
Correcting for Multiple Testing in Expression Studies
Biological Study 1: Expression Study on the per2 Knock-out—
Utilizing the two-dye DIGE schema, wild type liver samples of mice were compared with per2 knock-out samples utilizing four biological replicates, and the q value methodology was applied. A total of 823 protein spots were detected and matched across the dataset. Using the PCER threshold of 0.01 commonly utilized within the field, eight spots had significant p values. Of these, six would have been picked because they would have been considered suitably abundant for downstream processing in our laboratory.
During analysis, the p value frequency distribution did not give a large increase at low p values suggesting that little was detected as significantly changing (Fig. 5A). The estimated proportion of spots not changing was 0.796, which led to q values for the spots varying between 0.7004 and 0.795. Thus for the six spots identified as significant by the current methodology, 80% are expected to be false positives (Table I).
|
|
|
|
To confirm the conservative nature of the approaches that control the FWER, the Bonferroni correction method was applied. Controlling the FWER to 0.05 led to an adjusted significance threshold (p' < 0.000087) resulting in the selection of only 11 spots as statistically significant. This method controls the chance of any one type I error but assumes independent tests. This method provides strong control of false positives and leads to the strongest statistical inference and high confidence in the selected spots of significance but has little power.
With the application of the Storey q value approach the proportion of spots that were unchanging was estimated at 0.473, and the q value for each spot was calculated. The frequency distributions of p value, with preponderance toward low values, shows that a high proportion of spots have a significant change in expression giving low p values and consequently low q values (Fig. 8).
|
|
| DISCUSSION |
|---|
|
|
|---|
In the first biological study, looking at the effect of a per2 knock-out in mouse liver, the importance of assessing the false discovery rate was highlighted as this study demonstrated the risk of obtaining false leads if the multiple testing issue was not considered. With the current methodology of a stringent PCER (p < 0.01), six spots would have been chosen as significant, and yet five of these were estimated to be false discoveries with the q value approach. With this approach, the high false discovery rate would lead to no protein species being selected for downstream processing. As per2 is a key negative regulator of circadian rhythms, significant changes in expression are expected upon its knock-out (30). Thus the inability to detect statistically significant changes in expression arises from a lack of statistical power. The power of the experiment was low due to high biological noise and the low number of replicates for the size of changes occurring.
In the second biological study, looking at the effect of an Erwinia mutation, the q value approach was shown to be a revealing system for assessing the false discovery rate. This allows the experimenter to consider the error rate that is acceptable for the downstream studies and resources available. After the expression study is completed, the expected error rate forms a caveat that should influence the interpretation of the results.
The comparison of noise across these biological studies agrees with the microarray studies where biological noise in cell cultures is less than that found in inbred mouse populations (32). It is easy to anticipate that biological variability in human population studies will be larger still. Consideration of the expected variation is essential to ensure that experiments will have sufficient power to address the biological questions being addressed.
The study described here highlights the need for researchers to verify assumptions of statistical tests and procedures to achieve proper behavior of the computed tests ensuring valid conclusions. For each technique it can be anticipated that the issues will vary and need individual solutions. The traditional three-dye schema used with DIGE was found to give correlated data, thus violating the assumption of independence in the Student's t test and preventing the meaningful application of the q value approach. The correlation was shown to arise from the use of the Cy2 as a common internal standard for the two samples from each gel, leading to within-spot correlation. This could be accounted for with a more complex statistical test; however, with the high noise on the three-dye approach, it is far simpler to use the two-dye system combined with a more user-friendly statistical analysis. To make valid inferences and allow the application of the q value approach when using DIGE combined with a Student's t test, a two-dye schema is recommended where the Cy3 dye labels the sample and the Cy5 labels the internal standard. The issue of co-variance could also arise for other quantitative techniques utilize multiplexing and internal standards, such as the iTRAQ tagging system (the amine-modifying labeling reagents for multiplexed relative and absolute protein quantitation) where one of the four or eight possible tags is utilized to label an internal standard common to several separate labeling experiments.
In overall conclusion, the biological studies presented here, which are typical to the proteome community, highlight the need for robust experimental design that encompasses the appropriate application of statistical procedures. Such planning should include validation of the assumptions of the tests and procedures to ensure that the conclusions drawn from the study are valid. In designing the experiment, the expected technical and biological variability, the expected size of the expression changes, and the number of replicates will all need to be considered to ensure sufficient power to allow the detection of changes in expression.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, May 17, 2007, DOI 10.1074/mcp.M600274-MCP200
1 The abbreviations used are: 2D, two-dimensional; FDR, false discovery rate; SA, standardized abundance; ECA, E. carotovora; PCER, per comparison error rate; FWER, familywise error rate; Q, quantile. ![]()
2 N. A. Karp, P. S. McCormick, M. R. Russell, and K. S. Lilley, unpublished observation. ![]()
* This work was supported in part by Biotechnology and Biological Sciences Research Council (BBSRC) Grant BB/C50694/1. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()
A BBSRC research associate supported by BBSRC Grant BB/C50694/1. ![]()
** Supported by a BBSRC Strategic Studentship BBS/Q/Q/2004/05630. ![]()

To whom correspondence should be addressed. Dept. of Biochemistry, University of Cambridge, Bldg. O, Downing Site, Cambridge, CB2 1QW UK. Tel.: 44-1223-765-255; Fax: 44-1223-333-345; E-mail: k.s.lilley{at}bioc.cam.ac.uk
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. Sui, J. Zhang, T. L. Tan, C. B. Ching, and W. N. Chen Comparative Proteomics Analysis of Vascular Smooth Muscle Cells Incubated with S- and R-Enantiomers of Atenolol Using iTRAQ-coupled Two-dimensional LC-MS/MS Mol. Cell. Proteomics, June 1, 2008; 7(6): 1007 - 1018. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||