Abstract
In quantitative proteomics, the false discovery rate (FDR) can be defined as the number of false positives within statistically significant changes in expression. False positives accumulate during the simultaneous testing of expression changes across hundreds or thousands of protein or peptide species when univariate tests such as the Student's t test are used. Currently most researchers rely solely on the estimation of p values and a significance threshold, but this approach may result in false positives because it does not account for the multiple testing effect. For each species, a measure of significance in terms of the FDR can be calculated, producing individual q values. The q value maintains power by allowing the investigator to achieve an acceptable level of true or false positives within the calls of significance. The q value approach relies on the use of the correct statistical test for the experimental design. In this situation, a uniform p value frequency distribution when there are no differences in expression between two samples should be obtained. Here we report a bias in p value distribution in the case of a threedye DIGE experiment where no changes in expression are occurring. The bias was shown to arise from correlation in the data from the use of a common internal standard. With a twodye schema, where each sample has its own internal standard, such bias was removed, enabling the application of the q value to two different proteomics studies. In the case of the first study, we demonstrate that 80% of calls of significance by the more traditional method are false positives. In the second, we show that calculating the q value gives the user control over the FDR. These studies demonstrate the power and ease of use of the q value in correcting for multiple testing. This work also highlights the need for robust experimental design that includes the appropriate application of statistical procedures.
Quantitative proteomics, the study of global changes in protein expression, is a rapidly growing field where twodimensional (2D)^{1} gel electrophoresis (1, 2), differential labeling of protein and peptides with stable isotopes (3), and labelfree mass spectrometric peak intensity measurements (4) are pivotal approaches to measuring changes in the expression level of proteins. The output of large scale genomic sequencing and gene expression studies have driven the need to globally assess protein behavior, which is central to our understanding of cellular function and disease processes. In quantitative studies, the quantitation of the protein or peptide signal allows the comparison of protein levels from one state with another. Regardless of the technique used, the data generated can be analyzed with a variety of statistical approaches. In a simple comparison of two samples, changes in protein expression are considered to be significant when above a specified threshold (5). More rigorous studies incorporating replicates use univariate (1, 6) or multivariate statistical methods (1, 7) to analyze the data. Univariate methods, such as the Student's t test, analyze the data from each individual protein species to determine whether the differences between the samples are significant. In multivariate methods, such as principle component analysis, all the data are simultaneously analyzed to look for patterns in expression. Each of these statistical methods involves a number of mathematical assumptions; however, these assumptions are often ignored, potentially leading to erroneous results (8). Regardless of the analytical method used to collect the proteomics data, subtle issues with statistical analysis must be overcome to ensure that valid conclusions are made.
To date, the majority of published quantitative proteomics studies have used the established 2D gel electrophoresis approach for quantitation and applied a univariate test to identify protein species with significant changes in expression (8), although an increased number of studies are now using nongelbased approaches. These tests calculate the probability (p) of observing a test statistic more extreme than the one calculated from the data given that the samples are from the same population (i.e. that any apparent change in expression occurs by chance alone). Typically an expression change is considered significant if the calculated p value falls below a prescribed significance threshold, for example 0.01 (the per comparison error rate (PCER)). Two types of errors are possible: a false positive (type I error) occurs when a protein species is declared to be differentially expressed erroneously; a false negative (type II error) occurs when the test fails to detect a differentially expressed species. In expression studies, many thousands of statistical tests are conducted, one for each species. A substantial number of false positives may accumulate at the 0.01 confidence level because 1% of sample differences will be significant even if no changes in protein expression exist between the two samples. This accumulation of false positives is termed the multiple testing problem and is a general property of a confidencebased statistical test when applied across multiple features (where in DIGE each feature is a gel spot). Although the issue of multiple testing in quantitative proteomics has been discussed (9), the extent of this problem has rarely been tested but has the potential to lead to a substantial number of false leads.
In the microarray field, the multiple testing problem has been the subject of detailed discussion. Early methods used the familywise error rate, which controls the probability of one or more false rejections (type I error) among all tests conducted. The simplest and most conservative approach is the Bonferroni correction, which adjusts the threshold of significance by dividing the PCER by the number of comparisons being completed (10). For example, when testing 1000 protein species, the PCER threshold of 0.05 would be divided by the number of tests leading to the stringent familywise error rate (FWER) of 0.00005. This approach fails to consider betweenfeature dependence within the data and does not take into account the risk of false positives associated with the proportion of tests for which no change occurs. In proteomics data, betweenfeature dependence can arise because proteins influence each other in complex interactions within and across biochemical pathways and when a protein is represented multiple times; for example in 2D gel electrophoresis, a single protein can be represented in multiple positions in a charge train. Within the microarray community, controlling the FWER has been found to be too conservative and has led to many missed features of interest. It has also been argued that in the context of exploratory experiments, where later confirmatory investigations are utilized, allowing a few false leads would not present a serious problem if the majority of significant species were correctly chosen (11–13). Subsequently the concept of allowing false positives among the genuine changes has been widely accepted in the large scale data analysis of microarray data (14). This has led to the application of methodologies to control the false discovery rate (FDR) where the focus is on achieving an acceptable ratio of true and false positives. Benjamini and Hochberg (15) originally defined the false discovery rate as a proportion of changes identified as significant that are false. For example, an FDR rate of 10% means that on average 10% of changes identified as significant would be expected to have arisen from type I errors. The original FDR methodology was also considered to be too conservative for discovery experiments because it does not take into account that in these experiments a proportion of features genuinely change (16, 17). Consequently more recent methods focus on optimizing procedures to control and eliminate false discoveries by considering the behavior of the data.
An extension to the FDR was developed by Storey and calculates a q value for each tested feature (16, 17). The q value is the expected proportion of false positives incurred when making a call that this feature (i.e. protein species) has a significant change in expression between the two samples. Although the p value is a measure of significance in terms of the false positive rate of the test, the q value is a measure in terms of the false discovery rate. Consequently using a threshold focused solely on p values (e.g. methods that manage the FWER) controls the rate that unchanging features are called significant, whereas a focus on the FDR controls the rate of significant features being false. Thus, the calculation of FDR focuses attention on the content of the features called significant. The q values are calculated from the p values obtained for all features within a study with an easy to use pointandclick tool developed by Storey and Tibshirani (16). The frequency distribution of the p values is used to estimate the proportion of features that are unchanging; this is then used to estimate the false discovery rate (Fig. 1). The estimation process relies on the assumption that a statistical test appropriate for the experimental design and data characteristics was used. The use of an appropriate test in an unchanging situation results in a uniform distribution of p values being obtained. The q values are calculated using the p value distribution; this allows the experimenter to select an FDR threshold that is appropriate to their experiment. Restriction of analysis to features with a q value ≤x results in an overall FDR ≤x. An advantage of the q value approach is its tolerance of weak dependence between features (16). As argued by Storey and Tibshirani (16) for microarray data, we also reason that the dependence in proteomics data between protein species can be considered weak because as the number of measurements increases, the dependence becomes negligible. Studies comparing various FDR controlling methods on microarray expression data have found that the q value approach gives the highest apparent power (14). Several FDR controlling procedures exist; however, the q value is easy to use, maintains power, is tolerant of betweenfeature dependence, and leads to an output that allows the experimenter to select an appropriate FDR. This makes the q value approach ideal for use in quantitative proteomics.
Some studies applying microarray methods of analysis to protein expression data have incorporated multiple testing controlling procedures and found that the number of protein species selected as significant was reduced (8, 18–20). The multiple testing controlling procedures were not the main focus of these publications, and thus little attention was paid to the underlying assumptions utilized in the process.
Twodimensional polyacrylamide gel electrophoresis allows the resolution of thousands of proteins, resulting in a global view of the proteome. The resulting spot patterns can be visualized by labeling the sample prior to the separation, e.g. with radioactivity (21) or fluorescence (1, 6, 22), or after the separation with total protein stains such as colloidal Coomassie (2), SYPRO Ruby, or Deep Purple (23, 24). The spot volumes can be compared from one sample to another, and it is an established technique in expression studies. The fluorescence DIGE approach has a high sensitivity and consequently is a powerful tool in expression studies. The high sensitivity arises from the use of an internal standard, which has been shown previously to substantial increase the accuracy of quantification (25). DIGE involves labeling samples with spectrally resolvable fluorescent CyDyes™ (Cy2, Cy3, and Cy5; GE Healthcare). The labeled samples are then mixed prior to isoelectric focusing and resolved on the same 2D gel. For a multigel approach, one of the CyDyes™, typically Cy2, is used to label an internal standard, which consists of an average sample. This standard sample is used to match the spot patterns across a gel series and to calculate a standardized abundance value for each spot that can be compared across many gels. Typically the Cy3 and Cy5 CyDyes are used to label two samples from two independent groups to ensure a dye and gel balance (7, 26). A typical 13cm DIGE gel contains 1000–1500 spots; of these, 600–1000 spots are generally matched across a sixgel series, depending on the quality of the gels (25). At a simplistic level, in a comparison of two groups across 800 spots, eight false positives could arise with the PCER significance threshold currently used in the field (p < 0.01).
This study focused on the issue of multiple testing in the context of 2D gel electrophoresis in conjunction with DIGE. To build on established approaches, this study used the Student's t test because the majority of published DIGE studies utilize this methodology to identify significant changes in expression.^{2} The Student's t test is a simple test that assumes the data are randomly sampled from normal distributions and shows homogeneity of variance. Regardless of the statistical approach used, all tests have underlying assumptions that need to be considered, and all univariate tests suffer from the issue of multiple testing. The q value procedure relies on a uniform distribution of p values when no changes in expression occur. To investigate this, we obtained data where the sample was identical. These samesame data were found to give a bias in p values toward higher values. Following these empirical observations, data simulations were used to understand the structure found with the DIGE samesame data. Based on these observations, we therefore make recommendations about the design of DIGE experiments that allows the application of the FDR correction procedures. Using the new experimental design, the q value approach was applied to two biological problems that represent typical questions being addressed in proteomics laboratories. The impact of the new experimental design on the calls of significance is discussed and demonstrates how this methodology avoids a flood of false positive results.
EXPERIMENTAL PROCEDURES
Datasets—
To assess the distribution of p values when no changes in expression are occurring, samesame datasets were obtained where the same sample was run across a sixgel set using the threedye system resulting in up to 12 data points per spot. In a samesame gel, three 50μg aliquots of the sample were labeled individually with Cy2, Cy3, and Cy5; mixed; and separated by 2D gel electrophoresis as detailed by Karp and Lilley (25). Three independent samesame datasets were obtained using an Erwinia carotovora (ECA) wild type sample to investigate reproducibility. To assess the behavior across sample types, samesame datasets were also obtained for murine brain, liver, and heart tissue. For a dataset based on the twodye system (Cy3 and Cy5), two of the Erwinia samesame datasets were combined.
To demonstrate the application of the FDR procedure when the twodye DIGE system was utilized, two biological questions were studied. In biological study 1, the effect of a per2 knockout in mouse liver samples was assessed; study 2 looked at the effect of an ECA0020 mutation in Erwinia. In these studies, four biological replicates for each group being compared were used. In the twodye system, 50 μg of each sample was labeled with Cy3 and combined with a 50μg Cy5labeled mixed sample. The pooled samples were separated with 2D gel electrophoresis, and the fluorescent images were visualized following standard methodology (25).
Sample Preparation—
For the samesame studies, the bacterial samples were grown in liquid broth medium (10 g/liter Bacto tryptone, 5 g/liter Bacto yeast extract, and 5 g/liter sodium chloride) at 30 °C with agitation at 300 rpm overnight and harvested by centrifugation for 10 min at 4 °C at 5000 rpm. Cells were resuspended in lysis buffer (8 m urea, 2% (w/w) CHAPS, 5 mm magnesium acetate, 10 mm Tris, pH 8.0, and protease inhibitor mixture set I at 1× concentration (Calbiochem)) and lysed by sonication (3 × 10s pulses on ice). From a Wistar rat the brain, liver, and heart tissue were harvested 24 h postbirth. Cells were homogenized in lysis buffer (8 m urea, 2% (w/w) amidosulfobetaine14, 5 mm magnesium acetate, 10 mm Tris, pH 8.0, and protease inhibitor mixture set I at 1× concentration (Calbiochem)) using a motorized pestle, and cells were lysed by three cycles of freeze thawing and sonication.
In biological study 1, to investigate the effect of a per2 knockout mutation in mice, liver samples were harvested from per2 knockout and wild type mice with synchronized circadian clocks and extracted 6 h after the onset of activity (circadian time 18) and then frozen at −70 °C. A tissue block was taken from each sample and homogenized in CHAPS lysis buffer. Cell debris were removed by discarding the pellet formed after centrifugation for 10 min at 4 °C at 4500 rpm. To harvest the soluble protein fraction, the sample was centrifuged at 13,000 rpm for 10 min at 4 °C, and the pellet was discarded. Material from four genetically identical mice livers was extracted for each sample type giving biological replicates.
In biological study 2, to investigate the effect of a mutation in E. carotovora subspecies atroseptica SCRI1043 gene ECA0020 the strain was generated by allelic exchange as described in Coulthurst et al. (27). For each sample type, four independent cultures were grown in pectate lyase minimal medium (27) for 24 h at 25 °C with shaking at 300 rpm and harvested by centrifugation for 10 min at 4 °C at 5000 rpm. Cells were resuspended in lysis buffer (8 m urea, 2% (w/w) CHAPS, 5 mm magnesium acetate, 10 mm Tris, pH 8.0, and protease inhibitor mixture set I at 1× concentration (Calbiochem)) and lysed by sonication (3 × 10s pulses on ice). To harvest the soluble protein fraction a low speed centrifugation was used to remove cell debris (15 min at 8000 rpm at 4 °C); this was then followed by a high speed centrifugation to remove insoluble material (10 min at 13,000 rpm at 4 °C). For all samples the protein concentrations were determined using the BioRad DC protein assay as described by the manufacturer.
Data Analysis—
Gel analysis was performed using DeCyder™ Biological Variation Analysis Version 5.02 (GE Healthcare), a software package designed specifically to be used for DIGE, following the manufacturer's recommendations. Data were normalized within the software using a ratiometric approach, and a log_{10} transformation was used on the standardized abundance to stabilize variance. The estimated number of spots for each codetection was set to 2500. Studies focused on spots that were matched across the gel series. The q value was calculated using the p values calculated in DeCyder via a pointandclick tool provided by Storey and Tibshirani (16). Statistical power was calculated as detailed in Karp and Lilley (25).
Data Simulation—
Data simulations were completed using the free software R (28). The scripts used are available in Supplemental Appendix 1.
RESULTS
Investigating the Underlying Assumption with the Threedye System
Accurate application of the q value procedure assumes the correct calculation of p values, which is dependent on the use of an appropriate statistical test. A uniform distribution of p values in a situation where no difference exist between groups can be used to test whether the correct statistical test is being utilized. From the samesame data, utilizing a threedye system, the logstandardized abundance values from each gel were randomly assigned to either group 1 or group 2 ensuring a dye balance (for an example, see Fig. 2A). Groups 1 and 2 were compared with a Student's t test. With the threedye system, the samesame datasets resulted in a bias in p value toward higher values. This indicates that the data were more similar than expected from random sampling, suggesting that a Student's t test is not suitable (Fig. 3A). For this to occur, an underlying assumption of the Student's t test is not being met, leading to a distortion in the p values obtained. The Student's t test assumes independent sampling, normality, and homogeneity of variance of which the latter two have been assessed previously for the DIGE technique by examining samesame data and found to be valid (25). This suggests that in the traditional threedye approach, the final standardized abundance (SA) data for a spot are not truly independent. This leads to a similarity in the data that results in the groups being more alike than expected by random chance, giving a bias toward a p value of 1.
Investigating the Underlying Assumption with Data Simulations
The bias in p values observed with the threedye system when analyzing samesame data with a Student's t test suggests that the assumption of random sampling is not true leading to withinspot correlation. This effect was hypothesized to arise from the use of a common internal standard spot volume in the calculation of the two standardized abundance values obtained from a gel for a given spot. Any error in the internal standard spot volume would be common between the two SA values obtained from the same gel, leading to a common distortion in the final SA value for those values. To investigate whether this design could lead to bias in the p value distribution, data were simulated based on a variety of experimental designs using a straightforward system where all spots were assumed to have the same mean and S.D.
First the observation that a uniform frequency distribution of the p values is obtained when no changes are occurring provided the assumptions of the Student's t test are upheld was confirmed by data simulations. Data were randomly sampled from two normally distributed populations with the same mean and S.D. giving two groups with 10 data points, and the groups were compared using a Student's t test. This was repeated a thousand times. When the assumptions of the Student's t test are upheld the resulting p values indeed gave a uniform frequency distribution (Fig. 4A). This was tested for a variety of different mean and S.D. combinations (data not shown).
Data were then generated to mimic those obtained from samesame data with the typical DIGE experiment schema. The Cy3, Cy5, and Cy2 modeled values were obtained by randomly sampling data from normally distributed populations with the same mean and S.D. to give 10 data points per group. The standardized abundance values were calculated as a ratio with each gel pair (Cy3 and Cy5 values) being divided by a common internal standard (Cy2) value. The SA values were assigned to either group 1 or 2 following the typical DIGE experimental design (Fig. 1), and the generated data were then compared with a Student's t test. This was repeated a thousand times, and the resulting p values gave a bias toward 1 in a frequency distribution (Fig. 4B). This was tested for a variety of different mean and S.D. combinations (data not shown).
To ensure that the observed bias is arising from the common internal standard approach rather than the use of the ratio, the process was repeated but with the use of just two dyes where one dye was used as the internal standard and the other was used for the sample, and the resulting SA value was compared across the gels series (Fig. 2B). This twodye approach is comparable to the Cy3 being used to label the sample and Cy5 being used to label the internal standard. The Cy3 and Cy5 samesame values were obtained by randomly sampling data from populations with the same mean and S.D. to give 10 data points per group. The SA values were obtained by randomly assigning the sampled values to either group 1 or 2, and the groups were compared with a Student's t test. This was repeated a thousand times, and in this situation the p values gave a uniform frequency distribution (Fig. 4C). The data simulations confirm that the bias observed in the samesame data from the traditional DIGE schema was arising from the utilization of the common internal standard value leading to withinspot correlation. This would be avoided by the use of the twodye schema.
Investigating the Underlying Assumption with the Twodye System
The nonuniform p value distribution seen with the threedye system arises as the data points are correlated (withinspot correlation) and hence violate the assumption of independence. The correlation was shown to arise from the use of the Cy2 as a common denominator for the two samples from each gel. Mathematically this leads to the variance of the difference between the groups being a composite of the variance of each sample and the covariance (see Equation 1) (29). Without consideration of the covariance, the statistical test overestimates the true variance leading to the bias in the p values toward 1. To address this, the experimenter could utilize a significance test that incorporated a term to account for the covariance or alter the experimental design such that only one data point is obtained on a gel (twodye approach; Fig. 2B). Against the use of a more complex model, we have shown previously that the threedye system gives significantly higher variance than the twodye system when the data were analyzed assuming independent sampling (25). However, it can be argued that when the threedye system is analyzed with a more complex model it could potentially be more powerful. Assessing the use of a more complex model with the threedye system was beyond the scope of this study. Furthermore there are risks in overfitting with the use of a more complex model. Consequently the use of a twodye system was investigated further. In a previous publication the Cy3 and Cy5 dye combination gave the lowest noise; hence this would be the recommended dye pair with one dye labeling the sample and the other dye labeling the standard (25).
Samesame ECA data from a twodye design were randomly assigned to either group 1 or group 2, and the groups were compared with a Student's t test. This was repeated four times by altering the group assignments, and the distribution was examined (Fig. 3B). Ideally with random sampling effects, a uniform distribution of p values should be obtained. For one of the assignments, some bias toward the higher p values was observed (see supplemental information). These distributions were all obtained from the same dataset but with alteration in the group assignment. The bias may have arisen from similarity in data from the gels that were run in pairs being assigned in opposite groups. This indicates the importance of removing any sources of systematic bias where possible, for example by running the gels in large batches to ensure conditions are as similar as possible. Overall the move to a twodye system in this study ensured that a uniform distribution of p values would be met in an unchanging situation where no expression differences are expected provided that the assumptions of normality and heterogeneity of variance are sustained with the use of biological replicates.
Dataset Selection via Filtering
In the microarray community, the issue of what should be considered as the dataset is quite simple. In 2D gel electrophoresis, however, the issue is more complex because spot detection can lead to many false features being included as potential spots, for example dust particles or smears. Artifact spots could be hypothesized to be unchanging features, thus contributing to the p value background, which will mask the features changing in expression in a multiple testing situation. Removing nonreal spots that are contributing to this background could potentially increase the sensitivity. Thus, alternative filters could be utilized in the selection of the dataset provided the p value distribution is independent of such a filter and the filters are chosen in advance of data analysis. Consequently the use of filters in the selection of the dataset was considered.
So far, the analysis has focused on spots matched across the gel series because this should filter for “real” protein spots with the idea that only real spots would consistently be present in that position across the gel series. An alternative approach of using a volume filter was considered. However, spot volume is dependent on a variety of technical issues, e.g. the scanning settings (data not shown), and what is deemed low volume will depend significantly on the downstream processing of samples and sensitivity of instrumentation used for protein and peptide identification. Spots could also be filtered on “realness” as judged by the experimenter; however, this approach would be highly subjective and laborintensive. Consequently after consideration of the issue our recommendation is to focus on well matched spots.
Correcting for Multiple Testing in Expression Studies
Biological Study 1: Expression Study on the per2 Knockout—
Utilizing the twodye DIGE schema, wild type liver samples of mice were compared with per2 knockout samples utilizing four biological replicates, and the q value methodology was applied. A total of 823 protein spots were detected and matched across the dataset. Using the PCER threshold of 0.01 commonly utilized within the field, eight spots had significant p values. Of these, six would have been picked because they would have been considered suitably abundant for downstream processing in our laboratory.
During analysis, the p value frequency distribution did not give a large increase at low p values suggesting that little was detected as significantly changing (Fig. 5A). The estimated proportion of spots not changing was 0.796, which led to q values for the spots varying between 0.7004 and 0.795. Thus for the six spots identified as significant by the current methodology, 80% are expected to be false positives (Table I).
The high q value arose because there is no clear signal of low scoring spots in the p value distribution above the background. An alternative method of assessing the p value distribution is to plot a uniform QQ plot, which provides clear representation of deviations from a uniform distribution (Fig. 5B). The QQ plot is a graphical technique for determining whether the sample comes from a specified, in this case a uniform, population. The quantile of the target population (y axis) is plotted against the respective sample quantile (x axis) where quantile is the fraction (or percentage) of points below the given value. This graphical approach clearly shows that the p value has no significant deviation from a uniform distribution and hence no difference from the samesame study (Fig. 6). By considering the FDR, the conclusion would be that no spots are significantly changing to warrant downstream processing.
The mutation studied is expected to have biologically significant changes (30). Failure to detect any changes in the experiment suggests that the design had too little power for the size of expression changes occurring. The noise that encompasses 75% of the spots was found to be 4 times higher in this study compared with the technical noise published by Karp and Lilley (25) (Fig. 6). Completing a power study using the Lenth (31) power tool clearly demonstrates that with only four replicates the power in detecting change was low due to the high biological noise (Fig. 7).
Biological Study 2: Expression Study on the Erwinia Mutant—
Utilizing the twodye DIGE schema, E. carotovora wild type samples were compared with a mutant utilizing four biological replicates, and the q value methodology was applied. A total of 575 protein spots were detected and matched across the dataset. Using the current PCER significance threshold utilized within the field (p < 0.01), 104 spots had significant p values. Of these, 86 would have been picked because they would have been considered suitable for downstream processing. With the use of the PCER threshold in this multiple testing situation, the extent of the false discovery rate is unknown.
To confirm the conservative nature of the approaches that control the FWER, the Bonferroni correction method was applied. Controlling the FWER to 0.05 led to an adjusted significance threshold (p′ < 0.000087) resulting in the selection of only 11 spots as statistically significant. This method controls the chance of any one type I error but assumes independent tests. This method provides strong control of false positives and leads to the strongest statistical inference and high confidence in the selected spots of significance but has little power.
With the application of the Storey q value approach the proportion of spots that were unchanging was estimated at 0.473, and the q value for each spot was calculated. The frequency distributions of p value, with preponderance toward low values, shows that a high proportion of spots have a significant change in expression giving low p values and consequently low q values (Fig. 8).
The calculated q value allows an estimation of false discoveries to be calculated for various false discovery rate thresholds (Table II). The results highlight that by increasing the proportion of the false calls the power of the experiment is increased, and a sizable number of significant spots are detected.
DISCUSSION
The multiple testing problem has had little attention in the field of quantitative proteomics; however, the accumulation of false positives can lead to a significant waste of resources in followup studies. FDR methodologies, which focus on the balance of false and true positives, address this issue and maintain the power of the experiment in detecting changes in expression. The q value, an extension of the FDR, provides a measure of the significance of each feature while taking into account the fact that thousands of features are simultaneously tested. The q value approach is easy to interpret and implement. The strength of the q value is that it allows the experimenter to choose an error rate that is acceptable to them and their subsequent studies, for example orthogonal validation techniques. In cases where validation of changes in protein expression is facile, the investigator may choose to accept a higher FDR. It is essential to consider the issue of multiple testing for all methods of quantitative proteomics. Here we concentrated on the DIGE method, but the use of the q value approach could be applied equally well to the other quantitative methods.
In the first biological study, looking at the effect of a per2 knockout in mouse liver, the importance of assessing the false discovery rate was highlighted as this study demonstrated the risk of obtaining false leads if the multiple testing issue was not considered. With the current methodology of a stringent PCER (p < 0.01), six spots would have been chosen as significant, and yet five of these were estimated to be false discoveries with the q value approach. With this approach, the high false discovery rate would lead to no protein species being selected for downstream processing. As per2 is a key negative regulator of circadian rhythms, significant changes in expression are expected upon its knockout (30). Thus the inability to detect statistically significant changes in expression arises from a lack of statistical power. The power of the experiment was low due to high biological noise and the low number of replicates for the size of changes occurring.
In the second biological study, looking at the effect of an Erwinia mutation, the q value approach was shown to be a revealing system for assessing the false discovery rate. This allows the experimenter to consider the error rate that is acceptable for the downstream studies and resources available. After the expression study is completed, the expected error rate forms a caveat that should influence the interpretation of the results.
The comparison of noise across these biological studies agrees with the microarray studies where biological noise in cell cultures is less than that found in inbred mouse populations (32). It is easy to anticipate that biological variability in human population studies will be larger still. Consideration of the expected variation is essential to ensure that experiments will have sufficient power to address the biological questions being addressed.
The study described here highlights the need for researchers to verify assumptions of statistical tests and procedures to achieve proper behavior of the computed tests ensuring valid conclusions. For each technique it can be anticipated that the issues will vary and need individual solutions. The traditional threedye schema used with DIGE was found to give correlated data, thus violating the assumption of independence in the Student's t test and preventing the meaningful application of the q value approach. The correlation was shown to arise from the use of the Cy2 as a common internal standard for the two samples from each gel, leading to withinspot correlation. This could be accounted for with a more complex statistical test; however, with the high noise on the threedye approach, it is far simpler to use the twodye system combined with a more userfriendly statistical analysis. To make valid inferences and allow the application of the q value approach when using DIGE combined with a Student's t test, a twodye schema is recommended where the Cy3 dye labels the sample and the Cy5 labels the internal standard. The issue of covariance could also arise for other quantitative techniques utilize multiplexing and internal standards, such as the iTRAQ tagging system (the aminemodifying labeling reagents for multiplexed relative and absolute protein quantitation) where one of the four or eight possible tags is utilized to label an internal standard common to several separate labeling experiments.
In overall conclusion, the biological studies presented here, which are typical to the proteome community, highlight the need for robust experimental design that encompasses the appropriate application of statistical procedures. Such planning should include validation of the assumptions of the tests and procedures to ensure that the conclusions drawn from the study are valid. In designing the experiment, the expected technical and biological variability, the expected size of the expression changes, and the number of replicates will all need to be considered to ensure sufficient power to allow the detection of changes in expression.
Acknowledgments
We thank Dr. J. Byers, Dr. S. Coulthurst, and Dr. M. Deery for provision of samples. We especially thank Renata Feret for running the gels for the Erwinia biological study.
Footnotes

Published, MCP Papers in Press, May 17, 2007, DOI 10.1074/mcp.M600274MCP200

↵1 The abbreviations used are: 2D, twodimensional; FDR, false discovery rate; SA, standardized abundance; ECA, E. carotovora; PCER, per comparison error rate; FWER, familywise error rate; Q, quantile.

↵2 N. A. Karp, P. S. McCormick, M. R. Russell, and K. S. Lilley, unpublished observation.

↵* This work was supported in part by Biotechnology and Biological Sciences Research Council (BBSRC) Grant BB/C50694/1. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

↵S The online version of this article (available at http://www.mcponline.org) contains supplemental material.

↵§ A BBSRC research associate supported by BBSRC Grant BB/C50694/1.

↵‖ Supported by Unilever.

↵** Supported by a BBSRC Strategic Studentship BBS/Q/Q/2004/05630.
 Received July 26, 2006.
 Revision received May 14, 2007.
 © 2007 The American Society for Biochemistry and Molecular Biology