Abstract
Spectral counting has become a commonly used approach for measuring protein abundance in labelfree shotgun proteomics. At the same time, the development of data analysis methods has lagged behind. Currently most studies utilizing spectral counts rely on simple data transforms and posthoc corrections of conventional signaltonoise ratio statistics. However, these adjustments can neither handle the bias toward high abundance proteins nor deal with the drawbacks due to the limited number of replicates. We present a novel statistical framework (QSpec) for the significance analysis of differential expression with extensions to a variety of experimental design factors and adjustments for protein properties. Using synthetic and real experimental data sets, we show that the proposed method outperforms conventional statistical methods that search for differential expression for individual proteins. We illustrate the flexibility of the model by analyzing a data set with a complicated experimental design involving cellular localization and time course.
MSbased shotgun proteomics is currently the most commonly used approach for the identification and quantification of proteins in large scale studies (1, 2). A variety of mass spectrometrydriven protein quantification methods have been proposed involving stable isotope labeling of proteins or peptides coupled with MS/MS sequencing, e.g. ICAT (3), stable isotope labeling by amino acids in cell culture (SILAC) (4), and multiplexed quantitation using isobaric tags for relative and absolute quantitation (iTRAQ) (5) (for reviews, see Refs. 6 and 7). The well known limitations of label basedmethods include requirements for higher amounts of starting biological material, increased complexity of the experimental protocols, and high costs of reagents (7).
As a result, in recent years, socalled labelfree methods have received increasing attention as promising alternatives that automatically waive some of the disadvantages of using stable isotope labeling methods. Popular methods in this area have focused on the analysis of twodimensional images of ion intensities in the span of retention time and m/z from a LCMS or LCMS/MS run where peak intensities are used as the abundance measure (8–11). Despite the rich information contained in the LCMS data, daunting computational effort needs to be spent on processing the data, including background filtering, peak detection, and alignment (8, 11).
A viable labelfree quantitative strategy is spectral counting where the number of spectra matched to peptides from a protein is used as a surrogate measure of protein abundance. Although conceptually simple, recent studies have demonstrated that spectral counting can be as sensitive as ion peak intensities in terms of detection range while retaining linearity (12–20). A number of groups have proposed various types of normalized scores based on transformed spectral counts, including methods that explore weighted scoring by peptide match score (16), normalization by the number of potential peptide matches (17), peptide sequence length, overall experimentwide abundance (18), or incorporation of the probability of identification into counting (19). Standard statistical tests could also be applied on the raw/transformed counts to analyze the protein expression data (20–22).
Despite published examples of using spectral counting in proteomics, there is a lack of computational and statistical methods for analyzing this type of data that are as well established as the counterparts in gene expression data. These include differential expression analysis such as significance analysis of microarray data (SAM) (23), clustering and classification, and network analysis (24–26). Most studies demonstrating the use of spectral counts have resorted to datadriven corrections of conventional signaltonoise ratio statistics such as meanvariance model adjustment (27) and detection rate adjustment (20). These adjustments are primarily used to correct the bias in the statistic that favors large differences in highly abundant proteins. However, the technical challenges for modeling quantitative proteomics data are distinct in their own right. First neither ion peak intensity extraction nor spectral counting generates data that can easily be modeled with standard distributional assumptions as with gene expression data sets. This increases the burden of finding the appropriate statistical model and estimation methods. Second because of the limited amount of sample material available or MS instrument availability considerations, comparative profiling of two or more distinct biological conditions is rarely performed in sufficient number of replicates or samples. Lacking the opportunity to observe consistent evidence over multiple samples in homogeneous biological condition makes it difficult to perform robust estimation and inference on model parameters. Unless there are more than four or five replicates generated for each condition permutationbased methods for generating reference distributions will not work well.
Here we propose a general statistical framework for analyzing spectral count data. This method addresses the issue of the appropriate probability distribution for count data as well as tackles the paucity of information due to the absence of replicate samples. The model is based on the use of hierarchical Bayes estimation of generalized linear mixed effects model (GLMM)^{1} (28) where the spectral counts are considered to be random numbers from a large population of proteins, and hence the model parameters are directly shared within replicates and across proteins. This comprehensive modeling strategy is bound to be more powerful than calculating the signaltonoise ratio type of differential expression test statistics. These are performed on a per protein basis and referenced to an approximate null distribution especially when the number of replicates is limited.
This report is organized as follows. First the overall modeling framework and its applicability to a wide variety of experimental designs is explained, and its advantages are discussed. Then the performance of the proposed method using synthetic data sets is illustrated with a comparison with methods using signaltonoise ratio statistics. The comparison focuses on the power to detect differentially expressed proteins at fixed error rates and the property of the detected proteins such as abundance. For a real data analysis example, the experimental data set taken from Pavelka et al. (27) comparing proteome profiles of a yeast strain at two different phases in cell growth is reanalyzed. The enrichment analysis compares the biological functions highlighted by the protein signature detected by the proposed method with the conventional signaltonoise method, and related computational and statistical issues are discussed. Finally using the published data set of a systemwide survey of the mouse proteome in congestive heart failure (29) the proposed methodology is demonstrated in the presence of experimental design factors. Further discussion of potential improvements on the model and possible extensions concludes the report.
EXPERIMENTAL PROCEDURES
Experimental Data Sets
Three data sets were obtained from two published studies (27, 29). In all cases, no reanalysis of the raw MS data was performed in this work, i.e. the spectral count data were taken as reported in the supplemental materials provided in those publications. A brief description of the data sets is given below.
Yeast Control Data Set—
First Pavelka et al. (27) provided a data set containing four biological replicates of BY4741 strain of yeast grown in media enriched with different nitrogen isotopes (^{14}N and ^{15}N). Growing yeast in these two different media is not expected to result in differences in protein expression between these two samples. For each growth condition and each replicate, the LCMS/MS analysis was performed on 500 μg of protein extract. Proteins were TCAprecipitated, ureadenatured, reduced, alkylated, and digested with LysC followed by trypsin digestion. The resulting peptide mixtures were separated using a 12step multidimensional protein identification technology analysis. MS/MS spectra were collected on an LTQ linear ion trap mass spectrometer (ThermoFinnigan) equipped with a nanoLC electrospray ionization source. Each full MS scan was followed by five MS/MS scans using datadependent acquisition with the dynamic exclusion option specified as follows: repeat count, 1; repeat duration, 30 s; exclusion duration, 300 s. The peak lists were extracted from RAW files using the extract_ms.exe program with default parameter settings (consecutive scans acquired on the same peptide ion grouped into a single .dta file). The resulting peak lists were searched using SEQUEST against a yeast protein sequence database appended with decoy sequences. Protein level summaries were generated using DTASelect with SEQUEST score thresholds set to achieve a less than 1% false protein identification error rate based on decoy counts. The spectral count for each protein was calculated as the number of .dta files assigned a peptide from that protein with high SEQUEST scores passing DTASelect filtering criteria. In total, the data set contains four technical replicates for each of the two growth conditions (light and heavy isotope media) with 1307 proteins identified at least once in the eight analyses. This data set was used as a control data set for simulation studies.
Yeast Comparative Growth Data Set—
The second data set was also taken from Pavelka et al. (27) and represents four biological replicates of the same BY4741 yeast strain grown up to two different stages of cell growth, namely log and stationary phases. MS/MS spectra were collected and processed as described above. This data set contains 1856 unique proteins identified in any of the eight experiments (four replicates for each of the two growth phases). This data set exhibits a significant difference in protein expression levels between the log and stationary phase and was used in this study for the comparison of functional annotation in a real data analysis scenario.
Mouse Data Set—
The third data set was taken from a published mouse study on the causative effect of impaired calcium ion handling that leads to dilated cardiomyopathy and eventual death (29). Organellar protein fractions (mitochondrial, microsomal, and cytosolic) were extracted from pooled ventricle tissue and separated using centrifugation. 100 μg of protein extract was used in each LCMS/MS experiment, TCAprecipitated, denatured, reduced, alkylated, and digested sequentially with LysC and trypsin. The peptide mixtures were separated using a 12step multidimensional protein identification technology analysis followed by MS/MS sequencing on an LTQ mass spectrometer equipped with an electrospray ion source. Precursor ions were subjected to datadependent sequencing with dynamic exclusion enabled (one scan, no repeats, exclude for 90 s). The peak lists (.dta files) were extracted from RAW instrument files using extract_ms.exe with default parameter settings. The peak lists were searched using SEQUEST, and the search results were processed using the STATQUEST analysis program. The spectral count for each protein was calculated as the number of .dta files assigned a peptide from that protein with high confidence. The spectral count profiles of 6190 proteins in phospholamban mutant PLN R9C and wild type mice were compared at three time points in cytosol, microsome, and mitochondria. For each combination of time point and organelle, the spectral count profiles of the mutant and the wild type were paired, adding up to 18 spectral count profiles in total.
Simulation Data Sets
Using the first yeast data set described above, two groups of synthetic data sets were generated. Because the original cell cultures were grown in ^{14}N and ^{15}Nmedia and then mixed into four pools at a 1:1 ratio before LCMS/MS analysis, in effect these data had no real signals between the two groups in all proteins. To create synthetic data sets with nontrivial differential expression, the rows of the data matrix (proteins) were shuffled to ensure that the distribution of high and low abundance proteins is uniform across the rows. Then the first 200 proteins in the matrix were selected, and 2fold changes were inserted to the selected proteins, generating the first synthetic data set (F2). The second synthetic data set (F4) was generated by inserting 4fold changes to the selected proteins. Inserting a fixed fold change has been achieved on a proteinbyprotein basis. Counts in the replicates grown in ^{14}Nmedium were multiplied by the fold change if the mean count in the four replicates in ^{14}Nmedium was greater than the mean count in ^{15}Nmedium and vice versa. If a count in the group with smaller mean was 0, a randomly generated Poisson random count was inserted with mean equal to the fold change itself on the opposite group to bypass the null effect of multiplying 0 by the fold change.
To investigate the effect of the number of replicates on the power of detecting differentially expressed proteins, additional variants of the two data sets described above were derived by varying the number of replicates used: F21rep (taking first replicate for each condition), F22rep (taking the first two replicates), and F23rep (taking the first three replicates). The same was performed with the 4fold change data set to form subsets F41rep, F42rep, and F43rep, respectively. For the sake of consistency, the original data with all four replicates for each condition (growth media) were named F24rep and F44rep, respectively (data provided in supplemental Table 1). In addition, the aggregated counts across the four replicates within each condition were computed and saved as two columns of count sums (F2sum and F4sum). This last variant was generated to understand whether generating replicates helps by adding more signals to the total signal or by providing any direct information on the variability across replicates.
Functional Annotation
Interpretation of data was assisted by two annotation tools, FATIGO+ (30) and DAVID (31). These tools were used to assign significantly enriched functional categories to a selected set of proteins. FATIGO+ takes the set of target proteins and the set of background proteins, compares the enrichment of each functional category in the two sets, and reports the statistical significance of enriched functions in the former list. DAVID performs essentially the same operation with the option of specifying the background proteins as the complement of the target protein list among all proteins identified in the particular experiment or as the complement in the population of all known proteins in the public databases. FATIGO+ was utilized wherever the “background” list was well defined, and DAVID was used when it was otherwise.
Bayes Factors
A quantity called Bayes factor (32) was used as an indicator of statistical significance of the model parameters, e.g. regression coefficients for differential expression. Bayes factors are essentially likelihood ratios of two competing statistical models where the likelihood of each competing model is averaged over all possible parameter values by numerical integration. Suppose that we observe data X, and we have two models M_{1} and M_{2} that can describe the observation of X. For each model, we have parameters θ_{1} and θ_{2}, respectively. Then for i = 1 and i = 2, one can calculate the averaged likelihood. The Bayes factor is now defined as the ratio of the two averaged likelihoods. A large Bayes factor supports the second model M_{2} over M_{1} for describing the data X. If M_{2} is a model with a differential expression coefficient and M_{1} is a model without it, a large B indicates statistically significant differential expression.
False Discovery Rate (FDR) Estimation
The rate of false positives in the selection of differentially expressed proteins based on Bayes factors can be estimated using a mixture modelbased method of local FDR control (33, 34). Given a log transformed Bayes factor B, the local FDR (denoted as fdr) can be calculated according to Equation 3, where p_{0}(B) and p_{1}(B) are the proteomewide distribution of Bayes factor for proteins with trivial and significant differential expression, respectively, and π_{0} and π_{1} are the corresponding proportion of proteins. Using this method, one can choose a minimum threshold Bayes factor B* that controls the global FDR at a target rate of ∼5% as follows.
RESULTS
Statistical Model for Spectral Counts—
For a data set with n samples and p proteins, a modelbased method is proposed to select proteins whose absolute abundance changes by a statistically significant amount under different biological conditions. The MS/MS spectral counts of a protein are modeled as observations from the Poisson distribution. This represents a natural choice reflecting the stochastic nature of the peptide sampling process by the mass spectrometer. Similar assumptions are often made in related applications, e.g. in the serial analysis of gene expression (SAGE) approach (35). The expected counts are modeled as a linear function of normalizing factors, treatment or disease status, and other experimental information. Unlike in gene expression data sets, typical proteomics data sets have data over only a few replicates or samples, and as a result, fitting a Poisson regression model for individual proteins separately is often not feasible. The limited sample size can be as restrictive as the example in Fig. 1A, which shows the partial spectral count table for a data set with n = 2.
To address the challenge of small numbers of replicates, we utilize a statistical methodology called hierarchical Bayes that pools the statistical information on the regression models across proteins. Considering each protein as a member of the population of all identified proteins, we model the regression parameters for each protein as random effects. The random effect terms are the coefficients shared by the replicates within the same protein, allowing one to account for the intrasubject correlation of the data. The random effect terms for the baseline abundance of a particular protein are shared by every sample, and those for the treatment or disease status are shared by the replicates within the same condition.
More specifically, the analysis starts with the observed spectral count data matrix X = [X_{ij}]. Assuming that X_{ij} are observations from a Poisson distribution with expected count μ_{ij} for i = 1, 2, …, p, the expected count matrix is expressed as a GLMM, where μ_{ij} is the expected count for protein i in replicate j, L_{i} is the sequence length of protein i, N_{j} is the normalizing constant of replicate j, c_{0} is the baseline abundance, and b_{0i} and b_{1i} are the proteinspecific abundance and differential expression parameters for protein i. Most importantly, the treatment effect is defined as follows: T_{j} = 1 if replicate j is in treatment and T_{j} = 0 otherwise. The first term on the righthand side of Equation 5 is a fixed normalizing term often referred to as the “offset” in regression analysis. The protein sequence length L_{i} adjusts for the bias in the count for longer proteins, and the normalizing constant N_{j} of replicates adjusts for the overall abundance of each replicate or sample (18). For N_{j}, we use the average count across all identified proteins in sample j to reflect the total abundance of all proteins identified in each MS/MS experiment.
If the treatment effect b_{1i} is not a statistically significant term, then the model in Equation 5 reduces to the following. The full model (M_{F}) is denoted in Equation 5, and the reduced model (M_{R}) is denoted in Equation 6. If the evidence from the spectral count data supports M_{F} over M_{R}, the protein is considered as differentially expressed. If the protein is indeed differentially expressed, comparing the goodness of fit by M_{F} and M_{R} leads to the selection of differentially expressed proteins. This is because the model with the differential expression parameter fits the data better than the model without it. The exact protein selection method will be described in the next section more precisely.
Given the model setup, the probability distribution for the model parameters are specified as follows. Because M_{R} is a nested model of M_{F}, it suffices to write the model specification for M_{F}. Although the expected spectral counts are expressed in the form of a GLMM, the connection across the model parameters in different proteins has yet to be established. To this end, assume the following. This framework is called hierarchical Bayes because the set of parameters {b_{0i}} and {b_{1i}} for all proteins i = 1, 2, …, p are specified as random variables from the Gaussian distribution with inverse γdistributed variance parameters. Inverse γ distribution refers to the distribution of the reciprocal of the γ random variable with certain shape and scale parameters. Also it is a “conjugate” prior distribution in the sense that the posterior distribution is also an inverse γ distribution. The hierarchical structure in the model specification is known to result in shrinkage estimates that have better statistical properties. This provides the basis for more robust statistical estimation and inference procedures by pooling statistical information across all identified proteins (24), which tends to be helpful in small sample problems. The good property of information pooling has been a well established practice in gene expression data analysis (24, 36). Model parameters are estimated by sample averages of the posterior output from Markov chain Monte Carlo (26, 37) (see supplemental methods for more specific details on the estimation procedure).
Tests for Differential Expression and Multiple Testing Correction—
The strategy for determining whether each protein is differentially expressed between the two conditions is straightforward. For each protein, the Bayes factor (32) was calculated as follows. In Equation 8, the numerator and the denominator are essentially the likelihoods of observing the counts under M_{F} and M_{R}, respectively. Thus if this ratio is large, the data support the model with the differential expression parameters over the model without, providing probabilistic evidence that the protein is differentially expressed (see “Experimental Procedures” for details).
Conventionally a Bayes factor greater than 10 suggests a strong evidence for the model in the numerator, and a Bayes factor greater than 30 suggests a very strong evidence for the same model (32). However, these conventional cutoffs may not work efficiently in the high throughput data sets, and appropriate cutoffs have to be chosen in a way that the overall global error is controlled to a desired level. In this work, the distribution of Bayes factors with significant differential expression is discriminated from that without by mixture modeling (see “Experimental Procedures” for details).
Solely applying the Bayes factor threshold, however, does have its own potential drawbacks when there are low quality replicates. Empirically the Bayes factor can be overestimated because of the heterogeneity of counts across replicates rather than the real differential expression. This is especially true for extremely high abundance proteins. In this case, the averaged likelihood in the model without the differential expression parameter (M_{R}) tends to be penalized more than the model with the parameter (M_{F}). To address this issue, the selected proteins were required to have a fold change of no less than 50%. In the subsequent data analysis, of the proteins filtered by this step a small number were found to be in the high abundance range.
Simulation Study—
To assess the performance of the proposed method, it was compared with the conventional signaltonoise ratio statistics coupled with FDR control. Particularly the variance adjustment of tstatistics by the power law global error model (PLGEM) was reported to have improved the detection of differentially expressed proteins in Pavelka et al. (27); hence that method (PLGEMStN) was used in place of the conventional tstatistic.
The analysis was performed using a set of synthetic data sets containing 200 proteins with either 2 or 4fold change embedded in the much large list of proteins identified in the comparative analysis of two biological replicates of yeast (grown in ^{14}N and ^{15}Nmedia) with no differential expression expected between the replicates (see “Experimental Procedures” for details). The analysis was repeated for data sets containing a varying number of replicates for each of the two conditions (between one and four replicates). The raw spectral counts were converted into normalized spectral abundance factors (27), and the PLGEMStN model of Pavelka et al. (27) was used to calculate moderated tstatistics and their associated FDRadjusted p values. The proteins were selected using various cutoffs to examine the power over a wide range of FDRs.
Using the outputs from the two methods, PLGEMStN and the hierarchical Bayes method (referred to as QSpec from here on) presented here, the comparisons were made based on the power of detection at a fixed FDR. Importantly because the signaltonoise ratio statistics require the calculation of variance, methods like PLGEMStN (27) cannot be applied to data sets that have less than three replicates. Therefore the PLGEMStN analysis was performed for 2 and 4fold data sets with three or four replicates only, F23rep, F24rep, F43rep, and F44rep, respectively. Because the QSpec model does not have this limitation, it was applied on all data sets.
Fig. 2 illustrates the comparison using the synthetic data sets with 2fold change (Fig. 2, A and B) and 4fold change (Fig. 2, C and D), respectively. Several trends are apparent. With both methods, PLGEMStN and QSpec, increasing the number of replicates leads to the selection of a higher number of differentially expressed proteins. Also with the number of replicates fixed, both models are more successful at detecting proteins having higher fold change (compare corresponding 4 versus 2fold curves for QSpec and PLGEMStN models, Fig. 2, A versus C and B versus D, respectively). Comparing the two methods with each other when applied to the same data set, QSpec outperforms the PLGEMStN across the entire range of FDR values (compare Fig. 2, A versus B and C versus D, for 2 and 4fold data, respectively). For example, in the F24rep data set, QSpec selects 50 proteins (25%) at an FDR of 10%, whereas PLGEMStN selects only 24 proteins (12.5%). In the F44rep data set, QSpec collects 193 proteins (96.5%) at the same FDR level, whereas the other method selects 167 (83.5%). Furthermore the QSpec protein selection from the single replicate data sets, F21rep, performs no worse than the PLGEMStN selection from the threereplicate F23rep data set. Similarly the QSpec results in the tworeplicate F42rep data set are equivalent to the PLGEMStN results in the threereplicate F43rep data.
As an important feature of the model, QSpec performed equally well when applied to the aggregate data sets, F2sum and F4sum (with spectral counts from all replicates summed and represented as a single number), as with the original fourreplicate F24rep and F44rep data sets (data not shown). This can be explained by the fact that all the information used to fit the Poisson model is summarized in the count sum (sufficient statistic). More precisely, the Poisson model assumes that the expected count is equal to the variability of the counts (variance) due to its parameterization, so the model does not have a separate variance parameter. The consequences of this inherent assumption are further discussed later.
Because it was found that the real signals accumulate through replicates, the properties of proteins that could not be detected as differentially expressed in the synthetic data sets with only one or two replicates were investigated. One would expect that the area where the gain in statistical power due to replicates is most significant is among low abundance proteins. The ratio of biological signal to the technical noise in low abundance proteins tends to be too low to be accurately characterized using spectral counts in a single shotgun proteomics experiment. Fig. 3 shows the proportion of proteins selected by QSpec among the 200 proteins with spiked signals in the synthetic data sets. Indeed the protein selection achieves greater power in the low abundance range as more replicates are collected and is especially noticeable in the case of proteins with 2fold change (Fig. 3A).
Comparative Growth Analysis—
With the evidence that QSpec outperforms the method of conventional signaltonoise ratio statistics in simulated settings, the data set from the comparative growth phase analysis (27) was reanalyzed. Before applying QSpec, the distributions of spectral counts in each replicate were analyzed for homogeneity. The spectral counts for one of the replicates (LP3) were vastly higher in many midabundance proteins and lower in low and high abundance proteins relative to other replicates (see supplemental Fig. 1). The degree to which this replicate differed from the others was deemed more than what can be corrected by normalization procedures. For this reason, this replicate was expected to introduce unnecessary heterogeneity in the group logarithmic phase. Therefore, it was removed, and the analysis proceeded with the total of seven replicates. The final data set contained 1508 proteins including 10 contaminants that were excluded from the subsequent analysis.
Analysis with QSpec resulted in the selection of 298 proteins with a Bayes factor above 9.8 (see supplemental Table 2 for details of the analysis). Considering all proteins satisfying this criterion as differentially expressed would introduce on average a 5% or less FDR according to the mixture modelbased error estimation (see supplemental Fig. 2). Of the 298 proteins, 121 were overexpressed in the stationary phase, and 177 were overexpressed in the log phase. The GO annotations and their significance measures were given by FATIGO+, and the most significant terms (FDRcorrected p value less than 0.05) located in a reasonably high hierarchy of the GO are shown in Fig. 4A (also see supplemental Table 3 for the entire enrichment analysis results). For comparison, Fig. 4B shows the results of the original analysis presented in Pavelka et al. (27).
It should be noted that in that work the PLGEMStN method was applied not to the entire data set of all protein identifications but to the selected subset of 511 proteins that were identified in both the logarithmic and stationary phases. Subsequently the set of 100 proteins with the highest signaltonoise ratios was selected and categorized using gene ontology of which 34 were overexpressed in the stationary phase and the remaining 66 were overexpressed in the log phase. Among those 100 proteins, 82 were also in the list of 298 proteins selected by QSpec; this implies that the top list from Pavelka et al. (27) was almost completely recovered by QSpec.
Fig. 4 shows that almost all statistically enriched functions reported in the original study were also highlighted in this analysis. Biological processes such as translation and cellular biosynthetic process are the common top significant terms in both QSpec and PLGEMStN lists of proteins overexpressed in the log phase. Notably the multiple testingcorrected p values were much lower in the QSpec annotation table (higher significance), giving a high confidence explanation for the slowdown of biosynthesis machinery in the stationary phase of cell growth. Furthermore a large number of terms selected only in the QSpec annotation were found to be enriched in the list of proteins overexpressed in the stationary phase, including (especially glycolysisrelated) catabolism, cellular respiration, and oxidoreductase activity. This finding extends the biological interpretation beyond what was given in Pavelka et al. (27): as the cell growth process slows down in the stationary phase, the focus of molecular activities shifts to breaking down large molecules into smaller units and releasing energy, potentially creating energy required for chemical reactions in anabolism or more generally the maintenance of the cell.
The overlap of the lists of top differentially expressed proteins selected by QSpec and by PLGEMStN analysis of the entire list of 1508 proteins was also examined. Using the R implementation by Pavelka et al. (27) (plgem package available from BioConductor), PLGEMStN selected 319 proteins with the cutoff FDRadjusted p value less than 0.05. The overlap with the QSpecselected list (298 proteins) was only 172 proteins with most of the discrepancies appearing in the mid and low abundance proteins. However, subsequent enrichment analysis of these 319 proteins demonstrated diluted significance of relevant functional terms compared with those selected by QSpec and also by PLGEMStN with 511 proteins (see supplemental Table 3).
Accounting for Experimental Design—
Thus far the discussion has been limited to twogroup comparisons. However, the model can be extended to more flexible study designs, including designs studying subcellular localization (29). Fig. 1B shows a part of an example data matrix of spectral counts of the proteins identified in different cellular compartments at two different time points. The focus here was to investigate whether differentially expressed proteins were identified in specific cellular compartments or at particular time points.
These extra factors (localization and time course) can be coded into the model as additive main effect predictors and additive interaction predictors of expected count as seen in Equation 9, where b⃗_{2i} and b⃗_{3i} are the coefficient vectors for the main effect and interaction effect terms corresponding to the design factors. Assuming that the study design factors have a finite number of levels, e.g. two cellular compartments or two time points as in Fig. 1B, a total of K factors can be coded in the standard analysis of variance form as follows. Let M_{F} be the full model with all three sets of terms, i.e. 1) differential expression (treatment effect) b_{1i}; 2) main effects of design factors b⃗_{2i}; and 3) interaction effects of design factors and differential expression b⃗_{3i}. For k = 1, 2, …, K, also let M_{Rk} be the reduced model with the interaction effects between the kth factor and differential expression excluded from the full model, i.e. with every term above preserved but the coefficients for kth factor in b⃗_{3i}.
Testing for the differential expression specific to some levels in design factors can be done equivalently using Bayes factors. With K models now to compare with the full model, K Bayes factors are calculated using Equation 8 with the denominator replaced by those averaged likelihoods of reduced models. That is, for k = 1, 2, …, K, Bayes factor can be calculated according to Equation 10. By comparing the averaged likelihoods in the models with and without the interaction effects between the differential expression and the design factors, one can apply the same minimal threshold filter by Bayes factor. This leads to the selection of proteins whose differential expression is specific to certain cellular compartments or time points in the experiment.
Differential Expression with Time Course and Subcellular Localization Factors—
To demonstrate the use of the proposed methodology in the presence of experimental design factors, it was applied to the mouse data generated for the aforementioned PLN R9C mutant model. To identify proteins over or underexpressed to varying degrees over time in different organelles, the full model and three reduced models have been fitted with the time factors nested within each organelle as seen in Equations 11–14, where P_{j} and C_{j} are indicators for organelle and time course. Then the Bayes factors comparing M_{F} against the reduced models M_{R(CytoTime)}, M_{R(MicroTime)}, and M_{R(MitoTime)} effectively test the significance of differential expression specific to time course effects within each organelle (cytosol, microsome, and mitochondria), respectively.
Applying the same criterion of Bayes factor greater than or equal to 10, which gives an approximate FDR of 5% according to the mixture modelbased assessment, 444 differentially regulated proteins between mutant and wild type mice in specific time points within any of the three organelles were identified. Subsets of the 444 proteins pertaining to a particular change in expression (up/downregulation), time point (8, 16, and 24 weeks), and organelle (cytosol, microsome, and mitochondria) were subjected to the functional annotation tool DAVID. Fig. 5A shows the clusters of proteins that were up and downregulated in the mutants at specific time points in each organelle. Fig. 5B shows a heat map of differential expression between each pair of mutant and wild type. In this figure, overexpression in PLN R9C mutant is highlighted in yellow, and underexpression is highlighted in blue.
Overall mitochondrial proteins concerned with muscle development and calcium ion binding showed the most drastic changes with up and downregulation in the earlier two time points (8 and 16 weeks). A good number of proteins involved with antioxidant activity and fatty acid metabolism were underexpressed in week 8 consistent with biological interpretation in the original study (29). At the organelle level, cytoskeleton organization and actin cytoskeletonrelated proteins were consistently overexpressed across all time points in cytosol. A cluster composed of endoplasmic reticulum targeting sequence, response to protein stimulus, and protein unfolding protein was highlighted in the upregulated protein list in mitochondria at week 8. Many of the proteins in this list also appeared in the upregulated list at week 16 with functions such as muscle protein, contractile fiber, muscle contraction, actin filament depolymerization, and negative regulation of cell organization. These same functions remained significant at week 24 in mitochondria. In the microsome, glucose metabolism was enriched in the downregulated protein list consistent with antioxidant activities. Oxidative phosphorylation and ion transporter activity remained enriched across the time points in the upregulated list in this organelle as well as in others. In summary, upregulated calcium ion binding, cytoskeleton organization, and response to intracellular stress seem to have a strong association with the functional impairment on the cardiac ventricular muscle (see supplemental Table 4 for the selected proteins and their functional annotations).
DISCUSSION AND FUTURE WORK
At present, many studies that utilize spectral counting for relative quantification still rely on simple data analysis methods such as filtering based on fold change ratios. Such an approach selects proteins based solely on the effect size without incorporating the variability. Therefore it may introduce a number of false positive calls in low abundance proteins where a small difference may result in artificially large fold change ratios. More recently, a method has been described that improves the conventional signaltonoise ratio statistics by adjusting the variance terms based on the analysis of the spectral counts across multiple replicates (27). The limiting factor of this method is that it requires a sufficient number of replicates. Because the variability is estimated separately for each protein, the estimates are likely to be coarse when the source of the variance calculation is merely a few data points. Moreover the limited number of or total absence of replicates makes it difficult to find a robust method to assign significance to these statistics and reasonably control global false discovery rates. For example, in the popular method of referencing observed statistics to the permutation distribution, the number of possible permutations is 70 at most when there are four replicates in each comparison group, which gives a low resolution permutation distribution vulnerable to outlying observations.
The method presented here has several advantages. It can be applied to a variety of situations including the comparative experiments that feature no or a few numbers of replicates within each biological condition. In contrast to other methods, by assuming the equal mean and variance relationship, the Poisson model of QSpec faces no issues with the absence of replicates. Because the proteinspecific parameters are modeled as random numbers from a common population distribution, the method effectively pools statistical information needed for robust estimation (24) and provides a simple way to filter proteins based on a well established quantity known as Bayes factor (32) with an option of modelbased FDR control.
The method can also be extended to more complex experimental designs where proteins are first separated into many fractions. In this instance, one can insert protein fractionspecific parameters in the model to account for the initial separation. In any case, hierarchical Bayes estimation will effectively pool the statistical information across the proteins from different fractions for more robust parameter estimations and attempt to overcome the paucity of information because of the small sample sizes. Another advantage of the method is the flexibility for possible extensions to more complicated data structures. It was demonstrated in this work that the Poisson model can easily incorporate design factors in analysis of variance form, including time course and subcellular localization factors. This class of GLMMs with hierarchical Bayes estimation can be applied to even more general data analysis scenarios. These include longitudinal profiling study without comparative design (no differential expression), replicate analysis where the reproducibility of quantitation is studied by comparing the within and between replicate variability, and proteinprotein interaction study with a large number of pulldown experiments where the strength of interaction between pairs of proteins is validated based on the number of spectra corresponding to the interaction partners.
Yet there remain a number of areas for improvements in this modeling strategy. One well known problem with Poisson models is the potential violation of the assumption of the equal meanvariance relationship also termed the overdispersion problem. In data sets with many replicates, for instance, the observed data can include heterogeneous counts across replicates even within the same biological condition. In that case, the Poisson model with conventional assumptions may not work as efficiently. Furthermore using this model aggregating counts over replicates in a data set will produce largely identical results as in the case of applying it to the same data set but with replicates represented in it as separate experiments. In effect, this observation shows the drawback of the plain Poisson model from a different angle in that the model does not make full use of the variability observed in the data efficiently. Several extensions of the model are now being investigated. In addition to using the overdispersed Poisson model, another possibility is to use alternative distributions such as negative binomial models replacing the Poisson model used here. The latter model has a natural connection to Bayesian modeling through mixture model specification.
The discussion in this work was limited to spectral counts that were defined as the number of MS/MS spectra identified for each protein. However, related metrics such as the number of unique peptides are likely to contain additional useful information. Future work should involve detailed analysis of these different protein abundance parameters and their relative performance in different applications. To this end, future efforts should focus on designing multivariate statistical approaches that can effectively combine different abundance metrics leading to improved statistical power of detecting differential proteins. Furthermore such work should examine the effects of various instrument control settings on the accuracy of spectrum countingbased quantification.
Finally the protein inference problem of shotgun proteomics should not be overlooked because it also affects quantification (38). Peptides whose sequence is present in multiple proteins often cannot be unambiguously assigned to a particular protein or protein group in the protein summary file. The spectral counts for peptides shared among multiple proteins or protein isoforms should be appropriately weighted when computing the spectral count for each protein in a method similar to apportioning the probability of a peptide among all its corresponding proteins via peptideprotein weights when computing protein probabilities in ProteinProphet (39). For example, for a peptide identified from n MS/MS spectra and shared between two distinguishable proteins, A and B, its contribution to the spectral count of protein A could be taken as n × N_{A}^{d}/(N_{A}^{d} + N_{B}^{d}) where N_{A}^{d} and N_{B}^{d} are the spectral counts of proteins A and B, respectively, determined based on distinct (nonshared) peptides only. Note that the analysis presented in this work utilized spectral counts as provided in the original publications. Although less of an issue with yeast, in the mouse data set apportioning spectral counts of shared peptides as described above should provide more accurate protein abundance measures and thus more accurate results of the protein function enrichment analysis.
CONCLUSION
A statistical framework was presented for the significance analysis of differential expression in labelfree shotgun proteomics using spectral counts. The statistical methodology developed in this work is a proteomewide modelbased assessment of differential expression using GLMM equipped with a hierarchical Bayes estimation procedure that borrows statistical strengths across all proteins. Unlike the conventional methods using ad hoc data transformation, signaltonoise ratio, and posthoc datadriven adjustments, the proposed method is more powerful in finding differentially expressed proteins and robust to the variation because of the limited number of biological replicates at individual protein levels. The model showed superior performance in terms of its sensitivity of detection over existing methods. The real data analysis examples also illustrated the important advantages of handling the challenges because of the limited number of replicates and providing flexibility of extension of the model to more complicated study designs. It is expected that the computational framework presented in this work will be useful in a wide range of applications in labelfree shotgun proteomics.
Acknowledgments
We thank Andrew Emili and members of his laboratory for critical reading of the manuscript and useful discussions and Mike Washburn for drawing attention to the data set provided in Ref. 27. We also thank both groups for providing additional information regarding the data sets.
Footnotes

Published, MCP Papers in Press, July 20, 2008, DOI 10.1074/mcp.M800203MCP200

↵ ^{1} The abbreviations used are: GLMM, generalized linear mixed effects model; FDR, false discovery rates; PLGEM, power law global error model; PLN, phospholamban; DAVID, database for annotation, visualization, and integrated discovery; GO, gene ontology; StN, signal to noise.

↵* This work was supported, in whole or in part, by National Institutes of Health Grant R01 CA126239 from the NCI. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

↵S The online version of this article (available at http://www.mcponline.org) contains supplemental material.
 Received May 6, 2008.
 Revision received July 14, 2008.
 © 2008 The American Society for Biochemistry and Molecular Biology