Advertisement

Significance Analysis of Spectral Count Data in Label-free Shotgun Proteomics

  • Hyungwon Choi
    Affiliations
    Department of Pathology, University of Michigan, Ann Arbor, Michigan 48109

    Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109
    Search for articles by this author
  • Damian Fermin
    Affiliations
    Department of Pathology, University of Michigan, Ann Arbor, Michigan 48109
    Search for articles by this author
  • Alexey I. Nesvizhskii
    Correspondence
    To whom correspondence should be addressed: Dept. of Pathology, University of Michigan, 1301 Catherine, 4237 MS1, Ann Arbor, Michigan 48109
    Affiliations
    Department of Pathology, University of Michigan, Ann Arbor, Michigan 48109

    Center for Computational Medicine and Biology, University of Michigan, Ann Arbor, Michigan 48109
    Search for articles by this author
Open AccessPublished:July 20, 2008DOI:https://doi.org/10.1074/mcp.M800203-MCP200
      Spectral counting has become a commonly used approach for measuring protein abundance in label-free shotgun proteomics. At the same time, the development of data analysis methods has lagged behind. Currently most studies utilizing spectral counts rely on simple data transforms and posthoc corrections of conventional signal-to-noise ratio statistics. However, these adjustments can neither handle the bias toward high abundance proteins nor deal with the drawbacks due to the limited number of replicates. We present a novel statistical framework (QSpec) for the significance analysis of differential expression with extensions to a variety of experimental design factors and adjustments for protein properties. Using synthetic and real experimental data sets, we show that the proposed method outperforms conventional statistical methods that search for differential expression for individual proteins. We illustrate the flexibility of the model by analyzing a data set with a complicated experimental design involving cellular localization and time course.
      MS-based shotgun proteomics is currently the most commonly used approach for the identification and quantification of proteins in large scale studies (
      • Domon B.
      • Aebersold R.
      Mass spectrometry and protein analysis.
      ,
      • Nesvizhskii A.I.
      • Vitek O.
      • Aebersold R.
      Analysis and validation of proteomic data generated by tandem mass spectrometry.
      ). A variety of mass spectrometry-driven protein quantification methods have been proposed involving stable isotope labeling of proteins or peptides coupled with MS/MS sequencing, e.g. ICAT (
      • Gygi S.P.
      • Rist B.
      • Gerber S.A.
      • Turecek F.
      • Gelb M.H.
      • Aebersold R.
      Quantitative analysis of complex protein mixtures using isotope-coded affinity tags.
      ), stable isotope labeling by amino acids in cell culture (SILAC) (
      • Ong S.E.
      • Blagoev B.
      • Kratchmarova I.
      • Kristensen D.B.
      • Steen H.
      • Pandey A.
      • Mann M.
      Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics.
      ), and multiplexed quantitation using isobaric tags for relative and absolute quantitation (iTRAQ) (
      • Ross P.L.
      • Huang Y.N.
      • Marchese J.N.
      • Williamson B.
      • Parker K.
      • Hattan S.
      • Khainovski N.
      • Pillai S.
      • Dey S.
      • Daniels S.
      • Purkayastha S.
      • Juhasz P.
      • Martin S.
      • Bartlet-Jones M.
      • He F.
      • Jacobson A.
      • Pappin D.J.
      Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents.
      ) (for reviews, see Refs.
      • Goshe M.B.
      • Smith R.D.
      Stable isotope-coded proteomic mass spectrometry.
      and
      • Bantscheff M.
      • Schirle M.
      • Sweetman G.
      • Rick J.
      • Kuster B.
      Quantitative mass spectrometry in proteomics: a critical review.
      ). The well known limitations of label based-methods include requirements for higher amounts of starting biological material, increased complexity of the experimental protocols, and high costs of reagents (
      • Bantscheff M.
      • Schirle M.
      • Sweetman G.
      • Rick J.
      • Kuster B.
      Quantitative mass spectrometry in proteomics: a critical review.
      ).
      As a result, in recent years, so-called label-free methods have received increasing attention as promising alternatives that automatically waive some of the disadvantages of using stable isotope labeling methods. Popular methods in this area have focused on the analysis of two-dimensional images of ion intensities in the span of retention time and m/z from a LC-MS or LC-MS/MS run where peak intensities are used as the abundance measure (
      • Qian W.J.
      • Jacobs J.M.
      • Liu T.
      • Camp D.G.
      • Smith R.D.
      Advances and challenges in liquid chromatography-mass spectrometry-based proteomics profiling for clinical applications.
      • Li X.
      • Yi E.C.
      • Kemp C.J.
      • Zhang H.
      • Aebersold R.
      A software suite for the generation and comparison of peptide arrays from sets of data collected by liquid chromatography-mass spectrometry.
      • Jaffe J.D.
      • Mani D.R.
      • Leptos K.C.
      • Church G.M.
      • Gillette M.A.
      • Carr S.A.
      PEPPeR, a platform for experimental proteomic pattern recognition.
      • Listgarten J.
      • Emili A.
      Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry.
      ). Despite the rich information contained in the LC-MS data, daunting computational effort needs to be spent on processing the data, including background filtering, peak detection, and alignment (
      • Qian W.J.
      • Jacobs J.M.
      • Liu T.
      • Camp D.G.
      • Smith R.D.
      Advances and challenges in liquid chromatography-mass spectrometry-based proteomics profiling for clinical applications.
      ,
      • Listgarten J.
      • Emili A.
      Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry.
      ).
      A viable label-free quantitative strategy is spectral counting where the number of spectra matched to peptides from a protein is used as a surrogate measure of protein abundance. Although conceptually simple, recent studies have demonstrated that spectral counting can be as sensitive as ion peak intensities in terms of detection range while retaining linearity (
      • Liu H.
      • Sadygov R.G.
      • Yates III, J.R.
      A model for random sampling and estimation of relative protein abundance in shotgun proteomics.
      • Blondeau F.
      • Ritter B.
      • Allaire P.D.
      • Wasiak S.
      • Girard M.
      • Hussain N.K.
      • Angers A.
      • Legendre-Guillemin V.
      • Roy L.
      • Boismenu D.
      • Kearney R.E.
      • Bell A.W.
      • Bergeron J.J.
      • McPherson P.S.
      Tandem MS analysis of brain clathrin-coated vesicles reveals their critical involvement in synaptic vesicle recycling.
      • McAfee K.J.
      • Duncan D.T.
      • Assink M.
      • Link A.J.
      Analyzing proteomes and protein function using graphical comparative analysis of tandem mass spectrometry results.
      • Old W.M.
      • Meyer-Arendt K.
      • Aveline-Wolf L.
      • Pierce K.G.
      • Mendoza A.
      • Sevinsky J.R.
      • Resing K.A.
      • Ahn N.G.
      Comparison of label-free methods for quantifying human proteins by shotgun proteomics.
      • Ishihama Y.
      • Oda Y.
      • Tabata T.
      • Sato T.
      • Nagasu T.
      • Rappsilber J.
      • Mann M.
      Exponentially modified protein abundance index for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein.
      • Colinge J.
      • Chiappe D.
      • Lagache S.
      • Moniatte M.
      • Bougueleret L.
      Differential proteomics via probabilistic peptide identification scores.
      • Zybailov B.
      • Mosley A.I.
      • Sardiu M.E.
      • Coleman M.K.
      • Florens L.
      • Washburn M.P.
      Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae..
      • Lu P.
      • Vogel C.
      • Wang R.
      • Yao X.
      • Marcotte E.M.
      Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation.
      • Fu X.
      • Gharib S.A.
      • Green P.S.
      • Aitken M.L.
      • Frazer D.A.
      • Park D.R.
      • Vaisar T.
      • Heinecke J.W.
      Spectral index for assessment of differential protein expression in shotgun proteomics.
      ). A number of groups have proposed various types of normalized scores based on transformed spectral counts, including methods that explore weighted scoring by peptide match score (
      • Ishihama Y.
      • Oda Y.
      • Tabata T.
      • Sato T.
      • Nagasu T.
      • Rappsilber J.
      • Mann M.
      Exponentially modified protein abundance index for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein.
      ), normalization by the number of potential peptide matches (
      • Colinge J.
      • Chiappe D.
      • Lagache S.
      • Moniatte M.
      • Bougueleret L.
      Differential proteomics via probabilistic peptide identification scores.
      ), peptide sequence length, overall experiment-wide abundance (
      • Zybailov B.
      • Mosley A.I.
      • Sardiu M.E.
      • Coleman M.K.
      • Florens L.
      • Washburn M.P.
      Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae..
      ), or incorporation of the probability of identification into counting (
      • Lu P.
      • Vogel C.
      • Wang R.
      • Yao X.
      • Marcotte E.M.
      Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation.
      ). Standard statistical tests could also be applied on the raw/transformed counts to analyze the protein expression data (
      • Fu X.
      • Gharib S.A.
      • Green P.S.
      • Aitken M.L.
      • Frazer D.A.
      • Park D.R.
      • Vaisar T.
      • Heinecke J.W.
      Spectral index for assessment of differential protein expression in shotgun proteomics.
      • Zhang B.
      • VerBerkmoes N.C.
      • Langston M.A.
      • Uberbacher E.
      • Hettich R.I.
      • Samatova N.F.
      Detecting differential and correlated protein expression in label-free shotgun proteomics.
      • Xia Q.
      • Wang T.
      • Park Y.
      • Lamont R.J.
      • Hackett M.
      Differential quantitative proteomics of Porphyromonas gingivalis by linear ion trap mass spectrometry: non-label methods comparison, q-values and LOWESS curve fitting.
      ).
      Despite published examples of using spectral counting in proteomics, there is a lack of computational and statistical methods for analyzing this type of data that are as well established as the counterparts in gene expression data. These include differential expression analysis such as significance analysis of microarray data (SAM) (
      • Tusher V.G.
      • Tibshriani R.
      • Chu G.
      Significance analysis of microarrays applied to the ionizing radiation response.
      ), clustering and classification, and network analysis (
      • Parmigiani G.
      • Garrett E.S.
      • Irizarry R.A.
      • Zeger S.L.
      The Analysis of Gene Expression Data.
      • Do K.A.
      • Muller P.
      • Vannucci M.
      Bayesian Inference for Gene Expression and Proteomics.
      • Segal E.
      • Friedman N.
      • Kaminski N.
      • Regev A.
      • Koller D.
      From signatures to models: understanding cancer using microarrays.
      ). Most studies demonstrating the use of spectral counts have resorted to data-driven corrections of conventional signal-to-noise ratio statistics such as mean-variance model adjustment (
      • Pavelka N.M.
      • Fournier M.L.
      • Swanson S.K.
      • Pelizzola M.
      • Ricciardi-Castagnoli P.
      • Florens L.
      • Washburn M.P.
      Statistical similarities between transcriptomics and quantitative shotgun proteomics data.
      ) and detection rate adjustment (
      • Fu X.
      • Gharib S.A.
      • Green P.S.
      • Aitken M.L.
      • Frazer D.A.
      • Park D.R.
      • Vaisar T.
      • Heinecke J.W.
      Spectral index for assessment of differential protein expression in shotgun proteomics.
      ). These adjustments are primarily used to correct the bias in the statistic that favors large differences in highly abundant proteins. However, the technical challenges for modeling quantitative proteomics data are distinct in their own right. First neither ion peak intensity extraction nor spectral counting generates data that can easily be modeled with standard distributional assumptions as with gene expression data sets. This increases the burden of finding the appropriate statistical model and estimation methods. Second because of the limited amount of sample material available or MS instrument availability considerations, comparative profiling of two or more distinct biological conditions is rarely performed in sufficient number of replicates or samples. Lacking the opportunity to observe consistent evidence over multiple samples in homogeneous biological condition makes it difficult to perform robust estimation and inference on model parameters. Unless there are more than four or five replicates generated for each condition permutation-based methods for generating reference distributions will not work well.
      Here we propose a general statistical framework for analyzing spectral count data. This method addresses the issue of the appropriate probability distribution for count data as well as tackles the paucity of information due to the absence of replicate samples. The model is based on the use of hierarchical Bayes estimation of generalized linear mixed effects model (GLMM)
      The abbreviations used are: GLMM, generalized linear mixed effects model; FDR, false discovery rates; PLGEM, power law global error model; PLN, phospholamban; DAVID, database for annotation, visualization, and integrated discovery; GO, gene ontology; StN, signal to noise.
      1The abbreviations used are: GLMM, generalized linear mixed effects model; FDR, false discovery rates; PLGEM, power law global error model; PLN, phospholamban; DAVID, database for annotation, visualization, and integrated discovery; GO, gene ontology; StN, signal to noise.
      (
      • Zeger S.L.
      • Karim M.R.
      Generalized linear models with random effects; a Gibbs sampling approach.
      ) where the spectral counts are considered to be random numbers from a large population of proteins, and hence the model parameters are directly shared within replicates and across proteins. This comprehensive modeling strategy is bound to be more powerful than calculating the signal-to-noise ratio type of differential expression test statistics. These are performed on a per protein basis and referenced to an approximate null distribution especially when the number of replicates is limited.
      This report is organized as follows. First the overall modeling framework and its applicability to a wide variety of experimental designs is explained, and its advantages are discussed. Then the performance of the proposed method using synthetic data sets is illustrated with a comparison with methods using signal-to-noise ratio statistics. The comparison focuses on the power to detect differentially expressed proteins at fixed error rates and the property of the detected proteins such as abundance. For a real data analysis example, the experimental data set taken from Pavelka et al. (
      • Pavelka N.M.
      • Fournier M.L.
      • Swanson S.K.
      • Pelizzola M.
      • Ricciardi-Castagnoli P.
      • Florens L.
      • Washburn M.P.
      Statistical similarities between transcriptomics and quantitative shotgun proteomics data.
      ) comparing proteome profiles of a yeast strain at two different phases in cell growth is reanalyzed. The enrichment analysis compares the biological functions highlighted by the protein signature detected by the proposed method with the conventional signal-to-noise method, and related computational and statistical issues are discussed. Finally using the published data set of a system-wide survey of the mouse proteome in congestive heart failure (
      • Gramolini A.O.
      • Kislinger T.
      • Alikhani-Koopaei R.
      • Fong V.
      • Thompson N.J.
      • Isserlin R.
      • Sharma P.
      • Oudit G.Y.
      • Tivieri M.G.
      • Fagan A.
      • Kanna A.
      • Higgins D.
      • Huedig H.
      • Hess G.
      • Arab S.
      • Seidman J.G.
      • Seidman C.E.
      • Frey B.
      • Perry M.
      • Backx P.H.
      • Liu P.P.
      • MacLennan D.H.
      • Emili A.
      Comparative proteomic profiling of a phospholamban mutant mouse model of dilated cardiomyopathy reveals progressive intracellular stress responses.
      ) the proposed methodology is demonstrated in the presence of experimental design factors. Further discussion of potential improvements on the model and possible extensions concludes the report.

      EXPERIMENTAL PROCEDURES

      Experimental Data Sets

      Three data sets were obtained from two published studies (
      • Pavelka N.M.
      • Fournier M.L.
      • Swanson S.K.
      • Pelizzola M.
      • Ricciardi-Castagnoli P.
      • Florens L.
      • Washburn M.P.
      Statistical similarities between transcriptomics and quantitative shotgun proteomics data.
      ,
      • Gramolini A.O.
      • Kislinger T.
      • Alikhani-Koopaei R.
      • Fong V.
      • Thompson N.J.
      • Isserlin R.
      • Sharma P.
      • Oudit G.Y.
      • Tivieri M.G.
      • Fagan A.
      • Kanna A.
      • Higgins D.
      • Huedig H.
      • Hess G.
      • Arab S.
      • Seidman J.G.
      • Seidman C.E.
      • Frey B.
      • Perry M.
      • Backx P.H.
      • Liu P.P.
      • MacLennan D.H.
      • Emili A.
      Comparative proteomic profiling of a phospholamban mutant mouse model of dilated cardiomyopathy reveals progressive intracellular stress responses.
      ). In all cases, no reanalysis of the raw MS data was performed in this work, i.e. the spectral count data were taken as reported in the supplemental materials provided in those publications. A brief description of the data sets is given below.

      Yeast Control Data Set—

      First Pavelka et al. (
      • Pavelka N.M.
      • Fournier M.L.
      • Swanson S.K.
      • Pelizzola M.
      • Ricciardi-Castagnoli P.
      • Florens L.
      • Washburn M.P.
      Statistical similarities between transcriptomics and quantitative shotgun proteomics data.
      ) provided a data set containing four biological replicates of BY4741 strain of yeast grown in media enriched with different nitrogen isotopes (14N and 15N). Growing yeast in these two different media is not expected to result in differences in protein expression between these two samples. For each growth condition and each replicate, the LC-MS/MS analysis was performed on 500 μg of protein extract. Proteins were TCA-precipitated, urea-denatured, reduced, alkylated, and digested with Lys-C followed by trypsin digestion. The resulting peptide mixtures were separated using a 12-step multidimensional protein identification technology analysis. MS/MS spectra were collected on an LTQ linear ion trap mass spectrometer (ThermoFinnigan) equipped with a nano-LC electrospray ionization source. Each full MS scan was followed by five MS/MS scans using data-dependent acquisition with the dynamic exclusion option specified as follows: repeat count, 1; repeat duration, 30 s; exclusion duration, 300 s. The peak lists were extracted from RAW files using the extract_ms.exe program with default parameter settings (consecutive scans acquired on the same peptide ion grouped into a single .dta file). The resulting peak lists were searched using SEQUEST against a yeast protein sequence database appended with decoy sequences. Protein level summaries were generated using DTASelect with SEQUEST score thresholds set to achieve a less than 1% false protein identification error rate based on decoy counts. The spectral count for each protein was calculated as the number of .dta files assigned a peptide from that protein with high SEQUEST scores passing DTASelect filtering criteria. In total, the data set contains four technical replicates for each of the two growth conditions (light and heavy isotope media) with 1307 proteins identified at least once in the eight analyses. This data set was used as a control data set for simulation studies.

      Yeast Comparative Growth Data Set—

      The second data set was also taken from Pavelka et al. (
      • Pavelka N.M.
      • Fournier M.L.
      • Swanson S.K.
      • Pelizzola M.
      • Ricciardi-Castagnoli P.
      • Florens L.
      • Washburn M.P.
      Statistical similarities between transcriptomics and quantitative shotgun proteomics data.
      ) and represents four biological replicates of the same BY4741 yeast strain grown up to two different stages of cell growth, namely log and stationary phases. MS/MS spectra were collected and processed as described above. This data set contains 1856 unique proteins identified in any of the eight experiments (four replicates for each of the two growth phases). This data set exhibits a significant difference in protein expression levels between the log and stationary phase and was used in this study for the comparison of functional annotation in a real data analysis scenario.

      Mouse Data Set—

      The third data set was taken from a published mouse study on the causative effect of impaired calcium ion handling that leads to dilated cardiomyopathy and eventual death (
      • Gramolini A.O.
      • Kislinger T.
      • Alikhani-Koopaei R.
      • Fong V.
      • Thompson N.J.
      • Isserlin R.
      • Sharma P.
      • Oudit G.Y.
      • Tivieri M.G.
      • Fagan A.
      • Kanna A.
      • Higgins D.
      • Huedig H.
      • Hess G.
      • Arab S.
      • Seidman J.G.
      • Seidman C.E.
      • Frey B.
      • Perry M.
      • Backx P.H.
      • Liu P.P.
      • MacLennan D.H.
      • Emili A.
      Comparative proteomic profiling of a phospholamban mutant mouse model of dilated cardiomyopathy reveals progressive intracellular stress responses.
      ). Organellar protein fractions (mitochondrial, microsomal, and cytosolic) were extracted from pooled ventricle tissue and separated using centrifugation. 100 μg of protein extract was used in each LC-MS/MS experiment, TCA-precipitated, denatured, reduced, alkylated, and digested sequentially with Lys-C and trypsin. The peptide mixtures were separated using a 12-step multidimensional protein identification technology analysis followed by MS/MS sequencing on an LTQ mass spectrometer equipped with an electrospray ion source. Precursor ions were subjected to data-dependent sequencing with dynamic exclusion enabled (one scan, no repeats, exclude for 90 s). The peak lists (.dta files) were extracted from RAW instrument files using extract_ms.exe with default parameter settings. The peak lists were searched using SEQUEST, and the search results were processed using the STATQUEST analysis program. The spectral count for each protein was calculated as the number of .dta files assigned a peptide from that protein with high confidence. The spectral count profiles of 6190 proteins in phospholamban mutant PLN R9C and wild type mice were compared at three time points in cytosol, microsome, and mitochondria. For each combination of time point and organelle, the spectral count profiles of the mutant and the wild type were paired, adding up to 18 spectral count profiles in total.

      Simulation Data Sets

      Using the first yeast data set described above, two groups of synthetic data sets were generated. Because the original cell cultures were grown in 14N- and 15N-media and then mixed into four pools at a 1:1 ratio before LC-MS/MS analysis, in effect these data had no real signals between the two groups in all proteins. To create synthetic data sets with non-trivial differential expression, the rows of the data matrix (proteins) were shuffled to ensure that the distribution of high and low abundance proteins is uniform across the rows. Then the first 200 proteins in the matrix were selected, and 2-fold changes were inserted to the selected proteins, generating the first synthetic data set (F2). The second synthetic data set (F4) was generated by inserting 4-fold changes to the selected proteins. Inserting a fixed -fold change has been achieved on a protein-by-protein basis. Counts in the replicates grown in 14N-medium were multiplied by the -fold change if the mean count in the four replicates in 14N-medium was greater than the mean count in 15N-medium and vice versa. If a count in the group with smaller mean was 0, a randomly generated Poisson random count was inserted with mean equal to the -fold change itself on the opposite group to bypass the null effect of multiplying 0 by the -fold change.
      To investigate the effect of the number of replicates on the power of detecting differentially expressed proteins, additional variants of the two data sets described above were derived by varying the number of replicates used: F2-1rep (taking first replicate for each condition), F2-2rep (taking the first two replicates), and F2-3rep (taking the first three replicates). The same was performed with the 4-fold change data set to form subsets F4-1rep, F4-2rep, and F4-3rep, respectively. For the sake of consistency, the original data with all four replicates for each condition (growth media) were named F2-4rep and F4-4rep, respectively (data provided in supplemental Table 1). In addition, the aggregated counts across the four replicates within each condition were computed and saved as two columns of count sums (F2-sum and F4-sum). This last variant was generated to understand whether generating replicates helps by adding more signals to the total signal or by providing any direct information on the variability across replicates.

      Functional Annotation

      Interpretation of data was assisted by two annotation tools, FATIGO+ (
      • Al-Shahrour F.
      • Minguez P.
      • Tarraga J.
      • Montaner D.
      • Alloza E.
      • Vazquerizas J.M.M.
      • Conde L.
      • Blaschke C.
      • Vera J.
      • Dopazo J.
      BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments.
      ) and DAVID (
      • Dennis Jr., G.
      • Sherman B.T.
      • Hosack D.A.
      • Yang J.
      • Gao W.
      • Lane H.C.
      • Lempicki R.A.
      DAVID: database for annotation, visualization, and integrated discovery.
      ). These tools were used to assign significantly enriched functional categories to a selected set of proteins. FATIGO+ takes the set of target proteins and the set of background proteins, compares the enrichment of each functional category in the two sets, and reports the statistical significance of enriched functions in the former list. DAVID performs essentially the same operation with the option of specifying the background proteins as the complement of the target protein list among all proteins identified in the particular experiment or as the complement in the population of all known proteins in the public databases. FATIGO+ was utilized wherever the “background” list was well defined, and DAVID was used when it was otherwise.

      Bayes Factors

      A quantity called Bayes factor (
      • Jeffreys H.
      The Theory of Probability.
      ) was used as an indicator of statistical significance of the model parameters, e.g. regression coefficients for differential expression. Bayes factors are essentially likelihood ratios of two competing statistical models where the likelihood of each competing model is averaged over all possible parameter values by numerical integration. Suppose that we observe data X, and we have two models M1 and M2 that can describe the observation of X. For each model, we have parameters θ1 and θ2, respectively. Then for i = 1 and i = 2, one can calculate the averaged likelihood.
      p(X|Mi)=p(X,Θi|Mi)dΘi=p(Θi|Mi)p(X|Θi,Mi)dΘi
      (Eq. 1)


      The Bayes factor is now defined as the ratio of the two averaged likelihoods.
      B(X)=p(X|M2)p(X|M1)=p(Θ2|M2)p(X|Θ2,M2)dΘ2p(Θ1|M1)p(X|Θ1,M1)dΘ1
      (Eq. 2)


      A large Bayes factor supports the second model M2 over M1 for describing the data X. If M2 is a model with a differential expression coefficient and M1 is a model without it, a large B indicates statistically significant differential expression.

      False Discovery Rate (FDR) Estimation

      The rate of false positives in the selection of differentially expressed proteins based on Bayes factors can be estimated using a mixture model-based method of local FDR control (
      • Efron B.
      Large-scale simultaneous hypothesis testing: the choice of a null hypothesis.
      ,
      • Efron B.
      Size, power and false discovery rates.
      ). Given a log transformed Bayes factor B, the local FDR (denoted as fdr) can be calculated according to Equation 3,
      fdr(B)=π0p0(B)π0p0(B)+π1p1(B)
      (Eq. 3)


      where p0(B) and p1(B) are the proteome-wide distribution of Bayes factor for proteins with trivial and significant differential expression, respectively, and π0 and π1 are the corresponding proportion of proteins. Using this method, one can choose a minimum threshold Bayes factor B* that controls the global FDR at a target rate of ∼5% as follows.
      FDR(B*)=B>Bπ0p0(B)dBB>B*π0p0(B)+π1p1(B)dB
      (Eq. 4)


      DISCUSSION AND FUTURE WORK

      At present, many studies that utilize spectral counting for relative quantification still rely on simple data analysis methods such as filtering based on -fold change ratios. Such an approach selects proteins based solely on the effect size without incorporating the variability. Therefore it may introduce a number of false positive calls in low abundance proteins where a small difference may result in artificially large -fold change ratios. More recently, a method has been described that improves the conventional signal-to-noise ratio statistics by adjusting the variance terms based on the analysis of the spectral counts across multiple replicates (
      • Pavelka N.M.
      • Fournier M.L.
      • Swanson S.K.
      • Pelizzola M.
      • Ricciardi-Castagnoli P.
      • Florens L.
      • Washburn M.P.
      Statistical similarities between transcriptomics and quantitative shotgun proteomics data.
      ). The limiting factor of this method is that it requires a sufficient number of replicates. Because the variability is estimated separately for each protein, the estimates are likely to be coarse when the source of the variance calculation is merely a few data points. Moreover the limited number of or total absence of replicates makes it difficult to find a robust method to assign significance to these statistics and reasonably control global false discovery rates. For example, in the popular method of referencing observed statistics to the permutation distribution, the number of possible permutations is 70 at most when there are four replicates in each comparison group, which gives a low resolution permutation distribution vulnerable to outlying observations.
      The method presented here has several advantages. It can be applied to a variety of situations including the comparative experiments that feature no or a few numbers of replicates within each biological condition. In contrast to other methods, by assuming the equal mean and variance relationship, the Poisson model of QSpec faces no issues with the absence of replicates. Because the protein-specific parameters are modeled as random numbers from a common population distribution, the method effectively pools statistical information needed for robust estimation (
      • Parmigiani G.
      • Garrett E.S.
      • Irizarry R.A.
      • Zeger S.L.
      The Analysis of Gene Expression Data.
      ) and provides a simple way to filter proteins based on a well established quantity known as Bayes factor (
      • Jeffreys H.
      The Theory of Probability.
      ) with an option of model-based FDR control.
      The method can also be extended to more complex experimental designs where proteins are first separated into many fractions. In this instance, one can insert protein fraction-specific parameters in the model to account for the initial separation. In any case, hierarchical Bayes estimation will effectively pool the statistical information across the proteins from different fractions for more robust parameter estimations and attempt to overcome the paucity of information because of the small sample sizes. Another advantage of the method is the flexibility for possible extensions to more complicated data structures. It was demonstrated in this work that the Poisson model can easily incorporate design factors in analysis of variance form, including time course and subcellular localization factors. This class of GLMMs with hierarchical Bayes estimation can be applied to even more general data analysis scenarios. These include longitudinal profiling study without comparative design (no differential expression), replicate analysis where the reproducibility of quantitation is studied by comparing the within and between replicate variability, and protein-protein interaction study with a large number of pulldown experiments where the strength of interaction between pairs of proteins is validated based on the number of spectra corresponding to the interaction partners.
      Yet there remain a number of areas for improvements in this modeling strategy. One well known problem with Poisson models is the potential violation of the assumption of the equal mean-variance relationship also termed the overdispersion problem. In data sets with many replicates, for instance, the observed data can include heterogeneous counts across replicates even within the same biological condition. In that case, the Poisson model with conventional assumptions may not work as efficiently. Furthermore using this model aggregating counts over replicates in a data set will produce largely identical results as in the case of applying it to the same data set but with replicates represented in it as separate experiments. In effect, this observation shows the drawback of the plain Poisson model from a different angle in that the model does not make full use of the variability observed in the data efficiently. Several extensions of the model are now being investigated. In addition to using the overdispersed Poisson model, another possibility is to use alternative distributions such as negative binomial models replacing the Poisson model used here. The latter model has a natural connection to Bayesian modeling through mixture model specification.
      The discussion in this work was limited to spectral counts that were defined as the number of MS/MS spectra identified for each protein. However, related metrics such as the number of unique peptides are likely to contain additional useful information. Future work should involve detailed analysis of these different protein abundance parameters and their relative performance in different applications. To this end, future efforts should focus on designing multivariate statistical approaches that can effectively combine different abundance metrics leading to improved statistical power of detecting differential proteins. Furthermore such work should examine the effects of various instrument control settings on the accuracy of spectrum counting-based quantification.
      Finally the protein inference problem of shotgun proteomics should not be overlooked because it also affects quantification (
      • Nesvizhskii A.I.
      • Aebersold R.
      Interpretation of shotgun proteomic data.
      ). Peptides whose sequence is present in multiple proteins often cannot be unambiguously assigned to a particular protein or protein group in the protein summary file. The spectral counts for peptides shared among multiple proteins or protein isoforms should be appropriately weighted when computing the spectral count for each protein in a method similar to apportioning the probability of a peptide among all its corresponding proteins via peptide-protein weights when computing protein probabilities in ProteinProphet (
      • Nesvizhskii A.I.
      • Keller A.
      • Kolker E.
      • Aebersold R.
      A statistical model for identifying proteins by tandem mass spectrometry.
      ). For example, for a peptide identified from n MS/MS spectra and shared between two distinguishable proteins, A and B, its contribution to the spectral count of protein A could be taken as n × NAd/(NAd + NBd) where NAd and NBd are the spectral counts of proteins A and B, respectively, determined based on distinct (non-shared) peptides only. Note that the analysis presented in this work utilized spectral counts as provided in the original publications. Although less of an issue with yeast, in the mouse data set apportioning spectral counts of shared peptides as described above should provide more accurate protein abundance measures and thus more accurate results of the protein function enrichment analysis.

      CONCLUSION

      A statistical framework was presented for the significance analysis of differential expression in label-free shotgun proteomics using spectral counts. The statistical methodology developed in this work is a proteome-wide model-based assessment of differential expression using GLMM equipped with a hierarchical Bayes estimation procedure that borrows statistical strengths across all proteins. Unlike the conventional methods using ad hoc data transformation, signal-to-noise ratio, and posthoc data-driven adjustments, the proposed method is more powerful in finding differentially expressed proteins and robust to the variation because of the limited number of biological replicates at individual protein levels. The model showed superior performance in terms of its sensitivity of detection over existing methods. The real data analysis examples also illustrated the important advantages of handling the challenges because of the limited number of replicates and providing flexibility of extension of the model to more complicated study designs. It is expected that the computational framework presented in this work will be useful in a wide range of applications in label-free shotgun proteomics.

      Acknowledgments

      We thank Andrew Emili and members of his laboratory for critical reading of the manuscript and useful discussions and Mike Washburn for drawing attention to the data set provided in Ref.
      • Pavelka N.M.
      • Fournier M.L.
      • Swanson S.K.
      • Pelizzola M.
      • Ricciardi-Castagnoli P.
      • Florens L.
      • Washburn M.P.
      Statistical similarities between transcriptomics and quantitative shotgun proteomics data.
      . We also thank both groups for providing additional information regarding the data sets.

      REFERENCES

        • Domon B.
        • Aebersold R.
        Mass spectrometry and protein analysis.
        Science. 2006; 312: 212-217
        • Nesvizhskii A.I.
        • Vitek O.
        • Aebersold R.
        Analysis and validation of proteomic data generated by tandem mass spectrometry.
        Nat. Methods. 2007; 4: 787-797
        • Gygi S.P.
        • Rist B.
        • Gerber S.A.
        • Turecek F.
        • Gelb M.H.
        • Aebersold R.
        Quantitative analysis of complex protein mixtures using isotope-coded affinity tags.
        Nat. Biotechnol. 1999; 17: 994-999
        • Ong S.E.
        • Blagoev B.
        • Kratchmarova I.
        • Kristensen D.B.
        • Steen H.
        • Pandey A.
        • Mann M.
        Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics.
        Mol. Cell. Proteomics. 2002; 1: 376-386
        • Ross P.L.
        • Huang Y.N.
        • Marchese J.N.
        • Williamson B.
        • Parker K.
        • Hattan S.
        • Khainovski N.
        • Pillai S.
        • Dey S.
        • Daniels S.
        • Purkayastha S.
        • Juhasz P.
        • Martin S.
        • Bartlet-Jones M.
        • He F.
        • Jacobson A.
        • Pappin D.J.
        Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents.
        Mol. Cell. Proteomics. 2004; 3: 1154-1169
        • Goshe M.B.
        • Smith R.D.
        Stable isotope-coded proteomic mass spectrometry.
        Curr. Opin. Biotechnol. 2003; 14: 101-109
        • Bantscheff M.
        • Schirle M.
        • Sweetman G.
        • Rick J.
        • Kuster B.
        Quantitative mass spectrometry in proteomics: a critical review.
        Anal. Bioanal. Chem. 2007; 389: 1017-1031
        • Qian W.J.
        • Jacobs J.M.
        • Liu T.
        • Camp D.G.
        • Smith R.D.
        Advances and challenges in liquid chromatography-mass spectrometry-based proteomics profiling for clinical applications.
        Mol. Cell. Proteomics. 2006; 5: 1727-1744
        • Li X.
        • Yi E.C.
        • Kemp C.J.
        • Zhang H.
        • Aebersold R.
        A software suite for the generation and comparison of peptide arrays from sets of data collected by liquid chromatography-mass spectrometry.
        Mol. Cell. Proteomics. 2005; 4: 1328-1340
        • Jaffe J.D.
        • Mani D.R.
        • Leptos K.C.
        • Church G.M.
        • Gillette M.A.
        • Carr S.A.
        PEPPeR, a platform for experimental proteomic pattern recognition.
        Mol. Cell. Proteomics. 2006; 5: 1927-1941
        • Listgarten J.
        • Emili A.
        Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry.
        Mol. Cell. Proteomics. 2005; 4: 419-434
        • Liu H.
        • Sadygov R.G.
        • Yates III, J.R.
        A model for random sampling and estimation of relative protein abundance in shotgun proteomics.
        Anal. Chem. 2004; 76: 4193-4201
        • Blondeau F.
        • Ritter B.
        • Allaire P.D.
        • Wasiak S.
        • Girard M.
        • Hussain N.K.
        • Angers A.
        • Legendre-Guillemin V.
        • Roy L.
        • Boismenu D.
        • Kearney R.E.
        • Bell A.W.
        • Bergeron J.J.
        • McPherson P.S.
        Tandem MS analysis of brain clathrin-coated vesicles reveals their critical involvement in synaptic vesicle recycling.
        Proc. Natl. Acad. Sci. U. S. A. 2004; 101: 3833-3838
        • McAfee K.J.
        • Duncan D.T.
        • Assink M.
        • Link A.J.
        Analyzing proteomes and protein function using graphical comparative analysis of tandem mass spectrometry results.
        Mol. Cell. Proteomics. 2006; 5: 1497-1513
        • Old W.M.
        • Meyer-Arendt K.
        • Aveline-Wolf L.
        • Pierce K.G.
        • Mendoza A.
        • Sevinsky J.R.
        • Resing K.A.
        • Ahn N.G.
        Comparison of label-free methods for quantifying human proteins by shotgun proteomics.
        Mol. Cell. Proteomics. 2005; 4: 1487-1502
        • Ishihama Y.
        • Oda Y.
        • Tabata T.
        • Sato T.
        • Nagasu T.
        • Rappsilber J.
        • Mann M.
        Exponentially modified protein abundance index for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein.
        Mol. Cell. Proteomics. 2005; 4: 1265-1272
        • Colinge J.
        • Chiappe D.
        • Lagache S.
        • Moniatte M.
        • Bougueleret L.
        Differential proteomics via probabilistic peptide identification scores.
        Anal. Chem. 2005; 77: 596-606
        • Zybailov B.
        • Mosley A.I.
        • Sardiu M.E.
        • Coleman M.K.
        • Florens L.
        • Washburn M.P.
        Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae..
        J. Proteome Res. 2006; 5: 2339-2347
        • Lu P.
        • Vogel C.
        • Wang R.
        • Yao X.
        • Marcotte E.M.
        Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation.
        Nat. Biotechnol. 2007; 25: 117-124
        • Fu X.
        • Gharib S.A.
        • Green P.S.
        • Aitken M.L.
        • Frazer D.A.
        • Park D.R.
        • Vaisar T.
        • Heinecke J.W.
        Spectral index for assessment of differential protein expression in shotgun proteomics.
        J. Proteome Res. 2008; 7: 845-854
        • Zhang B.
        • VerBerkmoes N.C.
        • Langston M.A.
        • Uberbacher E.
        • Hettich R.I.
        • Samatova N.F.
        Detecting differential and correlated protein expression in label-free shotgun proteomics.
        J. Proteome Res. 2006; 5: 2909-2918
        • Xia Q.
        • Wang T.
        • Park Y.
        • Lamont R.J.
        • Hackett M.
        Differential quantitative proteomics of Porphyromonas gingivalis by linear ion trap mass spectrometry: non-label methods comparison, q-values and LOWESS curve fitting.
        Int. J. Mass Spectrom. 2007; 259: 105-116
        • Tusher V.G.
        • Tibshriani R.
        • Chu G.
        Significance analysis of microarrays applied to the ionizing radiation response.
        Proc. Natl. Acad. Sci. U. S. A. 2001; 98: 5116-5121
        • Parmigiani G.
        • Garrett E.S.
        • Irizarry R.A.
        • Zeger S.L.
        The Analysis of Gene Expression Data.
        Springer-Verlag, New York2003
        • Do K.A.
        • Muller P.
        • Vannucci M.
        Bayesian Inference for Gene Expression and Proteomics.
        Cambridge University Press, New York2006
        • Segal E.
        • Friedman N.
        • Kaminski N.
        • Regev A.
        • Koller D.
        From signatures to models: understanding cancer using microarrays.
        Nat. Genet. 2005; 37: S38-S45
        • Pavelka N.M.
        • Fournier M.L.
        • Swanson S.K.
        • Pelizzola M.
        • Ricciardi-Castagnoli P.
        • Florens L.
        • Washburn M.P.
        Statistical similarities between transcriptomics and quantitative shotgun proteomics data.
        Mol. Cell. Proteomics. 2008; 7: 631-644
        • Zeger S.L.
        • Karim M.R.
        Generalized linear models with random effects; a Gibbs sampling approach.
        J. Am. Stat. Assoc. 1991; 86: 79-86
        • Gramolini A.O.
        • Kislinger T.
        • Alikhani-Koopaei R.
        • Fong V.
        • Thompson N.J.
        • Isserlin R.
        • Sharma P.
        • Oudit G.Y.
        • Tivieri M.G.
        • Fagan A.
        • Kanna A.
        • Higgins D.
        • Huedig H.
        • Hess G.
        • Arab S.
        • Seidman J.G.
        • Seidman C.E.
        • Frey B.
        • Perry M.
        • Backx P.H.
        • Liu P.P.
        • MacLennan D.H.
        • Emili A.
        Comparative proteomic profiling of a phospholamban mutant mouse model of dilated cardiomyopathy reveals progressive intracellular stress responses.
        Mol. Cell. Proteomics. 2008; 7: 519-533
        • Al-Shahrour F.
        • Minguez P.
        • Tarraga J.
        • Montaner D.
        • Alloza E.
        • Vazquerizas J.M.M.
        • Conde L.
        • Blaschke C.
        • Vera J.
        • Dopazo J.
        BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments.
        Nucleic Acids Res. 2006; 34: W472-W476
        • Dennis Jr., G.
        • Sherman B.T.
        • Hosack D.A.
        • Yang J.
        • Gao W.
        • Lane H.C.
        • Lempicki R.A.
        DAVID: database for annotation, visualization, and integrated discovery.
        Genome Biol. 2003; 4: P3
        • Jeffreys H.
        The Theory of Probability.
        Oxford University Press, Oxford1961
        • Efron B.
        Large-scale simultaneous hypothesis testing: the choice of a null hypothesis.
        J. Amer. Stat. Assoc. 2004; 99: 96-104
        • Efron B.
        Size, power and false discovery rates.
        Ann. Stat. 2007; 35: 1351-1377
        • Cai L.
        • Huang H.
        • Blackshaw S.
        • Liu J.S.
        • Cepko C.
        • Wong W.H.
        Clustering analysis of SAGE data using a Poisson approach.
        Genome Biol. 2004; 5: R51
        • Smyth G.K.
        Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.
        Stat. Appl. Genet. Mol. Biol. 2004; 3: 3
        • Robert C.P.
        • Casella G.
        Monte Carlo Statistical Methods.
        Springer, New York2004
        • Nesvizhskii A.I.
        • Aebersold R.
        Interpretation of shotgun proteomic data.
        Mol. Cell. Proteomics. 2005; 4: 1419-1440
        • Nesvizhskii A.I.
        • Keller A.
        • Kolker E.
        • Aebersold R.
        A statistical model for identifying proteins by tandem mass spectrometry.
        Anal. Chem. 2003; 75: 4646-4658