|
Advertisement | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 7:631-644, 2008.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ABSTRACT |
|---|
|
|
|---|
To this end, the first layer of complexity that can be addressed by such technologies is exemplified by the following question. Which transcripts or proteins change their abundance in a given cell as a result of a normal biological process, in response to a specific perturbation, or as a consequence of disease? Although not conclusive, answering this type of question has already proven to help pinpoint the major players in several biological systems (9–12). In contrast to microarray-based transcriptomics, mass spectrometry-based proteomics (13) has unfortunately received fewer contributions from statistics and bioinformatics in terms of specific algorithms and software that are designed to answer the types of questions described above. Therefore, if the wealth of microarray-specific statistical tools could be directly applied to analyze proteomics data, this would most likely represent an enormous benefit for the rapid advancement of systems biology.
Conceptually there are significant similarities between MS-based proteomics data and microarray-based gene expression data. Primarily both technologies are believed to measure abundances of biological entities in a largely unbiased way (6, 14), which allows the use of a common mathematical representation of the data. Both types of datasets are typically represented as a matrix of numeric values where rows represent different transcripts or proteins in a cell, columns represent distinct microarray hybridizations or MS runs, and each entry represents the measured abundance level. Microarray data analysts have recognized long ago that standard statistical tools are not appropriate to analyze these data matrices because of the "many-genes-few-replicates" problem (15, 16). More precisely, all standard statistical methods rely on judging whether the difference in means between two series of values (here representing abundances of biological entities in two experimental conditions of interest) is significantly higher than the variation expected by chance. Classically statistical tests estimate this random variation by measuring the variability between the replicated measures within each series of values. But when the number of available replicates is 100 or 1000 times smaller than the number of analyzed transcripts (or proteins), the chance of occasionally measuring artificially small or artificially large standard deviations becomes dominant, potentially leading to an increase in both false positive and false negative identifications. To address this issue, several microarray-specific tools have been developed (16–20). It would therefore be of particular interest to test whether these methods are applicable to the analysis of proteomics data as well.
One hindrance to the direct transfer of expertise between these two approaches has been the widespread belief that, due to the different chemistry of nucleic acids and polypeptides and the different technologies used to analyze them, transcriptome data and proteome data had to be analyzed with distinct sets of tools. Until very recently, for example, LC-MS/MS (also known as "shotgun") proteomics was not even granted the definition of being a quantitative technique unless it was coupled with specific labeling methods that would make it suitable for relative quantification of proteins in an equimolar mixture of two samples of interest (21, 22). But sampling statistics, such as spectral counts, obtained by shotgun proteomics either with labeling or label-free have proven to allow quantification of proteins in single samples (23–25). For instance, we have recently used normalized spectral abundance factor (NSAF)1 values obtained by multidimensional protein identification technology (MudPIT) to determine the relative protein abundances inside the human Mediator complex (26) or for identifying abundance changes of yeast transmembrane proteins upon shift from a minimal to a rich culture medium (27). One feature of spectral counting-based approaches, like NSAF, is that they provide measures of protein abundances between different proteins in datasets and are applicable to any sample type. In our view, these represent important steps forward that render shotgun proteomics data conceptually more similar to microarray gene expression data.
Besides conceptual similarities, applicability of microarray-specific statistical methods to the analysis of shotgun proteomics data will ultimately depend also on more substantial similarities. At the least, numeric values representing transcript or protein abundance levels should have similar statistical properties, such as dynamic range or overall shape of the distribution of values. Furthermore it would be important if proteomics datasets and microarray datasets obeyed a similar global error model. Several authors, for example, have reported that variability of gene expression data is dependent on the average expression level of the gene itself and have termed this phenomenon "variance-versus-mean dependence" (28, 29). Taking this relationship explicitly into account has shown to partially solve the "many-gene-few-replicates" problem and to significantly improve the performance of the identification of differentially expressed genes (20, 30, 31). More specifically, we have reported previously that standard deviations from replicated Affymetrix GeneChip data could be modeled via a power law global error model (PLGEM); and use of PLGEM-derived standard deviations allowed the detection of a higher number of truly differentially expressed genes without increasing the false positive rate (20). The PLGEM-based method was then implemented into a freely available Bioconductor (32) package, called "plgem," as well as in an automated microarray data analysis pipeline (33). These implementations have already been applied, both by us (34) as well as by other authors (35), to successfully analyze real microarray data addressing real biological questions. Another study reported the successful application of a quadratic model to explain the dependence between noise variances and mean peak intensities in LC-MS proteomics datasets; and application of this error model resulted in a false positive rate that was closer to the expectation value compared with the false positive rate obtained by a standard Welch's t test (36). To the best of our knowledge, there is to date in the scientific literature no equally detailed error modeling study of shotgun proteomics data. If it was proven that NSAF data also obeyed a global error model, this could improve our ability to distinguish true protein abundance changes from random fluctuations.
The scope of the present work was therefore to compare general statistical properties of protein abundance values represented by NSAF values with those of transcript profiling data obtained by GeneChip experiments. Using two large MudPIT datasets (one containing eight biological replicates of the soluble fraction of a yeast whole-cell lysate and one containing nine technical replicates of a human protein complex preparation), we compared global distributions of major statistical parameters and tested whether NSAF datasets are characterized by a variance-versus-mean dependence similar to that governing GeneChip data. This work shows that there are indeed substantial similarities between the quantitative values obtained by these two apparently dissimilar technologies and provides the basis for applying PLGEM-based methods, and possibly other microarray-specific tools, to NSAF datasets for the identification of differentially abundant proteins.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
For the comparative growth phase proteomics analysis, S. cerevisiae strain BY4741 was grown in 14N as described before. The logarithmic phase (LP) and stationary phase (SP) proteomics analyses were performed on cells collected, respectively, at an averaged A at 600 nm of 0.96 ± 0.06 and 4.5 ± 0.15 over four replicated experiments. Cells were collected and washed as described before and stored at –80 °C before protein extraction. For protein extraction, cell pellets were resuspended in lysis buffer (310 mm sodium fluoride, 3.45 mm sodium orthovanadate, 12 mm ethylenediamine tetraacetic acid, 250 mm sodium chloride, and 100 mm sodium carbonate) and broken using silica glass beads by 12 cycles consisting of 30 s of bead beating, using bead beater model 1107900 (BioSpec Products Inc.), followed by 1-min incubation at 4 °C. The beads and cells debris were removed by centrifugation for 30 min at 4000 x g at 4 °C. The supernatant was collected and centrifuged for 1.5 h at 45,000 x g at 4 °C. The supernatant containing the whole cells extract was collected and stored at –80 °C. Protein concentration was determined by BCA assay (Pierce). For each replicated experiment and growth condition, MudPIT analysis was performed on 500 µg of protein extract desalted by TCA precipitation.
Protein Extraction for the Mediator Complex—
The mammalian Mediator of RNA polymerase II transcription is a multiprotein complex composed of over 30 subunits. Stably transfected HeLa cell lines, each expressing a different FLAG-tagged Mediator subunit, i.e. human MED9, MED10, MED19, MED26, MED28, and MED29 or the mouse orthologs of MED9 or MED19, were constructed. Nuclear proteins from these cell lines were extracted and purified by anti-FLAG-agarose immunoaffinity chromatography as described previously (38). The third elutions of all preparations involving a FLAG-tagged Mediator subunit were pooled, TCA-precipitated, and quantified by BCA assay (Pierce). The pooled mixture was split into identical aliquots of 10 µg each, nine of which were independently analyzed in the present study.
MudPIT Analysis—
Protein mixtures were TCA-precipitated, urea-denatured, reduced, alkylated, and digested with endoproteinase Lys-C followed by modified trypsin digestion (both from Roche Applied Science) as described previously (6). Peptide mixtures from the yeast proteins or the Mediator complex were loaded, respectively, onto split phase or three-phase 100-µm fused silica microcapillary columns both packed with 5-µm C18 reverse phase (Aqua, Phenomenex), strong cation exchange particles (Partisphere SCX, Whatman), and reverse phase (39). Loaded microcapillary columns were placed in line with a Quaternary Agilent 1100 series HPLC pump and an LTQ linear ion trap ion trap mass spectrometer equipped with a nano-LC electrospray ionization source (ThermoFinnigan). Fully automated seven-step MudPIT runs were carried out on the electrosprayed peptides for the Mediator samples as described previously (40), whereas a 12-step MudPIT run was performed for the yeast proteome analyses as described previously (27). Each full MS scan (from 400 to 1600 m/z range) was followed by five MS/MS events using data-dependent acquisition where the five most intense ions from a given MS scan were subjected to CID.
MS/MS Data Processing—
Proteins were identified by database searching using SEQUEST software (41). The list of parameters used for the yeast and human datasets searches are available in Supplemental Tables 1, A–D, and 2A, respectively. Briefly no enzyme specificity was imposed during searches, setting a mass tolerance of 3 amu for precursor ions and 0 amu for fragment ions. In all searches, cysteine residues were considered to be fully carboxamidomethylated (+57 Da statically added). No variable modifications were searched. For the yeast proteome, tandem mass spectra were searched against a database containing 14,176 protein sequences combining 6911 S. cerevisiae proteins (from the National Center of Biotechnology Information (NCBI) March 3, 2006 release), 177 common contaminants, such as keratin and immunoglobulins, and their corresponding 7088 randomized amino acid sequences. Each MS/MS dataset was searched four times following these criteria: 1) 14N-amino acids, 2) 14N-amino acids and +16 Da statically added to methionine (referred to as methionine oxidation), 3) 15N-amino acids for which the appropriate number of nitrogen atoms were statically added to their masses, and 4) 15N-amino acids and methionine oxidation (Supplemental Table 1, A–D). The sqt files generated from the four independent searches were merged in the final dataset as described before (27). For the yeast log phase versus stationary phase comparative analyses, no 15N was used so each dataset was searched using 14N-specific parameters found in Supplemental Table 1, A and B. Each MS/MS dataset was searched two times following these criteria: 1) 14N-amino acids and 2) 14N-amino acids and +16 Da statically added to methionine. The sqt files generated from the two independent searches were merged in the final dataset as described before (27). For the Mediator samples, MS/MS spectra were searched against a database of 60,234 amino acid sequences consisting of 29,890 human proteins (non-redundant entries from NCBI November 11, 2006 release), 160 usual contaminants (such as human keratins, IgGs, and proteolytic enzymes), 67 epitope-tagged proteins (including mouse orthologs of MED9 and MED19), and 30,117 randomized amino acid sequences derived from each non-redundant protein entry. Peptide/spectrum matches, including precursor ion m/z values and charge states, for the yeast control, human, and yeast log phase versus stationary phase datasets are provided, respectively, as Supplemental Tables 1E, 2B, and 3A. The lists of detected peptides and proteins were sorted and selected using DTASelect (42) with the following criteria set: spectra/peptide matches were only retained if they had a
Cn of at least 0.1; minimum XCorr of 1.5 for singly, 2.5 for doubly, and 3.0 for triply charged spectra; and maximum Sp rank of 10. In addition, peptides had to be fully tryptic and at least seven amino acids long. Peptide hits from multiple runs were compared (Supplemental Tables 1F, 2C, and 3B) using CONTRAST (42) and contrast-report (43). Proteins that were subsets of others were removed using the parsimony option in DTASelect (42). The false discovery rate (FDR) was calculated as the number of spectra matching randomized peptides multiplied by 2 and divided by the total number of spectra, as described before (44), and ranged between 0 and 0.465% for all MudPIT runs (Supplemental Tables 1F, 2C, and 3B).
Protein abundances were estimated using NSAF values calculated from the spectral counts of each identified protein (27). Briefly to account for the fact that larger proteins tend to contribute more peptide/spectra, spectral counts were divided by protein length to provide a spectral abundance factor (SAF). SAF values were then normalized against the sum of all SAF values in the corresponding run, allowing the comparison of protein levels across different runs. No particular thresholds or outlier removal steps were applied prior to NSAF calculation. The NSAF values of each detected protein from the yeast, Mediator, and yeast log phase versus stationary phase MudPIT datasets are provided as Supplemental Tables 1F, 2C, and 3B, respectively. For subsequent statistical analysis, all datasets were further processed to retain only proteins that were identified at least in three replicated experiments. Finally contaminant proteins were removed.
GeneChip Datasets—
The mouse GeneChip dataset used in the present study is a subset of a previously published dataset (45). This subset contains 11 replicates of the transcriptome of untreated mouse dendritic cells measured by MG-U74Av2 GeneChip arrays (Affymetrix, Santa Clara, CA) following standard procedures. All experimental details can be found in the original publication (45). All remaining microarray datasets were downloaded on February 22, 2007 from the Gene Expression Omnibus database (46) using the following search criteria. (i) The microarray platform had to be an Affymetrix GeneChip. (ii) Absolute signal intensities had to be obtained by standard image processing, background correction, and summarization methods as implemented either in the MicroArray Suite 5.0 or in the GeneChip Operating System software application (both from Affymetrix). (iii) Datasets had to contain at least one experimental condition with a minimum of three replicates. The combination of these selection criteria yielded 26 distinct studies across seven distinct platforms and five species (Homo sapiens: HG-U133Plus2.0 and HG-U133A; Mus musculus: MOE-430A and MG-U74Av2; Rattus norvegicus: RG-U34A; Arabidopsis thaliana: ATH1; S. cerevisiae: YG-S98) with a total of 336 samples grouped in 101 sets of replicates. Each set of replicates represented either a unique experimental condition or a unique combination of experimental factors (in case that more than one experimental factor was annotated in the database for a particular dataset) and contained between three and five, either biological or technical, replicates. All accession numbers of the downloaded data can be found in Supplemental Table 4.
Statistical Analysis—
NSAF datasets and GeneChip datasets were imported into the R environment for statistical computing (47) and parsed into individual "exprSet" objects to allow recognition by specific Bioconductor packages (32). Missing values were replaced with zeros, and data were normalized by dividing each value by the mean value of the corresponding column. The Bioconductor package plgem (20) was used to fit a PLGEM to the individual datasets, evaluate the goodness of fit of the model to the data, and detect differentially abundant transcripts or proteins. Relevant algorithmic details of the PLGEM method will be explained under "Results." All R scripts written to dynamically generate all the figures and tables in the present work are available from the authors upon request.
| RESULTS |
|---|
|
|
|---|
15 and
42 times smaller than the number of transcripts present in the mouse GeneChip dataset (Table I). For the same reason, an abundance value equal to zero (hereafter referred to as a zero value) was extremely unlikely in the mouse GeneChip dataset (representing only
0.02% of the total values), whereas it accounted for
29 and
35% of all values present in the yeast and the Mediator NSAF datasets, respectively (Table I). Interestingly the percentage of transcripts associated with an absent call in the GeneChip dataset (
50%) was similar to the percentage of zero values in the two NSAF datasets, suggesting a possible semantic equivalence between these two types of information. Most probably as a third consequence of the phenomenon described above, the dynamic range of measured abundance values in the yeast and the Mediator NSAF datasets was
3.6–3.8 orders of magnitude, whereas those in the mouse GeneChip dataset reached almost 4.7 orders of magnitude (Table I). Nonetheless these data confirm that, despite important differences in the overall size and in the presence of zero values, microarray datasets and proteomics datasets are both able to measure abundances of biological entities over several orders of magnitude.
|
|
To identify the possible underlying relationship between data variability and average abundance levels, we drew two types of scatter plots for each dataset (Fig. 2). In the first case, we analyzed rowS.D. values, which can be seen as an absolute measure of data variability, as a function of the corresponding rowMean values in a log-log space (Fig. 2, A–C). These plots revealed a striking linear relationship over the whole dynamic range in all three analyzed datasets with highly abundant transcripts or proteins showing a higher S.D. compared with lowly abundant ones. Although the S.D. is considered an absolute measure of data variability, the coefficient of variation (CV) can be seen as a relative measure of data variability. The CV is defined as follows.
![]() |
|
Goodness of Fit of PLGEM on NSAF Datasets—
The simplest model able to explain a linear relationship in a log-log space is a power law relationship in the linear-linear space. In mathematical terms, if
![]() |
where k, c, and
respectively represent the slope, the intercept, and a normally distributed residual error of a linear regression, then
![]() |
![]() |
![]() |
According to this model, if k = 1, then the rowS.D. would be directly proportional to the rowMean, whereas the rowCV would be constant over the whole dynamic range of rowMean values. Values of k > 1 would cause both the rowS.D. and the rowCV to increase as a function of the rowMean, whereas values of k < 0 would lead to a decrease of both the rowS.D. and the rowCV. Hence there is a critical range 0 < k < 1 in which the absolute variability increases with increasing average abundance (because of the positive power coefficient k in Equation 3), whereas the relative variability decreases (because of the negative power coefficient (k – 1) in Equation 5). An error model with parameter k within this critical range would therefore fully explain the observations made in Fig. 2. In addition, such a model would also be consistent with the fact that the dynamic range of rowS.D. values was significantly smaller than the dynamic range of the rowMean values in the same dataset (Table I).
We have previously described the above variance-versus-mean dependence to be at the basis of GeneChip data, and we modeled this relationship via a PLGEM (20). Here we tested whether and how PLGEM would be able to explain the variability present in a typical NSAF dataset as well. Using the Bioconductor package plgem we fitted a PLGEM either to a simulated dataset (forced to obey a PLGEM) or to the GeneChip and the two NSAF datasets under investigation in the current study (Fig. 3). Details about the robust PLGEM fitting method implemented in the plgem package can be found in the original publication (20). Briefly the dynamic range of rowMean values is partitioned into equally sized bins, and a modeling point is determined in each partition so that it captures the local median variation (20). Then a linear regression is performed through the set of modeling points in the log-log space to obtain the slope k and the intercept c of the PLGEM. As quality controls, a Pearson's correlation coefficient was calculated between all available ln(rowS.D.) values and the corresponding ln(rowMean) values, and an adjusted r2 value was calculated between the fitted PLGEM and the modeling points. In general, PLGEM fitted equally well on all analyzed datasets (Fig. 3, A–D) with correlation coefficients >0.96 and adjusted r2 values >0.99. An additional evaluation of the goodness of fit of PLGEM was performed through an analysis of the residuals of the model. Residuals were calculated as differences between the modeled and the measured ln(rowS.D.). As expected from a good fit, in all analyzed datasets the residuals were relatively constant across the whole dynamic range (Fig. 3, E–H) and were approximately normally distributed (Fig. 3, I–P).
|
0.8, which was very close to the average PLGEM slope generally found in GeneChip datasets (Fig. 4C).
|
|
To test the added value provided by the use of PLGEM in the analysis of NSAF-based proteomics datasets, we performed a MudPIT experiment designed to detect proteins that show differential abundance in different yeast growth phases. Whole-cell extracts from four biological replicates of a yeast cell culture grown in rich medium and harvested either in LP or in SP were analyzed by a total of eight independent MudPIT runs and quantified using the NSAF approach to search for proteins up- or down-regulated during the growth phase shift (Supplemental Table 3). A total of 783 proteins were consistently identified in at least three of four replicates in either the LP or the SP samples. Of these, 108 were identified only in the SP samples, and 164 were identified only in the LP samples. These two subsets, respectively, represent proteins induced or repressed in different growth phases and are consistent with prior knowledge on the biology of stationary phase in yeast (data not shown and Ref. 50). Although these proteins provide insights into the global changes occurring in response to this physiological transition, they represent only a minor fraction of the total identified proteins. In addition, their behavior can be modeled as an on/off response and are therefore less challenging to detect. The identification of differential abundance among the remaining majority of proteins (511 of 783, i.e.
65%), which were consistently identified in most of the samples, represents instead a much more challenging task. It is in this type of analysis that a model-based statistical analysis might prove its benefits.
A standard procedure in quantitative proteomics data analysis makes use of the -fold change (FC) as a measure of differential abundance of proteins across two groups of replicated samples. It is implicitly assumed that the higher the FC, the more the protein abundance level varies between the two experimental conditions of interest. A more rigorous procedure would take the within-group variability into account as well to tell whether the signal we are interested in (the difference in abundance of the protein) is higher than the noise (the background variability caused by a combination of biological and technical variation). In such an analysis it becomes important to obtain accurate estimates of the standard deviation of NSAF measurements across different replicates of a same experimental condition to not over- or underestimate the background noise and thus under- or overestimate the signal-to-noise (STN) ratio. We therefore tested the performance of PLGEM in providing more accurate estimates of standard deviation by incorporating PLGEM-derived standard deviation into the following STN statistic.
![]() |
Because they were independently analyzed, two distinct sets of PLGEM parameters were fit to the SP NSAF dataset and the LP NSAF dataset (Supplemental Fig. 1). It has to be noted that although the above statistic has successfully proven to provide excellent results in the analysis of GeneChip data (9, 20, 34, 35, 45), it has not yet been used for the analysis of NSAF-based proteomics data.
We first compared the results obtained by analyzing the above mentioned 511 yeast proteins either with the simple FC method or with the STN statistic incorporating classical data-derived estimates of standard deviation (Standard-STN). The FC statistic was implemented here as the log ratio of the average NSAF value in the SP samples over the average NSAF value in the LP samples. The 511 proteins were ranked based on the absolute value of either of the two statistics, and the top 100 with the most extreme values were selected as the most significantly changing (Supplemental Table 5). Whereas the FC method was biased toward detection of the most lowly abundant proteins because these are the ones expected to vary most, the Standard-STN method selected several proteins with very low -fold changes and missed other proteins with very high -fold changes (Fig. 6). Among the proteins with low FC values that were nonetheless selected by the Standard-STN method, some were identified with extremely small spectral counts like the transcriptional elongation protein Spt6, identified by zero, one, two, and three spectra in the four LP replicates and by two, three, three, and three spectra in the SP samples (Fig. 6). Proteins with very small spectral counts have been ranked among the 100 most differentially abundant ones by Standard-STN only because they happened to have reproducibly small NSAF values, but due to the variability of such low spectral counts they should likely be regarded as false positives. Among the proteins with large changes that were not ranked among the most significant ones using the Standard-STN method, many are well known to be down-regulated during a shift from LP to SP in yeast and should therefore be regarded as false negatives. An example of such a protein is the ribosomal protein Rpl8a, identified by 5, 11, 20, and 76 spectra in the LP samples and by zero, two, three, and six spectra in the SP samples (Fig. 6). The most likely cause, for which these proteins were missed by the Standard-STN method, was their relatively high standard deviations.
|
Ranking of proteomics hits based on some significance criterion is a common procedure to prioritize the follow-up of candidate proteins potentially involved in the biological phenomenon under investigation. We therefore tested the biological significance of the proteins identified with the FC, the Standard-STN, or the PLGEM-STN method. To this end, significant enrichment of Gene Ontology (GO) annotation terms or Swiss-Prot keywords among the top ranking 100 proteins was evaluated. We submitted the three different lists of 100 RefSeq IDs corresponding to the proteins selected by each method to FatiGO+ (51) to test whether any functional annotation terms were significantly over-represented in the query list in comparison with the background list of 411 non-selected proteins. This website provides p values from a Fisher's exact test that are adjusted for multiple testing by an FDR-based method. Whereas no statistically significant hits were returned for the 100 proteins with the highest FC values or the highest Standard-STN values, FatiGO+ detected a significant enrichment of GO Biological Process annotation terms biosynthetic process (FDR-adjusted p value = 2.3 x 10–3), cellular biosynthetic process (FDR-adjusted p value = 2.1 x 10–3), macromolecule biosynthetic process (FDR-adjusted p value = 3.7 x 10–4), and translation (FDR-adjusted p value = 5.7 x 10–4) and for the Swiss-Prot keyword ribosomal protein (FDR-adjusted p value = 2.6 x 10–6) among the 100 proteins with the highest PLGEM-STN values. It has to be noted that from a biological perspective the shift from LP to SP is well known in yeast to be accompanied by a progressive slow down of the whole biosynthetic machinery and especially of translation (52), and only by using PLGEM in this analysis did we capture this information.
| DISCUSSION |
|---|
|
|
|---|
Similarities between NSAF and GeneChip Data—
Here we have provided evidence that NSAF datasets share with GeneChip data substantial statistical similarities. Not only were the dynamic range and the distribution of values qualitatively very similar between the two technologies, but also, and perhaps more importantly, these two types of data have proven to obey the same global error model with surprisingly similar parameters (see PLGEM as an Error Model for Shotgun Proteomics for a more detailed discussion of the latter point). These similarities offer the exciting opportunity to take advantage of the multitude of statistical tools that have been designed to specifically deal with open issues in microarray data analysis and to test whether they perform as well in proteomics data analysis. There is for instance a wealth of literature, algorithms, and software that has been devoted to solve microarray data analysis problems related to missing values (53–55), multiple testing (56, 57), variance-versus-mean dependence (20, 29, 30), etc. We foresee that most of these issues will be recapitulated also in shotgun proteomics data. Therefore, if these microarray-specific tools were directly applicable to the analysis of proteomics data, this would represent a significant shortcut in the advancement of proteomics research. Other authors have already successfully applied specific microarray tools in the analysis of proteomics data (25) without providing a more general demonstration of the underlying assumption that proteomics data are substantially similar to transcriptomics data. The substantial similarities shown here between NSAF data and GeneChip data suggest instead that most GeneChip-specific statistical tools should be applicable to the analysis of NSAF datasets as well.
PLGEM as an Error Model for Shotgun Proteomics—
The most important similarity between NSAF and GeneChip datasets was that not only both types of datasets obeyed a PLGEM, but the most critical parameter of the model, i.e. the power coefficient k, was surprisingly conserved. The fact that this parameter was always inside the critical range 0 < k < 1 for more than 100 distinct GeneChip datasets from five distinct species as well as for four NSAF datasets, three from yeast and one from human samples, indicates that this global error model might really be a general model of GeneChip and NSAF data regardless of the specific nature of the analyzed samples. The main consequence of such a model with such constraints would be that transcript or protein abundance levels of more highly expressed genes would be intrinsically more stable than those of more lowly expressed genes. This observation raises the question about the reason for this skew. A possible explanation for this is that cells might have skewed their gene expression control system by concentrating their efforts in more precisely controlling the expression of genes with a potentially higher impact on cellular functions rather than dissipating energy in controlling the expression of genes the products of which would be expressed at low levels anyway. What argues against this explanation is that it assumes a direct relationship between the expression level of a gene and the biological impact of the encoded protein, which might not always be the case. Investigation of the real reason behind this peculiar phenomenon goes well beyond the scope of the present work.
It has to be noted that there is nothing radically distinct about PLGEM as compared with previously proposed error models for these types of measurements. In fact, PLGEM could be seen as a generalization of these models. For instance, two-component error models have been proposed previously for atomic absorption spectroscopy (58), gas chromatography-MS (58), LC-MS (36), or microarray data (29). These models assume a constant rowS.D. for very low abundances and a constant rowCV for higher abundances. A constant rowCV model would in fact be able to explain an increase of the rowS.D. as the function of the rowMean but would not account for the progressive decay of rowCV that we have observed in both GeneChip data (20) and NSAF data (Fig. 2) for increasingly higher values of the rowMean. PLGEM, conversely, by not assuming any particular value of the power coefficient k, relies on more relaxed assumptions. Notably a PLGEM with k
1 would result in an approximately constant rowCV model. Thus, a PLGEM with k
1 would be difficult to distinguish from a constant-CV model especially if the analyzed dynamic range was not sufficiently large. The wide dynamic range of abundance levels that can be measured with NSAF and the GeneChip technology instead allows a clear distinction between these two models. The fact that we observed the power coefficient k to be in the range 0.7–0.8 for most analyzed GeneChip and NSAF datasets (Fig. 4) might therefore explain why in the past the constant-CV assumption has been often taken for granted.
Identification of Differentially Abundant Proteins—
The unbiased sampling nature of shotgun proteomics approaches theoretically allows the detection of virtually any protein in a sample regardless of its concentration provided that the experiment is replicated a sufficiently large number of times (23). However, these extremely lowly abundant proteins pose numerous challenges in their statistical analysis because of the presence of several zero values and the intrinsically low reproducibility described above. To increase the confidence of downstream statistical analyses, it is therefore common practice to discard proteins identified only in a minority of the analyzed replicates of a MudPIT experiment or transcripts flagged as absent in the majority of replicates in a GeneChip experiment. But in a comparative analysis, where significant differences between two experimental conditions are sought, a transcript or a protein that passed the above criteria in one experimental condition but was virtually absent in the other condition would represent a valuable candidate for follow-up studies. Statistical methods able to deal with these lowly abundant transcripts or proteins and to detect a significant difference between a virtual absence and a modest presence will certainly expand the coverage by which we can interpret the outcomes of these experiments.
We have shown here that PLGEM fits equally well over the whole dynamic range of average NSAF values even to proteins identified in a minor fraction of all available replicates, i.e. three of eight in the case of the yeast dataset and three of nine in the case of the Mediator dataset. In addition, we observed that PLGEM fitted equally well also on NSAF datasets where
50% of the proteins were identified in only one or two replicates (data not shown). This suggests that PLGEM has the potential to improve our ability to cope with these lowly abundant proteins because it provides a reasonable estimate of the expected standard deviation despite the presence of only a small number of non-zero NSAF values.
The performance of a PLGEM-based method for the analysis of GeneChip experiments has already been thoroughly investigated and compared with the behavior of other commonly used statistical methods (20). In the present work, we have shown that the use of PLGEM-based standard deviations to calculate STN ratios in an NSAF dataset improves our ability to determine protein expression changes between yeast sampled at LP and SP (Fig. 6 and Supplemental Table 3). Although determining which proteins were present in one growth condition and absent in another is relatively straightforward, determining changes in abundance of proteins found in both LP and SP is challenging. The PLGEM-STN statistic outperformed both FC and Standard-STN by being more conservative with proteins of low abundance than proteins with high abundance. In conclusion, we envision a broad range of applications of PLGEM in the analysis of NSAF data. PLGEM might assist in prioritizing the follow-up analysis of candidate proteins that show significant abundance changes between any two samples of interest, i.e. in the comparison of a wild-type versus a knock-out cell line, a diseased versus a normal tissue, or a treated versus an untreated patient.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, November 19, 2007, DOI 10.1074/mcp.M700240-MCP200
1 The abbreviations used are: NSAF, normalized spectral abundance factor; MudPIT, multidimensional protein identification technology; PLGEM, power law global error model; FDR, false discovery rate; S.D., standard deviation; CV, coefficient of variation; LP, logarithmic phase; SP, stationary phase; GO, Gene Ontology; BCA, bicinchoninic acid; SAF, spectral abundance factor; FC, -fold change; STN, signal-to-noise. ![]()
* This work was supported by the Stowers Institute for Medical Research. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()
¶ Present address: Division of Biostatistics Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520. ![]()
** To whom correspondence should be addressed: Stowers Inst. for Medical Research, 1000 E. 50th St., Kansas City, MO 64110. Tel.: 816-926-4457; Fax: 816-926-4694; E-mail: mpw{at}stowers-institute.org
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
H. Choi, D. Fermin, and A. I. Nesvizhskii Significance Analysis of Spectral Count Data in Label-free Shotgun Proteomics Mol. Cell. Proteomics, December 1, 2008; 7(12): 2373 - 2385. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |