|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 5:868-881, 2006.
© 2006 by The American Society for Biochemistry and Molecular Biology, Inc.
,
,¶
,¶
,
,||
,


,


From the Departments of
Chemical Engineering and
Microbiology, University of Washington, Seattle, Washington 98195 and ** Department of Microbiology, University of Georgia, Athens, Georgia 30602
| ABSTRACT |
|---|
|
|
|---|
| MATERIALS AND METHODS |
|---|
|
|
|---|
Differential 15N:14N Labeling of Proteins
To prepare differentially 15N-labeled proteins, strains S2 and S40 were grown as described previously (6) in Methanococcus minimal medium plus 10 mm sodium acetate containing either 4.6 mm [15N]ammonium sulfate (instead of ammonium chloride) or unlabeled ammonium sulfate in side-arm bottles. The bottles were constructed from 160-ml serum bottles by fusing a 28-ml culture tube to the side. Thus, it was possible to follow the culture absorbance in a spectrophotometer without removing samples from the cultures. Before use, glassware was immersed in 1 m HCl overnight, rinsed with water, autoclaved in water with soap, rinsed in deionized water, and dried. To remove O2 adsorbed to the glassware, the bottles were stored in the anaerobic chamber at least a day before use. Stoppers were autoclaved for 20 min in 0.1 m NaOH, rinsed with deionized water, dried, and stored in the anaerobic chamber at least a day before use. Four bottles, each containing 10 ml of culture, were prepared for each growth condition. Cultures were initially pressurized to 276 kilopascals with H2/CO2 gas (80:20, v/v) and repressurized twice daily. When the cultures reached an average A600 of 0.56, 40 ml of the S2 culture grown with 14N was mixed with 40 ml of the S40 culture grown with 15N to form sample 1. The reciprocal mixture, called sample 2, was also prepared with cultures of S2 grown with 15N and cultures of S40 grown with 14N. Because of its slower growth, the mutant was harvested after 21 h, whereas S2 was harvested after 13 h. The cell mixtures were centrifuged at 10,000 x g for 30 min at 4 °C. After resuspending the cells in 2 ml of buffer B (100 mm ammonium bicarbonate (pH 9.0) and 2 mm DTT), the cells were lysed by freezing at 20 °C. The resuspended frozen cells were stored at 20 °C. Upon thawing, 10 units of DNase I were added, and the suspension was incubated for 15 min at 37 °C. After microscopic inspection to ensure lysis the cell debris and the unbroken cells were pelleted by centrifugation at 8000 x g for 30 min at 4 °C. Supernatants and pellets (resuspended in 0.5 ml of buffer B) were used separately for protein preparation as follows. Four protein samples were obtained from the two mixtures, two supernatants (CE1 and CE2) and two pellets (PT1 and PT2). For each sample, 200 µl of centrifuged cell lysate containing
0.5 mg of total protein was brought to a final volume of 370 µl containing 40 µl of DNase/RNase solution (1 mg/ml DNase I, 500 µg/ml RNase A), 0.15% RapiGest (Waters), 5 mm DTT, and 50 mm Tris-HCl, at pH 8.0. The RapiGest is an acid-labile bifunctional molecule designed to serve as a substitute for conventional detergents that is less likely to interfere with the mass spectrometry. The samples were immediately placed in boiling water for 1.52.0 min until the solution ceased to show obvious viscosity due to residual DNA. Each sample was then transferred onto ice. The proteins were further reduced by the addition of 5 mm DTT at 37 °C for 30 min and then alkylated with 10 mm iodoacetamide at 30 °C for 30 min in the dark. Each sample was then adjusted to give a solution containing 50 mm NH4HCO3 and 5 mm CaCl2. 15 µg of sequencing grade trypsin (Promega, Madison, WI), was added, and the mixture was incubated at 37 °C for 4 h. The sample was acidified with TFA to quench the digestion and concentrated to 200 µl using a vacuum centrifuge (RC10-22, Jouan Inc., Winchester, VA).
Liquid Chromatography and Tandem Mass Spectrometry
Off-line HPLC fractionation, 2D microcapillary HPLC, and LCQ ion trap mass spectrometry were essentially as described previously (7) with minor modifications. Samples were vortexed and centrifuged at 14,000 rpm in a microcentrifuge (Eppendorf 5415C) for 8 min to remove insoluble material. 30 µl of 50% acetonitrile was added followed by another vortex and centrifugation at 14,000 rpm for 45 min to remove any remaining insoluble material. Samples were fractionated using a YMC-PackTM PolymerC18TM, S-6, 2.0 x 150-mm reversed-phase HPLC column (Waters). The mobile phase was H2O and acetonitrile with 0.1% TFA. Peptides were eluted with increasing acetonitrile (260% in 60 min) at 0.4 ml/min. For samples CE1 and CE2, eluant containing peptides was detected by UV absorption at 214 nm and was fed 1 ml at a time into each of five tubes successively, rotating through all five tubes several times until all peptide-containing eluant was collected, resulting in five combined fractions. For samples PT1 and PT2 only two combined fractions were similarly collected. Each combined fraction was concentrated to 100 µl using a vacuum microcentrifuge, and acetic acid was added to a final concentration of 0.5% (v/v). Approximately 2.5 µl from each combined fraction was analyzed using a 2D microcapillary HPLC system (7, 8) combined with a Thermo Finnigan LCQ Classic or LTQ mass spectrometer in a semiautomated, data-dependent manner as described previously (7). Data acquired for the qualitative analysis of unlabeled strain S2 using the LTQ linear ion trap were similar and differed mainly in the number of usable mass spectra acquired per unit time. In our experiments the LTQ could collect data about 35 times faster, consistent with the experience of other laboratories (9).
Proteomic Data Collection and Analysis
Raw mass spectral data were collected, consisting of full scan (MS1) data and, for selected parent ions, CID scan (MS2) data. The SEQUEST algorithm (10) was used to identify MS2 spectra that were candidates for peptides encoded in the M. maripaludis strain S2 ORF database (2) (GenBankTM accession number BX950229). Our M. maripaludis (Mmp) FASTA database (version dated January 21, 2005) was enlarged by adding
25% of the nrdb human subset to increase the size of the database to about 21 megabytes to avoid statistical problems associated with searching very small databases. Two separate SEQUEST parameter files, one for 14N-peptides and one for 15N-peptides, were used (4, 5, 10). DTASelect (11) was then used to filter the results based on the following criteria. Peptides were required to be fully tryptic (beginning and ending at adjacent predicted trypsin digestion sites). The minimum peptide length was four amino acids.
Cn/Xcorr values for different peptide charge states were 0.08/1.9 for +1, 0.08/2.0 for + 2, and 0.08/3.3 for +3. This process generated two DTASelect-filter.txt files, one for 14N-peptides and one for 15N-peptides.
A program calculating the protein expression ratios for S40:S2 was written under the FileMaker 7.0 environment (Filemaker, Inc., Santa Clara, CA). Briefly peptide pairs were found for each identified peptide that was a unique peptide (sequence present in only one predicted ORF in the Mmp database) in both DTASelect-filter.txt files as follows. For the parent ion selected for each CID scan, the m/z value for the corresponding isotopic "flip" ion was calculated. The raw MS1 data were then searched for an ion with that m/z value in the full scan from which the original CID scan was derived during the parent ion selection process. If no flip ion was found, then the noise value near the predicted m/z for the flip ion was used instead. A non-0 value for the missing member of the 15N/14N peptide pair was essential to avoid prematurely rejecting data for potentially highly regulated genes. The raw data were then similarly searched for "heavy" and "light" peptide pairs in the MS1 scans immediately prior to and immediately after the scan used to select the original parent ion. Of the three MS1 scans, data were retained for the scan with the highest signal intensity. This process was repeated for as many heavy 15N and light 14N peptide pairs as the program could find for each ORF.
An average ratio was calculated for each protein from all the 15N:14N and 14N:15N peptide pairs associated with that protein. Normalization of the average ratios was done in a manner similar to methods that are used for microarrays (12). Briefly a frequency distribution histogram of the log2-transformed ratios was constructed (data not shown), and then the ratios were normalized by a factor that made the distribution center at 0 (12). The number of 15N/14N peptide pairs per ORF (n1) was used as a measure of reliability and analytical precision. Differential protein levels were regarded as significant if n1 was equal to or greater than 3 and the ratio (S40:S2) ± S.D. was above or below 1.0. These calculations were done separately for each of the four samples, CE1, CE2, PT1, and PT2. Finally data from all four samples were combined and used to make a composite analysis (summary) of the expression pattern for S40:S2 (see Supplemental Table S2). For the composite analysis, peptide ratios determined to be outliers were eliminated. The detection of outliers was done in two steps for each list of peptide ratios associated with an ORF. For the first stage, Dixons Q-test (13) was used; in the second stage, a median of the absolute deviation (MAD) modified z-score test (1416) with a cutoff value of 3.5 was used.
Confirmation by Peptide Synthesis
The following five peptides were synthesized commercially (Anaspec, San Jose, CA) and used for comparison of the original CID mass spectrum (MS2) with the CID spectrum acquired for the synthetic peptide standard. The Mmp number is given for each ORF followed by the sequence tag: Mmp0512, ELQKYAK; Mmp0512, NACKMGAAKFR; Mmp1456, IGQIPFR; Mmp1517, HFGAWSEFR; and Mmp1517, AWYEGLALAIFLK.
Preparation of RNA
Cells were grown as for the differential 15N:14N labeling of proteins except that only 14N was used. When the average A600 reached 0.55, the cultures were rapidly cooled in an ice-water bath. For these experiments, cultures of S2 and S40 reached this optical density after 12 and 36 h, respectively. The cells were collected by centrifugation at 8,000 x g for 30 min at 4 °C, washed in 1 ml of Methanococcus minimal medium (6) without sodium bicarbonate and stored at 70 °C. The cells were then resuspended in 200 µl of minimal nitrogen-free growth medium (excluding sulfide (17)) and lysed by the addition of 250 µl of RLT buffer (Qiagen, Valencia, CA). 350 µl of 100% ethanol (Aaper, Shelbyville, KY) was then added, and the RNA was purified on RNeasy columns (Qiagen). Aliquots of RNA were then treated with DNase using DNA-Free (Ambion, Austin, TX). Reactions contained 0.4 units of DNase I/µl of reaction mixture and were terminated with 10 µl of DNase inactivation reagent (Ambion).
The quality of the RNA samples was checked using an Agilent 2100 Bioanalyzer (Agilent Technologies Inc., Palo Alto, CA). The absence of DNA was confirmed by a PCR using a Biometra T gradient thermocycler (Whatman Biometra, Goettingen, Germany) and primers QRNAr and QRNAf (Supplemental Table S1) for ORF Mmp0608.
Array Labeling and Hybridization
Arrays constructed by generating PCR products of
500 base pairs in length for each of the predicted ORFs were purchased from Integrated Genomics (Chicago, IL) and spotted in duplicate on mirrored slides at the University of Washington Center for Expression Arrays. Labeling reactions were conducted with the amino allyl cDNA labeling kit (Ambion) using a modification of the published procedure. For each reaction, 510 µg of RNA and 4 µg of random decamers (Integrated DNA Technologies, Coralville, IA) were incubated at 75 °C for 7 min and then briefly centrifuged at room temperature. First strand cDNA synthesis was carried out in 1x RT buffer, 10 units of RNase inhibitor, 0.5 mm dATP, 0.5 mm dCTP, 0.5 mm dGTP, 0.15 mm dTTP, 0.15 mm amino allyl-dUTP, and 400 units of reverse transcriptase for 10 min at room temperature followed by 2 h at 42 °C. The RNA template was then hydrolyzed by adding 4 µl of 1 m NaOH and incubating at 65 °C for 15 min. The reaction was neutralized by the addition of 10 µl of 1 m HEPES; cDNA was precipitated by the addition of 3.4 µl of 3 m sodium acetate, 1 µl of 5 mg/ml glycogen, and 100 µl of 100% ethanol (Aaper), incubation for 15 min at 20 °C, and centrifugation at maximum speed in a microcentrifuge for 30 min at 4 °C. The samples were then washed with 75% ethanol, and the supernatants were discarded. For NHS-Cy dye coupling, the samples were resuspended in 4.5 µl of coupling buffer (Ambion), 2.5 µl of RNase-free H2O (Ambion), and 3 µl of either NHS-Cy3 or NHS-Cy5 (Amersham Biosciences) and incubated at room temperature in the dark for 1 h. The reaction was quenched with the addition of 6 µl of 4 m hydroxylamine (Ambion) followed by a 15-min incubation at room temperature in the dark. Unincorporated dye was removed by adding 100 µl of PB buffer (Qiagen) and 750 µl of 100% ethanol (Aaper) and applying the sample to a MinElute reaction clean-up column (Qiagen). Samples were eluted from the column with 60 ml of RNase-free H2O (Ambion). The whole, undiluted sample was analyzed using a spectrophotometer to determine DNA yields and dye concentrations. Only samples with at least 35 pmol of dye/mg of DNA were used for hybridization.
Samples were prepared for hybridization by mixing 100120 pmol each of the Cy3- and Cy5-labeled cDNA and centrifuging them to dryness in a SpeedVac (Jouan Inc.). The samples were then resuspended in 60 ml of hybridization buffer (Agilent Technologies Inc.), boiled for 3 min, and incubated on ice for 30 s.
Slides were washed in sterilized dH2O and dried with pressurized nitrogen gas. Resuspended samples were then applied along one edge of the slide, and a clean glass 24 x 60-mm coverslip (VWR, West Chester, PA) was lowered slowly over the sample starting with that edge. The slides were then sealed in a plastic container with reservoirs of dH2O to maintain humidity, and the hybridization was carried out at 42 °C for 1416 h in the dark.
All posthybridization washes were performed in 50-ml conical tubes (Greiner, Longwood, FL). A brief wash in 1x SSC + 0.2% SDS was used to remove the coverslip followed by a 10-min wash in 1x SSC + 0.2% SDS and two 10-min washes in 0.1x SSC + 0.2% SDS, all at 54 °C, with rotation in a hybridization oven covered with aluminum foil. Finally slides were subjected to two 1-min washes at room temperature in 0.1x SSC, rinsed briefly with dH2O, and dried with pressurized nitrogen gas.
Array Data Collection and Analysis
Slides were scanned at the University of Washington Center for Expression Arrays on an Amersham Biosciences Gen III array scanner. Images were collected separately for each duplicate half of the slide. Images taken for the Cy3 and Cy5 channels were stored as paired gel files. Spots were identified, and the local background was subtracted using the SpotFinder software from The Institute for Genomic Research.2 Four biological replicates were analyzed, each with duplicate arrays on each slide and with flip-dye hybridizations, making a total of 16 measurements for each gene. Data were Loess-normalized using the S+ array analyzer (Insightful, Seattle, WA). Average expression ratios and unadjusted p values were calculated for each gene across all replicates using the S+ array analyzer using a Wilcoxon non-parametric significance test for the p value calculations (18). Genes with p values less than 0.01 were regarded as differentially expressed.
Real Time RT-PCR
RNA was purified as described above. Real time RT-PCR was performed on a RotorGene 2000 instrument from Corbett Research (Westborough, MA). Software version 4.6 was used during the run, and data were analyzed on software version 6.0. Real time RT-PCR was carried out using SuperScript III Platinum One-Step Quantitative real time RT-PCR system by Invitrogen according to the manufacturers protocol. Primer and TaqMan probe concentrations were 200 nm, MgSO4 was adjusted to 3.5 mm, and RNA was added to 50 ng/20-µl reaction. Cycling parameters consisted of the RT reaction at 55 °C for 35 min, initial denaturation at 95 °C for 2 min, and 35 cycles of 95 °C for 15 s and 60 °C for 30 s. All samples were performed in triplicate with appropriate controls (no template control and no amplification control). 16 S rRNA was used as a standard (details of TaqMan probes are in Supplemental Table S1).
Bioinformatics
The quantitative combined array/proteomic dataset (Supplemental Table S2), the qualitative proteomic dataset (Supplemental Table S3), color images of the M. maripaludis proteome and transcriptome (Fig. 2), the signal intensity/peptide number calculations (Fig. 4), and the codon bias calculations (Fig. 5) were generated using a Filemaker Pro 7.0 (Filemaker, Inc.) application developed in house.
|
|
|
| RESULTS |
|---|
|
|
|---|
|
|
|
This pattern is apparent in Fig. 2 where colored images of the proteome (Fig. 2A) and transcriptome (Fig. 2B) present the ORFs by Mmp number following the physical map of the genome (2). In these plots green indicates statistically supported higher expression in the mutant compared with wild type, red indicates lower expression, yellow indicates no statistical support for differential expression, and white indicates no data for the reasons described in the figure legend. For the sake of simplicity and because the errors associated with the expression ratios are sufficiently large in both the proteome and the transcriptome, we indicate only the direction of change in green and red, not the magnitude of change. This operon-based corroboration was nearly always found in those 15 cases where both the arrays and the proteome showed similar regulation. It was frequently the case also where only the arrays showed regulation: of the 56 ORFs that were regulated in the array only, 29 were in one of eight multicistronic operons with other ORFs that were coregulated. Similarly for the 72 ORFs regulated only in the proteome, 13 ORFs were in one of six operons that were coregulated. Hence there were good reasons to believe that many of the differences in expression observed by only one method were biologically significant. Third, the results from the proteome and the transcriptome were positively correlated overall. This relationship is shown in Fig. 3 where the expression ratios are plotted for the proteome versus the transcriptome for 652 ORFs that passed quantitative quality control criteria for both protein and mRNA measurements in both the mutant and the wild-type strains. Certain ORFs that were probed for but that did not yield signals associated with significant hybridization in the arrays were eliminated from the data used to plot Fig. 3 as were those not detected at the protein level.
|
|
We use the number of "redundant peptides" per ORF (n2) to refer to all peptides that can be associated with a particular ORF, including multiple measurements of the same peptide. The total signal intensity in MS1 summed over all peptides associated with a given protein also correlated well with the total number of redundant peptides (Fig. 4B). Most importantly, both the range of relative standard deviation values (Fig. 4C) and the range of observed ratios (Fig. 4D) were widest at low numbers of peptide pairs per ORF and narrowed markedly at high numbers of peptide pairs. These data support the contention that it is difficult to sort out true regulatory changes based on quantitative proteomic data for low numbers of peptide pairs. On the other hand, true changes in protein expression tended to stand out dramatically as the number of peptide pairs per protein exceeded
10 as indicated by the ORFs coded green in Fig. 4D for which up-regulation was supported by both proteomics and transcriptome arrays. These ORFs were generally supported by the strongest corroborative evidence for regulation as described above. In contrast, the two ORFs coded red in Fig. 4D did not stand out and may represent false positives in the proteome, the transcriptome, or both.
Proteome Coverage
Mathematical simulation, based on sampling theory and the known data acquisition characteristics of the LCQ mass spectrometer (data not shown), suggest that the true level of measurable protein expression is somewhat higher than the 55% of the annotated protein-encoding ORFs measured for S40:S2, around 70%. This number is based on our estimate of how many parent ions (MS1) that exceed our signal-to-noise threshold can actually be isolated per unit time for purposes of generating a collision spectrum (MS2). For the LCQ spectrometer, only about 3% of the potential parent ions generate daughter ion spectra under our conditions. Indeed proteome coverage was markedly greater in the larger dataset summarized in Supplemental Table S3 in which we have accumulated qualitative data for several samples of M. maripaludis. For the cumulative dataset we detected 1289 proteins or 75% of the predicted protein-encoding ORFs. The 1289 proteins detected in the M. maripaludis proteome are broken down by Oak Ridge Functional Class in Table IV. Preliminary data (not shown) using the LTQ mass spectrometer suggests that even higher coverage is possible, and we have measured protein expression for more than 80% of the protein-encoding ORFs for strain S2
hpt (1) grown under continuous flow (chemostat) conditions.3
|
We suspect that global coverage for the S40 versus S2 comparison was better on the mRNA side than it was with the proteomics because trends were generally more consistently observed across known operons in the arrays (Fig. 2). However, the difference in coverage may not be as great as might be concluded from a superficial comparison of the numbers of colored (non-white) circles in Fig. 2, A and B. This is because yellow circles in the microarray data (Fig. 2B) do not imply detection of mRNA in any absolute sense, only that the array contained a probe for the gene and that there was no statistically significant difference in mRNA level between S40 and S2. In contrast, yellow circles in Fig. 2A do indicate absolute detection.
Peptide Number as a Predictor of Protein Abundance
As has been observed by other laboratories working with different model systems (4, 5), our data tended to contain the highest numbers of peptides for those proteins that are generally known to have high abundance. Thus, the ORFs with the highest numbers of redundant peptides in our cumulative M. maripaludis database (Supplemental Table S3) included the S-layer protein that comprises the cell wall, subunits of the methyl-coenzyme M reductase that catalyzes a step in methanogenesis, the chaperonin Hsp60, translation elongation factors, and ribosomal proteins. These proteins are all known to be abundant in M. maripaludis.
To test the hypothesis that high peptide numbers (n, n1, and n2) are a general predictor of protein abundance, we plotted peptide numbers (normalized to the predicted molecular weight of the protein) versus Karlin indices (19), which are predictors of protein expression levels based on codon biases (Fig. 5). For one plot (Fig. 5A) we based our calculations on bioinformatic categories for highly expressed genes, including ribosomal proteins. Ribosomal proteins of M. maripaludis are 100% highly expressed by Karlins criteria, similar to bacteria but unlike most Archaea sequenced to date in which about 60% of the ribosomal proteins are highly expressed (20). In two other plots (Fig. 5, B and C), we based our calculations on our own measured peptide numbers. Either way, the plots were essentially the same and showed a moderate correlation between peptide numbers and Karlin indices. To achieve even a moderate correlation it was necessary to divide n1 or n2 by the predicted molecular weight of the protein because the higher the molecular weight, the higher the probability of detection. In general the correlations between codon bias and measured protein expression by peptide numbers are likely to remain moderate as long as mass spectrometer duty cycle, the number of mass spectra that can be acquired per unit time, is a limiting factor.
In addition to Karlin indices, we also calculated codon adaptation indices (CAI (21)), which have also been used to predict protein abundances. Both indices are recorded in Supplemental Table S3. However, unlike Karlin indices, CAI did not correlate well with our peptide numbers. Grocock and Sharp (22) have argued against the predictive value of CAI for organisms with strongly biased GC content; the mol % G + C of M. maripaludis is quite low at 33% (2).
| DISCUSSION |
|---|
|
|
|---|
|
Our work agrees with other published accounts that the results with LC-MS-MS are vastly more informative as a readout of gene expression than 2D PAGE technology (25, 26) at least for microorganisms. Indeed in the current comparison of the S40 and S2 strains of M. maripaludis, we identified peptides belonging to 939 proteins, covering about 55% of the predicted protein-encoding ORFs. This is at least an order of magnitude improvement over what we have been able to achieve with 2D gel technology, and the MudPIT type of approach is far less labor-intensive. Although less labor-intensive, the large amount of relatively expensive mass spectrometry time, 24 weeks for a complete proteome analysis of a microbe with on the order of 2000 protein-encoding ORFs, limits wider use of the technique at present. Interestingly as we observed with P. gingivalis (8), there was no obvious analytical discrimination based on hydrophobicity or isoelectric point (data not shown). In contrast, 2D PAGE methods are generally limited by those factors. It is recognized that in certain situations where the goal is to get as much information as possible about the mature form(s) of the protein, including all posttranslational modifications and splice variants, spatial resolution of the protein prior to digestion and mass spectral analysis, as in 2D PAGE, can have certain advantages. However, these advantages would apply only to the relatively small numbers of proteins that are amenable to gel-based methods. For a small genome microbial organism, like M. maripaludis, in which there is little splicing and the primary structure of mature protein is generally the product of a single gene, the MudPIT type of approach seems to work well and is superior to the other approaches we have investigated for measuring protein expression.
Our data reported here and from other microbial expression studies suggest that we are discriminating against smaller gene products, proteins below
710 kDa. Because very small ORFs are both difficult to detect in the genome during the annotation process and difficult to probe for in an array experiment, it is also difficult to assess the magnitude of such discrimination at this time. In general, for a method that is limited by the duty cycle of the mass spectrometer, the probability of a protein being detected in such a complex analytical scheme is directly proportional to its molecular weight and the degree to which it is expressed. The inability to reliably detect protein expression by short ORFs is a limitation shared by all global approaches of which we are familiar.
Differing Trends in Arrays and Proteomics
Those cases where the results of arrays and proteomics differed deserve further comment. In many cases expression changes in the same direction probably occurred at both the mRNA and proteome levels, but statistical support was found only for one method. In other cases the array or the proteomics data may contain false positives. On the proteomics side false positives become more likely when the number of peptide pairs measured is low as discussed previously. Hence in some cases differences in the arrays and the proteomics may be a statistical issue, and additional experiments are needed when the ORF in question is of particular interest. In other cases posttranscriptional regulation could have occurred, resulting in regulation in the proteome where there was none in the transcriptome. Finally differences could occur due to culture history and the relative stability of proteins. These experiments were performed with batch-grown cultures harvested during the linear growth phase. The linear growth phase is commonly observed during growth of hydrogenotrophic methanogens and represents the point in which the growth rate is limited by the transfer of H2 from the gas phase to the medium (27). In linear growth, the specific activity for methanogenesis is decreasing, and the cellular composition is changing. The cells for transcription analysis and proteomics were harvested at the same point in the growth cycle. However, because proteome analysis measures an analyte that is in general more stable with a much longer half-life in the cell, the proteome will reflect previous mRNA expression. It would include protein expression that occurred at low cell densities prior to harvest. In contrast, transcriptional arrays reflect the levels of mRNA at the time of harvesting. More detailed and useful correlation analysis would require explicit knowledge of the kinetics of biosynthesis and degradation for each mRNA and protein in the dataset.
Data Compression
Real time RT-PCR often shows larger -fold differences between conditions than transcriptome arrays (28, 29). The difference between the two techniques seems to correlate with the degree of regulation with larger -fold changes producing greater differences between the array and real time RT-PCR results (29). The results of the M. maripaludis array presented here are consistent with these observations. In general only modest differences were seen between S40 and S2 on the arrays, and these were in general agreement with the results of the real time RT-PCR assays. The consistency of the results is illustrated by the ORF Mmp1503, which encodes PorE, a subunit of pyruvate oxidoreductase (see Table III). For this ORF real time RT-PCR indicated only a slightly higher expression ratio than did arrays, proteomics, and enzyme activity measurements.
The moderate nature of the expression differences between S40 and S2 precluded a full analysis of data compression in the proteome. However, the linear dynamic range of the 2D LC-MS-MS approach in our experiments is only about 2 orders of magnitude, whereas the range of protein expression may be upward of 68 orders of magnitude (25, 26). Hence we expect that, similar to arrays, data compression will increase with increasing -fold differences.
Outlier Detection and Removal
Low signal-to-noise is inherent in global analyses by quantitative 2D LC-MS-MS, necessitating the measurement of large numbers of 15N/14N peptide pairs to narrow the relative standard deviations of the ratios as shown in Fig. 4C. In some cases we used more than 300 peptide pairs to calculate the mean ratio for a single protein. This situation complicates the detection of outliers. Dixons Q-test (13) has been popular in the proteomics community. However, Dixons Q-test is not recommended when the number of data points is greater than about 30 (13) and handles groups of randomly clustered outliers characteristic of large datasets poorly, tending to include them as points to be retained. For this reason, we added a second stage of outlier detection based on a MAD modified z-score test (1416). The MAD modified z-score was better at removing random clusters of multiple outliers for ORFs with large values for n and n1. By Dixons Q-test, 300 records were detected as outliers from 21,524 records, or 0.72%. Then by the MAD modified z-score test, 1520 records were detected as outliers from the remaining 21,224 records, or 7.2%. The z-score cutoff of 3.5 used in this study was derived empirically and should be viewed as appropriate for these data, not as a generally useful value.
Statistical Accuracy for Highly Regulated Genes
Our statistical screening appears to have accurately determined many of the proteins that are present at different levels in S40 versus S2 (see ORFs indicated in green in Fig. 4D). The statistical analysis appears to have worked well for most ORFs whose expression ratios using either proteomics or microarrays were within a factor of 4 of a mean normalized value of 1.0 on a linear scale (see Supplemental Table S2). On the proteomics side, ratios outside a factor of four tended to derive from measurements in which true signal, confirmed by a collision spectrum, was measured for one state, whereas noise or very low abundance signals of large standard deviation were measured in the other state. Several of the data points in the upper left portion of Fig. 4, C and D, were typical. Those ORFs were supported by low n1 values and may not reflect real changes in protein expression. However, genes showing an "all on" or "all off" type of regulatory pattern would in principle be especially prone to generating these types of data. This could lead to the paradoxical situation where a gene that may in fact have had a dramatic expression change could yield protein data that would fail to pass our statistical tests, due to highly variable but low signal intensity in the state where little expression was detected, that in turn would influence the standard deviation of the ratio as a whole. Such data would also run a higher risk of being rejected by our outlier detection algorithm. This problem has now been corrected by the use of a weighting scheme 4 in which high signal-to-noise data contribute proportionately more to the standard deviation calculation for the ratio as a whole. The p values calculated for the spotted cDNA arrays using the Wilcoxon test were not subject to this problem.
Prospects for M. maripaludis Proteomics
The study presented here illustrates the feasibility of quantitative 2D LC-MS-MS-based proteomics for M. maripaludis. Metabolic labeling allowed the simultaneous comparative analysis of peptides from two conditions, analogous to "two-state" microarray experiments. Coverage was high, and quantitation was sufficiently accurate and precise to detect changes in protein expression that were subsequently confirmed by multiple lines of evidence. The measurement of large n1 values for a given ORF was essential for accurate and precise relative quantitation. We remain skeptical of the idea that biologically meaningful expression ratios can be generated from small numbers of peptide pairs regardless of the labeling or quantitation approach used. The relationship between number of heavy and light peptide pairs and confirmed expression changes shown in color in Fig. 4D is typical of what we see in all of our quantitative proteomic work with microbes and should be understood as a general phenomenon. True, validated, and biologically meaningful expression changes at the protein level in a two-state experiment require a high number of peptide pairs recovered per protein under our conditions. Our thinking about proteomic expression ratios parallels recent thinking in the microarray field where in general larger numbers of replicate measurements are now used to truly differentiate biological regulation from all the various sources of noise and error that can become an issue with such global assessments of gene expression (12). To return to the question asked in the Introduction, can we do as good a job with global measurements of the proteome using MudPIT as we can with the transcriptome using microarrays? The selected data presented in Tables I andTable II and the complete quantitative dataset presented in Supplemental Table S2 suggest an answer of "no, we are not quite there yet but getting close." These data were collected with an LCQ instrument that is no longer state-of-the-art. Coverage and quantitation have improved markedly with advances in mass spectrometric technology, and our most recent data3 indicates that by using an LTQ mass spectrometer with a much faster scanning speed we can greatly minimize the duty cycle issue mentioned repeatedly in this report as a fundamental limitation of the MudPIT approach with respect to qualitative coverage and quantitative reliability issues. Thus, for a tractable model organism with fewer than
2000 protein-encoding ORFs, such as M. maripaludis, it should now be possible to make global comparisons for all known protein-encoding genes under the assumption that the information will be as reasonably complete for the proteome as for the transcriptome.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Published, MCP Papers in Press, February 17, 2006, DOI 10.1074/mcp.M500369-MCP200
1 The abbreviations used are: MudPIT, multidimensional protein identification technology; CAI, codon adaptation indices; E (g), Karlin index or general expression measure; HEG, highly expressed gene; MAD, median of the absolute deviation; Mmp, M. maripaludis; n, number of unique peptides per ORF; n1, number of 15N/14N peptide pairs per ORF; n2, number of redundant peptides per ORF; POR, pyruvate oxidoreductase; SSC, sodium chloride/sodium citrate buffer; 2D, two-dimensional; dH2O, distilled H2O; NHS, N-hydroxysuccinimide. ![]()
2 www.tm4.org/spotfinder.html. ![]()
3 Q. Xia, T. Wang, M. Hackett, and J. A. Leigh, unpublished data. ![]()
4 T. Wang, Q. Xia, M. Meila, and M. Hackett, manuscript in preparation. ![]()
* This work was supported by Microbial Cell Project DE-FG03-01ER15252 from the Department of Energy and National Institutes of Health Grant GM60403 (to J. A. L.). Funding for portions of the proteomics software and infrastructure development were provided by National Institutes of Health Grant DE014372 and the University of Washington College of Engineering (to M. H.). ![]()
The data discussed in this publication have been deposited in the NCBI Gene Expression Omnibus (GEO) and are accessible through GEO Series accession numbers GSE2744 for the proteomics and GSE2745 for the spotted cDNA arrays.
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()
¶ Both authors contributed equally to this work. ![]()
|| Present address: Dept. of Biological Engineering, Massachusetts Inst. of Technology, 77 Massachusetts Ave., Cambridge, MA 02139-4307. ![]()

To whom correspondence should be addressed: Dept. of Chemical Engineering, Box 351750, University of Washington, Seattle, WA 98195. Tel.: 206-616-8071; Fax: 206-543-8297; E-mail: mhackett{at}u.washington.edu
| REFERENCES |
|---|
|
|
|---|