Proteomic Analysis of an Extreme Halophilic Archaeon, Halobacterium sp. NRC-1*

Halobacterium sp. NRC-1 insoluble membrane and soluble cytoplasmic proteins were isolated by ultracentrifugation of whole cell lysate. Using an ion trap mass spectrometer equipped with a C18 trap electrospray ionization emitter/micro-liquid chromatography column, a number of trypsin-generated peptide tags from 426 unique proteins were identified. This represents approximately one-fifth of the theoretical proteome of Halobacterium. Of these, 232 proteins were found only in the soluble fraction, 165 were only in the insoluble membrane fraction, and 29 were in both fractions. There were 72 and 61% previously annotated proteins identified in the soluble and membrane protein fractions, respectively. Interestingly, 57 of previously unannotated proteins found only in Halobacterium NRC-1 were identified. Such proteins could be interesting targets for understanding unique physiology of Halobacterium NRC-1. A group of proteins involved in various metabolic pathways were identified among the expressed proteins, suggesting these pathways were active at the time the cells were collected. This data containing a list of expressed proteins, their cellular locations, and biological functions could be used in future studies to investigate the interaction of the genes and proteins in relation to genetic or environmental perturbations.

Since the completion of the first bacterial genome, Haemophilus influenzae (1), more than 100 microbial genome sequences including Halobacterium sp. NRC-1 (2, 3) have been determined (www.ncbi.nlm.nih.gov/PMGifs/Genomes/ micr.html). These sequences constitute the primary digital information for global understanding of physiology, pathogenicity, and molecular machineries essential for the survival or adaptation of the organisms in different environmental conditions. The Halobacterium genome consists of a 2,014-kb large chromosome and two smaller chromosomes (191 kb and 365 kb) encoding ϳ2,630 putative protein genes (2,3). Among these, 41% matched to genes of known function in public databases. The predicted proteome is highly acidic with a median isoelectric point of 4.9 (4). The high negative surface charge of predicted proteins provides stability to the proteins in nearly saturated intracellular salinity where other conventional proteins would become denatured (4,5).
The archaeon Halobacterium sp. NRC-1 provides a relatively simple model for understanding a complex system of how cells adjust to various environmental stimuli. Halobacterium flourishes in extremely saline environments (Ͼ4 M salts), and its metabolism is subject to fluctuations in sunlight, oxygen, temperature, nutrients, and salinity. Halobacterium thrives in this harsh environment by appropriately tuning its extraordinary physiology in response to different environmental stimuli. For example, it can relocate, in search of favorable environments, using sensors that can discriminate beneficial and detrimental spectra of light (6 -8), aerotaxis transducer (HtrVIII) (9), and buoyant gas-filled vesicles (10,11). The Halobacterium transducer, HtrVIII, combines subunit I core structures of eukaryotic cytochrome c oxidase and eubacterial methyl-accepting chemotaxis proteins to mediate aerotaxis (9). One of interesting features of Halobacterium is its ability to survive aerobically as a chemoheterotroph and anaerobically using light and/or arginine as energy sources. Halobacterium sp. derives energy from light by its retinalcontaining light-driven ion transporters, bacteriorhodopsin and halorhodopsin (12)(13)(14)(15)(16). Halobacterium can also ferment arginine via the arginine deiminase pathway to yield 1 mol of ATP for each mole of fermented arginine (17,18).
Its intriguing physiology together with the availability of a complete genome sequence led us to catalogue via a simplified shotgun proteomic methodology (19,20) the proteins expressed by Halobacterium sp. NRC-1 in membrane and cytoplasmic compartments. In addition, the proteins involved in metabolic pathways in Halobacterium sp. NRC-1 under standard culture conditions were investigated. Herein we present the results of our initial investigation of the Halobacterium proteome using a simple shotgun proteomic approach that involves bulk digestion of copurified proteins with trypsin followed by a single stage of microcapillary high-pressure liquid chromatography (LC) 1 electrospray ionization (ESI) tandem mass spectrometry (MS/MS) analysis of peptides us-ing an ion trap mass spectrometer. The data contains a list of expressed proteins derived by searching peptide tandem mass spectra against the theoretical protein database of Halobacterium sp. NRC-1 (2,3), their cellular locations (i.e. membrane, cytoplasm, or both) deduced from subcellular fractionation prior to proteome analysis, and the putative biological functions of proteins.
Protein Preparation-Membrane and soluble-cytoplasmic proteins were isolated using a protocol modified from a halophiles laboratory manual and Oesterhelt (21,22). One liter of Halobacterium sp. NRC-1 culture was grown to OD 600 ϭ ϳ2.0 and pelleted by centrifugation at 7,500 rpm at 4°C for 10 min. Pellets were resuspended in 20 ml basal salt solution containing 0.5 mg each of DNaseI and RNaseA and 1 mM of proteinase inhibitor, phenylmethylsulfonyl fluoride (PMSF). Cells were lysed by osmotic shock against a 40ϫ excess of deionized water within a dialysis tubing bag (Spectra/Por ® membrane MWCO: 3,500; Spectrum, Rancho Dominguez, CA). Cell debris was removed by centrifugation at 10,000 ϫ g for 30 min. The remaining cell lysates were then separated into the soluble and membrane fractions by ultracentrifugation at 53,000 ϫ g for 2 h. The membrane fraction, a pellet at the bottom of the tube, and the soluble fraction, the aqueous supernatant portion, were then collected. The membrane was loaded on top of 30% sucrose cushion and ultracentrifuged at 53,000 ϫ g at 10°C overnight. The membrane fraction was collected and washed three times in 10 ml basal salt solution using a hand-held electrical homogenizer (Tissue-Tearor; Fisher, Pittsburgh, PA). Membrane proteins were then collected by centrifugation at 53,000 ϫ g for 2 h at 10°C. The pellet was resuspended in residual basal salt solution and then transferred to a microcentrifuge tube. The residual aqueous basal salt solution was removed by a brief spin at 14,000 rpm. The soluble protein fraction was dialyzed against five changes of 100ϫ volume of deionized water at 4°C to reduce the salt concentration that in excess might inhibit the protease reaction and mass spectrometry analysis.
Protease Digestion-One hundred micrograms of proteins were digested with 2 g of trypsin (Promega, Madison, WI) in 50 mM sodium bicarbonate (pH 8.3) at 37°C overnight. Soluble proteins were lyophilized after digestion. Membrane proteins were digested in the presence of 0.5% SDS to aid solubilization. After the protease reaction, SDS was removed by precipitating proteins with 70% acetone or by chromatography using a cation exchange cartridge (OASIS MCX; Waters, Milford, MA) according to manufacturer's procedure. The proteins were lyophilized and stored at Ϫ80°C and resuspended in 100 l of 0.4% acetic solution prior to mass spectrometer analysis.
Tandem Mass Spectrometry-Trypsin-digested peptides were analyzed by LC-ESI-MS/MS using an LCQ-DECA mass spectrometer (Thermo Finnigan, San Jose, CA) equipped with a C18 trap ESIemitter/micro-LC column. Trypsin-digested peptides (2 g) were loaded to a Hewlett Packard/Agilent 1100 Series high-pressure LC system using a Famos Autosampler (Dionex, San Francisco, CA). The peptides bound to the C18 matrix were eluted by acetonitrile gradient (5% to 35%) by mixing acetonitrile with 0.4% acetic acid in water. The eluted peptides were injected into the mass spectrometer by nano-ESI (19,23). Mass spectra were acquired by data-dependent ion selection from a full range as well as discrete and narrow survey scan m/z ranges to increase the number of identifications. Proteins were identified from tandem mass spectra using the SEQUEST (24) data-base search engine to search against the Halobacterium NRC-1 predicted protein database (3).
Membrane Domain Prediction-Halobacterium putative proteins were analyzed for the presence of transmembrane domains using the TMpred (25) and TMHMM programs (26,27). The TMpred program predicts membrane-spanning regions (MSRs) and their orientation based on the statistical analysis of TMbase, a database of transmembrane proteins and their helical membrane-spanning domains. The prediction is based on an algorithm using a combination of several weight-matrices for investigating the local properties of amino acid sequences. The program TMHMM, on the other hand, takes a global approach to determine the topology of an entire protein based on Hidden Markov models. The stand-alone TMpred program was installed on a SUN Microsystem Enterprise 420R server. TMHMM (v. 2.0) was run through the web interface (www.cbs.dtu.dk/services/ TMHMM).

Membrane Protein Prediction-The
Halobacterium NRC-1 genome encodes 2682 putative protein-coding genes (3). Among these, 2413 genes are unique. The TMHMM program predicted 544 membrane proteins containing 1 to 24 MSR(s), among which 163 were annotated proteins, 122 were conserved hypothetical proteins (CHP), and 259 were hypothetical proteins (HP). On the other hand, TMpred detected 929 membrane proteins containing 1 to 22 MSR(s) with total score more than 1000, among which 377 were annotated proteins, 202 were CHP, and 350 were HP. A score Ͼ500 is considered to be statistically significant in TMpred prediction (25). TMpred also detected all 544 membrane proteins predicted by the TM-HMM program with a minimal TMpred total score of 1194.
Mass Spectrometry Peptide Analysis-Two micrograms of trypsin-digested peptide mixtures from the membrane and soluble proteins were analyzed by LC-ESI-MS/MS with the following different m/z ranges from which ions were selected for collision-induced dissociation (CID): 1 (400 -2000 m/z), 4 (400ϳ800, 800ϳ1200, 1200ϳ1600, and 1600ϳ2000 m/z), or 16 (400ϳ500, 500ϳ600, 600ϳ700, . . . , and 1900ϳ2000 m/z). Using a different m/z range has been shown to increase the number of novel peptides selected for CID (28,29). The  TABLE I  Proteins identified in the membrane fraction   The functional category, the gene identification number, the ProteinProphet™ probability, the abbreviated protein name, the putative  function, and the number of TMHMM-and TMpred-predicted transmembrane domains for each protein are    tandem mass spectra were analyzed using the SEQUEST database search program (24) with the Halobacterium NRC-1 protein database. Search results were processed using the INTERACT web interface, a software tool that allows internetbased data display, data filtering, and data sorting (30). Recently developed statistical modeling algorithms to compute probabilities associated with peptide (PeptideProphet TM ) (31) and protein (ProteinProphet TM ) (32) sequence assignments that distinguish correct from incorrect database search results were used to validate the search results. These tools allowed assigning probabilities to all identifications and offering standardized interpretation of results by reducing the need for manual verification. In particular, these tools enabled rapid and objective evaluation of large proteomic datasets. A detailed application of these tools has been recently published (33). More information on these applications can be found on the Proteomics pages at http://www.systemsbiology.org/ and they are open source. In this study, we report proteins with probability at least 0.5. Probability 0.5 means that according to the statistical model, the sequence match given is 50% likely to be correct. These resulted in the identification of 426 proteins with false-positive rate of 3.7% (Fig. 1).
Proteins in Membrane Fraction-In the MS/MS analysis, 165 proteins were identified only in the membrane fraction but not in the soluble protein fraction (Table I) Proteins in Soluble Fraction-A total of 232 proteins were identified only in the soluble fraction but not in the membrane fraction. Of these, 168 (72.4%) were annotated proteins, 45 (19.4%) were CHP, and 19 (8.2%) were HP (Table II). TMHMM detected only three membrane proteins and each contained only a single putative membrane domain. TMpred detected 40 (17.2%) putative membrane proteins with scores greater than 1000.
Proteins in Both Membrane and Soluble Fractions-A total of 29 proteins were identified in both soluble and membrane fractions. These included protein components involved in large complex structures such as the ribosome, flagella, and gas vesicle (Table III). There were 27 annotated proteins and 2 HP. TMHMM predicted three of the proteins contain one membrane domain, and TMpred predicted that eight of the proteins contain one or two membrane domains with Ͼ1000 score. No membrane domain was predicted in 20 (69%) proteins. DISCUSSION In this study, we applied a simplified shotgun proteomic approach using LC-ESI-MS/MS and computational analysis to characterize the peptides in complex mixtures of trypsindigested membrane and soluble proteins. While this is a powerful technique for rapidly screening the peptide components and by inference the parent proteins in a sample, there are certain limitations to this approach which include: 1) peptide ion selection for CID during LC introduction is "top-down" and to some degree random (29), meaning that peptides that ionize well and that are from the more abundant proteins in the original mixture are the most likely to be selected; 2) for a protein to be identified, the peptide tandem mass spectrum used in the database search must be of sufficient "quality" (which in part is related to the abundance of the peptide) to match a sequence in the database; 3) the absence of a sequence in the database for which a high quality peptide tandem mass spectrum is generated may lead to a falsepositive because the software can generate a best-fit to a highly similar sequence that is present, although the probability scoring routine used minimizes this; 4) high versus low protein sequence coverage lends more weight to a protein identification and may be an indication of its relative abundance among proteins present in the original mixture; and 5) our search results were based solely on matching predicted protein sequences (i.e. post-translational modifications were not considered). The simplified shotgun proteomic approach was chosen instead of two-dimensional PAGE-MS methods (34,35) primarily because LC-MS/MS allows direct analysis of hydrophobic membrane proteins and also because it serves as a relatively rapid screen of expressed proteins.
A total of 401 chromosome proteins and 25 minichromosome pNRC100 and pNRC200 proteins were identified. In order to obtain functional and physiological information regarding these expressed proteins, 295 of the expressed proteins with putative functions were searched against the KEGG Enzymes/Compounds/Genes Pathway Database (http://www.genome.ad.jp/kegg-bin/mk_point_html). A number of metabolic pathways showed more than 50% of their members present in the group of expressed proteins, and thus suggests such pathways were active at the time the cells were collected or identified proteins are constitutively expressed (Table IV).
There were 29 proteins with tryptic peptides identified from both membrane and soluble protein fractions. These included    (36) where they are connected to the cytoplasmic membrane, or alternatively as unassembled precursor proteins. In Halobacterium, the formation of gas vesicles is induced under low oxygen conditions, enabling cells to float to the surface and grow phototrophically (10,11). Interestingly, the major gas vesicle protein, GvpA, was identified only in the membrane fraction. The GvpC peptides detected in the soluble protein mixture might have come from GvpC molecules detached from gas vesicle surfaces. Ribosomal proteins were found in soluble, membrane, or both fractions. Because ribosomal protein complexes are large structures, some ribosomes might have copurified with the membrane during ultracentrifugation while most of them remained in the soluble fraction. The cell surface glycoprotein (Csg) was one of the most frequently identified proteins in our analysis. Of the ϳ800 tandem mass spectra matched to Csg, 43% were from the membrane fractions. Csg may be observed in the soluble fraction if it detached from the cell membrane surface, or possibly as very small membrane fragments suspended in the supernatant after ultracentrifugation.
We next evaluated the extent to which our biochemical fractionation was successful in segregating membrane from soluble proteins. To do this we examined the peptides from the top three membrane and soluble proteins separately. Top protein candidates were those proteins with the highest number of matched tandem mass spectra and for which all tandem mass spectra had a Ն0.9 probability of being correctly matched. The top three soluble proteins were CctB, CctA, and ArcB. Of these three proteins, peptides from CctB were selected for CID 715 times in the soluble fraction but only 5 in the membrane fraction indicating that very little of this soluble protein was found in the membrane fraction. By this same measure, CctA and ArcB were also well segregated to the soluble fraction as demonstrated by finding that 638 out of 646 tandem mass spectra were selected only in the soluble fraction for CctA and 436 out of 444 for ArcB. Likewise, the top three membrane proteins, YqgG, DppD, and Vng1802H also appeared to be well segregated to the membrane fraction because 605 out of 681, 224 out of 244, and 110 out of 124 of their tandem mass spectral matches were found only in the membrane fraction. This test suggested that the ultracentrifuge purification procedure effectively partitioned most of the proteins to either soluble or membrane fraction, reflecting their original cellular location.
There are several sets of proteins of interest in Halobacterium sp. NRC-1 important to its physiology. They include the 6 TATA-box binding protein (TBP), 7 transcription factor B (TFB) proteins, 17 Htr signal transducer family, and gas vesicle proteins (2,3). In this study, only the expression of TbpE and TfbG were detected among the multiple putative transcription factor proteins. It raises the question of whether the other transcription factors would be expressed under different physiological conditions or growth phases. This observation may also be due to the protein being expressed at levels too low for detection. Interestingly, TbpE is located on the chromosome, while all the other Tbps are found on the pNRC100 or pNRC200 minichromosomes. Another possibility is that TbpE is the major TBP controlling transcription with TfbG. At least 11 (Htr1, Htr2, Htr3, Htr4, Htr5, Htr6, Htr8, Htr13, Htr14, Htr15, and Htr16) of the 17 signal transducers were detected. Expression of multiple transducer proteins suggests dynamic cellular functions in response to rapidly changing environmental conditions. The gas vesicle gene clusters, gvpACNO and gvpDEFGHI-JKLM, on pNRC100 had been extensively studied by genetic approaches to identify the essential genes required for gas vesicle biogenesis (11,(37)(38)(39)(40)(41)(42). Through analyses of spontaneous gas vesicle-deficient mutants caused by transposition of insertion sequence (ISH) elements, site-specific linker insertion, and deletion mutants, at least 10 of the 14 gvp genes were determined to be necessary for gas vesicle synthesis. Our MS/MS analysis identified the GvpA, GvpH, and GvpN in the membrane fraction, GvpO in the soluble fraction, and GvpC in both fractions. Interestingly, GvpH and GvpO were not identified from purified gas vesicles by LC-ESI-MS/MS analysis in our recent study. 2 This suggests GvpH and GvpO may not be gas vesicle structural proteins.
Our study compared the results of two transmembrane helix prediction programs used to predict the presence of membrane domains in the expressed proteins. In most cases, TMpred predicted a larger number of MSRs than TMHMM. A previous comparison of 14 membrane protein prediction programs on 883 defined MSRs of 188 well-characterized proteins indicated that TMHMM is currently the best performing transmembrane prediction program (44). Thus, in our study, TMHMM was primarily used to predict the membrane domains while TMpred was used as a supplemental program to support the results generated by TMHMM.
The computationally predicted membrane domains were somewhat in agreement with the results of the mass spec-

TABLE III Proteins identified in both membrane and soluble fractions
The functional category, the gene identification number, the ProteinProphet™ probability, the abbreviated protein name, the putative function, and the number of TMHMM-and TMpred-predicted transmembrane domains for each protein are listed. Proteomic Analysis of Halobacterium NRC-1

TABLE IV Expressed proteins in metabolic pathways
Expressed proteins were analyzed for their pathway involvement by searching the KEGG Enzymes/Compounds/Genes pathway database. The pathway, the total number of proteins involved in a putative pathway in Halobacterium sp. NRC-1, the number of proteins identified in this study, the gene identification number, and the abbreviated protein name with the putative function are listed. Some proteins are involved in more than one pathway.

Pathway
Total number of proteins    trometry. More than 99% of the soluble fraction proteins do not contain TMHMM-predicted MSRs, and 82.8% do not contain TMpred-predicted MSRs with total score greater than 1000. Accordingly, 74.5% of the membrane fraction proteins contained one or more TMpred-predicted MSR(s) with scores greater than 1000, and 54.5% contained at least one TMHMMpredicted MSR(s). The low percentage may due to the presence of nonintegral or peripheral membrane proteins without MSRs that are unlikely to be detected by either prediction program. Analysis of a genome sequence can provide a list of all predicted genes, yet the information regarding the expression of transcriptome or proteins is difficult to measure. It is unlikely that all of the predicted putative proteins are expressed. Among the expressed proteins, post-translational or chemical modifications could lead to the formation of new or functionally different proteins. Our study was based on matching predicted proteins of genome sequences and post-translational modifications were not considered. The failure to detect a protein in this study thus does not mean that it is absent. Sophisticated developments in mass spectrometers and enhanced sample preparation protocols will enable more expressed proteins to be identified in the future.
The integrated computational and mass spectrometry data analysis in this study gives a better understanding of the expressed proteins and cellular locations of membrane and cytoplasmic proteins in Halobacterium sp. NRC-1. The information obtained from this study will be useful for subsequent work, such as expression of proteins of interest to investigate certain aspects of Halobacterium biology. It can also be used with integrated microarray and isotope-coded affinity tag data (45) for systems approaches (46) to study novel biological processes in halophiles. SecD; protein-export membrane protein VNG1988G SecF; protein-export membrane protein * The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.