Proteome Analysis of an Aerobic Hyperthermophilic Crenarchaeon, Aeropyrum pernix K1*S

We analyzed the proteome of a crenararchaeon, Aeropyrum pernix K1, by using the following four methods: (i) two-dimensional PAGE followed by MALDI-TOF MS, (ii) one-dimensional SDS-PAGE in combination with two-dimensional LC-MS/MS, (iii) multidimensional LC-MS/MS, and (iv) two-dimensional PAGE followed by amino-terminal amino acid sequencing. These methods were found to be complementary to each other, and biases in the data obtained in one method could largely be compensated by the data obtained in the other methods. Consequently a total of 704 proteins were successfully identified, 134 of which were unique to A. pernix K1, and 19 were not described previously in the genomic annotation. We found that the original annotation of the genomic data of this archaeon was not adequate in particular with respect to proteins of 10–20 kDa in size, many of which were described as hypothetical. Furthermore the amino-terminal amino acid sequence analysis indicated that surprisingly the translation of 52% of their genes starts with TTG in contrast to ATG (28%) and GTG (20%). Thus, A. pernix K1 is the first example of an organism in which TTG is the most predominant translational initiation codon.

there are few existing industrial applications in which either archaeal biomass or archaeal enzymes are used. This is partly due to the lack of data for the expression of individual genes predicted from genome analysis. Such will be best achieved by proteome analysis.
Aeropyrum pernix K1 is an aerobic hyperthermophilic crenarchaeon isolated in 1993 from a coastal solfataric thermal vent in Kodakara-jima Island of Kagoshima, Japan. It grows optimally at 90 -95°C (1). Many of the thermostable enzymes of this archaeon are expected to be useful for a variety of industrial applications. The complete genomic sequence of A. pernix K1 was established in 1999, and ϳ2,700 ORFs were predicted from the sequence of nearly 1.67 Mb in size. The data were made available to the public through DDBJ/EMBL/ GenBank TM as well as the "Database of the Genomes Analyzed at NITE" (DOGAN). 1 About 1,600 of the predicted 2,700 ORFs were hypothetical (2). Moreover the number of predicted ORFs is much larger than those of other Archaea and bacteria with similar genome sizes, casting doubt over the authenticity of the predicted ORFs. Natale et al. (3) reannotated the A. pernix K1 genome using the Clusters of Orthologous Groups of Proteins database and reported the total number of its protein-coding genes to be 1,871. Similarly the current RefSeq contains an annotation reported by Pruitt et al. (4) in which 1,841 proteins were predicted in the A. pernix genome, and Guo et al. (5) re-evaluated the A. pernix K1 annotation and inferred a total of 1,610 ORFs as potential protein-coding genes. The confusion concerning the annotation of the A. pernix K1 genome is one of the factors that might have hindered wide spread utilization of A. pernix K1 enzymes, many of which are expected to possess excellent thermostability.
There is an additional problem: from the genomic and proteomic analyses performed to date, ATG is the most common initiation codon, and GTG and TTG are used in less than 10% of bacterial genes (6,7). In contrast, however, of some 2,700 ORFs predicted in the genome of A. pernix K1, 43% were deduced to be initiated with ATG and 57% were deduced to be initiated with GTG, which differs greatly from other species. Furthermore in A. pernix K1, genes initiated with TTG were reported (8) despite that TTG has not been reported as an initiation codon in other organisms.
The problems described above can only be experimentally clarified by performing proteome analysis. For this purpose, we adopted four methods to maximize the number of detected proteins. Consequently we were able to identify 704 proteins, including 19 that were derived from the genomic regions in which no ORFs were predicted previously (2). The results suggest at the same time that the number of predicted ORFs in the current version of DOGAN is largely overestimated due to the inclusion of ORFs for non-conserved hypothetical proteins with molecular mass of 10 -20 kDa. Furthermore amino-terminal amino acid sequences of 134 proteins were determined from which we were able to establish that surprisingly TTG is the most predominant initiation codon in A. pernix K1.

Strain and Culture
A. pernix strain K1 deposited at NITE Biological Resource Center (NBRC 100138) was cultured at 90°C in 400 ml of jamarine-yeast extract-trypticase peptone medium (1) for 20 h, cooled down on ice, and harvested by centrifugation at 5,000 ϫ g at 4°C for 10 min. Cellular pellets were resuspended in 3.5% NaCl and recentrifuged.

Protein Preparation
Two-dimensional (2D)-PAGE and 1D-SDS-PAGE-LC-MS/MS-A. pernix K1 cells were suspended in an extraction buffer (67% acetic acid containing 33 mM MgCl 2 ) and disrupted by sonication at 4°C. Cell debris were removed by centrifugation, and 4 volumes of 20 mM DTT in acetone were added to the supernatant. The mixture was stored at Ϫ20°C, and the protein precipitates were collected by centrifugation and dried.
MD-LC-MS/MS-A. pernix K1 cells were suspended in distilled water and lysed by homogenization in S-203 (AS ONE, Osaka, Japan) for 30 s on ice.

2D-PAGE
Protein Separation by 2D-PAGE-IEF was performed on either 180-mm IPG strips with the pH range of 3-10 (Amersham Biosciences) or IPG ReadyStrips with the pH range of 3-6 or 5-8 (Bio-Rad). Protein samples were dissolved in a lysis buffer containing 7 M urea, 2 M thiourea, 4% CHAPS, 50 mM DTT, 40 mM Tris, and 0.2% carrier ampholyte and incubated at room temperature for 1 h. The first dimensional separation was performed on an IPGphor IEF apparatus (Amersham Biosciences). IPG strips loaded with 100 g of protein were electrofocused first at 200 V for 1 h, then at a linear gradient of 200 -4,000 V for 6 h, and finally at 8,000 V to achieve a total of 60 kV-h. After IEF, the strips were equilibrated with an equilibration buffer containing 6 M urea, 30% glycerol, 2% SDS, 50 mM Tris-HCl (pH 6.8), and 1% DTT for 30 min. SDS-PAGE was then carried out on 12 or 16% polyacrylamide gels (20 ϫ 20 ϫ 0.1 cm). Proteins were visualized by staining with Coomassie Brilliant Blue R-250 (CBB) (Nacalai Tesque, Kyoto, Japan).

Radical-free and Highly Reducing (RFHR)-2D-PAGE-
The method of Wada (9) was mainly followed. Protein samples were dissolved in a lysis buffer containing 8 M urea and 0.2 M mercaptoethanol and incubated at 40°C for 30 min. Sample charging electrophoresis was carried out with 100 g of protein on an 8% polyacrylamide gel containing 8 M urea, 40 mM KOH, and 0.37% acetic acid at 100 V for 30 min on an NA1450 apparatus (Nihon Eido, Tokyo, Japan). Subsequently the first dimensional separation was performed on an 8% polyacrylamide gel containing 8 M urea, 400 mM Tris, 500 mM boric acid, and 21.5 mM EDTA-2Na at 100 V for 15 h on an NA1460 apparatus (Nihon Eido). The second dimensional separation was then carried out on an 18% polyacrylamide gel containing 8 M urea 50 mM KOH and 5% acetic acid (16 ϫ 16 ϫ 0.2 cm) at 100 V for 30 h. Proteins were visualized with CBB as described above.
Enzymatic Digestion for 2D-PAGE-MALDI-TOF MS-In-gel digestion with modified trypsin (sequencing grade, Promega, Madison, WI) and sample spotting for MALDI-TOF MS were performed with the Investigator ProPrep automatic digestion and spotting system (Genomic Solutions, Huntingdon, UK) according to the manufacturer's protocols with some modifications. The CBB-stained protein spots were excised from the gel and washed with 25 mM NH 4 HCO 3 and acetonitrile at room temperature. The proteins were reduced with 10 mM DDT in 25 mM NH 4 HCO 3 at 60°C for 10 min and alkylated with 40 mM iodoacetamide in 25 mM NH 4 HCO 3 at room temperature for 35 min. The dried gel pieces were rehydrated and incubated in 25 mM NH 4 HCO 3 containing modified trypsin at 37°C for 4 h. 3% formic acid was added to stop the enzymatic reaction, and the resultant peptides were concentrated, desalted by passing through a -C 18 ZipTip (Millipore, Billerica, MA), mixed with a matrix solution of 50% acetonitrile saturated with ␣-cyano-4-hydroxycinnamic acid (Sigma), and airdried on the target plate.
Mass Spectrometry of 2D-PAGE-MALDI-TOF MS-The resulting peptide mixture was subjected to analysis on an Auto-Flex instrument (Bruker Daltonics, Bremen, Germany) with ␣-cyano-4-hydroxycinnamic acid as the matrix and operated in the reflector mode. Calibration was performed in the external mode using a peptide calibration standard kit (Bruker Daltonics). For peptide assignment the mass spectrum data were analyzed using the MASCOT database search program (Matrix Science Ltd., London, UK) in the peptide mass fingerprinting mode against the database of putative proteins of A. pernix K1 containing the data for 2,694 ORFs as well as against the translation of the entire genomic sequence in all phases.
Enzymatic Digestion for 1D-SDS-PAGE-LC-MS/MS-After CBB staining, the gels were sliced into 5-mm-thick pieces from the top band (Ͼ116 kDa) to the bottom line (4.4 kDa). In-gel digestion with modified trypsin was performed using Investigator ProPrep for 6 h and stopped with 3% formic acid. The resulting peptide mixtures were eluted from the gel and dried by evaporation. The peptides were diluted with 0.02% formic acid containing 0.005% heptafluorobutyric acid (HFBA) and 2% acetonitrile.
The mass spectrometer was operated in data-dependent MS/MS mode with dynamic exclusion at 450 -2000 m/z ranges, and the ions were selected for CID with automatic data-dependent settings. The MS/MS spectra were converted into peak list files with SEQUEST TM Browser (ThermoElectron) that were searched for with the MASCOT database search program in the MS/MS mode against the A. pernix K1 genomic data. The criteria adopted for protein identification were either 1) that at least three peptides with ion score 20 or higher match or 2) that at least one peptide with ion score 40 or higher matches.

MD-LC-MS/MS Analysis
Protein Separation by MD-LC-Proteins of a whole cell lysate were separated by off-line 2D-LC and 2D-LC-nano-ESI-MS/MS. The first dimensional chromatography was performed with a self-packed strong anion exchange (SAX) column prepared in a glass chromatography tube of 8-mm inner diameter and 100-mm length. Trimethylaminopropyl-bonded silica gel (BONDESIL-SAX, 40 m, Varian, Palo Alto, CA) was used to fill the column. Buffers used were 20 mM Tris-HCl, pH 7.0 (Buffer C), and 20 mM Tris-HCl, pH 7.0, with 1 M NaCl (Buffer D). Proteins were eluted with Buffer C from 0 to 5 min with a linear gradient of Buffer C to Buffer D from 6 to 25 min and with Buffer D from 26 to 32 min. The flow rate was 2 ml/min, and the eluate was collected into eight 8-ml fractions. The fractions were concentrated to 0.5 ml by partial lyophilization (EYELA FD-81, Tokyo Rikakikai, Tokyo, Japan). The second dimensional chromatography was performed with a gel permeation chromatography (GPC) column (Bioassist G2SWXL, TOSOH, Tokyo, Japan). 200 l of each of the concentrated SAX fractions were successively injected into a column connected with a guard column (TOSOH) and two GPC. Elution was performed with Buffer E (0.1 M sodium phosphate, pH 7.0) at a flow rate of 0.5 ml/min, and fractions were collected every 3.5 min starting from 18 min until 60 min (12 fractions). In this way a total of 96 fractions (8 ϫ 12) were obtained.
Enzymatic Digestion for MD-LC-MS/MS-Samples in each fraction were reduced by incubating in 5 mM DTT at 60°C for 30 min, alkylated with 15 mM iodoacetamide in the dark and at room temperature for 30 min, and digested with modified trypsin at 37°C for 1 h. The samples were adjusted to pH 4 with trifluoroacetic acid, desalted with a C 18 reverse phase column, and evaporated. The digests were dissolved in a mixture of 0.02% formic acid, 0.005% HFBA, 2% acetonitrile, and 98% water prior to MS analysis.
In-column Enzymatic Digestion-Proteins retained on the SAX column were treated with 0.1% RapiGest (Waters, Milford, MA) in 5 mM DTT and incubated at 60°C for 30 min. The proteins were alkylated and digested with modified trypsin at 37°C for 15 h. The digests were eluted with a mixture of 0.1% trifluoroacetic acid, 5% methanol, and 94.9% water; desalted with a C 18 reverse phase column; and evaporated. The digests were dissolved in 0.02% formic acid with 0.005% HFBA, 2% acetonitrile, and 98% water prior to MS analysis.

Amino-terminal Amino Acid Sequence Analysis
Protein spots on 2D-PAGE were electroblotted onto a PVDF membrane (Sequi-Blot PVDF membrane, Bio-Rad) with a semidry blotting apparatus (Bio Craft, Tokyo, Japan). The blotted membrane was stained with CBB. Singly stained spots were excised from the PVDF membrane and applied to a protein sequencer (model Procise 491cLC, Applied Biosystems, Foster City, CA) if the staining intensity appeared to be strong enough for sequencing. For weakly stained spots, two to six excised spots were combined by repeating 2D-PAGE and then applied to the protein sequencer. Edman reactions were performed according to the manufacturer's instructions. To identify each protein, the amino acid sequences obtained were compared with the predicted amino acid sequence data translated from the genomic sequence of A. pernix K1.

Miscellaneous
The genomic sequence data of A. pernix K1 along with the data for 2,694 annotated ORFs were downloaded from DOGAN (www.bio. nite.go.jp/dogan/Top). Additional A. pernix K1 data for 1,610 annotated ORFs were downloaded from the home page of Tianjin University BioInfomatics Centre (TUBIC) (tubic.tju.edu.cn/Aper/).

2D-PAGE Followed by TOF MS and Amino-terminal
Analysis-About 500 protein spots were well separated from others by 2D-PAGE with IEF in the first dimension ( Fig. 1). Also about 70 basic protein spots, many of which were ribosomal pro-FIG. 1. A typical 2D-PAGE pattern. 100 g of proteins were loaded and separated either by IEF or on a 180-mm-wide 3-10 non-linear immobilized pH gradient strip in the first dimension (horizontal) that was subsequently placed onto a polyacrylamide gel and electrophoresed vertically in the second dimension. The example shown indicates that the first dimensional separation was carried out by IEF with the pI range of 3-10, and the second dimensional separation was by 12% SDS-PAGE in the molecular mass range shown at right. The identified protein spots are indicated by red plus (ϩ) signs.

TABLE I Proteins whose amino-terminal amino acid sequence was determined
Difference column shows the difference between the previously assigned start position (shown in DOGAN) and the newly identified start position. A blank indicates that no difference was detected, and N.D. indicates that the start position could not be assigned. -indicates no corresponding data because it is a newly identified ORF that was not predicted in DOGAN. ORFs newly identified were named by adding "a" to the upstream ORF names of Ref. 2.

ORF
Direction a Start Stop Difference Initiation codon  teins, were observed on RFHR gels. They were individually cut out and digested with trypsin as described under "Experimental Procedures," and the resultant peptide mixtures were analyzed by MALDI-TOF MS with 400 -3000 m/z ranges. The mass spectra obtained were then examined by using the MASCOT database search program. In addition, the protein spots were transferred to a PVDF membrane and subjected to amino-terminal amino acid sequencing. Consequently a total of 300 proteins were identified (Supplemental Table 1). Of them, 187 (62.3%) corresponded to those annotated in other organisms, 80 (26.7%) corresponded to conserved hypothetical proteins, and 33 (11.0%) corresponded to non-conserved hypothetical proteins. However, six of the identified proteins matched the genomic regions of A. pernix K1 in which no ORF had been assigned previously. Of these 300 proteins, 134 proteins were found to have unblocked amino termini, and their 6 -21 amino-terminal amino acid residues were successfully sequenced enabling determination of the corresponding genomic regions encoding them. These included the six proteins mentioned above that were derived from the regions without assigned ORFs (Table I). Of the 134 proteins, 73 (54.5%) corresponded to annotated, 45 (33.6%) corresponded to conserved hypothetical, and 16 (11.9%) corresponded to non-conserved hypothetical proteins.
1D-SDS-PAGE-LC-MS/MS Analysis-In this method, proteins prepared from A. pernix K1 cells were first resolved by SDS-PAGE, and the gel was sliced and analyzed as described under "Experimental Procedures" (Fig. 2). A total of 630 proteins were identified accordingly (Supplemental Table 1). Of them, 357 (56.7%) corresponded to annotated, 161 (25.6%) corresponded to conserved hypothetical, and 112 (17.8%) corresponded to non-conserved hypothetical proteins. Also 14 proteins matched the genomic regions without previously assigned ORFs.
MD-LC-MS/MS Analysis-In this method, the whole cell lysate was applied to a SAX column for the first dimensional separation and then to a GPC column for the second dimensional separation. In the first dimension, proteins were separated according to their pI values into eight fractions, whereas in the second dimension, proteins in each of the SAX fractions were separated by their molecular mass into 12 fractions. As a consequence, a total of 96 fractions were obtained as exemplified in Fig. 3. Proteins in the resultant fractions were then digested with trypsin and analyzed by 2D-LC-MS/MS. In addition, proteins retained on the SAX column were treated similarly. A total of 404 proteins were thus identified (Supplemental Table 1), 235 (58.2%) of which corresponded to annotated, 94 (23.3%) of which corresponded to conserved hypothetical, and 75 (18.6%) of which corresponded to nonconserved hypothetical proteins. 10 proteins corresponded to the genomic regions without previously assigned ORFs.
By combining the results obtained with the four methods mentioned above, a total of 704 proteins were successfully identified in A. pernix K1. The proteins identified by each method are listed in Supplemental Table 1. Of them, 382 (54.3%) were found to correspond to proteins annotated in other organisms, 188 (26.7%) were found to correspond to conserved hypothetical proteins, and 134 (19.0%) were found to correspond to non-conserved hypothetical proteins. Also 19 proteins were found to correspond to the genomic regions in which no ORFs were previously assigned. Of the 300, 630, and 404 proteins identified, respectively, by 2D-PAGE, 1D-SDS-PAGE-LC-MS/MS, and MD-LC-MS/MS, 204 proteins were common to all methods (Fig. 4). On the contrary, proteins uniquely identified in each method were also recognized as shown. Typical images of 2D-and 1D-PAGE as well as multidimensional chromatograms along with a list of all proteins identified in these studies will be made available on the DOGAN web site (www.bio.nite.go.jp/dogan/Top).
Several features of the identified proteins such as molecular mass, pI, hydropathy, protein class, and codon usage were then compared with those of the ORFs predicted in the genome of A. pernix K1. If the statistical distribution of these values is not similar between the observed and predicted proteins, then the annotation of the genomic data needs to be appropriately corrected.
Molecular Mass Distribution-The molecular mass distribution of proteins predicted from the genomic sequence of A. pernix K1 and those experimentally identified by 2D-PAGE, 1D-SDS-PAGE-LC-MS/MS, and MD-LC-MS/MS was compared in groups of 10 kDa up to and higher than 120 kDa (Fig.  5A). Although 49.4% of the proteins were predicted in the molecular mass range of 10 -20 kDa in the genome analysis, only 22.9% were actually observed in the same range. Consequently the number of predicted proteins in this molecular mass range in the current version of DOGAN appears to be overrepresented. The average molecular mass of the proteins identified by 2D-PAGE was 32.4 kDa, whereas it was 34.8 kDa by 1D-SDS-PAGE-LC-MS/MS or MD-LC-MS/MS. With the latter two methods, it is possible to identify proteins harboring a larger molecular mass value, whereas such is not the case with 2D-PAGE as the separation of larger proteins becomes poorer. In any event, it is obvious that about half of the proteins predicted from the genomic data in the 10 -20-kDa range appear to be incorrectly assigned.
This could be due in part to incorrect prediction of ORFs in the genome analysis of A. pernix K1 because of the poorer quality computer software used at that time to assign ORFs. The percentages of ORFs identified by our proteome analysis were 35, 47, 51, 57, and 55%, respectively, in the molecular mass ranges of 20 -40, 40 -60, 60 -80, 80 -100, and Ͼ100 kDa. Therefore, proteins of larger sizes could in general be more frequently identified in our analysis. Of the proteins of 80 kDa or higher, 14% of those identified by our analysis were predicted to possess a transmembrane domain, whereas the value was much higher for those that could not have been identified in our analysis in which case 61% were predicted to possess a transmembrane domain. Likewise of the proteins smaller than 80 kDa, 9% of identified and 32% of unidentified were predicted to possess a transmembrane domain. From these data, it appears that many of the proteins not identified in our proteome analysis, in particular those with higher molecular mass, are likely to be membrane proteins.
The largest five of the protein-coding ORFs of A. pernix K1 are APE0620, APE0609, APE0057, APE1340, and APE1213. Of them, the products of APE1340 and APE0609 were identified in our proteome analysis. The former has homology to the reverse gyrase of Pyrococcus furiosus that was recently experimentally proven to be necessary for the growth of this bacterium at high temperature (11). The APE0609 protein is similar to a surface layer protein of Staphylothermus marinus. The surface layer protein of S. marinus forms a complex with a protease that is likely to play a role in taking up external FIG. 2. A typical 1D-SDS electropherogram. 10 g of proteins were loaded and separated on a 10% polyacrylamide gel, which was then sliced into 5-mm-thick pieces as schematically shown at right with their median molecular mass values calculated from the molecular markers. Subsequently proteins were extracted and analyzed. peptides and proteins. A protein similar to the protease of this complex is encoded by APE0607 that was identified in our analysis. This protein is likely to have a similar function in A. pernix.
The remaining proteins were not identified in our analysis most likely because they are membrane proteins as they appear to possess a transmembrane domain. APE1213 is a paralogue of APE0607, but its expression might be different from the latter. The function of APE0620 and APE0057 remains to be investigated.
Isoelectric Point Distribution-The pI values of the identified and predicted proteins were compared with each other in the pI range from 3 to 13 (Fig. 5B). The average pI values of identified proteins were 7.25 (2D-PAGE), 7.55 (1D-SDS-PAGE-LC-MS/MS), and 7.70 (MD-LC-MS/MS), whereas the value for the predicted proteins was calculated to be 8.68. In the pI range between 5 and 7, the proteins predicted from the genomic data are much fewer than those identified by proteome analysis, whereas proteins in the high pI range (Ͼ10) show an opposite distribution pattern. Also proteins identified by 2D-PAGE were much more likely to be distributed in the pI range of 5-7 than those identified in other methods, although the reason for this is not clear.
Hydropathy Distribution-The GRAVY score indicates the hydrophilicity or hydrophobicity of a protein (12); it can be calculated as an arithmetic mean of the sum of the hydropathy index of each amino acid of a protein. About 70% of the predicted proteins concentrated in the neutral range (Ϫ0.4 to 0.4). On the other hand, 85% of experimentally identified proteins were found to be distributed in the same range regardless of the identification methods used. The results indicated, therefore, that a large portion of proteins of A. pernix are in the neutral GRAVY score range (Fig. 5C). The averages of the GRAVY score for identified proteins were Ϫ0.15  (Fig. 5D) and compared. More experimentally identified proteins were found to be catego-rized in "metabolism" and "genetic information processing" than those predicted from the genome analysis, whereas a distinctly large proportion of predicted proteins were categorized in "non-conserved hypothetical proteins." With respect to the distribution pattern of proteins in the six protein classes, differences among the methods used were marginal.
Codon Usage Pattern-It is known that a characteristic bias in codon usage exists in each species of organisms (13). To examine whether and to what extent differences in codon usages exist between the experimentally identified ORFs and the ORFs predicted from the A. pernix K1 genomic sequence, codon usages in individual ORFs were plotted against the categories of proteins described above, namely molecular mass, pI, hydropathy, and protein class.
An example is shown in Fig. 6: in A, the codon usage patterns of proteins categorized by their molecular mass are shown, and in B, similar patterns of proteins categorized by their protein class are shown. As described above, a large proportion of predicted proteins were classified in the molecular mass range of 10 -20 kDa. Indeed many of them were found to deviate from the average use of TCC, whereas predicted proteins larger than 40 kDa appear to match well with those of experimentally confirmed proteins. Therefore, it seems that the usage patterns of various codons will serve as good tools to evaluate whether a particular ORF predicted from the genomic sequence is likely to be a true gene or not. Indeed this is one of the bases on which algorithms for the prediction of genes/ORFs in the genomic sequence data rely. A similar analysis was performed with respect to protein classes as shown in Fig. 6B. The patterns of experimentally identified versus predicted ORFs were found to be quite different when proteins categorized as "non-conserved hypothetical proteins" were analyzed. Interestingly such a clear difference shown in Fig. 6, A and B, was not observed when a similar analysis was performed with proteins categorized by their pI and hydropathy values.
Complementarity of the Methods Used-For the proteome analysis of A. pernix K1, we adopted the high resolving power of 2D-PAGE (14) including RFHR-2D-PAGE (9) and combined it with MALDI-TOF MS for "peptide mass fingerprinting." However, 2D-PAGE-MALDI-TOF MS has limitations in the detection of less abundant or hydrophobic proteins as well as proteins of extremely large sizes. Introduction of improved chaotropes and development of novel zwitterionic detergents (15) were found to improve the situation to some extent.
An alternative approach was to omit the first dimensional separation and apply an enriched membrane protein fraction directly to 1D-SDS-PAGE-LC-MS/MS. Most of the bands on the 1D-SDS-PAGE gel consisted of multiple proteins, but the ability of HPLC in connection with ESI tandem mass spectrometry is powerful enough to analyze a mixture of derived peptides so that conventional tryptic digestion of proteins followed by mass spectrometric analysis led to the identification of each protein in the mixture. This method is called the "shotgun method" (16), and it gives a considerable advantage in the characterization of membrane proteins because separation was targeted at the peptides rather than proteins so that solubility problems that are often encountered with hydrophobic proteins could largely be alleviated. A disadvantage of the shotgun method is that intensive computational analysis of the entire data set is always required, and no information regarding the charge of the intact protein could be obtained.
MD-LC-MS/MS is a third alternative method we adopted in which proteins were separated by MD-LC (10,17,18), digested with a specific enzyme, and ionized with ESI, and then their mass spectra were measured. In regular MD-LC systems proteins and peptides are separated according to a variety of their properties, such as pI, relative molecular mass, and hydrophobicity (17). However, a disadvantage of the MD-LC-MS/MS method is that not all of the peptide fragments could be detected, and their quantity is low. Also the sensitivity of detection will progressively decrease as the number of fractions increases.
Comparison with the Results Obtained by Other Researchers- Guo et al. (5) examined the genomic data of A. pernix K1 and reported that 1,610 ORFs can be recognized as such (tubic.tju.edu.cn/Aper/). Therefore, their data were compared with the proteins experimentally identified. Of the 704 identified proteins, 692 were included as ORFs predicted by Guo et al. (5), but the remaining 12 were not. The molecular mass distribution of the proteins derived from TUBIC ORFs is very similar to the identified proteins as shown in Fig. 5A. However, their other characteristics slightly but significantly deviate from those of the proteins we experimentally characterized (Fig. 5, B-D).
Assignment of the Codons for Translation Initiation-Of the 134 proteins whose amino-terminal sequences were experimentally determined, 50 were found to possess Met at their amino terminus. By comparing the nucleotide sequences corresponding to the amino-terminal Met, seven of them were found to possess ATG, and 14 others contained GTG. In addition, to our surprise, 29 others were found to contain TTG at the position of the amino-terminal Met. Subsequently we looked for candidate initiation codons based on the aminoterminal amino acid sequence data of the remaining 84 proteins that did not possess Met at their amino terminus. With 80 of them, a putative initiation codon was found in their immediate upstream, i.e. 39 of them were with TTG, 29 were with ATG, and 12 were with GTG, respectively. Because A. pernix K1 possesses an ORF encoding a protein homologous to methionine aminopeptidase, we interpreted the results to indicate that the amino-terminal Met of these 80 proteins was removed post-translationally by the putative methionine aminopeptidase.
With the remaining four proteins, however, candidate initiation codons were not found in the immediate upstream. Of these, APE0079 has homology to S-adenosylmethionine decarboxylase proenzyme 2 that is known to be post-translationally processed into an ␣ and a ␤ chain. Similarly APE0521 has homology to a protease subunit of the proteasome of Methanococcus jannaschii, the amino-terminal region of which is likely to be processed. APE2072 has homology to the thermosome ␤ subunit of Thermoplasma acidophilum. By a "shotgun" mass spectrometry analysis, a peptide containing the detected amino terminus of APE2072 as well as 13 others matching the upstream region were detected (data not shown). Therefore, the detected amino terminus of APE2072 is likely to be that of a processed protein, although the nature of the processing remains to be clarified further. The genomic nucleotide sequence present in the immediate upstream of APE2493 is ATA. However, because ATA is not likely to serve as a translational initiator, it may be that the amino-terminal amino acid sequence corresponding to APE2493 was similarly processed, although the processing has not been elucidated yet.
To summarize the data for translational initiation codons corresponding to the 130 sequences other than the four proteins mentioned above, TTG was found to be most frequent (52%), whereas ATG and GTG, respectively, were found in 28 and 20% of the cases. Of the 130 ORFs, six proteins were found to be derived from the region in which no ORFs were previously assigned. Of the remaining 124 ORFs, the initiation codons of 89 (72%) were different from the positions that were assigned previously (2).
Characteristics of the Region Upstream of the Putative Initiation Codons-The mechanism of transcriptional initiation in Archaea has been speculated to be more closely related to that of eukaryotes (19,20). However, three groups of transcription-associated proteins have been identified in Archaea: one group more similar to prokaryotes, another group more similar to eukaryotes, and a third group more similar to both prokaryotes and eukaryotes. Several homologues of bacterial transcriptional factors (21)(22)(23)(24) have been identified in Archaea, and Tolstrup et al. (25) have shown that the translation process of internal genes of operons in Archaea was similar to that in bacteria.
To characterize the genomic regions likely to function in translational initiation in A. pernix K1, the nucleotide frequency of the sequences surrounding the ORFs for 130 proteins mentioned above were analyzed according to the method of Xiu-Feng et al. (26). The ORFs were categorized into two groups: in Group 1 the ORFs in question are not well separated from their immediate upstream neighbor, whereas in Group 2 they are more than 50 bp away from each other. In the region preceding the 130 ORFs, there is a G box at the position Ϫ10 upstream of the initiation codon with a typical sequence of GGTG regardless of the ORF category, whereas ORFs of Group 1 harbor in addition an AT box at the Ϫ42 position upstream of the initiation codon and a weak C box at the Ϫ35 position (Fig. 7, A and B).
Mechanisms of Translational Initiation-In view of the finding mentioned above, all the identified ORFs of A. pernix were reassigned by taking the presence of three initiation codons and of a G box into consideration. Consequently TTG was found to be the most predominant initiation codon (38% of all ORFs) followed by ATG (33%) and GTG (29%). After the reassignment, the frequency of occurrence of each nucleotide in the upstream region was plotted. There is a distinct G box at the region surrounding the Ϫ10 position upstream of the initiation codon harboring GGTG as a typical sequence as in the case of the 130 genes for the experimentally characterized FIG. 7. Frequency of occurrence of each nucleotide in the region preceding the initiation codon. A, frequency of each of the four nucleotides occurring in Group 1 ORFs whose protein products were identified by amino-terminal amino acid analysis was calculated as shown. B, the value in Group 2 ORFs identified by amino-terminal amino acid analysis was similarly calculated. proteins mentioned above. In addition, Group 1 ORFs possess a weak AT box at the Ϫ42 position and a weak C box at the Ϫ35 position, whereas the weak C boxes were not so clear in Group 2 ORFs (data not shown).
For the experimental identification of translational initiation codons, a large scale amino-terminal sequencing of Synechocystis sp. strain PCC6803 was performed by Sazuka and Ohara (27,28) in which amino-terminal sequences of 234 protein spots were analyzed. The initiation codons in Synechocystis sp. were thus identified, suggesting that ATG was most predominant (88%) followed by GTG (7%) and TTG (3% It has been reported that TTG is the most plausible initiation codon for many mitochondrial protein genes in two nematodes, Ascaris suum and Caenorhabditis elegans (30). In addition, ACG was reported as an initiation codon in two eukaryotic viral genes (31,32). In E. coli, initiation at ATG was more efficient than at GTG or TTG (33). It has been generally believed that ATG is the most predominant and efficient translational initiation codon in many other organisms as well. The results shown here, however, are different, and TTG is most predominant in A. pernix K1, although it is not clear whether it is the most efficient initiation codon in A. pernix K1 or not. Archaea are known to possess a eukaryote-like positive regulator in the transcription apparatus that consists of a cognate bacterial-type regulator facilitating recruitment of the TATAbinding protein for transcriptional activation (34). Furthermore two types of translational initiation mechanisms have been reported in Archaea, namely leadered and leaderless translation. The former has been shown to occur in internal genes (25), which possess a G-rich region in their 5Ј flanking region that is likely to play a role in ribosomal binding, whereas the latter involves scanning for the first initiation codon along the transcripts (35). N-Formylmethionyl tRNA, deformylase, and methionyl aminopeptidase have been proven to play the roles in the translational initiation in E. coli (36 -38) in particular in combination with ATG. A. pernix K1 possesses no homologues for N-formylmethionyl tRNA and deformylase, and the Met-tRNA genes are present in triplicate, although there is no sequence similarity between them, and two of them have an intron (2). These might possibly be related, at least to some extent, to the less frequent translational initiation at ATG in A. pernix K1. * The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. □ S The on-line version of this article (available at http://www. mcponline.org) contains supplemental material.