Proteogenomic Analysis of Mycobacterium tuberculosis By High Resolution Mass Spectrometry*

The genome sequencing of H37Rv strain of Mycobacterium tuberculosis was completed in 1998 followed by the whole genome sequencing of a clinical isolate, CDC1551 in 2002. Since then, the genomic sequences of a number of other strains have become available making it one of the better studied pathogenic bacterial species at the genomic level. However, annotation of its genome remains challenging because of high GC content and dissimilarity to other model prokaryotes. To this end, we carried out an in-depth proteogenomic analysis of the M. tuberculosis H37Rv strain using Fourier transform mass spectrometry with high resolution at both MS and tandem MS levels. In all, we identified 3176 proteins from Mycobacterium tuberculosis representing ∼80% of its total predicted gene count. In addition to protein database search, we carried out a genome database search, which led to identification of ∼250 novel peptides. Based on these novel genome search-specific peptides, we discovered 41 novel protein coding genes in the H37Rv genome. Using peptide evidence and alternative gene prediction tools, we also corrected 79 gene models. Finally, mass spectrometric data from N terminus-derived peptides confirmed 727 existing annotations for translational start sites while correcting those for 33 proteins. We report creation of a high confidence set of protein coding regions in Mycobacterium tuberculosis genome obtained by high resolution tandem mass-spectrometry at both precursor and fragment detection steps for the first time. This proteogenomic approach should be generally applicable to other organisms whose genomes have already been sequenced for obtaining a more accurate catalogue of protein-coding genes.

spectrometric data from N terminus-derived peptides confirmed 727 existing annotations for translational start sites while correcting those for 33 proteins. We report creation of a high confidence set of protein coding regions in Mycobacterium tuberculosis genome obtained by high resolution tandem mass-spectrometry at both precursor and fragment detection steps for the first time. This proteogenomic approach should be generally applicable to other organisms whose genomes have already been sequenced for obtaining a more accurate catalogue of protein-coding genes. Molecular & Cellular Proteomics 10 Mycobacterium tuberculosis continues to be a significant health burden, especially in the developing countries. Emergence of drug-resistant strains and a higher incidence of tuberculosis in people with HIV/AIDS have further worsened the situation. In the past, researchers have used proteomics for investigating the biology of this pathogen (1)(2)(3)(4)(5)(6)(7). There are a number of published studies pertaining to annotation of the Mycobacterium tuberculosis genome. The whole genome sequence of M. tuberculosis first became available for H37Rv strain in 1998, which was followed by that of CDC1551 and several other strains (8,9). Accurate annotation of protein coding genes from any genome is a continuously evolving process. This is highly evident in the case of M. tuberculosis. Cole and colleagues reported the presence of 3924 open reading frames (ORFs) in H37Rv genome (8). In a re-annotation effort by the same authors, the gene number was revised to 3995 (10). As of March 2011, the TubercuList database contains 4012 annotated protein coding genes in the M. tuberculosis genome (11). de Souza et al. have carried out a comparison of two different gene annotation sets for M. tuberculosis H37Rv strain (Sanger and TIGR annotations) and reported that ϳ50% of the genes have different translation start sites (12). In the same study, using proteomic data for 449 culture filtrate proteins, the authors were able to correct annotations of 24 genes. Finally, the possibility of existence of many CDSs, which are not yet annotated in H37Rv genome, has also been suggested (13).
A direct evidence of translational potential of a genomic region can be obtained from peptide data from mass spectrometry-based proteomics (14 -16). Other information such as N-terminal acetylation of peptides can be used for translational start site assignment. That the annotations in H37Rv genome are still not final is indicated by a recent analysis by de Souza et al., where they used clustered database of annotated CDSs and flanking regions from five M. tuberculosis strains and three M. bovis strains to search mass spectrometric data to identify missing proteins from H37Rv genome. These investigators found peptide evidence for 24 genes incorrectly annotated in H37Rv genome (17). In the present study, we have carried out an in-depth proteomic analysis of M. tuberculosis using high resolution Fourier transform mass spectrometry. Cell lysates and culture filtrates were fractionated using various methodologies followed by tandem MS (MS/MS) 1 analysis on an LTQ-Orbitrap Velos ETD mass spectrometer. The mass spectrometry-derived data were analyzed using a six-frame translation of genome sequences in addition to searches of a protein database of M. tuberculosis H37Rv. We used two gene prediction programs (FgeneSB and GeneMark) to obtain alternative gene models (18,19). In addition to gene predictions, we also used a comparative proteomic approach to validate alternative gene structures. From this analysis, 3176 proteins were identified representing ϳ80% of the total proteome of M. tuberculosis. A total of ϳ250 peptides that did not match existing annotations were identified. On the basis of these high confidence peptide identifications, we were able to delineate 41 novel genes in the H37Rv genome and correct 79 gene models. We were also able to identify alternative translational start sites for 33 proteins in addition to confirming translational start sites of 727 proteins using N terminus-derived peptides.

EXPERIMENTAL PROCEDURES
Culturing and Protein Extraction from Mycobacterium tuberculosis-M. tuberculosis H37Rv strain was grown in Middlebrook 7H9 media with OADC supplement. Colonies from Lowenstein-Jensen media slants were used to inoculate 1 liter of Middlebrook 7H9 media. Cultures were grown at 37°C in a stationary condition for 5 weeks. At the end of 5 weeks, the cells were pelleted by spinning after washing 3 times using chilled phosphate buffer saline. Cell lysis was carried out by bead beating (0.1 mm zirconia beads) in presence of lysis buffer. Three percent SDS was used as lysis buffer for SDS-PAGE fractionation and 9 M urea was used as lysis buffer for in-solution digestion. For preparation of culture filtrate proteins, the M. tuberculosis H37Rv strain was grown in Proskauer-Beck media at 37°C for 6 weeks. Proskauer-Beck media was chosen as it does not have added protein supplements. The cells were removed by filtering through 0.22-m membrane filter. Filtrate was concentrated using 3 kDa cutoff filters (Millipore Corporation, Billerica, MA).
Trypsin Digestion and Fractionation-To obtain maximum proteome coverage, cell lysates were fractionated using three different methods. Protein level fractionation was carried out by SDS-PAGE. Two hundred micrograms of protein was loaded onto a 10% SDS-PAGE gel and stained using colloidal Coomassie blue stain. After removing excess stain, the lane was cut into 35 bands and subjected to in-gel tryptic digestion as described previously (20). In-gel reduction was carried out using 10 mM dithiothreitol followed by alkylation using 20 mM iodoacetamide. In-gel digestion was carried out using trypsin (Promega, Madison, WI) at 37°C for 12 h. Peptides were extracted from the gel and dried using vacuum drying process as explained earlier (21).
Peptide level fractionation was carried out using strong cation exchange chromatography (SCX) and isoelectric focusing (offgel). In-solution trypsin digestion was carried out for ϳ1 mg of cell lysate protein following reduction and alkylation. Digestion was carried out at 37°C for 12 h using trypsin (Promega) at the concentration of 1:20. Peptides were cleaned using C 18 Sep-Pak columns. Half the amount of the peptides was used for SCX fractionation on polysulfoethyl A column (PolyLC, 200 ϫ 2.1 mm; 5 m, 200A) and the other half was used for fractionation by IEF using Agilent's 3100 OFFGEL fractionator. Twenty-four fractions were collected by offgel fractionation method over the pH range of 3 to 12. Two hundred micrograms of culture filtrate protein was fractionated by 10% SDS-PAGE and 36 bands were cut and in-gel digestion was done as described above for cell lysates.
LC-MS/MS Analysis-We carried out total of 123 (LC)-MS/MS runs (Cell lysate in-gel, 32; SCX fractions, 31; offgel fractions, 24; and culture filtrate ingel fractions 36). All of the mass spectrometry analyses were carried out on an LTQ-Orbitrap Velos ETD mass spectrometer (Thermo Scientific) interfaced with an Agilent 1200 series highperformance liquid chromatography system. The peptides from each fraction were analyzed using reversed phase nano scale liquid chromatography coupled to tandem mass spectrometry. The reversed phase nano scale liquid chromatography system consisted of a desalting column (75 m ϫ 2 cm, C 18 material 5-10 m, 120Å) and an analytical column (75 m ϫ 10 cm, C 18 material 5 m, 120Å) with an electrospray emitter tip 8 m (New Objective Woburn, MA, USA) maintained at 2.0 kV ion spray voltage. The mass spectrometry analysis on the LTQ-Orbitrap Velos was carried out in a data-dependent manner with survey scan resolution r ϭ 60,000 at m/z 400, scan range of m/z 350 to 1800. Following every survey scan, up to 15 most abundant precursor ions were picked for MS/MS fragmentation by collision induced dissociation (higher energy collision induced dissociation mode was used for analysis of SCX and offgel, collision induced dissociation mode was used for in-gel fractions). Fragment ions were detected in Orbitrap with resolution r ϭ 15,000 at m/z 400. In the case of culture filtrate samples, MS/MS scans were acquired in the LTQ mass analyzer. Lock mass option was enabled, which helps maintaining high mass accuracy by real time calibration using polysiloxane ions from air. Further, ions picked for MS/MS were dynamically excluded for next 30 s. Normalized collision energy for MS/MS was set to 35%, and the transfer tube temperature was maintained at 220°C.
Database Searches for Peptide and Protein Identification-Raw data files were processed to generate peak list files using Proteome Discoverer software version 1.2 (Thermo scientific). Filtering parameters used were: (1) Allowed precursor mass range was 500 Da to 5000 Da, (2) Precursor charge state was allowed from 1 to 5, (3) Minimum number of peaks in a spectra was chosen to be 5, (4) Signal to noise ratio was set as 3, (5) For precursors with unrecognized charge state, default charge states of 2 and 3 were allowed. The protein database used for MS/MS searches was downloaded from NCBI for M. tuberculosis H37Rv strain (updated April 12, 2009). The genome sequence for H37Rv strain was downloaded from NCBI ftp site (NCBI Reference Sequence: NC_000962.2). Using in-house Python scripts, a six-frame translated database was created containing translated sequences from stop codon to stop codon. As Mycobacterium is known to use GTG and TTG as initiator methionine codons, wherever GTG and TTG codon was encountered it was translated as initiator methionine in addition to valine and leucine, respectively (8). The variant peptides thus obtained were appended to the genome translation databases as separate entries. Commonly encountered contaminants like BSA, trypsin, and keratins were added to the databases that were used for the MS/MS ion search. Total number of sequences in protein database, including contaminants, was 4015 whereas in genome translation database 320,958 sequences were present. Three different search algorithms, Mascot (version 2.2), Sequest (SCM build 59), and MassWiz (version 1.6.4.3-A) were used to analyze the data. Search parameters used were as follows: (a) Trypsin as a proteolytic enzyme allowing up to one missed cleavage, for MS/MS ion search against genome database, semitryptic cleavage was allowed; (b) Peptide mass error tolerance of 20 ppm; (c) Fragment mass error tolerance of 0.1 Da (for culture filtrate data 0.8 Da mass error was allowed as it was acquired in LTQ mass analyzer); (d) Fixed post-translational modifications was carbamidomethylation of cysteine residues. Variable modifications allowed were oxidation of methionine, acetylation of peptide N terminus and formylation of methionine. PSMs were filtered based on score threshold for 1% false discovery rate (FDR) established using decoy database search. A reverse sequence database was searched separately in addition to forward (target) database and FDR was calculated at every Peptide spectrum match (PSM) score value as % FDR ϭ (Number of hits in reverse database at or above the score/Total number of hits in target and reverse database at or above the score) ϫ 100 A 1% FDR threshold was applied to search results from individual data files. Because six different types of searches were performed with the same data set (genome and protein database searches using Mascot, Sequest, and MassWiz), peptide sequences assigned to a single spectrum in each search were compared. Spectra that were assigned different sequences in different searches were omitted from further analysis. The protein identification list was generated by grouping proteins based on shared peptides. Proteins that have at least one unique peptide were selected from the group. From a group in which no protein could be distinctly selected above others, that is, all proteins had equal evidence, it was represented by only one in the final list marked with an asterisk and remaining equivalent protein IDs were reported in parentheses. Proteins with no unique peptide evidence (subset proteins) were reported in a separate list. Peptides identified with PTMs which were used for protein N terminus analysis were manually evaluated for spectral assignments.

Analysis Using the MassWiz MS/MS Search Algorithm-MassWiz
is a recently developed algorithm for peptide and protein identification from mass spectrometry data (22). MassWiz uses empirically deter-mined abundance weights given to different ion series, their continuity, ion intensities, and supporting ions according to different instrument types. These weights help discriminate the good peptide spectrum matches from poor ones. MassWiz incorporates an integrated filtering module that picks high intensity peaks from dynamic mass bins determined from precursor neutral masses. This removes poor spectra prior to database search. This scoring scheme plugged with a simple spectral filtering approach enables MassWiz to maximize correct peptide identifications while lowering false identifications. MassWiz is open-source and is well suited for high resolution mass spectrometry data because the scoring function takes the mass accuracies of the matching fragment peaks into account.
Workflow for Genome Annotation-Peptide identifications from mass spectrometric data obtained in high resolution mode at both MS and MS/MS levels were used for proteogenomic analysis. This was done because we wanted to base our novel findings on high-confident peptide identification. Peptides obtained after applying the 1% FDR cutoff were selected for genome annotation analysis. Genome coordinates of all the peptides were found out using the tblastn program. Peptides mapping to multiple places in the H37Rv genome were not considered for further proteogenomic analysis. Genome search specific peptides were identified by excluding those peptides which mapped to known proteins from translated genome database search results. Genome search specific peptides were categorized as (1) mapping to intergenic region, (2) partially overlapping annotated genes, and (3) completely mapping to annotated genes. Alternative gene models were searched using two different gene prediction programs-FgeneSB, GeneMark 2.5 for prokaryotes and orfind tool, for peptides that did not agree with the gene model that they were mapped to (categories 2 and 3) and those that mapped to intergenic region (category 1). Novel genes and gene model modifications thus obtained using peptide evidence and gene prediction tools were checked for their conservation across Mycobacterium spp. (in some cases Corynebacterium strains) using protein blast. MS/MS spectra used for proposing novel ORFs or changes in gene structure were manually validated from raw files for the sequence assignments.

LC-MS/MS Analysis of Intracellular and Culture
Filtrate Proteins of M. tuberculosis H37Rv-Protein identification was carried out from both cell lysates and culture filtrates of M. tuberculosis H37Rv. Cell lysates were fractionated by three different fractionation methods: (1) SDS-PAGE followed by in-gel trypsin digestion, (2) Strong cation exchange chromatography, and (3) offgel electrophoresis (IEF), the latter two being at the peptide level. Culture filtrate proteins were fractionated by SDS-PAGE alone. In all, 123 LC-MS/MS runs were carried out. Approximately 1,800,000 MS/MS spectra were obtained, out of which 394,952 were assigned to peptide sequences using three different MS/MS search algorithms. PSMs were filtered for first rank assignments that passed a 1% FDR threshold. The total number of unique peptide sequences obtained was 35,562. The number of peptides identified exclusively from different fractionation methods was, 5788 from in-gel digestion, 5149 from strong cation exchange chromatography and 5291 from offgel fractionation. Figs. 1A and 1B show distribution of peptides and proteins identified from the various fractionation methods applied for analysis of cell lysate proteins. The complete list of peptides identified in our study along with PSM scores, charge, m/z value and post-translational modifications is provided in supplementary Data file 1. The complete set of raw mass spectrometry data (.raw files and peak list files) generated from this study has been made available through the Tranche server (http://proteomecommons.org/tranche, hash values for downloading the data are given at the end of the article). The raw data has also been deposited to PeptideAtlas (www.peptideatlas.org) where the raw data along with peptide identifications can be accessed using the accession number PAe001767 (23). The peptide and protein identification data have also been submitted to Open Source Drug Discovery portal (http://sysborg2.osdd.net/).

Confirmation of Annotated Protein Coding Genes in M.
tuberculosis Genome-NCBI RefSeq database has 3988 protein sequences for H37Rv strain, whereas TubercuList database lists 4012 protein coding genes in the H37Rv genome (release R21, March 2010) (11). We have identified a total of 3176 mycobacterial proteins with at least one unique peptide identified, comprising ϳ80% of the total proteome of the M. tuberculosis. Proteins identified based on shared peptide evidence are listed separately (supplementary Data file 2). As an illustration of the high coverage of many of the proteins, Interestingly, we identified a peptide YLTWGLR mapping to gene Rv0157A, which is annotated as a probable pseudogene (11).
Identification of Hypothetical Proteins-Mycobacterium represents an evolutionarily distinct group of microorganisms. Many of the genes in Mycobacteria are functionally uncharacterized because of lack of sequence similarity to any of the known genes from model prokaryotes (8). In the TubercuList database, 1081 such proteins are present, which are categorized as conserved hypothetical proteins as these are conserved across related organisms (MTB complex, M. smegmatis, M. marinum) (13). We identified 829 proteins that are annotated as conserved hypothetical proteins of which 233 have not been shown to be translated by any earlier proteomic study.

Identification of Novel Protein Coding Genes in H37Rv Genome Using Genome Search Specific Peptides (GSSPs)-
After excluding peptides which map to currently annotated proteins (from RefSeq database), the results from the genome database search provided a list of GSSPs. On the basis of these GSSPs, novel genes and gene structure refinements were proposed by proteogenomic analysis. Fig. 3 gives a schematic of our proteogenomic annotation strategy, which is described in detail in the methods section. Out of a total 36,785 unique peptides that were identified in this study, 243 peptides either mapped to regions of genome where no gene was present or did not agree with the gene model they were mapping to. Two gene prediction programs, FgeneSB and GeneMark as well as the Orfind tool from NCBI were used to find ORFs in the region to which these GSSPs were mapped (18,19). Based on this, we were able to predict the presence of 41 novel protein coding genes. Further, we also checked the conservation of these predicted genes across mycobacterial or related species using protein blast algorithm. Forty of the novel ORFs were already annotated in other strains of

M. tuberculosis.
It is interesting to note that the length of these novel proteins was short (an average length of 86.5 amino acids) and it is likely that some of these were missed from genome annotation of the reference strain owing to their small size. As mentioned earlier, TubercuList contains an additional 24 genes as compared with the RefSeq data set. We identified seven of these proteins, which are implicated in virulence, detoxification, and adaptation. Fig. 4 depicts identification of a novel 69 amino acid long ORF in the H37Rv genome along with the corresponding MS/MS spectra. Table I lists the novel genes found in this study along with details of the supporting peptide evidence and genome co-ordinates of the novel genes.
Changes in Current Gene Structures Using Peptide Data-Apart from identification of novel ORFs, we corrected the gene coordinates using genome search specific peptide data.
Using this strategy, we corrected 79 gene models of which, 78 were N-terminal extensions and one was merging of two genes into one. TubercuList has annotated corrected coordinates (in agreement with our analysis) for six of these genes. Supplementary data file 3 gives the list of genes where modification in the structure is suggested in our analysis. It also contains information about old co-ordinates, modified coordinates, and corresponding peptide evidence. Fig. 5 depicts an example of correction of a gene model by extension of the N terminus (Rv2241). The current coordinates of the gene Rv2241 (AceE), which codes for pyruvate dehydrogenase E1 component are 2,512,452 to 2,515,244. We found peptides HDLAQNSNSASEPDR and HDLAQNSNSASEPDRVR mapping 66 bp upstream to the gene. FgeneSB and GeneMark predicted a longer gene model, which includes the peptide identified by us. In case of gene Rv1181 (PKS4), a peptide TVVTASSFDEL-SAALR was found to partially overlap the N terminus of the gene. This finding suggested that actual start site of the protein was located further upstream of the annotated start site. Using a comparative genomic approach, a similar region in closely related strains was represented by one long protein whereas in H37Rv two separate CDSs, Rv1180 and Rv1181, were reported in the genome as a stop codon. However, a single nucleotide variation (A-ϾC) has been reported at position 1,315,191 in the H37Rv genome, which converts the stop codon to a tyrosine codon supporting the presence of a longer protein instead of two independent coding regions (11).
In another example of N-terminal extension, peptides MIIDLHVQR and VLTIHGVTEHGR were found to map upstream of a gene Rv3203 (genome coordinates 3,580,638 to 3,580,638). However, these peptides are in ϩ2 frame and Rv3203 is translated in ϩ3 frame. We also identified seven unique peptides mapping to Rv3203 in ϩ3 frame. A closer examination revealed that orthologous proteins in other mycobacterial strains including H37Ra and CDC1551 are longer by 37 amino acids, which includes both of the novel peptides that were identified. Ioerger et al. have reported an indel (T-Ͼ-) at position 3,580,637 in the H37Rv genome, which is a probable sequencing error in the H37Rv reference genome (24). Our findings indicate that there is indeed a sequencing error and that the coordinates of the gene Rv3203 should be corrected to be 3,580,526 to 3,581,309.
Correction of gene coordinates was also carried out using protein N-terminal peptides as described in the following section.

Identification of N-terminal Peptides and Confirmation/Correction of Translational Start Sites (TSS)-Approximately 60%
of the genes in H37Rv genome use ATG as start codon, ϳ33% use GTG as start codon and ϳ5% use TTG as a start codon. As described in the methods section, while creating six frame translated database from genome we added variant peptides in which GTG and TTG codons were translated into methionine residues instead of valine or leucine, respectively. Protein N-terminal peptides can be indicated either by post-translational modifications like acetylation or nontryptic nature at the N terminus of the peptide. We used these criteria to assign correct translational start sites. We permitted semitryptic peptides in MS/MS search of the genome database for the purpose of identifying such peptides. We looked for fMet containing peptides, deformylated protein N-terminal peptides, peptides with initiator methionine cleaved, and peptides with acetylated N terminus after initiator methionine was deformylated or cleaved to identify protein N-terminal peptides. N-terminal acetylation after removal of initiator methionine is known to occur in prokaryotes, however, at lower frequency than in eukaryotes. Rison et al. have shown that fraction of proteins that are acetylated at the N-terminal in M. tuberculosis is more than that found in E. coli (25). We found 874 protein N-terminal peptides out of which, 253 peptides were acetylated at the FIG. 3. Schematic representation of proteogenomic annotation strategy. Peptide qualifying 1% FDR threshold from both protein and genome search were considered for proteogenomic analysis. From six frame translated genome database search results, peptides which mapped to known proteins from protein database were eliminated to obtain a set of genome search specific peptides (GSSPs). GSSPs were classified as mapping to intergenic region partially overlapping with current gene annotation and completely mapping within current gene annotations which were further used to modify the present genome annotation. N-terminal peptides identified from protein database searches were used to either confirm or propose change in the annotated translational start sites. N termini, 237 peptides had initiator methionine deformylated, and 384 peptides had cleaved initiator methionine. Thus, we were able to confirm the TSS of 727 proteins (supplementary Data file 5) and reannotate TSS of 33 other proteins (supplementary Data file 4). We also identified Nterminal peptides which confirmed the translational start site of four Novel ORFs (Table I).
We found 21 examples of N-terminal peptides mapping upstream to the currently annotated translational start site of the gene. On the other hand, we corrected TSSs of 12 genes based on a protein N-terminal peptide mapping downstream of the annotated TSS. One such example is illustrated in Fig.  6 where a peptide with nontryptic N terminus, M.GDASLT-TELGR.V, was mapped downstream of the annotated TSS of gene Rv1106c, which can be used as evidence for short-ening the CDS length of the gene model. Interestingly, in the case of eight proteins (Rv2175c, Rv2847c, Rv3133c, Rv1683, Rv1612, Rv3131, Rv2986c, and Rv3001c) we found peptide evidence indicating translation initiation at two different sites. DISCUSSION M. tuberculosis H37Rv genome sequence has been available for more than 10 years and has been characterized previously by a number of groups. Thus, it was surprising to identify many novel regions with protein coding potential. Even more surprising was the fact that most of these novel ORFs were already annotated in other strains of M. tuberculosis but were missing from the primary genome se- FIG. 4. Identification of a novel gene based on peptides mapping to an intergenic region. A, Six peptides mapped to an intergenic region between genes Rv3202c and Rv3203. Gene prediction algorithms FgeneSB and GeneMark supported the presence of this additional gene. In addition, a protein corresponding to this novel gene has been annotated in M. tuberculosis CDC1551 genome. B, Protein sequence of a novel gene. Identified region is indicated by taller text. C, A representative MS/MS spectrum for identification of genome search specific peptide AVDFDDEAGLDTAYLSGGAGDR is shown.  quence of M. tuberculosis. We have demonstrated the power of the proteogenomic approach to annotate a genome and refine the annotation of a well studied genome. Though relatively new, this approach of using mass spectrometry-based proteomic data to identify protein coding regions in the genome can prove to be an essential complementary method along with computational methods for annotating newly sequenced genomes in the future. The