If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
To whom correspondence should be addressed: G.N. Ramachandran Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Mathura Road, Delhi 110025, India. Tel.: +91-11-29879301; Fax: +91-11-29879301;
* DK is supported by Department of Science and Technology-INSPIRE Senior Research Fellowship. AKY is supported by Council of Scientific and Industrial Research (India), Senior Research Fellowship. SHN is supported by UQ Postdoctoral Research Fellowship. SMG is supported by National Health and Medical Research Council (NHMRC) Principal Research Fellowship. DD acknowledges CSIR In silico Biology project (CMM-0017) and Genesis project (BSC-0121) for compute infrastructure and publication charges. This article contains supplemental Figs. S1 to S2, Tables S1 to S2 and Files S1 to S8.
We present GenoSuite, an integrated proteogenomic pipeline to validate, refine and discover protein coding genes using high-throughput mass spectrometry (MS) data from prokaryotes. To demonstrate the effectiveness of GenoSuite, we analyzed proteomics data of Bradyrhizobium japonicum (USDA110), a model organism to study agriculturally important rhizobium-legume symbiosis. Our analysis confirmed 31% of known genes, refined 49 gene models for their translation initiation site (TIS) and discovered 59 novel protein coding genes. Notably, a novel protein which redefined the boundary of a crucial cytochrome P450 system related operon was discovered, known to be highly expressed in the anaerobic symbiotic bacteroids. A focused analysis on N-terminally acetylated peptides indicated downstream TIS for gene blr0594. Finally, ortho-proteogenomic analysis revealed three novel genes in recently sequenced B. japonicum USDA6T genome. The discovery of large number of missing genes and correction of gene models have expanded the proteomic landscape of B. japonicum and presents an unparalleled utility of proteogenomic analyses and versatility of GenoSuite for annotating prokaryotic genomes including pathogens.
Rapid advances in massively parallel sequencing technologies have enabled the sequencing of thousands of prokaryote genomes. However, to understand the functional elements of the genome, annotation of the protein coding genes is the first and usually the most challenging task. Protein coding genes in a genome are annotated by in silico predictions and homology searches against known proteins. In silico gene annotations are prone to errors such as missing genes, incorrect translation initiation sites (TIS)
) but are difficult to carry out in a proteogenomic context. This is because of the lack of automated software for proteogenomic analyses that incorporates multiple search engines without compromising on the statistical robustness of individual algorithms.
To fill this gap, we developed an automated pipeline, GenoSuite to carry out genome translations, database searches using multiple search engines, result integration based on statistical significance of PSMs, FDR calculations, coordinate mapping, and finding completely novel genes. To demonstrate the effectiveness of GenoSuite, we analyzed proteomics data of Bradyrhizobium japonicum (USDA110), a model organism to study agriculturally important rhizobium–legume symbiosis. We selected B. japonicum for proteogenomic re-annotation because of its large genome size, high GC content and nonavailability of many closely related genome sequences. B. japonicum is a symbiont to legumes and is important for nitrogen fixation in root nodules which helps these plants to grow without any nitrogenous fertilizers. Its primary host is Soybean, an economically important crop and model system to study rhizobia-legume symbiosis. This bacterium has a 9.1 Mb genome, one of the longest among bacteria, with 64.1% GC content (
) gene predictions and sequence similarity with known proteins. A large number of transcriptomics and proteomics studies have also been performed to understand the mechanisms and genes involved in symbiosis and nitrogen fixing process. However, a high quality annotation of protein coding genes is still not achieved for this class of bacteria. To improve on the existing annotation of B. japonicum, we carried out a comprehensive proteogenomic analysis using publicly available proteomics data generated from bacteroids of three host systems (
). We used GenoSuite to search spectral data against genome translated database of B. japonicum and selected peptide identifications at ≤1%FDR to re-annotate B. japonicum genome. We identified 59 novel protein coding regions (NPCRs) and corrected annotations for 49 genes.
The mass spectral data for proteogenomic analysis on B. japonicum were obtained from PRIDE repository (
). A total of nine data sets with PRIDE accessions 10099–10104 and 10114–10116 were used. These mass spectra represent B. japonicum proteomes from three different host systems in triplicates for each host. In total, these data sets had 621,176 MS/MS spectra.
Development of GenoSuite, An Automated Proteogenomic Pipeline
We developed GenoSuite, a standalone pipeline for automated proteogenomic analysis. GenoSuite is a suite of three tools: PPT (Prokaryotic Proteogenomic Tool), ORFmapper, and PSMplotter. GenoSuite is an easily configurable tool and is ready for use after downloading and unzipping the archive. Integration with the search algorithms OMSSA, X!Tandem, and InsPecT is also easy because only the paths need to be added to GenoSuite configuration file. MassWiz comes integrated as a part of the standard distribution.
PPT searches spectral data against a genome database with four peptide identification algorithms namely MassWiz (
). It is highly configurable and any combination of these algorithms can be used for search. All inputs for PPT are defined in a configuration file. GenoSuite uses common file formats e.g. FASTA, Genbank, MGF, GFF etc., so that the pipeline can be easily integrated with existing frameworks. We employed a Combined FDRScore (
) based approach to integrate results from multiple algorithms, each of which has a noncomparable scoring metric. In brief, GenoSuite calculates FDRScore for all employed algorithms' results. Average FDRScore is calculated for each PSM by calculating geometric mean of FDRScores from individual algorithms identifying peptide spectrum pair. All PSMs are divided into subsets based on the combination of algorithms identifying PSMs. Combined FDRScores are again calculated from Average FDRScores separately for each subset of PSMs and significant PSMs below a user defined FDR threshold are selected for further analyses. False Discovery Rate (FDR) (in %) is estimated by Kall method (
Filtered peptides are then mapped onto the genome and also onto the known proteins. Identified peptides mapping exclusively to the genome translated database are classified as novel peptides. A complete outline of the analysis pipeline GenoSuite is shown in Fig. 1. List of novel peptides are exported as GFF files which can be integrated with any DAS annotation server. For spectral quality assessment, XML files are created for peptide identifications from each algorithm and for novel peptides.
ORFmapper compares novel peptides to existing annotations and to ab initio predictions. It requires genbank file, ORF prediction file (GeneMark or GFF format) and novel peptide GFF as inputs. Novel peptide coordinates in input GFF file are compared with the gene coordinates in genbank file and peptides are further classified into (1) novel proteins coding region (NPCR) or (2) gene model changes. ORFmapper creates separate files for peptides leading to novel proteins, peptides suggesting gene model changes and ORFs mapped to novel peptides. It also creates genomic map for each peptide to provide a visual of its genomic context. These images are linked to an HTML file for ease of analysis.
PSMplotter program is a utility to visualize peptide spectrum matches. It takes the XML file created by PPT as its Input and generates an HTML file where all spectrum matches from the XML file are hyperlinked with their PSM images. This can be used for manual validation of peptide spectral matches.
Spectral data were searched by GenoSuite against a six frame translated database of B. japonicum USDA110 genome (NC_004463.1). GenoSuite translates a genome from stop to stop codon in all six reading frames and translation products of length 50 aa or more are kept in database. In total, the B. japonicum database had 98,716 sequence entries. All four algorithms in the GenoSuite were employed for spectral data search with 20 ppm precursor ion tolerance, 0.6 Da product ion tolerance, trypsin as the protease with one missed cleavage, carbamidomethylation of cysteine as a fixed modification and methionine oxidation as variable modification. For parameters not common across algorithms, we used the algorithm defaults. Separate target-decoy database searches were carried out. Stringent FDR threshold of ≤1% was applied to the resulting PSMs. Leucine and Isoleucine were considered identical during FDR calculations and mapping onto the genome or proteome. Peptides mapping to two different genomic loci were not considered for further analysis.
Translation Initiation Site Search
A different search for estimating translation initiation sites for the gene models was performed using OMSSA and X!Tandem. Search parameters were same as above except semitryptic enzyme specificity, peptide N-terminal acetylation and peptide N-terminal formylated methionine as variable modifications. Only N-terminally modified peptides identified at ≤ 1% FDR were selected for further analysis.
Gene Prediction and ORF Comparison
To define boundaries of maximum number of novel translations in the B. japonicum USDA110 genome we used ORF predictions by four different algorithms namely GeneMark.hmm prokaryotic (Version 2.6r) (
). GeneMark, Glimmer, and Prodigal predictions were downloaded from NCBI (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Bradyrhizobium_japonicum_USDA_110_uid57599/). FGENESB gene predictions were downloaded from (http://linux1.softberry.com/data/annotation/bact/NC_004463.fna.ann.gz). ORF comparison across algorithms was performed using in-house scripts.
Protein Functional Annotation
To annotate proteins for their probable functions, we used domain assignment method Pfam (
). Pfam-A profile database and HMMER-3.0 were used to assign profiles. GOANNA tool was used for gene ontology assignment and only molecular function (F) category was used for annotation.
Multi-algorithmic Proteogenomic Analysis Pipeline Development
GenoSuite was developed to automate spectral searches, statistical integration of PSMs and downstream analysis. It is configured with four open source peptide identification algorithms namely OMSSA, X!Tandem, InsPecT and MassWiz. The choice of algorithms was empirical and these algorithms have been benchmarked in our previous studies (
), a diverse set of algorithms provides an increased sensitivity for peptide identifications than any single algorithm. GenoSuite incorporates a reversed protein decoy database strategy to calculate FDR. For result integration from these algorithms, GenoSuite calculates Combined FDRScore which is reported to give ≈35% more identifications than any single algorithm (
) from OMSSA and X!Tandem search results. We get highly correlated CFS values from these two tools (supplemental Fig. S1). FDRapp implements the same approach on OMSSA, X!Tandem, and Mascot algorithms. We extended this approach to Inspect and MassWiz and implemented in GenoSuite (supplemental Fig. S2). Peptides from FDR filtered PSMs are then mapped onto the genome to report NPCR and gene model changes. Fig. 1 depicts a schematic chart of GenoSuite. Additional functionality for ORF prediction mappings and peptide spectral match visualization is also added to the pipeline.
Genome Search Specific Peptide Identification
A total of 621,176 spectra from nine different datasets were searched against B. japonicum genome. A complete list of significant PSM identifications from each of nine datasets is provided in supplementary File S1. Total 26,592 peptides were identified of which 511 were shared across multiple genomic loci. As the primary aim of this study was to annotate genomic loci, shared peptides were not considered for further analysis. Fig. 2A depicts all unique peptides (26,081) identified in this study mapped to their genomic loci. After mapping unique peptides onto the genome translated database, GenoSuite reports proteins identified by at least two unique peptides or single peptide with ≥5 significant PSMs above desired FDR threshold. A total of 2654 proteins were identified in an automated manner at a global protein level FDR of 0.0018. Supplementary File S2 provides a full list of proteins identified in this analysis and their sample wise identification. Four out of ten proteins, which were identified with only a single unique peptide, had dubious peptide spectral matches and hence were not considered for further analysis. Supplementary File S3 provides annotated peptide spectral matches for six accepted single peptide hits. GenoSuite further maps all peptides from significant protein identifications to the annotated proteome and lists novel peptides. A total of 283 peptides were reported as novel. GenoSuite then categorizes novel peptides into (1) NPCR or (2) changes in the annotated gene models. A total of 59 new protein coding regions and 49 wrongly annotated genes were reported in this automated analysis by GenoSuite. A potential contamination of host proteins or any other contamination may also result in observation of novel peptides. To eliminate this possibility, we mapped all the novel peptides on a sequence database of Glycine max and common contaminants (cRAP) allowing one amino acid mismatch. Eighteen of these novel peptides, all of length 8 aa or below, were found to be similar with host or contaminant proteins. This did not affect the results at protein level as novel proteins had other unique peptides. One gene model change suggested exclusively by a peptide having similarity with host or contaminant proteins was not considered further.
Novel Protein Coding Regions in B. japonicum Genome
We discovered 59 confident NPCRs in B. japonicum USDA110 genome. Supplementary File S4 lists all NPCRs identified in this study with information about their function, sample-wise and host-wise expression. Ab initio gene predictions were used to determine the boundary of un-annotated genes. ORF predictions provide information about conservation of transcription and translational features beyond coding region and add confidence to the peptide identifications. Fifty-one out of 59 NPCRs are supported by ORF predictions on the same frame and strand. Based on gene predictions, we determined the TIS of these proteins. We also analyzed the length distribution of these 51 NPCRs. The average length of these proteins is 113 aa. Interestingly, 92% of these 51 novel proteins were below 200 aa length suggesting that most novel proteins are encoded by short ORFs (Fig. 2B). Supplemental Table S1 lists all NPCRs with their genomic coordinates and identified peptides. As our proteogenomic analysis is based on the proteomic data from different hosts of B. japonicum, we analyzed the NPCRs for their host-specific expression. Following the criterion of Koch et al. for this analysis, we considered only those NPCRs which are identified in at least two replicates for a host. Collectively, 34 NPCRs are identified in at least two replicates. 13 of these are specifically expressed in bacteroids from soybean, two from siratro and three are specific for cowpea bacteroids (Fig. 2C). Host specific expression of NPCRs suggests their putative role in host adaptation and survival.
We also probed the genomic context of the NPCRs. Among all NPCRs, 36 were intergenic, 21 were on the opposite strand to existing gene annotation and two were on different frame. Fig. 3A shows a NPCR identified with five peptides mapping to an intergenic region of the genome. The protein is supported by ORF prediction by all four gene prediction algorithms and is identified in five out of nine samples. Function prediction suggests that this protein is a putative cytochrome P450 hydroxylase. Interestingly, Inclusion of this NPCR suggests a continuous operon in this genomic region (Fig. 3B). Earlier this operon was considered with only three members (blr2143-blr2145) but now with NPCR bridging the gap, this operon can be extended to six members. FGENESB also predicts an operon of six proteins, which includes the NPCR identified in this genomic locus. Sequence similarity shows the NPCR and operon to be specific to the rhizobia. The operon has been shown to be under regulation of oxygen-responsive transcriptional activator protein NifA (
). Four proteins of the operon are annotated as cytochrome P450 proteins and remaining two have cytochrome P450 hydroxylase domain. In our analysis, all six proteins of this operon are identified with multiple peptide hits. Identification of all proteins of this operon and their related functions further strengthens coordinated gene expression as in operonic structures.
Novel Proteins and Comparative Genomics
We extended newly identified annotations of B. japonicum in related rhizobial genomes based on orthology of NPCRs. Ortho-proteogenomic strategy was first applied on genus Mycobacterium by Gallien et al. (
). For this analysis we included all the sequenced genomes in Bradyrhizobiaceae family along with Mesorhizobium loti, Sinorhizobium meliloti, and Rhizobium leguminosarum because these are widely studied rhizobial genomes. These genomes are of diverse sizes ranging from ≈3 Mb to ≈9 Mb with 3122 to 8826 protein coding genes. Genomic level identity among these genomes and B. japonicum USDA110 genome ranges from 1% to 75% (supplemental Table S2). A full list of NPCRs with presence/absence of orthologs in related genomes is tabulated in supplementary File S5. Orthologous sequences were found for 43 NPCRs. Interestingly, orthologs of 35 NPCRs identified in this study in B. japonicum USDA110 are annotated as protein coding genes in the recently sequenced and annotated B. japonicum USDA6T (
) genomes. After comparing the orthologous loci with genome annotations and ORF predictions of the respective genomes, we added three novel proteins in B. japonicum USDA6T genome and found three protein coding genes with N-terminal changes.
N-terminal Changes in the Annotated Genes
Sixty two novel peptides mapped upstream of annotated gene models on the same frame and strand. Because there was no stop codon between identified peptide and annotated gene, these peptides suggested change in the currently annotated translation initiation site (TIS). We excluded one peptide from the list because of its similarity with host proteins. The remaining 61 novel peptides indicated changes in 48 annotated gene models. A detailed list of peptides suggesting gene model changes is provided as supplementary File S6. The ORF predictions were checked for any upstream TIS. 34 of these 48 gene model changes are also supported by gene predictions. We could assign new TIS to 34 genes, 21 of which had TTG as the start codon. We considered maximum voted gene start of the ORF predictions and the longest ORF in cases where votes tied. Fig. 4 highlights a case of gene model change. Four peptides map upstream to the currently annotated gene model for locus bll2380, a gene coding for a glycosyl transferase enzyme. ORF prediction by GeneMark also agrees with the upstream TIS. A longer bll2380 protein is also annotated in B. japonicum USDA6T and B. sp.23321.
In cases where ORF predictions did not support gene model changes, we applied sequence similarity searches. One striking case is identified for NolA (bll2019), a transcriptional regulator protein which is a key member involved in nodulation in the host plant during symbiosis. The NolA protein is identified with six peptides. One of the peptides, IGELAEATGVTVR is identified in all nine samples and it uniquely maps upstream to the current annotation of the NolA TIS. All four gene predictions support the existing annotation of translational start site but do not cover the novel peptide. However, a longer protein is reported for some close strains by sequence similarity analysis (Fig. 5). Even for B. japonicum USDA110, a longer NolA protein is suggested in few studies (
). The peptide is covered in the longest protein isoform from NolA locus and is a part of DNA binding domain in the protein's N terminus. The observation of the novel peptide in all nine samples suggests the longest protein isoform from NolA locus to be expressed during symbiosis with legume plants.
We analyzed protein translation initiation site by probing N-terminal specific modifications. Formylation happens on the initiator methionine and N-terminal acetylation on second amino acid after initiator methionine is cleaved. These modifications directly mark the TIS for a protein coding gene. Based on peptides identified at ≤1% FDR, and their upstream codon in the genome we could confirm the annotated TIS for 15 genes. Annotated peptide spectrum matches of selected N-acetylated peptides are provided in supplementary File S7. Additionally, we corrected TIS for one gene where N-terminally modified peptides did not agree with the currently annotated start site. Fig. 6 depicts an N-terminally acetylated peptide and a representative PSM, which helped in correcting TIS for the locus blr0594 (NP_767234.1), a thioredoxin protein. Peptide TIIDQGNGAAGPAAADLIK is identified with N-terminal acetylation. The codon preceding the N-terminally acetylated Threonine is GTG which is known to code for initiator methionine in genomes with high GC content. Based on the acetylated peptide we corrected the TIS for blr0594 locus. The newly assigned TIS is downstream to the annotated TIS for this locus. The orthologous protein sequences in related Bradyrhizobium genomes also agree with the newly assigned TIS for blr0594.
Proteogenomic Analysis of Shigella flexneri
We also applied GenoSuite on Shigella flexneri 2a str. 2457T data (Pride Accession 18992–18999). S. flexneri is a human pathogen which causes dysentery and diarrhea. We discovered 28 previously un-annotated protein coding genes and corrected annotation for 22 genes, which shows rapid proteogenomic analyses of another prokaryotic genome (supplementary File S8). No proteogenomic study for this bacterium has been reported so far. Novel and better annotations can lead to better understanding of pathogenesis by Shigella flexneri.
GenoSuite tool described here provides a simple and effective informatics framework for proteogenomic analyses from prokaryotes. To the best of our knowledge, there are no readily available tools dedicated for prokaryotic proteogenomic analyses. For instance, GAPP (
) uses a de novo tag based approach to detect peptides in a genome, which is usually more error prone than database searches. Integration of multiple algorithms in GenoSuite improves both sensitivity and specificity. Implementing FDScore method on different algorithms before integrating their outputs and calculating FDR at PSM and protein levels makes GenoSuite a robust statistical pipeline. It is currently designed to discover simple gene models found in prokaryotes.
As a proof of principle, we have used GenoSuite to analyze Bradyrhizobium japonicum USDA110 genome for discovering novel protein coding genes. B. japonicum USDA110 is an agriculturally important model organism to study symbiotic nitrogen fixation in legumes. A comprehensive annotation is a key prerequisite for such studies but its genome annotation is far from complete. Because only a handful of related genomes are sequenced, the annotation of protein coding genes could not take advantage of comparative genomics. GenoSuite identified large number of new proteins coding genes in B. japonicum USDA110. A significant number of these NPCRs are unique to this bacterium whereas others have orthology in two closely related genomes, namely B. japonicum USDA6T and B. sp.S23321. The exclusivity of these genes to one organism or to a small group of organisms makes them difficult to annotate by methods other than proteogenomics. Most of the NPCRs in our analysis are short in length (<200 aa). Previous proteogenomic studies in other organisms have also identified mostly short novel proteins (
) considered all ORF predictions of length 150 bp or above for gene prediction, many short genes within the acceptable length range (>150 bp) were left un-annotated in B. japonicum USDA110 genome. It is generally believed that genomes with high GC% suffer with over-prediction of short proteins (
); our data suggests the opposite for B. japonicum genome, which also has high GC (64.1%). As NPCRs identified in proteogenomics studies are experimentally discovered, these can be a valuable resource for improving the existing gene prediction algorithms or development of new algorithms for predicting short length proteins. Additionally, by applying ortho-proteogenomics, we identified three new genes in the B. japonicum USDA6T genome. This indicates that although the computational gene prediction methods have improved, they still miss some protein coding genes and it emphasizes the use of proteogenomics as a powerful solution to such issues in genome annotation.
We also used novel peptides in the discovery of novel operons or correction of known operons. An operon is predicted when genes are on the same strand and the gap between them is not more than 50–60 bp. Many novel genes or N-terminal extension to existing genes reduce this gap and can correct operon models. In addition, it is a prerequisite that all proteins in new operon are identified. We discovered 11 operon models which could not be annotated based on initial annotations of protein coding genes. An example is an operon with six proteins related to Cytochrome P450 system. This operon is highly active in anaerobic conditions within symbiotic bacteroids and is crucial for normal nitrogen reduction within legume host. Christie-Oleza et al. also reported operons in Ruegeria pomeroyi by proteogenomic analysis (
). Correction and unveiling of operonic structures can aid in quick annotation as well as provide functional insights into the underlying biology of the organism.
Incorrect assignment of TIS is probably the most frequent error of genome annotation. We used fully tryptic peptides to correct such errors. We found changes in 48 existing gene models in B. japonicum genome. Most of the errors in TIS assignment were associated with TTG start codon. ATG and GTG start codons are preferred over TTG by gene prediction algorithms which may have resulted in incorrect assignment of TIS in observed cases. Protein sequences from these loci are significantly changed and may contribute to newer structural and functional features as shown in NolA gene where N-terminal extension provides additional DNA binding domain probably involved in a regulatory role. Protein N-terminal modifications can also be used to probe TIS. All TIS confirmations in our study are based on N-terminally acetylated peptides. N-Acetylation for blr0594 gene revealed that TIS is located downstream to the currently annotated site. N-terminal acetylation is believed to be rare in bacteria and only few proteins, mostly ribosomal, are known to be acetylated (
), which showed N-acetylation as a widespread protein modification among bacteria.
To establish GenoSuite's versatility and applicability in discovering novel translations, it was also applied on a recent Shigella flexneri 2a str. 2457T data where we discovered previously un-annotated genes and corrected annotation for several genes. This demonstrates rapid and automated proteogenomic analyses of prokaryotic genomes using GenoSuite. GenoSuite is a highly effective, easily configurable analysis suite for prokaryotic proteogenomics.
We thank Anupam Kumar Mondal for helpful discussions and insightful comments while proof-reading the manuscript.