Advertisement

A Meta-proteogenomic Approach to Peptide Identification Incorporating Assembly Uncertainty and Genomic Variation*

Open AccessPublished:May 29, 2019DOI:https://doi.org/10.1074/mcp.TIR118.001233
      Matching metagenomic and/or metatranscriptomic data, currently often under-used, can be useful reference for metaproteomic tandem mass spectra (MS/MS) data analysis. Here we developed a software pipeline for identification of peptides and proteins from metaproteomic MS/MS data using proteins derived from matching metagenomic (and metatranscriptomic) data as the search database, based on two novel approaches Graph2Pro (published) and Var2Pep (new). Graph2Pro retains and uses uncertainties of metagenome assembly for reference-based MS/MS data analysis. Var2Pep considers the variations found in metagenomic/metatranscriptomic sequencing reads that are not retained in the assemblies (contigs). The new software pipeline provides one stop application of both tools, and it supports the use of metagenome assembly from commonly used assemblers including MegaHit and metaSPAdes. When tested on two collections of multi-omic microbiome data sets, our pipeline significantly improved the identification rate of the metaproteomic MS/MS spectra by about two folds, comparing to conventional contig- or read-based approaches (the Var2Pep alone identified 5.6% to 24.1% more unique peptides, depending on the data set). We also showed that identified variant peptides are important for functional profiling of microbiomes. All results suggested that it is important to take into consideration of the assembly uncertainties and genomic variants to facilitate metaproteomic MS/MS data interpretation.

      Graphical Abstract

      Microbiome research has been applied to various studies of microbial organisms associated with different ecosystems, habitats and hosts (
      • Crump B.C.
      • Armbrust E.V.
      • Baross J.A.
      Phylogenetic analysis of particle-attached and free-living bacterial communities in the columbia river, its estuary, and the adjacent coastal ocean.
      ,
      • Santelli C.M.
      • Orcutt B.N.
      • Banning E.
      • Bach W.
      • Moyer C.L.
      • Sogin M.L.
      • Staudigel H.
      • Edwards K.J.
      Abundance and diversity of microbial life in ocean crust.
      ,
      • Fierer N.
      • Lauber C.L.
      • Ramirez K.S.
      • Zaneveld J.
      • Bradford M.A.
      • Knight R.
      Comparative metagenomic, phylogenetic and physiological analyses of soil microbial communities across nitrogen gradients.
      ,
      • Fierer N.
      • Leff J.W.
      • Adams B.J.
      • Nielsen U.N.
      • Bates S.T.
      • Lauber C.L.
      • Owens S.
      • Gilbert J.A.
      • Wall D.H.
      • Caporaso J.G.
      Cross-biome metagenomic analyses of soil microbial communities and their functional attributes.
      ,
      • Qin J.
      • Li R.
      • Raes J.
      • Arumugam M.
      • Burgdorf K.S.
      • Manichanh C.
      • Nielsen T.
      • Pons N.
      • Levenez F.
      • Yamada T.
      A human gut microbial gene catalogue established by metagenomic sequencing.
      ,
      • Gill S.R.
      • Pop M.
      • DeBoy R.T.
      • Eckburg P.B.
      • Turnbaugh P.J.
      • Samuel B.S.
      • Gordon J.I.
      • Relman D.A.
      • Fraser-Liggett C.M.
      • Nelson K.E.
      Metagenomic analysis of the human distal gut microbiome.
      ,
      • Ley R.E.
      • Turnbaugh P.J.
      • Klein S.
      • Gordon J.I.
      Microbial ecology: human gut microbes associated with obesity.
      ,
      • Wilmes P.
      • Andersson A.F.
      • Lefsrud M.G.
      • Wexler M.
      • Shah M.
      • Zhang B.
      • Hettich R.L.
      • Bond P.L.
      • VerBerkmoes N.C.
      • Banfield J.F.
      Community proteogenomics highlights microbial strain-variant protein expression within activated sludge performing enhanced biological phosphorus removal.
      ). Recent studies have further shown the broader impacts of microbiota on human health and diseases, including the influence of microbiota on the efficacy of cancer immunotherapy (
      • Routy B.
      • Le Chatelier E.
      • Derosa L.
      • Duong C.P.M.
      • Alou M.T.
      • Daillere R.
      • Fluckiger A.
      • Messaoudene M.
      • Rauber C.
      • Roberti M.P.
      • Fidelle M.
      • Flament C.
      • Poirier-Colame V.
      • Opolon P.
      • Klein C.
      • Iribarren K.
      • Mondragon L.
      • Jacquelot N.
      • Qu B.
      • Ferrere G.
      • Clemenson C.
      • Mezquita L.
      • Masip J.R.
      • Naltet C.
      • Brosseau S.
      • Kaderbhai C.
      • Richard C.
      • Rizvi H.
      • Levenez F.
      • Galleron N.
      • Quinquis B.
      • Pons N.
      • Ryffel B.
      • Minard-Colin V.
      • Gonin P.
      • Soria J.C.
      • Deutsch E.
      • Loriot Y.
      • Ghiringhelli F.
      • Zalcman G.
      • Goldwasser F.
      • Escudier B.
      • Hellmann M.D.
      • Eggermont A.
      • Raoult D.
      • Albiges L.
      • Kroemer G.
      • Zitvogel L.
      Gut microbiome influences efficacy of PD-1-based immunotherapy against epithelial tumors.
      ). Researchers started to investigate the therapeutic applications of microbiome, which are very promising as demonstrated in recent case studies: Zhu and colleagues edited gut microbiota on ameliorate colitis (
      • Zhu W.
      • Winter M.G.
      • Byndloss M.X.
      • Spiga L.
      • Duerkop B.A.
      • Hughes E.R.
      • Büttner L.
      • de Lima Romão E.
      • Behrendt C.L.
      • Lopez C.A.
      • Sifuentes-Dominguez L.
      • Huff-Hardy K.
      • Wilson R.P.
      • Gillis C.C.
      • Tükel Ç.
      • Koh A.Y.
      • Burstein E.
      • Hooper L.V.
      • Bäumler A.J.
      • Winter S.E.
      Precision editing of the gut microbiota ameliorates colitis.
      ), and in another case, engineered commensal microbiomes were used for diet-mediated colorectal cancer chemoprevention (
      • Ho C.L.
      • Tan H.Q.
      • Chua K.J.
      • Kang A.
      • Lim K.H.
      • Ling K.L.
      • Yew W.S.
      • Lee Y.S.
      • Thiery J.P.
      • Chang M.W.
      Engineered commensal microbes for diet-mediated colorectal-cancer chemoprevention.
      ). In microbiome research, metagenomic and metatranscriptomic (
      • Shi Y.
      • Tyson G.W.
      DeLong, E. F. Metatranscriptomics reveals unique microbial small rnas in the ocean's water column.
      ,
      • Stewart F.J.
      • Ulloa O.
      • DeLong E.F.
      Microbial metatranscriptomics in a permanent marine oxygen minimum zone.
      ) data often tell about the taxonomic distribution and potential functions, whereas metaproteomic data provides more direct information about the functionality of microbial communities (
      • Verberkmoes N.C.
      • Russell A.L.
      • Shah M.
      • Godzik A.
      • Rosenquist M.
      • Halfvarson J.
      • Lefsrud M.G.
      • Apajalahti J.
      • Tysk C.
      • Hettich R.L.
      Shotgun metaproteomics of the human distal gut microbiota.
      ,
      • Wilmes P.
      • Bond P.L.
      Metaproteomics: studying functional gene expression in microbial ecosystems.
      ,
      • Maron P.-A.
      • Ranjard L.
      • Mougel C.
      • Lemanceau P.
      Metaproteomics: a new approach for studying functional microbial ecology.
      ,
      • Erickson A.R.
      • Cantarel B.L.
      • Lamendella R.
      • Darzi Y.
      • Mongodin E.F.
      • Pan C.
      • Shah M.
      • Halfvarson J.
      • Tysk C.
      • Henrissat B.
      • Raes J.
      • Verberkmoes N.C.
      • Fraser C.M.
      • Hettich R.L.
      • Jansson J.K.
      Integrated metagenomics/metaproteomics reveals human host-microbiota signatures of Crohn's disease.
      ). In a recent study of ocean microbiome (
      • Pachiadaki M.G.
      • Sintes E.
      • Bergauer K.
      • Brown J.M.
      • Record N.R.
      • Swan B.K.
      • Mathyer M.E.
      • Hallam S.J.
      • Lopez-Garcia P.
      • Takaki Y.
      • Nunoura T.
      • Woyke T.
      • Herndl G.J.
      • Stepanauskas R.
      Major role of nitrite-oxidizing bacteria in dark ocean carbon fixation.
      ), single-cell genomics and community metagenomics revealed that Nitrospinae are the most abundant and globally distributed nitrite-oxidizing bacteria in the ocean, whereas metaproteomic and metatranscriptomic analyses suggest that nitrite oxidation is the main pathway of energy production in Nitrospinae. And in another work, Kleiner and colleagues (
      • Kleiner M.
      • Thorson E.
      • Sharp C.E.
      • Dong X.
      • Liu D.
      • Li C.
      • Strous M.
      Assessing species biomass contributions in microbial communities via metaproteomics.
      ) showed that metaproteomics could be used to assess species biomass contributions in microbial communities, which is less prone to the biases in sequencing-based methods. Studies have shown the impact of microbiomes on human health and diseases. Metaproteomics provides a new opportunity of studying the functionality of microbial communities.
      Although metaproteomics revealed complementary information to metagenomic and metatranscriptomic studies, it often applies conventional proteomics techniques using mass spectrometry (MS). Metaproteomic data analysis is challenging (
      • Heyer R.
      • Schallert K.
      • Zoun R.
      • Becher B.
      • Saake G.
      • Benndorf D.
      Challenges and perspectives of metaproteomic data analysis.
      ), which has motivated the development of recent algorithms and tools including MetaLab (
      • Cheng K.
      • Ning Z.
      • Zhang X.
      • Li L.
      • Liao B.
      • Mayne J.
      • Stintzi A.
      • Figeys D.
      MetaLab: an automated pipeline for metaproteomic data analysis.
      ), cascaded search (
      • Kertesz-Farkas A.
      • Keich U.
      • Noble W.S.
      Tandem mass spectrum identification via cascaded search.
      ), Unipept (
      • Mesuere B.
      • Debyser G.
      • Aerts M.
      • Devreese B.
      • Vandamme P.
      • Dawyndt P.
      The unipept metaproteomics analysis pipeline.
      ), two-steps search (
      • Jagtap P.
      • Goslinga J.
      • Kooren J.A.
      • McGowan T.
      • Wroblewski M.S.
      • Seymour S.L.
      • Griffin T.J.
      A two-step database search method improves sensitivity in peptide sequence matches for metaproteomics and proteogenomics studies.
      ), MetaProteomeAnalyzer (
      • Muth T.
      • Behne A.
      • Heyer R.
      • Kohrs F.
      • Benndorf D.
      • Hoffmann M.
      • Lehtevä M.
      • Reichl U.
      • Martens L.
      • Rapp E.
      The metaproteomeanalyzer: a powerful open-source software suite for metaproteomics data analysis and interpretation.
      ) and ProteoStorm (
      • Beyter D.
      • Lin M.S.
      • Yu Y.
      • Pieper R.
      • Bafna V.
      Proteostorm: An ultrafast metaproteomics database search framework.
      ). Successful proteomic database search largely relies on the completeness and specificity of the target protein database. There are two broad categories of approaches for metaproteomic database search: approaches relying on a generic collection of reference proteins (for example, the human/mouse gut bacterial genes (
      • Zhang X.
      • Ning Z.
      • Mayne J.
      • Moore J.I.
      • Li J.
      • Butcher J.
      • Deeke S.A.
      • Chen R.
      • Chiang C.K.
      • Wen M.
      • Mack D.
      • Stintzi A.F.
      • Igeys D.
      MetaPro-IQ: a universal metaproteomic approach to studying human and mouse gut microbiota.
      ,
      • Cheng K.
      • Ning Z.
      • Zhang X.
      • Li L.
      • Liao B.
      • Mayne J.
      • Stintzi A.
      • Figeys D.
      MetaLab: an automated pipeline for metaproteomic data analysis.
      )), and approaches that use specific proteins/peptides inferred from matching metagenomic and metatranscriptomic data sets of the same microbial community. Different from typical proteomics data analysis, peptide/protein identification from metaproteomics data generally lacks a good reference database of proteins for database searches, except for well-studied microbiomes. On the other hand, although matching metagenomic or metatranscriptomic data sets can provide more adequate reference database, they are often under-used for metaproteomic data analysis. Microbiomes are normally composed of many species, many of those species are closely related, as demonstrated in studies of microbial communities empowered by long sequencing reads (
      • Kuleshov V.
      • Jiang C.
      • Zhou W.
      • Jahanbani F.
      • Batzoglou S.
      • Snyder M.
      Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome.
      ,
      • Sharon I.
      • Kertesz M.
      • Hug L.A.
      • Pushkarev D.
      • Blauwkamp T.A.
      • Castelle C.J.
      • Amirebrahimi M.
      • Thomas B.C.
      • Burstein D.
      • Tringe S.G.
      • Williams K.H.
      • Banfield J.F.
      Accurate, multi-kb reads resolve complex populations and detect rare microorganisms.
      ). Despite of the recent advances in the development of assembly algorithms for metagenomes, it remains a challenge to assemble metagenomes from shotgun metagenomic sequences because of the nature of microbial communities: a large number of species co-exist in a sample; these species are of various abundances; and many species are closely related containing similar genome sequences with variations so that assemblers won't be able to untangle them. As a result, metagenome assemblies often provided limited references for peptide/protein identification from metaproteomic data sets.
      Approaches have been proposed to address the above-mentioned limitation of using metagenome assemblies as reference for peptide/protein identification from MS/MS data. One successful effort is from May and colleagues (
      • May D.H.
      • Timmins-Schiffman E.
      • Mikan M.P.
      • Harvey H.R.
      • Borenstein E.
      • Nunn B.L.
      • Noble W.S.
      An alignment-free “metapeptide” strategy for metaproteomic characterization of microbiome samples using shotgun metagenomic sequencing.
      ), who developed a method to derive metapeptides (short amino acid sequences that may be encoded by multiple organisms) from shotgun metagenomic sequencing reads, and then use the metapeptides as the reference for peptide identification from metaproteomic MS/MS data. They showed that by constructing site-specific metapeptide databases, they were able to detect more than one and a half times as many peptides as by searching against predicted genes from an assembled metagenome and roughly three times as many peptides as by searching against the NCBI environmental proteome database. Also some benchmarking experiments have been optimized to explore better experimental and computational strategies for metaproteomics (
      • Cantarel B.L.
      • Erickson A.R.
      • VerBerkmoes N.C.
      • Erickson B.K.
      • Carey P.A.
      • Pan C.
      • Shah M.
      • Mongodin E.F.
      • Jansson J.K.
      • Fraser-Liggett C.M.
      • Hettich R.L.
      Strategies for metagenomic-guided whole-community proteomics of complex microbial environments.
      ,
      • Rooijers K.
      • Kolmeder C.
      • Juste C.
      • Doré J.
      • De Been M.
      • Boeren S.
      • Galan P.
      • Beauvallet C.
      • de Vos W.M.
      • Schaap P.J.
      An iterative workflow for mining the human intestinal metaproteome.
      ) as well. We have also recognized the limitation of conventional approach of using assembled metagenomes (e.g. using only the contigs) for peptide/protein identification from MS/MS data and developed a graph-centric approach (Graph2Pro) (
      • Tang H.
      • Li S.
      • Ye Y.
      A graph-centric approach for metagenome-guided peptide and protein identification in metaproteomics.
      ) that uses the assembly graph to take into consideration of the assembly uncertainties. Our approach achieved significant improvement of peptide/protein identification from MS/MS data for microbiome research.
      Comparing to (meta)genomic sequencing, metaproteomics (and proteomics in general) typically has relatively limited throughput. It is therefore important to develop methods that can extensively use the sequencing reads (in addition to assemblies) for peptide/protein identification from MS/MS data (sequencing reads from low abundant species are unlikely to be assembled into contigs). Further, microbial communities often contain closely related species and strains with genomic variations, many of which are of low abundances (
      • Kuleshov V.
      • Jiang C.
      • Zhou W.
      • Jahanbani F.
      • Batzoglou S.
      • Snyder M.
      Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome.
      ,
      • Sharon I.
      • Kertesz M.
      • Hug L.A.
      • Pushkarev D.
      • Blauwkamp T.A.
      • Castelle C.J.
      • Amirebrahimi M.
      • Thomas B.C.
      • Burstein D.
      • Tringe S.G.
      • Williams K.H.
      • Banfield J.F.
      Accurate, multi-kb reads resolve complex populations and detect rare microorganisms.
      ). Short sequencing reads containing variations are often “ignored” by the assemblers, so as a result, our graph-based approach that uses assembly graphs for MS/MS identification is unable to use these reads for MS/MS identification. The open database search approach (
      • Kong A.T.
      • Leprevost F.V.
      • Avtonomov D.M.
      • Mellacheruvu D.
      • Nesvizhskii A.I.
      Msfragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics.
      ) in principle can be applied to detect variations, which however will be impractical for metaproteomic identification because of the unknown and extremely large search space. The goal of our new variant-aware approach is to incorporate the potential peptides encoded by the short sequencing reads that are not even assembled into the assembly graph, to improve the identification from MS/MS data. Instead of using variant callers (such as MIDAS (
      • Nayfach S.
      • Rodriguez-Mueller B.
      • Garud N.
      • Pollard K.S.
      An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography.
      ) and metaSNV (
      • Costea P.I.
      • Munch R.
      • Coelho L.P.
      • Paoli L.
      • Sunagawa S.
      • Bork P.
      metaSNV: A tool for metagenomic strain level analysis.
      )) to detect potential variants from metagenomic and/or metatranscriptomic sequencing data (
      • Wilmes P.
      • Andersson A.F.
      • Lefsrud M.G.
      • Wexler M.
      • Shah M.
      • Zhang B.
      • Hettich R.L.
      • Bond P.L.
      • VerBerkmoes N.C.
      • Banfield J.F.
      Community proteogenomics highlights microbial strain-variant protein expression within activated sludge performing enhanced biological phosphorus removal.
      ), our variant-aware approach retains all potential variants containing peptides, and use them as reference for spectral database search. The rationale is that the variants are often of low abundance, so the typical variant detection tools that rely on sequencing coverage to distinguish variants from sequencing errors may not work. If a potential variant is supported by only few reads, but nevertheless is supported by the MS/MS data, it is a confident variant. We also note that, different from the previous method (
      • May D.H.
      • Timmins-Schiffman E.
      • Mikan M.P.
      • Harvey H.R.
      • Borenstein E.
      • Nunn B.L.
      • Noble W.S.
      An alignment-free “metapeptide” strategy for metaproteomic characterization of microbiome samples using shotgun metagenomic sequencing.
      ), which includes all potential short peptides from short sequencing reads, our variant-aware approach only includes the peptides that are similar to the proteins that have already been identified from MS/MS spectra. We showed that our approach effectively reduced the search space for MS/MS spectral database searching, and identified more peptides from metaproteomic MS/MS data.

      RESULTS

      We tested our integrated pipeline for peptide/protein identification using the ocean water and wastewater multi-omics data sets. We compared our approaches with contig-based (in which proteins predicted from contigs are used as the reference), read-based approach (using metapeptides derived from reads as the reference), and a simple combination of read- and contig-based approaches (in which peptides were derived using the read- and contig-based approaches separately, and then merged as the final results). Our results showed that by considering variations and uncertainties of assembly, we can significantly improve the MS/MS spectral identification rate, and therefore identify more peptides (and proteins) for downstream analyses including functional profiling. The Var2Pep approach alone identified 5.6% to 24.1% more unique peptides, and our pipeline integrating the Graph2Pro approach and the Var2Pep approach improved the overall identification rate of the metaproteomic MS/MS spectra by about two folds, as compared with conventional contig- or read-based approaches.

      Performance of the Meta-proteogenomic Approach on the Seawater Microbiome Data Set

      We first tested the pipeline for peptide and protein identification from metaproteomic MS/MS data using the ocean data set. Table I lists the size of the CS and BSt metapeptide databases, as well as the size of Var2Pep databases for comparison. Apparently, the Var2Pep database contains much fewer reads-derived peptides for MS/MS identification. For instance, for the BSt data set, there are about 15.9 million metapeptides, whereas the Var2Pep database only contains about 2.7 million peptides derived from 148 million mismatched or unmapped reads.
      Table ISummary of the read-based MS/MS search databases for the ocean data sets
      BStCS
      Number of peptides in metapeptides database (obtained from
      • May D.H.
      • Timmins-Schiffman E.
      • Mikan M.P.
      • Harvey H.R.
      • Borenstein E.
      • Nunn B.L.
      • Noble W.S.
      An alignment-free “metapeptide” strategy for metaproteomic characterization of microbiome samples using shotgun metagenomic sequencing.
      )
      15,911,89319,194,693
      Number of mismatched/unmatched reads148,502,311221,702,454
      Number of peptides in Var2Pep database2,702,6554,891,690
      Note: Bst, Bering strait; CS, Chukchi sea.
      Table II summarizes the peptide identification for selected data sets (BSt45 and CS51; see results for all data sets in the supplementary Data File S1) using different approaches. Fig. 2 shows the comparison results for all ocean water data sets. Briefly, our approach approximately doubled the number of identified peptides as compared with the contig-based and read-based approaches, when the same FDR (≤ 0.01, at spectrum level) estimated using the target-decoy search approach was applied. When FDR ≤ 0.01 at peptide level was applied, all methods unsurprisingly identified fewer peptide, however, we observed similar trends of improvements by Graph2Pro and Var2Pep: supplemental Fig. S1 shows the comparison of the different approaches on all ocean water data sets, using FDR ≤ 0.01 at peptide level, and supplemental Fig. S2 shows the side-by-side comparison of the results by using FDR at spectrum or peptide level for the CS51 data set, respectively. It was shown in reference (
      • May D.H.
      • Timmins-Schiffman E.
      • Mikan M.P.
      • Harvey H.R.
      • Borenstein E.
      • Nunn B.L.
      • Noble W.S.
      An alignment-free “metapeptide” strategy for metaproteomic characterization of microbiome samples using shotgun metagenomic sequencing.
      ) that read-based approach using short amino acid sequences predicted from shotgun sequencing reads as the search database identified significantly more peptides as compared with the contig-based approach using genes predicted from contigs assembled by SOAPdenovo version 1.06 (
      • May D.H.
      • Timmins-Schiffman E.
      • Mikan M.P.
      • Harvey H.R.
      • Borenstein E.
      • Nunn B.L.
      • Noble W.S.
      An alignment-free “metapeptide” strategy for metaproteomic characterization of microbiome samples using shotgun metagenomic sequencing.
      )). Our results confirmed the observation that when SOAPdenovo2 was used as the assembler, contig-based approach resulted in worse MS/MS identification (see Table II) than the read-based approach. However, when MegaHit was used as the assembler, the trend reversed and the contig-based approach identified more peptides than the read-based approach (see Fig. 2), whereas in both cases adding Graph2Pro and Var2Pep steps further improved the peptide identification.
      Table IISummary of peptide identification from MS/MS spectra for selected ocean datasets
      BSt45 (90,072 spectra)CS51 (100,588 spectra)
      PSMs (%)Unique peptidePSMs (%)Unique peptide
      Reads based7795 (8.6%)295815,167 (15.1%)5652
      Contigs based (S*)1892 (2.10%)8176526 (6.49%)2442
      Graph2Pro (S*)12,728 (14.1%)454226,576 (26.4%)9932
      Contigs based (M*)8631 (9.6%)336717,427 (17.3%)6388
      Graph2Pro (M*)15,172 (16.8%)585729,072 (28.9%)11,463
      Graph2Pro (M*) + Var2Pep15,913 (17.7%)624031,145 (31.0%)12,380
      Note: S* represents SOAPdenovo2, M* represents MegaHit, PSM stands for peptide spectrum match. Bst stands for Bering strait, and CS stands for Chukchi sea. Graph2Pro (S*) and Graph2Pro (M*) represent using assembly graph from SOAPdenovo2 (S*) and MegaHit (M*) as the reference in Graph2Pro, respectively. In all cases, FDR (false discovery rate) was estimated using a target-decoy search approach, and a cutoff of 1% at spectrum level was applied. This table only shows the results for two datasets. See Fig. 2 and supplementary Data File S1 for results of all data sets.
      Figure thumbnail gr2
      Fig. 2.Comparison of peptide identification by the different approaches on the ocean data sets. The barplot shows the total number of unique peptides identified from six ocean metaproteomic MS/MS data sets, by different approaches.
      Our pipeline significantly outperformed the simple combination of contig-based and read-based approaches (red bars versus blue bars in Fig. 2). For instance, for the Bst45 data set, reads-based method identified 7795 (8.6% of total) spectra, and the contig-based (using MegaHit) approach identified 8631 (9.5%) of total spectra. Combining contig-based search and reads-based search resulted in the characterization of a total of 16,426 spectra/3813 peptides. By comparison, our pipeline reported the identification of comparable number of spectra (15,913), but many more unique peptides (6240).

      Comparison of Our Approach with Grouped and Cascaded Searches

      Our pipeline uses separate database searches on the Graph2Pro and Var2Pep databases, and then combines the search results (we called it separate search for comparison purpose). We compared this approach with the other two approaches: a single spectral search on a grouped database combining both databases (called grouped approach), and cascaded search using only rejected spectra from the spectral search against the Graph2Pro database for the subsequent search against Var2Pep database (i.e. the cascaded approach, similar to the approach developed by Kertesz-Farkas and colleagues (
      • Kertesz-Farkas A.
      • Keich U.
      • Noble W.S.
      Tandem mass spectrum identification via cascaded search.
      )). Table III summarizes the comparison of separate and cascaded searches at the same FDR level. For example, for CS51, our pipeline based on separate search against the Var2Pep database identified additional 2073 spectra (917 unique peptides); by comparison, the cascaded search against the Var2Pep database identified slightly fewer 1598 spectra (629 unique peptides). Notably, this result appears different from the previous report that cascaded search identified more spectra and unique peptides (
      • Kertesz-Farkas A.
      • Keich U.
      • Noble W.S.
      Tandem mass spectrum identification via cascaded search.
      ) than the separate search. One possible reason is that removing some high qualify spectra in the cascaded searches may alter the overall score distribution, and thus the estimation of false discovery rate for different searching results. Our approach also outperformed (marginally) the search using combined database. For example, for the CS51 data set, combined database search only identified a total of 20,046 spectra and 7662 unique peptides; by contrast, our approach based on separate searches identified 31,145 spectra and 12,380 peptides, and the cascaded search identified 30,670 spectra and 12,092 peptides from the same spectra data set. Based on our tests, we recommend using either separate or cascaded search, but not using the combined database for spectrum search. Our pipeline supports both separate search (default) and cascaded-search, and the results reported below were based on the separate searches.
      Table IIIComparison of additional spectra and unique peptides identified by searching against the Var2Pep database using the separate search approach (our approach) versus cascaded search
      Separate (our approach)Cascaded
      PSMsPeptidesPSMsPeptides
      CS5120739171598629
      CS5222828161692533
      CS5319696251466392
      Bst45741383495221
      Bst46651318390173
      Bst47640322418192
      Note: PSM stands for peptide spectrum match. Bst, Bering strait; CS, Chukchi sea.

      Performance of the Meta-proteogenomic Approach on the Wastewater Microbiome Data Set

      Next we tested our approach using the wastewater data sets. Fig. 3 illustrates the results of peptide identification using different approaches on this data set using FDR ≤ 0.01 at spectrum level (see supplementary Data File S2 for details; also see supplemental Fig. S3 for the comparison of the different approaches using FDR ≤ 0.01 at peptide level, and supplemental Fig. S4 for side-by-side comparison of the results by using FDR at spectrum versus peptide level for the SD3-MG data set). For comparison purposes, for reads-based approach, we used the Sixgill software (https://github.com/dhmay/sixgill) (
      • May D.H.
      • Timmins-Schiffman E.
      • Mikan M.P.
      • Harvey H.R.
      • Borenstein E.
      • Nunn B.L.
      • Noble W.S.
      An alignment-free “metapeptide” strategy for metaproteomic characterization of microbiome samples using shotgun metagenomic sequencing.
      ) to predict putative peptides (i.e. metapeptides) from reads for the reads-based approach. Here are the numbers of metapeptides predicted from the different data sets: 266,430 (SD3-MG), 1,640,859 (SD3-MGMT), 2,614,012 (SD6-MG), 4,782,170 (SD6-MGMT), 1,800,892 (SD7-MG), and 3,977,520 (SD7-MGMT) (sequences of the metapeptides can be found at our supplementary website at http://omics.soic.indiana.edu/MP). In all cases, the contig-based approach achieved significantly better performance than the read-based approach, and our approach integrating Graph2Pro and Var2pep significantly outperformed other approaches.
      Figure thumbnail gr3
      Fig. 3.Comparison of peptide identification results by the different approaches on the wastewater data sets. The barplot shows the total number of unique peptides identified from three wastewater samples (SD3, SD6, and SD7), using either matching metagenomic data alone (MG), or both metagenomic and metaproteomic data (MGMT) as the reference.
      As shown in Fig. 3, significant improvement of peptide identification was achieved by using both metatranscriptomic and metagenomic data sets as the reference as compared with using only metagenomic sequencing data (see supplementary Data File S2 for details). This result is expected, because metatranscriptomic data may provide sequence information for otherwise low-abundant species that may have been missed by metagenomic sequencing. We note that among the wastewater data sets, the Var2Pep approach resulted in the most significant improvements of the peptide identification for the SD3 data set, with an additional 24.1% and 19.5% more identified unique peptides when using matched metagenomic data set alone (SD3-MG), and both the metagenomic and metatranscriptomic data sets (SD3-MGMT).
      Considering the increasing popularity of metagenome assembler MetaSPAdes, we also used the wastewater data sets to test if more contigs assembled from metagenome resulted in better identification of metaproteomics data. Table IV shows the comparison of the results for SD3-MG data set using assemblies from MegaHit and MetaSPAdes as the reference in our pipeline. Strikingly, MetaSPAdes produced about three times more contigs (more than doubled the total bases in the contigs) as compared with MegaHit. However, the increasing of contigs did not lead to more peptide identifications.
      Table IVComparison of the performance using metagenome assembly from MegaHit and MetaSpades as the reference on SD3-MG dataset
      MegaHitMetaSpades
      Number of contigs111,739446,773
      Total bases (MB)78166
      Contig only (PSM/peptide)43650 (11173)40161 (10337)
      Graph2Pro (PSM/peptide)63178 (17452)57760 (17651)
      Graph2Pro + Var2pep (PSM/peptide)74317 (21662)71366 (22043)

      Abundant Known (and Unknown) Proteins Revealed by Metaproteomics

      We examined proteins that were identified using our pipeline. For example, for CS51 data set, its first most abundant protein was supported by 270 spectra (0.8% of the total 31,145 identified spectra for this data set). A total of 663 proteins or protein fragments (of minimal length of 60 aa) were each supported by at least 10 spectra (see supplementary Data File S3 for details). Functional annotation of these proteins by myRAST (http://blog.theseed.org/downloads/myRAST-Intel.dmg) (
      • Overbeek R.
      • Olson R.
      • Pusch G.D.
      • Olsen G.J.
      • Davis J.J.
      • Disz T.
      • Edwards R.A.
      • Gerdes S.
      • Parrello B.
      • Shukla M.
      • Vonstein V.
      • Wattam A.R.
      • Xia F.
      • Stevens R.
      The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST).
      ) revealed 90 ribosomal proteins, 61 heat shock proteins (HSP60), 59 elongation factors (GTP_EFTU), 38 chaperone proteins (DnaK), 30 glutamate aspartate periplasmic binding protein precursors, 26 branched-chain amino acid ABC transporters, 10 TonB-dependent receptors, among other functions (see supplementary Data File S3 for details of the annotations). Our result is consistent with the fact that many of these functions (e.g. ribosomal proteins) are typically the most abundant and highly conserved in prokaryotic genomes.
      Among the CS51's 663 proteins each with support of at least 10 spectra, a significant fraction (93, 14%) have no functional annotation based on myRAST annotation, including a protein (ID: NODE_17410_length_722_cov_188.0000_ID_34819_1_456_+) that has the third most spectra support with 257 spectra and many others with strong spectral supports. See supplementary Data File S3 for the complete list of these highly expressed proteins that lack any functional annotations (and their sequences can be found at our supplementary website). We argue that these proteins are most wanted proteins that wait to be further studied experimentally for their functions. Fig. 4 illustrates two examples of proteins supported by a significant number of variant peptides (details of the variant peptides are available at our supplementary website). Protein366 was predicted from the assembly graph by Graph2Pro and was shown to be supported by a total of 27 peptides, including six variant peptides. The myRAST annotation did not reveal any significant hits for this protein. We also did sequence similarity search of this protein using ncbi blast (http://www.ncbi.nlm.nih.gov/blast) against the nr database (as of Nov 18th, 2018), which returned no hits either. Although no significant sequential similarity was detected between Protein366 and proteins in the public databases, a structural comparison (by FATCAT (
      • Ye Y.
      • Godzik A.
      Flexible structure alignment by chaining aligned fragment pairs allowing twists.
      )) of this protein's predicted structure (using QUARK (
      • Xu D.
      • Zhang Y.
      Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field.
      )) revealed that it contains a Cystatin-like fold composed of helical segments packed against coiled antiparallel beta-sheet. Protein2811 is another example, which was annotated as a putative peptide/opine/nickel uptake transporter. This protein has a total of 17 supporting peptides, among which nine are variant peptides.
      Figure thumbnail gr4
      Fig. 4.Selected examples of identified proteins and variants supported by metaproteomic MS/MS data (ocean water sample CS51 and wastewater sample SD3). The plots depict the regions in the proteins that are supported by identified peptides. The black lines on the top represent the proteins, with boxes in different colors showing predicted PFAM domains in these proteins (orange box represents the SBP_bac_5 domain, gray box represents DUF2815, and the blue boxes represent the SLH domains. The green lines below the protein lines represent MS/MS supported peptides from each protein (no mismatches), and the red lines represent peptide variants that share similarity with the protein, with the number of mismatches indicated by the bar on the left.
      For the wastewater data sets, an interesting example is a protein (Protein264) identified from SD6 data set. This protein has the most spectra support (2266 spectra), and the domain prediction shows that it contains three SLH (S-layer homology) domains and another larger domain that has unknown function (DUF2815). The presence of SLH domains in this protein suggest that it is likely a cell-surface protein, as studies have shown the utilization of SLH domains as vehicles for surface location of function proteins (
      • Misra C.S.
      • Basu B.
      • Apte S.K.
      Surface (S)-layer proteins of Deinococcus radiodurans and their utility as vehicles for surface localization of functional proteins.
      ,
      • Sychantha D.
      • Chapman R.N.
      • Bamford N.C.
      • Boons G.J.
      • Howell P.L.
      • Clarke A.J.
      Molecular basis for the attachment of S-layer proteins to the cell wall of Bacillus anthracis.
      ); however, the precise function of this protein remains to be determined. Fig. 4 (bottom) shows the locations of the domains and the distribution of identified peptides and variant peptides supporting this protein.

      Low Abundant Proteins with Supports Augmented by Variant Peptides

      As shown above, metaproteomics is likely limited to revealing highly abundant proteins expressed from highly abundant bacterial species, so metaproteomics-based functional profiling of microbiomes will only provide a shallow glimpse into the functionality of the underlining microbial community. A hope to alleviate this limitation is to augment the identification of proteins using variant peptides. Here we use the wastewater data set (SD3-MG) as an example to illustrate this approach. We used BlastKOALA (https://www.kegg.jp/blastkoala/) (
      • Kanehisa M.
      • Sato Y.
      • Morishima K.
      BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences.
      ) to annotate the functions of identified proteins from this data set. A total of 1401 KEGG families (i.e. the KO families) were identified using proteins with at least one peptide support; however, this number increased to 1938 if variant peptides were also considered (a 38% improvement). We acknowledge that the number of families is still small, because of the shallow functional profiling of microbiomes based on metaproteomics.
      We note that the variant peptides identified by our pipeline have both nucleotide level (from metagenomic and/or metatranscriptomic sequences) and mass spectral level supports. Here, we examined the types of substitutions found in the variant peptides. Not surprisingly, we found that most variations are the substitutions likely to be found in homologous proteins, with positive BLOSUM scores (BLOSUM matrices contain log-odd scores of all possible substitution pairs of amino acids, where positive scores represent favorable substitutions observed in homologous proteins such as Asp to Asn substitution and negative scores represent unfavorable substitutions such as Arg to Asp) (
      • Henikoff S.
      • Henikoff J.G.
      Performance evaluation of amino acid substitution matrices.
      ). As shown in Fig. 5, the average BLOSUM (BLOSUM62) score between two amino acids is −2 (see “all” box), whereas the average BLOSUM score of substitutions in variant peptides is positive for Protein366. This result indicates that Var2Pep likely detected biologically meaningful variations.
      Figure thumbnail gr5
      Fig. 5.Var2Pep detected variations that are preferred substitutions among homologous proteins. The y axis shows the BLOSUM score of pairs of amino acids. The “all” box is for all pairs of possible amino acids (excluding pairs of identical residues), whereas the other box is for variations found by Var2Pep in Protein366.

      DISCUSSION AND CONCLUSIONS

      We developed the Var2Pep algorithm, which combined with Graph2pro achieved drastic improvement of peptide identification from metaproteomics data. Var2Pep makes use of shotgun sequencing reads for MS/MS identification, and it reduces the search space imposed by the large number of sequencing reads by only keeping peptides that share similarity with proteins already identified by Graph2Pro. As shown in Table I, there were 15.9 million putative peptides in BSt data set from the mismatched/unmatched reads, and Var2Pep only contained 2.7 million peptides for MS/MS database search. This process significantly reduced the number of possible decoys in the database to permit the database searches, which significantly contribute to increasing the identification rate of metaproteomic MS/MS data. Similar ideas have been applied to improve metaproteomics data identification: for example MetaLab (
      • Cheng K.
      • Ning Z.
      • Zhang X.
      • Li L.
      • Liao B.
      • Mayne J.
      • Stintzi A.
      • Figeys D.
      MetaLab: an automated pipeline for metaproteomic data analysis.
      ) uses a reduced gut reference gene catalogue to improve MS/MS identification). We note Var2Pep may still be used when assembly graph is not available (i.e. only the contigs are available). In such a case, peptides encoded in short sequencing reads that share similarity with MS/MS supported peptides identified from the contigs can be included in the database searching for MS/MS analysis.
      The main goal of our approaches is to optimize the use of matching metagenomic and/or metatranscriptomic data for metaproteomics data analysis. Our integrated pipeline results in two databases for MS/MS identification, one from Graph2Pro (Graph2Pro DB), and the other from Var2Pep (Var2Pep DB). Our pipeline provides one way of using these databases as shown in Fig. 1. Users can make use of these databases in different ways. In order to get high confident peptide identification, as well as variant peptides, it is highly recommended that users apply strict false discovery rate control, either in spectral level or peptide level. A reasonable concern on our approach of combining the results from two separated searches each at 1% FDR is that the actually FDR would be higher. We re-calculated the FDRs for combined results, based on the unique sets of identified spectra for all the data sets we tested. The results showed that the FDRs estimated using this approach were slightly higher, but still reasonably low (at about 1.5%).
      It is possible to combine our approaches with those that are recently developed, including the utilization of spectral clustering to speed up the search as shown in (
      • Griss J.
      • Foster J.M.
      • Hermjakob H.
      • Vizcaino J.A.
      PRIDE Cluster: building a consensus of proteomics data.
      ). We used the MS-GF+ search engine (
      • Kim S.
      • Pevzner P.A.
      Ms-gf+ makes progress towards a universal database search tool for proteomics.
      ), which is one of the fastest MS/MS search engines. More recently developed tools for the fast database searching of metaproteomic MS/MS data such as ProteoStorm (
      • Beyter D.
      • Lin M.S.
      • Yu Y.
      • Pieper R.
      • Bafna V.
      Proteostorm: An ultrafast metaproteomics database search framework.
      ) can also be incorporated into the pipeline. Our significantly improved results indicate there is still a huge space for algorithmic improvement in the field of metaproteomics and proteomics in general.
      The goal of metaproteomics is not only to identify proteins expressed in the microbial community, but also to estimate their abundances (i.e. their expression levels) under different conditions. Nevertheless, a protein can be quantified only if it can be identified by using the metaproteomic data. Therefore, the methods presented here that increase the coverage of protein identification will also help the subsequent steps for protein quantification. Furthermore, identification of peptide variants may contribute to accurate quantification of the proteins from which these peptides are derived. Tools such as MetaGOmics (
      • Riffle M.
      • May D.H.
      • Timmins-Schiffman E.
      • Mikan M.P.
      • Jaschob D.
      • Noble W.S.
      • Nunn B.L.
      A web-based tool for peptide-centric functional and taxonomic analysis of metaproteomics data.
      ) can be applied to infer functional and taxonomic contents of the microbial communities based on identified proteins/peptides. Better identification of peptides and/or proteins from metaproteomic MS/MS data can improve the functional and pathway annotation of the corresponding microbial communities, as we have shown in (
      • Tang H.
      • Li S.
      • Ye Y.
      A graph-centric approach for metagenome-guided peptide and protein identification in metaproteomics.
      ).
      We showed that metaproteomics (and matching metagenomics/metatranscriptomics) data can be used to identify hypothetical proteins that sometimes are very abundant in the microbial communities. These MS/MS supported hypothetical proteins provide interesting candidates for further studies of their functions using computational and experimental approaches. However, our analyses of the identified proteins also suggested that unless we significantly increase the depth of the metaproteomic experiments, what we identify are limited to those highly abundant (often well studied) proteins.
      In conclusion, we developed a pipeline for metaproteomic MS/MS data analysis using matching metagenomic and/or metatranscriptomic sequencing data as the reference. Tests of our pipeline using publicly available meta-omics data sets showed that it is important to consider the assembly uncertainties (captured in assembly graph) and genomic variants to maximize the utilization of metagenomes and/or metatranscriptomes as the reference for metaproteomic data interpretation.

      Data Availability

      Our pipeline for peptide/protein identification from metaproteomic data using matching metagenomic and/or metatranscriptomic data as the reference can be downloaded from https://github.com/COL-IU/graph2pro-var. We also made available the results (including the reference databases and search results) from analyzing the two collections of microbiome data sets in a supplementary website at http://omics.soic.indiana.edu/MP/and Zenodo at http://doi.org/10.5281/zenodo.2691363.

      REFERENCES

        • Crump B.C.
        • Armbrust E.V.
        • Baross J.A.
        Phylogenetic analysis of particle-attached and free-living bacterial communities in the columbia river, its estuary, and the adjacent coastal ocean.
        Appl. Environ. Microbiol. 1999; 65: 3192-3204
        • Santelli C.M.
        • Orcutt B.N.
        • Banning E.
        • Bach W.
        • Moyer C.L.
        • Sogin M.L.
        • Staudigel H.
        • Edwards K.J.
        Abundance and diversity of microbial life in ocean crust.
        Nature. 2008; 453: 653-656
        • Fierer N.
        • Lauber C.L.
        • Ramirez K.S.
        • Zaneveld J.
        • Bradford M.A.
        • Knight R.
        Comparative metagenomic, phylogenetic and physiological analyses of soil microbial communities across nitrogen gradients.
        ISME J. 2012; 6: 1007-1017
        • Fierer N.
        • Leff J.W.
        • Adams B.J.
        • Nielsen U.N.
        • Bates S.T.
        • Lauber C.L.
        • Owens S.
        • Gilbert J.A.
        • Wall D.H.
        • Caporaso J.G.
        Cross-biome metagenomic analyses of soil microbial communities and their functional attributes.
        Proc. of the Natl. Acad. Sci. USA. 2012; 109: 21390-21395
        • Qin J.
        • Li R.
        • Raes J.
        • Arumugam M.
        • Burgdorf K.S.
        • Manichanh C.
        • Nielsen T.
        • Pons N.
        • Levenez F.
        • Yamada T.
        A human gut microbial gene catalogue established by metagenomic sequencing.
        Nature. 2010; 464: 59-65
        • Gill S.R.
        • Pop M.
        • DeBoy R.T.
        • Eckburg P.B.
        • Turnbaugh P.J.
        • Samuel B.S.
        • Gordon J.I.
        • Relman D.A.
        • Fraser-Liggett C.M.
        • Nelson K.E.
        Metagenomic analysis of the human distal gut microbiome.
        Science. 2006; 312: 1355-1359
        • Ley R.E.
        • Turnbaugh P.J.
        • Klein S.
        • Gordon J.I.
        Microbial ecology: human gut microbes associated with obesity.
        Nature. 2006; 444: 1022-1023
        • Wilmes P.
        • Andersson A.F.
        • Lefsrud M.G.
        • Wexler M.
        • Shah M.
        • Zhang B.
        • Hettich R.L.
        • Bond P.L.
        • VerBerkmoes N.C.
        • Banfield J.F.
        Community proteogenomics highlights microbial strain-variant protein expression within activated sludge performing enhanced biological phosphorus removal.
        ISME J. 2008; 2: 853
        • Routy B.
        • Le Chatelier E.
        • Derosa L.
        • Duong C.P.M.
        • Alou M.T.
        • Daillere R.
        • Fluckiger A.
        • Messaoudene M.
        • Rauber C.
        • Roberti M.P.
        • Fidelle M.
        • Flament C.
        • Poirier-Colame V.
        • Opolon P.
        • Klein C.
        • Iribarren K.
        • Mondragon L.
        • Jacquelot N.
        • Qu B.
        • Ferrere G.
        • Clemenson C.
        • Mezquita L.
        • Masip J.R.
        • Naltet C.
        • Brosseau S.
        • Kaderbhai C.
        • Richard C.
        • Rizvi H.
        • Levenez F.
        • Galleron N.
        • Quinquis B.
        • Pons N.
        • Ryffel B.
        • Minard-Colin V.
        • Gonin P.
        • Soria J.C.
        • Deutsch E.
        • Loriot Y.
        • Ghiringhelli F.
        • Zalcman G.
        • Goldwasser F.
        • Escudier B.
        • Hellmann M.D.
        • Eggermont A.
        • Raoult D.
        • Albiges L.
        • Kroemer G.
        • Zitvogel L.
        Gut microbiome influences efficacy of PD-1-based immunotherapy against epithelial tumors.
        Science. 2018; 359: 91-97
        • Zhu W.
        • Winter M.G.
        • Byndloss M.X.
        • Spiga L.
        • Duerkop B.A.
        • Hughes E.R.
        • Büttner L.
        • de Lima Romão E.
        • Behrendt C.L.
        • Lopez C.A.
        • Sifuentes-Dominguez L.
        • Huff-Hardy K.
        • Wilson R.P.
        • Gillis C.C.
        • Tükel Ç.
        • Koh A.Y.
        • Burstein E.
        • Hooper L.V.
        • Bäumler A.J.
        • Winter S.E.
        Precision editing of the gut microbiota ameliorates colitis.
        Nature. 2018; 553: 208
        • Ho C.L.
        • Tan H.Q.
        • Chua K.J.
        • Kang A.
        • Lim K.H.
        • Ling K.L.
        • Yew W.S.
        • Lee Y.S.
        • Thiery J.P.
        • Chang M.W.
        Engineered commensal microbes for diet-mediated colorectal-cancer chemoprevention.
        Nat. Biomed. Eng. 2018; 2: 27-37
        • Shi Y.
        • Tyson G.W.
        DeLong, E. F. Metatranscriptomics reveals unique microbial small rnas in the ocean's water column.
        Nature. 2009; 459: 266-269
        • Stewart F.J.
        • Ulloa O.
        • DeLong E.F.
        Microbial metatranscriptomics in a permanent marine oxygen minimum zone.
        Environ. Microbiol. 2012; 14: 23-40
        • Verberkmoes N.C.
        • Russell A.L.
        • Shah M.
        • Godzik A.
        • Rosenquist M.
        • Halfvarson J.
        • Lefsrud M.G.
        • Apajalahti J.
        • Tysk C.
        • Hettich R.L.
        Shotgun metaproteomics of the human distal gut microbiota.
        ISME.J. 2009; 3: 179-189
        • Wilmes P.
        • Bond P.L.
        Metaproteomics: studying functional gene expression in microbial ecosystems.
        Trends Microbiol. 2006; 14: 92-97
        • Maron P.-A.
        • Ranjard L.
        • Mougel C.
        • Lemanceau P.
        Metaproteomics: a new approach for studying functional microbial ecology.
        Microbial Ecol. 2007; 53: 486-493
        • Erickson A.R.
        • Cantarel B.L.
        • Lamendella R.
        • Darzi Y.
        • Mongodin E.F.
        • Pan C.
        • Shah M.
        • Halfvarson J.
        • Tysk C.
        • Henrissat B.
        • Raes J.
        • Verberkmoes N.C.
        • Fraser C.M.
        • Hettich R.L.
        • Jansson J.K.
        Integrated metagenomics/metaproteomics reveals human host-microbiota signatures of Crohn's disease.
        PloS One. 2012; 7: 49138
        • Pachiadaki M.G.
        • Sintes E.
        • Bergauer K.
        • Brown J.M.
        • Record N.R.
        • Swan B.K.
        • Mathyer M.E.
        • Hallam S.J.
        • Lopez-Garcia P.
        • Takaki Y.
        • Nunoura T.
        • Woyke T.
        • Herndl G.J.
        • Stepanauskas R.
        Major role of nitrite-oxidizing bacteria in dark ocean carbon fixation.
        Science. 2017; 358: 1046-1051
        • Kleiner M.
        • Thorson E.
        • Sharp C.E.
        • Dong X.
        • Liu D.
        • Li C.
        • Strous M.
        Assessing species biomass contributions in microbial communities via metaproteomics.
        Nat. Commun. 2017; 8: 1558
        • Heyer R.
        • Schallert K.
        • Zoun R.
        • Becher B.
        • Saake G.
        • Benndorf D.
        Challenges and perspectives of metaproteomic data analysis.
        J. Biotechnol. 2017; 261: 24-36
        • Cheng K.
        • Ning Z.
        • Zhang X.
        • Li L.
        • Liao B.
        • Mayne J.
        • Stintzi A.
        • Figeys D.
        MetaLab: an automated pipeline for metaproteomic data analysis.
        Microbiome. 2017; 5: 157
        • Kertesz-Farkas A.
        • Keich U.
        • Noble W.S.
        Tandem mass spectrum identification via cascaded search.
        J. Proteome Res. 2015; 14: 3027-3038
        • Mesuere B.
        • Debyser G.
        • Aerts M.
        • Devreese B.
        • Vandamme P.
        • Dawyndt P.
        The unipept metaproteomics analysis pipeline.
        Proteomics. 2015; 15: 1437-1442
        • Jagtap P.
        • Goslinga J.
        • Kooren J.A.
        • McGowan T.
        • Wroblewski M.S.
        • Seymour S.L.
        • Griffin T.J.
        A two-step database search method improves sensitivity in peptide sequence matches for metaproteomics and proteogenomics studies.
        Proteomics. 2013; 13: 1352-1357
        • Muth T.
        • Behne A.
        • Heyer R.
        • Kohrs F.
        • Benndorf D.
        • Hoffmann M.
        • Lehtevä M.
        • Reichl U.
        • Martens L.
        • Rapp E.
        The metaproteomeanalyzer: a powerful open-source software suite for metaproteomics data analysis and interpretation.
        J. Proteome Res. 2015; 14: 1557-1565
        • Beyter D.
        • Lin M.S.
        • Yu Y.
        • Pieper R.
        • Bafna V.
        Proteostorm: An ultrafast metaproteomics database search framework.
        Cell Syst. 2018; 7: 463-467
        • Zhang X.
        • Ning Z.
        • Mayne J.
        • Moore J.I.
        • Li J.
        • Butcher J.
        • Deeke S.A.
        • Chen R.
        • Chiang C.K.
        • Wen M.
        • Mack D.
        • Stintzi A.F.
        • Igeys D.
        MetaPro-IQ: a universal metaproteomic approach to studying human and mouse gut microbiota.
        Microbiome. 2016; 4: 31
        • Kuleshov V.
        • Jiang C.
        • Zhou W.
        • Jahanbani F.
        • Batzoglou S.
        • Snyder M.
        Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome.
        Nat. Biotechnol. 2016; 34: 64-69
        • Sharon I.
        • Kertesz M.
        • Hug L.A.
        • Pushkarev D.
        • Blauwkamp T.A.
        • Castelle C.J.
        • Amirebrahimi M.
        • Thomas B.C.
        • Burstein D.
        • Tringe S.G.
        • Williams K.H.
        • Banfield J.F.
        Accurate, multi-kb reads resolve complex populations and detect rare microorganisms.
        Genome Res. 2015; 25: 534-543
        • May D.H.
        • Timmins-Schiffman E.
        • Mikan M.P.
        • Harvey H.R.
        • Borenstein E.
        • Nunn B.L.
        • Noble W.S.
        An alignment-free “metapeptide” strategy for metaproteomic characterization of microbiome samples using shotgun metagenomic sequencing.
        J. Proteome Res. 2016; 15: 2697-2705
        • Cantarel B.L.
        • Erickson A.R.
        • VerBerkmoes N.C.
        • Erickson B.K.
        • Carey P.A.
        • Pan C.
        • Shah M.
        • Mongodin E.F.
        • Jansson J.K.
        • Fraser-Liggett C.M.
        • Hettich R.L.
        Strategies for metagenomic-guided whole-community proteomics of complex microbial environments.
        PloS One. 2011; 6: 27173
        • Rooijers K.
        • Kolmeder C.
        • Juste C.
        • Doré J.
        • De Been M.
        • Boeren S.
        • Galan P.
        • Beauvallet C.
        • de Vos W.M.
        • Schaap P.J.
        An iterative workflow for mining the human intestinal metaproteome.
        BMC Genomics. 2011; 12: 6
        • Tang H.
        • Li S.
        • Ye Y.
        A graph-centric approach for metagenome-guided peptide and protein identification in metaproteomics.
        PLoS Computational Biol. 2016; 12: 1005224
        • Kong A.T.
        • Leprevost F.V.
        • Avtonomov D.M.
        • Mellacheruvu D.
        • Nesvizhskii A.I.
        Msfragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics.
        Nat. Methods. 2017; 14: 513
        • Nayfach S.
        • Rodriguez-Mueller B.
        • Garud N.
        • Pollard K.S.
        An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography.
        Genome Res. 2016; 26: 1612-1625
        • Costea P.I.
        • Munch R.
        • Coelho L.P.
        • Paoli L.
        • Sunagawa S.
        • Bork P.
        metaSNV: A tool for metagenomic strain level analysis.
        PLoS ONE. 2017; 12 (0182392)
        • Langmead B.
        • Trapnell C.
        • Pop M.
        • Salzberg S.L.
        Ultrafast and memory-efficient alignment of short dna sequences to the human genome.
        Genome Biol. 2009; 10: 25
        • Rho M.
        • Tang H.
        • Ye Y.
        Fraggenescan: predicting genes in short and error-prone reads.
        Nucleic Acids Res. 2010; 747: e191
        • Zhao Y.
        • Tang H.
        • Ye Y.
        Rapsearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data.
        Bioinformatics. 2012; 28: 125-126
        • Muller E.E.
        • Pinel N.
        • Laczny C.C.
        • Hoopmann M.R.
        • Narayanasamy S.
        • Lebrun L.A.
        • Roume H.
        • Lin J.
        • May P.
        • Hicks N.D.
        • Heintz-Buschart A.
        • Wampach L.
        • Liu C.M.
        • Price L.B.
        • Gillece J.D.
        • Guignard C.
        • Schupp J.M.
        • Vlassis N.
        • Baliga N.S.
        • Moritz R.L.
        • Keim P.S.
        • Wilmes P.
        Community-integrated omics links dominance of a microbial generalist to fine-tuned resource usage.
        Nature Commun. 2014; 5: 5603
        • Deutsch E.W.
        • Lam H.
        • Aebersold R.
        Peptideatlas: a resource for target selection for emerging targeted proteomics workflows.
        EMBO Reports. 2008; 9: 429-434
        • Desiere F.
        • Deutsch E.W.
        • King N.L.
        • Nesvizhskii A.I.
        • Mallick P.
        • Eng J.
        • Chen S.
        • Eddes J.
        • Loevenich S.N.
        • Aebersold R.
        The peptideatlas project.
        Nucleic Acids Res. 2006; 34: 655-658
        • Bolger A.M.
        • Lohse M.
        • Usadel B.
        Trimmomatic: a flexible trimmer for illumina sequence data.
        Bioinformatics. 2014; 30: 2114-2120
        • Luo R.
        • Liu B.
        • Xie Y.
        • Li Z.
        • Huang W.
        • Yuan J.
        • He G.
        • Chen Y.
        • Pan Q.
        • Liu Y.
        • Tang J.
        • Wu G.
        • Zhang H.
        • Shi Y.
        • Liu Y.
        • Yu C.
        • Wang B.
        • Lu Y.
        • Han C.
        • Cheung D.W.
        • Yiu S.M.
        • Peng S.
        • Xiaoqian Z.
        • Liu G.
        • Liao X.
        • Li Y.
        • Yang H.
        • Wang J.
        • Lam T.W.
        • Wang J.
        Soapdenovo2: an empirically improved memory-efficient short-read de novo assembler.
        Gigascience. 2012; 1: 18
        • Li D.
        • Liu C.M.
        • Luo R.
        • Sadakane K.
        • Lam T.W.
        Megahit: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph.
        Bioinformatics. 2015; 31: 1674
        • Nurk S.
        • Meleshko D.
        • Korobeynikov A.
        • Pevzner P.A.
        metaspades: a new versatile metagenomic assembler.
        Genome Res. 2017; 27: 824-834
        • Vollmers J.
        • Wiegand S.
        • Kaster A.K.
        Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist's Perspective - Not Only Size Matters!.
        PLoS ONE. 2017; 12 (0169662)
        • Kim S.
        • Pevzner P.A.
        Ms-gf+ makes progress towards a universal database search tool for proteomics.
        Nat. Communications. 2014; 5: 5277
        • Elias J.E.
        • Gygi S.P.
        Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry.
        Nat. Methods. 2007; 4: 207-214
        • Overbeek R.
        • Olson R.
        • Pusch G.D.
        • Olsen G.J.
        • Davis J.J.
        • Disz T.
        • Edwards R.A.
        • Gerdes S.
        • Parrello B.
        • Shukla M.
        • Vonstein V.
        • Wattam A.R.
        • Xia F.
        • Stevens R.
        The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST).
        Nucleic Acids Res. 2014; 42: 206-214
        • Ye Y.
        • Godzik A.
        Flexible structure alignment by chaining aligned fragment pairs allowing twists.
        Bioinformatics. 2003; 19: 246-255
        • Xu D.
        • Zhang Y.
        Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field.
        Proteins. 2012; 80: 1715-1735
        • Misra C.S.
        • Basu B.
        • Apte S.K.
        Surface (S)-layer proteins of Deinococcus radiodurans and their utility as vehicles for surface localization of functional proteins.
        Biochim. Biophys. Acta. 2015; 1848: 3181-3187
        • Sychantha D.
        • Chapman R.N.
        • Bamford N.C.
        • Boons G.J.
        • Howell P.L.
        • Clarke A.J.
        Molecular basis for the attachment of S-layer proteins to the cell wall of Bacillus anthracis.
        Biochemistry. 2018; 57: 1949-1953
        • Kanehisa M.
        • Sato Y.
        • Morishima K.
        BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences.
        J. Mol. Biol. 2016; 428: 726-731
        • Henikoff S.
        • Henikoff J.G.
        Performance evaluation of amino acid substitution matrices.
        Proteins. 1993; 17: 49-61
        • Griss J.
        • Foster J.M.
        • Hermjakob H.
        • Vizcaino J.A.
        PRIDE Cluster: building a consensus of proteomics data.
        Nat. Methods. 2013; 10: 95-96
        • Riffle M.
        • May D.H.
        • Timmins-Schiffman E.
        • Mikan M.P.
        • Jaschob D.
        • Noble W.S.
        • Nunn B.L.
        A web-based tool for peptide-centric functional and taxonomic analysis of metaproteomics data.
        Proteomes. 2017; 6: 2