Advertisement

Methods, Tools and Current Perspectives in Proteogenomics*

  • Kelly V. Ruggles
    Footnotes
    Affiliations
    Department of Medicine, New York University School of Medicine, New York, New York 10016;
    Search for articles by this author
  • Karsten Krug
    Footnotes
    Affiliations
    The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142;
    Search for articles by this author
  • Xiaojing Wang
    Footnotes
    Affiliations
    Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030;

    Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030;
    Search for articles by this author
  • Karl R. Clauser
    Affiliations
    The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142;
    Search for articles by this author
  • Jing Wang
    Affiliations
    Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030;

    Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030;
    Search for articles by this author
  • Samuel H. Payne
    Affiliations
    Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99354;
    Search for articles by this author
  • David Fenyö
    Correspondence
    To whom correspondence may be addressed: The Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, Tel.:+1-617-714-7483, E-mail: ; Tel.:+1-713-798-1443, E-mail: ; or Tel.:+1-212-263-2216, E-mail:.
    Affiliations
    Department of Biochemistry and Molecular Pharmacology, New York University School of Medicine, New York, New York 10016;

    §Institute for Systems Genetics, New York University School of Medicine, New York, New York 10016
    Search for articles by this author
  • Bing Zhang
    Correspondence
    To whom correspondence may be addressed: The Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, Tel.:+1-617-714-7483, E-mail: ; Tel.:+1-713-798-1443, E-mail: ; or Tel.:+1-212-263-2216, E-mail:.
    Affiliations
    Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030;

    Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030;
    Search for articles by this author
  • D.R. Mani
    Correspondence
    To whom correspondence may be addressed: The Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, Tel.:+1-617-714-7483, E-mail: ; Tel.:+1-713-798-1443, E-mail: ; or Tel.:+1-212-263-2216, E-mail:.
    Affiliations
    The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142;
    Search for articles by this author
  • Author Footnotes
    * KVR, SHP, and DF were funded by National Cancer Institute (NCI) CPTAC award U24 CA210972. KVR and DF were funded by contract 13XS068 from Leidos Biomedical Research, Inc. SHP was funded by an Early Career Award from the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research. XW, JW, and BZ were funded by National Cancer Institute (NCI) CPTAC awards U24CA159988 and U24CA210954 and by contract 13XS029 from Leidos Biomedical Research, Inc. BZ is Cancer Prevention & Research Institutes of Texas (CPRIT RR160027) Scholar and McNair Medical Institute Scholar. KK, KRC and DRM were funded by National Cancer Institute (NCI) CPTAC awards U24CA160034, U24CA210986 and U24CA210979. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
    ¶¶ Cofirst authors.
Open AccessPublished:April 29, 2017DOI:https://doi.org/10.1074/mcp.MR117.000024
      With combined technological advancements in high-throughput next-generation sequencing and deep mass spectrometry-based proteomics, proteogenomics, i.e. the integrative analysis of proteomic and genomic data, has emerged as a new research field. Early efforts in the field were focused on improving protein identification using sample-specific genomic and transcriptomic sequencing data. More recently, integrative analysis of quantitative measurements from genomic and proteomic studies have identified novel insights into gene expression regulation, cell signaling, and disease. Many methods and tools have been developed or adapted to enable an array of integrative proteogenomic approaches and in this article, we systematically classify published methods and tools into four major categories, (1) Sequence-centric proteogenomics; (2) Analysis of proteogenomic relationships; (3) Integrative modeling of proteogenomic data; and (4) Data sharing and visualization. We provide a comprehensive review of methods and available tools in each category and highlight their typical applications.
      The last decade has witnessed the rapid emergence of proteogenomics, a new research field at the interface of genomics and proteomics. The term proteogenomics came into use following a publication by George Church's group in 2004 describing a proteogenomic mapping technique which harnessed proteomics data to improve genome annotation of Mycoplasma pneumonia (
      • Jaffe J.D.
      • Berg H.C.
      • Church G.M.
      Proteogenomic mapping as a complementary method to perform genome annotation.
      ). The reach of proteogenomics has since expanded with technological advancements enabling rapid and economical high-throughput DNA and RNA sequencing and deep mass spectrometry (MS)-based proteomics. These advancements have proved particularly useful for integrating nucleotide sequencing and MS data from the same sample, where genomic sequencing data can be used to improve protein identification through comprehensive protein sequence database construction. Proteomic data can then be used to demonstrate the validity and functional relevance of novel findings based on large scale RNA and DNA sequencing projects, including coding sequence variants and novel coding transcripts. In addition to sequence-centric proteogenomic data integration, combined quantitative analysis from genomic and proteomic studies have also been used to provide novel insights into multilevel gene expression regulation (
      • Liu Y.
      • Beyer A.
      • Aebersold R.
      On the dependency of cellular protein levels on mRNA abundance.
      ,
      • Vogel C.
      • Marcotte E.M.
      Insights into the regulation of protein abundance from proteomic and transcriptomic analyses.
      ,
      • Battle A.
      • Khan Z.
      • Wang S.H.
      • Mitrano A.
      • Ford M.J.
      • Pritchard J.K.
      • Gilad Y.
      Genomic variation. Impact of regulatory variation from RNA to protein.
      ,
      • Foss E.J.
      • Radulovic D.
      • Shaffer S.A.
      • Goodlett D.R.
      • Kruglyak L.
      • Bedalov A.
      Genetic variation shapes protein networks mainly through non-transcriptional mechanisms.
      ,
      • Foss E.J.
      • Radulovic D.
      • Shaffer S.A.
      • Ruderfer D.M.
      • Bedalov A.
      • Goodlett D.R.
      • Kruglyak L.
      Genetic basis of proteome variation in yeast.
      ,
      • Fu J.
      • Keurentjes J.J.B.
      • Bouwmeester H.
      • America T.
      • Verstappen F.W.A.
      • Ward J.L.
      • Beale M.H.
      • de Vos R.C.H.
      • Dijkstra M.
      • Scheltema R.A.
      • Johannes F.
      • Koornneef M.
      • Vreugdenhil D.
      • Breitling R.
      • Jansen R.C.
      System-wide molecular evidence for phenotypic buffering in Arabidopsis.
      ,
      • Ghazalpour A.
      • Bennett B.
      • Petyuk V.A.
      • Orozco L.
      • Hagopian R.
      • Mungrue I.N.
      • Farber C.R.
      • Sinsheimer J.
      • Kang H.M.
      • Furlotte N.
      • Park C.C.
      • Wen P.-Z.
      • Brewer H.
      • Weitz K.
      • Camp D.G.
      • Pan C.
      • Yordanova R.
      • Neuhaus I.
      • Tilford C.
      • Siemers N.
      • Gargalovic P.
      • Eskin E.
      • Kirchgessner T.
      • Smith D.J.
      • Smith R.D.
      • Lusis A.J.
      Comparative analysis of proteome and transcriptome variation in mouse.
      ,
      • Lappalainen T.
      • Sammeth M.
      • Friedländer M.R.
      • 't Hoen P.A.C.
      • Monlong J.
      • Rivas M.A.
      • Gonzàlez-Porta M.
      • Kurbatova N.
      • Griebel T.
      • Ferreira P.G.
      • Barann M.
      • Wieland T.
      • Greger L.
      • van Iterson M.
      • Almlöf J.
      • Ribeca P.
      • Pulyakhina I.
      • Esser D.
      • Giger T.
      • Tikhonov A.
      • Sultan M.
      • Bertier G.
      • MacArthur D.G.
      • Lek M.
      • Lizano E.
      • Buermans H.P.J.
      • Padioleau I.
      • Schwarzmayr T.
      • Karlberg O.
      • Ongen H.
      • Kilpinen H.
      • Beltran S.
      • Gut M.
      • Kahlem K.
      • Amstislavskiy V.
      • Stegle O.
      • Pirinen M.
      • Montgomery S.B.
      • Donnelly P.
      • McCarthy M.I.
      • Flicek P.
      • Strom T.M.
      • Geuvadis Consortium
      • Lehrach H.
      • Schreiber S.
      • Sudbrak R.
      • Carracedo A.
      • Antonarakis S.E.
      • Häsler R.
      • Syvänen A.-C.
      • van Ommen G.-J.
      • Brazma A.
      • Meitinger T.
      • Rosenstiel P.
      • Guigó R.
      • Gut I.G.
      • Estivill X.
      • Dermitzakis E.T.
      Transcriptome and genome sequencing uncovers functional variation in humans.
      ,
      • Zhang B.
      • Wang J.
      • Wang X.
      • Zhu J.
      • Liu Q.
      • Shi Z.
      • Chambers M.C.
      • Zimmerman L.J.
      • Shaddox K.F.
      • Kim S.
      • Davies S.R.
      • Wang S.
      • Wang P.
      • Kinsinger C.R.
      • Rivers R.C.
      • Rodriguez H.
      • Townsend R.R.
      • Ellis M.J.C.
      • Carr S.A.
      • Tabb D.L.
      • Coffey R.J.
      • Slebos R.J.C.
      • Liebler D.C.
      • NCI CPTAC
      Proteogenomic characterization of human colon and rectal cancer.
      ,
      • Liu Q.
      • Halvey P.J.
      • Shyr Y.
      • Slebos R.J.C.
      • Liebler D.C.
      • Zhang B.
      Integrative omics analysis reveals the importance and scope of translational repression in microRNA-mediated regulation.
      ,
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ,
      • Zhang H.
      • Liu T.
      • Zhang Z.
      • Payne S.H.
      • Zhang B.
      • McDermott J.E.
      • Zhou J.-Y.
      • Petyuk V.A.
      • Chen L.
      • Ray D.
      • Sun S.
      • Yang F.
      • Chen L.
      • Wang J.
      • Shah P.
      • Cha S.W.
      • Aiyetan P.
      • Woo S.
      • Tian Y.
      • Gritsenko M.A.
      • Clauss T.R.
      • Choi C.
      • Monroe M.E.
      • Thomas S.
      • Nie S.
      • Wu C.
      • Moore R.J.
      • Yu K.-H.
      • Tabb D.L.
      • Fenyö D.
      • Bafna V.
      • Wang Y.
      • Rodriguez H.
      • Boja E.S.
      • Hiltke T.
      • Rivers R.C.
      • Sokoll L.
      • Zhu H.
      • Shih I.-M.
      • Cope L.
      • Pandey A.
      • Zhang B.
      • Snyder M.P.
      • Levine D.A.
      • Smith R.D.
      • Chan D.W.
      • Rodland K.D.
      • Investigators CPTAC
      Integrated proteogenomic characterization of human high-grade serous ovarian cancer.
      ), signaling networks (
      • Ryu G.-M.
      • Song P.
      • Kim K.-W.
      • Oh K.-S.
      • Park K.-J.
      • Kim J.H.
      Genome-wide analysis to predict protein sequence variations that change phosphorylation sites or their corresponding kinases.
      ,
      • Ren J.
      • Jiang C.
      • Gao X.
      • Liu Z.
      • Yuan Z.
      • Jin C.
      • Wen L.
      • Zhang Z.
      • Xue Y.
      • Yao X.
      PhosSNP for systematic analysis of genetic polymorphisms that influence protein phosphorylation.
      ,
      • Creixell P.
      • Schoof E.M.
      • Simpson C.D.
      • Longden J.
      • Miller C.J.
      • Lou H.J.
      • Perryman L.
      • Cox T.R.
      • Zivanovic N.
      • Palmeri A.
      • Wesolowska-Andersen A.
      • Helmer-Citterich M.
      • Ferkinghoff-Borg J.
      • Itamochi H.
      • Bodenmiller B.
      • Erler J.T.
      • Turk B.E.
      • Linding R.
      Kinome-wide decoding of network-attacking mutations rewiring cancer signaling.
      ,
      • Reimand J.
      • Wagih O.
      • Bader G.D.
      The mutational landscape of phosphorylation signaling in cancer.
      ), disease subtypes (
      • Zhang B.
      • Wang J.
      • Wang X.
      • Zhu J.
      • Liu Q.
      • Shi Z.
      • Chambers M.C.
      • Zimmerman L.J.
      • Shaddox K.F.
      • Kim S.
      • Davies S.R.
      • Wang S.
      • Wang P.
      • Kinsinger C.R.
      • Rivers R.C.
      • Rodriguez H.
      • Townsend R.R.
      • Ellis M.J.C.
      • Carr S.A.
      • Tabb D.L.
      • Coffey R.J.
      • Slebos R.J.C.
      • Liebler D.C.
      • NCI CPTAC
      Proteogenomic characterization of human colon and rectal cancer.
      ,
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ,
      • Zhang H.
      • Liu T.
      • Zhang Z.
      • Payne S.H.
      • Zhang B.
      • McDermott J.E.
      • Zhou J.-Y.
      • Petyuk V.A.
      • Chen L.
      • Ray D.
      • Sun S.
      • Yang F.
      • Chen L.
      • Wang J.
      • Shah P.
      • Cha S.W.
      • Aiyetan P.
      • Woo S.
      • Tian Y.
      • Gritsenko M.A.
      • Clauss T.R.
      • Choi C.
      • Monroe M.E.
      • Thomas S.
      • Nie S.
      • Wu C.
      • Moore R.J.
      • Yu K.-H.
      • Tabb D.L.
      • Fenyö D.
      • Bafna V.
      • Wang Y.
      • Rodriguez H.
      • Boja E.S.
      • Hiltke T.
      • Rivers R.C.
      • Sokoll L.
      • Zhu H.
      • Shih I.-M.
      • Cope L.
      • Pandey A.
      • Zhang B.
      • Snyder M.P.
      • Levine D.A.
      • Smith R.D.
      • Chan D.W.
      • Rodland K.D.
      • Investigators CPTAC
      Integrated proteogenomic characterization of human high-grade serous ovarian cancer.
      ), and clinical prediction (
      • Barretina J.
      • Caponigro G.
      • Stransky N.
      • Venkatesan K.
      • Margolin A.A.
      • Kim S.
      • Wilson C.J.
      • Lehár J.
      • Kryukov G.V.
      • Sonkin D.
      • Reddy A.
      • Liu M.
      • Murray L.
      • Berger M.F.
      • Monahan J.E.
      • Morais P.
      • Meltzer J.
      • Korejwa A.
      • Jané-Valbuena J.
      • Mapa F.A.
      • Thibault J.
      • Bric-Furlong E.
      • Raman P.
      • Shipway A.
      • Engels I.H.
      • Cheng J.
      • Yu G.K.
      • Yu J.
      • Aspesi P.
      • de Silva M.
      • Jagtap K.
      • Jones M.D.
      • Wang L.
      • Hatton C.
      • Palescandolo E.
      • Gupta S.
      • Mahan S.
      • Sougnez C.
      • Onofrio R.C.
      • Liefeld T.
      • MacConaill L.
      • Winckler W.
      • Reich M.
      • Li N.
      • Mesirov J.P.
      • Gabriel S.B.
      • Getz G.
      • Ardlie K.
      • Chan V.
      • Myer V.E.
      • Weber B.L.
      • Porter J.
      • Warmuth M.
      • Finan P.
      • Harris J.L.
      • Meyerson M.
      • Golub T.R.
      • Morrissey M.P.
      • Sellers W.R.
      • Schlegel R.
      • Garraway L.A.
      The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity.
      ,
      • Ray B.
      • Henaff M.
      • Ma S.
      • Efstathiadis E.
      • Peskin E.R.
      • Picone M.
      • Poli T.
      • Aliferis C.F.
      • Statnikov A.
      Information content and analysis methods for multi-modal high-throughput biomedical data.
      ,
      • Ma S.
      • Ren J.
      • Fenyö D.
      Breast Cancer Prognostics Using Multi-Omics Data.
      ). In this review, we subscribe to an expansive view of proteogenomics, encompassing all areas of proteomic and genomic integrative data analysis and cover the range of tools developed to tackle the associated challenges.
      To complement already published review papers that focus on specific sub-domains of the broad proteogenomics research area (
      • Menschaert G.
      • Fenyö D.
      Proteogenomics from a bioinformatics angle: A growing field.
      ,
      • Nesvizhskii A.I.
      Proteogenomics: concepts, applications and computational strategies.
      ,
      • Wang X.
      • Liu Q.
      • Zhang B.
      Leveraging the complementary nature of RNA-Seq and shotgun proteomics data.
      ,
      • Wang X.
      • Zhang B.
      Integrating genomic, transcriptomic, and interactome data to improve Peptide and protein identification in shotgun proteomics.
      ), we systematically classified existing methods and tools for various types of integrative proteogenomic studies into four major sections. Sequence-centric Proteogenomics describes aspects of sequence-centric proteogenomics and the combined use of genomic and proteomic data to augment gene or protein annotation (Fig. 1). “Analysis of Proteogenomic Relationships” explores relationships between genomic and proteomic data using correlation, with application to deciphering the effect of mutations on signaling (Fig. 2). “Integrative Modeling of Proteogenomic Data” summarizes integrative modeling and analysis of proteogenomic data using statistical and machine learning approaches (Fig. 3). “Data Sharing and Visualization” discusses genome (Fig. 4) and network visualization (Fig. 5), along with challenges in data sharing. All four sections of the review assume tandem MS (MS/MS) as the core proteomics technology for generating peptide sequence data.
      Figure thumbnail gr1
      Fig. 1.Sequence-centric proteogenomics. Sequencing-based technologies to sequence DNA (whole genome sequencing, WGS; whole exome sequencing, WXS) and RNA (RNA-seq) generate millions of short sequencing reads that are assembled into genomes, exomes or transcriptomes by either de novo or template-based approaches by alignment to a reference sequence. Sample-specific sequence aberrations are determined and nucleotide sequences are transformed into personalized, amino acid-centric sequence databases. Peptide mass spectra derived by LC-MS/MS analysis from a matching sample are then scored and validated against the personalized database enabling the detection of sample-specific peptide sequences. Depending on the scope of the proteogenomic project, these peptides can then be used to (1) aid genome annotation by detection of peptides in unannotated genome regions; (2) identify tumor-specific mutations translated into the proteome as well as novel protein splice variants; and (3) detect species-specific peptides in microbial communities.
      Figure thumbnail gr2
      Fig. 2.Proteogenomic relationships. A, Correlation analysis of mRNA and protein pairs across samples enables the assessment of global correlation structure which typically centers between correlation coefficients of 0.3 and 0.5. B, Regulatory effects on RNA and protein expression levels caused by copy number aberrations (CNA), genetic variants (eQTL) and microRNAs (miRNAs) can be studied by different correlation-based approaches. CNA cis and trans effects on RNA, protein and PTM expression can be determined by correlating each gene copy number at a given locus to all quantified features in RNA, protein or PTM space across all samples. Expression quantitative trait loci (eQTL) analysis can be used to identify DNA sequence variants affecting RNA/protein expression levels in the sample population being studied. Global miRNA analysis accompanied with mRNA or protein profiling enables the assessment of miRNA mediated regulation of mRNA and protein expression. C, Integrative analysis of genetic variants and PTM sites like phosphorylation can identify functional consequences of genetic variants at the molecular level. Mutations that directly affect serine, threonine and tyrosine residues can result in destruction or genesis of phosphosites (I); mutations adjacent to phosphosites can result in removal or addition of phosphosites (II) or change the kinase that recognizes the phosphorylation site (III).
      Figure thumbnail gr3
      Fig. 3.Integrative modeling. Overview of sub-topics in integrative modeling of proteogenomic data. A, Clustering techniques illustrating a schematic of multi-omic hierarchical clustering analysis resulting in the identification of two subtypes, B, Predictive modeling for disease diagnosis, prognosis, drug response and drug toxicity using multiple data modalities and, C, proteogenomic pathway and network modeling, including informing network composition and pathway and GO term enrichment.
      Figure thumbnail gr4
      Fig. 4.Genome-based visualization, using proBAM as an example. proBAM is a data format to integrate mass spectrometry data with the genome. In this example, we show the visualization of 10 colorectal cancer cell lines in proBAM format. The PSMs result from a search against a customized database built from matched RNA-Seq data, which are also incorporated into the visualization. A, Integrative Genomics Viewer (IGV) snapshot visualizes peptides and RNA-Seq reads mapped to KRAS in one window. The upper panel shows proteomic data from 10 colon cancer cell lines indicated by different colors. The bottom three panels illustrate RNA-Seq data from cell lines HCT15, Caco-2 and SW480, respectively. B, Zoomed-in view of an exon region in KRAS. Similar to RNA-Seq reads (three bottom panels), peptides mapped to the genome can be classified into within exon peptides and junction peptides in the proBAM file (upper panel). C, The upper panel shows a zoomed-in view of mutations confirmed by both RNA-Seq and proteomic data in KRAS. A G13D mutation in HCT15 and a G12V mutation in SW480 are observed in both transcriptomic (second and fourth panel) and proteomics (first panel) data, whereas wild type peptide is observed in Caco-2 (third panel).
      Figure thumbnail gr5
      Fig. 5.NetGestalt-based analysis of colorectal cancer proteomics data. A, NetGestalt created a one-dimensional (1-D), linear order of all 12,112 genes in a protein-protein interaction network based on the hierarchical modular organization of the network. The ruler indicates coordinates of the genes in the resulting linear order. B, Each bar represents a module identified from the network and alternating bar colors (green and orange) are used to distinguish neighboring modules. C, Colorectal cancer proteomics data visualized as a heat map with each row representing a sample. Red and blue colors in the heat map represent relative over- and under-expression, respectively. All 12,112 genes in the network are visualized in the heat map, and missing values in the proteomic data are indicated in gray color. The samples (rows) of the data are ordered based on the five subtypes visualized beside the heat map. D, Signed minus log10 transformed p values of the difference between subtype 3 and other subtypes visualized by the bar plot. Missing values are indicated by yellow bars in the bar plot. E, Significantly over-expressed genes in subtype 3 (FDR<0.05). F–H, Zoomed-in view of a region corresponding to one of the over-represented modules. I, Genes in (H) visualized in a node-link diagram. The edges in the diagram represent protein-protein interactions.

      Sequence-centric Proteogenomics

      In this section we review several areas of sequence-centric proteogenomics. This includes the integrative analysis of genomic and proteomic data for exome annotation in the form of gene discovery and gene model refinement (Proteomics Aiding Genome Annotation); protein level detection of single amino acid variants (SAAVs)
      The abbreviations used are: SAAV, single amino acid variant; DNA, deoxyribonucleic acid; RNA, ribonucleic acid; MS/MS, tandem mass spectrometry; EST, expressed sequence tag; ORF, open reading frame; TCGA, The Cancer Genome Atlas; CPTAC, Clinical Proteomic Tumor Analysis Consortium; FDR, false discovery rate; SNP, single nucleotide polymorphism; PTM, post-translational modification; PSM, peptide spectrum match; RPPA, reverse-phase protein array; GSEA, gene set enrichment analysis; ssGSEA, single-sample gene set enrichment analysis; SNV, single nucleotide variant; nsSNV, nonsynonymous single nucleotide variant; CNA, copy number aberration.
      , insertions, deletions, alternative splice junctions and novel gene fusions in relation to a reference genome sequence (Personalized Protein Sequence Databases); the application of proteomic sequencing to characterize antibodies (Sequencing of Antibodies); studying the effects of viral infections and transposons on gene expression in eukaryotic organisms (Viral Infections and Activations of Transposable Elements); and applications of proteogenomics to metaproteomic investigation (Metaproteogenomics).
      A concept integral to all five topics in this section is the importance of an inclusive and high-quality protein sequence database for peptide identification. In a typical proteomic experiment, peptide MS/MS spectra are interpreted using a database search algorithm that matches and scores the similarity of each experimental spectrum against model spectra constructed from peptide sequences contained in a user supplied protein sequence database (
      • Aebersold R.
      • Mann M.
      Mass-spectrometric exploration of proteome structure and function.
      ). This strategy is used, in part, because the fragmentation efficiency of current MS/MS instrumentation is unable to consistently yield spectra from which complete, unambiguous sequences can be interpreted de novo (the current state of automated de novo peptide interpretation have been reviewed elsewhere (
      • Medzihradszky K.F.
      • Chalkley R.J.
      Lessons in de novo peptide sequencing by tandem mass spectrometry.
      ,
      • Yan Y.
      • Kusalik A.J.
      • Wu F.-X.
      Recent developments in computational methods for de novo peptide sequencing from tandem mass spectrometry (MS/MS).
      )). To address this, researchers use a protein sequence database, ideally containing all protein sequences one expects to be present in the sample, with minimal irrelevant sequences to reduce false spectral matches and search time. Limiting the number of candidate sequences in the form of a sequence database enables the sequence ambiguity present in a spectrum to be overcome, resulting in high confidence peptide spectrum matches (PSMs) (
      • Clauser K.R.
      • Baker P.
      • Burlingame A.L.
      Role of accurate mass measurement (+/- 10 ppm) in protein identification strategies employing MS or MS/MS and database searching.
      ).

      Proteomics Aiding Genome Annotation

      The use of MS-based proteome characterization to aid genome annotation has been widely exploited in various organisms and has been previously reviewed by several groups (
      • Nesvizhskii A.I.
      Proteogenomics: concepts, applications and computational strategies.
      ,
      • Castellana N.
      • Bafna V.
      Proteogenomics to discover the full coding content of genomes: a computational perspective.
      ,
      • Sheynkman G.M.
      • Shortreed M.R.
      • Cesnik A.J.
      • Smith L.M.
      Proteogenomics: integrating next-generation sequencing and mass spectrometry to characterize human proteomic variation.
      ). Here, we highlight some of the pioneering studies using integrative analysis of genomics and proteomics data for genome (re)annotation (See related review (
      • Krug K.
      • Nahnsen S.
      • Macek B.
      Mass spectrometry at the interface of proteomics and genomics.
      )).
      Early studies integrating proteomic and genomic data date back to before the genome sequencing revolution of the early 21st century. Motivated by the lack of comprehensive and complete protein sequence databases and the emerging availability of nucleotide sequence data in the form of expressed sequence tags (ESTs), researchers interrogated peptide mass spectra using databases obtained by in-silico translation of unassembled ESTs. In 1995 Yates et al. (
      • Yates J.R.
      • Eng J.K.
      • McCormack A.L.
      Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases.
      ) demonstrated the use of nucleotide sequences translated into all six reading frames of amino acid sequence (six-frame translation) to identify mass spectra of unmodified and phosphorylated peptides from human, bovine, E. coli, and S. cerevisiae proteins. The intrinsic ability of searching peptide mass spectra against genomic sequence databases to identify novel, unannotated genes, was employed shortly after by several groups (
      • Link A.J.
      • Hays L.G.
      • Carmack E.B.
      • Yates J.R.
      Identifying the major proteome components of Haemophilus influenzae type-strain NCTC 8143.
      ,
      • Neubauer G.
      • King A.
      • Rappsilber J.
      • Calvio C.
      • Watson M.
      • Ajuh P.
      • Sleeman J.
      • Lamond A.
      • Mann M.
      Mass spectrometry and EST-database searching allows characterization of the multi-protein spliceosome complex.
      ,
      • Jungblut P.R.
      • Müller E.C.
      • Mattow J.
      • Kaufmann S.H.
      Proteomics reveals open reading frames in Mycobacterium tuberculosis H37Rv not predicted by genomics.
      ). Choudhary et al. (
      • Choudhary J.S.
      • Blackstock W.P.
      • Creasy D.M.
      • Cottrell J.S.
      Interrogating the human genome using uninterpreted mass spectrometry data.
      ) used information from the recently released human genome draft in 2001 to query all 23 human chromosomes with tandem mass spectra and compare the results to EST database searches. From this, they concluded that MS/MS searching of genomic DNA databases were of limited utility, as the presence of introns in the database prohibited matching exon-spanning peptides and prevented identification in roughly one quarter of the spectra. Additionally, the consensus sequence chosen for the reference genome also prevented the identification of individual SAAVs with EST evidence. These limitations have since been addressed through the incorporation of sample-specific or species-specific alternative splicing and SNPs into the protein sequence database (see Personalized Protein Sequence Databases).
      In 2004, Jaffe et al. introduced the concept of a proteogenomic map as a complementary method for genome annotation, which used evidence of protein expression to predict ORFs in Mycoplasma pneumoniae (
      • Jaffe J.D.
      • Berg H.C.
      • Church G.M.
      Proteogenomic mapping as a complementary method to perform genome annotation.
      ). Since then, a proteomics-based approach to gene annotation model refinement has been successfully applied in both model and nonmodel organisms (
      • Merrihew G.E.
      • Davis C.
      • Ewing B.
      • Williams G.
      • Käll L.
      • Frewen B.E.
      • Noble W.S.
      • Green P.
      • Thomas J.H.
      • MacCoss M.J.
      Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations.
      ,
      • Castellana N.E.
      • Payne S.H.
      • Shen Z.
      • Stanke M.
      • Bafna V.
      • Briggs S.P.
      Discovery and revision of Arabidopsis genes by proteogenomics.
      ,
      • Fermin D.
      • Allen B.B.
      • Blackwell T.W.
      • Menon R.
      • Adamski M.
      • Xu Y.
      • Ulintz P.
      • Omenn G.S.
      • States D.J.
      Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics.
      ,
      • Gupta N.
      • Tanner S.
      • Jaitly N.
      • Adkins J.N.
      • Lipton M.
      • Edwards R.
      • Romine M.
      • Osterman A.
      • Bafna V.
      • Smith R.D.
      • Pevzner P.A.
      Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation.
      ,
      • Potgieter M.G.
      • Nakedi K.C.
      • Ambler J.M.
      • Nel A.J.M.
      • Garnett S.
      • Soares N.C.
      • Mulder N.
      • Blackburn J.M.
      Proteogenomic Analysis of Mycobacterium smegmatis using high resolution mass spectrometry.
      ,
      • Krug K.
      • Carpy A.
      • Behrends G.
      • Matic K.
      • Soares N.C.
      • Macek B.
      Deep coverage of the Escherichia coli proteome enables the assessment of false discovery rates in simple proteogenomic experiments.
      ,
      • Borchert N.
      • Dieterich C.
      • Krug K.
      • Schütz W.
      • Jung S.
      • Nordheim A.
      • Sommer R.J.
      • Macek B.
      Proteogenomics of Pristionchus pacificus reveals distinct proteome structure of nematode models.
      ). Proteogenomic gene annotation is most often carried out by searching peptide mass spectra against a six-frame translation of an associated reference genome sequence database. Peptides identified by this search are then mapped to the existing gene annotation model (Fig. 1). The detection and identification of these peptides provides direct and valuable evidence of protein translation, and have been used to train algorithms for gene model prediction (
      • Borchert N.
      • Dieterich C.
      • Krug K.
      • Schütz W.
      • Jung S.
      • Nordheim A.
      • Sommer R.J.
      • Macek B.
      Proteogenomics of Pristionchus pacificus reveals distinct proteome structure of nematode models.
      ).
      A crucial prerequisite for genome refinement using MS-based proteomics is sufficient proteome coverage, which became feasible following major improvements in MS instrumentation (ion traps) as well as sample preparation protocols (multidimensional fractionation at the peptide level). In fact, Liquid Chromatography (LC)-MS/MS based proteomics utilizing sensitive and fast scanning ion-trap mass analyzers dominated the field of proteogenomics for several years despite the low resolution and mass accuracy of the acquired spectra (
      • Merrihew G.E.
      • Davis C.
      • Ewing B.
      • Williams G.
      • Käll L.
      • Frewen B.E.
      • Noble W.S.
      • Green P.
      • Thomas J.H.
      • MacCoss M.J.
      Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations.
      ,
      • Castellana N.E.
      • Payne S.H.
      • Shen Z.
      • Stanke M.
      • Bafna V.
      • Briggs S.P.
      Discovery and revision of Arabidopsis genes by proteogenomics.
      ,
      • Gupta N.
      • Tanner S.
      • Jaitly N.
      • Adkins J.N.
      • Lipton M.
      • Edwards R.
      • Romine M.
      • Osterman A.
      • Bafna V.
      • Smith R.D.
      • Pevzner P.A.
      Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation.
      ,
      • Baerenfaller K.
      • Grossmann J.
      • Grobei M.A.
      • Hull R.
      • Hirsch-Hoffmann M.
      • Yalovsky S.
      • Zimmermann P.
      • Grossniklaus U.
      • Gruissem W.
      • Baginsky S.
      Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics.
      ,
      • Gallien S.
      • Perrodou E.
      • Carapito C.
      • Deshayes C.
      • Reyrat J.-M.
      • Van Dorsselaer A.
      • Poch O.
      • Schaeffer C.
      • Lecompte O.
      Ortho-proteogenomics: multiple proteomes investigation through orthology and a new MS-based protocol.
      ,
      • Tanner S.
      • Shen Z.
      • Ng J.
      • Florea L.
      • Guigó R.
      • Briggs S.P.
      • Bafna V.
      Improving gene annotation using peptide mass spectrometry.
      ,
      • Xia D.
      • Sanderson S.J.
      • Jones A.R.
      • Prieto J.H.
      • Yates J.R.
      • Bromley E.
      • Tomley F.M.
      • Lal K.
      • Sinden R.E.
      • Brunk B.P.
      • Roos D.S.
      • Wastling J.M.
      The proteome of Toxoplasma gondii: integration with the genome provides novel insights into gene expression and annotation.
      ). However, the low mass accuracy data acquired by these instruments required large mass tolerances when searching six-frame translation databases, resulting in prohibitively large search spaces, long search times and high proportions of false positive peptide identifications using conventional target-decoy FDR thresholding (
      • Elias J.E.
      • Gygi S.P.
      Target-decoy search strategy for mass spectrometry-based proteomics.
      ).
      The invention of a new generation of mass spectrometers revolutionized the field of proteomics, providing high-resolution and high-accuracy MS data at an expanded dynamic range (
      • Hu Q.
      • Noll R.J.
      • Li H.
      • Makarov A.
      • Hardman M.
      • Graham Cooks R.
      The Orbitrap: a new mass spectrometer.
      ,
      • Cox J.
      • Mann M.
      Is proteomics the new genomics?.
      ). High mass accuracy data has the intrinsic capability to reduce the database search space by allowing for small mass tolerances and an associated decrease in plausible candidate sequences, something of importance when searching large genomic six-frame translation databases (
      • Krug K.
      • Nahnsen S.
      • Macek B.
      Mass spectrometry at the interface of proteomics and genomics.
      ). Further improvements in MS technology enabled the acquisition of high-resolution data at both the MS and MS/MS level without compromising sequencing speed (
      • Michalski A.
      • Damoc E.
      • Lange O.
      • Denisov E.
      • Nolting D.
      • Müller M.
      • Viner R.
      • Schwartz J.
      • Remes P.
      • Belford M.
      • Dunyach J.-J.
      • Cox J.
      • Horning S.
      • Mann M.
      • Makarov A.
      Ultra high resolution linear ion trap Orbitrap mass spectrometer (Orbitrap Elite) facilitates top down LC MS/MS and versatile peptide fragmentation modes.
      ,
      • Scheltema R.A.
      • Hauschild J.-P.
      • Lange O.
      • Hornburg D.
      • Denisov E.
      • Damoc E.
      • Kuehn A.
      • Makarov A.
      • Mann M.
      The Q Exactive HF, a Benchtop mass spectrometer with a pre-filter, high-performance quadrupole and an ultra-high-field Orbitrap analyzer.
      ,
      • Eliuk S.
      • Makarov A.
      Evolution of orbitrap mass spectrometry instrumentation.
      ).
      Despite these technological improvements, proteomics still suffers from low sequence coverage even in simple prokaryotic genomes (
      • Krug K.
      • Carpy A.
      • Behrends G.
      • Matic K.
      • Soares N.C.
      • Macek B.
      Deep coverage of the Escherichia coli proteome enables the assessment of false discovery rates in simple proteogenomic experiments.
      ). A typical LC-MS/MS proteomics strategy employs the enzyme trypsin to digest proteins by specific cleavage after lysine (K) and arginine (R) amino acids. Any sequence overlap in the resulting pool of peptides will be an artifact of cleavage sites missed by trypsin. Regions of a protein with spacing of K, R residues that are <6 AA's or >30 AA's will tend not to be observed by the mass spectrometer, and peptides with extremes of hydrophilicity or hydrophobicity will not be readily bound or eluted from the LC column, thereby limiting sequence coverage. Recent CPTAC studies report median protein sequence coverage of about 25% across >12,000 proteins in a human cancer cohorts (
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ) using extensive sample fractionation. Sequence coverage can be increased by use of multiple proteases generating different peptide species at the expense of additional experimental costs. Nucleotide sequencing-based methods to measure gene expression, such as RNA-Seq and Ribo-Seq, typically achieve higher genome coverage and are routinely used to complement the annotation of newly sequenced genomes (
      • Tisserant E.
      • Da Silva C.
      • Kohler A.
      • Morin E.
      • Wincker P.
      • Martin F.
      Deep RNA sequencing improved the structural annotation of the Tuber melanosporum transcriptome.
      ,
      • Martin J.
      • Zhu W.
      • Passalacqua K.D.
      • Bergman N.
      • Borodovsky M.
      Bacillus anthracis genome organization in light of whole transcriptome sequencing.
      ,
      • Hoff K.J.
      • Lange S.
      • Lomsadze A.
      • Borodovsky M.
      • Stanke M.
      BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS.
      ,
      • Coleman S.J.
      • Zeng Z.
      • Wang K.
      • Luo S.
      • Khrebtukova I.
      • Mienaltowski M.J.
      • Schroth G.P.
      • Liu J.
      • MacLeod J.N.
      Structural annotation of equine protein-coding genes determined by mRNA sequencing.
      ).

      Personalized Protein Sequence Databases

      Reference protein sequence databases, such as those from Ensembl or RefSeq, are typically used to identify mass spectra through peptide spectrum matching. Because these databases lack sample specific sequence variation, including single amino acid variants (SAAVs), insertions, deletions, alternative splice junctions and novel gene fusions, studies using this approach are unable to identify the corresponding variant peptides present in the MS/MS data. This is a particularly important limitation to consider in cancer studies, where patients acquire tumor specific somatic variation. Analyzing nonsynonymous somatic mutations at the proteome level has the potential to yield novel insights into tumor biology (
      • Alfaro J.A.
      • Sinha A.
      • Kislinger T.
      • Boutros P.C.
      Onco-proteogenomics: cancer proteomics joins forces with genomics.
      ). To do this, genome and RNA sequencing have been used to generate personalized protein sequence databases by incorporating nonsynonymous variants into reference protein sequences. Several informatics pipelines have emerged in recent years for generating these databases (
      • Fan J.
      • Saha S.
      • Barker G.
      • Heesom K.J.
      • Ghali F.
      • Jones A.R.
      • Matthews D.A.
      • Bessant C.
      Galaxy integrated omics: web-based standards-compliant workflows for proteomics informed by transcriptomics.
      ,
      • Krug K.
      • Popic S.
      • Carpy A.
      • Taumer C.
      • Macek B.
      Construction and assessment of individualized proteogenomic databases for large-scale analysis of nonsynonymous single nucleotide variants.
      ,
      • Li Y.
      • Wang X.
      • Cho J.-H.
      • Shaw T.I.
      • Wu Z.
      • Bai B.
      • Wang H.
      • Zhou S.
      • Beach T.G.
      • Wu G.
      • Zhang J.
      • Peng J.
      JUMPg: An integrative proteogenomics pipeline identifying unannotated proteins in human brain and cancer cells.
      ,
      • Ruggles K.V.
      • Tang Z.
      • Wang X.
      • Grover H.
      • Askenazi M.
      • Teubl J.
      • Cao S.
      • McLellan M.D.
      • Clauser K.R.
      • Tabb D.L.
      • Mertins P.
      • Slebos R.
      • Erdmann-Gilmore P.
      • Li S.
      • Gunawardena H.P.
      • Xie L.
      • Liu T.
      • Zhou J.-Y.
      • Sun S.
      • Hoadley K.A.
      • Perou C.M.
      • Chen X.
      • Davies S.R.
      • Maher C.A.
      • Kinsinger C.R.
      • Rodland K.D.
      • Zhang H.
      • Zhang Z.
      • Ding L.
      • Townsend R.R.
      • Rodriguez H.
      • Chan D.
      • Smith R.D.
      • Liebler D.C.
      • Carr S.A.
      • Payne S.
      • Ellis M.J.
      • Fenyő D.
      An analysis of the sensitivity of proteogenomic mapping of somatic mutations and novel splicing events in cancer.
      ,
      • Wang X.
      • Zhang B.
      customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search.
      ,
      • Wen B.
      • Xu S.
      • Zhou R.
      • Zhang B.
      • Wang X.
      • Liu X.
      • Xu X.
      • Liu S.
      PGA: an R/Bioconductor package for identification of novel peptides using a customized database derived from RNA-Seq.
      ,
      • Woo S.
      • Cha S.W.
      • Merrihew G.
      • He Y.
      • Castellana N.
      • Guest C.
      • MacCoss M.
      • Bafna V.
      Proteogenomic database construction driven from large scale RNA-seq data.
      ,
      • Zickmann F.
      • Renard B.Y.
      MSProGene: integrative proteogenomics beyond six-frames and single nucleotide polymorphisms.
      ). Fig. 1 illustrates the core processes these pipelines perform, and Table I provides a list of typical, currently available software.
      Table ISoftware pipelines for generating personalized protein sequence databases
      NameURLReferenceRemarks
      customProDBhttps://www.bioconductor.org/packages/release/bioc/html/customProDB.html(
      • Wang X.
      • Zhang B.
      customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search.
      )
      R-package to generate customized protein database from NGS data, including SNVs, INDELs and novel splice junctions
      Galaxy Integrated Omicshttps://bessantlab.org/software/gio/(
      • Krug K.
      • Popic S.
      • Carpy A.
      • Taumer C.
      • Macek B.
      Construction and assessment of individualized proteogenomic databases for large-scale analysis of nonsynonymous single nucleotide variants.
      )
      Curated collection of Galaxy-based tools to generate sample-specific databases from RNA-seq data
      Galaxy-Phttps://toolshed.g2.bx.psu.edu/view/galaxyp/proteomics_rnaseq_sap_db_workflow/3a11830963e3(
      • Sheynkman G.M.
      • Johnson J.E.
      • Jagtap P.D.
      • Shortreed M.R.
      • Onsongo G.
      • Frey B.L.
      • Griffin T.J.
      • Smith L.M.
      Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations.
      )
      Collection of Galaxy-P workflows for customized database construction from RNA-seq data
      QUILTShttp://quilts.fenyolab.org(
      • Ruggles K.V.
      • Tang Z.
      • Wang X.
      • Grover H.
      • Askenazi M.
      • Teubl J.
      • Cao S.
      • McLellan M.D.
      • Clauser K.R.
      • Tabb D.L.
      • Mertins P.
      • Slebos R.
      • Erdmann-Gilmore P.
      • Li S.
      • Gunawardena H.P.
      • Xie L.
      • Liu T.
      • Zhou J.-Y.
      • Sun S.
      • Hoadley K.A.
      • Perou C.M.
      • Chen X.
      • Davies S.R.
      • Maher C.A.
      • Kinsinger C.R.
      • Rodland K.D.
      • Zhang H.
      • Zhang Z.
      • Ding L.
      • Townsend R.R.
      • Rodriguez H.
      • Chan D.
      • Smith R.D.
      • Liebler D.C.
      • Carr S.A.
      • Payne S.
      • Ellis M.J.
      • Fenyő D.
      An analysis of the sensitivity of proteogenomic mapping of somatic mutations and novel splicing events in cancer.
      )
      Web-based tool to generate sample-specific databases with special focus on novel splice isoforms
      MutationDBhttp://proteomics.ucsd.edu/software-tools/cancer-proteogenomics-4/(
      • Woo S.
      • Cha S.W.
      • Merrihew G.
      • He Y.
      • Castellana N.
      • Guest C.
      • MacCoss M.
      • Bafna V.
      Proteogenomic database construction driven from large scale RNA-seq data.
      )
      Proteogenomic database construction from all types of mutational variants driven from large-scale RNA-seq data
      SpliceDBhttp://proteomics.ucsd.edu/software-tools/cancer-proteogenomics-4/(
      • Woo S.
      • Cha S.W.
      • Merrihew G.
      • He Y.
      • Castellana N.
      • Guest C.
      • MacCoss M.
      • Bafna V.
      Proteogenomic database construction driven from large scale RNA-seq data.
      )
      Proteogenomic database construction with special focus on novel splice junction identifications driven from large-scale RNA-seq data
      PGAhttp://bioconductor.org/packages/PGA/(
      • Wen B.
      • Xu S.
      • Zhou R.
      • Zhang B.
      • Wang X.
      • Liu X.
      • Xu X.
      • Liu S.
      PGA: an R/Bioconductor package for identification of novel peptides using a customized database derived from RNA-Seq.
      )
      R-package to build customized protein databases based on RNA-Seq data with or without a reference genome guide
      JUMPghttps://github.com/gatechatl/JUMPg(
      • Li Y.
      • Wang X.
      • Cho J.-H.
      • Shaw T.I.
      • Wu Z.
      • Bai B.
      • Wang H.
      • Zhou S.
      • Beach T.G.
      • Wu G.
      • Zhang J.
      • Peng J.
      JUMPg: An integrative proteogenomics pipeline identifying unannotated proteins in human brain and cancer cells.
      )
      Perl-based pipeline to build search customized databases and to perform subsequent filtering and visualization of search results
      MSProGenehttp://sourceforge.net/projects/msprogene/(
      • Zickmann F.
      • Renard B.Y.
      MSProGene: integrative proteogenomics beyond six-frames and single nucleotide polymorphisms.
      )
      Java program that constructs customized protein databases from RNA-seq data
      PPLinehttps://sourceforge.net/projects/ppline/(
      • Krasnov G.S.
      • Dmitriev A.A.
      • Kudryavtseva A.V.
      • Shargunov A.V.
      • Karpov D.S.
      • Uroshlev L.A.
      • Melnikova N.V.
      • Blinov V.M.
      • Poverennaya E.V.
      • Archakov A.I.
      • Lisitsa A.V.
      • Ponomarenko E.A.
      PPLine: An Automated Pipeline for SNP, SAP, and Splice Variant Detection in the Context of Proteogenomics.
      )
      Python-based suite to process raw RNA-seq or Exome-seq data for customized database construction
      Because the likelihood of experimentally observing a peptide decreases as one progresses from a reference database to each variant database (SAAV; novel splice junctions; and novel coding loci in putative intergenic regions), MS/MS data sets are sensibly searched in an iterative fashion through individual databases with separate FDR estimations for PSMs (
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ,
      • Zhang H.
      • Liu T.
      • Zhang Z.
      • Payne S.H.
      • Zhang B.
      • McDermott J.E.
      • Zhou J.-Y.
      • Petyuk V.A.
      • Chen L.
      • Ray D.
      • Sun S.
      • Yang F.
      • Chen L.
      • Wang J.
      • Shah P.
      • Cha S.W.
      • Aiyetan P.
      • Woo S.
      • Tian Y.
      • Gritsenko M.A.
      • Clauss T.R.
      • Choi C.
      • Monroe M.E.
      • Thomas S.
      • Nie S.
      • Wu C.
      • Moore R.J.
      • Yu K.-H.
      • Tabb D.L.
      • Fenyö D.
      • Bafna V.
      • Wang Y.
      • Rodriguez H.
      • Boja E.S.
      • Hiltke T.
      • Rivers R.C.
      • Sokoll L.
      • Zhu H.
      • Shih I.-M.
      • Cope L.
      • Pandey A.
      • Zhang B.
      • Snyder M.P.
      • Levine D.A.
      • Smith R.D.
      • Chan D.W.
      • Rodland K.D.
      • Investigators CPTAC
      Integrated proteogenomic characterization of human high-grade serous ovarian cancer.
      ,
      • Ruggles K.V.
      • Tang Z.
      • Wang X.
      • Grover H.
      • Askenazi M.
      • Teubl J.
      • Cao S.
      • McLellan M.D.
      • Clauser K.R.
      • Tabb D.L.
      • Mertins P.
      • Slebos R.
      • Erdmann-Gilmore P.
      • Li S.
      • Gunawardena H.P.
      • Xie L.
      • Liu T.
      • Zhou J.-Y.
      • Sun S.
      • Hoadley K.A.
      • Perou C.M.
      • Chen X.
      • Davies S.R.
      • Maher C.A.
      • Kinsinger C.R.
      • Rodland K.D.
      • Zhang H.
      • Zhang Z.
      • Ding L.
      • Townsend R.R.
      • Rodriguez H.
      • Chan D.
      • Smith R.D.
      • Liebler D.C.
      • Carr S.A.
      • Payne S.
      • Ellis M.J.
      • Fenyő D.
      An analysis of the sensitivity of proteogenomic mapping of somatic mutations and novel splicing events in cancer.
      ,
      • Woo S.
      • Cha S.W.
      • Bonissone S.
      • Na S.
      • Tabb D.L.
      • Pevzner P.A.
      • Bafna V.
      Advanced proteogenomic analysis reveals multiple peptide mutations and complex immunoglobulin peptides in colon cancer.
      ). As the total size of the protein database increases, identifying high confidence PSMs requires increased spectral quality with increasingly complete peptide fragmentation. Detailed statistical considerations for iterative search strategy design and FDR estimation in a proteogenomic paradigm have recently been thoroughly described (
      • Nesvizhskii A.I.
      Proteogenomics: concepts, applications and computational strategies.
      ).

      Sequencing of Antibodies

      In addition to its role in novel peptide identification and gene annotation, sequence-centric proteogenomics has played a significant role in antibody sequencing (
      • Scheid J.F.
      • Mouquet H.
      • Ueberheide B.
      • Diskin R.
      • Klein F.
      • Oliveira T.Y.K.
      • Pietzsch J.
      • Fenyo D.
      • Abadir A.
      • Velinzon K.
      • Hurley A.
      • Myung S.
      • Boulad F.
      • Poignard P.
      • Burton D.R.
      • Pereyra F.
      • Ho D.D.
      • Walker B.D.
      • Seaman M.S.
      • Bjorkman P.J.
      • Chait B.T.
      • Nussenzweig M.C.
      Sequence and structural convergence of broad and potent HIV antibodies that mimic CD4 binding.
      ,
      • Cheung W.C.
      • Beausoleil S.A.
      • Zhang X.
      • Sato S.
      • Schieferl S.M.
      • Wieler J.S.
      • Beaudet J.G.
      • Ramenani R.K.
      • Popova L.
      • Comb M.J.
      • Rush J.
      • Polakiewicz R.D.
      A proteomics approach for the identification and cloning of monoclonal antibodies from serum.
      ,
      • Muellenbeck M.F.
      • Ueberheide B.
      • Amulic B.
      • Epp A.
      • Fenyo D.
      • Busse C.E.
      • Esen M.
      • Theisen M.
      • Mordmüller B.
      • Wardemann H.
      Atypical and classical memory B cells produce Plasmodium falciparum neutralizing antibodies.
      ,
      • Fridy P.C.
      • Li Y.
      • Keegan S.
      • Thompson M.K.
      • Nudelman I.
      • Scheid J.F.
      • Oeffinger M.
      • Nussenzweig M.C.
      • Fenyö D.
      • Chait B.T.
      • Rout M.P.
      A robust pipeline for rapid production of versatile nanobody repertoires.
      ). In vertebrates, antibodies provide the ability for an organism to differentiate between itself and the environment, and a mechanism to fight a diverse range of infections. In order to meet these needs, antibodies possess a tremendous level of sequence diversity, which is achieved through a combination of antibody sequence options and the introduction of mutations. Three types of antibody genes are encoded in the genome: Variable (V), Diversity (D), and Joining (J), and these are combined through a process called V(D)J recombination (
      • Roth D.B.
      • Craig N.L.
      VDJ recombination.
      ) to produce a diverse library of heavy and light chains that are combined pairwise, and together provide the antigen binding specificity. The affinity maturation process (
      • Di Noia J.M.
      • Neuberger M.S.
      Molecular mechanisms of antibody somatic hypermutation.
      ) further optimizes antibodies that recognize foreign objects by allowing for a high rate of point mutation introductions, followed by a selection for the strongest binders. At the same time, antibodies recognizing self are eliminated—a process that is defective in autoimmune diseases. Although antibody sequences are encoded in the genome, because of the process by which they are combined and matured, it is not possible to predict the final repertoire of antibody sequences for an individual from the genome alone. However, RNA-Seq can be applied to sequencing the variable region of the light and heavy chains to obtain a sampling of the antibody diversity of an individual.
      MS has also been applied to sequencing both recombinant and circulating antibodies. For recombinant antibodies where the sequence is known, MS can be used to confirm the sequence and check for purity. For studying circulating antibodies, the most widely used approach is to use the antigen as bait to enrich for the circulating high affinity antibodies and analyze them with MS/MS. Multiple aliquots are digested with proteases of different specificity to generate comprehensive coverage with overlapping peptides; the spectra can then be interpreted using de novo sequencing approaches (
      • Guthals A.
      • Gan Y.
      • Murray L.
      • Chen Y.
      • Stinson J.
      • Nakamura G.R.
      • Lill J.R.
      • Sandoval W.
      • Bandeira N.
      De Novo MS/MS sequencing of native human antibodies.
      ,
      • Guthals A.
      • Clauser K.R.
      • Frank A.M.
      • Bandeira N.
      Sequencing-grade de novo analysis of MS/MS triplets (CID/HCD/ETD) from overlapping peptides.
      ,
      • Guthals A.
      • Clauser K.R.
      • Bandeira N.
      Shotgun protein sequencing with meta-contig assembly.
      ,
      • Tran N.H.
      • Rahman M.Z.
      • He L.
      • Xin L.
      • Shan B.
      • Li M.
      Complete de novo assembly of monoclonal antibody sequences.
      ,
      • Vincke C.
      • Loris R.
      • Saerens D.
      • Martinez-Rodriguez S.
      • Muyldermans S.
      • Conrath K.
      General strategy to humanize a camelid single-domain antibody and identification of a universal humanized nanobody scaffold.
      ). This requires high quality data, and limits the number of antibodies that can be sequenced in a mixture. An alternative is to use a proteogenomics approach: performing targeted RNA-Seq of the variable region of the light and heavy chains, assembling the reads and translating the assemblies to create a protein sequence database for searching. This approach has been employed to identify high affinity circulating antibodies in infected individuals against HIV (
      • Scheid J.F.
      • Mouquet H.
      • Ueberheide B.
      • Diskin R.
      • Klein F.
      • Oliveira T.Y.K.
      • Pietzsch J.
      • Fenyo D.
      • Abadir A.
      • Velinzon K.
      • Hurley A.
      • Myung S.
      • Boulad F.
      • Poignard P.
      • Burton D.R.
      • Pereyra F.
      • Ho D.D.
      • Walker B.D.
      • Seaman M.S.
      • Bjorkman P.J.
      • Chait B.T.
      • Nussenzweig M.C.
      Sequence and structural convergence of broad and potent HIV antibodies that mimic CD4 binding.
      ) and malaria (
      • Muellenbeck M.F.
      • Ueberheide B.
      • Amulic B.
      • Epp A.
      • Fenyo D.
      • Busse C.E.
      • Esen M.
      • Theisen M.
      • Mordmüller B.
      • Wardemann H.
      Atypical and classical memory B cells produce Plasmodium falciparum neutralizing antibodies.
      ) surface proteins that are potentially broadly neutralizing. The approach has also been applied to produce single chain llama antibodies to be used as reagents (
      • Fridy P.C.
      • Li Y.
      • Keegan S.
      • Thompson M.K.
      • Nudelman I.
      • Scheid J.F.
      • Oeffinger M.
      • Nussenzweig M.C.
      • Fenyö D.
      • Chait B.T.
      • Rout M.P.
      A robust pipeline for rapid production of versatile nanobody repertoires.
      ). Llamas and camels produce single chain antibodies in addition to paired heavy-light chain antibodies. The advantage of the single chain antibodies as reagents is that they are small (∼15kDa), robust, and once sequenced, they can easily be expressed in E. coli to provide a reproducible resource. These single chain antibodies can be humanized (
      • Fridy P.C.
      • Li Y.
      • Keegan S.
      • Thompson M.K.
      • Nudelman I.
      • Scheid J.F.
      • Oeffinger M.
      • Nussenzweig M.C.
      • Fenyö D.
      • Chait B.T.
      • Rout M.P.
      A robust pipeline for rapid production of versatile nanobody repertoires.
      ) and conjugated with drugs (
      • Arias J.L.
      • Unciti-Broceta J.D.
      • Maceira J.
      • Del Castillo T.
      • Hernández-Quero J.
      • Magez S.
      • Soriano M.
      • García-Salcedo J.A.
      Nanobody conjugated PLGA nanoparticles for active targeting of African Trypanosomiasis.
      ); they are, therefore, highly promising candidates for developing therapeutics.

      Viral Infections and Activation of Transposable Elements

      Beyond expression of host genes, viral infections and transposons contribute potential protein-level expression in eukaryotic organisms that can be studied using proteogenomic techniques. Viral infections alter gene expression and protein production as the virus highjacks cellular processes that allow for self-replication (
      • Davis Z.H.
      • Verschueren E.
      • Jang G.M.
      • Kleffman K.
      • Johnson J.R.
      • Park J.
      • Von Dollen J.
      • Maher M.C.
      • Johnson T.
      • Newton W.
      • Jäger S.
      • Shales M.
      • Horner J.
      • Hernandez R.D.
      • Krogan N.J.
      • Glaunsinger B.A.
      Global mapping of herpesvirus-host protein complexes reveals a transcription strategy for late genes.
      ,
      • Jäger S.
      • Kim D.Y.
      • Hultquist J.F.
      • Shindo K.
      • LaRue R.S.
      • Kwon E.
      • Li M.
      • Anderson B.D.
      • Yen L.
      • Stanley D.
      • Mahon C.
      • Kane J.
      • Franks-Skiba K.
      • Cimermancic P.
      • Burlingame A.
      • Sali A.
      • Craik C.S.
      • Harris R.S.
      • Gross J.D.
      • Krogan N.J.
      Vif hijacks CBF-β to degrade APOBEC3G and promote HIV-1 infection.
      ,
      • Jean Beltran P.M.
      • Mathias R.A.
      • Cristea I.M.
      A portrait of the human organelle proteome in space and time during cytomegalovirus infection.
      ,
      • Luo Y.
      • Jacobs E.Y.
      • Greco T.M.
      • Mohammed K.D.
      • Tong T.
      • Keegan S.
      • Binley J.M.
      • Cristea I.M.
      • Fenyö D.
      • Rout M.P.
      • Chait B.T.
      • Muesing M.A.
      HIV-host interactome revealed directly from infected cells.
      ), and can lead to cell death or cancerous growth of the host cells (
      • Crawford D.H.
      • Johannessen I.
      • Rickinson A.B.
      ). Eukaryotic genomes also contain mobile elements or transposons that are remnants of ancient viral infections that were incorporated in the host germline genome and then spread throughout the genome through gene copying. It is estimated that about half of the human genome is transposon sequence albeit most are no longer active (
      • Huang C.R.L.
      • Burns K.H.
      • Boeke J.D.
      Active transposition in genomes.
      ). There is, however, a small subset of the LINE-1 (Long Interspersed Elements) retrotransposons that are capable of autonomous retrotransposition through a copy and paste mechanism using an RNA intermediary, but they are most commonly inactive in somatic cells. However, increased retrotransposition activity has been observed in some disease states including many cancer types (
      • Rodić N.
      • Sharma R.
      • Sharma R.
      • Zampella J.
      • Dai L.
      • Taylor M.S.
      • Hruban R.H.
      • Iacobuzio-Donahue C.A.
      • Maitra A.
      • Torbenson M.S.
      • Goggins M.
      • Shih I.-M.
      • Duffield A.S.
      • Montgomery E.A.
      • Gabrielson E.
      • Netto G.J.
      • Lotan T.L.
      • De Marzo A.M.
      • Westra W.
      • Binder Z.A.
      • Orr B.A.
      • Gallia G.L.
      • Eberhart C.G.
      • Boeke J.D.
      • Harris C.R.
      • Burns K.H.
      Long interspersed element-1 protein expression is a hallmark of many human cancers.
      ). The role of retrotransposition in cancer biology is currently unclear and it is not known if it promotes or suppresses tumors, or if it is just an effect of genome instability (
      • Ardeljan D.
      • Taylor M.S.
      • Burns K.H.
      • Boeke J.D.
      • Espey M.G.
      • Woodhouse E.C.
      • Howcroft T.K.
      Meeting report: the role of the mobilome in cancer.
      ). Also, the mechanism for suppression and activation of retrotransposition is unknown (
      • Burns K.H.
      • Boeke J.D.
      Human transposon tectonics.
      ) but could provide important insights into cancer biology. Human LINE-1 has two open reading frames: ORF1 and ORF2. ORF1 is an RNA binding protein and ORF2 is an endonuclease and a reverse transcriptase. ORF1 can be quantified using MS-based proteomics, and has been observed by deep MS-based proteomics in many tumors including breast, prostate and ovarian tumors (

      LINE-1 ORF1 Observations in GPMDB http://gpmdb.thegpm.org/protein/accession/gi%7C74753422%7C,

      ). Interesting proteogenomics questions for future research include: Do higher levels of LINE-1 transcripts and ORF1 protein concentrations correlate with tumor progression? Which human transcripts and protein levels correlate with ORF1 protein concentration? Does ORF1 protein concentration correlate with a higher number of somatic LINE-1 insertions?

      Metaproteogenomics

      Metaproteomics (
      • Banfield J.F.
      • Verberkmoes N.C.
      • Hettich R.L.
      • Thelen M.P.
      Proteogenomic approaches for the molecular characterization of natural microbial communities.
      ,
      • Wilmes P.
      • Bond P.L.
      The application of two-dimensional polyacrylamide gel electrophoresis and downstream analyses to a mixed community of prokaryotic microorganisms.
      ), or proteomic investigation of multiorganism communities, represents a unique application for proteomic and genomic data integration. Metagenomic studies typically collect genome data only, representing the functional potential of organisms present in a community, whereas metaproteogenomics adds the additional layer of proteomics data to elucidate what cellular functions are being expressed and utilized by community members. Unfortunately, reference genome sequence databases lack many protein sequences in natural consortia samples, as many organisms within the biological sample may not have a sequenced genome. Despite the meteoric rise in the number of sequenced genomes, there remain large swaths of bacterial diversity that are unrepresented in databases like GenBank and UniProt. A recent paper by the Banfield group (
      • Hug L.A.
      • Baker B.J.
      • Anantharaman K.
      • Brown C.T.
      • Probst A.J.
      • Castelle C.J.
      • Butterfield C.N.
      • Hernsdorf A.W.
      • Amano Y.
      • Ise K.
      • Suzuki Y.
      • Dudek N.
      • Relman D.A.
      • Finstad K.M.
      • Amundson R.
      • Thomas B.C.
      • Banfield J.F.
      A new view of the tree of life.
      ) highlights numerous phyla which were previously unknown (not just unsequenced). Indeed, they observe a new phyla radiation with dozens of bacterial phyla entirely separate from known taxonomy. Recent predictions for bacterial diversity on the planet approach 1 trillion distinct species (
      • Locey K.J.
      • Lennon J.T.
      Scaling laws predict global microbial diversity.
      ). Thus, we expect that reference genome sequence databases will be incomplete at the species level for the foreseeable future.
      Therefore, a driving need for utilizing genomic data in metaproteomics experiments is the inaccuracy of annotating genomes for novel species. Proteomics has frequently been used not only to improve the set of known proteins in a single organism, but as training and/or testing data for improving gene calling algorithms (
      • Tripp H.J.
      • Sutton G.
      • White O.
      • Wortman J.
      • Pati A.
      • Mikhailova N.
      • Ovchinnikova G.
      • Payne S.H.
      • Kyrpides N.C.
      • Ivanova N.
      Toward a standard in structural genome annotation for prokaryotes.
      ). Moreover, protein annotation is less accurate for genes which have not been previously characterized, a common observation for samples in natural (not laboratory) conditions.
      Absent algorithmic advances which allow spectra to be identified without an exact sequence match to a database (
      • Horlacher O.
      • Lisacek F.
      • Müller M.
      Mining large scale tandem mass spectrometry data for protein modifications using spectral libraries.
      ,
      • Na S.
      • Payne S.H.
      • Bandeira N.
      Multi-species identification of polymorphic peptide variants via propagation in spectral networks.
      ,
      • Ye D.
      • Fu Y.
      • Sun R.-X.
      • Wang H.-P.
      • Yuan Z.-F.
      • Chi H.
      • He S.-M.
      Open MS/MS spectral library search to identify unanticipated post-translational modifications and increase spectral identification rate.
      ), the path forward for metaproteomics involves obtaining sequencing data for the biological sample at hand. This can be either metagenomic or metatranscriptomic data. Each provides crucial information about the potential protein sequences that should be considered when searching tandem mass spectra. Although this introduces additional costs to the project, it has so far been the best way to improve the number of identified peptides in a metaproteomic analysis (
      • Wilmes P.
      • Heintz-Buschart A.
      • Bond P.L.
      A decade of metaproteomics: where we stand and what the future holds.
      ).

      Analysis of Proteogenomic Relationships

      In this section we cover the integration of profiling data from different omics platforms to help elucidate information flow from DNA to RNA to proteins and, most importantly, to phenotype.
      This includes studies aimed at understanding whether protein abundance can be reliably predicted from mRNA measurements (mRNA-Protein Correlation); assessing the genetic control of mRNA on protein abundance (Genetic Control of mRNA and Protein Abundance); and the impact of genetic aberrations on post-translational protein modification (PTM) and signaling (Relating Mutations to PTM and Signaling).

      mRNA-Protein Correlation

      Correlation between mRNA and protein profiling data has been a topic of considerable research during the past decade, and an excellent review on this has recently been published (
      • Liu Y.
      • Beyer A.
      • Aebersold R.
      On the dependency of cellular protein levels on mRNA abundance.
      ). Early studies focused on the correlation between steady state mRNA and protein abundance for all genes in a single sample, and it was noted in various organisms that relative abundance of proteins in a sample cannot be adequately explained by the corresponding mRNA abundance (
      • Vogel C.
      • Marcotte E.M.
      Insights into the regulation of protein abundance from proteomic and transcriptomic analyses.
      ). This can be explained, at least in part, by our understanding that protein abundance is determined by a combination of mRNA abundance, translational regulation, and protein degradation (
      • Vogel C.
      • Marcotte E.M.
      Insights into the regulation of protein abundance from proteomic and transcriptomic analyses.
      ). With the availability of paired mRNA and protein data for large sample cohorts, studies on gene-wise correlations between mRNA and protein abundance across many samples also reported modest correlations (
      • Foss E.J.
      • Radulovic D.
      • Shaffer S.A.
      • Goodlett D.R.
      • Kruglyak L.
      • Bedalov A.
      Genetic variation shapes protein networks mainly through non-transcriptional mechanisms.
      ,
      • Ghazalpour A.
      • Bennett B.
      • Petyuk V.A.
      • Orozco L.
      • Hagopian R.
      • Mungrue I.N.
      • Farber C.R.
      • Sinsheimer J.
      • Kang H.M.
      • Furlotte N.
      • Park C.C.
      • Wen P.-Z.
      • Brewer H.
      • Weitz K.
      • Camp D.G.
      • Pan C.
      • Yordanova R.
      • Neuhaus I.
      • Tilford C.
      • Siemers N.
      • Gargalovic P.
      • Eskin E.
      • Kirchgessner T.
      • Smith D.J.
      • Smith R.D.
      • Lusis A.J.
      Comparative analysis of proteome and transcriptome variation in mouse.
      ,
      • Gry M.
      • Rimini R.
      • Strömberg S.
      • Asplund A.
      • Pontén F.
      • Uhlén M.
      • Nilsson P.
      Correlations between RNA and protein expression profiles in 23 human cell lines.
      ). More recently, the CPTAC consortium has explored mRNA-protein correlations in breast, colorectal, and ovarian cancer samples (
      • Zhang B.
      • Wang J.
      • Wang X.
      • Zhu J.
      • Liu Q.
      • Shi Z.
      • Chambers M.C.
      • Zimmerman L.J.
      • Shaddox K.F.
      • Kim S.
      • Davies S.R.
      • Wang S.
      • Wang P.
      • Kinsinger C.R.
      • Rivers R.C.
      • Rodriguez H.
      • Townsend R.R.
      • Ellis M.J.C.
      • Carr S.A.
      • Tabb D.L.
      • Coffey R.J.
      • Slebos R.J.C.
      • Liebler D.C.
      • NCI CPTAC
      Proteogenomic characterization of human colon and rectal cancer.
      ,
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ,
      • Zhang H.
      • Liu T.
      • Zhang Z.
      • Payne S.H.
      • Zhang B.
      • McDermott J.E.
      • Zhou J.-Y.
      • Petyuk V.A.
      • Chen L.
      • Ray D.
      • Sun S.
      • Yang F.
      • Chen L.
      • Wang J.
      • Shah P.
      • Cha S.W.
      • Aiyetan P.
      • Woo S.
      • Tian Y.
      • Gritsenko M.A.
      • Clauss T.R.
      • Choi C.
      • Monroe M.E.
      • Thomas S.
      • Nie S.
      • Wu C.
      • Moore R.J.
      • Yu K.-H.
      • Tabb D.L.
      • Fenyö D.
      • Bafna V.
      • Wang Y.
      • Rodriguez H.
      • Boja E.S.
      • Hiltke T.
      • Rivers R.C.
      • Sokoll L.
      • Zhu H.
      • Shih I.-M.
      • Cope L.
      • Pandey A.
      • Zhang B.
      • Snyder M.P.
      • Levine D.A.
      • Smith R.D.
      • Chan D.W.
      • Rodland K.D.
      • Investigators CPTAC
      Integrated proteogenomic characterization of human high-grade serous ovarian cancer.
      ) finding predominantly positive mRNA-protein correlations for all genes with moderate (0.3–0.45) median correlations (Fig. 2A).
      Because both mRNA and protein profiling data are noisy (
      • Hundertmark C.
      • Fischer R.
      • Reinl T.
      • May S.
      • Klawonn F.
      • Jänsch L.
      MS-specific noise model reveals the potential of iTRAQ in quantitative proteomics.
      ,
      • Lahens N.F.
      • Kavakli I.H.
      • Zhang R.
      • Hayer K.
      • Black M.B.
      • Dueck H.
      • Pizarro A.
      • Kim J.
      • Irizarry R.
      • Thomas R.S.
      • Grant G.R.
      • Hogenesch J.B.
      IVT-seq reveals extreme bias in RNA sequencing.
      ), it is unclear how much of the reported low correlation between mRNA and protein expression is because of technological issues versus underlying biology. Statistical methods that attempt to model stochastic and systematic errors in mRNA and protein profiling data have produced higher mRNA-protein correlations (
      • Fu J.
      • Keurentjes J.J.B.
      • Bouwmeester H.
      • America T.
      • Verstappen F.W.A.
      • Ward J.L.
      • Beale M.H.
      • de Vos R.C.H.
      • Dijkstra M.
      • Scheltema R.A.
      • Johannes F.
      • Koornneef M.
      • Vreugdenhil D.
      • Breitling R.
      • Jansen R.C.
      System-wide molecular evidence for phenotypic buffering in Arabidopsis.
      ,
      • Jovanovic M.
      • Rooney M.S.
      • Mertins P.
      • Przybylski D.
      • Chevrier N.
      • Satija R.
      • Rodriguez E.H.
      • Fields A.P.
      • Schwartz S.
      • Raychowdhury R.
      • Mumbach M.R.
      • Eisenhaure T.
      • Rabani M.
      • Gennert D.
      • Lu D.
      • Delorey T.
      • Weissman J.S.
      • Carr S.A.
      • Hacohen N.
      • Regev A.
      Immunogenetics. Dynamic profiling of the protein life cycle in response to pathogens.
      ), and thus it has been suggested that transcription predominantly determines protein abundance (
      • Li J.J.
      • Biggin M.D.
      Gene expression. Statistics requantitates the central dogma.
      ). Recent studies by the CPTAC consortium have reported nonrandom associations between the level of mRNA-protein correlation and biological functions of the genes (
      • Zhang B.
      • Wang J.
      • Wang X.
      • Zhu J.
      • Liu Q.
      • Shi Z.
      • Chambers M.C.
      • Zimmerman L.J.
      • Shaddox K.F.
      • Kim S.
      • Davies S.R.
      • Wang S.
      • Wang P.
      • Kinsinger C.R.
      • Rivers R.C.
      • Rodriguez H.
      • Townsend R.R.
      • Ellis M.J.C.
      • Carr S.A.
      • Tabb D.L.
      • Coffey R.J.
      • Slebos R.J.C.
      • Liebler D.C.
      • NCI CPTAC
      Proteogenomic characterization of human colon and rectal cancer.
      ,
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ,
      • Zhang H.
      • Liu T.
      • Zhang Z.
      • Payne S.H.
      • Zhang B.
      • McDermott J.E.
      • Zhou J.-Y.
      • Petyuk V.A.
      • Chen L.
      • Ray D.
      • Sun S.
      • Yang F.
      • Chen L.
      • Wang J.
      • Shah P.
      • Cha S.W.
      • Aiyetan P.
      • Woo S.
      • Tian Y.
      • Gritsenko M.A.
      • Clauss T.R.
      • Choi C.
      • Monroe M.E.
      • Thomas S.
      • Nie S.
      • Wu C.
      • Moore R.J.
      • Yu K.-H.
      • Tabb D.L.
      • Fenyö D.
      • Bafna V.
      • Wang Y.
      • Rodriguez H.
      • Boja E.S.
      • Hiltke T.
      • Rivers R.C.
      • Sokoll L.
      • Zhu H.
      • Shih I.-M.
      • Cope L.
      • Pandey A.
      • Zhang B.
      • Snyder M.P.
      • Levine D.A.
      • Smith R.D.
      • Chan D.W.
      • Rodland K.D.
      • Investigators CPTAC
      Integrated proteogenomic characterization of human high-grade serous ovarian cancer.
      ). For example, metabolic functions such as amino acid, fatty acid and nucleotide metabolism are enriched for genes with high mRNA-protein correlations, whereas ribosomal and mRNA splicing functions are enriched for genes with low or negative mRNA-protein correlations. A more systematic study using mRNA and protein profiling data from the three CPTAC cancer types showed that proteomic data strengthened the link between gene expression and function for at least 75% of Gene Ontology (GO) biological processes and 90% of KEGG pathways (
      • Wang J.
      • Ma Z.
      • Carr S.A.
      • Mertins P.
      • Zhang H.
      • Zhang Z.
      • Chan D.W.
      • Ellis M.J.C.
      • Townsend R.R.
      • Smith R.D.
      • McDermott J.E.
      • Chen X.
      • Paulovich A.G.
      • Boja E.S.
      • Mesri M.
      • Kinsinger C.R.
      • Rodriguez H.
      • Rodland K.D.
      • Liebler D.C.
      • Zhang B.
      Proteome profiling outperforms transcriptome profiling for co-expression based gene function prediction.
      ). Thus, mRNA-protein discrepancy cannot be simply explained by experimental errors, and biological functions arise from both mRNA- and protein-level regulations.

      Genetic Control of mRNA and Protein Abundance

      Genetic variation plays an important role in determining mRNA and protein abundance. mRNA and protein expression data from a cohort of samples can be integrated with DNA variation information to study the underlying genetic determinants of gene expression variation. This type of analysis is an extension of the traditional quantitative trait locus (QTL) mapping, in which a section of DNA (the locus) is correlated with variation in a phenotype (i.e. quantitative trait). When expression levels of mRNAs are treated as quantitative traits, the QTL analysis is named eQTL analysis, a method that has become well-established in the field of genetics (
      • Gilad Y.
      • Rifkin S.A.
      • Pritchard J.K.
      Revealing the architecture of gene regulation: the promise of eQTL studies.
      ) (Fig. 2B). eQTLs may be cis- or trans- acting, determined by their physical distance from the gene they regulate. Specifically, cis-eQTLs affect gene expression at the same locus of the genotype, whereas trans-eQTLs affect gene expression at a different locus. Although many cis-eQTLs have been reported, mapping trans-eQTLs has been less successful (
      • Nica A.C.
      • Dermitzakis E.T.
      Expression quantitative trait loci: present and future.
      ). It remains unclear whether the difficulty in mapping trans-eQTLs reflects true biology (i.e. eQTLs primarily act in cis) or computational and statistical challenges. More recently, ribosome occupancy and protein abundance have been used as quantitative traits to identify ribosome occupancy QTLs (rQTLs) and protein abundance QTLs (pQTLs), respectively (
      • Ghazalpour A.
      • Bennett B.
      • Petyuk V.A.
      • Orozco L.
      • Hagopian R.
      • Mungrue I.N.
      • Farber C.R.
      • Sinsheimer J.
      • Kang H.M.
      • Furlotte N.
      • Park C.C.
      • Wen P.-Z.
      • Brewer H.
      • Weitz K.
      • Camp D.G.
      • Pan C.
      • Yordanova R.
      • Neuhaus I.
      • Tilford C.
      • Siemers N.
      • Gargalovic P.
      • Eskin E.
      • Kirchgessner T.
      • Smith D.J.
      • Smith R.D.
      • Lusis A.J.
      Comparative analysis of proteome and transcriptome variation in mouse.
      ).
      An integrative multi-omics study on a set of HapMap Yoruba lymphoblastoid cell lines found that most QTLs were associated with mRNA expression levels, but their impact on protein expression levels were significantly reduced (
      • Battle A.
      • Khan Z.
      • Wang S.H.
      • Mitrano A.
      • Ford M.J.
      • Pritchard J.K.
      • Gilad Y.
      Genomic variation. Impact of regulatory variation from RNA to protein.
      ). This buffering of protein levels may allow cells to cope with noisy genetic variations and attenuate their impact on downstream phenotypes. Interestingly, a set of cis QTLs that affect protein abundance showed little or no effect on messenger RNA or ribosome levels, suggesting their potential roles in post-translational regulation. Both the buffering effect and protein abundance specific QTLs have been reported in earlier studies in yeast (
      • Foss E.J.
      • Radulovic D.
      • Shaffer S.A.
      • Goodlett D.R.
      • Kruglyak L.
      • Bedalov A.
      Genetic variation shapes protein networks mainly through non-transcriptional mechanisms.
      ,
      • Foss E.J.
      • Radulovic D.
      • Shaffer S.A.
      • Ruderfer D.M.
      • Bedalov A.
      • Goodlett D.R.
      • Kruglyak L.
      Genetic basis of proteome variation in yeast.
      ), Arabidopsis (
      • Fu J.
      • Keurentjes J.J.B.
      • Bouwmeester H.
      • America T.
      • Verstappen F.W.A.
      • Ward J.L.
      • Beale M.H.
      • de Vos R.C.H.
      • Dijkstra M.
      • Scheltema R.A.
      • Johannes F.
      • Koornneef M.
      • Vreugdenhil D.
      • Breitling R.
      • Jansen R.C.
      System-wide molecular evidence for phenotypic buffering in Arabidopsis.
      ), mouse (
      • Ghazalpour A.
      • Bennett B.
      • Petyuk V.A.
      • Orozco L.
      • Hagopian R.
      • Mungrue I.N.
      • Farber C.R.
      • Sinsheimer J.
      • Kang H.M.
      • Furlotte N.
      • Park C.C.
      • Wen P.-Z.
      • Brewer H.
      • Weitz K.
      • Camp D.G.
      • Pan C.
      • Yordanova R.
      • Neuhaus I.
      • Tilford C.
      • Siemers N.
      • Gargalovic P.
      • Eskin E.
      • Kirchgessner T.
      • Smith D.J.
      • Smith R.D.
      • Lusis A.J.
      Comparative analysis of proteome and transcriptome variation in mouse.
      ), and human (
      • Lappalainen T.
      • Sammeth M.
      • Friedländer M.R.
      • 't Hoen P.A.C.
      • Monlong J.
      • Rivas M.A.
      • Gonzàlez-Porta M.
      • Kurbatova N.
      • Griebel T.
      • Ferreira P.G.
      • Barann M.
      • Wieland T.
      • Greger L.
      • van Iterson M.
      • Almlöf J.
      • Ribeca P.
      • Pulyakhina I.
      • Esser D.
      • Giger T.
      • Tikhonov A.
      • Sultan M.
      • Bertier G.
      • MacArthur D.G.
      • Lek M.
      • Lizano E.
      • Buermans H.P.J.
      • Padioleau I.
      • Schwarzmayr T.
      • Karlberg O.
      • Ongen H.
      • Kilpinen H.
      • Beltran S.
      • Gut M.
      • Kahlem K.
      • Amstislavskiy V.
      • Stegle O.
      • Pirinen M.
      • Montgomery S.B.
      • Donnelly P.
      • McCarthy M.I.
      • Flicek P.
      • Strom T.M.
      • Geuvadis Consortium
      • Lehrach H.
      • Schreiber S.
      • Sudbrak R.
      • Carracedo A.
      • Antonarakis S.E.
      • Häsler R.
      • Syvänen A.-C.
      • van Ommen G.-J.
      • Brazma A.
      • Meitinger T.
      • Rosenstiel P.
      • Guigó R.
      • Gut I.G.
      • Estivill X.
      • Dermitzakis E.T.
      Transcriptome and genome sequencing uncovers functional variation in humans.
      ). These studies all suggest that integrating high-throughput proteomic data into QTL analysis could provide new insights into gene expression regulation.
      Similarly, analysis of the correlation between copy number alteration (CNA) and mRNA or protein abundance has been used to infer the impact of CNAs on mRNA and protein abundance, including both cis-effects on the abundance of genes in the same loci and trans-effects on the abundance of genes at other loci in the genome. Visualization of the resulting correlation matrix in a heatmap can help highlight statistically significant cis- and trans- correlations. Furthermore, visually and statistically comparing the correlation heatmaps for mRNA and protein can reveal relationships between these profiles: cis- and trans-effects in protein (and also phosphoprotein) are generally subsets of mRNA cis- and trans-effects respectively, with more directionally uniform effects at the protein level (
      • Zhang B.
      • Wang J.
      • Wang X.
      • Zhu J.
      • Liu Q.
      • Shi Z.
      • Chambers M.C.
      • Zimmerman L.J.
      • Shaddox K.F.
      • Kim S.
      • Davies S.R.
      • Wang S.
      • Wang P.
      • Kinsinger C.R.
      • Rivers R.C.
      • Rodriguez H.
      • Townsend R.R.
      • Ellis M.J.C.
      • Carr S.A.
      • Tabb D.L.
      • Coffey R.J.
      • Slebos R.J.C.
      • Liebler D.C.
      • NCI CPTAC
      Proteogenomic characterization of human colon and rectal cancer.
      ,
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ,
      • Zhang H.
      • Liu T.
      • Zhang Z.
      • Payne S.H.
      • Zhang B.
      • McDermott J.E.
      • Zhou J.-Y.
      • Petyuk V.A.
      • Chen L.
      • Ray D.
      • Sun S.
      • Yang F.
      • Chen L.
      • Wang J.
      • Shah P.
      • Cha S.W.
      • Aiyetan P.
      • Woo S.
      • Tian Y.
      • Gritsenko M.A.
      • Clauss T.R.
      • Choi C.
      • Monroe M.E.
      • Thomas S.
      • Nie S.
      • Wu C.
      • Moore R.J.
      • Yu K.-H.
      • Tabb D.L.
      • Fenyö D.
      • Bafna V.
      • Wang Y.
      • Rodriguez H.
      • Boja E.S.
      • Hiltke T.
      • Rivers R.C.
      • Sokoll L.
      • Zhu H.
      • Shih I.-M.
      • Cope L.
      • Pandey A.
      • Zhang B.
      • Snyder M.P.
      • Levine D.A.
      • Smith R.D.
      • Chan D.W.
      • Rodland K.D.
      • Investigators CPTAC
      Integrated proteogenomic characterization of human high-grade serous ovarian cancer.
      ) (Fig. 2B). These correlation matrices can also be used to identify candidate driver genes whose copy number alterations directly drive significant trans-effects by comparing with functional knockdown data in large public databases like LINCS (Library of Integrated Network-based Cellular Signatures) (
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ).
      Efforts have also been made to study the roles of miRNAs in gene expression regulation. miRNAs are small noncoding RNAs that pair to the messenger RNAs (mRNAs) of protein-coding genes to suppress their expression (
      • Bartel D.P.
      MicroRNAs: genomics, biogenesis, mechanism, and function.
      ). Several studies have shown that in addition to downregulating mRNA levels, miRNAs also directly repress translation of hundreds of genes (
      • Guo H.
      • Ingolia N.T.
      • Weissman J.S.
      • Bartel D.P.
      Mammalian microRNAs predominantly act to decrease target mRNA levels.
      ,
      • Selbach M.
      • Schwanhäusser B.
      • Thierfelder N.
      • Fang Z.
      • Khanin R.
      • Rajewsky N.
      Widespread changes in protein synthesis induced by microRNAs.
      ,
      • Baek D.
      • Villén J.
      • Shin C.
      • Camargo F.D.
      • Gygi S.P.
      • Bartel D.P.
      The impact of microRNAs on protein output.
      ). To investigate all miRNAs simultaneously in their endogenous context, Liu et al. performed an integrative analysis of global miRNA, mRNA, and protein profiles in nine colorectal cancer cell lines using a correlation-based method (
      • Liu Q.
      • Halvey P.J.
      • Shyr Y.
      • Slebos R.J.C.
      • Liebler D.C.
      • Zhang B.
      Integrative omics analysis reveals the importance and scope of translational repression in microRNA-mediated regulation.
      ) (Fig. 2B). This study showed that translational repression was involved in more than half, and played a major role in a third of all predicted miRNA-target interactions. These predicted miRNA-target interactions can be further confirmed by more focused miRNA perturbation studies. Interestingly, sequence features known to drive site efficacy in mRNA decay, such as 8mer seed site, site positioning within 3′ UTR, local AU-rich context, and additional 3′ pairing, are generally not applicable to translational repression (
      • Liu Q.
      • Halvey P.J.
      • Shyr Y.
      • Slebos R.J.C.
      • Liebler D.C.
      • Zhang B.
      Integrative omics analysis reveals the importance and scope of translational repression in microRNA-mediated regulation.
      ). A key unanswered question is what sequence features determines selectivity for miRNA-mediated translational repression.

      Relating Mutations to Post-translational Modifications and Signaling

      Millions of nonsynonymous single nucleotide polymorphisms (nsSNPs) identified by next-generation sequencing (NGS) and genome-wide association (GWAS) studies have been correlated with certain phenotypes and diseases (
      • Bush W.S.
      • Moore J.H.
      Chapter 11: Genome-wide association studies.
      ,
      • Shastry B.S.
      SNPs in disease gene mapping, medicinal drug development and evolution.
      ). However, the functional mechanisms of these associations are often barely understood or completely unknown. One likely explanation is that a subset of these SNPs result in amino acid changes in PTM targets, including targets of phosphorylation (specific to serines, threonines, and tyrosines), or acetylation and ubiquitylation of lysines, directly perturbing cell signaling networks (
      • Ryu G.-M.
      • Song P.
      • Kim K.-W.
      • Oh K.-S.
      • Park K.-J.
      • Kim J.H.
      Genome-wide analysis to predict protein sequence variations that change phosphorylation sites or their corresponding kinases.
      ,
      • Erxleben C.
      • Liao Y.
      • Gentile S.
      • Chin D.
      • Gomez-Alegria C.
      • Mori Y.
      • Birnbaumer L.
      • Armstrong D.L.
      Cyclosporin and Timothy syndrome increase mode 2 gating of CaV1.2 calcium channels through aberrant phosphorylation of S6 helices.
      ,
      • Gentile S.
      • Martin N.
      • Scappini E.
      • Williams J.
      • Erxleben C.
      • Armstrong D.L.
      The human ERG1 channel polymorphism, K897T, creates a phosphorylation site that inhibits channel activity.
      ,
      • Keegan S.
      • Cortens J.P.
      • Beavis R.C.
      • Fenyö D.
      g2pDB: A database mapping protein post-translational modifications to genomic coordinates.
      ,
      • Yang C.-Y.
      • Chang C.-H.
      • Yu Y.-L.
      • Lin T.-C.E.
      • Lee S.-A.
      • Yen C.-C.
      • Yang J.-M.
      • Lai J.-M.
      • Hong Y.-R.
      • Tseng T.-L.
      • Chao K.-M.
      • Huang C.-Y.F.
      PhosphoPOINT: a comprehensive human kinase interactome and phospho-protein database.
      ). Because these four amino acids account for 22.2% of all amino acids in the human proteome (
      • UniProt Consortium
      UniProt: a hub for protein information.
      ), they are expected to be disproportionately affected by missense mutations. Substitutions of amino acids that are targets of PTMs can result in destruction, genesis, or constitutive activation of PTM sites (
      • Gentile S.
      • Martin N.
      • Scappini E.
      • Williams J.
      • Erxleben C.
      • Armstrong D.L.
      The human ERG1 channel polymorphism, K897T, creates a phosphorylation site that inhibits channel activity.
      ). Moreover, mutations affecting proximal flanking positions of PTM sites might alter the recognition motif for corresponding transferases, e.g. protein kinases recognize, besides other factors, specific motifs on its substrate protein (
      • Ryu G.-M.
      • Song P.
      • Kim K.-W.
      • Oh K.-S.
      • Park K.-J.
      • Kim J.H.
      Genome-wide analysis to predict protein sequence variations that change phosphorylation sites or their corresponding kinases.
      ,
      • Ren J.
      • Jiang C.
      • Gao X.
      • Liu Z.
      • Yuan Z.
      • Jin C.
      • Wen L.
      • Zhang Z.
      • Xue Y.
      • Yao X.
      PhosSNP for systematic analysis of genetic polymorphisms that influence protein phosphorylation.
      ).
      To address this, several studies have assessed the effect of SNP-induced changes to PTM sites, predominantly serine, threonine and tyrosine phosphorylation. In 2008 Ryu et al. (
      • Ryu G.-M.
      • Song P.
      • Kim K.-W.
      • Oh K.-S.
      • Park K.-J.
      • Kim J.H.
      Genome-wide analysis to predict protein sequence variations that change phosphorylation sites or their corresponding kinases.
      ) used data from Swiss-Prot and Swiss-variant (
      • UniProt Consortium
      UniProt: a hub for protein information.
      ) databases to develop software predicting phosphorylation sites accompanied by a database for human phosphovariants, which the authors defined as genetic variations that change phosphorylation sites or their interacting kinases. In this study variants were classified into three groups depending on whether the variant directly affects a phosphorylation site, the flanking region or the kinase itself (Fig. 2C). Two years later, Yang et al. (
      • Yang C.-Y.
      • Chang C.-H.
      • Yu Y.-L.
      • Lin T.-C.E.
      • Lee S.-A.
      • Yen C.-C.
      • Yang J.-M.
      • Lai J.-M.
      • Hong Y.-R.
      • Tseng T.-L.
      • Chao K.-M.
      • Huang C.-Y.F.
      PhosphoPOINT: a comprehensive human kinase interactome and phospho-protein database.
      ) used phosphosites annotated in Phosho.ELM (
      • Dinkel H.
      • Chica C.
      • Via A.
      • Gould C.M.
      • Jensen L.J.
      • Gibson T.J.
      • Diella F.
      Phospho.ELM: a database of phosphorylation sites—update 2011.
      ), the Human Protein Reference Database (HPRD) (
      • Keshava Prasad T.S.
      • Goel R.
      • Kandasamy K.
      • Keerthikumar S.
      • Kumar S.
      • Mathivanan S.
      • Telikicherla D.
      • Raju R.
      • Shafreen B.
      • Venugopal A.
      • Balakrishnan L.
      • Marimuthu A.
      • Banerjee S.
      • Somanathan D.S.
      • Sebastian A.
      • Rani S.
      • Ray S.
      • Harrys Kishore C.J.
      • Kanth S.
      • Ahmed M.
      • Kashyap M.K.
      • Mohmood R.
      • Ramachandra Y.L.
      • Krishna V.
      • Rahiman B.A.
      • Mohan S.
      • Ranganathan P.
      • Ramabadran S.
      • Chaerkady R.
      • Pandey A.
      Human Protein Reference Database–2009 update.
      ) and SwissProt (
      • UniProt Consortium
      UniProt: a hub for protein information.
      ) together with SNPs annotated in the NCBI dbSNP database (
      • Sherry S.T.
      • Ward M.H.
      • Kholodov M.
      • Baker J.
      • Phan L.
      • Smigielski E.M.
      • Sirotkin K.
      dbSNP: the NCBI database of genetic variation.
      ), to identify 64 phosphorylation sites that potentially result in a disease phenotype, including schizophrenia and hypertension, when substituted by a nonphosphorylatable amino acid. In total 1451 nsSNPs which were present in dbSNP (downloaded May 2007) occurred in a ± 7 amino acid flanking region of a phosphosite, thereby potentially influencing the recognition of a kinase toward its preferred substrates. In a related study, Ren et al. (
      • Ren J.
      • Jiang C.
      • Gao X.
      • Liu Z.
      • Yuan Z.
      • Jin C.
      • Wen L.
      • Zhang Z.
      • Xue Y.
      • Yao X.
      PhosSNP for systematic analysis of genetic polymorphisms that influence protein phosphorylation.
      ) carried out a genome-wide analysis of SNPs that potentially influenced protein phosphorylation status. The authors used a combination of dbSNP predicted kinase-specific phosphosites and experimentally detected phosphosites to identify and classify SNPs affecting phosphosignaling. Based on the predicted phosphosites the authors estimated that ∼70% of nsSNPs have the potential to affect phosphosignaling, suggesting that a large portion of nsSNPs play an important role in rewiring biological pathways. Creixell et al. (
      • Creixell P.
      • Schoof E.M.
      • Simpson C.D.
      • Longden J.
      • Miller C.J.
      • Lou H.J.
      • Perryman L.
      • Cox T.R.
      • Zivanovic N.
      • Palmeri A.
      • Wesolowska-Andersen A.
      • Helmer-Citterich M.
      • Ferkinghoff-Borg J.
      • Itamochi H.
      • Bodenmiller B.
      • Erler J.T.
      • Turk B.E.
      • Linding R.
      Kinome-wide decoding of network-attacking mutations rewiring cancer signaling.
      ) described a similar computational approach (ReKINect) to systematically classify and interpret such network-attacking mutations (NAMs) specifically in phosphosignaling. The authors used exome sequencing, bioinformatics and phosphoproteomics to demonstrate as a proof-of-principle the existence of six types of NAMs in human cancer cell lines.
      Additionally, Reimand et al. (
      • Reimand J.
      • Bader G.D.
      Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers.
      ) developed a computational method (ActiveDriver) based on a gene-centric generalized linear regression model to detect aberrant mutation rates proximal to phosphorylation sites. This method was used to analyze 800 genomes spanning eight cancer types to detect mutations specifically targeting the phosphorylation machinery, identifying 44 genes with significantly higher mutation rates in regions with detected phosphosites compared with the gene sequence given its structured and disordered regions. Mutations identified were comprised of both known driver mutations and novel candidates. The authors then extended their approach to the TCGA pan cancer data set, containing more than 3000 genomes from 12 cancer types and found mutations affecting phosphosignaling in about 90% of all tumors (
      • Reimand J.
      • Wagih O.
      • Bader G.D.
      The mutational landscape of phosphorylation signaling in cancer.
      ).
      Two databases that are valuable for PTM proteogenomic analysis that have not yet been described are PhosphoSitePlus (PSP) (
      • Hornbeck P.V.
      • Zhang B.
      • Murray B.
      • Kornhauser J.M.
      • Latham V.
      • Skrzypek E.
      PhosphoSitePlus, 2014: mutations, PTMs and recalibrations.
      ) and g2pDB (
      • Keegan S.
      • Cortens J.P.
      • Beavis R.C.
      • Fenyö D.
      g2pDB: A database mapping protein post-translational modifications to genomic coordinates.
      ). PSP is a manually curated database of mammalian PTM sites containing over 330,000 nonredundant PTMs (
      • Hornbeck P.V.
      • Zhang B.
      • Murray B.
      • Kornhauser J.M.
      • Latham V.
      • Skrzypek E.
      PhosphoSitePlus, 2014: mutations, PTMs and recalibrations.
      ). In its latest release (2014), PSP introduced the “PTMVar” data set, which intersects missense mutations and PTMs detailing 25,000 PTMs impacted by known variants, about 75% of which relate to phosphorylation. The remaining PTM sites comprise ubiquitylation, acetylation, mono-methylation, and succinylation sites. These additional modifications, despite their low coverage, enable researchers to interrogate genomic mutations with PTMs beyond phosphorylation. g2pDB (
      • Keegan S.
      • Cortens J.P.
      • Beavis R.C.
      • Fenyö D.
      g2pDB: A database mapping protein post-translational modifications to genomic coordinates.
      ) is a database mapping protein PTMs to genomic coordinates for all phosphorylations, acetylations and ubiquitinylations that are available in the Global Proteome Machine Database (GPMDB) (
      • Craig R.
      • Cortens J.P.
      • Beavis R.C.
      Open source system for analyzing, validating, and storing protein identification data.
      ,
      • Fenyö D.
      • Beavis R.C.
      The GPMDB REST interface.
      ). Overlaying the genome-mapped PTM sites with genome coordinates of known, disease-associated SNPs might reveal a role of these PTM sites in the respective disease. A list of all relevant tools and databases can be found in Table II.
      Table IIComputational frameworks and resources for intersecting PTMs and mutations
      NameURLReferenceRemarks
      PhosphoPOINThttp://kinase.bioinformatics.tw
      link not working at the time of writing.
      (
      • Yang C.-Y.
      • Chang C.-H.
      • Yu Y.-L.
      • Lin T.-C.E.
      • Lee S.-A.
      • Yen C.-C.
      • Yang J.-M.
      • Lai J.-M.
      • Hong Y.-R.
      • Tseng T.-L.
      • Chao K.-M.
      • Huang C.-Y.F.
      PhosphoPOINT: a comprehensive human kinase interactome and phospho-protein database.
      )
      Human kinase interactome and phospho-protein database
      PhosSNPhttp://phossnp.biocuckoo.org/(
      • Ren J.
      • Jiang C.
      • Gao X.
      • Liu Z.
      • Yuan Z.
      • Jin C.
      • Wen L.
      • Zhang Z.
      • Xue Y.
      • Yao X.
      PhosSNP for systematic analysis of genetic polymorphisms that influence protein phosphorylation.
      )
      Database of mutations predicted to impact phosphorylation status of proteins
      PhosphoVariantNA(
      • Ryu G.-M.
      • Song P.
      • Kim K.-W.
      • Oh K.-S.
      • Park K.-J.
      • Kim J.H.
      Genome-wide analysis to predict protein sequence variations that change phosphorylation sites or their corresponding kinases.
      )
      Database for definite and possible variants changing phosphosites
      PTMvarphosphosite.org(
      • Hornbeck P.V.
      • Zhang B.
      • Murray B.
      • Kornhauser J.M.
      • Latham V.
      • Skrzypek E.
      PhosphoSitePlus, 2014: mutations, PTMs and recalibrations.
      )
      Database intersecting non-synonymous SNPs and PTM sites
      ActiveDriverhttp://individual.utoronto.ca/reimand/ActiveDriver/(
      • Reimand J.
      • Bader G.D.
      Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers.
      )
      Prediction of ‘active’ phosphosites in proteins that are specifically and significantly mutated in cancer genomes
      ReKINecthttp://rekinect.science/home(
      • Creixell P.
      • Schoof E.M.
      • Simpson C.D.
      • Longden J.
      • Miller C.J.
      • Lou H.J.
      • Perryman L.
      • Cox T.R.
      • Zivanovic N.
      • Palmeri A.
      • Wesolowska-Andersen A.
      • Helmer-Citterich M.
      • Ferkinghoff-Borg J.
      • Itamochi H.
      • Bodenmiller B.
      • Erler J.T.
      • Turk B.E.
      • Linding R.
      Kinome-wide decoding of network-attacking mutations rewiring cancer signaling.
      )
      Prediction of network attacking mutations (NAMs) from NGS data
      MIMPhttp://mimp.baderlab.org/(
      • Wagih O.
      • Reimand J.
      • Bader G.D.
      MIMP: predicting the impact of mutations on kinase-substrate phosphorylation.
      )
      Characterization of genetic variants that specifically alter kinase-binding sites in proteins
      g2pDBwww.g2pdb.org(
      • Keegan S.
      • Cortens J.P.
      • Beavis R.C.
      • Fenyö D.
      g2pDB: A database mapping protein post-translational modifications to genomic coordinates.
      )
      Database of auto-curated PTM sites mapped to their genomic locations
      * link not working at the time of writing.
      All the aforementioned studies focused on the classification of mutations as either directly or indirectly (i.e. in a proximal flanking region) affecting the PTM site. However, there is no consensus on the length of flanking regions, number of classes, and their nomenclature, making it difficult to directly compare the findings of different studies. The integrative analysis of genomic mutations and PTM-mediated signaling shows great promise in providing insights into the mode of action of disease-associated mutations. This type of analysis can aid in discriminating tumor driver mutations from functionally neutral passenger mutations, and ultimately lead to novel personalized treatments. More importantly, the analysis of PTMs is not accessible to genomics sequencing technologies and the ever-increasing collection of published, global PTM-omes at single amino acid resolution demonstrates the indispensable value of state-of-the art MS-based proteomics in the era of precision medicine.

      Integrative Modeling of Proteogenomic Data

      Integrative modeling involves the application of statistical, machine learning and network-modeling tools to data obtained from one or more omics platforms. In this section, we focus on the application of integrative modeling to proteogenomic analyses. Models can be developed on combined omics data sets (e.g. genomics and proteomics), or applied to each omics data type separately, and the results comparatively analyzed. We review clustering (Unsupervised Clustering) and predictive modeling (Predictive Modeling)—usually termed class discovery and class prediction, respectively—which are both orthogonal approaches to gaining insight from biological data; and network modeling (Pathway and Network Modeling), which interprets data in the context of prior biological knowledge and promotes understanding at the level of pathways and cellular mechanisms.

      Unsupervised Clustering

      Clustering is a method of grouping similar entities—e.g. samples, genes, proteins, etc.—together based on a similarity metric. Because meta-data about the entities—like phenotypes, mutations, disease type, etc.—are not used in the clustering process, the algorithms are termed “unsupervised,” and are primarily used to discover new groups or classes, in addition to computationally validating known biology. Most proteogenomic analysis includes unsupervised clustering of proteome and/or phosphoproteome data, followed by comparison of the resulting clusters to known subgroups, cluster labels derived from genomic data, or other mutation, survival, or clinical data.
      Clustering of proteome data is performed using a variety of algorithms including hierarchical (
      • Zhang B.
      • Wang J.
      • Wang X.
      • Zhu J.
      • Liu Q.
      • Shi Z.
      • Chambers M.C.
      • Zimmerman L.J.
      • Shaddox K.F.
      • Kim S.
      • Davies S.R.
      • Wang S.
      • Wang P.
      • Kinsinger C.R.
      • Rivers R.C.
      • Rodriguez H.
      • Townsend R.R.
      • Ellis M.J.C.
      • Carr S.A.
      • Tabb D.L.
      • Coffey R.J.
      • Slebos R.J.C.
      • Liebler D.C.
      • NCI CPTAC
      Proteogenomic characterization of human colon and rectal cancer.
      ), k-means (
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ) and model-based clustering (
      • Zhang H.
      • Liu T.
      • Zhang Z.
      • Payne S.H.
      • Zhang B.
      • McDermott J.E.
      • Zhou J.-Y.
      • Petyuk V.A.
      • Chen L.
      • Ray D.
      • Sun S.
      • Yang F.
      • Chen L.
      • Wang J.
      • Shah P.
      • Cha S.W.
      • Aiyetan P.
      • Woo S.
      • Tian Y.
      • Gritsenko M.A.
      • Clauss T.R.
      • Choi C.
      • Monroe M.E.
      • Thomas S.
      • Nie S.
      • Wu C.
      • Moore R.J.
      • Yu K.-H.
      • Tabb D.L.
      • Fenyö D.
      • Bafna V.
      • Wang Y.
      • Rodriguez H.
      • Boja E.S.
      • Hiltke T.
      • Rivers R.C.
      • Sokoll L.
      • Zhu H.
      • Shih I.-M.
      • Cope L.
      • Pandey A.
      • Zhang B.
      • Snyder M.P.
      • Levine D.A.
      • Smith R.D.
      • Chan D.W.
      • Rodland K.D.
      • Investigators CPTAC
      Integrated proteogenomic characterization of human high-grade serous ovarian cancer.
      ). Although the clustering algorithms can vary, consensus clustering (
      • Monti S.
      • Tamayo P.
      • Mesirov J.
      • Golub T.
      Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data.
      ) is a common approach used to assess cluster stability and define the natural number of clusters in the data. Visualization of the consensus matrix, along with the delta-area plot and silhouette plots (
      • Rousseeuw P.J.
      Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.
      ) are an effective way to determine the number of clusters in the data.
      Once the proteome or phosphoproteome clusters are identified, the samples constituting these clusters can be characterized by enrichment tests for known subgroups (e.g. PAM-50 classification or RPPA groups in breast cancer, methylation subtype in colon cancer, mutation status for relevant genes, or other clinical or survival data). In addition, supervised marker selection methods combined with pathway enrichment analysis (e.g. SAM (
      • Tusher V.G.
      • Tibshirani R.
      • Chu G.
      Significance analysis of microarrays applied to the ionizing radiation response.
      ) marker selection followed by Gene Set Enrichment Analysis (GSEA) (
      • Subramanian A.
      • Tamayo P.
      • Mootha V.K.
      • Mukherjee S.
      • Ebert B.L.
      • Gillette M.A.
      • Paulovich A.
      • Pomeroy S.L.
      • Golub T.R.
      • Lander E.S.
      • Mesirov J.P.
      Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.
      ) for pathway enrichment) can also be used to characterize proteome or phosphoproteome clusters by identifying pathways or gene sets that are selectively up or down regulated in each cluster.
      An alternative approach to clustering proteome or phosphoproteome data is to project the original data to pathway space and then cluster the projected data. This approach is used by Mertins et al. (
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ) to cluster phosphoproteome pathways, resulting in a unique cluster not directly observed either in the proteome or phosphoproteome data. The projection to pathway space is performed using single-sample gene set enrichment analysis (ssGSEA) (
      • Barbie D.A.
      • Tamayo P.
      • Boehm J.S.
      • Kim S.Y.
      • Moody S.E.
      • Dunn I.F.
      • Schinzel A.C.
      • Sandy P.
      • Meylan E.
      • Scholl C.
      • Fröhling S.
      • Chan E.M.
      • Sos M.L.
      • Michel K.
      • Mermel C.
      • Silver S.J.
      • Weir B.A.
      • Reiling J.H.
      • Sheng Q.
      • Gupta P.B.
      • Wadlow R.C.
      • Le H.
      • Hoersch S.
      • Wittner B.S.
      • Ramaswamy S.
      • Livingston D.M.
      • Sabatini D.M.
      • Meyerson M.
      • Thomas R.K.
      • Lander E.S.
      • Mesirov J.P.
      • Root D.E.
      • Gilliland D.G.
      • Jacks T.
      • Hahn W.C.
      Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1.
      ), where the enrichment of curated pathways (MSigDB C2 gene sets, http://software.broadinstitute.org/gsea/msigdb) in each sample is evaluated. The enrichment scores are then subject to unsupervised clustering, followed by characterization of the derived clusters using the pathways constituting the data set.

      Coclustering

      In coclustering, data from multiple modalities (e.g. mRNA and proteome) are treated as independent “samples,” and clustering is performed over the collection of disparate omics profiles. The key here is to either transform the data so that different modalities are comparable (e.g. z-scores), or to use a similarity metric that is agnostic to the scale of values in the data (e.g. Spearman correlation (
      • Daniel W.W.
      Spearman rank correlation coefficient.
      )). Coclustering mRNA and proteome data (using hierarchical clustering) after filtering to retain genes or proteins with moderate to high correlation is used to show that mRNA profiles of samples are closest to their corresponding proteome profiles, thereby validating sample quality and mitigating concerns regarding tumor heterogeneity in (
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ) (Fig. 3A).

      Multi-omic Clustering

      Clustering over sample profiles obtained from two or more omic platforms is referred to as multi-omic clustering. Unlike coclustering, where the multi-omic data from each sample provides independent items that are clustered simultaneously, multi-omic clustering attempts to derive an integrative clustering to assign each sample to a single cluster based on combined evidence from multi-omic data. An overall review of multi-omic clustering methods is presented in (
      • Wang D.
      • Gu J.
      Integrative clustering methods of multi-omics data for molecule-based cancer classifications.
      ) where algorithms are grouped by strategy:
      • Direct integrative clustering methods use a combined multi-omics data set as input to the clustering analysis. Examples in this category include iCluster+ (
        • Mo Q.
        • Wang S.
        • Seshan V.E.
        • Olshen A.B.
        • Schultz N.
        • Sander C.
        • Powers R.S.
        • Ladanyi M.
        • Shen R.
        Pattern discovery and cancer gene identification in integrated cancer genomic data.
        ), LRAcluster (
        • Wu D.
        • Wang D.
        • Zhang M.Q.
        • Gu J.
        Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification.
        ) and moCluster (
        • Meng C.
        • Helm D.
        • Frejno M.
        • Kuster B.
        moCluster: identifying joint patterns across multiple omics data sets.
        ).
      • Clustering of clusters is an approach where clustering is initially performed on each omics data set and the results integrated into final cluster assignments. Examples include COCA (
        • Hoadley K.A.
        • Yau C.
        • Wolf D.M.
        • Cherniack A.D.
        • Tamborero D.
        • Ng S.
        • Leiserson M.D.M.
        • Niu B.
        • McLellan M.D.
        • Uzunangelov V.
        • Zhang J.
        • Kandoth C.
        • Akbani R.
        • Shen H.
        • Omberg L.
        • Chu A.
        • Margolin A.A.
        • Van't Veer L.J.
        • Lopez-Bigas N.
        • Laird P.W.
        • Raphael B.J.
        • Ding L.
        • Robertson A.G.
        • Byers L.A.
        • Mills G.B.
        • Weinstein J.N.
        • Van Waes C.
        • Chen Z.
        • Collisson E.A.
        • Cancer Genome Atlas Research Network
        • Benz C.C.
        • Perou C.M.
        • Stuart J.M.
        Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin.
        ) and SNF (
        • Wang B.
        • Mezlini A.M.
        • Demir F.
        • Fiume M.
        • Tu Z.
        • Brudno M.
        • Haibe-Kains B.
        • Goldenberg A.
        Similarity network fusion for aggregating data types on a genomic scale.
        ).
      • Regulatory integrative clustering harnesses molecular regulatory structures and/or networks to integrate different omics data sets in a robust manner. Examples in this group include PARADIGM (
        • Vaske C.J.
        • Benz S.C.
        • Sanborn J.Z.
        • Earl D.
        • Szeto C.
        • Zhu J.
        • Haussler D.
        • Stuart J.M.
        Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM.
        ) and iRafNet (
        • Petralia F.
        • Wang P.
        • Yang J.
        • Tu Z.
        Integrative random forest for gene regulatory network inference.
        ).
      Many of the clustering algorithms use de novo or regulatory network graphs to model interactions in each omics domain, and to drive integration across different omics data sets. A review of clustering methods from this orthogonal perspective is covered in (
      • Bersanelli M.
      • Mosca E.
      • Remondini D.
      • Giampieri E.
      • Sala C.
      • Castellani G.
      • Milanesi L.
      Methods for the integration of multi-omics data: mathematical aspects.
      ).

      Predictive Modeling

      Predictive modeling is a statistical approach in which models are built to predict a future outcome based on data attributes. Machine learning, pattern recognition and predictive analytics all lie within the umbrella of predictive modeling and this method of analysis has been rapidly gaining traction across most scientific disciplines. Predictive modeling and machine learning techniques applied to proteogenomics can greatly improve our ability to accurately diagnose, guide prognosis, and treat disease. For example, global molecular profiling of tissues and tumors enables a shift from nonspecific treatment strategies toward a more targeted, personalized approach based on the presence or absence of predictive genetic and/or protein signatures. Typical supervised classification methods used for predictive modeling from omics data include Support Vector Machines (SVMs) (
      • Guyon I.
      • Weston J.
      • Barnhill S.
      • Vapnik V.
      Gene selection for cancer classification using support vector machines.
      ), Bayesian logistic regression (
      • Ma J.
      • Stingo F.C.
      • Hobbs B.P.
      Bayesian predictive modeling for genomic based personalized treatment selection.
      ), and random forests (
      • Breiman L.
      Random Forests.
      ).

      Machine Learning

      To date, machine learning and statistical modeling techniques applied to genomics and transcriptomics data have identified genetic profiles predictive of disease diagnoses (
      • Janssens A.C. J.W.
      • Aulchenko Y.S.
      • Elefante S.
      • Borsboom G.J. J.M.
      • Steyerberg E.W.
      • van Duijn C.M.
      Predictive testing for complex diseases using multiple genes: fact or fiction?.
      ) and drug response (
      • Barretina J.
      • Caponigro G.
      • Stransky N.
      • Venkatesan K.
      • Margolin A.A.
      • Kim S.
      • Wilson C.J.
      • Lehár J.
      • Kryukov G.V.
      • Sonkin D.
      • Reddy A.
      • Liu M.
      • Murray L.
      • Berger M.F.
      • Monahan J.E.
      • Morais P.
      • Meltzer J.
      • Korejwa A.
      • Jané-Valbuena J.
      • Mapa F.A.
      • Thibault J.
      • Bric-Furlong E.
      • Raman P.
      • Shipway A.
      • Engels I.H.
      • Cheng J.
      • Yu G.K.
      • Yu J.
      • Aspesi P.
      • de Silva M.
      • Jagtap K.
      • Jones M.D.
      • Wang L.
      • Hatton C.
      • Palescandolo E.
      • Gupta S.
      • Mahan S.
      • Sougnez C.
      • Onofrio R.C.
      • Liefeld T.
      • MacConaill L.
      • Winckler W.
      • Reich M.
      • Li N.
      • Mesirov J.P.
      • Gabriel S.B.
      • Getz G.
      • Ardlie K.
      • Chan V.
      • Myer V.E.
      • Weber B.L.
      • Porter J.
      • Warmuth M.
      • Finan P.
      • Harris J.L.
      • Meyerson M.
      • Golub T.R.
      • Morrissey M.P.
      • Sellers W.R.
      • Schlegel R.
      • Garraway L.A.
      The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity.
      ,
      • Daemen A.
      • Griffith O.L.
      • Heiser L.M.
      • Wang N.J.
      • Enache O.M.
      • Sanborn Z.
      • Pepin F.
      • Durinck S.
      • Korkola J.E.
      • Griffith M.
      • Hur J.S.
      • Huh N.
      • Chung J.
      • Cope L.
      • Fackler M.J.
      • Umbricht C.
      • Sukumar S.
      • Seth P.
      • Sukhatme V.P.
      • Jakkula L.R.
      • Lu Y.
      • Mills G.B.
      • Cho R.J.
      • Collisson E.A.
      • van't Veer L.J.
      • Spellman P.T.
      • Gray J.W.
      Modeling precision treatment of breast cancer.
      ,
      • Garnett M.J.
      • Edelman E.J.
      • Heidorn S.J.
      • Greenman C.D.
      • Dastur A.
      • Lau K.W.
      • Greninger P.
      • Thompson I.R.
      • Luo X.
      • Soares J.
      • Liu Q.
      • Iorio F.
      • Surdez D.
      • Chen L.
      • Milano R.J.
      • Bignell G.R.
      • Tam A.T.
      • Davies H.
      • Stevenson J.A.
      • Barthorpe S.
      • Lutz S.R.
      • Kogera F.
      • Lawrence K.
      • McLaren-Douglas A.
      • Mitropoulos X.
      • Mironenko T.
      • Thi H.
      • Richardson L.
      • Zhou W.
      • Jewitt F.
      • Zhang T.
      • O'Brien P.
      • Boisvert J.L.
      • Price S.
      • Hur W.
      • Yang W.
      • Deng X.
      • Butler A.
      • Choi H.G.
      • Chang J.W.
      • Baselga J.
      • Stamenkovic I.
      • Engelman J.A.
      • Sharma S.V.
      • Delattre O.
      • Saez-Rodriguez J.
      • Gray N.S.
      • Settleman J.
      • Futreal P.A.
      • Haber D.A.
      • Stratton M.R.
      • Ramaswamy S.
      • McDermott U.
      • Benes C.H.
      Systematic identification of genomic markers of drug sensitivity in cancer cells.
      ,
      • Sos M.L.
      • Michel K.
      • Zander T.
      • Weiss J.
      • Frommolt P.
      • Peifer M.
      • Li D.
      • Ullrich R.
      • Koker M.
      • Fischer F.
      • Shimamura T.
      • Rauh D.
      • Mermel C.
      • Fischer S.
      • Stückrath I.
      • Heynck S.
      • Beroukhim R.
      • Lin W.
      • Winckler W.
      • Shah K.
      • LaFramboise T.
      • Moriarty W.F.
      • Hanna M.
      • Tolosi L.
      • Rahnenführer J.
      • Verhaak R.
      • Chiang D.
      • Getz G.
      • Hellmich M.
      • Wolf J.
      • Girard L.
      • Peyton M.
      • Weir B.A.
      • Chen T.-H.
      • Greulich H.
      • Barretina J.
      • Shapiro G.I.
      • Garraway L.A.
      • Gazdar A.F.
      • Minna J.D.
      • Meyerson M.
      • Wong K.-K.
      • Thomas R.K.
      Predicting drug susceptibility of non-small cell lung cancers based on genetic lesions.
      ). One would expect the predictive analysis of proteome and phosphoproteome data to be more informative regarding clinical outcomes compared with NGS data, as these data modalities are more proximal to the disease. These techniques have been applied to proteomics data to classify clinically relevant disease subtypes in cancer (
      • Deeb S.J.
      • Tyanova S.
      • Hummel M.
      • Schmidt-Supprian M.
      • Cox J.
      • Mann M.
      Machine Learning-based Classification of Diffuse Large B-cell Lymphoma Patients by Their Protein Expression Profiles.
      ,
      • Tyanova S.
      • Albrechtsen R.
      • Kronqvist P.
      • Cox J.
      • Mann M.
      • Geiger T.
      Proteomic maps of breast cancer subtypes.
      ,
      • Iglesias-Gato D.
      • Wikström P.
      • Tyanova S.
      • Lavallee C.
      • Thysell E.
      • Carlsson J.
      • Hägglöf C.
      • Cox J.
      • Andrén O.
      • Stattin P.
      • Egevad L.
      • Widmark A.
      • Bjartell A.
      • Collins C.C.
      • Bergh A.
      • Geiger T.
      • Mann M.
      • Flores-Morales A.
      The proteome of primary prostate cancer.
      ), to define prognosis (
      • Gonzalez-Angulo A.M.
      • Hennessy B.T.
      • Meric-Bernstam F.
      • Sahin A.
      • Liu W.
      • Ju Z.
      • Carey M.S.
      • Myhre S.
      • Speers C.
      • Deng L.
      • Broaddus R.
      • Lluch A.
      • Aparicio S.
      • Brown P.
      • Pusztai L.
      • Symmans W.F.
      • Alsner J.
      • Overgaard J.
      • Borresen-Dale A.-L.
      • Hortobagyi G.N.
      • Coombes K.R.
      • Mills G.B.
      Functional proteomics can define prognosis and predict pathologic complete response in patients with breast cancer.
      ), and to identify biomarkers predicting drug sensitivity (
      • Gonzalez-Angulo A.M.
      • Hennessy B.T.
      • Meric-Bernstam F.
      • Sahin A.
      • Liu W.
      • Ju Z.
      • Carey M.S.
      • Myhre S.
      • Speers C.
      • Deng L.
      • Broaddus R.
      • Lluch A.
      • Aparicio S.
      • Brown P.
      • Pusztai L.
      • Symmans W.F.
      • Alsner J.
      • Overgaard J.
      • Borresen-Dale A.-L.
      • Hortobagyi G.N.
      • Coombes K.R.
      • Mills G.B.
      Functional proteomics can define prognosis and predict pathologic complete response in patients with breast cancer.
      ,
      • Niepel M.
      • Hafner M.
      • Pace E.A.
      • Chung M.
      • Chai D.H.
      • Zhou L.
      • Schoeberl B.
      • Sorger P.K.
      Profiles of Basal and stimulated receptor signaling networks predict drug response in breast cancer lines.
      ,
      • Timpe L.C.
      • Li D.
      • Yen T.-Y.
      • Wong J.
      • Yen R.
      • Macher B.A.
      • Piryatinska A.
      Mining the Breast Cancer Proteome for Predictors of Drug Sensitivity.
      ).
      Despite the use of predictive modeling in genomics and proteomics independently, studies integrating proteomics and genomics are less common. Several studies using “multimodal” integration of data types including RNA-Seq, exon expression, and Reverse Phase Protein Array (RPPA) data to predict clinical phenotypes and drug response found no advantage to combining data modalities compared with individual platform analysis and showed gene expression data to be consistently more predictive than RPPA-based proteomics (
      • Ray B.
      • Henaff M.
      • Ma S.
      • Efstathiadis E.
      • Peskin E.R.
      • Picone M.
      • Poli T.
      • Aliferis C.F.
      • Statnikov A.
      Information content and analysis methods for multi-modal high-throughput biomedical data.
      ,
      • Daemen A.
      • Griffith O.L.
      • Heiser L.M.
      • Wang N.J.
      • Enache O.M.
      • Sanborn Z.
      • Pepin F.
      • Durinck S.
      • Korkola J.E.
      • Griffith M.
      • Hur J.S.
      • Huh N.
      • Chung J.
      • Cope L.
      • Fackler M.J.
      • Umbricht C.
      • Sukumar S.
      • Seth P.
      • Sukhatme V.P.
      • Jakkula L.R.
      • Lu Y.
      • Mills G.B.
      • Cho R.J.
      • Collisson E.A.
      • van't Veer L.J.
      • Spellman P.T.
      • Gray J.W.
      Modeling precision treatment of breast cancer.
      ). Similarly, Ma et al. (
      • Ma S.
      • Ren J.
      • Fenyö D.
      Breast Cancer Prognostics Using Multi-Omics Data.
      ) found that in machine learning models predicting ten-year survival from 77 breast tumors (
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ), fusion of four data types (genome, transcriptome, MS/MS-based proteome and phosphoproteome) did not improve the predictive performance of the model. However, they did find proteomics to outperform models based on genomics and transcriptomics data in survival prediction (
      • Ma S.
      • Ren J.
      • Fenyö D.
      Breast Cancer Prognostics Using Multi-Omics Data.
      ). As this is still fairly uncharted territory in proteogenomics, we anticipate to see a wealth of studies focused on assessing the predictive power of proteomics, and phosphoproteomics in disease prognosis, diagnosis and drug response in the future (Fig. 3B).

      Supervised Analysis for Marker Selection

      Aside from machine learning, supervised analysis has been used to derive markers for a variety of distinctions including intrinsic disease subtypes (e.g. PAM-50 subtype in breast cancer or HRD status in ovarian cancer), subtypes identified by clustering, samples with and without mutations in genes of interest (e.g. PIK3CA or TP53 mutations) and survival analysis. For examples, see (
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ) and (
      • Zhang H.
      • Liu T.
      • Zhang Z.
      • Payne S.H.
      • Zhang B.
      • McDermott J.E.
      • Zhou J.-Y.
      • Petyuk V.A.
      • Chen L.
      • Ray D.
      • Sun S.
      • Yang F.
      • Chen L.
      • Wang J.
      • Shah P.
      • Cha S.W.
      • Aiyetan P.
      • Woo S.
      • Tian Y.
      • Gritsenko M.A.
      • Clauss T.R.
      • Choi C.
      • Monroe M.E.
      • Thomas S.
      • Nie S.
      • Wu C.
      • Moore R.J.
      • Yu K.-H.
      • Tabb D.L.
      • Fenyö D.
      • Bafna V.
      • Wang Y.
      • Rodriguez H.
      • Boja E.S.
      • Hiltke T.
      • Rivers R.C.
      • Sokoll L.
      • Zhu H.
      • Shih I.-M.
      • Cope L.
      • Pandey A.
      • Zhang B.
      • Snyder M.P.
      • Levine D.A.
      • Smith R.D.
      • Chan D.W.
      • Rodland K.D.
      • Investigators CPTAC
      Integrated proteogenomic characterization of human high-grade serous ovarian cancer.
      ).
      Common marker selection methods used include the t test, ANOVA (F-test), moderated tests (
      • Smyth G.K.
      Linear models and empirical bayes methods for assessing differential expression in microarray experiments.
      ) and SAM (
      • Tusher V.G.
      • Tibshirani R.
      • Chu G.
      Significance analysis of microarrays applied to the ionizing radiation response.
      ), in addition to nonparametric tests like the Mann-Whitney test and the Kruskal-Wallis test. Although these tests are in most cases applied to specific types of omics data, marker ranking from these tests can be combined across multiple omics data sets to derive a global overall rank using rank aggregation algorithms (
      • Aerts S.
      • Lambrechts D.
      • Maity S.
      • Van Loo P.
      • Coessens B.
      • De Smet F.
      • Tranchevent L.-C.
      • De Moor B.
      • Marynen P.
      • Hassan B.
      • Carmeliet P.
      • Moreau Y.
      Gene prioritization through genomic data fusion.
      ,
      • Kolde R.
      • Laur S.
      • Adler P.
      • Vilo J.
      Robust rank aggregation for gene list integration and meta-analysis.
      ).

      Pathway and Network Modeling

      Historically, the field of biomedicine has operated under the “molecular biology paradigm,” in which it is assumed that biological function can be explained through the comprehensive knowledge of genes and their associated proteins, and that these proteins operate in linear pathways (
      • Beadle G.W.
      • Tatum E.L.
      Genetic control of biochemical reactions in Neurospora.
      ). Despite large-scale efforts to link genotype and phenotype under this paradigm, the relationships between the two are still wholly unresolved and surprisingly complex. Instead, the systems or network biology approach attempts to consider these complex relationships to better understand this genotype-phenotype connection (
      • Bensimon A.
      • Heck A.J.R.
      • Aebersold R.
      Mass spectrometry-based proteomics and network biology.
      ,
      • Vidal M.
      • Cusick M.E.
      • Barabási A.-L.
      Interactome networks and human disease.
      ,
      • Arkin A.P.
      • Schaffer D.V.
      Network news: innovations in 21st century systems biology.
      ). Studies in proteogenomics can build upon current models of network biology, contributing to both network annotation and using established pathway and gene ontology tools in gene-protein enrichment analyses.

      Network Annotation

      In network biology, nodes represent the molecules of interest (gene, protein, metabolite) and edges represent a function, physical or enzymatic relationship. Genetic and physical interaction networks are commonly used models for studying complex systems and disease. These networks can reflect a static system, built from information in a single condition or a differential system, highlighting changes in network connections in two distinct states, revealing state-specific and disease-specific interactions (see reviews (
      • Ideker T.
      • Krogan N.J.
      Differential network biology.
      ,
      • Hu J.X.
      • Thomas C.E.
      • Brunak S.
      Network biology concepts in complex disease comorbidities.
      )).
      Biological networks are typically built in three ways: (1) curation of available physical or biochemical interaction data, (2) computational predictions based on sequence similarity, gene cooccurrence, or gene coexpression; and (3) comprehensive assessment of whole genomes or proteomes (
      • Vidal M.
      • Cusick M.E.
      • Barabási A.-L.
      Interactome networks and human disease.
      ). MS-based proteomics and PTM-omics can be layered atop these scaffold networks to both fine tune the biological network representation and identify network rewiring in a disease state (
      • Vidal M.
      A biological atlas of functional maps.
      ) (Fig. 3C). For example, Zhang et al. identified protein-protein interaction network modules that were enriched in down-regulated proteins in a poor-prognosis colorectal cancer subtype (
      • Zhang B.
      • Wang J.
      • Wang X.
      • Zhu J.
      • Liu Q.
      • Shi Z.
      • Chambers M.C.
      • Zimmerman L.J.
      • Shaddox K.F.
      • Kim S.
      • Davies S.R.
      • Wang S.
      • Wang P.
      • Kinsinger C.R.
      • Rivers R.C.
      • Rodriguez H.
      • Townsend R.R.
      • Ellis M.J.C.
      • Carr S.A.
      • Tabb D.L.
      • Coffey R.J.
      • Slebos R.J.C.
      • Liebler D.C.
      • NCI CPTAC
      Proteogenomic characterization of human colon and rectal cancer.
      ). Similarly, analysis of gene-protein coexpression found differential interaction patterns in a subset of network modules in basal-enriched and luminal-enriched breast cancer subgroups (
      • Mertins P.
      • Mani D.R.
      • Ruggles K.V.
      • Gillette M.A.
      • Clauser K.R.
      • Wang P.
      • Wang X.
      • Qiao J.W.
      • Cao S.
      • Petralia F.
      • Kawaler E.
      • Mundt F.
      • Krug K.
      • Tu Z.
      • Lei J.T.
      • Gatza M.L.
      • Wilkerson M.
      • Perou C.M.
      • Yellapantula V.
      • Huang K.
      • Lin C.
      • McLellan M.D.
      • Yan P.
      • Davies S.R.
      • Townsend R.R.
      • Skates S.J.
      • Wang J.
      • Zhang B.
      • Kinsinger C.R.
      • Mesri M.
      • Rodriguez H.
      • Ding L.
      • Paulovich A.G.
      • Fenyö D.
      • Ellis M.J.
      • Carr S.A.
      • NCI CPTAC
      Proteogenomics connects somatic mutations to signalling in breast cancer.
      ).

      Pathway and Gene Ontology (GO) Enrichment

      Several approaches for pathway and GO enrichment analysis have been developed, including over-representation and Gene Set Enrichment Analysis (GSEA). Over-representation analysis uses the Fisher's exact test to identify pathways and GO terms with significant over-representation in a gene or protein list of interest, which should be predefined based on differential expression, clustering, or other upstream analyses. Representative tools in this category include DAVID (
      • Dennis G.
      • Sherman B.T.
      • Hosack D.A.
      • Yang J.
      • Gao W.
      • Lane H.C.
      • Lempicki R.A.
      DAVID: Database for annotation, visualization, and integrated discovery.
      ) and WebGestalt (
      • Wang J.
      • Duncan D.
      • Shi Z.
      • Zhang B.
      WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013.
      ) (Table III). GSEA (
      • Subramanian A.
      • Tamayo P.
      • Mootha V.K.
      • Mukherjee S.
      • Ebert B.L.
      • Gillette M.A.
      • Paulovich A.
      • Pomeroy S.L.
      • Golub T.R.
      • Lander E.S.
      • Mesirov J.P.
      Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.
      ) ranks genes or proteins in the entire data set based on differential expression or association to a continuous phenotype, and then uses a modified version of the Kolmogorov-Smirnov test (
      • Daniel W.W.
      Kolmogorov-Smirnov one-sample test.
      ) to identify pathways, signatures and GO terms in which the gene members are enriched at the top or bottom of the ranked list (Fig. 3C). As both the over-representation method and the GSEA approach ignore pathway topology when performing enrichment analysis, an additional tool, SPIA (
      • Tarca A.L.
      • Draghici S.
      • Khatri P.
      • Hassan S.S.
      • Mittal P.
      • Kim J.-S.
      • Kim C.J.
      • Kusanovic J.P.
      • Romero R.
      A novel signaling pathway impact analysis.
      ), was established to address this limitation. Further, because these methods all perform enrichment analysis at the gene level, they do not allow for phosphosite-level enrichment analysis, which is critical for understanding kinase-phosphate signal transduction in phosphoproteome profiling studies. PHOXTRACK (
      • Weidner C.
      • Fischer C.
      • Sauer S.
      PHOXTRACK-a tool for interpreting comprehensive data sets of post-translational modifications of proteins.
      ) was developed for this purpose, and modifies the GSEA approach to search for an enrichment of known kinase targets in an uploaded phosphoproteomics profile data set (Table III).
      Table IIIComputational resources for pathway and gene ontology enrichment
      NameURLReferenceRemarks
      DAVIDhttp://david.abcc.ncifcrf.gov/(
      • Dennis G.
      • Sherman B.T.
      • Hosack D.A.
      • Yang J.
      • Gao W.
      • Lane H.C.
      • Lempicki R.A.
      DAVID: Database for annotation, visualization, and integrated discovery.
      )
      GO/Pathway annotation and enrichment
      GoMinerhttp://discover.nci.nih.gov/gominer/(
      • Zeeberg B.R.
      • Feng W.
      • Wang G.
      • Wang M.D.
      • Fojo A.T.
      • Sunshine M.
      • Narasimhan S.
      • Kane D.W.
      • Reinhold W.C.
      • Lababidi S.
      • Bussey K.J.
      • Riss J.
      • Barrett J.C.
      • Weinstein J.N.
      GoMiner: a resource for biological interpretation of genomic and proteomic data.
      )
      GO analysis
      GSEAhttp://software.broadinstitute.org/gsea/(
      • Craig R.
      • Cortens J.P.
      • Beavis R.C.
      Open source system for analyzing, validating, and storing protein identification data.
      )
      Identifies pathways/GO terms with gene enrichment based on gene/protein ranking
      InnateDBhttp://www.innatedb.com/(
      • Lynn D.J.
      • Winsor G.L.
      • Chan C.
      • Richard N.
      • Laird M.R.
      • Barsky A.
      • Gardy J.L.
      • Roche F.M.
      • Chan T.H.W.
      • Shah N.
      • Lo R.
      • Naseer M.
      • Que J.
      • Yau M.
      • Acab M.
      • Tulpan D.
      • Whiteside M.D.
      • Chikatamarla A.
      • Mah B.
      • Munzner T.
      • Hokamp K.
      • Hancock R.E.W.
      • Brinkman F.S.L.
      InnateDB: facilitating systems-level analyses of the mammalian innate immune response.
      )
      GO/Pathway annotation and enrichment, visualization
      KEGG Atlashttp://www.kegg.jp/kegg/atlas/(
      • Okuda S.
      • Yamada T.
      • Hamajima M.
      • Itoh M.
      • Katayama T.
      • Bork P.
      • Goto S.
      • Kanehisa M.
      KEGG Atlas mapping for global analysis of metabolic pathways.