If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030;Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030;
Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030;Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030;
To whom correspondence may be addressed: The Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, Tel.:+1-617-714-7483, E-mail: ; Tel.:+1-713-798-1443, E-mail: ; or Tel.:+1-212-263-2216, E-mail:.
Department of Biochemistry and Molecular Pharmacology, New York University School of Medicine, New York, New York 10016;§Institute for Systems Genetics, New York University School of Medicine, New York, New York 10016
To whom correspondence may be addressed: The Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, Tel.:+1-617-714-7483, E-mail: ; Tel.:+1-713-798-1443, E-mail: ; or Tel.:+1-212-263-2216, E-mail:.
Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030;Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030;
To whom correspondence may be addressed: The Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, Tel.:+1-617-714-7483, E-mail: ; Tel.:+1-713-798-1443, E-mail: ; or Tel.:+1-212-263-2216, E-mail:.
* KVR, SHP, and DF were funded by National Cancer Institute (NCI) CPTAC award U24 CA210972. KVR and DF were funded by contract 13XS068 from Leidos Biomedical Research, Inc. SHP was funded by an Early Career Award from the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research. XW, JW, and BZ were funded by National Cancer Institute (NCI) CPTAC awards U24CA159988 and U24CA210954 and by contract 13XS029 from Leidos Biomedical Research, Inc. BZ is Cancer Prevention & Research Institutes of Texas (CPRIT RR160027) Scholar and McNair Medical Institute Scholar. KK, KRC and DRM were funded by National Cancer Institute (NCI) CPTAC awards U24CA160034, U24CA210986 and U24CA210979. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. ¶¶ Cofirst authors.
With combined technological advancements in high-throughput next-generation sequencing and deep mass spectrometry-based proteomics, proteogenomics, i.e. the integrative analysis of proteomic and genomic data, has emerged as a new research field. Early efforts in the field were focused on improving protein identification using sample-specific genomic and transcriptomic sequencing data. More recently, integrative analysis of quantitative measurements from genomic and proteomic studies have identified novel insights into gene expression regulation, cell signaling, and disease. Many methods and tools have been developed or adapted to enable an array of integrative proteogenomic approaches and in this article, we systematically classify published methods and tools into four major categories, (1) Sequence-centric proteogenomics; (2) Analysis of proteogenomic relationships; (3) Integrative modeling of proteogenomic data; and (4) Data sharing and visualization. We provide a comprehensive review of methods and available tools in each category and highlight their typical applications.
The last decade has witnessed the rapid emergence of proteogenomics, a new research field at the interface of genomics and proteomics. The term proteogenomics came into use following a publication by George Church's group in 2004 describing a proteogenomic mapping technique which harnessed proteomics data to improve genome annotation of Mycoplasma pneumonia (
). The reach of proteogenomics has since expanded with technological advancements enabling rapid and economical high-throughput DNA and RNA sequencing and deep mass spectrometry (MS)-based proteomics. These advancements have proved particularly useful for integrating nucleotide sequencing and MS data from the same sample, where genomic sequencing data can be used to improve protein identification through comprehensive protein sequence database construction. Proteomic data can then be used to demonstrate the validity and functional relevance of novel findings based on large scale RNA and DNA sequencing projects, including coding sequence variants and novel coding transcripts. In addition to sequence-centric proteogenomic data integration, combined quantitative analysis from genomic and proteomic studies have also been used to provide novel insights into multilevel gene expression regulation (
). In this review, we subscribe to an expansive view of proteogenomics, encompassing all areas of proteomic and genomic integrative data analysis and cover the range of tools developed to tackle the associated challenges.
To complement already published review papers that focus on specific sub-domains of the broad proteogenomics research area (
), we systematically classified existing methods and tools for various types of integrative proteogenomic studies into four major sections. Sequence-centric Proteogenomics describes aspects of sequence-centric proteogenomics and the combined use of genomic and proteomic data to augment gene or protein annotation (Fig. 1). “Analysis of Proteogenomic Relationships” explores relationships between genomic and proteomic data using correlation, with application to deciphering the effect of mutations on signaling (Fig. 2). “Integrative Modeling of Proteogenomic Data” summarizes integrative modeling and analysis of proteogenomic data using statistical and machine learning approaches (Fig. 3). “Data Sharing and Visualization” discusses genome (Fig. 4) and network visualization (Fig. 5), along with challenges in data sharing. All four sections of the review assume tandem MS (MS/MS) as the core proteomics technology for generating peptide sequence data.
Fig. 1.Sequence-centric proteogenomics. Sequencing-based technologies to sequence DNA (whole genome sequencing, WGS; whole exome sequencing, WXS) and RNA (RNA-seq) generate millions of short sequencing reads that are assembled into genomes, exomes or transcriptomes by either de novo or template-based approaches by alignment to a reference sequence. Sample-specific sequence aberrations are determined and nucleotide sequences are transformed into personalized, amino acid-centric sequence databases. Peptide mass spectra derived by LC-MS/MS analysis from a matching sample are then scored and validated against the personalized database enabling the detection of sample-specific peptide sequences. Depending on the scope of the proteogenomic project, these peptides can then be used to (1) aid genome annotation by detection of peptides in unannotated genome regions; (2) identify tumor-specific mutations translated into the proteome as well as novel protein splice variants; and (3) detect species-specific peptides in microbial communities.
Fig. 2.Proteogenomic relationships.A, Correlation analysis of mRNA and protein pairs across samples enables the assessment of global correlation structure which typically centers between correlation coefficients of 0.3 and 0.5. B, Regulatory effects on RNA and protein expression levels caused by copy number aberrations (CNA), genetic variants (eQTL) and microRNAs (miRNAs) can be studied by different correlation-based approaches. CNA cis and trans effects on RNA, protein and PTM expression can be determined by correlating each gene copy number at a given locus to all quantified features in RNA, protein or PTM space across all samples. Expression quantitative trait loci (eQTL) analysis can be used to identify DNA sequence variants affecting RNA/protein expression levels in the sample population being studied. Global miRNA analysis accompanied with mRNA or protein profiling enables the assessment of miRNA mediated regulation of mRNA and protein expression. C, Integrative analysis of genetic variants and PTM sites like phosphorylation can identify functional consequences of genetic variants at the molecular level. Mutations that directly affect serine, threonine and tyrosine residues can result in destruction or genesis of phosphosites (I); mutations adjacent to phosphosites can result in removal or addition of phosphosites (II) or change the kinase that recognizes the phosphorylation site (III).
Fig. 3.Integrative modeling. Overview of sub-topics in integrative modeling of proteogenomic data. A, Clustering techniques illustrating a schematic of multi-omic hierarchical clustering analysis resulting in the identification of two subtypes, B, Predictive modeling for disease diagnosis, prognosis, drug response and drug toxicity using multiple data modalities and, C, proteogenomic pathway and network modeling, including informing network composition and pathway and GO term enrichment.
Fig. 4.Genome-based visualization, using proBAM as an example. proBAM is a data format to integrate mass spectrometry data with the genome. In this example, we show the visualization of 10 colorectal cancer cell lines in proBAM format. The PSMs result from a search against a customized database built from matched RNA-Seq data, which are also incorporated into the visualization. A, Integrative Genomics Viewer (IGV) snapshot visualizes peptides and RNA-Seq reads mapped to KRAS in one window. The upper panel shows proteomic data from 10 colon cancer cell lines indicated by different colors. The bottom three panels illustrate RNA-Seq data from cell lines HCT15, Caco-2 and SW480, respectively. B, Zoomed-in view of an exon region in KRAS. Similar to RNA-Seq reads (three bottom panels), peptides mapped to the genome can be classified into within exon peptides and junction peptides in the proBAM file (upper panel). C, The upper panel shows a zoomed-in view of mutations confirmed by both RNA-Seq and proteomic data in KRAS. A G13D mutation in HCT15 and a G12V mutation in SW480 are observed in both transcriptomic (second and fourth panel) and proteomics (first panel) data, whereas wild type peptide is observed in Caco-2 (third panel).
Fig. 5.NetGestalt-based analysis of colorectal cancer proteomics data.A, NetGestalt created a one-dimensional (1-D), linear order of all 12,112 genes in a protein-protein interaction network based on the hierarchical modular organization of the network. The ruler indicates coordinates of the genes in the resulting linear order. B, Each bar represents a module identified from the network and alternating bar colors (green and orange) are used to distinguish neighboring modules. C, Colorectal cancer proteomics data visualized as a heat map with each row representing a sample. Red and blue colors in the heat map represent relative over- and under-expression, respectively. All 12,112 genes in the network are visualized in the heat map, and missing values in the proteomic data are indicated in gray color. The samples (rows) of the data are ordered based on the five subtypes visualized beside the heat map. D, Signed minus log10 transformed p values of the difference between subtype 3 and other subtypes visualized by the bar plot. Missing values are indicated by yellow bars in the bar plot. E, Significantly over-expressed genes in subtype 3 (FDR<0.05). F–H, Zoomed-in view of a region corresponding to one of the over-represented modules. I, Genes in (H) visualized in a node-link diagram. The edges in the diagram represent protein-protein interactions.
In this section we review several areas of sequence-centric proteogenomics. This includes the integrative analysis of genomic and proteomic data for exome annotation in the form of gene discovery and gene model refinement (Proteomics Aiding Genome Annotation); protein level detection of single amino acid variants (SAAVs)
The abbreviations used are: SAAV, single amino acid variant; DNA, deoxyribonucleic acid; RNA, ribonucleic acid; MS/MS, tandem mass spectrometry; EST, expressed sequence tag; ORF, open reading frame; TCGA, The Cancer Genome Atlas; CPTAC, Clinical Proteomic Tumor Analysis Consortium; FDR, false discovery rate; SNP, single nucleotide polymorphism; PTM, post-translational modification; PSM, peptide spectrum match; RPPA, reverse-phase protein array; GSEA, gene set enrichment analysis; ssGSEA, single-sample gene set enrichment analysis; SNV, single nucleotide variant; nsSNV, nonsynonymous single nucleotide variant; CNA, copy number aberration.
, insertions, deletions, alternative splice junctions and novel gene fusions in relation to a reference genome sequence (Personalized Protein Sequence Databases); the application of proteomic sequencing to characterize antibodies (Sequencing of Antibodies); studying the effects of viral infections and transposons on gene expression in eukaryotic organisms (Viral Infections and Activations of Transposable Elements); and applications of proteogenomics to metaproteomic investigation (Metaproteogenomics).
A concept integral to all five topics in this section is the importance of an inclusive and high-quality protein sequence database for peptide identification. In a typical proteomic experiment, peptide MS/MS spectra are interpreted using a database search algorithm that matches and scores the similarity of each experimental spectrum against model spectra constructed from peptide sequences contained in a user supplied protein sequence database (
). This strategy is used, in part, because the fragmentation efficiency of current MS/MS instrumentation is unable to consistently yield spectra from which complete, unambiguous sequences can be interpreted de novo (the current state of automated de novo peptide interpretation have been reviewed elsewhere (
)). To address this, researchers use a protein sequence database, ideally containing all protein sequences one expects to be present in the sample, with minimal irrelevant sequences to reduce false spectral matches and search time. Limiting the number of candidate sequences in the form of a sequence database enables the sequence ambiguity present in a spectrum to be overcome, resulting in high confidence peptide spectrum matches (PSMs) (
The use of MS-based proteome characterization to aid genome annotation has been widely exploited in various organisms and has been previously reviewed by several groups (
). Here, we highlight some of the pioneering studies using integrative analysis of genomics and proteomics data for genome (re)annotation (See related review (
Early studies integrating proteomic and genomic data date back to before the genome sequencing revolution of the early 21st century. Motivated by the lack of comprehensive and complete protein sequence databases and the emerging availability of nucleotide sequence data in the form of expressed sequence tags (ESTs), researchers interrogated peptide mass spectra using databases obtained by in-silico translation of unassembled ESTs. In 1995 Yates et al. (
) demonstrated the use of nucleotide sequences translated into all six reading frames of amino acid sequence (six-frame translation) to identify mass spectra of unmodified and phosphorylated peptides from human, bovine, E. coli, and S. cerevisiae proteins. The intrinsic ability of searching peptide mass spectra against genomic sequence databases to identify novel, unannotated genes, was employed shortly after by several groups (
) used information from the recently released human genome draft in 2001 to query all 23 human chromosomes with tandem mass spectra and compare the results to EST database searches. From this, they concluded that MS/MS searching of genomic DNA databases were of limited utility, as the presence of introns in the database prohibited matching exon-spanning peptides and prevented identification in roughly one quarter of the spectra. Additionally, the consensus sequence chosen for the reference genome also prevented the identification of individual SAAVs with EST evidence. These limitations have since been addressed through the incorporation of sample-specific or species-specific alternative splicing and SNPs into the protein sequence database (see Personalized Protein Sequence Databases).
In 2004, Jaffe et al. introduced the concept of a proteogenomic map as a complementary method for genome annotation, which used evidence of protein expression to predict ORFs in Mycoplasma pneumoniae (
). Proteogenomic gene annotation is most often carried out by searching peptide mass spectra against a six-frame translation of an associated reference genome sequence database. Peptides identified by this search are then mapped to the existing gene annotation model (Fig. 1). The detection and identification of these peptides provides direct and valuable evidence of protein translation, and have been used to train algorithms for gene model prediction (
A crucial prerequisite for genome refinement using MS-based proteomics is sufficient proteome coverage, which became feasible following major improvements in MS instrumentation (ion traps) as well as sample preparation protocols (multidimensional fractionation at the peptide level). In fact, Liquid Chromatography (LC)-MS/MS based proteomics utilizing sensitive and fast scanning ion-trap mass analyzers dominated the field of proteogenomics for several years despite the low resolution and mass accuracy of the acquired spectra (
). However, the low mass accuracy data acquired by these instruments required large mass tolerances when searching six-frame translation databases, resulting in prohibitively large search spaces, long search times and high proportions of false positive peptide identifications using conventional target-decoy FDR thresholding (
The invention of a new generation of mass spectrometers revolutionized the field of proteomics, providing high-resolution and high-accuracy MS data at an expanded dynamic range (
). High mass accuracy data has the intrinsic capability to reduce the database search space by allowing for small mass tolerances and an associated decrease in plausible candidate sequences, something of importance when searching large genomic six-frame translation databases (
). Further improvements in MS technology enabled the acquisition of high-resolution data at both the MS and MS/MS level without compromising sequencing speed (
Ultra high resolution linear ion trap Orbitrap mass spectrometer (Orbitrap Elite) facilitates top down LC MS/MS and versatile peptide fragmentation modes.
). A typical LC-MS/MS proteomics strategy employs the enzyme trypsin to digest proteins by specific cleavage after lysine (K) and arginine (R) amino acids. Any sequence overlap in the resulting pool of peptides will be an artifact of cleavage sites missed by trypsin. Regions of a protein with spacing of K, R residues that are <6 AA's or >30 AA's will tend not to be observed by the mass spectrometer, and peptides with extremes of hydrophilicity or hydrophobicity will not be readily bound or eluted from the LC column, thereby limiting sequence coverage. Recent CPTAC studies report median protein sequence coverage of about 25% across >12,000 proteins in a human cancer cohorts (
) using extensive sample fractionation. Sequence coverage can be increased by use of multiple proteases generating different peptide species at the expense of additional experimental costs. Nucleotide sequencing-based methods to measure gene expression, such as RNA-Seq and Ribo-Seq, typically achieve higher genome coverage and are routinely used to complement the annotation of newly sequenced genomes (
Reference protein sequence databases, such as those from Ensembl or RefSeq, are typically used to identify mass spectra through peptide spectrum matching. Because these databases lack sample specific sequence variation, including single amino acid variants (SAAVs), insertions, deletions, alternative splice junctions and novel gene fusions, studies using this approach are unable to identify the corresponding variant peptides present in the MS/MS data. This is a particularly important limitation to consider in cancer studies, where patients acquire tumor specific somatic variation. Analyzing nonsynonymous somatic mutations at the proteome level has the potential to yield novel insights into tumor biology (
). To do this, genome and RNA sequencing have been used to generate personalized protein sequence databases by incorporating nonsynonymous variants into reference protein sequences. Several informatics pipelines have emerged in recent years for generating these databases (
Because the likelihood of experimentally observing a peptide decreases as one progresses from a reference database to each variant database (SAAV; novel splice junctions; and novel coding loci in putative intergenic regions), MS/MS data sets are sensibly searched in an iterative fashion through individual databases with separate FDR estimations for PSMs (
). As the total size of the protein database increases, identifying high confidence PSMs requires increased spectral quality with increasingly complete peptide fragmentation. Detailed statistical considerations for iterative search strategy design and FDR estimation in a proteogenomic paradigm have recently been thoroughly described (
In addition to its role in novel peptide identification and gene annotation, sequence-centric proteogenomics has played a significant role in antibody sequencing (
). In vertebrates, antibodies provide the ability for an organism to differentiate between itself and the environment, and a mechanism to fight a diverse range of infections. In order to meet these needs, antibodies possess a tremendous level of sequence diversity, which is achieved through a combination of antibody sequence options and the introduction of mutations. Three types of antibody genes are encoded in the genome: Variable (V), Diversity (D), and Joining (J), and these are combined through a process called V(D)J recombination (
) to produce a diverse library of heavy and light chains that are combined pairwise, and together provide the antigen binding specificity. The affinity maturation process (
) further optimizes antibodies that recognize foreign objects by allowing for a high rate of point mutation introductions, followed by a selection for the strongest binders. At the same time, antibodies recognizing self are eliminated—a process that is defective in autoimmune diseases. Although antibody sequences are encoded in the genome, because of the process by which they are combined and matured, it is not possible to predict the final repertoire of antibody sequences for an individual from the genome alone. However, RNA-Seq can be applied to sequencing the variable region of the light and heavy chains to obtain a sampling of the antibody diversity of an individual.
MS has also been applied to sequencing both recombinant and circulating antibodies. For recombinant antibodies where the sequence is known, MS can be used to confirm the sequence and check for purity. For studying circulating antibodies, the most widely used approach is to use the antigen as bait to enrich for the circulating high affinity antibodies and analyze them with MS/MS. Multiple aliquots are digested with proteases of different specificity to generate comprehensive coverage with overlapping peptides; the spectra can then be interpreted using de novo sequencing approaches (
). This requires high quality data, and limits the number of antibodies that can be sequenced in a mixture. An alternative is to use a proteogenomics approach: performing targeted RNA-Seq of the variable region of the light and heavy chains, assembling the reads and translating the assemblies to create a protein sequence database for searching. This approach has been employed to identify high affinity circulating antibodies in infected individuals against HIV (
) surface proteins that are potentially broadly neutralizing. The approach has also been applied to produce single chain llama antibodies to be used as reagents (
). Llamas and camels produce single chain antibodies in addition to paired heavy-light chain antibodies. The advantage of the single chain antibodies as reagents is that they are small (∼15kDa), robust, and once sequenced, they can easily be expressed in E. coli to provide a reproducible resource. These single chain antibodies can be humanized (
); they are, therefore, highly promising candidates for developing therapeutics.
Viral Infections and Activation of Transposable Elements
Beyond expression of host genes, viral infections and transposons contribute potential protein-level expression in eukaryotic organisms that can be studied using proteogenomic techniques. Viral infections alter gene expression and protein production as the virus highjacks cellular processes that allow for self-replication (
). Eukaryotic genomes also contain mobile elements or transposons that are remnants of ancient viral infections that were incorporated in the host germline genome and then spread throughout the genome through gene copying. It is estimated that about half of the human genome is transposon sequence albeit most are no longer active (
). There is, however, a small subset of the LINE-1 (Long Interspersed Elements) retrotransposons that are capable of autonomous retrotransposition through a copy and paste mechanism using an RNA intermediary, but they are most commonly inactive in somatic cells. However, increased retrotransposition activity has been observed in some disease states including many cancer types (
). The role of retrotransposition in cancer biology is currently unclear and it is not known if it promotes or suppresses tumors, or if it is just an effect of genome instability (
) but could provide important insights into cancer biology. Human LINE-1 has two open reading frames: ORF1 and ORF2. ORF1 is an RNA binding protein and ORF2 is an endonuclease and a reverse transcriptase. ORF1 can be quantified using MS-based proteomics, and has been observed by deep MS-based proteomics in many tumors including breast, prostate and ovarian tumors (
). Interesting proteogenomics questions for future research include: Do higher levels of LINE-1 transcripts and ORF1 protein concentrations correlate with tumor progression? Which human transcripts and protein levels correlate with ORF1 protein concentration? Does ORF1 protein concentration correlate with a higher number of somatic LINE-1 insertions?
), or proteomic investigation of multiorganism communities, represents a unique application for proteomic and genomic data integration. Metagenomic studies typically collect genome data only, representing the functional potential of organisms present in a community, whereas metaproteogenomics adds the additional layer of proteomics data to elucidate what cellular functions are being expressed and utilized by community members. Unfortunately, reference genome sequence databases lack many protein sequences in natural consortia samples, as many organisms within the biological sample may not have a sequenced genome. Despite the meteoric rise in the number of sequenced genomes, there remain large swaths of bacterial diversity that are unrepresented in databases like GenBank and UniProt. A recent paper by the Banfield group (
) highlights numerous phyla which were previously unknown (not just unsequenced). Indeed, they observe a new phyla radiation with dozens of bacterial phyla entirely separate from known taxonomy. Recent predictions for bacterial diversity on the planet approach 1 trillion distinct species (
). Thus, we expect that reference genome sequence databases will be incomplete at the species level for the foreseeable future.
Therefore, a driving need for utilizing genomic data in metaproteomics experiments is the inaccuracy of annotating genomes for novel species. Proteomics has frequently been used not only to improve the set of known proteins in a single organism, but as training and/or testing data for improving gene calling algorithms (
). Moreover, protein annotation is less accurate for genes which have not been previously characterized, a common observation for samples in natural (not laboratory) conditions.
Absent algorithmic advances which allow spectra to be identified without an exact sequence match to a database (
), the path forward for metaproteomics involves obtaining sequencing data for the biological sample at hand. This can be either metagenomic or metatranscriptomic data. Each provides crucial information about the potential protein sequences that should be considered when searching tandem mass spectra. Although this introduces additional costs to the project, it has so far been the best way to improve the number of identified peptides in a metaproteomic analysis (
In this section we cover the integration of profiling data from different omics platforms to help elucidate information flow from DNA to RNA to proteins and, most importantly, to phenotype.
This includes studies aimed at understanding whether protein abundance can be reliably predicted from mRNA measurements (mRNA-Protein Correlation); assessing the genetic control of mRNA on protein abundance (Genetic Control of mRNA and Protein Abundance); and the impact of genetic aberrations on post-translational protein modification (PTM) and signaling (Relating Mutations to PTM and Signaling).
mRNA-Protein Correlation
Correlation between mRNA and protein profiling data has been a topic of considerable research during the past decade, and an excellent review on this has recently been published (
). Early studies focused on the correlation between steady state mRNA and protein abundance for all genes in a single sample, and it was noted in various organisms that relative abundance of proteins in a sample cannot be adequately explained by the corresponding mRNA abundance (
). This can be explained, at least in part, by our understanding that protein abundance is determined by a combination of mRNA abundance, translational regulation, and protein degradation (
). With the availability of paired mRNA and protein data for large sample cohorts, studies on gene-wise correlations between mRNA and protein abundance across many samples also reported modest correlations (
), it is unclear how much of the reported low correlation between mRNA and protein expression is because of technological issues versus underlying biology. Statistical methods that attempt to model stochastic and systematic errors in mRNA and protein profiling data have produced higher mRNA-protein correlations (
). Recent studies by the CPTAC consortium have reported nonrandom associations between the level of mRNA-protein correlation and biological functions of the genes (
). For example, metabolic functions such as amino acid, fatty acid and nucleotide metabolism are enriched for genes with high mRNA-protein correlations, whereas ribosomal and mRNA splicing functions are enriched for genes with low or negative mRNA-protein correlations. A more systematic study using mRNA and protein profiling data from the three CPTAC cancer types showed that proteomic data strengthened the link between gene expression and function for at least 75% of Gene Ontology (GO) biological processes and 90% of KEGG pathways (
). Thus, mRNA-protein discrepancy cannot be simply explained by experimental errors, and biological functions arise from both mRNA- and protein-level regulations.
Genetic Control of mRNA and Protein Abundance
Genetic variation plays an important role in determining mRNA and protein abundance. mRNA and protein expression data from a cohort of samples can be integrated with DNA variation information to study the underlying genetic determinants of gene expression variation. This type of analysis is an extension of the traditional quantitative trait locus (QTL) mapping, in which a section of DNA (the locus) is correlated with variation in a phenotype (i.e. quantitative trait). When expression levels of mRNAs are treated as quantitative traits, the QTL analysis is named eQTL analysis, a method that has become well-established in the field of genetics (
) (Fig. 2B). eQTLs may be cis- or trans- acting, determined by their physical distance from the gene they regulate. Specifically, cis-eQTLs affect gene expression at the same locus of the genotype, whereas trans-eQTLs affect gene expression at a different locus. Although many cis-eQTLs have been reported, mapping trans-eQTLs has been less successful (
). It remains unclear whether the difficulty in mapping trans-eQTLs reflects true biology (i.e. eQTLs primarily act in cis) or computational and statistical challenges. More recently, ribosome occupancy and protein abundance have been used as quantitative traits to identify ribosome occupancy QTLs (rQTLs) and protein abundance QTLs (pQTLs), respectively (
An integrative multi-omics study on a set of HapMap Yoruba lymphoblastoid cell lines found that most QTLs were associated with mRNA expression levels, but their impact on protein expression levels were significantly reduced (
). This buffering of protein levels may allow cells to cope with noisy genetic variations and attenuate their impact on downstream phenotypes. Interestingly, a set of cis QTLs that affect protein abundance showed little or no effect on messenger RNA or ribosome levels, suggesting their potential roles in post-translational regulation. Both the buffering effect and protein abundance specific QTLs have been reported in earlier studies in yeast (
). These studies all suggest that integrating high-throughput proteomic data into QTL analysis could provide new insights into gene expression regulation.
Similarly, analysis of the correlation between copy number alteration (CNA) and mRNA or protein abundance has been used to infer the impact of CNAs on mRNA and protein abundance, including both cis-effects on the abundance of genes in the same loci and trans-effects on the abundance of genes at other loci in the genome. Visualization of the resulting correlation matrix in a heatmap can help highlight statistically significant cis- and trans- correlations. Furthermore, visually and statistically comparing the correlation heatmaps for mRNA and protein can reveal relationships between these profiles: cis- and trans-effects in protein (and also phosphoprotein) are generally subsets of mRNA cis- and trans-effects respectively, with more directionally uniform effects at the protein level (
) (Fig. 2B). These correlation matrices can also be used to identify candidate driver genes whose copy number alterations directly drive significant trans-effects by comparing with functional knockdown data in large public databases like LINCS (Library of Integrated Network-based Cellular Signatures) (
Efforts have also been made to study the roles of miRNAs in gene expression regulation. miRNAs are small noncoding RNAs that pair to the messenger RNAs (mRNAs) of protein-coding genes to suppress their expression (
). To investigate all miRNAs simultaneously in their endogenous context, Liu et al. performed an integrative analysis of global miRNA, mRNA, and protein profiles in nine colorectal cancer cell lines using a correlation-based method (
) (Fig. 2B). This study showed that translational repression was involved in more than half, and played a major role in a third of all predicted miRNA-target interactions. These predicted miRNA-target interactions can be further confirmed by more focused miRNA perturbation studies. Interestingly, sequence features known to drive site efficacy in mRNA decay, such as 8mer seed site, site positioning within 3′ UTR, local AU-rich context, and additional 3′ pairing, are generally not applicable to translational repression (
). A key unanswered question is what sequence features determines selectivity for miRNA-mediated translational repression.
Relating Mutations to Post-translational Modifications and Signaling
Millions of nonsynonymous single nucleotide polymorphisms (nsSNPs) identified by next-generation sequencing (NGS) and genome-wide association (GWAS) studies have been correlated with certain phenotypes and diseases (
). However, the functional mechanisms of these associations are often barely understood or completely unknown. One likely explanation is that a subset of these SNPs result in amino acid changes in PTM targets, including targets of phosphorylation (specific to serines, threonines, and tyrosines), or acetylation and ubiquitylation of lysines, directly perturbing cell signaling networks (
), they are expected to be disproportionately affected by missense mutations. Substitutions of amino acids that are targets of PTMs can result in destruction, genesis, or constitutive activation of PTM sites (
). Moreover, mutations affecting proximal flanking positions of PTM sites might alter the recognition motif for corresponding transferases, e.g. protein kinases recognize, besides other factors, specific motifs on its substrate protein (
To address this, several studies have assessed the effect of SNP-induced changes to PTM sites, predominantly serine, threonine and tyrosine phosphorylation. In 2008 Ryu et al. (
) databases to develop software predicting phosphorylation sites accompanied by a database for human phosphovariants, which the authors defined as genetic variations that change phosphorylation sites or their interacting kinases. In this study variants were classified into three groups depending on whether the variant directly affects a phosphorylation site, the flanking region or the kinase itself (Fig. 2C). Two years later, Yang et al. (
), to identify 64 phosphorylation sites that potentially result in a disease phenotype, including schizophrenia and hypertension, when substituted by a nonphosphorylatable amino acid. In total 1451 nsSNPs which were present in dbSNP (downloaded May 2007) occurred in a ± 7 amino acid flanking region of a phosphosite, thereby potentially influencing the recognition of a kinase toward its preferred substrates. In a related study, Ren et al. (
) carried out a genome-wide analysis of SNPs that potentially influenced protein phosphorylation status. The authors used a combination of dbSNP predicted kinase-specific phosphosites and experimentally detected phosphosites to identify and classify SNPs affecting phosphosignaling. Based on the predicted phosphosites the authors estimated that ∼70% of nsSNPs have the potential to affect phosphosignaling, suggesting that a large portion of nsSNPs play an important role in rewiring biological pathways. Creixell et al. (
) described a similar computational approach (ReKINect) to systematically classify and interpret such network-attacking mutations (NAMs) specifically in phosphosignaling. The authors used exome sequencing, bioinformatics and phosphoproteomics to demonstrate as a proof-of-principle the existence of six types of NAMs in human cancer cell lines.
) developed a computational method (ActiveDriver) based on a gene-centric generalized linear regression model to detect aberrant mutation rates proximal to phosphorylation sites. This method was used to analyze 800 genomes spanning eight cancer types to detect mutations specifically targeting the phosphorylation machinery, identifying 44 genes with significantly higher mutation rates in regions with detected phosphosites compared with the gene sequence given its structured and disordered regions. Mutations identified were comprised of both known driver mutations and novel candidates. The authors then extended their approach to the TCGA pan cancer data set, containing more than 3000 genomes from 12 cancer types and found mutations affecting phosphosignaling in about 90% of all tumors (
). In its latest release (2014), PSP introduced the “PTMVar” data set, which intersects missense mutations and PTMs detailing 25,000 PTMs impacted by known variants, about 75% of which relate to phosphorylation. The remaining PTM sites comprise ubiquitylation, acetylation, mono-methylation, and succinylation sites. These additional modifications, despite their low coverage, enable researchers to interrogate genomic mutations with PTMs beyond phosphorylation. g2pDB (
) is a database mapping protein PTMs to genomic coordinates for all phosphorylations, acetylations and ubiquitinylations that are available in the Global Proteome Machine Database (GPMDB) (
). Overlaying the genome-mapped PTM sites with genome coordinates of known, disease-associated SNPs might reveal a role of these PTM sites in the respective disease. A list of all relevant tools and databases can be found in Table II.
Table IIComputational frameworks and resources for intersecting PTMs and mutations
All the aforementioned studies focused on the classification of mutations as either directly or indirectly (i.e. in a proximal flanking region) affecting the PTM site. However, there is no consensus on the length of flanking regions, number of classes, and their nomenclature, making it difficult to directly compare the findings of different studies. The integrative analysis of genomic mutations and PTM-mediated signaling shows great promise in providing insights into the mode of action of disease-associated mutations. This type of analysis can aid in discriminating tumor driver mutations from functionally neutral passenger mutations, and ultimately lead to novel personalized treatments. More importantly, the analysis of PTMs is not accessible to genomics sequencing technologies and the ever-increasing collection of published, global PTM-omes at single amino acid resolution demonstrates the indispensable value of state-of-the art MS-based proteomics in the era of precision medicine.
Integrative Modeling of Proteogenomic Data
Integrative modeling involves the application of statistical, machine learning and network-modeling tools to data obtained from one or more omics platforms. In this section, we focus on the application of integrative modeling to proteogenomic analyses. Models can be developed on combined omics data sets (e.g. genomics and proteomics), or applied to each omics data type separately, and the results comparatively analyzed. We review clustering (Unsupervised Clustering) and predictive modeling (Predictive Modeling)—usually termed class discovery and class prediction, respectively—which are both orthogonal approaches to gaining insight from biological data; and network modeling (Pathway and Network Modeling), which interprets data in the context of prior biological knowledge and promotes understanding at the level of pathways and cellular mechanisms.
Unsupervised Clustering
Clustering is a method of grouping similar entities—e.g. samples, genes, proteins, etc.—together based on a similarity metric. Because meta-data about the entities—like phenotypes, mutations, disease type, etc.—are not used in the clustering process, the algorithms are termed “unsupervised,” and are primarily used to discover new groups or classes, in addition to computationally validating known biology. Most proteogenomic analysis includes unsupervised clustering of proteome and/or phosphoproteome data, followed by comparison of the resulting clusters to known subgroups, cluster labels derived from genomic data, or other mutation, survival, or clinical data.
Clustering of proteome data is performed using a variety of algorithms including hierarchical (
) is a common approach used to assess cluster stability and define the natural number of clusters in the data. Visualization of the consensus matrix, along with the delta-area plot and silhouette plots (
) are an effective way to determine the number of clusters in the data.
Once the proteome or phosphoproteome clusters are identified, the samples constituting these clusters can be characterized by enrichment tests for known subgroups (e.g. PAM-50 classification or RPPA groups in breast cancer, methylation subtype in colon cancer, mutation status for relevant genes, or other clinical or survival data). In addition, supervised marker selection methods combined with pathway enrichment analysis (e.g. SAM (
) for pathway enrichment) can also be used to characterize proteome or phosphoproteome clusters by identifying pathways or gene sets that are selectively up or down regulated in each cluster.
An alternative approach to clustering proteome or phosphoproteome data is to project the original data to pathway space and then cluster the projected data. This approach is used by Mertins et al. (
) to cluster phosphoproteome pathways, resulting in a unique cluster not directly observed either in the proteome or phosphoproteome data. The projection to pathway space is performed using single-sample gene set enrichment analysis (ssGSEA) (
), where the enrichment of curated pathways (MSigDB C2 gene sets, http://software.broadinstitute.org/gsea/msigdb) in each sample is evaluated. The enrichment scores are then subject to unsupervised clustering, followed by characterization of the derived clusters using the pathways constituting the data set.
Coclustering
In coclustering, data from multiple modalities (e.g. mRNA and proteome) are treated as independent “samples,” and clustering is performed over the collection of disparate omics profiles. The key here is to either transform the data so that different modalities are comparable (e.g. z-scores), or to use a similarity metric that is agnostic to the scale of values in the data (e.g. Spearman correlation (
)). Coclustering mRNA and proteome data (using hierarchical clustering) after filtering to retain genes or proteins with moderate to high correlation is used to show that mRNA profiles of samples are closest to their corresponding proteome profiles, thereby validating sample quality and mitigating concerns regarding tumor heterogeneity in (
Clustering over sample profiles obtained from two or more omic platforms is referred to as multi-omic clustering. Unlike coclustering, where the multi-omic data from each sample provides independent items that are clustered simultaneously, multi-omic clustering attempts to derive an integrative clustering to assign each sample to a single cluster based on combined evidence from multi-omic data. An overall review of multi-omic clustering methods is presented in (
Direct integrative clustering methods use a combined multi-omics data set as input to the clustering analysis. Examples in this category include iCluster+ (
Clustering of clusters is an approach where clustering is initially performed on each omics data set and the results integrated into final cluster assignments. Examples include COCA (
Regulatory integrative clustering harnesses molecular regulatory structures and/or networks to integrate different omics data sets in a robust manner. Examples in this group include PARADIGM (
Many of the clustering algorithms use de novo or regulatory network graphs to model interactions in each omics domain, and to drive integration across different omics data sets. A review of clustering methods from this orthogonal perspective is covered in (
Predictive modeling is a statistical approach in which models are built to predict a future outcome based on data attributes. Machine learning, pattern recognition and predictive analytics all lie within the umbrella of predictive modeling and this method of analysis has been rapidly gaining traction across most scientific disciplines. Predictive modeling and machine learning techniques applied to proteogenomics can greatly improve our ability to accurately diagnose, guide prognosis, and treat disease. For example, global molecular profiling of tissues and tumors enables a shift from nonspecific treatment strategies toward a more targeted, personalized approach based on the presence or absence of predictive genetic and/or protein signatures. Typical supervised classification methods used for predictive modeling from omics data include Support Vector Machines (SVMs) (
To date, machine learning and statistical modeling techniques applied to genomics and transcriptomics data have identified genetic profiles predictive of disease diagnoses (
). One would expect the predictive analysis of proteome and phosphoproteome data to be more informative regarding clinical outcomes compared with NGS data, as these data modalities are more proximal to the disease. These techniques have been applied to proteomics data to classify clinically relevant disease subtypes in cancer (
Despite the use of predictive modeling in genomics and proteomics independently, studies integrating proteomics and genomics are less common. Several studies using “multimodal” integration of data types including RNA-Seq, exon expression, and Reverse Phase Protein Array (RPPA) data to predict clinical phenotypes and drug response found no advantage to combining data modalities compared with individual platform analysis and showed gene expression data to be consistently more predictive than RPPA-based proteomics (
), fusion of four data types (genome, transcriptome, MS/MS-based proteome and phosphoproteome) did not improve the predictive performance of the model. However, they did find proteomics to outperform models based on genomics and transcriptomics data in survival prediction (
). As this is still fairly uncharted territory in proteogenomics, we anticipate to see a wealth of studies focused on assessing the predictive power of proteomics, and phosphoproteomics in disease prognosis, diagnosis and drug response in the future (Fig. 3B).
Supervised Analysis for Marker Selection
Aside from machine learning, supervised analysis has been used to derive markers for a variety of distinctions including intrinsic disease subtypes (e.g. PAM-50 subtype in breast cancer or HRD status in ovarian cancer), subtypes identified by clustering, samples with and without mutations in genes of interest (e.g. PIK3CA or TP53 mutations) and survival analysis. For examples, see (
), in addition to nonparametric tests like the Mann-Whitney test and the Kruskal-Wallis test. Although these tests are in most cases applied to specific types of omics data, marker ranking from these tests can be combined across multiple omics data sets to derive a global overall rank using rank aggregation algorithms (
Historically, the field of biomedicine has operated under the “molecular biology paradigm,” in which it is assumed that biological function can be explained through the comprehensive knowledge of genes and their associated proteins, and that these proteins operate in linear pathways (
). Despite large-scale efforts to link genotype and phenotype under this paradigm, the relationships between the two are still wholly unresolved and surprisingly complex. Instead, the systems or network biology approach attempts to consider these complex relationships to better understand this genotype-phenotype connection (
). Studies in proteogenomics can build upon current models of network biology, contributing to both network annotation and using established pathway and gene ontology tools in gene-protein enrichment analyses.
Network Annotation
In network biology, nodes represent the molecules of interest (gene, protein, metabolite) and edges represent a function, physical or enzymatic relationship. Genetic and physical interaction networks are commonly used models for studying complex systems and disease. These networks can reflect a static system, built from information in a single condition or a differential system, highlighting changes in network connections in two distinct states, revealing state-specific and disease-specific interactions (see reviews (
Biological networks are typically built in three ways: (1) curation of available physical or biochemical interaction data, (2) computational predictions based on sequence similarity, gene cooccurrence, or gene coexpression; and (3) comprehensive assessment of whole genomes or proteomes (
). MS-based proteomics and PTM-omics can be layered atop these scaffold networks to both fine tune the biological network representation and identify network rewiring in a disease state (
) (Fig. 3C). For example, Zhang et al. identified protein-protein interaction network modules that were enriched in down-regulated proteins in a poor-prognosis colorectal cancer subtype (
). Similarly, analysis of gene-protein coexpression found differential interaction patterns in a subset of network modules in basal-enriched and luminal-enriched breast cancer subgroups (
Several approaches for pathway and GO enrichment analysis have been developed, including over-representation and Gene Set Enrichment Analysis (GSEA). Over-representation analysis uses the Fisher's exact test to identify pathways and GO terms with significant over-representation in a gene or protein list of interest, which should be predefined based on differential expression, clustering, or other upstream analyses. Representative tools in this category include DAVID (
) ranks genes or proteins in the entire data set based on differential expression or association to a continuous phenotype, and then uses a modified version of the Kolmogorov-Smirnov test (
) to identify pathways, signatures and GO terms in which the gene members are enriched at the top or bottom of the ranked list (Fig. 3C). As both the over-representation method and the GSEA approach ignore pathway topology when performing enrichment analysis, an additional tool, SPIA (
), was established to address this limitation. Further, because these methods all perform enrichment analysis at the gene level, they do not allow for phosphosite-level enrichment analysis, which is critical for understanding kinase-phosphate signal transduction in phosphoproteome profiling studies. PHOXTRACK (
) was developed for this purpose, and modifies the GSEA approach to search for an enrichment of known kinase targets in an uploaded phosphoproteomics profile data set (Table III).
Table IIIComputational resources for pathway and gene ontology enrichment