If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
BioBix, Lab of Bioinformatics and Computational Genomics, Department of Mathematical Modeling, Statistics and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Ghent, BelgiumVIB-UGent Center for Medical Biotechnology, Ghent, Belgium
* Special Research Fund (BOF) of Ghent University [01D20615] to S.V. Postdoctoral Fellowship of the Research Foundation-Flanders (FWO-Vlaanderen) [12A7813N] to G.M. P.V.D. acknowledges funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (PROPHECY grant agreement No 803972). This article contains supplemental material. Conflict of interest: M. W. and B.K. are founders and shareholders of OmicScouts, Freising, Germany. They have no operational role in the company. S.G. is an employee of SAP SE, Potsdam, Germany.
PROTEOFORMER is a pipeline that enables the automated processing of data derived from ribosome profiling (RIBO-seq, i.e. the sequencing of ribosome-protected mRNA fragments). As such, genome-wide ribosome occupancies lead to the delineation of data-specific translation product candidates and these can improve the mass spectrometry-based identification. Since its first publication, different upgrades, new features and extensions have been added to the PROTEOFORMER pipeline. Some of the most important upgrades include P-site offset calculation during mapping, comprehensive data pre-exploration, the introduction of two alternative proteoform calling strategies and extended pipeline output features. These novelties are illustrated by analyzing ribosome profiling data of human HCT116 and Jurkat data. The different proteoform calling strategies are used alongside one another and in the end combined together with reference sequences from UniProt. Matching mass spectrometry data are searched against this extended search space with MaxQuant. Overall, besides annotated proteoforms, this pipeline leads to the identification and validation of different categories of new proteoforms, including translation products of up- and downstream open reading frames, 5′ and 3′ extended and truncated proteoforms, single amino acid variants, splice variants and translation products of so-called noncoding regions. Further, proof-of-concept is reported for the improvement of spectrum matching by including Prosit, a deep neural network strategy that adds extra fragmentation spectrum intensity features to the analysis. In the light of ribosome profiling-driven proteogenomics, it is shown that this allows validating the spectrum matches of newly identified proteoforms with elevated stringency. These updates and novel conclusions provide new insights and lessons for the ribosome profiling-based proteogenomic research field. More practical information on the pipeline, raw code, the user manual (README) and explanations on the different modes of availability can be found at the GitHub repository of PROTEOFORMER: https://github.com/Biobix/proteoformer.
To expand our knowledge about proteome complexity, data from sequencing technologies can be used to construct a custom database for subsequent MS searches. For example, introducing RNA-seq results into the search space aided in identifying splice variants (
) takes this approach even a step further. With this recent technique, ribosome-protected mRNA fragments are analyzed with NGS leading to a genome-wide measurement of the translation landscape. Typically, ribosomes are halted on the position where they are translating the mRNA by using an antibiotic. In Eukaryotes, the antibiotic cycloheximide (CHX) stabilizes ribosomes on the mRNA sequence and prevents further ribosomal translocation, allowing the study of elongating ribosome profiles. Other antibiotics like harringtonine (HARR) or lactimidomycin (LTM), each with their characteristic mode of action, have the unique ability to only stabilize initiating ribosomes, opening up the opportunity to visualize translation initiation (
). After mapping the sequenced fragments to the reference genome, specific offsets allow to pinpoint the alignments onto the P-site (i.e. the exact base position where the ribosome was translating the mRNA into a peptide product). The offset in between the 5′ end of the read and the P-site depends on the length of the RPF. With a correct set of P-site offsets, the subcodon resolution can be correctly disclosed and important features of ribosome profiling, like triplet periodicity and translation patterns, can be unveiled.
Ribosome profiling gathers data closer to the stage of the final protein product than RNA-seq and as such, it serves as a better protein expression proxy for expanding the MS search space with sample-specific sequencing results. With the help of a ribosome profiling extended search space, alternative initiation events could be validated with matching MS data (
Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events.
), a complete pipeline for processing ribosome profiling information into a sequence database for MS-based validation. This allowed to identify new protein forms (proteoforms) and helped the re-annotation of genomes.
) which infers open reading frames (ORFs) by modeling the experimental noise and the stochastic processes involved in RIBO-seq. From this model, the set of translated codons that generates the observed reads with maximum likelihood is determined. This is in turn the basis for the assembly of ORF candidates. Another new interesting tool is SPECtre (
). It focuses on modeling the triplet periodicity of ribosomal signals using a spectral coherence classifier. It is important to note that these two new techniques can function without the use of a parallel initiation profile sample, a hallmark that was lacking in the former PROTEOFORMER pipeline (
), additionally includes quantitative analyses over multiple samples at once.
For most MS search engines, only the number of fragment ion matches is considered when comparing theoretical and experimental spectra. Nevertheless, it has been proven that adding intensities to the matching algorithm (MS/MS intensity-based proteomics) enhances the identification rates (
), on the other hand, is a MS post-processor tool, enabling to combine features and scores of different MS analysis tools. Based on a semi-supervised learning method with support vector machines, it provides a statistical framework to interpret the combined results. Percolator is thus able to join the additional features of Prosit with the results of canonical search engines.
Here we present all new features added to the PROTEOFORMER pipeline since its first publication. Ribosome profiling mapping, data pre-exploration, proteoform calling strategies and outputted features have been extensively improved and expanded. Based on multiple high-depth samples of matching MS/MS data of HCT116 and Jurkat human cell lines, the ribosome profiling-based sequence database was searched with MaxQuant. On top of the classical MS/MS search engines, Prosit was applied in combination with Percolator to enhance the peptide identification rate and provide extra confidence for (novel) proteoform events and genome re-annotation.
We first give a short technical overview of the new features in the experimental procedures before setting out the results of the HCT116 and Jurkat case studies. Afterwards, we discuss the implications of these new features and results on the general proteogenomics research field.
Ribosome profiling analysis with PROTEOFORMER
Data Quality Assessment, Read Preprocessing, Alignment and Translated Transcript Calling
The updated PROTEOFORMER pipeline (Fig. 1) was applied on matching RIBO-seq and shotgun proteomics data originating from two cell lines, human HCT116 (
). For both cell lines, the CHX-treated (elongating ribosome profile) and LTM-treated (initiating ribosome profile) samples were quality checked with FastQC. The quality reports of the raw HCT116 CHX- and LTM-treated, and Jurkat CHX- and LTM-treated read samples are respectively given in supplementary Files S1 and S2, and S3 and S4. Based on several plots (i.e. GC content, adaptor content, duplication levels) preprocessing and filtering of the reads is desirable. Therefore, the raw reads were quality-trimmed and adaptors were clipped. Consecutively, the reads were prefiltered against databases containing rRNA, tRNA, sn(-o-)RNA sequences and the PhiX bacteriophage genome. Finally, preprocessed reads were aligned to the human genome. For comparison purposes, Jurkat alignments were only allowed if they map to one unique position in the genome, whereas HCT116 alignments can map to up to 16 positions. Mapping statistics of both prefiltering and genomic alignment can be found in supplemental Table S1. The obtained alignment SAM file was then again checked for quality with FastQC. Results for the aligned HCT116 CHX- and LTM-treated and the Jurkat CHX- and LTM-treated samples are respectively given in supplementary Files S5, S6, S7 and S8. General quality improves drastically and overall, contaminants and wet-lab related artifacts (e.g. translation inhibitor usage (
). We recommend to visually check the effects of these steps with FastQC, as embedded into PROTEOFORMER. Further, visualizing quality and metagenomic features is in our opinion good practice to have a general understanding of the data content before continuing with the rest of the analysis.
Besides, FastQC reports a slightly better quality for Jurkat than for HCT116, but one must bear in mind that the read coverage of Jurkat is around 2,5 times higher.
Ribosome profiling specific data exploration was done with mQC. The results for the CHX- and the LTM-treated HCT116 samples are respectively given in supplementary Files S9 and S10. For the CHX- and LTM-treated samples of Jurkat, plots are respectively available in supplementary Files S11 and S12. In these files, the results of the Plastid P-site offset calculation are shown as well. Overall, the P-site determination of LTM is crispier than for CHX. Also, the higher coverage in Jurkat allows a more precise offset calculation. In the metagenic annotation plots of HCT116, a quite notable percentage of the alignments lies in processed pseudogenes, because of the non-unique mapping (this percentage reduced drastically when mQC was applied on unique mapped HCT116 data (results not shown)). In general, results of mQC comply with what can be expected (
) and a good triplet periodicity is observable for both cell lines.
Visualization on a more focused level is now possible by loading the generated BedGraph files into a genome browser, with the new option to generate RPF-specific BedGraph files.
The rule-based transcript calling was used for all analyses. During this calling, transcripts are recognized as truly translated if at least 85% of its exons have an elongated ribosome profile coverage higher than a predetermined threshold (more details on this can be found in (
)). In HCT116, it yielded 65 553 translated transcript isoforms. For Jurkat, 82 065 translated transcript isoforms were called.
In this study, three methods of proteoform calling were compared: (a) the subsequent combination of TIS calling, eventual SNP calling and proteoform assembly (this combination is termed ‘classic proteoform calling’ from hereon), (b) by using the PRICE algorithm (
). The two latter methods are novel introductions in the pipeline and have the big advantage that they do not require initiating ribosome profiles. This comes in handy as many recent ribosome profiling studies lack this translation initiation focused experiment (
). The SPECtre method uses a reference annotation in a GTF file as a basis for its analysis and is therefore not useful to find completely novel proteoforms. However, it is useful to check which canonical sequences show translation. A big drawback of this method though, is its running time (149 h), which is significantly longer than for the other two methods (2–4 h). These run times were measured on 20 2.3GHz AMD Opteron™ processors on a Linux server running Fedora Core 23 with 350Gb of RAM. PRICE, the other new method, is constructed to find new translated sequences solely based on ribosome profiling. Therefore, a score model needs to be used. The developers chose a default FDR of 10%. This means that it needs to allow quite some false positives to find its candidates (more details on the PRICE FDR calculation can be found in the online methods of (
)). A less stringent FDR is for this pipeline less a concern though as ribosome profiling is used here to obtain ORF candidates. Stronger validation follows afterwards in the pipeline using MS/MS data. A looser default FDR threshold means though that a lot of canonical sequences are missed, which is why the overlap with the other two techniques (and especially with SPECtre) is relatively small (Fig. 2A and supplemental Fig. S1). This observation is extendable to the MS level (Fig. 2C and supplemental Fig. S4), as PRICE also lacks at that level quite a big part of the sequences that the other two techniques do capture. The major part of the MS identified sequences missed by PRICE but picked up by classic proteoform calling and SPECtre, start from annotated TISs. Nevertheless, in cases where no MS validation and no ribosome initiation profiling are at hand, the combination of SPECtre and PRICE should result in a complete set of all translated proteoforms based on solely the elongating RIBO-seq profile.
The classic proteoform calling (combination of TIS calling and proteoform assembly) has been upgraded over the years. This method does not work with a score model but is rule-based. Therefore, it is less stringent and thus aims at a subsequent MS validation to exclude the false positives from that phase. In Fig. 2A and supplemental Fig. S1, it is shown that the classic proteoform calling gives the most complete search space of the three techniques with a combination of both canonical sequences and new variants. This strategy also adds the most MS identifications attributable to one distinct proteoform calling technique (Fig. 2A and supplemental Fig. S4).
Selenocysteines were introduced in the different proteoform calling algorithms. In the classic proteoform calling, respectively 134 (0,057%) and 258 (0,053%) of the candidate translation products contain one or multiple selenocysteines in respectively HCT116 and Jurkat data. With the SPECtre algorithm, respectively 47 (0,127%) and 43 (0,101%) of the candidate products contain selenocysteines. The internal PRICE algorithm does not take selenocysteines into account and just classifies the “UGA” codon as a stop signal. Validation of selenocysteine-containing peptides in MS/MS was however not possible. As these sequences constitute only a very small portion of the search space and as selenoproteins have been reported to only occur in specific conditions (specialized MS strategies were even developed for picking them up (
For Jurkat data, SNPs were also included during the classic proteoform calling. 7,15% of the candidate products contain one or more SNPs as compared with the reference sequences. A trial version of PROTEOFORMER was implemented that took indels into account as well. Subsequent MS validation could not confirm any new protein variants by indel addition (unpublished), so for the moment, indel-aware proteoform calling is not included in the pipeline.
FASTA File Export and Database Combinations
FASTA files were generated for the three different applied proteoform calling methods for both HCT116 and Jurkat data. Redundancy can be removed when generating these files, so both files with and without remaining redundancy were exported for all methods. Afterwards, the FASTA files of the three different methods were merged into one comprehensive FASTA data set, for either their redundant or nonredundant forms. The overlap between methods found during this merging is shown in Fig. 2A and supplemental Fig. S1.
In general, the overlap between the different methods is quite low, but this overlap enlarges tremendously once MS/MS validation is applied (Fig 2C and supplemental Fig. S4). It is thus essential to keep in mind that ribosome profiling can lead to a candidate proteoform database, but not to a database of surely present proteins. High overlap is therefore not necessarily expected on ribosome profiling level, in contrast to the MS/MS level.
If redundancy was not removed in the initial database exports, especially the classic proteoform calling method database size is somewhat larger. The redundant database (119,716 sequences) is 55,58% larger than the nonredundant (76,945 sequences) for the classic proteoform calling in HCT116. In contrast, the PRICE and SPECtre databases rise only with respectively 4.33% (14 077 nonredundant sequences) and 13.59% (26,383 nonredundant sequences) when keeping redundant sequences.
The classic proteoform calling of PROTEOFORMER is designed to include different protein variants, also from different TISs (i.e. N-terminal proteoforms) in the light of possible MS validation afterwards. Therefore, it initially contains a lot of overlapping sequences (e.g. extensions, truncations, splice variants…). SPECtre on the other hand starts from a canonical reference annotation, resulting in almost all identified candidate products starting from a canonical TIS location with almost no new variants. When redundancy is not removed, the sequences resulting from the classic proteoform calling method remain in the database in both their canonical and variant form. Therefore, a higher overlap between SPECtre and the classic method is seen. Also, detailed analysis revealed that none of the PRICE-unique translation product candidates start from a canonical TIS. All canonical TIS-starting candidates identified by PRICE are in the overlap regions with the other two methods (Fig. 2A and supplemental Fig. S1).
Another remarkable feature, seen in these plots, is the fact that there is much less overlap between PRICE and the other two approaches than between the classic proteoform calling method and SPECtre. Varying the false discovery rate (FDR) of PRICE (from 0.01 over 0.1 to 0.2) did not result in an overlap increase with the other two methods (results not shown). Specifically, the number of PRICE-unique variants is subject to a changing FDR, increasing with a less strict FDR. Contrarily, the canonical sequences in the overlap sections (with classic and SPECtre) do not change remarkably.
Next, the combined results of the three methods were merged with all protein info from UniProt (consisting of SwissProt and TrEMBL). As the MS analysis program can sort out the different redundant forms later by applying protein inference algorithms, the combined redundant FASTA files were chosen for fusion with UniProt. Besides, the database size does only increase by a factor of 0.43 when keeping redundant sequences. As this is far from an exponential increase, the negative impact on peptide and protein scoring is limited. The overlap between sequences resulting from the PROTEOFORMER pipeline and those available in UniProt is presented in Fig. 2B and supplemental Fig. S2. For HCT116, both a version of UniProt with and without splice variants was examined. Splice variant inclusion enlarges the overlap between UniProt and PROTEOFORMER. Also, overlap with PROTEOFORMER is larger for the SwissProt part of UniProt compared with TrEMBL and by including splice variants, this effect is even more pronounced. For the Jurkat data, the overlap has not notably increased compared with HCT116, but the new variants delivered by PROTEOFORMER have expanded because of the higher coverage in this data set. Detailed investigation (based on the underlying SQLite database) reveals that all PRICE-unique candidates are new variants that do not overlap with UniProt. The classic proteoform calling method on the other hand gives a combination of both UniProt known sequences and new variants. SPECtre-unique sequences are mainly found overlapping with TrEMBL whereas sequences shared between SPECtre and classic proteoform calling are generally found overlapping with SwissProt.
Another new useful addition to the pipeline is the new PEFF format, following the definition of this new format by the HUPO PSI. This FASTA-derived format allows grouping of the different SNPs and proteoform variants of a common base sequence more logically together as one entry. An example of the different proteoforms of human transcript ENST00000000412 is given in supplemental Fig. S3. These proteoforms can be exported in PEFF format as can be seen in supplementary File S13, which is a snippet of a full PEFF file generated with PROTEOFORMER.
MS/MS-based Validation with MaxQuant
Matching high-depth LC-MS/MS data over 4 replicates were searched with the MaxQuant GUI. The full results of these searches, together with peak and raw data, have been deposited to the ProteomeXchange Consortium via the PRIDE (
) partner repository with the dataset identifier PXD011353. Search results at the protein and peptide level are also available in supplementary Files S14–S19.
First, data were searched against the FASTA files of the three combined proteoform calling techniques to see how much each methodology adds to the general identification rate at the MS level. For HCT116, both the combination process with and without the redundancy removal was tested; for Jurkat, only the redundant option was analyzed. A table of identification rates at protein group, peptide and PSM level for the different search spaces and samples can be found in supplemental Table S2. In this table, the high coverage of the MS/MS data is visible in the high amounts of protein groups identified. The results coming out of these MaxQuant analyses can be used to determine the share of each proteoform calling strategy in the pool of MS/MS validated proteins, as shown in Fig. 2C and supplemental Fig. S4. It can be observed that preserving the FASTA-level redundancy, tends to reduce the number of SPECtre-only validated sequences. Like the ribosome profiling stage, keeping redundancy leads to more overlap with the classic proteoform calling method, as stated earlier. The higher coverage in the Jurkat ribosome profiling data boosts the overlap between PRICE and the other two proteoform calling methods (classic and SPECtre-based), resulting in the largest number of validations residing in the overall union of the three methods (supplemental Fig. S4G). The higher coverage also leads to more classic proteoform called validations.
Peptides with selenocysteines on the contrary, could not be identified with MaxQuant. There are some arguments though that explain why selenocysteine could not be picked up in MS/MS data. First, selenocysteines are present in very low abundance in the ribosome profiling data: 43 ORFs with selenocysteines of the 42,452 ORFs in total (∼0,1%). Second, selenoproteins are known to be tissue specific and susceptible to expression changes as a result of processes like aging (
) could only pick up 22 known selenoproteins and 5 new candidates in MS/MS data. So, for the MS/MS run analyzed here, not specifically designed to pick up selenocysteine-containing peptides, it is not surprising to not detect these.
In a second round, the previously mentioned PROTEOFORMER pipeline generated sequences were now merged with UniProt sequences and this combined database was used for MS/MS validation with MaxQuant. Identification rates are shown in supplemental Table S2. The shares of the PROTEOFORMER pipeline and UniProt in these MS searches are shown in supplemental Fig. S5. Most of the validations are shared, but most interesting validations are of course to be found in the PROTEOFORMER part not overlapping with UniProt, as these contain MS/MS spectra validating new proteoforms and novel translation events. If the UniProt splice variants are included, roughly 50% of the earlier PROTEOFORMER-only validations can be explained by the added splice variants in the extended UniProt database. It illustrates the benefit of including alternative splice isoforms in the search space. The higher ribosome profiling coverage of the Jurkat data allows to identify two times more validated proteoforms, while the number of shared identified sequences remains roughly the same. The distribution of the UniProt sequences was further split up between SwissProt and TrEMBL in Fig. 2D and supplemental Fig. S6. Here, the overlap of PROTEOFORMER with TrEMBL is much smaller than the overlap with SwissProt and also smaller than the overlap between all three collections.
With a new PROTEOFORMER module, the group of MS/MS validated sequences, found by the PROTEOFORMER pipeline but not yet present in UniProt (i.e. newly identified proteoforms), was subdivided in more detail based on the nature of their variations. In HCT116, respectively 109 and 52 of such proteoforms are found outside respectively the canonical and splice isoform-included UniProt information. In Jurkat, 107 new proteoforms outside the splicing included reference are found. The classification of these newly validated proteoforms is shown in Fig. 3, supplemental Fig. S7–S8 and in supplementary File S20. Different sources of proteoform variation could be validated: N- and C-terminal extensions and truncations, new splice variants, SAVs, down- and upstream ORFs, out-of-frame ORFs and translation events in previously considered non-coding regions. Splice variants and non-coding region proteoforms could be further classified in subcategories. For HCT116, MS/MS searches against both the merge with the canonical (supplemental Fig. S7) and splicing-included version (supplemental Fig. S8) of UniProt were performed. Comparing the classifications of both experiments points to a reduction of splice variants, C-terminal extensions and N-terminal extensions and truncations in the splicing-included case. The added splicing information in UniProt is thus able to explain parts of certain proteoform categories found compared with the canonical UniProt analysis. Next, the classification for HCT116 (supplemental Fig. S8) and Jurkat data (Fig. 3) can be compared. The different categories present in the analysis of HCT116 data are also present for Jurkat data, but some overall differences are noticeable. An increase of proteoforms with only SAVs is observed for Jurkat data because the SNP calling was additionally executed in Jurkat and not in HCT116. Some of these called SNPs lead to single amino acid substitutions (SAVs), which could be validated in peptides during the MS/MS analysis. Further, whereas for HCT116 there are validated proteoforms found in pseudogenes, for Jurkat, this subcategory is absent. In the metagenic plots of the earlier mentioned mQC reports (supplementary File S9), it was found that the non-unique mapping applied in HCT116 allows enrichment of reads in pseudogenic regions. In Jurkat on the other hand, unique mapping was performed. It is clear that the ribosomal signal in pseudogenic regions for non-unique mapped experiments is also observable in the form of validated pseudogenic peptides at the MS level. An example of a new MS/MS validated proteoform can be seen in Figs. 4A and 4B, whereas its PROTEOFORMER proof on ribosome profiling level can be visualized on a genome browser as shown in Fig. 4C.
Proof-of-Concept Proteogenomic Experiment Using Prosit and Percolator
For this experiment, the purpose was to try whether there was an added value of extending the scores of Andromeda with other features coming out of the Prosit tool, a neural network approach for fragment intensity prediction. Percolator was used to combine the Andromeda scores with the new features from Prosit.
A first experiment consisted of comparing the q values of the earlier MaxQuant identified PSMs between a Percolator run with only the Andromeda scores and a second run with a combination of both the scores of Andromeda and the new features from Prosit. This was performed for the HCT116 data with a search space of combined redundant PROTEOFORMER data merged with the canonical version of UniProt (Fig. 5). The number of identified PSMs decreases with more stringent q value filtering. By including not only the scores of Andromeda but also the features calculated by Prosit, it is possible to filter at lower q values. As such, the analysis where Prosit features are included, can be executed at higher levels of stringency while still maintaining a comparable number of validated PSMs which is desired in a proteogenomic setup where the search space tends to increase.
), a lot of novel implementations were added in the PROTEOFORMER pipeline (Fig. 1). Together with the usage of high coverage MS/MS data, our pipeline leads to the validation of a collection of novel proteoforms. We here want to discuss the overall implications of these novelties on the proteogenomics research field. Further, we want to point out what can be learnt from our approach for the future proteogenomic study of proteoforms and ribosome profiling-assisted re-annotation in general.
Data Quality Assessment, Read Preprocessing, and Alignment
Data quality checks and preliminary data exploration hold a very important position in the new PROTEOFORMER pipeline. FastQC (
) offers a very good way to visualize the effects of pre-mapping data clean-up and gives at the same time a metagenomic overview of the data. These aspects are indispensable for the downstream workflow.
) improves the base resolution of the alignments to their correct P-site by calculating the RPF length-specific offset based on the sample data. A sample-specific offset is in proteogenomics of course preferable over fixed offsets, as also outlined in (
) enables to visualize even more ribosome profiling-specific features of the aligned data. This adds a collection of quality control and general outlook visualization to the PROTEOFORMER pipeline and opens up new ways of visualizing ribosome-specific features like triplet periodicity and codon usage. New quality tools for ribosome profiling start to find ground (
) and proteogenomic studies should not hesitate to use them.
Taken together, proteogenomic approaches should wear high priority to quality checks, data exploration and sample-specific P-offsets.
As seen in the results section, each of the three proteoform calling methods has its own pros and cons. Depending on the goal of the analysis, one specific method or the combination could be more suitable. In the case of an MS validation afterwards, the combination of all three implemented methods enables a good effort to optimally enrich the search space. If you would have to rely on one technique, the classic proteoform calling takes still the lead with the eye on subsequent MS validation afterwards, on the premise that initiation ribosome profiling data is available. This is not totally unexpected as the classic method is developed to function with subsequent MS (
), even under stringent settings. Therefore, PROTEOFORMER uses MS/MS as an important independent gold standard technique to validate the new candidates, proposed by ribosome profiling. The discussion on the false positive rate of ribosome profiling demonstrates the necessity of this subsequent MS/MS validation step.
Another point worth mentioning is the fact that information from other transcriptome and translatome sequencing sources could be considered in the PROTEOFORMER pipeline. As RNA-seq sequences can already be mapped with PROTEOFORMER, the foundation stone is already laid for transcriptome mapping. Other translatome sequencing techniques like RNC-seq (
) could also be included in PROTEOFORMER in the future once these techniques get as commonly used as ribosome profiling. This would allow constructing candidate proteoform search spaces from different translatomic technical angles. Further, as some of these additional sequencing techniques acquire longer read lengths than ribosome profiling (
), this would open up interesting opportunities for developing pipeline modules to discover new splice variants.
FASTA File Export and Database Combinations
Different options were developed for combining the FASTA exports of different PROTEOFORMER analysis strategies as well as for merging with reference sequences from UniProt. The first release of PROTEOFORMER did not allow these merges and users needed to consecutively search spectral data against UniProt first and afterward against the custom PROTEOFORMER database (
). Now, by merging sequences from UniProt and PROTEOFORMER into one database, the proteomics matching is done for all sequences at once in one search run with a set search space size. This eliminates identification biases because of differing search space sizes and thus facilitates the overall evaluation and interpretation. We are convinced that this strategy is useful in lots of other proteogenomic research cases. Further, the devised combination options can also be used to compare other sets of strategies other than the comparison between proteoform calling methods performed in this manuscript (e.g. different transcript calling methods, different PROTEOFORMER parameter sets…). On the downside, combined databases lead to bigger search space sizes and this has a mild but negative effect on the proteomics FDR. Therefore, novel MS/MS identification strategies can be applied to overcome this problem (see infra).
)) start to accept also the PEFF format (http://www.psidev.info/peff), another new option included in PROTEOFORMER allows exporting the results in this new format. The rich and strictly defined PEFF header information, including details about sequence variants, promises to be a helpful tool to communicate precise results easier between different proteogenomic tools (
). The more tools that are programmed to handle this format, the broader the applicability of this novel format will be.
MS/MS-based Validations with MaxQuant
First, we described MS/MS-based validation experiments with a focus on combining different proteoform calling strategies. Not much difference in the number of MS/MS identifications is observed between the combination with or without search space redundancy removal. This is mostly because the protein inference algorithm of MaxQuant (
) (or other MS/MS identifications tools) bundles the redundant sequences in protein groups. It is however useful to keep the redundancy if there is a merge with UniProt planned afterwards, as the canonical sequences will then not be removed by their eventual longer extension variants. Keeping this redundancy does not mean that the database size explodes exponentially as seen in RNA-seq assisted proteogenomic studies (
). As such, the effect on MS/MS search FDR is limited. Further, in supplemental Table S2 it is clear that the number of identifications do not differ significantly between searches against a redundant and a nonredundant search space (4 330 identified protein groups for redundant versus 4 322 for nonredundant). So, in contrast to RNA-seq supported proteogenomics, for ribosome profiling-assisted experiments a reasonable redundancy can overall be presented to the protein inference mechanism without major influence. As such, the protein inference can be used as an asset to group redundant identifications in protein groups at the later stage of MS/MS searching.
For the merge between PROTEOFORMER and UniProt, most identifications are, at first glance, found in the overlap. Nevertheless, the MaxQuant protein inference algorithm allows picking up new proteoforms from the combined search space that could not be explained by searching the UniProt database alone. As expected from the nature of the algorithms, these newly identified proteoforms were added to the database by the classic proteoform calling and PRICE methods and not by SPECtre, because of its dependence on reference annotation, making it not suited for detecting new proteoforms. Further, a higher coverage at the ribosome profiling level for Jurkat data compared with HCT116, did result in more translation product candidates in the total search space, but it did not remarkably increase the number of MS/MS identifications (supplemental Table S2). However, the amount of novel proteoforms did roughly double (supplemental Fig. S5: panel S5G versus S5D), so it can be generally concluded that a higher ribosome profiling coverage (and thus a more comprehensive search space) leads to more novel protein variants without increasing the amount of canonical protein identifications. Besides that, more sequencing depth enhances the quality and the power of the ribosome profiling analysis (
Next, in Fig. 2D, only 47 proteins out of 4477 (1,05%) are identified because of UniProt solely (SwissProt+TrEMBL). This raises the point whether reference information is still useful and could eventually be substituted completely for custom search spaces in general proteomics experiments. It is worth discussing but one should bear in mind that it takes an amount of time to generate these custom databases from sequencing information, whereas a reference database can be downloaded directly from its repository. Clearly, this is study-dependent and here, the combination of both a ribosome profiling-based search space and a reference is necessary to separate new proteoforms from known cases. For samples and species with no or insufficient protein reference information however, this discussion comes in a totally different light. In that case, a custom database generated based on ribosome profiling is of very high value as this custom database can fulfill the role of the deficient reference.
Further analysis classifies the proteoforms based on the nature of their variation in a semi-automated fashion. As such, different categories of proteoforms are validated following MS/MS. Comparing these classification results between datasets shows that analysis strategies tend to have an influence on the abundance of specific proteoform categories. It is thus important to keep in mind the origin and specificities of all data sources when evaluating the outcome of a proteogenomic experiment.
The different new proteoforms can also be manually and individually checked. At the MS/MS level, examples can be examined in the MaxQuant interface and on ribosome profiling level, evidence for these same examples can be loaded into a genome browser of choice using the PROTEOFORMER BedGraph files. As such, proteoforms can be studied and viewed in detail at different layers of evidence. Recently, an even more intuitive way of combining different visual information layers has become available by the definition of two proteogenomic-minded formats: proBAM and proBed (
). With these formats, results of proteomics analyses can be shown on the genomic and transcriptomic level as the proteomic identifications can now be stored in adapted SAM/BAM or BED files, widely used and able to be visualized by genome browsers. However, MaxQuant currently outputs a format which is not convertible to proBAM and proBed yet (
The applied approach, in which a custom search space for proteomics is obtained from analyzing matching ribosome profiling with PROTEOFORMER, enables the identification and validation of new proteoforms. At the same time, this opens opportunities for genome re-annotation and the results of this manuscript can be returned to initiatives like Ensembl (
). In that way, the feedback loop will be closed as the initially used reference information of this study can be complemented and adjusted. Further, our approach can be extended to data from other studies. All proteomics data in PRIDE (
) reported that its database now contains data of 2884 ribosome profiling data sets covering 29 species. This allows to find matching ribosome profiling data for different proteomics samples in PRIDE. As such, an automated proteogenomics pipeline of PROTEOFORMER with subsequent MS/MS searching can even be set up to rescan proteomics data based on custom ribosome profiling-driven information, enabling a mass scale hunt for new proteoforms and genomic re-annotation. An example of this semi-automated setup was run on online RPFdb data of HEK293 cells (SRA project SRP014629). Analysis of matching N-terminomics MS/MS data (PRIDE dataset PXD005583) led to the identification of different classes of proteoforms, but especially uORFs and N-terminal variations, which is expected given the type of proteomic data (supplemental Fig. S10). This example demonstrates the usability of the PROTEOFORMER pipeline and its results for the broader proteomics and genomics communities.
Proof-of-Concept Proteogenomic Experiment Using Prosit and Percolator
) everything is taken together in one statistical framework. It is shown that these extra features allow to filter more stringently without lowering the amount of validated PSMs. On the other hand, different strategies are applied in both Prosit (
) to avoid overfitting. Further, no additional overfitting is added by combining these two tools as they function as successive steps. Overall, this first trial case shows that there is a promising advantage of adding these additional features to the PSM validation framework and that this can help the proteoform validation strategy.
The next step for this promising technique is to include the Prosit features in the protein inference algorithm (
) of Percolator to verify whether this results in extra protein identifications compared to a run with only the MaxQuant scores. These extra protein identifications can then lead to additional proteoform validations. Along with Prosit, other tools that include fragment intensities in their search algorithm like MS2PIP (
). Also, once these next-generation MS/MS search engines are tested, they can be encapsulated in wrappers, which allow the total PROTEOFORMER pipeline to be run. For the moment the pipeline can be run continuous up until the FASTA search space, but the MaxQuant search engine is still depending on a GUI. Clearly demonstrated by these first results (Fig. 5), we believe these MS/MS intensity-based identification strategies, all based on machine learning, are part of the way forward in proteogenomics as FDR calculation encounters challenges in this field because of the extended search space size. Because of ribosome profiling, this search space size explosion is somehow tempered compared with a 3- or 6-frame ORF translation database from RNA-seq (
), but nevertheless, new approaches to lower the FDR will allow to work more stringently and validate proteogenomic outcomes with even more confidence.
We report on a complete makeover of the PROTEOFORMER pipeline, where all newly implemented features of the pipeline drastically expand its possibilities. The combination of different proteoform calling methods optimally allows to expand the search space for MS/MS validation based on ribosome profiling. These efforts show the ability to identify a collection of MS/MS validated new proteoforms, distributed over different possible protein variant types. Moreover, a first step is taken to include MS/MS intensity-based approaches in a proteogenomics setup. Together, all these results provide novel insights for the ribosome profiling-assisted proteogenomics research.
Raw ribosome profiling reads used in this manuscript can be found in the Gene Expression Omnibus (datasets GSE58207 and GSE74279). More details on these data can be found in the supplemental experimental protocols.
Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events.
Author contributions: S.V. and G.M. designed the research; S.V., E.N., and G.M. implemented new features for the PROTEOFORMER pipeline; S.V. analyzed ribosome profiling and proteomics data; S.G. and M.W. calculated extra PSM features with Prosit; P.V.D. performed proteome analyses; S.V., S.G., M.W., P.V.D., and G.M. wrote the paper; G.M. supervised the research and P.V.D., W.V.C., and B.K. advised on research. All authors read and approved the final manuscript.