If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
N-terminal peptides increase the overall chances to identify novel proteins.
We identify 2896 distinct protein N termini, of which 19 point to novel proteins.
Stringent filtering and curation are required to confidently report novel proteins.
Ribosome profiling has revealed translation outside canonical coding sequences, including translation of short upstream ORFs, long noncoding RNAs, overlapping ORFs, ORFs in UTRs, or ORFs in alternative reading frames. Studies combining mass spectrometry, ribosome profiling, and CRISPR-based screens showed that hundreds of ORFs derived from noncoding transcripts produce (micro)proteins, whereas other studies failed to find evidence for such types of noncanonical translation products. Here, we attempted to discover translation products from noncoding regions by strongly reducing the complexity of the sample prior to mass spectrometric analysis. We used an extended database as the search space and applied stringent filtering of the identified peptides to find evidence for novel translation events. We show that, theoretically our strategy facilitates the detection of translation events of transcripts from noncoding regions but experimentally only find 19 peptides that might originate from such translation events. Finally, Virotrap-based interactome analysis of two N-terminal proteoforms originating from noncoding regions showed the functional potential of these novel proteins.
). Proteoforms can arise from the usage of different promoters during gene transcription and, in eukaryotes, by differences in processing of immature mRNA molecules (alternative splicing). Also translation events (in-frame alternative translation initiation, ribosomal frameshifting, and stop codon read-through), protein modifications including processing by proteases, give rise to proteoforms (
). Ribosome profiling (ribosome sequencing [Ribo-Seq]), RNA-Seq, sequence conservation analysis, bioinformatics prediction tools, and proteogenomics have revealed that many transcripts contain more than one ORF (
). Such ORFs often do not resemble annotated ORFs, as they can be either situated within 5′ or 3′ UTRs or have alternative reading frames that overlap with annotated ORFs. In addition, several ORFs are derived from transcripts that are annotated as noncoding. The latter include, among others, long noncoding RNA (lncRNA), retained introns, and transcribed pseudogenes. ORFs currently nonannotated to code for proteins are often shorter than 100 codons and are therefore also referred to as small ORFs (sORFs) (
Although such unannotated ORFs gained increased attention over the years, their coding potential and possible biological functions remain a matter of debate. Targeted bioinformatics approaches and several ribosomal profiling approaches enabled the prediction, detection, and discovery of thousands of novel ORFs possibly being translated to proteins (
) showed that such technical aspects alone cannot explain this absence of lncRNA-encoded proteins in MS data. This discrepancy between the limited number of detected products from unannotated ORFs in mammalian cells and the large number of unannotated ORFs detected by ribosome profiling and computational methods leaves open the possibility that the protein products of such ORFs are rapidly degraded and therefore not detectable or are not translated as predicted (
) studied micropeptides originating from sORFs on a large scale by combining Ribo-Seq, MS, and CRISPR/Cas-based screens. They detected stable expression, localization, knockout, and rescue effects, as well as protein interactors of the translation products of six lncRNAs and seven uORFs. For example, a human lncRNA RP11-84A1.3 was found to encode a 70 amino acid (AA) long protein that localizes to the plasma membrane and interacts with several cell surface proteins. Another study by Ruiz Cuevas et al. identified 1529 peptide products from noncoding ORFs in B-cell lymphoma cells. Of note, these peptides were found to be associated with the major histocompatibility complex I complex and were only found upon analyzing the immunopeptidome. It was therefore predicted that the proteins from which they originated were more disordered and less stable, leading to their rapid degradation by the proteasome, which is the main source for generating major histocompatibility complex I–associated peptides (
). In general, such studies point to the protein-coding potential and functional importance of unannotated ORFs.
To improve the detection of novel protein products, proteogenomics approaches were developed that combine more comprehensive sequence databases with techniques to enrich small and/or low-abundant proteins in complex samples (
). In this study, we aimed to detect and characterize protein products from annotated noncoding regions/transcripts in human embryonic kidney 293 (HEK293) cells. We created a database containing UniProtKB-SwissProt entries and UniProt isoforms (
). We show, using in silico studies, that the proteome only contains 3.7% of unique peptides originating from noncoding genes. However, when focusing on Nt-peptides, this percentage raises to 25.4%, thus greatly improving the likelihood of detecting protein products from these noncoding genes. To increase proteome coverage, three different proteases to generate Nt-peptides were used in parallel (
Besides reducing proteome complexity, Nt-peptide enrichment offers other major advantages for proteoform discovery. The presence of an initiator methionine (iMet) and the acetylation state of the N terminus allows to verify if a proteoform originates from translation, as opposed to protein processing, and indicates the exact translation initiation site (TIS), allowing the distinction of related proteoforms. Indeed, only nascent proteins start with an iMet (
) that can be cotranslationally removed by methionine aminopeptidases (MetAPs), exposing the second amino acid at the protein’s N terminus, and providing a first level of evidence for a protein’s origin being because of a translation event. MetAPs remove iMet when the side chain of the second amino acid has a small gyration radius (Ala, Cys, Gly, Pro, Ser, Val, or Thr) (
). Next to the peptide sequence, we monitor protein Nt-acetylation, a modification that carries another level of evidence for protein synthesis. Nascent polypeptides can be acetylated by Nt , both on the iMet as well as on newly exposed residues upon iMet removal (
). Such post-translational processing events generate new N termini that are typically not acetylated. COFRADIC (combined fractional diagonal chromatography) allows to distinguish in vivo acetylated Nt-peptides originating from translation events from such nonacetylated neo-N-termini (
) (Fig. 1B). In our study, we thus possess three levels of evidence indicating if an Nt-peptide can be used as a proxy of translation (Fig. 2). Based on these three levels of evidence, we applied a stringent filtering approach on our cytosolic Nt-data to find high-confident peptide-level evidence for the translation of noncoding transcripts (NTRs). We obtained 2896 distinct N termini, with only 19 of them pointing to the translation of NTRs. Our study thus seems to prove that stringent filtering and careful inspection of proteomics data is required when one aims to identify novel proteins.
Custom Protein Database Generation
Ribo-Seq reads were downloaded from European Nucleotide Archive, including the Lee et al. (
) datasets of control (SRR1630828 and SRR1630831) and amino-acid starved (SRR1630830 and SRR1630833) HEK293 cells. Subsequently, Ribo-Seq reads pointing to translation initiation with lactimidomycin (LTM) or translation elongation with cycloheximide (CHX) were subjected to the PROTEOFORMER 2.0 pipeline (
) in a pairwise fashion for the corresponding LTM and CHX experiments. Human genome assembly GRCh38.p13 with Ensembl annotation 98 was used to generate indexes for the splice-aware mapper STAR (version 2.5.4b, https://github.com/alexdobin/STAR) with the following settings: --genomeSAindexNbases 14 --sjdbOverhang 35. Common contaminating sequences were retrieved: PhiX bacteriophage genome (NC_001422.1) and human rRNA sequences (search term: ‘(biomol_rrna) AND "Homo sapiens" [porgn:_txid9606]’) were obtained from the National Center for Biotechnology Information, human small nuclear RNA and small nucleolar RNA sequences were obtained from Biomart (Ensembl, version 98; using gene type filter, http://sep2019.archive.ensembl.org/biomart/martview/), whereas human tRNA sequences (v. hg38) were downloaded from gtrnadb.ucsc.edu. The mapping.pl suite of PROTEOFORMER 2.0 allowed integrating several subsequent steps of data processing. Read quality filtering and trimming by FASTX-Toolkit (version 0.0.14, http://hannonlab.cshl.edu/fastx_toolkit/) was performed for the Lee et al. dataset to remove polyA adaptors (48x A), whereas no adaptor removal was necessary for the Gao et al. dataset. STAR was subsequently used to filter out reads mapping to PhiX, rRNA, small nucleolar RNA, and tRNA, before the remaining reads were mapped nonuniquely to the human genome with the following settings: --readlength 36 --unique N --rpf_split Y --suite plastid. Finally, plastid (https://plastid.readthedocs.io/) was used to determine P-site offsets. Rule-based transcript calling was performed using Ensembl release 98, resulting in the elimination of transcripts without any reads and classification of other transcripts based on exon coverage. TIS calling was performed in a rule-based manner, jointly using the results of the corresponding LTM and CHX experiments. SNP calling was omitted. Near-cognate start sites were decoded to methionine, and known selenocysteines were included. Custom identifiers of proteoforms were composed of Ensembl transcript ID, TIS genomic position, and TIS annotation. The following TIS annotations were distinguished: aTIS (proteoform TIS corresponding to the Ensembl annotated TIS); CDS (proteoform TIS falls within the Ensembl annotated coding sequence [CDS]); 5UTR (proteoform TIS is located in the 5′UTR); ntr (proteoform TIS is located on an NTR based on Ensembl biotype, such as pseudogene) and 3UTR (proteoform TIS is located in the 3′UTR). A FASTA file of candidate proteoforms was generated for each Ribo-Seq experiment, without removing subsequent redundancy (--mflag 5), retaining all Ensembl aTIS that did not pass the TIS calling algorithm (--tis_call Y). FASTA files from the Lee et al. and Gao et al. datasets were subsequently combined (combine_dbs.py) and merged with the UniProt FASTA file (human canonical proteins and isoforms, version 2019_04; 42,425 entries) using the combine_with_uniprot.py module. The origin of the proteoforms is clear from the structure of the FASTA headers: redundant sequences are reduced to a single entry with one main ID, and all other IDs are kept in the description line, between square brackets. The UniProt ID is preferentially selected as the main ID, and otherwise, the main ID is selected based on the following order of importance: aTIS, CDS, 5UTR, ntr, and 3UTR. The last six characters of the main ID are reserved for bincodes denoting the Ribo-Seq dataset of origin, with the following order: (1) Lee et al. dataset, (2) Gao et al. dataset of normal conditions, and (3) Gao et al. dataset of starved cells. The combined FASTA file (containing 103,020 entries) was used as a custom protein database for identifying MS/MS spectra.
UniProt and NTR proteoform sequences were selected from the custom database using FASTA headers and the R package Biostring (Bioconductor, https://bioconductor.org/packages/Biostrings/). NTR entries matching completely any proteoform of another category (UniProt, Ensembl aTIS, 5UTR, CDS, and 3UTR) were removed. In silico protein digestion was performed using the R package cleaver (Bioconductor, https://bioconductor.org/packages/release/bioc/html/cleaver.html) with the following enzyme settings: ArgC specificity “R,” up to two missed cleavages (MCs); chymotrypsin specificity “[FLMWY],” up to 2 MC; V8/GluC specificity c(“[DE]", "[DE](?=P)"), up to 4 MC. Peptide mass and charge at a pH of 2 were calculated using the R package Peptides (CRAN, https://cran.r-project.org/web/packages/Peptides/index.html). In silico–generated peptides longer than six amino acids, with a maximal charge of +4 and an m/z ≤ 1500 Th, were considered to be MS-detectable peptides. Nt-peptides starting at position 1 were retrieved from the complete pool of peptides using the R cleaver and IRanges packages (Bioconductor, https://bioconductor.org/packages/release/bioc/html/IRanges.html). Methionine cleavage was considered and processed Nt-peptides starting at the second position in the protein sequence were in addition generated if the iMet was followed by A, S, G, P, T, or V. Nt-peptides were also subjected to the MS suitability criteria indicated previously. For the uniqueness analysis, the R package stringr (CRAN, https://cran.r-project.org/web/packages/stringr/index.html) was used, and leucine residues were considered indistinguishable from isoleucine. MS-detectable peptides were subjected to the DeepMSPeptide Convolutional Neural Network detectability prediction tool using the Python Tensorflow package, version 1.13.1 (https://www.tensorflow.org/). Data concerning MS detectability and uniqueness were grouped by protein using the R package dplyr (CRAN, https://cran.r-project.org/web/packages/dplyr/index.html). Biotype information was taken from Ensembl Biomart, version 98. R packages ggplot2, RColorBrewer, GeomSplitViolin, reshape2, scales, ggExtra, ggsci, and GGally (all packages listed here can be found on CRAN, https://cran.r-project.org/) were used for plotting.
HEK293T cells were cultured at 37 °C and 8% CO2 in Dulbecco's modified Eagle's medium (DMEM), supplemented with 10% fetal bovine serum, 25 units/ml penicillin, and 25 μg/ml streptomycin.
Cytosolic extracts were prepared from 2.5 × 107 HEK293T cells similar as described (
). The cell pellets were washed with ice-cold Dulbecco's PBS (Thermo Fisher Scientific; catalog no.: 14190250) and resuspended in 1.25 ml of cell-free systems buffer (220 mM mannitol, 170 mM sucrose, 5 mM NaCl, 5 mM MgCl2, 10 mM Hepes [pH 7.5], 2.5 mM KH2PO4, 0.02% digitonin, and cOmplete Protease Inhibitor Cocktail [Roche; catalog no.: 4693132001]) and kept on ice for 2 min. Lysates were cleared by centrifugation at 14,000g for 15 min at 4 °C. The supernatants, being the cytosolic extracts, were collected. The remaining pellets and a HEK293T total lysate serving as control were resuspended in 1 ml of radioimmunoprecipitation buffer (50 mM Tris–HCl [pH 7.4], 150 mM NaCl, 1% NP-40, 0.5% sodium deoxycholate, 0.1% SDS, and cOmplete Protease Inhibitor Cocktail) followed by three freeze–thaw cycles. Lysates were cleared by centrifugation at 16,000g for 15 min at room temperature. About 200 μl of each sample was used for Western blot analysis, whereas the rest of the sample was used for Nt-peptide enrichment by COFRADIC.
Western Blot Analysis
Proteins were denatured in XT sample buffer (Bio-Rad; catalog no.: 1610791) and XT reducing agent (Bio-Rad; catalog no.: 1610792), heated at 99 °C for 10 min and centrifuged for 5 min at 16,000g (room temperature). About 25 μg of each protein mixture was separated by SDS-PAGE on a Criterion XT 4 to 12% Bis–Tris gel (Bio-Rad Laboratories; catalog no.: 3450124). Proteins were transferred to a polyvinylidene difluoride membrane (Merck Millipore; catalog no.: IPFL00010) after which the membrane was blocked using Odyssey Blocking buffer (PBS) (LI-COR; catalog no.: 927-4000) diluted once with TBS-T (TBS supplemented with 0.1% Tween-20). Immunoblots were incubated overnight with primary antibodies against GAPDH (Abcam; catalog no.: ab8245), HSP60 (Santa Cruz; catalog no.: sc-13115), lamin B (Santa Cruz; catalog no.: sc-374015), ribophorin I (Santa Cruz; catalog no.: sc-12164), and γ-tubulin (Thermo Fisher Scientific; catalog no.: MA1-850) in Odyssey Blocking buffer (PBS) diluted once with TBS-T. Blots were washed four times with TBS-T, incubated with fluorescent-labeled secondary antibodies (IRDye 800CW Goat Antimouse IgG polyclonal 0.5 mg from LI-COR; catalog no.: 926-32210) and IRDye 800CW Donkey antigoat IgG polyclonal 0.5 mg from LI-COR (catalog no.: 926-32214) in Odyssey blocking buffer diluted once with TBS-T for 1 h. After three washes with TBS-T and an additional wash in TBS, immunoblots were imaged using the Odyssey infrared imaging system (LI-COR).
Nt-peptides were enriched by COFRADIC as described previously (
), however, without the pyroglutamate removal and SCX steps. In the following, we only mention the main differences with the published protocol. In brief, 1 mg of cytosolic proteins was used. As digitonin was used for cytosolic extraction in combination with the radioimmunoprecipitation lysis buffer, which interfere with LC–MS/MS analysis, the samples were cleaned up using Pierce Detergent removal spin columns (Thermo Fisher Scientific; catalog no.: 87777) according to the manufacturer’s instructions. Then, guanidinium hydrochloride was added to a final concentration (f.c.) of 4 M before proteins were reduced (with 15 mM f.c. Tris(2-carboxyethyl)phosphine) and alkylated (with 30 mM f.c. iodoacetamide) for 15 min at 37 °C. To enable the assignment of in vivo Nt-acetylation events, all primary protein amines were blocked using stable isotope–encoded acetate, that is, an N-hydroxysuccinimide (NHS)–ester of 13C1D3-acetate. This acetylation reaction was allowed to proceed for 1 h at 30 °C and was repeated once. Prior to digestion, the samples were desalted on a NAP-10 column in 50 mM freshly prepared ammonium bicarbonate (pH 7.8). Samples were digested either with trypsin in a trypsin/protein ratio of 1/50 (w/w) (Promega; catalog no.: V5111) and incubated overnight at 37 °C, chymotrypsin in a chymotrypsin/protein ratio of 1/20 (w/w) (Promega; catalog no.: V1061) and incubated overnight at 25 °C, or endoproteinase GluC in a GluC/protein ratio of 1/20 (w/w) (Promega; catalog no.: V1651) and incubated overnight at 37 °C. After vacuum drying, the samples were redissolved in 80 μl loading solvent A (2% acetonitrile [can] and 0.1% TFA in double-distilled water) before isolating Nt-peptides by two subsequent reversed-phase HPLC fractionations with a 2,4,6-trinitrobenzenesulfonic acid reaction in between.
LC–MS/MS Analysis of Nt-Peptides and Peptide Identification
LC–MS/MS analysis was similar as reported before (
). Each COFRADIC fraction was solubilized in 20 μl loading solvent A, and half of each fraction was injected for LC–MS/MS analysis on an Ultimate 3000 RSLCnano system in-line connected to an Orbitrap Fusion Lumos mass spectrometer (Thermo Fisher Scientific). Trapping was performed at 10 μl/min for 4 min in loading solvent A on a 20 mm trapping column (made in-house, 100 μm internal diameter [I.D.], 5 μm beads, C18 Reprosil-HD; Dr Maisch). The peptides were separated on a 200 cm μPAC column (C18-endcapped functionality, 300 μm wide channels, 5 μm porous-shell pillars, interpillar distance of 2.5 μm, and a depth of 20 μm; PharmaFluidics). The column was kept at a constant temperature of 50 °C. Peptides were eluted by a linear gradient reaching 33% MS solvent B (0.1% formic acid [FA] in water/ACN [2:8, v/v]) after 42 min, 55% MS solvent B after 58 min, and 99% MS solvent B at 60 min, followed by a 10-min wash at 99% MS solvent B and re-equilibration with MS solvent A (0.1% FA in water). The first 15 min, the flow rate was set to 750 nl/min, after which it was kept constant at 300 nl/min.
The mass spectrometer was operated in data-dependent mode, automatically switching between MS and MS/MS acquisition. Full-scan MS spectra (300–1500 m/z) were acquired in 3 s acquisition cycles at a resolution of 120,000 in the Orbitrap analyzer after accumulation to a target automatic gain control (AGC) value of 200,000 with a maximum injection time of 250 ms. The precursor ions were filtered for charge states (2–7 required), dynamic range (60 s; ±10 ppm window), and intensity (minimal intensity of 5E3). The precursor ions were selected in the ion routing multipole with an isolation window of 1.6 Da and accumulated to an AGC target of 10E3 or a maximum injection time of 40 ms and activated using collision-induced dissociation fragmentation (35% normalized collision energy [NCE]). The fragments were analyzed in the Ion Trap Analyzer at rapid scan rate.
Mascot Generic Files were created from the MS/MS data in each LC run using the Mascot Distiller software (version 184.108.40.206; Matrix Science). To generate these MS/MS peak lists, grouping of spectra was allowed with a maximum intermediate retention time of 30 s and a maximum intermediate scan count of 5. Grouping was done with a 0.005 Da precursor tolerance. A peak list was only generated when the MS/MS spectrum contained more than 10 peaks. There was no deisotoping, and the relative signal-to-noise limit was set at 2. The generated MS/MS peak lists were searched with Mascot using the Mascot Daemon interface (version 2.6.0; Matrix Science). MS data were matched against our custom-build database (containing UniProt, UniProt isoform entries appended with Ribo-Seq–derived protein sequences). The Mascot search parameters were as follows: heavy acetylation of lysine side chains (with 13C1D3-acetate), carbamidomethylation of cysteine and methionine oxidation to methionine sulfoxide were set as fixed modifications. Variable modifications were acetylation of N termini (both light and heavy because of the 13C1D3 label) and pyroglutamate formation of Nt-glutamine (both at the peptide level). The enzyme settings were: endoproteinase semi-Arg-C/P (semi-Arg-C specificity with Arg-Pro cleavage allowed) allowing for two MCs for the trypsin sample. For chymotrypsin and GluC, the enzyme settings were semi-Chymo and semi-GluC. For GluC, two MCs were allowed, whereas for chymotrypsin, four MCs were allowed. Mass tolerance was set to 10 ppm on the precursor ion and to 0.5 Da on fragment ions. In addition, the C13 setting of Mascot was set to 1. Peptide charge was set to 1+, 2+, and 3+, and instrument setting was put to electrospray ionization trap MS. Raw DAT-result files of MASCOT were further queried using ms_lims (
). Only peptides that were ranked first and scored above the threshold score set at 99% confidence were withheld. The false discovery rate (FDR) was estimated by searching a decoy database (a reversed version of the custom-generated database), which resulted in an FDR of 0.44% for the trypsin sample, 0.14% for the chymotrypsin sample, and 0.53% for the GluC sample at the peptide level. At protein level, this resulted in an FDR of 1.73% for the trypsin sample, 0.58% for the chymotrypsin sample, and 2.10% for the GluC sample
Selection of N Termini
From this dataset, Nt-peptides were selected and classified. The selection workflow was built in KNIME (see https://www.knime.com/). Selection was done per protease, and all identified peptides (cotranslationally acetylated, heavy acetylated [blocked] peptides, and N-terminally free peptides) were used as input. Peptides were grouped based on sequence and accession to get a list of distinct (unique) identified peptides. Information on multiple identifications of a given peptide was retained and, if possible, used to calculate an acetylation percentage. Internal (solely found as free NH2-starting peptide) and C-terminal peptides were removed. The remaining potential Nt-peptides were classified. High confident TIS/N termini encompass (1) all (partially) cotranslationally (in vivo acetylated) N termini and blocked (in vitro heavy acetylated) N termini, of which the start position corresponded with a UniProt, UniProt isoform, or Ensembl (Ribo-Seq) aTIS site; (2) cotranslationally acetylated peptides with a start position higher than two, and for which the iMet is retained or removed; (3) N termini matching TIS identified by ribosome profiling (either cotranslationally acetylated or heavy acetylated [blocked]). Low-confident TIS/N termini encompass: (1) cotranslationally acetylated peptides with a start position beyond position 2 that neither start nor are preceded by a Met, with no extra Ribo-Seq evidence and that are not preceded by a cleavage site recognized by the proteases used; (2) heavy acetylated (blocked) peptides with a start position higher than 2 that start with or are preceded by a Met (according to the iMet processing rules), with no extra Ribo-Seq evidence and that are not preceded by a proteolytic cleavage site. In a final step, the data from the three different proteases were merged to create a final list of distinct N termini.
Two peptides (ADDAGAAGGPGGPGGPEMGNRGGFRGGF and MDGEEKTCGGCEGPDAMYVKLISSDGHEFIVKR) were made in-house, whereas all other peptides were obtained from Thermo Fisher Scientific (standard peptide custom synthesis service). In-house peptide synthesis was done using Fmoc chemistry on an Applied Biosystems 433A Peptide Synthesizer. All required modifications, besides heavy acetylation of primary amines, were introduced during peptide synthesis. Primary amines were blocked after peptide synthesis by adding a 150 times molar excess of an NHS–ester of 13C1D3-acetate, and peptides were incubated for 1 h at 37 °C. This step was repeated once, after which the remainder of the NHS–ester was quenched by adding glycine to an f.c. of 30 mM and incubating the peptides for 10 min at room temperature. O-acetylation was reversed by adding hydroxylamine (75 mM f.c.) followed by an incubation for 10 min at room temperature. Next, peptides were purified on OMIX C18 Tips (Agilent), which were first washed with prewash buffer (0.1% TFA in water/ACN [20:80, v/v]) and pre-equilibrated with 0.1% TFA before sample loading. Tips were then washed with 0.1% TFA, and peptides were eluted with 0.1% TFA in water/ACN (40:60, v/v). Purified peptides were mixed and diluted to an f.c. of 100 fmol/μl (of each peptide).
About 1 pmol of the acetylated synthetic peptides was injected for LC–MS/MS analysis on an Ultimate 3000 RSLCnano system in-line connected to an Orbitrap Fusion Lumos mass spectrometer. Trapping was performed at 10 μl/min for 4 min in loading solvent A on a 20 mm trapping column (made in-house, 100 μm I.D., 5 μm beads, C18 Reprosil-HD). The peptides were separated on a 200 cm μPAC column (C18-endcapped functionality, 300 μm wide channels, 5 μm porous-shell pillars, interpillar distance of 2.5 μm, and a depth of 20 μm). The column was kept at a constant temperature of 50 °C. Peptides were eluted by a linear gradient reaching 26.4% MS solvent B after 20 min, 44% MS solvent B after 25 min, and 56% MS solvent B at 28 min, followed by a 5-min wash at 56% MS solvent B and re-equilibration with MS solvent A. The first 15 min, the flow rate was set to 750 nl/min, after which it was kept constant at 300 nl/min. The mass spectrometer was operated in data-dependent mode, automatically switching between MS and MS/MS acquisition with the m/z values of the precursors of the synthetic peptides as an inclusion list. Full-scan MS spectra (300–1500 m/z) were acquired in 3 s acquisition cycles at a resolution of 120,000 in the Orbitrap analyzer after accumulation to a target AGC value of 200,000 with a maximum injection time of 30 ms. The precursor ions not present in the inclusion list were filtered for charge states (2–7 required) and intensity (minimal intensity of 5E3). The precursor ions were selected in the ion routing multipole with an isolation window of 1.6 Da and accumulated to an AGC target of 10E3 or a maximum injection time of 40 ms and activated using collision-induced dissociation fragmentation (35% NCE). The fragments were analyzed in the Ion Trap Analyzer at rapid scan rate.
), Skyline-Daily V220.127.116.116), was used to compare the ranking of the fragment ions between the synthetic peptides and the possible NTR peptides. For each synthetic peptide, the top 10 most abundant fragment ions of the synthetic peptides were selected to perform the comparison. This ranking was compared with the ranking of the spectra of the same peptide (with the highest score) identified in our own Nt-COFRADIC dataset. The previously identified Nt-peptide (in our COFRADIC experiment) was considered matching to a synthetic peptide if the ranking of the fragment ions was in line with the ranking of the fragment ions of the synthetic peptide.
Generation of the NTR Clones
Gag-bait fusion constructs were generated as described (
). The CDSs for the full length and the proteoform of the selected gene (Ensembl accession: ACTBP8; ENSG00000220267) were ordered from IDT (gBlocks gene fragments) and transferred into the pMET7-GAG-sp1-RAS plasmid by classic cloning with restriction enzymes. The pMD2.G (expressing vesicular stomatitis virus G [VSV-G protein), pcDNA3-FLAG-VSV-G plasmids (available at Addgene #12259 and #80606), and the GAG-eDHFR (Escherichia coli dihydrofolate reductase) vector (serving as a control) were a gift from Sven Eyckerman (VIB-UGent Center for Medical Biotechnology).
Protein Complex Purification by Virotrap
For full details on the Virotrap protocol, we refer to Ref. (
). HEK293T cells were kept at low passage (<10) and cultured at 37 °C and 8% CO2 in DMEM, supplemented with 10% fetal bovine serum, 25 units/ml penicillin, and 25 μg/ml streptomycin. Each construct was analyzed in triplicate, and for every replicate, the day prior to transfection, a 75 cm2 falcon was seeded with 9 × 106 cells. Cells were transfected using polyethylenemine, with a DNA mixture containing 6.43 μg of bait plasmid (pMET7-GAG-bait), 0.71 μg of pcDNA3-FLAG-VSV-G plasmid, and 0.36 μg of pMD2.G plasmid. For the eDHFR control, cells were transfected with a DNA mixture containing 3.75 μg of eDHFR plasmid (pMET7-GAG-eDHFR), 2.68 μg of pSVsport plasmid, 0.71 μg of pcDNA3-FLAG-VSV-G plasmid, and 0.36 μg of pMD2.G plasmid. The medium was refreshed after 6 h with 8 ml of supplemented DMEM.
The cellular supernatant was harvested after 46 h and centrifuged for 3 min at 1250g to remove debris. The cleared supernatant was then filtered using 0.45 μm filters (Merck Millipore; catalog no.: SLHV033RB). For every sample, 20 μl MyOne Streptavidin T1 beads in suspension (10 mg/ml; Thermo Fisher Scientific; catalog no.: 65601) were first washed with 300 μl wash buffer containing 20 mM Tris–HCl (pH 7.5) and 150 mM NaCl and subsequently preloaded with 2 μl biotinylated anti-FLAG antibody (BioM2; Sigma; catalog no.: F9291). This was done in 500 μl wash buffer, and the mixture was incubated for 10 min at room temperature. Beads were added to the samples, and the viral-like particles (VLPs) were allowed to bind for 2 h at room temperature by end-over-end rotation. Bead–particle complexes were washed once with 200 μl washing buffer (20 mM Tris–HCl [pH 7.5] and 150 mM NaCl) and subsequently eluted with FLAG peptide (30 min at 37 °C; 200 μg/ml in washing buffer; Sigma; catalog no.: F3290) and lysed by addition of Amphipol A8–35 (Anatrace; catalog no.: A835) (
) to an f.c. of 1 mg/ml. After 10 min, the lysates were acidified (pH <3) by adding 2.5% FA. Samples were centrifuged for 10 min at >20,000g to pellet the protein/Amphipol A8–35 complexes. The supernatant was removed, and the pellet was resuspended in 20 μl 50 mM fresh triethylammonium bicarbonate. Proteins were heated at 95 °C for 5 min, cooled on ice to room temperature for 5 min, and digested overnight at 37 °C with 0.5 μg of sequencing-grade trypsin (Promega; catalog no.: V5111). Peptide mixtures were acidified to pH 3 with 1.5 μl 5% FA. Samples were centrifuged for 10 min at 20,000g. About 7.5 μl of the supernatant was injected for LC–MS/MS on an Ultimate 3000 RSLCnano system in-line connected to a Q Exactive HF Biopharma mass spectrometer (Thermo Fisher Scientific). Trapping was performed at 10 μl/min for 4 min in loading solvent A on a 20 mm trapping column (made in-house, 100 μm I.D., 5 μm beads, C18 Reprosil-HD). The peptides were separated on a 250 mm Waters nanoEase M/Z HSS T3 Column, 100 Å, 1.8 μm, 75 μm inner diameter (Waters Corporation) kept at a constant temperature of 50 °C. Peptides were eluted by a nonlinear gradient starting at 1% MS solvent B reaching 55% MS solvent B in 80 min, 97% MS solvent B in 90 min, followed by a 5-min wash at 97% MS solvent B and re-equilibration with MS solvent A. The mass spectrometer was operated in data-dependent mode, automatically switching between MS and MS/MS acquisition for the 12 most abundant ion peaks per MS spectrum. Full-scan MS spectra (375–1500 m/z) were acquired at a resolution of 60,000 in the Orbitrap analyzer after accumulation to a target value of 3,000,000. The 12 most intense ions above a threshold value of 13,000 were isolated with a width of 1.5 m/z for fragmentation at an NCE of 30% after filling the trap at a target value of 100,000 for maximum 80 ms. MS/MS spectra (200–2000 m/z) were acquired at a resolution of 15,000 in the Orbitrap analyzer.
The generated MS/MS spectra were processed with MaxQuant (version 18.104.22.168, https://www.maxquant.org/maxquant/) using the Andromeda search engine with default search settings, including an FDR set at 1% on both the peptide and protein levels. The sequences of the human proteins in the Swiss-Prot database (released January 2021 [20,394 entries]; complemented with a database containing 11 entries, which are relevant proteins expressed during the Virotrap protocol such as GAG, VSV-G, eDHFR, and the NTR [both full length and proteoform] sequences) were used as the search space. The enzyme specificity was set at trypsin/P, allowing for two MCs. Variable modifications were set to oxidation of methionine residues and Nt-protein acetylation; there were no fixed modifications set. Standard settings were used. In the settings of advanced identification, match between runs was implemented (with standard settings).The resulting peptide and protein identifications can be found in supplemental Table S1 (Sheet 1 contains the MaxQuant-generated ProteinGroups.txt, and sheet 2 contains the Peptide.txt file). Only proteins with at least one unique or razor peptide were retained, leading to the identification of 1569 proteins across all samples. Reverse proteins, proteins that are only identified by site, and potential contaminants were removed. Differential analysis of the Virotrap data was conducted using the limma R package (version 3.48.0, Bioconductor, https://bioconductor.org/packages/limma/) (
For the N-terminomics part, three samples were processed in parallel. Each sample was digested with a different digestion enzyme (either trypsin, chymotrypsin, or endoproteinase GluC) to increase the depth of analysis (
) before Nt-peptide enrichment by COFRADIC. We choose using an approach of combining data of three different digestion enzymes instead of replicates allowing us to detect new N termini while still allowing confirmation and check of reproducibility between the samples. During COFRADIC, samples are fractionated by HPLC and pooled to 36 fractions per sample, making a total of 108 LC–MS/MS samples. All MS/MS data were searched using a custom-build database allowing the detection of alternative TISs, with the threshold score set at 99% confidence. The FDR was estimated by a decoy database search resulting in an FDR of <0.6% at peptide level and an FDR of <2.2% at protein level. These data were used as input for a thorough selection workflow only retaining confident Nt-peptides. This confidence is based on three forms of translational evidence, as shown in Figure 2, Nt-acetylation, the presence or processing of iMet, and evidence of translation by Ribo-Seq. The remaining NTR peptides were further curated by BlastP, manual inspection of their MS spectra (by comparison with the spectra of synthetic peptides), and inspection of sequencing data. One protein was finally selected for interactome analysis. For this study, we used eDHFR as control and used both the full-length protein as well as a truncated Nt-proteoform to study the interactome of this novel protein. Here, the samples were prepared in triplicates and run in LC–MS/MS in a fully randomized order. In total, we thus have nine samples, three control samples (eDHFR-GAG as control bait), three samples of the full-length NTR protein, and three samples of the proteoform of the NTR protein. During the search, an FDR of 1% was used at peptide and protein levels. Proteins quantified with iBAQ values in all replicates of at least one condition were retained. To find the protein’s interaction partners, a pairwise contrast of interest between differentially treated samples was retrieved at a significance level of alpha 0.01, corresponding to Benjamini–Hochberg adjusted p value (FDR) cutoff.
A Comprehensive Sequence Database for Identifying Novel Proteoforms
A comprehensive database of known and putative protein sequences is essential for the MS-based identification of (novel) proteoforms. We used publicly available ribosome profiling (Ribo-Seq) datasets from HEK293 cells (
). The resulting (putative) protein sequences were stored in a database that was supplemented with annotated proteoforms from Ensembl and UniProt, both canonical sequences (a curated selection including one protein per gene), and sequences of annotated isoforms. Our final database contained 103,020 nonredundant protein sequences (Fig. 3A), including 60,043 (58.4%) annotated sequences, 25,564 (24.8%) new sequences of proteoforms in known translated transcripts, and 16,919 (16.4%) putative proteoforms originating from NTRs. We classified novel predicted proteoforms in protein-coding transcripts according to the position of the TIS, and most were found in 5′ UTRs (21,008 entries), followed by TIS within CDSs (3555 entries) and those in 3′ UTRs (1001 entries). NTRs were classified according to the Ensembl biotype and mostly originated from processed pseudogenes (pseudogenes generated through a genome insertion of reverse-transcribed mRNA, possibly with evidence of locus-specific transcription; 7519 entries), transcripts with retained intron (5887 entries), and lncRNAs (3192 entries; Fig. 3A), with only 321 NTRs belonging to other biotypes.
Assessing the MS Detectability of Proteins
We hypothesized that experimental procedures could be optimized to improve the chances of detecting NTR proteins. Therefore, we calculated and compared NTR and UniProt protein sequence features, such as length, number of protease cleavage sites, MS-identifiable and unique peptides.
NTR proteins were found to be significantly shorter than UniProt proteins (median protein lengths of 40 and 414, respectively; Wilcoxon test p value <2.2e-16, Fig. 3B), with >80% of NTR proteins derived from sORFs (less than 100 amino acids; Fig. 3C).
To predict the impact of the protease used for protein digestion and peptide enrichment strategies on protein sequence coverage and NTR protein identification, we performed in silico digestion on NTR and UniProt proteins with three different enzymes: endoproteinase ArgC, chymotrypsin, and endoproteinase GluC. Note that because lysine side chains are acetylated in our set-up, trypsin (that will be used for enriching Nt-peptides [see later]) will only cleave C-terminal to arginine, explaining why we studied the effect of endoproteinase ArgC. We considered both shotgun proteomics and N-terminomics approaches and only peptides longer than six amino acids, with a maximal charge of 4+ (at a pH of 2), and a m/z ratio ≤1500 Th were considered to be MS identifiable. These parameters correspond to the database search outcomes typically obtained in our experiments (see later). The majority of the predicted peptides were found to be MS identifiable, except for chymotrypsin-generated Nt-peptides, which were predicted to have an average peptide length of only eight amino acids for NTR proteins and nine amino acids for UniProt proteins. Furthermore, the NTR and UniProt proteins produced comparable fractions of MS-identifiable peptides for all conditions considered (Fig. 3D). However, in contrast to UniProt peptides, NTR peptides were rarely unique (Fig. 3E), meaning that they matched to more than one protein sequence among all NTR protein sequences. This effect was less pronounced for NTR Nt-peptides (Fig. 3E).
We next analyzed a theoretical proteome composed of both UniProt and NTR proteoforms and found that upon digestion, the corresponding peptide mixture is dominated by UniProt peptides and leaves only marginal opportunity to identify unique NTR peptides (1.4–3.7%; Fig. 4A). Note that we do not consider any additional possible disadvantages as for instance caused by lower expression of NTR proteins (
), so the actual detectability of NTR proteins is likely even overestimated, but enrichment of Nt-peptides seems to increase the likelihood of identifying NTR proteins (Fig. 4B). Unique peptides offer the strongest evidence of expression of a given proteoform. However, not every proteoform gives rise to unique peptides, inevitably leading to reduced proteome coverage. In contrast to 98% of UniProt proteins that produce at least one unique peptide, up to a third of NTR proteins cannot be uniquely identified (Fig. 4C). Enrichment of Nt-peptides seems to have an additional negative impact on the number of unique protein identifications, in both UniProt and NTR categories (Fig. 4D), because of the lack of unique and MS-identifiable N termini.
To conclude this part, it is predicted that by enriching for Nt-peptides, one is more likely to identify NTR proteins, albeit at a potential cost of fewer UniProt annotated proteins.
Next, we evaluated if combining the results after digesting proteomes with different proteases would lead to an overall higher coverage of proteomes. We investigated the proteases mentioned previously and found that Nt-peptides enriched after ArgC digestion should capture 89.1% and 95.4% of UniProt and NTR proteins, respectively, and both fractions further increase when also including the chymotrypsin and GluC digestion results (supplemental Fig. S1). When considering all MS-identifiable peptides, ArgC provides almost full coverage of UniProt proteins; 99.6% of the proteins generate at least one MS-identifiable peptide. In addition, this coverage is only increased to a limited extent when including the other two proteases. A somewhat less complete ArgC coverage of NTR proteins (97.4%) in the shotgun proteomics setup can be remedied using complementary digestion strategies.
Extracting Cytosolic Proteins
Since decreasing the proteome complexity increases the possibility of obtaining protein evidence from NTR proteins (
), and perform a highly stringent downstream data analysis. Virotrap (see later), a technology that favors cytosolic proteins as baits, would then be suited to identify potential protein interactors of newly discovered proteins. However, Virotrap is currently limited to HEK293T cells, which explains the selection of these cells.
Cytosolic proteins were enriched after permeabilization of the plasma membrane with 0.02% digitonin, leaving the organellar membranes intact (
). The efficiency of the cytosol isolation was evaluated by Western blot analysis of the cytosolic fraction, the remaining pellet (containing the organelles), and a total cell lysate (as control), using antibodies against several organelle markers (endoplasmic reticulum, cytosol, mitochondrion, cytoskeleton, and nucleus) (supplemental Fig. S2). Organelle markers were absent or strongly depleted in the cytosolic fraction, whilst the cytosolic marker was enriched in the cytosolic fraction and depleted in the organelle fraction, indicative of an efficient isolation of cytosolic proteins.
Following COFRADIC, Nt-peptides were analyzed by LC–MS/MS. We also evaluated the quality of cytosol isolation by evaluating Gene Ontology Cellular Component terms associated with the identified proteins for Gene Ontology terms containing “cytosol” (GO: 0005829) (Table 1). Among the three samples, a comparable fraction of cytosolic proteins (around 60%) was found. This number is higher than expected when analyzing a total lysate (Fisher’s exact test, p < 1e-5), given that The Human Protein Atlas (Cell Atlas) reports that 4740 proteins (24% of the human proteome) localizes to the cytosol (
Increasing Proteome Coverage by Using Three Proteases and Nt-Peptide Enrichment
Prior to digestion, primary amines in the cytosolic proteins were acetylated with an acetyl group carrying stable heavy isotopes, thus to distinguish in vivo N-terminally acetylated (Nt-acetylated) from in vivo free N termini, and both can serve as a proxy for translation initiation events (
). As mentioned, three different proteases, trypsin (with ArgC specificity), chymotrypsin, and endoproteinase GluC, were used in parallel. In the generated peptide mixtures, Nt-peptides are thus N-terminally acetylated (whereas internal peptides are not) and enriched by COFRADIC prior to LC–MS/MS analysis. In vivo acetylated peptides will be further referred to as cotranslationally modified peptides, whereas in vitro acetylated peptides will be referred to as N-terminally blocked peptides (Fig. 1B). The LC–MS/MS data were searched in the aforedescribed database, leading to the identification of 10,147, 6796, and 5373 unique peptide sequences in the samples digested by trypsin, chymotrypsin, and endoproteinase GluC, respectively.
To evaluate the identification gain from using three proteases (supplemental Fig. S1), we calculated the overlap of the identified unique Nt-peptides and proteins when using each protease. Here, we need to consider that different proteases may generate Nt-peptides with identical start positions yet with different end positions. Therefore, we coupled a peptide start position to its protein entry in the database, creating a unique proxy for a protein’s N terminus. However, the identified internal and C-terminal peptides may cause an overestimation of protease-specific start positions. Therefore, we applied a rule-based selection strategy to first remove internal and C-terminal peptides and to withhold a list of confidently identified and distinct N termini per sample. These data were submitted to BioVenn (
) to visualize the overlap (Fig. 5), and a list of peptides and proteins corresponding to each compartment was generated (supplemental Table S2).
In total, 2896 distinct N termini and 2420 distinct proteins were identified upon combining the data from the different protease setups. About half of all proteins were identified in at least two datasets (Fig. 5, A and B), this figure drops to 27% when considering matching peptide start sites (Fig. 5, C and D). This drop is explained by different Nt-peptides generated from different Nt-proteoforms that have to the same protein database entry. Trypsin seems to account for the largest part of identified proteins and Nt-peptides (with 38.9 and 42.1% of unique identifications, respectively), with chymotrypsin and GluC contributing additional unique sets of proteins (10.5 and 16.9%, respectively) and Nt-peptides (12 and 18%, respectively). Thus, the proteome coverage is indeed increased by using different proteases as each contributes with its own unique set of identified peptides, even exceeding our theoretical predictions (supplemental Fig. S1).
We also evaluated how efficient COFRADIC enriched for Nt-peptides (Table 2). Whereas Nt-peptides normally only account for less than 5% (based on shotgun proteomics data) of all peptides, we found that, following enrichment, 59.2% of the peptides generated by trypsin are Nt-peptides, which agrees with previous reports (
). COFRADIC was most efficient when using trypsin (73.2% of all peptides were Nt-peptides, when including both protein Nt-peptides and pyroglutamate-starting peptides, which are coenriched by COFRADIC. However, the efficiency of enriching chymotryptic Nt-peptides is much lower (43.6%); one possible explanation for this is due to the fact that this protease recognizes more residues, thus generates a larger pool of (shorter) peptides by which the actual chemical modification step used for sorting in COFRADIC becomes less efficient. In addition, chymotryptic peptides are less basic than tryptic peptides, which might negatively influence their ionization (and thus detection).
Table 2Overview of the numbers of distinct peptide sequences found in the different samples and Nt-enrichment efficiencies
). Identifications were grouped by peptide sequence. When the same peptide sequence was found several times with different Nt-modifications, priority was given to in vivo acetylated peptides. The enrichment efficiency was calculated by taking the sum of all peptides expected to be enriched by COFRADIC (these being Nt-peptides and pyroglutamyl-starting peptides) and dividing this by the total number of identified peptides.
All identified peptides were loaded into a KNIME selection pipeline to select with high stringency Nt-peptides of both known and novel proteins/proteoforms. Our selection was based on the cotranslational nature of protein Nt-acetylation, also considering the (possible) removal of the iMet by MetAPs, with extra translational evidence provided by Ribo-Seq. Our strategy is outlined in Figure 6 and explained in more detail later. It was applied on the dataset for each used protease, and all results were merged afterward. In this way, peptides matched to an NTR accession could be traced back to any of the three proteases (Table 3 and Fig. 7). More information about the NTR peptides and the proteins retained in each step can be found in supplemental Table S3.
Table 3Numbers of peptides with accessions indicating that the peptide originates from an NTR protein during the different steps of the selection procedure
)) using protein sequence databases with a high level of subsequence redundancy, such as the database we have used, is challenging. In general, of all the identified peptides, only 3246 (13.9%), 1731 (14.8%), and 1530 (14.6%) (for trypsin, chymotrypsin, and GluC, respectively) were unique. Peptides often matched to multiple proteins, both annotated and novel, with database entries to be chosen rather at random by Mascot upon identification of such peptides. This is because Mascot needed to be used at the peptide level and not at the protein level where the protein inference problem is better dealt with. To correct for this, we reordered all peptide-associated protein entries, prioritizing UniProt entries over UniProt isoform entries, and these over Ensembl entries (coming from the Ribo-Seq data) as the UniProt database is by far the most completely annotated and curated of these databases. As for the Ensembl entries, we prioritized the different biotypes as aTIS > CDS > 5′ UTR > 3′UTR > NTR, thus again prioritizing for the most confident or plausible origin of the peptide. Based on these criteria, we assigned each peptide with one main protein entry and listed all other entries in a separate column (the isoform column) as these can contain extra information, such as Ribo-Seq evidence for a translation initiation event at a matching position in a transcript. Furthermore, when a peptide was matched to several entries of the same category/biotype, we prioritize the entry holding the lowest start position, thus favoring matches at a protein’s utmost Nt-position. Finally, for any unresolved cases, we sorted database entries alphabetically, which was the case for 2299, 1250, and 1204 peptides in the trypsin, chymotrypsin, and GluC datasets, respectively. Do note that when prioritizing accessions, we actually do not remove potential NTR accessions but rather arrange protein accessions within a protein group to be able to evaluate which peptides match to other biotypes next to NTRs. By this, we thus exclude nonconclusive identifications (peptides matching several biotypes) and are able to focus on NTR-specific peptides.
This set of rules ensures that when a peptide was found by more than one protease, it was always matched to the same protein entry, simplifying the final merging of the data. In addition, these rules also imply that many peptides that were initially associated with NTR proteins were reassigned as we prioritized matches to known annotated proteoforms and protein-coding regions. Indeed, before filtering, about 6% of all identifications were matched to an NTR entry (regardless of the protease used). After filtering, only very few NTR entries (0.40, 0.58, and 0.14% for trypsin, chymotrypsin, and GluC, respectively) remained (Table 3). In addition, a huge fraction of the identified NTR peptides were reassigned to UniProt (isoforms), (93.37, 91.10, and 97.99% for trypsin, chymotrypsin, and GluC, respectively). For example, the cotranslationally acetylated peptide AVNVYSTSVTSDNLSR, identified in the trypsin sample matched to ENST00000522077_8_135625981_ntr_100db1. However, this is also the mature Nt-peptide of a UniProt entry, Q15691 (Microtubule-associated protein RP/EB family member 1) and was therefore matched to this entry.
While checking the influence of the database entry filtering on the overall distribution of the database entries, we noticed that the fraction of UniProt proteins, UniProt isoforms, and the different Ribo-Seq categories were equally distributed over the three datasets (before and after filtering), and that after database entry filtering, more proteins were matched to UniProt entries (supplemental Fig. S3). For all remaining NTR entries (just 176 in total), 94 (53.4%) of these were found to be unique. A list of all NTR entries and peptides before and after this first filtering step is shown in supplemental Table S3.
Deduplication and Removal of Non–Nt-Peptides
In the dataset, now consisting of all identified peptides filtered for database entries, some peptides were present several times as, for instance, they were identified in different fractions of a certain sample (e.g., KSAPSTGGVKKPH found in fractions B6 and B9 of the chymotrypsin sample). In the second filtering step, we removed such peptide sequence duplicates. Of note, some peptides were found with different Nt-modifications. For example, GDVVPKDANAAIATIKTKR (ENST00000530835_11_90283408_ntr_011db2), identified in the trypsin dataset, was found both with a free N terminus and with a cotranslationally acetylated one. Peptides were grouped according to their sequence, and cotranslational acetylation was prioritized over N-terminally blocked, over unmodified (free α-amine group), thereby giving more weight to (translational) evidence that an identified peptide indeed points to the N terminus of a translation product. When a peptide was identified several times with the same Nt-modification, the highest scoring peptide-to-spectrum match (PSM) was retained. How many times a peptide was identified, the degree of cotranslational acetylation and the different Nt-modifications the peptide was identified with (peptide forms) are listed. Thus, for the aforementioned example, both modifications (in vivo and in vitro acetylations) are reported, and the acetylation percentage was calculated, being 71% (based on PSM counts). This filtering step reduced the dataset from 23,420, 11,687, and 10,461, to 10,147, 6796, and 5373 unique peptide sequences for trypsin, chymotrypsin, and GluC, respectively. NTR peptides were also found several times or with different Nt-modifications (see aforementioned example), thus the number of NTR matches further reduced to 25, 20, and 13 for trypsin, chymotrypsin, and GluC, respectively.
Next, from this list of unique peptides, we focused on the Nt-peptides and excluded coenriched internal and C-terminal peptides. Both peptide classes can be distinguished from Nt-peptides as Nt-peptides are acetylated, either in vivo or in vitro, and internal peptides contain an unmodified N terminus (free α-amine), whereas the C-terminal peptides are not followed by an amino acid. This allows us to further reduce our dataset to 5876, 2403, and 2790 distinct Nt-peptides for trypsin, chymotrypsin, and GluC, respectively.
Some identified NTR peptides were internal or C-terminal peptides as the number of NTR matches further dropped to 16, 13, and 8 for trypsin, chymotrypsin, and GluC, respectively. However, as NTR proteins are generally short (Fig. 3B), we also retained peptides that are both N and C terminal and thus cover the complete sequence of the (micro)protein. We identified one such N-terminally blocked peptide in two different datasets (trypsin and chymotrypsin) that covers the complete protein sequence: MKEETKEDAEEKQ. This peptide was found to be a unique peptide belonging to ENST00000486575_22_20127011_ntr_100db1. Note that most of the NTR proteins that were removed in this step were identified by peptides that were neither cotranslationally acetylated nor N-terminally blocked. Therefore, they did not contain evidence to be further considered as Nt-peptides.
Among all distinct Nt-peptides, we noticed that many peptides were found to be C-terminally ragged, an artifact previously observed in COFRADIC datasets. Such ragged peptides share the same start position and are linked to the same database entry but are C-terminally shorter. One example is ENST00000556323_14_92026617_ntr_100db1 for which four different peptides were found: EKKEVVEEAENGRDAPAD, EKKEVVEEAENGRDAP, EKKEVVEEAENGR, and EKKEVVEEAEN. These peptides were identified in the chymotrypsin and trypsin datasets, but note that the C-terminal ends of the peptides do not comply with chymotrypsin’s specificity for cleavage (Y, W, F, L, or M). We grouped such peptides based on their start site and the coupled database entry and again prioritized for cotranslational acetylation. If such peptides held the same Nt-modification, the longest peptide sequence was kept. However, information on all shorter variants was also stored. This step further reduced the dataset and the number of NTR accessions found. For trypsin, chymotrypsin, and GluC, we now ended with 5063, 1813, and 2324 unique Nt-peptides, respectively, and only 15, 11, and 8 Nt-peptides remained matched to an NTR entry.
Further Filtering of Nt-Peptides Based on Cotranslational Modifications
The list of distinct Nt-peptides was further filtered for Nt-peptides that are proxies for translation by removing peptides that reported protein processing and peptides that were very unlikely to be Nt-peptides (as explained later and schematically summarized in Fig. 6B). For all remaining peptides, a confidence level (high or low) and a category, database-annotated or alternative N termini, were assigned. The latter points to Nt-peptides originating from an Nt-variant of a canonical protein. In this way, a high confident Nt-peptide originating from an NTR protein, combined with the Ribo-Seq data that were used to build the database, must provide solid evidence that a transcript from a presumed nontranslated region can be translated.
Database-Annotated Protein N Termini
When an Nt-peptide matched to a protein position 1 or 2 (following iMet removal), it was considered a highly confident database-annotated Nt-peptide. Such peptides may match to not only a regular UniProt protein but also to a UniProt isoform or an Ensembl entry (such as an NTR entry). About 1462, 715, and 868 database-annotated N termini were found for, trypsin, chymotrypsin, and GluC, respectively. From the 15, 11, and 8 NTR accessions in the trypsin, chymotrypsin, and GluC datasets, respectively, 5, 8, and 4 of the Nt-peptides were listed under this category.
Alternative Protein N Termini
We further filtered Nt-peptides that did not start at positions 1 or 2 as these could also originate from signal or transit peptide removal and/or other proteolytic activities in cells or whilst preparing the samples. For trypsin, chymotrypsin, and GluC, there were ,respectively, 3600, 1098, and 1455 presumed Nt-peptides that started beyond position 2. We differentiated between (in vivo) cotranslationally acetylated peptides and in vitro blocked peptides. The former holds more evidence that they originated from translation events, whereas the latter could also point to protein processing. The majority of N termini with a start position beyond 2 were found in vitro blocked. For the three proteases, 422, 113, and 214 cotranslationally acetylated peptides, and 3178, 985, and 1241 blocked peptides were identified for trypsin, chymotrypsin, and GluC, respectively. To further check if such peptides originated from translation events, we used extra evidence from Ribo-Seq and evaluated the presence of an iMet.
Cotranslationally Acetylated Peptides
If a peptide was cotranslationally acetylated and either started with or was preceded by a methionine, it was classified as a highly confident alternative Nt-peptide. About 222, 56, and 109 Nt-peptides belong to this category for trypsin, chymotrypsin, and GluC, respectively. Only one peptide matching to an NTR entry was classified as such: MASAASSSSLE (found in the GluC dataset) matching to position 12 of ENST00000403258_6_88276364_ntr_101db1.
Cotranslationally acetylated peptides that neither start with nor are preceded by a methionine were also identified (200, 57, and 105 for trypsin, chymotrypsin, and GluC, respectively). For such peptides, we first searched for extra evidence (from UniProt isoforms or Ribo-Seq) that could indicate that these peptides pointed to alternative starts of translation. For trypsin, chymotrypsin, and GluC, respectively, we only found three, two, and one peptides with extra Ribo-Seq evidence and categorized these peptides as highly confident alternative Nt-peptides. Of note, among these peptides, none were from NTR entries. For the remaining peptides without extra Ribo-Seq evidence and not starting with methionine, we verified if the preceding amino acid was a potential recognition site of the protease used for digestion. If not, such peptides were considered as low-confident alternative Nt-peptides; all other peptides were removed.
A missing iMet can be explained by a non-AUG codon that was used for initiating translation, which are generally also translated into a methionine (
). However, in protein databases, such as UniProt, non-AUG start codons will be indicated as translated into the amino acid they encode for instead of methionine. As such, when using protein databases, one would not be able to identify the corresponding Nt-peptide. About 106, 24, and 38 peptides were found in this category, with three, one, and zero Nt-peptides matching to an NTR entry in the trypsin, chymotrypsin, and GluC datasets, respectively.
N-terminally Blocked Peptides
For the Nt-blocked peptides (3178, 985, and 1241 peptides for trypsin, chymotrypsin, and GluC, respectively), extra evidence for translation was first evaluated (see aforementioned). UniProt isoforms or Ribo-Seq corroborated peptides were considered highly confident alternative Nt-peptides. As such, 38, 15, and 12 Nt-blocked peptides were assigned as highly confident (for trypsin, chymotrypsin, and GluC, respectively). Stringent filtering was then applied for peptides lacking extra evidence. The presence of a methionine at the peptide’s start may provide translational evidence; therefore, we scanned if the starting or preceding amino acid was methionine, and if this was not the case, peptides were removed. One example is AVGVIKAVDKKAAGAGKVT, starting at position 27 of ENST00000415278_1_96448151_ntr_010db2. It was found in the chymotrypsin dataset and is not initiated with a methionine. It is thus unlikely that this peptide points to a translation event, and it was therefore removed. Note that this step removed the majority of presumed Nt-peptides (3009, 886, and 1201 in the trypsin, chymotrypsin, and GluC datasets, respectively).
Peptides that survived this filtering step were further evaluated by checking if the preceding amino acid was a potential cleavage site for the protease used (as some unwanted transacetylation acitivity is possible (
)) and if iMet (processing) agreed with the specificity of the MetAPs. Peptides that met these criteria were retained as low-confident alternative Nt-peptides, leading to 102, 65, and 21 of such peptides in the trypsin, chymotrypsin, and GluC datasets, respectively.
Clearly, with this stringent filtering strategy, a large part of peptides that do not confidently point to a protein’s N terminus, but rather to processing events, was removed. In total, we find 263, 73, and 122 high-confident alternative N termini and 205, 88, and 58 low-confident alternative N termini (for trypsin, chymotrypsin, and GluC, respectively).
Finally, 1930, 876, and 1048 Nt-peptides were identified by COFRADIC in the cytosolic HEK293T proteome in, respectively, the trypsin-, chymotrypsin-, or GluC-digested samples (thus 3854 in total). Of the peptides that matched to an NTR entry, eight, nine, and five peptides (for trypsin, chymotrypsin, and GluC, respectively) survived the different filtering steps, thus 22 in total, which is merely 0.57% of all identified Nt-peptides. The majority (
) start at position 1 or 2 of the corresponding NTR protein and are thus listed in the highly confident category. For trypsin, the remaining three NTR matches are low-confident alternative Nt-peptides (cotranslationally modified but not starting with methionine). The same is observed for one peptide in the chymotrypsin dataset, whereas in the GluC dataset, we detected a cotranslationally acetylated peptide that starts with methionine and is thus a highly confident alternative N terminus (supplemental Table S3).
Final Merging of the Data
In the last step, the three different datasets were concatenated. As these proteases generated different ends at a given protein N terminus, merging based on peptide sequence was not possible. Therefore, we relied on the database entry and the peptide’s start position and retained the longest peptide sequence as this contained the most information, but, as indicated previously, also kept all information on shorter forms of this peptide. We also listed the datasets in which a peptide was identified and recalculated the degree of in vivo acetylation based on all identified peptides. This resulted in a final dataset of 2896 unique and confident Nt-peptides (supplemental Table S4), 19 of which that matched to an NTR entry (Table 4).
Table 4List of Nt-peptides matched to an NTR entry
This table lists all Nt-peptides that were identified and matched to an NTR protein. For each peptide, the accession (a detailed explanation of the information contained in the accessions from Ensembl can be found in the Experimental procedures section), the start and end positions, the sequence, the enzyme (this indicates by which protease the peptide was detected), modification (“Mod.,” which indicates the modification found on the protein’s N terminus), the confidence level assigned to the peptides (“Conf.,” which is either high or low [see main text]), isoform column, transcript information, protein length, “Prec. AA,” which indicates the preceding AA, and the spectral counts (#) are listed.
Several interesting observations can be made for the identified NTR peptides. For example, 12 of these 19 peptides do not end on an amino acid that corresponds to a cleavage site of the protease used. For seven of them, iMet processing did not seem to follow the known MetAPs rules, instead, peptides were identified starting with E, H, K, N, or R for which the iMet is normally not removed. About 13 peptides are only identified by one or just two PSMs, which points to the low abundance of NTR proteins. On the other hand, the majority of the identified NTR peptides are unique, supporting their NTR origin. For tubulin alpha pseudogene 2, two peptides were identified. However, one peptide is a shorter variant of the other missing the first four amino acids. Finally, concerning the biotypes of these transcripts in Ensembl, 12 are processed pseudogenes, five have a retained intron, one is a transcribed processed pseudogene, and one is a processed transcript.
Further Curation of Identified NTR Proteins by BlastP Analysis
After this stringent filtering, an additional curation step was performed, similar as described by Zhu et al. (
). The identified peptides from the NTR proteins were searched for homologous sequences using the BlastP algorithm (using standard settings, automatically adjusted for short sequences and an e-value of 200,000) against the human UniProtKB/Swiss-Prot database.
Strikingly, nine of the 19 peptides had an exact match to a UniProt protein sequence (supplemental Table S5). This can be explained by the semiprotease settings (semi-ArgC, semi-GluC, and semichymotrypsin) that were used during the database search, which imply that one end of the peptide (either C-terminal or Nt, not both) was allowed not to comply with the protease’s specificity rules, and such settings are required to identify alternative start positions (
). However, in the cases explained later, the peptides’ ends do not comply with the specificity of the protease used, whereas they were matched to a start position of an NTR protein. As such, a match against NTR proteins appears “forced” over a match to an internal peptide of a UniProt protein. For example, the peptide ADTFLEHM is found in the trypsin-digested sample. As the peptide does not end on R, its Nt-amino acid should follow a trypsin (acting as ArgC) cleavage site or be a start position. The peptide was matched to the N terminus (position 2, preceded by a methionine) of ENST00000454707 (pyruvate kinase M1/2 pseudogene 5, processed pseudogene). However, following BlastP analysis, this peptide was found to match to a peptide starting at position 22 of pyruvate kinase PKM (UniProtKB accession: P14618). Here, this peptide is preceded by a methionine, which is not a potential trypsin cleavage site but is likely an internal start site of P14618. Thus, similar to Zhu et al. (
), we assume that for all peptides with an exact match after BlastP analysis, the semisetting caused them to match against an NTR protein, whereas they likely originated from an annotated protein.
Of note, by removing such peptides, we lost peptides of which the iMet was removed though not in agreement with the specificity of the MetAPs. However, one case remains, being the removal of methionine leading to the EKKEVVEEAENGRDAPAD peptide.
The remaining 10 of 19 peptides have a strong homology match to a UniProt entry with just one (or two) amino acids that are different (Table 5). There are two hits for prothymosin alpha and tubulin alpha; this is because for both cases, two peptides were identified that differ at their N terminus but are linked to the same UniProt protein and contain the same single amino acid variation. Interestingly, nine of the 10 peptides just had a single amino acid difference with a UniProt reference protein, which for most of them could be explained by a potential SNP. For example, for ADDAGAAGGPGGPGGPEMGNRGGFRGGF (Ribosomal protein S2 pseudogene 46) and ADDAGAAGGPGGPGGPGMGNRGGFRGGF (40S ribosomal protein S2), both proteins were found cotranslationally acetylated and for both, the peptides were matched to the second position in the protein sequence. In the nucleotide sequence of the UniProt protein, glycine is encoded by GGG, which only differs one base from the GAG codon that encodes glutamic acid in the pseudogene. For other peptides, any straightforward explanation is less clear. For example, EKKEVVEEAENGRDAPAD (protymosin alpha pseudogene) and EKKEVVEEAENGRDAPAN (prothymosin alpha). The peptide was identified as being heavy acetylated and, in the pseudogene, preceded by a methionine (i.e., a potential Nt-peptide). However, it is unlikely that the iMet is removed when the second amino acid is glutamic acid (
). As for the UniProt protein, the peptide is preceded by lysine, which is not a trypsin (here acting as ArgC) nor chymotrypsin cleavage site however, the lysine is encoded as AAG and is surrounded by a Kozak-like sequence, which could point to a non-AUG alternative translation start site. Nonetheless, both cases appear unlikely, and the actual difference (Asn versus Asp) can be explained by an SNP (GAC instead of AAC).
Table 5List of all peptides that matched to an NTR entry and for which a strong homology match to a UniProt entry was found
ENST00000439303_10_10174230_ntr_100db1 Elongin C pseudogene 3, processed pseudogene
1 MDGEEKTCGGCEGPDAMYVKLISSDGHEFIVKR 33
1 AA difference Possible SNP? Yes
1 MDGEEKTYGGCEGPDAMYVKLISSDGHEFIVKR 33
ENST00000486575_22_20127011_ntr_100db1 RAN binding protein 1, Retained intron
1 MKEETKEDAEEKQ 13
P43487 Ran-specific GTPase-activating protein (RANBP1)
1 AA difference Possible SNP? No
189 VKEETKEDAEEKQ 201
ENST00000458332_17_19446852_ntr_111db1 Ribosomal protein S2 pseudogene 46, processed pseudogene
Q71U36 Tubulin alpha-1A chain P68363 Tubulin alpha-1B chain and other tubulin alpha chains: P0DPH7, Q6PEY2, Q9BQE3
1 AA difference Possible SNP? Yes
325 PKDVNAAIATIKTKR 339
For each peptide, the accessions (containing both the Ensembl accession and transcript information) are provided, the column “match” contains the peptide sequence of the identified peptide matched to the NTR and the corresponding sequence in the UniProt protein it was matched to, with the positions of the peptide within the NTR and UniProt proteins. The UniProt accession to which the NTR peptide is matched to, the enzyme (protease) the peptide was identified with and some extra information are listed.
Misidentification of MS/MS spectra is a considerable threat in MS-driven proteomics. This is especially true when considering unexpected and hence unaccounted for modifications, which can yield isobaric peptides with similar fragmentation patterns. In a recent proteogenomics study that also applied a highly stringent workflow, peptides with single amino acid substitutions were removed (
), we considered that inspection of MS/MS spectra by experts facilitates differentiation between correct and incorrect single amino acid variants as called by database searching. In the aforementioned study, SpectrumAI was used to perform this task at a large scale. Here, the inspection was done manually as only 10 spectra required examination. Upon inspection, two peptides (EKKEVVEEAENGRDAPAD and SDAAVDTSSEITTKDLKEKKEVVEEAENGRDAPAD), both pointing to the same NTR (prothymosin alpha), were removed, the other eight were evaluated correct and thus retained. For the removed peptides, the mass difference was just 1 Da, which equals the mass difference between the peptide from the NTR and UniProt protein (the difference is the last amino acid, D in the NTR protein and N in the UniProt protein). Besides this, the internal Asn-Gly motif makes the inspection of the spectra more difficult as this motif is prone to deamidation to Asp-Gly (
For all other NTR proteins, synthetic peptides were used to compare and validate the identified spectral matches. If for a peptide the same precursor ion (same m/z) was found, this ion was selected, and the top 10 fragment ions of the synthetic peptides were selected and compared with the ranking of the same fragment ions of the fragmented peptide ion that was identified in our COFRADIC samples.
For MDGEEKTCGGCEGPDAMYVKLISSDGHEFIVKR, we experienced issues with the synthesis of the synthetic peptide and cannot draw conclusions for this case. The ranking of the fragment ions of ASTGTAKAVGKVIPELNGKL identified in our sample was too different from the ranking of the synthetic peptide (supplemental Fig. S4). Therefore, ASTGTAKAVGKVIPELNGKL linked to glyceraldehyde-3-phosphate dehydrogenase pseudogene 72 (ENST00000402643_6_166064725_ntr_111db1) was removed, further reducing the amount of confident Nt-peptides of NTRs to seven. For all other NTR-linked Nt-peptides, the ranking of their fragment ions was (highly) similar between the peptides identified in our COFRADIC samples and the synthetic peptides, and therefore, these peptides were retained. For example, the fragment ions of MASAASSSSLE were highly comparable (Fig. 8, all the others can be found in supplemental Fig. S4).
Inspection of Sequencing Data
We used Ribo-Seq data to determine splice boundaries and exact nucleotide sequences of exons at homologous genomic loci, to verify the expression of NTR-specific transcripts and variants in HEK293T cells. Using transcript coordinates of custom database entries, we developed an approach to map identified peptides to their genomic positions. Subsequently, we visualized peptides next to Ribo-Seq sequencing reads using an Integrative Genome Viewer (
). Inspection of the seven most-confident NTR peptides, lacking an exact BlastP match to UniProt entries and retained after inspection of MS/MS spectra (Table 6), revealed sequencing evidence supporting four NTRs (Figs. 9 and S5). More specifically, we found NTR-specific alternatively spliced reads from a retained intron transcript matching the FPSNWNEIVDSFDDMNLSESLLR peptide of eukaryotic translation initiation factor 4A1 (Fig. 9). Interestingly, translation of this peptide is initiated at exon 1 in a different reading frame compared with the canonical proteoform of the same gene, for which the N terminus was also found. From exon 1, NTR continues translation directly to exon 3, thereby restoring the canonical reading frame. Splicing of RANBP1 (Ran-specific GTPase-activating protein) NTR transcript responsible for the MKEETKEDAEEKQ peptide was also confirmed (supplemental Fig. S5). From the remaining inspected cases, ribosome-protected fragments carrying NTR-specific nonsynonymous and synonymous variants in peptide MDGEEKTCGGCEGPDAMYVKLISSDGHEFIVKR (ELOCP3), next to synonymous variants in ADDAGAAGGPGGPGGPEMGNRGGFRGGF (RPS2P46) and MASAASSSSLE (ACTBP8) were found (indicated in green, supplemental Fig. S5). However, many NTR-specific nucleotide variants were not supported. Instead, Ribo-Seq reads at these positions were missing or mapped with a mismatch (indicated in red). To test if NTR peptides can be explained by known SNPs, NTR nucleotide sequences (encompassing the peptide) were compared with their closest annotated protein-coding match. None of the NTR-specific nonsynonymous sequence variations were previously reported in the canonical genes by the dbSNP database (build 142 hg38) (
For each peptide, the accessions (containing both the Ensembl accession and transcript information) are provided along with the identified peptide sequence and the enzyme (protease) the peptide was identified with.
To further investigate the functionality of NTR proteins, we selected one NTR protein to study its interactome by Virotrap. An Nt-peptide pointing to an Nt-proteoform (missing the first 11 amino acids) of ACTB pseudogene 8 (ENST00000403258_6_88276364_ntr_101db1) was identified, and we decided to select both forms (full length and proteoform) for interactome analysis. The proteoforms were coupled to HIV-1 GAG protein and expressed in HEK293T cells. Expression of the fusion protein initiates budding of VLPs from the cells, which contain the bait protein and its interaction partners. The VLPs can thereafter be purified from the growth medium as a genetic fusion protein of the VSV-G coupled to a FLAG tag is cotransfected and expressed on the surface of the VLPs (
). We performed triplicate Virotrap experiments for both NTR proteoforms to obtain specific interaction partners for the NTR protein. eDHFR fused to GAG was used as a negative control. MS of the VLPs revealed that both proteoforms were successfully expressed, and pairwise comparisons between either the full length or the Nt-proteoform with the eDHFR control samples revealed several potential interaction partners of the NTR protein (11 for the full length and 10 for the proteoform, not taking the bait itself into account, Fig. 10, A and B). Four common interaction partners were identified: CTNNA1, SNX2, HGS, and CTBP1/CTBP2. All significant proteins (adjusted p <0.01) in at least one comparison were also visualized in a heatmap (Fig. 10C), revealing 54 significant proteins in the control samples and 18 significant proteins in the bait samples. The NTR proteins seem to interact with several proteins that are found to localize at membranes (IRS4, TMEM219, CTNNA1, MAT2B, PI4KA, and CMTM6) and/or function in vesicles and protein transport (TFG, SNX2, HGS, RER1, and CTMT6), thus providing molecular functions to this novel protein.
The discrepancy between the number of NTR proteins predicted from RNA analyses (such as Ribo-Seq and RNA-Seq) (
) and the number of unambiguously detected NTR protein products using MS-based proteomics was the main motive for performing our study. We used an experimental setup that combined a reduced sample complexity with an extended database search space to improve the possibility of identifying NTR proteins. To cope with the increased search space, we introduced a rigorous workflow for data analysis and curation of the results.
) used a combination of Nt-peptide enrichment, a Ribo-Seq–based search database, and downstream filtering in their search for protein evidence of translation starting at noncanonical TISs. Similarly to their study, we reduced the sample complexity by focusing on protein Nt-peptides only, as these peptides are proxies for translation events and can provide direct evidence for NTR proteins. However, restricting ourselves to Nt-peptides comes with a cost. Indeed, by not using shotgun proteomics data, we will have missed identification of several NTR proteins. In addition, protein identification in N-terminomics studies is typically based on a single peptide, with not all of these peptides being identifiable. To counter this effect, we used three different proteases for proteome digestion as these generate different Nt-peptides. Our bioinformatics analysis revealed that our approach should, in theory, greatly improve identification of peptides from NTR protein products; however, given the limitations of our study, we found that less than 1% of the identified Nt-peptides was found to originate from such proteins.
With complex databases such as the one used in our study, the protein inference problem is highly prevalent. Therefore, our filtering approach started with the reassignment of protein entries to identified peptides and, when peptides were matched to database entries with different levels of evidence for protein expression and protein (functional) annotation, we gave priority to the best annotated ones (UniProtKB entries). In this way, we excluded nonconclusive identifications of NTR peptides. As expected, this led to the highest reduction in the apparent novel proteins identified (a decrease from 6% of all identifications to <0.6 %) as most (>91%) of the NTR peptides were now rematched to UniProt protein entries. Several other studies have used similar or slightly different strategies (e.g., filtering out all peptides originating from known proteins or filtering out all nonunique proteins) to find novel proteins in their proteogenomics workflows (
Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events.
). Filtering of our results continued with the deduplication and removal of non–Nt-peptides, reducing the number of NTR proteins identified to just 39. The remaining peptides were expected to originate either from translation or from protein processing. The former starts with an iMet, which can be cotranslationally removed by MetAPs. Cotranslational Nα-acetylation of the iMet or the exposed second amino acid is a frequent modification occurring on eukaryotic intracellular proteins. Both features were used to remove peptides that most likely originated from protein processing, followed by a final merge of the results from the three different proteases used. Ultimately, this leads to translational evidence for only 19 NTR proteins (or 0.78% of all identified proteins).
To account for the fact that isobaric amino acids (or amino acid combinations) and amino acid modifications might have influenced correct identification of MS/MS spectra, a BlastP analysis of these 19 NTR peptides was performed. This revealed nine exact matches to (high-abundant) annotated proteins. The remaining 10 were found to be highly similar to annotated proteins as well, only differing by one or two amino acids. When evaluating the PSMs of these peptides, two more peptides (pointing to the same NTR protein) were removed. Additional comparisons of the fragment ions of the peptides identified in our sample with synthetic peptides resulted in the removal of an extra peptide. To verify the coverage of NTR-specific transcript variants present in the identified Nt-peptides using Ribo-Seq data, we visualized their genomic locations with the corresponding Ribo-Seq reads, which showed that many of the NTR-specific nucleotide variants were not unambiguously covered by Ribo-Seq reads. In fact, only four of seven NTR proteins were highly supported.
In an attempt to reduce the number of false-positive identifications as much as possible, we decided to apply stringent curation and inspection of database search results as several shortcomings inherent to data analysis wrongly assigned peptides to NTR proteins. In this respect, Kim et al. (
) reported several peptides pointing to translation of noncoding RNAs, but upon checking some of these peptides by BlastP, we found an exact match to UniProt proteins. For instance, the peptide VLGSAPPPFTPSLLEQEVR was linked to a noncoding RNA (LOC113230), whereas BlastP revealed an exact match to MISP3 (starting at position 140; UniProt accession: Q96FF7). In fact, of the nine proteins reported to originate from novel protein-coding regions (more specifically noncoding RNAs), we found for six of them that the identified peptide(s) had an exact match to a UniProt protein (supplemental Table S6). This might point to misannotation of lncRNAs.
) hold sequences of novel proteins or proteins originating from NTRs. When available, these databases also report MS evidence for such predicted proteins. Considering the NTR proteins identified in our study, often (slightly) different protein sequences are reported from the same transcripts. When evaluating our 10 most confident NTR proteins in these databases, we find for several NTR proteins evidence in the OpenProt database. For MASAASSSSLE, we found a similar protein in the OpenProt database (accession: IP_591792, reported as an alternative protein), which is N-terminally seven amino acid longer (MCDIKEK) compared with the NTR protein (ENST00000403258) identified in our study. OpenProt reports several IP_591792 peptide matches found by MS, among these is EMASAASSSSLEK, which holds the Nt-peptide we have identified (MASAASSSSLE). The NTR protein linked to ADDAGAAGGPGGPGGPEMGNRGGFRGGF is also reported in OpenProt (accession: IP_711030); however, there is no supporting MS/MS or Ribo-Seq evidence for this protein, thus this OpenProt entry was only predicted. For MDGEEKTCGGCEGPDAMYVKLISSDGHEFIVKR, we found a similar situation as the NTR protein was also listed as predicted in the OpenProt database (IP_790862). For both proteins, we thus now find MS/MS evidence to support their existence. For (SDAAVDTSSEITTKDLK)EKKEVVEEAENGRDAPAD, the NTR protein is also present in the OpenProt database (IP_745694) with MS evidence (among the identified peptides for this protein is DLKEKKEVVEEAENGRDAPAD) covering the largest part of the Nt-peptide we identified and also covering the terminal aspartic acid residue, which is different from UniProt proteins. We also checked these 10 most confident peptides via ProteoMapper online (on the site of PeptideAtlas) to evaluate where they map to PeptideAtlas (
). For two peptides, SDAAVDTSSEITTKDLKEKKEVVEEAENGRDAPAD and EKKEVVEEAENGRDAPAD, we found a match to Q15203, a TrEMBL unreviewed entry based on genomic DNA translation. In conclusion, such matches to entries stored in databases further increase the confidence of our findings.
We also evaluated if Nt-peptides from Nt-proteoforms generated by alternative translation initiation or alternative splicing, or from translation of small uORFs located in the 5′UTR of a regular CDS, were identified. Indeed, we detected 31 N termini originating from an annotated start site in Ensembl (aTIS), nine N termini pointing to proteoforms located inside an annotated CDS in Ensembl (CDS), and 22 proteoforms located in the 5′UTR region. However, we were not able to detect protein products from 3′UTR regions, in line with previous data that translation from 5′UTRs is more frequent than from 3′UTRs (
). Furthermore, we identified 92 Nt-peptides that uniquely match to a UniProt isoform and 689 Nt-peptides that match to an internal position of a UniProt protein and thus point to possible Nt-proteoforms (supplemental Table S7).
Besides Ribo-Seq and proteogenomics, other methods exist to study or monitor protein synthesis. Some proteomics-based methods rely on the incorporation of an azidohomoalanine (a bio-orthogonal methionine analog), which can then be used for affinity-based purification (
). However, such methods were only able to identify a few hundreds of proteins. A more recent method called PUNCH-P recovers ribosome–nascent chain complexes from the cells by ultracentrifugation followed by labeling with biotin–puromycin and affinity purification before LC–MS/MS analysis. With this method, thousands of proteins could be detected (
) (and likely also depends on the cell cycle phase and stimuli). We likely miss several NTR proteins by only analyzing HEK293T proteomes under normal conditions. Clearly, by restricting to cytosolic proteins, we will also have missed proteins present at other subcellular localizations. As shown recently, it might be interesting to focus on the immunopeptidome as Cuevas et al. (
) showed that proteins originating from noncoding genes are more likely to be detected in the immunopeptidome compared with canonical proteins, hinting to the fact that they might be nonfunctional. Similar as in our study, they only found translational evidence (by MS) for 0.44% of the noncanonical proteins reported by their RNA-Seq and Ribo-Seq experiments. The detection of a protein is also dependent on several other characteristics such as its abundance (sensitivity) and physicochemical characteristics of its peptides that are generated upon digestion. Some of the latter (peptide length) were considered in our theoretical peptide MS detectability analysis; however, we could not consider protein abundance. As many unannotated ORFs were reported to have lower translation rates (
), we might have missed such NTR proteins as their peptide levels were below the limit of detection of the LC–MS/MS instrument used. Therefore, our reported peptide numbers are likely lower bound estimates because of our stringent filtering workflow.
Besides issues on the MS level (discussed previously), there might also be issues on the Ribo-Seq level as several articles have raised concerns that there is a need for standardization as biases in both sample preparation and data processing can greatly influence the translational evidence reported (
We performed an interactome analysis using Virotrap to evaluate apparently stable NTR proteins using one NTR protein that survived our filtering steps, ACTB pseudogene 8 (ENST00000403258_6_88276364_ntr_101db1). When used as a Virotrap bait, we found that this protein had 18 potential interaction partners that mainly function in vesicle/protein transport and/or were found to localize at membranes, thereby hinting to a possible function of this particular NTR protein that must however be further studied to draw more solid conclusions.
In summary, we show that, theoretically, our strategy facilitates the detection of NTR proteins. However, experimentally, we only find a limited number of confidently identified NTR proteins.
The MS proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE (
The MS data of the synthetic peptides and the comparison with the peptides identified in our samples have been deposited to the ProteomeXchange Consortium via the Panorama Public repository with the dataset identifier PXD030285.
We thank Dr Patrick Willems, Jarne Pauwels, and Jade Hawksworth of the Gevaert laboratory for their suggestions and helpful discussions.
Funding and additional information
This work was supported by The Research Foundation — Flanders (FWO), project number G008018N (to K. G.).
A. B., D. F., and K. G. conceptualization; A. B., D. F., A. S., and H. D. methodology; D. F. software; A. B., D. F., and A. S. formal analysis; A. B. and T. V. d. S. investigation; A. B. and D. F. writing–original draft; D. F. and K. G. writing–review & editing; A. B. and D. F. visualization; K. G. supervision; K. G. funding acquisition.
Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events.