Advertisement

MS2Rescore: Data-Driven Rescoring Dramatically Boosts Immunopeptide Identification Rates

Open AccessPublished:July 06, 2022DOI:https://doi.org/10.1016/j.mcpro.2022.100266

      Highlights

      • MS2Rescore significantly boosts immunopeptide identification rates
      • Data-driven post-processing allows for a ten-fold increase in specificity
      • MS2PIP and DeepLC predictors are integrated with Percolator post-processing
      • MS2Rescore accepts identification results from MaxQuant, PEAKS, MS-GF+ and X!Tandem
      • MS2Rescore shows great promise to extend current neo- and xeno-epitope landscapes
      Immunopeptidomics aims to identify major histocompatibility complex (MHC)-presented peptides on almost all cells that can be used in anti-cancer vaccine development. However, existing immunopeptidomics data analysis pipelines suffer from the nontryptic nature of immunopeptides, complicating their identification. Previously, peak intensity predictions by MS2PIP and retention time predictions by DeepLC have been shown to improve tryptic peptide identifications when rescoring peptide-spectrum matches with Percolator. However, as MS2PIP was tailored toward tryptic peptides, we have here retrained MS2PIP to include nontryptic peptides. Interestingly, the new models not only greatly improve predictions for immunopeptides but also yield further improvements for tryptic peptides. We show that the integration of new MS2PIP models, DeepLC, and Percolator in one software package, MS2Rescore, increases spectrum identification rate and unique identified peptides with 46% and 36% compared to standard Percolator rescoring at 1% FDR. Moreover, MS2Rescore also outperforms the current state-of-the-art in immunopeptide-specific identification approaches. Altogether, MS2Rescore thus allows substantially improved identification of novel epitopes from existing immunopeptidomics workflows.

      Graphical Abstract

      Keywords

      Abbreviations:

      CE (collision energy), FDR (false discovery rate), HCD (higher-energy collision-induced dissociation), MS2 (tandem mass spectrometry), MGF (mascot generic format), PCC (Pearson correlation coefficient), PSM (peptide-spectrum match.)
      The immune system is a complex, yet remarkable system that protects us from both invaders from outside the body, that is, pathogens, as well as from inside the body, that is, malignancies (
      • Sattler S.
      ). Increased understanding of the immune system allowed for great medical achievements such as vaccination, which is currently available for over 29 diseases, enabled the eradication of smallpox, and prevents over 3 million deaths each year (https://www.cdc.gov/vaccines/vpd/vaccines-diseases.html?CDC_AA_refVal=https%3A%2F%2Fwww.cdc.gov%2Fvaccines%2Fvpd-vac%2Fdefault.htm). However, many diseases such as Mycobacterium tuberculosis or malignancies lack effective vaccines due to improper T-cell activation. A key issue in developing effective vaccines for these diseases is the lack of accurately identified major histocompatibility complex (MHC)-presented epitopes or immunopeptides. These epitopes are presented on the cell surface and enable T-cells to discern healthy cells from infected or malignant cells. While much effort has recently been invested in accurate prediction of these epitopes in silico (
      • Raoufi E.
      • Hemmati M.
      • Eftekhari S.
      • Khaksaran K.
      • Mahmodi Z.
      • Farajollahi M.M.
      • et al.
      Epitope prediction by novel immunoinformatics approach: a state-of-the-art Review.
      ), these are mostly limited to viruses as these contain fewer potential protein antigens (
      • Mayer R.L.
      • Impens F.
      Immunopeptidomics for next-generation bacterial vaccine development.
      ). Moreover, these tools are not yet sufficiently precise to confidently identify epitopes (
      • Larsen M.V.
      • Lundegaard C.
      • Lamberth K.
      • Buus S.
      • Lund O.
      • Nielsen M.
      Large-scale validation of methods for cytotoxic T-lymphocyte epitope prediction.
      ,
      • Zhang H.
      • Lundegaard C.
      • Nielsen M.
      Pan-specific MHC class I predictors: a benchmark of HLA class I pan-specific prediction methods.
      ). Therefore, experimental immunopeptidomics workflows, such as epitope detection through LC-MS, are still the best way to accurately identify these immunopeptides (
      • Bassani-Sternberg M.
      • Pletscher-Frankild S.
      • Jensen L.J.
      • Mann M.
      Mass spectrometry of human leukocyte antigen class i peptidomes reveals strong effects of protein abundance and turnover on antigen presentation.
      ).
      While immunopeptidomics workflows have been readily developed and applied (
      • Solleder M.
      • Guillaume P.
      • Racle J.
      • Michaux J.
      • Pak H.S.
      • Müller M.
      • et al.
      Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA class I ligands.
      ), acquisition of immunopeptides through LC-MS suffers from some major problems. First, the acquisition of immunopeptide spectra is hampered due to the low abundance of immunopeptides and even more so of neo-epitopes. Infrequently occurring epitopes are still very challenging to identify through LC-MS, despite enrichment efforts and sample preprocessing (
      • Faridi P.
      • Purcell A.W.
      • Croft N.P.
      In Immunopeptidomics we need a sniper instead of a shotgun.
      ). Second, in contrast to standard proteomics experiments where proteins are usually digested with trypsin before LC-MS, immunopeptides are captured through immune purification with antibodies followed by acidic elution resulting in mostly nontryptic peptides. The nontryptic nature of immunopeptides results in one less positive charge due to the missing arginine or lysine at the peptide’s C-terminus, causing many immunopeptides to be singly charged during MS acquisition. These singly charged peptides are much harder to analyze because, during fragmentation of the peptide, the charge resides on one of the fragments, leaving the other uncharged and thus lost (
      • Pfammatter S.
      • Bonneil E.
      • Lanoix J.
      • Vincent K.
      • Hardy M.-P.P.
      • Courcelles M.
      • et al.
      Extending the comprehensiveness of immunopeptidome analyses using isobaric peptide labeling.
      ). Moreover, most contaminants are singly charged as well, making identifications of immunopeptides much harder (
      • Purcell A.W.
      • Ramarathinam S.H.
      • Ternette N.
      Mass spectrometry–based identification of MHC-bound peptides for immunopeptidomics.
      ). The nontryptic nature of immunopeptides hampers not only the acquisition but also the identification of immunopeptide spectra. To match each acquired spectrum with the peptide from which the spectrum originated, proteomics database search engines such as SEQUEST (
      • Eng J.K.
      • McCormack A.L.
      • Yates J.R.
      An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.
      ), X!Tandem (
      • Craig R.
      • Beavis R.C.
      TANDEM: matching proteins with tandem mass spectra.
      ), Andromeda (
      • Cox J.
      • Neuhauser N.
      • Michalski A.
      • Scheltema R.A.
      • Olsen J.V.
      • Mann M.
      Andromeda: a peptide search engine integrated into the MaxQuant environment.
      ), or PEAKS DB (
      • Zhang J.
      • Xin L.
      • Shan B.
      • Chen W.
      • Xie M.
      • Yuen D.
      • et al.
      Peaks DB: de novo sequencing assisted database search for sensitive and accurate peptide identification.
      ) generate in silico spectra for all potentially matching peptides. The complete list of peptides that could be in the sample is called the search space. It is important to note that spectra from peptides that are not included in this search space cannot be identified, even though they were acquired. In standard shotgun proteomics experiments, in silico tryptic digestion of relevant proteins yields a broad yet representable search space. In immunopeptidomics, however, the search space tends to be two orders of magnitude larger due to (i) seemingly random cleavage from the protein of origin, (ii) the variable length of MHC class I bound peptides, 8 to 11 amino acids, and MHC-II peptides, 6 to 24 amino acids (
      • Jiang J.
      • Natarajan K.
      • Margulies D.H.
      ) and (iii) the potential occurrence of conformational cis- and trans-spliced immunopeptides, which are nonlinear peptides that originate from the same or different proteins, respectively (
      • Faridi P.
      • Li C.
      • Ramarathinam S.H.
      • Vivian J.P.
      • Illing P.T.
      • Mifsud N.A.
      • et al.
      A subset of HLA-I peptides are not genomically templated: evidence for cis- and trans-spliced peptide ligands.
      ). Additionally, sequence variants and noncanonical protein sequences are often considered as well, even further increasing the search space. Such a search space expansion leads to considerably more ambiguity between candidate peptide-spectrum matches (PSMs) (
      • Colaert N.
      • Degroeve S.
      • Helsens K.
      • Martens L.
      Analysis of the resolution limitations of peptide identification algorithms.
      ), lower PSM scores, drastically elevated false discovery rate (FDR) score thresholds, and ultimately in fewer identified immunopeptides (
      • Verheggen K.
      • Ræder H.
      • Berven F.S.
      • Martens L.
      • Barsnes H.
      • Vaudel M.
      Anatomy and evolution of database search engines—a central component of mass spectrometry based proteomic workflows.
      ). Furthermore, because tryptic peptides have been the longtime standard in proteomics, search engines as well as bioinformatics tools that aid in identifying LC-MS spectra are tailored toward tryptic peptides, making them less accurate or not applicable at all for immunopeptidomics.
      The high need for neo- and xeno-epitope discovery led to the development of many bioinformatics tools to improve or validate identifications in immunopeptidomics. On the one hand, motif deconvolution tools have been developed that leverage binding motifs of immunopeptides to validate immunopeptide identifications. On the other hand, full pipelines have been developed to improve immunopeptide identification. For example, MHCquant (
      • Bichmann L.
      • Nelde A.
      • Ghosh M.
      • Heumos L.
      • Mohr C.
      • Peltzer A.
      • et al.
      MHCquant: automated and reproducible data analysis for immunopeptidomics.
      ), which is a recent computational workflow designed specifically for neo-epitope identification, and PEAKS DB (
      • Zhang J.
      • Xin L.
      • Shan B.
      • Chen W.
      • Xie M.
      • Yuen D.
      • et al.
      Peaks DB: de novo sequencing assisted database search for sensitive and accurate peptide identification.
      ). Even though PEAKS DB is not specifically designed for immunopeptides, it is highly interesting due to its de novo–assisted database searches, which tend to work well for large search spaces. Even though these tools can help with immunopeptide identification, they do not use all available information, such as retention time and fragment ion intensity patterns. Previously, it has been proven that integrating retention time predictions in standard proteomics workflows can improve identification rates (
      • Dorfer V.
      • Maltsev S.
      • Winkler S.
      • Mechtler K.
      CharmeRT: boosting peptide identifications by chimeric spectra identification and retention time prediction.
      ). Similarly, adding peak intensity predictions to postprocessing tools such as Percolator can also improve identification rates drastically (
      • Silva A.S.C.
      • Bouwmeester R.
      • Martens L.
      • Degroeve S.
      Accurate peptide fragmentation predictions allow data driven approaches to replace and improve upon proteomics search engine scoring functions.
      ), which has already been proven to work for immunopeptides as well by efforts such as Prosit (
      • Li K.
      • Jain A.
      • Malovannaya A.
      • Wen B.
      • Zhang B.
      DeepRescore: leveraging deep learning to improve peptide identification in immunopeptidomics.
      ,
      • Wilhelm M.
      • Zolg D.P.
      • Graber M.
      • Gessulat S.
      • Schmidt T.
      • Schnatbaum K.
      • et al.
      Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics.
      ). Similarly, tools such as DeepLC (
      • Bouwmeester R.
      • Gabriels R.
      • Hulstaert N.
      • Martens L.
      • Degroeve S.
      DeepLC can predict retention times for peptides that carry as-yet unseen modifications.
      ) and MS2PIP (
      • Gabriels R.
      • Martens L.
      • Degroeve S.
      Updated MS2PIP web server delivers fast and accurate MS2 peak intensity prediction for multiple fragmentation methods, instruments and labeling techniques.
      ,
      • Degroeve S.
      • Maddelein D.
      • Martens L.
      MS2PIP prediction server: compute and visualize MS2 peak intensity predictions for CID and HCD fragmentation.
      ,
      • Degroeve S.
      • Martens L.
      MS2PIP: a tool for MS/MS peak intensity prediction.
      ) can provide accurate retention time predictions and peak intensity predictions, respectively, to aid in postprocessing. Indeed, when combined with Percolator, identification rates at a fixed FDR have been proven to substantially increase (
      • Silva A.S.C.
      • Bouwmeester R.
      • Martens L.
      • Degroeve S.
      Accurate peptide fragmentation predictions allow data driven approaches to replace and improve upon proteomics search engine scoring functions.
      ). However, currently, DeepLC and MS2PIP are solely trained on tryptic peptides. This absence of lysine and arginine at the C-terminus is less of a problem for DeepLC as the effect on retention time is small and is accounted for through feature encoding (
      • Ruiz Cuevas M.V.
      • Hardy M.-P.
      • Holly J.
      • Bonneil É.
      • Durette C.
      • Courcelles M.
      • et al.
      Most non-canonical proteins uniquely populate the proteome or immunopeptidome.
      ). However, this is not the case for MS2PIP, as alterations in peptide composition as well as fragmentation patterns and labeling methods heavily alter peak intensity patterns (
      • Gabriels R.
      • Martens L.
      • Degroeve S.
      Updated MS2PIP web server delivers fast and accurate MS2 peak intensity prediction for multiple fragmentation methods, instruments and labeling techniques.
      ). Therefore, we here present greatly improved MS2PIP models that include immunopeptides and nontryptic peptides in general. Moreover, we have integrated MS2PIP and DeepLC with Percolator into the free and open-source MS2Rescore software tool, which enables improved rescoring of peptide identifications from various proteomics search engines. Altogether, we show that well-adapted fragmentation spectrum and retention time predictions integrated into MS2Rescore drastically increase immunopeptide identification rates and outperform existing postprocessing methods.

      Experimental Procedures

      Training and Evaluation of New MS2PIP Spectrum Prediction Models for Immunopeptides

      To train and test new MS2PIP models, five publicly available immunopeptide data sets and one publicly available chymotrypsin-digestion data set were downloaded from PRIDE Archive (
      • Martens L.
      • Hermjakob H.
      • Jones P.
      • Adamsk M.
      • Taylor C.
      • States D.
      • et al.
      PRIDE: the proteomics identifications database.
      ,
      • Perez-Riverol Y.
      • Csordas A.
      • Bai J.
      • Bernal-Llinares M.
      • Hewapathirana S.
      • Kundu D.J.
      • et al.
      The PRIDE database and related tools and resources in 2019: improving support for quantification data.
      ). Similarly, for evaluation on representative unseen data, four distinct data sets were downloaded: (i) a data set containing HLA-I immunopeptides, (ii) a data set containing HLA-II immunopeptides, (iii) the data set with tryptic peptides that was previously used to evaluate the existing MS2PIP higher-energy collision-induced dissociation (HCD) models, and (iv) a data set containing chymotrypsin-digested peptide data. The corresponding ProteomeXchange project identifiers as well as the number of unique peptides and HLA patterns for each data set are listed in supplemental Table S1.
      As tandem mass spectrometry (MS2) fragmentation patterns are highly dependent on the instrument, instrument settings, fragmentation method, and any applied labeling methods (
      • Gabriels R.
      • Martens L.
      • Degroeve S.
      Updated MS2PIP web server delivers fast and accurate MS2 peak intensity prediction for multiple fragmentation methods, instruments and labeling techniques.
      ), all MS2PIP train, test, and evaluation data must originate from experiments with the same experimental parameters. Unlabeled HCD data from Quadrupole-Orbitrap instruments was used, as this makes the newly trained models widely applicable and plenty of training data is readily available on public repositories. For each PRIDE Archive project, the raw mass spectrometry files were converted to Mascot Generic Format (MGF) files using ThermoRawFileParser (v1.3.4) (
      • Hulstaert N.
      • Shofstahl J.
      • Sachsenberg T.
      • Walzer M.
      • Barsnes H.
      • Martens L.
      • et al.
      ThermoRawFileParser: modular, scalable, and cross-platform RAW file conversion.
      ). The corresponding identification files were converted to MS2PIP input files using custom Python scripts and were further filtered to retain unique combinations of peptide sequence, modifications, and precursor charge at 1% FDR. Next, all spectra were combined into one MGF file. The universal spectrum identifier (
      • Deutsch E.W.
      • Perez-Riverol Y.
      • Carver J.
      • Kawano S.
      • Mendoza L.
      • Van Den Bossche T.
      • et al.
      Universal spectrum identifier for mass spectra.
      ) was used as unique identifier for each PSM to ensure reproducibility and a one-on-one mapping between peptide identifications and spectra. Data from each PRIDE Archive project was either used as train/test data or as evaluation data to ensure fully independent data sets, except for the chymotrypsin data, where the same project was used to provide both training/testing data (70%) and evaluation data (30%). This split was made after selecting unique peptide-modification-charge combinations, to ensure no overlap in samples between both splits.
      Similarly to the 2019 MS2PIP models, new models were trained with the XGBoost machine learning algorithm (
      • Chen T.
      • Guestrin C.
      ). The Hyperopt (
      • Bergstra J.
      • Yamins D.
      • Cox D.D.
      30th international conference on machine learning.
      ), package (v0.2.5) was used in combination with four-fold cross-validation for hyperparameter optimization, allowing 400 boosting rounds and early stopping fixed at 10 rounds. Hyperparameters were optimized for each training data set separately, as well as for b- and y-ion models. All selected hyperparameters are shown in supplemental Table S2. To evaluate each model, the Pearson correlation coefficient (PCC) was calculated between observed and predicted b- and y-ion peak intensities for each spectrum. The model performances were further analyzed by peptide length and precursor charge. Ultimately, three models were trained: the immunopeptide model solely trained on immunopeptides, the immunopeptide-chymotrypsin model trained on immunopeptides supplemented with chymotrypsin-digested peptides, and the nontryptic immunopeptide model solely trained on nontryptic immunopeptides. Ultimately, two models were integrated into MS2PIP: (i) the immunopeptide model and (ii) the immuno-chymotrypsin model. The former can be used for immunopeptide peak intensity predictions and the latter for tryptic and more general nontryptic peptide predictions. For further analysis into rescoring immunopeptide PSMs, only the immunopeptide model was used, as it showed the best performance for both HLA-I and HLA-II immunopeptides.
      To compare MS2PIP predictions with Prosit predictions, the same evaluation data sets were used as mentioned above. Prosit (v1.1.2) was downloaded from GitHub (https://github.com/kusterlab/prosit) and the hcd_hla and irt_prediction models were downloaded from Figshare (https://figshare.com/projects/prosit/35582). MS2PIP predictions were acquired for the general proteomics and chymotrypsin evaluation data with the immuno-chymotrypsin model and for the HLA-I and HLA-II evaluation data with the immunopeptide model. Peptides that were not included in the Prosit output were filtered out of the MS2PIP predictions. The performance was measured in both PCC and spectral angle to ensure a thorough comparison. Only correlations for singly charged fragment ions were taken into account, as the newly trained MS2PIP models only predict intensities for these ions.

      Evaluation of MS2Rescore on HLA Class I Peptides and Comparison with Prosit Rescoring

      To validate the capacity of the new MS2PIP models to improve immunopeptide identification rates, the new models were implemented with DeepLC (v0.1.36) and Percolator (v3.5) into MS2Rescore. MS2Rescore calculates various meaningful features based on (i) the search engine output, (ii) the DeepLC-predicted and the observed retention times, and (iii) the MS2PIP-predicted and the observed MS2 peak intensities. These features are then passed to Percolator for PSM rescoring. Search engine features were selected based on the previous publication by Granholm et al. (
      • Granholm V.
      • Kim S.
      • Navarro J.C.F.
      • Sjölund E.
      • Smith R.D.
      • Käll L.
      Fast and accurate database searches with MS-GF+percolator.
      ) and replicated for use with MaxQuant search results (
      • Tyanova S.
      • Temu T.
      • Cox J.
      The MaxQuant computational platform for mass spectrometry-based shotgun proteomics.
      ). MS2PIP features were used as first described by Silva et al (
      • Silva A.S.C.
      • Bouwmeester R.
      • Martens L.
      • Degroeve S.
      Accurate peptide fragmentation predictions allow data driven approaches to replace and improve upon proteomics search engine scoring functions.
      ). All features generated by MS2Rescore are listed in supplemental Table S3.
      MS2Rescore was validated on a large-scale HLA class I data set (
      • Sarkizova S.
      • Klaeger S.
      • Le P.M.
      • Li L.W.
      • Oliveira G.
      • Keshishian H.
      • et al.
      A large peptidome dataset improves HLA class I epitope prediction across most of the human population.
      ), which was also used to validate the recently published Prosit-rescoring effort for immunopeptides (PXD021398) (
      • Wilhelm M.
      • Zolg D.P.
      • Graber M.
      • Gessulat S.
      • Schmidt T.
      • Schnatbaum K.
      • et al.
      Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics.
      ). This allows both an evaluation of the improved identification rates due to the new MS2PIP models and a straight-forward comparison with Prosit rescoring. First, the msms.txt identification files for the projects’ two MaxQuant searches (alkylated and nonalkylated samples), the Prosit-rescored Percolator output files, and the raw mass spectrometry files were downloaded from PRIDE Archive. The mass spectrometry files were then further processed with ThermoRawFileParser (v1.3.4) (
      • Hulstaert N.
      • Shofstahl J.
      • Sachsenberg T.
      • Walzer M.
      • Barsnes H.
      • Martens L.
      • et al.
      ThermoRawFileParser: modular, scalable, and cross-platform RAW file conversion.
      ) and the PSMs for each of the two MaxQuant searches were rescored separately. Two rescoring methods were evaluated: (i) using only search engine features, replicating a normal Percolator run, and (ii) using the full MS2Rescore feature set, including search engine-, MS2PIP-, and DeepLC-features. Additionally, these rescoring methods were compared with the original MaxQuant results and with the downloaded Prosit-rescoring results.
      Each rescoring method was evaluated at varying FDR thresholds in terms of identification rate and number of unique identified peptides. The contribution of the different feature sets in MS2Rescore was visualized using Percolator’s model weights, and the distributions of retention time difference and MS2PIP prediction correlations were compared between decoy PSMs, accepted target PSMs, and rejected target PSMs.
      Additionally, as reported by Wilhelm et al. (
      • Wilhelm M.
      • Zolg D.P.
      • Graber M.
      • Gessulat S.
      • Schmidt T.
      • Schnatbaum K.
      • et al.
      Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics.
      ), sequence motif patterns for HLA pattern C∗12:03 were further analyzed with GibbsCluster (v2.0) (
      • Andreatta M.
      • Alvarez B.
      • Nielsen M.
      GibbsCluster: unsupervised clustering and alignment of peptide sequences.
      ), for the gained and lost peptides compared to rescoring with only search engine features.

      Evaluation of MS2Rescore Across Collision Energy Settings and Peptide Abundances

      To further analyze MS2Rescore performance for various experimental collision energy settings, replicate LC-MS/MS runs were performed on HL60 cells at collision energy values of 25, 27, 30, 32, and 35 NCE (supplemental Methods). The resulting spectra were searched with the Andromeda search engine (MaxQuant v1.6.14.0) against the human UniProtKB-SwissProt (14-09-2020; 20, 388 sequences, Taxonomy ID 9606) database without any enzyme specificity. A minimal peptide length of seven amino acids was required. Oxidation (M) was set as variable modification with a maximum of three modifications per peptide. Mass tolerances were set at 5 ppm and 20 ppm for MS1 and MS2 spectra, respectively. FDR was kept at 100% with the use of a decoy strategy for downstream rescoring with (i) only search engine features and (ii) the full MS2Rescore feature set. Furthermore, for all LC-MS/MS runs at all collision energy settings, precursor intensities were obtained from the MaxQuant msms.txt file to assess any differences in the performance of MS2Rescore between low and high abundant peptides.

      Evaluation of MS2Rescore on HLA Class II Peptides

      To validate MS2Rescore for HLA class II peptides, another set of raw mass spectrometry files were downloaded from PRIDE Archive (PXD015408). As the uploaded search engine results were already filtered at 5% FDR, the spectra were reanalyzed with PEAKS DB (v10.5) (
      • Zhang J.
      • Xin L.
      • Shan B.
      • Chen W.
      • Xie M.
      • Yuen D.
      • et al.
      Peaks DB: de novo sequencing assisted database search for sensitive and accurate peptide identification.
      ) with the same search parameters that were used in the original publication, that is, no enzyme specificity, precursor error tolerance of 10 ppm, fragment ion tolerance of 0.01 Da, with oxidation (M), deamidation (NQ), and trioxidation (C) as variable modifications, searched against the UniProtKB-SwissProt database (01-2021, 22, 235 sequences, Taxonomy ID 10090). The mzIdentML identification file as well as the corresponding MGF files were exported from PEAKS DB and were rescored with MS2Rescore with (i) only search engine features and (ii) the full MS2Rescore feature set, as described above for the evaluation on HLA class I peptides.

      General Data Processing and Data Visualization

      All plots, unless specified, were generated in Jupyter notebooks (v6.4.0) using Python (v3.8.3) with the Matplotlib (v3.4.2) (
      • Hunter J.D.
      Matplotlib: a 2D graphics environment.
      ), Seaborn (v0.11.0) (
      • Waskom M.
      seaborn: statistical data visualization.
      ), UpSetPlot (v0.6.0), and spectrum_utils (v0.3.5) (
      • Bittremieux W.
      Spectrum-utils: a Python package for mass spectrometry data processing and visualization.
      ) libraries.

      Results

      Newly Trained MS2PIP Models Accurately Predict Immunopeptide Spectrum Peak Intensities

      In order to improve the identification rate of immunopeptides by leveraging peak intensity predictions, new models for MS2PIP were trained specifically for immunopeptides. Despite using different training set compositions, all newly trained models drastically improve predictions for both HLA-I and HLA-II data in comparison with the tryptic 2019 HCD model (Fig. 1A). Surprisingly, even for standard tryptic shotgun proteomics data, the predictions from the new models are slightly better, largely due to the portion of tryptic peptides within the immunopeptide training data. Indeed, when these peptides are left out of the training data, accuracy drops in comparison with the 2019 HCD model. While both immunopeptide models are well suited to predict peak intensities for tryptic and immunopeptides, the performance on chymotrypsin-digested peptides is not as high (supplemental Fig. S1). Thus, even though immuno- and chymotrypsin-digested peptides are both considered nontryptic, they are still very different for MS2PIP peak intensity predictions. Overall, immunopeptide peak intensity predictions are drastically improved by all the newly trained models, with the immunopeptide model showing the highest accuracy (median PCC of 0.94). The exact median PCCs are listed in supplemental Table S4. Examples of a prediction with median PCC values with the immunopeptide model and the corresponding, less accurate, 2019 model prediction are shown in Figure 1, B and C.
      Figure thumbnail gr1
      Fig. 1Performance evaluation of the new nontryptic MS2PIP models. Boxplots comparing Pearson correlation distributions between predicted and observed peak intensity values on spectrum level for the 2019 HCD model, immunopeptide model, nontryptic immunopeptide model, and immuno-chymotrypsin model. Performance is evaluated on general tryptic proteomics data, HLA-I immunopeptide data, HLA-II immunopeptide data, and chymotrypsin data (A). Predictions for immunopeptide “KQHGVNVSV” by new immunopeptide MS2PIP model (top) compared to observed MS2 spectrum (mzspec:PXD005231:20160513_TIL1_R1:scan:10909) (bottom) (B). This peptide was selected for visualization as its PCC lies close to the median for the immunopeptide model on all HLA-I evaluation data. Predictions by the 2019 HCD model for the same immunopeptide (C) as visualized in B. HCD, higher-energy collision-induced dissociation; MS2, tandem mass spectrometry; PCC, Pearson correlation coefficient.

      MS2Rescore Drastically Improves Immunopeptide Identification Rates

      Ultimately, the goal of these newly trained models is to improve immunopeptide identification rates by providing more accurate peak intensity predictions. Therefore, (i) identification results without rescoring, (ii) rescoring with solely search engine features (replicating a normal Percolator run), (iii) rescoring with MS2Rescore (including DeepLC and the new MS2PIP models), and (iv) rescoring with the recently published Prosit models were compared in terms of the total amount of identifications as well as the number of unique identifications based on sequence. Overall, rescoring with both MS2Rescore and Prosit substantially improved the spectrum identification rate in comparison with rescoring with search engine features alone or not rescoring and this at both 1% and 0.1% FDR. Indeed, MS2Rescore achieves an identification rate of 11.1%, out of 18 million spectra, compared to 7.6% for traditional rescoring (an increase of 46%), and only 1.9% for the MaxQuant search results, all at 1% FDR (Fig. 2, A and B). Moreover, 83% of the identified spectra at 1% FDR are retained when restricting the threshold to 0.1% FDR. Thus, providing peak intensities and retention time predictions to Percolator substantially increases the number of identified immunopeptides. This is clearly illustrated by analyzing the Percolator weights for each separate feature as well as the combined absolute weights for search engine features, MS2PIP features, and DeepLC features (supplemental Fig. S2). Similarly, the number of unique identified immunopeptides increases by 36% when adding MS2PIP and DeepLC Features for the 1% FDR and even more so for 0.1% FDR where the number of unique identified peptides reaches nearly 300% of the number of traditional Percolator identification results (Fig. 2, CF). These gains are consistent across all 95 HLA class I alleles included in the data (supplemental Figs. S3–S4), showing that the newly trained MS2PIP model, and therefore MS2Rescore, is generalizable across different HLA types. In fact, MS2Rescore allows for a substantial increase in identification rate for HLA types with initially fewer identifications (e.g., A0101, A0204, B4402…), indicating that MS2Rescore especially improves the peptide identification coverage for harder-to-identify HLA alleles.
      Figure thumbnail gr2
      Fig. 2Percentage of identified spectra and unique identified peptides using different rescoring methods (PXD021398). Bar charts showing the spectrum identification rate out of 18.375.659 spectra (A and B), showing the total number of unique identified peptides in terms of sequence (C and D), and showing the shared (blue), gained (green), and lost (red) number of unique (by sequence) identified immunopeptides in relation to rescoring with only search engine features (E and F). All results are shown for the 1% FDR (A, C, and E) and 0.1% FDR (B, D, and F) thresholds. FDR, false discovery rate.
      The power of providing these predictions to Percolator is further illustrated when visualizing the distributions for decoy PSMs, rejected target PSMs, and accepted target PSMs. Indeed, the distributions for decoy and rejected target PSMs are highly similar for both the retention time error as well as the PCC, while the accepted target PSMs accumulate around low retention time errors and high PCCs (Fig. 3, A and B). The accepted target PSMs are clearly separable from the decoy and rejected target PSMs using only the PCC and retention time error distributions (Fig. 3, C and D). Furthermore, while both metrics correlate with the search engine score, a large amount of decoy and rejected target PSMs can only be separated from the target PSMs by also including PCC or retention time error information (Fig. 3, E and F). This clearly illustrates how Percolator achieves its much-improved separation between true and false target PSMs when provided with peak intensity and retention time prediction features. Furthermore, PSMs that previously would have been incorrectly accepted below a 1% FDR because of a high search engine score alone are now rejected due to a low PCC, a high retention time error, or both. This most likely accounts for the small percentage of identified peptides that are lost after rescoring.
      Figure thumbnail gr3
      Fig. 3Distributions of MS2PIP-, DeepLC, and search engine-based features. Density plots showing the distribution of the smallest retention time error between observed and predicted retention time (A) and showing the Pearson correlation between observed and predicted peak intensities (B) for each PSM split in decoys (red) rejected targets, q value >0.01(blue) and accepted targets, q value <0.01 (green) (note the rejected targets distribution coincides with the decoy distribution). Scatterplots showing the relation between observed and predicted retention time (C), between Pearson correlation and retention time error (D), between retention time error and search engine score (E), and between Pearson correlation and search engine score (F) for decoys (red), rejected targets (blue), and accepted targets (green) for all PSMs from “GN20170531_SK_HLA_C0102_R1_01”. Note that all retention time errors are the smallest retention time error for each precursor (see rt_diff_best ), all Pearson correlation coefficients are calculated for each full log2 transformed spectrum, and all zero values are caused by the fact that either the observed or predicted intensities for a given ion type are all zero. PSMs, peptide-spectrum matches.

      MS2Rescore Outperforms the Current State-of-the-Art

      The integration of MS2PIP-, DeepLC-, and search engine-based features in MS2Rescore has proven to substantially increase the identification rate of immunopeptides and furthermore, outperforms the recently published Prosit-rescoring method (
      • Wilhelm M.
      • Zolg D.P.
      • Graber M.
      • Gessulat S.
      • Schmidt T.
      • Schnatbaum K.
      • et al.
      Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics.
      ). In comparison with Prosit, MS2Rescore gains 5% and 35% more identifications at 1% FDR and 0.1% FDR, respectively. This trend continues for the number of unique identified peptides with a respective increase of 8% and 57% (Fig. 2). Indeed, over 32,000 unique peptides were identified below the 0.1% FDR threshold by MS2Rescore, while Prosit rescoring only identified these peptides at the 1% FDR threshold, highlighting the gain in confidence in terms of unique identified immunopeptides (supplemental Fig. S5). MS2Rescore thus substantially increases the identification rate, especially for more stringent FDR thresholds.
      Peptides for which Prosit cannot predict MS2 peak intensities, that is, unmodified (C), cysteinylated (C), and acetylation (N-terminus), were left out of the Prosit-rescoring output. That is why MS2Rescore includes more PSMs in the unfiltered data set at 100% FDR, especially for the noIAA sample (supplemental Fig. S6, A and B). For a second, thorough comparison, these PSMs were left out of the MS2Rescore output as well. However, this seemed to have a rather negligible impact on the number of identifications at 1% and 0.1% FDR (supplemental Fig. S6, C and D), and thus the difference in identification rates cannot be attributed to these filtered peptides.
      Furthermore, a comparison of the peptide spectrum prediction accuracy of the newly trained MS2PIP models and Prosit indicates that higher performance of MS2Rescore cannot be attributed to improved peak intensity predictions. Indeed, depending on the correlation metric that is used, either MS2PIP or Prosit performs slightly better than the other on the evaluation data (supplemental Fig. S7). It is important to note, however, that for Prosit, the correct collision energy (CE) value should be selected for optimal performance. To determine if the difference in rescoring between MS2Rescore and Prosit is driven by a difference in features, all MS2Rescore features that do not have a close counterpart in Prosit rescoring were removed for a separate run. This includes all DeepLC retention time features and various search engine features (supplemental Table S3). As MS2Rescore and Prosit both include various similar peak intensity prediction features, these were retained. With this reduced feature set, the performance of MS2Rescore is slightly lower than for Prosit rescoring (supplemental Fig. S6, AD), which confirms that the retention time features and additional search engine features result in improved MS2Rescore performance over Prosit rescoring.
      Similarly to Prosit, the identified sequence motif for HLA type C∗12:03 was highly similar to the motif reported in the original publication of the data set (
      • Wilhelm M.
      • Zolg D.P.
      • Graber M.
      • Gessulat S.
      • Schmidt T.
      • Schnatbaum K.
      • et al.
      Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics.
      ,
      • Sarkizova S.
      • Klaeger S.
      • Le P.M.
      • Li L.W.
      • Oliveira G.
      • Keshishian H.
      • et al.
      A large peptidome dataset improves HLA class I epitope prediction across most of the human population.
      ), while the peptides that were removed by MS2Rescore compared to search engine rescoring showed quite different, less conserved sequence motifs (supplemental Fig. S8).

      MS2Rescore Generalizes Well Across Collision Energy Settings and Peptide Abundances

      Because MS2PIP does not account for CE in its predictions, MS2PIP and consequently MS2Rescore could potentially be biased toward spectra obtained with certain CE values. Therefore, search results of replicate mass spectrometry runs with CE settings varying from 25 to 35 were postprocessed with MS2Rescore. For each CE value, MS2Rescore shows a significant increase in identification rate. However, for larger (less optimal) CE values, the overall identification rate decreases (supplemental Fig. S9A). This is most likely due to a reduced quality in fragmentation spectra, which is reflected in the decreased explained ion current and in line with the b- and y-ion MS2PIP PCC distributions for accepted PSMs when using suboptimal CE values (supplemental Fig. S9, DF). Most interestingly, the relative gain in unique identified peptides for MS2Rescore increases for higher (and therefore less optimal) CE values, approaching a 60% increase for CE 35, by slightly shifting the feature weights away from fragmentation features in favor of DeepLC retention time features (supplemental Fig. S9, B and C). Consequently, MS2Rescore is able to recover peptides that would otherwise be lost due to lower-quality fragmentation spectra.
      A similar effect is observed for low abundant peptides. Indeed, the largest relative gain achieved by MS2Rescore in terms of number of unique identified peptides is seen for the lowest precursor intensities (supplemental Fig. S10, A and B), where traditional rescoring fails to recover most identifications. MS2Rescore is thus not only able to increase the amount of identifications for immunopeptides in general, it can recover peptides previously lost due to low precursor intensities and thus lower-quality spectra or nonoptimal instrument settings.

      MS2Rescore is Unbiased to Different HLA Classes

      To evaluate the performance of MS2Rescore on HLA class II peptides, another publicly available data set was reanalyzed. However, while for the HLA class I data set human immunopeptides and MaxQuant search engine (
      • Cox J.
      • Neuhauser N.
      • Michalski A.
      • Scheltema R.A.
      • Olsen J.V.
      • Mann M.
      Andromeda: a peptide search engine integrated into the MaxQuant environment.
      ) results were used, for this HLA class II data set, mouse data was searched with PEAKS DB (
      • Zhang J.
      • Xin L.
      • Shan B.
      • Chen W.
      • Xie M.
      • Yuen D.
      • et al.
      Peaks DB: de novo sequencing assisted database search for sensitive and accurate peptide identification.
      ). As was the case for HLA class I peptides, MS2Rescore significantly increases the identification rate for HLA class II peptides with 15% and 57% for the 1% and 0.1% FDR threshold, respectively (supplemental Fig. S11). These increases are, however, slightly lower than for the HLA class I data set. Moreover, where previously conventional rescoring showed a significant increase in comparison to search engine rescoring, here, the gain in comparison with no rescoring is lower for both identification rate as well as number of unique identified peptides. This is likely due to (i) the less extensive search engine features that are calculated for the PEAKS DB pipeline in MS2Rescore (supplemental Table S3) and (ii) to the fact PEAKS DB is likely better equipped to identify immunopeptides than MaxQuant due to its de novo–assisted database search (
      • Zhang J.
      • Xin L.
      • Shan B.
      • Chen W.
      • Xie M.
      • Yuen D.
      • et al.
      Peaks DB: de novo sequencing assisted database search for sensitive and accurate peptide identification.
      ). Nevertheless, the full MS2Rescore feature set, including peak intensity and retention time predictions, still results in a significantly higher identification rate. Altogether, these results show that MS2Rescore generalizes well across HLA class I and class II immunopeptides, across different species, and can boost the performance from different search engines.

      Discussion

      By training new peak intensity prediction models, we were able to greatly enhance immunopeptide identification rate through PSM rescoring. While all newly trained MS2PIP models greatly enhance peak intensity predictions for immunopeptides, the model trained solely on immunopeptides performed best. Even though the immuno-chymotrypsin model contained the same immunopeptide train set, the addition of the chymotrypsin-digested peptides did lower the performance slightly. Similarly, not including chymotrypsin-digested peptides in the training data resulted in lower accuracies for the chymotrypsin-digested peptides. Indeed, immunopeptides are generally much smaller and consequently carry a lower charge state; as a consequence, these immunopeptide-specific MS2PIP models are not able to predict the behavior of longer and higher charged peptides in the mass spectrometer. While both immuno- and chymotrypsin-digested peptides are considered nontryptic, their properties can be very different, leading to reduced accuracy of peak intensity of MS2PIP when applied on a different type of nontryptic peptides. Surprisingly, while immuno- and chymotrypsin-digested peptides are antagonistic, immunopeptides and tryptic peptides seem synergistic in terms of training data. This comes as no surprise as almost 50% of the immunopeptide training data consists of tryptic peptides. Immunopeptides are thus not necessarily nontryptic. However, the actual occurrence of tryptic peptides in immunopeptidomics samples is most likely much lower. This unrepresentatively sized tryptic portion most likely originated from the tryptic bias in current immunopeptidomics workflows. Indeed, in previous studies, tryptic MHC peptide coverage could rise to 70% (
      • Chen R.
      • Fauteux F.
      • Foote S.
      • Stupak J.
      • Tremblay T.L.
      • Gurnani K.
      • et al.
      Chemical derivatization strategy for extending the identification of MHC class i immunopeptides.
      ). By training new, nontryptic models of MS2PIP, we take a first step in decreasing this tryptic bias to ultimately be able to analyze an unbiased immunopeptide landscape.
      Moreover, by integrating the new immunopeptide model with retention time predictions and search engine features into MS2Rescore, we greatly enhanced the ability of Percolator to rescore immunopeptide PSMs, resulting in a much-improved immunopeptide identification workflow. Furthermore, rescoring drastically increases the number of unique identified peptides, which is of crucial importance for the discovery of potential neo-epitopes for cancer vaccination or xeno-epitopes for anti-bacterial and to a lesser extent, anti-viral vaccines. Moreover, while previously almost no identifications were found at a more confident 0.1% FDR threshold, MS2Rescore allows a lowering of the FDR threshold to 0.1%, while retaining 83% of the peptides identified at 1% FDR. This illustrates the large increase in confidence of the identified PSMs MS2Rescore introduces. Besides the increase in both PSM confidence and identification rate, MS2Rescore has shown to be unbiased with regard to HLA patterns and CE settings. Most importantly, the relative identification gain introduced by MS2Rescore is even larger for HLA patterns that initially had fewer identifications, showing that MS2Rescore is able to increase the view on the immunopeptide landscape for traditionally harder-to-identify HLA patterns. Moreover, MS2Rescore is able to recover peptide identifications that would have been lost due to lower-quality spectra by making use of DeepLC retention time features and can therefore recover substantial additional identifications for low abundant peptides. This potentially enables the recovery of biologically relevant neo- or xeno-epitopes that occur less frequently in the sample. Furthermore, MS2Rescore is able to gain immunopeptide identifications regardless of the search engine used, for both HLA class I and class II peptides, and across different species.
      Additionally, MS2Rescore with DeepLC and the new immunopeptide MS2PIP models shows an improved identification rate over the recently published Prosit effort, especially for lower FDR thresholds. As Prosit has shown to provide more accurate predictions compared to previous MS2PIP models, it is unlikely that MS2Rescore’s higher performance can be attributed to superior peak intensity predictions. Indeed, the peak intensity prediction accuracies of the new MS2PIP models and of Prosit are highly similar for immunopeptides (supplemental Fig. S7) even when Prosit has been optimized for the right CE. These negligible differences in peak intensity prediction correlations are therefore likely not the reason for the higher performance of MS2Rescore in favor of Prosit. Instead, it is more likely that the main difference in rescoring performance is the result of the generation of more relevant MS2PIP-, DeepLC, and search engine–derived features. Indeed, when the majority of the search engine features and all DeepLC retention time features were omitted, reflecting the more limited Prosit feature set, the performance of MS2Rescore drops as well (supplemental Fig. S6). By providing a more extensive feature set, MS2Rescore creates a unique feature space that allows Percolator to separate true from false identifications much better than when provided with limited features without retention time or peak intensity information (Fig. 3). The combination of all these calculated features is therefore likely to be driver of MS2Rescore’s superior performance.
      MS2Rescore is freely available under the permissive Apache 2.0 open-source license on GitHub (https://github.com/compomics/ms2rescore) and can easily be installed locally through the cross-platform PyPI Python package as well as with a standalone windows install script. Both a command line interface and a graphical user interface are available, various identification files from different search engines are accepted, and both MS2PIP and DeepLC can handle a variety of modifications, eliminating the need to filter identification files before rescoring. Altogether, these new models show great promise to greatly extend the immunopeptide landscape in existing and future immunopeptidomics experiments.

      Data Availability

      MS2Rescore is available at https://github.com/compomics/ms2rescore. All additional code used in this work is available at compomics/ms2rescore-immunopeptidomics-manuscript (github.com). The data used for training and evaluation of the newly trained models, the models themselves as well as the MS2Rescore output is available on Zenodo at https://doi.org/10.5281/zenodo.6532013. Additional mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE (
      • Perez-Riverol Y.
      • Csordas A.
      • Bai J.
      • Bernal-Llinares M.
      • Hewapathirana S.
      • Kundu D.J.
      • et al.
      The PRIDE database and related tools and resources in 2019: improving support for quantification data.
      ) partner repository with the data set identifier PXD033868.

      Supplemental data

      This article contains supplemental data (
      • Bassani-Sternberg M.
      • Pletscher-Frankild S.
      • Jensen L.J.
      • Mann M.
      Mass spectrometry of human leukocyte antigen class i peptidomes reveals strong effects of protein abundance and turnover on antigen presentation.
      ,
      • Sarkizova S.
      • Klaeger S.
      • Le P.M.
      • Li L.W.
      • Oliveira G.
      • Keshishian H.
      • et al.
      A large peptidome dataset improves HLA class I epitope prediction across most of the human population.
      ,
      • Racle J.
      • Michaux J.
      • Rockinger G.A.
      • Arnaud M.
      • Bobisse S.
      • Chong C.
      • et al.
      Robust prediction of HLA class II epitopes by deep motif deconvolution of immunopeptidomes.
      ,
      • Chong C.
      • Marino F.
      • Pak H.
      • Racle J.
      • Daniel R.T.
      • Mü Ller M.
      • et al.
      High-throughput and sensitive immunopeptidomics platform reveals profound interferon γ-mediated remodeling of the human leukocyte antigen (HLA) ligandome.
      ,
      • Gfeller D.
      • Guillaume P.
      • Michaux J.
      • Pak H.-S.
      • Daniel R.T.
      • Racle J.
      • et al.
      The length distribution and multiple specificity of naturally presented HLA-I ligands.
      ,
      • Bassani-Sternberg M.
      • Bräunlein E.
      • Klar R.
      • Engleitner T.
      • Sinitcyn P.
      • Audehm S.
      • et al.
      Direct identification of clinically relevant neoepitopes presented on native human melanoma tissue by mass spectrometry.
      ,
      • Wang D.
      • Eraslan B.
      • Wieland T.
      • Hallström B.
      • Hopf T.
      • Zolg D.P.
      • et al.
      A deep proteome and transcriptome abundance atlas of 29 healthy human tissues.
      ,
      • Bassani-Sternberg M.
      • Chong C.
      • Guillaume P.
      • Solleder M.
      • Pak H.S.
      • Gannon P.O.
      • et al.
      Deciphering HLA-I motifs across HLA peptidomes improves neo-antigen predictions and identifies allostery regulating HLA specificity.
      ,
      • Marino F.
      • Semilietof A.
      • Michaux J.
      • Pak H.S.
      • Coukos G.
      • Müller M.
      • et al.
      Biogenesis of HLA ligand presentation in immune cells upon activation reveals changes in peptide length preference.
      ,
      • Gravina F.
      • Sanchuki H.S.
      • Rodrigues T.E.
      • Gerhardt E.C.M.
      • Pedrosa F.O.
      • Souza E.M.
      • et al.
      Proteome analysis of an Escherichia coli ptsN-null strain under different nitrogen regimes.
      ).

      Conflict of interest

      The authors declare that they have no conflict of interest with the contents of the article.

      Acknowledgments

      NanoLC-MS/MS instruments were supported by the French Proteomic Infrastructure ( ProFI FR2048 ; ANR-10-INBS-08-03 ).

      Funding and additional information

      R. B. acknowledges funding from the Vlaams Agentschap Innoveren en Ondernemen under project number HBC.2020.2205.; R. G. received funding from the Research Foundation Flanders (FWO) [1S50918N]. S. D. and L. M. acknowledge funding from the European Union’s Horizon 2020 Programme (H2020-INFRAIA-2018-1) [823839]; L. M. acknowledges funding from the Research Foundation Flanders (FWO) [G028821N] and from Ghent University Concerted Research Action [BOF21/GOA/033]. A. D. received funding from the Research Foundation Flanders (FWO) [1SE3722].

      Author contributions

      A. D., A. H., and R. G. methodology; A. D. software; A. D., R. B., and R. G. validation; A. D. and R. G. formal analysis; A. D. and R. G. writing–original draft; A. D., R. B., A. H., C. C., S. D., L. M., and R. G. writing–review and editing; A. D., R. B., C. C., S. D., L. M., and R. G. funding acquisition; A. D., R. B., S. D., L. M., and R. G. conceptualization; A. D., A. H., C. C., and R. G. investigation; C. C. and L. M. resources; L. M. and R. G. project administration; R. G. supervision.

      Supplemental Data

      References

        • Sattler S.
        Advances in Experimental Medicine and Biology. Springer New York LLC), New York2017: 3-14
        • Raoufi E.
        • Hemmati M.
        • Eftekhari S.
        • Khaksaran K.
        • Mahmodi Z.
        • Farajollahi M.M.
        • et al.
        Epitope prediction by novel immunoinformatics approach: a state-of-the-art Review.
        Int. J. Pept. Res. Ther. 2020; 26: 1155-1163
        • Mayer R.L.
        • Impens F.
        Immunopeptidomics for next-generation bacterial vaccine development.
        Trends Microbiol. 2021; 29: 1034-1045
        • Larsen M.V.
        • Lundegaard C.
        • Lamberth K.
        • Buus S.
        • Lund O.
        • Nielsen M.
        Large-scale validation of methods for cytotoxic T-lymphocyte epitope prediction.
        BMC Bioinformatics. 2007; 8: 1-12
        • Zhang H.
        • Lundegaard C.
        • Nielsen M.
        Pan-specific MHC class I predictors: a benchmark of HLA class I pan-specific prediction methods.
        Bioinformatics. 2009; 25: 83-89
        • Bassani-Sternberg M.
        • Pletscher-Frankild S.
        • Jensen L.J.
        • Mann M.
        Mass spectrometry of human leukocyte antigen class i peptidomes reveals strong effects of protein abundance and turnover on antigen presentation.
        Mol. Cell Proteomics. 2015; 14: 658-673
        • Solleder M.
        • Guillaume P.
        • Racle J.
        • Michaux J.
        • Pak H.S.
        • Müller M.
        • et al.
        Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA class I ligands.
        Mol. Cell Proteomics. 2020; 19: 390-404
        • Faridi P.
        • Purcell A.W.
        • Croft N.P.
        In Immunopeptidomics we need a sniper instead of a shotgun.
        Proteomics. 2018; 18e1700464
        • Pfammatter S.
        • Bonneil E.
        • Lanoix J.
        • Vincent K.
        • Hardy M.-P.P.
        • Courcelles M.
        • et al.
        Extending the comprehensiveness of immunopeptidome analyses using isobaric peptide labeling.
        Anal. Chem. 2020; 92: 9194-9204
        • Purcell A.W.
        • Ramarathinam S.H.
        • Ternette N.
        Mass spectrometry–based identification of MHC-bound peptides for immunopeptidomics.
        Nat. Protoc. 2019; 14: 1687-1707
        • Eng J.K.
        • McCormack A.L.
        • Yates J.R.
        An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.
        J. Am. Soc. Mass Spectrom. 1994; 5: 976-989
        • Craig R.
        • Beavis R.C.
        TANDEM: matching proteins with tandem mass spectra.
        Bioinformatics. 2004; 20: 1466-1467
        • Cox J.
        • Neuhauser N.
        • Michalski A.
        • Scheltema R.A.
        • Olsen J.V.
        • Mann M.
        Andromeda: a peptide search engine integrated into the MaxQuant environment.
        J. Proteome Res. 2011; 10: 1794-1805
        • Zhang J.
        • Xin L.
        • Shan B.
        • Chen W.
        • Xie M.
        • Yuen D.
        • et al.
        Peaks DB: de novo sequencing assisted database search for sensitive and accurate peptide identification.
        Mol. Cell Proteomics. 2012; 11M111.010587
        • Jiang J.
        • Natarajan K.
        • Margulies D.H.
        Advances in Experimental Medicine and Biology. Springer New York LLC, New York2019: 21-62
        • Faridi P.
        • Li C.
        • Ramarathinam S.H.
        • Vivian J.P.
        • Illing P.T.
        • Mifsud N.A.
        • et al.
        A subset of HLA-I peptides are not genomically templated: evidence for cis- and trans-spliced peptide ligands.
        Sci. Immunol. 2018; 3eaar3947
        • Colaert N.
        • Degroeve S.
        • Helsens K.
        • Martens L.
        Analysis of the resolution limitations of peptide identification algorithms.
        J. Proteome Res. 2011; 10: 5555-5561
        • Verheggen K.
        • Ræder H.
        • Berven F.S.
        • Martens L.
        • Barsnes H.
        • Vaudel M.
        Anatomy and evolution of database search engines—a central component of mass spectrometry based proteomic workflows.
        Mass Spectrom. Rev. 2020; 39: 292-306
        • Bichmann L.
        • Nelde A.
        • Ghosh M.
        • Heumos L.
        • Mohr C.
        • Peltzer A.
        • et al.
        MHCquant: automated and reproducible data analysis for immunopeptidomics.
        J. Proteome Res. 2019; 18: 3876-3884
        • Dorfer V.
        • Maltsev S.
        • Winkler S.
        • Mechtler K.
        CharmeRT: boosting peptide identifications by chimeric spectra identification and retention time prediction.
        J. Proteome Res. 2018; 17: 2581-2589
        • Silva A.S.C.
        • Bouwmeester R.
        • Martens L.
        • Degroeve S.
        Accurate peptide fragmentation predictions allow data driven approaches to replace and improve upon proteomics search engine scoring functions.
        Bioinformatics. 2019; 35: 1401-1403
        • Li K.
        • Jain A.
        • Malovannaya A.
        • Wen B.
        • Zhang B.
        DeepRescore: leveraging deep learning to improve peptide identification in immunopeptidomics.
        Proteomics. 2020; 20e1900334
        • Wilhelm M.
        • Zolg D.P.
        • Graber M.
        • Gessulat S.
        • Schmidt T.
        • Schnatbaum K.
        • et al.
        Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics.
        Nat. Commun. 2021; 12: 3346
        • Bouwmeester R.
        • Gabriels R.
        • Hulstaert N.
        • Martens L.
        • Degroeve S.
        DeepLC can predict retention times for peptides that carry as-yet unseen modifications.
        Nat. Methods. 2021; : 1-7
        • Gabriels R.
        • Martens L.
        • Degroeve S.
        Updated MS2PIP web server delivers fast and accurate MS2 peak intensity prediction for multiple fragmentation methods, instruments and labeling techniques.
        Nucl. Acids Res. 2019; 47: W295-W299
        • Degroeve S.
        • Maddelein D.
        • Martens L.
        MS2PIP prediction server: compute and visualize MS2 peak intensity predictions for CID and HCD fragmentation.
        Nucl. Acids Res. 2015; 43: W326-W330
        • Degroeve S.
        • Martens L.
        MS2PIP: a tool for MS/MS peak intensity prediction.
        Bioinformatics. 2013; 29: 3199-3203
        • Ruiz Cuevas M.V.
        • Hardy M.-P.
        • Holly J.
        • Bonneil É.
        • Durette C.
        • Courcelles M.
        • et al.
        Most non-canonical proteins uniquely populate the proteome or immunopeptidome.
        Cell Rep. 2020; 34: 108815
        • Martens L.
        • Hermjakob H.
        • Jones P.
        • Adamsk M.
        • Taylor C.
        • States D.
        • et al.
        PRIDE: the proteomics identifications database.
        Proteomics. 2005; 5: 3537-3545
        • Perez-Riverol Y.
        • Csordas A.
        • Bai J.
        • Bernal-Llinares M.
        • Hewapathirana S.
        • Kundu D.J.
        • et al.
        The PRIDE database and related tools and resources in 2019: improving support for quantification data.
        Nucl. Acids Res. 2019; 47: D442-D450
        • Hulstaert N.
        • Shofstahl J.
        • Sachsenberg T.
        • Walzer M.
        • Barsnes H.
        • Martens L.
        • et al.
        ThermoRawFileParser: modular, scalable, and cross-platform RAW file conversion.
        J. Proteome Res. 2020; 19: 537-542
        • Deutsch E.W.
        • Perez-Riverol Y.
        • Carver J.
        • Kawano S.
        • Mendoza L.
        • Van Den Bossche T.
        • et al.
        Universal spectrum identifier for mass spectra.
        Nat. Methods. 2021; 18: 768-770
        • Chen T.
        • Guestrin C.
        Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY)2016
        • Bergstra J.
        • Yamins D.
        • Cox D.D.
        30th international conference on machine learning.
        ICML. 2013; 2013: 115-123
        • Granholm V.
        • Kim S.
        • Navarro J.C.F.
        • Sjölund E.
        • Smith R.D.
        • Käll L.
        Fast and accurate database searches with MS-GF+percolator.
        J. Proteome Res. 2014; 13: 890-897
        • Tyanova S.
        • Temu T.
        • Cox J.
        The MaxQuant computational platform for mass spectrometry-based shotgun proteomics.
        Nat. Protoc. 2016; 11: 2301-2319
        • Sarkizova S.
        • Klaeger S.
        • Le P.M.
        • Li L.W.
        • Oliveira G.
        • Keshishian H.
        • et al.
        A large peptidome dataset improves HLA class I epitope prediction across most of the human population.
        Nat. Biotechnol. 2020; 38: 199-209
        • Andreatta M.
        • Alvarez B.
        • Nielsen M.
        GibbsCluster: unsupervised clustering and alignment of peptide sequences.
        Nucl. Acids Res. 2017; 45: W458-W463
        • Hunter J.D.
        Matplotlib: a 2D graphics environment.
        Comput. Sci. Eng. 2007; 9: 90-95
        • Waskom M.
        seaborn: statistical data visualization.
        J. Open Source Softw. 2021; 6: 3021
        • Bittremieux W.
        Spectrum-utils: a Python package for mass spectrometry data processing and visualization.
        Anal. Chem. 2020; 92: 659-661
        • Chen R.
        • Fauteux F.
        • Foote S.
        • Stupak J.
        • Tremblay T.L.
        • Gurnani K.
        • et al.
        Chemical derivatization strategy for extending the identification of MHC class i immunopeptides.
        Anal. Chem. 2018; 90: 11409-11416
        • Racle J.
        • Michaux J.
        • Rockinger G.A.
        • Arnaud M.
        • Bobisse S.
        • Chong C.
        • et al.
        Robust prediction of HLA class II epitopes by deep motif deconvolution of immunopeptidomes.
        Nat. Biotechnol. 2019; 37: 1283-1286
        • Chong C.
        • Marino F.
        • Pak H.
        • Racle J.
        • Daniel R.T.
        • Mü Ller M.
        • et al.
        High-throughput and sensitive immunopeptidomics platform reveals profound interferon γ-mediated remodeling of the human leukocyte antigen (HLA) ligandome.
        Mol. Cell. Proteomics. 2018; 17: 533-548
        • Gfeller D.
        • Guillaume P.
        • Michaux J.
        • Pak H.-S.
        • Daniel R.T.
        • Racle J.
        • et al.
        The length distribution and multiple specificity of naturally presented HLA-I ligands.
        J. Immunol. 2018; 201: 3705-3716
        • Bassani-Sternberg M.
        • Bräunlein E.
        • Klar R.
        • Engleitner T.
        • Sinitcyn P.
        • Audehm S.
        • et al.
        Direct identification of clinically relevant neoepitopes presented on native human melanoma tissue by mass spectrometry.
        Nat. Commun. 2016; 7: 1-16
        • Wang D.
        • Eraslan B.
        • Wieland T.
        • Hallström B.
        • Hopf T.
        • Zolg D.P.
        • et al.
        A deep proteome and transcriptome abundance atlas of 29 healthy human tissues.
        Mol. Syst. Biol. 2019; 15e8503
        • Bassani-Sternberg M.
        • Chong C.
        • Guillaume P.
        • Solleder M.
        • Pak H.S.
        • Gannon P.O.
        • et al.
        Deciphering HLA-I motifs across HLA peptidomes improves neo-antigen predictions and identifies allostery regulating HLA specificity.
        PLOS Comput. Biol. 2017; 13: 1-28
        • Marino F.
        • Semilietof A.
        • Michaux J.
        • Pak H.S.
        • Coukos G.
        • Müller M.
        • et al.
        Biogenesis of HLA ligand presentation in immune cells upon activation reveals changes in peptide length preference.
        Front. Immunol. 2020; 11 (doi:1981)
        • Gravina F.
        • Sanchuki H.S.
        • Rodrigues T.E.
        • Gerhardt E.C.M.
        • Pedrosa F.O.
        • Souza E.M.
        • et al.
        Proteome analysis of an Escherichia coli ptsN-null strain under different nitrogen regimes.
        J. Proteomics. 2018; 174: 28-35