Advertisement

Re-analysis of ProteomicsDB using an accurate, sensitive and scalable false discovery rate estimation approach for protein groups

Open AccessPublished:October 31, 2022DOI:https://doi.org/10.1016/j.mcpro.2022.100437

      Highlights

      • Evaluating protein group FDR estimation methods with entrapment and simulated data
      • Accurate & sensitive protein group FDR method on databases with protein isoforms
      • Tool for combining multiple large-scale MaxQuant searches on protein group-level
      • Analysis on ProteomicsDB identified >1200 human genes with multiple protein groups

      Abstract

      Estimating false discovery rates (FDRs) of protein identification continues to be an important topic in mass spectrometry-based proteomics, particularly when analyzing very large data sets. One performant method for this purpose is the Picked Protein FDR approach which is based on a target-decoy competition strategy on the protein level that ensures that FDRs scale to large data sets. Here, we present an extension to this method that can also deal with protein groups, i.e. proteins that share common peptides such as protein isoforms of the same gene. To obtain well-calibrated FDR estimates that preserve protein identification sensitivity, we introduce two novel ideas. First, the picked group target-decoy and, second, the rescued subset grouping strategies. Using entrapment searches and simulated data for validation, we demonstrate that the new Picked Protein Group FDR method produces accurate protein group-level FDR estimates regardless of the size of the data set. The validation analysis also uncovered that applying the commonly used Occam’s razor principle leads to anti-conservative FDR estimates for large datasets. This is not the case for the Picked Protein Group FDR method. Re-analysis of deep proteomes of 29 human tissues showed that the new method identified up to 4% more protein groups than MaxQuant. Applying the method to the re-analysis of the entire human section of ProteomicsDB, led to the identification of 18,000 protein groups at 1% protein group-level FDR. The analysis also showed that about 1,250 genes were represented by ≥2 identified protein groups. To make the method accessible to the proteomics community, we provide a software tool including a graphical user interface that enables merging results from multiple MaxQuant searches into a single list of identified and quantified protein groups.

      Graphical abstract

      Keywords

      Abbreviations:

      FDR (False discovery rate), iBAQ (Intensity based absolute quantification), LFQ (Label free quantification), MS (Mass spectrometry), PEP (Posterior error probability), PSM (Peptide spectrum match), TMT (Tandem mass tags), TDS (target-decoy strategy), cT (classic TDS), pT (picked TDS), pgT (picked group TDS), nG (no protein grouping), sG (subset protein grouping), rsG (rescued subset protein grouping), dS (discard shared peptides), rS (razor peptides), mmP (multiplication of MaxQuant PEPs), bmP (best MaxQuant PEP), bpP (best Percolator PEP)

      Introduction

      Algorithms for protein quantification and identification from mass spectrometry (MS) data are continuously challenged by the ever growing trend towards large-scale experiments. Today, it is not at all uncommon to perform experiments resulting in hundreds, or even thousands, of MS data files [
      • Wilhelm M.
      • Schlegl J.
      • Hahne H.
      • Gholami A.M.
      • Lieberenz M.
      • Savitski M.M.
      • Ziegler E.
      • Butzmann L.
      • Gessulat S.
      • Marx H.
      Others
      Mass-spectrometry-based draft of the human proteome.
      ], [
      • Kim M.-S.
      • Pinto S.M.
      • Getnet D.
      • Nirujogi R.S.
      • Manda S.S.
      • Chaerkady R.
      • Madugundu A.K.
      • Kelkar D.S.
      • Isserlin R.
      • Jain S.
      A draft map of the human proteome.
      ], [
      • Huttlin E.L.
      • Bruckner R.J.
      • Navarrete-Perea J.
      • Cannon J.R.
      • Baltier K.
      • Gebreab F.
      • Gygi M.P.
      • Thornock A.
      • Zarraga G.
      • Tam S.
      • Szpyt J.
      • Gassaway B.M.
      • Panov A.
      • Parzen H.
      • Fu S.
      • Golbazi A.
      • Maenpaa E.
      • Stricker K.
      • Guha Thakurta S.
      • Zhang T.
      • Rad R.
      • Pan J.
      • Nusinow D.P.
      • Paulo J.A.
      • Schweppe D.K.
      • Vaites L.P.
      • Harper J.W.
      • Gygi S.P.
      Dual proteome-scale networks reveal cell-specific remodeling of the human interactome.
      ], [
      • Edwards N.J.
      • Oberti M.
      • Thangudu R.R.
      • Cai S.
      • McGarvey P.B.
      • Jacob S.
      • Madhavan S.
      • Ketchum K.A.
      The CPTAC data portal: a resource for cancer proteomics research.
      ]. While such data is increasingly deposited into searchable public data repositories such as ProteomicsDB [
      • Lautenbacher L.
      • Samaras P.
      • Muller J.
      • Grafberger A.
      • Shraideh M.
      • Rank J.
      • Fuchs S.T.
      • Schmidt T.K.
      • The M.
      • Dallago C.
      others
      ProteomicsDB: toward a FAIR open-source resource for life-science research.
      ], PRIDE [
      • Perez-Riverol Y.
      • Csordas A.
      • Bai J.
      • Bernal-Llinares M.
      • Hewapathirana S.
      • Kundu D.J.
      • Inuganti A.
      • Griss J.
      • Mayer G.
      • Eisenacher M.
      others
      The PRIDE database and related tools and resources in 2019: improving support for quantification data.
      ], MassIVE (http://massive.ucsd.edu) or PeptideAtlas [
      • Desiere F.
      • Deutsch E.W.
      • King N.L.
      • Nesvizhskii A.I.
      • Mallick P.
      • Eng J.
      • Chen S.
      • Eddes J.
      • Loevenich S.N.
      • Aebersold R.
      The PeptideAtlas project.
      ], estimating the proportion of false positive identifications (false discovery rate; FDR) at the gene-level is a non-trivial task [
      • Savitski M.M.
      • Wilhelm M.
      • Hahne H.
      • Kuster B.
      • Bantscheff M.
      A scalable approach for protein false discovery rate estimation in large proteomic data sets.
      ], [
      • The M.
      • MacCoss M.J.
      • Noble W.S.
      • Käll L.
      Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0.
      ], [
      • Omenn G.S.
      • Lane L.
      • Overall C.M.
      • Paik Y.-K.
      • Cristea I.M.
      • Corrales F.J.
      • Lindskog C.
      • Weintraub S.
      • Roehrl M.H.A.
      • Liu S.
      Progress Identifying and Analyzing the Human Proteome: 2021 Metrics from the HUPO Human Proteome Project.
      ]. An even larger challenge is presented by distinguishing between different protein products from the same gene [
      • Plubell D.L.
      • Käll L.
      • Webb-Robertson B.-J.
      • Bramer L.
      • Ives A.
      • Kelleher N.L.
      • Smith L.M.
      • Montine T.J.
      • Wu C.C.
      • MacCoss M.J.
      Can we put Humpty Dumpty back together again? What does protein quantification mean in bottom-up proteomics?.
      ], such as splice variants [
      • Tapial J.
      • Ha K.C.H.
      • Sterne-Weiler T.
      • Gohr A.
      • Braunschweig U.
      • Hermoso-Pulido A.
      • Quesnel-Vallières M.
      • Permanyer J.
      • Sodaei R.
      • Marquez Y.
      • Cozzuto L.
      • Wang X.
      • Gómez-Velázquez M.
      • Rayon T.
      • Manzanares M.
      • Ponomarenko J.
      • Blencowe B.J.
      • Irimia M.
      An atlas of alternative splicing profiles and functional associations reveals new regulatory programs and genes that simultaneously express multiple major isoforms.
      ] or single-nucleotide polymorphisms (SNPs) or when analysing mixtures of orthologous proteins from different species such as human/mouse xenografts or bacterial communities in metaproteomics [
      • Rechenberger J.
      • Samaras P.
      • Jarzab A.
      • Behr J.
      • Frejno M.
      • Djukovic A.
      • Sanz J.
      • González-Barberá E.M.
      • Salavert M.
      • López-Hontangas J.L.
      others
      Challenges in clinical metaproteomics highlighted by the analysis of acute leukemia patients with gut colonization by multidrug-resistant enterobacteriaceae.
      ].
      The most prevailing method for estimating FDRs makes use of so-called target-decoy models [
      • Elias J.E.
      • Gygi S.P.
      Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry.
      ], where decoy peptide/protein sequences are added to the collection of genuine (i. e. target) peptide/protein sequences to serve as a model for false hits. The underlying assumption is that the database search engine identification score distributions of decoy and incorrect target peptide-spectrum matches (PSMs), peptides and proteins are the same. Violations of this assumption can lead to inaccurate FDR estimates, referred to as loss of FDR control. A well-working FDR estimation procedure should both be accurate, i.e. reflecting the true proportion of false discoveries, and sensitive, i.e. maximizing the number of acceptable discoveries at a defined threshold (typically 1%). When large-scale data sets begun to appear in the literature, it was soon recognized that naively compiling lists of identified proteins by combining large numbers of experiments led to loss of control of the protein false discovery rate (FDR) [
      • Serang O.
      • Käll L.
      Solution to statistical challenges in proteomics is more statistics, not less.
      ], [
      • Ezkurdia I.
      • Vázquez J.
      • Valencia A.
      • Tress M.
      Analyzing the First Drafts of the Human Proteome.
      ] and that applying the simple target-decoy approach led to issues with reduced sensitivity [
      • Wilhelm M.
      • Schlegl J.
      • Hahne H.
      • Gholami A.M.
      • Lieberenz M.
      • Savitski M.M.
      • Ziegler E.
      • Butzmann L.
      • Gessulat S.
      • Marx H.
      Others
      Mass-spectrometry-based draft of the human proteome.
      ]. The latter turned out be the result of an unintended asymmetry between decoy proteins and falsely identified target proteins. This is because false target PSMs may arise from both correct and false target proteins, whereas decoy PSMs only arise from decoy proteins, which are false by definition [
      • Reiter L.
      • Claassen M.
      • Schrimpf S.P.
      • Jovanovic M.
      • Schmidt A.
      • Buhmann J.M.
      • Hengartner M.O.
      • Aebersold R.
      Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry.
      ].
      This asymmetry was subsequently resolved by the development of the picked target-decoy strategy (picked TDS) [
      • Savitski M.M.
      • Wilhelm M.
      • Hahne H.
      • Kuster B.
      • Bantscheff M.
      A scalable approach for protein false discovery rate estimation in large proteomic data sets.
      ]. Briefly, the picked TDS compares the highest observed search engine PSM score for a target protein with its respective sequence-reversed or shuffled decoy protein and only retains the entry with the highest score. This strategy re-established the assumption that false target protein identifications have the same score distribution as decoy protein identifications. As a result, the picked TDS increased the number of confidently identified proteins compared to the classic TDS [
      • Savitski M.M.
      • Wilhelm M.
      • Hahne H.
      • Kuster B.
      • Bantscheff M.
      A scalable approach for protein false discovery rate estimation in large proteomic data sets.
      ], [
      • The M.
      • MacCoss M.J.
      • Noble W.S.
      • Käll L.
      Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0.
      ]. Savitski et al. [
      • Savitski M.M.
      • Wilhelm M.
      • Hahne H.
      • Kuster B.
      • Bantscheff M.
      A scalable approach for protein false discovery rate estimation in large proteomic data sets.
      ] employed this picked TDS to calculate protein-level FDRs using a method called the Picked Protein FDR approach. This method was designed to operate at the gene level and, for simplicity, discarded identified peptides that are shared between multiple protein sequences. This choice was made because shared peptides are rare when only canonical protein sequences are considered. However, not considering shared peptides led to reduced sensitivity when searching databases containing multiple protein isoforms of a gene. This is because if a peptide is shared between two protein isoforms of a gene but is the only peptide identified for that gene, the picked TDS would discard this peptide and neither isoform of the gene would be reported.
      As proteomic technology has improved over time to enable profiling proteomes at more and more depth, the above simplification may no longer be acceptable in many circumstances. Indeed, it has been estimated that up to 50% of all peptide sequences are shared between multiple protein sequences when considering all protein isoforms resulting from alternative RNA splicing [
      • Dost B.
      • Bandeira N.
      • Li X.
      • Shen Z.
      • Briggs S.P.
      • Bafna V.
      Accurate Mass Spectrometry Based Protein Quantification via Shared Peptides.
      ]. In turn, this leads to complications for protein identification and quantification [
      • Nesvizhskii A.I.
      • Aebersold R.
      Interpretation of Shotgun Proteomic Data.
      ]. If a peptide is shared between two protein isoforms of a gene and if it is the only peptide identified for that gene, one can confidently mark the gene as identified but one cannot be certain regarding which of the two isoforms (or both) are identified. A popular way to address this issue is to combine proteins that share identified peptides into a group and treat such a group of proteins as a single entity [
      • Serang O.
      • Noble W.
      A review of statistical methods for protein identification using tandem mass spectrometry.
      ]. In the above simple example, the two protein isoforms would be combined and treated as one protein group with the interpretation that at least one of the two sequences was actually present in the sample [
      • The M.
      • Tasnim A.
      • Käll L.
      How to talk about protein-level false discovery rates in shotgun proteomics.
      ]. In reality, both shared and distinct (i.e. unique) peptides may be identified for protein isoforms of the same gene. Therefore, proteins are typically only grouped if a protein’s identified peptides are a subset of the identified peptides of another protein. The identification of even a single unique peptide for each of the isoforms would prevent isoforms to be grouped even if they had (many) identified peptides in common. This may lead to the identification of multiple protein groups for a given gene and, in some cases, even particular protein isoforms.
      Protein grouping has become very popular because it provides some of the desired granularity for distinguishing different protein products from the same gene. However, the subject of estimating protein group-level FDR has received little attention [
      • Serang O.
      • Noble W.
      A review of statistical methods for protein identification using tandem mass spectrometry.
      ], [
      • The M.
      • Tasnim A.
      • Käll L.
      How to talk about protein-level false discovery rates in shotgun proteomics.
      ], [
      • Audain E.
      • Uszkoreit J.
      • Sachsenberg T.
      • Pfeuffer J.
      • Liang X.
      • Hermjakob H.
      • Sanchez A.
      • Eisenacher M.
      • Reinert K.
      • Tabb D.L.
      In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.
      ]. In particular, three unresolved problems may be noted. First, the picked TDS cannot directly be applied to protein groups, as the decoy counterparts of the target proteins in a protein group are not necessarily also grouped together. This is because the decoy counterpart proteins have their own set of identified shared and unique (decoy) peptides upon which grouping is based. One could artificially group all decoys counterparts into a decoy protein group but this violates the need to treat target and decoy proteins in the same manner, which is at the very heart of the target-decoy idea. Second, protein grouping leads to a practical problem when combining or comparing large datasets. Because the composition of protein groups depends on the set of identified peptides, proteins that were grouped together in one dataset are likely not grouped with the same composition of proteins in another dataset. This makes comparisons between datasets difficult unless all the data is combined and searched again, which is not practical for very large datasets or when looking at entire repositories. Recently, a tool to produce protein grouping results for multiple Percolator output files was released designed for metaproteomic datasets [
      • Schallert K.
      • Verschaffelt P.
      • Mesuere B.
      • Benndorf D.
      • Martens L.
      • Van Den Bossche T.
      Pout2Prot: An Efficient Tool to Create Protein (Sub) groups from Percolator Output Files.
      ]. However, to the best of our knowledge, no tools exist yet that can easily combine results from multiple MaxQuant searches with consistent protein grouping. Third, even with protein grouping, many high-confident peptides may still be shared between multiple protein groups and can, therefore, not unequivocally be attributed to a single protein group. Some methods apportion such cases over the protein groups concerned using probabilistic models [
      • Serang O.
      • MacCoss M.J.
      • Noble W.S.
      Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data.
      ], [
      • Pfeuffer J.
      • Sachsenberg T.
      • Dijkstra T.M.H.
      • Serang O.
      • Reinert K.
      • Kohlbacher O.
      EPIFANY: A Method for Efficient High-Confidence Protein Inference.
      ], [
      • Li Y.F.
      • Arnold R.J.
      • Li Y.
      • Radivojac P.
      • Sheng Q.
      • Tang H.
      A Bayesian Approach to Protein Inference Problem in Shotgun Proteomics.
      ], [
      • Kim M.
      • Eetemadi A.
      • Tagkopoulos I.
      DeepPep: Deep proteome inference from peptide profiles.
      ] but these often require the data to adhere to specific probability distributions, have problems with scaling and have hard-to-interpret models and results [
      • Serang O.
      • Moruz L.
      • Hoopmann M.R.
      • Kall L.
      Recognizing uncertainty increases robustness and reproducibility of mass spectrometry-based protein inferences.
      ]. To avoid such complications, a popular alternative is the Occam's razor heuristic (also known as the "law of parsimony" or “razor peptides”), most notably employed in the MaxQuant software platform [
      • Tyanova S.
      • Temu T.
      • Cox J.
      The MaxQuant computational platform for mass spectrometry-based shotgun proteomics.
      ]. Here, a typical decision rule for shared peptides is to assign them to the protein group with the highest number of identified unique peptides, and arbitrarily pick one of the protein groups in case of a tie. Although such rules may pick the correct protein group in a majority of cases, the number of cases where it picks the wrong protein group accumulates in large-scale experiments. Worse still, these cases do not arise from decoy protein groups because, whichever decoy protein group a shared decoy peptide is attributed to, is false by definition. Therefore, these false positive targets remain unaccounted for when computing FDRs, leading to a loss of FDR control and anti-conservative FDR estimates [
      • Serang O.
      • Moruz L.
      • Hoopmann M.R.
      • Kall L.
      Recognizing uncertainty increases robustness and reproducibility of mass spectrometry-based protein inferences.
      ].
      Note that gene-level FDRs and protein group-level FDRs differ fundamentally due to the type of entities in the lists they are calculated on, i.e. genes and protein groups respectively, and can, thus, generally not be compared to each other. However, one can calculate gene-level FDRs on a list of protein groups by, e.g., only retaining the best scoring protein group per gene. The reverse, i.e. calculating protein group-level FDRs from a list of genes, is only possible in the most trivial sense, where each gene is its own protein group. In the following, we will exclusively use protein group-level FDRs.
      To address the above issues, we introduce the Picked Protein Group FDR method, along with an accompanying software tool, which extends the Picked Protein FDR approach to protein groups. In particular, we show that this method identifies up to 4% more protein groups than the method used by MaxQuant while simultaneously correctly controlling the protein group-level FDR. Furthermore, we show that the method scales to very large numbers of experiments, exemplified by the re-analysis of the entire human section of ProteomicsDB. The tool comes with a graphical user interface operating under Windows as well as a Python package. The software also contains the option to merge results from multiple searches into a single list of quantified protein groups (LFQ, iBAQ, TMT) and provides an output similar to MaxQuant’s proteinGroups.txt. Instructions for download and use can be found at https://github.com/kusterlab/picked_group_fdr.

      Experimental procedures

      Datasets

      RAW files for a deep proteome study of 29 healthy human tissues (Wang et al. [
      • Wang D.
      • Eraslan B.
      • Wieland T.
      • Hallström B.
      • Hopf T.
      • Zolg D.P.
      • Zecha J.
      • Asplund A.
      • Li L.-h.
      • Meng C.
      A deep proteome and transcriptome abundance atlas of 29 healthy human tissues.
      ], PXD010154; HCD data only) corresponding to 50 million MS2 spectra were searched using MaxQuant (v1.5.3.8) against the human Swiss-Prot database including isoforms and TrEMBL entries downloaded from UniProt (accessed: 27 Mar 2020, 95,943 protein sequences), concatenated with a list of contaminants, as provided with MaxQuant, as well as reversed decoy sequences for all entries. Trypsin was specified as protease with up to 2 missed cleavages. Both peptide and protein-level FDR thresholds were set to 100% and default values were used for all other parameters. This data set is referred to in this manuscript as the Wang_base dataset.
      All PSMs assigned to human samples in ProteomicsDB (accessed: 08 Jun 2020) including both Mascot and MaxQuant search results for all proteases formed the PrDB dataset used in the current study and consists of 77 distinct projects of varying size (Supplementary Table S1)., totaling 269 million target and 141 million decoy PSMs (100% PSM-level FDR, searched against Swiss-Prot database including isoforms and TrEMBL entries).
      For both the Wang_base and PrDB datasets, we produced three lists of PSMs that were filtered to only include PSMs with peptide sequences that mapped to the canonical Swiss-Prot, the Swiss-Prot with isoforms databases and the Swiss-Prot+TrEMBL with isoforms respectively. These protein sequences were obtained from ProteomicsDB (accessed: 05 Jul 2020). Note that these filtered lists of PSMs will slightly differ from the results of a normal search against these respective databases. In the filtered lists, PSMs with sequences not present in the reduced databases are discarded instead of matched to another sequence within the reduced database. As this equally affects incorrect target and decoy PSMs, this should not lead to any biases for FDR estimation.

      Entrapment searches

      To assess if protein group-level FDRs are well-calibrated for the protein group-level FDR estimation methods presented below, entrapment searches were performed [
      • Granholm V.
      • Navarro J.F.
      • Noble W.S.
      • Käll L.
      Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics.
      ]. Briefly, in entrapment searches, the target database is extended by an entrapment database, typically 5-10 times the size of the target database, which only contains protein sequences known to be false.
      Entrapment databases were constructed as follows (for a graphical overview, see Supplementary Figure 1A):
      • 1.
        The target database was in-silico digested with Trypsin/P as protease, without missed cleavages. Only peptides longer than 6 amino acids were retained.
      • 2.
        A fraction S (see below) of peptides was randomly selected to remain unchanged, thereby creating shared peptides between the target and entrapment databases [
        • The M.
        • MacCoss M.J.
        • Noble W.S.
        • Käll L.
        Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0.
        ].
      • 3.
        All other peptide sequences were shuffled, while keeping the C-terminal amino acid the same.
      • 4.
        Finally, these shuffled peptides replaced their original versions in the target protein sequence, resulting in an entrapment protein having the same number of shared peptides as the target version.
      • 5.
        Steps 2-4 were repeated four times to yield an entrapment database five-times the size of the original target database.
      Entrapment databases were generated for two different fractions S of shared peptides, with S = 0.5 representing the shared ratio of the Swiss-Prot+Isoforms database and S = 0.04 representing the shared ratio of the Swiss-Prot database. The spectra from Wang_base were searched against each of these entrapment databases with MaxQuant (v1.5.3.8). Trypsin/P was specified as protease and no missed cleavages were allowed, to reflect the construction of the entrapment databases. The peptide-level FDR threshold was set to 10% (searching with 100% peptide-level FDR was prohibitively slow) and the protein-level FDR threshold to 100%. All other parameters were set to default values. These data sets are referred to Wang_trap_0.5 and Wang_trap_0.04 respectively.

      Peptide-level filtering

      For protein-level FDR estimation methods, it is common to supply the list of PSMs without applying a peptide-level FDR threshold [
      • Savitski M.M.
      • Wilhelm M.
      • Hahne H.
      • Kuster B.
      • Bantscheff M.
      A scalable approach for protein false discovery rate estimation in large proteomic data sets.
      ], [
      • The M.
      • MacCoss M.J.
      • Noble W.S.
      • Käll L.
      Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0.
      ]. This allows the estimation of protein-level FDRs for proteins with only weak evidence. This could be especially relevant for distinguishing protein isoforms, which usually have few unique peptides. Note that, for parsimony-based protein inference methods that do not apply protein-level FDR estimation, it is vital to apply a strict peptide-level FDR threshold before protein grouping to prevent excessive accumulation of false protein groups.
      However, for MaxQuant’s method, we noted a decrease of up to 9% in the number of identified protein groups at 1% protein group-level FDR for the 100% peptide-level FDR cutoff compared to MaxQuant’s default 1% peptide-level FDR cutoff (Supplementary Figure 2). This decrease was due to the addition of low confident peptides that were filtered out in the 1% peptide-level FDR cutoff results. As MaxQuant computes protein group scores by a multiplication of PEPs, these low confident peptides actually decreased the confidence in their corresponding protein relative to proteins without low confident peptides. To make the comparison to MaxQuant fairer and to rule out the permissive FDR threshold as a confounding factor when comparing methods, we applied a 1% peptide-level FDR cutoff for all methods. This cutoff was applied per raw file, as is done in MaxQuant.

      Percolator PSM rescoring

      For the three Wang datasets, the evidence.txt MaxQuant output files were processed by a custom python script to create a percolator input file by extracting the following features: Andromeda score, Andromeda delta score, peptide length, charge (one-hot encoded), mass, enzymatic N-terminal, enzymatic C-terminal, missed cleavages, number of modifications, delta mass and absolute delta mass. This file was subsequently processed by Percolator v3.04. The resulting PSM target and decoy output files were merged back with the original evidence.txt and the Q-value and PEP columns were updated with the new values provided by the Percolator analysis.
      For the PrDB dataset, PSMs were grouped by project (n=77) and the same features as above were extracted, minus the delta mass and absolute delta mass features. These two columns were incomplete in ProteomicsDB and were, therefore, discarded. Percolator was applied to each project separately, so that the weights of the support vector machine (SVM) could be adjusted for each project. The percolator results were extracted on peptide level, where only the best scoring PSM per peptide sequence was retained. The resulting peptide lists (one for each project) were merged into a single list, using the -log10(q-value) of the peptide as the score and only retaining the best PSM for each peptide. Peptide-level q-values were then re-calculated on this final list of (unique) peptide sequences.

      Simulated data

      Tabled 1
      Simulation procedure

      Input:

      n_exp: number of experiments

      n_prot_mean: mean number of proteins present per experiment

      n_prot_stdev: stdev number of proteins present per experiment

      tp_score_mean: mean of score distribution for true positives

      tp_score_stdev: stdev of score distribution for true positives

      fp_score_mean: mean of score distribution for false positives

      fp_score_stdev: stdev of score distribution for false positives

      incorrect_ratio: proportion of incorrect peptides without FDR threshold

      peptide_fdr: peptide fdr threshold, should be lower than protein_fdr

      protein_probs: probability for each protein to be present

      peptide_probs: probability for each peptide (including shared peptides!) to be present

      Algorithm:
      • 1.
        For each experiment
        • a.
          Pick n_prot from N(n_prot_mean, n_prot_stdev) to be present
        • b.
          Select n_prot true positive target proteins
        • c.
          Calculate min_score = ppf(1 - peptFDR * (1 - incorrect_ratio) / incorrect_ratio, fp_score_mean, fp_score_stdev)
        • d.
          For each true positive target protein
          • i.
            Draw peptides based on their peptide_probs
          • ii.
            Draw score for each peptide from true positive distribution: trunc_norm(min_score, inf, tp_score_mean, tp_score_stdev)
        • e.
          Draw 2*tp_peptides*peptide_fdr/(1-peptFDR) from all peptides in target and decoy db based on uniform distribution
        • f.
          Draw score for each peptide from false positive distribution: trunc_norm(min_score, inf, fp_score_mean, fp_score_stdev)
      • 2.
        Do protein grouping based on observed peptides
      • 3.
        Calculate classic/Picked Protein group FDR
      For the simulated datasets in this manuscript, the input parameters were estimated from the Wang_base dataset, where one experiment corresponds to one of the 29 tissues analyzed: n_protein_mean = 10.000, n_protein_stdev = 1000, tp_score_mean = 2.5 tp_score_stdev = 0.7, fp_score_mean = 0.0, fp_score_stdev = 0.7, incorrect_ratio = 0.6, peptide_fdr = 0.01. Data was simulated for 10, 100, 200, 300, 400 and 500 experiments, to assess the performance of the different protein group-level FDR estimation methods at different scales. Additionally, datasets were simulated where the combined list of experiments had a 10% peptide-level FDR per raw file or a global 1% peptide-level FDR.

      Protein grouping

      A common practice to alleviate the problem of distributing shared peptide identifications between multiple protein sequences is to group proteins into so-called protein groups. In this manuscript, we considered the following options:

      No grouping (nG)

      Each protein is considered its own protein group. In databases with many protein isoforms or homologous proteins, this will result in many shared peptides between the protein groups.

      Subset grouping (sG)

      Proteins are grouped if the peptides for one protein form a subset of the peptides for a second protein. This will, for example, group isoforms of the same gene if no isoform-specific peptides are identified. This method is, for example, used by MaxQuant.

      Rescued subset grouping (rsG)

      The idea behind this new two-step procedure is to prevent that protein groups are split into multiple groups due to the presence of low-confident PSMs. To achieve this, first, a regular subset protein grouping is performed, producing a list of protein groups, PG1. Next, we filter the list of PSMs using a PSM-level threshold equivalent to a 1% protein group-level FDR, which is calculated based on PG1. Then, a second subset protein grouping is performed using this filtered list of PSMs, producing a second list of protein groups PG2. Finally, PG2 is supplemented with protein groups from PG1 that did not contain proteins already present in a protein group in PG2. This last step ensures that protein groups above 1% protein group-level FDR also have FDR estimates.

      Shared peptides

      After protein grouping, peptides have to be assigned to protein groups. This is simple if all proteins the peptides could originate from are all in the same protein group. However, one needs to decide on how to deal with peptides shared between proteins that are not in the same protein group. Two options are considered in this manuscript:

      Razor peptides (rS)

      The razor peptide strategy forces the assignment of a shared peptide to one of its associated protein groups based on which of these had the highest number of unique peptides. In case of a tie for the highest number of unique peptides, the tie is broken randomly.

      Discard shared peptides (dS)

      Here, all peptides that are shared between protein groups are discarded.

      Protein group scoring

      To rank protein groups by confidence of identification, protein group scores are computed based on the identification probabilities of the constituent peptides. Here, three options for protein group scoring were evaluated.

      Multiplication of MaxQuant PEPs (mmP)

      The protein group score used by MaxQuant is based on mulitplying peptide posterior error probabilities (PEPs). This score was re-implemented in Python and verified by comparing it to the protein group score that MaxQuant itself reported for each protein group. In reproducing the protein group scores reported by MaxQuant, we noted that the peptide PEPs were first divided by a constant, which appeared to be chosen with a view to maximizing the number of identified protein groups. Therefore, we implemented a grid search to optimize this constant before calculating the protein group scores.

      Best MaxQuant PEP (bmP)

      This takes the –log10(PEP) of the best scoring PSM for the protein as the protein group’s score.

      Best Percolator PEP (bpP)

      This takes the –log10(PEP) of the best scoring PSM after rescoring with Percolator as the protein group’s score.

      Protein (group) target-decoy strategy

      Before protein group-level FDR estimation, one can optionally perform a target-decoy competition step on the protein group-level. This competition has been shown to resolve problems with decreased sensitivity that results from an asymmetry between decoy proteins and falsely identified target proteins as noted in the introduction.

      Classic target-decoy strategy (cT)

      All proteins are passed to the FDR calculation without any protein-level target-decoy competition taking place.

      Picked target-decoy strategy (pT)

      This is a target-decoy competition step at the protein level. For each target protein, one notes down the corresponding decoy protein that was constructed by reversing or shuffling the target protein sequence. These two paired-up proteins are considered each other’s counterpart protein and only the highest scoring out of the two is retained.

      Picked group target-decoy strategy (pgT)

      This is a target-decoy competition step at the protein group level. First, we mark all leading proteins for each protein group, i.e. proteins which cover all identified peptides associated with the protein group. When going down the list of protein groups (sorted by decreasing identification score), protein groups are removed for which one or more counterpart leading protein was observed in the current (higher scoring) protein group. Note that if no protein grouping (nG) is done, this procedure is equal to the picked TDS.

      Protein group-level FDR

      The null hypothesis for the identification of a protein group [
      • The M.
      • Tasnim A.
      • Käll L.
      How to talk about protein-level false discovery rates in shotgun proteomics.
      ] was defined as “none of the proteins in the protein group had a correct PSM”, making the alternative hypothesis: “at least one of the proteins in the protein group had at least one correct PSM”. This then leads to the interpretation that if one rejects the null hypothesis, at least one (but potentially multiple or even all) of the proteins in the protein group was correctly identified by a PSM. Note that this hypothesis does not attempt to answer the question of absence and presence of a protein (proteins can be present in the group without having a correct PSM).
      To assess which protein groups were correctly identified, we make use of the protein group scores defined above and use the number of decoy protein groups as an estimate for the number of incorrect target protein groups. First, protein groups are sorted by protein group score, starting with the best scoring protein group. The protein group-level FDR for a protein group is estimated as the ratio of the number of decoy protein groups and target protein groups with a better score than the current protein group.

      Summary of methods

      The options listed above for peptide-level FDR threshold, protein grouping, usage of razor peptides, protein scoring and protein group FDR calculation can be combined in any desired constellation. Below is a summary of the methods used in the main text.

      MaxQuant

      Grouping: subset grouping (sG)
      Shared peptides: Occam’s razor (rS)
      Scoring: multiplication of MaxQuant PEPs (mmP)
      TDS: classic (cT)

      Savitski

      Grouping: no grouping (nG)
      Shared peptides: discard (dS)
      Scoring: best Percolator PEP (bpP)
      TDS: picked (pT)

      Picked Protein Group FDR

      Grouping: rescued subset grouping (rsG)
      Shared peptides: discard (dS)
      Scoring: best Percolator PEP (bpP)
      TDS: picked group (pgT)

      Savitski + Classic FDR

      Grouping: no grouping (nG)
      Shared peptides: discard (dS)
      Scoring: best Percolator PEP (bpP)
      TDS: classic (cT)

      Discard + Picked group

      Grouping: subset grouping (sG)
      Shared peptides: discard (dS)
      Scoring: best Percolator PEP (bpP)
      TDS: picked group (pgT)

      Razor + Picked group

      Grouping: subset grouping (sG)
      Shared peptides: Occam’s razor (rS)
      Scoring: best Percolator PEP (bpP)
      TDS: picked group (pgT)

      Classic Protein Group FDR

      Grouping: rescued subset grouping (rsG)
      Shared peptides: discard (dS)
      Scoring: best Percolator PEP (bpP)
      TDS: classic (cT)
      We observed that MaxQuant PSM-level PEPs are less well-calibrated than those generated by Percolator, which led to anti-conservative protein group-level FDR estimates for the MaxQuant PEPs but not for the PEPs generated by Percolator (Supplementary Figure 1). Using the Percolator PEPs had the added benefit of a 7% increase (164k vs 154k) in the number of peptide identifications at 1% peptide-level FDR compared to using the MaxQuant PEPs (Supplementary Figure 1E). Therefore, the best Percolator PEP (bpP) was selected as the default choice for protein scoring. Results are also shown for multiplication of MaxQuant's PEPs (mmP) and best MaxQuant PEPs (bmP) where these were relevant.

      Results

      Protein group-level FDR estimation

      Protein group-level FDR estimation can be broken down into multiple stages, where one of several available options at each stage has to be chosen. We define a method as a particular combination of chosen options. Any one method takes a list of PSMs as input and generates a list of protein groups with associated protein group-level FDRs as output. We used two large-scale datasets to evaluate several protein group-level FDR estimation methods in terms of accuracy and sensitivity (Figure 1). Specifically, we compare (1) MaxQuant’s method, (2) the Picked Protein FDR method by Savitski et al., and (3) a novel Picked Protein Group FDR method which will be introduced further below. We demonstrate that the two state-of-the-art methods have issues either with calibration or sensitivity, and that the Picked Protein Group FDR method resolves both of these problems (Figure 2A).
      Figure thumbnail gr1
      Figure 1Overview of datasets and evaluations. We used entrapment searches on a deep proteome study as well as simulated data to assess the accuracy of FDR estimates of different protein group-level FDR estimation methods. We also evaluated the sensitivity of the methods on three databases with increasing levels of redundancy (SwissProt canonical, SwissProt+isoforms and SwissProt+iso+TrEMBL) for the deep proteome study and the human section of ProteomicsDB.
      Figure thumbnail gr2
      Figure 2Overview of protein group-level FDR estimation methods, its constituent stages and the available options at each stage. (A) A protein group-level FDR estimation method takes a list of PSMs and chooses one of the available options at each of the three stages: protein grouping, handling of shared peptides and target-decoy strategy. (B) Subset grouping combines proteins into groups if the peptides of one protein (ProteinA) form a subset of the peptides of another protein (ProteinB). (C) The razor peptide approach assigns a shared peptide to the protein group with the most unique identifications. (D) The picked target-decoy strategy performs a target-decoy competition on protein level, retaining only the best scoring protein out of each target-decoy pair.
      The different methods, stages of data processing and associated options are described in detail in the Methods section but a brief overview is given here for convenience. First, one has to decide how proteins are grouped. The simplest option is to not group proteins at all (nG), i.e. each protein is its own protein group. Subset protein grouping (sG, Figure 2B) groups proteins if the identified peptides of one protein are a subset of the identified peptides of another protein. Rescued subset protein grouping (rsG) is an extension of subset protein grouping that will be introduced in more detail further below. Second, peptides shared between protein groups can either be discarded (dS) or assigned using Occam’ s razor (rS, Figure 2C), i.e. assigned to the protein group with the highest number of identified peptides, with ties broken randomly. Next, one has to calculate a score for each protein based on the scores of peptide identifications (omitted from Figure 2A for simplicity). As the default choice, we used the best Percolator score (bpP) among all PSMs for a protein group, while also showing results for multiplication of MaxQuant's PEPs (mmP) and best MaxQuant PEPs (bmP) where relevant for the comparison of methods. Finally, one has to choose a target-decoy strategy (TDS) for protein groups. In the classic TDS (cT), all target and decoy protein groups are passed onto the FDR estimation stage. In the picked TDS (pT, Figure 2, Figure 3D), the score of each target protein is compared to its shuffled or reversed decoy counterpart protein and only the best scoring out of the two proteins is retained. The picked group TDS (pgT) is an evolution of the picked TDS that specifically deals with protein groups and which will also be introduced further below.
      Figure thumbnail gr3
      Figure 3Shared peptides lead to reduced sensitivity for the Savitski method. (A) Number of identified protein groups at 1% FDR for the Savitski and MaxQuant on the Wang_base dataset. The Savitski method has sensitivity issues for highly redundant databases. (B) Fraction of shared and unique peptides based on in-silico digests of three databases with different levels of redundancy. For databases that include isoforms, the fraction of shared peptides is >50%.
      As input to the protein group FDR estimation methods, we used the MaxQuant evidence.txt results filtered at 1% peptide-level FDR and 100% protein-level FDR and used Percolator to re-score the PSMs (see Methods). All options and methods for protein group-level FDR estimation have been implemented in a Python package named Picked Group FDR (https://pypi.org/project/picked-group-fdr/). The results shown below were generated with v0.3.0.

      The Savitski method shows reduced sensitivity for databases containing isoforms

      The protein FDR estimation method we proposed in Savitski et al., based on the picked TDS, has been shown to avoid the accumulation of decoy matches at the gene level in large datasets, while correctly controlling protein-level FDR [
      • Savitski M.M.
      • Wilhelm M.
      • Hahne H.
      • Kuster B.
      • Bantscheff M.
      A scalable approach for protein false discovery rate estimation in large proteomic data sets.
      ], [
      • The M.
      • MacCoss M.J.
      • Noble W.S.
      • Käll L.
      Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0.
      ]. We re-analyzed a recently published dataset of deep proteomes of 29 human tissues [
      • Wang D.
      • Eraslan B.
      • Wieland T.
      • Hallström B.
      • Hopf T.
      • Zolg D.P.
      • Zecha J.
      • Asplund A.
      • Li L.-h.
      • Meng C.
      A deep proteome and transcriptome abundance atlas of 29 healthy human tissues.
      ] by searching 50 million tandem mass spectra (MS2) against three human protein sequence databases of different size and containing increasing peptide level sequence redundancy (Swiss-Prot, Swiss-Prot+Isoforms, Swiss-Prot+Isoforms+TrEMBL). This dataset will be referred to as the Wang_base dataset.
      When only taking canonical protein sequences into account (Swiss-Prot), we observed a 3% gain in identified proteins for the Savitski method compared to MaxQuant’s method (Figure 3A). This modest increase is realized despite the fact that the Savitski method does not perform protein grouping (nG) and discards shared peptides (dS) whereas MaxQuant’s method uses subset protein grouping (sG) and razor peptides (rS), both of which can enhance sensitivity. When we applied the picked TDS to results of searches against protein databases including isoforms (Swiss-Prot+Isoforms) and unreviewed protein sequences (Swiss-Prot+Isoforms+TrEMBL), the sensitivity of the Savitski method drops compared to MaxQuant’s method (Figure 3A). We also obtained the counter-intuitive result that fewer proteins are identified the larger the database that is used. This drop in sensitivity can largely be attributed to the discarding of the shared peptides in the Savitski method. This is because the fraction of shared peptides increases drastically with the increasing sequence redundancy of tryptic peptides obtained by in-silico digestion (Figure 3B). For the most redundant database, Swiss-Prot+Isoforms+TrEMBL, 63% of all peptides are shared by at least two protein sequences. This effect is also evident in the MaxQuant search results, where 75% of the peptides were shared by two or more protein sequences for this database (Supplementary Figure 3).

      Development of the Picked Protein Group FDR method

      In light of the above, we hypothesized that the negative effect of a high rate of shared peptides on the identification of proteins could be addressed by a more appropriate method of protein grouping and subsequent adjusted FDR estimation. In order to be able to use the picked TDS, we had to solve the issue that target and decoy proteins are not necessarily grouped in a way that would allow fair competition. For example, a target protein group may consist of proteins D, E and F, whereas the decoy protein group that contains the decoy counterpart of protein D, REV_D, also contains REV_F and REV_H (Figure 4A), whereas REV_E forms a different protein group of its own. Therefore, we extended the picked TDS to handle such cases that arise from protein grouping and we term this extension the picked group TDS (pgT). This strategy first sorts the protein groups by descending protein identification score. Then, while going down the sorted protein group list, all protein groups that contain at least one counterpart protein of a leading protein in the current protein group (but with a lower score) are eliminated (Figure 4A, see Methods).
      Figure thumbnail gr4
      Figure 4The picked group target-decoy strategy (TDS) handles competition of protein groups but does not resolve the issues of the state-of-the-art methods. (A) The picked group TDS extends the picked TDS to handle protein groups. Protein groups are sorted by decreasing identification score. Going down the sorted list, all groups containing ≥1 counterpart of one of the leading proteins of the current group are eliminated. (B) Protein group-level FDR calibration plots using entrapment searches for two methods using the picked group TDS. The region between y=1.5x and y=0.67x (dashed lines) was deemed well-calibrated. The Razor + Picked group method produces anti-conservative FDR estimates, whereas Discard + Picked group has reduced sensitivity. (C) The results in panel (B) can be explained by this example. Low-confident, incorrect peptides 2-4 prevent proteins A and B from being grouped. Razor peptides (rS) lead to anti-conservative estimates due to the erroneous assignment of true positive peptide 1 to the incorrect protein B. If shared peptides are discarded (dS), the high-confident peptide 1 is discarded, leading to reduced sensitivity as neither protein is identified. (D) Schematic summary of the methods evaluated in panel (B).
      To verify that the picked group TDS leads to well-calibrated protein group-level FDRs, we searched the Wang et al. dataset against the Swiss-Prot+Isoforms+TrEMBL database augmented with an entrapment database [
      • Granholm V.
      • Navarro J.F.
      • Noble W.S.
      • Käll L.
      Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics.
      ] (Wang_trap_0.5 dataset, see Methods). This entrapment database was constructed in such a way that it mimicked the proportion of shared peptides as found in the Swiss-Prot+Isoforms database (50% shared peptide ratio). When using the picked group TDS (pgT) together with subset grouping (sG) and discarding shared peptides (dS), we observed that the resulting FDRs were well-calibrated over the entire FDR range (Figure 4B). However, when we changed this method to use razor peptides (rS), the results showed anti-conservative FDR estimates. A similar behavior was observed for the MaxQuant method, because it also uses razor peptides.
      The anti-conservative behavior when using razor peptides is a result of false positives that remain unaccounted for by the decoy model, as explained in the introduction and in Figure 4C. This effect becomes stronger as more false positives are included in the input list of PSMs, e.g. when applied to large-scale datasets or when permissive PSM-level FDR cutoffs are used. Neither using different scoring methods nor the picked group TDS was able to lead to well-calibrated protein group-level FDRs when using razor peptides (Supplementary Figure 4). When reducing the shared peptide ratio to 4% (Supplementary Figure 5; Wang_trap_0.04 dataset, mimicking Swiss-Prot’s shared peptide ratio), this effect was not as apparent anymore, as the effect of razor peptides is reduced.
      Combining the picked group TDS with discarding shared peptides led to well-calibrated FDR estimates but the loss in the number of identified protein groups was not completely resolved (Figure 4B). This is because a group of proteins with one (or more) high-confident shared peptide identifications can be split over two protein groups owing to the presence of low-confident peptides that are unique to particular isoforms and are present in the same group (Figure 4C). In such cases, high-confidence shared peptides are shared by two protein groups and are discarded by the picked group TDS. This results in neither protein group being identified, in turn, leading to reduced sensitivity of protein group identification.
      To overcome the shortcomings of (1) using razor peptides, leading to anti-conservative protein FDR estimates (Figure 4D, orange), and (2) discarding shared peptides, leading to reduced sensitivity (Figure 4D, purple), we propose an extension to subset grouping that we term rescued subset grouping (rsG, Figure 5A). First, a regular subset protein grouping is performed and the PSM-level score cutoff corresponding to a 1% protein group-level FDR is computed. Second, using this cutoff, the high-confident PSMs are retained from the original list and subset protein grouping is performed on this filtered list of PSMs. The final list of protein groups consists of the protein groups from the second grouping, supplemented with protein groups from the first grouping for which none of its proteins were already in a protein group from the second grouping. By removing the effect of low-confident PSMs on the protein grouping procedure, more high-confidence peptides can be uniquely mapped to a protein group. This reduces the fraction of discarded precursors from 0.21 for subset grouping (sG) to 0.15 for rescued subset grouping (rsG), a 30% decrease (Figure 5B). On the Wang_trap_0.5 dataset, the picked group TDS combined with discarding shared peptides and rescued subset grouping (rsG, dS, bpP, pgT) shows well-calibrated FDRs (Figure 5C). This combination of options in this new method will henceforth be referred to as the Picked Protein Group FDR method.
      Figure thumbnail gr5
      Figure 5The Picked Protein Group FDR method is sensitive and well-calibrated. (A) Rescued subset grouping employs a second protein grouping that prevents low-scoring false positives from breaking up protein groups. (B) Number of identified peptides assigned to a protein group or discarded for the different protein grouping options for Wang_base searched against the SP+iso+TrEMBL database. Rescued subset grouping further improves the number of assigned peptides to protein groups compared to subset grouping. (C) Protein group-level FDR calibration plots using entrapment searches. The Picked Protein Group FDR method shows good calibration, whereas MaxQuant method produces anti-conservative FDR estimates. (D) Bar plots of the number of identified protein groups at 1% FDR. The Picked Protein Group FDR method consistently identifies the most protein groups across the 3 databases with different levels of redundancy.
      We analyzed the Wang_base dataset using the new Picked Protein Group FDR method and compared the results to those obtained by MaxQuant’s method (sG, rS, mmP, cT), the Savitski method (nG, dS, bpP, pT) and rescued subset grouping with the classic TDS (bpP, rsG, dS, cT). For the database with the smallest ratio of shared peptides (Swiss-Prot), the picked TDS of the Savitski method led to the expected moderate increase in identified proteins at 1% protein group-level FDR compared to MaxQuant’s method which uses the classic TDS. The same was observed for the Picked Protein Group FDR method, which showed an increase of 4% in the number of identified protein groups relative to MaxQuant’s method. When including isoforms into the analysis (Swiss-Prot+Isoforms), the number of identified proteins using the Savitski method drops by 21%, as observed above. In contrast, this is not the case for methods that use rescued subset grouping, for which an increase of 4-5% in the number of identified protein groups was observed compared to the Swiss-Prot database. This is because one can now identify multiple protein groups per gene. For the Swiss-Prot+Isoforms+TrEMBL database, the number of protein groups at 1% FDR roughly doubled when using the Picked Protein Group FDR method compared to the Savitski method (Figure 5D). There was also an increase of 4% in the number of identified protein groups by switching from the classic TDS to the picked group TDS and a 2% increase compared to MaxQuant’s method. However, as demonstrated above, the FDR estimates of MaxQuant’s method are likely not well-calibrated due to the use of razor peptides. Hence, the actual FDR might be higher than the reported 1% on this list of protein groups. In summary, the Picked Protein Group FDR method obtained the highest number of identified protein groups across the three differently sized databases while correctly controlling protein group-level FDR. This led to 15,600 confidently identified protein groups in this human proteome represented by 29 healthy tissues.

      The Picked Protein Group FDR method scales to very large datasets

      Next, we evaluated if the Picked Protein Group FDR method would scale to analyzing very large datasets. Because entrapment database searches are computationally expensive when performed at scale and because they change the original database in terms of size and shared peptides, we instead used simulated data (see Methods) as a way to generate large-scale datasets to verify protein group-level FDR estimates. To verify the validity of the simulated data, we checked if we could recover the qualitative effects observed for the Wang_trap_0.5 and Wang_trap_0.04 datasets. To this end, we estimated the appropriate input parameters for the simulation from the entrapment experiments using the Wang et al. dataset. We then simulated PSMs for the two entrapment searches (4% and 50% shared peptide ratios), where each experiment was controlled at 10% or 1% peptide-level FDR. The results of the different protein group FDR methods for simulated and entrapment datasets were similar for both shared peptide ratios (Supplementary Figure 6). For example, we observed the aforementioned anti-conservative behavior resulting from allowing razor peptides in the 50% shared peptide ratio data which was hardly noticeable at the 4% shared peptide ratio.
      We next simulated data to investigate the effect of combining hundreds of experiments containing about 10,000 proteins each and searched against Swiss-Prot+Isoforms+TrEMBL, and controlled at 1% peptide-level FDR per experiment (Figure 6, Supplementary Figure 7). As expected, the more experiments were combined, the larger the anti-conservative effect of razor peptides became (Figure 6A). At the same time, the number of protein groups at 1% FDR for the Picked Protein Group FDR method increased as the number of combined experiments increased (Figure 6B). Furthermore, and as expected, using a 10% peptide-level FDR threshold per experiment greatly exacerbated the anti-conservative behavior of methods employing razor peptides (Supplementary Figure 8). Reassuringly, the current best practice in proteomics of applying a global 1% peptide-level FDR after combining all experiments did largely resolve the calibration issues caused by razor peptides, although some anti-conservative behavior could still be observed in the very low FDR region. However, this did not lead to more identified protein groups compared to the Picked Protein Group FDR method (Supplementary Figure 9).
      Figure thumbnail gr6
      Figure 6The Picked Protein Group FDR method performs well when combining hundreds of simulated experiments in a single analysis. (A) Protein group-level FDR calibration plot showing the entrapment FDR at 1% decoy FDR for 10 to 500 combined experiments with ∼10,000 proteins each. The Picked Protein Group FDR produces accurate FDR estimations regardless of the number of combined experiments. The two methods using razor peptides (MaxQuant and Razor+Picked group) produce increasingly anti-conservative estimates as more experiments are combined. (B) The number of protein groups at 1% entrapment FDR for different numbers of combined experiments. As desired, the Picked Protein Group FDR method increases the number of identified protein groups as more experiments are combined. Above 300 experiments, methods using razor peptides have too many high-scoring entrapment protein groups such that no protein groups make the 1% entrapment FDR
      Finally, we applied the Picked Protein Group FDR method to the re-analysis of the entire human section of ProteomicsDB which comprises 77 projects representing many different types of proteomic applications, 19,800 LC-MS/MS runs leading to 410 million PSMs. As expected, we observed a sharp increase of 71% in the number of identified protein groups compared to the Savitski method (18,000 vs 10,500) when searching against Swiss-Prot+Isoforms+TrEMBL but identified an almost identical number when searching the canonical Swiss-Prot database only (15,600 vs 15,500; Figure 7A). Furthermore, the Picked Protein Group FDR method showed the expected behavior of identifying more protein groups when searching Swiss-Prot+Isoforms+TrEMBL (18,000) than Swiss-Prot+isoforms (17,000) or Swiss-Prot (15,600). At the gene level, we observed the expected and desired behavior that the same genes were identified when searching Swiss-Prot+Isoforms or Swiss-Prot (Figure 7B). Searching Swiss-Prot+Isoforms, the Picked Protein Group FDR method resulted in 1,230 genes with multiple identified protein groups (Supplementary Table S2).
      Figure thumbnail gr7
      Figure 7Re-analysis of the human section of ProteomicsDB with the Picked Protein Group FDR method leads to increased information for databases including isoforms. (A) Bar plot showing the number of protein groups at 1% FDR. The Picked Protein Group FDR method exhibits an increase in the number of protein groups as isoforms and unreviewed proteins are included in the database. This is mainly because multiple protein groups can now be identified per gene. (B) Venn diagram on gene level comparing the Savitski method without isoforms (red) and the Picked Protein Group FDR method with isoforms (blue). The Picked Protein Group FDR method reveals information about protein isoforms, with 8% of the identified genes having multiple identified protein groups (gray).

      Discussion

      Here, we introduced the Picked Protein Group FDR method for calculating protein group-level FDRs. This method achieves higher sensitivity than alternative state-of-the-art methods, correctly controls protein group-level FDR and scales well to repository-sized datasets. We developed a Python package and an accompanying graphical user interface that implements this method. This software tool can directly be applied to MaxQuant search results, with the option of combining multiple search results in a single protein group analysis. On a deep proteome study of 29 healthy human tissues (Wang_base), the number of identified protein groups was increased by up to 500 (+4%) compared to MaxQuant’s method. The re-analysis of the human section of ProteomicsDB resulted in 15,600 identified genes, similar to the number previously obtained with the Savitski method. Out of these identified genes, 1,230 had multiple identified protein groups when searched against the SwissProt database with isoforms included. This number is substantially higher than the 246 genes previously reported by Abascal et al. [
      • Abascal F.
      • Ezkurdia I.
      • Rodriguez-Rivas J.
      • Rodriguez J.M.
      • Pozo A.d.
      • Vázquez J.
      • Valencia A.
      • Tress M.L.
      Alternatively Spliced Homologous Exons Have Ancient Origins and Are Highly Expressed at the Protein Level.
      ]. These authors purposely employed very stringent selection criteria in order to minimize the number of false positives, as the authors had observed artifacts of unaccounted false positives on peptide level [
      • Tress M.L.
      • Abascal F.
      • Valencia A.
      Alternative splicing may not be the key to proteome complexity.
      ]. Here, we used a 1% protein group-level FDR filter, which corresponded to a rather stringent 0.12% peptide-level FDR filter. Manual inspection of the results indicates that in the vast majority of such genes, high-confident peptides are available unique to each of the protein groups. However, further research will be needed to assess the validity of these results.
      To resolve sensitivity and calibration issues of the state-of-the-art methods, the Picked Protein Group FDR method introduces rescued subset grouping for protein grouping and the picked group TDS for target-decoy competition of protein groups. Rescued subset grouping extends regular subset grouping with a second protein grouping step in which low-confident PSMs are ignored. This is comparable to the current best practice of applying a global 1% peptide-level FDR cutoff before protein grouping but has the benefit of producing FDR estimates for protein groups without peptides below the peptide-level FDR cutoff. The picked group TDS extends the picked TDS to handle protein groups. It is easy to implement and is identical to the picked TDS when protein grouping is not performed. However, it should be noted that the picked group TDS is a heuristic rule that relies on the similarity in composition, e.g. number of proteins and shared peptides, of incorrect target protein groups and decoy protein groups. One can indeed construct (artificial) examples where the picked group TDS produces undesirable results, e.g. an incorrect target protein group consisting of one protein eliminating a decoy protein group with 10 decoy proteins that happen to share one peptide. Nevertheless, we demonstrated here through our calibration experiments that, in practice, the protein groups competing against each other are similar enough to ensure a fair competition and, thereby, accurate FDR estimations.
      Furthermore, we demonstrated that the use of razor peptides can lead to anti-conservative protein group-level FDR estimates. Fortunately, we observed in our simulation experiments that this bias will likely be minor if the best practice of applying a global 1% peptide-level FDR cutoff before protein grouping is used. We acknowledge that razor peptides increase the number of peptides for a protein group and thereby stabilize protein abundance estimates. However, it cannot be guaranteed that the assignment to one of the protein groups in question is indeed correct. Using razor peptides might, therefore, lead to a false sense of confidence in the presence and abundance of specific isoforms.
      More concerningly for protein quantification, isoform-specific peptides are frequently only identified in a small fraction of samples. This often leads to high levels of missing values for isoforms, regardless of whether razor peptides are used or not. One way to address this issue could be to take the abundances of the peptides shared between isoforms into account [
      • Gerster S.
      • Kwon T.
      • Ludwig C.
      • Matondo M.
      • Vogel C.
      • Marcotte E.M.
      • Aebersold R.
      • Bühlmann P.
      Statistical approach to protein quantification.
      ], [
      • Jacob L.
      • Combes F.
      • Burger T.
      PEPA test: fast and powerful differential analysis from relative quantitative proteomics data using shared peptides.
      ]. As such methods still have to prove their reliability, we recommend doing differential abundance analysis on gene level and using isoform-level quantification only in cases where enough information is available.
      In summary, the current study presents a method for protein group FDR estimation that is both correct and sensitive. The accompanying software as well as the data simulation scripts are open-source, providing the proteomics community useful new tools to design, develop and test methods for estimating protein group-level FDRs. The authors also expect that the ability of the software to generate consistent protein group identifications when combining search results from different (and possibly large) datasets will make proteomic experiments more comparable without the need for expending large computational resources.

      Data availability

      Raw files for the Wang et al. dataset are available on PRIDE (PXD010154). The MaxQuant search results and result files of the Picked Protein Group FDR analysis are available on Zenodo (10.5281/zenodo.7157677). The software and graphical user interface are freely available on GitHub (https://github.com/kusterlab/picked_group_fdr) and as a Python package (https://pypi.org/project/picked-group-fdr/). This includes the scripts to reproduce the figures in this manuscript as well as those for simulating PSMs for large-scale datasets.

      Supplemental data

      This article contains supplemental data.

      Competing Financial Interest

      M.W. and B.K. are founders and shareholders of OmicScouts GmbH and MSAID GmbH, both operating in the field of proteomics. They have no operational role in either company.

      Acknowledgments

      This work was in part funded by the German Federal Ministry of Education and Research (BMBF, grant no. 031L0168) and an ERC Advanced Grant (grant no. 833710).

      References

        • Wilhelm M.
        • Schlegl J.
        • Hahne H.
        • Gholami A.M.
        • Lieberenz M.
        • Savitski M.M.
        • Ziegler E.
        • Butzmann L.
        • Gessulat S.
        • Marx H.
        • Others
        Mass-spectrometry-based draft of the human proteome.
        Nature. 2014; 509: 582-587
        • Kim M.-S.
        • Pinto S.M.
        • Getnet D.
        • Nirujogi R.S.
        • Manda S.S.
        • Chaerkady R.
        • Madugundu A.K.
        • Kelkar D.S.
        • Isserlin R.
        • Jain S.
        A draft map of the human proteome.
        Nature. 2014; 509: 575-581
        • Huttlin E.L.
        • Bruckner R.J.
        • Navarrete-Perea J.
        • Cannon J.R.
        • Baltier K.
        • Gebreab F.
        • Gygi M.P.
        • Thornock A.
        • Zarraga G.
        • Tam S.
        • Szpyt J.
        • Gassaway B.M.
        • Panov A.
        • Parzen H.
        • Fu S.
        • Golbazi A.
        • Maenpaa E.
        • Stricker K.
        • Guha Thakurta S.
        • Zhang T.
        • Rad R.
        • Pan J.
        • Nusinow D.P.
        • Paulo J.A.
        • Schweppe D.K.
        • Vaites L.P.
        • Harper J.W.
        • Gygi S.P.
        Dual proteome-scale networks reveal cell-specific remodeling of the human interactome.
        Cell. May 2021; 184 (e28): 3022-3040
        • Edwards N.J.
        • Oberti M.
        • Thangudu R.R.
        • Cai S.
        • McGarvey P.B.
        • Jacob S.
        • Madhavan S.
        • Ketchum K.A.
        The CPTAC data portal: a resource for cancer proteomics research.
        Journal of proteome research. 2015; 14: 2707-2713
        • Lautenbacher L.
        • Samaras P.
        • Muller J.
        • Grafberger A.
        • Shraideh M.
        • Rank J.
        • Fuchs S.T.
        • Schmidt T.K.
        • The M.
        • Dallago C.
        • others
        ProteomicsDB: toward a FAIR open-source resource for life-science research.
        Nucleic acids research. 2022; 50: D1541-D1552
        • Perez-Riverol Y.
        • Csordas A.
        • Bai J.
        • Bernal-Llinares M.
        • Hewapathirana S.
        • Kundu D.J.
        • Inuganti A.
        • Griss J.
        • Mayer G.
        • Eisenacher M.
        • others
        The PRIDE database and related tools and resources in 2019: improving support for quantification data.
        Nucleic acids research. 2019; 47: D442-D450
        • Desiere F.
        • Deutsch E.W.
        • King N.L.
        • Nesvizhskii A.I.
        • Mallick P.
        • Eng J.
        • Chen S.
        • Eddes J.
        • Loevenich S.N.
        • Aebersold R.
        The PeptideAtlas project.
        Nucleic Acids Research. 2006; 34: D655-D658
        • Savitski M.M.
        • Wilhelm M.
        • Hahne H.
        • Kuster B.
        • Bantscheff M.
        A scalable approach for protein false discovery rate estimation in large proteomic data sets.
        Molecular & Cellular Proteomics. 2015; (mcp–M114)
        • The M.
        • MacCoss M.J.
        • Noble W.S.
        • Käll L.
        Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0.
        Journal of The American Society for Mass Spectrometry. 2016; 27: 1719
        • Omenn G.S.
        • Lane L.
        • Overall C.M.
        • Paik Y.-K.
        • Cristea I.M.
        • Corrales F.J.
        • Lindskog C.
        • Weintraub S.
        • Roehrl M.H.A.
        • Liu S.
        Progress Identifying and Analyzing the Human Proteome: 2021 Metrics from the HUPO Human Proteome Project.
        Journal of proteome research. 2021; 20: 5227-5240
        • Plubell D.L.
        • Käll L.
        • Webb-Robertson B.-J.
        • Bramer L.
        • Ives A.
        • Kelleher N.L.
        • Smith L.M.
        • Montine T.J.
        • Wu C.C.
        • MacCoss M.J.
        Can we put Humpty Dumpty back together again? What does protein quantification mean in bottom-up proteomics?.
        bioRxiv. 2021;
        • Tapial J.
        • Ha K.C.H.
        • Sterne-Weiler T.
        • Gohr A.
        • Braunschweig U.
        • Hermoso-Pulido A.
        • Quesnel-Vallières M.
        • Permanyer J.
        • Sodaei R.
        • Marquez Y.
        • Cozzuto L.
        • Wang X.
        • Gómez-Velázquez M.
        • Rayon T.
        • Manzanares M.
        • Ponomarenko J.
        • Blencowe B.J.
        • Irimia M.
        An atlas of alternative splicing profiles and functional associations reveals new regulatory programs and genes that simultaneously express multiple major isoforms.
        Genome Research. October 2017; 27: 1759-1768
        • Rechenberger J.
        • Samaras P.
        • Jarzab A.
        • Behr J.
        • Frejno M.
        • Djukovic A.
        • Sanz J.
        • González-Barberá E.M.
        • Salavert M.
        • López-Hontangas J.L.
        • others
        Challenges in clinical metaproteomics highlighted by the analysis of acute leukemia patients with gut colonization by multidrug-resistant enterobacteriaceae.
        Proteomes. 2019; 7: 2
        • Elias J.E.
        • Gygi S.P.
        Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry.
        Nature methods. 2007; 4: 207-214
        • Serang O.
        • Käll L.
        Solution to statistical challenges in proteomics is more statistics, not less.
        Journal of proteome research. 2015; 14: 4099-4103
        • Ezkurdia I.
        • Vázquez J.
        • Valencia A.
        • Tress M.
        Analyzing the First Drafts of the Human Proteome.
        Journal of Proteome Research. August 2014; 13: 3854-3855
        • Reiter L.
        • Claassen M.
        • Schrimpf S.P.
        • Jovanovic M.
        • Schmidt A.
        • Buhmann J.M.
        • Hengartner M.O.
        • Aebersold R.
        Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry.
        Molecular & Cellular Proteomics. 2009; 8: 2405-2417
        • Dost B.
        • Bandeira N.
        • Li X.
        • Shen Z.
        • Briggs S.P.
        • Bafna V.
        Accurate Mass Spectrometry Based Protein Quantification via Shared Peptides.
        Journal of Computational Biology. April 2012; 19: 337-348
        • Nesvizhskii A.I.
        • Aebersold R.
        Interpretation of Shotgun Proteomic Data.
        Molecular & Cellular Proteomics. October 2005; 4: 1419-1440
        • Serang O.
        • Noble W.
        A review of statistical methods for protein identification using tandem mass spectrometry.
        Statistics and its interface. 2012; 5: 3
        • The M.
        • Tasnim A.
        • Käll L.
        How to talk about protein-level false discovery rates in shotgun proteomics.
        Proteomics. 2016; 16: 2461-2469
        • Audain E.
        • Uszkoreit J.
        • Sachsenberg T.
        • Pfeuffer J.
        • Liang X.
        • Hermjakob H.
        • Sanchez A.
        • Eisenacher M.
        • Reinert K.
        • Tabb D.L.
        In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.
        Journal of proteomics. 2017; 150: 170-182
        • Schallert K.
        • Verschaffelt P.
        • Mesuere B.
        • Benndorf D.
        • Martens L.
        • Van Den Bossche T.
        Pout2Prot: An Efficient Tool to Create Protein (Sub) groups from Percolator Output Files.
        Journal of Proteome Research. 2022;
        • Serang O.
        • MacCoss M.J.
        • Noble W.S.
        Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data.
        Journal of Proteome Research. October 2010; 9: 5346-5357
        • Pfeuffer J.
        • Sachsenberg T.
        • Dijkstra T.M.H.
        • Serang O.
        • Reinert K.
        • Kohlbacher O.
        EPIFANY: A Method for Efficient High-Confidence Protein Inference.
        Journal of proteome research. 2020; 19: 1060-1072
        • Li Y.F.
        • Arnold R.J.
        • Li Y.
        • Radivojac P.
        • Sheng Q.
        • Tang H.
        A Bayesian Approach to Protein Inference Problem in Shotgun Proteomics.
        Journal of Computational Biology. August 2009; 16: 1183-1193
        • Kim M.
        • Eetemadi A.
        • Tagkopoulos I.
        DeepPep: Deep proteome inference from peptide profiles.
        PLoS computational biology. 2017; 13: e1005661
        • Serang O.
        • Moruz L.
        • Hoopmann M.R.
        • Kall L.
        Recognizing uncertainty increases robustness and reproducibility of mass spectrometry-based protein inferences.
        Journal of proteome research. 2012; 11: 5586-5591
        • Tyanova S.
        • Temu T.
        • Cox J.
        The MaxQuant computational platform for mass spectrometry-based shotgun proteomics.
        Nature protocols. 2016; 11: 2301-2319
        • Wang D.
        • Eraslan B.
        • Wieland T.
        • Hallström B.
        • Hopf T.
        • Zolg D.P.
        • Zecha J.
        • Asplund A.
        • Li L.-h.
        • Meng C.
        A deep proteome and transcriptome abundance atlas of 29 healthy human tissues.
        Molecular systems biology. 2019; 15: e8503
        • Granholm V.
        • Navarro J.F.
        • Noble W.S.
        • Käll L.
        Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics.
        Journal of Proteomics. 2013; 80: 123-131
        • Abascal F.
        • Ezkurdia I.
        • Rodriguez-Rivas J.
        • Rodriguez J.M.
        • Pozo A.d.
        • Vázquez J.
        • Valencia A.
        • Tress M.L.
        Alternatively Spliced Homologous Exons Have Ancient Origins and Are Highly Expressed at the Protein Level.
        PLOS Computational Biology. June 2015; 11: e1004325
        • Tress M.L.
        • Abascal F.
        • Valencia A.
        Alternative splicing may not be the key to proteome complexity.
        Trends in biochemical sciences. 2017; 42: 98-110
        • Gerster S.
        • Kwon T.
        • Ludwig C.
        • Matondo M.
        • Vogel C.
        • Marcotte E.M.
        • Aebersold R.
        • Bühlmann P.
        Statistical approach to protein quantification.
        Molecular & cellular proteomics. 2014; 13: 666-677
        • Jacob L.
        • Combes F.
        • Burger T.
        PEPA test: fast and powerful differential analysis from relative quantitative proteomics data using shared peptides.
        Biostatistics. 2019; 20: 632-647