If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
From the ‡Proteomics Center of Excellence, Northwestern University, Evanston, Illinois;§Department of Molecular Biosciences, Northwestern University, Evanston, Illinois;Department of Chemistry and the Feinberg School of Medicine, Northwestern University, Evanston, Illinois
* This research was supported by NIDA P30DA018310 and the Paul G. Allen Family Foundation (Award #11715). The work was also supported by the Sherman Fairchild Foundation and performed in collaboration with the National Resource for Translational and Developmental Proteomics under Grant P41 GM108569 from the National Institute of General Medical Sciences, National Institutes of Health. [S] This article contains supplemental material.
Within the last several years, top-down proteomics has emerged as a high throughput technique for protein and proteoform identification. This technique has the potential to identify and characterize thousands of proteoforms within a single study, but the absence of accurate false discovery rate (FDR) estimation could hinder the adoption and consistency of top-down proteomics in the future. In automated identification and characterization of proteoforms, FDR calculation strongly depends on the context of the search. The context includes MS data quality, the database being interrogated, the search engine, and the parameters of the search. Particular to top-down proteomics–there are four molecular levels of study: proteoform spectral match (PrSM), protein, isoform, and proteoform. Here, a context-dependent framework for calculating an accurate FDR at each level was designed, implemented, and validated against a manually curated training set with 546 confirmed proteoforms. We examined several search contexts and found that an FDR calculated at the PrSM level under-reported the true FDR at the protein level by an average of 24-fold. We present a new open-source tool, the TDCD_FDR_Calculator, which provides a scalable, context-dependent FDR calculation that can be applied post-search to enhance the quality of results in top-down proteomics from any search engine.
determination of protein and proteoform identifications is needed to improve top-down proteomics for large-scale, automated proteoform discovery (qualitative analysis) and relative quantification (quantitative analysis) (
). Discovery top-down proteomics uses LC-MS/MS to analyze complex samples to determine the proteoform composition without protease digestion and employs various search algorithms to identify proteoforms.
Over time, the community of bottom-up proteomics has developed more accurate, global FDR solution that scales well (
). A major complexity in bottom-up is the “protein inference problem” where individual peptides can be shared between different highly related proteins of different genes, isoforms and proteoforms of a single protein (
) do not consider the individual molecular levels discovered by top-down (Fig. 1) and thus cannot be used in this context. Likewise, the targeted top-down analysis tools that have been available for many years (
) but the accuracy of FDR determination is yet understudied and not regularized with the community.
In high-throughput top-down proteomics, search algorithms function by scoring the match between a set of theoretical proteoforms and an observed set of MS1 and MS2 data. As in bottom-up proteomics, the MS1 and MS2 data object in top-down will here be referred to as a Spectrum, and it contains both intact and fragment masses. Likewise, a match between a Spectrum and a theoretical proteoform can be called a Proteoform Spectral Match (PrSM) (Fig. 1). Typically, spectral data are converted to the neutral mass regime (
). Matches between Spectra and theoretical proteoforms, PrSMs, can then manually validated by mass spectrometerists to determine which proteoforms are present in the sample. Although this approach is very labor intensive and subject to human interpretation, it has been used successfully in the past (
) see Fig. 1. Proteins represent a collection of expressed proteoforms, and although the proteoform is a form of a given protein, a protein is usually expressed as multiple proteoforms. Each measured proteoform results from a series of molecular processing events starting with transcription and ending with post-translational modifications. Each gene typically is associated with a canonical amino acid sequence. This sequence is often different from that observed in biological samples: there may be different alleles or coding SNPs creating sequence variants, transcripts from a given gene can be alternatively spliced to form multiple isoforms (
) which also translate into different amino acid sequences, and proteoforms can have covalently-attached, site-specific features enzymatically added to form post-translational modifications (PTMs). The sum of these events leads to a population of multiple molecules, each being a unique proteoform (
). This is further complicated by isoforms. Some proteins, such as the human high mobility group protein (P17096–1 and P17096–2) come in multiple isoforms. These isoforms have differing amino acid sequences and different modifications (
). In systems where a confidence metric specifically at the isoform level is desired, FDRs can now be calculated for this purpose. However, we expect that most studies will focus mainly on protein and proteoform-level FDR values.
The concept of an expressed protein exists to help us simplify the complexity associated with understanding biological function at the molecular level and has worked well for bottom-up proteomics. But, the presence of multiple proteoforms arising from a single gene makes the concept of an expressed protein more complex, as it is the proteoforms that are expressed. Although unmodified protein sequence as directly encoded by a gene are frequently expressed in bacteria, in eukaryotes this appears to occur less often than the expression of modified sequences. (For example, in one study (
) only 106 of 1046 or 10.1% of the discovered proteoforms are unmodified, whereas the Catherman data set presented below has only 4.5% unmodified proteoforms.) This duality, that the protein entity encoded by a gene is rarely expressed, while modified proteoforms more commonly occur in the physical world, complicates FDR determination in top-down proteomics.
Observing proteoforms is further complicated by issues of identification and characterization. To avoid confusion, the term Protein Identification is simply the act of assigning a gene product to a gene, or in practice to a protein accession in a gene-centric protein knowledgebase. It should be noted that there may or may not be data that allow the exact determination of proteoforms with complete molecular specificity. Thus, a protein may be identified as present in a sample even if no proteoforms are fully characterized. In contrast, full characterization only occurs when the molecular specificity of a proteoform can be determined within the context of the search. Proteoforms containing unknown mass shifts, or incompletely localized PTMs are said to be partially characterized. In top-down proteomics, it is less useful to discuss characterizing an isoform or protein, except in the case of very simple systems where there is only a single proteoform produced by a given gene.
Identification can occur at three independent molecular levels (Fig. 1). The protein entry is a gene-level identification and is denoted with an accession number to a gene-centric database like UniProtKB (
), whereas a proteoform identification refers to one combination of modifications with a single primary structure. A proteoform should be reported with an accession number from a proteoform-centric database like the one maintained by the Consortium for Top-Down Proteomics (
). Between the protein entry and proteoform level, there exists an isoform entry level. Identifying an isoform means that there is evidence for the presence of a given sequence arising from alternative splicing or translational start site in the sample, independent of which PTMs might be present.
Top-down search engines assign a numeric score to the degree of matching between a Spectrum and a candidate proteoform. ProSight (
). This score is a nonlinear transformation of the number of fragment ions matching between the candidate proteoform and the PrSM, where the non-linear response is governed by the search parameters. Well-suited for targeted studies, this score allows gene-level identification but cannot automatically distinguish partial characterizations, a shortcoming alleviated by the C-score (
The fundamental problem, regardless of search engine, is determining when a continuous score (i.e. one that can have continuous values over a given range) is sufficiently good to allow the assertion that the Spectrum represents a specific proteoform. The search engine returns a set of putative discoveries which can be ranked by score from best to worst, and a cutoff can be found which determines the so called “selected discoveries.” Ideally, the cutoff separates true discoveries from false discoveries. The level of this cutoff must be determined by the needs of the individual experiment. For example, studies interested in biomarker discovery (
The distribution of scores will vary with the context of the search. The context is defined here as the set of the Spectra searched, the database searched against, the search engine, and parameters used. To determine if a score is enough to allow identification, for a given context, we need to infer the null distribution of the score within that context (
). The null distribution is an unknown probability distribution associated with a context that reports the probability of a given score (or better) occurring by chance alone.
One approach to inferring the null distribution is to reverse or scramble the searched database such that none of the candidate proteoforms in the new decoy database could represent real ions measured by the observed Spectra. Then, by duplicating the search context against this decoy database, any PrSMs returned are known to be false, and are called Decoy PrSMs. The distribution of these false scores can then be used as a surrogate for the null distribution (
Two approaches, parametric and non-parametric, have been used to determine the PrSM-level FDR from a decoy distribution. With a non-parametric approach, for a given list of PrSMs, the FDR is a function of the number of Decoy PrSMs scoring equal to or better than the observed Forward PrSM (
). Both parametric and non-parametric approaches have their strengths and weaknesses. Non-parametric approaches are more robust against irregularities in the decoy distribution but fail to provide any information about the FDR associated with Forward PrSMs scoring above the best Decoy PrSM. Likewise, parametric solutions suffer from errors in correctly modeling the null distribution but provide FDR information about all Forward PrSMs.
Tools using non-parametric approaches are availible (
). These tools use a model-free approach to empirically estimate null distributions and control the FDR at the PrSM level. Such a non-parametric system has been used by early versions of the TDPortal, a customized Galaxy search portal (
). The TDPortal is the first top-down tool to control FDR at not just the PrSM level, but also at the proteoform, isoform, and protein levels.
Here we present a logical structure for calculating an identification FDR at the proteoform, isoform, and protein level using PrSMs from their given search context. We supply software for performing this calculation, and a large set of curated spectral results as training data. We show that the FDR functions and scales correctly on the training data and on previously published results.
An algorithm for calculating the context-dependent FDR (CD FDR) associated with PrSM, proteoform, isoform, and protein identification was developed. This algorithm uses a non-parametric FDR for identification but enhances this with a parametric FDR for all those molecular entities identified with a score better than the best decoy score. The algorithm calculates a separate decoy distribution at each molecular level encountered in top-down proteomics. A training dataset with 546 manually validated PrSMs was built from data from two representative publications from two different laboratories (
) TDCD_FDR_Calculator, software to implement the algorithm on search engine output, was developed and tested against the training dataset. Both the training datasets and the TDCD_FDR_Calculator tool are made available and described in detail below.
Non-parametric Estimation of FDR
Prior to FDR calculation, each spectral data file was deconvoluted and deisotoped. Here the algorithm was tested on ProSight Absolute Mass (
), and was designed to work with results from any top-down search engine. Following the specifications for TDCD_FDR_Calculator, comma separated value (CSV) files were used for input, one for each LC-MS/MS run.
The following outline of the method for calculating CD FDR was implemented
1. Search Spectra and Pool Resulting PrSMs
All Spectra from a given run are searched separately within a single context and the resulting PrSMs are pooled across the entire experiment. The Spectra are searched against both a forward and a decoy database. Here, a decoy database was created by scrambling each isoform sequence in the forward database and then shotgun annotating all encoded PTMs at the amino acid residue of their original location. Other database scrambling strategies are possible, and the system presented here will accept any user-defined strategy.
PrSMs are defined as Spectra that match a proteoform with at least one MS2 fragment ion. Each ProSight PrSM is returned from the search engine with two scores; a P-score (
) that measures the uniqueness of the Spectrum's match relative to the other entries in the database. Each Informed Proteomics PrSM is returned with an E-value, and an FDR value, q, which we refer to as the PrSM-level FDR.
From here forward, different calculations are performed depending on the molecular entity under consideration (Fig. 1); thus, when calculating a proteoform CD FDR, only proteoforms associated with PrSMs are considered, and for protein CD FDRs only protein entries associated with PrSMs are considered.
2. Determine Context-Dependent Score for each Entity
Multiple PrSMs can support the existence of a given molecular entity within a run. However, each entity must be assigned a single score to represent it for downstream processing. This algorithm uses the best score from the set of PrSM scores. For example, at the proteoform level, multiple PrSMs may report on the same proteoform, but the PrSM with the best score (with the lowest score in the case of the P-score) is selected as the proteoform's score. The same follows at the protein level. The best score represents the protein from all the PrSMs that report on it. The same logic is used on both forward and decoy results.
3. Estimate Posterior Probability for each Forward Entity using Decoys
The scores from decoy entities are used to estimate the corresponding null distribution. Using a non-parametric approach, a posterior probability is derived for each molecular entity (PrSM, proteoforms, isoforms, and proteins) by counting the proportion of decoy to forward entities scoring better than the forward entity in question (
4. Filter Forward Entities before Multiple Testing Correction
The posterior probability determined from the decoy null distribution is an estimate of the probability of the Forward entity scoring as well as it did or higher because of chance alone. Yet, this probability does not consider the multiple hypothesis testing issue associated with testing a set of forward entities. As more forward entities are interrogated, the standard of evidence required for each entity increases. For this reason, it is useful to filter obviously incorrect results from the forward entity list. We require the best supporting PrSM for an Absolute Mass search to have at least 3 matching fragments, 6 for a Biomarker search, and an E-value less than 1 for MSPathFinder. This represents a heuristic rule for the minimum amount of supporting evidence required to accept an identification, regardless of its PrSM-level FDR.
5. Apply Multiple Testing Correction to obtain q values
The posterior probability scores for each molecular entity are then corrected for multiple testing (
). The critical p values derived from the Benjamini and Hochberg calculation are converted algebraically to q values (q) by multiplying the posterior probability (p) by the total number of forward entities (n) and dividing by the entity's rank (r), giving q = p × n/r. This value is the q value and represents the FDR for the entire study set at each molecular level.
6. Take Characterization into Account when Reporting Proteoforms
To avoid misrepresenting proteoform discoveries, some care should be taken to separate instances of partial characterizations from fully characterized proteoforms. Here, proteoforms from ProSight searches with a C-score greater than 40 are considered characterized and included in characterized proteoform counts (
). It should be remembered that with all search engines, proteoforms are only characterized within the context of the search. There is always the possibility that there exists another proteoform that could explain the observed data.
Aggregation to higher molecular levels: Note that Steps 2 to 6 are performed separately for each molecular level. Thus, the list of proteoforms with a CD FDR of 1% is different from the list of proteins with a CD FDR of 1%. Care must always be taken to refer correctly to each list. As a default, we anchor proteoform reporting to those that map to identifications matching a 1% CD FDR at the protein level (
The non-parametric estimate of the q value is robust and can estimate the FDR even in the presence of any unknown issues within the search context. Unfortunately, it provides no estimation of the q value for entities where the best scoring PrSM is better than the best decoy. In this case, the posterior probability is simply less than the reciprocal of the total number of decoy PrSMs considered. For example, if there are one thousand decoy PrSMs, any forward PrSM scoring better than the best decoy PrSM can only be said to have a posterior probability less than one in one thousand. This is enough for generating lists of identified proteoforms, but it fails to show increasing support for forward PrSMs with better scores.
If desired, a parametric model can be fit to the null distribution, and an FDR estimate calculated for entities scoring beyond the highest scoring PrSM. TDCD_FDR_Calculator fits a gamma distribution to the null data and uses the area under the curve from the forward PrSM to infinity as the parametric estimate of the posterior probability. The tool only provides these for entities scoring better than the best decoy PrSM. These scores are reported as the Enhanced FDR. Note, that even if the parameterized model is skewed, these best forward results will preserve their relative rankings (e.g. a result with a q value of 10−80 has superior information content than one with 10−40) and can be used for downstream processing such as gene-set enrichment analysis.
Manual Spectral Validation
Training data sets with known results are a useful approach to validating FDR calculations (
). With sufficiently large datasets, validating an FDR requires nothing more than verifying that the number of false and true positives are in the correct ratio. To build such a training dataset, the ProSight and MSPathFinder search engines were used to generate a list of candidate proteoforms (detailed below). Each spectra supporting each proteoform was then compared with theoretical isotopic distributions (both MS1 and MS2) generated using Mercury7 (
). A set of experts in top-down mass spectrometry then confirmed the matching PrSM using experience and a set of standardized metrics (Table I). This involves a careful examination of the averaged precursor spectrum, the averaged fragmentation spectrum, and plausibility of the proteoform candidate itself in a given biological context (e.g. the protein and proteoform levels). TDValidator (Proteinaceous Inc., Evanston, IL) was used to assist with most of the expert driven validation (
The training set was constructed using data acquired from two laboratories and using different search engines. For each Spectrum in the set, an mzML file was generated that contains all the MS1 and MS2 scans involved. There is also a summary CSV file for each result set that maps the mzML files to the proteoform found, and a copy of the RAW data files.
From Catherman et al. 2013: Tandem MS data were acquired in the course of a previously published study on human proteins (
) Briefly, in that study mitochondrial membrane proteins were isolated from H1299 cells and separated using a GELFrEE 8100 Fractionation system (Expedeon). Data from a single GELFrEE fraction containing ∼15–20 kDa proteins (with 846 MS1 and 889 MS2 spectra) were selected and Spectra were manually created by combining multiple MS1 and MS2 scans. Each Spectrum was then searched against ProSight Warehouses (.pwf files) originally built from UniProt release 2012_02 of the human proteome. Spectra with a single MS1 precursor and a single, unambiguous “correct” proteoform result were selected and verified by at least two practitioners trained in top-down proteomics data analysis, to be considered as true answers. This process resulted in 429 Spectra that show evidence for 193 unique proteoforms and 164 unique protein entries, all of which were considered true positives.
Park et al. Validated Results: Park et al. (2017) provide an ovarian tumor replicate run containing 1147 MS1 and 6882 MS2 spectra. Of the 359 protein entries (3042 Spectra) identified at 1% FDR in the manuscript (Fig. 6 of Park et al. (
)), 117 were unique to Informed Proteomics when compared with the TDPortal. These 117 protein entries and their corresponding Spectra were manually validated as above. This yielded 95 true positive and 22 true negative protein entries. In addition, further manual validation was done to generate Fig. 3A to 2% FDR.
Included with the training set, but not used here, are 2705 mzML files with corresponding protein identifications which are the Spectra from the ovarian tumor replicate from Park et al. (2017) that were also found by the TDPortal v3.0. These results represent a 1% FDR, with the TDPortal using the CD FDR described here.
TDCD_FDR_CALCULATOR Software Tool
The top-down context-dependent FDR calculator application, TDCD_FDR_Calculator, and source code perform the FDR calculation detailed here. It was implemented in C# 7.0 using Microsoft Visual Studio 2017 Update 4 and run with a .NET Core 2.0 console application. The Gamma fitting portion depends on version 3.20 of the MathNet.Numerics Nuget package. Three arguments are required: a path to the sorted forward data, a path to the sorted decoy data, and a path to write the output. The input files must be CSV files where the first column is a generic text tag and the second column is a score (where larger is better, sorted ascending). The output CSV will be the same as the forward input file, but with two additional columns: non-parametric q value and Enhanced q value. The source code for this open source tool is available at: https://github.com/NRTDP/TDCD_FDR_Calculator.
In discovery top-down proteomics, each context (the combination of spectra, database, and search parameters) has a unique search space. Further, each molecular level in this search space has its own decoy distribution of scores (Fig. 2). Within a given context, the silhouette of a decoy distribution will be similar between proteins and proteoforms, but they are not identical. This difference comes from multiple PrSMs supporting a single proteoform, and multiple proteoforms supporting a single protein. There is always more decoy information available to estimate the null distribution of proteoforms than for isoforms, and more information on the null distribution of isoforms than proteins.
The consideration immediately above implies that it is best practice to generate an FDR calculation specific to each molecular level shown in Fig. 1 (i.e. protein, isoform, etc.). The list of proteins discovered with a 1% FDR is not the same list as the list of proteins resulting from PrSMs discovered at a 1% FDR. The latter list will have a higher FDR at the protein-level than the former, and this difference can be quite striking. For example, the data used in the training set from Park et al. yields 298 proteins when aggregated with a 1% protein level CD FDR, but there 324 proteins when aggregated at the PrSM level and naïvely merged. The latter would yield an effective 8.0% FDR while appearing to have a 1%, an 8.0-fold error in FDR.
A training set with known correct answers was created from two recently published manuscripts. This allowed the magnitude of the loss of control on FDR to be observed. The training dataset contains 546 mzML files where each file containing a single Spectrum, and a spreadsheet of correct proteoform identifications and corresponding search engine scores. These represent 524 expertly validated proteoform identifications, and 22 true negatives, as described above. All files can be downloaded from MassIVE: https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=bced230c79f849c3bd87a10e08c929fb.
Using the training dataset, it is possible to determine the true FDR for a given search context and assess the quality of FDR estimation within a given context or tool used for searching and scoring in top-down proteomics. Fig. 3A shows that for four contexts, each responds differently to increasing FDR cutoff levels (0.1%, 0.5%, 1 and 2%), yet in all cases the protein-level FDR is higher than the PrSM-level FDR–an average of 23.7 times higher overall. Therefore, PrSM-level FDRs cannot be used to accurately determine proteoform- or protein-level discoveries with a reliable FDR cutoff (Fig. 3A).
With a CD FDR algorithm calibrated against known data, it is possible to explore the role that increasing the number of runs has on FDR inflation. Park et al. describe a quantitative study with 5 LC-MS/MS runs each from two treatment categories. For each of four FDR cutoff levels (0.1%, 0.5%, 1, and 2%) either two, four, or eight runs were chosen randomly (without replacement) and the number of proteins identified was determined. This was done five times at each FDR cutoff level. In each case, the number of identified proteins was compared with the number identified using the CD FDR algorithm, and the CD FDR cutoff producing the discovered number of proteins was taken as an estimate of FDR achieved using a PrSM-level cutoff. The results are shown in Fig. 3B. Notice that for any FDR cutoff, the greater the number of runs pooled, the larger the effect of the loss of control on FDR.
Park et al. used PrSM-level FDR at 1% to identify proteins (
). The actual protein-level FDR, after manual validation, is at least 7%. This discrepancy, as in all the cases shown in Fig. 3, comes from proteins supported by a single spectrum. Given the uncorrelated nature of single spectrum proteins (supplemental Fig. S1), the number of false positives will increase when merging search results from multiple runs. Fig. 4 shows the increase in proteins and proteoforms supported by a multiple spectrum in the ten runs reported in the quantitative study in Park et al. (
). Unfortunately, one should not simply disregard proteins supported by a single PrSM as even at ten runs only 62% of the proteoforms are supported by multiple PrSMs. Discarding the single PrSM proteoforms needlessly excludes many true positives. For example, in the qualitative study reported by Park et al., no less than 44 manually-validated, true positive protein identifications were supported by only one PrSM each.
The CD FDR outperforms a naïve interpretation of PrSM-level FDR, though there is a generally monotonic relationship between the two (supplemental Fig. S2). To visualize the relationship between CD FDR and PrSM-level FDR, 1,173 putative protein identifications from the validated data in Park et al. were ranked by FDR. Fig. 5 shows the 200 proteins that span the transition from true to false discoveries. To the left on this figure are the true positives (accept) and to the right are true negatives (reject). Unfortunately, when ranked by either CD FDR or PrSM FDR the transition from true positive to true negative is not sharp, but rather the true and false positives mix together. The goal of automated proteomics is to pick a cutoff value that separates the true positives in the sample from the true negatives while controlling the FDR (and without requiring manually verification). In this case, the CD FDR is consistently conservative. As shown in Fig. 5, the CD FDR places the cutoff to the left, or below, of the true cutoff, while the PrSM-based approach used by MSPathFinder has the opposite behavior.
The first 284 protein discoveries in the Park et al. validated data all have a reported PrSM-level q value of zero. This phenomena has been reported before with Big Mascot (
), and it occurs when no decoy PrSMs score high enough to differentiate the top-scoring discovered proteins. This failure to differentiate top-scoring proteins creates a rank-ordered list with hundreds of ties for first place, even though there is a noticeable difference in spectral support among the tied proteins. Increasing the number of decoy PrSM results can potentially alleviate this problem, but that is usually impractical given computational constraints. The Enhanced FDR uses a parameterized model to estimate the FDR for all molecular entities scoring above the top decoy hit. These optional data are provided in a separate column in the TDCD_FDR_Calculator output and can be used whenever finer detail and striation of the top scoring hits in a run is desired.
In top-down proteomics, search context is critical for estimating FDRs. The context includes the MS data searched and the database search against, but it also includes the search engine used, the parameters of the search, and the molecular level (proteoform, isoform, or protein) of interest. Naïve PrSM-level FDR estimation loses control of the true FDR when reporting at the Proteoform, Isoform, and Protein level. This problem gets worse at higher molecular levels, and at the protein level the problem is significant; we found an average loss of control of 24-fold over four contexts (Fig. 3A). Therefore, it is not enough to report top-down proteomic results with a simple PrSM FDR, instead they must be calculated at the level for which they are being reported. We recommend a conservative reporting standard for discovered proteins be those with a 1% protein level FDR supported by at least one proteoform at the 1% proteoform FDR.
We have manually validated two publicly available datasets; one published by our laboratory and the other from Park et al. We have identified the correct proteoforms associated with 546 spectra. Additionally, 2705 spectra representing the proteins identified in common from two independent search engines are identified, curated, and made publicly available in a large dataset. While we have used this training set to explore the role of search context on FDR calculations, these data are available to be used to train and calibrate other search algorithms for top-down proteomics.
In our training set, the loss of control of FDR comes from proteins and proteoforms supported by a single PrSM. Unfortunately, many correct proteins and proteoforms are also supported by a single PrSM, so the simple expedient of removing all “one-hit wonders” is ill-advised. Instead, we provide TDCD_FDR_Calculator which implements a conservative algorithm which can be applied post-search to correct the FDR estimation for either ProSightPC or MSPathFinder.
TDCD_FDR_Calculator is conservative in FDR estimation. In the future, there are several things to consider. When applied to the training data, the current CD FDR algorithm is overly conservative. This may be corrected by refining the estimation process using techniques from bottom-up (
), but such an analysis will benefit from an even larger training set. In addition, multiple searches are frequently pooled to create a final list of molecular entities. This analysis provides no guidance on how to correctly pool searches while maintaining overall control on the FDR.
It is possible that statistical dependences exist within the CD FDR. Intuitively, fragment ions can belong to multiple proteins or proteoforms and therefore affect multiple scores. This violates the independence assumption of the Benjamini and Hochberg FDR analysis. The FDR analysis can be modified to account for such dependences (
). Analyzing the role of dependence is a likely next-step in improving the CD FDR.
In conclusion, an FDR on a protein or proteoform identification is dependent on the context of the search. The entire context must be considered, including all identifications inferred from the data. We advocate a conservative approach to FDR estimation in which all data are pooled from a given study, and separate inferences made depending on the molecular level in question. Thus, the Context Dependent solution described here is accurate, designed to scale without loss of FDR control and provides conservative control regarding the inclusion of proteoforms supported by a single PrSM. This approach is suitable for use in future top-down proteomic studies run at very large scale (e.g. millions of tandem mass spectra (
Author contributions: R.D.L., R.T.F., B.P.E., D.P.S., P.M.T., and N.L.K. designed research; R.D.L., R.T.F., B.P.E., J.B.G., and D.P.S. analyzed data; R.D.L., R.T.F., D.P.S., P.M.T., and N.L.K. wrote the paper; R.T.F., B.P.E., J.B.G., and D.P.S. performed research.