Selection of Features with Consistent Profiles Improves Relative Protein Quantification in Mass Spectrometry Experiments*

Summaries of protein abundance are undermined by mass spectrometric features inconsistent with the overall protein profile. We develop an automated statistical approach that detects such inconsistent features. These features can be separately investigated, curated, or removed. Accounting for such inconsistent features improves the estimation and the detection of differential protein abundance between conditions. The approach is implemented as an option in the open-source R-based software MSstats. Graphical Abstract Highlights Automated statistical approach for detecting uninformative features and outliers. Improved performance on relative protein quantification. An option in the open-source R-based software MSstats. In bottom-up mass spectrometry-based proteomics, relative protein quantification is often achieved with data-dependent acquisition (DDA), data-independent acquisition (DIA), or selected reaction monitoring (SRM). These workflows quantify proteins by summarizing the abundances of all the spectral features of the protein (e.g. precursor ions, transitions or fragments) in a single value per protein per run. When abundances of some features are inconsistent with the overall protein profile (for technological reasons such as interferences, or for biological reasons such as post-translational modifications), the protein-level summaries and the downstream conclusions are undermined. We propose a statistical approach that automatically detects spectral features with such inconsistent patterns. The detected features can be separately investigated, and if necessary, removed from the data set. We evaluated the proposed approach on a series of benchmark-controlled mixtures and biological investigations with DDA, DIA and SRM data acquisitions. The results demonstrated that it could facilitate and complement manual curation of the data. Moreover, it can improve the estimation accuracy, sensitivity and specificity of detecting differentially abundant proteins, and reproducibility of conclusions across different data processing tools. The approach is implemented as an option in the open-source R-based software MSstats.

Bottom-up proteomics is a method to analyze peptides from enzymatic digestion of proteins with liquid chromatography coupled with mass spectrometry (LC-MS) 1 . It is the most widespread proteomic workflow (2,3) compatible with data-dependent acquisition (DDA), data-independent acquisition (DIA) (28,11), and selected reaction monitoring (SRM) (18).
In DDA, the mass spectrometer alternates between acquiring a full MS1 spectrum of the co-eluting precursor ions for analyte quantification and acquiring the fragment MS/MS spectra of the peptide ions for identification. SRM, an important method for targeted analysis, requires an a priori knowledge of the unique peptide sequences of the protein targets. The peptides are selectively profiled based on their precursor ions and representative fragment ions, and quantified by the intensity of each pair of precursor/fragment, also known as transition. Label-based SRM workflows further enhance the accuracy of the quantification by spiking an isotopically labeled reference counterpart of each targeted peptide into the samples. In DIA, the mass spectrometer covers the entire relevant mass range with a set of wide windows. All the precursor ions in a window are co-isolated and subjected to fragmentation, to produce their combined fragment spectrum.
A common use of DDA, SRM and DIA is relative protein quantification. Protein quantification summarizes all the quantitative information pertaining to a protein across the LC-MS runs, and performs statistical analyses that detect differentially abundant proteins between conditions. Because mass spectrometers quantify peptide ions or their fragments instead of proteins, deriving protein-level conclusions can be challenging. First, the quantitative profile of a peptide ion or a fragment can deviate from the representative protein profile for biological reasons (e.g. in presence of post-translational modifications or unknown protein isoforms), or for technological reasons (e.g. because of interferences of co-eluting precursor ions such as ion suppression). Second, some of the precursor ions or fragments can be incorrectly identified or quantified across the LC-MS runs. This is particularly common for low abundant analytes. Third, in DDA workflows, the stochastic nature of the acquisition of fragmentation spectra makes it challenging to consistently identify and measure the same species in multiple LC-MS runs, and produces missing values. Finally, computational processing of the spectra adds uncertainty to protein quantification. A variety of computational tools, e.g. MaxQuant (9,8), Progenesis (Nonlinear Dynamics, Newcastle upon Tyne, UK), Proteome Discoverer (Thermo Fisher Scientific, Waltham, MA), Skyline (15), DIA Umpire (26), OpenSWATH (20) and Spectronaut (4), identify, quantify and align the peaks across the LC-MS runs. Each tool optimizes different computational tasks, and the resulting list of differentially abundant proteins may be, at least to some extent, tool-dependent. Overall, these issues undermine the accuracy and reproducibility of protein-level summarization and relative quantification.
This manuscript aims to alleviate these challenges. We propose an automated statistical approach, which takes as input a set of identified and quantified spectral features reported by a data processing tool, characterizes the information content in the individual features, and performs relative protein quantification using an informative subset of the features. We consider DDA, SRM and DIA acquisition types, and use the generic term feature to refer to precursor ions in DDA, transitions in SRM, or fragments in DIA, identified and quantified across multiple LC-MS runs. We define informative features as features that contain few missing values, and for which the intensities are consistent with most of the protein intensity profiles across the runs. We define uninformative features as features that do not satisfy these properties. Among those, low-coverage features contain many missing values. Outliers are inconsistent with the overall protein profile. Finally, noisy features exhibit substantial variation between the LC-MS runs, beyond the variation of most of the features of the protein.

Background
Detecting uninformative features in LC-MS data often involves manual curation. A feature of a peptide is evaluated by plotting the proportion of its intensity in the sum of the intensities of all the features of the peptide. For the informative features, this proportion is roughly constant across all the LC-MS runs in the investigation. Deviations from the constant pattern indicate that the feature may be compromised by noise or artifacts and should be further investigated. Such manual curation is limited in throughput. Also, when performed separately for each peptide as in common practice, it does not account for potentially distinct peptide profiles of the same protein.
Automatic approaches for detecting uninformative features have been introduced as alternatives to manual curation. They can be loosely classified into two groups: those detecting interferences or errors in identification and quantification within a single LC-MS run, and those detecting these patterns across multiple LC-MS runs. Many methods consider properties of spectral features in a single LC-MS run. These methods evaluate feature-specific patterns such as relative intensities, chromatographic traces, or patterns of isotopic distributions. The patterns are compared with their expected patterns for the analyte (or, in the case of SRM, to the observed patterns of their isotopically labeled reference counterparts). For example, the scoring model in mProphet (19) combines multiple quality metrics, and filters out peaks with likely incorrect identification, presence of interfering ions, etc. Originally developed for SRM, the mProphet algorithm has since been implemented in Spectronaut, OpenSWATH and Skyline for DIA workflows. Similar steps are implemented in other data processing tools, e.g. SWATHProphet (14) and EncyclopeDIA (21) for DIA.
To select informative features in multiple LC-MS runs, a common heuristic, called "top-n", selects n features with the highest average intensities across the runs. The heuristic is based on the insight that higher intensity features are less impacted by interference, and therefore have a higher information content. More recent approaches make more sophisticated considerations. For SRM, AuDIT (1) calculates the intensity ratio between each pair of features (i.e. transitions) generated from a same precursor in all runs, compares them to the ratios of the isotopically labeled reference counterparts, and flags significantly deviating ratios using the t test. Although the method is simple and performs well, it requires high-quality information from the standards. For DIA, mapDIA (25) implements a suite of data processing steps, including intensity normalization, fragment filtering, and detection of differentially abundant proteins. The fragment filtering step includes outlier detection and uses pairwise correlation to select features with consistent profiles across the runs. The step is tightly integrated with the rest of the mapDIA workflow and requires tuning multiple parameters that control the cross-fragment correlation and the number of peptides and fragments. Using several quality metrics, Avant-garde (27) automatically curates features with a genetic algorithm and adjusts their peak boundaries for quantification. The method is integrated as an external tool of Skyline. Diffacto (29) performs protein summarization through robust factor analysis of the features, views the features corresponding to the majority factor as representative of the protein, but does not select a subset of informative features.
This manuscript aims to improve upon the existing methods above. We propose a statistical approach that takes as input the identified and quantified features that pass single-run filtering criteria (e.g. by mProphet), and complements the filtering using the information available from multiple LC-MS runs. It selects a subset of informative spectral features for detecting differentially abundant proteins. The uninformative features can be separately investigated, curated, or removed.
The approach is compatible with multiple data acquisition workflows and data processing tools.

EXPERIMENTAL PROCEDURES
The input data sets are briefly described below and summarized in Tables I and II. The data sets are available on either MassIVE.quant (5) or Panorama (23). Details about data processing, R script for statistical analysis, and the differential abundance analysis results are available on MassIVE.quant. Data set identifiers and reanalysis container identifiers on MassIVE.quant are listed in Tables I and II. Additionally, the R script for summarizing all the analysis results in this manuscript is available at https://github.com/tsunghengtsai/MSstats-feature-selection.
Benchmark Controlled Mixtures-We evaluated the proposed approach on four benchmark-controlled mixtures of known composition. Raw spectra from a same investigation were processed with up to four data processing tools. Data from two controlled mixtures were acquired with DDA: (1) the iPRG benchmark (6), where the data were processed by MaxQuant, Progenesis, and Skyline; and (2) the Choi benchmark (7), where the data were processed by MaxQuant, Progenesis, Proteome Discoverer, and Skyline. Data from two controlled mixtures were acquired with DIA: (1) the Bruderer benchmark (4), where the data were processed by DIA Umpire, Skyline, and Spectronaut; and (2) the Navarro benchmark (16), where the data were processed by DIA Umpire, OpenSWATH, Skyline, and Spectronaut.
This work did not aim to compare the data processing tools. Rather, we used these data sets to evaluate the impact of the proposed approach on accuracy of fold-change estimation, sensitivity and specificity of detecting differentially abundant proteins, and reproducibility of conclusions across the data processing tools. We compared the performance to (1) all-features protein quantification, as currently implemented in MSstats, and (2) protein quantification based on the n features with the highest average intensities across runs (i.e. top-n features) using MSstats.
Biological Investigations-We evaluated the proposed approach using two biological investigations. The first is a label-based SRM experiment by Dunkley et al. (10), investigating in vivo neuro-devel-opment. In addition to the biological samples, the experiment included 35 replicate runs from a pooled quality control (QC) sample to assess the technical performance. Also, spike-in experiments with pooled isotope-labeled peptides spiked into the QC sample were performed to characterize the linear portion of the dynamic range for each analyte. The data were processed with Skyline, and the quality of each feature was manually curated. The manual curation consisted of two steps: (1) removing peptides with no signal, and transitions with interference, and (2) removing peptides that failed to meet a set of thresholds based on coefficient of variation (CV) of the QC runs, and the range of intensity values compared with the linear portion of the dynamic range. Whenever either an endogenous or a reference feature did not meet the quality thresholds, the entire pair was removed. Further details regarding the manual curation can be found in (10). We used this data set to evaluate the capability of the proposed approach to automatically detect uninformative features, as compared with the manual curation.
The second biological investigation is a DIA-based experiment by Selevsek et al. (22). The main investigation quantified the proteome of Saccharomyces cerevisiae in response to osmotic stress at six time points. In addition, a separate refinement experiment performed eight technical DIA runs on a same biological sample. Two analyses of the resulting data were performed in the study, both using Skyline. The first analysis, which we call Full, considered all the peptide precursors in the main investigation, and quantified each precursor with the top-6 fragments. The second analysis, which we call LowCV, selected peptides detected in at least 50% of the refinement runs, and with CV Ͻ 20% in those runs. The selected peptides were then quantified in the biological samples in the main investigation. After these steps, the features selected in both analyses were further subjected to mProphet to characterize the quality of the peaks in individual LC-MS runs in terms of q-values. Compared with the Full analysis, the LowCV analysis queried against a reduced spectral library. Therefore, a peak in the two analyses could have different q-values. We considered two types of q-filters: i) the Sparse q-filter eliminated peaks with q-value Ͼ 0.01, and ii) the 50% q-filter eliminated the entire LC-MS features where the q-values of over 50% of the peaks exceeded 0.01. Combinations of the two data analyses with the two q-filter strategies produced four data sets: Full_Sparse, Full_50%, LowCV_Sparse, and LowCV_50%. We used these data sets to evaluate the impact of the Simulated Data Sets-In addition to the experimental data, computer simulations evaluated the effect of outliers, and of different number of features and runs, on the performance of the proposed approach. The details are in supplementary Section S1.
Proposed Statistical Approach for Selecting Informative Features-Overview and Notation-The input to the approach is a list of logtransformed intensities (peak areas, heights at apex, etc.), identified, quantified and aligned across LC-MS runs by a data processing tool. We denote y ijk the log-intensity from protein i (i ϭ 1, . . ., I), feature j (i ϭ 1, . . ., J) and run k (i ϭ 1, . . ., K). Some values y ijk can be missing because of the stochastic nature of data acquisition (e.g. in DDA), or because of filtering steps (e.g. with respect to q-values). We denote n ij the number of observed log-intensities in protein i and feature j. We order the features of a protein by descending order of the number such that n ij Ն n i, jϩ1 for j ϭ 1, . . ., J Ϫ 1. The proposed approach consists of: i) developing a probability model for a representative pattern of missing log-intensities of a protein, and detecting features that deviate from this pattern, and ii) developing a probability model for a representative pattern of observed log-intensities of a protein, and detecting features that deviate from this pattern in terms of outlying log-intensities and between-run variation. We describe these aspects below, and illustrate each step in Fig. 1A, using as an example protein P32915 from the DDA iPRG benchmark, quantified with MaxQuant.
Detection of Low-coverage Features-Some amount of missing log-intensities is expected in any experiment, in particular when protein abundance is close to the limit of detection, or when a protein is absent in the samples in some conditions. Our first step is to detect low-coverage features, which contain fewer log-intensities than what is representative for the protein.
Statistical Model-We assume that the number n ij of observed log-intensities in protein i and feature j follows a Binomial distribution: where K is the number of runs and i is the representative probability of observing log-intensities in protein i. The Binomial model characterizes both the expected number and the variability of the observed log-intensities in every feature of protein i. Estimation of Representative Probability i -The value of parameter i in Eq. 1 is unknown and must be estimated from the data. We estimate i as the average proportion of observed log-intensities among all the features of protein i For example, protein P32915 ( Fig. 1A) has five features quantified across 12 runs. One feature (TWIEISGTSPR_2, highlighted in dark gray) has six observed log-intensities, and each of the other four features has 12 observed log-intensities. The representative probability of observing log-intensities is estimated as i ϭ ((6ϩ12 ⅐ 4)/12)/5 ϭ 0.9.
Detection of Low-coverage Features-According to the Binomial distribution, the cumulative probability of observing n ij or fewer logintensities in protein i and feature j is nϭ0 nij P͑n͉K, i ͒. We label a feature as low-coverage if this probability is below a pre-defined, typically very low confidence level ␣ c (e.g. ␣ c ϭ 0.01): The low-coverage features are excluded from the subsequent analyses. After this step, the protein has JЈ remaining features JЈ Յ J, and for each of the remaining features nϭ0 nij P͑n͉K, i ͒ Ͼ ␣ c . For example, the first panel of Fig. 1A illustrates that the cumulative probability for feature TWIEISGTSPR_2, n ϭ 0 6 P͑n͉12, 0.9͒ is below the confidence level (Յ 0.01). Therefore, the feature is labeled as low-coverage, and is excluded. Note that, in general, the data set may still contain missing values after this filtering.

Detection of Outliers and Noisy Features-
Statistical Model-The model describes the patterns of systematic and random variations across LC-MS runs, expected in situations where all the remaining features of protein i are informative. We represent the log-intensities of all the features of a protein with a relatively simple linear model Here i is the mean log-intensity of the protein i across all the features and all the runs, ␣ ij is the deviation of the mean logintensity of the feature j from the overall mean (expressing differences in ionization efficiency, fragmentation patterns, etc.), and ␤ ik is the deviation of the log-intensity of the run k from the overall mean (expressing differences induced by conditions, biological replicates, technical replicates, etc.). The term ijk expresses the stochastic variation unexplained by the terms above. Note that the stochastic variation needs not to be Normally distributed. We only assume that it is non-systematic (i.e. E͓ ijk ͔ ϭ 0) and has proteinspecific variance i 2 . Estimation of Representative Protein Profiles-The values of the parameters in Eq. 3 are unknown and must be estimated from the data. Because we anticipate the presence of missing values, outliers and features with noise, it is important to develop an estimation procedure that is robust to such artifacts, especially in experiments with few features and runs. Multiple robust methods for estimating the parameters in Eq. 3 exist, e.g. Tukey's median polish (TMP) implemented for run-level summarization in MSstats (7). However, when a feature systematically deviates from the representative protein profile, the robustness of TMP is quickly reduced (12). Here we propose to estimate the parameters by robust regression with M-estimation (13), a generalization of the maximum-likelihood estimation. The estimator defines the residuals r ijk ϭ y ijk Ϫ ͑ i ϩ ␣ ij ϩ ␤ ik ͒ and minimizes the objective function where (r) is the Huber function The constant c is set to 1.345 ϫ i , such that the estimator is 95% as efficient as the least squares estimator when the stochastic variation is Normally distributed (13). The standard deviation i is unknown, and is estimated with either Huber's proposal 2 (13) or with the median absolute deviation (MAD, i ϭ median͕͉r ijk ͉͖/0.6745), where the constant 0.6745 scales the MAD to approximately match to the standard deviation of a Normal distribution. Note that the Huber function (r) requires an estimate of standard deviation i , which depends on the residuals with the estimated parameters in Eq. 4. An iteratively reweighted least squares method is therefore required to estimate i , i , ␣ ij , and ␤ ik . For estimating i , we chose Huber's proposal 2, based on the simulations in supplementary Section S1. The example protein has five features quantified across 12 runs, where feature TWIEISGTSPR_2 (highlighted in dark gray in the first panel) has six observed log-intensities, which are significantly fewer than as expected. The feature is labeled as low-coverage, and is excluded. The remaining four features are used to estimate the representative protein profile and variation, as highlighted in yellow in the second panel. The log-intensity from feature GQIGIYPIK_2 and run 12 significantly deviates from the estimated patterns, and is labeled as an outlier (highlighted in dark gray in the third panel). Finally, feature GQIGIYPIK_2 (highlighted in dark gray in the fourth panel) exhibits substantial variation between the runs, and is labeled as a noisy feature. B, Application of the proposed feature selection approach for protein-level summarization and statistical inference. The profile plots of the protein with all the features and with the selected informative features are shown in the first two panels, where the detected outlier and uninformative features (TWIEISGTSPR_2 and GQIGIYPIK_2) are removed in the second panel. The informative features are used as input to perform subsequent statistical analyses. The third panel shows the results of protein-level summarization, using the informative features (shown in light gray, solid lines), or using all the features that also include the uninformative features (shown in dashed lines). The protein-level summary with the proposed approach and that with all the features are shown in yellow and dark gray, respectively.
With the estimated parameters, the protein abundance in run k is given by ͑ i ϩ ␤ ik ͒, as highlighted in yellow in the second panel of Fig.  1A, and the residuals are Estimation of Representative Variation-Because the estimates i 2 can be unstable in experiments with a small number of features per protein or a few runs, we enhance the estimation with the empirical Bayes (EB) shrinkage estimator, originally implemented in the limma package (24). The EB estimator leverages the information across all the proteins in the data set to stabilize the variance estimates. Briefly, it assumes that the estimate i 2 follows a scaled chi-square distribution with d i degrees of freedom and the true protein-specific variance i 2 is in turn sampled from a scaled inverse chi-square distribution with degrees of freedom d 0 and location s 0 Under these assumptions, i 2 follows a scaled F-distribution on d i and d 0 degrees of freedom and log i 2 follows Fisher's z-distribution (24), which is roughly symmetric and has finite moments of all orders.
The degree of freedom d i in Eq. 7 is determined by the experimental design, as in two-way fixed effects models. For example, in a balanced data set (where each feature has fully observed log-intensities across runs), d i ϭ ͑ JЈ Ϫ 1͒͑K Ϫ 1͒. The hyperparameters s 0 2 and d 0 in Eq. 8 are obtained by matching the first two moments (i.e. mean and variance) of the empirical distribution of log i 2 to that of the z-distribution (24): where ͑͒ and Ј͑͒ are the digamma and trigamma functions, respectively. As the moment matching estimation is sensitive to the extreme values of i 2 , we employ a robust solution originally proposed in limma (17), and only consider the 5-90% percentiles of the empirical distribution of log i 2 for the match. After obtaining the hyperparameters s 0 2 and d 0 , the posterior mean The i 2 is the EB shrinkage estimator of the representative protein variation. As shown in the second panel of Fig. 1A, the EB variance estimates "shrink" toward the consensus value s 0 2 .
Detection of Outlying Log-intensities-Residuals r ijk of outlying logintensities exceed the representative protein variation i 2 . Therefore, if the residual we label the corresponding log-intensity as an outlier, and replace the log-intensity and the residual with NA. For example, in protein P32915 with i ϭ 0.32 (Fig. 1A), the log-intensity from feature GQIGIYPIK_2 and run 12 has a residual greater than the cutoff 3 i ϭ 0.96, and is labeled as an outlier (highlighted in dark gray in the third panel). With this cutoff, we expect a 0.3% false positive rate of outliers when the residuals are Normally distributed and have a known standard deviation.
Detection of Noisy Features-Residual variation of noisy features exceeds the representative protein variation i 2 . We evaluate the variation in feature j of protein i according to the second moment of its residuals, scaled by the variance estimate of the protein, i.e., ij ϭ 1 where larger values indicate noisier features. Note that the summation does not include the outlying log-intensities replaced with NA in the previous step, and nЈ ij is the number of non-outlying log-intensities in protein i and feature j. The scaling factor i 2 allows us to compare all the features of all the proteins simultaneously. To determine a significance threshold, we generate a reference summary for each feature where y i is the average log-intensity over runs and features in protein i, and it can be viewed as a less accurate prediction of y ijk ignoring the contributions from both the runs and the features in Eq. 3. Therefore, in most cases the squared deviations ͑ y ijk Ϫ y i ͒ 2 are larger than the squared residuals r ijk 2 (Eq. 6), and the values of the reference summary ij ref are generally larger than the targets ij . We use the empirical distribution of the reference summary ij ref to determine the largest acceptable value of ij , where a threshold corresponding to a 5% false positive rate on the lower tail of the distribution is selected. For example, in the fourth panel of Fig. 1A, the empirical distributions of the feature summary ij and the reference ij ref are shown in log scale. With the reference distribution, a threshold for log ij is set to Ϫ0.23. The dark gray feature GQIGIYPIK_2 exceeds the threshold, and is labeled as a noisy feature.
Detection of Outliers and Noisy Features in SRM Experiments with Labeled Reference Peptides-In these experiments, the proposed approach is performed separately for features from endogenous and reference peptides. For reference peptides, the procedure is modified to (1) skip the outlier detection/removal step (Eq. 10), and (2) consider all the log-intensities when calculating the feature summary and the reference (Eqs. 11 and 12). The rest of the procedure is applied identically to the endogenous and reference peptides. Fig. S3 presents the pseudocode of the proposed approach for label-free DDA, DIA and SRM workflows. The approach is implemented as an option within the open-source Rbased software MSstats v3.16.2 (or later), which can be downloaded through Bioconductor (https://bioconductor.org/packages/release/ bioc/html/MSstats.html). It is applied after the data processing steps such as normalization, and before the protein-level summarization and statistical inference such as class comparison, sample summarization, or sample size calculation, as illustrated in Fig. 1B.

Implementation in the Open-source R-based Software Package MSstats-Supplemental
The approach is integrated in the dataProcess function as the option featureSubset ϭ 'highQuality'. The option highlights the uninformative features, which can be separately investigated, curated, or removed. The option remove_uninformative_feature_outlier ϭ TRUE removes the detected uninformative features and outliers. The imple-mentation of the robust regression is based on the rlm function in the R package MASS, and the implementation of the empirical Bayes shrinkage estimator is based on the squeezeVar function in limma. Further details about the usage of the feature selection approach can be found in the vignette and user manual of MSstats, available at http://msstats.org/. RESULTS We evaluated the proposed approach on a series of benchmark-controlled mixtures and biological investigations, analyzed by multiple data processing tools. All the downstream statistical analyses were performed with MSstats. Using the benchmark data sets, we evaluated the impact of the proposed selection of informative features on accuracy of fold change estimation, sensitivity and specificity of detecting differentially abundant proteins, and reproducibility of conclusions across the data processing tools. We compared the performance to that with all the features and with the top-n features. Moreover, we used the biological investigations to compare the proposed approach with the manual curation, and to evaluate the impact of the proposed approach on relative protein quantification under different choices of data processing.
In this section, we assume that all the inconsistencies in the features are because of noise, and should be removed. However, in many investigations the inconsistencies may be because of genuine biological reasons such as presence of post-translational modifications. Such features should be considered separately, as discussed in the Discussion. When removing the uninformative features, some proteins may be eliminated from the data set. As a result, the number of comparisons between conditions can change before and after the feature selection procedure.

Evaluation with the DIA Benchmarks
For DIA experiments, we used the Bruderer benchmark and the Navarro benchmark to evaluate the proposed feature selection approach. The two benchmarks differ in the number of differentially abundant proteins and their fold changes. In the Bruderer benchmark, three mixes of 12 non-human proteins were spiked into a constant background proteome (HEK-293), covering a range of fold changes from 1.1 to 50. On the other hand, the Navarro benchmark had tryptic digests of human, yeast and Escherichia coli proteins mixed in defined proportions (1:1 for human, 2:1 for yeast, and 1:4 for Escherichia coli proteins), resulting in a large number of true differentially abundant proteins (around 3,000) with fold changes in the range of 2 to 4. Data from both benchmarks were analyzed with a least three data processing tools.
Improved Estimation Accuracy and Detection of Differential Abundance- Fig. 2A illustrates the number of features per protein across the DIA benchmarks, before and after the proposed selection of informative features. As shown in the figure, the number of features varied substantially, and de-pended on both the experimental design and the data processing tool. With the proposed approach, the number of selected informative features became more consistent, with a median around 10 -20. Fig. 2B and Fig. 2C detail the proposed approach in the case of protein C8ZIG9 from the Navarro benchmark. Skyline and Spectronaut quantified the protein with different numbers of features, and with somewhat different patterns of betweenrun variation and outliers. In both cases, protein-level summarization with top-3 features was compromised by outliers in the high-intensity features, and the estimated log-fold changes were in the opposite direction of the truth. In the case of Skyline, the change in the opposite direction was large enough to warrant statistical significance (adjusted p value 0.002). The all-features summarization was in turn compromised by the noise in low-intensity features. Although the estimates of the log-fold changes were in the right direction, the variation inflated the standard errors, and undermined the adjusted p values. In contrast, the proposed approach effectively eliminated the uninformative features, and produced the most accurate estimation of log-fold changes, as well as adjusted p values consistent with differential abundance. Additional examples illustrating the impact of missing values are shown in supplemental Fig. S4.
Beyond considering individual proteins, Fig. 3A and Fig. 3B summarize the absolute errors of the fold change estimates and associated standard errors, considering multiple benchmarks and data processing tools, and focusing on background proteins with no differential abundance. The figures compare protein quantification with the selected features to that with top-n for a range of n. The top-10 quantification approximately matches the median number of features per protein selected by the proposed approach. With a small n, the top-n quantification was vulnerable to uninformative features and outliers, and increasing n increased its robustness to data artifacts. Selecting informative features performed consistently better than the top-n in terms of estimates of log-fold changes and their standard errors. Table III summarizes the results of detecting differentially abundant proteins across conditions at the false discovery rate (FDR) of 0.05, across the benchmarks and the data processing tools, using all the features and using the informative features. Selecting informative features consistently improved the sensitivity, while maintaining the specificity of detecting the changes. Supplemental Figs. S5 and S6 show the receiver operating characteristic (ROC) curves for all the considered methods for the Bruderer and Navarro benchmarks. The performance of the top-n quantifications varied across the data sets. For example, the top-3 quantification had the highest sensitivity in the Bruderer benchmark analyzed by Skyline (supplemental Fig. S5), but performed the worst in the Navarro benchmark data sets (supplemental Fig.  S6). In contrast, selecting informative features consistently optimized the detection of differential abundance in these comparisons.
Reproducibility of Detecting True Differentially Abundant Proteins Across Tools-Because data processing may introduce discrepancy in protein quantification, we evaluated the ability of the proposed selection of informative features to improve the between-tool reproducibility of relative protein quantification. Fig. 4 shows the agreement of detected true changes in protein abundance between the data processing tools for the two DIA benchmarks. DIA Umpire identified sub-stantially fewer proteins in the Navarro benchmark than the other tools (the fact also noted in the original publication (16), and shown in Table I). We therefore excluded it from this evaluation. Selecting informative features maximized the agreement in detected true changes in protein abundance across the tools. Similarly to the discussion in Improved Estimation Accuracy and Detection of Differential Abundance, performance of the top-3 and top-5 methods in detecting differentially abundant proteins was inconsistent across the data sets.

Evaluation with the DDA Benchmarks
For DDA experiments, we used the Choi benchmark and the iPRG benchmark to evaluate the proposed selection of informative features. In the Choi benchmark, three mixes of 30 proteins were spiked into a constant background of Escherichia coli proteome in different amounts, with fold changes of 1, 2 and 4. The iPRG benchmark considered six individual proteins spiked into a constant Saccharomyces cerevisiae proteome, covering a wide range of true fold changes (from 1.1 to 833). Data from both benchmarks were analyzed with a least three data processing tools.
Effective Detection of Uninformative Features and Outliers-DDA experiments typically quantify fewer features than DIA, and have more missing values and noise. Fig. 5A illustrates the number of quantified features per protein across the DDA benchmarks, before and after the selection of informa- tive features. The number of features was smaller than in DIA ( Fig. 2A). Similarly to the DIA, the original number of features per protein depended on the experimental design and on the data processing. For example, although the iPRG benchmark analyzed by Progenesis produced fewer features per protein (the median of 3 features) than Skyline, the Choi benchmark analyzed by Progenesis produced more features than Skyline. Selecting informative features mitigated this discrepancy and resulted in a more comparable number of features per protein (median of 5) in most data sets. Fig. 5B and Fig. 5C illustrate the proposed approach in the case of protein P32898 from the iPRG benchmark. Quantified with Progenesis, the protein had only four features, including two with inconsistent, high-intensity profile in runs 4 -6. The noisy features undermined the protein-level summaries and resulted in a false positive detection of differential abundance (adjusted p value close to 0 with both all-features and top-3 quantifications). Selection of informative features excluded one noisy feature and successfully corrected this artifact. Skyline reported more features for this

. Performances of the all-features and proposed methods in terms of sensitivity and specificity were evaluated on the DIA benchmark-controlled mixtures. Sensitivity is calculated as TP/(TP ϩ FN), where TP is the number of true positives and FN is the number of false negatives. Specificity is calculated as TN/(FP ϩ TN), where FP is the number of false positives and TN is the number of true negatives
Benchmark and data processing

FIG. 4. Euler diagrams for detected true changes across data processing tools in (A) the Bruderer benchmark and (B) the Navarro benchmark, using the proposed selection of informative features, and the all-features and top-n quantifications.
protein. Although some of the low-intensity features contained outliers, most feature profiles formed a consistent pattern, and all the methods led to the correct conclusion of no change in abundance.
Performance in Terms of Relative Protein Quantification-Using the same evaluation measures as in Evaluation with the DIA Benchmarks, we evaluated the impact of the proposed approach on relative protein quantification with the DDA benchmarks. Similarly to Fig. 3, supplemental Fig. S7 illustrates the effect of selecting informative features on the estimates of log-fold changes and their standard errors for the DDA benchmarks processed with different tools. The pro-posed approach more accurately estimated log-fold changes and had smaller standard errors than the top-3 and top-5 quantifications but performed similarly to the all-features quantification. Table IV summarizes the results for detecting differentially abundant proteins between conditions at FDR ϭ 0.05. Overall, selecting informative features improved the sensitivity over the all-features method while maintaining a similar specificity. The improvement, however, was not as pronounced as in the DIA data sets. The ROC curves for the DDA benchmark data sets are shown in supplemental Figs. S8 and S9. In the Choi benchmark analyzed with Proteome Discoverer, the proposed approach had a reduced sensitivity compared with the allfeatures quantification (Table IV and supplemental Fig. S8), with two fewer detected true changes at the FDR cutoff of 0.05. The proteins with missed true changes by the proposed approach typically had few or no features with a consistent profile across all the conditions. Removing uninformative features alone did not fully eliminate the inconsistency and the relative protein quantification was compromised, as illustrated in supplemental Fig. S10. These cases would require further process to curate or re-quantify the features.
In terms of reproducibility of detecting true positives across data processing tools, there was no clear change between the proposed approach and the all-features quantification. The top-n methods, on the other hand, resulted in worse reproducibility across tools (supplemental Fig. S11). Overall, compared with the all-features quantification, the proposed approach effectively removed uninformative features in the DDA benchmarks, but did not affect the relative protein quantification.

The Selected Features Overlapped with Manually Curated Features in the SRM Investigation
As a targeted method, SRM quantifies a protein with fewer features than DIA and DDA, and most of the quantified features are of high quality. Thus, the evaluation of the proposed approach for SRM focused on individual uninformative features. As described in Experimental Procedures, the SRM investigation by Dunkley et al. manually curated the quality of each feature. We used this data set to compare the proposed approach with the manual curation.
For the 246 proteins in the data set, four proteins labeled as iRT standard and three quantified with only one feature were excluded from the evaluation. Among the remaining 239 proteins, 89 with inconsistent features were detected by the proposed approach as well as the manual curation. Supplemental Fig. S12 illustrates an example of protein P63104, where the features of peptide TAFDEAIAELDTL-SEESYK had significant run-to-run variation, compared with those of peptides DIC[ϩ57]NDVSLLEK and FLIP-NASQAESK. Qualities of these features were also reflected on their chromatographic profiles and intensity proportions. In the case of peptide FLIPNASQAESK, the features were quantified in the expected range of retention time and had high consistency across runs (supplemental Fig. S13). In contrast, the uninformative features of peptide TAFDEA-IAELDTLSEESYK were inconsistent in terms of their chromatographic profiles, retention times and intensity proportions (supplemental Fig. S14).
The proposed approach detected uninformative features that were not detected by manual curation in 64 more proteins. In the case of protein P51532 (supplemental Fig. S15A), the uninformative features of peptide AIEEGTLEEIEEEVR were only detected by the proposed approach. Although these features were consistent among themselves, their profiles deviated from most other features of the protein. It is therefore difficult to detect such an unwanted deviation with the per-peptide manual curation.
Moreover, 33 proteins were characterized as having uninformative features by the manual curation alone. This was particularly common in situations where a protein was targeted with two peptides with distinct quantitative profiles. When a consensus profile of the features was unclear or even nonexistent, the proposed approach had difficulties establishing a consensus protein profile. Among the 33 proteins with bad-quality features flagged by the manual curation alone, 32 were targeted with one or two peptides. In the case of protein Q9ULK0 with two peptides (supplemental Fig. S15B), all features were manually labeled as bad-quality, because the endogenous peptides had inconsistent chromatographic profiles (e.g. GFSIDVLDALAK in supplemental Fig. S16). Based on the log-intensities, however, the proposed approach was undermined by the lack of consensus protein profile and did not detect any of these features as uninformative. For proteins with few quantified features, manual curation may still be necessary.

Appropriate Data Processing Facilitated the Selection of Informative Features
We evaluated the impact of data processing on the proposed selection of informative features and protein-level conclusions, using the DIA experiment in Selevsek et al. Two Skyline analyses were performed: the Full analysis considered all the peptide precursors and the LowCV analysis selected peptides that were consistently detectable and with a low CV. We also considered two types of q-filter strategies, based on individual peaks (the Sparse q-filter) and entire LC-MS features (the 50% q-filter). Combinations of the two analyses and two q-filter strategies produced four data sets: Full_Sparse, Full_50%, LowCV_Sparse, and LowCV_50%. Full_Sparse was processed with less stringent criteria and had more proteins than the other three data sets. Fig. 6A illustrates the fact that the choice of data processing affected the detection of differentially abundant proteins.
In the case of protein YPL117C, Full_Sparse quantification had several noisy features and many missing values (Fig. 6B). Features of better consistency were reported with either the 50% q-filter (Full_50%, Fig. 6C) or low-CV analysis (Low-CV_Sparse, Fig. 6D). This difference affected the reproducibility of detecting differential abundance. In the T2-T0 comparison, the all-features quantification gave inconsistent conclusions across these analyses, in terms of estimated log-fold change and adjusted p value. Selecting informative features reduced this discrepancy. Moreover, for all the detected changes in T2-T0, the proposed approach enhanced the reproducibility across data processing options ( Fig. 6E and Fig. 6F), especially for Full_50%, LowCV_Sparse and LowCV_50%. The remaining discrepancy with Full_Sparse was likely because of more missing values and noise in the data set, which undermined the selection of informative features.

Relationship Between the Initial Number of Features and the Performance of the Proposed Approach
Accurate estimation of representative patterns for each protein is key to the proposed selection of informative features, with different levels of challenges across proteins and across data sets. In the previous sections we saw that selecting informative features improved the detection of differentially abundant proteins (as compared with the all-features quantification) in the DIA benchmarks (Evaluation with the DIA benchmarks), but had a limited benefit in the DDA benchmarks (Evaluation with the DDA benchmarks). Using the benchmark data sets, we investigated whether the initial number of features impacted the performance of the proposed approach. Supplemental Fig. S17 illustrates that the errors of the estimated log-fold changes decreased with an increase in the number of features. Compared with the all-features quantification, the proposed selection of informative features had a reduced or similar estimation error across proteins with different number of features in the DIA benchmarks (supplemental Fig. S17A). The reduction in the estimation error was the most pronounced for proteins with 7-20 features. There was only a moderate improvement for proteins with 1-6 features, likely because of less reliable estimates of the protein patterns with fewer features. On the other hand, with enough features (e.g. Ͼ 20), the TMP summarization in MSstats was less affected by outliers and uninformative features, and the proposed selection of informative features only resulted in marginal improvement.
For the DDA benchmarks, the impact of the proposed approach was moderate and appeared to be tool-dependent (supplemental Fig. S17B). It did not result in substantial reduction in estimation error across proteins with different number of features and across the evaluated data sets, except for proteins with 7-20 features in the iPRG benchmark quantified with Progenesis. DDA experiments typically have larger technical variation than DIA, which makes it more challenging to obtain reliable estimates of protein patterns. Some proteins even lacked a consensus profile among the quantified features across all the conditions (supplemental Fig. S10), and selecting informative features did not improve the performance. For proteins with Ͼ 20 features, the discrepancy across data processing and quantification options was minimized. Overall, the efficacy of the proposed approach depended on both the number of features and the data processing, as well as other factors such as technical variation. DISCUSSION The proposed approach automatically detects uninformative features and outliers to be separately investigated, curated, or removed. For estimation and detection of differential abundance, the uninformative features can be confidently removed in a data set with a large number of good-quality features (e.g. DIA), where a detailed and efficient curation may not be possible. In such cases the proposed approach was robust to the uninformative features and outliers (Fig. 2), increased the estimation accuracy (Fig. 3) and the sensitivity of detecting true changes (supplemental Fig. S5 and Fig. S6), and enhanced the reproducibility of the results across data processing tools (Fig. 4).
The proposed selection of informative features relies on accurate estimation of representative patterns for each protein. DDA workflows typically quantify much fewer features, and do this in a less consistent manner, than DIA. This makes it challenging to obtain a reliable estimate of protein abundance. Although the proposed approach is useful in detecting uninformative features and outliers (Fig. 5), and helps direct the search for poor-quality data, further evaluation of the detected features is recommended. The evaluation may be based on manual curation, which is more manageable compared with examining the entire data set. It may also be useful to obtain additional evidence using other quality metrics. For SRM data sets, special attention is needed as a consensus profile may not be existent from the data if the targeted peptides have distinct quantitative profiles.
It should be noted that the proposed approach cannot eliminate the impact of poor experimental design (e.g. lack of randomization) or flawed data processing (e.g. inappropriate normalization).
The proposed selection of informative features assumes that features of the same protein share a representative pattern. This assumption may be violated because of biological reasons such as polymorphisms, alternative splice variants, or PTMs. For quantitative analysis of PTMs, it is more suitable to proceed with site-level quantification and inference. Although the proposed approach is applicable for the site-level analysis, it would require enough features for each PTM site, in order to accurately estimate the representative patterns.
The proposed approach evaluates information content for individual features. For design of targeted assay, instead of focusing on individual features, the approach can be extended to detect peptides whose quantitative profiles deviate from the patterns of the targeted protein. In label-based experiments, the proposed approach analyzes endogenous and reference peptides separately. When both endogenous and reference features are required, it is possible to integrate both counterparts, prior to applying the proposed approach as in label-free analysis.
Combining information from multiple sources is a common strategy in many statistical and machine learning practices, which may as well improve the selection of informative features. The goal to distinguish between informative features and uninformative features without a ground-truth is an illposed classification problem, and supervised methods for classification are not directly applicable. However, methods for unsupervised learning may be well suited for facilitating the estimation of protein profile, which can be framed as a clustering problem. When manual curation is performed on a subset of the data, semi-supervised methods may further improve the detection of uninformative features, by using the curated features to train predictors with learned representations.