Bayesian Proteoform Modeling Improves Protein Quantification of Global Proteomic Measurements

As the capability of mass spectrometry-based proteomics has matured, tens of thousands of peptides can be measured simultaneously, which has the benefit of offering a systems view of protein expression. However, a major challenge is that, with an increase in throughput, protein quantification estimation from the native measured peptides has become a computational task. A limitation to existing computationally driven protein quantification methods is that most ignore protein variation, such as alternate splicing of the RNA transcript and post-translational modifications or other possible proteoforms, which will affect a significant fraction of the proteome. The consequence of this assumption is that statistical inference at the protein level, and consequently downstream analyses, such as network and pathway modeling, have only limited power for biomarker discovery. Here, we describe a Bayesian Proteoform Quantification model (BP-Quant)(1) that uses statistically derived peptides signatures to identify peptides that are outside the dominant pattern or the existence of multiple overexpressed patterns to improve relative protein abundance estimates. It is a research-driven approach that utilizes the objectives of the experiment, defined in the context of a standard statistical hypothesis, to identify a set of peptides exhibiting similar statistical behavior relating to a protein. This approach infers that changes in relative protein abundance can be used as a surrogate for changes in function, without necessarily taking into account the effect of differential post-translational modifications, processing, or splicing in altering protein function. We verify the approach using a dilution study from mouse plasma samples and demonstrate that BP-Quant achieves similar accuracy as the current state-of-the-art methods at proteoform identification with significantly better specificity. BP-Quant is available as a MatLab® and R packages.


INTRODUCTION
The application of MS-based proteomics has resulted in large-scale studies in which the set of measured, and subsequently identified, peptides are often used to estimate protein abundance. In particular, label-free MS-based proteomics is highly effective for identification of peptides and measurement of relative peptide abundances (1,2), but it does not directly yield protein quantities. The importance of accurate protein quantification cannot be understated; it is the essential component of identifying biomarkers of disease, or defining the relationship between gene regulations, protein interactions and signaling networks in a cellular system (3,4). The major challenge is that protein abundance depends not only on transcription rates of the gene but also on additional control mechanisms, such as mRNA stability, translational regulation and protein degradation. Moreover, the functional activity of proteins can be altered through a variety of post-translational modifications (PTMs) or proteolytic processing and alternative splicing, events which selectively alter the abundance of some selected peptides while leaving others unchanged (4). This complexity of the proteome, in addition to issues associated with the measurement, and identification errors, present a significant challenge to accurate relative protein quantification (5).
Smith et al., (3) recently describes the importance of capturing protein variation in all forms (e.g., PTMs, splice variants), all of which are collectively referred to as proteoforms. To date, little has been described in respect to automated identification of proteoforms. Most work on improving protein quantification for label-free data focuses on removing variation from the data through peptide filtering, such as removing shared peptides or those that do not meet frequency and coefficient of variation (CV) thresholds (6). Several recent methodologies have described approaches to deal with shared peptides either through linear programming (7), hierarchical modeling (8), or peptide detectability (9) to improve protein-level quantification. However, these approaches do not identify or quantify distinct proteoforms of a protein. The current state-of-the-art method for proteoform discovery uses a combination of by guest on March 6, 2020 https://www.mcponline.org Downloaded from correlation and clustering to identify distinct patterns (10). Protein quantification by peptide quality control (PQPQ) to date has only been applied to labeled data, but is generic and can be applied to labelfree experiments as well.
Generating a parsimonious protein list is a well-known practice in protein inference. The procedure involves assembling the most concise set of proteins across the assigned peptide sequences observed in an experiment (11). We present the novel approach of statistically informed peptide selection using a Bayesian Proteoform Quantification (BP-Quant) approach for parsimonious, relative protein quantification (Fig. 1). Research objectives are at the foundation of BP-Quant. Peptides to be used for protein quantification are selected using a peptide-specific signature vector that is determined by statistical hypothesis tests relevant to the research objective(s) of the experiment as well as the directionality of the effect(s) (i.e., upward or downward abundance). The statistical signature-based approach does not omit shared peptides, although it does not directly account for the fact that a peptide is shared. BP-Quant-based protein quantification is a multi-step process that 1) generates the peptide signature vectors based on the research hypotheses, 2) assigns probabilities to potential proteoform configurations and 3) selects peptides with an over-representation of a similar signature to be used for relative protein abundance estimates.

EXPERIMENTAL PROCEDURES
Discovery of candidate plasma biomarkers is of interest to many applications, such as disease specificity, drug toxicity, drug response, and fundamental research (12)(13)(14)(15). Plasma samples collected from standard inbred mice under the National Institutes of Health National Institute of Environmental Health Sciences Biomarkers of Exposure project (http://www.niehs.nih.gov/health/topics/science/biomarkers/) were used for a dilution study. Although the focus of this larger project was to understand the development and progression of complex disease due to exposure to environmental stressors in the presence of risk factors such as obesity or exposure to inhaled endotoxins (e.g., lipopolysaccharide) (16), here we constructed an by guest on March 6, 2020 experiment that yields expected protein ratios and for which known concentrations can be used to define Reversed-phase capillary LC-MS analyses -Diluted peptide samples (64 total = 16 samples by four dilutions) , analyzed in duplicate, were balanced and randomized across a 4-column custom-built capillary LC system coupled online to a LTQ-Orbitrap Velos mass spectrometer (Thermo Scientific, San Jose, CA) by way of an in-house manufactured electrospray ionization interface, as previously described (18). Electrospray emitters were custom made using 150 µm outer diameter (o.d.) x 20 µm inner diameter (i.d.) chemically etched fused silica capillaries, as previously described (19). Reversed-phase capillary columns were prepared by slurry packing 3-µm Jupiter C18 bonded particles (Phenomenex, Torrence, CA) into a 75 µm x 65 cm fused silica capillary (Polymicro Technologies, Phoenix, AZ) using 0.5 cm sol-gel plugs for particle retention (20). Mobile phases consisted of (a) 0.1% formic acid in water and (b) 0.1% formic acid in acetonitrile, and were degassed on-line using a Degasys Model DG-2410 vacuum degasser (Dionex, Germany); the HPLC system was equilibrated at 10,000 psi with 100% mobile by guest on March 6, 2020 phase (a) for initial starting conditions. After loading 2.5 µg of peptides onto the column, the mobile phase was held at 100% mobile phase (a) for 50 min. Exponential gradient elution was initiated 50 min after sample loading with an initial column flow rate of 400 nL/min, and the mobile phase was ramped from 0% to 55% mobile phase (b) over 100 min using a 2.5 mL stainless steel mixing chamber, followed by a rapid increase to ~100% (b) for 10 min to wash the column. The temperature of the heated capillary and the electrospray ionization (ESI) voltage were 200°C and 2.2 kV, respectively. Data was collected over the mass range 400-2,000 m/z.
Peptide identification -Quantitative LC-MS data were processed using the PRISM Data Analysis system (21), which is a series of software tools developed in-house (e.g. Decon2LS (22) and VIPER (23); freely available at http://panomics.pnnl.gov/software/). The first step involved deisotoping of the raw MS data to give the monoisotopic mass, charge state, and intensity of the major peaks in each mass spectrum (22).
The data were next examined in a 2-D fashion to identify groups of mass spectral peaks observed in sequential spectra using an algorithm that computes a Euclidean distance in n-dimensional space for combinations of peaks (23). Each group, generally ascribed to one detected species and referred to as an "LC-MS feature", has a median monoisotopic mass, central normalized elution time (NET), and abundance estimate computed by summing the intensities of the MS peaks that comprise the entire feature. The peptide identities of detected features in each dataset (here a dataset is equivalent to a single LC-MS analysis) was determined by comparing their measured monoisotopic masses and NETs to the calculated monoisotopic masses and observed NETs of each of the peptides in a mouse plasma/Shewanella accurate mass and time tag database (24) within initial search tolerances of ± 6 ppm and ± 0.025 NET for monoisotopic mass and elution time, respectively. The peptides identified from this matching process were retained as a matrix for subsequent data analysis and are available in Excel format in supplementary data and online at http://omics.pnl.gov/view/publication_1088.html. by guest on March 6, 2020 Statistical Pre-Processing of Peptide Abundance Dataset -Peptide abundance data were transformed to the log 2 scale and missing data values were left as blank (not imputed) prior to processing. Peptides with an insufficient amount of data to perform a qualitative or quantification statistical difference test (e.g., G-test or Analysis of Variance (ANOVA)) across the set of biological replicates were removed (25). The full collection of mouse plasma and S. oneidensis peptides were evaluated for evidence of unusual peptide abundance distributions across the pool of LC-MS analyses (26). The peptides associated with the mouse plasma were extracted, technical replicates were averaged, and additional peptide filtering was performed to ensure sufficient data across the pool of samples. The mouse plasma data were normalized using a 2step process. First, the undiluted samples consisting of only mouse peptides (0.50 µg/µL, n=16) were mean centered using a rank invariant peptide subset (27). The remaining three dilution sets (0.25 µg/µL, 0.125 µg/µL, and 0.0625 µg/µL) were then normalized as a function of the expected dilution ratio to the normalized peptide set consisting of only mouse peptides. Step 1 defines the research goal, for example identifying proteins with any statistically significant differential expression (i.e., abundance) between control and two conditions (T1 and T2).

BP-QUANT
Step 2 treats each peptide as an independent source of information and evaluates each using an appropriate statistical test, such as an ANOVA with a Tukey's post-hoc test to adjust for the multiple comparisons within the peptide (28). In Step 3, the statistical results, typically p-values, are translated into signatures based on the trinary descriptors (-1,0 or 1) where -1 and 1 indicate the treatment group has lower or higher expression, respectively, than the comparison group and 0 indicates no statistical difference in peptide abundance. In particular, the significant features are checked for each individual contrast such as 1) control versus T1, 2) control versus T2, and 3) T1 versus T2. In practice, the primary signature vector is based on the comparison of quantitative peptide abundance values across groups, where groups are defined by the by guest on March 6, 2020 research objectives. A secondary signature vector can also be defined based on the comparison of the frequency of response (i.e., presence/absence) across comparison groups, which is highly relevant for proteomics data given the large fraction of missing values. If the primary signature vector contains any NaNs, which indicates inadequate data for a quantitative statistical test, then they are replaced using the corresponding secondary signature vector value(s) which are based on qualitative test results. The fourth step computes the probability of all possible proteoform configurations, thus identifying how many proteoforms are present and the peptides associated with each proteoform. The last step then quantifies protein level expression using standard approaches (29).
Bayesian peptide selection/proteoform identification -The goal of the Bayesian inference problem is to determine a specific proteoform configuration given the information for protein i. We assume that each unique peptide signature can result in a unique proteoform and thus each signature is defined as a Bernoulli random variable where 1  ij x if proteoform j is present for protein i and 0 otherwise. For each protein i we observe a count for signature j ( ij n ) and similar to prior work in protein inference (30,31) the goal is to determine the maximum a posteriori (MAP) proteoform configuration, which is dependent upon the observed signature counts: [1] The MAP proteoform configuration (Eq. 1) is found by evaluating the posterior probability of each configuration where all random variables are described in Table 1: by guest on March 6, 2020 The probability that we would observe a specific number of peptides displaying signature j for protein i is dependent upon the proteoform configuration for that signature. Under the assumption that each signature is independent we can simplify Eq. 2 to where ] , , , iJ In practice, the assumption of independence of peptides will not hold holistically, however this assumption works relatively well for initial model development similar to protein inference. We model the probability that we would observe a specific number of observations for a given signature as a binomial distribution. For example, if we have a background frequency () of 0.15 for signature j of protein i, there is only a 2.1% chance that we would observe this signature 4 or more times for a protein containing 8 peptides. If the proteoform is present, x ij is one, we sum over the binomial probabilities from the lower tail of a binomial. Thus, if the likelihood of observing n ij by chance is very low, the probability in the lower tail will be high. We utilize this distributional information to determine the likelihood of over-represented signatures that are indicative of a proteoform. Since x is binary, the prior, ) ( ij x P , is modeled as a Bernoulli random variable; [4] The computational time to solve equation 5 is minimal because the full possible set of solutions for  Table 1.

RESULTS
To explore the capability of BP-Quant to correctly identify the number of proteoforms associated with a protein we use the dilution series, which consists of 100 proteins where each protein contains from 2 to 188 peptides. We compare these results to a standard correlation-based approach (PQPQ) that uses clusters defined from correlation to define proteoforms (10,32). We used the publicly available version of PQPQ (http://www.mcponline.org/content/10/10/M111.010264/suppl/DC1), which runs in MATLAB 2013a with all defaults. Since the PQPQ software is designed for labeled data, we defined the highest dilution as our control set and allowed PQPQ to perform the analysis on the ratios to this group. For consistency, the peptide relationships to proteins and proteoforms were extracted from the output and protein quantification was performed identical to BP-Quant using a reference-based approaches -R-Rollup (29,33). pattern is an initial increase from 0.25 followed by a decrease for the last two dilutions. In this example ( Fig.3), these are the six peptides and associated abundance values identified from protein A1AG1_Mouse, for which simply the last three peptides were permuted to the second dilution ordering. Therefore, for this protein we can clearly define that we have two distinct proteoforms. For every protein in the dilution series we first select a defined number of proteoforms, second identify which peptides will belong to each proteoform (minimum of 3) and lastly determine the dilution ordering (permutation) for each proteoform.
To assess variability in the final result we generate the complete dataset 50 times.
Across the 50 datasets there were on average 423.5 proteins (of 100 total) with multiple proteoforms and the number of proteoforms for these proteins ranged from 2-6 (average of 3.1). For each of the datasets the predicted number of proteoforms to the known number of proteoforms was compared, and the accuracy was quantified. The accuracy was defined based on true positive (TP) and true negatives (TN) where TP and FN are the total the number of proteins that correctly identified the number of proteoforms -100 ) ( TN TP  . Alternately, metrics such as root mean square error can be used, but return similar results (data not shown). Fig. 4 shows a comparison of BP-Quant versus PQPQ in respect to accuracy for each of the 50 datasets (circles). The accuracies are relatively similar, average of 76.7% and 76.0%, with standard deviations of 3.3% and 3.3%, for BP-Quant and PQPQ, respectively. In 64% of the by guest on March 6, 2020 cases BP-Quant had a higher accuracy, which was statistically significant by a Wilcoxon Rank Sum test (p ~0.007).
The measure of accuracy in Fig. 4 does not directly consider the specificity of the test, i.e., the number of times a protein with a single proteoform (TN) is falsely identified as having multiple proteoforms, a false positive (FP), or conversely, a false negative (FN) Thus, we also evaluate the F1score, which is the harmonic mean of precision and sensitivity;     . Under the circumstance of no false identifications, the F1-score will be 1 and will decrease as the number of false positives and negatives increase. BP-Quant has a significantly higher F1-score than PQPQ based on a Wilcoxon signed rank test (p < 3e-4), larger in 74% of the datasets. Fig. 5 shows the confidence interval for the F1-score for each approach as well as a boxplot of the difference inset in the figure.
Lastly, to explore the impact of noisy data with no proteoform present we used the dilution series data to construct negative sets by simply randomly permuting the data for each peptide and leaving the dilution ordering constant. In this manner, there should be no statistical difference between groups and those for which this is the case would be outliers. The accuracy of BP-Quant on this negative set is displayed in Fig. 4 (green triangles). The average accuracy of BP-Quant on the negative set was 98.5% with a standard deviation of 1.25% while PQPQ resulted in an average accuracy of only 72.9% with a standard deviation of 2.64%. This is similar to the observations for the dilution series with no permutations where PQPQ found 5 of the 54 proteins (>10%) were identified to have multiple proteoforms when only one proteoform is possible.

DISCUSSION
The use of proteomic measurements for biomarker discovery results in a vast amount of information that, by necessity, requires summarization to facilitate further biological conclusions. There is great benefit to a rigorous statistical investigation of the data in the context of the research hypothesis and by guest on March 6, 2020 the subsequent peptide and protein information available for biological modeling. Statistically informed peptide selection -BP-Quant -is essential for any of the relative protein quantification approaches presented when dealing with complex samples, such as those used in environmental, medical or clinical studies. Such approaches are particularly important for lower abundances species, which are often of greatest interest in biomarker discovery efforts. In addition, since BP-Quant is a signature-based peptide selection methodology it can be used as a precursor to any desired protein quantification method (e.g., reference-based or linear models).
Proteoform identification is one of the major challenges of the protein quantification field and is a necessity to facilitate biomarker discovery and improve fundamental knowledge of biological systems at the pathway level (3). The BP-Quant approach facilitates the discovery of significant biological patterns in the presence of substantial noise and also facilitates the discovery of multiple significant biological patterns (i.e., possible proteoforms), by selecting specific peptides that display unique patterns of expression. In many cases such specific proteoforms, such as those arising from PTMs, have been identified as associated with diseases such as Alzheimer's and Parkinson's (34). BP-Quant showed excellent specificity and similar accuracy to a correlation-based clustering approach to proteoform identification. The current implementation of BP-Quant assumes that peptides are independent, which clearly is not the case. Future explorations will evaluate using protein level information, such as exon structure or peptide sequence overlap, or correlation, such as PQPQ, to identify these dependencies and model them in the Bayesian framework.
In the absence of the BP-Quant approach, biologists have been presented with a sort of 'consensus behavior' of the peptides mapped to a given protein, even as individual peptides may have been significantly modified as a specific response to experimental conditions. Because BP-Quant has the ability to effectively segregate the 'consensus' peptides from the uniquely altered peptides, biologists can begin to assess the functional significance of specific proteoforms within a biological context. This is a by guest on March 6, 2020 powerful tool for systems biology studies in which the flow of information is as important as mere changes in abundance. In may also convey an additional layer of specificity on biomarker discovery efforts, greatly improving the ability to identify proteoform-specific biomarkers.

FIG. 1. Bayesian Statistically Informed Peptide Selection (BP-Quant) for parsimonious protein quantification is driven by research objectives that define statistical patterns of interest.
Relative protein abundance values are estimated using peptides that share similar statistical behavior.  The expected frequency of signature (j=1) defined as the pattern the represents no statistical change between any groups [0,0,0….,0]