A Novel Approach for Untargeted Post-translational Modification Identification Using Integer Linear Optimization and Tandem Mass Spectrometry*

A novel algorithm, PILOT_PTM, has been developed for the untargeted identification of post-translational modifications (PTMs) on a template sequence. The algorithm consists of an analysis of an MS/MS spectrum via an integer linear optimization model to output a rank-ordered list of PTMs that best match the experimental data. Each MS/MS spectrum is analyzed by a preprocessing algorithm to reduce spectral noise and label potential complimentary, offset, isotope, and multiply charged peaks. Postprocessing of the rank-ordered list from the integer linear optimization model will resolve fragment mass errors and will reorder the list of PTMs based on the cross-correlation between the experimental and theoretical MS/MS spectrum. PILOT_PTM is instrument-independent, capable of handling multiple fragmentation technologies, and can address the universe of PTMs for every amino acid on the template sequence. The various features of PILOT_PTM are presented, and it is tested on several modified and unmodified data sets including chemically synthesized phosphopeptides, histone H3-(1–50) polypeptides, histone H3-(1–50) tryptic fragments, and peptides generated from proteins extracted from chromatin-enriched fractions. The data sets consist of spectra derived from fragmentation via collision-induced dissociation, electron transfer dissociation, and electron capture dissociation. The capability of PILOT_PTM is then benchmarked using five state-of-the-art methods, InsPecT, Virtual Expert Mass Spectrometrist (VEMS), Modi, Mascot, and X!Tandem. PILOT_PTM demonstrates superior accuracy on both the small and large scale proteome experiments. A protocol is finally developed for the analysis of a complete LC-MS/MS scan using template sequences generated from SEQUEST and is demonstrated on over 270,000 MS/MS spectra collected from a total chromatin digest.

Identification of the types of post-translational modifications (PTMs) 1 of various organisms is currently a major challenge in the field of proteomics. MS/MS has shown to be an excellent tool for de novo peptide sequence prediction and database peptide identification and is indispensable in determining PTMs (1,2). Many research groups  have incorporated modification discovery into their respective identification algorithms and utilize multiple databases, including UniMod (29), RESID (30), and Delta Mass, 2 to build a list of variable modifications that can exist on a candidate peptide. To date, there exist two types of algorithms for identification of PTMs: (a) hybrid sequence tag/database approaches (3)(4)(5)(6)(7)(8)(9)28), which develop a sequence tag and subsequently compare this tag with a database to extract a candidate peptide sequence and determine the set of PTMs that best explain the MS/MS spectrum and (b) pure database-based approaches (10 -15, 23-27), which directly compare the experimental peak list with a theoretical peak list derived from candidate peptides in a database. Both approaches have had success both in validation of known modifications and discovery of novel ones. To our knowledge, there is no de novo approach for identification of PTMs using a comprehensive variable modification list.
The hybrid methods, denoted as (a), are beneficial because the derivation of the sequence tag may limit the size of the database to proteins that contain that sequence tag. This approach (3) can allow for a richer set of variable modifications to be considered on candidate peptide sequences due to the database size reduction. For example, the InsPecT algorithm (4) will generate de novo sequence tags of a fixed length and scan a trie-based database for all instances of the tag. Each distinct variable modification combination, or decoration, is entered in a mass-ordered list prior to database searching. When a peptide matching the tag is found, the algorithm will attempt to increase the length of the tag with an amino acid sequence if the mass of the sequence plus the mass of one decoration is equal to the predetermined mass gap. The Virtual Expert Mass Spectrometrist (VEMS) (6) uses both a database-independent search for generation of sequence tags and a database-dependent search to determine possible peptides. Any sequence tag that is not validated by a peptide found in the database-dependent search is compared with the list of proteins containing peptides found in the database-dependent search to generate candidate amino acid sequences. All combinations of variable modifications that equal the difference between the parent mass and the candidate sequence mass are tested to derive the best possible modification (6). The Mod i algorithm (7) assumes that the database has been reduced a priori to a candidate subset of 20 proteins. After filtering the MS/MS spectrum, the algorithm generates a list of sequence tags derived from the spectrum and attempts to explain the mass gaps using any of the modifications from the UniMod database.
Although sequence tags have proven to be very capable in determining the candidate peptide sequence, the success of these methods relies greatly on the accurate prediction of the sequence tag. Pure database methods, denoted earlier as (b), remove this need by directly obtaining the peptide sequence (with or without modifications) from a database. These approaches are also beneficial because they use all of the MS/MS spectrum peak information at once when analyzing a candidate peptide. That is, for each candidate sequence in the database, a full set of theoretical ion fragments may be compared with the experimental MS/MS spectrum peaks to derive a score for the candidate sequence. The potential drawback of the pure database algorithms is the limitation of variable modifications that can be analyzed. Each variable modification will create an additional copy of the amino acid that must be analyzed when developing a theoretical candidate peptide from the database.
When analyzing the entire database with a small modification set, these algorithms have been very effective in identifying modified spectra. The SEQUEST algorithm (10) uses a technique known as cross-correlation to mathematically compare the overlap between the theoretical spectrum from a candidate database peptide and the experimental spectrum. Mascot (11) incorporates probability-based searching to locate a candidate peptide sequence that scores above a certain expectation threshold dependent on the size of the database. X!Tandem (14) also uses a probabilistic search method to determine the best peptide match to a spectrum.
A major limitation of many of the preceding algorithms is the inability to interpret electron transfer dissociation (ETD) (32)(33)(34) or electron collision dissociation (ECD) (35,36) spectra. ECD and ETD both involve the reaction of an electron with a protonated cation to form an odd electron peptide. This process induces large amounts of backbone cleavage to yield c-ions and z ⅐ -ions (32,35) that are analogous to the b-ions and y-ions produced from CID. Although the c-ions and z ⅐ions are often the most abundant ions present, both ETD and ECD spectra have been known to show b-ions, y-ions, and their neutral losses as well (37). ECD and ETD enhance the diversity of peptides that can be fragmented because they can analyze bigger peptides with higher charge state. In fact, a recent decision tree model (38) was developed to differentiate which parent mass and charge states are most appropriate for CID and ETD. Generally, CID will provide the most fragmentation for peptides of charge 2 or high mass peptides of charge 3. Low mass peptides of charge 3 and all peptides of charge greater than 3 may have better fragmentation using ETD or ECD (38). Unlike CID, ECD and ETD cleavage is very weakly affected by the amino acid sequence and generally provides more complete coverage than CID alone when used on peptides with higher charge density. Depending on the precursor charge and basic residue location, one can expect a large fraction of complementary c-ions and z ⅐ -ions to be present in the spectral data. Additionally, both ECD and ETD also prevent cleavage of labile modifications (33,36). Although the mechanism of cleavage during ECD and ETD is still debated, PTMs are fully present on the c-ions and z ⅐ -ions produced during cleavage. As ETD/ECD fragmentation techniques become more readily available, they will serve as a complement for CID technology, and hence it is desirable that computational algorithms be able to handle inputs from all three techniques.
A further limitation of most of the preceding algorithms is the inability to search for a large amount of variable modifications. Enumerating all combinations of the variable modifications will lead to an exponential increase in the search time and can pose a significant problem when the database size is large. This may be reduced by implementing a two-pass approach (39 -41) where the database is initially scanned either with no modifications or a small subset of variable modifications to eliminate proteins that did not score above a given threshold (based on the peptide hits). Mascot (40), X!Tandem (39), and InsPecT (41) will run a first pass search with a small set of variable modifications to analyze spectra that are either unmodified or contain the queried modifications. Because of the reduced database size, additional variable modifications as well as missed cleavages and other unusual digestion/fragmentation information can be incorporated into the search.
Several groups (16, 17, 19, 21-23, 28, 40, 42) have developed untargeted algorithms to assign integer mass modifications to candidate peptide sequences. These algorithms place a restriction on the number of modification sites to enhance computational efficiency and reduce the false detection of low mass modifications. Alternatively, the Mod i algorithm (7) currently uses the entire UniMod (29) database as a variable modification list, allows a user to input as many additional modifications as necessary, and does not place an upper bound on the amount of modification sites. A recent approach involving selective mass screening (18) has also been developed to identify low abundance modifications using the Mod i (7) algorithm. To aid in the development of de novo identification algorithms for modified peptides, MS-Profile (43) was recently developed to generate spectral profiles of tandem mass spectra. Using the forward-backward algorithm, MS-Profile is able to determine the probability that a spectral peak corresponds to a peptide prefix mass without explicitly enumerating the complete spectral dictionary for the MS/MS spectrum.
We have developed a novel method, PILOT_PTM (Fig. 1), for untargeted PTM prediction via integer linear optimization (ILP) and tandem mass spectrometry. ILP has been an integral tool in the de novo sequencing algorithm PILOT (44,45) and the hybrid algorithm PILOT_SEQUEL (46). Similar to these previous methods, our objective function seeks to maximize the sum of intensity contributions from theoretical peak matches to the experimental spectrum given a set of logical constraints. We expand on the previous methodology by directly incorporating the intensity contribution from both sets of complementary ion peaks as well as all corresponding offsets in the objective function. Given a template sequence of amino acids, the model will seek to determine the optimal set of modifications among a "universal" list based on the MS/MS spectral data assuming that all template positions can contain a PTM. This universal list (supplemental Table 1) consists of 912 known PTMs, chemical derivatives, amino acid substitu-FIG. 1. a, overall framework for PILOT_PTM. b, identification of globally and locally significant peaks. The highest intensity filtered peaks are labeled as globally significant. Any other filtered peak is labeled as locally significant if the peak intensity is greater than all other peaks within a 5.0-Da mass window. c, a set of singly charged support peaks (red) for a candidate ion peak (blue). Each peak labeled as globally or locally significant will be assigned a set of supporting peaks based on the filtered MS/MS. d, output from an optimal solution of PILOT_PTM. The blue peaks on the left represent all MS/MS peaks that are activated for the optimal solution. The sets of blue peaks on the right represent all candidate ion peak sets that will activate the optimal combination of MS/MS peaks. e, cross-correlation example for the template sequence KSTGGKAPR with N-terminal propionylation, Lys-1 dimethylation, and Lys-6 acetylation.
tions, non-enzymatic modifications, isotopic labels, and artifacts in the UniMod (29), RESID (30), and Delta Mass 2 databases. No upper bound is placed on the amount of modification types or modification sites that can exist on the template. The method rigorously guarantees the optimal set of modifications without having to enumerate all combinations of possible modifications.

Sample Preparation and Annotation
This section will detail the preparation of each of the data sets and the annotation procedure for determination of PTMs. Annotated spectra for all test data sets are provided as supplemental material. The fragmentation methods, MS/MS instruments, and total scans for each data set are given in Table I.

Test Set A: Phosphopeptides
Three sets of chemically synthesized phosphopeptides (AnaSpec) were prepared for mass spectrometry by using 50% acetonitrile. The peptides were analyzed using a data-dependent mode setting measuring the parent mass followed by MS/MS fragmentation using alternating CID/ETD scans (Table I). All peptides were isolated based on parent mass, and a total of 218 spectra were manually validated.

Test Set B: Histone H3-(1-50) N-terminal Tail
Histone H3 was isolated from HeLa cells and prepared for mass spectrometry as described previously (47). The H3-(1-50) N-terminal tail was analyzed using an 8.5-tesla quadrupole FTMS instrument. The 8ϩ charge state was enhanced using the quadrupole and second octopole of the instrument for selective ion accumulation. Selected species were selected for fragmentation by ECD, and 58 spectra were manually validated (47). Although multiple modified forms may be present in each MS/MS spectrum, the annotation assigned corresponds to the most abundant modified form.

Test Set C: Propionylated Histone Fragments
Histone H3 was isolated from mouse embryonic fibroblast cells and prepared for mass spectrometry as described previously (48). Propionylated H3 peptides were analyzed by nanoflow reverse-phase HPLC-MS/MS using a linear quadrupole ion trap-Orbitrap mass spectrometer (ThermoFisher, San Jose, CA) operated in the data-dependent mode with one full MS acquired in the Orbitrap followed by seven data-dependent MS/MS spectra acquired via CID in the ion trap. A representative set of parent masses was selected using the five peptide fragments associated with the H3-(1-50) N-terminal tail and modifications that are commonly found on histone H3 (Lys methylation, Lys dimethylation, and Lys acetylation) or are artifacts of the propionylation procedure (Lys propionylation, Lys methylated propionylation, N-terminal propionylation, N-terminal acetylation, and Cterminal methylation), and a total of 553 spectra were isolated and manually annotated.

Test Set D: Total Chromatin Fraction
HeLa S3 cells were cultured and harvested as described recently (47). Chromatin fractions from the HeLa cells were roughly prepared according to published procedures (49). Extracted protein was separated using one-dimensional SDS-PAGE and in-gel digested by trypsin following treatment with iodoacetamide. Peptide digests were then analyzed by nanoflow LC-MS/MS on an Orbitrap mass spectrometer as described previously (50).
To develop an annotated test set, we initially utilized the SEQUEST algorithm (10) with a set of eight variable modifications (Met oxidation, Lys methylation, Lys dimethylation, Lys acetylation, Arg methylation, Ser phosphorylation, N-terminal acetylation, and C-terminal amidation). We scanned the NCBInr database with human taxonomy and allowed up to three missed cleavages. The fragment ion tolerance was set to 0.5 Da, and the parent ion tolerance was set to 0.1 Da. The cutoff XCorr for an annotation was 2.0 for a charge 1 precursor, 2.2 for charge 2, 2.50 for charge 3, and 3.0 for charge 4. All such assignments were then analyzed using Mascot (11) with the same variable modification list and search parameters. The assignments that were validated by Mascot with an expectation value of at most 0.1 were retained.
The annotations were then manually examined to remove assignments that appeared to correspond to low quality spectra. Each peptide annotation was theoretically fragmented to develop a list of predicted b-and y-ions. Using an appropriate noise threshold (51), we remove all MS/MS spectra that do not contain at least 50% (Quality  Ion Trap Peptides-These spectra from the organism Mycobacterium smegmatis are available from the Open Proteomics Database (52). A test set of 36 spectra were verified by Mascot (11) and SEQUEST (10) and further filtered based on the amount of b-ions and y-ions above the noise level as described previously (44).
Q-TOF Peptides-These spectra were derived from a publicly available data set (53). The spectra were collected with Q-TOF2 and Q-TOF-Global mass spectrometers using a mixture of alcohol dehydrogenase (yeast), myoglobin (horse), albumin (bovine; BSA), and cytochrome c (horse). A test set of 37 spectra was obtained using only "acceptable spectra" as defined previously (44).
Orbitrap Peptides-Stock solutions of a 16-peptide mixture were prepared containing equal amounts of each protein as described previously (46). The proteins were digested with trypsin and analyzed by automated microcapillary liquid chromatography and an LTQ-Orbitrap hybrid mass spectrometer (Thermo Finnigan, San Jose, CA). Both MS and MS/MS spectra were recorded on the instrument, and a test set of 401 spectra was annotated using the SEQUEST algorithm (10).

Novel PILOT_PTM Algorithm
The framework for PILOT_PTM ( Fig. 1a) begins with a preprocessing algorithm that filters the raw spectrum to extract all globally and locally significant peaks based on their intensity (Fig. 1b). The preprocessor is capable of handling inputs from multiple fragmentation methods including CID, ETD, and ECD and will label candidate b-ion (CID) or c-ion peaks (ETD/ECD), the appropriate complementary y-ion (CID) or z ⅐ -ion (ETD/ECD), and any supporting peaks (isotopes, neutral offsets, etc.) that may exist (Fig. 1c). The ILP model will derive a rank-ordered list of activated globally significant peaks for the template peptide sequence based on one or more sets of candidate ion peaks (Fig. 1d). A complete list of candidate modified sequences that satisfy the appropriate mass conservation constraints for each candidate ion peak set is then constructed. The postprocessing algorithm section uses a cross-correlation function to mathematically verify the overlap between the experimental MS/MS and the theoretical spectrum created by a candidate sequence (Fig. 1e). Each candidate sequence is assigned a cross-correlation score and placed in a rank-ordered list. The modified sequence that best explains the experimental data will have the highest cross-correlation score.

ILP Model
Given a template amino acid sequence of length K, each amino acid is assigned an index, k, corresponding to the position in the template sequence. Without loss of generality, the N-terminal amino acid will correspond to k ϭ 1, and the C-terminal amino acid will correspond to k ϭ K. During the preprocessing stage, a list of candidate ion peaks, j, is generated that represent possible b-ions (for CID) or c-ions (for ETD/ECD). The peak masses will correspond to singly charged ions, and the choice of ion type is arbitrary as y-ions or z ⅐ -ions could easily be used in the formulation of the problem.
Sets-The set CS k (Equation 2) consists of all candidate ion peaks j that are valid peaks for the template amino acid sequence at position k. Given the universal list of modifications, the theoretical lower (m k L ) and upper (m k U ) bounds on the masses of the ion peaks used to construct the candidate sequence can be easily calculated. We can then efficiently construct each set CS k by enumerating all j subject to m k L Յ m j Յ m k U , and there exists an amino acid path from j to both the N-terminal and C-terminal boundary conditions (N-term and C-term B.C.) (54). The set Pos j is simply a list of all template positions for which j can be a candidate ion peak (Equation 3).
᭚ an amino acid path to the N-term and C-term B.C.͖

(Eq. 2)
Support j ϭ ͕i : i is a supporting MS/MS spectrum ion peak for candidate ion peak j͖ (Eq. 4) For each candidate ion peak j, we construct the set of supporting MS/MS spectrum peaks, Support j (Equation 4) using the globally and locally significant peaks (indexed over i) determined from the preprocessor (Fig. 1b). Support j is intended to detail as much information about the candidate ion peak j and is dependent on the fragmentation method used. For ETD/ECD spectra, Support j consists of c 2ϩ -ions, z ⅐ -ions, z ⅐2ϩ -ions, b-ions, y-ions, and their corresponding ϩ1 and ϩ2 isotopes. For CID spectra, the appropriate ions are b 2ϩ -ions, y-ions, y 2ϩ -ions, their corresponding ϩ1 and ϩ2 isotopes, and their corresponding offsets (i.e. ϪH 2 O, ϪNH 3 , and ϪCO). The y-ion or z ⅐ -ion series can be calculated from the modified parent mass by the formula c-ion ϩ z ⅐ -ion ϭ m P ϩ m H ϩ 2⅐m H ϩ for ETD/ECD spectra and y-ion ϩ b-ion ϭ m P ϩ 2⅐m H ϩ for CID spectra, respectively. The set Mult i is the set of all j such that i is a supporting peak for j (Equation 5). Binary Variables-We use binary variables (Equations 6 and 7) to model the logical use of a candidate ion peak j at a template position k (p j,k ) as well as the logical use of an MS/MS spectrum ion peak i as supporting information (y i ). These variables are defined as follows. Mathematical Model-The constraints of the problem are chosen to ensure proper use of the logical binary variables. At most, one candidate ion peak j is able to be assigned to a template position k (Equation 9). Additionally, we allow for missing candidate ion peaks j associated with a template position k but require that there can be no more than three consecutive missing candidate ion peaks (Equation 10). We can also enforce the constraint that a candidate ion peak j can be used at most once in the construction of a modified sequence (Equation 11). Constraints are also introduced to ensure that an MS/MS spectrum peak i is used properly as supporting information. An MS/MS spectrum peak i can only be activated if at least one of the corresponding candidate ion peaks j in the set Mult i is activated for any valid template position k in the set Pos j (Equation 12). We also ensure that a candidate ion peak j is not activated if the corresponding MS/MS spectrum ion peak i is not activated (Equation 13). The objective of the problem is to maximize the intensity of the MS/MS spectrum peaks i used to construct the modified sequence (Equation 8). max pj,k,yi This ILP model can be solved to global optimality using CPLEX (55) to obtain a set of MS/MS spectrum peaks that correspond to one or more modified sequences. Using integer cuts (56), a rank-ordered list of the top 10 sets of MS/MS spectrum peak variables will be generated. CPLEX uses a branch-and-cut algorithm (57) where a subset of the integer variables is fixed and the remaining integer variables are relaxed so they can take on continuous values, and a linear programming (LP) relaxation is formed. A branch-and-bound tree keeps track of which integer variables are fixed in a given relaxation and stores them in a "node" in the tree. The algorithm then parses through the tree and solves an LP relaxation at each node to get a theoretical upper bound on the optimal solution of the original problem. The traversal of the tree is dependent on the search techniques being used, some of which include depth-first search, breadth-first search, and best-bound search. Complete enumeration is avoided using fathoming criteria after each LP relaxation is solved. For further detail, the reader is directed to Refs. 56 and 57. A complete description of the ILP model and detailed solution strategies can be found in the supplemental material. Note that the ILP model can be formulated using network-based constraints (44, 45, 58 -62).
Cutting Plane Constraints-When incorporating all of the previous constraints, it is still possible to obtain linear programming relaxations that consider a set of p j,k at adjacent template positions that do not correspond to the mass difference of a modified or unmodified amino acid. For each p j,k , we determine Inv j,k,kЈ L and Inv j,k,kЈ U , the set of candidate ion peaks jЈ at template position kЈ (kЈ Ͻ k and kЈ Ͼ k, respectively) such that no jump exists between j and jЈ.
The improper assignment of any invalid peak combination at adjacent template positions is prevented with Equations 14 and 15. For candidate ion peaks j and jЈ at template positions k and kЈ, respectively, where ͉kЈ Ϫ k͉ Ͼ 1, we would like to prevent an invalid combination only if a candidate peak is not activated at any template position kЉ between k and kЈ. This is illustrated using Equations 16 and 17. The next set of constraints will be added when the linear relaxation activates candidate ion peaks at adjacent template positions where the mass difference between them is less than the smallest modified amino acid or greater than the largest modified amino acid. Thus, for each candidate ion peak j at template position k, we establish m j,k J,L and m j,k J,U , which are the maximum and minimum masses that can be reached from j, respectively. All jЈ ⑀ CS k ϩ 1 for which m jЈ Ͼ m j,k J,U and m jЈ Ͻ m j,k J,L correspond to candidate ion peaks outside the minimum and maximum possible mass peak boundaries. The improper assignment of peak variables can be prevented by Equations 18 and 19.
Incorporating all of these equations in the initial formulation of the problem results in a large number of constraints, many of which are not activated for the optimal solution. To circumvent this computational burden, we apply them dynamically as cuts. That is, for a given ILP relaxation, the violations of Equations 14 -19 are checked, and cuts are then added when needed.

Preprocessing Algorithm
The preprocessing algorithm begins by removing all peaks that are associated with the precursor ion. For CID spectra, this includes the precursor ion, its ϩ1 and ϩ2 isotopes, and any neutral losses (i.e. ϪH 2 O, ϪNH 3 , and ϪCO) (63). For ETD/ECD spectra, we must remove all peaks that correspond to distinct charge states of the precursor ion and their isotopes. Additionally, all peaks that correspond to a common neutral loss of a charge-reduced form of the precursor ion (37) are removed. The MS/MS spectrum is then filtered to remove any peak that is within an appropriate tolerance of another peak of higher intensity. The filtered MS/MS spectrum is scanned to extract the peaks with the highest intensity (Fig. 1b). All locally significant peaks are then extracted if the peak intensity is greater than all other peaks within an appropriate mass window (Fig. 1b).
The preprocessor scans and removes all peaks that are determined to be ϩ1 or ϩ2 isotopes. If any doubly or triply charged peaks are found based on isotopic offsets, the appropriate singly charged peak of the same intensity is constructed. For CID spectra, all neutral offsets are removed if the offset does not have a complementary peak. The preprocessor then queries all candidate peaks and determines a full list of supporting peaks (Fig. 1c) for each candidate ion peak. For CID spectra, this will include ϩ1 and ϩ2 isotopic offsets, neutral losses (i.e. ϪH 2 O, ϪNH 3 , and ϪCO), and doubly charged peaks. For ETD/ECD spectra, this will include isotopic offsets and doubly charged peaks.

Postprocessing Algorithm
A postprocessing algorithm is used to score the candidate modified peptide sequences that are derived from the peak sets in the ILP rank-ordered list. A cross-correlation technique (10) is used to mea-sure the mathematical overlap between the theoretical ions produced from the candidate PTM set and the experimental spectrum. A generalized model is established that is similar to that used in PILOT (44,45) and PILOT_SEQUEL (46). A mathematical overlap between the theoretical and experimental spectrum is then calculated based on monoisotopic masses for each candidate modified peptide (Fig. 1e). Each candidate modified peptide is assigned a cross-correlation score and inserted into a rank-ordered list. Similar to SEQUEST, this score is a measure of how well the "expected" fragmentation pattern of a particular modified peptide matches the experimental data and is not a probabilistic metric (10). The modified peptide thought to best explain the experimental data is given the highest cross-correlation score.
Once all peak intensities are assigned, the postprocessor scans each set of candidate ion peaks j output from the ILP model. If the mass difference between two candidate ion peaks j and jЈ that are at least two template positions apart is equal to the sum of the intermediate unmodified residue masses, but the activated candidate ion peaks in between j and jЈ indicate a possible modification, then these intermediate candidate ion peak assignments are checked by looking for the presence of peaks in the MS/MS spectrum that indicate unmodified residues. If enough supporting information exists, then the intermediate candidate ion peaks are reassigned to that of the unmodified sequence and subsequently rescored.

Algorithm Scoring
The accuracy of an algorithm is measured using three metrics: residue prediction accuracy, peptide prediction accuracy, and subsequence accuracy. The definitions of each accuracy metric for PILOT_PTM and all compared algorithms are given below.

Residue Prediction Accuracy
For a given template amino acid, we will define the residue prediction accuracy for PILOT_PTM as 1 for the assignment of a modification (or lack thereof) with mass within 0.1 Da (0.01 Da for ECD spectra) of the proper annotated modification mass and 0 otherwise. For alternative algorithms, the residue prediction accuracy will be equal to 1 if the algorithm assigned any amino acid (modified or unmodified) with mass within 0.1 Da (0.01 Da for ECD spectra) of the proper modified residue and 0 otherwise. When the size of the peptide predicted by a competing algorithm is not equal to that of the annotated peptide, we look for the alignment between the predicted peptide and the annotated peptide that will yield the highest amount of correct residues. If multiple peptides report the same "best" score for an algorithm, then the peptide that has the highest amount of correct residues is selected for accuracy quantitation.

Peptide Prediction Accuracy
The complete peptide prediction accuracy is set to 1 if all residues have been correctly annotated with the proper modification (or lack thereof) and 0 otherwise. The complete prediction accuracy within N residues is set to 1 if at most N residues are assigned incorrectly and 0 otherwise. When the size of the peptide predicted by a competing algorithm is not equal to that of the annotated peptide, the peptide prediction accuracy is calculated using the alignment found during calculation of the residue prediction accuracy.

Subsequence Length Accuracy
The subsequence length of an MS/MS spectrum is the longest string of residues that were annotated correctly. That is, if an MS/MS spectrum is assigned a subsequence with length L, then there exists L consecutive amino acids that were assigned a residue prediction accuracy of 1. The subsequence accuracy for a given length across a data set is then determined by dividing all peptides that contain a properly annotated subsequence with at least that length by the total number of peptides with at least that length.

Algorithm Validation
The proposed method was tested on four modified test data sets, including 218 phosphopeptides fragmented via ETD and CID (A1-A3), 58 histone H3-(1-50) N-terminal tail spectra fragmented via ECD (47) (B), 553 propionylated histone H3-(1-50) peptides fragmented via CID (48) (C), and 525 peptides from a total chromatin fraction fragmented via CID. (D1) PILOT_PTM was able to accurately identify 100% of the modified residues from data set A1, 93.8% from A2, 89.7% from A3, 97.9% from B, 98.6% from C, and 96.5% from D1 (Table II). The decrease in accuracy between data set A1 and data sets A2 and A3 was thought to be due to the lack of fragmentation of the MS/MS spectrum in these data sets. We note that the prediction accuracy for modified residues is the highest for data sets A1 and C. These data sets generally had the best fragmentation and thus contained many of their singly charged ion peaks (supplemental annotations). For all modified data sets, PILOT_PTM was able to accurately identify 2,339 of the 2,393 modified residues (97.7%) and 15,752 of the 15,864 unmodified residues (99.3%).
PILOT_PTM was also tested on two unmodified data sets, including the 6,025 unmodified spectra identified from the total chromatin fraction (D2) and 474 unmodified spectra fragmented via CID using ion trap, Q-TOF, and Orbitrap instruments (44 -46) (E1-E3). PILOT_PTM achieves a residue prediction accuracy of 99.9% for the total chromatin data set, 98.2% for the ion trap data set, 99.7% for the Q-TOF data set, and 99.8% for the Orbitrap data set. These results are evidence both of the ability to properly identify no modifications on unmodified peptides for various types of spectral instruments and the potential for enhanced accuracy with better spectral resolution (Table II).
The peptide prediction accuracy for a data set is defined as the total amount of peptides with the correct modification (or lack thereof) assigned to all residues in the peptide and is displayed in Table III. PILOT_PTM reports a complete peptide prediction accuracy of 100% for data set A1, 93.8% for data set A2, 89.7% for data set A3, 89.7% for data set B, 94.4% for data set C, and 95.2% for data set D1. Note that the accuracies for data sets A1, A2, and A3 will be exactly equal to the modified residue prediction accuracy (Table II) because each of the peptides contain exactly one modified phosphorylation residue. The high prediction accuracies of data sets C and D1 show the ability of PILOT_PTM to fully annotate peptides that are either highly modified (data set C) or part of a very complex sample (data set D1). We also note that PILOT_PTM was able to fully annotate 52 of the 58 histone N-terminal tail peptides in data set B. This is an important result because these peptides are the longest in all test data sets with 50 amino acids, the fragmentation near the middle of the peptide is not nearly as strong as it is near the termini (supplemental annotations), and these spectra contain additional modified peptides at lower stoichiometric amounts that could reduce the ability of an algorithm to identify the most prevalent form (47). The local inclusion of high resolution peaks in the PILOT_PTM preprocessor as well as the accurate identification of peaks of charge 2ϩ and charge 3ϩ from isotopic information allow for the proper assignment of the lysine modifications near the middle of the peptide. Table III also shows the improvement in the peptide prediction accuracy when allowing for up to one or two incorrect residues. When allowing for two incorrect modifications, PILOT_PTM was able to annotate 6,497 of the 6,499 unmodified peptides (100%) and 1,323 of the 1,354 modified peptides (97.7%), showing that PILOT_PTM was still able to annotate a majority of the peptide even when some residues are incorrectly identified (Table III).

Comparative Studies
To benchmark the capability of the method, PILOT_PTM was compared with five state-of-the-art algorithms using the modified data sets B, C, and D1. The compared algorithms include three hybrid sequence tag/database approaches (InsPecT (4), Mod i (7), and VEMS (6)) and two pure database approaches (Mascot (11) and X!Tandem (14)). Data sets A1, A2, and A3 were not used because all spectra are chemically synthesized and thus did not necessarily correspond to a peptide that would be found in a database as a result of a tryptic digest. These data sets were instead analyzed with phosphopeptide site assignment software Phosida (31). Details about the algorithm parameters used for each data set are given in the supplemental methods.
Test Set A: Chemically Synthesized Phosphopeptides-As a large majority of the phosphopeptides used in this data set did not correspond to a tryptically digested peptide found in the NCBInr database, a comparison with the five software packages listed above could not be done. A comparison can be made with current phosphopeptide site assignment software such as Phosida (31), which attempts to localize a phosphorylation modification on a template amino acid sequence. Phosida is only capable of predicting serine and threonine phosphorylations on peptides that contain at least 13 amino acids. Of the four peptides meeting this criteria (P1, DLD-VPIPGRFDRRVpSVAAE; P2, FQpSEEQQQTEDELQDK; P3, RPVSSAApSVYAGAC; and P4, SFVLNPTNIGMpSKSSQGH-VTK), Phosida was only able to assign the correct phosphorylation residue to P1 and P3. The serine at position 15 was incorrectly assigned the modification for P4, and no phosphorylation was assigned for P1.

TABLE II PILOT_PTM residue prediction accuracy
A correctly predicted amino acid residue (modified or unmodified) is assigned a value of 1 if the correct modification (or lack thereof) was assigned to the residue and 0 otherwise. The accuracy for a data set is simply the total residue prediction accuracy for all peptides in the data set. The percent of correct annotations is given in parenthesis next to the number of correct annotations. N/A, not applicable.

TABLE III PILOT_PTM peptide prediction accuracy
Peptide prediction accuracy is defined as the total amount of peptides with the correct modification (or lack thereof) assigned to all residues. The accuracy within one or two residues represents the total amount of peptides with at most one or two incorrect residues, respectively. The percent of correct annotations is given in parenthesis next to the number of correct annotations. Test Set B: Histone H3-  N-terminal Tail-Mascot was the only compared algorithm for data set B because it is the only algorithm of the five that is specifically capable of handling the highly modified ECD spectra. Although X!Tandem is able to search for c-and z ⅐ -ions, the algorithm imposes an upper bound of one modification type for each amino acid. As all of the histone MS/MS spectra in data set B contain more than one type of lysine modification, it was expected that X!Tandem would not be able to accurately identify the modifications in this data set. In fact, when tested, X!Tandem was unable to assign a sequence to any of the 58 spectra.
Of the 58 MS/MS spectra in data set B, Mascot was not able to completely annotate any of the peptides and only correctly annotated 1 (1.7%) when allowing for up to two incorrect modifications (Table IV). Alternatively, PILOT_PTM was able to completely annotate 52 peptides (89.7%). Of the six peptides that PILOT_PTM did not completely annotate, the acetylation on lysine 14 was improperly assigned to lysine 18, indicating that PILOT_PTM was still able to assign the proper modification type. The total number of correct residues is also higher for PILOT_PTM (2,888; 99.6%) than for Mascot (2,595; 89.5%) even though Mascot only allows for modifications on lysine, arginine, serine, threonine, and the termini (supplemental methods), whereas PILOT_PTM utilizes the universal list. This is clear evidence of the ability of PILOT_PTM to accurately predict modification types when given a high resolution MS/MS spectrum. The authors note that Mascot is able to search using the entire list of modifications found in the UniMod database (29) in an error-tolerant search (40). The results of the error-tolerant search were slightly worse than the original search. Mascot retained 99.2% of the original residue annotations with some of the previously unmodified residues now containing small mass shifts associated with deamidation or amidation.
It is speculated that lysine methylation, dimethylation, trimethylation, and acetylation interact together on the histone H3-(1-50) N-terminal tail to give rise to a potential histone "code" (47). It is highly essential that a PTM prediction algorithm be capable of accurately identifying not only the types of modifications but also the appropriate residues. Thus, we focused on the annotation of the eight lysine residues in the H3-(1-50) N-terminal tail (Table IV). PILOT_PTM was able to accurately identify 452 of the 464 (97.4%) lysine residues (modified or unmodified), whereas Mascot was only able to identify 246 (53.0%) residues. Moreover, PILOT_PTM was able to correctly annotate the lysines at positions 9, 14, 23, 27, 36, and 37 for all 58 spectra. In fact, the highest scoring lysine residues for Mascot are at positions closest to the termini (4, 14, and 37) where the fragmentation is most prevalent for the annotated modified form (supplemental annotations). Mascot scored very poorly for the lysine residues at positions 18 and 23, possibly resulting from the weaker fragmentation and the likely presence of other modified forms of lower abundance (Table IV).
Test Sets C and D1: Algorithm Comparison Protocol-To compare the capability of PILOT_PTM against alternative prediction algorithms for test data sets C and D1, a testing protocol was developed for those algorithms that place an upper bound on the number of modification sites or types. We begin by constructing the set S Ann , which is the set of modifications that was used to create the annotated spectra. Note that S Ann will be different for data sets C and D1. For each data set, we then create a superset of common modifications, S Test , from the set S Ann by adding additional modifications that have been reported on the peptides (data set C) (30) or are commonly found (data set D1) (11). A set of modifications is chosen for a trial as follows. 1) Select a set of modifications, S Known , from the annotated set S Ann that are known to be in the sample. This reflects the user's knowledge of the sample in question and the PTMs thought to be present. The set S Known was fixed to be the four most prevalent modifications in the annotated data set. 2) Select at random a set of unknown modifications, S Unk , from the remaining modifications in S Test until the total amount of known and unknown modifications totals nine, which is the upper bound for variable modification types for Mascot. This reflects the user's uncertainty about the additional test modifications that may or may not be present. The set of nine modifications extracted from steps 1 and 2 represents the variable modifications that will be checked. The modification list used in the protocol is presented in Table V. The modifications that comprise S Test for each data set will be marked as either present (P) in the test set only, present in the annotated set (A), or present in the annotated set and the known set (C). The modifications in S Ann will either be marked as A or C, and the modifications in

spectra (data set B)
There exist 464 lysine residues and 2,900 total residues for the 58 spectra. Lysine residues 9, 14, 23, 27, and 36 correspond to the modified residues for each spectra. The percent of correct annotations is given in parenthesis next to the number of correct annotations. PILOT_PTM data is reported in boldface font. S Known are marked as C. All remaining modifications are marked as not present in the data set (N).
To estimate the expected prediction accuracy, we create multiple modification lists using the above methodology to be tested with the methods Mascot (11), X!Tandem (14), InsPecT (4), and VEMS (6). Note that multiple modification lists are not needed for PILOT_PTM or Mod i (7) because these methods place no restriction on the number of variable modifications. For each variable modification list, a separate trial was conducted for Mascot, X!Tandem, InsPecT, and VEMS. We report both the average results and aggregate results over all trials. Average results are calculated by first calculating the accuracy of an algorithm for each trial and then determining the average result over all trials. The aggregate result is calculated by first finding the highest scoring peptide for each spectra over all trials and then performing the accuracy calculations on these peptides.
For data set D1, the protocol is slightly modified for the Mod i algorithm to account for the fact that Mod i can handle the universal list of modifications but requires a database of at most 20 proteins. Thus, we first determine the 10 proteins that correspond to the largest total amount of modified peptides and then randomly select 10 additional proteins that contain at least one modified or unmodified spectrum in data sets D1 or D2. Average and aggregate results are calculated in a way similar to that for the above methods. Note that this procedure is not necessary for data set C because all of the test peptides TABLE VI Peptide and residue accuracies for comparison using propionylated histone fragments (data set C) and total chromatin peptides (data set D1) Data set C contained 553 spectra with a total of 5,790 residues. Data set D1 contained 525 spectra with a total of 6,892 residues. Parameters for each of the algorithms were chosen to reflect the quality of the spectra as well as the possibility for multiple modifications and missed cleavages. The results for Mascot, InsPecT, VEMS, X!Tandem, and Mod i contain both averaged (Avg.) and aggregated (Agg.) results based on the protocol described in the text. For data set D1, InsPecT was also run in unrestricted search mode (Unr.) while allowing up to two modifications. The percent of correct annotations is given in parenthesis next to the number of correct annotations. PILOT_PTM data is reported in boldface font.

TABLE V Testing protocol modification list
Each modification is either marked as "N," not present in the test set S Test , "P," present in the test set only, "A," present in the test set and in the annotated set S Ann , or "C," present in the test and annotated sets and kept constant during the testing protocol (present in S Known ). Test Sets C and D1-The peptide and residue prediction accuracy for all algorithms is presented in Table VI. We first note that the aggregate score for any algorithm is higher than the corresponding average score. This is not surprising because the aggregate accuracy comprises the highest scoring results from each trial, but it should be noted that multiple trials are needed to determine the aggregate score. Thus, there is clearly an added cost in terms of the number of trials that need to be run to achieve the aggregate score. We look to the average score for an indication of how accurate an algorithm is on any given trial and expect the accuracy to improve to the aggregate accuracy as more trials are run. Note that PILOT_PTM, Mod i (data set C only), and InsPecT in unrestricted mode (data set D1 only) do not require more than one trial. Although the average and aggregate scores are reported for the residue prediction accuracy (Table VI), the peptide prediction accuracy (Table VI), and the subsequence accuracy (Table VII and Fig. 2), the following discussion will solely focus on the aggregate results.
PILOT_PTM is able to fully predict 522 (94.4%) of the peptides from data set C and 500 (95.2%) from data set D1. Alternatively, Mascot is only able to fully identify 474 (85.7%) peptides from data set C, whereas InsPecT fully identifies 484 (87.5%) from data set C and 482 (91.8%) from data set D1 (Table VI). The accuracy of the remaining algorithms for data set C was significantly lower than that for Mascot with X!Tandem reporting the highest of the remainder (213 peptides). X!Tandem was also able to fully identify 471 (89.7%) of the peptides in data set D1 followed by VEMS with a total of 390 (74.3%). The blind search of InsPecT reported only 274 (52.2%) fully annotated peptides from data set D1, the lowest of all algorithms (Table VI, aggregate scores only). When allowing for up to two incorrect residues, PILOT_PTM is able to predict 536 peptides (96.9%) from data set C and 511 (97.3%) from data set D1. InsPecT is the next highest in data set C with 524 identified peptides (94.8%) and ties X!Tandem with 493 peptides (93.9%) for data set D1. Mascot scores the next highest in data set C with 501 peptides (90.6%), and VEMS follows InsPecT and X!Tandem in data set D1 with 441 peptides (84.0%). We also see that the blind search of InsPecT accuracy improves to 73.3% (385 peptides) when allowing for two incorrect modifications, although it is still 10.7% lower than VEMS.
The ability to predict a subsequence of a given length gives insight into the effectiveness of an algorithm to sequence a The results for Mascot, InsPecT, VEMS, X!Tandem, and Mod i contain both averaged (Avg.) and aggregated (Agg.) results based on the protocol described in the text. For data set D1, InsPecT was also run in unrestricted search mode (Unr.) while allowing up to two modifications. The percent of correct annotations is given in parenthesis next to the number of correct annotations. PILOT_PTM data is reported in boldface font.
Data portion of the peptide using appropriate spectral information ( Fig. 2 and Table VII). PILOT_PTM reports a subsequence accuracy of 100% for all L Յ 4 in data set C and for all L Յ 3 in data set D1. This implies that PILOT_PTM was able to correctly annotate four consecutive amino acids for all 553 spectra in data set C and three consecutive amino acids for all 525 spectra in data set D1. Additionally, PILOT_PTM outperforms all competing algorithms for each listed length for both data set C and set D1 and maintains an accuracy that is at least 3.5% greater than the next highest scoring algorithm for data set C and at least 1.3% greater for data set D1 (Table VII).

Complete MS Analysis
Complete LC-MS/MS Untargeted Modification Search Protocol-To run an untargeted modification search with PILOT_PTM on a complete MS scan, we must first generate a set of candidate template sequences for use with the algorithm. The sequences will be generated by initially scanning the data using a peptide sequencing algorithm to uncover all spectra that are either unmodified or contain oxidized methionine. Using the SEQUEST algorithm (10), a protein list (PL; probability Ͼ5e Ϫ5 ) was generated, and a superset of candidate template sequences (TS) is then defined as the non-redundant list of peptide sequences found in the search. We augment PL with a dummy "No match" protein and then map all peptides in TS to their corresponding proteins in PL. Any peptide that was not assigned a protein from the SEQUEST search is assigned No match.
The spectra not annotated by SEQUEST were subject to filtering where those that did not contain at least 50 ion peaks were removed. For each remaining spectra, we run the following sequence of steps. 1) Determine a three-amino acid FIG. 2. Subsequence accuracy results from comparisons using data sets C and D1. The results from Mascot (data set C), InsPecT (restricted), VEMS, X!Tandem, and Mod i (data set D1) are calculated using many trials, each consisting of a distinct variable modification list. The results for PILOT_PTM, Mod i (data set C), and InsPecT in blind search mode (unrestricted (Unr.); data set D1) are reported for only one trial because these algorithms do not place a restriction on the types of variable modifications considered. a, aggregate (Agg.) subsequence accuracies for the histone fragment test data set. b, average (Avg.) subsequence accuracies for the histone fragment test data set. c, aggregate subsequence accuracies for the chromatin test data set. d, average subsequence accuracies for the chromatin test data set. unmodified sequence tag based on the experimental data. 2) Search all peptides in TS to derive a list of template sequences that exactly contain the sequence tag. 3) Run the PILOT_PTM algorithm for each template sequence. 4) Compare the cross-correlation score of the top modified peptide for each template sequence and select the peptide that has the highest score. Because of the large number of template sequences generated during step 2 of the above procedure, we imposed a window on the possible mass gaps for possible modifications (step 3). We set the lower bound to be Ϫ50 Da and the upper bound to be 250 Da.
Case Study: Total Chromatin Extraction-The above protocol was tested on several data sets generated from a total chromatin extraction. All spectra that had a minimum XCorr value (1.5 for z ϭ 1, 2.0 for z ϭ 2, 2.5 for z ϭ 3, and 3.0 for z ϭ 4) were annotated with the associated peptide and oxidized methionine modifications (if applicable), and PL and TS were generated as described above. A total of 466,905 spectra were initially analyzed with the SEQUEST algorithm. A total of 81,961 unmodified spectra and 4,838 modified spectra were found, yielding 19,250 distinct peptides and 1,913 distinct proteins. After applying the ion peak filtering, a total of 273,733 spectra were analyzed with PILOT_PTM. Prior to searching, all isotopic labels were removed from the universal list of modifications as these modifications will not be present in the chromatin data.
The sequence tag generation step reduced the number of template sequences per MS/MS spectrum to ϳ10 on average. Some of these sequences will not be able to generate appropriate sets of valid candidate ion peaks (CS k ) for several consecutive amino acids because of the inability to connect candidate ion peaks with an appropriate series of jumps. The total number of template sequences fully analyzed by PILOT_PTM is ϳ6 per MS/MS spectrum on average (46,610 sequences).
PILOT_PTM assigned a sequence with modifications or amino acid substitutions to 7,641 spectra, including a total of FIG. 3. Modification histogram for untargeted analysis of a chromatin fraction. Each modification is given an abbreviation (Abb.) of the form AA-M where AA is the name of the amino acid and M is the modification type. Note that CT and NT refer to the C terminus and N terminus, respectively. a, histogram of the total amount of modification counts present in all 7,668 annotated spectra. b, numerical table for the modifications in a. 6,356 modifications and 11,391 substitutions. We report the histogram of modifications for all annotated spectra in Fig. 3. Oxidized methionine (1,957 PTMs) was removed from Fig. 3 to show the counts of the other modifications in higher detail. C-terminal methylation and N-terminal acetylation appeared 829 and 330 times, respectively. However, these modifications are likely the result of sample preparation and not posttranslational modification. Methylation is the most prevalent modification appearing on the N terminus (257 PTMs), lysine (179 PTMs), arginine (110 PTMs), aspartic acid (109 PTMs), asparagine (88 PTMs), threonine (67 PTMs), and glutamine (33 PTMs). Dimethylation is the next most abundant modification appearing on the N terminus (247 PTMs), lysine (184 PTMs), arginine (102 PTMs), and asparagine (35 PTMs). Acetylation is annotated on serine (110 PTMs) and lysine (46 PTMs); deamidation is annotated on glutamine (163 PTMs), asparagine (83 PTMs), and arginine (15 PTMs); formylation is annotated on serine (65 PTMs) and threonine (62 PTMs); and hydroxylation is annotated on proline (121 PTMs), valine (103 PTMs), and aspartic acid (39 PTMs). DISCUSSION A novel integer linear framework for the assignment of PTMs on a template sequence was developed. PILOT_PTM utilizes the universal list of modifications while placing no restrictions on the amount of modification types or modification sites for a given peptide. The case studies presented above demonstrate the high accuracy of the PILOT_PTM algorithm when analyzing modified spectra that come from different mass spectrometers as well as different fragmentation patterns. The superior ability of PILOT_PTM when compared with five current PTM prediction algorithms is demonstrated using highly modified histone H3-(1-50) peptides and peptides from a large scale chromatin-enriched fraction. The performance of PILOT_PTM may be due to the amount of peaks selected from the MS/MS spectrum for analysis. Database and hybrid methods may use fewer peaks to discriminate between correct and incorrect results, but it is often necessary to utilize lower abundance peaks to properly assign the modification type and modification site when a large variable modification list is considered. To maintain the efficiency of PILOT_PTM when the MS/MS spectrum contains many peaks, a strict filtering algorithm is used during the preprocessing stage (supplemental methods) to eliminate all possible isotopes, neutral losses, and multiply charged ions from consideration in the candidate peak list. In fact, the preprocessing stage is crucial for the ECD data where the spectral resolution often enables proper assignment of many charge states that can be converted into the appropriate singly charged peak or removed from consideration.
The computational run time for a completely automated run of PILOT_PTM for a single template sequence is shown in detail for the data sets in Table VIII. The time is reported for both the single thread and parallelized version of CPLEX (55) (eight threads) on average on a Intel Pentium 4 3.0-GHz Linux-based computer. For each data set, we calculate the average number of residues, R , per peptide by dividing the total number of residues (Table II) by the total number of peptides (Table III). The average CPU time for all data sets ranged from 8.7 to 18.3 CPU s with ranges from 9.1 to 16.1 for all data sets except set B. The increase in CPU time for this data set is due to the large number of peaks present in the ECD data set (R ϭ 50) that were retained by PILOT_PTM. The average run time is reduced on average by a factor of 4.85 if a parallelized version of CPLEX is used. With an average computational time of 2.8 CPU s per spectrum, the total time required to run all stages of PILOT_PTM and output results for all 7,853 spectra is 6.1 CPU h. Furthermore, addition of modifications to the universal list does not result in an increase in PILOT_PTM run time because the amount of binary variables and constraints will remain unchanged in the ILP. The raw spectral data and source code for the PILOT_PTM algorithm are available upon request. * This work was supported, in whole or in part, by National Institutes of Health Grant R01LM009338 (to C. A. F.). This work was also supported by the United States Environmental Protection Agency Science to Achieve Results Program through Grant R 832721-010.
□ S This article contains supplemental Table 1, methods, and  The average time to process a spectrum was measured using CPLEX version 11.1 on a Pentium 4 3.0-GHz Linux-based computer. The parallel time utilized the parallel CPLEX software package on an eight-thread unit. The reported time is taken as the average over all spectrum in that data set. The average number of residues per peptide, R , is calculated for each data set as the total number of residues divided by the total number of peptides.