|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 5:652-670, 2006.
© 2006 by The American Society for Biochemistry and Molecular Biology, Inc.
,
,¶
,||
,||

,||
From the
Institute for Systems Biology, Seattle, Washington 98103, and || Swiss Federal Institute of Technology, CH-8092 Zurich, Switzerland
| ABSTRACT |
|---|
|
|
|---|
Shotgun proteomics creates significant computational challenges (58). Large numbers (on the order of 105) of MS/MS spectra acquired in each experiment need to be computationally processed to identify peptides that produced them and to infer what proteins were present in the original sample. In most high throughput studies, peptide identification is performed by searching MS/MS spectra against protein sequence databases. A number of automated database search tools have been developed for that purpose, including commercial and open source programs (917). These programs correlate the experimental MS/MS spectra with theoretical fragmentation patterns of peptides obtained from a sequence database and use various scoring schemes to find the best matching peptide sequence. This high throughput protein identification process, however, is prone to false positives resulting from incorrect peptide assignments to MS/MS spectra by the database search tools (5, 1821). The problem of false positives has received significant attention in recent years. As a result, statistical approaches and computational tools were developed for assigning confidence measures to peptide and protein identifications and for estimating the false identification rates. These tools reduce the need for time-consuming manual verification of peptide assignments and allow faster and more consistent analysis of large scale datasets (5).
Despite the progress in developing new database search tools and methods for statistical validation of peptide assignments to MS/MS spectra, the number of spectra that remain "unassigned" (i.e. the sequence of the peptide that produced the spectrum is not known) in any experiment is significant. In fact, of all MS/MS spectra acquired in a typical shotgun proteomic experiment only a relatively small fraction (e.g. typically less than a half in the case of the ion trap instruments) is assigned a peptide with high confidence. The use of higher quality instruments having better mass accuracy and resolution alleviates the problem but does not eliminate it. Reasons for such a high failure rate include the deficiencies of the scoring schemes used to quantify the degree of similarity between the experimental spectrum and those predicted for database peptides, ambiguities in the determination of the charge state of the peptide ions selected for fragmentation (in low mass accuracy instruments), the presence of spectra arising from non-peptidic contaminants, concurrent fragmentation of multiple different precursor ions, and the low quality of many spectra due to excessive noise or incomplete peptide fragmentation (5, 7, 22, 23). However, a significant fraction of high quality, peptide-derived spectra also remain unassigned, and the failure of the database search tools to interpret them correctly cannot be explained by the reasons mentioned above.
One source of unassigned high quality spectra are peptides containing a post-translational or chemical modification. To speed up the analysis, MS/MS database searching is often performed in a way that does not anticipate the presence of modifications in the peptides analyzed, and the modified peptides are therefore missed. Another source for such spectra are peptides whose sequence is not present in the searched protein sequence database, e.g. peptides corresponding to unanticipated alternative splice forms or sequence variants of known proteins (polymorphisms). Although identification of modified, mutated, or novel1 peptides can be of high significance, the high quality spectra that remain unassigned after the initial database search pass through the data in most high throughput experiments are not further analyzed. Finding such spectra would normally entail manually sifting through large amounts of low quality data, which is rarely practiced due to the ever increasing number of newly acquired datasets waiting to be processed and interpreted. Thus, there is a clear need to develop computational tools for automated spectrum quality assessment that could be used to detect unassigned high quality spectra and mark them for subsequent reanalysis. Such quality assessment tools can also be used for a different task, e.g. to filter out low quality spectra prior to database searching to reduce the computational time or to assist in the process of discriminating between correct and incorrect peptide assignments.
In this work, we present a dynamic quality scoring approach for finding high quality unassigned spectra in large shotgun proteomic datasets. The basic idea behind the method is that in the first database search pass through the data high confidence peptide assignments are generally based on high quality MS/MS spectra. Therefore, the notion of what constitutes a high quality spectrum can be learned from the analyzed data itself, i.e. without relying on a training dataset created using spectra from a different experiment (24). This way the statistical classifier is automatically developed for each dataset anew, ensuring the robustness of the method toward variations in the MS/MS spectrum properties caused by differences in acquisition methods or instrument to instrument variability. This distinguishes this method from other recently described approaches (25, 2729)2 that can produce inaccurate results when applied to spectra that differ significantly from those used for training. The statistical spectrum quality classifier is computed using an extended set of spectrum features, including those designed to take into account the knowledge of the peptide fragmentation process.
The accuracy of the method is evaluated using a dataset of MS/MS spectra from a recent experiment on human lipid rafts (30). The results indicate that the spectrum quality classifier enables fast and automated detection of high quality spectra left unassigned during the initial computational pass through the data. We also demonstrate that by interrogating those unassigned high quality spectra more comprehensively using existing protein sequence databases and by searching against genomic databases, one can significantly increase the number of identified peptides, including peptides containing modifications and sequence polymorphisms. Furthermore reanalysis of high quality spectra can lead to the identification of novel peptides, e.g. peptides confirming computationally predicted alternative splice forms.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
1) The Haemophilus influenzae dataset consisted of 15 LC-MS/MS runs and was used previously as a test dataset for the development of a statistical model for validation of peptide assignments to MS/MS spectra (19). Sample proteins were digested using trypsin, and the resulting peptide mixtures were separated using reverse phase and strong cation exchange (SCX)3 chromatography prior to MS/MS sequencing. The dataset contained more than 16,000 multiply charged MS/MS spectra generated from the membrane fraction of the sample. The spectra were searched using SEQUEST as described previously (19), and peptide assignments to the spectra were processed using PeptideProphet. Of the 31,728 search results (multiply charged spectra were searched twice, assuming 2+ or 3+ charge state, because the exact charge could not be determined), 4229 MS/MS spectra were assigned a peptide with probability above 0.9 (3457 and 730 assignments to spectra of doubly charged and triply charged precursor ions, respectively). This dataset was used to optimize the method for computing spectrum quality scores and to test the accuracy of the method as a function of the dataset size.
2) The Arabidopsis dataset4 was acquired from four protein mixtures derived from cultured Arabidopsis thaliana cells. The crude extract was loaded on a one-dimensional SDS gel, and single bands were cut out. Each band was digested with trypsin, and peptides were sequenced by LC-MS/MS. In total, 3420 MS/MS spectra were submitted to a SEQUEST database search against a protein database of Arabidopsis and known contaminants allowing for semitryptic (tryptic at one terminus only) peptides with a mass tolerance of 3 Da and with specifying carboxyamidomethylation of cysteines and methionine oxidation as variable modifications. This search resulted in 6720 peptide assignments (counting 2+/3+ duplicates). Of those, 924 were peptide assignments to spectra with a probability of being correct greater than 0.90 as computed by PeptideProphet. Additional database searches were performed using this dataset to test the ability of the method to recover high quality unassigned spectra.
3) The human lipid raft dataset was taken from a large scale quantitative proteomic experiment on lipid raft plasma membrane domains from human Jurkat T cells (30). The full dataset is available from the PeptideAtlas MS/MS data repository5 (31). Nine LC/MS/MS runs (flow-through sample, SCX fractions 31 through 39) were selected for the analysis in this work. The spectra, 12,864 in total, were first searched using SEQUEST against the human International Protein Index (IPI) database (32), version 2.35, allowing for semitryptic peptides with a mass tolerance of 3 Da and allowing for methionine oxidation as a variable modification. This resulted in 24,197 peptide assignments, counting 2+/3+ duplicates (711 singly charged, 12,153 doubly charged, and 12,044 triply charged). Of those, 4171 were peptide assignments to spectra with a PeptideProphet probability of being correct equal or greater than 0.9 (4034 peptide assignments to 2+/3+ spectra and 137 assignments to 1+ spectra). The results were further analyzed using ProteinProphet (19), resulting in 315 proteins having a ProteinProphet probability above 0.9 identified by 660 unique peptides. The number of proteins was determined by counting the number of entries in the minimal list of proteins sufficient to explain all observed peptides (19). The list of peptide assignments and the corresponding list of protein accession numbers are provided in Supplemental Table S1 (the results of the analysis of the entire dataset, including ICAT samples, can be found on the PeptideAtlas website). In calculating the number of assigned spectra, a spectrum was counted only if it was assigned a peptide corresponding to a protein with ProteinProphet probability greater than 0.9. To eliminate random matches to proteins correctly identified by other peptides, an additional peptide level constraint was applied: a spectrum was counted as assigned only if the peptide was identified (from this or another spectrum in the dataset) with PeptideProphet probability equal to or greater than 0.5.
To determine what types of peptides generated the high quality spectra unassigned in the initial search, a subset of the initially unassigned MS/MS spectra was reanalyzed using a number of additional searches.
1) "Large mass tolerance" search: SEQUEST, semitryptic, 5-Da mass tolerance (larger mass tolerance compared with the initial 3-Da search).
2) "4+/5+ charge state" search: SEQUEST search, semitryptic, 3-Da mass tolerance, assuming 4+ or 5+ charge state.
3) "Pyro-Glu" search: SEQUEST, semitryptic, 3-Da mass tolerance, allowing for conversion of N-terminal glutamine to pyroglutamic acid (loss of 17 Da) as a variable modification. Due to a limitation of SEQUEST, the search was performed allowing for modified residues to be located anywhere in the sequence; peptides that did not contain glutamine at the N terminus were filtered out at the data validation stage.
4) "Mascot" searches: Mascot, tryptic peptides only (note that the Mascot tryptic search allows for removal of the initiating Met), two missed cleavages or less, 3-Da mass tolerance, allowing for N-terminal acetylation (+42 Da) or carbamylation (+43 Da) as a variable modification.
5) Miscellaneous searches: X! Tandem (13) search allowing for more than one type of modification per peptide; SEQUEST and Mascot searches allowing for modifications not specified in the previous searches (e.g. conversion of N-terminal glutamic acid residues to pyroglutamic acid, phosphorylation, guanidination, etc.).
These searches were performed against the same version of the IPI database as the initial search (version 2.35). The results from different searches were processed using Interact and PeptideProphet and then manually scrutinized. Validated peptide sequences were imported in a relational structured query language database, and the total number of assigned spectra, the number of unique peptides, and the minimum number of protein identifications sufficient to explain all identified peptides were calculated. In addition, the spectra were searched against a genomic database as described under "Results."
Computational Method
Separation of Spectra into Peak List Subsets (Low and High Intensity Components)
Prior to computing spectrum features, two peak list subsets are extracted from each spectrum. The main reason for this step is a reduction of the number of noise peaks in the spectrum whose presence can lower the discriminating power of some spectral features. The two subsets of peaks are extracted for each spectrum in parallel using the following approaches.
1) A signal intensity cutoff is applied using a robust percentile-based approach, creating the "high intensity" peak subset. To account for the significant variability of the fragment ion intensities across the spectrum, the spectrum is divided into five equally sized m/z sections. Within each section, the peaks are sorted according to their intensity, and the intensity at a given percentile is used as the cutoff value (i.e. the signal intensity at the 50th percentile would be equal to the median intensity of the peaks in a section). All peaks with intensity above the cutoff are assigned to the high intensity peak subset. Depending on the instrument type and settings, MS/MS spectra may contain a large number of noise peaks. To make the percentile approach more robust, the number of peaks taken into account is restricted to the 400 most intense peaks per 1000-Da interval.
2) Spectrum deisotoping is performed based on a Poisson distribution model, thus creating the "deisotoped" peak subset. The peak heights of the isotope patterns are predicted by a Poisson distribution model with fragment masses, and the "heavy isotope excess" constant is estimated by dividing the sum of the average masses of all amino acids by the sum of the monoisotopic masses. The theoretical distribution and the actual spectrum are compared, and the
2 test statistic is calculated to measure the quality of the fit. All peaks with the
2 value below 20 are assigned to the deisotoped peak list subset. This represents a relatively loose threshold appropriate for low mass resolution data where the observed isotope distributions of heavy ions tend to deviate substantially from the theoretical distributions.
Spectrum features such as sequence tags, complementary fragments, and neutral losses (described below) are calculated exclusively on the high intensity peak subset as defined by the percentile cutoff, whereas the general spectrum features are calculated using the full spectrum as well as the two peak list subsets (the high intensity peak subset and the deisotoped peak subset). The percentile cutoff was optimized separately for each spectrum feature as described under "Results."
Spectrum Features
Several classes of spectrum features were introduced to measure the quality of MS/MS spectra. These features were designed to characterize the overall distribution of peaks in the spectrum (general spectrum features) as well as to detect certain patterns (sequence tags, complementary fragment ions, and neutral losses) expected to be present in high quality MS/MS spectra but absent in low quality spectra.
General Spectrum Features
Eight general descriptive features were calculated for each peak list subset.
(a) Number of peaks, square root-transformed.
(b) Arithmetic mean of the peak intensities, log-transformed.
(c) Standard deviation of the peak intensities, log-transformed.
(d) Smallest m/z range containing 95% of the total peak intensity.
(e) Smallest m/z range containing 50% of the total peak intensity.
(f) Total ion current per m/z (total ion current divided by feature d), log-transformed.
(g) Standard deviation of the consecutive m/z gaps between all peaks, log-transformed.
(h) Average number of neighbor peaks within a 2-Da interval around any peak.
The log or square root transformation of some of the spectrum features was applied to obtain a more symmetric shape of the distributions and to minimize the variance across the spectra and resulted in improved discriminating power of the composite spectrum quality classifier derived using the linear discriminant function approach (described under "Spectrum Quality Score"). For each spectrum, the entire peak list and the two subsets (the high intensity peak subset and the deisotoped peak subset) were extracted and characterized by the eight descriptive features, resulting in 24 general spectrum features. Clearly many of these spectrum features are redundant, and there is a particularly high degree of correlation between those spectrum features that are derived by applying the same descriptive feature to the three different peak subsets. Nevertheless having each spectrum feature computed in triplicate slightly improved the overall performance of the method.
Sequence Tags
A high quality peptide MS/MS spectrum should contain an extended series of peaks separated by the mass differences corresponding to one of the 20 amino acids. Thus, if a spectrum does not allow extraction of at least a short sequence tags (partial sequence) it is unlikely to be of high quality.6 The sequence tag-based spectrum features are calculated using the high intensity peak subset only. The following four spectrum features measuring the presence of amino acid mass differences in the spectrum are computed.
(a) The length of the longest sequence tag that can be extracted from the spectrum (24). The tags are extracted using a fast dynamic programming algorithm in a single pass through the spectrum and assuming that all peaks correspond to singly charged fragment ions from the same ion series (e.g. all y+ or b+ ions).
(b) The average (per peak) length of all extracted sequence tags, calculated by summing up the lengths of the tags involving each peak in the high intensity subset of the spectrum and dividing it by the number of peaks.
(c) The number of peak pairs corresponding to an amino acid mass difference (sequence tag of length one).
(d) A derived version of the above feature computed using the peak intensities as weighting factors to account for the increased likelihood of more intense peaks being true fragment ions. For every pair of peaks, the geometric mean of their intensities is calculated, and the spectrum feature is computed as a ratio of the sum of the geometric means of peak pairs corresponding to an amino acid mass divided by the sum of the geometric means of all peak pairs.
In the case of peptides containing modifications, the sequence tags are bound to be interrupted by mass differences not corresponding to any of the standard amino acid residue masses. Although the masses of typical modifications can be included in the list of amino acid masses, this would lead to an increased number of peak pairs occurring at random. The additional measures counting the isolated amino acid mass differences (features c and d) are thus useful in the case of modified peptides where extraction of long sequence tags using the standard set of 20 amino acids might not be possible.
Complementary Fragment Ions and Neutral Losses
The fragmentation of multiply charged peptide ions via collision-induced dissociation results in the charges being distributed among the resulting fragment ions. The doubly charged parent ions most often dissociate into two singly charged fragments, which leads to a high degree of symmetry in the spectrum. In contrast, the triply charged parent ions dissociate into two charged fragments, but the charges are unevenly distributed with one fragment carrying two charges. When a precursor ion with the m/z value mzp and charge zp breaks down into two ions (in the case of ion trap mass spectrometers, primarily b ions and y ions), the overall mass and charge are preserved but split up between the two fragment ions, thus meeting the following criteria (33, 34).
![]() |
Although the fragmentation process is strongly dependent on the sequence of the peptide and the resulting fragmentation patterns observed in MS/MS spectra can be quite complex, a high quality spectrum should still contain a significant number of complementary peak pairs (fragment ions whose masses add up to the mass of the precursor ion). Three spectrum features are computed (using the high intensity peak subset only) in the following way.
(a) The number of complementary peak pairs satisfying the above conditions (Equation 1).
(b) The number of complementary peak pairs (feature a) divided by the background count (the number of complementary pairs expected by chance). The background count is estimated by averaging the numbers of peak pairs satisfying the same conditions (Equation 1) but using 10 incorrect m/z values of the precursor ion obtained by shifting the correct precursor ion mass by ±10, 20, 30, 40, and 50 Da.
(c) Similar to feature b but with the background count subtracted from the number of complementary peak pairs instead of taking the ratio.
When the charge state of the precursor peptide is not known (e.g. multiply charged ions in the case of low mass resolution data), the number of complementary peaks is computed first assuming the 2+ charge state (in which case all observed fragment ions are expected to be singly charged) and then assuming the 3+ charge state (in which case both singly charged and doubly charged fragments are allowed). Then the largest of the two counts is selected and used as the number of complementary peak pairs feature (feature a), and the two derivative features (b and c) are computed as described. It should be noted that some of the spectrum features used here for quality assessment purposes, especially the number of complementary fragments and its derivatives, can be used to assist in the determination of the precursor ion charge state directly from the MS/MS spectrum (33, 34). In the case of high resolution mass spectrometers where the charge state can be accurately determined, the entire analysis can be performed separately for spectra with different precursor ion charge states.
Similarly neutral losses such as loss of ammonia (loss of 17 Da), water (18 Da), and carbon monoxide (28 Da) are often found in MS/MS spectra of peptide ions (see e.g. Ref. 35). The presence of neutral losses in the spectrum is useful for distinguishing peptide ion spectra from those of contaminants or background noise. They are detected in the same way as the complementary fragments described above, thus adding nine more spectrum features and bringing the total number to 40.
Spectrum Quality Score
Improved discrimination between assigned and unassigned spectra (and thus, to a large degree between high quality and low quality spectra7) can be achieved by combining all separate spectrum features into a single discriminant score (the spectrum quality score (SQS)). The spectrum quality score is a weighted combination of the individual scores si described above computed according to the discriminant function (Equation 2).
![]() |
In the linear discriminant function (LDF) approach used in this work the constant c0 and the weights ci are derived in such a way that the ratio of between-class variation to within-class variation is maximized under the assumption of multivariate normality. This assumption does not necessarily hold for all of the spectrum features, and nonlinear classification methods should in general allow better discrimination. Nevertheless the LDF analysis is a robust multivariate technique that has been proven to be a valuable chemometric tool in similar applications, e.g. the elucidation of chemical structure of organic compounds from mass spectrometry data (36). The main advantage of the LDF approach and the rationale for its use in this work over more sophisticated nonlinear methods was the ease of its implementation as a part of the dynamic quality scoring algorithm.
Deriving the discriminant function requires a training dataset, which in the present method is created automatically using the confidence values of peptide assignments to spectra computed by PeptideProphet as described above. Based on that training dataset, the variance-covariance matrix and the vector of the group differences are calculated, and the coefficients ci are computed using simple matrix algebra (37). The resulting score computed according to the discriminant function is used as a final quality score in place of the original numerical spectrum features.
| RESULTS |
|---|
|
|
|---|
|
Third, all spectrum features are combined into a single composite spectrum quality score (see "Experimental Procedures"). The weighing factors, measuring the relative contribution of each spectrum feature toward the composite quality score, are determined using the linear discriminant analysis. Discriminant function coefficients are determined automatically for each analyzed dataset anew ("dynamic" quality scoring). Finally the statistical classifier is applied to the spectra left unassigned in the initial search to find those that are of high quality. Such spectra are the most interesting candidates for subsequent reanalysis because they are likely to contain valuable biological information that is worthwhile to be explored further.
Method Optimization and Testing
The computational method was first tested and optimized using several datasets. The optimization of the spectrum features was performed using the H. influenzae dataset. Spectrum features such as sequence tags, complementary fragment ions, and neutral losses were found to be sensitive to the peak intensity cutoff. As the peak intensity cutoff increases, an increasingly higher fraction of the noise peaks are eliminated, thus reducing the number of peak pairs and sequence tags occurring by chance.8 This, in turn, improves the discriminating power of the spectrum features. This trend, however, holds up only to a certain point after which the performance starts degrading due to exclusion of true fragment ions from the high intensity peak subset. Fig. 2A illustrates this in more detail for one particular feature, the length of the longest sequence tag that can be extracted from the spectrum. The plot shows the frequency of extracting a tag (the longest tag for each spectrum) of a certain length among assigned and unassigned spectra in the H. influenzae dataset when using all peaks in the spectrum and after applying an 80% intensity cutoff (i.e. using only the top 20% of most intense peaks). For both assigned and unassigned spectra, the distributions are shifted toward shorter tag lengths when the sequence tags are extracted from the high intensity peak subsets compared with using all peaks. The average length of the longest tag extracted from the assigned or unassigned spectra was around 15 and 6 amino acids, respectively, when the tags were extracted using all peaks. These numbers dropped to
7 and 2 at the 80% intensity threshold. To find the optimal intensity cutoff, the significance of each spectrum feature was measured using the non-parametric two group comparison test (Mann-Whitney z score). Fig. 2B shows the statistical significance of several spectrum features that are representative of each group (sequence tags, complementary fragments, and neutral losses) plotted as a function of the intensity cutoff. For each class of spectrum features, the discriminating power was evaluated, and the parameters were optimized. The plot demonstrates that in the case of sequence tag-based features, the optimal intensity cutoff (the cutoff resulting in maximum discriminating power) was around 80%. A similar analysis was performed for other classes of spectrum features, and the intensity cutoffs of 70 and 50% for the complementary fragments and neutral losses, respectively, were chosen. In the case of neutral losses, the choice of cutoff was slightly arbitrary in the sense that a whole contiguous range of cutoffs yielded similar performance, reflecting the absence of a pronounced maximum in the curve on Fig. 3B. In addition, similar trends were investigated for the mass tolerance parameter, and the tolerance of 0.7 Da was selected for computing the sequence tag-based features, 0.9 Da was selected for the complementary fragments, and 0.5 Da was selected for neutral losses. In general, these parameters should be optimized separately for each type of mass spectrometer. Ideally the optimization process should be carried out using the same on the fly created training dataset used to determine the discriminant function coefficients in Equation 2, thus making the method even more robust. However, preliminary tests indicated that slight changes in the model parameters did not significantly affect the overall performance of the spectrum quality classifier, and the implementation of the dynamical optimization process was deferred to future work.
|
|
Within each class of spectrum features, not all individual features contributed equally to the discrimination. Furthermore combining several redundant features did not result in a significantly improved discrimination overall. For example, using the length of the longest sequence tag extracted from the spectrum as the only sequence tag-based feature allowed almost as good overall discrimination as using all four features from that category combined. The effect of inclusion of redundant or marginally discriminating features on the accuracy of the classifier was investigated by performing a stepwise analysis (addition of one new feature at each step), and ROC curves were plotted to measure the performance of the classifier at each step (data not shown). Importantly addition of redundant features or features with low discriminating power did not negatively affect the performance of the classifier but resulted in a slight improvement. Furthermore the use of multiple related features made the method more robust toward unexpected variations in the spectrum properties. Based on these considerations, all 40 spectrum features were kept in the analysis.
Another important consideration is the minimum size of the dataset needed to realize the advantages of dynamic training. This was investigated by applying the method to several H. influenzae datasets of decreasing size. The results of this analysis are shown in Fig. 3B. Overall the plot demonstrates that the method can be used even with datasets of relatively small size. As expected, the best results were achieved with large datasets consisting of more than several LC-MS/MS runs. However, even for a dataset consisting of a single LC-MS/MS run (with 111 and 362 assigned and unassigned spectra, respectively, in the dynamically created training dataset), the classification performance was satisfactory and only marginally worse than that observed for larger datasets. The accuracy of the classifier started degrading to a significant degree only when applied to very small datasets (with less than 50 spectra assigned in the initial search). Thus, in a typical shotgun analysis of complex protein mixtures, in most cases even a single LC-MS/MS run generated on current mass spectrometers should be sufficient to train the classifier because it is unlikely that it would contain fewer than several thousand spectra of which minimally a few hundred spectra would be identified confidently in the initial database search. In case of very small datasets, rather than training the classifier dynamically it would better to use a larger dataset generated in another (but similar) experiment. The larger size of the training dataset would then compensate for the loss of accuracy due to the differences in the spectrum properties between the dataset of interest and the training dataset.
Evaluation of the Method
The ability of the method to effectively recover unassigned high quality spectra was evaluated using the Arabidopsis dataset. The same dataset was used to investigate the robustness of the method toward the presence of misclassified (high quality) spectra in the unassigned subset of the on the fly created training set. The tests were carried out in the following way. First the spectra were search against the full database and allowing for Met + 16 as a variable modification (normal search). The same spectra were than searched against a modified database, or allowing for no modifications, thus restricting the search space to prevent the identification of some spectra (called masked spectra) that were successfully assigned in the normal search (see Fig. 4A). These flawed searches were performed in three different ways by making the following changes compared with the normal search.
|
2) Search against a modified database created by removing sequences of four high abundance proteins, resulting in 122 masked spectra.
3) Search with the mass of serine statically increased by 5 Da, resulting in 402 masked spectra.
The spectrum quality classifier was developed using the results of the flawed searches. The performance of the method was then evaluated by plotting the distribution of quality scores for the masked spectra and comparing it with the distribution for the assigned spectra (see Fig. 4, B, C, and D). In the first two tests (Fig. 4, B and C), the distribution of quality scores for the masked spectra remained similar to the distribution observed for the assigned spectra. The performance was worse in the third case (the search with incorrect serine mass) where the number of high quality spectra in the unassigned subset of the dynamically created training dataset far exceeded the number of assigned spectra, a situation that is unlikely to occur in practice (Fig. 4D).
Overall the tests described above demonstrate that the approach should be able to detect unassigned high quality spectra based on their similarity to successfully identified spectra in any typical experiment. In those cases where it is expected that only a small fraction of all high quality spectra would be successfully assigned in the initial database search (e.g. proteomic studies using organisms with unsequenced or partially sequenced genome), a better approach would be to use a classifier trained on a different, more suitable dataset generated using the same type of mass spectrometer.
Application of the Method to a Human Lipid Raft Dataset
The tests described in the previous section demonstrate that the method is able to recover high quality spectra left unassigned due to intentionally introduced flaws in the database search parameters or in the sequence database itself. However, for practical applications, it is more important to estimate how many high quality spectra are left unassigned in a typical analysis, to test whether the dynamic quality scoring method is able to recover them, and to investigate what types of peptides produced those spectra. Answering these questions is also important for designing more efficient computational strategies for peptide identification.
To address these questions, a dataset of ion trap MS/MS data, a subset of a large human lipid raft experiment (30), was chosen for more detailed analysis (see "Experimental Procedures"). The spectra were initially searched against the human IPI database (version 2.35) using SEQUEST with 3-Da precursor ion tolerance, allowing for tryptic and semitryptic peptides, and with variable Met + 16 modification. These search parameters are equivalent to those implemented in the PeptideAtlas high throughput data analysis pipeline (31) and are quite typical for data generated using ion trap mass instruments. In total, the initial database search resulted, after applying ProteinProphet (19), in 315 proteins having a ProteinProphet probability above 0.9 (the minimal number of proteins sufficient to explain all observed peptides; see Refs. 19 and 40) identified by 653 unique peptides9 from 4172 assigned spectra (see Table I). The list of peptide assignments and the corresponding list of protein accession numbers are provided in Supplemental Tables S1 and S2.
|
2.5% percent were assignments to semitryptic peptides, and 2% were assigned to peptides containing oxidized methionine (see Table I). The sequences of
65% of all peptides identified with high probability contained no missed trypsin cleavage sites, and only a small percentage of the peptides contained more than two missed cleavages. It is also worth mentioning here that no protein identifications would be lost if the search was limited to unmodified peptides only, and only four proteins would be lost (or statistical significance of the identification drop below the 0.9 ProteinProphet probability threshold) if the search was limited to unmodified tryptic peptides. The results of this initial database search were used to train the spectrum quality classifier, which was then applied to all multiply charged spectra. Singly charged MS/MS spectra were removed from further analysis due to their relatively small contribution. The distribution of quality scores for the entire dataset of multiply charged spectra as well as for the assigned and unassigned subsets is shown in Fig. 5. The plot indicates that close to 80% of all unassigned MS/MS spectra are of relatively low quality; these spectra might not be worth additional efforts to reanalyze them. However, a significant number of unassigned spectra are actually of high quality, which manifests itself in the presence of an extended right tail in the quality score distribution for the unassigned spectra. Fitting the quality score distribution observed for the assigned spectra to match the extended right tail of the quality score distribution for the unassigned spectra indicates that it should be possible to extract at least 25% more peptide identifications in addition to those obtained in the initial database search (see Fig. 5).
|
|
In addition, a significant number of peptides containing modifications were identified among which peptides with the N-terminal glutamine converted to pyroglutamic acid (116 spectra, 48 unique peptides) and N-terminal acetylation or carbamylation (68 spectra, 33 unique peptides) were the most common (in addition to the peptides containing oxidized methionine identified in the initial search). Other types of detected modifications included sodiation, tryptophan oxidation, lysine carbamylation, and histidine methylation (15 spectra, eight unique peptides in total). Fig. 7A shows one such example, the identification of a tryptic peptide, YPIEHGIVTNWDDMEK, from Actin, cytoplasmic 1 protein (Swiss-Prot accession number P02570) containing methylated histidine at position 5 in the peptide sequence (position 73 in the protein sequence) that is a known post-translational modification annotated in the Swiss-Prot database.
|
Of the 48 unique peptides (116 MS/MS spectra) identified containing N-terminal pyroglutamic acid, all except one were not seen previously in the unmodified form in this dataset. However, this modification is likely attributable to sample handling because all of the identified peptides containing this modification were tryptic and not located at the N terminus of the protein. It is known that pyroglutamic acid readily forms at the N terminus of peptides beginning with glutamine when peptides are kept in acidic solution for 15 min or more, which is often the case in shotgun proteomic analysis. Nevertheless identification of peptides containing this modification was informative: it resulted in 10 new protein identifications and four protein identifications with increased confidence (identified by unmodified peptides in the initial search but with ProteinProphet probability below 0.9). Eight of those proteins could not be identified with high confidence based on the results of the initial SEQUEST search even when the search was extended to include the entire human lipid raft dataset of Ref. 30. These proteins include several membrane proteins that are likely to be relevant to the analysis of lipid rafts, e.g. KIAA0143 protein (IPI0009217, Swiss-Prot accession number Q14156). The accession numbers of those proteins and the sequences of their corresponding peptides can be found in Supplemental Table S4 with one MS/MS spectrum for each of the identified peptides shown in Supplemental Table S5.
Of the 33 unique peptides (68 MS/MS spectra) identified containing N-terminal acetylation or carbamylation,10 none were seen in the unmodified form in the selected dataset (SCX flow-through sample, fractions 31 through 39 of the entire human lipid raft dataset), resulting in three new protein identifications and three identifications with increased confidence. However, additional analysis indicated that 21 of those peptides were also present in the unmodified form in other SCX fractions of the original dataset (flow-through sample, SCX fractions 43 and higher). All of the peptides that were identified in both modified and unmodified form (considering all SCX fractions) were tryptic, were located in the middle of the protein sequence, and were arising mostly from high abundance proteins. This strongly suggests that in those peptides the observed modification was a carbamylation, an artifact that occurred after protein digestion. N-terminal carbamylation is a common chemical modification known to occur when protein samples are digested in a solution containing urea, which was the case with the human lipid raft dataset used in this work (30).
In contrast, of the 12 peptides that were identified in the modified form only (considering all SCX fractions), 10 were located at the N terminus of their corresponding proteins starting with the residue immediately following the initiating methionine. In those 10 peptides the modification was likely an acetylation of the protein N terminus that occurred prior to digestion. The sequences of those peptides started with either Ala (four peptides), Ser (four peptides), or Thr (two peptides). As reviewed in Ref. 41, these residues are among the most frequently acetylated residues with the frequency of acetylation of Ala and Ser close to 100% and 80% in the case of Thr. For comparison, only three of the 21 peptides identified in both modified and unmodified form (considering all SCX fractions) started with Ala, Ser, or Thr. The identification of N-acetylated peptides is important because they confirm the translation initiation start sites and assist in the determination of the mature form of the protein present in the sample (40, 42, 43). It is also worth noting that three of the N-acetylated peptides identified here correspond to three low molecular weight proteins that could not be identified even when the initial SEQUEST search (i.e. search without looking for N-acetylated peptides) was extended to the entire human lipid raft dataset, including two less characterized proteins containing predicted transmembrane domains: RER1 protein (IPI00005728, Swiss-Prot accession number O15258) and HSPC009 protein (IPI00022277, Swiss-Prot/TrEMBL accession number Q9Y2R0) (see Supplemental Tables S4 and S5).
All database searches described above were performed against the human IPI database. This database is often used in the analysis of shotgun proteomic data because it represents a good tradeoff between the completeness of protein sequences and the degree of sequence redundancy (40). In particular, the sequence- and identifier-based construction of the IPI database significantly reduces the need for manual filtering of redundant sequences while maintaining cross-references to all its source data. However, minor sequence variants, e.g. polymorphisms, are not represented in the IPI database. Furthermore this database is likely to be not complete with respect to alternative splicing. Thus, it was interesting to investigate whether any of the high quality spectra unassigned in the initial search against the human IPI database could be assigned by searching more complete sequence databases.
To identify novel peptides or peptides containing sequence polymorphisms, the unassigned spectra were searched against a compressed human EST sequence database created and maintained by N. Edwards.11 This database represents the minimum size sequence database containing 30-amino acid-long sequences from the human EST database six-frame translation (44).12 To reduce the number of entries with sequencing errors, or simply redundant entries, the database was constrained to contain only those sequences that appear in an open reading frame of at least 50 unambiguous amino acids and that appear at least twice in the database. Importantly the compact nature of the database allowed an order of magnitude reduction in the database search time compared with the search time against the original EST database. More detailed information regarding the construction of this database can be found on the corresponding website.
The search against the EST database resulted in the identification of seven peptides from 12 MS/MS spectra whose sequences were not present in the then current version of the IPI database used in the initial search (version 2.35). The sequences of two of those peptides are now present in the latest version of that database (version 3.01), a positive development indicating continuing improvement of the protein sequence databases by their developers and annotators. In addition, the EST database search resulted in the identification of several sequence variants of known proteins. For example, a tryptic peptide, LSGLVFFPHLDR, from endonuclease G-like 1 protein (Swiss-Prot accession number Q9Y2C4) was identified from three MS/MS spectra, revealing a single nucleotide polymorphism (SNP) annotated in the dbSNP database (46) maintained by the National Center for Biotechnology Information (NCBI) (Val/Gly at the third position in the peptide sequence, refSNP identification number rs1141223). In another example, a tryptic peptide, LQTASDESYKDPTNIQLSK, was identified from two MS/MS spectra corresponding to an alternative transcript of the protein Spectrin
chain, brain (Swiss-Prot accession number Q13813, isoform 2), annotated in Swiss-Prot but not present as a separate sequence entry in the human IPI database.
Importantly the EST database search also resulted in the identification of two novel peptides (see Supplemental Tables S4 and S5). One of the peptides, tryptic peptide LQGSATAAEAQVGHQTAR, is particularly interesting and deserves special attention. This peptide is present in four EST sequences and was identified by SEQUEST (Xcorr score of 2.8) and additionally confirmed by Mascot (expectation value, 5 x 105) from one high quality MS/MS spectrum (see Fig. 7B). This intron-exon-spanning peptide identifies a novel splice variant of the Lck-interacting transmembrane adaptor 1 protein (LIME1). At the moment, the NCBI Entrez Protein sequence database contains only the standard form of this protein, accession number NP_060276, which was shown to be a raft-associated protein in several recent studies (47, 48). However, the identified peptide is present in several predicted alternative splice forms of LIME1 according to the AceView13 gene models (November 2004 version). All these predicted forms also contain a transmembrane domain and a putative raft-targeting motif, suggesting that all of them can be raft-associated. To gain further insight, the entire dataset (all SCX fractions from both the flow-through and ICAT experiments and without quality filtering) was researched against a database of all predicted LIME1 splice forms created based on the AceView gene models. This resulted in the identification of another MS/MS spectrum produced by the same peptide but in a different charge state, thus further increasing the confidence in this identification. However, no peptides from the standard form (or any other predicted form) were identified, suggesting that different forms of this protein can be expressed in different cell types. Although comprehensive follow-up analysis of these and other interesting findings was beyond the scope of this work, these results clearly illustrate that new biological insight can be gained by reanalysis of existing datasets. They also demonstrate the potential of mass spectrometry-derived data as a new source of information useful for validation and annotation of the genome (31, 40, 43, 4951).
| DISCUSSION |
|---|
|
|
|---|
The basic idea behind the dynamic quality assessment method is that having the information on what spectra got assigned a peptide with high or low confidence during the first pass through the data, the notion of what constitutes a bad or good quality spectrum can be learned from the analyzed data itself, i.e. without relying on a training dataset of spectra taken from a different experiment. The statistical classifier is thus automatically developed for each dataset anew, making the method more robust toward variations in the MS/MS data than the static classifiers. Importantly it does not require creating training datasets using manual expert ranking (29),2 which is time-consuming and complicates adoption of the method to different instrument types, or generation of MS/MS data using samples of purified proteins (27, 28) in which case the training dataset may not accurately represent the complexity of MS/MS spectra observed in a shotgun proteomic experiment using complex protein samples. It also incorporates a larger set of spectral features than any other recently described spectrum quality scoring methods (25, 2729, 53).2
In addition to the development and testing of the MS/MS quality assessment method, a significant effort in this work was devoted to the reanalysis of high quality spectra left unassigned in a typical shotgun proteomic experiment. The results of those analyses should be useful for designing more efficient peptide identification strategies, especially those concerned with the identification of modified peptides. Allowing for multiple variable modifications of amino acids when searching spectra against a protein sequence database significantly increases the number of candidate peptides found in the database that must be evaluated. Depending on the frequency of occurrence of the amino acid, the increase in computation time often becomes prohibitive. In addition, the potential for false positives also increases. The analysis performed here generally confirms the validity of the subset database search approach implemented in several existing database searching tools, including X! Tandem, Mascot, and SpectrumMill. Many of the peptides containing modifications identified in the human lipid raft dataset corresponded to a relatively small number of high abundance proteins identified with sufficiently high confidence by at least one unmodified peptide in the initial search. Similarly the dominant majority of the identified semitryptic peptides were fragments of tryptic peptides also identified in the dataset, indicating that they are the product of in-source fragmentation. This suggests that in some cases it should be sufficient to search for peptides containing certain types of modifications (e.g. oxidation or sodiation) and for semitryptic peptides using a subset database comprised of proteins already identified in the initial, fast (tryptic peptides only, no modifications) scan through the data, thereby reducing the computational time. The same subset database search approach should be sufficient for reanalysis of MS/MS spectra in an attempt to interpret the spectra arising from peptide ions with 4+ or higher charge states or spectra with inaccurately determined precursor ion mass.
At the same time, the subset database search approach would likely be less successful in identifying peptides containing certain types of modifications, e.g. peptides with N-terminal acetylation. The strategy described in this work enables efficient analysis without limiting the protein search space because more comprehensive reanalysis is carried out only on those spectra that are of high quality and remain unassigned after the initial search. Both strategies, spectrum quality filtering and subset database searching, can be combined for even better computational efficiency. Quality scoring can also be combined with spectrum clustering (54, 55) to achieve even higher reduction in the number of spectra that need to be analyzed in each experiment. However, the development of efficient iterative database search