|
Advertisement | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 8:53-69, 2009.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ABSTRACT |
|---|
|
|
|---|
Similar to generating the covering set of tags (that in most applications limited to tags of length 3), one can try to generate the covering sets of full-length peptide reconstructions that with high probability contain the correct peptide (spectral dictionary). Spectral dictionaries take the peptide sequence tag approach one step further by generating peptide reconstructions and ensuring that one of them is correct. They also have the potential to improve the filtration efficiency of tag-based tools (2, 3); for example, the filtration efficiency of 1000 de novo reconstructions of length 10 is orders of magnitude higher than even a single tag of length 3. However, although spectral dictionaries have important advantages over spectral tags, generating them remains an open problem.
The spectral dictionaries could be searched efficiently against a protein database resulting in a hybrid approach to peptide identification (Fig. 1). Although the idea of spectral dictionaries is almost as old as the idea of peptide sequence tags (6), the software tool RAId based on this approach was described only recently (7). However, although RAId generated promising initial results, it was based on a heuristic exhaustive search and turned out to be rather slow (2–4 min per spectrum) thus limiting its applicability. Also RAId was benchmarked on a small sample thus making it difficult to evaluate its performance on large MS/MS data sets. Here we describe a fast approach to generating spectral dictionaries that takes
0.1 s per spectrum and benchmark it on a data set of over 20,000 peptides.
|
3 billion amino acids after removing repeats). Indeed most of the previous proteogenomics studies were limited to searches against the six-frame translations of bacterial genomes. The largest proteogenomics analysis conducted so far was the search against the six-frame translation of Arabidopsis thaliana that resulted in the discovery of nearly 800 new genes using InsPecT.1 However, even fast tag-based tools like InsPecT become impractical in searches of the 20 times larger six-frame translation of the human genome. Below we show that MS-Dictionary is able to search the six-frame translation of the human genome in roughly the same time as it takes to search the 100 times smaller database of all human proteins. Spectral dictionaries make the size of the database almost irrelevant because the spectral dictionary can be matched against the six-frame translation as efficiently as against a much smaller database of known proteins. Because many genes remain unidentified even in the well studied organisms (see Siepel et al. (16) and Stark et al. (17) for the recent discovery of over 1000 new protein-coding genes in human and fruit fly genomes), the searches in six-frame translation represent a valuable tool for proteogenomics annotations.2
De novo peptide sequencing represents a fast alternative to MS/MS database search. Although the best de novo algorithms are orders of magnitude faster than the fastest database search tools (even on moderately sized databases), they are less accurate. However, the superior accuracy of the database search tools becomes less pronounced with the increase in the database size. Moreover we show that for very large databases our de novo peptide sequencing algorithm compares favorably to MS/MS database search tools. Thus, searches in very large databases represent an important niche where de novo-based approaches are accurate and orders of magnitude faster than the traditional database search approaches. A number of de novo methods have been developed, including Lutefisk (6, 19), Sherenga (20), PepNovo (21), PEAKS (22), EigenMS (23), NovoHMM (24), AUDENS (25), MSNovo (26), and PILOT (27) (see also Refs. 28–30). Most de novo tools use the spectrum graph approach where a spectrum is represented as a graph with peaks as vertices that are connected by edges if their mass difference corresponds to the mass of an amino acid.
De novo peptide sequencing can also be viewed as a database search in the database of all possible peptides. Even if this time-consuming search were feasible, it would remain unclear which peptide in the database of all peptides represents the real peptide that generated the spectrum. We estimate that in 50–95% of the cases (depending on the peptide length), the existing database search tools (2, 31–35) will fail to identify the correct peptide in such an ultimate test because its score will be lower than the score of an incorrect peptide. We therefore argue that any de novo peptide sequencing algorithm should output multiple peptide reconstructions rather than a single reconstruction. Matching these peptides against a database results in a hybrid spectral dictionary approach that bypasses the time-consuming matching of spectra against the database.
Spectral dictionaries allow one to turn every MS/MS database search tool into a de novo peptide sequencing software (by simply running this tool on all peptides from the spectral dictionary and selecting the top scoring peptide). After such "conversion," one can estimate how well both database search tools and de novo tools would perform on very large databases. This experiment reveals a disappointing performance of both de novo and database search tools. Only 35–42% of peptides of length 10 (charge 2) are correctly reconstructed in such experiments (35, 38, and 42% for X!Tandem, PepNovo, and InsPecT, correspondingly). Our MS-Dictionary algorithm correctly reconstructs 50% of such peptides, a significant improvement over existing approaches.3 We further show that MS-Dictionary can search a six-frame translation of the entire human genome, the largest database ever searched for spectral interpretations.
The key problem in the spectral dictionary approach is deciding which and how many reconstructions must be generated. Generating too few peptides will lead to high false negative error rates, whereas generating too many peptides will lead to high false positive error rates. Some de novo algorithms output a single or a fixed number (decided before the search) of peptides. For example, RAId (7) generates 1000 de novo reconstructions and matches them against a database.4 We argue that for some spectra generating only one reconstruction is sufficient for finding the correct peptide, whereas in other cases (even with the same parent mass), a thousand reconstructions may be insufficient. We propose an approach for dynamically determining how many reconstructions must be generated for each spectrum and then actually generating them.5
Our MS-Dictionary software (available as open source) generates spectral dictionaries based on the recently introduced concept of the generating function of tandem mass spectra borrowed from statistical mechanics. The generating function approach efficiently analyzes the peptide reconstructions with the optimal and suboptimal scores and determines the statistical significance (spectral probability of those reconstructions (for more details, refer to Ref. 38).6
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
Let s = s1 ... sn be a Boolean string called a spectrum and
=
1 ...
n be a Boolean string called a peptide. The probability of peptide
generating spectrum s is defined as Prob(s|
) =
i = 1n Prob(si|
i) where Prob(x|y) is a 2 x 2 matrix (see Fig. 2).
|
, we are interested in solving the problem of finding max

Prob(s|
). Below we focus on the sets
that are relevant to tandem mass spectrometry. Let V = {0, 1, ... , n} and G(V, E) be a topological ordering of a directed acyclic graph (DAG) such that i < j for every directed edge (i, j) in E. Every path from 0 to n in G corresponds to a G-peptide
=
1 ...
n such that
i = 1 if vertex i belongs to the path (see Fig. 2). We are interested in the following Peptide Sequencing Problem (41): given a spectrum s and a DAG7 G, find a G-peptide
maximizing Prob(s|
) over all G-peptides.
In de novo peptide sequencing it is assumed that (i, j)
E if (j – i) equals the integer mass of an amino acid. Such graphs are referred to as amino acid graphs (38) (compare with spectrum graphs (20, 42)). As a first approximation, an MS/MS spectrum with parent mass n can be represented as a string of ones (peak present) and zeros (peak missing) with a 0/1 for every 1-Da interval. Similarly sequences of amino acid masses (peptides) can also be represented as strings of zeros and ones. An amino acid with an integer mass
is represented as a string of
– 1 zeros followed by a single one. Then a peptide is simply a concatenation of Boolean strings corresponding to its amino acids. In this context,
0.05 (probability of observing a noise peak) and
0.7 (probability of observing a b-ion) represent typical values of
and
for ion trap MS/MS spectra (Fig. 2). This somewhat simplistic Boolean model can be modified for any mass resolution, peptide fragmentation rules, and peak intensities (4, 28, 29) (see below). Moreover the more realistic model can be analyzed with exactly the same algorithm as the Boolean model (20).
The model above does not capture the fact that MS/MS spectra represent both prefix ions (b-ions series) and suffix ions (y-ions series). To reflect this we represent peptides as strings in three-letter alphabet: 1 (theoretical b-cut), –1 (theoretical y-cut), and 0 (no cut). Given a peptide
=
1 ...
n, we define its reverse as the peptide
* = –
n ... –
1, i.e.
i* = –
n – i+ 1. We now redefine the probability of peptide
generating spectrum s as Prob(s|
) =
i = 1n Prob(si|
i)·Prob(si|
i*), where Prob(x|y) is a 2 x 3 matrix.
From Boolean Spectra to MS/MS Spectra
Accounting for Peak Intensities—
Although the simple model described above led to an accurate peptide sequencing algorithm (20), it does not capture the intensities of fragment ion in MS/MS spectra. The experimental spectra represent real valued vectors s1 ... sn rather than Boolean vectors (si is the peak intensity at mass i). One can argue that the same model based on probabilities p(x, y) where x is a (real valued) peak intensity and y
{–1, 0, +1} would take into account the intensities of mass spectra. However, this model faces difficulties because (i) intensities vary between different spectra of the same peptide and (ii) the value of intensity seems to be less important than the distribution of intensities over different peaks (26). As a result, most peptide sequencing algorithms use heuristic approaches and do not try to come up with a rigorous model of spectra generation that accounts for intensities. We argue that peak ranks rather than peak intensities may lead to an adequate model of spectra generation. Peak ranks proved to be valuable in peptide identification; for example InsPecT (2) utilizes peak ranks in its scoring function. Below we show how to rigorously utilize peak ranks in de novo peptide sequencing and to solve the corresponding Peptide Sequencing Problem.
We now define a spectrum s = s1 ... sn as a string in the alphabet I (ranks of peaks) and a peptide
=
1 ...
n as a string in the alphabet F (types of neutral losses). The probability of peptide
generating spectrum s is defined as Prob(s|
) =
i = 1n Prob(si|
i)·
i = 1n Prob(si|
i*), where Prob(x|y) is an arbitrary |I| x |F| matrix representing the probability that a symbol y in the peptide generates a symbol x in the spectrum.
The spectrum strings s = s1 ... sn are generated from tandem mass spectra as follows. For simplicity, we retain top k peaks from every MS/MS spectrum (up to k = 150 in our implementation). Spectra are filtered to remove noisy peaks as follows: given a peak at mass M, we retain the peak if it is among the top five peaks within a window of size 100 Da around M. Let us say this procedure gives t peaks, which are ranked from 1 to t. If t > k, we keep only the top k peaks; if t < k, we reinsert the top k – t peaks that were filtered out and assign them ranks t + 1 to k. We define si as the rank of the peak at mass i (if there is a peak at mass i) and define si = 0 if there is no peak at mass i.
The peptide strings
=
1 ...
n are generated from amino acid sequences as follows. We define an alphabet of fragment ions as a set of integers corresponding to neutral losses, for example ion fragments b, b – H2O, and b – NH3 correspond to neutral losses {0, 18, 17}. Given a set of neutral losses {x1 ... xt}, we represent every amino acid of mass
as a string s1 ... s
of length
with
– t zeros and t non-zero symbols 1, 2, ..., t located at positions
– x1,
– x2, ...
– xt. The peptide string
=
1 ...
k is simply a concatenation of strings corresponding to amino acids from the peptide. To make the model more accurate, we further added the doubly charged b- and y-ions as additional types of ions generated by the peptide strings.
MS-Dictionary Scoring Function—
When applying the above model for peptide identification, we are interested in the ratio of probabilities that a spectrum is generated by a given peptide
versus probability that a spectrum is generated by a string consisting of all zeros (noise). This can be represented as Prob(s|
)/Prob(s|0) =
i=1n Prob(si|
i)/
i=1n Prob(si|0). We further express it as the sum of log odds ratios as follows.
![]() |
Using the training data set (described below), we learn the values of (Prob(si|
i))/(Prob(si|0)). The learning is done separately for the lower and the higher halves of the mass range (peaks corresponding to doubly charged ions only appear in the lower part of the spectrum). A smoothing function was applied on these values for lower intensity peaks (ranks 11–150); for each ion type, the value at any rank was set to the average value in a window of five ranks around the given rank. The distribution of these values for each peak rank is shown in supplemental Table S1 for three different spectrum lengths for the low and the high mass region. These statistics vary with the length; however, the differences between similar lengths (like 7 and 8) are typically small as compared with differences between very different lengths (like 7 and 20). Thus, specific length-dependent scoring can be applied using the approximate length inferred from the parent mass of the spectrum.
The MS-Dictionary scoring function described here was compared with the scoring functions of the popular database search tools SEQUEST (33), X!Tandem (31), and InsPecT (2). 50,000 spectra were chosen randomly from the Shewanella data set and searched with Sequest, X!Tandem, and InsPecT. The score of the best peptide for each spectrum from the database search was compared with the MS-Dictionary score for the same spectrum-peptide pair. We found good correlation between the MS-Dictionary scoring function and the scoring functions used in the database search tools; the correlation coefficients are 0.87 for SEQUEST, 0.90 for X!Tandem, and 0.96 for InsPecT (Fig. 3). These correlations are even better than the correlation between the database search tools themselves (for example, InsPecT and X!Tandem raw scores have a correlation coefficient of only 0.75).
|
The dynamic programming table is constructed for all mass values between 0 and parent mass + 0.5 with a resolution of 0.1 Da. The number of reconstructions is computed by summing up the results for all mass values in a window of 1 Da around the exact parent mass to account for the low accuracy of ion trap mass spectrometers. In case of precision mass spectrometry (e.g. FTMS), accurate solutions (with low parent mass error) can be obtained by increasing the resolution and reducing the size of the window around the parent mass. For efficient computation, Ile and Leu are treated as the same amino acid, resulting in a 19-letter amino acid alphabet at the time of generating reconstructions. In the low accuracy setting, Gln and Lys are also treated as the same amino acid.
Symmetric Versus Antisymmetric de Novo Reconstructions—
Some de novo reconstructions may be symmetric, i.e. the same peak in the spectrum may contribute to the score up to four times as a singly charged or doubly charged b-ion or y-ion. The algorithm to alleviate this problem was proposed by Chen et al. (28) and further improved by others (22, 29). Later Lu and Chen (43) designed an algorithm for generating all antisymmetric peptide reconstructions. We have chosen not to use the antisymmetric path approach in MS-Dictionary because (i) it leads to a significant time overhead when many reconstructions are generated and (ii) it does not take into account doubly charged ion fragments that often have high intensities and thus contribute significantly to MS-Dictionary scores. To accurately score the symmetric reconstruction, MS-Dictionary rescores the obtained peptide reconstructions to exclude multiple contributions from the same peak. Starting with the highest scoring reconstructions, we check the peptide sequence to determine whether there are any peaks that have multiple contributions to the score. These peptides are rescored by using only the largest contributions from such peaks.
Template-free Spectral Recalibration—
Recalibration of tandem mass spectra is important for correcting systematic mass errors. All existing spectral recalibration tools use templates (interpreted spectra with known b/y-peaks) to perform linear recalibration using either least squares fit (19, 22, 44) or least median of squares fit (23). In the de novo peptide sequencing framework the reliable templates are hard to obtain thus reducing the utility of spectral recalibration to Q-TOF and LTQ-FT data. In the low mass accuracy setting, the applications of template-based spectral recalibration are mainly limited to validating candidate peptide identifications. As a result, de novo peptide sequencing programs commonly default to a rather high fragment mass tolerance (e.g. 0.5 Da for ion trap data) and thus result in many erroneous spectral interpretations. We describe a template-free spectral recalibration procedure for ion trap mass spectra and demonstrate that it reduces the required mass tolerance from 0.5 to 0.2 Da. We further show that this recalibration leads to significant improvement in MS-Dictionary accuracy.
The fractional masses of amino acids may be as large as 0.1 for arginine (mass, 156.1 Da). The first step of our MS-Recalibration tool is rescaling all peaks in the spectrum by multiplying all masses by 0.9995 to minimize the theoretical fractional masses of amino acids. After rescaling the fractional mass of arginine is 0.02 (156.02 Da), and the fractional masses of all other amino acids are below 0.04 (the average fractional mass is reduced 3-fold from 0.06 to 0.02).
MS-Calibration further filters the rescaled spectra to retain the high intensity peaks using a sliding window as described above. Using Int(m) and Frac(m) to denote the integer and fractional part of mass m (respectively), our goal is to find
and β minimizing the sum
(Frac(
·m + β))2 over all masses m in the rescaled filtered spectrum (Fig. 4a). The coefficients
and β are computed with the least squares fit algorithm and are used to recalibrate all peaks in the rescaled spectrum. Although MS-Recalibration has no information about the peptide that produced the spectrum, Fig. 4b illustrates that it achieves almost the same accuracy as the template-based approaches that recalibrate the spectra based on the information about the correct positions of b/y-ions. After applying MS-Recalibration, one can safely set the mass tolerance to 0.2 Da (and retain 96% of b/y-peaks) as compared with the 0.5 Da in existing approaches. Another advantage of our method is that it makes the mass error distributions centered around zero regardless of their positions in the spectrum. This feature is important for designing a new scoring function that carefully account for errors in peak positions (see below).
|
MSNovo uses a unified peak error model (Gaussian distribution) and peak rank model (exponential distribution) independent of the ion type, rank, and position of each peak. However, Fig. 5a illustrates that different fragment ions have different error models. Fig. 5b reveals that peak ranks and mass errors (that are assumed to be independent in MSNovo) are strongly correlated. Also Fig. 5b reveals subtle irregularity in noise peaks indicating that the noise model in Mo et al. (26) needs to be adjusted. MS-Dictionary takes these observations into account and incorporates the mass errors into its scoring function using a more adequate error model than Mo et al. (26). Below we briefly describe the error-dependent scoring for Boolean spectra (this model can be extended to MS/MS spectra as described above).
|
i generates the spectrum symbol si at exactly the same position. We now extend this model by assuming that the peptide symbol
i can generate spectrum symbol si +
where
represents a mass measurement error. We assume that errors are "small," i.e. they do not exceed a threshold
max (
max is typically 0.5 for ion trap spectra). Incorporating errors into the spectrum generation model requires introducing the three-dimensional matrix Prob(x,
|y) where –
max
+
max and x and y are Boolean as before. The probability of peptide
generating a spectrum s with error
=
1, ... ,
n can now be defined as Prob(s,
|
). The Peptide Sequencing Problem can now be reformulated as the Peptide Sequencing Problem with Errors: given a spectrum s and a DAG G, find a G-peptide
and mass errors
maximizing Prob(s,
|
) =
i=1n Prob(si+
i,
i|
i) over all G-peptides and over all mass errors
.
The matrix Prob(x,
|y) was learned from the training sample, and the learned parameters were further used in the dynamic programming algorithm as described before. Table I compares the performance of MS-Dictionary with PepNovo version 1.03 and illustrates that MS-Dictionary outperforms PepNovo for all peptide lengths.
|
| RESULTS |
|---|
|
|
|---|
Generating Multiple de Novo Reconstructions—
A spectrum may have many reconstructions with the optimal score, and in these cases, reporting only one reconstruction is clearly deficient. For example, Fig. 6 shows a spectrum for which two distinct peptides, LHEALPDPEK and HLEALGAFYK, receive the optimal de novo score of 90.
|
60% of length 10 spectra, the correct peptide has a suboptimal PepNovo score (
50% for MS-Dictionary score), and this fraction quickly increases with the peptide length (Fig. 8). Because the existing de novo approaches fail to identify the correct peptide as the optimal reconstruction in a large fraction of the spectra, a de novo method should consider multiple reconstructions with suboptimal scores.
|
|
|
|
As the spectrum length increases, the size of the peptide search space increases dramatically, making it harder to generate the spectral dictionary. Thus all de novo search methods yield lower accuracy for longer peptides. The generating function approach allows one to dynamically determine the number of peptide reconstructions and increase the chance of finding the correct peptide in the set of de novo reconstructions (see Fig. 9).
The number of reconstructions obtained for these same length spectra varies over orders of magnitude. Although the peak of the distribution of the number of reconstruction is at log2(size of spectral dictionary)
10 (comparable to the number of reconstructions generated in Alves and Yu (7)), some of these spectra have fewer than 100 or more than 10,000 reconstructions. This remarkable variance in the size of spectral dictionaries illustrates the point that different spectra have a different number of plausible reconstructions and raises a concern about de novo methods that return a fixed number of peptides.
Recently Frank et al. (46) described de novo peptide sequencing for data acquired from FT-ICR instruments when both the parent mass and the peak positions are accurate. However, acquiring such spectra remains time-consuming, and an intermediate approach that is gaining prominence is to acquire mass spectra with high precision at MS1 stage and lower precision at MS/MS stage, giving accurate parent mass but inaccurate peak positions. However, the existing de novo search methods are aimed toward ion traps or other low accuracy mass spectrometers, which may have parent mass errors on the order of 1 dalton. Because vertices in the spectrum graph are constructed based on low accuracy peaks, it is not clear how to exploit the accurate parent mass information that is available from new high accuracy instruments. Availability of accurate parent mass values can be effectively utilized in MS-Dictionary to filter the reconstructions. The number of reconstructions for 5-ppm accuracy is typically 4–16 times smaller than the corresponding numbers for 0.5-dalton accuracy (data are not shown).
Using MS-Dictionary for Database Search—
In any database search, a large number of spectra remain unidentified. This may happen due to several reasons: these spectra may have many missing or noisy peaks making them difficult to interpret, the corresponding peptide may not be present in the database, or the peptide may have a post-translational modification not captured by the search algorithm. In the case of S. oneidensis MR-1, only
10% of the 14.5 million spectra were reliably identified (14). We show that MS-Dictionary is able to find identifications for some previously unidentified spectra.
We selected all (
600 thousands) spectra of charge 2 from the Shewanella data set within the parent mass range from 1100 to 1200 Da (the typical mass range for length 10 peptides). All these spectra were searched against the Shewanella proteome with MS-Dictionary (generated with spectral probability 1e–9), InsPecT, and X!Tandem. The same analysis was repeated with a decoy database of the same size. A spectrum is considered identified if any of the reconstructions are present in the Shewanella database (target database). Fig. 10 demonstrates that InsPecT and MS-Dictionary significantly improve on X!Tandem (at 5% FDR, X!Tandem, InsPecT, and MS-Dictionary identified 3272, 4184, and 4137 peptides, respectively). We further rescored InsPecT identifications using MS-GF spectral probabilities achieving an even better performance for the hybrid InsPecT
MS-GF hybrid tool (4299 peptide identifications at 5% FDR). Fig. 11 shows the Venn diagrams of peptides identified by X!Tandem, InsPecT, MS-Dictionary, and InsPecT
MS-GF. To further illustrate the applicability of MS-Dictionary in proteogenomics applications we extended the analysis of Shewanella proteome described above (Fig. 10) to the 7 times larger six-frame translation of Shewanella. We selected all spectra from the Shewanella data set that were not identified in the InsPecT database search with the ParentMass range from 1100 to 1200 Da and with MS-GF scores above 50 (24,814 spectra).10 MS-Dictionary generated spectral dictionaries for these spectra at three different values of SpectralProbability. The same analysis was repeated with a decoy database of the same size. A spectrum is considered identified if any of the reconstructions are present in the six-frame translation of the Shewanella genome (target database). Table III shows the number of new peptides identified by MS-Dictionary in each database that were not found in the earlier database search. For SpectralProbability = 10–10, 1007 new peptides are identified from 6211 spectra in the target database, whereas only six peptides (from six spectra) are identified in the decoy database, corresponding to a peptide-level false discovery rate of 0.6%. As the SpectralProbability is lowered, the false discovery rate turns into zero at 2 x 10–11) with 794 peptide identifications. 280 of them were identified previously by InsPecT (from other higher quality spectra), but 514 represent new peptide identifications. Interestingly 512 (99.6%) of them map to the known protein sequences (including contaminants), providing further confirmation that these identifications are correct. Indeed because the size of the Shewanella protein database is only
15% of the size of six-frame Shewanella translation, one expects that only 15% of these proteins would hit the Shewanella database by chance. Moreover of 512 peptides, 508 are matched to expressed proteins (confirmed by at least two InsPecT identifications in Gupta et al. (14)), and two are matched to proteins with a single identified peptide, confirming the expression of these proteins. Supplemental Table S2 lists each of the new peptide identifications.
|
|
|
We note that peptide identifications reported here are based on the spectra in the 1100–1200-Da parent mass range only, and their number is expected to be much larger if spectra of other masses are also included. Supplemental Table S3 shows that spectra in lower or higher mass ranges also show similar trends as spectra in the 1100–1200-Da range. MS-Dictionary thus has the potential to provide a significant number of new peptide identifications from spectra that were missed in the traditional database searches.
Searching the Six-frame Translation of the Human Genome with MS-Dictionary—
Although mass spectrometry has been successfully used for bacterial gene predictions (8–11, 14, 15, 50), the proteogenomics studies of large eukaryotic genomes are still in infancy. Even the fastest MS/MS database search tools become impractical in such studies because they require searches in huge databases resulting from the six-frame translations of eukaryotic genomes (
2.5 billion amino acids for repeat-masked human genome). Tanner et al. (13) and Edwards (51) made a step toward proteogenomics searches of the human genome by combining the EST and MS/MS analysis. Although this approach is very valuable it can only be successful if the same exons are supported by both EST and MS/MS data. The largest proteogenomics analysis conducted so far is the search of the six-frame translation of A. thaliana that resulted in the discovery of nearly 800 new genes using InsPecT.2 Although InsPecT is 10 times faster than X!Tandem and 60 times faster than SEQUEST (see Payne et al. (52)), it becomes too slow in searches of the translated mammalian genomes. Because neither InsPecT nor X!Tandem can search the translated human genome,11 we ran InsPecT on a 124 times smaller database and assumed that its running time is proportional to the database size. The running time of InsPecT is estimated at 42 s per spectrum,12 whereas MS-Dictionary takes less than 1 s per spectrum on average on a desktop machine with a 2.16-GHz Intel processor. Below we demonstrate that MS-Dictionary can search the translated human genome and identify over 10,000 human peptides with low FPR. Recently Tanner et al. (13) demonstrated that such peptides can significantly improve the accuracy of traditional de novo gene prediction tools and boosted the accuracy of GeneID predictions by 0.65 correct exons per gene on average.
MS-Dictionary generates the spectral dictionary for each spectrum and uses fast pattern matching to match the spectral dictionary against the indexed database.13 We used a simple partitioning/indexing that divides the translated human genome into 124 equally sized subgenomes. Generating a spectral dictionary with 10,000 reconstructions takes 0.1 s per spectrum, and pattern matching of a spectral dictionary against all 124 databases (including file input and output overhead) takes 0.8 s per spectral dictionary on average. This results in less than 1-s running time, a 40-fold speedup over InsPecT.14
To benchmark MS-Dictionary we used the human HEK293 MS/MS data set generated in Steve Briggs laboratory. We focus on 48,926 doubly charged peptides with tryptic C terminus identified by InsPecT15 (InsPecT version 20070613, human IPI database version 3.18) with 2.5% false discovery rate (for a detailed description see Refs. 13 and 53). We removed 17,821 peptides that span the exon boundaries (these peptides cannot be identified by searching the translated human genome) resulting in 31,105 peptides. Because most peptides in HEK293 are represented by multiple spectra, we randomly selected one spectrum of all spectra of the same peptide. We further searched 31,105 spectra against the translated human genome (version 48 from Ensembl) with masked repeats and with corrected parent mass as described before. For each spectrum, we generated a spectral dictionary with SpectralProbability = 10–11 and limited the maximum size of spectral dictionaries to 10,000. Each peptide in the spectral dictionary was matched (without errors) against the translated human genome.
The searches in the translated human genome are not expected to identify all spectra reliably identified in the human protein database. Indeed Castellana et al.2"lost"
30% of all identifications of peptides falling within exons after switching from the protein database to the translated genome database of A. thaliana. Such losses are unavoidable because many reliable identifications in the protein database turn into statistically insignificant identifications in the much larger translated genome. For example, although SpectralProbability = 10–10 makes sense for searching the human protein database, it results in very high error rates (FPR = 25%) in a
100 times larger translated human genome. Therefore, all peptide identifications with SpectralProbability
10–10 will be lost after switching from the protein database to the translated human genome.16 We have therefore chosen SpectralProbability = 10–11 as a threshold resulting in estimated FPR = DatabaseSize·SpectralProbability = 2.5·109·10–11 = 0.025. Because 9470 of 31,105 peptides (30%) have SpectralProbability exceeding 10–11, they cannot be identified in any sensible database search against the translated human genome. It leaves us with 21,635 peptides that can be potentially identified in the translated human genome.
MS-Dictionary identified 10,266 of 21,635 spectra in the translated human genome. 98.9% of the identified peptides fall into the human proteins, and only 1.1% fall into non-coding regions.17 To further estimate FPR of our experiment, we selected a single run (25,746 spectra), picked out unidentified doubly charged spectra in this run (16,205 spectra), and used MS-Dictionary to generate spectral dictionaries and match them against the translated human genome. MS-Dictionary identified only 71 spectra in this experiment, corresponding to an FPR of 0.44%.
Therefore, MS-Dictionary reliably identifies
10,000 peptides from human proteins without knowing the human proteome. However, it also "loses"
11,000 peptides that can be potentially identified in searches of the translated human genome. Fig. 12 illustrates that although MS-Dictionary identifies a large fraction of peptides of length 10–13 the performance deteriorates for shorter and longer peptides. Because the SpectralProbability threshold has to be low in proteogenomics applications, only very high quality spectra of shorter peptides represent reliable identifications (only 23% of spectra of length 9). This does not indicate the poor performance of MS-Dictionary but rather reflects the stringent threshold. For the spectra of length more than 14 aa, the performance of MS-Dictionary deteriorates because of the limited size of spectral dictionaries. Further algorithmic developments (e.g. generating dictionaries of long tags) are needed to address this shortcoming of MS-Dictionary.
|
| DISCUSSION |
|---|
|
|
|---|
Future work will focus on developing this hybrid approach into a viable tool for peptide identification by extending it to highly charged spectra and improving the efficiency of this approach in the case of longer peptides. Deteriorated performance for highly charged and long peptides is an important limitation of all de novo approaches to spectral interpretations. The existing de novo peptide sequencing tools are aimed at charge 2 peptides with the single exception of the greedy best strong tag algorithm (30) that is best suited for tag generation rather than full-length de novo peptide sequencing, which is the focus of this study. All tools we tested also deteriorated while searching longer peptides in very large databases. For example, InsPecT and X!Tandem would correctly identify only 16 and 11% of all length 14 peptides in the de novo peptide sequencing framework (Table II). Although MS-Dictionary improves on these tools, its accuracy is also rather low (18%). This observation reveals the shortcomings of existing de novo and database search tools that often score the incorrect peptides higher than the correct peptides. Frank et al. (46) recently discussed the "homeometric peptides" that represent the key obstacle for developing better de novo algorithms (they become more pronounced with the increase in the peptide length). This problem is partially alleviated by generating all reconstructions with a given SpectralProbability and further matching them against a database (Fig. 9).
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, August 14, 2008, DOI 10.1074/mcp.M800103-MCP200
1 N. Castellana, S. Payne, Z. Shen, M. Stanke, S. Briggs, and V. Bafna, submitted manuscript. ![]()
2 Spectral dictionaries are also helpful in searches for fusion peptides that are common in tumor proteomes but not explicitly present in protein databases (18). ![]()
3 Although MS-Dictionary compares well with X!Tandem and InsPecT for charge 2 spectra, the performance of all existing de novo tools (including MS-Dictionary and PepNovo) deteriorates for highly charged peptides (3+). The problem of de novo analysis of highly charged spectra has been addressed recently by Cao and Nesvizhskii (36). ![]()
4 Although it may appear that matching 1000 peptides against the database is rather time-consuming, the combinatorial pattern-matching algorithms (37) are able to do it in negligible time. ![]()
5 The problem of generating varying numbers of reconstructions for each spectrum becomes particularly important for long peptides. For instance, PepNovo (4) accurately reconstructs 54% of peptides of length 7 and only 0.4% of peptides of length 20. ![]()
6 Although the accuracy of MS-Dictionary in the standard de novo peptide sequencing improves on the state-of-the-art tool PepNovo (21), optimizing de novo peptide sequencing is an important but not the crucial goal for our main application. As Alves and Yu (7) pointed out, de novo peptide sequencing and spectral dictionary approaches have similar but distinct goals: an outstanding de novo algorithm is not a prerequisite for the spectral dictionary approach to perform well. ![]()
7 The abbreviations used are: DAG, directed acyclic graph; FDR, false discovery rate; EST, expressed sequence tag; aa, amino acids; FPR, false positive rate; Prob, probability. ![]()
8 The spectra were acquired on an ion trap MS (LCQ, ThermoFinnigan, San Jose, CA) using ESI. The program extract_msn (ThermoFinnigan) was used to generate the dta files with standard options. ![]()
9 Although this analysis loses peptides at the C terminus of proteins, it will have a minor effect on the reported statistics. ![]()
10 Although most spectra with MS-GF scores above 50 correspond to high quality peptide identifications by both InsPecT and X!Tandem, a significant portion of them may have borderline InsPecT/X!Tandem scores. As discussed previously (38), such low scores may reflect deficiencies of the underlined scoring approaches. ![]()
11 Both tools report unexpected errors on the translated human genome. ![]()
12 This is a lower bound that does not account for overhead caused by indexing/partitioning of large databases. ![]()
13 Indexing the entire six-frame translation of the human genome takes less than an hour. ![]()
14 We estimate that optimized indexing/partitioning or running MS-Dictionary on a large shared memory machine would further reduce the running time. ![]()
15 Although MS-Dictionary generates both tryptic and non-tryptic peptides, we selected doubly charged peptides with tryptic C terminus to simplify benchmarking. ![]()
16 In particular, all peptides of length 8 and shorter are likely to be lost because SpectralProbability even of a single peptide of length 8 is rather high (
0.4·10–10). ![]()
17 Although most spectral dictionaries have zero or one hit in the human genome, 1.8% of them have multiple hits (typically two hits). ![]()
* This work was supported, in whole or in part, by National Institutes of Health Grant 5R01RR016522-05. This work was also supported a by Howard Hughes Medical Institute professor award. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()
¶ To whom correspondence should be addressed. E-mail: ppevzner{at}ucsd.edu
| REFERENCES |
|---|
|
|
|---|
k, V., Addona, T., Clauser, K., Vath, J., and Pevzner, P. (1999
) De novo protein sequencing via tandem mass-spectrometry.
J. Comp. Biol. 6, 327
–341[CrossRef]This article has been cited by other articles:
![]() |
S. Kim, N. Bandeira, and P. A. Pevzner Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for de Novo Peptide Sequencing and Identification Mol. Cell. Proteomics, June 1, 2009; 8(6): 1391 - 1400. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |