Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for de Novo Peptide Sequencing and Identification*

Despite many efforts in the last decade, the progress in de novo peptide sequencing has been slow with only 30–45% of all peptides correctly reconstructed. We argue that accurate full-length peptide sequencing may be an unattainable goal for some spectra and demonstrate how to accurately sequence gapped peptides instead. We further argue that gapped peptides are nearly as useful as full-length peptides for error-tolerant database searches. Gapped peptides occupy a niche between long but inaccurate full-length reconstructions and short but accurate peptide sequence tags. Our MS-Profile tool uses spectral profiles, a new representation of tandem mass spectra, to generate gapped peptides that are longer and more accurate than peptide sequence tags of length 3 traditionally used to speed up database searches in proteomics. In addition, spectral profiles also enable intuitive visualization of all high scoring de novo reconstructions of tandem mass spectra.


Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for de Novo Peptide Sequencing and
Identification* □ S Sangtae Kim, Nuno Bandeira, and Pavel A. Pevzner ‡ Despite many efforts in the last decade, the progress in de novo peptide sequencing has been slow with only 30 -45% of all peptides correctly reconstructed. We argue that accurate full-length peptide sequencing may be an unattainable goal for some spectra and demonstrate how to accurately sequence gapped peptides instead. We further argue that gapped peptides are nearly as useful as full-length peptides for error-tolerant database searches. Gapped peptides occupy a niche between long but inaccurate full-length reconstructions and short but accurate peptide sequence tags. Our MS-Profile tool uses spectral profiles, a new representation of tandem mass spectra, to generate gapped peptides that are longer and more accurate than peptide sequence tags of length 3 traditionally used to speed up database searches in proteomics. In addition, spectral profiles also enable intuitive visualization of all high scoring de novo reconstructions of tandem mass spectra.

Molecular & Cellular Proteomics 8: 1391-1400, 2009.
Recent advances in de novo peptide sequencing have enabled tag-based peptide identification tools (e.g. Inspect (1) and Paragon (2)) that are orders of magnitude faster than traditional MS/MS database search approaches (e.g. Sequest (3) and Mascot (4)). However, reliable full-length de novo peptide sequencing remains an elusive goal, and even the most accurate de novo tools correctly reconstruct only 30 -45% of peptides (5). We argue that accurate full-length de novo peptide sequencing may be an unattainable goal for many spectra because they do not provide enough information to disambiguate between correct and incorrect reconstructions. Spectra often have variable local quality (along the peptide length) making some regions not amenable to de novo sequencing. For example, spectra of peptides DGEAAENTDAQK and DSVAAENTDAQK are very similar (supplemental Fig. S1) making it nearly impossible to reliably reconstruct these peptides de novo (the combined mass of Gly and Glu is close to the combined mass of Ser and Val). In such cases, it makes more sense to reconstruct a gapped peptide D[186]AAENTDAQK rather than a contiguous peptide (186 Da refers to the rounded combined mass of Gly and Glu or Ser and Val). Although gapped peptides are less informative than full-length peptides, we argue that there is little difference between these two representations. Indeed in most applications, de novo peptide sequencing is not the final goal in analyzing a spectrum but rather a prelude to error-tolerant database searches and other applications like metaproteomics (6 -9). We argue that long gapped peptides are nearly as good for such applications as full-length de novo reconstructions. For example, the gapped peptide D[186]AAENTDAQK has 9 continuous amino acids and thus, for all practical applications, is at least as useful as any peptide of length 9 (or length 11 if one counts Asp and [186] as separate "letters"). Because most mass spectrometrists view peptides of length 9 as useful as peptides of length 12, generating sufficiently long gapped peptides is nearly as useful as generating full-length reconstructions (the full length of D[186]AAENTDAQK is 12).
In this study we introduce the notion of a spectral profile ( Fig. 1) that enables accurate de novo sequencing of gapped peptides and reveals the variable spectral quality along the peptide length. For example, for peptides of length 11-12, our MS-Profile tool correctly reconstructs 65% of gapped peptides as compared with 46, 28, and 26% correct reconstructions of full or truncated full-length peptides by PepNovoϩ (5, 10), MS-Dictionary (11), and PEAKS (12). Gapped peptides occupy a niche between peptide sequence tags (that in most applications are limited to tags of length 3) and full-length reconstructions: they are nearly as accurate as short tags and, at the same time, typically have a unique match in the protein database. For example, for peptides of length 12, the average length of gapped reconstructions is 8.9, typically resulting in a single hit even when searching against the largest databases used in proteomics today. 1 A spectral profile is a novel representation of tandem mass spectra with "intensities" of all masses varying from 0 to 1. Every peptide of length n defines n prefix masses representing masses of the first i amino acids (for 1 Յ i Յ n). The spectral profile at mass x is the proportion of peptides with prefix mass x among all high scoring interpretations of the spectrum. Thus, the spectral profile compactly represents information about all high scoring de novo reconstructions (spectral dictionary) even if there are billions of such reconstructions (see Ref. 11). Spectral profiles are conceptually similar to the motif profiles (13) (supplemental Fig. S2) that are used in various areas of bioinformatics (e.g. in regulatory genomics). Whereas motif profiles in regulatory genomics compactly represent all known binding sites of a transcription factor, a spectral profile compactly represents all high scoring de novo reconstructions of an MS/MS spectrum. However, whereas motif profiles represent the center of gravity of known motifs, spectral profiles represent the center of gravity of unknown high scoring de novo reconstructions (spectral dictionaries). This makes computing spectral profiles challenging because in many cases spectral dictionaries cannot be explicitly generated (11). This study extends Kim et al. (11) by showing how to compute the spectral profile of any spectrum without explicitly generating its spectral dictionary. We further show how to use spectral profiles for generating reliable gapped peptides.
The difficult challenge in de novo spectral interpretations is how to figure out which ion type every peak represents (e.g. how to distinguish b-series peaks from y-series peaks) and how to analyze the widely varying intensities in a single probabilistic framework. The spectral profile collapses all possible ion type interpretations and varying intensities into a single ion type (b-ion) with rigorously defined probability. In contrast to FIG. 1. An example of the spectral profile. Top, an MS/MS spectrum of the peptide STVAGESGSADTVR with b-and y-peaks painted green and blue, respectively. Middle, the spectral profile of the spectrum above. The overall height of each peak represents the probability of the peak being a correct prefix mass. Each peak is represented as a multicolored bar where various colors (subpeaks stacked on top of each other) correspond to various amino acids (amino acids are color-coded). Similarly to the motif profile in supplemental Fig. S2, the height of each colored subpeak (corresponding to an amino acid X) represents the probability of a prefix with terminal amino acid X ending at the given mass position. Bottom, the database match (DBMatch), full-length de novo reconstruction (DeNovo), and gapped peptide (Gapped) of the spectrum in the top panel. The painted rectangles represent the tags of length 1 ending at each position of the de novo reconstruction: the width of each rectangle corresponds to the mass of the amino acid, and the height corresponds to the probability of the length 1 tag being correct. Although the full-length de novo reconstruction is incorrect, the gapped peptide reconstruction (generated using the spectral profile) is correct. The consecutive amino acids Ser and Leu are represented as a 200-Da gap because the value of the spectral profile at the position separating Ser and Leu is low. real MS/MS spectra (that contain peaks corresponding to band y-ions, various neutral losses, etc.), spectral profiles only represent (putative) b-ions. In some sense, spectral profiles represent a trade-off between (hard-to-interpret but compact) real spectra and (easy-to-interpret but huge) spectral dictionaries. We emphasize that spectral profiles are different from "scored spectra" (e.g. sequence spectra (14,15) or prefix residue mass spectra (1)) that are commonly used for de novo sequencing and MS/MS database searches. Although profile probabilities are global (i.e. they take into account complex dependences between all peaks in the spectrum), scored spectra take into account only a few local satellite peaks explaining a given mass.
Similar to the diverse applications of motif profiles, spectral profiles have a multitude of applications that we describe below. Fig. 2 illustrates recently implemented alternative approaches to peptide identification: peptide sequence tag approaches (1,2,16,17) and full-length de novo reconstruction approaches (7,8,11,18) (see also lookup peak approach (19)). Although these approaches significantly speed up conventional peptide identification tools, each of them presents certain challenges, leading to deteriorating performance on long (15ϩ-aa 2 ) peptides. Most of these approaches do not automatically adjust to varying spectral qualities or different peptide lengths. For example, InsPecT generates the same number of tags for every spectrum, although a more sensible approach would be to generate a larger number of tags for long peptides (tag generation deteriorates for longer peptides) or for low quality spectra. Although MS-Dictionary (11) generates an adaptive but large number of full-length reconstructions (for both high and low quality spectra), dictionaries of spectra of long peptides may become so large that their generation becomes impractical. To overcome this problem, we show how to quickly construct spectral profiles of even huge dictionaries without explicitly generating them.
MS-Profile currently works in two modes (Fig. 3). In the first mode, the input is an MS/MS spectrum and a spectral probability threshold (described below), and the output is a spectral profile. In the second mode, the constructed spectral profile, in addition to a de novo reconstruction, and a Min-Probability threshold (described below) serve as an input, and the output is a gapped peptide. MS-Profile in the second mode represents a new de novo peptide sequencing tool that improves accuracy of de novo reconstructions produced by other tools (e.g. PepNovoϩ, PEAKS, or MS-Dictionary). In particular, it generates gapped peptides that can be used for mutation-tolerant database searches and speed up existing database search tools. MS-Profile is available both as open source software and as a Web server.

FIG. 2. Various filtering approaches to peptide identifications.
The tag-based approach (e.g. InsPecT (1) and Paragon (2)) extracts short (usually length 3) peptide sequence tags and filters databases by considering only peptides that match tags. Full-length de novo approaches either reconstruct a single full-length peptide and find sequence matches (e.g. MS-BLAST (6), OpenSea (7), and SPIDER (8)) or generate multiple full-length reconstructions and find sequence matches to the protein database ((RAId) (18) and MS-Dictionary (11)). Spectral profiles represent an alternative approach to peptide identification generating gapped peptides and matching them to the database.

FIG. 3. Overview of the MS-Profile tool.
MS-Profile works in two modes: mode 1 is for the spectral profile generation and mode 2 is for the gapped peptide generation.

EXPERIMENTAL PROCEDURES
What Is the Spectral Profile?-For the sake of simplicity, we will first introduce the notion of a spectral profile under the assumption that amino acid masses are integers. 3 Given a peptide P 1 . . . P n , we define its prefix masses as a series mass(P 1 ), mass(P 1 ) ϩ mass(P 2 ), . . . , ⌺ j ϭ 1 i mass(P j ), . . . , ⌺ j ϭ 1 n mass(P j ) where ⌺ j ϭ 1 n mass(P j ) ϭ k is defined as the parent mass. We further represent the peptide P 1 , . . . P n as a k-mer boolean vector P ϭ x 1 . . . x k where x t ϭ 1 if t represents a prefix mass and x t ϭ 0 otherwise (see Refs. 11,20,and 21) for applications of boolean spectra and peptides). Given a set of boolean peptides Dictionary ϭ {P 1 , . . . , P m }, we define the spectral profile as simply the center of gravity of all peptides (boolean vectors) in the set, i.e. Profile(Dictionary) ϭ 1/m ⌺ j ϭ 1 m P j . This definition assumes that all peptides in the Dictionary are equally likely.
Kim et al. (11) introduced the notion of spectral dictionary and described an MS-Dictionary approach to peptide identification. Given a spectrum Spectrum and a score threshold Threshold, Dictionary(Spectrum, Threshold) is defined as the set of all peptides with scores above the Threshold. We define the spectral profile Profile(Spectrum, Threshold) as Profile(Dictionary(Spectrum, Threshold)).
When the Dictionary is explicitly given, computing Profile(Dictionary) amounts to computing the center of gravity of k-dimensional boolean vectors from Dictionary. Although MS-Dictionary (11) is capable of quickly generating spectral dictionaries for short peptides (less than 15 aa), the spectral dictionaries of spectra of long peptides are so large (even for sensible choices of Threshold) that MS-Dictionary becomes impractical. For example, for a typical spectrum of a 15-aa long peptide, the spectral dictionary consists of Ϸ4⅐10 9 high scoring peptides that would typically result in statistically significant database hits (11). Below we show how to quickly generate spectral profiles of such huge dictionaries without explicitly generating the dictionary. MS-Profile takes only Ϸ0.2 s to generate the spectral profiles even for spectra of long peptides. Thus MS-Profile bypasses the need to explicitly generate large spectral dictionaries that limited applications of MS-Dictionary in the case of long peptides.
Computing Spectral Profiles-The transformation of spectra into spectral profiles can be done efficiently by the forward-backward dynamic programming algorithm (22). For the sake of simplicity, we first represent a spectrum with parent mass k as a boolean spectrum S ϭ s 1 . . . s k where s i ϭ 1 if there is a peak at mass i in the spectrum and s i ϭ 0 otherwise. This representation assumes that spectra are discretized and that all masses are integers. Below we use the term mass of peptide/spectra to refer to the dimension of the corresponding vectors (parent mass k). The score (denoted as Score(P, S)) between a boolean peptide P ϭ p 1 . . . p k and a boolean spectrum S ϭ s 1 . . . s k (of the same mass) is defined as ⌺ j ϭ 1 k p j ⅐s j . When peptide P and spectrum S differ in mass, we define Score(P, S) as Ϫϱ.
We define S i prefix as s 1 . . . s i and S i suffix as s k-i ϩ 1 . . . s k . Given a spectrum S ϭ s 1 . . . s k , we define prefix (i, t) as the set of all boolean can be computed using the forward dynamic programming as follows.
x fwd ͑i, t͒ ϭ all amino acids a Given a spectrum S ϭ s 1 . . . s k , we define suffix (i, t) as the set of all boolean peptides P ⑀ suffix (i, t) with length k Ϫ i and Score(P, S k Ϫ i suffix ) ϭ t. Let x bwd (i, t) be the size of suffix (i, t). The variable x bwd (i, t) can be computed using the reverse dynamic programming as follows.
x bwd ͑i, t͒ ϭ all amino acids a Given a score threshold Threshold to generate a Dictionary, it is easy to see that the size of the Dictionary can be computed as follows.
Below we demonstrate that Profile(Spectrum, Threshold) ϭ f 1 . . . f k can be computed using the forward-backward algorithm.  , tЈ), f i can be computed by the forward-backward algorithm as described above. Fig. 4 illustrates computation of a spectral profile. In practice, we compute spectral profiles for a fixed spectral probability (23) (rather than for a fixed score threshold). The spectral probability of a Peptide-Spectrum Match (PSM) is defined as the total probability of all peptides with scores exceeding the score of the PSM. 4 One can also define a spectral probability depending on a score Threshold as the total probability of all peptides with scores above Threshold (the total probability of all peptides in the corresponding spectral dictionary). Given a spectral probability p, one can approximate the expectation value as p⅐DatabaseSize. See Refs. 11 and 23 for the background on spectral probabilities and spectral dictionaries. For each spectrum, MS-Profile dynamically sets Threshold as the minimum score s such that the spectral probability of the reconstructions with scores above s does not exceed a predefined spectral probability (e.g. 10 Ϫ8 ) and computes the spectral profile. For example, the spectral profile in Fig.  1 was computed for spectral probability 10 Ϫ8 . Supplemental Fig. S3 illustrates that the spectral profile remains rather stable for a range of spectral probabilities.
Note that the simple boolean model for scoring peptide-spectrum matches can easily be extended to more complicated models without any algorithmic changes. Indeed MS-Profile uses the scoring model of MS-Dictionary (23) that considers various features such as ion types, peak intensities, and mass errors.

RESULTS
Data Set-We used the Standard Protein Mix database consisting of 1.1 million spectra generated from 18 proteins using eight different mass spectrometers (24). For this study, we considered only the charge 2 spectra generated by Thermo Electron LTQ where 1388 peptides of length between 7 and 20 are reliably identified with false discovery rate 2.5% using Sequest (3) and PeptideProphet (25) in the search against the Haemophilus influenzae database appended with sequences of the 18 proteins (567,460 residues). Although this study focuses on doubly charged spectra, MS-Profile can also be applied to MS/MS spectra of higher charges as long as an additive scoring model for highly charged MS/MS spectra is available.
For each peptide, we randomly selected one representative spectrum and formed a data set of 1388 PSMs grouped by the length of their peptide identifications. To avoid computational artifacts introduced by errors in the parent mass, the parent masses of the spectra are corrected according to the Sequest identifications. Below we refer to this data set as the Standard data set. Throughout the paper we measure accuracy of a de novo sequencing tool as the percentage of spectra with error-free reconstructions among all spectra in the Standard data set. Fig. 5 shows the distribution of spectral probabilities (false positive rates) of the Standard data set. Most PSMs (91%) have spectral probabilities lower than 10 Ϫ8 . We used 10 Ϫ8 as the spectral probability threshold to generate spectral profiles. Table I shows the results of de novo peptide sequencing of the Standard data set with PEAKS, PepNovoϩ, and MS-Dictionary. PEAKS and MS-Dictionary correctly reconstructed peptides for less than 30% of the spectra, and the accuracy of both tools greatly deteriorates as the peptide length increases. PepNovoϩ reported shorter de novo reconstructions (especially for spectra of long peptides) by allowing gaps in the start and the end of the peptides, resulting in accuracy better than that of the other tools. Below we show that MS-Profile improves the accuracy of these tools at the cost of a small reduction in the length of reconstructed peptides.
De Novo Sequencing of Gapped Peptides-De novo peptide sequencing algorithms usually correctly recover some amino acids within a peptide and misinterpret others. The key challenge is to figure out which portions of the peptide are reconstructed incorrectly and to limit reconstructions to highly accurate portions. Gapped peptide reconstruction addresses this challenge by reporting only reliably reconstructed regions of the peptide.
Given a Peptide ϭ x 1 . . . x k , a Profile ϭ f 1 . . . , f k , and a parameter MinProbability, we define GappedPeptide(Peptide, FIG. 4. An example of the dynamic programming algorithm for computing the spectral profile of a "toy" boolean spectrum 011010100 with four peaks at masses 2, 3, 5, and 7 (parent mass, 9). The MS-Profile algorithm is illustrated with the help of a toy amino acid model (only two amino acids with masses 2 and 3 Da) and a simplified discretized spectrum. The scoring function Score(Peptide, Spectrum) used for this illustration is the number of matching peaks between boolean peptide and boolean spectrum. There are only five peptides with parent mass 9 Da: 3222 (score 3), 2322 (score 3), 2232 (score 2), 2223, and 333 (both score 1). These five peptides correspond to nine-dimensional boolean vectors: 001010101, 010010101, 010100101, 010101001, and 001001001. If one considers all peptides with scores 1 and above, the spectral profile is a nine-dimensional vector (0, 3 ⁄5, 2 ⁄5, 2 ⁄5, 2 ⁄5, 2 ⁄5, 3 ⁄5, 0, 1) representing the center of gravity of these five vectors. However, if one considers the dictionary of all peptides with scores 2 and above then the spectral profile (0, 2 ⁄3, 1 ⁄3, 1 ⁄3, 2 ⁄3, 0, 1, 0, 1) is the center of gravity of three peptides 001010101, 010010101, and 010100101. The forward-backward dynamic programming generates the spectral profile without explicitly generating any of the peptides in the dictionary. For the threshold 1 (peptides of scores 2 and above are considered), the size of the spectral dictionary is 3, and the spectral profile of the dictionary is (0, 2 ⁄3, 1 ⁄3, 1 ⁄3, 2 ⁄3, 0, 1, 0, 1). Numbers in bold and dashed (mass 2) and solid (mass 3) arrows represent paths to reach the dictionary with the threshold 1.

Spectral Profiles
Profile, MinProbability) ϭ g 1 . . . g k as g i ϭ x i if f i Ն MinProbability and g i ϭ 0 otherwise. Fig. 1 shows a spectral profile for the spectrum of peptide STVAGESGSADTVR and (incorrect) de novo reconstruction SSLAGESGSADTVR. One can notice that although profile values for most prefix masses in STVAG- MS-Profile generates gapped peptides as follows. For each spectrum, it first constructs the spectral profile and generates optimal de novo reconstructions by backtracking its forward matrix. Indeed because MS-Profile uses the MS-Dictionary scoring (11), the reconstructions are the same as reconstructions generated by MS-Dictionary. Both PEAKS and MS-Dictionary may generate (a small number of) multiple optimal de novo reconstructions, and we first convert them into a single consensus reconstruction. For example, the set of reconstructions YWAGELTR, YWASVLTR, YWAVSLTR, and YWAEGLTR will be converted into a single consensus reconstruction YWA[186]LTR by retaining only the prefix masses present in all reconstructions. Next MS-Profile discards all prefix masses in the consensus reconstruction whose corresponding profile values are below MinProbability as described above. The remaining prefix masses represent the gapped peptide generated by applying MS-Profile to MS-Dictionary (referred to as MS-Profile(MS-Dictionary)). Fig. 6 compares the accuracy of de novo reconstructions generated by MS-Dictionary and the gapped peptide generated by MS-Pro-file(MS-Dictionary). Applying MS-Profile increases the percentage of correct reconstructions from 28 to 42% while decreasing the average length of reconstructions from 12.8 to 9.1 amino acids when MinProbability ϭ 0.1. We remark that the Standard data set contains some low quality spectra that are nearly impossible to reconstruct in de novo fashion. Supplemental Fig. S4 illustrates the performance of MS-Profile when the Standard data set is restricted to high quality spectra. Supplemental Fig. S5 illustrates similar comparisons for the different MinProbability thresholds. One can increase the accuracy by increasing the MinProbability threshold. For example, when MinProbability ϭ 0.2, the accuracy increases to 50% while the average length of gapped peptide decreases to 7.9. When MinProbability ϭ 0.3, the accuracy increases to 54% while the average length of gapped peptide becomes 7.2.
PepNovoϩ and PEAKS represent some of the most accurate de novo peptide sequencing tools. MS-Profile can be used to convert PepNovoϩ and PEAKS reconstructions into gapped peptides resulting in MS-Profile(PepNovoϩ) and MS-Profile(PEAKS) tools. Applying MS-Profile to PepNovoϩ increases the percentage of correct reconstructions from 46 to 65% while decreasing the average length of reconstructions from 11.0 to 8.9 amino acids (MinProbability ϭ 0.1). Applying MS-Profile to PEAKS increases the percentage of correct reconstructions from 26 to 48% while decreasing the average length of reconstructions from 12.6 to 9.2 amino acids (Min-Probability ϭ 0.1). Although gapped peptides generated by MS-Profile(PepNovoϩ) and MS-Profile(PEAKS) are shorter than PepNovoϩ and PEAKS reconstructions, they are still long enough to uniquely identify most peptides even in large protein databases. Fig. 7 compares the accuracy and lengths of PepNovoϩ, PEAKS, MS-Profile(PepNovoϩ), and MS-Profile(PEAKS) reconstructions.  November 7, 2007) were run with parent mass and fragment mass tolerances of 0.5 Da, fixed modification of Cys ϩ 57, no optional modifications, and without any enzyme preference. In contrast to PEAKS and MS-Dictionary, PepNovoϩ allows gaps at the start/end of peptides thus giving PepNovoϩ significant leverage when it comes to the reported accuracy of reconstruction. Although MS-Dictionary is designed for generating spectral dictionaries (rather than ensuring that the correct reconstruction has the top score), it can be used in de novo mode as well (it has slightly higher accuracy than PEAKS while generating slightly longer peptides). PEAKS and MS-Dictionary have a tendency to output de novo reconstructions that are longer than the correct peptides (e.g. for peptides of length 11-12, the average length of PEAKS and MS-Dictionary reconstructions is 12.1 and 12.2, respectively). Accuracy of each tool is defined as the percentage of the error-free reconstructions among all reconstructions for the Standard data set. PepNovoϩ allows users to generate up to 2000 reconstructions per spectrum. When multiple reconstructions are generated, the probability of at least one of them being correct increases. For each reconstruction, we generate a gapped peptide using MS-Profile(PepNovoϩ). Because different Pep-Novoϩ reconstructions may correspond to the same gapped peptide, the number of gapped peptides generated by MS-Profile(PepNovoϩ) is typically smaller than the original number of PepNovoϩ reconstructions. Supplemental Table S1 illustrates that although the number of gapped peptides generated by MS-Profile(PepNovoϩ) is 3-15 times smaller than the number of PepNovoϩ reconstructions the length of the reconstructed gapped peptides is typically sufficient to ensure a unique database hit. The improved performance of MS-Profile(PepNovoϩ) in generating gapped peptides suggests that it can be used for database filtration in the same way as peptide sequence tags are used in InsPecT (1). For the Standard data set, we ran InsPecT to generate 1, 10, and 25 tags of lengths 3 and 4 and measured for how many spectra InsPecT generates at least one correct tag (Fig. 9). The same number of gapped peptides are also generated by MS-Profile(PepNovoϩ). It turned out that the best gapped peptide is longer and more accurate than the best tag of length 3 (the gapped peptide is correct for 65% of spectra, whereas the best 3-aa long tag is correct for 44% of spectra). Also top 10 and 25 gapped peptides are roughly as accurate as the same number of tags of length 3. For 83% of spectra, at least one of the top 10 gapped peptides is correct, whereas for 80% of spectra, at least one of the top 10 tags of 3 aa is correct. For 86% of spectra, at least one of the top 25 gapped peptides is correct, whereas for 88% of spectra, at least one of the top 25 tags of 3 aa is correct. This is surprising because gapped peptides generated by MS-Profile(PepNovoϩ) represent a much better filter for database search than InsPecT tags. To test the filtering efficiency, we matched the top gapped peptide of each spectrum and its top 3-aa tag against the Swiss-Prot database  it is roughly proportional to the time required for peptide identification (1). Therefore, a 50-fold reduction in the number of false matches can potentially translate into a 50-fold speedup as compared with (already fast) InsPecT. The contrast between gapped peptides and tags is particularly pronounced in searches against very large databases like proteogenomic six-frame translation searches of the repeatmasked human genome of size 2.7 billion residues (11). Gapped peptides longer than 8 aa (63% of spectra in the Standard data set) are expected to have only 0.24 false matches in this database, whereas 3-aa tags are expected to have 1400 false matches on average.
This comparison suggests that MS-Profile can significantly improve on previous filtration approaches to MS/MS database searches. In contrast to peptide sequence tags (that typically have many false hits in a database), gapped peptides typically have few false hits (if any) thus speeding up the database searches. We comment that use of gapped seeds in traditional BLAST-like genomics searches is well studied (26).
Evaluating Spectral Profile Probabilities-Some de novo sequencing programs output the reliability of predicted amino acids. For example, PepNovoϩ defines features that reflect the reliability of each predicted amino acid and converts the feature vectors into probabilities (27). PEAKS recently added a similar function that computes the reliability of an amino acid a by locally permuting the reconstruction around a, computing the score difference between the original and permuted reconstructions and using the prelearned distribution of the difference to assign the reliability of a (28). MS-Profile differs from these tools because instead of learning it rigorously computes a probability that a prefix mass is present in a high scoring de novo peptide reconstruction.
We show that the spectral profile probabilities approximate the empirical accuracy of the prefix mass (represented by the profile peak) being correct. To compute the accuracy of the profile value p (for p ϭ 0.1, 0.2, . . . 0.9, and 1.0), we bundled all the profile peaks with values between p Ϫ 0.05 and p ϩ 0.05 and measured the fraction of correct peaks among them. If the empirically computed fraction of correct peaks of the profile value p is close to p then our estimate of profile probabilities is unbiased. Fig. 10 shows that it is indeed the case: the empirical accuracy of the profile peaks with probability p is slightly above p. The slightly higher empirical accuracy (as compared with profile values) is likely a consequence of using the same spectral probability threshold of 10 Ϫ8 for all spectra, whereas in reality most PSMs have much lower spectral probabilities (Fig. 5). DISCUSSION Although peptide sequence tags were first proposed in 1994 (29), it took 10 years for this idea to become an integral part of the new generation of fast MS/MS database search tools (1,2). It took such a long time because a seemingly simple problem of generating accurate sequence tags turned out to be more difficult than originally thought. We demonstrated that gapped peptides occupy an important niche between long but inaccurate full-length peptide reconstructions and short but more accurate peptide sequence tags. This niche provides certain advantages because gapped peptides represent a more stringent filter that may enable very fast MS/MS database searches that in many cases will amount to a simple lookup in a database. Spectral profiles reveal poor quality spectra (or poor quality regions within long peptides) that other methods have difficulties analyzing. MS-Profile follows a different route to error-tolerant peptide identifications than OpenSea (7) and SPIDER (8). Instead of trying to generate (unreliable) full-length reconstructions and approximately matching them against the database, MS-Profile generates reliable gapped peptides and matches them against the database exactly. Some de novo sequencing tools such as Lutefisk (30), PEAKS (12), and PepNovoϩ (10) can generate gapped peptides typically trimming the full-length peptides in the beginning/end. Even when internal gaps are allowed (Lutefisk and PEAKS), they are limited to gaps of 2 aa or shorter. For long peptides where multiple consecutive peaks are missing, it is hard to generate correct gapped peptides when only short gaps are allowed. On the other hand, PepNovoϩ improves on these tools by allowing long gaps in the start/end of the peptides. As a result, PepNovoϩ has a tendency to generate incorrect solutions when it tries to reconstruct all amino acids in the middle. To the best of our knowledge, MS-Profile is the only program that allows both short and long gaps regardless of the position. Also MS-Profile can convert any de novo reconstructions into gapped peptides thus making it a useful addition to various de novo peptide sequencing tools.