|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 6:1188-1197, 2007.
© 2007 by The American Society for Biochemistry and Molecular Biology, Inc.
,
,
,
,
,
,||
From the
Laboratory for Biological and Medical Mass Spectrometry, Biomedical Centre, Box 583, Uppsala University, SE-75123 Uppsala, Sweden,
Department of Pharmaceutical Biosciences, Biomedical Centre, Uppsala University, SE-75124 Uppsala, Sweden, and ¶ The Rockefeller University, New York, New York 10021
| ABSTRACT |
|---|
|
|
|---|
MS is a powerful tool utilized for thorough analytical profiling of a large number of neuropeptides (5). The MS methodology in combination with either ESI (9, 10) or MALDI (11) permits sensitive detection of peptide changes in complex mixtures of hundreds of different peptides simultaneously (5). The resolution and specificity of a neuropeptide analysis is further enhanced by coupling MS to LC or other high resolution separation techniques.
Neuropeptidomics MS experiments, aimed at understanding the healthy and diseased mammalian brain, generate a large amount of data. To efficiently analyze these large datasets, reliable tools for automatic identification are needed. Such tools should be fast, yield few false peptide identifications (false positives), and leave few correct peptides unidentified (false negatives). So far, the main focus of the proteomics field has been on developing tools for identification of proteins, which are typically digested with trypsin, i.e. an enzyme with high specificity (12), limiting the search space of possible peptides. In contrast, endogenous peptide precursors are often processed by several enzymes (13), and some of these have unknown specificity, making it difficult to accurately predict the sequence of mature endogenous peptides. Therefore, when searching for endogenous peptides, the entire proteome is often cleaved assuming an enzyme with no specificity (i.e. cleaving between any pair of amino acids). This creates a very large search space and yields poor results because only peptides that have strong experimental support can be identified. In a typical peptidomics experiment many hundreds of peptides are detected (5), but about an order of magnitude less are identified confidently.
Many bioactive endogenous peptides are post-translationally modified, and it is common that a peptide contains more than one modification, further complicating the identification process. Important peptide modifications include acetylation, amidation, phosphorylation, and sulfation (7). Approximately 300 different modifications have so far been reported for proteins (1417). For example, 30% of the mammalian proteins are believed to be phosphorylated at one time or another (18). The C-terminal amidation, a common neuropeptide modification, seems to modify 50% of all bioactive peptides (19, 20). Briefly the unknown specificity of the processing enzymes and the numbers of possible modifications make the identification of endogenous peptides difficult. Another difficulty stems from the less informative and inadequately understood fragmentation patterns for endogenous peptides compared with that of tryptic peptides.
The aim of this study was to investigate how to optimize the identification process for endogenous peptides analyzed by tandem mass spectrometry by improving the sequence collections used by the search engines. During this study, several previously uncharacterized peptides were discovered from mouse brain tissue. Some of these peptides are potential novel neuropeptides as they are processed from proteins, known to contain neuropeptides, at sites that are characteristic for neuropeptides. Identifying novel neuropeptides is important for the understanding of the biochemical processes in the mammalian brain. This study demonstrates the importance of using optimized sequence collections when identifying endogenous peptides.
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
SwePep Precursor
The SwePep precursor sequence collection includes the sequences from the mouse peptide precursor proteins annotated in SwePep. Many precursor proteins, such as pro-opiomelanocortin, contain several known endogenous peptides (22) and a number of possible cleavage sites for endogenous peptides. Therefore this sequence collection should contain many of the endogenous peptides despite its moderate size of 123 protein sequences with a total number of 23,601 amino acid residues. Using unspecific cleavage and a maximum peptide length of 50 amino acid residues 4,406,615 peptides were derived from this sequence collection.
SwePep Peptides
The SwePep peptide sequence collection contains the sequences of the endogenous peptides annotated in SwePep from Mus musculus. It is constituted of 245 sequences and 6,776 amino acid residues. When using unspecific cleavage and a maximum peptide length of 50 amino acid residues this sequence collection generates 1,142,680 peptides.
SwePep Predicted
Endogenous neuropeptides are processed in many steps to become active peptides. Predominantly they are cleaved from their precursor at the C terminus of two basic amino acids, separated by 0, 2, 4, or 6 other residues, by endopeptidases such as prohormone convertase 1 (PC1/3)1 and PC2 (13, 23). The basic residues at the C terminus are then removed by carboxypeptidase E (24). In the last step, the peptide may be modified. Important modifications on neuropeptides include C-terminal amidation and N-terminal acetylation (7).
By using the existing neuropeptide processing knowledge, possible peptide sequences were predicted from the mouse proteome (International Protein Index (IPI) mouse version 3.15, www.ebi.ac.uk/IPI/IPImouse.html) according to the following template: (K/R)Xm(K/R)
Xk(K/R)Xn(K/R)
where m and n = 0, 2, 4, 6, X is any amino acid, and k = 350. Residues in bold signify amino acids that are not part of the final (detected) sequence. The C-terminal basic residues (Xk
(K/R)Xn(K/R)) were removed, and the sequences Xk were stored in the SwePep predicted sequence collection.
It is possible to define digestion rules for the search engines so that the theoretical digest of the proteome is performed at dibasic sites on the fly, but the SwePep predicted sequence collection speeds up the search, and it can be curated to include special cases and peptides from more than one type of cleavage.
The SwePep predicted sequence collection was developed as a complement to the SwePep precursor and SwePep peptide sequence collections for identification of uncharacterized peptides and peptides from precursors not known to contain endogenous peptides. Peptides identified from the SwePep predicted sequence collection are likely to be biologically active because this collection only contains peptide sequences that have the specific cleavage pattern for neuropeptides. The SwePep predicted collection is constituted of precleaved sequences, and the searches are performed without any cleavage, i.e. the tandem mass spectra are directly matched against the sequences in the sequence collection. There are 3,413,034 predicted peptide sequences with 83,182,326 amino acid residues in this sequence collection. When using X! Tandem and its refinement function (25) this sequence collection generates 15,499,268 peptides with a maximum peptide length of 50 amino acid residues.
Mouse Proteome
To compare this new identification approach with the commonly used identification approach, a sequence collection constituted of the whole mouse proteome (IPI mouse version 3.15, www.ebi.ac.uk/IPI/IPImouse.html) was searched using unspecific cleavage. The sequence collection of the mouse proteome consists of 68,222 protein sequences with a total number of 27,668,712 amino acid residues. When using unspecific cleavage and a maximum peptide length of 50 amino acid residues 250,809,615 peptides are generated.
Search Engines
This study was performed using two different search engines, X! Tandem (26) and Mascot (27), for searching the four sequence collection s described above.
Search parameters were as follows. The SwePep peptides sequence collection, the SwePep precursor sequence collection, and mouse proteome sequence collection were searched using unspecific cleavage, and the precleaved SwePep predicted sequence collection was searched using no cleavage. The databases were searched using a peptide mass tolerance of ±2 Da and a fragment mass tolerance of ±0.7 Da. The first dataset was searched with a number of possible post-translational modifications (N-terminal acetylation, N-terminal pyroglutamic acid of glutamine, C-terminal amidation, deamidation of asparagine and glutamine, and oxidation of methionine). A full specification of search parameters is presented in the supplemental data. For X! Tandem the refinement function was used to allow unspecific cleavage of a precursor if one or more peptides have been identified from it (25).
Mass Spectrometry Datasets
Two different MS datasets were used for searching the sequence collections. One set contained 86 tandem mass spectra with manually identified peptides in the mass range from 500 to 3500 Da and with charge states 1, 2, 3, or 4. All tandem mass spectra were manually evaluated, and the peptides were unambiguously identified. Because this dataset was manually composed of spectra with known identities it does not reflect a typical collection of tandem mass spectra from an LC-MS analysis of a peptidomic sample. Therefore, a second dataset was evaluated. This dataset was obtained by analyzing a peptidomic sample from mouse hypothalamus with nanoflow capillary LC-ESI-MS/MS and contained 2,867 tandem mass spectra.
Sample Preparation and Mass Spectrometry Analysis
The brain tissue was suspended in cold extraction solution (0.25% acetic acid) and homogenized by microtip sonication (Vibra cell 750, Sonics & Materials Inc., Newtown, CT) to a concentration of 0.2 mg of tissue/µl as described previously (4, 5). Briefly the suspension was centrifuged at 20,000 x g for 30 min at 4 °C. The protein- and peptide-containing supernatant was transferred to a centrifugal filter device (Microcon YM-10, Millipore, Bedford, MA) with a molecular mass limit of 10,000 Da and centrifuged at 14,000 x g for 45 min at 4 °C. Finally the peptide filtrate was frozen and stored at 80 °C until analysis.
Five microliters of peptide filtrate (equivalent to 1.0 mg of brain tissue) was desalted on a nano-precolumn (LC Packings, Amsterdam, The Netherlands) at 10 µl/min using a nano-LC system (Ettan MDLC, GE Healthcare). The filtrate was then separated using a fused silica capillary column (75-µm inner diameter, 15-cm length, NAN75-15-03-C18PM; LC Packings) by an isocratic flow of buffer A (0.25% acetic acid in water) for 35 min and eluted during a 60-min gradient from buffer A to B (35% acetonitrile in 0.25% acetic acid). The eluted peptides were analyzed by a linear trap quadrupole ion trap mass spectrometer (Thermo Electron, San Jose, CA). The spray voltage was 1.8 kV, the capillary temperature was 160 °C, and 35 units of collision energy were used to obtain fragment spectra. Four MS/MS spectra of the most intense peaks were obtained following each full-scan mass spectrum (Xcalibur 1.4 SR1). The dynamic exclusion feature was enabled to obtain MS/MS spectra on co-eluting peptides. Raw linear trap quadrupole data were converted to dta files by Xcalibur 1.4 SR1 and assembled by an in-house developed script to Mascot generic files.
Verification of Search Results
An important step in the identification process is to verify the result from the search engines. One way to do this is to estimate the probability of false identifications (28, 29). This is often achieved by calculating an expectation value or by searching a decoy sequence collection, calculating the number of hits over a threshold, and dividing this number by the number of matches from the targeted sequence collection search (30, 31). A commonly used acceptance criteria for considering a protein to be identified is that at least two peptides have to be identified from that protein with scores over a calculated threshold. In contrast, each endogenously processed peptide may have a unique biological function, and it is therefore important to obtain sufficiently high quality data to trust the suggested identity. Also the thresholds selected for endogenous peptide identification have to be more stringent.
The false-positive rate for the first dataset in this study, containing known peptide identities, was calculated by dividing the number of false identified peptides with the number of true hits. For the second dataset the false-positive rate was estimated by searching the reversed sequence collections.
The threshold suggested by the search engine was used as the first verification step to evaluate the search result. All peptides with a score below the suggested threshold were manually verified or discarded. Secondly to increase the stringency the individual scores of the peptides were considered. After sorting out only the significant hits, the second best hit for each tandem mass spectrum was related to the best hit to confirm that the first (s1) and second (s2) best hits do not have a score that is too close to each other, i.e. (s1 s2) < 1. If the scores for the first and second best hits were too close, manual inspection was used for determining the correct sequence.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
Manually Identified Brain Peptides (Dataset 1)
Dataset 1 constituted 86 tandem mass spectra. Fig. 1 shows the search result obtained by searching Dataset 1 against the different sequences collections. The highest numbers of identified peptides were obtained when searching the SwePep peptide sequence collection. The least number of peptides were identified searching the mouse proteome. This is due to the fact that the cutoff scores for a significant identification increase with the size of the sequence collection because the possibility that the hit is random increases (33). For example, the cutoff score for a significant identification in Mascot increases from 31 to 70 when going from the smallest to the largest sequence collections (5% false-positive rate). Fig. 2 shows the logarithm of the expectation value as a function of the logarithm of the number of peptide sequences in the sequence collection for a few peptides identified by X! Tandem. This means that the number of false negatives will increase as the size of the sequence collection increases. For Dataset 1, the only false identity was suggested by Mascot when the SwePep predicted sequence collection was searched. When searching these more targeted sequences collections, 3 times as many peptides were identified compared with searching the mouse proteome. The search of the mouse proteome did not contribute any identities that were not identified when searching the more targeted sequence collections. Another drawback with searching large sequence collections using unspecific cleavage is that it is time-consuming, especially if the search includes a number of different post-translational modifications.
|
|
The sequence collections all generated somewhat different results in the present study. Table I summarizes the advantages and disadvantages for the different sequence collections. Each sequence collection, except the one of the mouse proteome, generated sequence collection-specific peptide identities; although they existed in the other sequence collections, they were not significantly identified. For more information about Dataset 1 see Table 1 in the supplemental data.
|
When using Dataset 2 to search against the three different collections of sequences 85 peptides were identified. Some of the peptides were well characterized neuropeptides such as melanotropin
, little SAAS, WE-14, and Met-enkephalin-Arg-Phe. Others were fragments of characterized neuropeptides as well as previously uncharacterized peptides. A total of 27 uncharacterized peptides were identified (Table II). See the supplemental data for more detailed information and tandem mass spectra for the peptides. Many of these peptides are processed from known peptide precursors at sites that correspond to the cleavage sites of the proprotein convertases PC1/3 and PC2. Fig. 3 shows in which sequence collection the peptides were identified. There are overlaps between the sequence collections; however, some of the peptides were only identified in one of the sequence collections, showing the importance of using all of them. Some of these previously uncharacterized peptides have the potential to be novel biologically active neuropeptides. From a single peptidomics experiment it is not possible to determine whether the observed peptides are endogenous or degradation products. However, in a time course study where the peptide levels are measured as a function of postmortem time to the first order approximation, the level of endogenous peptides will decrease and the level of degradation products will increase as the postmortem time increases, i.e. the dynamics of the changes in peptide level can be used to give an indication whether an observed peptide is endogenous or a degradation product.
|
|
Some of the uncharacterized peptides had to be manually validated because they did not fulfill the above stated criteria for the identification process of endogenous peptides. One of the peptides was not suggested as a primary hit when verifying the search result by searching the mouse proteome using Mascot. The peptide suggested from the searches of the more targeted sequences collections was pyro-Glu(Q)EDAELQPR. When validating the search result by searching the mouse proteome using Mascot the primary hit for this tandem mass spectrum was KQPASQAIPQdeamide-amide (data not shown). This peptide sequence suggested by Mascot is likely to be incorrect. First the peptide sequence suggested by Mascot has a basic residue, lysine, at the N terminus. This would imply that the b-ion series should be the most prominent, but when examining the tandem mass spectra the most intense peaks are assigned as y-ions. Secondly the tandem mass spectra of the sequence pyro-Glu(Q)EDAELQPR showed poor fragmentation. Because this peptide is singly charged, the C-terminal residue is an arginine, and the sequence contains an aspartic acid, it is likely that there will be "charge remote fragmentation" (35). It will generate a tandem mass spectrum that will have the most intense peak between aspartic acid and alanine. The most abundant ion series should be the y-ion series because the basic residue is positioned at the C terminus. Taken together, the peptide sequence suggested by Mascot using the mouse proteome sequence collection is most likely a false positive.
This explanation is in line with the work published by Kapp et al. (36) where the effect of proton mobility on patterns in peptide fragmentation was investigated. It was noted that peptides without a mobile proton often receive low scores from search engines. It was also suggested that the use of an additional proton mobility-based scoring would compensate for this effect. An automatic implementation of such proton mobility scoring would be of great value for the identification of endogenous peptides.
Another example of when manual inspection of the Mascot search result is needed was when the difference between the scores for the primary and secondary hits was too close (i.e. <1). The primary hit corresponded to the sequence acetyl-DTNSIAKAIKTRGEGIHQKL+2 deamidation as the most likely match for the tandem mass spectrum. The secondary hit corresponded to the sequence ELSAERPLNEQIAEAEADKI. The manual inspection of the expected fragmentation patterns showed that the secondary hit was presumably the right one (Fig. 4). The sequence ELSAERPLNEQIAEAEADKI has two basic residues; according to the study performed by Tabb et al. (34) arginine is the dominant residue, and it is the position of arginine that decides which type of ions that will be most abundant. In this case the b-ions should be the most intense peaks, and indeed 16 of the most 21 intense peaks in the tandem mass spectrum could be assigned as b-ions.
|
AREPSNATQLDGPA
R****R where * signifies a non-basic amino acid (not Lys or Arg) (Fig. 5A) from vasopressin-neurophysin 2-copeptin precursor (P35455; all accession numbers cited are from UniProt). The peptide is processed from the precursor protein at sites that are specific for neuropeptides (20, 24, 37), and it is the N-terminal part of the known biologically active peptide copeptin. The precursor also contains the bioactive peptide Arg-vasopressin; many precursor proteins contain more than one bioactive peptide (38). Another example is the peptide KK
acetyl-LLYEKMKGGQ
RR (Fig. 5B) from 7B2 (P12961). 7B2 was first discovered in 1982 (39, 40); it is a bifunctional polypeptide that is highly conserved in multiple species (4146). The N-terminal peptide of the polypeptide binds to PC2 and is essential for its activation; in contrast the C-terminal peptide functions as its inhibitor (47, 48). The peptide identified in this study is the C-terminal part of the N-terminal peptide; it has both characteristic cleavage sites and has been identified both with and without an N-terminal acetylation. Two peptides, KR
YPQSKWQEQE
KN (Fig. 5C) and R**R
DPADASGTRWASS
RE (Fig. 5D), from secretogranin-1 (P16014) were also identified in this study. Both of the peptides have characteristic cleavage sites at the N terminus but not on the C terminus. Many peptides derived from secretogranin-1 have been discovered in mouse and rat brain tissue during the last couple of years (1, 2, 5, 32, 49, 50).
|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, March 30, 2007, DOI 10.1074/mcp.M700016-MCP200
1 The abbreviation used is: PC, prohormone convertase. ![]()
* This study was supported by Swedish Research Council (VR) Grant 11565, 2004-3417; the Swedish Foundation for International Cooperation in Research and Higher Education (STINT) institutional grant; the K&A Wallenberg Foundation; and the Karolinska Institutet Centre for Medical Innovations, Research Program in Medical Bioinformatics. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()
|| To whom correspondence should be addressed: Laboratory for Biological and Medical Mass Spectrometry, Uppsala University, Box 583 Biomedical Centre, SE-75123 Uppsala, Sweden. Tel.: 46-18-471-7206; Fax: 46-18-471-4422; E-mail: per.andren{at}bmms.uu.se
| REFERENCES |
|---|
|
|
|---|
-amidating monooxygenase: a multifunctional protein with catalytic, processing, and routing domains.
Protein Sci.
2, 489
497[Abstract]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |