“ChopNSpice,” a Mass Spectrometric Approach That Allows Identification of Endogenous Small Ubiquitin-like Modifier-conjugated Peptides

Conjugation of small ubiquitin-like modifier (SUMO) to substrates is involved in a large number of cellular processes. Typically, SUMO is conjugated to lysine residues within a SUMO consensus site; however, an increasing number of proteins are sumoylated on non-consensus sites. To appreciate the functional consequences of sumoylation, the identification of SUMO attachment sites is of critical importance. Discovery of SUMO acceptor sites is usually performed by a laborious mutagenesis approach or using MS. In MS, identification of SUMO acceptor sites in higher eukaryotes is hampered by the large tryptic fragments of SUMO1 and SUMO2/3. MS search engines in combination with known databases lack the possibility to search MSMS spectra for larger modifications, such as sumoylation. Therefore, we developed a simple and straightforward database search tool (“ChopNSpice”) that successfully allows identification of SUMO acceptor sites from proteins sumoylated in vivo and in vitro. By applying this approach we identified SUMO acceptor sites in, among others, endogenous SUMO1, SUMO2, RanBP2, and Ubc9.

Post-translational modification with ubiquitin and ubiquitinlike modifiers (Ubls) 1 such as SUMO plays an important role in most, if not all, cellular processes (1)(2)(3)(4)(5)(6). Conjugation of Ubls to their targets involves an isopeptide bond between the carboxyl group of the modifier and the -amino group of a lysine residue within the targets. Attachment of Ubls to specific targets involves an enzymatic cascade. First the Ubls are processed to expose their C-terminal diglycine motif. The mature Ubl is then transferred to its target via a cascade of E1 (activating), E2 (conjugating), and E3 (ligase) enzymes. The conjugation system for SUMO consists of a heterodimeric activating enzyme, Aos1/Uba2; a conjugating enzyme, Ubc9; and E3 ligases, such as RanBP2 or members of the PIAS family. The conjugation status undergoes perpetual change and is governed by a small family of SUMO proteases that hydrolyze the isopeptide bond between SUMO and its target (7,8). Although in lower eukaryotes only one SUMO is present, vertebrates express at least three different SUMO paralogs: SUMO1, SUMO2, and SUMO3. Mature SUMO2 and SUMO3 (referred to as SUMO2/3) are 97% identical but differ substantially from SUMO1 (ϳ50% identity).
Although the list of known SUMO substrates is growing rapidly, our understanding of the functional consequences for many of these targets is lagging behind. At a molecular level, the functional consequences of SUMO conjugation can be explained by a gain or loss of interaction with other macromolecules (3,4). SUMO-dependent intramolecular conformational changes have also been described (9,10). Thus, to appreciate the role that SUMO plays in the regulation of specific substrates, identification of the acceptor site(s) for SUMO conjugation is of key importance.
So far, identification of SUMO acceptor sites has relied largely on mutation of the SUMO consensus site, which consists of a short motif with the sequence KXE ( represents a bulky hydrophobic residue, and X represents any amino acid). This motif is recognized by Ubc9 if presented in an extended conformation (11)(12)(13). However, an increasing number of proteins, such as PCNA, E2-25K, Daxx, and USP25, turned out to be sumoylated on lysine residues that do not conform to the SUMO consensus site (14 -17). For this category of proteins, as well as for proteins that contain a large number of SUMO consensus sites, the identification of acceptor lysines is a burdensome task that often involves mutagenesis of each lysine residue within the substrate in turn.
MS is currently one of the state-of-the-art technologies to identify protein factors and their post-translational modifications in an unbiased and sensitive manner. Several groups have shown that, using overexpressed tagged SUMO, MS can be efficiently exploited to identify endogenous substrates for SUMO conjugation (18 -20). However, the identification of SUMO acceptor lysines using MS has remained a more challenging task (18,21,23,24). So far, using tagged SUMO, unbiased identification of acceptor lysines for endogenous substrates has only been observed in Saccharomyces cerevisiae (18). The identification of substrates in higher eukaryotes has been hampered by the large conjugated SUMO peptide that arises upon tryptic digestion (Ͼ2154 Da with human SUMO1 and Ͼ3568 Da with human SUMO2/3 compared with 484 Da for Smt3 in S. cerevisiae). Such large fragments, in addition to the mass of the conjugated peptide, can impede their in-gel digestion, extraction, detection, and sequencing in MS. To overcome some of these limitations, several different strategies have been developed: 1) mutation of the tryptic fragment of SUMO, yielding a smaller tryptic fragment (23), 2) development of an automated recognition pattern tool (SUMmOn) (24), and 3) identification of targets using an in vitro to in vivo approach (21). Although these approaches have been applied successfully for the identification of SUMO conjugates in vitro and in vivo, unbiased identification of SUMO conjugates in vivo has not been achieved in higher eukaryotes. Another hurdle to such identification of SUMO conjugates is the variety of masses that can theoretically arise for just one SUMO-conjugated lysine in a given protein because of tryptic miscleavages. Thus, the unambiguous identification of SUMO acceptor sites requires the mass of the modified peptide carrying the conjugated SUMO (fragment) to be measured with high accuracy, and most importantly, it requires sequence analysis of the modified peptides. Because available proteomics search engines lack the possibility to search MSMS spectra for larger modifications, e.g. those that occur upon sumoylation, we developed a novel, simple, and straightforward database search tool ("ChopNSpice") that, in combination with current proteomics search engines (such as MASCOT (25) or SEQUEST (26)), allows one to identify SUMO1 and SUMO2/3 acceptor sites unambiguously. We confirmed this strategy in vitro on various substrates and demonstrate the power of this technique by the identification of acceptor lysines within several endogenous targets from HeLa cells.

EXPERIMENTAL PROCEDURES
Software-ChopNSpice is written in PHP. The software tools that we have developed and presented in this study, along with further documentation, are freely available on line and also released as open source under the terms of the General Public License v3 (GPLv3).
Cell Culture, Immunoprecipitation, and Immunoblotting-HeLa-S3 cells were maintained in Joklik's medium supplemented with 10% fetal bovine serum and antibiotics. To immunoprecipitate SUMO1 conjugates, 1 ϫ 10 8 HeLa cells were washed twice with PBS containing 10 mM NEM and lysed in 2 pellet volumes of radioimmune precipitation assay buffer (20 mM NaP, pH 7.4, 150 mM NaCl, 1% Triton, 0.5% sodium deoxycholate, 0.1% SDS) supplemented with protease inhibitors and 10 mM NEM. Lysates were centrifuged (16,000 ϫ g for 15 min at 4°C) and filtered (0.45 m) prior to addition of 25 g of monoclonal ␣-SUMO1 antibodies. After 2-h incubation at 4°C, the lysates were centrifuged (16,000 ϫ g for 15 min at 4°C), and the supernatant was incubated for another 2 h at 4°C with protein G-agarose. After collection and extensive washing of bound proteins, samples were eluted with 2ϫ sample buffer and separated by SDS-PAGE followed by Coomassie staining or Western blotting. In a second larger experiment, 1 ϫ 10 9 cells were lysed in TB (with 0.1% Triton and 10 mM ATP) and treated with 10 mM NEM after lysis. Immunoprecipitation using 100 g of GMP1 antibodies was similar to that described above. The SUMO acceptor site in RanGAP1 was observed in both purification methods, whereas the other targets were identified in the second scaled up experiment.
Mass Spectrometry and Data Analysis-SUMO-conjugated proteins were excised from the gel, reduced with 50 mM DTT for 1 h, alkylated for 1 h with 100 mM iodoacetamide, and in-gel digested with modified trypsin (Promega) overnight, all at 37°C. SUMO-conjugated proteins from solution were reduced with 50 mM DTT for 1 h, alkylated for 1 h with 100 mM iodoacetamide, and subsequently digested with modified trypsin overnight, all at 37°C. Tryptic peptides were dissolved in 2 l of 50% acetonitrile with 0.1% formic acid and added to 18 l of 0.1% formic acid for further MS analysis. MS analysis was performed by nanoscale LC-MSMS using an LTQ-Orbitrap mass spectrometer (Thermo Fisher Scientific) equipped with a nanoelectrospray ion source and coupled to an Agilent 1100 HPLC system (Agilent Technologies) fitted with a self-made C 18 column. Tryptic peptides were first loaded at a flow rate of 10 l/min onto a C 18 trap column (1.5 cm, 360-m outer diameter, 150-m inner diameter, Reprosil-Pur 120 Å, 5 m, C18-AQ, Dr. Maisch GmbH, Ammerbuch-Entringen, Germany). Retained peptides were eluted and separated on an analytical C 18 capillary column (15 cm, 360-m outer diameter, 75-m inner diameter, Reprosil-Pur 120 Å, 5 m, C18-AQ, Dr. Maisch GmbH) at a flow rate of 300 nl/min with a gradient from 7.5 to 37.5% ACN in 0.1% formic acid for 60 min. Typical MS conditions were as follows: spray voltage of 1.8 kV, heated capillary temperature of 150°C, and normalized CID collision energy of 37.5% for MSMS in the LTQ. An activation q ϭ 0.25 and activation time of 30 ms were used. The mass spectrometer was operated in the data-dependent mode to automatically switch between MS and MSMS acquisition. Survey full-scan MS spectra (from m/z 350 to 2000) were acquired in the orbitrap with resolution R ϭ 30,000 at m/z 400 (after accumulation to a "target value'" of 1,000,000 in the orbitrap). The five most intense ions were isolated sequentially and fragmented in the linear ion trap using CID at a target value of 100,000. For all measurements with the orbitrap detector a lock mass ion from ambient air (m/z 445.120025) was used for internal calibration. For high mass data-dependent mode, the mass range for selecting MS data-dependent masses was 2154 -1,000,000 and 3568 -1,000,000 for SUMO1 and SUMO2/3, respectively, using m/z values as masses. For protein identification, all MSMS spectra were searched against a Swiss-Prot database using MASCOT with the following parameters: mass tolerance of 10 ppm in MS mode and 0.8 Da in MSMS mode; allow up to two missed cleavages; consider methionine oxidation and cysteine carboxyamidomethylation as variable modifications. The sequence of the protein of interest was manually saved to a FASTA file, and ChopNSpice was used to create a new FASTA file with the following parameters: spice species was H. sapiens; spice sequences were SUMO1 and SUMO2, respectively; spice site was KX; spice mode was once per fragment; include unmodified fragments in output; enzyme was trypsin (Lys/Arg, do not cleave at Pro); allow up to three protein miscleavages; allow up to one miscleavages in the "spice sequence"; output formatting was FASTA (single protein sequence); mark all cleaved sites ("J"); retain comments in FASTA format without line breaks in FASTA output. For sumoylated site identification with MASCOT or SEQUEST, all MSMS spectra were searched against a new FASTA file that was created by ChopNSpice with the following parameters: mass tolerance of 10 ppm in MS mode and 0.8 Da in MSMS mode; allow zero missed cleavages; consider methionine oxidation and cysteine carboxyamidomethylation as variable modifications; enzyme cleaved at J at N and C termini for MASCOT or no enzyme must be used for SEQUEST. If the search was performed with the in-house MASCOT server, the file "quant_subs.pl" must be changed from J Ն 0 to J Ն 0.05 in line 3653. All MSMS spectra were confirmed manually to identify the SUMO acceptor site. The symbol of the amino acid that was before and after the identified SUMO conjugated peptide must be J. All high abundance peaks had to be assigned to y-or b-ion series.

RESULTS
ChopNSpice-A typical work flow in MS-based proteomics comprises digestion of proteins with endoproteinases, separation of the generated peptides by LC, and ionization and subsequent fragmentation of the peptides. Finally, automated searching of the fragment spectra against a database allows identification of the corresponding protein (for a review, see Aebersold and Mann (32)). Identification of post-translational modifications by MS requires, in addition to a highly accurate mass determination of the precursor, sequencing of the peptide that contains the modification.
Accordingly, our approach to identify SUMO acceptor sites is based on the fragmentation pattern of conjugated sumoylated peptides after digestion with trypsin. Such digestion results in peptides in which a missed (i.e. non-cleaved because of SUMO modification) lysine residue is branched with a SUMO tryptic peptide (Fig. 1A). In practice, we and others observed that the MSMS fragmentation of such a branched peptide is similar to the fragmentation of a linear tryptic peptide that has a miscleaved lysine residue and the SUMO peptide at its N terminus (Fig. 1, A and B) (21,33). Identification of SUMO acceptor lysines using such MSMS spectra in a database search is only possible when the peptide sequences within the database are also modified by SUMO. However, available search engines for experimental fragment spectra do not include SUMO as a putative modification at lysine residues. Simple addition of the molecular weight of the tryptic SUMO fragment to that of a lysine residue within the target protein, without obtaining sequence information, would generate a large number of false positive hits in database searches. In addition, because sumoylation can theoretically occur at every lysine residue within a protein, manual construction of such artificial peptides is a time-consuming process. Accordingly, we generated an algorithm to automate the generation of such SUMO-modified FASTA sequences of proteins in silico ( Fig. 2A). Subsequently, the novel FASTA sequences are implemented in a database search with commonly used search engines to identify acceptor sites for SUMO conjugation (Fig. 2B).
More specifically, the FASTA sequence of a putatively sumoylated protein is "chopped" into tryptic fragments (allowing 0, 1, 2, or n missed cleavages). The tryptic "spice" sequence (e.g. tryptic peptides from SUMO1 or any other ubiquitin-like protein) is attached to the N terminus of each tryptic peptide that contains a Lys as a missed cleavage site. It is of note that also the ubiquitin-like proteins are allowed to contain 0, 1, 2, or n miscleavage(s). To prevent the appearance of non-natural peptides, a virtual amino acid, J, is attached to the C terminus of each tryptic fragment before ligation of the generated tryptic fragments into one large FASTA sequence. This large artificial protein sequence is submitted into the database search in which the virtual cleavage site J is recognized by an artificial endoproteinase that directly cleaves N-and C-terminally to J to generate the tryptic fragments for the selected missed cleavages. Subsequently, the SUMO acceptor site can be identified by using the applied search engine (e.g. MASCOT, X!Tandem, or SE-QUEST). A work flow to set up a modified FASTA sequence in which certain proteins (or entire databases) can be generated by a user-defined modifier is implemented in the program ChopNSpice.
In practical terms, after enrichment of endogenous SUMOconjugated proteins or proteins sumoylated in vitro, putative SUMO substrates are identified by a standard MS-based protein identification; i.e. samples are digested with trypsin, and the tryptic fragments are separated by LC, detected, and sequenced by MS. Corresponding proteins in the sample are identified by (i) the highly accurate mass of the peptide and (ii) searching the fragment spectra against a database using e.g. MASCOT, X!Tandem, or SEQUEST as search engine. A second MS and MSMS analysis under "high mass" conditions is performed where only those precursors are selected for sequencing that exceed a certain size, i.e. Ն2154 Da for SUMO-1 and Ն3568 Da for SUMO-2/3 (see also below).
Once one or several putatively sumoylated proteins have been identified in both the analyses after merging the data/ results, MS and MSMS data are resubmitted for search against the database containing the virtual sumoylated protein sequence generated by ChopNSpice (Fig. 2B). In a subsequent experiment, the same sample can be reinvestigated by extended/modified LC-MSMS analysis to identify the SUMO acceptor site(s).
Note that both of the search engines used in this study (MASCOT and SEQUEST) have some shortcomings. MASCOT for instance does not efficiently search fragment spectra that contain fragment ions with a charge state higher than 2; as a consequence, larger sumoylated peptides with charge state of 4ϩ show a very low score in MASCOT searches or are not identified at all (data not shown). This problem can be circumvented by using either SEQUEST or other search engines (e.g. X!Tandem) or, alternatively, by using the software tool Raw2msn to deconvolute the higher charge stages of the fragment ions in the raw data to singly charged fragment ions for MASCOT search (34). However, a prerequisite for deconvolution is that MSMS spectra (generated either by CID or by high energy collision-induced dissociation) are recorded in the FT analyzer/detector of the orbitrap with sufficient resolution for charge state recognition, and this in turn decreases sensitivity (35). A comparison between the different systems for processing raw data and the different detection modes of the orbitrap mass spectrometer are shown in supplemental Fig.  S1. SEQUEST on the other hand does not allow for cleavage with endoproteinase both N-and C-terminally to J but rather either N-terminally or C-terminally. Therefore, cleavage of the FASTA sequence is performed unspecifically; i.e. no enzyme is used in silico, and matched spectra are validated manually. Confidence in the results from the search engine is achieved by the high mass accuracy of the orbitrap instrument (Ͻ10 ppm) and by the fact that the validated sequence must be preceded or followed by the virtual amino acid J. Furthermore, all the abundant fragment ions must be assigned to yand/or b-ion series. However, as a very simple alternative, the single concatenated peptide sequences can be submitted to the database without merging them into a single new FASTA sequence.

Identification of SUMO Conjugation Sites in Vitro-
To validate our approach, we applied RanGAP1, Sp100, and p53 to an in vitro sumoylation reaction with SUMO1 (Fig. 3, A-C) and SUMO2 (data not shown). Proteins migrating on SDS-PAGE with a higher apparent molecular weight than the original proteins were considered to be sumoylated and were processed by LC-MSMS as described above. For identification of sumoylated peptides, we first tested SUMO as a variable modification of lysines (2154 Da for SUMO1 and 3568 Da for SUMO2) using two commonly used peptide identification tools, MASCOT and SEQUEST. However, like other groups (24), we were unable to identify any sumoylated peptides by the standard LC-MS proteomics and subsequent database search (see supplemental Data S1). Although manual identification of SUMO conjugation sites was possible, it required extensive searching in the MS spectra for modified peptides (17). In contrast, by using the ChopNSpice software on the identified protein sequences and subsequent database search with MASCOT and SEQUEST, we readily identified SUMO modification of RanGAP1 on lysine 526, of p53 on lysine 386, and of Sp100 on lysine 297 (Fig. 3, A-C). In addition, we observed several minor acceptor sites, also observed by others (Fig. 3, A-C, supplemental Table S1, and corresponding MASCOT search results and annotated spectra for RanGAP1, Sp100, and p53 are listed in supplemental Data S2) (36 -38). Furthermore, we discovered that numerous, so far unidentified, lysine residues within the SUMO E1 activating enzyme Uba2 are conjugated with SUMO1 and SUMO2 ( Fig. 3D and supplemental Data S2 and Table S1). Consistent with the identification of multiple acceptor sites, mutations of single lysine residues within digested with endoproteinases and analyzed by LC-MSMS. The corresponding proteins are identified by a database search using search engines (MASCOT and/or SEQUEST). Putatively sumoylated protein sequences are "chopped and spiced" (see A), and the spiced FASTA sequences are added to the database. The search with the search engine is repeated to identify the sumoylated peptide with its corresponding acceptor site (see text for details). Uba2 did not significantly impair its sumoylation (data not shown).
Increasing Sensitivity Using High Mass Acquisition-In earlier work, we mapped two SUMO conjugation sites within USP25 by which we identified one site (lysine 141) using a mutagenesis approach, whereas the other (lysine 99) was identified using an MS approach. It is of note that in our previous study we used a small fragment of USP25 that was conjugated with SUMO2 in bacteria followed by purification by gel filtration and anion-exchange chromatography (17). However, manual examination of full-length USP25 sumoylated in vitro did not reveal any SUMO acceptor site. To test whether our ChopNSpice method has an increased sensitivity to identify the acceptor sites of this more complex sample, we conjugated full-length USP25 with SUMO2 in vitro, using the E3 ligase PIASX␣, as described previously (17). Next, the mixture was digested with trypsin in solution. Subsequently, to increase sensitivity for the identification of SUMO acceptor sites, we also used high mass MSMS acquisition conditions (Fig. 4A, compare the standard (upper panel) with the high mass (lower panel)). Under these conditions, only peptides with a mass exceeding 2154 Da (for SUMO1) or 3568 Da (for SUMO2/3) are selected (see above). This approach is highly suitable for the accurate detection and sequencing of larger peptides and additionally facilitates detection of lower abundance SUMO conjugates (see also Fig. 4A and below). A database search against modified se-quences (achieved by the program ChopNSpice in combination with MASCOT) demonstrated that sumoylated peptides were enriched by high mass MSMS acquisition (supplemental Data S3 and Table S2). Using this strategy, we went on to identify several additional SUMO acceptor sites within fulllength USP25 (supplemental Data S3 and Table S2), including lysine 141, which had previously been identified only by a mutational approach. In addition, we observed lysine 5 in SUMO2 as an acceptor site for chain formation, consistent with a previous report (21).
Identification of SUMO-conjugated Sites in Vivo-Although the identification of SUMO conjugation sites in endogenous proteins from yeast has been performed before (18), unbiased identification of SUMO acceptor sites in higher eukaryotes has remained a technical challenge. This can partly be accounted for by the high mass of SUMO after hydrolysis with trypsin in higher eukaryotes combined with the low abundance of post-translational modifications per se as compared with the amount of non-modified protein. Additionally, chemical enrichment for modifications with SUMO prior to MS has not been described as is the case for instance with phosphorylation (39 -41).
To examine the power of our strategy for identification of SUMO conjugation sites, we purified endogenous SUMO1 conjugates from HeLa cells (Fig. 5A). Although the overall protein composition in the immunoprecipitation of SUMO1 conjugates seems indistinguishable from the control immuno-

FIG. 5. Identification of SUMO acceptor sites in endogenous proteins.
A, SUMO1-conjugated proteins were isolated from HeLa cells using SUMO1 antibodies (Ab) coupled to protein G-agarose or control (Ctr) protein G-agarose. Immunoprecipitates were extensively washed and eluted with sample buffer. Five percent of the sample was loaded to detect SUMO1-conjugated species by Western blot; the rest of the precipitation in Coomassie (Fig. 5A), the Western blot clearly demonstrates enrichment of SUMO1 conjugates in the immunoprecipitation (Fig. 5A, left panel). The gel was cut into slices, and the proteins specifically present in the SUMO1 immunoprecipitation were identified by LC-MSMS (supplemental Table S3). One of the most prominent SUMO1 conjugates was found at 90 kDa and represents RanGAP1 conjugated with SUMO1 (30). By applying our ChopNSpice approach (Fig. 2B), we were able to identify lysine 524 in endogenous RanGAP1 with endogenous SUMO1 (Fig. 5B). Importantly, in a subsequent experiment, we could additionally identify SUMO acceptor lysine residues in SUMO1, SUMO2/3, Ubc9, RanBP2, and others. Although several of these proteins were known as SUMO targets, the SUMO acceptor sites within RanBP2 have not been described before. Interestingly, in the SUMO1 immunoprecipitate we also observed SUMO2 conjugated to SUMO2 on lysine 11 (Table I, supplemental Table S4, and supplemental Data S4 for annotated raw MS and MSMS spectra). Thus, our MS approach proved to be highly reliable, and it easily and specifically identified SUMO acceptor sites both in vitro and in vivo. Thereby, our method increases the sensitivity of the identification of SUMO conjugation sites in mammalian cells.

DISCUSSION
In this study, we present a freely available computational approach to identify post-translational modifications by mass spectrometry that cannot easily be explored by using common search engines such as SEQUEST and/or MASCOT. We demonstrate that our approach is of value in MS-based analysis and subsequent database search for the identification of SUMO conjugation sites within proteins that have been sumoylated either in vitro or in vivo.
In particular, mammalian sumoylated proteins and peptides present a challenge in MS-based detection. In contrast to yeast (S. cerevisiae), where after digestion of sumoylated proteins only an EQIGG peptide (484 Da) is conjugated to its respective SUMO acceptor, the large tryptic fragments of mammalian SUMO1 (2154 Da) and SUMO2/3 (3568 Da) are not easily identified in MS. These difficulties are in part due to the presence of long peptide conjugates, which resemble cross-linked peptides (but without cross-linker). Consequently, MS and MSMS result in fragment ion spectra that are too complex to interpret manually. To circumvent these problems, a mutational approach has been proposed to yield a smaller tryptic fragment of SUMO that simplifies the identifisample was used to identify SUMO acceptor sites by MS. RanGAP1 conjugated with SUMO1 is indicated by the arrows. B, MSMS CID spectrum of a tryptic peptide (m/z ϭ 962.2370) derived from RanGAP1 encompassing positions 516 -530 with fragment ions recorded in the FT analyzer of the Orbitrap. MSMS in combination with database searches of the modified RanGAP1 sequence (by ChopNSpice) confirmed the known Lys 524 as the actual SUMO site. y-and b-type ions are shown in the spectrum and at their respective positions in the conjugated peptide. XCorr is the score in the database search using SEQUEST as search engine.

TABLE I In vivo sumoylated proteins derived after immunoprecipitation from HeLa cells using anti-SUMO1 antibody (see "Experimental Procedures")
The sequence of the sumoylated peptide and the positions of SUMO acceptor sites as determined by MS and MSMS using ChopNSpice in combination with MASCOT and SEQUEST as search engines are listed. For details, see supplemental Table_S4. Dashes (-) indicate that the corresponding sumoylated peptide or its actual conjugation position could not be identified and thus was not scored by the search engines. cation of SUMO acceptor sites by mass spectrometry (23,42). Although this method has proved efficient for the identification of SUMO acceptor sites from proteins sumoylated in vitro, the tailored SUMO proteins may be conjugated/deconjugated less efficiently in vivo. Another MS-based method that has been utilized to identify SUMO acceptor sites is a software tool (SUMmOn) designed to interpret the complex fragment ion pattern that allows one to work with low accuracy mass spectrometers (24). Also in this study, relatively simple in vitro conjugation mixtures were examined, whereas more complex samples from in vivo experiments are expected to cause problems in the unambiguous identification of SUMO acceptor sites. We also have used the SUMmOn pattern recognition software to identify SUMO acceptor sites in proteins sumoylated in vitro and in vivo. In fact, the analysis of our raw data with SUMmOn delivered a similar, but smaller set of sites compared with ChopNSpice in conjunction with a MASCOTbased database search (see supplemental Table S5). In addition, another software tool (Ubl finder) is available, but it suffers from the weakness that only ubiquitin and SUMO (T95R) mutants can be searched (23). By making use of highly accurate and resolving MS techniques, Matic et al. used an in vitro to in vivo approach (21). In vitro sumoylated proteins were analyzed for SUMO acceptor sites in an Orbitrap mass spectrometer and were subsequently confirmed in vivo.
We followed a different approach and combined high end MS with a commonly used database search that was slightly modified. The prerequisite for the detection of post-translational modifications per se by MS is the unambiguous identification of the site of modification within the peptide. This in turn requires MSMS sequence analyses and subsequent database searches using search engines that compare the m/z values of experimental data (i.e. the MSMS fragment spectra) with the m/z values generated in silico. In this manner, (posttranslational) modifications that are attached to any amino acid can also be identified through the extra mass of the modification that is added to all the respective amino acids in the database. In a similar manner, putative ubiquitylation sites after tryptic digestion (GG diamino acid conjugated to its acceptor site) can be identified with available search engines. Nonetheless, even highly accurate MS analysis can lead to false positive identification when only the exact mass of the modification is taken into account. For example, it has recently been reported that iodoacetamide-induced artifacts mimic ubiquitylation in mass spectrometry (22). Thus, it is of the utmost importance, in particular when one is dealing with longer conjugates, to obtain sequence information not only from the substrate peptide but also from the modifier. However, although search engines are capable of taking experimental parameters (e.g. proteases used and modifications) into account, they rely solely on databases that contain putative protein sequences for identification and, in the case of modifications, the extra mass added to a particular amino acid. Search engines such as MASCOT and/or SEQUEST are commonly used by the proteomics researchers who use MS, and the output format of these search engines (including their scoring systems) are widely accepted in the community. To that end, we developed a software tool that makes use of these search engines and adds new modified protein sequences (sumoylated sequences) to the standard databases against which standard MS search engines can then compare and have made the new tool freely available.
The program ChopNSpice for the identification of SUMO acceptor sites is unique in its ability to allow the user (i) to combine two protein sequences in a linear manner, (ii) to generate any modified linear protein sequence that contains any modifications at the N terminus of the novel fused sequence, (iii) to introduce defined extra masses in either of the two protein sequences so that also peptide-peptide crosslinks (using a cross-linking reagent) after tryptic digestion of cross-linked proteins can be searched and identified, and (iv) to generate an m/z list of all linearly fused peptides. The latter is particularly useful when users do not have access to e.g. an Orbitrap mass spectrometer but instead would like to use a simple peptide mass fingerprint analysis by MALDI MS of putatively sumoylated proteins. In addition, the list serves as an inclusion list in LC-MSMS analysis such that predicted modified (e.g. sumoylated) peptides are chosen for fragmentation within the mass spectrometer.
A similar strategy for the generation of concatenated peptides (proteins) has been discussed in conjunction with the analysis of protein-protein cross-linking MS data (33), but to date, no software is publicly available to facilitate the generation of the required FASTA library files, and Maiolica et al. (33) did not describe how the user should generate a dedicated database containing concatenated peptides. Against this background and for the first time, our approach provides a broad community with the possibility to generate every type of FASTA sequences, including various modifications that can then be used for a database search using common search engines, if required, in a high throughput approach. For the latter, entire databases (e.g. Swiss-Prot human) can be modified with ChopNSpice to generate e.g. sumoylated proteins from each entry. In addition to this feature, a number of modified databases (e.g. sumoylated Swiss-Prot human) are available via the ChopNSpice web site for added convenience.
We further show that the database search of the MSMS fragment spectra (values), including the modified linear sequence(s), is highly specific. Importantly, no hits with MAS-COT or SEQUEST were obtained when the modifier sequence was reversed and attached to the C terminus of the tryptic peptides (data not shown). Moreover, a search against the human Swiss-Prot database in which all proteins were modified with SUMO1 and SUMO2/3 by ChopNSpice gave the same hits for a distinct sumoylated protein as in a search where only the protein sequence of interest was modified with ChopNSpice and submitted to the Swiss-Prot database (data not shown). As we aim to reach a broad proteomics community by this approach, we determined the rate of false positives in a decoy database search, finding it to be Յ0.33% (see supplemental Table S6), and thus demonstrate that our approach can be applied to shotgun proteomics projects. Importantly, the false positive rate remains low because of the applied proteomics work flow (see "Results"), although in some cases, we observed mainly product ions of the SUMO peptide and less of the product ions derived from the acceptor peptide (see supplemental Data S3).
In summary, here we present an approach to identify SUMO acceptor sites in endogenous proteins by mass spectrometry in a rapid and sensitive manner, and we describe examples of its successful application. We believe that this approach has the potential to be widely used mainly because (i) the necessary software for the generation of modified protein sequence (ChopNSpice) is provided, (ii) it uses established search engines for protein identification, and (iii) it facilitates the identification of sites of modification in large immunoprecipitation studies and shotgun approaches. Importantly, the idea of the generation of novel modified sequences is not restricted to ubiquitin modifiers or Ubls but can be applied to any type of (user-defined) modification.