Sensitive and Specific Identification of Wild Type and Variant Proteins from 8 to 669 kDa Using Top-down Mass Spectrometry*S

Top-down and bottom-up mass spectrometry methods can generate gas phase fragments and use these to identify proteins. Top-down methods, in addition, can provide the mass of the protein itself and therefore additional structural information. Despite the conceptual advantage of top-down methods, the market share advantage belongs to bottom-up methods as a result of their more robust sample preparation, fragmentation, and data processing methods. Here we report improved fragmentation and data processing methods for top-down mass spectrometry. Specifically we report the use of funnel-skimmer dissociation, a variation of nozzle-skimmer dissociation, and compare its performance with electron capture dissociation. We also debut BIG Mascot, an extended version of Mascot with incorporated top-down MS2 search ability and the first search engine that can perform both bottom-up and top-down searches. Using BIG Mascot, we demonstrated the ability to identify proteins 1) using only intact protein MS1, 2) using only MS2, and 3) using the combination of MS1 and MS2. We correctly identified proteins with a wide range of masses, including 13 amyotrophic lateral sclerosis-associated variants of the protein Cu/Zn-superoxide dismutase, and extended the upper mass limit of top-down protein identification to 669 kDa by identifying thyroglobulin.

Our objectives here are (i) to help develop an extended version of Mascot that incorporates top-down MS 2 search ability and (ii) to infer its search sensitivity and selectivity. Here we present BIG Mascot, the first general use search engine that can perform both bottom-up and top-down MS 2 searches. Mascot has been one of the most widely used database search engines because of its probability-based scoring method (28), its superior performance (37,38), and its fast operating algorithm. However, the applicability of Mascot to top-down work flows is severely limited as the commercially available version has a precursor ion mass limit of 16 kDa. Using a version of Mascot referred to as BIG Mascot that was modified to allow precursor ions of up to 110 kDa, we demonstrate that either intact protein MS 1 or MS 2 by themselves yield sufficient information for identification of many proteins. To identify variants of a single protein, including mutations or PTMs, however, the masses of both the intact protein and the fragments are required. Finally by identifying the protein thyroglobulin using a single step analysis, we extend the mass of intact protein identification beyond the previous limit of 229 kDa (9) to 669 kDa.
BSA (66 kDa), bovine thyroglobulin (669 kDa as a dimer under native conditions (42)(43)(44)(45)), and equine myoglobin (17 kDa) were purchased from Sigma-Aldrich. UCHL-1 and DJ-1 were expressed and purified as described elsewhere (for UCHL-1, see Ref. 46; for DJ-1, see Ref. 47). Proteins were desalted using an Amicon Ultra-4 centrifugal membrane filter (Millipore) with 5-kDa nominal molecular mass limit (SOD1 variants were found to aggregate under these same conditions, hence the difference in desalting methodologies). An initial wash using 10 mM ammonium bicarbonate was performed followed by centrifugation. Three more wash steps were performed using HPLC grade water. Amicon-desalted samples were diluted to a final concentration of 1 M (ubiquitin and myoglobin) or 10 M (thyroglobulin) in 50% acetonitrile, 0.1% formic acid. A separate thyroglobulin aliquot was diluted in 10 mM ammonium acetate (to maintain dimeric quaternary structure), pH 6.8, for additional analysis. MS 1 and MS 2 Analyses-Samples were directly infused using a syringe pump and nanosprayed into a dual ion funnel (48) (Apollo II) ion source connected to a hybrid quadrupole FT-ICR mass spectrometer (9.4 teslas, apex Qe-94, Bruker Daltonics Inc., Billerica, MA). The capillary temperature was set at 160°C. The instrument is equipped with a hollow cathode (49) for ECD. External calibration of m/z scale was performed using electrospray tuning mixture (part number G2431A) from Agilent (Santa Clara, CA) using peaks at 622, 922, 1522, and 2122 m/z.
After desolvation the ions were transferred from a source hexapole to the quadrupole region where isolation and/or transmission into a second hexapole (collision cell) was performed. Ions accumulated in the second hexapole were then transferred through the ion optics into the ICR cell where fragmentation via ECD and/or detection was performed. For certain ECD experiments, isolation and activation of a single charge state in the ICR cell was performed that involved a 105-V peak to peak, 1000-Hz off-resonant, 4-ms sustained off-resonance irradiation (SORI) during the ECD pulse (4,5,33,50).
With the exception of thyroglobulin, all intact proteins were observed, and masses were determined from deisotoped and reconstructed mass spectra (Data Analysis, Bruker Daltonics Inc.). For fragmentation, funnel-skimmer dissociation or ECD was used. ECD involved a filament bias of 1-4 V and pulse duration of 2-4 ms. What we have coined as funnel-skimmer dissociation is analogous to nozzle-skimmer dissociation (51) and involved modulating the source declustering potential. Because the ambient pressure in the first stage of the dual stage funnel is ϳ3 torrs, the momentum necessary to collisionally activate ions exiting the source capillary cannot be achieved, thereby preventing CAD at the location analogous to the traditional "nozzle." Instead the voltage of the ion optic labeled skimmer 1 is raised significantly to accelerate ions into the second stage of the ion funnel stack where the pressure is a few millitorrs. Although this effectively creates a shallower direct current voltage gradient across the first stage of the radio frequency funnel, no impact on sensitivity is observed. Performing CAD in the second funnel stage is advantageous because the funnel has a wide acceptance angle, which mitigates scattering effects. Thus we favor the more descriptive term funnel-skimmer dissociation because dissociation takes place in the absence of a nozzle and in the presence of an ion funnel. The voltage of skimmer 1 was increased 2-fold (80 V) or more (120 V for BSA and 100 V for thyroglobulin). Skimmer 1 voltages of 40 and 80 V correspond to ⌬SF (skimmer 1 voltage Ϫ funnel 2 voltage) of 33 and 83 V under our conditions. Ion optics voltages downstream of the skimmer 1 electrode were not modified for these experiments.
Database Searches-BIG Mascot was used for all database searches described. Its graphical user interface and options are identical to that of Mascot. BIG Mascot benefits from a precursor ion mass limit of 110,000 Da instead of 16,000 Da. This feature enabled us to submit the intact protein mass in the place of precursor mass in Mascot generic files. As a result of a BIG Mascot search, the user obtains one "peptide" identified per protein with a sequence coverage of 100%. In this study we accepted the highest ranking protein as the BIG Mascot identification of our "unknown." We did not filter our results according to significance thresholds reported by BIG Mascot.
Because of the vestiges of the Mascot graphical user interface, BIG Mascot requires that an enzyme be specified. The enzyme definition encodes a set of operations that will be performed in a sequencespecific manner, e.g."trypsin" will cleave a primary sequence following Arg and Lys residues. We therefore developed an enzyme definition named "IntactProtein" that essentially does nothing other than cleave the N-terminal methionine. Specifically we altered the enzyme definitions file of the BIG Mascot server to include the following code: "Cleavage:M, Restrict:ABCDEFGHIJKLMNOPQRSTUVYZ, Cterm." We report significance thresholds from searches with IntactProtein because our proteins were intact.
For searches in which either the protein is larger than 110,000 Da, the intact protein mass is not detected, or the protein is modified in a way that is not accounted for in the database, we used a search we term MS 2 -only search. Normally BIG Mascot requires a precursor ion mass input. We bypassed this requirement by specifying a nominal 10,000-Da precursor mass and Ϯ1% (maximum) error tolerance and selecting "None" as the enzyme specificity. This setup allows an MS 2 -only search of all sequences and subsequences of proteins larger than 9,900 Da (10,000 Ϫ 100 Da difference is because of the tolerance window). In other words, the nominal 10-kDa precursor mass specified acts as the lower mass limit of protein (MS 2 -only search with 10-kDa nominal precursor mass would not identify 8-kDa ubiquitin, thus we used a 5-kDa nominal mass for ubiquitin). In summary, MS 2 -only searches do not require knowledge of the intact protein mass but a nominal minimum protein mass. Searches using None as the enzyme definition yielded the same scores and matches but much larger significance thresholds for BIG Mascot scores (between 60 and 70 instead of 10 -20) because the number of sequences searched increases immensely because of the inclusion of subsequences. Note that MS 2 -only searches specifying a precursor mass of 10 -110 kDa (the limit of BIG Mascot) yielded thyroglobulin as the first hit; the critical parameter for MS 2 -only searches is therefore that the specified precursor mass is lower than the database-derived mass.
MSDB (released August 31, 2006) was augmented with sequences of 102 different ALS-associated variants of SOD1 and the sequence of GPLGS-tagged UCHL-1. All database searches utilized monoisotopic masses obtained via automated data reduction of the highly complex FT-ICR MS 2 spectral data (Sophisticated Numerical Annotation Procedure (SNAP) version 2.0, Bruker Daltonics Inc.). SNAP uses a given mass, charge, and mean molecular constitution (in our case averagine (52)) to calculate isotopic distribution to obtain a nonlinear fit to the data in a similar fashion to the THRASH (thorough high resolution analysis of spectra by Horn) algorithm introduced by Horn et al. (35). This fit is used to calculate the monoisotopic mass. Manual isotopic peak determination was also performed on all SOD1 (wild type and variants) MS 2 data for verification. Default error tolerances were set to Ϯ2 Da for MS 1 data and Ϯ0.5 Da for ECD or Ϯ0.2 Da for funnel-skimmer dissociation. Default instrument definition was set as FTMS-ECD for ECD and ESI-FTICR for funnel-skimmer dissociation searches. MS 1 tolerance of Ϯ2 Da does not reflect the accuracy of the instrument (10 ppm or less) but instead accounted for the partial reduction of disulfide bonds in SOD1 that we observed. Nterminal acetylation was specified as a fixed modification for all SOD1 variants.
Following these MS 2 searches, an additional set of CAD "internal fragment" MS 2 searches were performed. Instrument fragment type definitions of BIG Mascot were edited to allow for internal fragments of by-type as large as 30 kDa to be searched in addition to b-and y-type fragments. MS 2 error tolerance was more stringent at 0.05 Da for these internal fragment searches to avoid random matches. To match this stringency internal calibration of the CAD MS 2 data using six to eight high intensity b or y fragment ions was performed. The m/z values corresponding to the theoretical monoisotopic masses and observed charge state of each fragment were calculated manually from the BIG Mascot fragmentation report of the protein.
Using the same parameters, we performed additional BIG Mascot automatic decoy database searches of randomized sequences. Each random sequence has the same composition and weight as the database sequence. As a result, BIG Mascot displays the number of random sequences matched to data with a score above significance thresholds (identity and homology) and generates a separate report for decoy database search. As we have a small number of spectra, it is not useful to display false discovery rates (for our case the false discovery rate was either 0.0% (no significant random hits) or undefined (BIG Mascot search result has no significant hits)). Nevertheless in this study we report the score of the highest scoring random sequence to compare with the score of the BIG Mascot-identified protein sequence.

Evaluation of the Upper Mass Limit of Top-down MS-A
frequently raised criticism of top-down MS is the mass limit imposed by the largest detectable protein mass, which in routine operation of FTMS instruments is 70 -100 kDa (53). We therefore calculated the masses of all known proteins in the IPI (54) human database version 3.31 to determine the fraction below this upper mass limit (see Fig. 1). An in-house, manually written software was used to calculate the monoisotopic and average mass of each entry and export entry numbers and corresponding masses as a text file. This file was imported into an Excel spreadsheet and plotted. Fig. 1 illustrates that 80% of human proteins weigh less than 70 kDa and are therefore currently amenable to top-down MS. Moreover the median mass of proteins in the human IPI database is 31.4 kDa, and the highest density region (mode) of the database is around 14.5 kDa. Methods are presented below that identify the protein thyroglobulin, a protein larger than all but 0.6% of all human primary sequences.
Evaluation of the Hypothetical Requirements for the Identification of Proteins Using only MS 2 Data-To explore the prospect of protein identification using only fragmentation (no intact protein mass), we investigated the required quality of fragmentation data for BIG Mascot protein identification using MS 2 -only searches. Fig. 2 illustrates that for the proteins that we analyzed a single sequence tag with seven sequential fragments was sufficient to identify a protein when the mass accuracy of analysis is 100 ppm, whereas at 1 ppm accuracy four or five sequential fragments were sufficient. We observed that fewer fragments or lower mass accuracy decreases the specificity of identification, whereas the size of the fragments does not significantly affect the specificity of identification (1-8 kDa; data not shown). Notably consecutive C-and Nterminally derived fragments in this size range are generally well represented in experimental data. Because of the astronomical number of possible combinations, we could not per- form an exhaustive evaluation of non-sequential fragments. A cursory evaluation indicated that higher search specificity could be achieved with similar numbers of non-sequential fragments such that a search with sequential fragments at 1 ppm yields specificity similar to that with non-sequential fragments searched at 10 ppm. This could be due to extra information encoded between two separate sequence tags (b 13 , b 14 , and b 15 sequence tag yields residues 14 and 15, whereas b 13 , b 15 , and b 16 sequence tag additionally yields residue 16).
Protein Identification using FT-ICR-MS 1 and -MS 2 Data-Although the masses of intact proteins are in many cases the most informative feature of "top-down" data, intact protein mass is not sufficient for the identification of structural isomers (which have the same mass). The ultimate test would be to distinguish the locations of the same mutations occurring at different parts of the protein. Thus, for evaluating combined performance of BIG Mascot and our top-down methods including funnel-skimmer dissociation (Fig. 3), we acquired FT-ICR-MS 1 and -MS 2 data of 13 different ALS-related variants of human SOD1. Despite reduction with tris(2-carboxyethyl)phosphine, disulfide-oxidized hSOD1 was sometimes observed, resulting in less efficient dissociation and an uncertainty of 2 Da in intact protein mass.
We augmented the MSDB to include ALS-causing SOD1 variants, including four distinct His 3 Arg variants, three different Gly 3 Ser variants, three distinct Ala 3 Thr variants, etc. As a result, identification of a given variant required both intact protein mass and protein fragmentation data. Fig. 4 illustrates spectra of wild type and G93A SOD1 MS 1 , MS 2 , and zoomed MS 2 data (see supplemental Figs. 2 and 3 for G93A BIG Mascot search results). All 13 variants were uniquely identified and ranked first using ECD, and all except D76Y ranked first using funnel-skimmer dissociation (Table I). Selectivity in identifying some SOD1 variants depended upon mass accuracies commonly achieved by Fourier transform mass spectrometers, for example D124V funnel-skimmer dissociation data (Ϯ0.1-Da MS 1 (6 ppm) and Ϯ0.01-Da MS 2 error tolerances). From the SORI-ECD results of the 153-residue wild type SOD1, BIG Mascot matched 66 masses including 41 c-, four y-, and 21 z-type ions.
Mascot determines the significance threshold by considering the database size searched and absolute probabilities. Our BIG Mascot searches with an enzyme definition of None yielded significance thresholds that are between scores of 60 and 70, values higher than most of the ion scores, while at the same time correct protein ranked first. In contrast, IntactProtein enzyme definition reported the significance level between scores of 10 and 20.
Protein Identification Using Only FT-ICR-MS 2 Data-We developed a way to search only fragmentation data using BIG Mascot ("Experimental Procedures"). This method was successful for the identification of all wild type proteins (Table I). For protein variants, these types of searches often (70% of the time) identified the correct gene while failing to identify the correct protein variant (35% correct protein variant). This illustrates the importance of MS 1 information for accurate identification of protein variants. The case of UCHL-1 illustrates the utility of MS 2 for searches of proteins with unknown modifications. Funnel-skimmer dissociation of UCHL-1 yielded high quality MS 1 and MS 2 spectra, yet no protein was identified when using a 2-Da MS 1 tolerance. However, when searched using MS 2 data-only as described under "Experimental Procedures," UCHL-1 was the single significant hit and 32 y-and no b-type fragment ions were matched. The protein mass differed by 411 Da from the database prediction, and we were then able to deduce that the UCHL-1 sequence contained an N-terminal addition (GPLGS-) that was an artifact of recombinant expression and purification techniques. This modified UCHL-1 was then added to our database, and a subsequent search matched an additional 21 b-type fragment ions. This indicates the need for observation of the complete sequence, either the intact protein mass and/or complimentary fragments (b/y or c/z) for confident assignment of modification status.
The case of thyroglobulin illustrates the utility of MS 2 -only searches for the identification of intact proteins that are either heterogeneous or too large to detect. Moreover the current version of BIG Mascot does not allow precursor masses larger than 110 kDa, and therefore MS 2 -only searches are necessary to overcome this limitation for any protein larger than 110 kDa whether or not the intact protein is detected. We attempted to detect intact thyroglobulin without success. We observed only low signal to noise, unresolved signal at 2500 -4000 m/z (data not shown) using electrospray, and no signal using MALDI-TOF (data not shown). Thyroglobulin is not only large in size but is also heterogenous in composition with many PTMs including iodination, glycosylation, and disulfide bonds (44). We performed funnel-skimmer dissociation from two different solvents (Fig. 5). Using 50% acetonitrile, 0.1% formic acid (Fig. 5, top panel) and performing MS 2 -only searches we identified internal fragments of thyroglobulin resulting from the cleavage of the peptide bond between Val 2203 and Pro 2204 (supplemental Figs. 4 -6). Specifically BIG Mascot identified fragments between Pro 2204 and the C terminus, suggesting that thyroglobulin fragmented at the N terminus of Pro 2204 and then fragmented further to give the sequence tag. Other fragments matched included a b-ion series that terminated at Pro 2234 (supplemental Fig. 6). The second spectrum, acquired using 10 mM ammonium acetate, pH 6.8, yielded a considerably different spectrum with some higher weight fragments (Fig. 5). Use of this spectrum still yielded the identification of thyroglobulin (supplemental Fig. 7) but from different regions of sequence (Thr 200 -Gln 1098 ) consistent with the observation that solvent additives effect patterns of protein cleavage (9). The second spectrum (10 mM ammonium acetate, pH 6.8) suggests that thyroglobulin fragmented at Gln 1098 and then fragmented further to give the sequence tag identified by BIG Mascot. Other than this sequence tag, the second spectrum contains many higher mass ions assigned as other internal fragments (supplemental Fig. 8).
Funnel-Skimmer Dissociation Yields Abundant Internal Fragmentation whereas ECD Does Not-When searching funnel-skimmer dissociation data using the standard "ESI-FTICR" (note "FTMS-ECD" searches c/z-type fragments) associated search parameters of Mascot, which do not search for internal fragments, we observed a large proportion of unmatched ions. After the modification of these "instrument definitions" to search for internal fragments, extensive internal fragment ions of by-type (55) were identified (Table II).
Including internal fragments in BIG Mascot searches increases the chance of random matches for the following reasons. 1) There are comparatively more hypothetical internal fragments per proteins than N or C terminus-containing fragments (for wild type SOD1, a 153-residue protein, there are 11,325 possible internal fragments in contrast to 898 band y-type fragments including ammonia and water losses). 2) As discussed in Zhai et al. (34), many internal fragment sequences within the sequence of a given protein have identical masses and pose an ambiguity problem. 3) A general problem, but one that is amplified by internal fragment searches, is that BIG Mascot defines the fragment mass tolerance in mass units, as opposed to ppm, because at 500 Da a 0.05-Da error is 100 ppm, whereas at 15,000 Da a 0.05-Da error is 3.3 ppm. Therefore, the proper choice of mass tolerance for a given fragment results in the loss of sensitivity for fragments above that mass and false positives for fragments below that mass. For these reasons, incorrect and ambiguous internal fragment matches occur and sometimes yield a number of matches higher than the number of experimental ions Mascot used (Table II). To reduce the number of random internal fragment matches, we therefore internally calibrated our data and performed searches with tighter fragment mass tolerances of 0.05 Da for instrument fragment definitions that included bytype internal fragments.
Including internal fragments in searches did not alter the ranks of the correctly identified proteins and increased BIG Mascot scores except in two cases (D125H and D90A SOD1). In the case of D76Y the rank of the variant increased to first place. We observed a significant improvement in proportion of assigned peaks especially for SOD1 variants, BSA, and thyroglobulin.
In the case of BSA, in addition to internal fragments, we observed 11 a-ions. Whenever a-ions were observed, the corresponding b-ions were observed at higher intensity. We did not observe significant a-ions in other proteins. Because of these and because addition of each ion series introduces the chance of random matches, we do not report a-ion searches except in the case of BSA and do not recommend the inclusion of a-ions for the search of unknown proteins.

Comparison of Mascot and ProSight PTM Scoring Algorithms-ProSight
PTM is the gold standard for top-down database searching and modification characterization. Thus, we compared the Mascot and ProSight PTM (version 1.0) scoring algorithms, specifically how each would score identical MS 1 and MS 2 data using an identical database, by calculating the probabilities of a random match reported by each algorithm. The value of BIG Mascot-assigned probability of a random match was calculated from the ion scores using the following formula: Score ϭ Ϫ10 ϫ log(p). We did not perform ProSight PTM searches, but instead the numbers of matched and submitted fragments from BIG Mascot searches and a mass deviation threshold of 0.2 Da were input into the ProSight PTM Poisson formula as used in Meng et al. (36), Here f is the number of fragment values submitted, n is the number of random fragment ion hits, and M a is mass accuracy. The probability reported in Fig. 6 does not relate to the number of protein

Identification of Proteins Using Top-down Mass Spectrometry
forms in the database. The number of protein forms is neither specified nor used for the calculation of the ion scores of Mascot or the calculation of Poisson probability, thus giving a measure that is independent of the database. As a result of the probability calculations explained above, BIG Mascot consistently provided a higher probability of a false match (was more stringent) compared to ProSight PTM (Fig. 6).

DISCUSSION
Here we debut both funnel-skimmer dissociation and the BIG Mascot top-down database search engine. Using these tools we 1) identified 13 variants of the same protein, 2) improved the upper mass limit of top-down methods to the 669 kDa, 3) demonstrated that funnel-skimmer dissociation generates frequent internal fragmentation and interresidue cleavage C-terminal to proline, and 4) identified proteins using either intact protein mass, protein fragments (MS 2 -only), or a combination of the two. Finally we compared the BIG Mascot and ProSight PTM top-down data base search engines and demonstrated that the BIG Mascot scoring algorithm is currently too stringent. Here we used protein variants to evaluate the specificity of our methods and identified SOD1 variants correctly out of a possible 102 distinct variants, a feat we could not achieve using bottom-up methods.
Funnel-skimmer dissociation occurs following the skimmer 1 region and under a pressure of a few millitorrs in analogy to prefolding dissociation (9). Our setup, however, includes 1) an ion funnel instead of a nozzle and 2) an additional funnel and skimmer stack following the first skimmer where activation occurs. Funnel-skimmer dissociation is an in-source fragmentation technique that has the following advantages. 1) It does not require automated real time ion selection software, 2) it has a high efficiency, and 3) it allows for dissociation of first generation fragment ions (pseudo-MS 3 ). Limitations of in-source dissociation techniques include 1) the inability to purify the precursor within the mass spectrometer, therefore requiring separation or purification, LC, and/or ion mobility separation, and 2) a higher percentage of unassigned fragments, including internal fragments, compared to ECD. We found ECD to be more specific but it exhibited lower fragmentation efficiency, which could potentially be increased using activation techniques (56).
We showed fragmentation of larger proteins using funnelskimmer dissociation and demonstrated fragmentation efficiency similar to the nozzle-skimmer results with BSA (51) and the additional identification of significant numbers of internal fragments (Table II). We have provided the dissociation of thyroglobulin (669 kDa as a dimer). Thyroglobulin is reported to be a very stable molecule, held together by not only noncovalent forces but also many intrachain disulfide and dityrosine bonds (44). The fact that we observed a heterogenous unresolved spectrum is in accord with electrophoretic analy-

Q1098 P2204
FIG. 5. Funnel-skimmer dissociation spectrum of thyroglobulin sprayed from two different solvent compositions. Assignments of m/z and charges were made using SNAP2 reflecting monoisotopic masses. Each sequence shown is the best sequence part identified. Intens., intensity; FA, formic acid; FSD, funnel-skimmer dissociation.
Internal fragments are often overlooked, although they are observed in both in-source and in-cell fragmentation methods (34). Overall internal fragments are important because their identification increases the confidence in correct protein match by 1) decreasing the number of unmatched ions and 2) allowing for localizing changes in the primary sequence or PTMs with higher spatial resolution especially when primary ion series are missing. We observed the second case for SOD1 variants as y-ion series were virtually undetected. Internal fragments are much harder to interpret because of the high chance of redundancy. Thus the internal fragment search needs to be carefully performed. Here we used the following work flow: 1) b-and y-ion searches with externally calibrated data, 2) internal calibration based upon the fragment matches of the top protein hit, and 3) b, y, and internal fragment search using internally calibrated data with lower MS 2 error tolerance (0.05 Da) together with a decoy search. Alternatively a strategy of identifying an unknown protein without looking for internal fragments and then searching suspected or assumed sequence using a sequence searching software such as Se-quenceEditor (Bruker Daltonics Inc.) to match other fragments can be used. We did perform these searches to verify the fragments identified by BIG Mascot and especially to correct for false fragment matches due to the use of fragment error in Da rather than ppm by Mascot. For large proteins, we currently lack confidence in individual internal fragment matches assigned by BIG Mascot because 1) multiple matches often occur and 2) mass changes due to PTMs, especially frequent disulfide bonds, makes it probable that especially big internal fragments have a different mass than that of matched unmodified sequence. However, we observed that internal fragment assignments from BIG Mascot were consistent across the variants of SOD1.
A high probability of fragmentation N-terminal to proline is a recurring theme in collisionally activated dissociation (60,61). Additionally there is evidence of proline contributing to internal fragmentation (62). We clearly observed this trend 1) in thyroglobulin spectra where internal fragment series in the first spectrum of thyroglobulin originates from a fragment starting with proline (supplemental Fig. 6) and 2) in the BSA spectrum where b-ion series start N-terminal to proline.
We used the observations in Figs. 1 and 2 to utilize a strategy of database searching (similar to that suggested previously (63)) for top-down proteomics. Starting with an intact protein with known modification status or an annotated database, searching accurate mass (MS 1 ) is enough to identify the protein (see Fig. 1 and supplemental Fig. 1). Mass of fragments alone can give information specific enough to uniquely identify a protein but is insufficient for distinguishing different variants of the protein (see Fig. 2 and Table I). Combination of MS 1 and MS 2 enables the complete and accurate identification of variants and PTMs. In the bottom-up ap-2 Sigma-Aldrich certificate of analysis for product number T1001. An instrument definition was created to allow internal fragments as large as 30 kDa to be searched in addition to b-and y-ions. This was used to search our CAD data generated by funnel-skimmer dissociation. Because the number of masses searched against is increased significantly, we included decoy database searches and report the score of the highest scoring protein hit. The score of 1020 is caused by a glitch that we observed when no fragments were assigned to any hits, and therefore scoring changes over to peptide mass fingerprint mode. Thus the random hit has zero score in MS 2 mode scoring.   Table I was calculated using BIG Mascot scores (empty circles) and by the Poisson method described previously (36) (filled circles).

Protein Rank
proach, confident protein identification using peptide mass alone is generally not possible (64) for the following reason: tryptic digestion of proteins removes the broad, natural distribution of masses (Fig. 1) and replaces it with a narrow distribution of many more peptides distributed mostly between 400 and 2000 Da. Using more specific proteases such as Lys-C widens this distribution and offers a compromise between top-down and bottom-up approaches (referred to as "middle-down") (20,50,65).
We found that probability estimates reported by BIG Mascot were always lower with respect to ProSight PTM scores (11) for top-down MS searches (Fig. 6). The question then arises as to whether Mascot is too stringent or ProSight PTM is not stringent enough. We consider Mascot to be too stringent for the following reasons. 1) The ProSight PTM scoring mechanism is published, has been peer reviewed, and is based upon well established statistical principles. 2) On more than one occasion (A4V and D125H SOD1 CAD; S134N SOD1 and UCHL1 ECD) BIG Mascot assigned a non-significant score to a protein that it ranked first out of a 1,943,018-protein database. Based upon our results, we conclude that the Mascot scoring algorithm requires adjustment for top-down data (it was originally calibrated empirically using bottom-up data). We suggest that the following improvements need to be made to further improve the BIG Mascot search engine: 1) parts per million MS 2 search error tolerance units should be included in the parameters, 2) the upper mass limit needs to be extended to equal the largest protein in the database, 3) scoring needs to be revised to reflect a realistic specificity of top-down MS data in matching the protein reported, and 4) a glitch that causes false identifications to yield a score of 1020 needs to be fixed. To test and improve either database search engine, studies with larger data sets are needed.
We suggest that BIG Mascot will prove useful for attempts to automate the identification of proteins. Strengths of the approach include: 1) its sensitivity, the ability to identify the correct protein within all mammalian subsequences; 2) its specificity, its ability demonstrated by distinguishing very similar protein variants; 3) the contribution of MS 1 to the scoring system within the significance threshold calculation; and 4) the ability to search for internal fragments that are significant especially for in-source dissociation techniques.
Note Added in Proof-Following acceptance of the manuscript, BIG Mascot was branded as MascotTD.