MCP Agilent Technologies
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Originally published In Press as doi:10.1074/mcp.M300110-MCP200 on March 28, 2004.
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Data
Right arrow All Versions of this Article:
M300110-MCP200v1
3/7/625    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow Glossary
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Parker, K. C.
Right arrow Articles by Martin, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Parker, K. C.
Right arrow Articles by Martin, S.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Molecular & Cellular Proteomics 3:625-659, 2004.
© 2004 by The American Society for Biochemistry and Molecular Biology, Inc.


Research

Depth of Proteome Issues

A Yeast Isotope-Coded Affinity Tag Reagent Study*,S

Kenneth C. Parker{ddagger},§, Dale Patterson{ddagger}, Brian Williamson{ddagger}, Jason Marchese{ddagger}, Armin Graber, Feng He||, Allan Jacobson||, Peter Juhasz{ddagger} and Stephen Martin{ddagger}

From the {ddagger} Discovery Proteomics and Small Molecule Research Center, Applied Biosystems, Framingham, MA 01701; Biocrates Life Sciences, 66 Innrain, Innsbruck, Austria; and || Department of Medical Genetics and Microbiology, University of Massachusetts Medical School, Worcester, MA 01655-0122


    ABSTRACT
 TOP
 ABSTRACT
 MATERIALS AND METHODS
 CRUDE RESULTS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 REFERENCES
 
As a test case for optimizing how to perform proteomics experiments, we chose a yeast model system in which the UPF1 gene, a protein involved in nonsense-mediated mRNA decay, was knocked out by homologous recombination. The results from five complete isotope-coded affinity tag (ICAT) experiments were combined, two using matrix-assisted laser desorption/ionization (MALDI) tandem mass spectrometry (MS/MS) and three using electrospray MS/MS. We sought to assess the reproducibility of peptide identification and to develop an informatics structure that characterizes the identification process as well as possible, especially with regard to tenuous identifications. The cleavable form of the ICAT reagent system (Gygi et al. (1999) Nat. Biotechnol. 17, 994–999) was used for quantification. Most proteins did not change significantly in expression as a consequence of the upf1 knockout. As expected, the Upf1 protein itself was down-regulated, and there were reproducible increases in expression of proteins involved in arginine biosynthesis. Initially, it seemed that about 10% of the proteins had changed in expression level, but after more thorough examination of the data it turned out that most of these apparent changes could be explained by artifacts of quantification caused by overlapping heavy/light pairs. About 700 proteins altogether were identified with high confidence and quantified. Many peptides with chemical modifications were identified, as well as peptides with noncanonical tryptic termini. Nearly all of these modified peptides corresponded to the most abundant yeast proteins, and some would otherwise have been attributed to "single hit" proteins at low confidence. To improve our confidence in the identifications, in MALDI experiments, the parent masses for the peptides were calibrated against nearby components. In addition, five novel parameters reflecting different aspects of identification were collected for each spectrum in addition to the Mascot score that was originally used. The interrelationship between these scoring parameters and confidence in protein identification is discussed.


One of the goals of proteomic research is to identify (correctly) and quantify as many proteins as possible in the biological system of choice (see Refs. 1 and 2 for general reviews). We chose a yeast system for this study because it is possible to get large quantities of biological material from yeast cells using fermentation, and the yeast genome encodes <6600 rather well-characterized proteins whose abundance can be estimated based on the codon adaptation index (3). At the biological level, we chose to study two yeast strains that were related by the knockout of Upf1p (the protein encoded by the UPF1 gene), a protein that is crucial to the process of nonsense-mediated mRNA decay (4, 5).1 We used the isotope-coded affinity tag (ICAT)2 reagent approach (6) to quantify the relative expression levels of each protein based on the relative intensity of the heavy and light forms of Cys-containing tryptic peptides as measured by mass spectrometry. More than 700 yeast mRNAs are regulated by UPF1; that is, their expression levels increase 2-fold or more when Upf1p is inactivated (7), and their respective extents of translation may thus be comparably enhanced. However, genes encoding the most abundant proteins in yeast have evolved such that their mRNA levels are not affected by this pathway (7), making this system a good test for detection of small changes in expression levels.

One simple statistic for determining the relative success of a proteomics experiment is to count the number of correctly identified peptides and proteins. Because identification of peptides by electrospray is dependent on appropriate automated selection of precursor ions, repetition of experiments has frequently been observed to result in a higher number of identifications. In the matrix-assisted laser desorption/ionization (MALDI) approach, the output of the reversed-phase columns is deposited on the MALDI plate, so that time is much less of a limitation in selecting the most informative set of precursors for fragmentation. However, even in this approach, one can expect a larger number of identifications upon repetition because of slight changes in elution times, differences in matrix crystallization, experiment-specific loss of peptides prior to reversed-phase chromatography by absorption, or limitations on sample consumption by the acquisition process itself. For both approaches, another desired output is the expression ratio: the ratio of expression of each protein in the experimental sample versus the control. In these experiments, in addition to learning about the ramifications of the deletion of UPF1, we sought to determine what limitations there are in both identification and quantification that might apply to proteomics experiments in general. In particular, we sought to develop additional scoring parameters so as to more easily identify false positives in a semi-automatic fashion, while retaining borderline identifications that are consistent with what is known about correct identifications. To that end, we report here on the results of five complete ICAT experiments, starting from exactly the same samples, so that biological variability does not contribute.

In these experiments, we combined together identifications from two different instrument types that initially used two different search engines. After combining the data into a common relational database, we submitted all of the spectra corresponding to identified peptides to Mascot. In addition, we developed new measures of reliability of identification and quantification that proved useful in resolving discrepancies in identifications. These quantities are adapted from the parameters we recently described for peptide mass fingerprinting experiments (8). In addition to the percent intensity matched parameter, we describe a percent ChemScore matched parameter that semiquantitatively assesses what percentage of the critically important ion fragments were detected. A third parameter was the internal mass consistency of the fragment ions. We also describe a fourth parameter, called Fragment TriScore, that gives higher credit to spectra in which the fragments with the highest ChemScore are also the most intense ions. These parameters were combined in an overall MatchScore (MatSc) parameter that is useful in documenting the credibility of an identification, especially in marginal cases. The parent mass accuracy parameter was left separate as an independent measure of credibility of identification. Upon calibration, based on added calibrants or masses that can be used as calibrants because they can be confidently assigned, the mass of a precursor ion should not deviate from the theoretical peptide mass by more than 20 ppm.

In order to determine which factors were most significant in preventing additional protein identifications, a lot of attention was focused on spectra that were not readily identifiable. In many cases, these spectra could be attributed to chemically modified forms of the peptides that had already been identified many times and were therefore surely very abundant. A second class of these peptides corresponded to peptides with one noncanonical tryptic terminus from these same abundant proteins. An additional problem is spectra whose measured masses were several mass units different from the mass of the peptide to which they were matched. To quantify these problems, each of the spectra were classified in groups corresponding to chemical modification status, mass accuracy, and tryptic specificity.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 MATERIALS AND METHODS
 CRUDE RESULTS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Yeast
Two strains of yeast were studied in these experiments. The strain we describe herein as "wild-type" has been designated HFY1200 (5); it has mutations in ade2, his3, leu2, trp1, and can1, which come in to play when the yeast is grown in restricted media. The upf1 knockout strain has been designated HFY871 (5). It has the same genetic background as HFY1200, but has the HIS3 gene inserted in place of the UPF1 gene. Wild-type yeast and the upf1 knockout strain were grown to mid-log phase (OD600 = 0.7) in 2 liters of yeast extract-peptone-dextrose (YPD) medium at 30 °C in a fermentor. All subsequent procedures were performed at 4 °C. Yeast cells were collected by centrifugation at 4,000 x g for 5 min and were washed with 200 ml of water and then 200 ml of 50 mM Tris-Cl, pH 7.5 (buffer A). The yeast extracts were prepared using the liquid nitrogen (LN2) grinding method (9). The cell pellets were resuspended in 1/10 volume of buffer A and then carefully mixed into LN2 to form beads. The beads were crushed and ground to fine powder in LN2 using a prechilled mortar and pestle. The fine powder was stored at –70 °C. The soluble fraction of the yeast extracts was prepared by thawing the fine powder on ice for 15 min and then collecting the supernatant by centrifugation at 14,000 rpm for 5 min using a microcentrifuge. The protein concentration of the soluble fraction was determined using a Bradford assay (10). Each 2-liter culture yields about 4 g of cell pellet, and the estimated protein yield for each soluble fraction is about 400 mg.

Peptide Chemistry
ICAT Reagent Labeling Procedure—
Two 500-µg aliquots from each strain were resuspended in 6 M guanidine-HCl, 1% Triton X-100, 50 mM Tris HCl, pH 8.5 (buffer B). The proteins were then reduced by the addition of 10 µl of 50 mM tricarboxyethylphosphine and boiled at 100 °C for 10 min. After cooling for 5 min to room temperature, 1 mg of the acid-cleavable form of the ICAT light reagent, dissolved in acetonitrile (ACN), was added to the wild type, whereas 1 mg of the acid-cleavable form of the ICAT heavy reagent was added to the upf1 knockout sample. After incubation for 2 h at 37 °C, the two aliquots were combined and precipitated with acetone (6:1 volume of acetone:volume of sample). The precipitated proteins were centrifuged for 10 min at 13,000 x g, the acetone was decanted, and the pellet was resuspended in 100 µl of ACN. The sample was then diluted with 900 µl of 50 mM Tris, pH 8.5, 10 mM CaCl2, 20% ACN. Then 12 µg of porcine trypsin (Promega, Madison, WI) was added, the sample was incubated for 2 h at 37 °C, then another 12 µg of porcine trypsin was added, followed by overnight digestion.

Ion Exchange Chromatography (IEX)—
The sample (1 ml) was diluted to 10 ml with 10 mM K3PO4, 25% ACN, pH ~2.5 (buffer C). In two batches, the sample was injected onto a 4.6 x 100 mm polysulfoethyl A cation exchange column at a flow rate of 1 ml/min. The high salt buffer contained 350 mM KCl, 10 mM K3PO4, 25% ACN, pH ~2.5 (buffer D). Peptides were separated over four linear gradient segments using an Applied Biosystems Vision Work station (Applied Biosystems, Foster City, CA) in order to separate the peptides as efficiently as possible: 2 min to 10% buffer D, 15 min to 20% buffer D, 3 min to 45% buffer D, and 10 min to 100% buffer D. Seven to 23 fractions (see Table V, column No. IEX) consisting of 1.5 ml were collected beginning 4 min into the gradient. Prior to affinity chromatography, 250 µl of 100 mM Na3PO4 1500 mM NaCl, pH 10 was added to each fraction, which was sufficient to bring the pH to ~7.2.


View this table:
[in this window]
[in a new window]
 
TABLE V Spectra vs. experiment

Ex, experiment number. In experiment 1, the noncleavable ICAT reagent was used, and those data are not included here. EM, refers to electrospray (E) = 2 vs. MALDI (M) = 1. No. IEX, the number of ion exchange fractions analyzed. No. HPLC, the number of RP-HPLC fractions analyzed. Date, the month the data were collected. Spectra, the number of spectra that were deposited into either GPS (MALDI) or ProICAT (electrospray). A larger number of electrospray spectra were collected (especially at the beginning and end of each HPLC gradient) that were of such low quality that they were not processed by ProICAT. Upon direct examination, small numbers of these spectra can be matched with reasonable confidence. Nor, the normalization constant that was applied to the raw HL ratio to correct for slight differences between experiments. Stringency ICAT L, (low stringency) the number of CICAT identifications (Y2 = 1) with yb > 3, MatSc > 5000, MAc = 1. Stringency ICAT H, (high stringency) the number of CICAT identifications (Y2 = 1) with yb > 3, MatSc > 5000, Sc ≥ 20, MAc = 1, MBC = 0, 0.2 < HL < 5. Stringency others L, the number of low-stringency identifications where Y2 = 0, 2, or 3. Stringency others H, the number of high-stringency identifications where Y2 = 0, 2, or 3 (unmodified, KICAT, or incompletely alkylated); no HL restriction. N, the number of high-stringency identifications where Y2 = 0 (unmodified peptides). X, the number of high-stringency KICAT identifications where Y2 = 2. Z, the number of high-stringency identifications where Y2 = 3 (incompletely alkylated peptides). Met, the number of high-stringency CICAT identifications that also contained unmodified Met. MetOx, the number of high-stringency CICAT identifications that also contained methionine sulfoxide. % Ox, the percentage of high-stringency CICAT identifications in which Met was oxidized (Y2 = 1). CICAT_Ox, the number of high-stringency spectra containing oxidized CICAT (ChM = 11).

 
Avidin Affinity Chromatography—
Each ion exchange fraction was separately purified using the monomeric avidin beads supplied with the ICAT reagent kit (Applied Biosystems), according to the instructions.

Cleavage of Biotin—
Each eluate was dried completely using reduced pressure. A 200-µl aliquot of ICAT cleaving reagent from the ICAT reagent kit was added, followed by incubation at 37 °C for 2 h. Once again the sample was dried under reduced pressure until time for reversed-phase separation. At that time, each sample was resuspended in 100 µl of 2% ACN, 0.1% trifluoroacetic acid (TFA).

Mass Spectrometry
Electrospray Analysis—
Three or four dependent scans were collected per mass spectrometry (MS) scan using a QStarR Pulsar I (Applied Biosystems) equipped with a nanospray source using Analyst software. Typically, data was collected over 110 min at a flow rate of 0.3 µl/min, with 3-s parent scans from 300 to 1500 m/z, and with tandem MS (MS/MS) scans from 70 to 1500 m/z every 3 s. Information-dependent acquisition was used to select the most intense parent masses excluding singly charged ions and using dynamic exclusion to prevent the same parent mass from being chosen for fragmentation within a 45-s window. Samples were injected onto a capillary trap cartridge (Captrap, Michrom Bioresources, CA) at a flow rate of 20 µl/min to concentrate and desalt the samples. After 10 min, the trap cartridge was automatically switched in-line with the analytical column. Peptides were separated using a 75-µm x 15-cm reversed-phase C18 column, 3-µm particle size (PepMap; LC Packings, Hercules, CA), by means of an UltimateTM System (Dionex Corporation, Sunnyvale, CA). The gradient was typically from 5 to 30% buffer B, where buffer B is 85% ACN/10% water/5% n-propanol/0.1% formic acid/0.01% TFA and buffer A is 98% water/2% ACN/0.1% formic acid/0.01% TFA.

MALDI Analysis—
After cleavage, the peptides were separated using an Ultimate Chromatography system (Dionex-LC Packings, Hercules, CA) equipped with a Probot MALDI spotting device. A total of 50 µl of each digested protein fraction was injected and captured on a 0.3 x 5-mm trap column (3 µm, C18; Dionex-LC Packings) with a 0.1 x 150-mm resolving column (3 µm, C18; Dionex-LC Packings) connected in series. Peptides were resolved by dual-solvent gradient elution at a flow rate of 800 nl/min, using a gradient of 5–45% buffer B over 35 min, followed by a gradient of 35–90% buffer B over 5 min, where buffer A is 98% water, 2% ACN, 0.1% TFA and buffer B is 85% ACN, 10% water, 5% isopropanol, 0.1% TFA. Column effluent was monitored using a 3-nl ultraviolet flow cell and spotted directly onto a MALDI target using the Probot. Column effluent was mixed 1:2 with MALDI matrix (7.5 mg/ml {alpha}-cyano-4-hydroxycinnamic acid dissolved in 75:25 ACN:water containing 0.15 mg/ml dibasic ammonium citrate) by means of a 25-nl mixing tee (Upchurch Scientific, Oak Harbor, WA) and was spotted onto the target in a 12 x 12 array at 20-s intervals. All MALDI spectra were acquired using a 4700 Proteomics Analyzer (Applied Biosystems) equipped with GPS Explorer version 1.0, and peaks were selected for MS/MS analysis using a strategy to collect as many useful spectra as possible without regard to heavy-to-light (HL) ratio (11). To accomplish this, the parent spectra were collected first, and masses were chosen for fragmentation from the spots in which they were most intense using the PeakPicker program (11).

Peptide Identification Methods—
QStar files were submitted to the ProICAT software package (Applied Biosytems) for the process of both identification and quantification. The database used was Swiss-Prot release 36. ProICAT generates a set of tables housed in an Access relational database. From these tables, two new tables were generated using SQL queries: A table that contained one row for each mass spectrum (see Table III ) and a table that contained the peak list information from each mass spectrum (see Table VI). Although the HL ratio and peptide sequence were originally in distinct tables, these values were incorporated into Table III.


View this table:
[in this window]
[in a new window]
 
TABLE IIIA Spectra

The meanings of the column headings are described in "Crude Results." SpID was derived from the MS/MS ID number from ProICAT, with values of 100,000, 600,000, or 700,000 added to prevent overlaps. SpID was derived from the Peak List ID from GPS. Pla was derived from the Plate ID from GPS, and from the run id from ProICAT. Many of the QStar spectra were not processed so as to extract the parent ion intensity, and therefore it is blank. HL is listed only when measured. Fields HL_S, HLI_S, HL_1, and HLI_1 are defined for MALDI spectra only. AutoSeq was calculated only in rare instances. ISD has meaning only when ChM = 10 or 12, or NT = 0 or 1, because only in these circumstances is it likely that the parent mass may have arisen from in-source decay. The comment field is not displayed, but is used for tracking observations and manual modifications.

 

View this table:
[in this window]
[in a new window]
 
TABLE IIIB Spectra

The meanings of the column headings are described in "Crude Results." SpID was derived from the MS/MS ID number from ProICAT, with values of 100,000, 600,000, or 700,000 added to prevent overlaps. SpID was derived from the Peak List ID from GPS. Pla was derived from the Plate ID from GPS, and from the run id from ProICAT. Many of the QStar spectra were not processed so as to extract the parent ion intensity, and therefore it is blank. HL is listed only when measured. Fields HL_S, HLI_S, HL_1, and HLI_1 are defined for MALDI spectra only. AutoSeq was calculated only in rare instances. ISD has meaning only when ChM = 10 or 12, or NT = 0 or 1, because only in these circumstances is it likely that the parent mass may have arisen from in-source decay. The comment field is not displayed, but is used for tracking observations and manual modifications.

 

View this table:
[in this window]
[in a new window]
 
TABLE IIIC Spectra

The meanings of the column headings are described in "Crude Results." SpID was derived from the MS/MS ID number from ProICAT, with values of 100,000, 600,000, or 700,000 added to prevent overlaps. SpID was derived from the Peak List ID from GPS. Pla was derived from the Plate ID from GPS, and from the run id from ProICAT. Many of the QStar spectra were not processed so as to extract the parent ion intensity, and therefore it is blank. HL is listed only when measured. Fields HL_S, HLI_S, HL_1, and HLI_1 are defined for MALDI spectra only. AutoSeq was calculated only in rare instances. ISD has meaning only when ChM = 10 or 12, or NT = 0 or 1, because only in these circumstances is it likely that the parent mass may have arisen from in-source decay. The comment field is not displayed, but is used for tracking observations and manual modifications.

 

View this table:
[in this window]
[in a new window]
 
TABLE IIID Fields for spectrum table

The meanings of these columns are described in "Crude Results."

 

View this table:
[in this window]
[in a new window]
 
TABLE VI Peak lists

No., The SpID for the peak list. In Table VI, 1 corresponds to the spectrum in Fig. 3A (original SpID 52964), which has been identified as ITLHVDcLR; 2 and 3 both correspond to SpID 604059. The 2 peak list was automatically obtained using ProICAT (Fig. 3B). The 3 peak list was derived from the same spectrum (Fig. 3D), except that manual intervention was used to ensure that each peak listed was appropriately de-isotoped and de-charged. In the case of this spectrum, it was difficult to find peak extraction parameters that worked well for this spectrum that did not result in poor peak extraction for many other spectra. Because the evidence for the identification of ITLHVDcLR was marginal, nearly any error in the peak list depresses the Mascot score below the significance threshold. Mass, the mass for the fragment (charge assumed to be 1). Int, the area of the isotope cluster corresponding to Mass. In the case of 3, this number was obtained manually.

 
The MALDI spectra were identified using Mascot version 1.9 (12). The database used was MSDB (Matrix Sciences Ltd., London, United Kingdom) from the June 1, 2003 release, containing 9722 Saccharomyces cerevisiae sequences. MALDI files were originally stored in an Oracle database generated by GPS 1.0. All of the mass and intensity information was extracted into tables in Access, along with the sequences of the peptides proposed to match that were obtained by the Mascot search engine. The isotope cluster intensities for the HL pair peaks of the parent spectrum that had been deposited into the Oracle database were used for quantification. Like the electrospray data, these data were housed in Tables III and VI. These steps correspond to Fig. 1, Workflow 1, steps 1–3.



View larger version (43K):
[in this window]
[in a new window]
 
FIG. 1. Workflows used to identify spectra. Workflow 1, Automated procedure used to identify peptides. In some cases, spectra were processed several times, resulting in conflicting assignments especially with borderline identifications. Conflicts were resolved using MatSc_Calc. Workflow 2, Spectra (usually selected by parent mass) were automatically searched for evidence that they could be explained by the same peptides already identified. The same procedure can be used to identify spectra that match to a specific peptide of particular interest. Workflow 3, Manual identification processes. Five criteria (1a–1e) are listed by which certain spectra were chosen for special attention. Some spectra were carefully tested using any of the eight methods listed in Workflow 3, section 2. Workflow 4, Follow-up automated searches based on what was learned from manual identifications, or to test for the presence of modified peptides in certain classes.

 
A Visual Basic program (MatSc_Calc) was written that calculates MatSc (see below in "Calculation of Overall Score") as well as most of the additional parameters in Table III starting from the proposed sequence, the experimental mass, and the peak/intensity list. In order to compare identifications from electrospray data to MALDI data, the peak lists for identified peptides from both electrospray and MALDI were combined together and resubmitted to Mascot (Fig. 1, Workflow 1, step 5). This generates a Mascot score for the electrospray data. In addition, the protein consolidation capability of Mascot ensures that the smallest possible list of proteins is deposited into Table III, with common accession numbers regardless of which databases were initially searched (Fig. 1, Workflow 1, step 6). Efforts were made to ensure that no identifications were lost in this process, which centered on determining which spectra were most crucial for the identifications of each peptide and protein (Fig. 1, Workflow 1, step 7). In many instances, spectra were examined manually (Fig. 1, Workflow 1, step 8). In some cases, MatSc_Calc was enabled to overwrite the initial sequence with alternative sequences (for example, sequences that Mascot or ProICAT had identified with lower confidence) when the alternative sequence had a higher MatSc (see Fig. 1, Deciding Between Alternative Sequences). Occasionally, queries were performed to determine how many identifications were changed to different sequences by this means, and MatSc_Calc or its input parameters was altered so that those sequences that appeared to be correct by manual examination of selected spectra resulted in the highest MatSc. The final set of input parameters used are listed in Table I. Many of the tracking parameters in Table III were manually appended using combinations of additional programs and update queries within Access.


View this table:
[in this window]
[in a new window]
 
TABLE I MS/MS ChemScore rules

There is one rule per row, numbered by the No. column. Rules 1–8 are initial value rules. The ChemScore of the fragment in question is first defined by rules 1–4, which are mutually exclusive. Thus, His is not treated differently than other aa. Rules 5–8 are applied whenever the sequence of the fragment contains the condition described. Thus, if the fragment’s sequence ends in Pro, rule 6 applies. If the fragment begins with Pro, but the preceding residue is Asp, then both rules 5 and 7 apply, and the ChemScore gets multiplied by 4.5. Rules 9–13 adjust the ChemScore according to the ion type, by multiplying the ChemScore calculated by rules 1–8 by the factor listed in column Value. The SeqString, which is calculated separately, is based on a separate score for each peptide bond in the peptide. It is calculated by adding the score in column SeqString for each ion type, with a maximum value of 9. The SeqString is based on rules 9–13 only. Rules 14–21 are rules that allow the ChemScores of ion types to be adjusted in a sequence-specific fashion. Rule 14 applies only to y-17 or b-17 ions in which the first residue is Gln. These ions are often very prominent, and rule 14 sets the ChemScore for such ions equal to the corresponding y or b ion. Rule 15 applies to any y-17 or b-17 that contains a Lys, Gln, or His except when rule 14 applies. Rules 16 and 17 are analogous to rules 14 and 15, but apply to CICAT in peptides. Rule 18 causes a y-18 or b-18 ion to be calculated only if the N-terminal residue of the fragment is Glu. Rules 19 and 20 adjust the ChemScores for the a2 and b2 ions only. Rule 21 applies to a special +18 amu form of the penultimate b ion, which is especially prominent if there is an internal Arg in the fragment. Rule 22 should apply only to MALDI data (but has little effect on QStar data, where it need not apply). Because in MALDI MS/MS spectra there is often detectable transmission of the unselected member of an HL pair, both heavy and light forms of all y and b ions containing CICAT or KICAT are calculated. The unselected HL form gets penalized 10-fold. In rules 23 and 24, peptides containing MetOx or oxidized CICAT often have prominent neutral losses. Therefore, all y and b ions are duplicated for each fragment that contains m, 7 or 8 (see Table IV A). These ions get assigned one-half the value of the corresponding y or b ions. Because Mascot does not consider these ions, it has difficulty matching these modified peptides. Rules 25–40 assign such little value to the immonium ions (listed according to single aa code and mass) that they play little role in distinguishing between sequences, which is often desirable when there are overlapping precursors. The His immonium ion at 110 amu is by far the most reliable. Rule 41 sets PpmMin for calculating FrT and MatSc. Rule 42 sets the highest allowed ppm error for matching a fragment ion. Rule 43 sets the lower mass limit for calculation of ppw. Rule 44 sets the highest allowed ppm error for matching a fragment ion with mass < 200 amu (set by rule 43).

 
To obtain the identifications in Table III, certain spectra were submitted to multiple additional rounds of Mascot searching (Fig. 1, Workflows 2–4). In most cases, whenever a peptide was identified with high confidence, all spectra with parent masses within 2 amu were searched (using MatSc_Calc) to find additional spectra that matched the same peptide (Fig. 1, Workflow 2). A modified form of MatSc_Calc was used to search for the oxidized form of some of the previously identified Cys-containing ICAT reagent modified (CICAT) peptides (Fig. 1, Workflow 2, step c). In other cases, MatSc_Calc was used in search mode to look for peptides of particular interest, for example, from Upf1p (Fig. 1, Workflow 2, step d). In these cases, MatSc_Calc was routinely set to identify any spectrum that had the minimal criteria for identification; typically, >3 yb ions, a MatSc >5000, and parental mass accuracy of 300 ppm. In other cases, these criteria were lowered so as to annotate any spectrum that matched within 1 amu of the proposed sequence that could not be better explained by alternative sequences. This normally did not result in additional identifications with any confidence (except for e.g. human keratin peptides that were missed at the database level). Instead, this process enabled us to annotate parent masses from which additional spectra could in principle be acquired to substantiate or to rule out such an identification.

A large number of spectra were examined manually (Fig. 1, Workflow 3). Special attention was paid to spectra that were derived from unmatched, intense MS/MS spectra (Fig. 1, Workflow 3, step 1a), or intense precursors that were not automatically matched (Fig. 1, Workflow 3, step 1b). We were also particularly interested in spectra that had precursor masses that appeared to belong to HL pairs that indicated differential expression (Fig. 1, Workflow 3, step 1c). Other spectra were selected for special attention because they matched to proteins that we did not expect to be abundant, or that contained unusual modifications, or did not conform to trypsin cleavage rules (Fig. 1, Workflow 3, step 1d). Some spectra were examined randomly as a spot check of the annotation process (Fig. 1, Workflow 3, step 1e). Some of the methods used to identify these spectra are listed in Fig. 1, Workflow 3, section 2. For some spectra, the precursor mass or charge state was adjusted (Fig. 1, Workflow 3, step 2b). In other cases, the peak list was refined manually because of inappropriate de-isotoping, or because some fragment ions appeared to be derived from substances other than peptides (Fig. 1, Workflow 3, step 2c). In this case, the peak list was usually not permanently altered, and MatSc was calculated using the peptide sequence that was determined manually.

After we had uncovered evidence for peptides with new modifications, selected high-performance liquid chromatography (HPLC) runs were resubmitted to Mascot, for example, with lysine-modified ICAT reagent (KICAT), oxidized CICAT, or N-terminal ICAT reagent modified as alternative variable modifications (Fig. 1, Workflow 4). Once again, MatSc_Calc was used as the criterion to resolve conflicts. Regardless of how many searches were performed, in the end, each spectrum was allowed to match to no more than one peptide sequence. Some spectra were manually placed into Y1 class 99 (see Table IVB), because manual examination of the spectrum indicated that a proposed sequence was implausible, yet MatSc was significant.


View this table:
[in this window]
[in a new window]
 
TABLE IVB Overall classifications

Y1, A collection of identifications, corresponding to line 17 in Table IIID. All Y1 classes require MAc < 4, yb > 3 and MatSc ≥ 10,000 except classes 0, 13, 14, 98, and 99. All Y1 classes except 0, 5, 8, and 9 are dependent on Y2 classifications. Y2, A collection of identifications with six general categories, based mostly on ChM without regard to confidence of identification. The Y2 category is listed for each Y1 when that is appropriate. In cases where there is more than one Y2 category for a single Y1 category, Y2 is left blank. Spectra classified as Y1 = 98 or 99 are defined as Y2 = 0. No., the number of identifications in each category. M, the number of MALDI identifications. E, the number of electrospray identifications.

 
Spectrum Classification
Tables I and II list the input parameters used in the calculation of MatSc.


View this table:
[in this window]
[in a new window]
 
TABLE II Additional rules

Two internal calibrants are sought within each MS/MS spectrum. Rule 1 is used to select the low mass calibrant. Such a calibrant must be among the most 20 intense ions, it must be either a b or y ion, and it must have a mass > 0.2 times the parent mass, and < 0.4 times the parent mass. The most intense qualifying ion is selected. Rules 2–5 select the high mass calibrant. It also must be either a y or b ion, and is first sought among the most intense 30 with a mass higher than 0.8 times the parent mass. If such an ion cannot be found, rules 3–5 apply in turn. Rules 6 and 7 apply to calculation of Ppw and PIM. Rule 6 specifies that for both parameters, only ions of masses > 200 are assessed. Similarly, rule 7 specifies that no ion within 50 amu of the parent ion is counted. This excludes neutral losses of ammonia and water while retaining losses of Gly or the often abundant neutral loss of 64 amu from MetOx. Rules 8 and 9 prevent unusually intense ions from dominating and distorting parameters PIM and FrT, respectively. This is accomplished by sorting these parameters in decreasing order, and then replacing the top three values with the fourth highest value. Rules 10–12 apply to counting y and b ions. Rule 10 indicates that the highest y ion doesn’t count (it is the parent). Rule 11 specifies that only the unmodified form of the y ion is counted, even if other fragments have equivalent or nearly equivalent ChemScore. Rule 12 ensures that only the correct member of an HL pair is counted, even if both are detected.

 
Peak List Filtering—
For intensity truncation, the initial peak lists were extracted by GPS 1.0 for MALDI analysis, and by ProICAT for electrospray analysis. To the first approximation, the higher the percentage of the intensity that can be accounted for a proposed match (the Percent Intensity Matched (PIM)), the more confident the identification. However, we found that if the spectrum contained from one to three very intense masses, these intense masses often eliminated the discriminating power of the PIM term because they overwhelmed the contributions by less-intense matched ions. To avoid this problem, the intensities of the most intense three daughter ions were "truncated" to the intensity of the fourth most intense daughter ion, thereby allowing the weaker ions to contribute to the calculations (Table II, rule 8). In the case of a "rich" spectrum with many nearly equally strong peaks, truncation is not necessary and has little effect.

For peak density filtration, in many instances, the most important ions for determining the correct sequence are relatively high in molecular mass but weak in overall intensity. To ensure that the peak list contains the most significant ions from all regions of the mass spectrum, the peak list was filtered to eliminate all but the six most intense masses every 100 amu. Later, these same processed peak lists were resubmitted to Mascot so that the Mascot scores listed were derived from these altered peak lists.

Percent Intensity Matched—
PIM is calculated beginning with the filtered peak list. Only masses that are greater than 200 amu and smaller than 50 amu below the precursor mass are allowed to contribute to PIM because most masses outside this range are not sequence specific (Table II, rules 6 and 7).

Calculation of Fragment Ion ChemScore—
The Fragment Ion ChemScore is designed to approximate the theoretical expected intensity for each ion type (see Refs. 13 and 14 for alternative methods to do this). In this article, the ChemScore for each fragment ion is calculated based on the rules listed in Table I. If it was desired, these assumptions could be tuned to the instrument type (electrospray versus MALDI), but in this article we have used the same settings for all spectra.

In this scheme, it is possible to count any desired ion type. In this article, we have counted only y ions, b ions, a ions, y-17 ions, b-17 ions, certain immonium ions, and y and b ions generated by neutral loss of 64 amu from oxidized Met.

The first principle is that y ions and b ions are the most important. y ions have been assigned an arbitrary score of 6000, versus 5000 for b ions, etc. (Table I, rules 9 and 10). The ChemScore for any ion that contains a Lys or Arg is further multiplied by 1.5-fold (Table I, rules 3 and 4). Because we and others (15, 16) have observed preferential fragmentation before Pro (in all ion series), and after Asp, and to a lesser degree Glu, the ChemScores of all such ions have been multiplied the factors described by Table I, rules 5, 7, and 8. In addition, the ChemScores of all ions C-terminal to Pro have been decreased by 1.5-fold (Table I, rule 6), because these ions are less frequently detected (13).

We have observed that y-17 and b-17 ions are more intense when the N-terminal amino acid (aa) is Gln, possibly because of facile elimination of ammonia by cyclization; thus such ions are given the same ChemScore as the corresponding y or b ion (Table I, rule 14). The remaining (y-17) and (b-17) ions are given 20-fold more credit than other -17 ions if they contained R, H, or Q (Table I, rule 15). We have also observed unusually intense y-17 and b-17 ions after ICAT reagent labeling and acid cleavage; thus ICAT reagent-labeled Cys-containing (CICAT) fragments are dealt with the same as R, H, and Q (Table I, rules 16 and 17).

Most neutral losses of water have not been considered here, except for the case of N-terminal Glu, which are often unusually intense (Table I, rule 18).

Because the b2 ion and a2 ion often seem to be more intense than other a and b ions, the ChemScore for the b2 ion has been multiplied by 1.5-fold, whereas the ChemScore for the a2 ion is multiplied by 6.6-fold (Table I, rules 19 and 20).

Finally, it has been observed that an ion having the molecular mass of the parent ion minus the C-terminal aa is often present, especially when the sequence contains an internal arginine (which in general makes fragmentation of any kind more difficult). In this report, the ChemScore of the b(n–1)+18 ion is counted the same as an ordinary b ion (Table I, rule 21).

In order to promote detection of ions, the timed ion selector in MALDI mode is typically relaxed so that small amounts of ions from the unselected HL pair are often detected. We decided to add these ions to the fragment list and assign them ChemScores 10-fold below the value for the ICAT reagent-labeled peptide that was selected (Table I, rule 22).

Differential ChemScore values have also been applied to the most common immonium ions. The immonium ion for His at ~ 110 is usually the most reliable immonium ion, thus it was assigned a ChemScore of 100. This value is too low to have much impact on scoring, but these small values would break ties between peptides that otherwise seemed to be equally plausible. If the starting peptide samples were less complex, then the reliability of immonium ions for identification could be increased, but even small amounts of His-containing peptides appear to render the His 110 ion detectable (Table I, rules 25–40).

It has also been observed that when methionine sulfoxide is present, a second series of ions is observed that is 64 amu below the canonical y and b ions (Table I, rule 23). A similar neutral loss from the oxidized ICAT side-chain has also been observed (Table I, rule 24).

Calculation of Percent ChemScore Matched (PCM)—
All of the masses in the filtered peak list that match to predicted ions contribute to PCM. The total ChemScore for the sequence in question is the sum of the Fragment Ion ChemScores of all of the considered ions. The ChemScore Matched is calculated by summing the Fragment Ion ChemScores for those ions that were matched to masses in the filtered peak list within tolerance (Table I, rule 42). Thus, PCM is calculated by:

where ti is the total number of ions considered, MFM is the number of ions matched, m is the set of matched ions, and n is the set of all ions. If two theoretical ions matched within tolerance to the same mass, then the Fragment Ion ChemScore for both ions was used. Because of the quantitative nature of the Fragment ChemScore index, PCM is not significantly affected by consideration of additional ions, so long as these additional ions get assigned low ChemScores. In contrast, consideration of large numbers of additional ions can by itself destroy the usefulness of the PIM term, because almost any mass can be explained by some ion.

Calculation of Peptide Fragment TriScore (FrT)—
It is to be expected that the ions with the highest ChemScore (ChS) should correspond to the most intense ions. In addition, the observed masses are more credibly identified if they match the calculated fragment ion mass within experimental accuracy. To make this quantitative, we define PpmMin as a lower limit of mass error in parts per million (ppm) (Table I rule 41). Ions that match to a tolerance below PpmMin get increasingly less additional credit for doing so. PpmMin is in principle an instrument-specific factor, although in this study a value of 400 ppm was used for both instruments. The upper limits for matching were set to 500 ppm for ions with >200 amu (Table I, rule 42 and 43) and to 0.5 amu for ions <200 amu (Table I, rule 44). To prevent a small number of ions from dominating FrT, it was arbitrarily decided to truncate ChS and intensity (Int) parameters to the fourth highest value (Table II, rules 8 and 9). To calculate FrT, the theoretical ion list is sorted by decreasing ChS, and the peak list is sorted by decreasing Int. To normalize the value of Frt to the intensity distribution observed, the MaximumFrT was calculated as follows:

where i is the number of elements in the shorter of the two lists. If i < 5, then MaximumFrT is calculated using:

The MatchedFrT was then calculated for each matched fragment according to:

where ThIM is the number of matched ions, and the list is sorted by decreasing ChS x Int. As with MaximumFrT, the equation for MatchedFrt may need to be reduced to suit the number of elements. Finally, FrT was calculated according to:

so that the peptide would receive a value of 100 if the intensity distribution of ions exactly matched the theoretical distribution postulated above, and with a mass accuracy of <<PpmMin. Significantly lower scores result whenever the intensity distribution of ions does not conform to expectations. Note that FrT can still have a high value (near 100) if a small number of ions are detected, so long as these ions are predicted to be the most intense. This is one reason why no match is considered plausible unless the sum of b and y ions matched exceeds 3.

The FrT term acts to buffer against devaluation of the PIM term by random matching of intense masses to minor ion types, because when this takes place, although PIM increases, FrT decreases significantly.

Internal Calibration of MSMS Spectra—
In order to improve the internal accuracy of the fragment ions for the MALDI spectra, for each tentative identification, masses were selected to be used to calibrate the remaining fragments. To accomplish this, two masses corresponding to y or b ions were sought that were as intense as possible, and also well separated so that a slope measurement derived from them would be as accurate as possible. The rules to find appropriate masses are listed in Table II (rules 1–5). First, a low-molecular-mass fragment was sought that had a mass greater than 0.2 times the mass of the parent ion, and no greater than 0.4 times the mass of the parent ion. The second mass was then selected from the fragments >0.6 times the parent mass using the rules in Table II. A two-point calibration was then performed. If no appropriate masses could be found, then no calibration was performed.

Intensity Weighted Mass Error (ppw)—
So that matches to low-intensity ions do not distort the mass error term, intensity-weighting was performed. Because the immonium ion region was often poorly calibrated after this procedure, no fragments less than a value of 200 amu were allowed to contribute to ppw.

Calculation of Overall Score (MatSc)—
The overall score (MatSc) is a compound index that includes contributions from many of the parameters that can be used to judge the quality of an identification. It is not yet clear how these parameters should be optimally combined, as each of the individual parameters have limitations under certain conditions (like if the peak list is too large). Sophisticated mathematical techniques have been used by others to optimize the weighting of such parameters (17).

In this article, MatSc is calculated according to:

where PpmMin is the minimum ppm value below which matches are of no greater significance. In these calculations, a PpmMin of 400 ppm was used. This seems rather high, but there were a few identifications that by all other criteria appeared to be correct for which this value was appropriate. In most instances, MatSc would have higher discriminating power if PpmMin were around 50 ppm. Note that the PpmMin term can break ties between peptides that are identical except for Lys versus Gln (0.036 amu difference) even when PpmMin is 400 ppm.

Calculation of SeqString—
The SeqString describes how the ions that match the proposed peptide are dispersed along the length of the sequence of the peptide. It is not used for spectrum classification, but is a useful visual guide. A similar scheme has been used previously by others (David Fenyo, personal communication). Each ion type is assigned a score, as listed in Table I, rules 9–13, column SeqString. Each peptide bond in the sequence is assigned the sum of those scores, with the exception that the sum must not exceed a value of 9. The most meaningful way to interpret the SeqString is to place it in TrueType font directly over the sequence to which it corresponds, shifted by half a space.

For example, the SeqString 9004555929 might correspond to the sequence ACDEFGHILMK. It could be displayed as:

This would indicate that both b and y ions were found that support the AC peptide bond, that no ions were found to corroborate the CD and DE bonds, a b ion corroborates the EF bond, etc.

Counting y and b Ions—
Probably the simplest way of assessing the quality of a database identification is to count the number of y and b ions matched (see Ref. 18 for an alternative way to do this). If this number is small (e.g. ≤3), then the identification is uncertain. If three y + b ions match, and the peak list contains only three or four masses, then the identification might be correct. If the peak list is much larger, then the number of b and y ions matched is no longer so useful, and it must be balanced with a term like % Intensity Matched. In this report, four or more y + b ions were required for all classifications of identified spectra. Table II, rules 10–12 list several additional considerations that apply to counting ions.


    CRUDE RESULTS
 TOP
 ABSTRACT
 MATERIALS AND METHODS
 CRUDE RESULTS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 REFERENCES
 
List of Spectra with Identifications—
All 73,009 spectra from experiments 2–6 are listed in Supplemental Table S3, with fields that define in what fraction they were obtained, the intensity, the HL ratio, and various parameters that define the confidence of the identification. Because Table S3 (as well as Tables S6–10, S12, and S13) is so large, a subset of this table is shown to exemplify the information it contains. Table III shows a subset of Table S3, which corresponds to seven spectra. The first three of these spectra correspond to Upf1p, the next two are from His3p, and the final two are from pyruvate decarboxylase. Table IIID lists the fields in Table III, many of which are useful for filtering and processing these data. Because in many cases multiple rounds of database searches were performed, the Mascot scores are not always exactly comparable to one another, even though this score is based mainly on the number of ions that match. One reason for this is that the same mass tolerances were not always used for each search. In the discussion below, the significance of the Mascot score will be addressed. Regardless of how the tentative sequence was identified, it is possible to tabulate how well the spectrum corresponds to the proposed sequence, for example, by using MatSc. In Table IIID, the column definitions for Table III are divided into five categories: those involved in tracking which HPLC run the spectrum belongs to, those that characterize the parent ion that was fragmented, those (labeled MS/MS) that describe how well the proposed peptide sequence corresponds to the spectrum, those that characterize the HL ratio, and those that characterize the peptide and protein sequence that was matched (if any).

Table S6 contains each of the peak lists that were automatically extracted for each of the 73,009 spectra in Table S3. This is an enormous table, and three examples of peak lists corresponding to two distinct spectra (one of which was obtained manually from the peak list in Table S6) are shown in Table VI.

Tracking—
The tracking fields include such parameters as spectrum number SpID; instrument type EM, where E designates electrospray and M designates MALDI; experiment number Ex; elution time (electrospray) or well number (MALDI), designated TiW; the ion exchange fraction number Fr; and a plate number (MALDI) or replicate number (electrospray) Pla. An experiment consists of all data from the same initial ICAT reagent labeling reaction and is instrument-specific. In some cases, replicate electrospray HPLC runs were performed on the same IEX fraction, resulting in the need for Pla. Because in some cases one HPLC run was collected on more than one plate, there is also an HPLC run index EID. Whenever the same spectrum was submitted to more than one database search, MatSc_Calc was used to resolve sequence discrepancies, using SpID as a key field.

Parent Ions—
The parent ion fields include the recalibrated parent ion mass CalMW; its intensity IntP; the mass of the corresponding peptide sequence PepMW; and the difference between these two masses in ppm Df. A fifth field defines the mass bin MB of the parent ion, which is obtained by dividing the parent ion mass by 1.0005, and then rounding to the nearest integer. MB is especially useful for comparing the results of multiple experiments, or adjacent ion exchange fractions, and is particularly useful as an index. Because a large number of identifications were made to peptides where CalMW-PepMW was nearly equal to small integers, another pair of fields were calculated to classify these spectra. The mass bin class field MBC corresponds to the integer itself. When MBC = 1, CalMW is about 1 amu larger than PepMW. This happens if the peptide is deamidated, or if CalMW was incorrectly assigned to the second isotope of the isotope cluster, rather than to the monoisotopic mass. This is more likely to happen at higher peptide molecular masses where the monoisotopic mass is harder to distinguish. Because deamidation is more interesting than a mistake in de-isotoping, the mass difference in ppm DfC was calculated, assuming an ideal mass difference of 0.984 amu, which is the mass difference between an acid and an amide. If the explanation is de-isotoping, then the mass difference should be 1.0033, which is the mass of the extra neutron in C13. Spectra were also classified according to mass accuracy class MAc, where class 1 is defined as ≤20 ppm for MALDI spectra or ≤80 ppm for electrospray spectra. Class MAc 2 is defined as ≤80 ppm (MALDI only), whereas class 3 is defined as ≤0.5 amu. Spectra of class 4 are off by more than 3.5 amu, because any mass difference between 0.5 amu and 3.5 amu would be grouped in a different MBC prior to calculation of MAc. Identifications with MAc 4 may be correct if the wrong isotope cluster was assigned to the spectrum, or nearly correct if the peptide contains an unassigned chemical modification.

In most cases, in the MALDI experiments, masses that were identified with high confidence were used to calibrate the spot in which the MS/MS spectrum was collected, as well as immediately adjacent spots. Such masses have a value of 1 in the Cal field. Thus, if the ppm for the parent ion is 0, it was probably used as a calibrant. In a few cases, the ppm difference rounded to 0 to four significant figures even when the mass in question was not used as a calibrant.

MS/MS Ions—
The MS/MS fields include the Mascot score Sc; the overall match score MatSc, defined above; the number of y and b ions matched yb; SeqString defined above; and the four major components of MatSc: namely, Percent ChemScore Matched PCM, Percent Intensity Matched PIM, intensity-weighted average ppm deviation ppw, and Fragment TriScore FrT. The number of masses detected prior to peak filtering MFD, after peak filtering MFU, the number of masses matched MFM, and the number of theoretical ions matched ThIM are also listed. Note that it is possible for more than one theoretical ion to match the same mass, and vice versa. IntD lists the sum of the intensity of the ions counted in MFU. Note that all of these scores are dependent on peak detection. In an ideal world, it would be possible to detect peaks reliably and reproducibly. In these experiments, the peak lists for the MALDI data are usually reliable; that is, upon manual inspection, the peaks deposited in the database appear to reflect faithfully what one would expect from careful inspection of the raw spectra. For the electrospray data, however, it is difficult to define peak detection, de-isotoping, and de-charging parameters that result in a good peak list for all spectra. For this reason, we spent a lot of time validating electrospray identifications, as is regular practice. We hope to develop more robust peak extraction methods in the future.

HL Fields—
The HL fields include the measured heavy-to-light ratio HL. In some cases, the ratio was not measurable because no HL partner was detected. HL_S summarizes how many possible HL partners were detected. HL_S has six digits, each of which can be either 0 or 1. If the first position = 1, then a potential HL partner was detected about 9.03 amu above the parent mass. The remaining five positions of HL_S correspond to the presence or absence of a peak 18, 27, –9, –18, and –27 amu above or below the parent mass in question. Under ideal circumstances, the HL pair would be nonambiguous, meaning there would be only one non-zero digit in HL_S, which would correspond to the number of CICAT residues in the peptide. In this case, the binary field HL_1 is set to 1. We have included as "identified" (Y1 class 1, see below) only those peptides in which HL_S is consistent with the number of Cys residues in the corresponding sequence. A second parameter, Ippm, is the ppm difference between the observed masses of the HL pair and the theoretical mass difference of the HL pair (not shown in Table III). When the HL pair was ambiguous, the value for Ippm listed corresponds to listed sequence. A final HL parameter is the HLI_S. It is similar to HL_S, but lists instead whether there are any interfering masses within 2 amu of an HL pair at +9, +18, +27, -9, -18, or -18. When HL_S is nonambiguous, and HLI_S is all zeroes, then the ratio of the HL pair is more reliable (depending on the intensity of the peaks), and the binary field HLI_1 is set to 1. When HL_S is ambiguous, the exact value of the HL ratio may be unknowable, because more than one peptide may contribute to the intensity of either the c0 form of the ICAT reagent or the c9 form of the ICAT reagent. This problem is lessened if it turns out that the ambiguous ICAT reagent-labeled mass belongs instead to a nonoverlapping HL pair. For example, if HL_S was 111,000 and the corresponding peptide had one Cys, the potentially confounding masses could correspond to a second unrelated HL pair whose masses were 18 and 27 amu above the first peptide’s mass. Because there are slight differences in HL ratio between experiments due to mixing inaccuracies, the HL values must be normalized. In the five experiments listed here, these normalization constants were small (between 0.855 and 1.11, see Table V, column Nor).

Peptide and Protein Fields—
These fields indicate which peptide sequence Seq was identified, and which protein was matched to that sequence, which is designated by (usually) a Swiss-Prot accession number Acc. The amino acid preceding the peptide is listed in < (SeqA in Access), whereas the amino acid following the peptide is listed in > (SeqZ in Access). If field "<" is empty, then the peptide was N terminal; similarly for field ">" and C-terminal peptides. When modified amino acids were detected, the normal capital letter abbreviation for the amino acid is altered in field Seq, whereas DSeq corresponds exactly to the sequence in the database (not shown in Table III). Table IVA lists all of the single-letter amino acid codes for altered amino acids in column Sym, where column RMW lists the molecular mass difference between the modified aa and the natural aa, or in the case of modified CICAT peptides the molecular mass difference between the modified and unmodified form of the CICAT residue. For example, "C" refers to the c0 form of the ICAT reagent, whereas "c" refers to the c9 form. "Z" refers to unmodified Cys. If an Asn or Gln has been deamidated, it is labeled as the corresponding acid residue in Seq, but not in DSeq. In some cases, the accession numbers are PIR accession numbers from csc-fserve.hh.med.ic.ac.uk/delphos.html (referred to hereafter as DelPhos) because there was no corresponding protein in Swiss-Prot. Table IVA also lists how many spectra are classified into each grouping, either at high confidence (field HiMac), at high confidence but also considering different MBC bins (field High), or at any confidence (field Low).


View this table:
[in this window]
[in a new window]
 
TABLE IVA Chemical modification classes

Sym, the Symbol used in single aa code to represent the modified aa. Name, the chemical name for the modified aa. Row 27 applies to the last four columns, whereas row 28 applies to the last three columns. RMW, the rounded molecular weight of the modification, as added onto the aa listed in column aa. In rows 12, 13, and 17–24, this mass is added in addition to the mass of the CICAT-modified Cys. In rows 25 and 26, this mass is added to the peptide, as the N-terminal aa of the peptide bears the modification. aa, the single letter code for the aa that is modified, where applicable. ChM, the chemical modification class that is defined by the modification. Similar modifications were grouped together (like sodiation). Both CICAT-modified peptides and unmodified peptides belong to ChM 0. They are distinguished by field Y2 (parameter 18 in Table IIID). Peptides that contain more than one of the modifications in rows 3–26 are defined as ChM 99. HiMAc, the number of high-stringency, high-mass-accuracy identifications in each ChM, where Sc ≥ 20; MatSc ≥ 5000; yb > 3; MAc = 1; MBC = 0. The number reported is tabulated based on ChM, not on the previous columns. High, the number of high-stringency identifications in each ChM, where Sc ≥ 20; MatSc ≥ 5000; yb > 3; MBC < 4. Low, the number of low-stringency identifications in each ChM, where MatSc ≥ 5000; yb > 3; MBC < 4. Row 28 lists the total number of identifications in each stringency class for all modifications.

 
The HL_Type field lists whether the peptide was modified with the c0 or the c9 form of the ICAT reagent, and how many such modifications there are. The CysN field lists the number of modified Cys residues in the peptide. The AutoSeq field lists the best sequence obtained by either manual de novo sequencing or automatic de novo sequencing, if that was determined (not shown in Table III). NT lists the number of tryptic cleavage sites; a value of 2 indicates that both ends correspond to trypsin cleavage rules. In these tables, all terminal Arg and Lys residues are defined to be acceptable tryptic termini, including sequences in which the next aa is Pro. In addition, peptides that include the protein’s N or C terminus are counted as tryptic. Finally, peptides whose N terminus is consistent with annotated biological cleavages are also counted as tryptic. This includes removal of an N-terminal Met or cleavage by signal peptidase. We did not happen upon any instances of known biologically relevant C-terminal cleavages. Some peptides contain missed trypsin cleavage sites; these are counted in field IT. In field IT, KP or RP sequences are not counted as internal missed cleavage sites. In some cases, a peptide may be generated from a longer peptide by fragmentation in the source region of the mass spectrometer. This is especially likely for certain chemically modified peptides, especially peptides containing "Z" for unmodified Cys, which commonly elute together with a corresponding Cys-alkylated peptide. If a suitable precursor for in-source decay was detected in the same MALDI spot or within 1 min of elution time, then this is noted in field ISD. Ideally, in-source decay in electrospray requirements should also require exact co-migration of the ISD ion and its precursor.

Overall Fields—
Two final parameters, Y1 and Y2, group spectra into classes of overall reliability. The most important class is Y1 class 1, which corresponds to the highest confidence category of CICAT peptides. Table IVB lists each of these classes. Note that most of the classifications of spectra according to fields Y1 and Y2 depend on the classifications in fields ChM, MAc, NT, HL_S, HL, yb, MatSc, or Sc. Y1 classes 8 and 9 are special exceptions that correspond to Cys-modified tryptic peptides that derive from trypsin or human keratin, respectively, and therefore do not contribute to any of the yeast peptide or protein statistics. Column No. lists how many spectra are in each Y1 class. Field Y2 groups spectra according to how they are modified. Y2 class 1 refers to CICAT peptides, whereas Y2 class 3 correspond to Lys-modified (KICAT) peptides. Y2 class 2 refers to peptides that are not modified at all, whereas Y2 class 4 have an unalkylated Cys. In most cases, this appears to be a result of ISD. Y2 class 5 are spectra matched to peptides with chemical modifications that correspond to more than one of classes 1–4, whereas Y2 class 0 have fewer than three y or b ions matched and are therefore essentially unmatched. Note, however, that such a spectrum could be matched with high confidence due to co-migration with second spectrum with high Mascot score (Sc) and yb. In addition, in MALDI experiments, such spectra could become identifiable after subsequent MS/MS experimentation.

Data Processing
Low Stringency Identifications—
The first task in extracting information from proteomics experiments is to determine which identifications are to be considered reliable. We chose to start with identifications using either ProICAT (for electrospray samples) or Mascot (for MALDI samples). Normally, threshold criteria are chosen that appear to exclude the bulk of the incorrect assignments while retaining the bulk of the correct assignments. In these experiments, we were particularly interested in studying the borderline identifications, and therefore have collected statistics on many identifications that would normally be discarded. We decided upon four minimal requirements for tentative identification: 1) at least four y and b ions must match within 300 ppm; 2) the sequence must be at least 6 aa long; 3) MatSc must exceed 5000; and 4) the tentative sequence must match the spectrum in question better than any other considered sequence. This last requirement requires some judgment regarding which chemical modifications are at all plausible for consideration. Generally a chemical modification is considered reasonable only if the same modification has been found on a peptide that was matched with high confidence to a high-quality spectrum from within the same experiment, or alternatively if the unmodified form of the peptide in question is known to be so abundant that nearly any chemical modification seems plausible. There is a special category of spectra that we excluded from further consideration: they consist of spectra that by the criteria of MatSc and yb seem to match a certain sequence, yet manual inspection of the spectrum indicates that the identification is not credible (Y1 class 99). In some cases, this results because the spectrum is so weak that the process of extracting a peak list failed. In other cases, examination of the spectrum indicates that the substance that was fragmented does not correspond to a peptide at all and is probably derived from contamination. In a few cases, the spectrum seems to correspond to a peptide, but has intense ions that cannot be explained easily by the proposed sequence. This process has been selectively applied to the data, with special attention paid to spectra that uniquely define proteins. Therefore, there are surely still examples of spectra that are matched to peptide sequences that can be shown by manual examination to correspond to other sequences, or substances other than peptides. Next, we classified spectra as 1) ICAT reagent modified on Cys (CICAT), 2) not ICAT reagent labeled, 3) ICAT reagent modified on Lys (KICAT), or 4) incompletely alkylated. These classifications correspond to Y2 classes 1–4.

Using the criteria for tentative identification described above, there are 12,249 identifications in Y2 class 1, corresponding to 2181 distinct CICAT peptides, or 1029 distinct proteins, some of which are surely misidentifications (Fig. 2A).



View larger version (18K):
[in this window]
[in a new window]
 
FIG. 2. Degree of overlap of peptides and proteins across instrument types. Identifications of CICAT peptides, segregated by instrument type. The number in parentheses indicates the number of distinct identifications. There were 7550 measurements of CICAT peptides that were chromatographically distinct. The Venn diagram in the middle of A shows that 990 peptides were identified by both techniques, out of 2181 distinct peptides. On the right, the diagram shows that 609 proteins were identified by both techniques. These identifications had yb > 3; MatSc ≥ 5000; len > 5, Y1 <> 8; Y1 <> 9; Y2 = 1; NT = 2; MBC = 0; MAc < 4; ChM 0 or 1. B, High-stringency identifications. The number of identifications decreases significantly. Many of the peptides and proteins that are unique to an instrument type in B were identified with low confidence in A. In addition to the requirements in A, MatSc ≥ 10,000; Sc ≥ 20; Y1 = 1; MAc = 1. C, Same as B, except all spectra also had HL ratios where 0.2 < HL < 5.

 
Protein Identification and Consolidation—The ICAT Dictionary—
Whenever more than one protein encodes the same peptide, one needs to determine if possible which protein is the most likely source protein. This can sometimes be accomplished if there are additional MS/MS identifications pointing to one protein but not the alternatives, or if there is independent evidence that one protein is more likely to be encountered. In any case, the data need to be annotated with a consistent protein accession number for tracking purposes. All of these issues can be addressed by making a database of all possible CICAT peptides in the yeast proteome, annotated with protein accession numbers. We refer to this database as an ICAT dictionary. In yeast, another useful tracking field is the codon bias index (3), which correlates positively with protein abundance (3), and therefore is useful for tie-breaking purposes.

It is not possible to derive the optimized list of proteins until all of the relevant data are consolidated together (see Ref. 19 for a discussion of this issue). In this report, the relevant data include identifications from all five experiments. Another complication is that some peptides are virtually identical by mass spectrometry (I versus L, and combinations of amino acids that add up to the same mass without distinguishing backbone fragment ions). Because of this, there is a danger that two distinct peptide sequences could get mapped to indistinguishable MS/MS spectra. To derive an optimized protein list, we submit the entire list of identified CICAT peptides to the Mascot search engine in the form of theoretical y ion spectra. Mascot then automatically selects the smallest number of distinct protein sequences that can explain the spectra. Later, the sequences are mapped to the other tracking fields like codon bias using the ICAT dictionary.

To generate our theoretical ICAT dictionary, we started with the yeast proteome at genome-ftp.stanford.edu/pub/yeast, updated ~1/2003. In this database, there are 23,308 distinct Cys-containing peptides with no missed trypsin cleavage sites that contain at least 6 aa and have a MW of <4000. Of these, 481 (2%) are encoded by more than one distinct protein. The ICAT dictionary has a degeneracy field that enumerates how many genes encode each peptide. After the list of these sequences was generated, the sequences were grouped using an SQL query so that each unique sequence would be paired with the protein accession number that corresponded to the protein with the highest codon bias, as well as a number that listed how many distinct genes encoded the peptide. In some cases, it was necessary to add new sequences to the ICAT dictionary to accommodate additional sequences that were derived from other databases, for example, the Mascot nonredundant database. The final protein list contains Swiss-Prot accession numbers whenever possible, and PIR-derived "S" numbers from the Mascot nonredundant database when the proteins in question were not in Swiss-Prot. Extensive use was made of the web site csc-fserve.hh.med.ic.ac.uk/delphos.html to resolve questions of protein identity. At this web site, one can