Transit Peptide Cleavage Sites of Integral Thylakoid Membrane Proteins*

A set of 58 nuclearly encoded thylakoid-integral membrane proteins from four plant species was identified, and their amino termini were assigned unequivocally based upon mass spectrometry of intact proteins and peptide fragments. The dataset was used to challenge the Web tools ChloroP, TargetP, SignalP, PSORT, Predotar, and MitoProt II for predicting organelle targeting and transit peptide proteolysis sites. ChloroP and TargetP reliably predicted chloroplast targeting but only reliably predicted transit peptide cleavage sites for soluble proteins targeted to the stroma. SignalP (eukaryote settings) accurately predicted the transit peptide cleavage site for soluble proteins targeted to the lumen. SignalP (Gram-negative bacteria settings) reliably predicted peptide cleavage of integral thylakoid proteins inserted into the membrane via the “spontaneous” pathway. The processing sites of more common thylakoid-integral proteins inserted by the signal recognition peptide-dependent pathway were not well predicted by any of the programs. The results suggest the presence of a second thylakoid processing protease that recognizes the transit peptide of integral proteins inserted via the spontaneous mechanism and that this mechanism may be related to the secretory mechanism of Gram-negative bacteria.

expressed by a particular cell type or tissue, is largely possible because of recent technological and computational advances. Contemporary proteomics is based on the following three components: analytical separation of proteins from complex starting material, identification of the proteins, and subsequent sorting and classification of the datasets using bioinformatics software tools. The availability of complete genome sequences, improvements in techniques for protein separation and displays (4 -6), and new developments in mass spectrometry (MS) 1 for protein identification (7-10) have arguably outpaced developments in bioinformatics. Using the complete mitochondrial, plastid, and nuclear Arabidopsis thaliana genomes (2,11,12) and results from recent proteomics experiments (13)(14)(15)(16), it is possible to test the performance of algorithms that are commonly used to predict organellar targeting of nuclear gene products and transit peptide cleavage sites. Thus, it is timely to reevaluate the various tools at our disposal.
The chloroplast is a fascinating organelle not only because of its vital photosynthetic function but also because its development requires the coordinated expression and interaction of all three genomes. Chloroplasts have three membrane systems, the outer and inner envelopes and the thylakoids, which form the boundaries of three separate soluble compartments, the interenvelope space, stroma, and thylakoid lumen. Chloroplast proteins, encoded by nuclear genes, have aminoterminal transit sequences that aid in targeting the polypeptides to the correct chloroplast membrane or compartment (17)(18)(19)(20)(21)(22). After translation in the cytosol, the transit peptide of the proprotein is phosphorylated by a cytosolic Ser/Thr kinase before import into the chloroplast (23). The phosphorylated proproteins are bound by the Toc complex (translocon of the outer chloroplast envelope) (24), transported across the outer envelope membrane, dephosphorylated in the interenvelope space, and then transported across the inner envelope by the Tic complex (20). Homologs for the Toc and Tic complexes have been found in the genome of Synechocystis sp. PCC 6803 (20), which supports the model that chloroplasts are reduced cyanobacterial endosymbionts (25)(26)(27)(28). Once in the stroma the transit peptide is cleaved by a stromal processing protease (SPP).
Proteins targeted to the thylakoid lumen have bipartite transit peptides. The stromal targeting information is in the aminoproximal portion of the transit peptide, and the carboxylproximal portion, similar to signal peptides of secreted proteins in bacteria, contains the information for targeting to the lumen. Two different pathways for targeting to the lumen have been characterized. The first is a Sec-dependent pathway related to the SecYEG export mechanism in bacteria (29,30). The second is a ⌬pH-dependent mechanism characterized by two conserved sequential arginines in the transit peptide, the Tat (twin arginine translocase) pathway (31)(32)(33). Proteins targeted to the lumen via the Sec pathway are generally translocated in an unfolded state, whereas proteins imported via the Tat pathway are translocated in a folded state (22). Proteins imported into the lumen by either pathway are processed by a thylakoid processing protease that removes the carboxyl-proximal portion of the transit peptide.
Integral membrane proteins (IMPs) destined for the thylakoid membrane are targeted by one of two additional translocation mechanisms. It is thought that most thylakoid-bound proteins are targeted by information contained within the mature protein sequence. This insertion mechanism, which orients the amino terminus of the protein into the stroma, requires a signal recognition particle and a putative unidentified thylakoid-bound translocation complex (34 -37). A few integral thylakoid proteins have bipartite transit peptides that are cleaved first in the stroma and again after membrane insertion. These proteins are inserted via a novel signal recognition particle-independent mechanism that appears to require no protein or nucleotide co-factors (the spontaneous pathway) and has the unique characteristic of leaving the amino terminus of the mature protein oriented into the lumen (38 -42). Chloroplast thylakoids have been used as a model system for membrane proteomics because of the size of the proteome, the limited number of post-translational modifications (16), the ease of collecting large amounts of membranes, and the ability to easily subfractionate detergent-resistant domains (43).
Proteomics studies using two-dimensional (2D) gel electrophoresis have provided useful insights for soluble proteins of the lumen and stroma as well as peripheral thylakoid proteins (13, 44 -47) and integral thylakoid proteins (48), but these methods do not allow convenient determination of the amino terminus where blocked, as is common in the chloroplast. In this work, a dataset of 35 nuclear encoded integral thylakoid membrane proteins from A. thaliana, where the sites of secondary amino-terminal processing were explicitly determined by MS, was used to challenge several bioinformatics tools. After initial trials the dataset was expanded to include photosystem (PS) II thylakoid-associated proteins from pea and spinach published previously (16,49) and a small number of tobacco IMPs, providing a total of 58 nuclear encoded integral thylakoid membrane proteins, each with an experimentally determined amino terminus. The results demonstrate that although some programs (ChloroP, TargetP) reliably predict trafficking to the chloroplast, they variably predict the processing sites of the transit peptides. The results highlight inadequacies of currently available bioinformatics tools for prediction of secondary amino-terminal processing of transit peptides of integral thylakoid proteins and demonstrate the need for improvements in the datasets used to train the algorithms.

MATERIALS AND METHODS
Plant Growth Conditions-A. thaliana var. Columbia seeds were soaked in water (4°C) for 1-2 d before sowing. Seeds were sown in trays, and seedlings were transferred to 144 cm 2 ϫ 12 cm high pots (ϳ5 plants/pot) containing Scotts Pro-Gro professional potting mix (The Scotts Company, Marysville, OH) approximately 1 week after germination. Plants were grown in a growth chamber maintained at 23 Ϯ 2°C in the light and 16 Ϯ 2°C in the dark. The plants were illuminated with a 12-h light period and constant relative humidity of 70%. Fluorescent lamps (Sylvania F72T12/CW/VHO, 160 watts) supplemented with Sylvania 100-W incandescent bulbs were used as the light source. Quantum (400 -700 nm) flux density was 700 -800 mol of photons m Ϫ2 s Ϫ1 at the level of the leaves. Plants were watered as necessary and supplemented once per week with 0.25ϫ Hoagland's solution (50).
Leaves were harvested from 5-week-old A. thaliana plants, placed in square glass beakers, covered with grinding buffer (51), and ground with a chilled Polytron homogenizer (Brinkmann Instruments/Kinematica). The homogenate was filtered over 8 layers of diaper liners (Gerber), and the green filtrate was layered over (40%) Percoll gradients in 1.6-ml microfuge tubes (52). Thylakoids were collected from the Percoll pads, diluted in grinding buffer, and pelleted at full speed in a tabletop centrifuge. The green membrane pellets were resuspended to ϳ1.1 mg/ml chlorophyll in extraction buffer (51) and frozen in liquid nitrogen. This procedure typically required 10 -15 min, and all manipulations were performed on ice or in a 4°C refrigerator under low light. Tobacco PS II thylakoid proteins were isolated as described previously (16).
Mass spectra were recorded on a PerkinElmer Life Sciences Sciex API III triple-quadrupole mass spectrometer with an Ionspray TM source as described by Whitelegge et al. (5). The instrument was scanned from m/z 600 -2300 (0.3 step size, 1-ms dwell time, 6-s scan speed, orifice at 65 V). Manufacturer-supplied software was used for the computations of measured protein molecular weight (MacSpec  (Table I).
Cyanogen Bromide Cleavage-Aliquots of fractions (10 l) collected during LCMSϩ were treated with saturated cyanogen bromide (CNBr) (1 l, 1 g/ml) for 4 h at room temperature in the dark. The reaction mixture was either spotted directly (0.3 l plus 0.5 l of matrix solution) or dried by centrifugal evaporation (SpeedVac). Dried reaction mixtures were redissolved in 5 l of 70% acetic acid and analyzed (0.2 l plus 0.5 l of matrix) by matrix-assisted laser desorption ionization (MALDI) coupled to delayed extraction time-of-flight MS in the reflector mode (Voyager DE STR, Applied Biosystems) using ␣-cyano-4-hydroxycinnamic acid as matrix (10 mg/ml of solution in water/acetonitrile/trifluoroacetic acid 30/70/0.1) and internal/ external calibration with bovine insulin. Manufacturer-supplied default settings for a method optimized for peptides less than 6000 Da were used for all samples. Some proteins have no internal Met or no CNBr peptides less than 8000 Da (MALDI in the reflectron mode has an upper limit for high resolution data acquisition of approximately 8000 Da); consequently some abundant proteins will not be detected during MALDI analysis, and peptides from low abundance proteins may appear to be more highly expressed. The MALDI data was compared with the fragmentation pattern predicted by the MS-Tag tool in Pro-teinProspector, version 4.0.4 (prospector.ucsf.edu/) (90). MS-Tag calculates the predicted fragment masses with the transit peptide sequence still included. There are very few, if any, amino-terminal peptide masses for chloroplast-targeted proteins predicted by MS-Tag. The MS-Digest tool was used to predict the fragment masses after manually removing the transit peptide from each prospective match assigned by MS-Tag.
Analysis of Tryptic Peptide Sequence Tags by Tandem Mass Spectrometry-Samples were analyzed by microliquid chromatography-MSMS with data-dependent acquisition (LCQ-DECA, ThermoFinnigan, San Jose, CA) after dissolution in 5 l of 70% acetic acid (v/v). A reverse-phase column (200 m ϫ 10 cm, PLRP/S 5 m, 300 Å; Michrom Biosciences, San Jose, CA) was equilibrated for 10 min at 1.5 l/min with 95% A, 5% B (A, 0.1% formic acid in water; B, 0.1% formic acid in acetonitrile) prior to sample injection. A linear gradient was initiated 10 min after sample injection ramping to 60% A, 40% B after 50 min and 20% A, 80% B after 65 min. Column eluent was directed to a coated glass electrospray emitter (TaperTip, TT150-50-50-CE-5, New Objective) at 3.3 kV for ionization without nebulizer gas. The mass spectrometer was operated in "triple-play" mode with a survey scan (400 -1500 m/z), data-dependent zoom scan, and MSMS. Individual sequencing experiments were matched to a custom A. thaliana sequence database using Sequest software (Ther-moFinnigan). "No enzyme" was set such that Sequest considered all possible peptide sequence permutations rather than just tryptic ones. To identify N-acetylated peptides, a static modification of ϩ42 Da was set for "amino terminus peptides." Data Analysis-ChloroP, version 1.1 (www.cbs.dtu.dk/services/ ChloroP/) (91), is a neural network method for identifying probable chloroplast transit peptide sequences and predicting the proteolytic cleavage site of each transit peptide. ChloroP presents its prediction of chloroplast targeting as a "Y" or "N" output based upon the predicted presence of a chloroplast transit peptide. TargetP, version 1.01 (www.cbs.dtu.dk/services/TargetP/) (92), is a layered neural network method for predicting subcellular targeting based upon the type of targeting/transit peptide predicted to be at the amino terminus of each protein. TargetP predicts whether the protein in question is trafficked to the chloroplast, mitochondria, secretory pathway, or "other" subcellular location. PSORT (World Wide Web version, Oct 8, 1999, psort.ims.u-tokyo.ac.jp/) (93) is an expert system using a knowledge-base setup as an "if-then" cascade. PSORT predicts subcellular localization with much finer resolution than any of the other programs examined. The four subcellular/suborganellar localizations with highest scores are ranked in order. The output of PSORT is listed in the footnotes to Table IV. PSORT includes a hydrophobic moment analysis for chloroplast proteins as one of its expert analysis pro-grams. The usefulness of this calculation is based on the assumption that all chloroplast proteins have a similar stromal targeting domain in the amino-terminal targeting peptide. The hydrophobic moment analysis in PSORT distinguishes chloroplast protein status as negative, positive, or undetermined. Predotar, version 0.5 (www.inra.fr/predotar/), is a program still under development designed to be a Webbased method for distinguishing chloroplast-from mitochondria-targeting sequences. Predotar predicts localization to the chloroplast,  (94), is a computational method for predicting mitochondrial targeting sequences and for predicting the proteolytic cleavage sites of the targeting peptides. MitoProt predicts targeting based upon a calculated probability score that the protein being examined is localized to the mitochondria. SignalP (www.cbs.dtu.dk/services/SignalP/) (95) allows discrimination between eukaryotic, Gram-positive bacterial and Gram-negative bacterial signal peptides. All three cases were tested with the dataset. Four parameters are calculated for a yes/no prediction of the presence of a signal peptide. Each organism group tested gave a predicted cleavage site. Our dataset was used to test each of the programs listed above. Default settings were used for ChloroP, MitoProt, and Predotar. Userdefined settings for TargetP were selected as follows: origin of sequences, plant, perform cleavage site predictions, no cutoff, and winner-takes-all. PSORT was set to plant source of input sequence. SignalP was set to the default analysis by all three prediction routines.

Assembly of the Dataset
A dataset was assembled from 58 nuclear encoded thylakoid-associated proteins, each with experimentally verified amino termini. The dataset includes 18 proteins from pea and spinach that have been described already (16,49), 35 proteins from A. thaliana (Table II), and five proteins from tobacco (Table III). The majority are integral membrane proteins isolated from thylakoids. For comparative purposes the set includes a small number of peripheral proteins localized to the stromal or luminal surface.

Assignment of Proteins and Post-translational Modifications
The 35 nuclear encoded proteins from A. thaliana thylakoids were defined by IMTs, and the identities were confirmed by CNBr peptide mass tags and/or MSMS sequencing of tryptic fragments (Table II). IMTs for chloroplast-encoded proteins targeted to the thylakoid will be presented independently. The results show that the amino termini of the Lhcb4 (LHCIIa/ CP29) and Lhcb5 (LHCIIc/CP26) chlorophyll a/b-binding lightharvesting proteins are N-acetylvaline at position 33 (initiating Met is residue 1) and N-acetylleucine at position 38, respec-tively. The amino terminus of Lhcb4 was confirmed by MSMS sequencing of the amino-terminal tryptic peptide (acetyl-33 VFGFGK 38 ) from a fraction collected during LCMSϩ concomitant with elution of the intact protein of 28,212.7 Da (LCMSϩ) ( Fig. 1 and Table II) consistent with previous predictions (96). Note that this amino terminus gives a calculated mass of 28,212.07 Da for the gene product such that full agreement between measured and calculated mass (within measurement error, 0.01% or 2.8 Da) has been achieved.
CNBr peptide mass tags ( Fig. 2A) show that the aminoterminal amino acid of processed Lhcb5 is (N-acetyl) Leu 38 rather than Lys 41 , which was predicted previously (96). The assignment of Lhcb5 was confirmed by MSMS sequence of tryptic fragments in the appropriate liquid chromatography fraction (Fig. 2B).
The peptide at m/z 4900.7 (Fig. 3) is assigned as the acetylated amino terminus of A. thaliana PsbS providing full agreement of measured and calculated mass (Table II) in confirmation of the previous identification of a 22,457.6-Da IMT as PsbS in spinach (16). Thus in A. thaliana the amino-terminal residue of PsbS is N-acetylleucine at position 53 ( Fig. 3) rather than Ala 60 (Swiss-Prot accession number Q9XF91) or Ala 61 (96).
Three proteins (Lhcb1 II1 , Lhcb1 II2 , and PsbO III ) were identified by CNBr peptide mass tags but not detected as IMTs during LCMS. Each of these proteins is encoded by one copy a Tobacco protein annotation using the standard gene names. b Intact mass tag of zero charge average mass determined by LCMS as described in Table II.
c Average zero charge mass predicted as described in Table II. FIG. 1. MSMS spectrum of the tryptic peptide sequence tag of the amino terminus of Lhcb4. The complete b-and y-ion series are indicated on the spectrum, but only the observed ions are shown in the inset. The peptide sequence acVFGVGKc (ac, acetyl amino terminus) was assigned with an Xcorr value of (2.108) when the database was modified to include the correct Lhcb4 amino terminus and static modification of the amino terminus was set to ϩ42 Da (acetylation). Identities of the b-and y-ion series are indicated below the spectra for ease of alignment. The presence of Lhbc4 in the sample was confirmed by the presence of several additional internal peptides ( of a highly conserved multigene family and is present in low abundance compared with its respective paralog. The paralogs exhibit similar masses and similar retention times due to their high level of sequence conservation and therefore high conservation of biochemical/biophysical properties. Hence, the low abundance paralogs were "masked" by their higher abundance paralog(s) using LCMS (Fig. 4A). However, unique CNBr mass tags from each paralog could be detected in fractions collected during LCMSϩ. As an example, in A. thaliana the largest subunit of the oxygen-evolving enzyme (OEE1 or PsbO) is encoded on two paralogous genes on chromosomes III and V. The mature proteins encoded by these genes co-elute and differ by only 0.02% or 6 Da (26,571.7 Da (PsbO III ) versus 26,565.7 Da (PsbO V )). Molecular mass spectra of PsbO show a mass peak at 26,566.7 Da that corresponds to PsbO V and a possible shoulder at 26,570 Da that may correspond to PsbO III (Fig. 4A). When the LCMSϩ data was analyzed unique internal CNBr fragments enabled discrimination between the two paralogs (Fig. 4B). Assuming similar ionization efficiency, it is estimated that PsbO V is present in To accommodate the mass range of interest the spectrum was collected in the linear mode; thus the 13 C isotopes of the peptide are not resolved, and average instead of monoisotopic masses were recorded. Identified peaks are labeled with letters, and unassigned peaks are labeled with their respective mass. The sample was spiked with an insulin standard for internal calibration. Underlined peak labels indicate carboxyl-terminal peptides, and boxed labels indicate amino-terminal peptides. The identified peaks are as follows: Peak A (measured mass (meas. mass) ϭ 3885. 8   y-ion series are indicated on the spectrum, but only the observed ions are shown in the inset. The peptide sequence HLSDPFGNNLLTVIAG-DTER was assigned with an Xcorr value of (5.382). Identities of the band y-ion series are indicated below the spectra for ease of alignment.
steady-state amounts about 5ϫ that of PsbO III under the described growth conditions.
In addition, five membrane proteins from tobacco (Nicotiana tabacum) PS II membrane preparations were identified by their corresponding IMTs (Table III). The lack of sequence data from the nuclear genome complicates assignment of tobacco IMTs. Tentative identifications of these IMTs were based upon coincidence of measured and predicted mass after removal of the transit peptide (plus possible aminoterminal acetylation), and confidence in the assignments came from similarity between their liquid chromatography retention times and the retention times of orthologs from other plants (16).

Tests of Web-based Protein Analysis Programs
Chloroplast Targeting-The targeting predictions of five publicly available software packages listed on the ExPasy proteomic tools Web page (ca.expasy.org/tools/; ChloroP, TargetP, PSORT, Predotar, and MitoProt II) were tested with the 58-protein dataset (Table IV). ChloroP and TargetP correctly predicted chloroplast targeting in 57 and 56 instances, respectively. Previous proteomic surveys of soluble chloroplast proteins using 2D gel separation followed by tryptic digestion and assignment of the peptide fragments by matrixassisted laser desorption ionization time-of-flight (MALDI-TOF) MS are in agreement with these results. ChloroP predicted that 31 of 34 (13) and TargetP predicted that 74 of 75 (46) chloroplast proteins identified by internal mass or sequence tags are targeted to the chloroplast. However, without an equivalent dataset of non-chloroplast proteins with explicitly determined amino termini, it is not possible to test for false positives using the same predictive criteria. A recent study of 80 proteins identified from the A. thaliana mitochondria proteome by 2D gel separation/MALDI-MS showed that TargetP predicted 54 to be targeted to the mitochondria. The remaining 16 (20%) proteins were predicted to be chloroplast-targeted, including mitochondrial proteins such as succinyl-CoA ligase, cytochrome c oxidase, and the mitochondrial elongation factor Tu (14).
Predotar predicted that 40 (69%) of the proteins in our dataset are targeted to the chloroplast, 2 are targeted to the mitochondria, 1 is targeted to both organelles, and 15 are targeted to neither. Predotar had similar success in predicting plastid targeting (58/75, 77%) when analyzing soluble chloroplast proteins identified by 2D gel/MALDI-MS (46). From our dataset, each protein predicted by Predotar not to be targeted to either organelle had a closely related paralog or ortholog (except for PsaH) predicted to be plastid-targeted. The proteins in this "neither" category appear to have no species bias. It is of interest that A. thaliana Lhcb1 Ia and Lhcb1 Ib were not predicted to be targeted to the same location despite having identical amino acid sequences except at position 20 in the transit peptide (Lys 20 in Lhcb1 Ia and Asn 20 in Lhcb1 Ib ). In this case an additional positive charge (3 versus 2) in the transit peptide is apparently sufficient to predict an alternate location for the chloroplast protein.
PSORT attempts a more ambitious prediction of protein localization by assigning a suborganellar destination for nuclear encoded proteins. PSORT does not make a single prediction but ranks the four suborganellar compartments with the highest scores. One of the expert programs of PSORT is a hydrophobic moment analysis for predicting chloroplast proteins. The PSORT results can be interpreted in several   (13). Less strict interpretations are reliable if the suborganellar destination of the protein being examined is already known. It is therefore suggested that assignment of an unknown/hypothetical protein as a chloroplast protein should only be considered when three of the four PSORT suborganellar predictions are to chloroplast compartments. Furthermore, the tentative assignment should be confirmed by further experimentation. MitoProt was initially used as a negative control, but surprisingly it predicted that a large number of stromal proteins and thylakoid IMPs (16/47, cutoff Ͼ0.90; 32/47, cutoff Ͼ0.80) would be targeted to the mitochondria (Table  IV). MitoProt predicted that proteins targeted to the lumen would not go to the mitochondria. Although the size of the datasets for stromal and luminal proteins is small, the trend is striking. MitoProt identified 53 of 80 (14) and 39 of 48 (15) confirmed mitochondrial proteins as being trafficked to the mitochondria using a cutoff of Ͼ0.85. The presence of chloroplast proteins in tests of MitoProt reduced the reliability of the predictions (94). Here it is demonstrated that for certain chloroplast proteins the probability of MitoProt incorrectly predicting a mitochondrial destination is high, and thus one recommendation that emerged from these tests is that MitoProt results from plastid-containing eukaryotes be screened against ChloroP/TargetP to help reduce false positive predictions.
Transit Peptide Cleavage Prediction-The transit peptide is cleaved by SPP after translocation across the chloroplast envelope into the stroma. Proteins destined for the lumen have a bipartite transit peptide that is removed by TPP (22). ChloroP, TargetP, and MitoProt each contain transit peptide cleavage predictions. TargetP uses the ChloroP cleavage assignment if it predicts the protein is chloroplast-targeted. Only two of the test proteins were predicted by TargetP to go to the mitochondria, therefore only the ChloroP results are presented (Table V). SignalP was tested because the two targeting pathways to the lumen, Sec-dependent and ⌬pH-dependent, have analogous systems in eukaryotic, Grampositive (Gmϩ), and Gram-negative (GmϪ) bacterial secretory pathways (22,46).
None of the programs adequately predict the transit peptide processing site of the membrane-spanning proteins inserted into the thylakoid via the signal recognition particle-dependent pathway. ChloroP correctly predicted the transit peptide cleavage site for one thylakoid IMP (A. thaliana Pet-C IV , the cytochrome b 6 /f complex Rieske Fe-S protein), but a A. thaliana and tobacco protein annotations as described in Tables II and III. Pea and spinach annotations using standard genes. b At, A. thaliana; Ps, Pisum sativum; So, Spinacia oleracea; Nt, N. tabacum. c Chloroplast localization. S, stroma; Tϩ, thylakoid membrane via the signal recognition particle-dependent pathway; TϪ, thylakoid membrane via the spontaneous pathway; Ls, lumen via the Sec pathway; Lt, lumen via the Tat pathway. for the spinach and pea orthologs it predicted a cleavage site 7 amino acids carboxyl-proximal and 10 amino acids aminoproximal, respectively (Table V). The majority of the correct predictions for MitoProt are for the Lhcb1 and Lhcb2 proteins.
If MitoProt predicted the presence of a mitochondrial transit peptide, then it consistently predicted a cleavage site 2 amino acids carboxyl-proximal from the actual site in the Lhcb1 and Lhcb2 proteins. ChloroP predictions ranged from 11 amino acids amino-proximal to 12 amino acids carboxyl-proximal to the actual site for the same set of proteins (Table V). The appearance that MitoProt does better at predicting the transit peptide cleavage site of thylakoid IMPs is more likely due to its consistency at predicting the same cleavage site for orthologous proteins than due to recognition of some specific thylakoid-localizing domain. It is somewhat surprising that the MitoProt algorithm, which was not designed to work with chloroplast-targeted proteins, is much more consistent in its predictions of where a putative cleavage site is located than ChloroP, which was specifically created to analyze chloroplast proteins.
PsbW and PsbX c 2 are the only thylakoid-bound proteins in the dataset that have amino termini oriented into the thylakoid lumen (39,41). They are also the only proteins in the dataset that are targeted to the thylakoid by the spontaneous insertion pathway (42). The Gram-negative SignalP program correctly predicted the transit peptide processing sites for the two PsbW and single PsbX c examples in the study. We also tested the two other membrane proteins, PsbY and AtpG, which are inserted into the thylakoid via the spontaneous mechanism (42). Two PsbY proteins (spinach and A. thaliana, Swiss-Prot accession numbers P80470 and O49347, respectively) and two AtpG proteins (subunit II of the CF 0 ATPase; spinach, Swiss-Prot accession number P31853; and A. thaliana, Swiss-Prot accession number Q42139) are listed in the Swiss Protein/TrEMBL databases in addition to the single PsbX c and two PsbW proteins in the dataset. Bacterial signal peptides are cleaved according to the (Ϫ3, Ϫ1) rule where the Ϫ3 and Ϫ1 positions before the cleavage site are small neutral amino acids (usually Ala) (95,97). Eukaryotic and Gmϩ signal peptide consensus sequences generally have hydrophobic amino acids in the region between Ϫ16 and Ϫ25, whereas GmϪ signal peptides are more likely to have hydrophilic or positively charged amino acids in this region, like the transit peptides of thylakoid proteins imported via the spontaneous path. The GmϪ SignalP program correctly predicted the processing site of all seven proteins that are imported into the thylakoid membrane by this mechanism (data not shown). This unexpected observation suggests that the spontaneous thylakoid import mechanism may be related to the secretory mechanism of Gram-negative bacteria. DISCUSSION There is clearly a need for reliable bioinformatics software for predicting protein destinations and processing sites in large proteomics datasets. However, the requirement for ex-2 psbT is the name of the chloroplast gene encoding the PS II reaction center T protein. psbT was also used as the name for the nuclear gene encoding the soluble PS II M r 5000 oxygen-encoding enzyme protein targeted to the lumen, later renamed psbX. Unfortunately, psbX also was used for a gene encoding a membrane-spanning PS II protein of unclear function in cyanobacteria and the plastids of rhodophyta and cryptophyta. Orthologs of this cyanobacterial gene have been found in A. thaliana (98) Tables II and III. Pea and spinach annotations using standard gene names. 2 Species abbreviations as in Table IV. 3 Chloroplast localization as in Table IV. 4 ⌬, difference between observed and predicted amino-terminal residue. Positive numbers indicate the predicted cleavage site is aminoproximal to the observed amino terminus. Negative numbers indicate the predicted cleavage site is carboxyl-proximal to the observed amino-terminus. Boxes indicate that the predicted amino terminus is within Ϯ 2 amino acids of the amino-terminal amino acid observed by MS. Gray boxes indicate program that best predicts the cleavage site for proteins in the stroma and the lumen. The special case of PsbW and PsbX c is discussed in the text. Euk, eukaryotic. 5 Amino terminus determined by LCMS. n, unblocked alpha amine; ac, acetylated alpha amine.
perimentally verified amino and carboxyl termini of the mature proteins is rarely met because of the typically incomplete protein coverage obtained in standard peptide mass or sequence tag protein identification experiments. Thus, the LCMSϩ approach is indispensable because it provides intact protein mass information as well as eliminating peptide losses associated with recovery of peptides from gels. The intact mass usually yields the amino-terminal cleavage site, although incorrect assignment of the carboxyl terminus could potentially confound assignments based upon the intact mass alone. Finding the amino-terminal peptide in digested LCMSϩ fractions using standard software packages remains challenging because protein masses are generally calculated from translated mRNA and include the transit peptide in the mass calculation. By not specifying a particular digest it is possible to detect non-tryptic peptides potentially representing the amino terminus unless they are acetylated at the amino terminus, in which case it is necessary to modify all peptide amino termini for searching. Although a non-tryptic cleavage at the amino terminus of a peptide near the predicted cleavage site may locate the amino terminus, this is hypothetical unless it is also N ␣ -acetylated or in agreement with the measured intact mass. The current state of the art does not permit rapid prediction of the mass of putative chloroplast proteins after transit peptide cleavage, and there is no method for predicting the acetylation of the amino terminus that is common for thylakoid-bound proteins.
All the programs that were tested were deficient in one or more of their predictive algorithms, and none adequately predict the transit peptide processing site of the membranespanning proteins in the thylakoid. The training dataset used for the neural networks of ChloroP is the only one readily available for examination (www.cbs.dtu.dk/services/ChloroP/ pages/datasets.html). This dataset was created by extracting from Swiss-Prot release 35 all entries with an FT line contain-ing the key word "TRANSIT" and description word "CHLO-ROPLAST." Entries with a chloroplast transit peptide marked "BY SIMILARITY," "PROBABLE," or "POTENTIAL" were removed. The remaining entries were screened with SignalP, and those that showed the presence of the bipartite transit peptide for trafficking to the thylakoid or lumen were removed. A final set of 75 proteins was obtained after homology reduction and removal of a few mistargeted proteins (91). The removal of thylakoid-and lumen-targeted proteins essentially makes ChloroP a stroma-targeting and processing prediction algorithm (Table VI). Because the thylakoid-and lumen-localized proteins must pass through the chloroplast envelope to the stroma the chloroplast-targeting predictions are correct, but the transit peptide cleavage predictions of ChloroP are only for proteins cleaved by SPP. Thus the ChloroP Web page is somewhat misleading in presenting results of the transit peptide cleavage prediction as the amino terminus for all chloroplast-targeted proteins rather than just the stromal ones.
Our re-examination of the ChloroP dataset (Table VI) shows several problems that are probably common to all such training/testing datasets. Of the 75 proteins in the dataset only 43 (57%) had explicitly determined transit peptide cleavage sites. Nine of the remaining sequences had a cleavage site that should have been labeled in the Swiss-Prot entry "BY SIMI-LARITY" because the amino terminus indicated in the entry was predicted from an ortholog that had an experimentally verified amino terminus. There were 15 entries where no experimental evidence confirming the cleavage site had been obtained from any plant or algae and where the annotation "POTENTIAL" or "PROBABLE" was omitted. Five more entries that should have had the transit peptide labeled "PROB-ABLE" have been shown previously by experimental peptide sequence data to be cleaved at a site different from the annotation in the Swiss Protein Database (accession numbers P11893, P17067, P09195, P12360, and P07370). Three of the five entries listed above had experimental confirmation of the amino terminus from orthologous proteins published several years prior to Swiss-Prot release 35 and are still incorrectly annotated (70,87,99). Of the three remaining entries, two were cleavage sites determined from proteins made in cellfree extracts and imported into chloroplasts of heterologous species (Swiss-Prot accession numbers P15102 and P00873), and one was based on the amino terminus found after recovery of protein expressed in transgenic Escherichia coli (Swiss-Prot accession number P12629). Examples of misannotations include the entries for the tomato chloroplast biosynthetic threonine dehydratase (Swiss-Prot accession number P25306), the barley chloroplast photosystem I reaction center subunit XI (PsaL) (Swiss-Prot accession number P23993), and the safflower chloroplast acyl-[acyl-carrier protein] desaturase (Swiss-Prot accession number P22243). The tomato threonine dehydratase has Lys 52 annotated as the experimentally confirmed amino terminus of the processed preprotein. The work referenced for this annotation did not include protein sequencing, and the 10-amino acid sequence annotated as the amino-terminal protein sequence in the Swiss Protein Database was from a figure where a 10-amino acid sequence was underlined to indicate the proposed region for the transit peptide cleavage site (100). The barley PsaL protein is blocked at the amino terminus of the processed protein, and the annotated amino terminus (Ala 41 ) found in the Swiss Protein Database was one of two proposed sites for the transit peptide cleavage site (101). The amino terminus of the processed PsaL protein has not been verified in vascular plants and should have been annotated "POTENTIAL. " The amino terminus of the safflower acyl-[acyl-carrier protein] desaturase is also blocked (102), and the annotation indicating an experimentally determined amino terminus (Ala 34 ) is from two overlapping peptide sequences at the amino-proximal end of a peptide map that begins with the same amino acid. This transit peptide should have been annotated as "PROBABLE." It is not surprising that the gargantuan task of annotating the Swiss Protein Database results in frequent unclear or misannotated entries. A suggested improvement in the Swiss Protein Database would be to link the "BY SIMILARITY" annotation to the entry to which it is referring. Although this would not solve the problems encountered when the annotation is missing, it would greatly speed up the ability to cross-reference predicted modifications and functional groups. The current situation requires time-consuming, careful, and diligent confirmation of the annotation of each entry before attempting to correlate intact protein mass tags with translated genomic sequence information, especially for proteins that have amino-terminal trafficking peptides. The same sort of diligence should also be utilized when creating software training/testing datasets, thereby minimizing ambiguities or errors that decrease confidence in the bioinformat-ics tools trained/tested with these datasets.
The range of species used in the ChloroP training set is narrow due to the lack of experimental verification of posttranslational modifications for most proteins (see Table I), and the dataset is skewed toward proteins from pea and spinach. The training set also has a significant number of algal sequences (Table VI), but algal transit peptides are 32 amino acids shorter, on average, than those from vascular plants (103). Where direct comparisons between orthologous proteins are possible, the transit peptide cleavage sites are not conserved between the algae and vascular plants. It is reasonable that separate training datasets are needed for vascular plants and green algae, as is done in SignalP for the eukaryotes, Gram-positive bacteria, and Gram-negative bacteria. Additionally, it seems reasonable to use separate training sets for each suborganellar compartment or training sets for each characterized translocation mechanism. In this way the training sets would narrow the focus of the neural net programs and help to avoid the confusion created by attempting to find a generalized transit peptide cleavage motif that may not be universally applicable for all photosynthetic organisms.
The results in Table V suggest that there may be at least four proteases in the chloroplast involved in removing transit peptides: a stromal processing protease, for which the cleavage site can be predicted by ChloroP for proteins targeted to the stroma; a lumen processing protease, for which the cleavage site can be predicted by SignalP (eukaryotic) for proteins imported into the lumen via the Sec or Tat mechanisms; a thylakoid processing protease, for which the cleavage site is predicted by SignalP (GmϪ) for proteins inserted into the membrane via the spontaneous mechanism; and a thylakoid protease, for which no reliable cleavage prediction program exists for proteins inserted into the membrane via a signal recognition particle-dependent mechanism. Attempts to identify common features of either transit peptides or mature proteins that might be useful for such predictions have thus far been unfruitful.
Previous proteomic studies of A. thaliana organelles that tested the trafficking predictions of various predictive programs (13,14) are in general agreement with the results we observe for trafficking. However, because the earlier studies identified proteins based upon internal sequence or peptide mass tags, they were unable to assess the cleavage prediction routines. Subsequent studies of luminal proteins in A. thaliana analyzed the cleavage prediction routines of TargetP and SignalP (46,47). Ten proteins overlap between these studies and ours: six lumen (PsbQ VI1 , PsbQ VI␣ , PsbP I , PsaN V , PsbO V , and PsbO III with The Institute for Genomic Research (TIGR) chromosomes locus numbers t4g21280, At4g05180, At1g06680, At5g64040, At5g66570, and At3g50820, respectively), two stroma (PsaD IV and PsaE IV with TIGR chromosome locus numbers At4g02770 and At4g28570, respectively), and two thylakoid-integral (AtpC IV and AtpD IV with TIGR chromosome locus numbers At4g04640 and At4g09650, respectively). SignalP was better at predicting the amino terminus of luminal proteins than TargetP (46), which is in agreement with our observations; however, only three (PsbP I , PsbO V , and PsbO III ) of the luminal proteins that overlapped with our study used the correct amino terminus when testing the programs. In contrast, our results agree completely with the experimentally determined amino termini for the oxygenevolving enzyme proteins in the second study (47).
Integral membrane proteins are predicted to account for at least one third of the open reading frames in the A. thaliana genome (104). The 2D methods currently in use are biased against low abundance proteins (105, 106) and IMPs. However, IMPs with masses up to and exceeding 100,000 Da and containing up to 15 membrane-spanning ␣-helices have now been successfully characterized by LCMS using intact mass tags (5,16,49,88,(107)(108)(109)(110)(111). 3 LCMSϩ allows us to subject fractions to off-line digestion to generate peptide fragments for mass tag (112)(113)(114)(115) and sequence tag (116,117) experiments to confirm the IMT data (49,118,119). Characterizing a protein based upon its intact mass tag allows us to simultaneously determine secondary post-translational modifications for several proteins. LCMSϩ is the only method currently available to rapidly determine such modifications for a large number of proteins. Consequently, LCMSϩ should be the method of choice for obtaining datasets to train/test predictive bioinformatics tools for trafficking and post-translational processing.