|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
,


,**
From the
Pasarow Mass Spectrometry Laboratory, Department of Psychiatry and Biobehavioral Sciences, Department of Chemistry and Biochemistry, and Neuropsychiatric Institute, University of California, Los Angeles, California 90095, ¶ Biosphere 2, Columbia University, Oracle, Arizona 85623, and || College of Natural Sciences, California State University, Chico, California 95929
| ABSTRACT |
|---|
|
|
|---|
Proteomics, the systematic study of large sets of proteins expressed by a particular cell type or tissue, is largely possible because of recent technological and computational advances. Contemporary proteomics is based on the following three components: analytical separation of proteins from complex starting material, identification of the proteins, and subsequent sorting and classification of the datasets using bioinformatics software tools. The availability of complete genome sequences, improvements in techniques for protein separation and displays (46), and new developments in mass spectrometry (MS) 1 for protein identification (710) have arguably outpaced developments in bioinformatics. Using the complete mitochondrial, plastid, and nuclear Arabidopsis thaliana genomes (2, 11, 12) and results from recent proteomics experiments (1316), it is possible to test the performance of algorithms that are commonly used to predict organellar targeting of nuclear gene products and transit peptide cleavage sites. Thus, it is timely to reevaluate the various tools at our disposal.
The chloroplast is a fascinating organelle not only because of its vital photosynthetic function but also because its development requires the coordinated expression and interaction of all three genomes. Chloroplasts have three membrane systems, the outer and inner envelopes and the thylakoids, which form the boundaries of three separate soluble compartments, the interenvelope space, stroma, and thylakoid lumen. Chloroplast proteins, encoded by nuclear genes, have amino-terminal transit sequences that aid in targeting the polypeptides to the correct chloroplast membrane or compartment (1722). After translation in the cytosol, the transit peptide of the proprotein is phosphorylated by a cytosolic Ser/Thr kinase before import into the chloroplast (23). The phosphorylated proproteins are bound by the Toc complex (translocon of the outer chloroplast envelope) (24), transported across the outer envelope membrane, dephosphorylated in the interenvelope space, and then transported across the inner envelope by the Tic complex (20). Homologs for the Toc and Tic complexes have been found in the genome of Synechocystis sp. PCC 6803 (20), which supports the model that chloroplasts are reduced cyanobacterial endosymbionts (2528). Once in the stroma the transit peptide is cleaved by a stromal processing protease (SPP).
Proteins targeted to the thylakoid lumen have bipartite transit peptides. The stromal targeting information is in the amino-proximal portion of the transit peptide, and the carboxyl-proximal portion, similar to signal peptides of secreted proteins in bacteria, contains the information for targeting to the lumen. Two different pathways for targeting to the lumen have been characterized. The first is a Sec-dependent pathway related to the SecYEG export mechanism in bacteria (29, 30). The second is a
pH-dependent mechanism characterized by two conserved sequential arginines in the transit peptide, the Tat (twin arginine translocase) pathway (3133). Proteins targeted to the lumen via the Sec pathway are generally translocated in an unfolded state, whereas proteins imported via the Tat pathway are translocated in a folded state (22). Proteins imported into the lumen by either pathway are processed by a thylakoid processing protease that removes the carboxyl-proximal portion of the transit peptide.
Integral membrane proteins (IMPs) destined for the thylakoid membrane are targeted by one of two additional translocation mechanisms. It is thought that most thylakoid-bound proteins are targeted by information contained within the mature protein sequence. This insertion mechanism, which orients the amino terminus of the protein into the stroma, requires a signal recognition particle and a putative unidentified thylakoid-bound translocation complex (3437). A few integral thylakoid proteins have bipartite transit peptides that are cleaved first in the stroma and again after membrane insertion. These proteins are inserted via a novel signal recognition particle-independent mechanism that appears to require no protein or nucleotide co-factors (the spontaneous pathway) and has the unique characteristic of leaving the amino terminus of the mature protein oriented into the lumen (3842). Chloroplast thylakoids have been used as a model system for membrane proteomics because of the size of the proteome, the limited number of post-translational modifications (16), the ease of collecting large amounts of membranes, and the ability to easily subfractionate detergent-resistant domains (43).
Proteomics studies using two-dimensional (2D) gel electrophoresis have provided useful insights for soluble proteins of the lumen and stroma as well as peripheral thylakoid proteins (13, 4447) and integral thylakoid proteins (48), but these methods do not allow convenient determination of the amino terminus where blocked, as is common in the chloroplast. In this work, a dataset of 35 nuclear encoded integral thylakoid membrane proteins from A. thaliana, where the sites of secondary amino-terminal processing were explicitly determined by MS, was used to challenge several bioinformatics tools. After initial trials the dataset was expanded to include photosystem (PS) II thylakoid-associated proteins from pea and spinach published previously (16, 49) and a small number of tobacco IMPs, providing a total of 58 nuclear encoded integral thylakoid membrane proteins, each with an experimentally determined amino terminus. The results demonstrate that although some programs (ChloroP, TargetP) reliably predict trafficking to the chloroplast, they variably predict the processing sites of the transit peptides. The results highlight inadequacies of currently available bioinformatics tools for prediction of secondary amino-terminal processing of transit peptides of integral thylakoid proteins and demonstrate the need for improvements in the datasets used to train the algorithms.
| MATERIALS AND METHODS |
|---|
|
|
|---|
5 plants/pot) containing Scotts Pro-Gro professional potting mix (The Scotts Company, Marysville, OH) approximately 1 week after germination. Plants were grown in a growth chamber maintained at 23 ± 2 °C in the light and 16 ± 2 °C in the dark. The plants were illuminated with a 12-h light period and constant relative humidity of 70%. Fluorescent lamps (Sylvania F72T12/CW/VHO, 160 watts) supplemented with Sylvania 100-W incandescent bulbs were used as the light source. Quantum (400700 nm) flux density was 700800 µmol of photons m-2 s-1 at the level of the leaves. Plants were watered as necessary and supplemented once per week with 0.25x Hoaglands solution (50).
Leaves were harvested from 5-week-old A. thaliana plants, placed in square glass beakers, covered with grinding buffer (51), and ground with a chilled Polytron homogenizer (Brinkmann Instruments/Kinematica). The homogenate was filtered over 8 layers of diaper liners (Gerber), and the green filtrate was layered over (40%) Percoll gradients in 1.6-ml microfuge tubes (52). Thylakoids were collected from the Percoll pads, diluted in grinding buffer, and pelleted at full speed in a tabletop centrifuge. The green membrane pellets were resuspended to
1.1 mg/ml chlorophyll in extraction buffer (51) and frozen in liquid nitrogen. This procedure typically required 1015 min, and all manipulations were performed on ice or in a 4 °C refrigerator under low light. Tobacco PS II thylakoid proteins were isolated as described previously (16).
LCMS+
A. thaliana thylakoid protein samples (
66 µg of chlorophyll) were prepared by acetone precipitation prior to dissolution in 60% acetic acid. Reverse-phase chromatography was performed as described previously (5, 49) using a poly(styrene-divinylbenzene) copolymer (Polymer Labs PLRP/S, 5 µm x 300 Å, 2.1 x 300 mm) stationary-phase column equilibrated in aqueous 0.1% trifluoroacetic acid containing 5% acetonitrile and eluted (100 µl/min at 40 °C) with a stepped increasing concentration of acetonitrile (min/% acetonitrile: 0/5, 5/5, 10/25, 130/75, and 150/100) initiated at the moment of sample injection (100 µl/injection). The absorption of the effluent from the column was recorded at 280 nm. The effluent was split, and a portion was sent to a fraction collector (typically about 75%, 2-min fractions); the remainder was directed straight into the MS ion source (LCMS+). Tobacco protein samples were measured by LCMS as described previously (16).
Mass spectra were recorded on a PerkinElmer Life Sciences Sciex API III triple-quadrupole mass spectrometer with an IonsprayTM source as described by Whitelegge et al. (5). The instrument was scanned from m/z 6002300 (0.3 step size, 1-ms dwell time, 6-s scan speed, orifice at 65 V). Manufacturer-supplied software was used for the computations of measured protein molecular weight (MacSpec 3.3) and zero-charge molecular weight reconstructions (BioMultiView 1.3.1). Calculated average molecular weights were generated from translated gene sequences using PeptideMass (expasy.cbr.nrc.ca/tools/peptide-mass.html) after manually removing the transit peptide.
Protein Identification
Thirty-three intact mass tags (IMTs) from the A. thaliana LCMS experiments were correlated with the predicted masses calculated by modifying the entries at The Institute for Genomic Research A. thaliana database (www.tigr.org/tdb/e2k1/ath1/) with experimentally verified amino termini of orthologs (Table I).
|
-cyano-4-hydroxycinnamic acid as matrix (10 mg/ml of solution in water/acetonitrile/trifluoroacetic acid 30/70/0.1) and internal/external calibration with bovine insulin. Manufacturer-supplied default settings for a method optimized for peptides less than 6000 Da were used for all samples. Some proteins have no internal Met or no CNBr peptides less than 8000 Da (MALDI in the reflectron mode has an upper limit for high resolution data acquisition of approximately 8000 Da); consequently some abundant proteins will not be detected during MALDI analysis, and peptides from low abundance proteins may appear to be more highly expressed. The MALDI data was compared with the fragmentation pattern predicted by the MS-Tag tool in ProteinProspector, version 4.0.4 (prospector.ucsf.edu/) (90). MS-Tag calculates the predicted fragment masses with the transit peptide sequence still included. There are very few, if any, amino-terminal peptide masses for chloroplast-targeted proteins predicted by MS-Tag. The MS-Digest tool was used to predict the fragment masses after manually removing the transit peptide from each prospective match assigned by MS-Tag.
Trypsin Cleavage
Selected fractions collected during LCMS+ were reduced, alkylated, and treated with trypsin (Promega sequencing grade modified by reductive methylation). Dithiothreitol (15 µl, 10 mM in 50 mM ammonium bicarbonate; 30 min, 24 °C), iodoacetamide (15 µl, 55 mM in 50 mM ammonium bicarbonate; 20 min, 24 °C), and finally trypsin (12.5 µl, 6 ng/µl in 50 mM ammonium bicarbonate; 3 h, 37 °C) were added to aliquots of fractions (10 µl). After incubation, samples were dried by centrifugal evaporation and stored at -20 °C prior to analysis by microliquid chromatography-MSMS.
Analysis of Tryptic Peptide Sequence Tags by Tandem Mass Spectrometry
Samples were analyzed by microliquid chromatography-MSMS with data-dependent acquisition (LCQ-DECA, ThermoFinnigan, San Jose, CA) after dissolution in 5 µl of 70% acetic acid (v/v). A reverse-phase column (200 µm x 10 cm, PLRP/S 5 µm, 300 Å; Michrom Biosciences, San Jose, CA) was equilibrated for 10 min at 1.5 µl/min with 95% A, 5% B (A, 0.1% formic acid in water; B, 0.1% formic acid in acetonitrile) prior to sample injection. A linear gradient was initiated 10 min after sample injection ramping to 60% A, 40% B after 50 min and 20% A, 80% B after 65 min. Column eluent was directed to a coated glass electrospray emitter (TaperTip, TT150-50-50-CE-5, New Objective) at 3.3 kV for ionization without nebulizer gas. The mass spectrometer was operated in "triple-play" mode with a survey scan (4001500 m/z), data-dependent zoom scan, and MSMS. Individual sequencing experiments were matched to a custom A. thaliana sequence database using Sequest software (ThermoFinnigan). "No enzyme" was set such that Sequest considered all possible peptide sequence permutations rather than just tryptic ones. To identify N-acetylated peptides, a static modification of +42 Da was set for "amino terminus peptides."
Data Analysis
ChloroP, version 1.1 (www.cbs.dtu.dk/services/ChloroP/) (91), is a neural network method for identifying probable chloroplast transit peptide sequences and predicting the proteolytic cleavage site of each transit peptide. ChloroP presents its prediction of chloroplast targeting as a "Y" or "N" output based upon the predicted presence of a chloroplast transit peptide. TargetP, version 1.01 (www.cbs.dtu.dk/services/TargetP/) (92), is a layered neural network method for predicting subcellular targeting based upon the type of targeting/transit peptide predicted to be at the amino terminus of each protein. TargetP predicts whether the protein in question is trafficked to the chloroplast, mitochondria, secretory pathway, or "other" subcellular location. PSORT (World Wide Web version, Oct 8, 1999, psort.ims.u-tokyo.ac.jp/) (93) is an expert system using a knowledge-base setup as an "if-then" cascade. PSORT predicts subcellular localization with much finer resolution than any of the other programs examined. The four subcellular/suborganellar localizations with highest scores are ranked in order. The output of PSORT is listed in the footnotes to Table IV. PSORT includes a hydrophobic moment analysis for chloroplast proteins as one of its expert analysis programs. The usefulness of this calculation is based on the assumption that all chloroplast proteins have a similar stromal targeting domain in the amino-terminal targeting peptide. The hydrophobic moment analysis in PSORT distinguishes chloroplast protein status as negative, positive, or undetermined. Predotar, version 0.5 (www.inra.fr/predotar/), is a program still under development designed to be a Web-based method for distinguishing chloroplast- from mitochondria-targeting sequences. Predotar predicts localization to the chloroplast, mitochondria, both organelles, or neither organelle. MitoProt II, version 1.0a4 (www.mips.biochem.mpg.de/cgi-bin/proj/medgen/mitofilter) (94), is a computational method for predicting mitochondrial targeting sequences and for predicting the proteolytic cleavage sites of the targeting peptides. MitoProt predicts targeting based upon a calculated probability score that the protein being examined is localized to the mitochondria. SignalP (www.cbs.dtu.dk/services/SignalP/) (95) allows discrimination between eukaryotic, Gram-positive bacterial and Gram-negative bacterial signal peptides. All three cases were tested with the dataset. Four parameters are calculated for a yes/no prediction of the presence of a signal peptide. Each organism group tested gave a predicted cleavage site.
|
| RESULTS |
|---|
|
|
|---|
|
|
|
|
|
|
Tests of Web-based Protein Analysis Programs
Chloroplast Targeting
The targeting predictions of five publicly available software packages listed on the ExPasy proteomic tools Web page (ca.expasy.org/tools/; ChloroP, TargetP, PSORT, Predotar, and MitoProt II) were tested with the 58-protein dataset (Table IV). ChloroP and TargetP correctly predicted chloroplast targeting in 57 and 56 instances, respectively. Previous proteomic surveys of soluble chloroplast proteins using 2D gel separation followed by tryptic digestion and assignment of the peptide fragments by matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF) MS are in agreement with these results. ChloroP predicted that 31 of 34 (13) and TargetP predicted that 74 of 75 (46) chloroplast proteins identified by internal mass or sequence tags are targeted to the chloroplast. However, without an equivalent dataset of non-chloroplast proteins with explicitly determined amino termini, it is not possible to test for false positives using the same predictive criteria. A recent study of 80 proteins identified from the A. thaliana mitochondria proteome by 2D gel separation/MALDI-MS showed that TargetP predicted 54 to be targeted to the mitochondria. The remaining 16 (20%) proteins were predicted to be chloroplast-targeted, including mitochondrial proteins such as succinyl-CoA ligase, cytochrome coxidase, and the mitochondrial elongation factor Tu (14).
Predotar predicted that 40 (69%) of the proteins in our dataset are targeted to the chloroplast, 2 are targeted to the mitochondria, 1 is targeted to both organelles, and 15 are targeted to neither. Predotar had similar success in predicting plastid targeting (58/75, 77%) when analyzing soluble chloroplast proteins identified by 2D gel/MALDI-MS (46). From our dataset, each protein predicted by Predotar not to be targeted to either organelle had a closely related paralog or ortholog (except for PsaH) predicted to be plastid-targeted. The proteins in this "neither" category appear to have no species bias. It is of interest that A. thaliana Lhcb1Ia and Lhcb1Ib were not predicted to be targeted to the same location despite having identical amino acid sequences except at position 20 in the transit peptide (Lys20 in Lhcb1Ia and Asn20 in Lhcb1Ib). In this case an additional positive charge (3 versus 2) in the transit peptide is apparently sufficient to predict an alternate location for the chloroplast protein.
PSORT attempts a more ambitious prediction of protein localization by assigning a suborganellar destination for nuclear encoded proteins. PSORT does not make a single prediction but ranks the four suborganellar compartments with the highest scores. One of the expert programs of PSORT is a hydrophobic moment analysis for predicting chloroplast proteins. The PSORT results can be interpreted in several ways depending upon the criteria used for determining a "hit." Under the strictest conditions, namely that the highest score must match the actual suborganellar destination and that the hydrophobic moment must predict that it is a chloroplast protein, PSORT does poorly, predicting 0/5 in the stroma, 6/42 in the thylakoid, and 6/11 in the lumen. If the requirement is that only one of the four highest scores matches the correct destination, then PSORT does much better (stroma, 3/5; thylakoid, 37/42; and lumen, 8/11). If the criteria for PSORT are similar to ChloroP and Predotar (chloroplast targeting or not), then the results are comparable with Predotar. PSORT correctly predicts 49/58 if at least one of the four highest scores is a chloroplast domain, 39/58 if at least two of the four high scoring results are to the chloroplast, and 25/58 if at least three of the four are in the chloroplast. PSORT was used previously to test a smaller set of soluble chloroplast proteins identified by mass and sequence tags. In this study PSORT predicted 13/34 proteins target to the chloroplast if only one of the four predicted destinations is in the chloroplast, 10/34 if two of the four are chloroplast domains, and 1/34 if three of the four predicted targets are chloroplast domains (13). Less strict interpretations are reliable if the suborganellar destination of the protein being examined is already known. It is therefore suggested that assignment of an unknown/hypothetical protein as a chloroplast protein should only be considered when three of the four PSORT suborganellar predictions are to chloroplast compartments. Furthermore, the tentative assignment should be confirmed by further experimentation.
MitoProt was initially used as a negative control, but surprisingly it predicted that a large number of stromal proteins and thylakoid IMPs (16/47, cutoff >0.90; 32/47, cutoff >0.80) would be targeted to the mitochondria (Table IV). MitoProt predicted that proteins targeted to the lumen would not go to the mitochondria. Although the size of the datasets for stromal and luminal proteins is small, the trend is striking. MitoProt identified 53 of 80 (14) and 39 of 48 (15) confirmed mitochondrial proteins as being trafficked to the mitochondria using a cutoff of >0.85. The presence of chloroplast proteins in tests of MitoProt reduced the reliability of the predictions (94). Here it is demonstrated that for certain chloroplast proteins the probability of MitoProt incorrectly predicting a mitochondrial destination is high, and thus one recommendation that emerged from these tests is that MitoProt results from plastid-containing eukaryotes be screened against ChloroP/TargetP to help reduce false positive predictions.
Transit Peptide Cleavage Prediction
The transit peptide is cleaved by SPP after translocation across the chloroplast envelope into the stroma. Proteins destined for the lumen have a bipartite transit peptide that is removed by TPP (22). ChloroP, TargetP, and MitoProt each contain transit peptide cleavage predictions. TargetP uses the ChloroP cleavage assignment if it predicts the protein is chloroplast-targeted. Only two of the test proteins were predicted by TargetP to go to the mitochondria, therefore only the ChloroP results are presented (Table V). SignalP was tested because the two targeting pathways to the lumen, Sec-dependent and
pH-dependent, have analogous systems in eukaryotic, Gram-positive (Gm+), and Gram-negative (Gm-) bacterial secretory pathways (22, 46).
|
None of the programs adequately predict the transit peptide processing site of the membrane-spanning proteins inserted into the thylakoid via the signal recognition particle-dependent pathway. ChloroP correctly predicted the transit peptide cleavage site for one thylakoid IMP (A. thaliana PetCIV, the cytochrome b6/f complex Rieske Fe-S protein), but for the spinach and pea orthologs it predicted a cleavage site 7 amino acids carboxyl-proximal and 10 amino acids amino-proximal, respectively (Table V). The majority of the correct predictions for MitoProt are for the Lhcb1 and Lhcb2 proteins. If MitoProt predicted the presence of a mitochondrial transit peptide, then it consistently predicted a cleavage site 2 amino acids carboxyl-proximal from the actual site in the Lhcb1 and Lhcb2 proteins. ChloroP predictions ranged from 11 amino acids amino-proximal to 12 amino acids carboxyl-proximal to the actual site for the same set of proteins (Table V). The appearance that MitoProt does better at predicting the transit peptide cleavage site of thylakoid IMPs is more likely due to its consistency at predicting the same cleavage site for orthologous proteins than due to recognition of some specific thylakoid-localizing domain. It is somewhat surprising that the MitoProt algorithm, which was not designed to work with chloroplast-targeted proteins, is much more consistent in its predictions of where a putative cleavage site is located than ChloroP, which was specifically created to analyze chloroplast proteins.
PsbW and PsbXc2 are the only thylakoid-bound proteins in the dataset that have amino termini oriented into the thylakoid lumen (39, 41). They are also the only proteins in the dataset that are targeted to the thylakoid by the spontaneous insertion pathway (42). The Gram-negative SignalP program correctly predicted the transit peptide processing sites for the two PsbW and single PsbXc examples in the study. We also tested the two other membrane proteins, PsbY and AtpG, which are inserted into the thylakoid via the spontaneous mechanism (42). Two PsbY proteins (spinach and A. thaliana, Swiss-Prot accession numbers P80470 and O49347, respectively) and two AtpG proteins (subunit II of the CF0 ATPase; spinach, Swiss-Prot accession number P31853; and A. thaliana, Swiss-Prot accession number Q42139) are listed in the Swiss Protein/TrEMBL databases in addition to the single PsbXc and two PsbW proteins in the dataset. Bacterial signal peptides are cleaved according to the (-3, -1) rule where the -3 and -1 positions before the cleavage site are small neutral amino acids (usually Ala) (95, 97). Eukaryotic and Gm+ signal peptide consensus sequences generally have hydrophobic amino acids in the region between -16 and -25, whereas Gm- signal peptides are more likely to have hydrophilic or positively charged amino acids in this region, like the transit peptides of thylakoid proteins imported via the spontaneous path. The Gm- SignalP program correctly predicted the processing site of all seven proteins that are imported into the thylakoid membrane by this mechanism (data not shown). This unexpected observation suggests that the spontaneous thylakoid import mechanism may be related to the secretory mechanism of Gram-negative bacteria.
| DISCUSSION |
|---|
|
|
|---|
-acetylated or in agreement with the measured intact mass. The current state of the art does not permit rapid prediction of the mass of putative chloroplast proteins after transit peptide cleavage, and there is no method for predicting the acetylation of the amino terminus that is common for thylakoid-bound proteins. All the programs that were tested were deficient in one or more of their predictive algorithms, and none adequately predict the transit peptide processing site of the membrane-spanning proteins in the thylakoid. The training dataset used for the neural networks of ChloroP is the only one readily available for examination (www.cbs.dtu.dk/services/ChloroP/pages/datasets.html). This dataset was created by extracting from Swiss-Prot release 35 all entries with an FT line containing the key word "TRANSIT" and description word "CHLOROPLAST." Entries with a chloroplast transit peptide marked "BY SIMILARITY," "PROBABLE," or "POTENTIAL" were removed. The remaining entries were screened with SignalP, and those that showed the presence of the bipartite transit peptide for trafficking to the thylakoid or lumen were removed. A final set of 75 proteins was obtained after homology reduction and removal of a few mistargeted proteins (91). The removal of thylakoid- and lumen-targeted proteins essentially makes ChloroP a stroma-targeting and processing prediction algorithm (Table VI). Because the thylakoid- and lumen-localized proteins must pass through the chloroplast envelope to the stroma the chloroplast-targeting predictions are correct, but the transit peptide cleavage predictions of ChloroP are only for proteins cleaved by SPP. Thus the ChloroP Web page is somewhat misleading in presenting results of the transit peptide cleavage prediction as the amino terminus for all chloroplast-targeted proteins rather than just the stromal ones.
|
Examples of misannotations include the entries for the tomato chloroplast biosynthetic threonine dehydratase (Swiss-Prot accession number P25306), the barley chloroplast photosystem I reaction center subunit XI (PsaL) (Swiss-Prot accession number P23993), and the safflower chloroplast acyl-[acyl-carrier protein] desaturase (Swiss-Prot accession number P22243). The tomato threonine dehydratase has Lys52 annotated as the experimentally confirmed amino terminus of the processed preprotein. The work referenced for this annotation did not include protein sequencing, and the 10-amino acid sequence annotated as the amino-terminal protein sequence in the Swiss Protein Database was from a figure where a 10-amino acid sequence was underlined to indicate the proposed region for the transit peptide cleavage site (100). The barley PsaL protein is blocked at the amino terminus of the processed protein, and the annotated amino terminus (Ala41) found in the Swiss Protein Database was one of two proposed sites for the transit peptide cleavage site (101). The amino terminus of the processed PsaL protein has not been verified in vascular plants and should have been annotated "POTENTIAL."
The amino terminus of the safflower acyl-[acyl-carrier protein] desaturase is also blocked (102), and the annotation indicating an experimentally determined amino terminus (Ala34) is from two overlapping peptide sequences at the amino-proximal end of a peptide map that begins with the same amino acid. This transit peptide should have been annotated as "PROBABLE." It is not surprising that the gargantuan task of annotating the Swiss Protein Database results in frequent unclear or misannotated entries. A suggested improvement in the Swiss Protein Database would be to link the "BY SIMILARITY" annotation to the entry to which it is referring. Although this would not solve the problems encountered when the annotation is missing, it would greatly speed up the ability to cross-reference predicted modifications and functional groups. The current situation requires time-consuming, careful, and diligent confirmation of the annotation of each entry before attempting to correlate intact protein mass tags with translated genomic sequence information, especially for proteins that have amino-terminal trafficking peptides. The same sort of diligence should also be utilized when creating software training/testing datasets, thereby minimizing ambiguities or errors that decrease confidence in the bioinformatics tools trained/tested with these datasets.
The range of species used in the ChloroP training set is narrow due to the lack of experimental verification of post-translational modifications for most proteins (see Table I), and the dataset is skewed toward proteins from pea and spinach. The training set also has a significant number of algal sequences (Table VI), but algal transit peptides are 32 amino acids shorter, on average, than those from vascular plants (103). Where direct comparisons between orthologous proteins are possible, the transit peptide cleavage sites are not conserved between the algae and vascular plants. It is reasonable that separate training datasets are needed for vascular plants and green algae, as is done in SignalP for the eukaryotes, Gram-positive bacteria, and Gram-negative bacteria. Additionally, it seems reasonable to use separate training sets for each suborganellar compartment or training sets for each characterized translocation mechanism. In this way the training sets would narrow the focus of the neural net programs and help to avoid the confusion created by attempting to find a generalized transit peptide cleavage motif that may not be universally applicable for all photosynthetic organisms.
The results in Table V suggest that there may be at least four proteases in the chloroplast involved in removing transit peptides: a stromal processing protease, for which the cleavage site can be predicted by ChloroP for proteins targeted to the stroma; a lumen processing protease, for which the cleavage site can be predicted by SignalP (eukaryotic) for proteins imported into the lumen via the Sec or Tat mechanisms; a thylakoid processing protease, for which the cleavage site is predicted by SignalP (Gm-) for proteins inserted into the membrane via the spontaneous mechanism; and a thylakoid protease, for which no reliable cleavage prediction program exists for proteins inserted into the membrane via a signal recognition particle-dependent mechanism. Attempts to identify common features of either transit peptides or mature proteins that might be useful for such predictions have thus far been unfruitful.
Previous proteomic studies of A. thaliana organelles that tested the trafficking predictions of various predictive programs (13, 14) are in general agreement with the results we observe for trafficking. However, because the earlier studies identified proteins based upon internal sequence or peptide mass tags, they were unable to assess the cleavage prediction routines. Subsequent studies of luminal proteins in A. thaliana analyzed the cleavage prediction routines of TargetP and SignalP (46, 47). Ten proteins overlap between these studies and ours: six lumen (PsbQVI1, PsbQVI
, PsbPI, PsaNV, PsbOV, and PsbOIII with The Institute for Genomic Research (TIGR) chromosomes locus numbers t4g21280, At4g05180, At1g06680, At5g64040, At5g66570, and At3g50820, respectively), two stroma (PsaDIV and PsaEIV with TIGR chromosome locus numbers At4g02770 and At4g28570, respectively), and two thylakoid-integral (AtpCIV and AtpDIV with TIGR chromosome locus numbers At4g04640 and At4g09650, respectively). SignalP was better at predicting the amino terminus of luminal proteins than TargetP (46), which is in agreement with our observations; however, only three (PsbPI, PsbOV, and PsbOIII) of the luminal proteins that overlapped with our study used the correct amino terminus when testing the programs. In contrast, our results agree completely with the experimentally determined amino termini for the oxygen-evolving enzyme proteins in the second study (47).
Integral membrane proteins are predicted to account for at least one third of the open reading frames in the A. thaliana genome (104). The 2D methods currently in use are biased against low abundance proteins (105, 106) and IMPs. However, IMPs with masses up to and exceeding 100,000 Da and containing up to 15 membrane-spanning
-helices have now been successfully characterized by LCMS using intact mass tags (5, 16, 49, 88, 107111). 3 LCMS+ allows us to subject fractions to off-line digestion to generate peptide fragments for mass tag (112115) and sequence tag (116, 117) experiments to confirm the IMT data (49, 118, 119). Characterizing a protein based upon its intact mass tag allows us to simultaneously determine secondary post-translational modifications for several proteins. LCMS+ is the only method currently available to rapidly determine such modifications for a large number of proteins. Consequently, LCMS+ should be the method of choice for obtaining datasets to train/test predictive bioinformatics tools for trafficking and post-translational processing.
| FOOTNOTES |
|---|
1 The abbreviations used are: MS, mass spectrometry; MSMS, tandem mass spectrometry; 2D, two-dimensional; Gm+, Gram-positive; Gm-, Gram-negative; IMP, integral membrane protein; IMT, intact mass tag; LCMS, liquid chromatography coupled to electrospray mass spectrometry; LCMS+, LCMS with fractions collected at a flow splitter between liquid chromatography and MS for MSMS; MALDI, matrix-assisted laser desorption ionization; TOF, time-of-flight; PS, photosystem; SPP, stromal processing protease. ![]()
2 psbT is the name of the chloroplast gene encoding the PS II reaction center T protein. psbT was also used as the name for the nuclear gene encoding the soluble PS II Mr 5000 oxygen-encoding enzyme protein targeted to the lumen, later renamed psbX. Unfortunately, psbX also was used for a gene encoding a membrane-spanning PS II protein of unclear function in cyanobacteria and the plastids of rhodophyta and cryptophyta. Orthologs of this cyanobacterial gene have been found in A. thaliana (98) and rice (Sasaki, T., Matsumoto, T., and Yamamoto, K. (2001) GenBankTM/EBI accession number AP004300), and this gene is UV-B-repressed in pea (Liu, L., White, M. J., and MacRae, T. H. (2001) GenBankTM/EBI accession number AY065654). We label this gene as psbXc. The use of PsbX for unrelated soluble and membrane-bound PS II proteins has created considerable confusion. ![]()
3 J. P. Whitelegge and S. J. Karlish, unpublished data. ![]()
* This work was supported by National Institutes of Health Grants AI-12601 and AI-29733 (to J. P. W.) and United States Department of Energy Grant DE-FG03-01ER15253 (to J. P. W. and K. F. F.). The National Institutes of Health, the Pasarow Foundation, and the W. M. Keck Foundation provided funds toward instrument purchases for this study. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
Present address: Pocagua Agricultural Systems, 718 10th St. SW, Albuquerque, NM 87102. ![]()
** To whom correspondence should be addressed. Tel.: 310-794-5156; Fax: 310-206-2616; E-mail: jpw{at}chem.ucla.edu
| REFERENCES |
|---|
|
|
|---|
pH-driven twin-arginine translocation pathway requires a specific signal in the hydrophobic domain in conjunction with the twin-arginine motif.
FEBS Lett.
434, 425
430[CrossRef][Medline]
pH-, and signal recognition particle-dependent protein targeting pathways, but not for CFoII integration.
Plant J.
10, 149
155