Genome-wide Analyses of Carboxyl-terminal Sequences*

Sequence motifs at the protein carboxyl termini in linear polypeptides are uniquely positioned and functionally capable of serving as recognition signatures for a variety of cellular and biochemical processes. At the proteome level, it is unknown whether and what carboxyl-terminal sequences might be particularly conserved, which may be directly related to specific biological functions shared among certain groups of proteins. To investigate this question, we analyzed the terminal sequences of reported yeast open reading frames, which presumably constitute the predicted, entire proteome of Saccharomyces cerevisiae. The results show that there are both known and novel terminal sequences. They are conserved at a frequency similar to that of functionally important, experimentally confirmed signals such as the HDEL sequence that mediates the endoplasmic reticulum retention and/or retrieval. The findings support the notion that there may be additional carboxyl-terminal signals, and the conserved motifs could be experimentally tested for currently unknown biological functions. Similar analyses were also applied to the limited proteome databases of other organisms with overall consistent findings. Therefore, indexing a proteome according to its carboxyl-terminal sequences may provide a means for functional classification and determination of proteins.

Each protein has one single terminal ␣-carboxyl group that links directly to the adjacent, last peptide bond. This position, referred to here as a carboxyl terminus, combined with preceding residues often serves as a signature recognition motif capable of conferring a variety of biochemical reactions that are restricted to this position of protein and of essential physiological functions. Some of the known functions include protein trafficking, subcellular anchoring of proteins, targeted protein degradation, and the static and dynamic formation of macromolecular complexes (for a review, see Ref. 1).
In an increasing number of systems, protein carboxyl-terminal sequences are found to be highly conserved among homologues of different species. Structural studies have identified a large number of protein domains that specifically recognize protein carboxyl termini (2,3). Using random peptide selection, these protein domains were shown to display high specificity for binding to carboxyl-terminal sequences (4,5). For example, cystic fibrosis transmembrane conductance regulator (CFTR) 1 proteins, whose malfunction causes cystic fibrosis, have an identical carboxyl-terminal sequence (TRL) in species from Xenopus to human (5). This sequence is necessary for appropriate subcellular expression of the CFTR channel in heterologous systems such as polarized Madin-Darby canine kidney cells (6). In addition, the motif is responsible for binding to CAP70 (CFTR-associated protein, 70 kDa), a protein that contains four protein interaction PDZ domains. The multivalent binding of CAP70 to two or more CFTR molecules potentiates the chloride channel activity (7). Similarly, within a given species, conserved carboxyl-terminal motifs have also been found among structurally and/or functionally distinct proteins. These conserved motifs often code specific biological activities, such as HDEL, which is a recognition signal for ER retention and/or retrieval (8), and CAAX, which serves as a substrate site for lipidation (9). Due to an increasing number of genomes that have been sequenced and their corresponding proteome information becoming available, it is now becoming feasible to investigate questions such as whether and to what extent the protein carboxyl-terminal motifs are conserved within a given proteome. If so, do those motifs indeed confer certain conserved functionality that has been determined experimentally? In addition, the conserved sequences with unknown function may be topics of further experimental studies.
To determine whether and what carboxyl sequence motifs are conserved in yeast, we compiled terminal sequences of 6,213 predicted yeast proteins that are longer than 50 amino acid residues. The analyses of frequency of contiguous carboxyl-terminal sequence suggest that conserved motifs are directly correlated to certain shared functions including but not limited to protein targeting. The function of the identified motifs may be investigated experimentally. locuslink/refseq.html. Information regarding yeast protein expression, localization, and functional properties were taken from the following websites unless specifically cited: genome-www.stanford.edu/Saccharomyces/, genome-www4.stanford.edu/cgi-bin/SGD, www.proteome.com/, and www.ncbi.nlm.nih.gov/locuslink/refseq.html.

Database for
Statistical Analysis-The programs used in database downloading, parsing, and subsequent statistical analysis were written in Perl5.6 and run on a PC Pentium 700 computer. The data were output in Microsoft Excel format. Protein sequence alignment was performed using McAlign of DNAStar TM .

RESULTS
Certain carboxyl tri-or tetra-amino acid sequences, such as SKL for peroxisome targeting (10) and HDEL (or KDEL, depending on the species) for ER retention and/or retrieval (8), are recognition sequences for trafficking proteins to appropriate subcellular compartments or microdomains. To investigate specific common sequence motifs that are shared by a group of otherwise different proteins, we have compiled an abundance of carboxyl peptide sequences from several species including S. cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, and Homo sapiens. Because the amino acid usage bias is found primarily at the last 8 residues (11,12), our calculation was performed for dimers and trimers through octamers. The yeast genome has been sequenced with very limited numbers of gene duplication, which are more suitable for this analysis. The top 30 hits of each are shown in Table I. The complete lists of hits are available from our website (www.molecularinteraction.org/listofpublication.htm).
Of 6,213 yeast proteins that are longer than 50 amino acids, we analyzed the frequency of tripeptide and tetrapeptide sequences ( Fig. 1). Fig. 1A shows that 31% of (or 1,917) proteins have unique tripeptide terminal sequences; 25% of (or 1,536 (768 ϫ 2)) proteins have terminal sequences that are found only twice (Fig. 1A). When the same analysis was applied to the tetrapeptidic sequences, about 83% of (or 5,151) proteins have distinct carboxyl tetrapeptide terminal sequences, and ϳ12% of (or 766 (383 ϫ 2)) proteins have their terminal sequences found only in one other protein (Fig. 1B). Of the remaining 297 proteins (ϳ5%) their terminal sequences are identical to three or more proteins. Analyses of the highest 30 hits identify several conserved motifs. The most abundant three hits of tetrapeptides are EVGE (16 hits), KWIH (16 hits), and TIAN (14 hits). The EVGE is found in YRF1-like helicases in multiple alleles. KWIH is a motif found in an open reading frame of Ty transposons, which are presumably present in multiple locations of yeast genome. TIAN is the terminus shared by multiple, nearly identical seripauperin proteins.

FIG. 1. Frequency distribution of S. cerevisiae proteins according to the sequence of the last 3 (A) and 4 residues (B).
In the bar graphs, the y axis indicates the number of protein groups that have unique carboxyl-terminal ends, and the x axis represents how many proteins belong to each group and the corresponding percentage in the genome. Therefore, ⌺xy ϭ 6,213. Numbers inside the pie chart represent the percentage of total proteins comprising each group. The percentage value of each group is also identified in parentheses below each group.
These motifs are the termini of the same sets of nearly identical proteins that appear to have been duplicated in the genome. 2 Consistent with this notion, almost identical numbers of hits of these three sequences are also found as the top hits of carboxyl-terminal 5-mers, 6-mers, 7-mers, and 8-mers (Table I). Other motifs include HDEL (11 hits) and A(M/L)LL (11 hits) tetrapeptide motifs, SKL (13 hits) and LSK (12 hits) tripeptide motifs, and two different dilysine motifs (XKK (152 hits) and KKXX (112 hits)) (see Table I). 2 Because these hits come from proteins of unrelated sequences, it is consistent with the notion that the conservation of these motifs reflects certain shared properties. It is also quite apparent that the most abundant tri-or tetrapeptide motifs are limited to less than 0.2% of total proteins in the proteome. A similar percentage but with different sequences has also been found in other proteomes such as D. melanogaster, C. elegans, and H. sapiens (see Table V).
The carboxyl HDEL motif is a well known recognition signal that directs proteins to ER retention and/or retrieval. Interestingly 10 of the 11 hits have been reported to be ER-specific proteins despite the fact that they are quite different in terms of overall polypeptide length, protein domain organization, biochemical function, and membrane topology (Table II). The only one within the group that is not strictly for the ER localization is Sec20, an integral membrane protein that is involved in vesicular trafficking in ER and Golgi apparatus (14,15). While the localization information is incomplete, six of the eight HDEL sequence entries from the Drosophila sequence database are ER-resident proteins (Table III and the data from www.molecularinteraction.org/listofpublication.htm). Thus, the HDEL-ended proteins, which represent ϳ0.2% of total proteins in the yeast proteome, could have been used to predict their shared properties with high confidence, which raises questions concerning other conserved but functionally unknown motifs.
Another highly conserved tetrapeptide motif is A(M/L)LL, which also has 11 hits. Of the 11 proteins, with exceptions of one reported to be in nuclei and one whose localization is currently unknown, the remaining nine proteins are found to have cell wall-related function (Table IV). These nine proteins display difference in length and/or domain organization (Fig.  2). Among them, the mannoprotein genes DAN1, TIR1, TIR2, TIR3, TIR4, and TIP1 were shown to be expressed after an anaerobic shift or during cold shock, whereas the CWP2 gene was down-regulated under the same conditions, suggesting that the seven mannoprotein genes are involved in remodeling of the cell wall (16). The remaining two of the nine proteins, Ylr110c and Ylr194c, are also found to be localized in the cell wall. It is interesting that these nine proteins are all involved in the same pathway and that a subset of these proteins was also identified by other informatic approaches (17). While the precise biochemical role of this motif is currently unknown, the terminal region might be recognized and cleaved prior to the surface expression of these proteins via a glycosylphosphatidylinositol anchor (12).
Conserved motifs with hits less than 10 genes may also be of great significance, but more analyses may be needed. For example, the KSKK motif was found in seven proteins (Rpl34b, Prp21, Cbf5, Ylr302c, Nop12, Ri01, and Svl3). Most of them are associated with RNA-involved structures or functions such as spliceosome and RNA transport. 2 Similar analyses also showed that 10 of 13 proteins with the SKL terminal sequence are peroxisomal proteins. This motif is also known as peroxisomal targeting signal type 1 (PTS1). This result is consistent with SKL as a recognition signal for peroxisomal localization (10,18). This conservation is also found in the higher organisms. 2 It should be noted, however, that not all conserved sequences give rise to detectable shared features such as subcellular localization. For example, the dilysine motif KKXX is thought to be a retention/retrieval signal in the secretory pathway (19). Of 112 proteins with the KKXX terminus, the shared features were not detectable. 3 For processes such as protein ER retention, several other ER localization signals have been reported, and the localization of ER could be a transient step to serve as a check point of proper protein assembly (13). Thus, the presence of conserved sequence motifs may be more suitable for grouping genes with related function or property. But the converse is not valid because it is known that many proteins found in ER do not use the HDELmediated localization machinery.
The statistical significance of a given motif may be analyzed in a number of ways. For the yeast genome, the total number of genes (6,213) and the lengths of the corresponding proteins (a combined total length of 2,912,365 amino acids) are known. Therefore it permits a more detailed statistical analysis. Fig. 3 shows the overall frequency of 20 amino acids (part IV) and the frequency of each amino at a given carboxyl-terminal position (Ϫ1 to Ϫ20 with Ϫ1 being the last residue at the carboxyl terminus) (part V). Most noticeable is the positionspecific bias of lysine at the carboxyl terminus. While it is apparent that there is certain bias of amino acid abundance at terminal positions, the abundance-corrected HDEL sequence probability is similar, 7.76 ϫ 10 Ϫ6 for internal and 8.38 ϫ 10 Ϫ6 for carboxyl-terminal sequences (Fig. 3, part VI). There are 2,893,726 (2,912,365 Ϫ 6,213 ϫ 3) possible (internal and terminal) tetrapeptide sequences but only 6,213 possible carboxyl-terminal tetrapeptide sequences. Thus, the probability of the entire yeast genome to have one protein with the terminal HDEL motif by chance is 5%. As shown, 11 different proteins with terminal HDEL sequence have been identified  The alternative transcripts of a gene are labeled with the suffixes -RA, -RB, -RC, etc. by FlyBase. These vary due to alternative splicing, variable exons, multiple poly(A) sites, or multiple promoters. Alternative transcripts may or may not encode different protein products.  (Table II), representing more than 200-fold above the expected frequency. Similar calculations were performed for A(M/L)LL and SKL, which were 26-and 2-fold above the expected frequency (Fig. 3, part VI). Using this information, one may also calculate the odds of finding the same sequence by chance.

DISCUSSION
Short linear terminal epitopes in proteins as sites for recognition are different from that of internal sequences in a number of ways; most obviously, each protein has only one carboxyl terminus. Most proteins of known structure have their carboxyl terminus exposed, consistent with a role for recognition that could lead to several possible biochemical events including static binding, cleavage (exposed to new terminal sequences), or posttranslational modification (for a review, see Ref. 1). The analyses of yeast proteome (Fig. 3) suggest that it is possible to estimate the number of internal residues needed to confer a similar level of diversity conferred by 5 terminal residues. Generally there are about 500 pentapeptide sequences in a 500-amino acid-long protein but only one carboxyl-terminal pentapeptide. Thus, to confer the same specificity in an internal peptide one needs to make the probability of a random occurrence 500 times less likely; this is accomplished by extending the length by 2 residues ((1/ 20) ϫ (1/20) ϭ 1/400). In this context, it is intriguing that immunoglobulin or T-cell receptor binding, when recognizing internal peptide, involves 7-9 amino acids in length. In contrast, a PDZ domain binding to carboxyl-terminal peptide recognizes 3-5 residues. Indeed about 500 pentapeptide se-quences in a 500-residue-long protein is overestimated since each adjacent peptide differs by 1 residue. Interestingly the strongly biased positions in protein carboxyl termini are most pronounced at the last 4 or 5 residues, which according to the calculation here provides sufficient diversity for a typical proteome.
The resultant data in this report have been deposited and are available at www.molecularinteraction.org/listofpublication.htm. The specific abundance of certain amino acids may be related to free energy considerations. For example, the preference for lysine at the carboxyl termini could stem from electrostatic stabilization of helix dipoles. The sequences may be mined in other ways as well. For example, the sequences may be sorted by predetermination of amino acid residues at any given position, such as CAAX or KKXX. Our result suggests that the conserved carboxyl-terminal sequence alone may confer certain important biochemical function, although some of these functions are currently unknown. Thus, these conserved motifs may serve as one criterion for grouping diverse proteins with certain shared properties including biochemical function, membrane association, and/or protein domain organization. It supports the notion that the high information contents at the carboxyl termini encode the signatures for certain fundamental functions including but not limited to subcellular localization (1). In C. elegans, HDEL and KDEL were found in seven and four proteins, respectively. Whether the C. elegans ER retention machinery is more degenerate in recognizing its anchor site than that of other species remains unanswered.
It is important to note that conserved motifs are likely to be physiologically important and thus resistant to selection pressure during evolution. However, less conserved terminal motifs in the proteome may confer equally important but highly specific biochemical activity that is restricted to only one or two proteins in the proteome but conserved among homologues from different species (Table V). As more and better annotated protein sequences become available, it will be important to mine both types of motifs via informatic tools, which could provide interesting hypotheses that can be tested experimentally.