Increased Frequency of Cysteine, Tyrosine, and Phenylalanine Residues Since the Last Universal Ancestor*

Analysis of extant proteomes has the potential of revealing how amino acid frequencies within proteins have evolved over biological time. Evidence is presented here that cysteine, tyrosine, and phenylalanine residues have substantially increased in frequency since the three primary lineages diverged more than three billion years ago. This inference was derived from a comparison of amino acid frequencies within conserved and non-conserved residues of a set of proteins dating to the last universal ancestor in the face of empirical knowledge of the relative mutability of these amino acids. The under-representation of these amino acids within last universal ancestor proteins relative to their modern descendants suggests their late introduction into the genetic code. Thus, it appears that extant ancient proteins contain evidence pertaining to early events in the formation of biological systems.

By remaining unchanged over the long course of molecular evolution, conserved residues of ancient proteins might possess significant information regarding early ancestral proteins. We sought to determine whether amino acid frequencies within conserved positions of proteins dating to the last universal ancestor (LUA) 1 of all life indicate that any of the 20 amino acids occurred more or less frequently within early proteins than within their modern descendants. In part, we were motivated by the idea that the amino acid composition of proteins within the LUA might have reflected the order of addition of amino acids to the genetic code, i.e. that compared to modern proteins, the composition was relatively richer in amino acids added to the code early and poorer in those added late. Our approach is based on the insight that the amino acid composition of conserved residues of modernday proteins has been determined by two factors, the composition of the ancestral proteins that gave rise to the extant proteins and the relative mutability of the various amino acids over the course of evolution of the sequences. Therefore, based solely on knowledge of the composition of conserved (i.e. unchanged) residues of extant sequences and the relative mutability of each amino acid, it may be possible to make inferences regarding the composition of early ancestral proteins.
The mutability of each amino acid has been determined empirically through pair-wise comparison of aligned homologous protein sequences; mutability is defined as the number of times an amino acid differs at analogous sites of two aligned sequences divided by the total occurrence of that amino acid within the pair of sequences (1). Thus, an amino acid that has mutated relatively frequently over the course of evolution is assigned a high mutability, whereas an amino acid that has mutated relatively infrequently is assigned a low mutability. Amino acids differ in mutability according to the ease with which each particular amino acid may be structurally or functionally replaced by any other within proteins. This depends on the size, shape, hydrophobicity, and charge of each amino acid side chain and its ability to form various types of weak bonds, as well as the structure of the genetic code.
Our approach is based upon the following premise. An amino acid with relatively low mutability is by definition less likely to change over the course of sequence evolution than other amino acids. Therefore, as an original set of ancestral sequences gives rise to successive generations of descendants, the frequency of such an amino acid within conserved positions of those descendants (i.e. residues that are unchanged between ancestral and descendant sequences) will increase relative to its frequency within the entire ancestral sequence set. Consequently, the frequency of an amino acid with low mutability within conserved sequence positions of descendant sequences provides an upper limit on its frequency within the ancestral sequences, i.e. it must have occurred with a lower frequency within the ancestral sequences as a whole than within the conserved positions of descendant sequences. On the other hand, the frequency of an amino acid with relatively high mutability will decrease over evolution within conserved positions of descendant sequences relative to the entire ancestral sequence set; thus, its frequency within conserved positions provides a lower limit on its frequency within the ancestral sequences. It is important to recognize that these inferences regarding the upper and lower limits of amino acid frequencies within ancestral sequences are com-pletely independent of substitution events occurring within non-conserved sequence positions.
As a consequence of the limits specified above, two general types of observations (Table I) would suggest that a change in frequency of an amino acid over evolution within a set of proteins had occurred; if an amino acid with low mutability occurs less frequently within conserved than within non-conserved residues of the extant protein set, its frequency must have increased over evolution, because its frequency within ancestral sequences can be inferred to have been lower than that within conserved residues. Conversely, if an amino acid with high mutability occurs with greater frequency within conserved than non-conserved residues, its frequency can be inferred to have decreased over evolution, because its frequency within ancestral sequences can be inferred to have been higher than that within conserved residues. It is worth remarking that, based on this approach, no inferences regarding changing amino acid frequencies may be made in cases in which an amino acid with low mutability occurs more frequently, or an amino acid with high mutability occurs less frequently, within conserved than non-conserved residues. Nonetheless, this approach may identify some amino acids that have changed in frequency over deep evolutionary time and thereby provide novel insights regarding early proteins. Guided by this rationale, we determined the frequency of each amino acid in conserved and non-conserved sequence elements of a set of extant proteins dating to the LUA in 26 species spanning the three primary lineages.

EXPERIMENTAL PROCEDURES
Choice of Protein Set-Although the nature of the LUA has been the subject of debate (2), for the present work it is sufficient that the LUA was an hetero-or homogeneous population that diverged to form the three primary lineages. Consistent with this view, a set of proteins was selected that can be inferred to have been present within the LUA. The clusters of orthologous groups (COG) database (3), which groups proteins into families based on pairwise comparisons of the protein complements of fully sequenced genomes, was used to assist in the choice of proteins to include in the analysis. Twenty-six major lineages (19 eubacteria, six archaea and one eukaryote) are represented in the COG database. Not all species contribute members to all families in the database; on the other hand, some species contribute more than one member to a particular family.
Our first requirement was that a member of a protein family be present in at least one species of each of the three primary lineages, because this criterion is used to infer that an ancestor of that family was present in the LUA (4). In fact, we required that for any protein family to be included in the study, at least one member had to be present in all 26 species selected from the COG database (for the list of species, see the legend to Table V). This made it possible to assemble a set containing members from the same protein families for each of these species. Although only one eukaryote, Saccharomyces cerevisiae, was included in the analysis, this did not in any way limit the ability to identify conserved sequence positions within the protein set or to draw conclusions based on the data obtained. In fact, the very wide phylogenetic representation of both eubacteria and archaea was more than sufficient to identify conserved residues, allowing inferences to be drawn regarding the frequency of certain amino acids within ancestral sequences in the LUA.
The inclusion of proteins that have been laterally transferred between the eubacterial and the archaeal/eukaryotic lineages would confound our ability to identify residues conserved since the LUA. The protein set was therefore chosen so as to minimize inclusion of laterally transferred proteins. The phylogenetic grouping of the archaea and eukaryotes within a lineage distinct from that of the eubacteria, originally based upon the small subunit rRNA tree (5), has been supported by whole genome analysis (6). Therefore, for any protein family to be included in the analysis, it was required that the family member from the one eukaryotic species, S. cerevisiae, and the members from the six archaeal species form a cluster that is separate from the members contributed by the eubacterial species on the phylogenetic tree provided with each COG (suggesting that proteins within this family have not been laterally transferred between the eubacterial and the archaeal/eukaryotic lineages). Finally, for the purpose of sequence reconstruction (see below), it had to be assumed that species and protein trees are congruent, an assumption potentially violated by inclusion of paralogs (homologs arising through gene duplication) that arose prior to speciation. Therefore, for inclusion of any COG family in the analysis, it had to have one homolog within each species, whether an ortholog or paralog, that did not invalidate the assumption of species and protein tree congruence.
After these requirements were fulfilled, our protein set consisted of 59 COG families (Table II). Forty-five of these proteins play some role in translation (many are ribosomal proteins), and another seven play a role in transcription, replication, or DNA repair. These all are classified as informational proteins (7), because they function in replication, transcription, or translation. The remaining seven proteins are classified as operational proteins (7), which perform metabolic and other housekeeping roles within the cell. Informational proteins have been found to be less likely to be laterally transferred than operational proteins (7), and because one of the goals in choosing the set was to avoid laterally transferred proteins, the high proportion of informational proteins in the set was both expected and reassuring.
Identification of Conserved Residues-The next step was to identify residues within the 59 proteins from each of the 26 species that have been conserved since the LUA. Sequences were aligned using ClustalW (8). Two approaches were then used to identify conserved residues within each of the descendant sequences. The first was to identify positions in which the amino acid residues in all 26 descendants are identical. We refer to such positions as "identical sites" to distinguish them from conserved residues identified using the second method described below. Identical sites are rare (ϳ2% of sequence sites) and exclude many residues actually conserved between an ancestral sequence and any given descendant sequence.
To identify conserved residues more accurately, maximum parsimony (9) was used to partially reconstruct the ancestral protein sequences in the LUA that gave rise to each family of aligned descendants. The protein parsimony software "protpars" included in the PHYLIP phylogenetic package (10) was used to partially reconstruct ancestral sequences, assuming the phylogenetic tree indicated by small subunit rRNA data (5). Using the inferred ancestral sequence, conserved and non-conserved sites within the descendant sequence of each species were identified. Because these ancient sequences have diverged to a great extent, only slightly more than a third (ϳ37%) of the sites within the ancestral sequence could be reconstructed. At sequence positions for which no ancestral residue could be assigned, it was assumed that residues within none of the descendant sequences were conserved. The frequency of each amino acid within conserved and non-conserved residues of the sequence set in each species could then be determined.

RESULTS
Conserved sequence elements for the 26 species were pooled to determine frequencies of each amino acid in those positions; the same was done for the non-conserved sequence elements. Six amino acids (glycine, histidine, leucine, proline, arginine, and tryptophan) were more frequent in conserved than non-conserved sequence elements; the remaining 14 amino acids were more frequent in non-conserved sequence elements (Table III).
The relative mutability of the 20 amino acids has been determined empirically by several investigators, starting with Dayhoff et al. (1). Jones et al. (11) later updated the mutability estimates of Dayhoff et al. (1) for a much larger set of protein sequences, whereas Gonnet et al. (12) based their mutability estimates on a set of sequences similar to that of Jones et al. (11) but using a modification of the Dayhoff approach. Depending on the data set and approach, some variations occur in the relative mutability ranking (Table IV). Nonetheless, seven amino acids (valine, glutamine, isoleucine, threonine, alanine, serine, and asparagine) consistently fall within the top half, and seven (tryptophan, cysteine, phenylalanine, tyrosine, glycine, proline, and arginine) fall within the bottom half of the ranking. There is, however, a lack of consensus as to whether the remaining amino acids (leucine, lysine, histidine, methionine, and glutamic and aspartic acids) are of high or low mutability. Accordingly, amino acids are assigned high, low, or undetermined mutability in Table III. Based on both their relative mutabilities and their relative frequencies in conserved and non-conserved sequence elements, the following three amino acids may be inferred to have changed in frequency in the protein set since the LUA: cysteine, tyrosine, and phenylalanine (Table III). Because all three of these amino acids are of low mutability and are more abundant in non-conserved than conserved residues, they must have increased in frequency over time (Table I). Although valine, being of high mutability and occurring more frequently in conserved than non-conserved sequence elements, also satisfies the criteria summarized in Table I, its difference in frequency between these subsets is not statistically significant as determined using a chi-square test. The remaining amino acids either lack consensus regarding their relative mutability (see above) or fall into one of the two categories in Table I for which no inferences may be made; glycine, proline, arginine, and tryptophan are of low mutability and are more frequent in conserved than non-conserved residues, whereas alanine, isoleucine, glutamine, serine, and threonine are of high mutability and are less frequent in conserved than nonconserved residues.
The frequencies of cysteine, tyrosine, and phenylalanine within conserved residues are 0.0039, 0.0231, and 0.0331, respectively (Table III). Because of their low mutability, the frequencies of these amino acids within conserved residues provide an upper limit on their frequencies within this protein set in the LUA. By comparison, the frequencies of cysteine, tyrosine, and phenylalanine within the protein set as a whole are 0.0074, 0.0297, and 0.0374, respectively. It can therefore be inferred that the frequency of cysteine has doubled within this protein set between the LUA and today, whereas that of tyrosine has increased at least 29% and phenylalanine at least 13%.
To gain insight on whether cysteine, tyrosine, and phenylalanine might still be increasing in frequency today, we determined whether they are present in modern proteomes at frequencies predicted by neutral evolution. The neutral theory of molecular evolution predicts that an amino acid within a proteome should eventually reach an equilibrium frequency determined primarily by the number of codons assigned to that amino acid, adjusted for the nucleotide composition of its codons and the nucleotide composition of the genomic coding sequences (14). The probability of observing amino acid j in a specific genome is given by p j ϭ (⌺ i x i y i z i ), where i represents each codon assigned to amino acid j; x i , y i , and z i represent the frequency of occurrence of the first, second, and third nucleotides, respectively, of codon i within coding sequences of that genome; and is a constant such that the sum over all amino acids is equal to one. The normalization constant compensates for probabilities assigned to stop codons.
Using genomic coding sequence nucleotide frequency data derived from the Codon Usage Database (15), the frequencies of cysteine, tyrosine, and phenylalanine in the proteome of each species predicted by neutral evolution were determined ( Table V). The observed frequency of cysteine is significantly less than that predicted in all 26 species (p Ͻ Ͻ 0.01), the mean over all species being one-third of that predicted. In contrast, the observed frequencies of tyrosine is less than predicted in only 15 of the species (p ϭ 0.28, which is not statistically significant), and the mean observed frequency of tyrosine,

TABLE III Frequency of each amino acid in conserved and non-conserved sequence residues and in the entire protein set, pooled among the 26 species
Amino acids that consistently fall within the top half of the mutability ranking (see Table IV) are assigned high (H) relative mutability; those consistently in the bottom half, low (L) relative mutability; otherwise, undetermined (?) relative mutability. 0.0335, is close to that predicted, 0.0358. For phenylalanine, the observed frequency is higher than predicted in 25 species (p Ͻ Ͻ 0.01), the mean observed frequency, 0.0437, being ϳ40% higher than predicted. Therefore, the observed frequency of cysteine is less than, and of phenylalanine is greater than, that predicted by neutral evolution, whereas that of tyrosine agrees with the prediction of neutral evolution.

DISCUSSION
It is generally assumed that those amino acids believed to have been absent from the prebiotic environment were added to the genetic code later, as enzymes for their biosynthesis evolved (16). Thus, very early versions of the code would have included only prebiotically-available amino acids. Because cysteine, tyrosine, and phenylalanine are absent from simulations of the prebiotic environment of the Earth (17), they are commonly held to be late additions to the genetic code. Although we do not propose a specific mechanism for addition of these amino acids to the evolving primitive code, we do make the assumption that codon reassignments would have occurred in a fashion that introduced them into proteins gradually, because the impact upon protein structure of introduc-ing these amino acids en masse was more likely to be detrimental than beneficial (18). Specifically, these amino acids most likely adopted codons that occurred infrequently within coding sequences. This idea is consistent with the fact that both cysteine and tyrosine share four-codon blocks with at least one stop codon; it is quite possible that the code had only recently evolved to use those codons to specify other amino acids (through modification of existing tRNAs) when cysteine and tyrosine "captured" them.
Consequently, we propose that upon their introduction into the code, these three amino acids would have gone from being non-existent to being rare within early coded proteins. Furthermore, because of the distinct physicochemical properties of these amino acids, the majority of subsequent coding sequence mutations introducing them into proteins presumably would have been deleterious, causing their increase in frequency to be gradual (that of cysteine especially so). Because our data indicate that these three amino acids increased in frequency between the LUA and today, they must not have reached their equilibrium frequencies by the time of the LUA. According to this scenario, the under-representation of these amino acids in the LUA relative to today is consistent with their late addition to the genetic code. It has conventionally been assumed that the time between the origin of proteins and today has been sufficient for all amino acids to reach their equilibrium frequencies and therefore, that an observed frequency of an amino acid distinct from that predicted by neutral evolution is evidence of some strict requirement of protein structure or function that places unusual selection on that amino acid (14). However, because our findings suggest that at the time of the LUA, cysteine, tyrosine, and phenylalanine had yet to reach equilibrium frequencies, change of amino acid composition toward that predicted by neutral evolution may be a process requiring very long time periods. Indeed, the observation that the frequency of cysteine is so much lower than that predicted by neutral evolution in modern proteomes may be evidence that the increase in usage of this particular amino acid has been especially gradual over evolution. Consequently, the possibility that even today cysteine continues to move toward its equilibrium frequency through neutral evolution, as the vast range of all possible sequence space is gradually searched, cannot be ruled out. On the other hand, over time phenylalanine has become more frequent in proteins than predicted by neutral evolution. In fact, it is possible that the frequency of phenylalanine, too, will increase further with evolution. In any case, positive selection for phenylalanine has caused any initial rarity of this amino acid in the earliest proteins to be overcome. The same may be argued for tyrosine, the observed frequency of which does not differ significantly from that predicted by neutral evolution.
Although our approach did not produce evidence for a change in frequency of any of the other 17 amino acids over the course of evolution, this does not imply that no other amino acids have changed in frequency. Using our rationale, it is not possible to reach a definite conclusion regarding the change in frequency (or lack thereof) of those amino acids of high mutability that are less frequent in conserved than nonconserved positions and those of low mutability that are more frequent in conserved than non-conserved positions. Moreover, our ability to make inferences was limited by the lack of consensus on the relative mutability of six amino acids (see Table III and Table IV). It is therefore possible that amino acids other than cysteine, tyrosine, and phenylalanine have increased in frequency since the LUA. With the increase in frequency of these three (and perhaps other) amino acids, there must have been a concomitant decrease in frequency of at least one other amino acid. Because valine is of low mutability and is present at greater frequency in conserved than non-conserved sequence elements (although not to a statistically significant extent), it may indeed have decreased in frequency over time. An alternative approach will be required to determine with certainty which amino acids other than cysteine, tyrosine, and phenylalanine have in fact changed in frequency over evolution.
It is not immediately evident how amino acid composition and structure have co-evolved in the ancient protein set investigated. Studies of protein evolution suggest that structure and function can be well conserved even as protein sequence diverges extensively (see Ref. 19, but see Ref. 20 for a contrary view). However, evolution of amino acid composition may have impacted structure in newly arising proteins of the proteome. Each amino acid has a specific predisposition to occur in different secondary structures, i.e. in ␣-helices, ␤-sheets, or random coils (21,22), and negative selection preserving structure would have been relatively relaxed in this later protein set. Further investigation will be required to elucidate structural consequences of changes in proteomic amino acid composition.