Automated Identification of Putative Methyltransferases from Genomic Open Reading Frames* □ S

We have analyzed existing methodologies and created novel methodologies for the automatic assignment of S - adenosylmethionine (AdoMet)-dependent methyltransferase functionality to genomic open reading frames based on predicted protein sequences. A large class of the AdoMet-dependent methyltransferases shares a common binding motif for the AdoMet cofactor in the form of a seven-strand twisted (cid:1) -sheet; this structural similarity is mirrored in a degenerate sequence similarity that we refer to as methyltransferase signature motifs. These motifs are the basis of our assignments. We find that simple pattern matching based on the motif sequence is of limited utility and that a new method of “sensitized matrices for scoring methyltransferases” (SM 2 ) produced with modified versions of the MEME and MAST tools gives greatly improved results for the Saccharomyces cerevisiae yeast genome. From our analysis, we conclude that this class of methyltransferases makes up (cid:1) 0.6–1.6% of the genes in the yeast, human, mouse, Drosophila melanogaster , Caenorhabditis elegans , Arabidopsis thaliana , and Escherichia coli genomes. We provide lists of unidentified genes that we consider to have a high probability of being methyltransferases for future biochemical analyses. Molecular & Cellular Proteomics 2:525–540, 2003.

ists that study methylation have been blessed in that many of the AdoMet-dependent methyltransferases share common three-dimensional signatures (notably in the AdoMet binding regions) that are imperfectly reflected in similarities in their primary sequences (4). There are, at present, at least three structurally defined types of AdoMet-dependent methyltransferases. The major class (Class I) is based on a seven-strand twisted ␤-sheet structure (4,5). A second recently described class (Class II) is exemplified by the SET proteins (6). Finally, a last class (Class III) is the set of membrane-associated enzymes with multiple membrane-spanning regions (7).
Herein is described the unification of developed methods to mine the information available in gene primary sequences and the screening of entire genomes in the attempt to completely assign in silico all known and novel AdoMet-dependent methyltransferases of the major seven-strand twisted ␤-sheet family. The common motifs for Class I AdoMet-dependent methyltransferases were first recognized in 1989 when three regions of similarity were noticed between the protein L-isoaspartyl O-methyltransferase and certain nucleic acid and small molecule methyltransferases (8). Over the years, these regions were expanded, largely by manual inspection of sequences, into Motif I, Post I, Motif II, and Motif III (9). These motifs were ultimately used for the first time in 1999 to scan the entire genome of Saccharomyces cerevisiae for putative methyltransferases (10). The result of the 1999 analysis was a list of 26 candidate S. cerevisiae open reading frames (ORFs).
The techniques used to perform the 1999 search relied heavily on the BLAST algorithm (11), a tool that performs sequence similarity searches. In this work, we describe three extensions of the search protocol for novel methyltransferases. Firstly, we have redefined the motifs using a positionally sensitive scoring matrix, for example where the first letter in the motif might be considered more important for a match than the third letter. Secondly, we have defined these motifs using an assortment of known methyltransferases with different substrate specificities. Finally, we have automated these tasks for easy refinement as more methyltransferases are discovered and to allow for the rapid screening of new genomes as they are sequenced. The results of motif analyses were verified and in some cases extended using sequence profile analysis implemented in PSI-BLAST (12) and HMMer (13), arguably two of the best tools for detection of the remote sequence homology.

EXPERIMENTAL PROCEDURES
Development of a Methyltransferase-specific Database for Automated ORF Tracking and Scoring-To track the progress of automated methyltransferase assignment methodologies, a database of existing yeast methyltransferases was built that could be queried by automated scoring systems. The database system chosen was MySQL (www.mysql.com/), a freely available SQL (structured query language) implementation. The layout of the database, designated MSD for "methyltransferase-specific database," is shown in Table I and is populated as described below.
The MSD is a hand-curated database of methyltransferases combining annotations of genes identified by literature review and genes identified from our automated identification methodologies. Each entry (record) of the MSD is characterized by a number of pieces of information (fields) useful specifically for work with methyltransferases. These include the class of the methyl-accepting substrate, the source organism, the ORF and gene names, and a confidence number that we assigned based on the biochemical evidence in the literature for methyltransferase function. Records are added to the database as new information in the literature becomes available or as candidates are selected based on automatic methyltransferase prediction algorithms. The confidence number in a record runs from 3 (strong experimental support for methyltransferase activity associ-ated with the gene product) to Ϫ3 (strong experimental evidence against being a methyltransferase); an entry of 0 denotes no information is available. This MSD is the only database of manually collected and annotated methyltransferases that we are aware of and is available at www.methyltransferase.org/.
In addition to the MSD that we have built, the Saccharomyces Genome Database (SGD) 2 provides two databases of gene annotation. We have regularly downloaded these from the SGD and loaded them into local MySQL tables with similar table definitions to those used by SGD. The two databases are available as ftp://genome-ftp. stanford.edu/pub/yeast/tables/ORF_Descriptions/orf_geneontology. tab and ftp://genome-ftp.stanford.edu/pub/yeast/gene_registry/ registry.genenames.tab.
One of the advantages to using MySQL is the multitude of programmatic interfaces. Using the methods described below, lists of putative methyltransferases will be generated, which then can be 2 Dolinski, K., Balakrishnan, R., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S. R., Fisk, D. G., Hong, E. L., Issel-Tarver, L., Sethuraman, A., Theesfeld, C. L., Binkley, G., Lane, C., Schroeder, M., Dong, S., Weng, S., Andrada, R., Botstein, D., and Cherry, J. M., Saccharomyces Genome Database at genome-www.stanford.edu/ Saccharomyces/.  b Below are descriptions of the fields in the methyltransferase table. Comments of * are described here. orf represents the common ORF name for the reading frame examined. genename is the common name used to refer to the gene product, if one exists. beenviewed reflects that entries can automatically be added to the database; if beenviewed is 1 it reflects that the database curator has viewed and commented on a particular record. mt_verif_status is a rating (Ϫ3-3) of how well the evidence refutes (Ϫ3, very refuted) or supports (3, very supported) whether a particular ORF is a methyltransferase; 0 represents no experimental evidence. annotation is used to hold the description from an outside database (e.g. SGD or Proteome (IncyteGenomics Yeast Proteome Database at www.incyte.com/bioknowledge)). autoid_sets is a description of why the entry was added to the database (e.g. which programmatic run identified this ORF as a methyltransferase).
automatically scored by querying either our MSD or the SGD using locally built programs.
Non-weighted ("Canonical") Degenerate Pattern Searching-The most straightforward method of motif generation and searching is the process of aligning the amino acid sequences of known methyltransferases in the conserved motif regions and making a consensus sequence based on those regions. This is described as a degenerate pattern as each position can possibly be one of several amino acids and as non-weighted because no position is considered more or less important than another (Table II).
To search for these non-weighted degenerate motifs in the translated yeast genome, a FASTA format file containing translations of all of the yeast genomic and mitochondrial genes (from SGD, "orf_trans-.fasta") was modified to remove line breaks within sequences and then searched using the standard UNIX utility, "grep," with appropriate regular expressions describing the degenerate motif (e.g. [GPY]"). A complete schematic for the formulation and use of these non-weighted degenerate motifs is shown in Fig. 1, and a list of the 18 methyltransferases used in the motif definition is shown in Table III.
Weighted Position-based Motif Searching-The program MEME (17) was used to automatically scan a training set of known methyltransferase amino acid sequences and produce a list of log-odds matrices of amino acids and positions that described putative methyltransferase motifs. These log-odds matrices were then used to scan the S. cerevisiae genome using the program MAST (18). The two major obstacles were the formulation of the initial training set and trying to generate motifs that simulated the known variation in spacing between motifs.
The MEME training set was built as follows and is represented graphically in Fig. 2. Entrez (the National Center for Biotechnology Information database query tool, www.ncbi.nlm.nih.gov/) was queried for the keyword "methyltransferase." Of the 5845 matches, all entries not from the RefSeq database were removed; the RefSeq database is the National Center for Biotechnology Information's curated set of entries that are designed to reflect the most highly accurate entries. The remaining 1064 entries were pruned using BLAST such that the final set did not contain any two sequences that matched with an expect value less than the desired threshold using the Blosum62 scoring matrix; the purpose of this culling is to remove entries that are highly similar to one another, which would lead to overrepresentation of certain sequences. With a cutoff expect value of 10 Ϫ20 , 289 sequences were in the final training set, of which 173 contributed to the definition of Motif I; the other 116 did not have regions similar enough to contribute to the Motif I definition and may represent Class II or III methyltransferases.
The output from the MEME program is a list of motifs described as  matrices with dimensions of (motif length) ϫ 20, each entry of which represents the log-odds for that amino acid occurring at that motif position. A sample motif is shown in Table IV.
Modified Weighted Position-based Motif Searching-Because of the high degeneracy and narrow width of the Post I motif, it could not be automatically identified. At best MEME was found to return a description of a Motif I, some interleaving residues of low significances, and a Post I motif. However, this description is only applicable to a very limited number of methyltransferases. Instead a matrix describing Post I was hand-forged based on the amino acid frequencies of the Post I motifs in the known S. cerevisiae methyltransferases (as of September 2002) as shown in Table III. Additionally, MAST, the

TABLE III Sequence and spacing of motifs in known yeast methyltransferases
Based on literature and database review up to September 2002, 18 highly experimentally supported methyltransferases were selected and visually inspected for the motif sequences. The names of the selections, the motif sequences, and the spacing between them (SP) are shown in the table. Less confident motif assignments are marked with a question mark, and motif identification different than that described previously by Niewmierzycka and Clarke (10) is shown in italics. At the bottom of the table is a graphical representation of the consensus sequence for each of the motifs where the bigger the letter is, the more it occurred in that position in that motif.
tool that searches genomes based on the MEME motifs, does not allow searching for two motifs separated by a variable gap size. Therefore a series of matrices was built with the MEME-determined definition of Motif I and the hand-built definition of Post I separated by between 10 and 35 score-neutral entries in the matrix. Because MAST will, if possible, match multiple motifs to a target sequence (a situation almost guaranteed by the degenerate description of Post I), the source code to MAST needed to be modified to only consider the best fitting motif to any given target.
Automated Scoring of Candidates-Automated methyltransferase identification methods produce lists of gene names (open reading frames) that are putatively methyltransferases. The results of these searches need to be evaluated, including the rather large lists generated by the non-weighted degenerate searches. Evaluation is the process of taking a list of generated candidates and deciding for each candidate whether it is a known methyltransferase (a "hit"), whether it is known not to be a methyltransferase (a "miss" or "false positive"), or whether it is neither of the above (a "putative methyltransferase"). Systematic evaluation is performed as follows. If the candidate is in the MSD, assignment is based on that score (2 or 3 is a hit, Ϫ2 or Ϫ3 is a miss, and Ϫ1, 0, or 1 is a putative methyltransferase). Otherwise the annotation of the two SGDs (orf_geneontology.tab and registry-.genenames.tab) is queried. If the annotations are marked as "unknown" the candidate is considered a putative methyltransferase. If the annotations contain the word "methyltransferase" the candidate is considered a methyltransferase. Otherwise, the candidate is considered an incorrect prediction (a false positive).
There are a number of inconsistencies in the SGD that can lead to inaccurate scoring. For example, HSL7, GCD14, and HemK are still not annotated as methyltransferases in the SGD (although they are in the MSD). This reflects that some genes are annotated as part of a pathway or have a phenotype but that the role as a methyltransferase was not initially known; for example HSL7 (YBR133c) is annotated as a negative regulator of the SWE1 kinase, but experimental evidence has confirmed the prediction of HSL7 as a methyltransferase (19).
Profile Searches Using PSI-BLAST and HMMer-A compilation of protein sequences in SCOP 1.61 (astral.stanford.edu/) and non-re-dundant SwissProt and TrEMBL databases (ftp://us.expasy.org/databases/sp_tr_nrdb/fasta/) was iteratively searched using the PSI-BLAST program (12). Each potential methyltransferase ORF sequence was used as the query with a profile inclusion E-value threshold of 0.001 and composition-based statistics turned on (20). The iterations were carried out for five rounds (or until convergence), and PSI-BLAST checkpoint files were saved for future use. The results of searches were inspected after each iteration to ensure that no compositionally biased sequences or spurious matches were included in the profile. To increase the sensitivity in the second step, candidate sequences and their corresponding checkpoint files from the first step were used as inputs for PSI-BLAST to scan the yeast proteome (genome-www.stanford.edu/Saccharomyces/). The searches were done for one iteration with the E-value set at 1e-5 to account for the smaller size of the yeast proteome compared with the database used to construct the profile. Potential methyltransferase ORF sequences were also individually compared with the Pfam 8.0 database (pfam.wustl.edu/), a collection of profile-hidden Markov models built from manually curated alignments of more than 5000 protein families (21). The searches employed the hmmpfam module of HMMer (13) (hmmer.wustl.edu/), and E-value threshold was set at 1.

Canonical Pattern Searching Markedly Loses Discrimination with Increasing Sensitivity and Does Not Rank Results-The
18 known Class I AdoMet-dependent yeast methyltransferases, based on literature review and database annotation at the time the search was performed, were used to build a set of consensus sequences for the various motifs as shown in Table III

TABLE IV MEME position-specific log-odds description of Motif I
The 287-sequence training set of known methyltransferases described herein was used as a training set to the MEME program. MEME produces a log-odds matrix (shown below) with one line/position in the motif. Each line has 20 entries, one for each of the amino acids; the order of the amino acids is ACDEFGHIKLMNPQRSTVWY. A key of amino acid positions has been added above the matrix. In the description of Motif I shown below, the most predominant amino acid score for each position is in bold. The most predominant sequence as described in this motif is VLDVGCGTG. specification of Motif I and Post I is shown in Table V, part C.
The results of searching with these patterns are shown in Table V. A first search of the yeast genome with Motif I returned 62 ORFs, including 21 known methyltransferases, 23 false positives, and 18 unknowns. When searched with the Motif I-Post I set, 30 ORFs were found, including 21 known methyltransferases, 2 false positives, and 7 unknowns. In the latter analysis, the number of false positives was dramatically reduced, but the number of putative methyltransferases was also much smaller.
In addition to simple searches with the listed patterns, the sensitivity was increased by allowing errors (deviations from the proscribed pattern) to be introduced. As shown in Table V, part A, the number of results grows quickly as multiple deviations are allowed. However, the number of false positives (candidates that have a known non-methyltransferase function) also increases rapidly as deviations are allowed, suggesting that this approach is not a good one for identifying new methyltransferases. The large number of false positives comes from the fact that a best match at each position is accepted just as readily as a worst match at each position.
For example, VLDVGCGPG is treated no differently than GS-VTAAAVD; the latter would not be considered an acceptable Motif I based on known methyltransferase sequences.
In an attempt to reduce false positives, a restricted search motif was created by removing the unusual amino acids from the patterns (Table V, part C). The results from searching with the restricted motif sets are shown in Table V, part B. Although the initial number of matches is lower, the amount of information returned is similar (for a given number of results, the partitioning of the results into "correct," "incorrect," and unknown is similar to that seen in Table V, part A). It is clear from these results that there is a very low limit of information that can be derived from these types of canonical searches before the signal-to-noise ratio drops well below an acceptable limit.
Unsupervised Automatic Motif-based Searches Are Similar to Human-mediated BLAST Searches and Can Be Greatly Improved with Minor Parameter Modification-We then took a second approach to finding new methyltransferases using automated motif identification processes. To answer the question as to how good default "out-of-the-box" motif identification and searching is, the 1064 RefSeq matches for the A translated database of yeast genomic and mitochondrial genes (n ϭ 6312) was searched for exact matches to canonically defined motifs. "Search set" (sections A and B) lists the motifs used for a given search as defined in section C. The notation of [10..30] reflects that between 10 and 30 amino acids must be between the two flanking motifs. "No. of errors" refers to how many deviations from the described motif are allowed in that search run. "Correct identifications," "Incorrect identifications," and "Unknown" break down the results into those that are methyltransferases, are not methyltransferases, and are of unknown methyltransferase status, respectively. "% Correct" is (number of correct/(number of correct ϩ number of incorrect)) ϫ 100. All results are based on a search of MSD followed by searches of SGD if the MSD search was not productive. keyword search "methyltransferase" were BLASTed against themselves to return sets in which no entry was homologous to any other entry with a significance greater than a certain expect value. The two expect values used were 10 Ϫ20 and 10 Ϫ50 , which returned training sets of 289 and 495 sequences, respectively. The motif-searching program MEME (17) was trained with the 10 Ϫ20 set without parameter modification and used in the default mode to detect five motifs. The matrices obtained were then used by the MAST program (18) to search a yeast-translated ORF database for matches (Table  VI, MEME expect 10 Ϫ20 , all motifs). Here, 9 methyltransferases were returned, with 5 false positives and 17 unknowns. During inspection of the generated motifs, it was noted that MEME was generating motifs that were not specific to all Class I AdoMet-dependent methyltransferases; for example the NPPY motif common to only the DNA N 6 -adenine methyltransferases and protein glutamine methyltransferases (22)(23)(24) was found. The searches were thus repeated using only the motifs returned that were similar to the already known methyltransferase motifs and again only with the automatically generated Motif I. This modification resulted in an improved performance with 13 correct methyltransferases returned along with 5 false positives and 26 unknowns (Table VI, MEME expect 10 Ϫ20 , Motif I).
Use of the automated MEME-MAST tool set in its default configuration was able to create lists of putative methyltransferases that were similar to those that were obtained by hand-using BLAST and manual sequence inspection (10). The advantage in using the automated tools is that they involve less effort and can therefore be rapidly applied to other genomes. Additionally, the results returned by the MEME-MAST tool set were significantly improved over the manual method by performing a first pass analysis of the results and rerunning the search after removing the non-Motif I confounding ele-ments that were specific to only certain subclasses of methyltransferases or that may represent motifs for distinct types of enzymes such as the related NAD/NADP dehydrogenases.
It is worthwhile to note that a major difference between this method and the canonical method described above is that this method begins with a list of gene sequences, which are then ordered in terms of likelihood of each entry being a methyltransferase. Using default settings, only the top percentage of entries is returned. However, with reduced reporting stringency, the entire genome can be ordered by the likelihood of each ORF being a methyltransferase.

A Less Stringent Training Set Produces Slightly Improved Results When Combined with a Variably Distal Hand-coded Post I Motif: Sensitized Methyltransferase-scoring Matrices (SM 2 )-
The MAST program returns a score on every hit that represents how well the subsequence of the ORF (motif) fits the MEME-derived scoring matrix. However, this score is not directly a probability of the gene product of a sequence being a methyltransferase. To compare sets of results returned from MAST, which vary in both order and motif match significance scores, we have arbitrarily chosen a cutoff point at the fifth known incorrect identification. Comparing the results of the 10 Ϫ20 and 10 Ϫ50 training sets yields very similar results (Table  VI; MEME expect 10 Ϫ20 , Motif I and MEME expect 10 Ϫ50 , Motif I) with the 10 Ϫ20 results being slightly better (one additional positive match and two additional candidates); this is the more stringent of the two sets.
Noting the highly conserved, albeit degenerate, Post I motif, a set of hand-coded matrices describing the Post I motif was appended to the description of Motif I in an attempt to improve the search sensitivity. The set varied only in the number of score-neutral elements that separated the Motif I and Post I motifs. Two spacings considered were 5-25 and 10 -30. The results are shown in Table VI (MEME, expect 10 Ϫ20 , Motif The results of the 1999 yeast methyltransferase search (10) are compared to a subset of the weighted position-specific motif ("MEME") searches. MAST results are listed (in order of motif match significance) until and including the fifth known false positive. Results are broken down into positives ("Correct ID's"), false positives ("Incorrect ID's"), and putatives ("Unknown"). % Correct is (Correct/(Correct ϩ Incorrect) ϫ 100). Result order is shown graphically with the X's being positives, Y's being negatives, 0's being unknowns, and u's being SGD-identified but without the keyword "methyltransferase." The results for all four sets are quite similar to one another and slightly improved over the non-Post I searches (14 -16 correct identifications and 27-28 candidates). Although the ordering of the ORFs was different, the significance of the results was similar based on the number of correct identifications and number of candidates returned for the 5-25 and 10 -30 spacing. The 10 Ϫ50 training set returned slightly better results than the 10 Ϫ20 training set with two additional correct identifications and one additional candidate ORF. We describe this optimized scoring system as sensitized matrices for scoring methyltransferases (SM 2 ). The results from the best training set are expanded in Table VII, which represents our new best list of putative methyltransferases in yeast. Descriptions of all the currently known S. cerevisiae methyltransferases are shown in Table VIII.  Table VII were probed individually using PSI-BLAST and HMMer, two powerful profilebased search tools that have been used in recent years with great success to detect remote sequence homology. Each sequence was first searched with PSI-BLAST against the non-redundant protein database in an attempt to provide support for its inclusion into the methyltransferase superfamily. Those candidates that matched known methyltransferases at E-value Ͻ0.001 before the sequence in question was included in the profile were considered true positives. Here, all true positives matched numerous methyltransferases, sometimes even in the first iteration. The subsequent iterations were important in generating checkpoint files, which correspond to position-specific scoring matrices. The checkpoint files were then used to increase the sensitivity of search against the yeast proteome. We annotated as true positives all sequences that identified a known methyltransferase in the yeast proteome at an E-value Ͻ1e-5, or those that were recovered themselves by another query using the same Evalue threshold.
As can be seen in Table VII, most candidates with percent correct values of 80 or greater pass as true positives according to PSI-BLAST criteria. Therefore, it appears that percent correct value 80 can be used in most cases as a safe threshold for automatic functional assignments. However, this analysis also showed that two candidates with lower percent correct values (YDR083w and YLR285w) are likely to be true positives, cautioning against the strict threshold. Finally, three ORFs not originally included in Table VII (YDR120c, YNL022c, and YBR141c) were identified as potential methyltransferases TABLE VII-continued a This table represents our current best list of putative methyltransferases in order of SM 2 significance. ORFs are listed in two columns; the "Putative ORF" column is the list of ORF names (and, if available, common names) of unknown function, and the "Known ORF" column shows ORF names of known function. Within the known ORF column, green entries in capital letters on the left are experimentally confirmed methyltransferases, and red entries in italics on the right are proteins with identified non-methyltransferase function(s). "Cumulative Percent Correct" is based on the correct and incorrect matches in the known ORF column. All ORFs identified in 1999 (shown in bold type with the previous F-designation in parentheses) (10) are identified here, except for those that fell below the significance cutoff for the table: yjr072c (F19), ylr137w (F21), and the known false positive yal061w (F14; FUN50). b A plus is recorded if a PSI-BLAST search of the putative methyltransferase entry against the non-redundant protein database as described under "Experimental Procedures" recovers any AdoMet-dependent methyltransferase with an E-value Ͻ0.001. c A plus is recorded if a search of the SGD with the profiles generated in the PSI-BLAST search matches a known methyltransferase with an E-value Ͻ1e-5. d A plus is recorded if a search of the Pfam 8.0 database recovers any methyltransferase with an E-value Ͻ0.1. A plus/minus is recorded if 0.1 Ͻ E-value Ͻ1.0. e spe3 and spe4 encode spermidine and spermine synthases, respectively. The encoded amino acid sequences are very similar to those of plant putrescine N-methyltransferases, but no methyltransferase activity of the yeast proteins has been shown.
f Indirect evidence has been presented for the function of the ylr285w gene product as a nicotinamide N-methyltransferase (25). g These ORFs, although below the inclusion threshold of the rest of the table entries, are included because they appear with high significance in the PSI-BLAST analysis.

TABLE VIII AdoMet-dependent methyltransferases in S. cerevisiae
This table lists all of the currently identified S. cerevisiae methyltransferases. Genes marked with "*" are genes that are not listed in Table  VII; HSL7 is found in the 38 th cumulative percentile, and TRM1 is not found through the 30 th cumulative percentile (this is expected considering its very unusual Motif I, "ILEALSATG," Table III). The entry marked with " 1 " is for MTF1; although there is no enzymatic evidence for this entry being a methyltransferase, the crystal structure is very similar to other known AdoMet-dependent methyltransferase structures (14). because other queries matched them at an E-value Ͻ1e-5. These proteins were subsequently used as queries with the non-redundant protein database and fulfilled the criteria outlined above for inclusion in the methyltransferase superfamily.
Sequence comparisons with HMMer tools and the Pfam 8.0 database provided further support for slightly more than half of PSI-BLAST true positives but were ultimately less informative than the SM 2 method described here despite the fact that Pfam 8.0 contains HMMs for more than 30 methyltransferase families, including some families that are presently annotated as uncharacterized. 3 Although it is formally possible that some true positives from the SM 2 and PSI-BLAST searches represent false predictions and as such were not confirmed by HMMer, it is clear that the coverage of the methyltransferases superfamily in Pfam 8.0 is far from reaching saturation. SM 2 Methodologies Are Easily Applied to Other Genomes and Show Results Similar to Those Seen in S. cerevisiae-To generalize these results, translated ORFs from six additional recently sequenced genomes (human, mouse, Drosophila, Caenorhabditis elegans, Arabidopsis, and Escherichia coli) were ordered based on likelihood of being a methyltransferase using the MAST tool in the SM 2 configuration with the "expect 10 Ϫ50 , Motif I-[10 -30]-Post I" criterion described in Table VI. Lists of putative methyltransferases generated by this method are in the on-line supplement to this paper.
The methods developed here appear to have similar success in finding methyltransferases in these other genomes. The efficacy of ordering the genome in terms of likelihood of being a methyltransferase is shown graphically in Figs. 3 and 4. After the genome is ordered in this fashion, one can look at the genes of known function and develop an overall cumulative percent methyltransferases expression that is similar to the scoring methodology used earlier in this article and shown graphically in Fig. 3. A possibly more telling view is to, at each point in the genome, look at the local percent methyltransferases, that is, what is the percent methyltransferases in a small window surrounding the position we are looking at. Fig.  4 shows this graphically using a window size of 0.01% of the genome size. As can be seen, the percent likelihood of finding a methyltransferase rapidly falls off after the top scores in 2-3% of the genome are analyzed.
The main difference between the scoring used here and the scoring used earlier in this paper with the yeast genome is that the MSD was used there to confirm the assignment of function. Here, the shortcut of looking solely at the provided gene annotation is used.
The final calculation in this section is the prediction of the total number of motif-bearing methyltransferases in a given 3  genome. This calculation was done by taking the data from Fig. 3  This data is plotted in Fig. 5. It is predicted from the graph that all the genomes assayed have a similar percentage (0.6 -1.6%) of genes that are of the Class I motif form of methyltransferases.

DISCUSSION
The purpose of this study was to identify novel methyltransferases using the primary sequence data available from genome sequencing projects. We have developed semi-automated methods that order the encoded amino acid sequences of the open reading frames of a genome in terms of their likelihood of being Class I methyltransferases (seven ␤-strand family). Using the criteria of getting as many of the known methyltransferases in our list as possible while, at the same time, keeping the number of known false positives to a minimum, we have identified candidate methyltransferases in yeast and other organisms. This system is automated enough to be easily applicable to new genomes as they are sequenced. It is also easy to recompute the training set as additional validated methyltransferases become known, allowing for the generation of updated candidate lists.
Including an ORF in a list of putative methyltransferases is obviously only a first step toward biochemically characterizing a new AdoMet-dependent methyltransferase. Even if we had a perfect method that identified all the AdoMet-dependent genes in a genome, we would still need to determine what their methyl-accepting substrates were to define their biological function. As enzymatic activity specification is the slow step in this process, it is sufficient at this point to have a partial list with even marginal confidence that each entry in the list is a methyltransferase. Having a list of 100 ORFs where each entry is 50% likely to be a methyltransferase is much better than having an entire genome ORF list where each entry is only 1-2% likely to be a methyltransferase. As time progresses and these early lists are exhausted, better techniques will hopefully evolve for protein identification that will allow establishing a complete catalog of the methyltransferase complement of an organism.
In the end, only time will tell if we have, in fact, generated here "good" lists of candidate methyltransferases. We can say at this point, however, that our methodology does appear to be superior to that presently employed in a database such as Pfam (21). For example, of the 24 experimentally verified yeast methyltransferases described in Table VII, eight are not annotated as methyltransferases in version 8.0 of the Pfam database. Additionally, we note that the SM 2 methodology used here has identified six new candidates in the "100%" region and 33 new candidates in the "100 -42%" region of Table VII that were not detected in the 1999 analysis of yeast proteins (10). We have been pleased to see a steady progression of our best yeast candidates into the class of experimentally supported methyltransferases. For example, just in the time between the completion of this manuscript and its revision, two of our high scoring candidates were identified as specific methyltransferases (15,16). Further evidence of this progress is that in 1999 only seven Class I methyltransferases had been described in yeast (10); the present number is 26 (Table VIII)! We note that the methods described here are only designed to reveal the Class I seven ␤-strand family of methyltransferases. Further work will be needed to analyze the Class II (SET) enzymes and the Class III (membrane-bound) enzymes. From the compilation in Table VIII of the 38 presently identified yeast methyltransferases, 26, or 68%, are of the Class I type.
Based on our results, it appears that we may have reached the limit of what is possible with the SM 2 methodology presented. Doubling the training set had minimal effect on the results. When we included information from the motif Post I, we did increase the number of correct positive identifications but only marginally improved the number of candidate methyltransferases returned above the 5-false positive threshold used in this study. It is clear that SM 2 may weakly score some methyltransferases (false negatives) because the motifs are divergent or because the spacing between them is different from the canonical spacing.
So how can these results be improved further? The next logical step would be the incorporation of countertraining sets using the false positive results to create a feature set that could be recognized and used to downgrade ORFs that had similar features. For example, many of the false positives either fit into a class of enzymes that could be identified (e.g. dehydrogenases or nucleotide-binding proteins) or were highly homologous and could be eliminated on that basis (e.g. the HXT proteins). Another avenue we are currently exploring is the use of motif-based profile HMMs that would automate functional assignments and provide more stringent statistical criteria for distinguishing true versus false positives. 3 Despite these limitations, we now have a list of unidentified ORFs for which we are highly confident that a majority of the members will ultimately be characterized as methyltransferases.
* This work was supported by National Institutes of Health Grants GM26020 and AG18000 (to S. C.). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.