|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 5:1520-1532, 2006.
© 2006 by The American Society for Biochemistry and Molecular Biology, Inc.
,
,¶
,||
From the Departments of
Chemistry and Biochemistry and ¶ Computer Science, University of Colorado, Boulder, Colorado 80309
| ABSTRACT |
|---|
|
|
|---|
The restriction on sequence identity arises because methods that detect similar patterns against a random background model cannot determine whether sequences are similar only because they are conserved from sequences that have had insufficient time to evolve in different directions. This effect is especially important when using shared motifs to define superfamilies (3, 4). Phylogenetic analyses of large numbers of sequences can also be extraordinarily time-consuming even with efficient algorithms (5). Choosing a smaller but representative set of sequences often provides the same phylogenetic tree far more efficiently (6, 7).
Because picking a divergent set manually is laborious, little attention has been paid to the reproducibility of programs that rely on divergent sets. How much does the taxon sampling or the precise choice of sequences from the same protein or RNA families affect the apparent functional motifs or relationships? In this study, we demonstrate one use of DivergentSet by comparing the motifs found by the popular motif-finding program MEME1 (1) using different sets of divergent sequences from the same initial alignment.
Divergent sets of sequences have typically been chosen manually. In this procedure, the distance between each pair of sequences is calculated. Any sequence that is too similar to a sequence already in the set is discarded. Taxonomic information can also be used, e.g. by taking one sequence from each genus. However, both methods based on taxonomic annotations and existing methods based on sequence similarity have substantial drawbacks. Thus, an automated method to choose sets of sequences based on sequence similarity is highly desirable.
To our knowledge, no fully automated system for choosing a divergent set based on sequence distances exists. Several programs, including BLASTCLUST, part of the BLAST package (8); nrdb90.pl, a script for removing nearly identical protein sequences from a database (9); and DOTUR (10) and FastGroup (11), two programs for choosing "operational taxonomic units" (OTUs) by sequence similarity, can assist to some extent in choosing divergent sets by identifying groups of related sequences. However, DOTUR and FastGroup were designed for community analysis and do not facilitate the task of randomly choosing representative sequences from each OTU. This task must still be completed by a time-consuming manual process that can take an experienced researcher many days in part because it is often desirable to include sequences from particular, well studied organisms. The ability to include sequences manually is especially desirable if representative crystal structures are available in the Protein Data Bank (12).
Choosing sequences based on taxonomy is especially problematic because a "genus" or a "family" represents a very different level of sequence divergence in different taxa. For example, genera of mammals and birds contain notoriously little sequence divergence, whereas bacteria are especially diverse. The mitochondrial small subunit rRNA in humans and chimpanzees is over 99% identical at the sequence level, but these species are in separate genera; in contrast, small subunit rRNA from species within the bacterial genus Mycoplasma can differ by more than 10% (representative sequences from GenBankTM, data not shown).
Ideally an automated system for choosing a divergent set of sequences should support manual adjustment of the divergent set. For example, disease-causing mutants might cause problems for motif analysis because the deleterious changes are likely to be at the active site. We identified seven essential design requirements for the system. The system should:
DivergentSet (available at bmf.colorado.edu/divergentset) is a web-based tool that meets these requirements. DivergentSet combines an easy to use user interface with a powerful Linux cluster back-end. To optimize the performance of the system, we compared several exact and approximate methods of calculating pairwise distances between sequences. This was critical because algorithms for clustering sequences spend most of their time performing pairwise comparisons. We also tested several methods of gathering additional homologs of the sequences in the initial set.
The overall work flow that DivergentSet supports is shown in Fig. 1. This work flow starts with a single identifier or sequence and produces a divergent set of sequences that can be used for downstream analyses such as motif finding. The user begins with a single sequence, identifier, or set of sequences. Additional sequences are recovered by BLAST (8), PSI-BLAST (13), or Shotgun (14). A tree relating the sequences is then used to choose a candidate divergent set. To ensure that the set is valid, all pairs of sequences in the candidate set are compared, and non-divergent sequences are discarded. The resulting divergent set can then be recovered either as database identifiers or as sequences and used for many different tasks including motif finding, phylogeny, and covariation analysis. Alternatively the set can be used for another round of refinement, or another divergent set can be chosen from the same initial set of sequences. For each step, we tried several different approaches. Here we compare the performance of these different approaches and describe the best practice methods available in the DivergentSet interface.
|
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
Dice coefficients (16) are used to compare strings without aligning them. The comparisons are instead performed by examining the fraction of shorter strings that the two strings have in common. The Dice coefficient between two strings A and B is defined as 2|A
B|/|A
B (Fig. 2). a and b are the sets of unique substrings of length k contained in the strings A and B, respectively. Dice coefficients provide a very fast method of comparing strings because they do not require that the sequences be aligned. The algorithm thus takes time proportional to sum of the lengths of the two strings rather than the product of the lengths.
|
We tested the approximate methods (Dice coefficients and MUSCLE) by comparing the tree distances from these techniques with the Needleman-Wunsch distances. Tree distances were the leaf-to-leaf distances between sequences using the UPGMA tree (18) calculated from each kind of distance matrix or, for MUSCLE, either the UPGMA tree or the neighbor-joining tree (19).
Gathering Additional Sequences
We tested two methods for gathering additional sequences. First, we performed one round of the Shotgun algorithm (14), which consolidates the results of many BLAST searches by keeping all sequences that appear in a minimum number of BLAST results from sequences in the input set. We used a BLAST E-value threshold of 1010 and a minimum Shotgun score of 1 or 3. These parameters can be easily modified in the user interface. Second, we performed PSI-BLAST (13), a method that increases the sensitivity of BLAST searches by iteratively building up a profile of conserved regions on each sequence for three iterations using an inclusion threshold of 1020. We tested these methods on alignments of real sequences and on simulated data consisting of several disjoint protein families that evolved separately.
Using a Phylogenetic Tree to Reduce the Number of Comparisons
To minimize the number of comparisons, we tried several methods that examine the sequences in an order defined by a phylogenetic tree. The tree allows us to eliminate some sequences from the analysis early, saving the time that would otherwise be required to compare them with all other sequences. Drawing the tree with the root (the ancestor) at the left and the leaves (the modern sequences) at the right, the children of each node are placed in an arbitrary order from top to bottom. The tree order can also conveniently produce different, randomly sampled divergent sets.
The first method assumes that the phylogenetic tree provides a complete picture of sequence evolution and that the rates of evolution are constant in every lineage. In this "trust tree" method, sequences are compared in tree order (Fig. 3). Starting with the first sequence i, we compare each successive sequence j until a sequence more divergent than the percent identity threshold is encountered. We measure divergence in terms of "percent identity," which is the number of bases or residues that are the same in the optimal alignment of both sequences divided by the length of the shorter sequence. All sequences that appear later in the tree order (j + 1, j + 2, ..., j + n) are assumed to be at least as divergent as i is from j and are not examined.
|
The second method, which we call "good pairs," compares each sequence only with the sequences already chosen for the divergent set (rather than with all sequences). This method is similar to the trust tree method except that the comparisons with sequence i do not stop when the first divergent sequence j is encountered but instead continue until all sequences below i that have not already been deleted from the tree have been examined. This method produces no false positives because every sequence that remains in the divergent set has explicitly been compared with each other sequence in the divergent set. It provides large time savings when few sequences are in the divergent set but more modest savings when the percent identity threshold is low, and therefore most sequences must be compared with most other sequences. One disadvantage of this method is that it is difficult to parallelize because the comparisons performed during each pass through the tree determine which other comparisons need to be performed and thus cannot be known at the start of the procedure. This method is theoretically the most efficient of the methods we tested when running on a single CPU without parallelization.
Finally we used a method that we call "candidate pairs." In this method, we use the trust tree approach to pick a candidate divergent set and then perform all pairwise comparisons within this reduced set (not all pairs in this candidate set will be divergent because the trust tree method is approximate). This method also produces no false positives and reduces the number of comparisons considerably. It also has the advantage that it was easily parallelized (as described under "Implementation" below). This natural parallelism arises because all pairwise comparisons in the reduced set must be performed, and thus blocks of comparisons can be distributed among multiple CPUs without synchronization. This method is theoretically the most efficient of the methods we tested when running in parallel on a cluster if considering elapsed time for the user rather than total CPU time.
We chose to implement these methods using a phylogenetic tree rather than a distance matrix for two reasons. First, the tree clusters together the sequences that are most closely related to one another. Thus, when comparing each sequence with the sequences that are already in the divergent set, similarities that would cause the new sequence to be rejected are encountered earlier in the process, reducing the total number of comparisons. Second, we found during user testing that displaying the divergent set in the context of a phylogenetic tree makes the data much easier to interpret than displaying the full distance matrix especially when the number of sequences is large (a tree with 1000 taxa is much easier for most researchers to interpret than the equivalent distance matrix with 1,000,000 elements). In practice, constructing the tree takes a relatively small fraction of the total run time (about 1%), so the benefits of using it outweigh the cost of its construction.
Maximal Versus Random Divergent Sets
One important issue when choosing divergent sets is whether the application requires a random divergent set or a maximal divergent set. A random divergent set ensures that the sample of sequences included in the divergent set is unbiased, which can be important for phylogenetic applications. A maximal divergent set contains as many sequences from the original set as possible, which can be important when the amount of data is limiting.
We produce both random and maximal divergent sets using a tree-based approach. To generate a random divergent set, we randomize the order of the branches at each level of the tree using the Mersenne Twister random number generator (20) as implemented in the Python standard library. Then for each sequence starting with the first sequence (i.e. the sequence found by following the first child in each successive node from the root), we compare the current sequence with each other sequence that is below it in the tree, ignoring already deleted sequences. Any sequence that is too similar to the current sequence is deleted. A new, random set can be produced by randomizing the branch order. The branch order determines which of each set of topologically equivalent sequences is encountered first and therefore which sequence will be kept when there is an equal choice between groups of sequences.
To generate maximal sets of sequences, we build a fully connected graph (each sequence is connected to each other sequence). In this graph, the vertices represent the sequences, and the edges represent the pairwise percent identity scores. We iteratively remove the vertices with greatest number of edges that are above the percent identity threshold until all edges are below the percent identity threshold. This process yields a maximal set of divergent sequences.
Keeping Specific Sequences in the Divergent Set
It is often important to keep specific sequences in the divergent set. Sequences may be preferred because their crystal structures are known or because the organism in which they are found is genetically tractable. Because the order in which each sequence appears in the tree determines whether it will be kept in preference to other sequences in the same subtree, we can conveniently "promote" user-specified sequences to be preferentially kept by moving each of these sequences to the first position in its parents lists of children (Fig. 3, topmost in the tree shown). We mark sequences as "preferred" by checking whether they are in a user-supplied list of identifiers to keep or whether they match specified criteria for preferential inclusion. These criteria include the presence of associated crystal structures (Protein Data Bank records). Randomization can be achieved by permuting the preferred and non-preferred children of each node separately such that all the preferred children of a node always appear before all the non-preferred children. Inclusion in the preferred list is propagated from the leaves to the root such that any node that has a preferred descendant is itself preferred. This method allows random sampling from the distribution of divergent sets that include as many of the desired sequences as possible (if a pair of preferred sequences are similar to one another, only one sequence from that pair will be kept in the divergent set).
Data Sources
To test the performance of the different algorithms, we used two test data sets: one downloaded from KEGG (21) on August 25, 2005 and one simulated. We chose these sets to represent a variety of possible applications.
The KEGG set, "LysRS," consists of the 497 non-redundant sequences that match a text search in KEGG for the core translational enzyme "lysine tRNA synthetase" consisting of both class I and class II LysRS sequences (the two classes are not homologous to one another and represent two independent origins of the same biochemical activity (22)).
The simulated set, "sim," consists of 1000 sequences families, each with 150 sequences evolved using a random breakpoint phylogeny, also known as the equiprobable model (23). Each sequence family started from a single ancestral sequence with equal amino acid frequencies. Sister taxa could evolve a total sequence divergence in the range 150%. This set was designed to test false positive rates and false negative rates for inclusion in the divergent sets. False positives are sequences that are included in the set inappropriately because they are too similar to one of the other sequences in the set; false negatives are sequences that were inappropriately excluded from the divergent set. Measuring false positive and false negative rates requires simulated data because the true protein families for the biological sequences in GenBankTM are not precisely known. This uncertainty makes it difficult to tell whether BLAST hits to "hypothetical proteins" are finding new family members or are hitting arbitrary sequences by chance. This sequence set is
1% of the current size of GenBankTM and models the situation of finding genes in a discrete taxonomic group (e.g. the vertebrates).
Implementation
Most of the source code described here was written in Python 2.4 (Python Software Foundation) and tested on MacOSX and on a Linux Beowulf cluster (Beowulf Project). The exceptions are the Needleman-Wunsch algorithm, which we implemented in C for performance, and the BLAST, PSI-BLAST, MUSCLE, and MEME programs for which we used the published implementations. The web interface uses Apache (Apache Software Foundation) and mod_python. Parallelization is achieved via PBS/TORQUE (Cluster Resources Inc.) running on our Beowulf cluster (bmf.colorado.edu/divergentset). The following is a high level overview of the algorithm used by DivergentSet to pick divergent sets of sequences.
This task-farming approach allowed us to obtain nearly linear speed-up for the BLAST and pairwise comparison steps, greatly reducing the overall analysis time.
| RESULTS |
|---|
|
|
|---|
Speed of Distance Calculation Methods
Fig. 4 shows the performance characteristics of the different distance methods. All these comparisons use the distances computed to build UPGMA trees. We initially prototyped the Dice coefficients and Needleman-Wunsch distance comparisons in Python. Although the performance of the Python implementation of Dice coefficients was acceptable, we were able to increase the performance of the Needleman-Wunsch algorithm by a factor of almost 10,000 by recoding it in C using arrays of small integers rather than character strings. Results shown are for the C implementation. Using MUSCLE and using Dice coefficients were both much faster than performing all-pairs Needleman-Wunsch alignments even using our optimized C implementation. MUSCLE uses a heuristic method to cut down the number of comparisons.
|
Speed and Accuracy of Set Picking Methods
Fig. 5 shows time and accuracy characteristics of the different set picking methods we used. All points in this figure are averages of 20 independent runs in which a different random divergent set was chosen. The total time (Fig. 5a) is, as expected, constant for all pairs using either Dice or Needleman-Wunsch because all sequences must be compared. The good pairs method can be almost 2 orders of magnitude faster than all pairs when few sequences are more divergent than the threshold but quickly slows down to within a factor of 4 of all pairs when the threshold is high and more sequences are in the divergent set and must be compared with one another. The Dice coefficient method was about 2 orders of magnitude faster than even our optimized Needleman-Wunsch implementation.
|
The number of sequences found (Fig. 5c, note log scale) is similar using each method except that Dice sometimes returns either too few (when iterated) or too many (when all-pairs comparisons are performed) sequences depending on the threshold and on whether all pairs are compared or the tree is used. This poor performance of the Dice method is due to the nonlinear relationship between the Dice coefficients and the percent identity calculated from pairwise alignments. Fig. 6 shows the relationship between Dice coefficients with word size of 2 or 3 with Needleman-Wunsch distances. Although it might seem that a smooth function could be fit relating the Dice coefficients, different protein families give very different relationships (data not shown). The large number of Needleman-Wunsch comparisons required to fit the function accurately for each protein family, in conjunction with the high variance once the distance exceeds about 0.5, eliminates any speed advantage that might otherwise be obtained by using Dice coefficients.
|
The false positive rate (Fig. 5e) shows that using Dice coefficients, although fast, is not an accurate method, whereas using the good pairs method does not ever produce false positives. The false positive rate (Fig. 5f) shows that neither "all pairs" nor good pairs ever produces false positives, a highly desirable property. Even with five rounds of iteration to randomize the tree order, all of the methods that trust the tree and all pairs using Dice coefficients produce alarmingly high false positive rates (often exceeding 50%). Overall although Dice coefficients and trusting the tree can be extremely fast, the good pairs is accurate and is substantially faster than calculating the Needleman-Wunsch alignment scores for all pairs of sequences.
Speed and Accuracy of Methods for Finding Additional Sequences
We tested the accuracy of the methods for finding additional sequences (PSI-BLAST and Shotgun) on both the LysRS and the simulated data set. For the simulated set (Fig. 7, ad), we were able to measure false negative and false discovery rates because we knew the membership of each sequence in each family in advance. For the real sequences (Fig. 7, e and f), we could only measure the number of known LysRS sequences recovered and missed. For the simulated data, at a given BLAST threshold (data shown are for an E-value of 103), PSI-BLAST was able to recover a much larger fraction of the sequences (Fig. 7, a and b), although it typically took 56 times longer to run (Fig. 7d). For example, PSI-BLAST took 242 s to run on a set of 30 seed sequences as opposed to 65 s for BLAST.
|
The poor performance of Shotgun on these data is probably due to the stringent E-value threshold of 103. The strength of Shotgun is that it can find matches by combining multiple weak hits (14). With the E-value we used, the hits were individually significant, and relatively few sequences matched more than one of the initial query sequences. Running Shotgun with lower E-values, e.g. 1, can result in much better sensitivity with little loss of specificity (data not shown). We kept the E-value the same for all analyses in the results shown so that the techniques could be compared directly when used with exactly the same BLAST parameter settings.
Effect of Set Choice on Motifs Identified by MEME
Finally we tested the effect of the set choice on the stability of the motifs recovered by MEME (1). For this analysis, we used the default MEME settings except that the maximum number of motifs was raised to 20, and a significance threshold was set at 0.01 to minimize noise from non-significant motifs (MEME will otherwise return as many motifs as are needed to reach the number specified).
From the LysRS set, we chose random divergent sets of sequences at divergence thresholds ranging from 30 to 90% identity (Fig. 8). We found that the motifs that MEME identified from the same set of initial sequences varied considerably depending on which divergent set of sequences we used. The average length of the motifs recovered was 31 amino acids.
|
For very similar motifs with no more than two mismatches, the average recovery of motifs (as defined by the fraction of pairs of runs in which the same motif was recovered in both runs) never exceeded 50%. For exact matches, this recovery fell below 10% for exact matches for intermediate divergences (5060% sequence identity). The recovery was substantially higher when the criterion for similarity was relaxed, averaging 5060% when five mismatches or gaps were allowed and reaching 90100% when 10 mismatches or gaps were allowed. However, at these levels, the motifs differ substantially, and the core motifs that match between the samples would typically not have been significant by themselves.
Interestingly the approximate match using the reduced alphabet made very little difference to the results. In the reduced alphabet, similar amino acids (e.g. isoleucine, valine, and leucine) are coded as the same character, resulting in five groups of amino acids. On average, using the reduced alphabet led to less than a 10% increase in recovery (Fig. 8, compare a and b), suggesting that the differences are not primarily due to interchange between similar amino acids. Fig. 9 gives examples of some of the motif pairs with different numbers of changes on the normal and reduced alphabets.
|
Additional Analyses
In addition to the analyses described here and above under "Experimental Procedures," we tried several variations on the distance metric and the tree-building algorithm. We tested Dice coefficients based on substrings of lengths 2, 3, and 4 and the average of the Dice coefficients for substrings of lengths 2 and 3 (results similar to those shown as "Dice", which was calculated with words of length 3). We also tried an alternative tree-based comparison algorithm that compared each of the sequences to the sequences in the "good" set rather than deleting "bad" sequences from the tree (results very similar to those shown as good pairs). Finally we tried alternative methods of building the tree with results very similar to those shown in all cases; we tried UPGMA clustering of the Dice coefficients and pairwise Needleman-Wunsch alignment scores, but our implementations were typically slower than the MUSCLE output, and no improvement in accuracy was obtained (data not shown).
| DISCUSSION |
|---|
|
|
|---|
Dice coefficients, although very fast to calculate, are unsuitable for choosing divergent sets. Dice coefficients did not map accurately onto the overall percent identity between aligned pairs of sequences. Consequently a given threshold in terms of Dice coefficients both excluded pairs of sequences less similar than the threshold and included sequences more similar than the threshold. In particular, sequences that are very similar overall often differ radically in their Dice coefficients for k > 2, leading to the inclusion of nearly identical sequences in the final result. Because the main point of choosing divergent sets is to eliminate such sequences, we concluded that they were unsuitable for this application. Although it would be possible to fit a function to relate Dice coefficients and pairwise distances for a random sample of pairs of sequences from a given input set, the statistical error in the relationship is likely to be so large that the false positive and false negative rates remain unacceptably high.
The ability to find additional sequences is greatly improved when all the sequences come from the same sequence family. Both PSI-BLAST and Shotgun performed poorly in cases where many unrelated sequences are used as seeds. DivergentSet itself could be used to assist in such analyses by screening a set of input sequences for those that are significantly similar, rather than those that are significantly different, to identify the families of sequences that are similar enough to be used as seeds for subsequent searches. When the number of input sequences was large, Shotgun found as many sequences as PSI-BLAST in a much shorter time. The Shotgun results can also be improved by lowering the E-value threshold, which imposes essentially no time penalty but may affect specificity. We kept the E-value threshold across methods in this analysis to permit a direct comparison using the same BLAST parameters, but we should note that the strength of Shotgun is in using very low E-values, e.g. 1, to find very distant homologies using a combination of hits that are individually weaker than threshold (14).
The particular divergent set chosen can greatly influence the motifs found. Under some conditions, less than 10% of the motifs were the same between two different randomly chosen divergent sets. These results underscore the importance of using many different divergent sets from the same sequences for analyses such as motif finding. Such a striking degree of inconsistency calls into question the biological significance of motifs that are identified from a single divergent set of proteins, although it also suggests that the false positive rate of motif-finding algorithms may be greatly reduced by using multiple divergent sets chosen from the same larger set of sequences. This need to repeat the analysis using different divergent sets underscores the need for an efficient method for set choice: if each divergent set requires days of manual effort to put together, confirming motif analyses by using many different sets is infeasible.
We achieved a dramatic speed improvement both over choosing the divergent set manually and over the simple algorithm of computing the full distance matrix and choosing sequences below the percent identity threshold. The process of choosing divergent sets by hand has traditionally been extremely tedious, taking days for an experienced researcher, which, along with the inability to randomize the sets, has discouraged repetition of analyses to check for stability. Using the full distance matrix and Needleman-Wunsch alignments automated this process to take a few hours of computer time (on a set of
1000 sequences) without manual intervention. By taking advantage of a phylogenetic tree built from the data and using less time-consuming comparison methods, we were able to further reduce this time to a few minutes. By parallelizing the slow steps (the pairwise comparisons and BLAST searches), we were able to further reduce the time to a minute or two from the users perspective.
The ability to sample random divergent sets coupled with the ability to conveniently retrieve additional sequences from the database in minutes rather than hours will greatly improve our ability to test the robustness of a wide range of bioinformatics analyses to the choice of divergent set. DivergentSet will also facilitate the development of improved methods for drawing reliable conclusions from many of these analyses.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, June 11, 2006, DOI 10.1074/mcp.T600022-MCP200
1 The abbreviations used are: MEME, Motif Elicitation by Maximum Entropy; BLAST, Basic Local Alignment Search Tool; KEGG, Kyoto Encyclopedia for Genes and Genomes; LysRS, lysyl-tRNA synthetase; PSI-BLAST, Position-Specific Iterated Basic Local Alignment Search Tool; OTU, operational taxonomic unit; UPGMA, unweighted pair group method with arithmetic mean; CPU, central processing unit; PBS, portable batch system. ![]()
* This work was supported in part by a seed grant from the W. M. Keck RNA Bioinformatics Initiative and by Air Force Office of Scientific Research Proposal 0705.01.1102b. ![]()
Both authors contributed equally to this work and should be considered joint first authors. ![]()
|| To whom correspondence should be addressed. Tel.: 303-492-1984; Fax: 303-492-7744; E-mail: rob{at}spot.colorado.edu
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
C. A. Lozupone and R. Knight Global patterns in bacterial diversity PNAS, July 3, 2007; 104(27): 11436 - 11440. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||