Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites*

Reversible protein phosphorylation is one of the most pervasive post-translational modifications, regulating diverse cellular processes in various organisms. High throughput experimental studies using mass spectrometry have identified many phosphorylation sites, primarily from eukaryotes. However, the vast majority of phosphorylation sites remain undiscovered, even in well studied systems. Because mass spectrometry-based experimental approaches for identifying phosphorylation events are costly, time-consuming, and biased toward abundant proteins and proteotypic peptides, in silico prediction of phosphorylation sites is potentially a useful alternative strategy for whole proteome annotation. Because of various limitations, current phosphorylation site prediction tools were not well designed for comprehensive assessment of proteomes. Here, we present a novel software tool, Musite, specifically designed for large scale predictions of both general and kinase-specific phosphorylation sites. We collected phosphoproteomics data in multiple organisms from several reliable sources and used them to train prediction models by a comprehensive machine-learning approach that integrates local sequence similarities to known phosphorylation sites, protein disorder scores, and amino acid frequencies. Application of Musite on several proteomes yielded tens of thousands of phosphorylation site predictions at a high stringency level. Cross-validation tests show that Musite achieves some improvement over existing tools in predicting general phosphorylation sites, and it is at least comparable with those for predicting kinase-specific phosphorylation sites. In Musite V1.0, we have trained general prediction models for six organisms and kinase-specific prediction models for 13 kinases or kinase families. Although the current pretrained models were not correlated with any particular cellular conditions, Musite provides a unique functionality for training customized prediction models (including condition-specific models) from users' own data. In addition, with its easily extensible open source application programming interface, Musite is aimed at being an open platform for community-based development of machine learning-based phosphorylation site prediction applications. Musite is available at http://musite.sourceforge.net/.

With many genomes being sequenced at an increasingly fast pace, a key and challenging issue is inferring protein function and downstream regulatory networks. As a pervasive regulatory mechanism, reversible protein phosphorylation plays an important role in signaling networks (1). Annotation of phosphorylation and other modification sites in proteomes is a critical first step toward decoding such signaling networks.
In recent years, protein phosphorylation data have accumulated rapidly due to large scale mass spectrometry studies of protein phosphorylation in different organisms (2)(3)(4)(5)(6)(7)(8)(9) and development of associated web resources (10 -18). In particular, there are currently about 100,000 annotated phosphorylation sites in all organisms in UniProt/Swiss-Prot (V57.8). About 27,000 of these sites are from human. Nevertheless, our knowledge of protein phosphorylation is still limited. The majority of proteins are estimated to be phosphorylated at multiple sites (Ͼ100,000 sites in the human proteome alone) (19). Furthermore, our understanding of phosphorylation events in signaling networks is even more lacking, largely due to the lag in elucidating kinase-substrate interactions. For example, fewer than 5,000 (5%) of reported phosphorylation sites in UniProt/Swiss-Prot are annotated for their cognate protein kinases.
Despite improvements in phosphopeptide enrichment and mass spectrometry analysis, experimental identification of phosphorylation sites in a global manner is still a difficult, expensive, and time-consuming task. In addition, high throughput proteomics techniques have some limitations. Because only proteotypic peptides are observed, mass spectrometry tends to provide fractional sequence coverage for proteins. Detection of low abundance proteins is also problematic. Consequently, a significant portion of phosphorylation sites are missed by current techniques. Moreover, it is even harder to characterize kinase-substrate interactions experimentally. Hence, in silico prediction of phosphorylation events can be highly valuable in many cases. As genome and proteome data in various organisms have been increasing dramatically, comprehensive and accurate prediction of protein phosphorylation sites is becoming more advantageous for proteome annotation and large scale experimental design. For example, in hypothesis-driven experiments, the research-ers may want to use prediction tools to focus on putative phosphorylation sites above a high stringency level.
More than a dozen phosphorylation site prediction tools have been developed; they can be divided into two categories: tools for general phosphorylation site prediction and tools for kinase-specific phosphorylation site prediction. DISPHOS (20), NetPhos (21), and scan-x (22) fall into the first category. The latter category includes Scansite (23), Net-PhosK (24), GPS (25), KinasePhos (26), Predikin (27), CRPhos (28), AutoMotif (29), pkaPS (30), PPSP (31), PhoScan (32), PredPhospho (33), and NetPhorest (34). More information about these tools is given in supplemental Table S1. Although kinase-specific prediction is of interest because of its essential role in constructing signaling networks, general prediction is also important because the majority of phosphorylation sites remain undiscovered, and the kinase-specific predictors may only be able to unveil a small fraction of them.
Despite the availability of various phosphorylation site prediction tools, they have limitations when applied to whole proteomes. The most important issue of phosphorylation site prediction is accuracy. Because different training data and techniques were used with these programs, prediction performance varies greatly among them as discussed later. Another notable issue is that most tools were only released as web servers and have restrictions for the data uploaded by users (see supplemental Table S1). This makes large scale predictions a laborious or impossible task. Besides web servers, GPS 2.1 (25) and PhoScan (32) were also released as stand-alone tools, capable of handling large data sets, but both tools only support kinase-specific predictions. NetPhos 2.0 and NetPhosK 1.0 were also released as both web servers and stand-alone applications under Unix/Linux, but prediction performance could be improved as we demonstrate in this study. In Schwartz et al. (22), proteome scale scans on human, mouse, fly, and yeast were performed using motif-x (35) and scan-x, and the prediction results were accessible. However, the tool scan-x has not been publicly released (as of May 26, 2010), and hence "on-the-fly" predictions for user-uploaded sequences are not possible. Another concern regarding the current tools is the stringency control of predictions. User control on the prediction stringency is important, especially for large scale predictions, because typically a user is interested only in predictions above a certain confidence threshold, and different users may have different requirements on the threshold. However, current tools either preset the threshold and do not support stringency adjustment or only support several predefined stringency levels from which a user can choose that may not meet every user's requirement.
To address the limitations of existing tools and to take advantage of the large magnitude of experimentally verified phosphorylation sites, we developed a bioinformatics tool, Musite, specifically designed for large scale prediction of both general and kinase-specific phosphorylation sites. As a stand-alone application, Musite can be easily used to perform large scale phosphorylation site prediction in an automated fashion. We modeled phosphorylation site prediction as an unbalanced binary classification problem and solve it with a comprehensive machine-learning approach. Reliable experimental phosphoproteomics data in multiple organisms were collected from several sources and utilized to train phosphorylation site prediction models by a comprehensive machinelearning procedure termed bootstrap aggregating. Three sets of features (k nearest neighbor (KNN) 1 scores, disorder scores, and amino acid frequencies) were extracted from the collected data and combined using support vector machine (SVM) to make predictions. KNN scores capture local sequence similarity around sites phosphorylated by the same kinase or kinase family whether or not the kinase-substrate interactions are known. Disorder scores reflect the higher probability of phosphorylated residues to be in disordered regions, which are segments of proteins that lack a stable tertiary structure. Phosphorylation sites have been shown to be preferentially located within disordered regions (20,36); this was confirmed on phosphoproteomics data in six organisms by this study.
Applications of Musite on several proteomes yielded tens of thousands of putative phosphorylation sites with high stringency. Cross-validation tests and comparisons with other tools show that Musite performs better on general predictions and at least comparably with existing methods on kinasespecific predictions. In Musite V1.0, we have trained general prediction models for six organisms and kinase-specific prediction models for 13 kinases or kinase families. It is noted, however, that using the current pretrained models users cannot correlate prediction results with any particular cellular condition. To do so, users can utilize a unique functionality in Musite for training customized prediction models from their own condition-specific phosphorylation data. In addition, Musite supports continuous stringency adjustment to meet different confidence requirements for users. Taken together, Musite provides a valuable tool for biologists to predict phosphorylation sites up to the whole proteome level. In addition, with its open source, well designed, and easily extensible application programming interface (API), Musite is also beneficial to bioinformaticians as a platform to build their own machine learning-based applications for phosphorylation site prediction.

Data Collection
Phosphorylation data for six model organisms, Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Sac-charomyces cerevisiae, and Arabidopsis thaliana, from several sources including UniProt/Swiss-Prot (11) Table I lists the phosphorylation data sources and statistics for different phosphorylation types of each organism. Because a serine/threonine-specific kinase can often phosphorylate both serine and threonine residues (38), phosphoserines and phosphothreonines in each organism were combined when training prediction models in Musite.
We used the same type of residues (serine/threonine or tyrosine), excluding known phosphorylation sites as the negative training data (non-phosphorylation sites). Although not all these sites are necessarily true negatives, it is reasonable to believe that a large majority of them are. The data were extracted from organism-wise complete proteomes annotated in UniProt (as of October 2, 2009) for H. sapiens, D. melanogaster, C. elegans, and S. cerevisiae. Because the complete proteome for M. musculus was not provided in UniProt, Swiss-Prot protein entries of M. musculus were used. For A. thaliana, TAIR9 gene models and annotations were used (39).

Non-redundant Data Set Construction
For each of the six organisms, after combining the positive and negative data, protein sequences with high similarities were removed to build a non-redundant (NR) protein data set using BLAST-Clust in BLAST (40) package version 2.2.19 with a sequence identity threshold of 50%. Proteins with similar sequences were first clustered into groups by BLASTClust. Within each group, we selected the protein with the largest number of known phosphorylation sites into the NR data set; if there were no phosphorylation sites in any of the proteins in a group, we selected the longest protein. In this way, the maximum number of NR phosphorylation sites was kept in the NR data set.

Machine Learning
Phosphorylation site prediction can be formulated as a binary classification problem; namely, each serine/threonine or tyrosine can be classified as either a phosphorylation site or a non-phosphorylation site. As with all general binary classification problems, there are three key issues: (i) having a well collected and curated data set including positive and negative data, (ii) having a set of effective features to characterize the commonalities in each category and the difference between the two categories, and (iii) developing a classifier trained from the known data and capable of making reliable predictions for new data.
In this study, for different types of phosphorylation (i.e. phosphoserine/threonine and phosphotyrosine) in each organism, separate prediction models were trained from the NR data sets as summarized in Fig. 1. Output from a protein disorder predictor, KNN scores, and amino acid frequencies around the phosphorylation sites were taken as features. More features as listed in supplemental Table S2 will be evaluated and incorporated into our prediction scheme if they improve the prediction accuracy and are computationally feasible. We used the aggregation of multiple SVMs (41) as the classifier.

Feature Extraction
KNN Features-Local sequence clusters often exist around phosphorylation sites because substrate sites of the same kinase or kinase family usually share similar patterns in local sequences (42). To take advantage of such cluster information of local sequences for predicting phosphorylation sites, we took the local sequence around a possible phosphorylation site in a query protein and extracted features from its similar sequences in both positive and negative sets by a KNN algorithm as follows.
I. For a query site (possible phosphorylation site), find its k nearest neighbors in positive and negative sets, respectively, according to local sequence similarity. For two local sequences, s 1 ϭ {s 1 (Ϫw), s 1 (Ϫwϩ 1), …, s 1 (0), …, s 1 (w Ϫ 1), s 1 (w)} and s 2 ϭ {s 2 (Ϫw), s 2 (Ϫw ϩ 1), …, s 2 (0), …, s 2 (w Ϫ 1), s 2 (w)} define the distance Dist(s 1 , s 2 ) between s 1 and s 2 as where w is the number of residues included in the window in each side, and hence, the window size is 2w ϩ 1 (in Musite, w ϭ 6 by default; "*," which represents gaps in the BLOSUM matrix, will be added to the termini if the upstream or downstream regions of the sites have less than w residues) and Sim, the amino acid similarity where a and b are two amino acids, M is the substitution matrix, and max/min{M} represent the largest/smallest number in the matrix, respectively.
II. The corresponding KNN feature is then extracted as follows.
A. Form a set of neighbors by combining the positive and negative sets. B. Calculate the average distances from the query sequence s to all neighbors. C. Sort the neighbors by the distances and pick the k nearest neighbors. D. Calculate the KNN score, the percentage of positive neighbors (phosphorylation sites) in its k nearest neighbors. III. To take advantage of different properties of neighbors with various similarities, steps I and II were repeated for different k values to obtain multiple features for the phosphorylation predictor. In Musite, by default, k was chosen to be 0.25, 0.5, 1, 2, and 4% of the size of the bootstrapped training data set, and thus, five KNN scores were extracted as features for phosphorylation prediction. It is worthwhile mentioning that the nearest neighbors represent only local sequence similarity as global sequence similarities are removed in NR data sets.
Disorder Features-Phosphorylation sites have been observed to have a strong tendency to be located in disordered regions (20,36). In Iakoucheva et al. (20), predicted disorder scores for phosphorylation sites were used as features in the phosphorylation predictor DISPHOS. In this study, we extracted the disorder information for surrounding residues of each possible phosphorylation site in the query protein and combined them to form a set of disorder features in the following procedure.
I. Predict the disordered scores for the query protein sequence by means of a widely used disorder prediction tool, VSL2B (44). II. Extract the disorder prediction scores for the residues around each possible phosphorylation site. III. Take the average scores surrounding each site with different window sizes as features for the phosphorylation predictor. In Musite, by default, the window sizes are chosen to be 1, 5, and 13 (i.e. with 0, 2, and 6 surrounding residues from each side). Therefore, three disorder scores were extracted as features for phosphorylation site prediction. If there are not enough residues in either side, we use a truncated window starting from or ending at the terminus. For example, if the phosphorylation site is at the fourth position in the protein, we averaged the disorder scores over positions 1-10 for the window size of 13.
Amino Acid Frequency Features-Iakoucheva et al. (20) analyzed the amino acid composition of the surrounding sequences of phosphorylation sites and found that rigid, buried, neutral amino acids (Trp, Cys, Phe, Ile, Tyr, Val, and Leu) were significantly depleted, whereas flexible, surface-exposed amino acids (Ser, Pro, Glu, and Lys) were significantly enriched. This conclusion was confirmed in the current study as illustrated under "Results." Hence, the amino acid frequencies can be useful features for phosphorylation site prediction. The procedure to extract amino acid frequency features is as follows. We calculated the amino acid frequencies in the sequence surrounding the query site (the site itself is not counted). There are 20 types of amino acids, and thus 20 frequencies are calculated, the sum of which is 1. In Musite, by default, the window size is 13; i.e. 6 residues at each side were included to calculate the frequencies. If there are not enough residues in either side, a truncated window starting from or ending at the terminus is used in the same manner as described above in the disorder feature extraction procedure.

Model Training
We used the extracted features to train phosphorylation prediction models. As illustrated in Fig. 1, the training procedure consisted of two subprocedures: bootstrap aggregation and specificity estimation. Note that the training data and control data were randomly separated with no overlaps.
Bootstrap Aggregating-The sizes of positive and negative data sets in this study were highly unbalanced. The size of negative data sets was 2 orders of magnitude larger than the positive data sets as shown in Table I. To handle this problem, we used an ensemble meta-algorithm in machine learning called bootstrap aggregating or bagging (45). Given the features extracted from positive and negative data sets, the bootstrap aggregation procedure is as follows.
I. Bootstrap: generate a training set by sampling with replacement from positive and negative data sets randomly. This training set is called one bootstrap sample. By default, in one bootstrap sample, 2,000 data points were sampled from the positive data set, and another 2,000 were sampled from the negative data set. Therefore, the training set was balanced after bootstrapping. II. Classifier training: take the bootstrap sample as a training data set to train an SVM classifier. We used the package SVM light V6.02 (46) in this study. Areas under ROC curves (AUCs) were also calculated based on the trapezoidal approximation. As part of the training procedure, a portion of non-phosphorylation sites (10,000 by default) was randomly selected for specificity estimation as shown in Fig. 1. When making predictions, a user can choose stringency levels based on estimated specificities.

RESULTS AND DISCUSSION
KNN Scores as Features-A KNN score measures whether the local sequence surrounding a query site is more similar to the sequences containing phosphorylation sites in the positive set or those with non-phosphorylation sites in the negative set. A score greater than 0.5 means the query site is more similar to the positive set; a score smaller than 0.5 means it is more similar to the negative set. The larger the KNN score, the more similar the site is to some known phosphorylation sites, and thus, the more likely it is a phosphorylation site. Fig. 2 compares the KNN scores of phosphorylation sites with those of non-phosphorylation sites. Overall, phosphorylation sites have larger KNN scores than non-phosphorylation sites. For phosphoserines/threonines, the average KNN scores with different sizes of nearest neighbors are within 0.6 -0.8 for all six organisms; for phosphotyrosines, the average KNN scores are around 0.6. Therefore, the local sequences surrounding known phosphorylation sites are more similar to their nearest neighbors in the positive set (excluding self-matches) on average as expected. Note that such similarities are not due to protein homology as the global sequence similarity between any two proteins in our NR data sets is either insignificant or low. This finding confirms that phosphorylation-related clusters exist in local sequences around phosphorylation sites. For non-phosphorylation sites, the average KNN scores are around 0.5, which means overall that the sequences in the negative set are not predominantly more similar to nearest neighbors in either the positive or negative set. This is not surprising because phosphorylation-related sequence clusters are unlikely to exist in the negative set, and thus, the sequences in the negative set have a similar chance to find close neighbors in either the positive or negative set. In short, the KNN scores capture the cluster information in the local FIG. 2. Comparison of KNN scores between phosphorylation sites and non-phosphorylation sites. KNN scores of 1,000 phosphorylation sites and 1,000 non-phosphorylation sites randomly selected from each non-redundant data sets for six organisms were plotted. A, box plots of KNN scores (H. sapiens serine/threonine data only) for phosphorylation sites (red) and non-phosphorylation sites (blue). The horizontal axis represents the size of nearest neighbors (in percentage of the bootstrapped data set size). The vertical axis represents the KNN score. The bottom and top of the box are the 25th and 75th percentiles, respectively; the central band is the median; the whiskers extend to the most extreme data points that are not considered outliers; and the outliers are plotted individually as plus marks (ϩ). B, comparison of mean KNN scores between phosphorylation sites (pentagrams) and non-phosphorylation sites (circles) in six organisms. sequence around phosphorylation sites and hence distinguish them from the background. Therefore, KNN scores are suitable to be used as features for phosphorylation site prediction.
KNN scores are very effective features when used for predicting both general and kinase-specific phosphorylation sites. For general phosphorylation site predictions, KNN scores can automatically capture the local sequence similarity between substrates of a common kinase or kinase family without relying on knowledge of kinase-substrate interactions or kinase-binding sequence motifs. This means that although most known phosphorylation sites have no annotation about their corresponding kinases KNN scores can still utilize the inherent cluster information in them. KNN scores are also useful for predicting kinase-specific sites. Oftentimes, one kinase corresponds to multiple local sequence motifs, and using a single sequence profile may not be as effective as KNN, which better handles diverse sequence clusters.
Protein Phosphorylation and Protein Disorder-In this section, we will demonstrate the effectiveness of disorder scores as features for phosphorylation site prediction by studying the preference of phosphorylation sites in protein disordered re-gions. Fig. 3, A and C and B and D, plot the histograms of the disorder scores for the surrounding residues of phosphoserines/threonines and non-phosphoserines/threonines, respectively. For both H. sapiens (Fig. 3A) and A. thaliana (Fig.  3C), the number of phosphoserines/threonines increases exponentially when the disorder score increases from 0 to 1; the phosphoserines/threonines with disorder scores larger than 0.9 are dominant in the data set. In contrast, Fig. 3, B and D, show a different pattern for non-phosphoserines/threonines. The number of non-phosphorylation sites with disorder scores larger than 0.9 is still higher than those in the other subranges. This may be due to the fact that many of the non-phosphorylation sites are actually unassigned phosphorylation sites. Alternatively, this could also reflect a general preference of serine/threonine to be located in disordered regions. A third possibility is that these serines/threonines are actually not phosphorylated because they lack a surrounding kinase-specific motif in which case their non-phosphorylation status should be indicated by the KNN feature. In any case, it is clear that phosphoserines/threonines in H. sapiens and A. thaliana are much more overrepresented in disordered regions than non-phosphoserines/threonines. In fact, the ma- All phosphorylation sites and non-phosphorylation sites that have 6 or more residues at both sides were used. A, histogram of disorder scores of residues around phosphoserines/threonines (23,907 in total) in the H. sapiens NR data set. The horizontal axis represents the disorder score predicted by VSL2B, divided evenly into 10 subranges from 0 to 1; the vertical axis represents the occurrence (the number of sites) in the corresponding disorder subrange. Different colors from blue to red in each bar stand for 13 different residue positions in the window from the upstream Ϫ6 to downstream ϩ6 residues as indicated in the color bar on the right. B, histogram of disorder scores of residues around non-phosphoserines/threonines (1,171,139 in total) in the H. sapiens NR data set. C, histogram of disorder scores of residues around phosphoserine/threonine sites (3,512 in total) in the A. thaliana NR data set. D, histogram of disorder scores of residues around non-phosphoserine/threonine sites (986,481 in total) in the A. thaliana NR data set. E, histogram of disorder scores of residues around phosphotyrosine sites (2,504 in total) in the H. sapiens NR data set. F, histogram of disorder scores of residues around non-phosphotyrosine sites (221,322 in total) in the H. sapiens NR data set. jority (91.0% in Fig. 3A and 87.6% in Fig. 3C) of the phosphorylation sites in the example have disorder scores larger than 0.5 (note that VSL2B predicts a residue to be in the disordered region when its predicted value is larger than 0.5), whereas the corresponding percentages are only 54.9 and 50.5% for non-phosphoserines/threonines in Fig. 3, B and D, respectively. Fig. 3, E and F, plot the histograms of the disorder scores for the surrounding residues of phosphotyrosines and non-phosphotyrosines in H. sapiens, respectively. Although the phosphotyrosines are not predominantly distributed in disordered regions, there is a clear shift toward disordered regions in comparison with non-phosphotyrosines. This pattern is consistent in all six organisms studied (supplemental Fig. S1). In summary, there is a clear preference of known phosphorylation sites to be within disordered regions, which justifies the use of disorder scores as features for phosphorylation site prediction.
Amino Acid Composition surrounding Phosphorylation Sites-In this section, we will study the difference between the amino acid composition surrounding phosphorylation sites and that surrounding non-phosphorylation sites. In Fig.  4, from left to right, the amino acids vary from being enriched to being depleted in the surrounding sequences of phosphorylation sites. With slight variations among different organisms, the overall trends are similar. For phosphoserine/threonine sites (Fig. 4A), amino acids Pro, Arg, Asp, Glu, Ser, Lys, and Gly are enriched in the surrounding sequences, whereas Cys, Trp, Tyr, Phe, Ile, Met, Leu, His, Thr, and Val are depleted. For phosphotyrosine sites (Fig. 4B), Asp, Glu, Pro, Ser, and Gly are enriched, whereas Trp, Cys, Phe, Leu, His, Met, and Ile are depleted. The different compositions of amino acids surrounding phosphorylation sites and non-phosphorylation sites justify the use of amino acid frequencies as features for phosphorylation site prediction.
General and Kinase-specific Prediction for Multiple Organisms-One of the unique features of Musite is that it can be used to perform both general and kinase-specific phosphorylation site predictions. Both types of predictions use the same process except that kinase-specific phosphorylation site predictions use phosphorylation sites corresponding to a specific kinase or a kinase family as the positive training data. For general predictions, we have trained six phosphoserine/ threonine prediction models for six organisms (H. sapiens, M. musculus, D. melanogaster, C. elegans, S. cerevisiae, and A. thaliana), one combined phosphoserine/threonine prediction model for general eukaryotes, and two phosphotyrosine prediction models for H. sapiens and M. musculus, respectively. We did not train phosphotyrosine prediction models for the other four organisms because there were not enough data for training in those organisms. Kinase-specific prediction models were trained from phosphorylation data in H. sapiens for 13 kinases or kinase families, including ataxia telangiectasia mutated (ATM) kinase, cyclin-dependent kinase (CDK) family, CDK1, CDK2, casein kinase 1 (CK1), CK2, mitogen-activated protein kinase (MAPK) family, MAPK1, MAPK3, protein kinase A (PKA) family, protein kinase B (PKB) family, protein kinase C (PKC) family, and proto-oncogenic tyrosine kinases (Src). Supplemental Table S3 lists the detailed information about the FIG. 4. Comparisons of amino acid compositions in positive and negative data sets. A, comparisons between phosphoserines/ threonines and non-phosphoserines/threonines in six organisms. The vertical axis represents the log 2 ratio between amino acid frequencies surrounding phosphoserines/threonines and those surrounding non-phosphoserines/threonines. A value larger than 0 means the corresponding amino acid is enriched surrounding phosphoserines/threonines. The horizontal axis represents the 20 amino acids sorted in descending order by the mean log 2 ratio. B, similarly, comparisons between phosphotyrosines and non-phosphotyrosines in H. sapiens and M. musculus (phosphotyrosine data in the other four organisms are too sparse to derive meaningful statistics). prediction models we trained so far. All the prediction models are downloadable at http://musite.sourceforge.net/.
Evaluation of General Phosphorylation Site Prediction Performance of Musite-To evaluate the performance of Musite for general phosphorylation site prediction, we carried out a 10-fold cross-validation test on each type of phosphorylation in each organism as follows. Taking phosphoserine/threonine in H. sapiens for example, the phosphoserines/threonines in the H. sapiens NR data set were randomly divided into 10 groups. Each group was then combined with the same number of non-phosphoserines/threonines randomly selected from the H. sapiens NR data set to form 10 sub-data sets. A single sub-data set was retained as validation data, and all the remaining positive and negative data in the NR data set was used as training data to train a prediction model. The validation data were then submitted to this trained model for prediction. The cross-validation process was repeated 10 times with each sub-data set used exactly once as the validation data. Sensitivities at different specificity levels in each crossvalidation run were calculated according to Equations 3 and 4. Average sensitivities and specificities over 10 cross-validations were calculated. By taking different thresholds, we then calculated the ROC curves as plotted in Fig. 5 and the AUCs as shown in supplemental Table S4.
The most confident predictions are those with high specificities. From the ROC curves, for phosphoserines/threonines, at 95% specificity, the prediction sensitivities vary from 36 to 62%; at 99% specificity, most predictions have sensitivity around 10%, whereas predictions for A. thaliana and C. elegans achieve 20 and 32% sensitivity, respectively. Interestingly, from the ROC curves, the predictions for C. elegans perform significantly better than those for the other five organisms. This can be explained by the fact that both KNN and amino acid frequency features in C. elegans show stronger patterns in distinguishing between positive and negative data as shown in Figs. 2 and 4. For phosphotyrosines, we tested our results only on H. sapiens and M. musculus. The performances for both organisms achieve around 10 and 25% in sensitivity at the 99 and 95% specificity levels, respectively.
The prediction performance using each of the three sets of features (KNN, amino acid frequency, and disorder) was also evaluated as shown in supplemental Fig. S2. The result shows that the combined features yield more accurate predictions as expected. When testing the features separately, KNN features performed the best in ROC.
Proteome-wide Phosphorylation Site Prediction-The general phosphorylation site prediction model for each organism was used to scan the corresponding proteome. Nearly 100,000 phosphorylation sites were predicted at the 99% specificity level for all six organisms combined as shown in Table II. Prediction results are available for download at http:// musite.sourceforge.net/. These predictions provide useful hypotheses for experimental validations.
Cross-species Prediction of General Phosphorylation Sites-The performance of cross-species predictions using Musite was evaluated against six organisms. A test data set was built by randomly selecting 100 proteins that contain phosphoserines/threonines and 100 non-phosphoproteins from the NR data set of each organism. The remaining proteins in each NR data set formed the training data set to build a prediction model for every organism. We also built a combined prediction model using the data by combining the six training data sets and running the non-redundant data set building procedure with an identity threshold of 30%. Note that there was no overlap between training and test data sets. Pairwise tests were then performed by submitting all serines/  H. sapiens,  M. musculus, D. melanogaster, C. elegans, S. cerevisiae, and A. thaliana. Each curve represents the average sensitivities and specificities for difference thresholds over 10 cross-validation runs. The bottom right figure is the zoomed-in region with high prediction specificities (0.9 -1). threonines in each of the six test data sets to each of the seven prediction models. The specificities, sensitivities, and AUCs were then calculated as shown in Table III. For all test data sets, prediction results from the model trained using data in the same organism performed the best (considering only AUC). For each of the six test data sets, the performances did not have large variations using different models trained on data from different organisms. As a possible explanation, although kinases and their substrates vary in different species, the biophysical mechanism of enzyme-substrate binding remained the same, and the generic features utilized by Musite (i.e. disorder scores and amino acid frequencies) could have captured such a mechanism. Supplemental Fig. S3 and supplemental Table S5 provide the scatter plots and correlation coefficients of prediction scores among the seven models for all 15,980 serines/threonines in the H. sapiens test data set, which show high positive associations among predictions from different models. The results suggest that Musite and the associated prediction models can be used for cross-species predictions of general phosphoserines/threonines, which is especially useful when phosphorylation data in species of interest is not enough for training a prediction model. Interestingly, for cross-species predictions, there is no apparent evidence that using model trained on data from an evolutionarily closer species would perform better. As an example, for the H. sapiens test data set, the predictions from the A. thaliana model performed better than others except the combined model and the H. sapiens model itself. Given the small test data size, this may not be statistically significant. The performance variations in various models may be partially due to different quantities and qualities of phosphorylation data in different organisms.
Comparison with Other General Prediction Tools-To further evaluate the performance of general phosphorylation site prediction by Musite, we compared it with three existing tools, NetPhos, DISPHOS, and scan-x. We applied the same H. sapiens test data set as that used in cross-species prediction evaluation containing 9,943 serines (with 390 known phosphoserines) and 6,037 threonines (with 77 known phosphothreonines). Sequences of these 100 phosphoproteins and 100 non-phosphoproteins were submitted to NetPhos, scan-x, and DISPHOS for prediction. To compare the result, we trained a model for Musite using the remaining proteins in the H. sapiens NR data set and predicted the phosphorylation sites in the test data set. The ROC curves comparing the predictions of different tools are shown in Fig. 6 and supplemental Table S6.
NetPhos is an artificial neural network-based general phosphorylation site predictor for eukaryotic proteins (21). We submitted the H. sapiens test data set to the web server of NetPhos 2.0 (http://www.cbs.dtu.dk/services/NetPhos/). The default predictions of NetPhos achieved sensitivity and specificity of 77.7 and 61.9%, respectively. In addition to the default predictions, the program also provided the prediction scores for all serines/threonines in query sequences. By taking different thresholds on the scores, NetPhos achieved sensitivities of 27.4 and 6.4% at 95.2 and 99.4% specificity levels, respectively. To compare, when predicting phosphoserines/threonines on the same data set using the models trained from the H. sapiens NR data set excluding the test data set, Musite achieved sensitivities of 88.7, 42.8, and 11.3% at specificities of 61.9, 95.2, and 99.4%, respectively.
DISPHOS was the first phosphorylation site predictor that made use of protein disorder information (20). The web server DISPHOS 1.3 (http://core.ist.temple.edu/pred/) successfully predicted 190 of 200 protein sequences in the H. sapiens test data set, whereas the remaining 10 sequences failed after multiple trials. We did not include these 10 sequences when evaluating the performance of DISPHOS. The default predic-   scan-x is a tool to predict phosphorylation sites in different organisms using phosphorylation motifs determined by motif-x (22,35). Proteome scale results on H. sapiens, M. musculus, D. melanogaster, and S. cerevisiae using scan-x are available for searching on the scan-x web site (http://scanx.med.harvard.edu/). By searching the protein sequences in the H. sapiens test data set, the predictions of scan-x achieved a sensitivity of 34.5% at a specificity of 95.2% and a sensitivity of 15.8% at a specificity of 98.7%. To compare, Musite achieved sensitivities of 42.8 and 18.2% at specificities of 95.2 and 98.7%, respectively.
Note that, when performing the comparisons, we used a prediction model that trained from a data set excluding the protein sequences in the test data set. However, for Net-Phos, DISPHOS, and scan-x, some of the test proteins might have been included in their training processes, and thus, the sensitivity performances are biased favorably toward these tools in the comparisons. This means that the performance improvement of Musite over these tools could be underestimated.
The consistency among different tools is shown in Fig. 7. At a specificity of 95%, each tool predicted about 900 phosphorylation sites. 109 total predicted sites are common among all four tools. About one-third to one-half of the predictions from each tool have no overlap with any other tools. 37 (34%) of the 109 commonly predicted sites are known phosphorylation sites, whereas the percentage for each separate tool is lower (Musite, 21%; DISPHOS, 9%; NetPhos, 9%; and scan-x, 18%). This suggests that a metapredictor combining the multiple tools to perform a consensus prediction might boost the prediction accuracy, although the data presented here are too sparse to be conclusive. An alternative explanation could be that the training data used by different tools were complementary.
Comparison with Other Kinase-specific Prediction Tools-We also evaluated the performance of kinase-specific predictions using Musite by comparison with four widely used tools, Scansite 2.0 (23), NetPhosK 1.0 (24), GPS 2.1 (25), and pkaPS (30). The well known PKA family, CK2, and MAPK family were used for comparison. The substrate proteins of each kinase or kinase family were extracted from the H. sapiens NR data set. Another 200 non-phosphoproteins were randomly selected from the H. sapiens NR data set. Combining these 200 non-phosphoproteins and the substrate proteins forms a test data set for each kinase or kinase family. The test data set for PKA was submitted to Scansite, NetPhosK, GPS, and pkaPS. The test data set of CK2 was submitted to Scansite, NetPhosK, and GPS. The test data set for MAPK was submitted to GPS.
To evaluate the performance of Musite, for each of the four kinases, a modified leave-one-out cross-validation test was performed. Each time, one phosphorylation site and the nonphosphorylation sites in the corresponding test data set formed the validation set, and the remaining phosphorylation and non-phosphorylation sites in the H. sapiens NR data set formed the training set. A prediction model was trained from the training set using Musite (1,000 data points from the positive data set and 1,000 data points from the negative data set were sampled when bootstrapping; k was chosen to be 2.5, 5, 10, and 20% of the bootstrapped sample size, i.e. 2,000, for KNN features; other parameters were by default). The model was used to predict phosphorylation sites in the validation set. The specificity level for the validation phosphorylation site was then calculated by counting the percentage of correctly classified non-phosphorylation sites in the validation set by setting the threshold as the prediction score of the validation phosphorylation site. This was repeated such that each phosphorylation site was used once in the validation set, and hence, each had a specificity level. Combining all validations, sensitivities were calculated by counting the percentage of phosphorylation sites that had specificities above certain levels. The leave-one-out cross-validation results of Musite were then compared with the prediction results of other tools. Like the comparisons of general phosphorylation site predictions, some of the test proteins might have been trained in other tools, and thus the performance is biased favorably to them when comparing the results. Therefore, we also performed self-consistency tests; i.e. we trained prediction models using all available sites for each kinase and predicted the sequences in the corresponding test data set. Both results of leave-one-out cross-validation tests and self-consistency tests by Musite were compared with the results of other tools.
All of the four other tools that we compared have predefined several stringency levels, and all except Scansite also provided prediction scores for all potential sites. Therefore, for Scansite, we calculated the specificity and sensitivity at its predefined three stringency levels, but for the other three tools and Musite, we adjusted the prediction thresholds to set the specificity levels as close as possible to 99.99, 99.9, 99.8, 99.5, 98.0, 97.0, 95.0, and 90.0% and compared the corresponding sensitivities. Table IV shows that Musite self-consistency tests performed superiorly to Scansite, NetPhosK, and pkaPS. Musite leave-one-out tests performed better than Scansite and NetPhosK in most cases except the results of NetPhosK at sensitivities of 99.99 and 99.90%. Musite leaveone-out tests performed comparably with pkaPS. GPS performed slightly better than the Musite leave-one-out tests; however, in most cases, Musite self-consistency tests outperformed the results of GPS, especially at high stringency levels. Note that because GPS is a new and well maintained tool it is likely that most, if not all, of our training phosphorylation sites have been included in their training process. In any case, the prediction performance of Musite is at least comparable with other kinase-specific prediction tools.
Software Implementation-We developed a stand-alone software system, Musite, to implement the described phosphorylation site prediction method. Currently, Musite V1.0 has been released for Windows, Mac OS X, and Linux/Unix plat-forms. Written in Java and released under a GNU general public license open source license, the Musite project provides an open platform for development of machine learningbased applications in predicting protein phosphorylation sites. With well designed API, Musite can be easily extended by programming. For example, other sequence features, such as protein secondary structure and solvent accessibility, can be easily incorporated in phosphorylation prediction by extending Musite API. Furthermore, Musite can be extended to train models and make predictions for other types of posttranslational modifications. Musite, together with its source code, is available at http://musite.sourceforge.net/.
To make Musite user-friendly, we have implemented an easy-to-use graphical user interface. The most important utility of Musite is phosphorylation prediction, and therefore, in the main dialog, a user can submit protein sequences, a FASTA file, or a Musite XML file (Musite XML is a customized XML file format used by Musite for storing phosphorylation data in a compact yet comprehensive way). Fig. 8 shows an example of the result panel after submitting the human p53 protein sequence for phosphoserine/threonine prediction. Musite supports continuous adjustment of specificity cutoff from 0 to 1, rather than predefined confidence levels, to meet all stringency requirements of different studies. The predicted phosphorylation sites above the cutoff are rendered in different colors according to their stringency levels. The tab "Predicted Sites" contains a table with detailed information about each predicted site, such as its position, prediction score, and specificity level. The predicted result can be saved for future analysis or exported as a tab-delimited text file.
Musite makes it possible for a user to perform proteome scale prediction of phosphorylation sites in an automated fashion. We have performed the proteome scans in Table III on a standard work station (2.13-GHz dual core processor and 2-GB memory). Processing time was ϳ18 h to predict general phosphorylation sites in all 20,319 proteins in the H. sapiens complete proteome from UniProt using a model trained with the default parameters (2,000 boots and five SVM classifiers). The running time will decrease by using a model with fewer boots and fewer SVM classifiers. If a user is only interested in predicting selected residues in the proteome, to save computing time, one can label each of those residues by appending a mark, "?," e.g. replacing the serines ("S") of interest by "S?", and then Musite will make predictions only for the labeled residues.
Musite also provides other related functionalities, such as customized prediction model training, file format conversion, file statistics, and NR data set building tool integration (37). Customized prediction model training is a unique utility provided by Musite that enables users to train their own models from any phosphorylation data sets. As phosphorylation sites in various species are accumulating rapidly, it is difficult for a phosphorylation prediction tool to keep track of all available phosphorylation data. We have built phosphorylation predic-tion models in six model eukaryotic species, and we are in the progress of building more models in other species. Customized model training makes it possible for users, especially for the non-computational biologists, to train models using phosphorylation data specific to their own work. To train a customized prediction model based on the default parameters, a user only needs to specify the training sequence file (in the format of Musite XML or FASTA), the output model name, and amino acid types of sites. The user must annotate the phosphorylation sites in the input file, which is a simple and straightforward task. For example, if the input file is in the FASTA format, a user can specify a serine residue "S" as a phosphorylation site, appending a mark "#." Although keep-ing simplicity, we also managed to maintain the flexibility of the program. The "Advanced Option" in the training dialog allows users to customize all the parameters for training models using their own data. Details on how to set the training parameters are explained in the tutorials at http://musite-.sourceforge.net. It is worth mentioning that we have provided tools to convert the UniProt XML format to the Musite XML format and extract the phosphorylation data in user-defined organisms so that the users can easily make use of the latest phosphorylation annotations in UniProt/Swiss-Prot when training their own models.
Limitations and Future Work-Although Musite provides a useful alternative strategy for annotating phosphorylation Sensitivities (Sn) at different specificities (Sp) were compared. Different specificity levels were taken as similar as possible (in each column) among different tools. The best performed result in each specificity level (column) for each kinase or kinase family is highlighted in bold. LOO and Self stand for "leave-one-out cross-validation test" and "self-consistency test," respectively. events in proteomes, it has some limitations, most of which are common for all current phosphorylation site prediction tools. First of all, although computational predictions indicate the possibilities that query sites can/cannot be phosphorylated, our predicted results have not been correlated to different cell states or tissue conditions. This is one of the reasons that we provided a utility of customized model training in Musite. However, users still have to provide high quality training data related to specific cellular conditions, which are sparse in the current phosphorylation studies. Second, the phosphorylation sites used in the training data were mostly identified by mass spectrometry methods, which may have inherent bias in terms of representing the global phosphorylation events and hence affect the prediction performance. As techniques like electron transfer dissociation and alternative proteases are helping to resolve technology limitations, more complete phosphorylation data sets will be released. We will adapt our program and prediction models as the new data become available. Another limitation of the data is that we have only labeled positive data, but we do not have labeled negative data (i.e. we do not know whether the non-phosphorylation sites are truly negatives), and therefore, if some of them are predicted as phosphorylation sites, we do not know whether they are false positives. This fact makes it hardly feasible to estimate the precision (TP/(TP ϩ FP)) where TP is true positive and FP is false positive) of any phosphorylation site prediction tool. Moreover, ignorance of the inherent prior ratio between the positive and negative data, which is hard to estimate, also created some bias in predictions. For future work, we will explore other methods, such as semisupervised learning, to address these limitations. We will include more kinases/kinase families and more organisms, extend Musite to other types of post-translational modifications, and integrate Musite in work flows of experimental phosphorylation studies. CONCLUSION Annotation of protein phosphorylation sites in proteomes is a crucial step to decode the signaling networks in living cells. In recent years, tens of thousands of phosphorylation sites in various species have been identified by large scale mass spectrometry-based studies. However, the vast majority of phosphorylation sites, especially in non-mammal species, still remain undiscovered. Considering limitations of mass spectrometry-based experimental studies, a more practical and efficient approach will be in silico large scale phosphorylation site prediction. In this work, we presented a new bioinformatics tool, Musite, specifically designed for large scale prediction of phosphorylation sites. Musite modeled phosphorylation site prediction as a binary classification problem with highly unbalanced data sets and solved it with a comprehensive machine-learning approach. After studying the properties of phosphorylation sites and their surrounding sequences, we adopted three sets of features to distinguish phosphorylation sites from non-phosphorylation sites: KNN scores, protein disorder scores, and amino acid frequencies. KNN scores were utilized to take advantage of the sequence cluster information around phosphorylation sites. Disorder scores and amino acid frequencies were used to characterize the generic patterns of phosphorylation sites. By combining both sequence and generic features, Musite is capable of identifying both phosphorylation sites with local sequence patterns similar to known phosphorylation sites and those beyond local sequence similarities. Combining all three sets of features, we have trained models based on a bootstrap aggregating procedure for predicting both general and kinase-specific phosphorylation sites in multiple organisms. It should be noted that the pretrained models in Musite V1.0 were not correlated to any particular cellular conditions. To perform condition-specific predictions, users can train customized models from phosphorylation data of a certain cellular condition. Proteome-wide predictions of phosphorylation sites were performed for six organisms. Cross-validation tests and comparisons with other tools show that Musite performs better on general predictions and at least comparably with existing methods on kinase-specific predictions.

PKA
Musite provides a unique application system, specifically designed for large scale prediction of both general and kinase-specific phosphorylation sites and for better utilizing the large magnitude of experimentally verified phosphorylation sites. Musite is the first tool that provides utility for training a phosphorylation site prediction model from users' own data and supports continuous adjustment of stringency levels. With its user-friendly graphic interface, Musite can be easily used by biologists to make predictions on their sequences and train prediction models from phosphorylation data of their own interest. Unlike experimental approaches, computational predictions are capable of proteome-wide predictions without inherent technical biases. Furthermore, Musite could provide an even more powerful and cost-effective approach by combining experimental and computational methods iteratively, which could be especially useful for some hypothesis-driven experiments. Alternatively, for bioinformaticians, Musite can serve as an open platform for building machine-learning applications for phosphorylation site prediction. In conclusion, Musite provides a unique tool for large scale phosphorylation site identification, and it is our hope that Musite will accelerate accumulation of our knowledge on protein phosphorylation and hence help explore the corresponding regulatory networks in living cells.