Advertisement

GPS 2.0, a Tool to Predict Kinase-specific Phosphorylation Sites in Hierarchy

  • Author Footnotes
    § Both authors contributed equally to this work.
    Yu Xue
    Footnotes
    § Both authors contributed equally to this work.
    Affiliations
    Hefei National Laboratory for Physical Sciences at Microscale and School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230027, China
    Search for articles by this author
  • Author Footnotes
    § Both authors contributed equally to this work.
    Jian Ren
    Footnotes
    § Both authors contributed equally to this work.
    Affiliations
    Hefei National Laboratory for Physical Sciences at Microscale and School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230027, China
    Search for articles by this author
  • Xinjiao Gao
    Affiliations
    Hefei National Laboratory for Physical Sciences at Microscale and School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230027, China
    Search for articles by this author
  • Changjiang Jin
    Affiliations
    Hefei National Laboratory for Physical Sciences at Microscale and School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230027, China
    Search for articles by this author
  • Longping Wen
    Correspondence
    To whom correspondence may be addressed. Tel.: 86-551-3600051; Fax: 86-551-3600426
    Affiliations
    Hefei National Laboratory for Physical Sciences at Microscale and School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230027, China
    Search for articles by this author
  • Author Footnotes
    ¶ A Georgia Cancer Coalition Eminent Scholar.
    Xuebiao Yao
    Correspondence
    To whom correspondence may be addressed. Tel.: 86-551-3606304; Fax: 86-551-3607141
    Footnotes
    ¶ A Georgia Cancer Coalition Eminent Scholar.
    Affiliations
    Hefei National Laboratory for Physical Sciences at Microscale and School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230027, China

    Department of Physiology and Cancer Biology Program, Morehouse School of Medicine, Atlanta, Georgia 30310
    Search for articles by this author
  • Author Footnotes
    § Both authors contributed equally to this work.
    ¶ A Georgia Cancer Coalition Eminent Scholar.
      Identification of protein phosphorylation sites with their cognate protein kinases (PKs) is a key step to delineate molecular dynamics and plasticity underlying a variety of cellular processes. Although nearly 10 kinase-specific prediction programs have been developed, numerous PKs have been casually classified into subgroups without a standard rule. For large scale predictions, the false positive rate has also never been addressed. In this work, we adopted a well established rule to classify PKs into a hierarchical structure with four levels, including group, family, subfamily, and single PK. In addition, we developed a simple approach to estimate the theoretically maximal false positive rates. The on-line service and local packages of the GPS (Group-based Prediction System) 2.0 were implemented in Java with the modified version of the Group-based Phosphorylation Scoring algorithm. As the first stand alone software for predicting phosphorylation, GPS 2.0 can predict kinase-specific phosphorylation sites for 408 human PKs in hierarchy. A large scale prediction of more than 13,000 mammalian phosphorylation sites by GPS 2.0 was exhibited with great performance and remarkable accuracy. Using Aurora-B as an example, we also conducted a proteome-wide search and provided systematic prediction of Aurora-B-specific substrates including protein-protein interaction information. Thus, the GPS 2.0 is a useful tool for predicting protein phosphorylation sites and their cognate kinases and is freely available on line.
      Post-translational modification of proteins provides reversible means to regulate the function of a protein in space and time. Recently computational studies of post-translational modifications (PTMs)
      The abbreviations used are: PTM, post-translational modification; PK, protein kinase; FPR, false positive rate; GPS, Group-based Prediction System; Sn, sensitivity; Sp, specificity; Pr, precision; LOO, leave-one-out validation; PSP, phosphorylation site peptide; PKA, protein kinase A; PKB, protein kinase B; BLAST, Basic Local Alignment Search Tool; CDK, cycle-dependent kinase; MAPK, mitogen-activated protein kinase; AUR, Aurora; GRK, G-protein-coupled receptor kinase; CaMK, Ca2+/calmodulin-dependent protein kinase; TK, tyrosine kinase; PIKK, phosphoinositide 3-kinase-related kinase; ATM, ataxia telangiectasia mutated; PEK, pancreatic eukaryotic initiation factor-2a kinase; sub., substrate; PPI, protein-protein interaction; OS, operating system; TP, true positive; TN, true negative; FP, false positive; FN, false negative; PPSP, prediction of PK-specific phosphorylation; AGC, protein kinase A, G and C family; CGMC, CDKs, G-SKs, MAPKs and CLKs kinase family.
      1The abbreviations used are: PTM, post-translational modification; PK, protein kinase; FPR, false positive rate; GPS, Group-based Prediction System; Sn, sensitivity; Sp, specificity; Pr, precision; LOO, leave-one-out validation; PSP, phosphorylation site peptide; PKA, protein kinase A; PKB, protein kinase B; BLAST, Basic Local Alignment Search Tool; CDK, cycle-dependent kinase; MAPK, mitogen-activated protein kinase; AUR, Aurora; GRK, G-protein-coupled receptor kinase; CaMK, Ca2+/calmodulin-dependent protein kinase; TK, tyrosine kinase; PIKK, phosphoinositide 3-kinase-related kinase; ATM, ataxia telangiectasia mutated; PEK, pancreatic eukaryotic initiation factor-2a kinase; sub., substrate; PPI, protein-protein interaction; OS, operating system; TP, true positive; TN, true negative; FP, false positive; FN, false negative; PPSP, prediction of PK-specific phosphorylation; AGC, protein kinase A, G and C family; CGMC, CDKs, G-SKs, MAPKs and CLKs kinase family.
      of proteins have attracted much attention. Various PTMs regulate the functions and dynamics of proteins through specific modifications and are implicated in almost all cellular processes. In contrast to the labor-intensive and expensive experimental methods, in silico prediction of PTM-specific substrates with their sites has emerged as a popular alternative approach. To date, more than 32 computational prediction tools have been developed (
      • Zhou F.F.
      • Xue Y.
      • Yao X.
      • Xu Y.
      A general user interface for prediction servers of proteins' post-translational modification sites.
      ).
      In the field of computational PTMs, protein phosphorylation is the most studied example. To predict general phosphorylation sites, several tools have been developed, such as DISPHOS (
      • Iakoucheva L.M.
      • Radivojac P.
      • Brown C.J.
      • O'Connor T.R.
      • Sikes J.G.
      • Obradovic Z.
      • Dunker A.K.
      The importance of intrinsic disorder for protein phosphorylation.
      ), NetPhos (
      • Blom N.
      • Gammeltoft S.
      • Brunak S.
      Sequence and structure-based prediction of eukaryotic protein phosphorylation sites.
      ), NetPhosYeast (
      • Ingrell C.R.
      • Miller M.L.
      • Jensen O.N.
      • Blom N.
      NetPhosYeast: prediction of protein phosphorylation sites in yeast.
      ), and GANNPhos (
      • Tang Y.R.
      • Chen Y.Z.
      • Canchaya C.A.
      • Zhang Z.
      GANNPhos: a new phosphorylation site predictor based on a genetic algorithm integrated neural network.
      ). As the need for performing large scale predictions and constructing reliable phosphorylation networks evolves, robust prediction of kinase-specific phosphorylation sites has become necessary and challenging. For example, Neuberger et al. (
      • Neuberger G.
      • Schneider G.
      • Eisenhaber F.
      pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model.
      ) used pkaPS to predict potential protein kinase A (PKA) sites in the human proteome directly. With Predikin, Brinkworth et al. (
      • Brinkworth R.I.
      • Munn A.L.
      • Kobe B.
      Protein kinases associated with the yeast phosphoproteome.
      ) predicted cognate PKs for 383 unannotated phosphorylation sites of 216 peptide sequences in yeast. Chang et al. (
      • Chang E.J.
      • Begum R.
      • Chait B.T.
      • Gaasterland T.
      Prediction of cyclin-dependent kinase phosphorylation substrates.
      ) predicted 91 highly probable CDK substrates in budding yeast using the position-specific scoring matrix motif approach. Recently Linding et al. (
      • Linding R.
      • Jensen L.J.
      • Ostheimer G.J.
      • van Vugt M.A.
      • Jorgensen C.
      • Miron I.M.
      • Diella F.
      • Colwill K.
      • Taylor L.
      • Elder K.
      • Metalnikov P.
      • Nguyen V.
      • Pasculescu A.
      • Jin J.
      • Park J.G.
      • Samson L.D.
      • Woodgett J.R.
      • Russell R.B.
      • Bork P.
      • Yaffe M.B.
      • Pawson T.
      Systematic discovery of in vivo phosphorylation networks.
      ) developed NetworKIN and constructed a human phosphorylation network, which has gained diversified interest not only for human phosphorylation network prediction but also for general implication in cell biology. To predict kinase-specific phosphorylation sites, several on-line Web services have been implemented using various algorithms, including our previous work of GPS (
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ) and PPSP (
      • Xue Y.
      • Li A.
      • Wang L.
      • Feng H.
      • Yao X.
      PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory.
      ), NetPhosK (
      • Blom N.
      • Sicheritz-Ponten T.
      • Gupta R.
      • Gammeltoft S.
      • Brunak S.
      Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence.
      ), ScanSite (
      • Obenauer J.C.
      • Cantley L.C.
      • Yaffe M.B.
      Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs.
      ), KinasePhos (
      • Huang H.D.
      • Lee T.Y.
      • Tzeng S.W.
      • Horng J.T.
      KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites.
      ,
      • Wong Y.H.
      • Lee T.Y.
      • Liang H.K.
      • Huang C.M.
      • Wang T.Y.
      • Yang Y.H.
      • Chu C.H.
      • Huang H.D.
      • Ko M.T.
      • Hwang J.K.
      KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns.
      ), PredPhospho (
      • Kim J.H.
      • Lee J.
      • Oh B.
      • Kimm K.
      • Koh I.
      Prediction of phosphorylation sites using SVMs.
      ), Predikin (
      • Brinkworth R.I.
      • Breinl R.A.
      • Kobe B.
      Structural basis and prediction of substrate specificity in protein serine/threonine kinases.
      ), PhoScan (
      • Li T.
      • Li F.
      • Zhang X.
      Prediction of kinase-specific phosphorylation sites with sequence features by a log-odds ratio approach.
      ), pkaPS (
      • Neuberger G.
      • Schneider G.
      • Eisenhaber F.
      pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model.
      ), etc.
      Although ∼10 predictors are already available, two essential issues have remained elusive. In the previous work, there was no standard rule for protein kinase (PK) classification. We and others clustered PKs into subgroups casually by sequence similarity from BLAST results (
      • Linding R.
      • Jensen L.J.
      • Ostheimer G.J.
      • van Vugt M.A.
      • Jorgensen C.
      • Miron I.M.
      • Diella F.
      • Colwill K.
      • Taylor L.
      • Elder K.
      • Metalnikov P.
      • Nguyen V.
      • Pasculescu A.
      • Jin J.
      • Park J.G.
      • Samson L.D.
      • Woodgett J.R.
      • Russell R.B.
      • Bork P.
      • Yaffe M.B.
      • Pawson T.
      Systematic discovery of in vivo phosphorylation networks.
      ,
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ,
      • Xue Y.
      • Li A.
      • Wang L.
      • Feng H.
      • Yao X.
      PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory.
      ,
      • Blom N.
      • Sicheritz-Ponten T.
      • Gupta R.
      • Gammeltoft S.
      • Brunak S.
      Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence.
      ,
      • Huang H.D.
      • Lee T.Y.
      • Tzeng S.W.
      • Horng J.T.
      KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites.
      ,
      • Wong Y.H.
      • Lee T.Y.
      • Liang H.K.
      • Huang C.M.
      • Wang T.Y.
      • Yang Y.H.
      • Chu C.H.
      • Huang H.D.
      • Ko M.T.
      • Hwang J.K.
      KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns.
      ,
      • Kim J.H.
      • Lee J.
      • Oh B.
      • Kimm K.
      • Koh I.
      Prediction of phosphorylation sites using SVMs.
      ,
      • Li T.
      • Li F.
      • Zhang X.
      Prediction of kinase-specific phosphorylation sites with sequence features by a log-odds ratio approach.
      ). The thresholds of PK classification varied in the previous publications, and the final PK subgroups were also quite different. Another issue is control of false positive rate (FPR) for large scale predictions. Usually the bona fide phosphorylation sites are only a small proportion of total Ser/Thr or Tyr residues present within a protein sequence. Thus, many false positive hits in the total prediction results could be generated even for a small FPR.
      In this work, we refined the GPS software (Group-based Prediction System, version 2.0) for predicting kinase-specific phosphorylation sites in hierarchy. We adopted a PK classification established by Manning et al. (
      • Manning G.
      • Whyte D.B.
      • Martinez R.
      • Hunter T.
      • Sudarsanam S.
      The protein kinase complement of the human genome.
      ) as the standard rule to cluster the human PKs into a hierarchical structure with four levels, including group, family, subfamily, and single PK. The training data were taken from Phospho.ELM 6.0 (
      • Diella F.
      • Cameron S.
      • Gemund C.
      • Linding R.
      • Via A.
      • Kuster B.
      • Sicheritz-Ponten T.
      • Blom N.
      • Gibson T.J.
      Phospho. ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins.
      ), and the modified version of the Group-based Phosphorylation Scoring algorithm (
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ) was used. Also we defined a simple rule to calculate the theoretically maximal FPRs. Three cutoffs of high, medium, and low thresholds were established with FPRs of 2, 6, and 10% for serine/threonine kinases and 4, 9, and 15% for tyrosine kinases, respectively. The performance and robustness of the prediction system were extensively evaluated by self-consistency, leave-one-out validation, and 4-, 6-, 8-, and 10-fold cross-validations. Compared with other existing tools, GPS 2.0 carries a greater computational power with superior performance. The on-line Web server version and local packages of GPS 2.0 were implemented in Java and can predict kinase-specific phosphorylation sites for 408 PKs in human. Moreover we used GPS 2.0 to conduct a large scale prediction of more than 13,000 mammalian phosphorylation sites in which GPS 2.0 exhibited remarkable performance. Finally we demonstrated the accuracy of GPS 2.0 prediction based on a proteome-wide search for Aurora-B cognate substrates. Taken together, GPS 2.0 offers greater precision and computing power on predicting protein phosphorylation and enzyme-substrate relationship.

      EXPERIMENTAL PROCEDURES

       Protein Kinase Classification for the Training Data Set—

      The training data set was derived from Phospho.ELM 6.0 (
      • Diella F.
      • Cameron S.
      • Gemund C.
      • Linding R.
      • Via A.
      • Kuster B.
      • Sicheritz-Ponten T.
      • Blom N.
      • Gibson T.J.
      Phospho. ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins.
      ), including 13,615 experimentally verified phosphorylation sites. First the redundant records were removed leaving 13,577 non-redundant entries. Then 3,161 non-redundant sites with respective kinase information were used for training. Because most of the verified sites were mammalian (13,254 of 13,579, ∼97.6%), we adopted a well established rule for human PK classification (
      • Manning G.
      • Whyte D.B.
      • Martinez R.
      • Hunter T.
      • Sudarsanam S.
      The protein kinase complement of the human genome.
      ,
      • Caenepeel S.
      • Charydczak G.
      • Sudarsanam S.
      • Hunter T.
      • Manning G.
      The mouse kinome: discovery and comparative genomics of all mouse protein kinases.
      ) to cluster various PKs with their verified sites into a hierarchical structure with four levels, including group, family, subfamily, and single PK (
      • Manning G.
      • Whyte D.B.
      • Martinez R.
      • Hunter T.
      • Sudarsanam S.
      The protein kinase complement of the human genome.
      ,
      • Caenepeel S.
      • Charydczak G.
      • Sudarsanam S.
      • Hunter T.
      • Manning G.
      The mouse kinome: discovery and comparative genomics of all mouse protein kinases.
      ) (see supplemental Table S1). The PK groups with less than three sites were singled out from this study.
      The training data could be reused several times and included in different PK clusters (Fig. 1). For example, in the AGC group, the experimental sites with PK information of PKB_group, PKBβ, PKAα, PKA_group, and other AGC kinases were used as the training data. In the AGC/AKT family, the verified sites with PK information of PKB_group and PKBβ were used. Again for AGC/AKT/AKT2, the verified sites only with PK information of PKBβ were used. Also for the AGC/PKA family, only the verified sites with PK information of PKAα and PKA_ group were used. Currently there are only two PKAα sites identified. Thus, the PK cluster of AGC/PKA/PKAα was not used in GPS 2.0.
      Figure thumbnail gr1
      Fig. 1The training data could be reused several times and included in different PK clusters based on their cognate PKs information.
      It has been reported that there are 518 human PKs identified (
      • Manning G.
      • Whyte D.B.
      • Martinez R.
      • Hunter T.
      • Sudarsanam S.
      The protein kinase complement of the human genome.
      ). After careful curation, we found that PKG1 had two paralogs in human rather than one gene. In this regard, the total human kinome contains 519 unique PKs. As previously described, we used the experimentally verified phosphorylation sites as the positive data (+), whereas all other residues (Ser/Thr or Tyr) in the same substrates were regarded as the negative data (−) (
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ,
      • Xue Y.
      • Li A.
      • Wang L.
      • Feng H.
      • Yao X.
      PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory.
      ,
      • Huang H.D.
      • Lee T.Y.
      • Tzeng S.W.
      • Horng J.T.
      KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites.
      ,
      • Wong Y.H.
      • Lee T.Y.
      • Liang H.K.
      • Huang C.M.
      • Wang T.Y.
      • Yang Y.H.
      • Chu C.H.
      • Huang H.D.
      • Ko M.T.
      • Hwang J.K.
      KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns.
      ,
      • Kim J.H.
      • Lee J.
      • Oh B.
      • Kimm K.
      • Koh I.
      Prediction of phosphorylation sites using SVMs.
      ).

       Evaluation of Prediction Performance and Robustness of GPS 2.0—

      The self-consistency validation was performed to evaluate the prediction performance. The jackknife validation and 4-, 6-, 8-, and 10-fold cross-validation were extensively performed to evaluate the robustness and stability of the prediction system. Four standard measurements of accuracy (Ac), sensitivity (Sn), specificity (Sp), and the Mathew correlation coefficient (MCC) were defined as follows.
      Sn=TPTP+FN
      (Eq. 1)


      Sp=TNTN+FP
      (Eq. 2)


      Ac=TP+TNTP+FP+TN+FN
      (Eq. 3)


      MCC=(TP×TN)-(FN×FP)(TP+FN)×(TN+FP)×(TP+FP)×(TN+FN)
      (Eq. 4)


      The results of n-fold cross-validation were very similar to those with the leave-one-out validation (see supplemental Fig. S1). To simplify the analysis, we only adopted performances of the self-consistency and leave-one-out validation for further analysis. The receiver operating characteristic curves were drawn for 70 PK groups with ≥30 sites with the x axis of 1 − Specificity and the y axis of Sensitivity (see supplemental Fig. S2).

       The Modified Version of the Group-based Phosphorylation Scoring Method Algorithm—

      To predict kinase-specific phosphorylation sites, we used our previous Group-based Phosphorylation Scoring method with improvement (
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ). First we defined a phosphorylation site peptide PSP(m, n) as a serine (Ser), threonine (Thr), or tyrosine (Tyr) amino acid flanked by m residues upstream and n residues downstream. The chief hypothesis of the algorithm is that if two short peptides share high sequence homology they may also bear similar three-dimensional structures and biochemical properties. Then we used the amino acid substitution matrix BLOSUM62 to calculate the similarity between two PSP(7, 7) peptides.
      As described previously (
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ), for two amino acids a and b, let the substitution score between them in BLOSUM62 be Score(a, b). The similarity between two PSP(7, 7) peptides (15 amino acids) A and B is defined as follows.
      S(A,B)=1i15Score(A[i],B[i])
      (Eq. 5)


      If S(A, B) < 0, we simply redefine S(A, B) = 0.
      Given a putative PSP(7, 7) peptide, it will be compared with all known sites pairwisely to calculate the substitution scores separately. The average value of the substitution scores is computed as the final prediction score of the given site. The basic idea of the Group-based Phosphorylation Scoring algorithm is also diagrammed (see Fig. 2). The gray dots represent the positive sites. The nearer distances indicate higher similarity scores between two sites. Given a putative PSP(7, 7) peptide, we can calculate its score. Then we can judge whether the given site is a potentially real phosphorylation site under different thresholds.
      Figure thumbnail gr2
      Fig. 2The basic idea of the Group-based Phosphorylation Scoring algorithm. The gray dots represent the positive sites. The nearer distances indicate higher similarity scores between two sites. Given a putative PSP(7, 7) peptide, we can calculate its score. Then we can judge whether the given site is a potentially real phosphorylation site under different thresholds.
      In previous versions (GPS 1.0 and 1.10), we hypothesized that the bona fide pattern for PK recognition and modification might be compromised by heterogeneity of multiple structural determinants with different features. Then all known phosphorylation sites are automatically partitioned into several clusters with the Markov cluster algorithm to improve the prediction performance (
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ). However, only ∼11% of the PK groups (eight of 71) could be divided into more than one cluster with improved performances. Thus, the clustering method was not used in GPS 2.0.
      To improve the robustness of the prediction system globally without influencing the prediction performance significantly, we developed a simple method of matrix mutation (Fig. 3). First the amino acid substitution matrix BLOSUM62 was chosen as the initial matrix. The performance (Sn and Sp) of leave-one-out validation for each PK group was calculated. Then we fixed Sp at 90% to improve Sn by matrix mutation. The process of matrix mutation is halted when the Sn value is no longer increased. Although matrix mutation in other types was also valid, the method we used in this study could improve the leave-one-out validation significantly, whereas the self-consistency was only influenced moderately. Thus, such a procedure made the GPS 2.0 more robust and stable.

       Control of FPR—

      To estimate the FPR, we tried to construct a near-negative data set by several approaches. The first method was to generate PSP(7, 7) peptides randomly. However, the abundances of the 20 amino acids are not equal in eukaryotes. Thus, the method was not used because it could not reflect the real distributions of PSP(7, 7) peptides in proteomes. Also the negative sites could also be randomly retrieved from eukaryotic proteomes. However, this method needs a large sequence file to retrieve PSP(7, 7) peptides, and this would slow the speed of computation. In this study, we chose a simple and fast method to construct the near-negative data set. First we calculated the distributions of amino acid composition in six organisms, including Saccharomyces cerevisiae, Schizosaccharomyces pombe, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus, and Homo sapiens. Then we randomly generated PSP(7, 7) peptides based on the real frequencies of the 20 amino acids. And FPR values based on the latter two methods were very similar. By this method, we randomly generated 10,000 PSP(7, 7) peptides and used GPS 2.0 to estimate the theoretically maximal FPR. The process was repeated 20 times, and the mean value was calculated as the final FPR.

       Threshold Setting—

      Threshold setting was also a difficult problem. In general, we and others have chosen different thresholds for every PK group (
      • Neuberger G.
      • Schneider G.
      • Eisenhaber F.
      pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model.
      ,
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ,
      • Xue Y.
      • Li A.
      • Wang L.
      • Feng H.
      • Yao X.
      PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory.
      ,
      • Blom N.
      • Sicheritz-Ponten T.
      • Gupta R.
      • Gammeltoft S.
      • Brunak S.
      Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence.
      ,
      • Obenauer J.C.
      • Cantley L.C.
      • Yaffe M.B.
      Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs.
      ,
      • Huang H.D.
      • Lee T.Y.
      • Tzeng S.W.
      • Horng J.T.
      KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites.
      ,
      • Wong Y.H.
      • Lee T.Y.
      • Liang H.K.
      • Huang C.M.
      • Wang T.Y.
      • Yang Y.H.
      • Chu C.H.
      • Huang H.D.
      • Ko M.T.
      • Hwang J.K.
      KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns.
      ,
      • Kim J.H.
      • Lee J.
      • Oh B.
      • Kimm K.
      • Koh I.
      Prediction of phosphorylation sites using SVMs.
      ,
      • Brinkworth R.I.
      • Breinl R.A.
      • Kobe B.
      Structural basis and prediction of substrate specificity in protein serine/threonine kinases.
      ,
      • Li T.
      • Li F.
      • Zhang X.
      Prediction of kinase-specific phosphorylation sites with sequence features by a log-odds ratio approach.
      ). Here we propose a uniform rule to choose cutoff values based on calculated FPRs. For serine/threonine kinases, the high, medium, and low thresholds were established with FPRs of 2, 6, and 10%. For tyrosine kinases, the high, medium, and low thresholds were selected with FPRs of 4, 9, and 15%. The high threshold was validated by a large scale prediction of mammalian phosphorylation sites with satisfying performance. The medium threshold often reduced the stringency to be useful in small scale experiments. Also the low threshold reduced the Sp to improve Sn considerably; this is very useful in extensive experimental identification of all potential phosphorylation sites in substrates.

      RESULTS

       Construction of the GPS 2.0 Software—

      The process of construction of GPS 2.0 software is summarized below (Fig. 4). An extensively adopted hypothesis for predicting kinase-specific phosphorylation sites is that PKs in a same group/subfamily will recognize similar sequence patterns of substrates for modification (
      • Linding R.
      • Jensen L.J.
      • Ostheimer G.J.
      • van Vugt M.A.
      • Jorgensen C.
      • Miron I.M.
      • Diella F.
      • Colwill K.
      • Taylor L.
      • Elder K.
      • Metalnikov P.
      • Nguyen V.
      • Pasculescu A.
      • Jin J.
      • Park J.G.
      • Samson L.D.
      • Woodgett J.R.
      • Russell R.B.
      • Bork P.
      • Yaffe M.B.
      • Pawson T.
      Systematic discovery of in vivo phosphorylation networks.
      ,
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ,
      • Xue Y.
      • Li A.
      • Wang L.
      • Feng H.
      • Yao X.
      PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory.
      ,
      • Blom N.
      • Sicheritz-Ponten T.
      • Gupta R.
      • Gammeltoft S.
      • Brunak S.
      Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence.
      ,
      • Obenauer J.C.
      • Cantley L.C.
      • Yaffe M.B.
      Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs.
      ,
      • Huang H.D.
      • Lee T.Y.
      • Tzeng S.W.
      • Horng J.T.
      KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites.
      ,
      • Wong Y.H.
      • Lee T.Y.
      • Liang H.K.
      • Huang C.M.
      • Wang T.Y.
      • Yang Y.H.
      • Chu C.H.
      • Huang H.D.
      • Ko M.T.
      • Hwang J.K.
      KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns.
      ,
      • Kim J.H.
      • Lee J.
      • Oh B.
      • Kimm K.
      • Koh I.
      Prediction of phosphorylation sites using SVMs.
      ,
      • Brinkworth R.I.
      • Breinl R.A.
      • Kobe B.
      Structural basis and prediction of substrate specificity in protein serine/threonine kinases.
      ,
      • Li T.
      • Li F.
      • Zhang X.
      Prediction of kinase-specific phosphorylation sites with sequence features by a log-odds ratio approach.
      ). In previous work, numerous PKs were classified into several groups simply based on sequence comparison by BLAST (
      • Linding R.
      • Jensen L.J.
      • Ostheimer G.J.
      • van Vugt M.A.
      • Jorgensen C.
      • Miron I.M.
      • Diella F.
      • Colwill K.
      • Taylor L.
      • Elder K.
      • Metalnikov P.
      • Nguyen V.
      • Pasculescu A.
      • Jin J.
      • Park J.G.
      • Samson L.D.
      • Woodgett J.R.
      • Russell R.B.
      • Bork P.
      • Yaffe M.B.
      • Pawson T.
      Systematic discovery of in vivo phosphorylation networks.
      ,
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ,
      • Xue Y.
      • Li A.
      • Wang L.
      • Feng H.
      • Yao X.
      PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory.
      ,
      • Blom N.
      • Sicheritz-Ponten T.
      • Gupta R.
      • Gammeltoft S.
      • Brunak S.
      Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence.
      ,
      • Obenauer J.C.
      • Cantley L.C.
      • Yaffe M.B.
      Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs.
      ,
      • Huang H.D.
      • Lee T.Y.
      • Tzeng S.W.
      • Horng J.T.
      KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites.
      ,
      • Wong Y.H.
      • Lee T.Y.
      • Liang H.K.
      • Huang C.M.
      • Wang T.Y.
      • Yang Y.H.
      • Chu C.H.
      • Huang H.D.
      • Ko M.T.
      • Hwang J.K.
      KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns.
      ,
      • Kim J.H.
      • Lee J.
      • Oh B.
      • Kimm K.
      • Koh I.
      Prediction of phosphorylation sites using SVMs.
      ,
      • Brinkworth R.I.
      • Breinl R.A.
      • Kobe B.
      Structural basis and prediction of substrate specificity in protein serine/threonine kinases.
      ,
      • Li T.
      • Li F.
      • Zhang X.
      Prediction of kinase-specific phosphorylation sites with sequence features by a log-odds ratio approach.
      ). Because the kinomes of several eukaryotic organisms have been comprehensively identified, phylogenetically analyzed, and classified into a hierarchical structure, including group, family, subfamily, and single PK (
      • Manning G.
      • Whyte D.B.
      • Martinez R.
      • Hunter T.
      • Sudarsanam S.
      The protein kinase complement of the human genome.
      ), and because most of the phosphorylation sites in the public database have been experimentally verified in mammals (13,254 of 13,579, ∼97.6%), we directly used the classification of human kinome as the standard rule for GPS 2.0 (
      • Manning G.
      • Whyte D.B.
      • Martinez R.
      • Hunter T.
      • Sudarsanam S.
      The protein kinase complement of the human genome.
      ). To date, the specific substrates with their relationships to respective cognate kinases have still not been identified. To predict potential phosphorylation sites for these kinases, a hypothesis should be adopted that the kinases in the same group, family, or subfamily could recognize similar patterns/motifs in substrates for modification. For example, both the CDK and MAPK families belong to the CMGC group (see supplemental Table S1) and could recognize a general motif of (pS/pT)P (where pS is phosphoserine and pT is phosphothreonine) for modification (
      • Puntervoll P.
      • Linding R.
      • Gemund C.
      • Chabanis-Davidson S.
      • Mattingsdal M.
      • Cameron S.
      • Martin D.M.
      • Ausiello G.
      • Brannetti B.
      • Costantini A.
      • Ferreè F.
      • Maselli V.
      • Via A.
      • Cesareni G.
      • Diella F.
      • Superti-Furga G.
      • Wyrwicz L.
      • Ramu C.
      • McGuigan C.
      • Gudavalli R.
      • Letunic I.
      • Bork P.
      • Rychlewski L.
      • Kuüster B.
      • Helmer-Citterich M.
      • Hunter W.N.
      • Aasland R.
      • Gibson T.J.
      ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins.
      ). Besides identification of substrates with relationships to well known PKs, GPS 2.0 could also predict substrate phosphorylation site information for many novel or less characterized PKs. Also the prediction capacity of GPS 2.0 is greater compared with the existing programs chosen. For example, GPS 1.0 and 1.10 (
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ) could predict specific sites for Aurora-A and Aurora-B, respectively, whereas KinasePhos 2.0 could predict sites for Aurora group (AUR family; see supplemental Table S1). And GPS 2.0 could be used for AUR family, Aurora-A, and Aurora-B, respectively. Because of the data limitation, certain kinases contain very few known phosphorylation sites. For example, the numbers of GRK-1, GRK-2, GRK-3, GRK-4, and GRK-5 sites were 8, 28, 4, 4, and 11, respectively (see supplemental Table S3), whereas the number of GRK family sites was 84. When the data set is too small, the prediction robustness will be low. However, GPS 2.0 provided a hierarchical classification, and the experimentalist could choose the proper predictor for computing. The training data set was taken from Phospho.ELM 6.0 (
      • Diella F.
      • Cameron S.
      • Gemund C.
      • Linding R.
      • Via A.
      • Kuster B.
      • Sicheritz-Ponten T.
      • Blom N.
      • Gibson T.J.
      Phospho. ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins.
      ), containing 3,161 verified phosphorylation sites with respective kinase information. These sites were then hierarchically clustered into groups, families, subfamilies, and kinases. The Java programming language was used for the implementation of the on-line service and stand alone software of GPS 2.0 (Fig. 5). The current version contained 144 serine/threonine and 69 tyrosine PK clusters and could predict kinase-specific phosphorylation sites for 408 human PKs in hierarchy (see supplemental Tables S1, S2, and S3).
      Figure thumbnail gr4
      Fig. 4The process of construction of GPS 2.0 software. The training data were taken from the Phospho.ELM 6.0 database. All sites with kinase information were retained. Then these verified sites with their kinases were separated into a hierarchical structure with four levels, including group, family, subfamily, and single PK. The modified version of Group-based Phosphorylation Scoring algorithm was used. The matrix mutation was used to improve the robustness of the prediction system. Then we set the high, medium, and low thresholds based on the calculated FPR for each PK cluster. Finally GPS 2.0 was implemented in Java as the first stand alone software for computational phosphorylation.
      Figure thumbnail gr5
      Fig. 5The screen snapshot of GPS 2.0 software. As an example, the protein sequence of rat Spinophilin was adopted. And the prediction results of PKA-specific sites with medium threshold are shown. DMPK, myotonic dystrophy protein kinase; PKC, protein kinase C; PKG, protein kinase G; RSK, ribosomal S6 kinase; SGK, serum- and glucocorticoid-regulated protein kinase; TKL, tyrosine kinase-like.

       Matrix Mutation to Improve the Robustness of the Prediction System—

      In our previous work, the BLOSUM62 matrix was used to score the similarity between known phosphorylation sites and a given site (
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ,
      • Xue Y.
      • Li A.
      • Wang L.
      • Feng H.
      • Yao X.
      PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory.
      ). However, the performance of BLOSUM62 in comparison with other matrices was not evaluated. Here we used PKA as an example to depict the matrix selection. We tested the prediction performances of PKA for ∼60 matrices (BLOSUM30–100 and PAM10–500, etc.). Both self-consistency and leave-one-out validation were calculated for comparison. Theoretically the performances of the self-consistency and jackknife validation of a perfect predictor should be very similar. Performance comparisons for eight typical matrices are shown (Fig. 6). Although the self-consistency performances of BLOSUM90, PAM10, and PAM90 were very high, their leave-one-out validations were quite low. The leave-one-out validations of BLOSUM30, BLOSUM45, PAM250, and PAM500 were more similar to their self-consistency performances. However, both performances were lower than that of BLOSUM62. To balance the prediction performance and robustness of the prediction system, the BLOSUM62 matrix was adopted in GPS 2.0.
      Figure thumbnail gr6
      Fig. 6Comparison of various scoring matrices.Self, self-consistency. The BLOSUM62 matrix was adopted to balance the prediction performance and robustness of GPS 2.0.
      Because different matrices will generate various performances, an interesting question is whether we can find an optimal or near-optimal matrix for each PK groups to improve the system stability without influencing the prediction performance significantly. To address this question, we developed a simple method to automatically mutate BLOSUM62 into a near-optimal matrix for each PK groups. First the performance (Sn and Sp) of leave-one-out validation for each PK group was calculated. Then we fixed Sp at 90% to improve Sn by matrix mutation. Using this approach, the leave-one-out validations of most of the PK groups were improved significantly, whereas the self-consistency performances were only influenced moderately (Fig. 7). For example, with an Sp of 90%, the leave-one-out validation (LOO) Sn values of AGC/PKA, AGC/AKT, CaMK/CaMKII, and CMGC/CDK were increased from 80.7, 85.7, 67.4, and 81.8% to 89.6, 92.9, 81.4, and 88.7%, respectively, whereas their self-consistency Sn values were altered from 87.5, 96.4, 97.7, and 88.5% to 91.1, 98.8, 96.5, and 92.1%, respectively (Table I).
      Figure thumbnail gr7
      Fig. 7Prediction performances before and after matrix mutations. For instance, we randomly chose 12 PK clusters to compare the performances. Usually the leave-one-out validations will be improved significantly. But the self-consistencies were only enhanced moderately. Thus, the process of matrix mutation improved both performance and robustness of GPS 2.0. MM, matrix mutation; Self, self-consistency; PKC, protein kinase C; RSK, ribosomal S6 kinase; MAPKAPK, MAPK-activated protein kinase.
      Table IMatrix mutation
      PK clusterBefore MM
      Before matrix mutation.
      After MM
      After matrix mutation.
      Self
      Self, self-consistency Sn.
      LOO
      LOO, the Sn of leave-one-out validation.
      SelfLOO
      %%
      AGC/PKA87.582.291.189.6
      AGC/PKC/α90.049.278.368.3
      Atypical/PIKK90.162.696.786.8
      CaMK/CaMKII/CaMKII-α100.069.0100.093.1
      AGC/AKT96.486.998.892.9
      AGC/GRK90.552.398.861.9
      AGC/PKC72.764.179.275.0
      AGC/RSK98.276.8100.083.9
      CaMK/CaMKII97.767.496.581.4
      CaMK/MAPKAPK100.043.8100.062.5
      CMGC/CDK88.583.892.188.1
      Other/CK277.674.380.278.8
      a Before matrix mutation.
      b After matrix mutation.
      c Self, self-consistency Sn.
      d LOO, the Sn of leave-one-out validation.

       Comparisons of GPS 2.0 with Other Existing Tools—

      Here we compared the prediction performances of GPS 2.0 with several other existing tools, including ScanSite (
      • Obenauer J.C.
      • Cantley L.C.
      • Yaffe M.B.
      Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs.
      ), KinasePhos (1.0 and 2.0) (
      • Huang H.D.
      • Lee T.Y.
      • Tzeng S.W.
      • Horng J.T.
      KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites.
      ,
      • Wong Y.H.
      • Lee T.Y.
      • Liang H.K.
      • Huang C.M.
      • Wang T.Y.
      • Yang Y.H.
      • Chu C.H.
      • Huang H.D.
      • Ko M.T.
      • Hwang J.K.
      KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns.
      ), NetPhosK (
      • Blom N.
      • Sicheritz-Ponten T.
      • Gupta R.
      • Gammeltoft S.
      • Brunak S.
      Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence.
      ), and pkaPS (
      • Neuberger G.
      • Schneider G.
      • Eisenhaber F.
      pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model.
      ). Because the leave-one-out validations for these programs were unavailable, we focused on the comparison of the self-consistency performances.
      We chose four well known PK groups for comparison, including AGC/PKA, atypical/PIKK/ATM, CMGC/CDK/CDC2/CDC2, and TK/Src/Src. Both the positive and negative data sets we tested for GPS 2.0 were submitted on these on-line services directly. And the measurements of Sn and Sp were calculated for each program, respectively. Then we fixed the Sp to be nearly equal to that in other tools and compared the Sn values (Table II). For PKA site prediction, only ScanSite with a high threshold (Sp of 99.91%) was better than GPS 2.0 with Sn of 16.91 versus 8.61%. However, when the medium or low threshold was chosen, GPS 2.0 was better than ScanSite. As for CDC2, ScanSite under medium and high thresholds, KinasePhos 1.0 with 100% Sp, and KinasePhos 2.0 were better, whereas the performance of GPS 2.0 was comparable with the three tools. However, for both ATM and Src, GPS 2.0 was the best predictor in all circumstances. Taken together, GPS 2.0 is better or at least comparable with previously established programs.
      Table IIComparison of GPS 2.0 with previous prediction tools, including ScanSite, KinasePhos 1.0 and 2.0, NetPhosK, and pkaPS
      Predictors and thresholdPKAATMCDC2Src
      SnSpSnSpSnSpSnSp
      %%%%
      ScanSite
      Low69.1495.0254.5593.6773.0895.1328.6895.28
      Medium42.4399.1727.2798.5729.2399.2611.7699.37
      High16.9199.9118.1899.708.4699.843.6899.94
      KinasePhos 1.0
      90%85.1690.6489.0983.8672.3186.3747.0689.93
      95%80.1294.5087.2789.7663.0892.6938.2493.91
      100%58.4698.4281.8296.0448.4697.9925.0097.84
      KinasePhos 2.055.1989.2089.0938.1213.0899.7286.8655.97
      NetPhosK77.7491.1885.4597.6016.9287.7933.0995.39
      pkaPS89.6190.81
      GPS 2.083.0995.04100.0094.0377.9695.1654.0295.34
      49.2699.1772.7398.6223.1299.2617.2499.43
      8.6199.9132.7399.707.5399.843.8399.93
      89.9190.75
      Not compared because both Sn and Sp of GPS 2.0 were better.
      93.0186.4166.2889.96
      84.5794.4989.7892.7157.0994.05
      64.3998.4398.1896.0446.7797.9937.9397.85
      91.6989.259.1499.7291.1956.03
      89.6191.2687.2797.6191.9487.8452.8795.44
      89.9190.91
      a Not compared because both Sn and Sp of GPS 2.0 were better.

       A Large Scale Prediction of Kinase-specific Phosphorylation Sites in Mammals—

      Estimation and control of false positive prediction is the key point in large scale predictions of kinase-specific phosphorylation sites. The FPR is the proportion of negative sites that are erroneously predicted as positive hits. From our analysis, the real phosphorylation sites were only a very small part of all Ser/Thr residues in proteins (see supplemental Tables S2 and S3). For 144 serine/threonine PK groups, the ratios of positive sites versus the negative sites range from 1:13.2 (other/PEK: 16 positive sites and 211 negative sites) to 1:141.2 (CaMK/CaMKI/CaMKIV: nine positive sites with 1,271 negative sites) with the average being 1:49. And for 69 tyrosine PK groups, the ratios of positive sites versus the negative sites range from 1:1.6 (TK/Trk/TRKA: five positive sites with eight negative sites) to 1:28.2 (TK/Csk: five positive sites and 141 negative sites) with the average being 1:9.7. Thus, even a very small FPR could generate too many false positive hits.
      Given a data set containing all non-phosphorylation sites, the real FPR could be easily computed. However, precise calculation of FPR is unavailable because of the lack of a “gold standard” negative data set. Here we randomly generated 10,000 PSP(7, 7) peptides to construct a near-negative data set based on the real frequencies of the 20 amino acids in eukaryotic proteomes. Although a few sites were predicted to be real hits, the proportion would be very small. The process was repeated 20 times, and the average FPR was calculated by GPS 2.0 as the theoretically maximal FPR. Then for large scale predictions, we defined the precision (Pr) as follows.
      Pr=M-(N×FPR)M
      (Eq. 6)


      Here N is the number of sites (Ser/Thr or Tyr) for prediction; M is the number of predicted sites by GPS 2.0. Because the FPR is the theoretically maximal false positive rate, the Pr is the minimal proportion of correct predictions.
      For any given kinase, the total Ser/Thr or Tyr residues in a proteome could be divided into three groups, including sites phosphorylated by the kinase, sites phosphorylated by other kinases, and non-phosphorylation sites. For the kinase, sites of the latter two groups would be regarded as “negative hits” for prediction. Because most sites in a proteome are non-phosphorylation sites, the number of negative sites for the kinase is too large. Thus, it would not make sense to carry out a large scale prediction for a proteome directly. Currently there are many small scale and large scale experiments to identify phosphorylation sites. And most of these sites are integrated in the Phospho.ELM database (
      • Diella F.
      • Cameron S.
      • Gemund C.
      • Linding R.
      • Via A.
      • Kuster B.
      • Sicheritz-Ponten T.
      • Blom N.
      • Gibson T.J.
      Phospho. ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins.
      ). From Phospho.ELM 6.0, there were 13,254 mammalian sites, including 9,717 Ser(P), 1,818 Thr(P), and 1,719 Tyr(P) sites (Table III). These sites were experimentally identified, but the kinase information of more than 10,000 sites still remains to be annotated. Most importantly, in the data set, the non-phosphorylation sites were excluded. And the number of potentially negative hits for a given kinase was greatly reduced. In this regard, a properly defined FPR will be useful to evaluate the prediction accuracy.
      Table IIIData analyses of a large scale prediction for kinase-specific sites in mammalian proteomes
      Phospho.ELM 6.0MammalianPredictedCoverage
      %
      Total
      Sites13,25412,21992.19
      Pro.4,2914,07194.87
      Ser(P)
      Sites9,7179,19594.63
      Pro.3,4443,32596.54
      Thr(P)
      Sites1,8181,55185.31
      Pro.1,2001,04887.33
      Tyr(P)
      Sites1,7191,47385.69
      Pro.88576886.78
      In this work, we performed a large scale prediction of kinase-specific phosphorylation sites in mammals to compare with the phosphorylation sites in Phospho.ELM 6.0. The high threshold of GPS 2.0 was chosen with an FPR of 2% for serine/threonine kinases and 4% for tyrosine kinases. The predictor for budding yeast IPL1 was not used. We divided the data set into three groups, the known substrates of a PK for prediction (Known sub.), the known substrates of other kinases (Other’s sub.), and the sites without PK information (Unknown sub.) (supplemental Table S4). For example, there were 306 sites experimentally identified as PKA sites in mammals. And 1,993 sites were verified as substrates of other PKs with 9,236 unannotated sites. For the first group (Known sub.), the Sn was calculated to depict the proportion of which we can correctly predict for the existing sites. And for the latter two groups, the Pr was calculated to estimate the minimal accuracy for large scale predictions, respectively.
      For 143 serine/threonine and 69 tyrosine PK groups, the Sn values for known substrates and Pr values for unknown data were calculated, respectively. Most of the prediction results were obtained with satisfying performances (see supplemental Fig. S2). For example, GPS 2.0 could predict 200 of 306 known PKA sites as positive hits with an Sn of 65.36%. And for 1,993 sites phosphorylated by other PKs, GPS 2.0 could predict 220 of them as positive hits with a Pr of 81.88%, meaning that at least 81.88% of the 220 predicted sites might be positive sites. Again for 9,236 unannotated sites, GPS 2.0 could predict 959 of them as positive sites with a Pr of 80.74%. However, if there were very few real positive sites in the entire data set, the occurrence of real positive sites should be even lower than randomly generated data, and the Pr value could be very small and even lower than 0, which indicates the under-representation of substrates of the subject kinase in a given data set. In our analysis, there were 53 PK groups (25% of 212 PK groups) with low performances. In total, there were 12,219 sites predicted with at least one PK with a total coverage of 92.19% (Table III).

       Prediction of Potential Aurora-B Substrates from Its Interacting Proteins—

      As described previously, protein kinase Aurora-B is a component of the Aurora/Ipl1 family and plays important roles in chromosome segregation (
      • Gorbsky G.J.
      Mitosis: MCAK under the aura of Aurora B.
      ,
      • Lan W.
      • Zhang X.
      • Kline-Smith S.L.
      • Rosasco S.E.
      • Barrett-Wilt G.A.
      • Shabanowitz J.
      • Hunt D.F.
      • Walczak C.E.
      • Stukenberg P.T.
      Aurora B phosphorylates centromeric MCAK and regulates its localization and microtubule depolymerization activity.
      ,
      • Honda R.
      • Korner R.
      • Nigg E.A.
      Exploring the functional interactions between Aurora B, INCENP, and survivin in mitosis.
      ) and progression of cytokinesis (
      • Kawajiri A.
      • Yasui Y.
      • Goto H.
      • Tatsuka M.
      • Takahashi M.
      • Nagata K.
      • Inagaki M.
      Functional significance of the specific sites phosphorylated in desmin at cleavage furrow: Aurora-B may phosphorylate and regulate type III intermediate filaments during cytokinesis coordinatedly with Rho-kinase.
      ). During mitosis, Aurora-B localizes on the kinetochore and forms a protein complex with Survivin, INCENP (inner centromere protein), and Borealin in metaphase (
      • Honda R.
      • Korner R.
      • Nigg E.A.
      Exploring the functional interactions between Aurora B, INCENP, and survivin in mitosis.
      ). Then it moves to the midbody in cytokinesis (
      • Kawajiri A.
      • Yasui Y.
      • Goto H.
      • Tatsuka M.
      • Takahashi M.
      • Nagata K.
      • Inagaki M.
      Functional significance of the specific sites phosphorylated in desmin at cleavage furrow: Aurora-B may phosphorylate and regulate type III intermediate filaments during cytokinesis coordinatedly with Rho-kinase.
      ). Proteins phosphorylated by Aurora-B regulate their functions and dynamics during cell division. In this regard, identification of Aurora-B substrates with their sites will be important for understanding the molecular mechanisms of cell division.
      In this study, we performed a comprehensive prediction for Aurora-B substrates with respective phosphorylation sites in human. As discussed previously, a short peptide flanking a site is not sufficient for providing full specificity for a PK modification in vivo (
      • Biondi R.M.
      • Nebreda A.R.
      Signalling specificity of Ser/Thr protein kinases through docking-site-mediated interactions.
      ,
      • Holland P.M.
      • Cooper J.A.
      Protein modification: docking sites for kinases.
      ). Numerous mechanisms have also been proposed to account for the specificity for PK recognition, such as subcellular co-localization of PKs with their substrates, co-complex, or interacting directly (
      • Biondi R.M.
      • Nebreda A.R.
      Signalling specificity of Ser/Thr protein kinases through docking-site-mediated interactions.
      ,
      • Holland P.M.
      • Cooper J.A.
      Protein modification: docking sites for kinases.
      ,
      • Yaffe M.B.
      • Leparc G.G.
      • Lai J.
      • Obata T.
      • Volinia S.
      • Cantley L.C.
      A motif-based profile scanning approach for genome-wide prediction of signaling pathways.
      ). Thus, in vivo a PK should at least “kiss” its substrates and then say farewell by direct or indirect interactions. Here we adopted this “kiss-then-farewell” model and predicted Aurora-B substrates with their sites from its interacting proteins.
      Both the experimental and predicted protein-protein interaction databases were used. The human experimental protein-protein interaction (PPI) data were derived from the Database of Interacting Proteins (DIP) (
      • Salwinski L.
      • Miller C.S.
      • Smith A.J.
      • Pettit F.K.
      • Bowie J.U.
      • Eisenberg D.
      The Database of Interacting Proteins: 2004 update.
      ), BioGrid (
      • Stark C.
      • Breitkreutz B.J.
      • Reguly T.
      • Boucher L.
      • Breitkreutz A.
      • Tyers M.
      BioGRID: a general repository for interaction datasets.
      ), the Molecular Interaction Database (MINT) (
      • Zanzoni A.
      • Montecchi-Palazzi L.
      • Quondam M.
      • Ausiello G.
      • Helmer-Citterich M.
      • Cesareni G.
      MINT: a Molecular INTeraction database.
      ), the Biomolecular Interaction Network Database (BIND) (
      • Alfarano C.
      • Andrade C.E.
      • Anthony K.
      • Bahroos N.
      • Bajec M.
      • Bantoft K.
      • Betel D.
      • Bobechko B.
      • Boutilier K.
      • Burgess E.
      • Buzadzija K.
      • Cavero R.
      • D'Abreo C.
      • Donaldson I.
      • Dorairajoo D.
      • Dumontier M.J.
      • Dumontier M.R.
      • Earles V.
      • Farrall R.
      • Feldman H.
      • Garderman E.
      • Gong Y.
      • Gonzaga R.
      • Grytsan V.
      • Gryz E.
      • Gu V.
      • Haldorsen E.
      • Halupa A.
      • Haw R.
      • Hrvojic A.
      • Hurrell L.
      • Isserlin R.
      • Jack F.
      • Juma F.
      • Khan A.
      • Kon T.
      • Konopinsky S.
      • Le V.
      • Lee E.
      • Ling S.
      • Magidin M.
      • Moniakis J.
      • Montojo J.
      • Moore S.
      • Muskat B.
      • Ng I.
      • Paraiso J.P.
      • Parker B.
      • Pintilie G.
      • Pirone R.
      • Salama J.J.
      • Sgro S.
      • Shan T.
      • Shu Y.
      • Siew J.
      • Skinner D.
      • Snyder K.
      • Stasiuk R.
      • Strumpf D.
      • Tuekam B.
      • Tao S.
      • Wang Z.
      • White M.
      • Willis R.
      • Wolting C.
      • Wong S.
      • Wrong A.
      • Xin C.
      • Yao R.
      • Yates B.
      • Zhang S.
      • Zheng K.
      • Pawson T.
      • Ouellette B.F.
      • Hogue C.W.
      The Biomolecular Interaction Network Database and related tools 2005 update.
      ), and the Human Protein Reference Database (HPRD) (
      • Mishra G.R.
      • Suresh M.
      • Kumaran K.
      • Kannabiran N.
      • Suresh S.
      • Bala P.
      • Shivakumar K.
      • Anuradha N.
      • Reddy R.
      • Raghavan T.M.
      • Menon S.
      • Hanumanthu G.
      • Gupta M.
      • Upendran S.
      • Gupta S.
      • Mahesh M.
      • Jacob B.
      • Mathew P.
      • Chatterjee P.
      • Arun K.S.
      • Sharma S.
      • Chandrika K.N.
      • Deshpande N.
      • Palvankar K.
      • Raghavnath R.
      • Krishnakanth R.
      • Karathia H.
      • Rekha B.
      • Nayak R.
      • Vishnupriya G.
      • Kumar H.G.
      • Nagini M.
      • Kumar G.S.
      • Jose R.
      • Deepthi P.
      • Mohan S.S.
      • Gandhi T.K.
      • Harsha H.C.
      • Deshpande K.S.
      • Sarker M.
      • Prasad T.S.
      • Pandey A.
      Human protein reference database—2006 update.
      ) with 1,397, 38,217, 8,127, 43,412, and 33,710 entries. These data sets were integrated into a non-redundant set with a total number of 51,529 records. For predicted PPI data, we simply used the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING database) with 690,143 precalculated PPI entries (
      • von Mering C.
      • Jensen L.J.
      • Snel B.
      • Hooper S.D.
      • Krupp M.
      • Foglierini M.
      • Jouffre N.
      • Huynen M.A.
      • Bork P.
      STRING: known and predicted protein-protein associations, integrated and transferred across organisms.
      ). Both experimentally verified and predicted PPI data were mapped to the UniProt database by BLAST for normalization of protein accession numbers. In total in Phospho.ELM 6.0, there were 140 human proteins containing 605 Ser(P)/Thr(P) sites identified as Aurora-B-interacting proteins. The high threshold of GPS 2.0 was used with an FPR of 2%. Then 48 sites from 32 proteins were predicted as positive hits (Table IV). The total Pr of the prediction was calculated as (48 − (605 × 2%))/48 = 74.79%.
      Table IVA proteome-wide prediction of Aurora-B-specific substrate sites
      SubstratePhospho.ELMSitePeptideGPS scoreKnown kinases
      HEC1O147775---MKRSSVSSGGAG6.1515Aurora-B
      O1477715SGGAGRLSMQELRSQ4.8788Aurora-B
      O1477749KLSINKPTSERKVSL3.8182Aurora-B
      O1477755PTSERKVSLFGKRTS5.3636Aurora-B
      O1477762SLFGKRTSGHGSRNS3.2424
      O1477769SGHGSRNSQLGIFSS4.303Aurora-B
      AURKAO14965288APSSRRTTLCGTLDY5.8485PKA, Aurora-A
      PPP1R12AO14974696ARQSRRSTQGVTLTD5.0606ROCK1
      SurvivinO15392117KNKIAKETNNKKKEF4.2424Aurora-B
      VIMP0867065GVYATRSSAVRLRSS4.303PAK
      P0867072SAVRLRSSVPGVRLL4.5758PAK, ROCK, Aurora-B
      GFAPP141367-MERRRITSAARRSY8.303ROCK, Aurora-B
      P1413613ITSAARRSYVSSGEM7.5152ROCK, PKC, CaMKII, Aurora-B
      P1413638LGPGTRLSLARMPPP4.5455ROCK, PKC, CaMKII, Aurora-B
      STMN1P1694962AAEERRKSHEAEVLK5.8182PKA
      DESP1766111YSSSQRVSSYRRTFG4.7576Aurora-B
      P1766116RVSSYRRTFGGAPGF5.2727ROCK, Aurora-B
      P1766159VYQVSRTSGGAGGLG4.8788Aurora-B
      LMNB1P2070027PLSPTRLSRLQEKEE3.303
      PSMA3P25788242AEKYAKESLKEEDES3.1818CK2
      CDC25BP30305353VQNKRRRSVTPPEEQ6Aurora-A
      BDKRB2P30411373SMGTLRTSISVERQI3.6364GRK-4, PKC
      NESP48681767ETQQRRRSLGEQDQM6.8788
      CENPAP494507-MGPRRRSRKPEAPR6.3636Aurora-A, Aurora-B
      PLK1P53350137LELCRRRSLLELHKR3.2424
      P53350210YDGERKKTLCGTPNY3.2121LOK
      H3.1P6843110TKQTARKSTGGKAPR10.1212Aurora-A, Aurora-B
      P6843128ATKAARKSAPATGGV8.9091MAPK, Aurora-B
      MDM2Q00987157SHLVSRPSTSSRRRA4.0909
      KIF23Q02241911NGSRKRRSSTVAPAQ6.303
      Q02241912GSRKRRSSTVAPAQP6.3939
      RELAQ04206276SMQLRRPSDRELSEP3.2121RSK-5
      BPTFQ1283077PRVHRPRSPILEEKD3.0303
      TP53BP1Q128881460GAGALRRSDSPEIPF3.2424
      CBX3Q1318593KDGTKRKSLSDSESD5.1212
      PIN1Q1352616PGWEKRMSRSSGRVY3.1818PKA
      IFI16Q16666132GAQKRKKSTKEKAGP4.8485CK2
      RCC1Q6NT9711KRIAKRRSPPADAIP4.4545
      FLJ37981Q8N1Q373ETSSLRNSQSENSSL5
      MCAKQ9966195IQKQKRRSVNSKIPA7.5455Aurora-B
      RACGAP1Q9H0H5387ETGLYRISGCDRTVK3.7879Aurora-B
      INCENPQ9NQS7897KPRYHKRTSSAVWNS4.1515Aurora-B
      Q9NQS7898PRYHKRTSSAVWNSP5.7273Aurora-B
      Q9NQS7899RYHKRTSSAVWNSPP3.1818Aurora-B
      TD-60Q9P25843RERPERCSSSSGGGS4.1515
      BAZ1BQ9UIG0189EDEGRRESINDRARR5.7576
      Q9UIG01342KRSSRRQSLELQKCE4.6061
      CDC23Q9UJX2582NTPTRRVSPLNLSSV3.7879
      Our analysis had precisely predicted 21 of 26 (Sn of ∼81%) experimentally verified Aurora-B sites in human (Table IV). In addition, several novel substrates with potential sites were identified in silico. For example, although human TD-60 is co-localized with Survivin on the kinetochore (
      • Mollinari C.
      • Reynaud C.
      • Martineau-Thuillier S.
      • Monier S.
      • Kieffer S.
      • Garin J.
      • Andreassen P.R.
      • Boulet A.
      • Goud B.
      • Kleman J.P.
      • Margolis R.L.
      The mammalian passenger protein TD-60 is an RCC1 family member with an essential role in prometaphase to metaphase progression.
      ), its phosphorylation by Aurora-B was never reported. We predicted that human TD-60 could be phosphorylated by Aurora-B at Ser-43. In addition, although HP1γ/CBX3 is localized on the centromeric region nearby the kinetochore (
      • Obuse C.
      • Iwasaki O.
      • Kiyomitsu T.
      • Goshima G.
      • Toyoda Y.
      • Yanagida M.
      A conserved Mis12 centromere complex is linked to heterochromatic HP1 and outer kinetochore protein Zwint-1.
      ), its phosphorylation by Aurora-B was unclear. Here we predicted that HP1γ/CBX3 could be phosphorylated by Aurora-B at Ser-93. Moreover we also predicted another kinetochore-associated kinase, PLK1 (
      • Arnaud L.
      • Pines J.
      • Nigg E.A.
      GFP tagging reveals human Polo-like kinase 1 at the kinetochore/centromere region of mitotic chromosomes.
      ), as a novel substrate of Aurora-B that is phosphorylated at both Ser-137 and Thr-210.
      Taken together, using GPS 2.0 and protein-protein interaction information, we successfully predicted that 32 proteins containing 48 Ser(P)/Thr(P) sites are novel Aurora-B substrates. Although the accuracy and physiology of the aforementioned phosphorylation sites remain to be validated by experimentation, our analyses performed with GPS 2.0 provide an outline of how mitotic Aurora-B phosphorylation regulates protein-protein interaction plasticity and dynamics.

      DISCUSSION

      In this work, we refined our previous established protein phosphorylation predication program GPS 1.10 (Group-based Phosphorylation Scoring) into a higher version, 2.0. In addition, the software was renamed as Group-based Prediction System because numerous PKs were clustered into a hierarchical structure with four levels, including group, family, subfamily, and single PK (
      • Manning G.
      • Whyte D.B.
      • Martinez R.
      • Hunter T.
      • Sudarsanam S.
      The protein kinase complement of the human genome.
      ). Then the on-line server and local packages of GPS 2.0 were implemented in Java with a modified version of the Group-based Phosphorylation Scoring algorithm (
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ). The GPS 2.0 Web server was tested on several Internet browsers, including Internet Explorer 6.0, Netscape Browser 8.1.3, and Firefox 2 under the Windows XP operating system (OS), Mozilla Firefox 1.5 of Fedora Core 6 OS (Linux), and Safari 3.0 of Apple Mac OS X 10.4 (Tiger) and 10.5 (Leopard). For Windows and Linux systems, the latest version of the Java Runtime Environment (JRE) package (Java 1.4.2 or later versions) of Sun Microsystems should be preinstalled for using GPS 2.0 program. However, for Mac OS, GPS 2.0 could be directly used without any additional packages. Furthermore users could directly install the local packages of GPS 2.0 on their own computers. Again the local packages of GPS 2.0 support three major OSs, including Windows, Unix/Linux, and Mac.
      The performance and robustness of the prediction system were extensively evaluated by self-consistency, leave-one-out validation, and 4-, 6-, 8-, and 10-fold cross-validations. Then we compared the prediction performances of GPS 2.0 with several other existing tools, including ScanSite (
      • Obenauer J.C.
      • Cantley L.C.
      • Yaffe M.B.
      Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs.
      ), KinasePhos (1.0 and 2.0) (
      • Huang H.D.
      • Lee T.Y.
      • Tzeng S.W.
      • Horng J.T.
      KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites.
      ,
      • Wong Y.H.
      • Lee T.Y.
      • Liang H.K.
      • Huang C.M.
      • Wang T.Y.
      • Yang Y.H.
      • Chu C.H.
      • Huang H.D.
      • Ko M.T.
      • Hwang J.K.
      KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns.
      ), NetPhosK (
      • Blom N.
      • Sicheritz-Ponten T.
      • Gupta R.
      • Gammeltoft S.
      • Brunak S.
      Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence.
      ), and pkaPS (
      • Neuberger G.
      • Schneider G.
      • Eisenhaber F.
      pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model.
      ). ScanSite constructs a position-specific scoring matrix for each kinase based on its known phosphorylation sites (
      • Obenauer J.C.
      • Cantley L.C.
      • Yaffe M.B.
      Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs.
      ). And KinasePhos 1.0 uses a maximal dependence decomposition strategy and constructs a profile hidden Markov model for each kinase (
      • Huang H.D.
      • Lee T.Y.
      • Tzeng S.W.
      • Horng J.T.
      KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites.
      ), whereas KinasePhos 2.0 retrieves the coupling patterns (XdZ where amino acid types X and Z are separated by d amino acids) from the known phosphorylation sites and uses the Support Vector Machines algorithm to train the model (
      • Wong Y.H.
      • Lee T.Y.
      • Liang H.K.
      • Huang C.M.
      • Wang T.Y.
      • Yang Y.H.
      • Chu C.H.
      • Huang H.D.
      • Ko M.T.
      • Hwang J.K.
      KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns.
      ). Also NetPhosK uses an artificial neural network method for training (
      • Blom N.
      • Sicheritz-Ponten T.
      • Gupta R.
      • Gammeltoft S.
      • Brunak S.
      Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence.
      ). These tools first retrieve the information from each position flanking the modified residue (Ser/Thr or Tyr). A hidden hypothesis in their model is that the information/function/evolution of each position is independent from its nearby residues. However, the information/function/evolution of each position is not entirely independent. GPS 1.0 and 1.10 (
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ), GPS 2.0, and pkaPS (
      • Neuberger G.
      • Schneider G.
      • Eisenhaber F.
      pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model.
      ) hypothesize that if two PSPs share high sequence homology they may also bear similar three-dimensional structures and biological functions. Thus, the information of the PSPs was considered rather than single positions. In this regard, the methods used in GPS 1.0 and 1.10 (
      • Xue Y.
      • Zhou F.
      • Zhu M.
      • Ahmed K.
      • Chen G.
      • Yao X.
      GPS: a comprehensive www server for phosphorylation sites prediction.
      ,
      • Zhou F.F.
      • Xue Y.
      • Chen G.L.
      • Yao X.
      GPS: a novel group-based phosphorylation predicting and scoring method.
      ), GPS 2.0, and pkaPS (
      • Neuberger G.
      • Schneider G.
      • Eisenhaber F.
      pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model.
      ) will be superior to other strategies. Also the prediction performances will be enhanced with a larger training data set. And the training data set of GPS 2.0 was much larger than that for the other tools. Furthermore we noticed that the prediction performances based on different amino acids matrices were not identical. The BLOSUM62 and other matrices are optimized to evaluate the similarity between homologous proteins but may not be optimized for the similarity of two PSPs. To find an optimal or near-optimal matrix for each PK group to improve the system stability without influencing the prediction performance significantly, we developed a simple method to automatically mutate BLOSUM62 into a near-optimal matrix for each PK group. The prediction performances of GPS 2.0 were further improved by this approach. By comparison, the method of GPS 2.0 was better or at least comparable with previous approaches on several well studied PKs. However, GPS 2.0 could predict kinase-specific phosphorylation sites for 408 human PKs, demonstrating a great comprehensive capacity and computational power.
      Previously control and calculation of FPR were never addressed. Here we developed a simple approach to estimate the theoretically maximal FPR for each PK cluster. We also defined the Pr factor to estimate the proportion of real phosphorylation sites in predicted results. Previously the precision was defined as TP/(TP + FP) (
      • Huang H.D.
      • Lee T.Y.
      • Tzeng S.W.
      • Horng J.T.
      KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites.
      ). However, the TP is usually unknown when an unknown data set is used for prediction. Thus, a hidden hypothesis for such a precision is that the ratio of calculated TP:FP is not changed in any given data set. The precision will be precalculated based on the training data set. However, when the composition of a given data set is changed and different from the training data set, such a precision will not be useful and valid any more. In this regard, the Pr value should be flexible and reflect the enrichment of substrates of the subject kinase in any given data sets. Given a data set for prediction (N sites), if all of the sites were true negative sites, we can easily calculate the theoretically maximal false positive hits as N × FPR. Then Pr value could be calculated by (M − (N × FPR))/M where M is the total predicted hits. Because there might be real phosphorylation sites contained in the data set, our approach will underestimated the real precision.
      As an application to depict the computational power, we performed a large scale prediction of more than 13,000 phosphorylation sites in mammals with high precisions. The high threshold was chosen with an FPR of 2% for serine/threonine kinases and 4% for tyrosine kinases. In addition, we provided a proteome-wide prediction for Aurora-B-specific substrates including protein-protein interaction information. As the first stand alone software for computational phosphorylation, GPS 2.0 will accelerate experimentation for delineating a kinase-coupled phosphoregulatory network and pathways underlying cellular plasticity and dynamics.

      Acknowledgments

      We thank the anonymous reviewer, whose suggestion has greatly improved the presentation of this manuscript.

      Supplementary Material

      REFERENCES

        • Zhou F.F.
        • Xue Y.
        • Yao X.
        • Xu Y.
        A general user interface for prediction servers of proteins' post-translational modification sites.
        Nat. Protocol. 2006; 1: 1318-1321
        • Iakoucheva L.M.
        • Radivojac P.
        • Brown C.J.
        • O'Connor T.R.
        • Sikes J.G.
        • Obradovic Z.
        • Dunker A.K.
        The importance of intrinsic disorder for protein phosphorylation.
        Nucleic Acids Res. 2004; 32: 1037-1049
        • Blom N.
        • Gammeltoft S.
        • Brunak S.
        Sequence and structure-based prediction of eukaryotic protein phosphorylation sites.
        J. Mol. Biol. 1999; 294: 1351-1362
        • Ingrell C.R.
        • Miller M.L.
        • Jensen O.N.
        • Blom N.
        NetPhosYeast: prediction of protein phosphorylation sites in yeast.
        Bioinformatics. 2007; 23: 895-897
        • Tang Y.R.
        • Chen Y.Z.
        • Canchaya C.A.
        • Zhang Z.
        GANNPhos: a new phosphorylation site predictor based on a genetic algorithm integrated neural network.
        Protein Eng. Des. Sel. 2007; 20: 405-412
        • Neuberger G.
        • Schneider G.
        • Eisenhaber F.
        pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model.
        Biol. Direct. 2007; 2: 1
        • Brinkworth R.I.
        • Munn A.L.
        • Kobe B.
        Protein kinases associated with the yeast phosphoproteome.
        BMC Bioinformatics. 2006; 7: 47
        • Chang E.J.
        • Begum R.
        • Chait B.T.
        • Gaasterland T.
        Prediction of cyclin-dependent kinase phosphorylation substrates.
        PLoS ONE. 2007; 2: e656
        • Linding R.
        • Jensen L.J.
        • Ostheimer G.J.
        • van Vugt M.A.
        • Jorgensen C.
        • Miron I.M.
        • Diella F.
        • Colwill K.
        • Taylor L.
        • Elder K.
        • Metalnikov P.
        • Nguyen V.
        • Pasculescu A.
        • Jin J.
        • Park J.G.
        • Samson L.D.
        • Woodgett J.R.
        • Russell R.B.
        • Bork P.
        • Yaffe M.B.
        • Pawson T.
        Systematic discovery of in vivo phosphorylation networks.
        Cell. 2007; 129: 1415-1426
        • Xue Y.
        • Zhou F.
        • Zhu M.
        • Ahmed K.
        • Chen G.
        • Yao X.
        GPS: a comprehensive www server for phosphorylation sites prediction.
        Nucleic Acids Res. 2005; 33: W184-W187
        • Zhou F.F.
        • Xue Y.
        • Chen G.L.
        • Yao X.
        GPS: a novel group-based phosphorylation predicting and scoring method.
        Biochem. Biophys. Res. Commun. 2004; 325: 1443-1448
        • Xue Y.
        • Li A.
        • Wang L.
        • Feng H.
        • Yao X.
        PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory.
        BMC Bioinformatics. 2006; 7: 163
        • Blom N.
        • Sicheritz-Ponten T.
        • Gupta R.
        • Gammeltoft S.
        • Brunak S.
        Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence.
        Proteomics. 2004; 4: 1633-1649
        • Obenauer J.C.
        • Cantley L.C.
        • Yaffe M.B.
        Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs.
        Nucleic Acids Res. 2003; 31: 3635-3641
        • Huang H.D.
        • Lee T.Y.
        • Tzeng S.W.
        • Horng J.T.
        KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites.
        Nucleic Acids Res. 2005; 33: W226-W229
        • Wong Y.H.
        • Lee T.Y.
        • Liang H.K.
        • Huang C.M.
        • Wang T.Y.
        • Yang Y.H.
        • Chu C.H.
        • Huang H.D.
        • Ko M.T.
        • Hwang J.K.
        KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns.
        Nucleic Acids Res. 2007; 35: W588-W594
        • Kim J.H.
        • Lee J.
        • Oh B.
        • Kimm K.
        • Koh I.
        Prediction of phosphorylation sites using SVMs.
        Bioinformatics. 2004; 20: 3179-3184
        • Brinkworth R.I.
        • Breinl R.A.
        • Kobe B.
        Structural basis and prediction of substrate specificity in protein serine/threonine kinases.
        Proc. Natl. Acad. Sci. U. S. A. 2003; 100: 74-79
        • Li T.
        • Li F.
        • Zhang X.
        Prediction of kinase-specific phosphorylation sites with sequence features by a log-odds ratio approach.
        Proteins. 2008; 70: 404-414
        • Manning G.
        • Whyte D.B.
        • Martinez R.
        • Hunter T.
        • Sudarsanam S.
        The protein kinase complement of the human genome.
        Science. 2002; 298: 1912-1934
        • Diella F.
        • Cameron S.
        • Gemund C.
        • Linding R.
        • Via A.
        • Kuster B.
        • Sicheritz-Ponten T.
        • Blom N.
        • Gibson T.J.
        Phospho. ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins.
        BMC Bioinformatics. 2004; 5: 79
        • Caenepeel S.
        • Charydczak G.
        • Sudarsanam S.
        • Hunter T.
        • Manning G.
        The mouse kinome: discovery and comparative genomics of all mouse protein kinases.
        Proc. Natl. Acad. Sci. U. S. A. 2004; 101: 11707-11712
        • Puntervoll P.
        • Linding R.
        • Gemund C.
        • Chabanis-Davidson S.
        • Mattingsdal M.
        • Cameron S.
        • Martin D.M.
        • Ausiello G.
        • Brannetti B.
        • Costantini A.
        • Ferreè F.
        • Maselli V.
        • Via A.
        • Cesareni G.
        • Diella F.
        • Superti-Furga G.
        • Wyrwicz L.
        • Ramu C.
        • McGuigan C.
        • Gudavalli R.
        • Letunic I.
        • Bork P.
        • Rychlewski L.
        • Kuüster B.
        • Helmer-Citterich M.
        • Hunter W.N.
        • Aasland R.
        • Gibson T.J.
        ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins.
        Nucleic Acids Res. 2003; 31: 3625-3630
        • Gorbsky G.J.
        Mitosis: MCAK under the aura of Aurora B.
        Curr. Biol. 2004; 14: R346-R348
        • Lan W.
        • Zhang X.
        • Kline-Smith S.L.
        • Rosasco S.E.
        • Barrett-Wilt G.A.
        • Shabanowitz J.
        • Hunt D.F.
        • Walczak C.E.
        • Stukenberg P.T.
        Aurora B phosphorylates centromeric MCAK and regulates its localization and microtubule depolymerization activity.
        Curr. Biol. 2004; 14: 273-286
        • Honda R.
        • Korner R.
        • Nigg E.A.
        Exploring the functional interactions between Aurora B, INCENP, and survivin in mitosis.
        Mol. Biol. Cell. 2003; 14: 3325-3341
        • Kawajiri A.
        • Yasui Y.
        • Goto H.
        • Tatsuka M.
        • Takahashi M.
        • Nagata K.
        • Inagaki M.
        Functional significance of the specific sites phosphorylated in desmin at cleavage furrow: Aurora-B may phosphorylate and regulate type III intermediate filaments during cytokinesis coordinatedly with Rho-kinase.
        Mol. Biol. Cell. 2003; 14: 1489-1500
        • Biondi R.M.
        • Nebreda A.R.
        Signalling specificity of Ser/Thr protein kinases through docking-site-mediated interactions.
        Biochem. J. 2003; 372: 1-13
        • Holland P.M.
        • Cooper J.A.
        Protein modification: docking sites for kinases.
        Curr. Biol. 1999; 9: R329-R331
        • Yaffe M.B.
        • Leparc G.G.
        • Lai J.
        • Obata T.
        • Volinia S.
        • Cantley L.C.
        A motif-based profile scanning approach for genome-wide prediction of signaling pathways.
        Nat. Biotechnol. 2001; 19: 348-353
        • Salwinski L.
        • Miller C.S.
        • Smith A.J.
        • Pettit F.K.
        • Bowie J.U.
        • Eisenberg D.
        The Database of Interacting Proteins: 2004 update.
        Nucleic Acids Res. 2004; 32: D449-D451
        • Stark C.
        • Breitkreutz B.J.
        • Reguly T.
        • Boucher L.
        • Breitkreutz A.
        • Tyers M.
        BioGRID: a general repository for interaction datasets.
        Nucleic Acids Res. 2006; 34: D535-D539
        • Zanzoni A.
        • Montecchi-Palazzi L.
        • Quondam M.
        • Ausiello G.
        • Helmer-Citterich M.
        • Cesareni G.
        MINT: a Molecular INTeraction database.
        FEBS Lett. 2002; 513: 135-140
        • Alfarano C.
        • Andrade C.E.
        • Anthony K.
        • Bahroos N.
        • Bajec M.
        • Bantoft K.
        • Betel D.
        • Bobechko B.
        • Boutilier K.
        • Burgess E.
        • Buzadzija K.
        • Cavero R.
        • D'Abreo C.
        • Donaldson I.
        • Dorairajoo D.
        • Dumontier M.J.
        • Dumontier M.R.
        • Earles V.
        • Farrall R.
        • Feldman H.
        • Garderman E.
        • Gong Y.
        • Gonzaga R.
        • Grytsan V.
        • Gryz E.
        • Gu V.
        • Haldorsen E.
        • Halupa A.
        • Haw R.
        • Hrvojic A.
        • Hurrell L.
        • Isserlin R.
        • Jack F.
        • Juma F.
        • Khan A.
        • Kon T.
        • Konopinsky S.
        • Le V.
        • Lee E.
        • Ling S.
        • Magidin M.
        • Moniakis J.
        • Montojo J.
        • Moore S.
        • Muskat B.
        • Ng I.
        • Paraiso J.P.
        • Parker B.
        • Pintilie G.
        • Pirone R.
        • Salama J.J.
        • Sgro S.
        • Shan T.
        • Shu Y.
        • Siew J.
        • Skinner D.
        • Snyder K.
        • Stasiuk R.
        • Strumpf D.
        • Tuekam B.
        • Tao S.
        • Wang Z.
        • White M.
        • Willis R.
        • Wolting C.
        • Wong S.
        • Wrong A.
        • Xin C.
        • Yao R.
        • Yates B.
        • Zhang S.
        • Zheng K.
        • Pawson T.
        • Ouellette B.F.
        • Hogue C.W.
        The Biomolecular Interaction Network Database and related tools 2005 update.
        Nucleic Acids Res. 2005; 33: D418-D424
        • Mishra G.R.
        • Suresh M.
        • Kumaran K.
        • Kannabiran N.
        • Suresh S.
        • Bala P.
        • Shivakumar K.
        • Anuradha N.
        • Reddy R.
        • Raghavan T.M.
        • Menon S.
        • Hanumanthu G.
        • Gupta M.
        • Upendran S.
        • Gupta S.
        • Mahesh M.
        • Jacob B.
        • Mathew P.
        • Chatterjee P.
        • Arun K.S.
        • Sharma S.
        • Chandrika K.N.
        • Deshpande N.
        • Palvankar K.
        • Raghavnath R.
        • Krishnakanth R.
        • Karathia H.
        • Rekha B.
        • Nayak R.
        • Vishnupriya G.
        • Kumar H.G.
        • Nagini M.
        • Kumar G.S.
        • Jose R.
        • Deepthi P.
        • Mohan S.S.
        • Gandhi T.K.
        • Harsha H.C.
        • Deshpande K.S.
        • Sarker M.
        • Prasad T.S.
        • Pandey A.
        Human protein reference database—2006 update.
        Nucleic Acids Res. 2006; 34: D411-D414
        • von Mering C.
        • Jensen L.J.
        • Snel B.
        • Hooper S.D.
        • Krupp M.
        • Foglierini M.
        • Jouffre N.
        • Huynen M.A.
        • Bork P.
        STRING: known and predicted protein-protein associations, integrated and transferred across organisms.
        Nucleic Acids Res. 2005; 33: D433-D437
        • Mollinari C.
        • Reynaud C.
        • Martineau-Thuillier S.
        • Monier S.
        • Kieffer S.
        • Garin J.
        • Andreassen P.R.
        • Boulet A.
        • Goud B.
        • Kleman J.P.
        • Margolis R.L.
        The mammalian passenger protein TD-60 is an RCC1 family member with an essential role in prometaphase to metaphase progression.
        Dev. Cell. 2003; 5: 295-307
        • Obuse C.
        • Iwasaki O.
        • Kiyomitsu T.
        • Goshima G.
        • Toyoda Y.
        • Yanagida M.
        A conserved Mis12 centromere complex is linked to heterochromatic HP1 and outer kinetochore protein Zwint-1.
        Nat. Cell Biol. 2004; 6: 1135-1141
        • Arnaud L.
        • Pines J.
        • Nigg E.A.
        GFP tagging reveals human Polo-like kinase 1 at the kinetochore/centromere region of mitotic chromosomes.
        Chromosoma. 1998; 107: 424-429