Advertisement

An Integrated Machine Learning System to Computationally Screen Protein Databases for Protein Binding Peptide Ligands*S

  • Ling Zhang
    Footnotes
    Affiliations
    From the Proteomics Research Center, National Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences/Peking Union Medical College, 5 Dong Dan San Tiao, 100005 Beijing, China
    Search for articles by this author
  • Chen Shao
    Footnotes
    Affiliations
    From the Proteomics Research Center, National Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences/Peking Union Medical College, 5 Dong Dan San Tiao, 100005 Beijing, China
    Search for articles by this author
  • Dexian Zheng
    Affiliations
    From the Proteomics Research Center, National Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences/Peking Union Medical College, 5 Dong Dan San Tiao, 100005 Beijing, China
    Search for articles by this author
  • Youhe Gao
    Correspondence
    To whom correspondence should be addressed. Tel.: 86-010-6521-2284; Fax: 86-010-6521-2284;
    Affiliations
    From the Proteomics Research Center, National Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences/Peking Union Medical College, 5 Dong Dan San Tiao, 100005 Beijing, China
    Search for articles by this author
  • Author Footnotes
    * This work was supported in part by The National Basic Research Program Grant 2004CB520804 and National Natural Science Foundation Grants 30270657, 30230150, and 3037030. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
    S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material.
    ‡ Both authors contributed equally to this work.
Open AccessPublished:March 29, 2006DOI:https://doi.org/10.1074/mcp.M500346-MCP200
      A fairly large set of protein interactions is mediated by families of peptide binding domains, such as Src homology 2 (SH2), SH3, PDZ, major histocompatibility complex, etc. To identify their ligands by experimental screening is not only labor-intensive but almost futile in screening low abundance species due to the suppression by high abundance species. An ideal way of studying protein-protein interactions is to use high throughput computational approaches to screen protein sequence databases to direct the validating experiments toward the most promising peptides. Predictors with only good cross-validation were not good enough to screen protein databases. In the current study we built integrated machine learning systems using three novel coding methods and screened the Swiss-Prot and GenBank™ protein databases for potential ligands of 10 SH3 and three PDZ domains. A large fraction of predictions has already been experimentally confirmed by other independent research groups, indicating a satisfying generalization capability for future applications in identifying protein interactions.
      Experimental screening for protein binding peptides is not only labor-intensive but almost futile in screening for low abundance binding species due to the suppression by high abundance species. A more plausible way of studying protein-protein interactions is by using high throughput computational predictions rather than experimental approaches to screen for interactions from protein sequence databases to direct the validating experiments toward the most promising peptides. A prediction can only be called successful when the overall efficiency and cost of computational prediction plus biological validation are much better than those of experimental screening. A fairly large set of protein interactions are mediated by families of peptide binding domains, such as SH2,
      The abbreviations used are: SH, Src homology; SVM, support vector machine; PNN, probabilistic neural network; SP, screening precision; ESP, estimated screening precision; MHC, major histocompatibility complex; MCC, Matthews correlation coefficient; BLU, Boehringer light unit.
      1The abbreviations used are: SH, Src homology; SVM, support vector machine; PNN, probabilistic neural network; SP, screening precision; ESP, estimated screening precision; MHC, major histocompatibility complex; MCC, Matthews correlation coefficient; BLU, Boehringer light unit.
      SH3, PDZ, MHC, etc. that act as receptors to accommodate, in their binding pockets, short peptides in an extended conformation (
      • Pawson T.
      • Nash P.
      Assembly of cell regulatory systems through protein interaction domains.
      ,
      • Pawson T.
      • Scott J.D.
      Signaling through scaffold, anchoring, and adaptor proteins.
      ). We studied two common domain-ligand interactions mediated by SH3 domain and PDZ domain. SH3 domains selectively bind peptides of 8–11 amino acids on their ligand proteins; PDZ domains bind 4–8 amino acids in the C termini of their partners. Ligands of both PDZ and SH3 domains are of high diversity. Two good reviews provide detailed information on SH3 domain (
      • Mayer B.J.
      SH3 domains: complexity in moderation.
      ) and PDZ domain (
      • Nourry C.
      • Grant S.G.
      • Borg J.P.
      PDZ domain proteins: plug and play!.
      ).
      Two major categories of methods have been developed to predict domain-ligand interactions. The first category is based on structure information as exemplified by the work on predicting MHC-specific epitopes with protein docking methods (
      • Tong J.C.
      • Tan T.W.
      • Ranganathan S.
      Modeling the structure of bound peptide ligands to major histocompatibility complex.
      ,
      • Michielin O.
      • Karplus M.
      Binding free energy differences in a TCR-peptide-MHC complex induced by a peptide mutation: a simulation analysis.
      ). This type of method needs intensive computation and the prior knowledge of the three-dimensional structures of the bait proteins. Instead of using exact protein structures, Brannetti et al. (
      • Brannetti B.
      • Via A.
      • Cestra G.
      • Cesareni G.
      • Helmer-Citterich M.
      SH3-SPOT: an algorithm to predict preferred ligands to different members of the SH3 gene family.
      ) and Altuvia and Margalit (
      • Altuvia Y.
      • Margalit H.
      A structure-based approach for prediction of MHC-binding peptides.
      ) extracted ligand-contacting residues from the known domain-ligand complex structures to approximately represent domain interface structures of SH3 and MHC, respectively. The interactions of amino acids at each position of the ligand with its contacting residues on the target domain were then scored using a statistical amino acid-amino acid pairwise potential table. By this means, peptide binding score was calculated by simply summing up scores over all the positions of the peptide. This type of method does not reflect the interrelations between different positions on the same peptide.
      In contrast to the structure-based methods that include comprehensive and high quality information relevant to interactions in three-dimensional spaces, the second category of prediction methods is based on sequence information from only ligands or from both ligands and domains. One widely used method, Scansite (
      • Obenauer J.C.
      • Cantley L.C.
      • Yaffe M.B.
      Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs.
      ) (scansite.mit.edu), calculates the position-specific scoring matrix from the known binding and non-binding peptides of a certain domain to characterize binding profiles. Tong et al. (
      • Tong A.H.
      • Drees B.
      • Nardelli G.
      • Bader G.D.
      • Brannetti B.
      • Castagnoli L.
      • Evangelista M.
      • Ferracuti S.
      • Nelson B.
      • Paoluzi S.
      • Quondam M.
      • Zucconi A.
      • Hogue C.W.
      • Fields S.
      • Boone C.
      • Cesareni G.
      A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules.
      ) have predicted 20 yeast SH3 binding ligands by the position-specific scoring matrix method. Machine learning approaches like artificial neural network (
      • Honeyman M.C.
      • Brusic V.
      • Stone N.L.
      • Harrison L.C.
      Neural network-based prediction of candidate T-cell epitopes.
      ,
      • Brusic V.
      • Rudy G.
      • Honeyman G.
      • Hammer J.
      • Harrison L.
      Prediction of MHC class II-binding peptides using an evolutionary algorithm and artificial neural network.
      ) and support vector machine (SVM) (
      • Dönnes P.
      • Elofsson A.
      Prediction of MHC class I binding peptides, using SVMHC.
      ,
      • Bhasin M.
      • Raghava G.P.
      SVM based method for predicting HLA-DRB1*0401 binding peptides in an antigen sequence.
      ) have been used in predicting MHC binding epitopes. All these algorithms need to collect a large amount of interaction data for each domain to build one predictor. If a domain can bind ligands belonging to different classes, two or more predictors should be built to characterize different binding profiles (
      • Dönnes P.
      • Elofsson A.
      Prediction of MHC class I binding peptides, using SVMHC.
      ,
      • Rammensee H.
      • Bachmann J.
      • Emmerich N.P.
      • Bachor O.A.
      • Stevanovic S.
      SYFPEITHI: database for MHC ligands and peptide motifs.
      ). Dönnes and Elofsson (
      • Dönnes P.
      • Elofsson A.
      Prediction of MHC class I binding peptides, using SVMHC.
      ) have reported that at least 20 ligand peptides of a single class for an MHC class I molecule are required for a reliable prediction, and with up to 50 peptides a slight improvement can be achieved. Even more data are required to build two or more predictors for ligands belonging to different classes. Martin et al. (
      • Martin S.
      • Roe D.
      • Faulon J.L.
      Predicting protein-protein interactions using signature products.
      ) have combined the full-length sequence information of both domains and ligands, devised a protein descriptor called signature products to represent interactions between pairs of amino acid sequences, and predicted ligands of SH3 domains using SVM. Because primary sequences are easy to accumulate, the sequence-based methods can be applied to a large variety of domains. However, primary sequences contain insufficient structure information, and these methods do not emphasize enough the quality of data.
      The existing computational models have only been assessed by cross-validation and have failed to provide any evidence of their performance on screening of the whole protein database. To computationally screen protein databases and focus the experimental efforts on the most likely interactions, we set up an integrated prediction system, taking advantage of both types of methods. We extracted high quality structural data and a large quantity of aligned sequences of both interacting partners and processed the data with machine learning approaches. (a) Quality of data. We extracted only information relevant to interaction by taking potential interface residues rather than using the full-length sequence. (b) Quantity of data. We collected and aligned a family of domains and their ligands and combined the interacting partners into a single prediction system. Because there are structural and sequential similarities among the domains in each family, one pair of known interaction can complementarily provide interaction information for other similar ones. Therefore, less data were needed for each domain to train the predictors. (c) Presentation of data. We developed three novel coding methods to represent different aspects of interactions between interface residues, namely orthogonal dot product, physicochemical product, and structural matrix. (d) Processing of data. We used two machine learning approaches, SVM and probabilistic neural network (PNN), to set up three independent predictors. (e) Neural network ensemble. We assembled the result of the three independent predictors to achieve better generalization capability. (f) Biological filtering. We filtered the candidate ligands with biological information, i.e. the protein subcellular localization, to make further improvement of the system. The flow chart of our method is shown in Fig. 1.
      Figure thumbnail gr1
      Fig. 1.Flow chart of experiment design. Step 1, extracting potential interface residues on PDZ and SH3 domains. Step 2, setting up three independent machine learning predictors with encoded interaction and non-interaction PDZ and SH3 data and then assembling three predictors for PDZ and SH3. Step 3, computational screening of 11,146 proteins from Swiss-Prot human protein database for three PDZ domains and 13,000 peptides total from Swiss-Prot and GenBank™ for 10 SH3 domains to predict their potential ligands. The prediction results were compared with the peptide SPOT array experiment results in which SH3 domains were tested against the same 13,000 peptides as those tested in computational prediction, but PDZ domains were tested against only a subset of Swiss-Prot (6,223 proteins). DB, Database.

      EXPERIMENTAL PROCEDURES

      SH3 and PDZ Structural Alignments and Interface Residue Extraction

      539 SH3 domains and 613 PDZ domains in Swiss-Prot protein database (a few from TrEMBL database) (www.ebi.ac.uk/swissprot/) (
      • Boeckmann B.
      • Bairoch A.
      • Apweiler R.
      • Blatter M.C.
      • Estreicher A.
      • Gasteiger E.
      • Martin M.J.
      • Michoud K.
      • O’Donovan C.
      • Phan I.
      • Pilbout S.
      • Schneider M.
      The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.
      ) (release 45.1) were identified by SMART (Simple Modular Architecture Research Tool) (
      • Schultz J.
      • Milpetz F.
      • Bork P.
      • Ponting C.P.
      SMART, a simple modular architecture research tool: identification of signaling domains.
      ) (smart.embl-heidelberg.de/). 15 SH3-ligand and 12 PDZ-ligand complex structures were collected from Protein Data Bank (
      • Berman H.M.
      • Bhat T.N.
      • Bourne P.E.
      • Feng Z.
      • Gilliland G.
      • Weissig H.
      • Westbrook J.
      The Protein Data Bank and the challenge of structural genomics.
      ) (www.rcsb.org/pdb). The 15 SH3 sequences were multiple-aligned to make a structural profile by ClustalW 3.0 (
      • Thompson J.D.
      • Higgins D.G.
      • Gibson T.J.
      CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
      ) (available at www.ebi.ac.uk/clustalw/). The profile was manually adjusted by putting together regions of the same secondary structure and inserting gaps into loop regions. Other SH3 sequences with no resolved structures were multiple-aligned according to the structural profile with some manual adjustment. An SH3 residue was defined as an interface residue when, in any of the 15 complexes of resolved crystallographic structures, the shortest distance between the residue and the SH3 binding ligands is less than 3 Å. The interface residues were considered most relevant to protein interaction and were used to represent full-length sequences of the SH3 domain. A similar procedure was used in extracting PDZ interface residues. Ultimately 27 positions on SH3 and 23 positions on PDZ sequences were extracted as potential interface residues. (Structural alignments of 15 SH3 domains and 12 PDZ domains and interface residue extraction can be found in Supplemental Fig. S1; the full-length sequences and extracted interface residues of 539 SH3 and 613 PDZ domains can be found in Supplemental Table S1.)

      Pretreatment of Ligand Peptides

      For proteins interacting with PDZ domains, the C-terminal 8 amino acids were treated as ligand peptides and arrayed according to their positions. SH3 ligands in the training set were aligned by fitting into a very loose consensus and trimmed to 10 amino acids. Class I ligands were fit to motif XXXXXX(P/φ)XX(P/φ), and class II were fit to X(P/φ)XX(P/φ)XXXXX. Atypical peptides were set with Pro at position 5 or 7 of the 10 amino acids (X, any amino acid; φ, hydrophobic amino acid). Peptides less than 10 amino acids were complemented with random amino acids. After alignment of the training set, Pro residues of the peptides were aligned at position 2, 5, 7, or 10. Therefore, each peptide in the test set was pretreated with Pro at any one position of 2, 5, 7, or 10. Pretreatment of peptides in the SH3 test set is not mandatory; nonetheless it saves computational screening time by reducing the number of peptide segments in the protein database.

      Interaction Data Representation

      Dot Product of Orthogonal Coding—

      Each interface residue on domains and each residue on ligands were coded in traditional orthogonal method by 20 dimensions of 0-1 vectors. Then a domain and a ligand could be described using vectors a = (a1, …, an)T ε Rn and b = (b1, .., bm)T ε Rm, respectively, where n indicates the number of interface residues on the domain and m indicates the number of residues on the ligand. To describe interactions between pairs of amino acids, a tensor product between vectors a and b was defined to be (a1b1, a1b2,.., a1bm, a2b1,.., anbm)T ε Rnm. Using this definition, the coding for pairs of amino acids, called orthogonal dot product, was established.

      Physicochemical Product Coding—

      Each amino acid was represented by five physicochemical properties: hydrophobicity (
      • Kyte J.
      • Doolittle R.F.
      A simple method for displaying the hydropathic character of a protein.
      ), charge (
      • Bull H.B.
      • Breese K.
      Surface tension of amino acid solutions: a hydrophobicity scale of the amino acid residues.
      ), van der Waals volume (
      • Chothia C.
      Structural invariants in protein folding.
      ), flexibility (
      • Bhaskaran R.
      • Ponnuswamy P.K.
      Dynamics of amino acid residues in globular proteins.
      ), and bulkiness (
      • Zimmerman J.M.
      • Eliezer N.
      • Simha R.
      The characterization of amino acid sequences in proteins by statistical methods.
      ). Every property was classified into five groups by their value (The property classification of each of the 20 amino acids can be found in Supplemental Fig. S2.). For one property, one amino acid was assigned to class 1, 2, 3, 4, or 5; the combination of two amino acids could be 11, 22, 33, 44, 55, 12, 13, 14, 15, 23, 24, 25, 34, 35, or 45 (in total 15 combinations) coded in 15 binaries. A pair of amino acids was represented by five properties, each with 15 combinations, so that a pair of amino acids was coded in 15 × 5 = 75 binaries, including 70 dimensions of zero and 5 dimensions of one. Each interface residue on domains and each residue on ligands were paired and coded in a set of 75 binaries. Then the interaction between a domain (m contacting residues) and a ligand (n residues) were coded in m × n × 75 binaries.
      Previous studies have used the measurements of the physicochemical properties of amino acids directly to code peptide sequences. But for the same property of the same amino acid, different literature reported slightly different measurements. We found that when using the previously published physicochemical coding method even a slight difference between measurements led to a different prediction performance. The differences did not interfere with the present physicochemical product coding method.

      Matrix of Structurally Interacting Potentials—

      Betancourt and Thirumalai (
      • Betancourt M.R.
      • Thirumalai D.
      Pair potentials for protein folding: choice of reference states and sensitivity of predicted native states to variations in the interaction schemes.
      ) have given a scale for interaction energies between the naturally occurring amino acid residues called potential matrix. The affinity of each amino acid pair composed of any of the 20 amino acids is expressed as a constant in the 20 × 20 matrix. Interaction between an interface residue on a domain and a residue on a ligand is coded by an element in the matrix.

      Machine Learning Approaches

      SVM is a machine learning approach based on statistical learning theory. A full coverage of SVM is given by Vapnik (
      • Vapnik V.N.
      ). The decision rules developed by the system generate a discrete decision (>0, interaction; <0, no interaction) upon introduction of a new set of putative interaction pairs. SVM learning was implemented using SVMlight (
      • Joachims T.
      Making large-scale SVM learning practical.
      ) (svmlight.joachims.org).
      PNN is a kind of artificial neural network that is suitable for classification problems. We used the neural network toolbox from commercial software, MATLAB (Version 6.5), to design PNNs. Introductions of this toolbox are available at www.mathworks.com.
      PNN shows good classification and generalization capabilities but has difficulty in dealing with high dimensional input vectors; SVM is suitable for the classification of sparse, high dimensional inputs. We used SVM to process orthogonal dot product and physicochemical product-encoded datasets and PNN to process matrix-encoded datasets.

      5-Fold Cross-validation

      The training set was randomly divided into five equally sized subsets. Each subset was used in turn as a test set, whereas the remaining four subsets were used to train the predictors. TP (true positive), FP (false positive), TN (true negative), and FN (false negative) in the five test sets were counted. The performance of each predictor was evaluated by calculating the mean value of three indices in the five tests with accompanying errors: precision, TP/(TP + FP); sensitivity, TP/(TP + FN); and Matthews correlation coefficient (MCC) (
      • Matthews B.W.
      Comparison of the predicted and observed secondary structure of T4 phage lysozyme.
      ).
      MCC= TP×TNFP×FN(TP+FN)×(TP+FP)×(TN+FP)×(TN+FN)
      (Eq. 1)


      In a totally correct prediction, MCC = 1; in a totally incorrect prediction, MCC = −1.

      Test Set Obtained from Peptide SPOT Arrays

      Three PDZ domains (Erbin, Lap2_Human; Af-6, Afad_Human; and Sna1, Sna1_Human) were tested against the C-terminal peptide of 11 amino acids from 6,223 human proteins in peptide SPOT array experiments by Boisguerin et al. (
      • Boisguerin P.
      • Leben R.
      • Ay B.
      • Radziwill G.
      • Moelling K.
      • Dong L.
      • Volkmer-Engert R.
      An improved method for the synthesis of cellulose membrane-bound peptides with free C termini is useful for PDZ domain binding studies.
      ) and Wiedemann et al. (
      • Wiedemann U.
      • Boisguerin P.
      • Leben R.
      • Leitner D.
      • Krause G.
      • Moelling K.
      • Volkmer-Engert R.
      • Oschkinat H.
      Quantification of PDZ domain specificity, prediction of ligand affinity and rational design of super-binding peptides.
      ). The strongest 100 interactions were selected as positive ligands by the authors for each PDZ. In SH3 experiments by Landgraf et al. (
      • Landgraf C.
      • Panni S.
      • Montecchi-Palazzi L.
      • Castagnoli L.
      • Schneider-Mergener J.
      • Volkmer-Engert R.
      • Cesareni G.
      Protein interaction networks by proteome peptide scanning.
      ), 10 SH3 domains (Abp1, Abp1_yeast; Boi1, Bob1_yeast; Boi2, Boi2_yeast; Myo5, Mys5_yeast; Rvs167, R167_yeast; Sho1, Ss81_yeast; Yhr016c, Yhh6_yeast; Yfr024c, Yfj4_yeast; Amphiphysin, Amph_human; and Endophilin, Sh32_human) were chosen to screen peptide arrays containing 672–2,032 peptides using Boehringer light unit (BLU) to represent the affinity between each SH3 and each 13-amino acid peptide sequence. The positive ligands were those with BLU larger than mean BLU plus 2 times standard deviation among total peptides tested. The rest of the peptides were regarded as non-binders.

      RESULTS

      System Setup—

      Based on secondary structure conformation, 539 SH3 and 615 PDZ domains were structurally aligned. 27 positions on SH3 sequences and 23 positions on PDZ sequences were extracted as potential interface residues. SH3 ligands were trimmed to 10 amino acids of protein internal sequence; PDZ ligands were the protein C-terminal 8-residue peptide.
      598 interactions between 42 SH3 domains and their ligands along with 770 non-interaction pairs of 19 SH3 domains and the corresponding peptides (
      • Brannetti B.
      • Via A.
      • Cestra G.
      • Cesareni G.
      • Helmer-Citterich M.
      SH3-SPOT: an algorithm to predict preferred ligands to different members of the SH3 gene family.
      ,
      • Tong A.H.
      • Drees B.
      • Nardelli G.
      • Bader G.D.
      • Brannetti B.
      • Castagnoli L.
      • Evangelista M.
      • Ferracuti S.
      • Nelson B.
      • Paoluzi S.
      • Quondam M.
      • Zucconi A.
      • Hogue C.W.
      • Fields S.
      • Boone C.
      • Cesareni G.
      A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules.
      ) were collected as the training set for SH3 interaction prediction. 338 interactions between 105 PDZ domains and 210 proteins (
      • Beuming T.
      • Skrabanek L.
      • Niv M.Y.
      • Mukherjee P.
      • Weinstein H.
      PDZBase: a protein-protein interaction database for PDZ-domains.
      ) were collected as the training set for PDZ interaction prediction (details are in Supplemental Table S2.). To create negative data, the ligand sequence from each of these 338 interacting pairs was randomly rearranged. The protein domain and the shuffled sequence were regarded as negative and entered into the training set. SH3 positive data were duplicated to make the positive and negative ratio in the training set 1:1 approximately.
      Each coding method emphasizes each of the three different aspects that affect protein and ligand interactions. Orthogonal dot product was a product modification of orthogonal coding (
      • Baldi P.
      • Brunak S.
      ), which sets each amino acid pair furthest apart in a distinct dimension. Physicochemical product represented the physical and chemical features of amino acid combinations. Structural matrix indicated the likelihood of two amino acids to interact in three-dimensional structures. All three predictors are oriented around interaction information between protein domains and their binding partners with no other irrelevant information.
      Three predictors were trained independently with the three encoded training sets by machine learning approaches. The performance of each predictor was tested using 5-fold cross-validation (Fig. 2) by adjusting several parameters. The performance of each predictor was evaluated by calculating three indices: precision, sensitivity, and MCC. To computationally screen protein database with higher reliability and to reduce the false positives, we set the predictors to high precision with compromised sensitivity (Table I) rather than choosing the predictor with the best MCC as usually done in other studies. More detailed rationales can be found under “Discussion.”
      Figure thumbnail gr2
      Fig. 2.Performance of three predictors on training sets. Each curve illustrates 5-fold cross-validation results of a predictor modified with varied parameters. A, precision versus sensitivity curve of three predictors for PDZ domains. B, precision versus sensitivity curve of three predictors for SH3 domains. C, precision versus MCC curve of three predictors for PDZ domains. D, precision versus MCC curve of three predictors for SH3 domains. + illustrates the optimized system used for database screening.
      Table I5-Fold cross-validation results
      PDZSH3
      Dot product (SVM)Physicochemical product (SVM)Structural matrix (PNN)Assembled systemDot product (SVM)Physicochemical product (SVM)Structural matrix (PNN)Assembled system
      MCC0.732 (0.0555)0.6989 (0.0297)0.6140 (0.0335)0.6991 (0.0237)0.6947 (0.0246)0.6769 (0.0378)0.4754 (0.0346)0.6682 (0.0413)
      Precision (%)95.8 (3.15)94.91 (2.23)93.05 (1.41)94.93 (2.17)97.09 (0.85)97.41 (0.81)93.26 (2.00)97.79 (1.22)
      Sensitivity (%)74.8 (3.77)71.60 (1.62)62.73 (5.31)71.61 (1.76)66.73 (3.04)62.99 (5.28)40.80 (4.12)62.41 (5.34)
      ParametersPolynomial kernel: degree = 2RBF kernel: γ = 0.004, c = 100Spread of radial basis functions = 0.12Polynomial kernel: degree = 3RBF kernel: γ = 0.0027, c = 100Spread of radial basis functions = 0.1
      The prediction results of the three independent component predictors are combined by majority voting to build an assembled system. A ligand would finally be determined positive or negative if it was predicted to be binding or non-binding by two or more predictors. The 5-fold cross-validation results of the assembled system are also shown in Table I. The assembled system preformed no better than the best performed individual predictor on 5-fold cross-validation. However, neural network ensemble did improve the accuracy of prediction in database screening as shown in Tables II and III. Ensemble of the three independent predictors would dramatically improve the generalization capability (
      • Hansen L.K.
      • Salamon P.
      Neural network ensembles.
      ), which is the key for database screening but nonetheless cannot be shown by the cross-validation result generated from very limited data.
      Table IIPrediction results of three PDZ domains and verification results by peptide SPOT array experiments
      Positive samples learnedPositive/whole setDot product (SVM)Physicochemical product (SVM)Structural matrix (PNN)Ensemble
      PredConfPredConfPredConfPredConf
      Erbin499/11,14325241134374531413429
      Af-61096/11,1382772810716450259614
      Sna1596/11,14143141473442571622735
      Total19291/33,422960110714971,1605545778
      SP (%)11.4613.594.7417.07
      Table IIIPrediction results of 10 SH3 domains and verification results by peptide SPOT array experiments
      Positive samples learnedPositive/whole setDot product (SVM)Physicochemical product (SVM)Structural matrix (PNN)Ensemble
      PredConfPredConfPredConfPredConf
      Abp11314/1,3275968131101
      Boi1927/1,3791344015333
      Boi2810/1,343123501010
      Myo51528/1,1415820519135174917
      Rvs1671315/6722873130574
      Sho1834/1,02345202514831110
      Yhr016c1111/672358356294235
      Yfr024c2480/1,3792497232156872017455
      Amphiphysin1298/2,0324978941877471231076
      Endophilin395/2,032395721092447910127
      Total116412/13,0001,39130197918840774689198
      SP (%)21.6419.2018.1828.74

      Prediction of Ligands for 13 Domains and Comparison with Peptide SPOT Array Screening Experiments—

      The three PDZ predictors computationally screened 11,146 human proteins from Swiss-Prot protein database with redundant C-terminal 8-amino acid sequences pre-excluded to predict potential ligands for three PDZ domains. The three SH3 predictors screened 13,000 peptides from 3,657 proteins (subsets of Swiss-Prot human protein database and GenBank™ yeast protein database (
      • Benson D.A.
      • Karsch-Mizrachi I.
      • Lipman D.J.
      • Ostell J.
      • Wheeler D.L.
      GenBank.
      )) for potential ligands for 10 SH3 domains (
      • Landgraf C.
      • Panni S.
      • Montecchi-Palazzi L.
      • Castagnoli L.
      • Schneider-Mergener J.
      • Volkmer-Engert R.
      • Cesareni G.
      Protein interaction networks by proteome peptide scanning.
      ). The prediction results were compared with the experimental screening results in peptide SPOT arrays, excluding any examples learned in training set. After ensemble, the average overlap of SH3 and PDZ prediction with experiments was 28.74 and 17.07%, respectively (Fig. 3). The prediction results of each predictor and the overlap between predictions and experiments for the 13 domains are shown in Tables II and III (Details of the predicted sequences by the assembled system and the predictions validated by experiments can be found in Supplemental Table S3.).
      Figure thumbnail gr3
      Fig. 3.The overlap between our prediction and the Peptide SPOT (Pep-spot) Array experiment results.a, prediction and experiment results for the three PDZ domains. b, prediction and experiment results for the 10 SH3 domains. DB, database.
      A measurement called screening precision (SP) was used to evaluate a predictor’s performance, whereas screening total ligand database was used to predict binding ligands of a domain. SP represents the generalization capability of predictors.
      SP= confirmed predicted ligandstotally predicted ligands×100%
      (Eq. 2)


      The last row in Tables II and III shows SP of each predictor and the assembled system. Despite that each domain learned less than 10 positive ligands on average, the prediction and experimental results overlapped well. Because PDZ peptide SPOT array experiments screened only a part of the Swiss-Prot database (6,223 proteins out of a total of 11,146), experimentally unconfirmed prediction results may not be false positives but real positives not included in the experiments. The current SP of PDZ undervalues the performance of the system.
      The assembled prediction system dramatically improved the screening precision. The ensemble has many superior features. (a) Each predictor was trained to focus on one property rather than trained with a universal code containing all kinds of information. The complexity in machine learning is therefore reduced (
      • Baum E.
      • Haussler D.
      What size net gives valid generalization?.
      ). (b) Because the error overlapping the three independent predictors was greatly decreased, the assembled system acquired higher accuracy (
      • Perrone M.
      • Coopler L.
      ). (c) The assembled system integrated information from many aspects. A new predictor handling new aspects of information could easily be incorporated into the system to improve the generalization capability.

      Generalization of Ligands Belonging to Different Classes—

      There are two major classes of common motifs that SH3 binds: (+XφPXφP) and (PXφPX+) (X, any of the 20 amino acids; +, basic amino acid; φ, hydrophobic amino acid) and other uncommon variants (
      • Mayer B.J.
      SH3 domains: complexity in moderation.
      ). There are two major classes of PDZ binding ligands, characterized as X(S/T)Xφ-COOH and XφXφ-COOH, and a large variety of atypical ligands (
      • Nourry C.
      • Grant S.G.
      • Borg J.P.
      PDZ domain proteins: plug and play!.
      ). We integrated the ligands of different classes into one system to learn. Then prediction for each of the 13 domains was performed by screening the database, and the results were classified and compared with the data from experimentally evaluated class I, class II, and other atypical ligands.
      To illustrate the generalization capability of our method on different classes of ligands, we show the results of three domains as examples (Table IV). Our prediction system learned only a single class of ligands for the three domains in Table IV. Yhr016c SH3 and Af-6 PDZ were predicted and experimentally proved to bind at least two classes of ligands probably due to the shared information provided from other similar domains. On the other hand, the prediction system would not falsely predict a second class of ligands for a domain that was experimentally proved to be specific for only one class of ligands. Sho1 SH3 was predicted to bind only class I ligands; the prediction of its disfavor for class II ligands was proved by the peptide SPOT array experiment. The rest of the 10 domains, not shown in Table IV, were all predicted to bind the classes of ligands that were learned in the training set.
      Table IVThe generalization capability of our prediction system on ligands of different classes
      Yhr016c SH3Af-6 PDZSho1 SH3
      Class IClass IIClass IClass IIClass IClass II
      Training positives01101050
      Predicted505731100
      Confirmed1010390
      Experiment positives403748210

      Biological Filters Complemented to Predictions—

      To reduce the number of experiments necessary for validating and increase the experimental success rate, it is recommended to select the prediction results by biological filters, e.g. similar gene expression profiles, subcellular co-localization, shared function, etc. Using a protein subcellular localization database, DBSubLoc (
      • Guo T.
      • Hua S.
      • Ji X.
      • Sun Z.
      DBSubLoc: database of protein subcellular localization.
      ), (www.bioinfo.tsinghua.edu.cn/SubLoc/), we excluded the proteins that were not co-localized in the same cellular compartment as proteins containing the target domain. As the existing database (August 2005) has not annotated localization information for every protein in Swiss-Prot, we only provide several examples to show the effect of biological filtering in Table V (Detailed localization information of each predicted interactor can be found in Supplemental Table S4.). The screening precision of our system was increased by more than 5 percentage points from 24.52% (76 of 310) to 31.25% (14 of 43) for Amphiphysin SH3 and from 14.58% (14 of 96) to 19.64% (11 of 53) for Af-6 PDZ. Biological filters will play more important roles as more information in databases are available. Other biological information could help further improve the prediction.
      Table VImprovement on prediction precision by filtering with protein subcellular localization information
      Amphiphysin SH3Af-6 PDZ
      Subcellular localizationCytoskeletonMembrane
      Possible interactor localizationCytosol, inner membrane, cytoskeletonMembrane, cytosol, cytoskeleton
      Predicted and co-localized4353
      Co-localized and confirmed1411
      Screening precision (%)31.2519.64

      DISCUSSION

      Previous prediction methods have shown their performances by cross-validation on the training set but provided no evidence of their practical performances on database screening. However, there is a broad gap between the performance on database screening and cross-validation. In our prediction, screening precision of each predictor was much lower than the precision from cross-validation on the training set. There are two main explanations. First, unlike in the training dataset, the percentage of interactors in the database is very low. Because most peptides in the database are negative for binding to a particular domain, a small proportion of falsely predicted ligands would appear to be a relatively large number compared with that of the true positives in the prediction result. Second, good performance on cross-validation cannot guarantee the predictor a high generalization capability because the cross-validation result is based on very limited data.
      To train a balanced learning machine, we set the ratio of positive/negative examples in the training set to 1:1 (
      • Baldi P.
      • Brunak S.
      ); however, the odds of a domain binding ligands in the protein database (defined as r) is usually one in hundreds if not more. The precision in database screening of a predictor can be estimated from cross-validation indices by the formula below. (For derivation of Eq. 3, refer to the supplemental materials.)
      Estimated screening precision(ESP)=precision×rprecision×r+(1precision)×(1r)×100%
      (Eq. 3)


      In the formula above, precision was adopted from the cross-validation result on the training set. The formula here is based on the hypothesis that the predictor generalizes well on the whole protein database. In other words, taking any independent part of the protein database for the test set, in which the numbers of positive and negative data are equal, precision and sensitivity of the predictor are constant. In our prediction, rPDZ and rSH3 were estimated to be 1 in 115 (291 of 33,442) and 1 in 32 (412 of 13,000), respectively. ESP of PDZ dot product predictor, for example, was calculated to be only 17.49% by the above formula; this is much lower than the precision from cross-validation (95.80%).
      However, ESP calculated by the formula still overestimated the practical precision in database screening. Note that the premise of this formula that the predictor has good generalization capability is rarely achieved in reality. That is because of the following. (a) When the training set is very small, performance obtained from cross-validation cannot guarantee the actual performance on a large sample space (
      • Vapnik V.N.
      ). Patterns learned from very limited known interactions can rarely be generalized to the whole protein database. (b) It was difficult to obtain experimental negative data; hence shuffled ligands were created as negative data for training. It is intrinsically easier to separate real protein sequences from shuffled ones than to distinguish the real sequences of binding proteins from real sequences of non-binding proteins. The results generated by predictors fed with shuffled sequences might not adequately reflect the true precision (
      • Lo S.L.
      • Cai C.Z.
      • Chen Y.Z.
      • Chung M.C.
      Effect of training datasets on support vector machine prediction of protein-protein interactions.
      ); therefore cross-validation overrates the actual classification capability of the predictor, and hence ESP overrates SP. (c) The generalization capabilities of machine learning approaches, such as SVM and PNN, are not perfect nowadays.
      From the discussion above, it could be concluded that cross-validation precision is higher than ESP, and ESP overrates SP. The performance on cross-validation could not represent the practical performance in database screening. For a new prediction model, the best testing strategy is experimental validation (
      • Brusic V.
      • Bajic V.B.
      • Petrovsky N.
      Computational methods for prediction of T-cell epitopes—a framework for modelling, testing, and applications.
      ).
      In the cases when no experimental validation is available, how to choose a predictor that might have higher database screening capability? It can be estimated according to the ESP formula. Taking the dot product PDZ predictor for example, when the predictor was optimized to the best MCC (MCC = 0.8106, precision = 90.53%), ESP was calculated to be only 8.25%. ESP decreased a lot, even though precision of this predictor was not too much lower than the predictor we used to screen the database (precision = 97.01%). Classifiers with lower precision would suffer from much more false positives and not be effective in helping the biological experimental designs. (A prediction will be called successful only when the total efficiency of the computational screen and the consequent experimental validation are higher than that of de novo experimental screening.) To reduce false positive predictions in database screening, we chose a predictor with high precision and moderate sensitivity on cross-validation rather than the one with the highest MCC.
      SH3 prediction results seemed to be more successful than PDZ prediction results. There are three possible explanations. (a) The SP indices might underestimate the PDZ predictors in reality because the peptide SPOT array experiments of PDZ domains based on which SP was calculated screened only a part of the Swiss-Prot database. (b) The positive odds of SH3 binding ligands in the database was higher than that of PDZ binding ligands. Calculated by the formula, ESP of SH3 predictors was higher than PDZ predictors, and thus SH3 predictors would exhibit better performance if generalized well. (c) In the SH3 training set, experimental negatives in addition to created shuffled sequences were included. The predictors were better trained to discriminate the binders and non-binders, resulting in a better successful prediction rate. Using experimentally tested negative data in the training set could improve the practical performance of a predictor.

      CONCLUSION

      We predicted peptide ligands of 10 SH3 and three PDZ domains. 20–30% of the predictions have already been experimentally confirmed by other independent research groups. Compared with the previous sequence-based prediction method, which requires a large number of interaction data for each domain (
      • Brusic V.
      • Bajic V.B.
      • Petrovsky N.
      Computational methods for prediction of T-cell epitopes—a framework for modelling, testing, and applications.
      ), our system learned less than 10 ligands per domain on average. The system could also generalize to predict ligands belonging to different classes not included in the training set; it was impossible to predict different classes of ligands using methods solely based on one class of ligands. Predictions could be further improved by filters based on supplemental biological information.
      This system can potentially be used to predict ligands of other peptide binding domains (SH2, PTB, etc.). Domain-domain interactions, such as between G-proteins and G-protein-coupled receptors, could also be predicted in similar procedures by extracting interface residues from each domain.

      Acknowledgments

      We thank Sucan Ma, Rui Tian, Shijuan Gao, and Ali Song for generous help in biological experiments; Xiaolin Yang and Fuxin Li for valuable suggestions in SVM and artificial neural network; and Weizhi Chen for critical reading and revision of the manuscript.

      REFERENCES

        • Pawson T.
        • Nash P.
        Assembly of cell regulatory systems through protein interaction domains.
        Science. 2003; 300: 445-452
        • Pawson T.
        • Scott J.D.
        Signaling through scaffold, anchoring, and adaptor proteins.
        Science. 1997; 278: 2075-2080
        • Mayer B.J.
        SH3 domains: complexity in moderation.
        J. Cell Sci. 2001; 114: 1253-1263
        • Nourry C.
        • Grant S.G.
        • Borg J.P.
        PDZ domain proteins: plug and play!.
        Sci. STKE. 2003; 2003: RE7
        • Tong J.C.
        • Tan T.W.
        • Ranganathan S.
        Modeling the structure of bound peptide ligands to major histocompatibility complex.
        Protein Sci. 2004; 13: 2523-2532
        • Michielin O.
        • Karplus M.
        Binding free energy differences in a TCR-peptide-MHC complex induced by a peptide mutation: a simulation analysis.
        J. Mol. Biol. 2002; 324: 547-569
        • Brannetti B.
        • Via A.
        • Cestra G.
        • Cesareni G.
        • Helmer-Citterich M.
        SH3-SPOT: an algorithm to predict preferred ligands to different members of the SH3 gene family.
        J. Mol. Biol. 2000; 298: 313-328
        • Altuvia Y.
        • Margalit H.
        A structure-based approach for prediction of MHC-binding peptides.
        Methods. 2004; 34: 454-459
        • Obenauer J.C.
        • Cantley L.C.
        • Yaffe M.B.
        Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs.
        Nucleic Acids Res. 2003; 31: 3635-3641
        • Tong A.H.
        • Drees B.
        • Nardelli G.
        • Bader G.D.
        • Brannetti B.
        • Castagnoli L.
        • Evangelista M.
        • Ferracuti S.
        • Nelson B.
        • Paoluzi S.
        • Quondam M.
        • Zucconi A.
        • Hogue C.W.
        • Fields S.
        • Boone C.
        • Cesareni G.
        A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules.
        Science. 2002; 295: 321-324
        • Honeyman M.C.
        • Brusic V.
        • Stone N.L.
        • Harrison L.C.
        Neural network-based prediction of candidate T-cell epitopes.
        Nat. Biotechnol. 1998; 16: 966-969
        • Brusic V.
        • Rudy G.
        • Honeyman G.
        • Hammer J.
        • Harrison L.
        Prediction of MHC class II-binding peptides using an evolutionary algorithm and artificial neural network.
        Bioinformatics. 1998; 14: 121-130
        • Dönnes P.
        • Elofsson A.
        Prediction of MHC class I binding peptides, using SVMHC.
        BMC Bioinformatics. 2002; 3: 25-38
        • Bhasin M.
        • Raghava G.P.
        SVM based method for predicting HLA-DRB1*0401 binding peptides in an antigen sequence.
        Bioinformatics. 2004; 20: 421-423
        • Rammensee H.
        • Bachmann J.
        • Emmerich N.P.
        • Bachor O.A.
        • Stevanovic S.
        SYFPEITHI: database for MHC ligands and peptide motifs.
        Immunogenetics. 1999; 50: 213-219
        • Martin S.
        • Roe D.
        • Faulon J.L.
        Predicting protein-protein interactions using signature products.
        Bioinformatics. 2005; 21: 218-226
        • Boeckmann B.
        • Bairoch A.
        • Apweiler R.
        • Blatter M.C.
        • Estreicher A.
        • Gasteiger E.
        • Martin M.J.
        • Michoud K.
        • O’Donovan C.
        • Phan I.
        • Pilbout S.
        • Schneider M.
        The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.
        Nucleic Acids Res. 2003; 31: 365-370
        • Schultz J.
        • Milpetz F.
        • Bork P.
        • Ponting C.P.
        SMART, a simple modular architecture research tool: identification of signaling domains.
        Proc. Natl. Acad. Sci. U S A. 1998; 95: 5857-5864
        • Berman H.M.
        • Bhat T.N.
        • Bourne P.E.
        • Feng Z.
        • Gilliland G.
        • Weissig H.
        • Westbrook J.
        The Protein Data Bank and the challenge of structural genomics.
        Nat. Struct. Biol. 2000; 7: 957-959
        • Thompson J.D.
        • Higgins D.G.
        • Gibson T.J.
        CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
        Nucleic Acids Res. 1994; 22: 4673-4680
        • Kyte J.
        • Doolittle R.F.
        A simple method for displaying the hydropathic character of a protein.
        J. Mol. Biol. 1982; 157: 105-132
        • Bull H.B.
        • Breese K.
        Surface tension of amino acid solutions: a hydrophobicity scale of the amino acid residues.
        Arch. Biochem. Biophys. 1974; 161: 665-670
        • Chothia C.
        Structural invariants in protein folding.
        Nature. 1975; 254: 304-308
        • Bhaskaran R.
        • Ponnuswamy P.K.
        Dynamics of amino acid residues in globular proteins.
        Int. J. Pept. Protein Res. 1984; 24: 180-191
        • Zimmerman J.M.
        • Eliezer N.
        • Simha R.
        The characterization of amino acid sequences in proteins by statistical methods.
        J. Theor. Biol. 1968; 21: 170-201
        • Betancourt M.R.
        • Thirumalai D.
        Pair potentials for protein folding: choice of reference states and sensitivity of predicted native states to variations in the interaction schemes.
        Protein Sci. 1999; 8: 361-369
        • Vapnik V.N.
        The Nature of Statistical Learning Theory. 2nd Ed. Springer, New York2000
        • Joachims T.
        Making large-scale SVM learning practical.
        in: Ikopf B. Burges C. Smola A. Advances in Kernel Methods—Support Vector Learning. MIT Press, Cambridge, MA1999: 169-184
        • Matthews B.W.
        Comparison of the predicted and observed secondary structure of T4 phage lysozyme.
        Biochim. Biophys. Acta. 1975; 405: 442-451
        • Boisguerin P.
        • Leben R.
        • Ay B.
        • Radziwill G.
        • Moelling K.
        • Dong L.
        • Volkmer-Engert R.
        An improved method for the synthesis of cellulose membrane-bound peptides with free C termini is useful for PDZ domain binding studies.
        Chem. Biol. 2004; 11: 449-459
        • Wiedemann U.
        • Boisguerin P.
        • Leben R.
        • Leitner D.
        • Krause G.
        • Moelling K.
        • Volkmer-Engert R.
        • Oschkinat H.
        Quantification of PDZ domain specificity, prediction of ligand affinity and rational design of super-binding peptides.
        J. Mol. Biol. 2004; 343: 703-718
        • Landgraf C.
        • Panni S.
        • Montecchi-Palazzi L.
        • Castagnoli L.
        • Schneider-Mergener J.
        • Volkmer-Engert R.
        • Cesareni G.
        Protein interaction networks by proteome peptide scanning.
        PLoS Biol. 2004; 2: 94-103
        • Beuming T.
        • Skrabanek L.
        • Niv M.Y.
        • Mukherjee P.
        • Weinstein H.
        PDZBase: a protein-protein interaction database for PDZ-domains.
        Bioinformatics. 2005; 21: 827-828
        • Baldi P.
        • Brunak S.
        Bioinformatics: the Machine Learning Approach. 2nd Ed. MIT Press, Cambridge, MA2001: 97 (115, and 126)
        • Hansen L.K.
        • Salamon P.
        Neural network ensembles.
        IEEE Trans. Pattern Anal. Mach. Intell. 1990; 12: 993-1001
        • Benson D.A.
        • Karsch-Mizrachi I.
        • Lipman D.J.
        • Ostell J.
        • Wheeler D.L.
        GenBank.
        Nucleic Acids Res. 2005; 33: D34-D38
        • Baum E.
        • Haussler D.
        What size net gives valid generalization?.
        Neural Comput. 1989; 1: 151-160
        • Perrone M.
        • Coopler L.
        When Networks Disagree: Ensemble Method for Neural Networks. Chapman-Hall, London1993
        • Guo T.
        • Hua S.
        • Ji X.
        • Sun Z.
        DBSubLoc: database of protein subcellular localization.
        Nucleic Acids Res. 2004; 32: D122-D124
        • Lo S.L.
        • Cai C.Z.
        • Chen Y.Z.
        • Chung M.C.
        Effect of training datasets on support vector machine prediction of protein-protein interactions.
        Proteomics. 2005; 5: 876-884
        • Brusic V.
        • Bajic V.B.
        • Petrovsky N.
        Computational methods for prediction of T-cell epitopes—a framework for modelling, testing, and applications.
        Methods. 2004; 34: 436-443