An Integrated Machine Learning System to Computationally Screen Protein Databases for Protein Binding Peptide Ligands*S

A fairly large set of protein interactions is mediated by families of peptide binding domains, such as Src homology 2 (SH2), SH3, PDZ, major histocompatibility complex, etc. To identify their ligands by experimental screening is not only labor-intensive but almost futile in screening low abundance species due to the suppression by high abundance species. An ideal way of studying protein-protein interactions is to use high throughput computational approaches to screen protein sequence databases to direct the validating experiments toward the most promising peptides. Predictors with only good cross-validation were not good enough to screen protein databases. In the current study we built integrated machine learning systems using three novel coding methods and screened the Swiss-Prot and GenBank™ protein databases for potential ligands of 10 SH3 and three PDZ domains. A large fraction of predictions has already been experimentally confirmed by other independent research groups, indicating a satisfying generalization capability for future applications in identifying protein interactions.

Experimental screening for protein binding peptides is not only labor-intensive but almost futile in screening for low abundance binding species due to the suppression by high abundance species. A more plausible way of studying protein-protein interactions is by using high throughput computational predictions rather than experimental approaches to screen for interactions from protein sequence databases to direct the validating experiments toward the most promising peptides. A prediction can only be called successful when the overall efficiency and cost of computational prediction plus biological validation are much better than those of experimental screening. A fairly large set of protein interactions are mediated by families of peptide binding domains, such as SH2, 1 SH3, PDZ, MHC, etc. that act as receptors to accom-modate, in their binding pockets, short peptides in an extended conformation (1,2). We studied two common domainligand interactions mediated by SH3 domain and PDZ domain. SH3 domains selectively bind peptides of 8 -11 amino acids on their ligand proteins; PDZ domains bind 4 -8 amino acids in the C termini of their partners. Ligands of both PDZ and SH3 domains are of high diversity. Two good reviews provide detailed information on SH3 domain (3) and PDZ domain (4).
Two major categories of methods have been developed to predict domain-ligand interactions. The first category is based on structure information as exemplified by the work on predicting MHC-specific epitopes with protein docking methods (5,6). This type of method needs intensive computation and the prior knowledge of the three-dimensional structures of the bait proteins. Instead of using exact protein structures, Brannetti et al. (7) and Altuvia and Margalit (8) extracted ligandcontacting residues from the known domain-ligand complex structures to approximately represent domain interface structures of SH3 and MHC, respectively. The interactions of amino acids at each position of the ligand with its contacting residues on the target domain were then scored using a statistical amino acid-amino acid pairwise potential table. By this means, peptide binding score was calculated by simply summing up scores over all the positions of the peptide. This type of method does not reflect the interrelations between different positions on the same peptide.
In contrast to the structure-based methods that include comprehensive and high quality information relevant to interactions in three-dimensional spaces, the second category of prediction methods is based on sequence information from only ligands or from both ligands and domains. One widely used method, Scansite (9) (scansite.mit.edu), calculates the position-specific scoring matrix from the known binding and non-binding peptides of a certain domain to characterize binding profiles. Tong et al. (10) have predicted 20 yeast SH3 binding ligands by the positionspecific scoring matrix method. Machine learning approaches like artificial neural network (11,12) and support vector machine (SVM) (13,14) have been used in predicting MHC binding epitopes. All these algorithms need to collect a large amount of interaction data for each domain to build one predictor. If a domain can bind ligands belonging to different classes, two or more predictors should be built to characterize different binding profiles (13,15). Dö nnes and Elofsson (13) have reported that at least 20 ligand peptides of a single class for an MHC class I molecule are required for a reliable prediction, and with up to 50 peptides a slight improvement can be achieved. Even more data are required to build two or more predictors for ligands belonging to different classes. Martin et al. (16) have combined the fulllength sequence information of both domains and ligands, devised a protein descriptor called signature products to represent interactions between pairs of amino acid sequences, and predicted ligands of SH3 domains using SVM. Because primary sequences are easy to accumulate, the sequence-based methods can be applied to a large variety of domains. However, primary sequences contain insufficient structure information, and these methods do not emphasize enough the quality of data.
The existing computational models have only been assessed by cross-validation and have failed to provide any evidence of their performance on screening of the whole protein database. To computationally screen protein databases and focus the experimental efforts on the most likely interactions, we set up an integrated prediction system, taking advantage of both types of methods. We extracted high quality structural data and a large quantity of aligned sequences of both interacting partners and processed the data with machine learning approaches. (a) Quality of data. We extracted only information relevant to interaction by taking potential interface residues rather than using the full-length sequence. (b) Quantity of data. We collected and aligned a family of domains and their ligands and combined the interacting partners into a single prediction system. Because there are structural and sequential similarities among the domains in each family, one pair of known interaction can complementarily provide interaction information for other similar ones. Therefore, less data were needed for each domain to train the predictors. (c) Presentation of data. We developed three novel coding methods to represent different aspects of interactions between interface residues, namely orthogonal dot product, physicochemical product, and structural matrix. (d) Processing of data. We used two machine learning approaches, SVM and probabilistic neural network (PNN), to set up three independent predictors. (e) Neural network ensemble. We assembled the result of the three independent predictors to achieve better generalization capability. (f) Biological filtering. We filtered the candidate ligands with biological information, i.e. the protein subcellular localization, to make further improvement of the system. The flow chart of our method is shown in Fig. 1 (20) (available at www.ebi.ac.uk/clustalw/). The profile was manually adjusted by putting together regions of the same secondary structure and inserting gaps into loop regions. Other SH3 sequences with no resolved structures were multiple-aligned according to the structural profile with some manual adjustment. An SH3 residue was defined as an interface residue when, in any of the 15 complexes of resolved crystallographic structures, the shortest distance between the residue and the SH3 binding ligands is less than 3 Å. The interface residues were considered most relevant to protein interaction and were used to represent full-length sequences of the SH3 domain. A similar procedure was used in extracting PDZ interface residues. Ultimately 27 positions on SH3 and 23 positions on PDZ sequences were extracted as potential interface residues. (Structural alignments of 15 SH3 domains and 12 PDZ domains and interface residue extraction can be found in Supplemental Fig. S1; the full-length sequences and extracted interface residues of 539 SH3 and 613 PDZ domains can be found in Supplemental Table S1.)

Pretreatment of Ligand Peptides
For proteins interacting with PDZ domains, the C-terminal 8 amino acids were treated as ligand peptides and arrayed according to their positions. SH3 ligands in the training set were aligned by fitting into a very loose consensus and trimmed to 10 amino acids. Class I ligands were fit to motif XXXXXX(P/)XX(P/), and class II were fit to X(P/ )XX(P/)XXXXX. Atypical peptides were set with Pro at position 5 or 7 of the 10 amino acids (X, any amino acid; , hydrophobic amino acid). Peptides less than 10 amino acids were complemented with random amino acids. After alignment of the training set, Pro residues of the peptides were aligned at position 2, 5, 7, or 10. Therefore, each peptide in the test set was pretreated with Pro at any one position of 2, 5, 7, or 10. Pretreatment of peptides in the SH3 test set is not mandatory; nonetheless it saves computational screening time by reducing the number of peptide segments in the protein database.

Interaction Data Representation
Dot Product of Orthogonal Coding-Each interface residue on domains and each residue on ligands were coded in traditional orthogonal method by 20 dimensions of 0-1 vectors. Then a domain and a ligand could be described using vectors a ϭ (a 1 , . . . , a n ) T ⑀ Rn and b ϭ (b 1 , .., b m ) T ⑀ Rm, respectively, where n indicates the number of interface residues on the domain and m indicates the number of residues on the ligand. To describe interactions between pairs of amino acids, a tensor product between vectors a and b was defined to be (a 1 b 1 , a 1 b 2 ,.., a 1 b m , a 2 b 1 ,.., a n b m ) T ⑀ Rnm. Using this definition, the coding for pairs of amino acids, called orthogonal dot product, was established.
Physicochemical Product Coding-Each amino acid was represented by five physicochemical properties: hydrophobicity (21), charge (22), van der Waals volume (23), flexibility (24), and bulkiness (25). Every property was classified into five groups by their value (The property classification of each of the 20 amino acids can be found in Supplemental Fig. S2.). For one property, one amino acid was as-signed to class 1, 2, 3, 4, or 5; the combination of two amino acids could be 11, 22, 33, 44, 55, 12, 13, 14, 15, 23, 24, 25, 34, 35, or 45 (in total 15 combinations) coded in 15 binaries. A pair of amino acids was represented by five properties, each with 15 combinations, so that a pair of amino acids was coded in 15 ϫ 5 ϭ 75 binaries, including 70 dimensions of zero and 5 dimensions of one. Each interface residue on domains and each residue on ligands were paired and coded in a set of 75 binaries. Then the interaction between a domain (m contacting residues) and a ligand (n residues) were coded in m ϫ n ϫ 75 binaries.
Previous studies have used the measurements of the physicochemical properties of amino acids directly to code peptide sequences. But for the same property of the same amino acid, different literature reported slightly different measurements. We found that when using the previously published physicochemical coding method even a slight difference between measurements led to a different prediction performance. The differences did not interfere with the present physicochemical product coding method.
Matrix of Structurally Interacting Potentials-Betancourt and Thirumalai (26) have given a scale for interaction energies between the naturally occurring amino acid residues called potential matrix. The affinity of each amino acid pair composed of any of the 20 amino acids is expressed as a constant in the 20 ϫ 20 matrix. Interaction between an interface residue on a domain and a residue on a ligand is coded by an element in the matrix.

Machine Learning Approaches
SVM is a machine learning approach based on statistical learning theory. A full coverage of SVM is given by Vapnik (27). The decision rules developed by the system generate a discrete decision (Ͼ0, interaction; Ͻ0, no interaction) upon introduction of a new set of putative interaction pairs. SVM learning was implemented using SVMlight (28) (svmlight.joachims.org).
PNN is a kind of artificial neural network that is suitable for classification problems. We used the neural network toolbox from commercial software, MATLAB (Version 6.5), to design PNNs. Introductions of this toolbox are available at www.mathworks.com.
PNN shows good classification and generalization capabilities but has difficulty in dealing with high dimensional input vectors; SVM is suitable for the classification of sparse, high dimensional inputs. We used SVM to process orthogonal dot product and physicochemical product-encoded datasets and PNN to process matrix-encoded datasets.

5-Fold Cross-validation
The training set was randomly divided into five equally sized subsets. Each subset was used in turn as a test set, whereas the remaining four subsets were used to train the predictors. TP (true positive), FP (false positive), TN (true negative), and FN (false negative) in the five test sets were counted. The performance of each predictor was Step 1, extracting potential interface residues on PDZ and SH3 domains.
Step 2, setting up three independent machine learning predictors with encoded interaction and non-interaction PDZ and SH3 data and then assembling three predictors for PDZ and SH3.
Step 3, computational screening of 11,146 proteins from Swiss-Prot human protein database for three PDZ domains and 13,000 peptides total from Swiss-Prot and GenBank TM for 10 SH3 domains to predict their potential ligands. The prediction results were compared with the peptide SPOT array experiment results in which SH3 domains were tested against the same 13,000 peptides as those tested in computational prediction, but PDZ domains were tested against only a subset of Swiss-Prot (6,223 proteins). DB, Database. evaluated by calculating the mean value of three indices in the five tests with accompanying errors: precision, TP/(TP ϩ FP); sensitivity, TP/(TP ϩ FN); and Matthews correlation coefficient (MCC) (29).
In a totally correct prediction, MCC ϭ 1; in a totally incorrect prediction, MCC ϭ Ϫ1.

RESULTS
System Setup-Based on secondary structure conformation, 539 SH3 and 615 PDZ domains were structurally aligned. 27 positions on SH3 sequences and 23 positions on PDZ sequences were extracted as potential interface residues. SH3 ligands were trimmed to 10 amino acids of protein inter-nal sequence; PDZ ligands were the protein C-terminal 8-residue peptide.
598 interactions between 42 SH3 domains and their ligands along with 770 non-interaction pairs of 19 SH3 domains and the corresponding peptides (7, 10) were collected as the training set for SH3 interaction prediction. 338 interactions between 105 PDZ domains and 210 proteins (33) were collected as the training set for PDZ interaction prediction (details are in Supplemental Table S2.). To create negative data, the ligand sequence from each of these 338 interacting pairs was randomly rearranged. The protein domain and the shuffled sequence were regarded as negative and entered into the training set. SH3 positive data were duplicated to make the positive and negative ratio in the training set 1:1 approximately.
Each coding method emphasizes each of the three different aspects that affect protein and ligand interactions. Orthogonal dot product was a product modification of orthogonal coding (34), which sets each amino acid pair furthest apart in a distinct dimension. Physicochemical product represented the physical and chemical features of amino acid combinations. Structural matrix indicated the likelihood of two amino acids to interact in three-dimensional structures. All three predictors are oriented around interaction information between protein domains and their binding partners with no other irrelevant information.
Three predictors were trained independently with the three encoded training sets by machine learning approaches. The performance of each predictor was tested using 5-fold crossvalidation (Fig. 2)  three indices: precision, sensitivity, and MCC. To computationally screen protein database with higher reliability and to reduce the false positives, we set the predictors to high precision with compromised sensitivity (Table I) rather than choosing the predictor with the best MCC as usually done in other studies. More detailed rationales can be found under "Discussion." The prediction results of the three independent component predictors are combined by majority voting to build an assembled system. A ligand would finally be determined positive or negative if it was predicted to be binding or nonbinding by two or more predictors. The 5-fold cross-validation results of the assembled system are also shown in Table I. The assembled system preformed no better than the best performed individual predictor on 5-fold cross-validation. However, neural network ensemble did improve the accuracy of prediction in database screening as shown in Tables II and III. Ensemble of the three independent predictors would dramatically improve the generalization capability (35), which is the key for database screening but nonetheless cannot be shown by the cross-validation result generated from very limited data.

Prediction of Ligands for 13 Domains and Comparison with Peptide SPOT Array Screening Experiments-
The three PDZ predictors computationally screened 11,146 human proteins from Swiss-Prot protein database with redundant C-terminal 8-amino acid sequences pre-excluded to predict potential ligands for three PDZ domains. The three SH3 predictors screened 13,000 peptides from 3,657 proteins (subsets of Swiss-Prot human protein database and GenBank TM yeast protein database (36)) for potential ligands for 10 SH3 domains (32). The prediction results were compared with the experimental screening results in peptide SPOT arrays, excluding any examples learned in training set. After ensemble, the average overlap of SH3 and PDZ prediction with experiments was 28.74 and 17.07%, respectively (Fig. 3). The prediction results of each predictor and the overlap between predictions and experiments for the 13 domains are shown in Tables II and III (Details of the predicted sequences by the assembled system and the predictions validated by experiments can be found in Supplemental Table S3.).
A measurement called screening precision (SP) was used to evaluate a predictor's performance, whereas screening total ligand database was used to predict binding ligands of a domain. SP represents the generalization capability of predictors.
SP ϭ confirmed predicted ligands totally predicted ligands ϫ 100% (Eq. 2) The last row in Tables II and III shows SP of each predictor and the assembled system. Despite that each domain learned less than 10 positive ligands on average, the prediction and experimental results overlapped well. Because PDZ peptide SPOT array experiments screened only a part of the Swiss- Prot database (6,223 proteins out of a total of 11,146), experimentally unconfirmed prediction results may not be false positives but real positives not included in the experiments. The current SP of PDZ undervalues the performance of the system. The assembled prediction system dramatically improved the screening precision. The ensemble has many superior features. (a) Each predictor was trained to focus on one property rather than trained with a universal code containing all kinds of information. The complexity in machine learning is therefore reduced (37). (b) Because the error overlapping the three independent predictors was greatly decreased, the assembled system acquired higher accuracy (38). (c) The assembled system integrated information from many aspects. A new predictor handling new aspects of information could easily be incorporated into the system to improve the generalization capability.
To illustrate the generalization capability of our method on different classes of ligands, we show the results of three domains as examples (Table IV). Our prediction system learned only a single class of ligands for the three domains in Table IV. Yhr016c SH3 and Af-6 PDZ were predicted and experimentally proved to bind at least two classes of ligands probably due to the shared information provided from other similar domains. On the other hand, the prediction system would not falsely predict a second class of ligands for a domain that was experimentally proved to be specific for only one class of ligands. Sho1 SH3 was predicted to bind only class I ligands; the prediction of its disfavor for class II ligands was proved by the peptide SPOT array experiment. The rest of the 10 domains, not shown in Table IV, were all predicted to bind the classes of ligands that were learned in the training set.
Biological Filters Complemented to Predictions-To reduce the number of experiments necessary for validating and increase the experimental success rate, it is recommended to select the prediction results by biological filters, e.g. similar gene expression profiles, subcellular co-localization, shared function, etc. Using a protein subcellular localization database, DBSubLoc (39), (www.bioinfo.tsinghua.edu.cn/Sub-Loc/), we excluded the proteins that were not co-localized in the same cellular compartment as proteins containing the target domain. As the existing database (August 2005) has not annotated localization information for every protein in Swiss-Prot, we only provide several examples to show the effect of biological filtering in Table V (Detailed localization information of each predicted interactor can be found in Supplemental  Table S4.). The screening precision of our system was increased by more than 5 percentage points from 24.52% (76 of 310) to 31.25% (14 of 43) for Amphiphysin SH3 and from 14.58% (14 of 96) to 19.64% (11 of 53) for Af-6 PDZ. Biological filters will play more important roles as more information in databases are available. Other biological information could help further improve the prediction. DISCUSSION Previous prediction methods have shown their performances by cross-validation on the training set but provided no evidence of their practical performances on database screening. However, there is a broad gap between the performance on database screening and cross-validation. In our prediction, screening precision of each predictor was much lower than the precision from cross-validation on the training set. There are two main explanations. First, unlike in the training dataset, the percentage of interactors in the database is very low. Because most peptides in the database are negative for binding to a particular domain, a small proportion of falsely predicted ligands would appear to be a relatively large number compared with that of the true positives in the prediction result. Second, good performance on cross-validation cannot guarantee the predictor a high generalization capability because the cross-validation result is based on very limited data.
To train a balanced learning machine, we set the ratio of positive/negative examples in the training set to 1:1 (34);  Training positives are the number of positive ligands of each particular class in the training set. Typical class I, class II, and atypical ligands were included in both training and test sets. However, in SH3 peptide array experiments, there were a number of peptides having ϩXPXPXϩ motif, which satisfies the consensus of both class I and class II motifs (X, any of the 20 amino acids; ϩ, basic amino acid; , hydrophobic amino acid). The peptides with the ϩXPXPXϩ motif in SH3 tests were not classified as either class I or class II for this analysis.
In the formula above, precision was adopted from the cross-validation result on the training set. The formula here is based on the hypothesis that the predictor generalizes well on the whole protein database. In other words, taking any independent part of the protein database for the test set, in which the numbers of positive and negative data are equal, precision and sensitivity of the predictor are constant. In our prediction, r PDZ and r SH3 were estimated to be 1 in 115 (291 of 33,442) and 1 in 32 (412 of 13,000), respectively. ESP of PDZ dot product predictor, for example, was calculated to be only 17.49% by the above formula; this is much lower than the precision from cross-validation (95.80%).
However, ESP calculated by the formula still overestimated the practical precision in database screening. Note that the premise of this formula that the predictor has good generalization capability is rarely achieved in reality. That is because of the following. (a) When the training set is very small, performance obtained from cross-validation cannot guarantee the actual performance on a large sample space (27). Patterns learned from very limited known interactions can rarely be generalized to the whole protein database. (b) It was difficult to obtain experimental negative data; hence shuffled ligands were created as negative data for training. It is intrinsically easier to separate real protein sequences from shuffled ones than to distinguish the real sequences of binding proteins from real sequences of non-binding proteins. The results generated by predictors fed with shuffled sequences might not adequately reflect the true precision (40); therefore crossvalidation overrates the actual classification capability of the predictor, and hence ESP overrates SP. (c) The generalization capabilities of machine learning approaches, such as SVM and PNN, are not perfect nowadays.
From the discussion above, it could be concluded that cross-validation precision is higher than ESP, and ESP overrates SP. The performance on cross-validation could not represent the practical performance in database screening. For a new prediction model, the best testing strategy is experimental validation (41).
In the cases when no experimental validation is available, how to choose a predictor that might have higher database screening capability? It can be estimated according to the ESP formula. Taking the dot product PDZ predictor for example, when the predictor was optimized to the best MCC (MCC ϭ 0.8106, precision ϭ 90.53%), ESP was cal-culated to be only 8.25%. ESP decreased a lot, even though precision of this predictor was not too much lower than the predictor we used to screen the database (precision ϭ 97.01%). Classifiers with lower precision would suffer from much more false positives and not be effective in helping the biological experimental designs. (A prediction will be called successful only when the total efficiency of the computational screen and the consequent experimental validation are higher than that of de novo experimental screening.) To reduce false positive predictions in database screening, we chose a predictor with high precision and moderate sensitivity on cross-validation rather than the one with the highest MCC.
SH3 prediction results seemed to be more successful than PDZ prediction results. There are three possible explanations. (a) The SP indices might underestimate the PDZ predictors in reality because the peptide SPOT array experiments of PDZ domains based on which SP was calculated screened only a part of the Swiss-Prot database. (b) The positive odds of SH3 binding ligands in the database was higher than that of PDZ binding ligands. Calculated by the formula, ESP of SH3 predictors was higher than PDZ predictors, and thus SH3 predictors would exhibit better performance if generalized well. (c) In the SH3 training set, experimental negatives in addition to created shuffled sequences were included. The predictors were better trained to discriminate the binders and non-binders, resulting in a better successful prediction rate. Using experimentally tested negative data in the training set could improve the practical performance of a predictor. CONCLUSION We predicted peptide ligands of 10 SH3 and three PDZ domains. 20 -30% of the predictions have already been experimentally confirmed by other independent research groups. Compared with the previous sequence-based prediction method, which requires a large number of interaction data for each domain (41), our system learned less than 10 ligands per domain on average. The system could also generalize to predict ligands belonging to different classes not included in the training set; it was impossible to predict different classes of ligands using methods solely based on one class of ligands. Predictions could be further improved by filters based on supplemental biological information.
This system can potentially be used to predict ligands of other peptide binding domains (SH2, PTB, etc.). Domain-domain interactions, such as between G-proteins and G-protein-coupled receptors, could also be predicted in similar procedures by extracting interface residues from each domain. * This work was supported in part by The National Basic Research Program Grant 2004CB520804 and National Natural Science Foundation Grants 30270657, 30230150, and 3037030. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. □ S The on-line version of this article (available at http://www. mcponline.org) contains supplemental material.