|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 5:1224-1232, 2006.
© 2006 by The American Society for Biochemistry and Molecular Biology, Inc.



From the Proteomics Research Center, National Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences/Peking Union Medical College, 5 Dong Dan San Tiao, 100005 Beijing, China
| ABSTRACT |
|---|
|
|
|---|
Two major categories of methods have been developed to predict domain-ligand interactions. The first category is based on structure information as exemplified by the work on predicting MHC-specific epitopes with protein docking methods (5, 6). This type of method needs intensive computation and the prior knowledge of the three-dimensional structures of the bait proteins. Instead of using exact protein structures, Brannetti et al. (7) and Altuvia and Margalit (8) extracted ligand-contacting residues from the known domain-ligand complex structures to approximately represent domain interface structures of SH3 and MHC, respectively. The interactions of amino acids at each position of the ligand with its contacting residues on the target domain were then scored using a statistical amino acid-amino acid pairwise potential table. By this means, peptide binding score was calculated by simply summing up scores over all the positions of the peptide. This type of method does not reflect the interrelations between different positions on the same peptide.
In contrast to the structure-based methods that include comprehensive and high quality information relevant to interactions in three-dimensional spaces, the second category of prediction methods is based on sequence information from only ligands or from both ligands and domains. One widely used method, Scansite (9) (scansite.mit.edu), calculates the position-specific scoring matrix from the known binding and non-binding peptides of a certain domain to characterize binding profiles. Tong et al. (10) have predicted 20 yeast SH3 binding ligands by the position-specific scoring matrix method. Machine learning approaches like artificial neural network (11, 12) and support vector machine (SVM) (13, 14) have been used in predicting MHC binding epitopes. All these algorithms need to collect a large amount of interaction data for each domain to build one predictor. If a domain can bind ligands belonging to different classes, two or more predictors should be built to characterize different binding profiles (13, 15). Dönnes and Elofsson (13) have reported that at least 20 ligand peptides of a single class for an MHC class I molecule are required for a reliable prediction, and with up to 50 peptides a slight improvement can be achieved. Even more data are required to build two or more predictors for ligands belonging to different classes. Martin et al. (16) have combined the full-length sequence information of both domains and ligands, devised a protein descriptor called signature products to represent interactions between pairs of amino acid sequences, and predicted ligands of SH3 domains using SVM. Because primary sequences are easy to accumulate, the sequence-based methods can be applied to a large variety of domains. However, primary sequences contain insufficient structure information, and these methods do not emphasize enough the quality of data.
The existing computational models have only been assessed by cross-validation and have failed to provide any evidence of their performance on screening of the whole protein database. To computationally screen protein databases and focus the experimental efforts on the most likely interactions, we set up an integrated prediction system, taking advantage of both types of methods. We extracted high quality structural data and a large quantity of aligned sequences of both interacting partners and processed the data with machine learning approaches. (a) Quality of data. We extracted only information relevant to interaction by taking potential interface residues rather than using the full-length sequence. (b) Quantity of data. We collected and aligned a family of domains and their ligands and combined the interacting partners into a single prediction system. Because there are structural and sequential similarities among the domains in each family, one pair of known interaction can complementarily provide interaction information for other similar ones. Therefore, less data were needed for each domain to train the predictors. (c) Presentation of data. We developed three novel coding methods to represent different aspects of interactions between interface residues, namely orthogonal dot product, physicochemical product, and structural matrix. (d) Processing of data. We used two machine learning approaches, SVM and probabilistic neural network (PNN), to set up three independent predictors. (e) Neural network ensemble. We assembled the result of the three independent predictors to achieve better generalization capability. (f) Biological filtering. We filtered the candidate ligands with biological information, i.e. the protein subcellular localization, to make further improvement of the system. The flow chart of our method is shown in Fig. 1.
|
| EXPERIMENTAL PROCEDURES |
|---|
|
|
|---|
Pretreatment of Ligand Peptides
For proteins interacting with PDZ domains, the C-terminal 8 amino acids were treated as ligand peptides and arrayed according to their positions. SH3 ligands in the training set were aligned by fitting into a very loose consensus and trimmed to 10 amino acids. Class I ligands were fit to motif XXXXXX(P/
)XX(P/
), and class II were fit to X(P/
)XX(P/
)XXXXX. Atypical peptides were set with Pro at position 5 or 7 of the 10 amino acids (X, any amino acid;
, hydrophobic amino acid). Peptides less than 10 amino acids were complemented with random amino acids. After alignment of the training set, Pro residues of the peptides were aligned at position 2, 5, 7, or 10. Therefore, each peptide in the test set was pretreated with Pro at any one position of 2, 5, 7, or 10. Pretreatment of peptides in the SH3 test set is not mandatory; nonetheless it saves computational screening time by reducing the number of peptide segments in the protein database.
Interaction Data Representation
Dot Product of Orthogonal Coding
Each interface residue on domains and each residue on ligands were coded in traditional orthogonal method by 20 dimensions of 0-1 vectors. Then a domain and a ligand could be described using vectors a = (a1, ..., an)T
Rn and b = (b1, .., bm)T
Rm, respectively, where n indicates the number of interface residues on the domain and m indicates the number of residues on the ligand. To describe interactions between pairs of amino acids, a tensor product between vectors a and b was defined to be (a1b1, a1b2,.., a1bm, a2b1,.., anbm)T
Rnm. Using this definition, the coding for pairs of amino acids, called orthogonal dot product, was established.
Physicochemical Product Coding
Each amino acid was represented by five physicochemical properties: hydrophobicity (21), charge (22), van der Waals volume (23), flexibility (24), and bulkiness (25). Every property was classified into five groups by their value (The property classification of each of the 20 amino acids can be found in Supplemental Fig. S2.). For one property, one amino acid was assigned to class 1, 2, 3, 4, or 5; the combination of two amino acids could be 11, 22, 33, 44, 55, 12, 13, 14, 15, 23, 24, 25, 34, 35, or 45 (in total 15 combinations) coded in 15 binaries. A pair of amino acids was represented by five properties, each with 15 combinations, so that a pair of amino acids was coded in 15 x 5 = 75 binaries, including 70 dimensions of zero and 5 dimensions of one. Each interface residue on domains and each residue on ligands were paired and coded in a set of 75 binaries. Then the interaction between a domain (m contacting residues) and a ligand (n residues) were coded in m x n x 75 binaries.
Previous studies have used the measurements of the physicochemical properties of amino acids directly to code peptide sequences. But for the same property of the same amino acid, different literature reported slightly different measurements. We found that when using the previously published physicochemical coding method even a slight difference between measurements led to a different prediction performance. The differences did not interfere with the present physicochemical product coding method.
Matrix of Structurally Interacting Potentials
Betancourt and Thirumalai (26) have given a scale for interaction energies between the naturally occurring amino acid residues called potential matrix. The affinity of each amino acid pair composed of any of the 20 amino acids is expressed as a constant in the 20 x 20 matrix. Interaction between an interface residue on a domain and a residue on a ligand is coded by an element in the matrix.
Machine Learning Approaches
SVM is a machine learning approach based on statistical learning theory. A full coverage of SVM is given by Vapnik (27). The decision rules developed by the system generate a discrete decision (>0, interaction; <0, no interaction) upon introduction of a new set of putative interaction pairs. SVM learning was implemented using SVMlight (28) (svmlight.joachims.org).
PNN is a kind of artificial neural network that is suitable for classification problems. We used the neural network toolbox from commercial software, MATLAB (Version 6.5), to design PNNs. Introductions of this toolbox are available at www.mathworks.com.
PNN shows good classification and generalization capabilities but has difficulty in dealing with high dimensional input vectors; SVM is suitable for the classification of sparse, high dimensional inputs. We used SVM to process orthogonal dot product and physicochemical product-encoded datasets and PNN to process matrix-encoded datasets.
5-Fold Cross-validation
The training set was randomly divided into five equally sized subsets. Each subset was used in turn as a test set, whereas the remaining four subsets were used to train the predictors. TP (true positive), FP (false positive), TN (true negative), and FN (false negative) in the five test sets were counted. The performance of each predictor was evaluated by calculating the mean value of three indices in the five tests with accompanying errors: precision, TP/(TP + FP); sensitivity, TP/(TP + FN); and Matthews correlation coefficient (MCC) (29).
![]() |
In a totally correct prediction, MCC = 1; in a totally incorrect prediction, MCC = 1.
Test Set Obtained from Peptide SPOT Arrays
Three PDZ domains (Erbin, Lap2_Human; Af-6, Afad_Human; and Sna1, Sna1_Human) were tested against the C-terminal peptide of 11 amino acids from 6,223 human proteins in peptide SPOT array experiments by Boisguerin et al. (30) and Wiedemann et al. (31). The strongest 100 interactions were selected as positive ligands by the authors for each PDZ. In SH3 experiments by Landgraf et al. (32), 10 SH3 domains (Abp1, Abp1_yeast; Boi1, Bob1_yeast; Boi2, Boi2_yeast; Myo5, Mys5_yeast; Rvs167, R167_yeast; Sho1, Ss81_yeast; Yhr016c, Yhh6_yeast; Yfr024c, Yfj4_yeast; Amphiphysin, Amph_human; and Endophilin, Sh32_human) were chosen to screen peptide arrays containing 6722,032 peptides using Boehringer light unit (BLU) to represent the affinity between each SH3 and each 13-amino acid peptide sequence. The positive ligands were those with BLU larger than mean BLU plus 2 times standard deviation among total peptides tested. The rest of the peptides were regarded as non-binders.
| RESULTS |
|---|
|
|
|---|
598 interactions between 42 SH3 domains and their ligands along with 770 non-interaction pairs of 19 SH3 domains and the corresponding peptides (7, 10) were collected as the training set for SH3 interaction prediction. 338 interactions between 105 PDZ domains and 210 proteins (33) were collected as the training set for PDZ interaction prediction (details are in Supplemental Table S2.). To create negative data, the ligand sequence from each of these 338 interacting pairs was randomly rearranged. The protein domain and the shuffled sequence were regarded as negative and entered into the training set. SH3 positive data were duplicated to make the positive and negative ratio in the training set 1:1 approximately.
Each coding method emphasizes each of the three different aspects that affect protein and ligand interactions. Orthogonal dot product was a product modification of orthogonal coding (34), which sets each amino acid pair furthest apart in a distinct dimension. Physicochemical product represented the physical and chemical features of amino acid combinations. Structural matrix indicated the likelihood of two amino acids to interact in three-dimensional structures. All three predictors are oriented around interaction information between protein domains and their binding partners with no other irrelevant information.
Three predictors were trained independently with the three encoded training sets by machine learning approaches. The performance of each predictor was tested using 5-fold cross-validation (Fig. 2) by adjusting several parameters. The performance of each predictor was evaluated by calculating three indices: precision, sensitivity, and MCC. To computationally screen protein database with higher reliability and to reduce the false positives, we set the predictors to high precision with compromised sensitivity (Table I) rather than choosing the predictor with the best MCC as usually done in other studies. More detailed rationales can be found under "Discussion."
|
|
|
|
|
![]() |
The last row in Tables II and III shows SP of each predictor and the assembled system. Despite that each domain learned less than 10 positive ligands on average, the prediction and experimental results overlapped well. Because PDZ peptide SPOT array experiments screened only a part of the Swiss-Prot database (6,223 proteins out of a total of 11,146), experimentally unconfirmed prediction results may not be false positives but real positives not included in the experiments. The current SP of PDZ undervalues the performance of the system.
The assembled prediction system dramatically improved the screening precision. The ensemble has many superior features. (a) Each predictor was trained to focus on one property rather than trained with a universal code containing all kinds of information. The complexity in machine learning is therefore reduced (37). (b) Because the error overlapping the three independent predictors was greatly decreased, the assembled system acquired higher accuracy (38). (c) The assembled system integrated information from many aspects. A new predictor handling new aspects of information could easily be incorporated into the system to improve the generalization capability.
Generalization of Ligands Belonging to Different Classes
There are two major classes of common motifs that SH3 binds: (+X
PX
P) and (PX
PX+) (X, any of the 20 amino acids; +, basic amino acid;
, hydrophobic amino acid) and other uncommon variants (3). There are two major classes of PDZ binding ligands, characterized as X(S/T)X
-COOH and X
X
-COOH, and a large variety of atypical ligands (4). We integrated the ligands of different classes into one system to learn. Then prediction for each of the 13 domains was performed by screening the database, and the results were classified and compared with the data from experimentally evaluated class I, class II, and other atypical ligands.
To illustrate the generalization capability of our method on different classes of ligands, we show the results of three domains as examples (Table IV). Our prediction system learned only a single class of ligands for the three domains in Table IV. Yhr016c SH3 and Af-6 PDZ were predicted and experimentally proved to bind at least two classes of ligands probably due to the shared information provided from other similar domains. On the other hand, the prediction system would not falsely predict a second class of ligands for a domain that was experimentally proved to be specific for only one class of ligands. Sho1 SH3 was predicted to bind only class I ligands; the prediction of its disfavor for class II ligands was proved by the peptide SPOT array experiment. The rest of the 10 domains, not shown in Table IV, were all predicted to bind the classes of ligands that were learned in the training set.
|
|
| DISCUSSION |
|---|
|
|
|---|
To train a balanced learning machine, we set the ratio of positive/negative examples in the training set to 1:1 (34); however, the odds of a domain binding ligands in the protein database (defined as r) is usually one in hundreds if not more. The precision in database screening of a predictor can be estimated from cross-validation indices by the formula below. (For derivation of Eq. 3, refer to the supplemental materials.)
![]() |
In the formula above, precision was adopted from the cross-validation result on the training set. The formula here is based on the hypothesis that the predictor generalizes well on the whole protein database. In other words, taking any independent part of the protein database for the test set, in which the numbers of positive and negative data are equal, precision and sensitivity of the predictor are constant. In our prediction, rPDZ and rSH3 were estimated to be 1 in 115 (291 of 33,442) and 1 in 32 (412 of 13,000), respectively. ESP of PDZ dot product predictor, for example, was calculated to be only 17.49% by the above formula; this is much lower than the precision from cross-validation (95.80%).
However, ESP calculated by the formula still overestimated the practical precision in database screening. Note that the premise of this formula that the predictor has good generalization capability is rarely achieved in reality. That is because of the following. (a) When the training set is very small, performance obtained from cross-validation cannot guarantee the actual performance on a large sample space (27). Patterns learned from very limited known interactions can rarely be generalized to the whole protein database. (b) It was difficult to obtain experimental negative data; hence shuffled ligands were created as negative data for training. It is intrinsically easier to separate real protein sequences from shuffled ones than to distinguish the real sequences of binding proteins from real sequences of non-binding proteins. The results generated by predictors fed with shuffled sequences might not adequately reflect the true precision (40); therefore cross-validation overrates the actual classification capability of the predictor, and hence ESP overrates SP. (c) The generalization capabilities of machine learning approaches, such as SVM and PNN, are not perfect nowadays.
From the discussion above, it could be concluded that cross-validation precision is higher than ESP, and ESP overrates SP. The performance on cross-validation could not represent the practical performance in database screening. For a new prediction model, the best testing strategy is experimental validation (41).
In the cases when no experimental validation is available, how to choose a predictor that might have higher database screening capability? It can be estimated according to the ESP formula. Taking the dot product PDZ predictor for example, when the predictor was optimized to the best MCC (MCC = 0.8106, precision = 90.53%), ESP was calculated to be only 8.25%. ESP decreased a lot, even though precision of this predictor was not too much lower than the predictor we used to screen the database (precision = 97.01%). Classifiers with lower precision would suffer from much more false positives and not be effective in helping the biological experimental designs. (A prediction will be called successful only when the total efficiency of the computational screen and the consequent experimental validation are higher than that of de novo experimental screening.) To reduce false positive predictions in database screening, we chose a predictor with high precision and moderate sensitivity on cross-validation rather than the one with the highest MCC.
SH3 prediction results seemed to be more successful than PDZ prediction results. There are three possible explanations. (a) The SP indices might underestimate the PDZ predictors in reality because the peptide SPOT array experiments of PDZ domains based on which SP was calculated screened only a part of the Swiss-Prot database. (b) The positive odds of SH3 binding ligands in the database was higher than that of PDZ binding ligands. Calculated by the formula, ESP of SH3 predictors was higher than PDZ predictors, and thus SH3 predictors would exhibit better performance if generalized well. (c) In the SH3 training set, experimental negatives in addition to created shuffled sequences were included. The predictors were better trained to discriminate the binders and non-binders, resulting in a better successful prediction rate. Using experimentally tested negative data in the training set could improve the practical performance of a predictor.
| CONCLUSION |
|---|
|
|
|---|
This system can potentially be used to predict ligands of other peptide binding domains (SH2, PTB, etc.). Domain-domain interactions, such as between G-proteins and G-protein-coupled receptors, could also be predicted in similar procedures by extracting interface residues from each domain.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, March 29, 2006, DOI 10.1074/mcp.M500346-MCP200
1 The abbreviations used are: SH, Src homology; SVM, support vector machine; PNN, probabilistic neural network; SP, screening precision; ESP, estimated screening precision; MHC, major histocompatibility complex; MCC, Matthews correlation coefficient; BLU, Boehringer light unit. ![]()
* This work was supported in part by The National Basic Research Program Grant 2004CB520804 and National Natural Science Foundation Grants 30270657, 30230150, and 3037030. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. ![]()
Both authors contributed equally to this work. ![]()
To whom correspondence should be addressed. Tel.: 86-010-6521-2284; Fax: 86-010-6521-2284; E-mail: gaoyouhe{at}pumc.edu.cn
| REFERENCES |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |