Advertisement
MCP
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Originally published In Press as doi:10.1074/mcp.M500346-MCP200 on March 29, 2006.
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Data
Right arrow All Versions of this Article:
M500346-MCP200v1
5/7/1224    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Glossary
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Zhang, L.
Right arrow Articles by Gao, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhang, L.
Right arrow Articles by Gao, Y.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Molecular & Cellular Proteomics 5:1224-1232, 2006.
© 2006 by The American Society for Biochemistry and Molecular Biology, Inc.


Research

An Integrated Machine Learning System to Computationally Screen Protein Databases for Protein Binding Peptide Ligands*,S

Ling Zhang{ddagger}, Chen Shao{ddagger}, Dexian Zheng and Youhe Gao§

From the Proteomics Research Center, National Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences/Peking Union Medical College, 5 Dong Dan San Tiao, 100005 Beijing, China


    ABSTRACT
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 
A fairly large set of protein interactions is mediated by families of peptide binding domains, such as Src homology 2 (SH2), SH3, PDZ, major histocompatibility complex, etc. To identify their ligands by experimental screening is not only labor-intensive but almost futile in screening low abundance species due to the suppression by high abundance species. An ideal way of studying protein-protein interactions is to use high throughput computational approaches to screen protein sequence databases to direct the validating experiments toward the most promising peptides. Predictors with only good cross-validation were not good enough to screen protein databases. In the current study we built integrated machine learning systems using three novel coding methods and screened the Swiss-Prot and GenBankTM protein databases for potential ligands of 10 SH3 and three PDZ domains. A large fraction of predictions has already been experimentally confirmed by other independent research groups, indicating a satisfying generalization capability for future applications in identifying protein interactions.


Experimental screening for protein binding peptides is not only labor-intensive but almost futile in screening for low abundance binding species due to the suppression by high abundance species. A more plausible way of studying protein-protein interactions is by using high throughput computational predictions rather than experimental approaches to screen for interactions from protein sequence databases to direct the validating experiments toward the most promising peptides. A prediction can only be called successful when the overall efficiency and cost of computational prediction plus biological validation are much better than those of experimental screening. A fairly large set of protein interactions are mediated by families of peptide binding domains, such as SH2,1 SH3, PDZ, MHC, etc. that act as receptors to accommodate, in their binding pockets, short peptides in an extended conformation (1, 2). We studied two common domain-ligand interactions mediated by SH3 domain and PDZ domain. SH3 domains selectively bind peptides of 8–11 amino acids on their ligand proteins; PDZ domains bind 4–8 amino acids in the C termini of their partners. Ligands of both PDZ and SH3 domains are of high diversity. Two good reviews provide detailed information on SH3 domain (3) and PDZ domain (4).

Two major categories of methods have been developed to predict domain-ligand interactions. The first category is based on structure information as exemplified by the work on predicting MHC-specific epitopes with protein docking methods (5, 6). This type of method needs intensive computation and the prior knowledge of the three-dimensional structures of the bait proteins. Instead of using exact protein structures, Brannetti et al. (7) and Altuvia and Margalit (8) extracted ligand-contacting residues from the known domain-ligand complex structures to approximately represent domain interface structures of SH3 and MHC, respectively. The interactions of amino acids at each position of the ligand with its contacting residues on the target domain were then scored using a statistical amino acid-amino acid pairwise potential table. By this means, peptide binding score was calculated by simply summing up scores over all the positions of the peptide. This type of method does not reflect the interrelations between different positions on the same peptide.

In contrast to the structure-based methods that include comprehensive and high quality information relevant to interactions in three-dimensional spaces, the second category of prediction methods is based on sequence information from only ligands or from both ligands and domains. One widely used method, Scansite (9) (scansite.mit.edu), calculates the position-specific scoring matrix from the known binding and non-binding peptides of a certain domain to characterize binding profiles. Tong et al. (10) have predicted 20 yeast SH3 binding ligands by the position-specific scoring matrix method. Machine learning approaches like artificial neural network (11, 12) and support vector machine (SVM) (13, 14) have been used in predicting MHC binding epitopes. All these algorithms need to collect a large amount of interaction data for each domain to build one predictor. If a domain can bind ligands belonging to different classes, two or more predictors should be built to characterize different binding profiles (13, 15). Dönnes and Elofsson (13) have reported that at least 20 ligand peptides of a single class for an MHC class I molecule are required for a reliable prediction, and with up to 50 peptides a slight improvement can be achieved. Even more data are required to build two or more predictors for ligands belonging to different classes. Martin et al. (16) have combined the full-length sequence information of both domains and ligands, devised a protein descriptor called signature products to represent interactions between pairs of amino acid sequences, and predicted ligands of SH3 domains using SVM. Because primary sequences are easy to accumulate, the sequence-based methods can be applied to a large variety of domains. However, primary sequences contain insufficient structure information, and these methods do not emphasize enough the quality of data.

The existing computational models have only been assessed by cross-validation and have failed to provide any evidence of their performance on screening of the whole protein database. To computationally screen protein databases and focus the experimental efforts on the most likely interactions, we set up an integrated prediction system, taking advantage of both types of methods. We extracted high quality structural data and a large quantity of aligned sequences of both interacting partners and processed the data with machine learning approaches. (a) Quality of data. We extracted only information relevant to interaction by taking potential interface residues rather than using the full-length sequence. (b) Quantity of data. We collected and aligned a family of domains and their ligands and combined the interacting partners into a single prediction system. Because there are structural and sequential similarities among the domains in each family, one pair of known interaction can complementarily provide interaction information for other similar ones. Therefore, less data were needed for each domain to train the predictors. (c) Presentation of data. We developed three novel coding methods to represent different aspects of interactions between interface residues, namely orthogonal dot product, physicochemical product, and structural matrix. (d) Processing of data. We used two machine learning approaches, SVM and probabilistic neural network (PNN), to set up three independent predictors. (e) Neural network ensemble. We assembled the result of the three independent predictors to achieve better generalization capability. (f) Biological filtering. We filtered the candidate ligands with biological information, i.e. the protein subcellular localization, to make further improvement of the system. The flow chart of our method is shown in Fig. 1.


Figure 1
View larger version (36K):
[in this window]
[in a new window]
 
FIG. 1. Flow chart of experiment design. Step 1, extracting potential interface residues on PDZ and SH3 domains. Step 2, setting up three independent machine learning predictors with encoded interaction and non-interaction PDZ and SH3 data and then assembling three predictors for PDZ and SH3. Step 3, computational screening of 11,146 proteins from Swiss-Prot human protein database for three PDZ domains and 13,000 peptides total from Swiss-Prot and GenBankTM for 10 SH3 domains to predict their potential ligands. The prediction results were compared with the peptide SPOT array experiment results in which SH3 domains were tested against the same 13,000 peptides as those tested in computational prediction, but PDZ domains were tested against only a subset of Swiss-Prot (6,223 proteins). DB, Database.

 

    EXPERIMENTAL PROCEDURES
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 
SH3 and PDZ Structural Alignments and Interface Residue Extraction
539 SH3 domains and 613 PDZ domains in Swiss-Prot protein database (a few from TrEMBL database) (www.ebi.ac.uk/swissprot/) (17) (release 45.1) were identified by SMART (Simple Modular Architecture Research Tool) (18) (smart.embl-heidelberg.de/). 15 SH3-ligand and 12 PDZ-ligand complex structures were collected from Protein Data Bank (19) (www.rcsb.org/pdb). The 15 SH3 sequences were multiple-aligned to make a structural profile by ClustalW 3.0 (20) (available at www.ebi.ac.uk/clustalw/). The profile was manually adjusted by putting together regions of the same secondary structure and inserting gaps into loop regions. Other SH3 sequences with no resolved structures were multiple-aligned according to the structural profile with some manual adjustment. An SH3 residue was defined as an interface residue when, in any of the 15 complexes of resolved crystallographic structures, the shortest distance between the residue and the SH3 binding ligands is less than 3 Å. The interface residues were considered most relevant to protein interaction and were used to represent full-length sequences of the SH3 domain. A similar procedure was used in extracting PDZ interface residues. Ultimately 27 positions on SH3 and 23 positions on PDZ sequences were extracted as potential interface residues. (Structural alignments of 15 SH3 domains and 12 PDZ domains and interface residue extraction can be found in Supplemental Fig. S1; the full-length sequences and extracted interface residues of 539 SH3 and 613 PDZ domains can be found in Supplemental Table S1.)

Pretreatment of Ligand Peptides
For proteins interacting with PDZ domains, the C-terminal 8 amino acids were treated as ligand peptides and arrayed according to their positions. SH3 ligands in the training set were aligned by fitting into a very loose consensus and trimmed to 10 amino acids. Class I ligands were fit to motif XXXXXX(P/{varphi})XX(P/{varphi}), and class II were fit to X(P/{varphi})XX(P/{varphi})XXXXX. Atypical peptides were set with Pro at position 5 or 7 of the 10 amino acids (X, any amino acid; {varphi}, hydrophobic amino acid). Peptides less than 10 amino acids were complemented with random amino acids. After alignment of the training set, Pro residues of the peptides were aligned at position 2, 5, 7, or 10. Therefore, each peptide in the test set was pretreated with Pro at any one position of 2, 5, 7, or 10. Pretreatment of peptides in the SH3 test set is not mandatory; nonetheless it saves computational screening time by reducing the number of peptide segments in the protein database.

Interaction Data Representation
Dot Product of Orthogonal Coding—
Each interface residue on domains and each residue on ligands were coded in traditional orthogonal method by 20 dimensions of 0-1 vectors. Then a domain and a ligand could be described using vectors a = (a1, ..., an)T {epsilon} Rn and b = (b1, .., bm)T {epsilon} Rm, respectively, where n indicates the number of interface residues on the domain and m indicates the number of residues on the ligand. To describe interactions between pairs of amino acids, a tensor product between vectors a and b was defined to be (a1b1, a1b2,.., a1bm, a2b1,.., anbm)T {epsilon} Rnm. Using this definition, the coding for pairs of amino acids, called orthogonal dot product, was established.

Physicochemical Product Coding—
Each amino acid was represented by five physicochemical properties: hydrophobicity (21), charge (22), van der Waals volume (23), flexibility (24), and bulkiness (25). Every property was classified into five groups by their value (The property classification of each of the 20 amino acids can be found in Supplemental Fig. S2.). For one property, one amino acid was assigned to class 1, 2, 3, 4, or 5; the combination of two amino acids could be 11, 22, 33, 44, 55, 12, 13, 14, 15, 23, 24, 25, 34, 35, or 45 (in total 15 combinations) coded in 15 binaries. A pair of amino acids was represented by five properties, each with 15 combinations, so that a pair of amino acids was coded in 15 x 5 = 75 binaries, including 70 dimensions of zero and 5 dimensions of one. Each interface residue on domains and each residue on ligands were paired and coded in a set of 75 binaries. Then the interaction between a domain (m contacting residues) and a ligand (n residues) were coded in m x n x 75 binaries.

Previous studies have used the measurements of the physicochemical properties of amino acids directly to code peptide sequences. But for the same property of the same amino acid, different literature reported slightly different measurements. We found that when using the previously published physicochemical coding method even a slight difference between measurements led to a different prediction performance. The differences did not interfere with the present physicochemical product coding method.

Matrix of Structurally Interacting Potentials—
Betancourt and Thirumalai (26) have given a scale for interaction energies between the naturally occurring amino acid residues called potential matrix. The affinity of each amino acid pair composed of any of the 20 amino acids is expressed as a constant in the 20 x 20 matrix. Interaction between an interface residue on a domain and a residue on a ligand is coded by an element in the matrix.

Machine Learning Approaches
SVM is a machine learning approach based on statistical learning theory. A full coverage of SVM is given by Vapnik (27). The decision rules developed by the system generate a discrete decision (>0, interaction; <0, no interaction) upon introduction of a new set of putative interaction pairs. SVM learning was implemented using SVMlight (28) (svmlight.joachims.org).

PNN is a kind of artificial neural network that is suitable for classification problems. We used the neural network toolbox from commercial software, MATLAB (Version 6.5), to design PNNs. Introductions of this toolbox are available at www.mathworks.com.

PNN shows good classification and generalization capabilities but has difficulty in dealing with high dimensional input vectors; SVM is suitable for the classification of sparse, high dimensional inputs. We used SVM to process orthogonal dot product and physicochemical product-encoded datasets and PNN to process matrix-encoded datasets.

5-Fold Cross-validation
The training set was randomly divided into five equally sized subsets. Each subset was used in turn as a test set, whereas the remaining four subsets were used to train the predictors. TP (true positive), FP (false positive), TN (true negative), and FN (false negative) in the five test sets were counted. The performance of each predictor was evaluated by calculating the mean value of three indices in the five tests with accompanying errors: precision, TP/(TP + FP); sensitivity, TP/(TP + FN); and Matthews correlation coefficient (MCC) (29).

Formula 1(Eq.1)

In a totally correct prediction, MCC = 1; in a totally incorrect prediction, MCC = –1.

Test Set Obtained from Peptide SPOT Arrays
Three PDZ domains (Erbin, Lap2_Human; Af-6, Afad_Human; and Sna1, Sna1_Human) were tested against the C-terminal peptide of 11 amino acids from 6,223 human proteins in peptide SPOT array experiments by Boisguerin et al. (30) and Wiedemann et al. (31). The strongest 100 interactions were selected as positive ligands by the authors for each PDZ. In SH3 experiments by Landgraf et al. (32), 10 SH3 domains (Abp1, Abp1_yeast; Boi1, Bob1_yeast; Boi2, Boi2_yeast; Myo5, Mys5_yeast; Rvs167, R167_yeast; Sho1, Ss81_yeast; Yhr016c, Yhh6_yeast; Yfr024c, Yfj4_yeast; Amphiphysin, Amph_human; and Endophilin, Sh32_human) were chosen to screen peptide arrays containing 672–2,032 peptides using Boehringer light unit (BLU) to represent the affinity between each SH3 and each 13-amino acid peptide sequence. The positive ligands were those with BLU larger than mean BLU plus 2 times standard deviation among total peptides tested. The rest of the peptides were regarded as non-binders.


    RESULTS
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 
System Setup—
Based on secondary structure conformation, 539 SH3 and 615 PDZ domains were structurally aligned. 27 positions on SH3 sequences and 23 positions on PDZ sequences were extracted as potential interface residues. SH3 ligands were trimmed to 10 amino acids of protein internal sequence; PDZ ligands were the protein C-terminal 8-residue peptide.

598 interactions between 42 SH3 domains and their ligands along with 770 non-interaction pairs of 19 SH3 domains and the corresponding peptides (7, 10) were collected as the training set for SH3 interaction prediction. 338 interactions between 105 PDZ domains and 210 proteins (33) were collected as the training set for PDZ interaction prediction (details are in Supplemental Table S2.). To create negative data, the ligand sequence from each of these 338 interacting pairs was randomly rearranged. The protein domain and the shuffled sequence were regarded as negative and entered into the training set. SH3 positive data were duplicated to make the positive and negative ratio in the training set 1:1 approximately.

Each coding method emphasizes each of the three different aspects that affect protein and ligand interactions. Orthogonal dot product was a product modification of orthogonal coding (34), which sets each amino acid pair furthest apart in a distinct dimension. Physicochemical product represented the physical and chemical features of amino acid combinations. Structural matrix indicated the likelihood of two amino acids to interact in three-dimensional structures. All three predictors are oriented around interaction information between protein domains and their binding partners with no other irrelevant information.

Three predictors were trained independently with the three encoded training sets by machine learning approaches. The performance of each predictor was tested using 5-fold cross-validation (Fig. 2) by adjusting several parameters. The performance of each predictor was evaluated by calculating three indices: precision, sensitivity, and MCC. To computationally screen protein database with higher reliability and to reduce the false positives, we set the predictors to high precision with compromised sensitivity (Table I) rather than choosing the predictor with the best MCC as usually done in other studies. More detailed rationales can be found under "Discussion."


Figure 2
View larger version (22K):
[in this window]
[in a new window]
 
FIG. 2. Performance of three predictors on training sets. Each curve illustrates 5-fold cross-validation results of a predictor modified with varied parameters. A, precision versus sensitivity curve of three predictors for PDZ domains. B, precision versus sensitivity curve of three predictors for SH3 domains. C, precision versus MCC curve of three predictors for PDZ domains. D, precision versus MCC curve of three predictors for SH3 domains. + illustrates the optimized system used for database screening.

 

View this table:
[in this window]
[in a new window]
 
TABLE I 5-Fold cross-validation results

Predictors in the table are the ones used in the screening of whole protein databases for potential domain binding ligands. Values in parentheses describe the standard deviations of the indices. RBF, radial basis function.

 
The prediction results of the three independent component predictors are combined by majority voting to build an assembled system. A ligand would finally be determined positive or negative if it was predicted to be binding or non-binding by two or more predictors. The 5-fold cross-validation results of the assembled system are also shown in Table I. The assembled system preformed no better than the best performed individual predictor on 5-fold cross-validation. However, neural network ensemble did improve the accuracy of prediction in database screening as shown in Tables II and III. Ensemble of the three independent predictors would dramatically improve the generalization capability (35), which is the key for database screening but nonetheless cannot be shown by the cross-validation result generated from very limited data.


View this table:
[in this window]
[in a new window]
 
TABLE II Prediction results of three PDZ domains and verification results by peptide SPOT array experiments

Pred, number of predicted ligands; Conf, number of ligands confirmed by peptide SPOT arrays. The third column shows the data adopted from peptide SPOT array experiments that overlap the databases we screened excluding the ligands in training sets in the form of number of positive ligands(positive)/number of screened peptides (whole set). The following three columns show numbers of ligands predicted by each predictor and validated by experiments. The last column shows the ensemble predictions. SP is the percentage of experimentally validated true positive ligands in prediction results, indicating the generalization capability of a predictor on a domain family (PDZ).

 

View this table:
[in this window]
[in a new window]
 
TABLE III Prediction results of 10 SH3 domains and verification results by peptide SPOT array experiments

Pred, number of predicted ligands; Conf, number of ligands confirmed by peptide SPOT arrays. The third column shows the data adopted from peptide SPOT array experiments that overlap the databases we screened excluding the ligands in training sets in the form of number of positive ligands(positive)/number of screened peptides (whole set). The following three columns show numbers of ligands predicted by each predictor and validated by experiments. The last column shows the ensemble predictions. SP is the percentage of experimentally validated true positive ligands in prediction results, indicating the generalization capability of a predictor on a domain family (SH3).

 
Prediction of Ligands for 13 Domains and Comparison with Peptide SPOT Array Screening Experiments—
The three PDZ predictors computationally screened 11,146 human proteins from Swiss-Prot protein database with redundant C-terminal 8-amino acid sequences pre-excluded to predict potential ligands for three PDZ domains. The three SH3 predictors screened 13,000 peptides from 3,657 proteins (subsets of Swiss-Prot human protein database and GenBankTM yeast protein database (36)) for potential ligands for 10 SH3 domains (32). The prediction results were compared with the experimental screening results in peptide SPOT arrays, excluding any examples learned in training set. After ensemble, the average overlap of SH3 and PDZ prediction with experiments was 28.74 and 17.07%, respectively (Fig. 3). The prediction results of each predictor and the overlap between predictions and experiments for the 13 domains are shown in Tables II and III (Details of the predicted sequences by the assembled system and the predictions validated by experiments can be found in Supplemental Table S3.).


Figure 3
View larger version (26K):
[in this window]
[in a new window]
 
FIG. 3. The overlap between our prediction and the Peptide SPOT (Pep-spot) Array experiment results. a, prediction and experiment results for the three PDZ domains. b, prediction and experiment results for the 10 SH3 domains. DB, database.

 
A measurement called screening precision (SP) was used to evaluate a predictor’s performance, whereas screening total ligand database was used to predict binding ligands of a domain. SP represents the generalization capability of predictors.

Formula 2(Eq.2)

The last row in Tables II and III shows SP of each predictor and the assembled system. Despite that each domain learned less than 10 positive ligands on average, the prediction and experimental results overlapped well. Because PDZ peptide SPOT array experiments screened only a part of the Swiss-Prot database (6,223 proteins out of a total of 11,146), experimentally unconfirmed prediction results may not be false positives but real positives not included in the experiments. The current SP of PDZ undervalues the performance of the system.

The assembled prediction system dramatically improved the screening precision. The ensemble has many superior features. (a) Each predictor was trained to focus on one property rather than trained with a universal code containing all kinds of information. The complexity in machine learning is therefore reduced (37). (b) Because the error overlapping the three independent predictors was greatly decreased, the assembled system acquired higher accuracy (38). (c) The assembled system integrated information from many aspects. A new predictor handling new aspects of information could easily be incorporated into the system to improve the generalization capability.

Generalization of Ligands Belonging to Different Classes—
There are two major classes of common motifs that SH3 binds: (+X{varphi}PX{varphi}P) and (PX{varphi}PX+) (X, any of the 20 amino acids; +, basic amino acid; {varphi}, hydrophobic amino acid) and other uncommon variants (3). There are two major classes of PDZ binding ligands, characterized as X(S/T)X{varphi}-COOH and X{varphi}X{varphi}-COOH, and a large variety of atypical ligands (4). We integrated the ligands of different classes into one system to learn. Then prediction for each of the 13 domains was performed by screening the database, and the results were classified and compared with the data from experimentally evaluated class I, class II, and other atypical ligands.

To illustrate the generalization capability of our method on different classes of ligands, we show the results of three domains as examples (Table IV). Our prediction system learned only a single class of ligands for the three domains in Table IV. Yhr016c SH3 and Af-6 PDZ were predicted and experimentally proved to bind at least two classes of ligands probably due to the shared information provided from other similar domains. On the other hand, the prediction system would not falsely predict a second class of ligands for a domain that was experimentally proved to be specific for only one class of ligands. Sho1 SH3 was predicted to bind only class I ligands; the prediction of its disfavor for class II ligands was proved by the peptide SPOT array experiment. The rest of the 10 domains, not shown in Table IV, were all predicted to bind the classes of ligands that were learned in the training set.


View this table:
[in this window]
[in a new window]
 
TABLE IV The generalization capability of our prediction system on ligands of different classes

Training positives are the number of positive ligands of each particular class in the training set. Typical class I, class II, and atypical ligands were included in both training and test sets. However, in SH3 peptide array experiments, there were a number of peptides having +X{varphi}PX{varphi}PX+ motif, which satisfies the consensus of both class I and class II motifs (X, any of the 20 amino acids; +, basic amino acid; {varphi}, hydrophobic amino acid). The peptides with the +X{varphi}PX{varphi}PX+ motif in SH3 tests were not classified as either class I or class II for this analysis.

 
Biological Filters Complemented to Predictions—
To reduce the number of experiments necessary for validating and increase the experimental success rate, it is recommended to select the prediction results by biological filters, e.g. similar gene expression profiles, subcellular co-localization, shared function, etc. Using a protein subcellular localization database, DBSubLoc (39), (www.bioinfo.tsinghua.edu.cn/SubLoc/), we excluded the proteins that were not co-localized in the same cellular compartment as proteins containing the target domain. As the existing database (August 2005) has not annotated localization information for every protein in Swiss-Prot, we only provide several examples to show the effect of biological filtering in Table V (Detailed localization information of each predicted interactor can be found in Supplemental Table S4.). The screening precision of our system was increased by more than 5 percentage points from 24.52% (76 of 310) to 31.25% (14 of 43) for Amphiphysin SH3 and from 14.58% (14 of 96) to 19.64% (11 of 53) for Af-6 PDZ. Biological filters will play more important roles as more information in databases are available. Other biological information could help further improve the prediction.


View this table:
[in this window]
[in a new window]
 
TABLE V Improvement on prediction precision by filtering with protein subcellular localization information

Results of one PDZ and one SH3 are given as examples because the subcellular localization information of most domains we predicted is unknown.

 

    DISCUSSION
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 
Previous prediction methods have shown their performances by cross-validation on the training set but provided no evidence of their practical performances on database screening. However, there is a broad gap between the performance on database screening and cross-validation. In our prediction, screening precision of each predictor was much lower than the precision from cross-validation on the training set. There are two main explanations. First, unlike in the training dataset, the percentage of interactors in the database is very low. Because most peptides in the database are negative for binding to a particular domain, a small proportion of falsely predicted ligands would appear to be a relatively large number compared with that of the true positives in the prediction result. Second, good performance on cross-validation cannot guarantee the predictor a high generalization capability because the cross-validation result is based on very limited data.

To train a balanced learning machine, we set the ratio of positive/negative examples in the training set to 1:1 (34); however, the odds of a domain binding ligands in the protein database (defined as r) is usually one in hundreds if not more. The precision in database screening of a predictor can be estimated from cross-validation indices by the formula below. (For derivation of Eq. 3, refer to the supplemental materials.)

Formula 3(Eq.3)

In the formula above, precision was adopted from the cross-validation result on the training set. The formula here is based on the hypothesis that the predictor generalizes well on the whole protein database. In other words, taking any independent part of the protein database for the test set, in which the numbers of positive and negative data are equal, precision and sensitivity of the predictor are constant. In our prediction, rPDZ and rSH3 were estimated to be 1 in 115 (291 of 33,442) and 1 in 32 (412 of 13,000), respectively. ESP of PDZ dot product predictor, for example, was calculated to be only 17.49% by the above formula; this is much lower than the precision from cross-validation (95.80%).

However, ESP calculated by the formula still overestimated the practical precision in database screening. Note that the premise of this formula that the predictor has good generalization capability is rarely achieved in reality. That is because of the following. (a) When the training set is very small, performance obtained from cross-validation cannot guarantee the actual performance on a large sample space (27). Patterns learned from very limited known interactions can rarely be generalized to the whole protein database. (b) It was difficult to obtain experimental negative data; hence shuffled ligands were created as negative data for training. It is intrinsically easier to separate real protein sequences from shuffled ones than to distinguish the real sequences of binding proteins from real sequences of non-binding proteins. The results generated by predictors fed with shuffled sequences might not adequately reflect the true precision (40); therefore cross-validation overrates the actual classification capability of the predictor, and hence ESP overrates SP. (c) The generalization capabilities of machine learning approaches, such as SVM and PNN, are not perfect nowadays.

From the discussion above, it could be concluded that cross-validation precision is higher than ESP, and ESP overrates SP. The performance on cross-validation could not represent the practical performance in database screening. For a new prediction model, the best testing strategy is experimental validation (41).

In the cases when no experimental validation is available, how to choose a predictor that might have higher database screening capability? It can be estimated according to the ESP formula. Taking the dot product PDZ predictor for example, when the predictor was optimized to the best MCC (MCC = 0.8106, precision = 90.53%), ESP was calculated to be only 8.25%. ESP decreased a lot, even though precision of this predictor was not too much lower than the predictor we used to screen the database (precision = 97.01%). Classifiers with lower precision would suffer from much more false positives and not be effective in helping the biological experimental designs. (A prediction will be called successful only when the total efficiency of the computational screen and the consequent experimental validation are higher than that of de novo experimental screening.) To reduce false positive predictions in database screening, we chose a predictor with high precision and moderate sensitivity on cross-validation rather than the one with the highest MCC.

SH3 prediction results seemed to be more successful than PDZ prediction results. There are three possible explanations. (a) The SP indices might underestimate the PDZ predictors in reality because the peptide SPOT array experiments of PDZ domains based on which SP was calculated screened only a part of the Swiss-Prot database. (b) The positive odds of SH3 binding ligands in the database was higher than that of PDZ binding ligands. Calculated by the formula, ESP of SH3 predictors was higher than PDZ predictors, and thus SH3 predictors would exhibit better performance if generalized well. (c) In the SH3 training set, experimental negatives in addition to created shuffled sequences were included. The predictors were better trained to discriminate the binders and non-binders, resulting in a better successful prediction rate. Using experimentally tested negative data in the training set could improve the practical performance of a predictor.


    CONCLUSION
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 
We predicted peptide ligands of 10 SH3 and three PDZ domains. 20–30% of the predictions have already been experimentally confirmed by other independent research groups. Compared with the previous sequence-based prediction method, which requires a large number of interaction data for each domain (41), our system learned less than 10 ligands per domain on average. The system could also generalize to predict ligands belonging to different classes not included in the training set; it was impossible to predict different classes of ligands using methods solely based on one class of ligands. Predictions could be further improved by filters based on supplemental biological information.

This system can potentially be used to predict ligands of other peptide binding domains (SH2, PTB, etc.). Domain-domain interactions, such as between G-proteins and G-protein-coupled receptors, could also be predicted in similar procedures by extracting interface residues from each domain.


    ACKNOWLEDGMENTS
 
We thank Sucan Ma, Rui Tian, Shijuan Gao, and Ali Song for generous help in biological experiments; Xiaolin Yang and Fuxin Li for valuable suggestions in SVM and artificial neural network; and Weizhi Chen for critical reading and revision of the manuscript.


   FOOTNOTES
 
Received, October 24, 2005, and in revised form, January 24, 2006.

Published, MCP Papers in Press, March 29, 2006, DOI 10.1074/mcp.M500346-MCP200

1 The abbreviations used are: SH, Src homology; SVM, support vector machine; PNN, probabilistic neural network; SP, screening precision; ESP, estimated screening precision; MHC, major histocompatibility complex; MCC, Matthews correlation coefficient; BLU, Boehringer light unit. Back

* This work was supported in part by The National Basic Research Program Grant 2004CB520804 and National Natural Science Foundation Grants 30270657, 30230150, and 3037030. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. Back

S The on-line version of this article (available at http://www.mcponline.org) contains supplemental material. Back

{ddagger} Both authors contributed equally to this work. Back

§ To whom correspondence should be addressed. Tel.: 86-010-6521-2284; Fax: 86-010-6521-2284; E-mail: gaoyouhe{at}pumc.edu.cn


    REFERENCES
 TOP
 ABSTRACT
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 CONCLUSION
 REFERENCES
 

  1. Pawson, T., and Nash, P. (2003) Assembly of cell regulatory systems through protein interaction domains. Science 300, 445 –452[Abstract/Free Full Text]

  2. Pawson, T., and Scott, J. D. (1997) Signaling through scaffold, anchoring, and adaptor proteins. Science 278, 2075 –2080[Abstract/Free Full Text]

  3. Mayer, B. J. (2001) SH3 domains: complexity in moderation. J. Cell Sci. 114, 1253 –1263[Abstract]

  4. Nourry, C., Grant, S. G., and Borg, J. P. (2003) PDZ domain proteins: plug and play! Sci. STKE 2003, RE7

  5. Tong, J. C., Tan, T. W., and Ranganathan, S. (2004) Modeling the structure of bound peptide ligands to major histocompatibility complex. Protein Sci. 13, 2523 –2532[CrossRef][Medline]

  6. Michielin, O., and Karplus, M. (2002) Binding free energy differences in a TCR-peptide-MHC complex induced by a peptide mutation: a simulation analysis. J. Mol. Biol. 324, 547 –569[CrossRef][Medline]

  7. Brannetti, B., Via, A., Cestra, G., Cesareni, G., and Helmer-Citterich, M. (2000) SH3-SPOT: an algorithm to predict preferred ligands to different members of the SH3 gene family. J. Mol. Biol. 298, 313 –328[CrossRef][Medline]

  8. Altuvia, Y., and Margalit, H. (2004) A structure-based approach for prediction of MHC-binding peptides. Methods 34, 454 –459[CrossRef][Medline]

  9. Obenauer, J. C., Cantley, L. C., and Yaffe, M. B. (2003) Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res. 31, 3635 –3641[Abstract/Free Full Text]

  10. Tong, A. H., Drees, B., Nardelli, G., Bader, G. D., Brannetti, B., Castagnoli, L., Evangelista, M., Ferracuti, S., Nelson, B., Paoluzi, S., Quondam, M., Zucconi, A., Hogue, C. W., Fields, S., Boone, C., and Cesareni, G. (2002) A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 295, 321 –324[Abstract/Free Full Text]

  11. Honeyman, M. C., Brusic, V., Stone, N. L., and Harrison, L. C. (1998) Neural network-based prediction of candidate T-cell epitopes. Nat. Biotechnol. 16, 966 –969[CrossRef][Medline]

  12. Brusic, V., Rudy, G., Honeyman, G., Hammer, J., and Harrison, L. (1998) Prediction of MHC class II-binding peptides using an evolutionary algorithm and artificial neural network. Bioinformatics 14, 121 –130[Abstract/Free Full Text]

  13. Dönnes, P., and Elofsson, A. (2002) Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics 3, 25 –38[CrossRef][Medline]

  14. Bhasin, M., and Raghava, G. P. (2004) SVM based method for predicting HLA-DRB1*0401 binding peptides in an antigen sequence. Bioinformatics 20, 421 –423[Abstract/Free Full Text]

  15. Rammensee, H., Bachmann, J., Emmerich, N. P., Bachor, O. A., and Stevanovic, S. (1999) SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics 50, 213 –219[CrossRef][Medline]

  16. Martin, S., Roe, D., and Faulon, J. L. (2005) Predicting protein-protein interactions using signature products. Bioinformatics 21, 218 –226[Abstract/Free Full Text]

  17. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M. C., Estreicher, A., Gasteiger, E., Martin, M. J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., and Schneider, M. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365 –370[Abstract/Free Full Text]

  18. Schultz, J., Milpetz, F., Bork, P., and Ponting, C. P. (1998) SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl. Acad. Sci. U S A. 95, 5857 –5864[Abstract/Free Full Text]

  19. Berman, H. M., Bhat, T. N., Bourne, P. E., Feng, Z., Gilliland, G., Weissig, H., and Westbrook, J. (2000) The Protein Data Bank and the challenge of structural genomics. Nat. Struct. Biol. 7, (suppl.) 957 –959

  20. Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673 –4680[Abstract/Free Full Text]

  21. Kyte, J., and Doolittle, R. F. (1982) A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105 –132[CrossRef][Medline]

  22. Bull, H. B., and Breese, K. (1974) Surface tension of amino acid solutions: a hydrophobicity scale of the amino acid residues. Arch. Biochem. Biophys. 161, 665 –670[CrossRef][Medline]

  23. Chothia, C. (1975) Structural invariants in protein folding. Nature 254, 304 –308[CrossRef][Medline]

  24. Bhaskaran, R., and Ponnuswamy, P. K. (1984) Dynamics of amino acid residues in globular proteins. Int. J. Pept. Protein Res. 24, 180 –191[Medline]

  25. Zimmerman, J. M., Eliezer, N., and Simha, R. (1968) The characterization of amino acid sequences in proteins by statistical methods. J. Theor. Biol. 21, 170 –201[CrossRef][Medline]

  26. Betancourt, M. R., and Thirumalai, D. (1999) Pair potentials for protein folding: choice of reference states and sensitivity of predicted native states to variations in the interaction schemes. Protein Sci. 8, 361 –369[Medline]

  27. Vapnik, V. N. (2000) The Nature of Statistical Learning Theory , 2nd Ed., Springer, New York

  28. Joachims, T. (1999) Making large-scale SVM learning practical, in Advances in Kernel Methods—Support Vector Learning (Ikopf, B., Burges, C., Smola, A., eds) pp. 169 –184, MIT Press, Cambridge, MA

  29. Matthews, B. W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442 –451[Medline]

  30. Boisguerin, P., Leben, R., Ay, B., Radziwill, G., Moelling, K., Dong, L., and Volkmer-Engert, R. (2004) An improved method for the synthesis of cellulose membrane-bound peptides with free C termini is useful for PDZ domain binding studies. Chem. Biol. 11, 449 –459[CrossRef][Medline]

  31. Wiedemann, U., Boisguerin, P., Leben, R., Leitner, D., Krause, G., Moelling, K., Volkmer-Engert, R., and Oschkinat, H. (2004) Quantification of PDZ domain specificity, prediction of ligand affinity and rational design of super-binding peptides. J. Mol. Biol. 343, 703 –718[CrossRef][Medline]

  32. Landgraf, C., Panni, S., Montecchi-Palazzi, L., Castagnoli, L., Schneider-Mergener, J., Volkmer-Engert, R., and Cesareni, G. (2004) Protein interaction networks by proteome peptide scanning. PLoS Biol. 2, 94 –103

  33. Beuming, T., Skrabanek, L., Niv, M. Y., Mukherjee, P., and Weinstein, H. (2005) PDZBase: a protein-protein interaction database for PDZ-domains. Bioinformatics 21, 827 –828[Abstract/Free Full Text]

  34. Baldi, P., and Brunak, S. (2001) Bioinformatics: the Machine Learning Approach , 2nd Ed., pp. 97 , 115, and 126, MIT Press, Cambridge, MA

  35. Hansen, L. K., and Salamon, P. (1990) Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12, 993 –1001[CrossRef]

  36. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Wheeler, D. L. (2005) GenBank. Nucleic Acids Res. 33, D34 –D38[Abstract/Free Full Text]

  37. Baum, E., and Haussler, D. (1989) What size net gives valid generalization? Neural Comput. 1, 151 –160

  38. Perrone, M., and Coopler, L. (1993) When Networks Disagree: Ensemble Method for Neural Networks , Chapman-Hall, London

  39. Guo, T., Hua, S., Ji, X., and Sun, Z. (2004) DBSubLoc: database of protein subcellular localization. Nucleic Acids Res. 32, D122 –D124[Abstract/Free Full Text]

  40. Lo, S. L., Cai, C. Z., Chen, Y. Z., and Chung, M. C. (2005) Effect of training datasets on support vector machine prediction of protein-protein interactions. Proteomics 5, 876 –884[CrossRef][Medline]

  41. Brusic, V., Bajic, V. B., and Petrovsky, N. (2004) Computational methods for prediction of T-cell epitopes—a framework for modelling, testing, and applications. Methods 34, 436 –443[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
Z. Wunderlich and L. A. Mirny
Using genome-wide measurements for computational prediction of SH2-peptide interactions
Nucleic Acids Res., June 5, 2009; (2009) gkp394v1.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. ProteomicsHome page
T. Hou, Z. Xu, W. Zhang, W. A. McLaughlin, D. A. Case, Y. Xu, and W. Wang
Characterization of Domain-Peptide Interaction Interface: A Generic Structure-based Model to Decipher the Binding Specificity of SH3 Domains
Mol. Cell. Proteomics, April 1, 2009; 8(4): 639 - 649.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Data
Right arrow All Versions of this Article:
M500346-MCP200v1
5/7/1224    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Glossary
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Zhang, L.
Right arrow Articles by Gao, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhang, L.
Right arrow Articles by Gao, Y.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 All ASBMB Journals   Journal of Biological Chemistry 
 Journal of Lipid Research   ASBMB Today 
Advertisement
spacer
Advertisement
Advertisement