Submitted on October 24, 2005
Revised on January 24, 2006
Accepted on March 29, 2006
An integrated machine learning system to computationally screen protein databases for protein binding peptide ligands
Ling Zhang, Chen Shao, Dexian Zheng, and Youhe Gao
Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, Beijing 100005
Corresponding Author: gaoyouhe{at}pumc.edu.cn
A fairly large set of protein interactions are mediated by families of peptide binding domains, such as SH2, SH3, PDZ, and MHC etc. To identify their ligands by experimental screening is not only labor intensive but almost futile in screening low abundant species, due to the suppression by high abundant species. An ideal way of studying protein-protein interactions is to use high-throughput computational approaches to screen protein sequence databases, so as to direct the validating experiments towards the most promising peptides. Predictors with only good cross-validation were not good enough to screen protein databases. In the current study we built integrated machine learning systems using three novel coding methods, and screened the Swissprot and Genbank protein databases for potential ligands of 10 SH3 and 3 PDZ domains. A large fraction of predictions have already been experimentally confirmed by other independent research groups, indicating a satisfying generalization capability for future applications in identifying protein interactions.