Abstract
In the discovery of new drugs, lead identification and optimization have assumed critical importance given the number of drug targets generated from genetic, genomics, and proteomic technologies. Highthroughput experimental screening assays have been complemented recently by “virtual screening” approaches to identify and filter potential ligands when the characteristics of a target receptor structure of interest are known. Virtual screening mandates a reliable procedure for automatic ranking of structurally distinct ligands in compound library databases. Computing a rank score requires the accurate prediction of binding affinities between these ligands and the target. Many current scoring strategies require information about the target threedimensional structure. In this study, a new method to estimate the free binding energy between a ligand and receptor is proposed. We extend a central idea previously reported (Bock, J. R., and Gough, D. A. (2001) Predicting proteinprotein interactions from primary structure. Bioinformatics 17, 455–460; Bock, J. R., and Gough, D. A. (2002) Wholeproteome interaction mining. Bioinformatics, in press) that uses simple descriptors to represent biomolecules as input examples to train a support vector machine (Smola, A. J., and Schölkopf, B. (1998) A Tutorial on Support Vector Regression, NeuroCOLT Technical Report NCTR98030, Royal Holloway College, University of London, UK) and the application of the trained system to previously unseen pairs, estimating their propensity for interaction. Here we seek to learn the function that maps features of a receptorligand pair onto their equilibrium free binding energy. These features do not comprise any direct information about the threedimensional structures of ligand or target. In crossvalidation experiments, it is demonstrated that objective measurements of prediction error rate and rankordering statistics are competitive with those of several other investigations, most of which depend on threedimensional structural data. The size of the sample (n = 2,671) indicates that this approach is robust and may have widespread applicability beyond restricted families of receptor types. It is concluded that newly sequenced proteins, or those for which threedimensional crystal structures are not easily obtained, can be rapidly analyzed for their binding potential against a library of ligands using this methodology.
The process of developing a new drug involves seven major steps (1). (i) First, a disease is identified, and then (ii) drug targets (usually proteins), the activation or inhibition of which is thought to alter the disease state, within the cell are hypothesized. Once targets are hypothesized, the next task is to (iii) identify potential lead compounds that will bind to the target. These leads are subsequently (iv) optimized with respect to their structural characteristics in the context of the target binding site and then subjected to (v) preclinical and (vi) clinical trials to determine their bioavailability and therapeutic potential. The final step is to (vii) optimize efficacy, toxicity, and pharmacokinetic properties. This may involve the use of pharmacogenomic techniques to tailor compounds to a subset of the patient population that is predisposed to a disease.
Pharmaceutical companies are exposed to great financial risk in the course of identifying viable drugs to treat a certain condition or disease. There are also tremendous direct and indirect (opportunity) costs associated with delaying the removal of nonviable drugs from this drug discovery “pipeline” until the latest stages of the process.
A huge number of drug targets have been generated from genetic, genomic, and proteomic technologies. Accordingly, the lead identification and optimization steps have assumed critical importance. Highthroughput experimental screening assays (30) have been complemented recently by computational (“virtual screening”) approaches to identify and filter potential ligands when the characteristics of the target receptor structure of interest are known (3, 35). In virtual screening, databases of compound libraries are searched, and scoring or discrimination functions are used to select the “best” candidate compounds for biological activity analysis (32).
The scoring of ligands in virtual screening is often associated with computational docking simulations that mate receptor and cognate small molecule ligand in threedimensional space. To provide broad generalization in “chemical diversity” space, computing this score requires the accurate prediction of binding affinities of many structurally distinct ligands (31). Three main methodologies have been identified for free binding energy calculations. In order of computational complexity, these are: 1) knowledgebased scoring functions, 2) partitioning the binding energy into biophysical energy terms, and 3) molecular dynamics (7). The most accurate computations are represented by molecular dynamic techniques, but their inherent computational intensity precludes their application to industrialsize chemical databases.
Regressionbased scoring functions, as exemplified by the work of Böhm (8), are fast but require a threedimensional structure of the receptor. This prohibits their use in cases where the structure is difficult to obtain, such as with transmembrane proteins. The accuracy of such methods has also been called into question. A recent investigation concluded that “no significant correlation” existed between Böhmtype scores and experimentally determined binding affinities for a group of 15 complexes (33).
In this study, we propose a new method to estimate the free binding energy between a ligand and receptor. We extend a central idea developed in previous investigations (10, 11) that uses simple descriptors to represent biomolecules as input examples to train a support vector machine (19) and the application of the trained system to previously unseen pairs, estimating their propensity for interaction. Here we seek to learn the function that maps features of a receptorligand pair onto their equilibrium free binding energy.
EXPERIMENTAL PROCEDURES
Thermodynamics of Binding
For our purposes, consider that a single protein P binds a single small molecule ligand L to form complex C, or Assuming that this reaction is in thermodynamic equilibrium, the Gibbs free energy change on binding ΔG^{0} is written where R is the gas constant, T is the temperature (K), and K_{a} is the equilibrium binding constant between protein and ligand.^{1} K_{a} is defined as where [C], [P], and [L] are molar concentrations of complex product, protein, and ligand reactants, respectively. Often the equilibrium dissociation constant K_{d} is used to quantify ligand binding strength. It is simply the inverse of the binding constant, or and represents the concentration of ligand required to saturate half of the available binding sites of the protein.
Calculation of ΔG^{0} usually entails its partitioning into various energetic components accounting for rotatable bond entropy, hydrogen bonds and ionic interaction forces, lipophilic proteinligand contact surface, and others (13).
Database of LigandReceptor Objects
The data set used in this investigation was aggregated automatically using information located in a number of disparate online resources coupled with local computations. An object database was constructed from this data and subsequently sampled to generate examples for training and testing the performance of the regression estimation system. The experimental database consisted of 2,956 objects, each having attributes as summarized in this section.
LigandReceptor Complex—
Ligandreceptor data were extracted from the Computed Ligand Binding Energy (CLiBE)^{2},^{3} database, a compendium of information on complexed receptors and ligands. Each record in CLiBE contains computed values for the total ligandreceptor potential energy field ΔG^{0}, given by where the righthandside partitioning represents energy contributions due to nonbonded van der Waals interactions, hydrogen bonds, electrostatic forces, and ligand desolvation energies, respectively (34). Methods underlying the computation of binding energies comprising the database subject to this investigation are described in Ref. 2.
The complexes within this resource are themselves based on “heterogen” records found in the Protein Data Bank (PDB) (16)^{4} for which a chemical identity has been assigned to the ligand. PDB is a public domain repository of experimentally determined structures of biological macromolecules.
Ligand Structures and Chemical Names—
Data files with entries representing ligand structures and their associated chemical names were obtained from the NCI, National Institutes of Health Open Database of Compounds.^{5} The data entries were represented as “SMILES” strings, where SMILES (Simplified Molecular Input Line Entry System) is a specification and nomenclature for describing molecules as a compact, onedimensional strings of characters, including atoms, bonds, aromatic rings, and branches (17).
Molecular Connectivity—
The SMILES representation for each ligand molecule was converted to a twodimensional connectivity matrix using a computational chemistry package (JOElib,^{6} Ref. 18). The rows and columns of this matrix reflect the cardinality of constituent atoms established by the SMILES representation. At row i and column j, a unitvalued entry is made if the corresponding atoms in the molecule are covalently connected; otherwise the value of that matrix element is zero. Diagonal elements of this matrix store the appropriate atomic number as suggested previously (15).
Molecular Synonyms—
To maximize the chemical diversity of objects potentially available for numerical experiments, a list of common chemical synonyms corresponding to each ligand were obtained using the online ChemFinder service.^{7} Each ligand synonym within its list was used in a lexical similarity search of the NCI, National Institutes of Health compound files to obtain SMILES representations in cases where different chemical names were used for identical ligands across databases.
Support Vector Regression
The support vector algorithm, based on statistical learning theory, is applicable to both 1) binary classification and 2) regression estimation (19). In previous work, we developed methods to train a support vector machine classifier to learn to predict proteinprotein interactions using descriptors based on physicochemical properties of paired amino acid sequences (10, 11). In the present application, we propose to exploit the support vector algorithm to solve a regression problem. The concept to be learned is the functional mapping between a set of ligandreceptor features and the total free binding energy of the complex. The basic idea in support vector regression (SVR) is to map a set of input patterns X = {x_{1}, x_{2}, …, x_{l}} ∈ R^{n} onto a highdimensional feature space F via a nonlinear mapping Φ:R^{n} → R^{D} (D ≫ n), and then perform linear regression in F. Each pattern vector x_{i} has a matching target value y_{i} ∈ R. The goal is to find a function y = f(x) representing the realvalued pairs {z_{i}  z_{i} = (x_{i}, y_{i}), i ∈ 1, …, l} within a certain acceptable maximum deviation level ε (12). Practical implementation issues with SVR are presented in Refs. 12 and 14, and theory and algorithms for extension to regression estimation with noisy data appear in Ref. 20.
Feature Representation
Each ligandreceptor complex was transformed into a vector of numerical features presumed salient for learning the target concept. Receptor and ligand feature vectors constructed as outlined in this section are concatenated and labeled with the value of their total free binding energy. These vectors are subjected to support vector machine regression training and crossvalidation testing to evaluate how keenly the system learned the concept as posed.
Receptor—
Receptor protein features were generated as described previously (10) considering tabulated physicochemical properties (charge, hydrophobicity, and surface tension) of the amino acid sequence thought to be prototypical of binding characteristics of the receptor. Each residue in sequence was replaced by floating point numbers with values corresponding to these physical properties. This vector of numbers was then mapped onto a fixedlength interval to provide a basis for comparison between receptor proteins of varying sequence length.
Ligand—
Exemplars for the ligand component of each molecular complex required a novel approach. The design ethos followed here dictates beginning with a minimal, elemental group of features to develop intuition regarding the feature space. In accordance with this approach, the twodimensional molecular connection matrix described under “Database of LigandReceptor Objects” was supplemented by additional arrays, each of which contained numerical values for fundamental, measurable chemical properties characterizing the atoms comprising the molecule. These properties included the atomic ionization potential energy, which represents the energy necessary to remove the outermost electron from the ground state of a neutral atom, and the electron affinity, which is a measure of energy change upon adding an electron to a neutral atom (21). Ionization energies are always positively valued, while electron affinities may assume either positive or negative numerical values.
For each small molecule ligand, three twodimensional arrays representing molecular topology, electronic structure, and chemical behavior of the component elements were concatenated into a single, wide matrix. The resulting aggregate data matrix was then factorized using the singular value decomposition (22). The singular values computed in this factorization are extracted, representing a projection onto onedimensional space of the essential characteristics of molecular bond topology and, it is hypothesized, the spatial distribution of molecular properties important for binding with a receptor.
Burden (15) introduced the idea of computing the eigenvalues of a hydrogensuppressed molecular bond graph with atomic number on the diagonal and numbers indicating bond presence and type at off diagonal positions. This matrix was used as a means to group substructures for chemical similarity search. In that work, it was maintained that the smallest eigenvalue embodied information on all molecules and therefore was sufficient as a topological descriptor. Here all singular values are retained, regardless of their relative magnitudes, as discarding the entire set is not justifiable. This vector is finally stretched (or compressed) onto a fixedlength interval as was performed for the receptor features.
IMPLEMENTATION
Learning Concept—
The concept to be learned is the function y = f(x) that maps ligandprotein feature vectors x to the corresponding free energy of binding y. How well the SVR machine learns this concept will be quantified using the statistics described under “Evaluation of Machine Learning” collected from observations of the crossvalidation protocol as described under “Crossvalidation Experiments.”
Evaluation of Machine Learning—
One measure of effectiveness for regression estimation is the normalized mean squared error (nmse), given by where N is the number of target points predicted, σ^{2} is the actual sample variance, and y_{k} and ŷ_{k} are the actual and estimated target values of the kth data point, respectively (23). Because nmse is normalized by the sample variance, it may be used to compare different regression studies on a more equitable basis than would be possible using the conventional root mean square error; intuitively, a given prediction experiment is less challenging where the variance in the data is small. Notice that if we replace the prediction terms ŷ_{k} with the arithmetic mean y̅ in Equation 6, the value of the statistic is 1. This trivial case results when the predictor simply outputs the mean value of the data. Low values of nmse indicate good overall predictive acuity.
Pointwise predictions of ligand binding may be evaluated using the normalized mean absolute error (nmae), defined by This statistic is normalized by the sample variance for the same reasons as were cited for nmse above. Furthermore, its value may be interpreted as the number of standard deviations, on average, that predictions differ from the target values across the test set. The lower the value of nmae, the better the system pointwise predictive ability.
In some ligand screening situations (such as virtual screening, Ref. 3), predicting the relative ranking of binding strengths among a set of ligandreceptor pairs may be desired. The output of such an analysis would be a list of predicted binding energies sorted according to predicted magnitudes ΔĜ^{0}. In such cases a measurement of nonparametric or rank correlation, such as represented by Kendall’s τ coefficient (24), is informative. In crossvalidation, given an ordered array of N “(actual, predicted)” values (y_{1}, ŷ_{1}), . . ., (y_{N}, ŷ_{N}), we systematically compare the numerical signs of individual bivariate pairs X = (y_{i}, ŷ_{i}) and Y = (y_{j}, ŷ_{j}) for i = 1, …, N, j = (i + 1), …, N.
If either (a) y_{i} > y_{j} and ŷ_{i} > ŷ_{j} or (b) y_{i} < y_{j} and ŷ_{i} < ŷ_{j} is observed, X and Y are said to be “concordant.” Otherwise the points are “discordant.” Kendall’s τ expresses the tendency of two ordered lists y and ŷ to coordinately increase or decrease and is computed as where N_{C} is the total number of concordant pairs, N_{D} is the number of discordant pairs, and T_{X}, T_{Y} are counts of the “ties” found in X and Y pairs, respectively. A large positive (negative) value of τ indicates that the rank ordered values within {y} and {ŷ} are positively (negatively) correlated.
Crossvalidation Experiments—
To estimate the generalization error of the trained support vector regression system, we averaged the results of 10 separate 10fold crossvalidation experiments. In kfold crossvalidation, k random, equalsized, disjoint partitions (folds) of the example data are constructed, and an “inducer” (here, an SVR engine) is trained on (k − 1) folds with the excluded fold being used to test the trained system performance. After k such experiments, the results are averaged, and the observed error rate may be taken as an estimate of the error rate expected upon generalization to new data (25). To reduce further the effects of chance in randomly sampling the data, we averaged the results of 10 different 10fold crossvalidation experiments, performing 100 different training/testing procedures. The results we present are crossvalidation averages for the statistics nmse, nmae, and τ as described under “Evaluation of Machine Learning.”
The total sample used in these experiments comprised 2,671 distinct ligandreceptor complexes. The output of the trained system is a predicted level of binding free energy y (in kcal/mol) given a set of features abstracted from a given input complex x. A qualitative glimpse of typical results from one complete 10fold crossvalidation test is offered in Fig. 1, which shows a scatter plot of actual versus predicted binding energy. Fig. 1 shows that some degree of correlation between prediction and truth exists. This correlation will be examined on an objective basis in the discussion of “Crossvalidation Results.”
CROSSVALIDATION RESULTS
The principal results obtained in this investigation are summarized in Table I and in Fig. 1. Table I compares the 10 10fold crossvalidation error estimates with a number of studies reported in the literature. In contrast to the present results (shown in boldface), all of the competing methodologies shown in Table I are derived from scoring functions or simulations predicated upon knowledge of the threedimensional structure of receptor and ligand complex. The columns in Table I comprise test sample size N, the mean target binding energy y̅, and standard deviation σ^{2} in kcal/mol, nmse (Equation 6), nmae (Equation 7), and Kendall’s τ (Equation 8).
The records in the table are listed in order of increasing nmse. This statistic is proposed as the primary objective indicator of accuracy for direct prediction of binding free energy.
DISCUSSION
Of particular note on consideration of Table I are the sample size and mean free binding energies characterizing the ligandreceptor data used here when contrasted to the other investigations. The current sample size (n = 2,671) is a factor 42 times larger than the next largest data set. The mean free binding energy is seen to be −38 kcal/mol, significantly stronger than the other data summarized in the Table I. Moreover, it can be seen that the present data set is highly variable as the standard deviation (35 kcal/mol) is on the same order as the mean.
Recall from the previous discussion that nmse values on the order of 1 are tantamount to trivial prediction of the mean value of a test data set. Lower values of nmse are associated with genuine learning of underlying patterns in the data and effective generalization. On this basis, the highest predictive accuracy (entry 1, nmse = 0.198) observed in this comparative study was realized by Head and coworkers (26), who present a hybrid approach combining ligandreceptor threedimensional structural information and parameters derived from molecular mechanics. The test set comprised 14 ligandreceptor complexes.
The second best nmse in this group was achieved by Böhm (8) using a regressionbased empirical scoring function based on hydrogen bonds, electrostatics, complementary surface areas, and other characteristics of receptorligand pairs where the threedimensional structure has been previously determined.
Next in our list of prediction results is the investigation reported in Wang et al. (4). Their approach uses another empirical scoring function for binding free energy that explicitly accounts for contributions due to Van der Waals interactions, metalligand bonding, hydrogen bonds, desolvation energies, and different kinematic effects. A regression equation is developed using these terms derived from known receptorligand complexes. All 11 data points in the test sample were based on endothiapepsin receptor complexes.
The current method, based on support vector regression, obtained the fourth best prediction error (nmse = 0.419) averaged over 10 different 10fold crossvalidation tests. We suggest that this error rate represents a significant step for the following reasons.
The error rate and rank correlation value are surprisingly competitive with other investigations in light of the relatively large variance and extremely large sample size of the underlying data set. Note that the fifth lowest nmse value in Table I was also obtained by Head et al. (26) for a different data set than they used in entry 1. Group 5 comprised 13 HIV1 proteaseHIV protease inhibitor complexes and showed a value of nmse = 0.440. So the same methodology by the same research group, applied on a different data set, realized much different predictive results. This demonstrates the variability in results that are possible when using small sample sizes while providing confidence in the robustness of our current method and results, which were based on a sample size n = 2,671.
The features used to represent the ligandprotein complexes in the support vector regression do not require any information about threedimensional structure. All that is required as input data are the amino acid sequence of the receptor, a connection table representing the ligand structure in two dimensions, and the atom characteristics at the nodes of this connection table.
There is no limitation on the protein family membership of the putative receptor(s), on the type (organic or synthetic), or on the size of ligand used.
The results obtained in this study suggest that it may be possible to infer binding energies for complexes involving newly sequenced or difficulttocrystallize proteins or for ligands that only exist in computer memory, awaiting synthesis upon successful in silico screening.
Rank Correlation—
We draw the reader’s attention to the trend in Kendall’s rank correlation statistic τ in Table I. It is apparent that there is a general inverse correlation between the magnitude of binding energy prediction errors (nmse and nmae) and the value of τ. That is, low values of prediction error are associated with high values of the correlation coefficient. τ measures the tendency of two ordinal random variables (here actual and predicted binding energy rank) to increase or decrease coordinately. If direct prediction of the physical binding energy is reasonably accurate, we would expect to see a positive and nontrivial correlation between the corresponding rankordered variables.
Computing biomolecular binding energies to higher accuracy remains a challenging problem (6). One author recently noted that current computational docking simulations, used to search for the best (lowest energy) “fit” of ligand into a target receptor cavity, still “suffer from insufficient precision of the scoring functions” (5). In Ref. 9, molecular dynamics simulations focused on biotin binding to avidin and streptavidin indicated that the energies of protein and ligand reorganization were found to be significant contributors to proteinligand binding free energy in molecular dynamic simulations. These reorganization energies were estimated to be on the order of 10–30 and 4.5–6 kcal/mol for protein and ligand, respectively. Because of the large variance in protein reorganization energy, the authors concluded that precise predictions of binding free energy were suspect.
Given these difficulties, the ability to reliably rank a set of ligandreceptor complexes during lead optimization (versus directly computing binding energy) remains important in the area of drug discovery. Such a procedure may add value, for example, as a decision aid when downselecting a set of ligands for chemical synthesis. In connection with the current methodology, we recognize that training the SVR requires example data representing estimated or measured values of binding free energy. The output of a computational technique cannot exceed the accuracy of its input; this is especially true with systems that learn from examples. Therefore, at present the qualitative analysis or ranking of potential ligands may be the main utility of the SVR technique.
The prediction evaluation statistics appearing in Table I are presented in the form of a bar chart in Fig. 2. The investigations numbered along the horizontal axis appear in order of increasing nmse and correspond to the numbering in Table I. This visualization provides a different perspective on the opposing trends of nmse, nmae, and τ as discussed above.
Conclusions—
In this work, we have introduced a new methodology, showing that it is possible to predict the binding free energy between ligand and receptor without direct information about their threedimensional structures.
In crossvalidation experiments, we have demonstrated that objective measurements of prediction error rate and rankordering statistics are competitive with several other investigations, most of which depend on threedimensional structural data. The size of the sample used (n = 2,671) indicates that this approach is robust and may have widespread applicability beyond restricted families of receptor types. Newly sequenced proteins, or those for which threedimensional crystal structures are not easily obtained, can be rapidly analyzed for their binding potential against a library of ligands using this methodology.
Footnotes

Published, MCP Papers in Press, November 6, 2002, DOI 10.1074/mcp.M200054MCP200

↵1 Under physiological conditions (310 K, 1 atm, 1.0 m), the value of RT is about 2.577 kJ/mol or 0.616 kcal/mol.

↵2 The abbreviations used are: CLiBE, Computed Ligand Binding Energy; SMILES, Simplified Molecular Input Line Entry System; PDB, Protein Data Bank; nmse, normalized mean square error; nmae, normalized mean absolute error; SVR, support vector regression; HIV, human immunodeficiency virus.

↵3 CLiBE circa August 2002 has 14,731 records with 2,803 distinct ligands and 2,256 distinct receptors; see xin.cz3.nus.edu.sg/group/clibe/clibe.asp.

↵4 The PDB contains 18,294 structures as of July 23, 2002; see www.rcsb.org/pdb.

↵5 Available at cactvs.cit.nih.gov/ncidb2/download.html; this resource currently contains over 250,000 compounds.

↵6 Open source available at joelib.sourceforge.net.

↵* This work was supported in part by grants from the CNRS (Action Concertée Incitative program Protéomique et Génie des Protéines), the Association pour la Recherche contre le Cancer (Contrat N 4435 Réseau Alliance des Recherches sur le Cancer, Protéomique et Cancer), the Conseil Régional MidiPyrénées, and the Consortium Pierre Fabre Médicament (Program Après Séquençage Génomique, Protéomique et nouvelles cibles thérapeutiques). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
 Received September 5, 2002.
 © 2002 The American Society for Biochemistry and Molecular Biology