A New Method to Estimate Ligand-Receptor Energetics*

In the discovery of new drugs, lead identification and optimization have assumed critical importance given the number of drug targets generated from genetic, genomics, and proteomic technologies. High-throughput experimental screening assays have been complemented recently by “virtual screening” approaches to identify and filter potential ligands when the characteristics of a target receptor structure of interest are known. Virtual screening mandates a reliable procedure for automatic ranking of structurally distinct ligands in compound library databases. Computing a rank score requires the accurate prediction of binding affinities between these ligands and the target. Many current scoring strategies require information about the target three-dimensional structure. In this study, a new method to estimate the free binding energy between a ligand and receptor is proposed. We extend a central idea previously reported (Bock, J. R., and Gough, D. A. (2001) Predicting protein-protein interactions from primary structure. Bioinformatics 17, 455–460; Bock, J. R., and Gough, D. A. (2002) Whole-proteome interaction mining. Bioinformatics, in press) that uses simple descriptors to represent biomolecules as input examples to train a support vector machine (Smola, A. J., and Schölkopf, B. (1998) A Tutorial on Support Vector Regression, NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK) and the application of the trained system to previously unseen pairs, estimating their propensity for interaction. Here we seek to learn the function that maps features of a receptor-ligand pair onto their equilibrium free binding energy. These features do not comprise any direct information about the three-dimensional structures of ligand or target. In cross-validation experiments, it is demonstrated that objective measurements of prediction error rate and rank-ordering statistics are competitive with those of several other investigations, most of which depend on three-dimensional structural data. The size of the sample (n = 2,671) indicates that this approach is robust and may have widespread applicability beyond restricted families of receptor types. It is concluded that newly sequenced proteins, or those for which three-dimensional crystal structures are not easily obtained, can be rapidly analyzed for their binding potential against a library of ligands using this methodology.

The process of developing a new drug involves seven major steps (1). (i) First, a disease is identified, and then (ii) drug targets (usually proteins), the activation or inhibition of which is thought to alter the disease state, within the cell are hypothesized. Once targets are hypothesized, the next task is to (iii) identify potential lead compounds that will bind to the target. These leads are subsequently (iv) optimized with respect to their structural characteristics in the context of the target binding site and then subjected to (v) preclinical and (vi) clinical trials to determine their bioavailability and therapeutic potential. The final step is to (vii) optimize efficacy, toxicity, and pharmacokinetic properties. This may involve the use of pharmacogenomic techniques to tailor compounds to a subset of the patient population that is predisposed to a disease.
Pharmaceutical companies are exposed to great financial risk in the course of identifying viable drugs to treat a certain condition or disease. There are also tremendous direct and indirect (opportunity) costs associated with delaying the removal of non-viable drugs from this drug discovery "pipeline" until the latest stages of the process.
A huge number of drug targets have been generated from genetic, genomic, and proteomic technologies. Accordingly, the lead identification and optimization steps have assumed critical importance. High-throughput experimental screening assays (30) have been complemented recently by computational ("virtual screening") approaches to identify and filter potential ligands when the characteristics of the target receptor structure of interest are known (3,35). In virtual screening, databases of compound libraries are searched, and scoring or discrimination functions are used to select the "best" candidate compounds for biological activity analysis (32).
The scoring of ligands in virtual screening is often associated with computational docking simulations that mate receptor and cognate small molecule ligand in three-dimensional space. To provide broad generalization in "chemical diversity" space, computing this score requires the accurate prediction of binding affinities of many structurally distinct ligands (31). Three main methodologies have been identified for free binding energy calculations. In order of computational complexity, these are: 1) knowledge-based scoring functions, 2) partitioning the binding energy into biophysical energy terms, and 3) molecular dynamics (7). The most accurate computations are represented by molecular dynamic techniques, but their inherent computational intensity precludes their application to industrial-size chemical databases.
Regression-based scoring functions, as exemplified by the work of Bö hm (8), are fast but require a three-dimensional structure of the receptor. This prohibits their use in cases where the structure is difficult to obtain, such as with transmembrane proteins. The accuracy of such methods has also been called into question. A recent investigation concluded that "no significant correlation" existed between Bö hm-type scores and experimentally determined binding affinities for a group of 15 complexes (33).
In this study, we propose a new method to estimate the free binding energy between a ligand and receptor. We extend a central idea developed in previous investigations (10,11) that uses simple descriptors to represent biomolecules as input examples to train a support vector machine (19) and the application of the trained system to previously unseen pairs, estimating their propensity for interaction. Here we seek to learn the function that maps features of a receptor-ligand pair onto their equilibrium free binding energy.

Thermodynamics of Binding
For our purposes, consider that a single protein P binds a single small molecule ligand L to form complex C, or P ϩ L^C (Eq. 1) Assuming that this reaction is in thermodynamic equilibrium, the Gibbs free energy change on binding ⌬G 0 is written where R is the gas constant, T is the temperature (K), and K a is the equilibrium binding constant between protein and ligand. 1 K a is defined as where [C], [P], and [L] are molar concentrations of complex product, protein, and ligand reactants, respectively. Often the equilibrium dissociation constant K d is used to quantify ligand binding strength. It is simply the inverse of the binding constant, or and represents the concentration of ligand required to saturate half of the available binding sites of the protein.
Calculation of ⌬G 0 usually entails its partitioning into various energetic components accounting for rotatable bond entropy, hydrogen bonds and ionic interaction forces, lipophilic protein-ligand contact surface, and others (13).

Database of Ligand-Receptor Objects
The data set used in this investigation was aggregated automatically using information located in a number of disparate on-line re-sources coupled with local computations. An object database was constructed from this data and subsequently sampled to generate examples for training and testing the performance of the regression estimation system. The experimental database consisted of 2,956 objects, each having attributes as summarized in this section.
Ligand-Receptor Complex-Ligand-receptor data were extracted from the Computed Ligand Binding Energy (CLiBE) 2,3 database, a compendium of information on complexed receptors and ligands. Each record in CLiBE contains computed values for the total ligandreceptor potential energy field ⌬G 0 , given by where the right-hand-side partitioning represents energy contributions due to non-bonded van der Waals interactions, hydrogen bonds, electrostatic forces, and ligand desolvation energies, respectively (34). Methods underlying the computation of binding energies comprising the database subject to this investigation are described in The complexes within this resource are themselves based on "heterogen" records found in the Protein Data Bank (PDB) (16) 4 for which a chemical identity has been assigned to the ligand. PDB is a public domain repository of experimentally determined structures of biological macromolecules.
Ligand Structures and Chemical Names-Data files with entries representing ligand structures and their associated chemical names were obtained from the NCI, National Institutes of Health Open Database of Compounds. 5 The data entries were represented as "SMILES" strings, where SMILES (Simplified Molecular Input Line Entry System) is a specification and nomenclature for describing molecules as a compact, one-dimensional strings of characters, including atoms, bonds, aromatic rings, and branches (17).
Molecular Connectivity-The SMILES representation for each ligand molecule was converted to a two-dimensional connectivity matrix using a computational chemistry package (JOElib, 6 Ref. 18). The rows and columns of this matrix reflect the cardinality of constituent atoms established by the SMILES representation. At row i and column j, a unit-valued entry is made if the corresponding atoms in the molecule are covalently connected; otherwise the value of that matrix element is zero. Diagonal elements of this matrix store the appropriate atomic number as suggested previously (15).
Molecular Synonyms-To maximize the chemical diversity of objects potentially available for numerical experiments, a list of common chemical synonyms corresponding to each ligand were obtained using the on-line ChemFinder service. 7 Each ligand synonym within its list was used in a lexical similarity search of the NCI, National Institutes of Health compound files to obtain SMILES representations in cases where different chemical names were used for identical ligands across databases.

Support Vector Regression
The support vector algorithm, based on statistical learning theory, is applicable to both 1) binary classification and 2) regression estimation (19). In previous work, we developed methods to train a support vector machine classifier to learn to predict protein-protein interactions using descriptors based on physicochemical properties of paired amino acid sequences (10,11). In the present application, we propose to exploit the support vector algorithm to solve a regression problem. The concept to be learned is the functional mapping between a set of ligand-receptor features and the total free binding energy of the complex. The basic idea in support vector regression (SVR) is to map a set of input patterns X ‫؍‬ {x 1 , x 2 , . . . , x l } ʦ R n onto a high-dimensional feature space F via a nonlinear mapping ⌽:R n 3 R D (D Ͼ Ͼ n), and then perform linear regression in F. Each pattern vector x i has a matching target value y i ʦ R. The goal is to find a function y ϭ f(x) representing the real-valued pairs {z i ͉ z i ϭ (x i , y i ), i ʦ 1, . . . , l} within a certain acceptable maximum deviation level ⑀ (12). Practical implementation issues with SVR are presented in Refs. 12 and 14, and theory and algorithms for extension to regression estimation with noisy data appear in Ref. 20.

Feature Representation
Each ligand-receptor complex was transformed into a vector of numerical features presumed salient for learning the target concept. Receptor and ligand feature vectors constructed as outlined in this section are concatenated and labeled with the value of their total free binding energy. These vectors are subjected to support vector machine regression training and cross-validation testing to evaluate how keenly the system learned the concept as posed.
Receptor-Receptor protein features were generated as described previously (10) considering tabulated physicochemical properties (charge, hydrophobicity, and surface tension) of the amino acid sequence thought to be prototypical of binding characteristics of the receptor. Each residue in sequence was replaced by floating point numbers with values corresponding to these physical properties. This vector of numbers was then mapped onto a fixed-length interval to provide a basis for comparison between receptor proteins of varying sequence length.
Ligand-Exemplars for the ligand component of each molecular complex required a novel approach. The design ethos followed here dictates beginning with a minimal, elemental group of features to develop intuition regarding the feature space. In accordance with this approach, the two-dimensional molecular connection matrix described under "Database of Ligand-Receptor Objects" was supplemented by additional arrays, each of which contained numerical values for fundamental, measurable chemical properties characterizing the atoms comprising the molecule. These properties included the atomic ionization potential energy, which represents the energy necessary to remove the outermost electron from the ground state of a neutral atom, and the electron affinity, which is a measure of energy change upon adding an electron to a neutral atom (21). Ionization energies are always positively valued, while electron affinities may assume either positive or negative numerical values.
For each small molecule ligand, three two-dimensional arrays representing molecular topology, electronic structure, and chemical behavior of the component elements were concatenated into a single, wide matrix. The resulting aggregate data matrix was then factorized using the singular value decomposition (22). The singular values computed in this factorization are extracted, representing a projection onto one-dimensional space of the essential characteristics of molecular bond topology and, it is hypothesized, the spatial distribution of molecular properties important for binding with a receptor.
Burden (15) introduced the idea of computing the eigenvalues of a hydrogen-suppressed molecular bond graph with atomic number on the diagonal and numbers indicating bond presence and type at off diagonal positions. This matrix was used as a means to group substructures for chemical similarity search. In that work, it was maintained that the smallest eigenvalue embodied information on all molecules and therefore was sufficient as a topological descriptor. Here all singular values are retained, regardless of their relative magnitudes, as discarding the entire set is not justifiable. This vector is finally stretched (or compressed) onto a fixed-length interval as was performed for the receptor features.

IMPLEMENTATION
Learning Concept-The concept to be learned is the function y ϭ f(x) that maps ligand-protein feature vectors x to the corresponding free energy of binding y. How well the SVR machine learns this concept will be quantified using the statistics described under "Evaluation of Machine Learning" collected from observations of the cross-validation protocol as described under "Cross-validation Experiments." Evaluation of Machine Learning-One measure of effectiveness for regression estimation is the normalized mean squared error (nmse), given by where N is the number of target points predicted, 2 is the actual sample variance, and y k and ŷ k are the actual and estimated target values of the kth data point, respectively (23). Because nmse is normalized by the sample variance, it may be used to compare different regression studies on a more equitable basis than would be possible using the conventional root mean square error; intuitively, a given prediction experiment is less challenging where the variance in the data is small. Notice that if we replace the prediction terms ŷ k with the arithmetic mean y in Equation 6, the value of the statistic is 1. This trivial case results when the predictor simply outputs the mean value of the data. Low values of nmse indicate good overall predictive acuity. Pointwise predictions of ligand binding may be evaluated using the normalized mean absolute error (nmae), defined by nmae ϭ This statistic is normalized by the sample variance for the same reasons as were cited for nmse above. Furthermore, its value may be interpreted as the number of standard deviations, on average, that predictions differ from the target values across the test set. The lower the value of nmae, the better the system pointwise predictive ability.
In some ligand screening situations (such as virtual screening, Ref. 3), predicting the relative ranking of binding strengths among a set of ligand-receptor pairs may be desired. The output of such an analysis would be a list of predicted binding energies sorted according to predicted magnitudes ⌬Ĝ 0 . In such cases a measurement of nonparametric or rank correlation, such as represented by Kendall's coefficient (24), is informative. In cross-validation, given an ordered array of N "(actual, predicted)" values (y 1 , ŷ 1 ), . . ., (y N , ŷ N ), we systematically compare the numerical signs of individual bivariate pairs X ϭ (y i , ŷ i ) and Y ϭ (y j , ŷ j ) for i ϭ 1, . . . , N, j ϭ (i ϩ 1), . . . , N.
If either (a) y i Ͼ y j and ŷ i Ͼ ŷ j or (b) y i Ͻ y j and ŷ i Ͻ ŷ j is observed, X and Y are said to be "concordant." Otherwise the points are "discordant." Kendall's expresses the tendency of two ordered lists y and ŷ to coordinately increase or decrease and is computed as where N C is the total number of concordant pairs, N D is the number of discordant pairs, and T X , T Y are counts of the "ties" found in X and Y pairs, respectively. A large positive (negative) value of indicates that the rank ordered values within {y} and {ŷ } are positively (negatively) correlated.
Cross-validation Experiments-To estimate the generalization error of the trained support vector regression system, we averaged the results of 10 separate 10-fold cross-validation experiments. In k-fold cross-validation, k random, equalsized, disjoint partitions (folds) of the example data are constructed, and an "inducer" (here, an SVR engine) is trained on (k Ϫ 1) folds with the excluded fold being used to test the trained system performance. After k such experiments, the results are averaged, and the observed error rate may be taken as an estimate of the error rate expected upon generalization to new data (25). To reduce further the effects of chance in randomly sampling the data, we averaged the results of 10 different 10-fold cross-validation experiments, performing 100 different training/testing procedures. The results we present are cross-validation averages for the statistics nmse, nmae, and as described under "Evaluation of Machine Learning." The total sample used in these experiments comprised 2,671 distinct ligand-receptor complexes. The output of the trained system is a predicted level of binding free energy y (in kcal/mol) given a set of features abstracted from a given input complex x. A qualitative glimpse of typical results from one complete 10-fold cross-validation test is offered in Fig. 1, which shows a scatter plot of actual versus predicted binding energy. Fig. 1 shows that some degree of correlation between prediction and truth exists. This correlation will be examined on an objective basis in the discussion of "Cross-validation Results."

CROSS-VALIDATION RESULTS
The principal results obtained in this investigation are summarized in Table I and in Fig. 1. Table I compares Table I are derived from scoring functions or simulations predicated upon knowledge of the three-dimensional structure of receptor and ligand complex. The columns in Table I comprise test sample  The records in the table are listed in order of increasing nmse. This statistic is proposed as the primary objective indicator of accuracy for direct prediction of binding free energy.

DISCUSSION
Of particular note on consideration of Table I are the sample size and mean free binding energies characterizing the ligandreceptor data used here when contrasted to the other investigations. The current sample size (n ϭ 2,671) is a factor 42 times larger than the next largest data set. The mean free binding energy is seen to be Ϫ38 kcal/mol, significantly stronger than the other data summarized in the Table I. Moreover, it can be seen that the present data set is highly variable as the standard deviation (35 kcal/mol) is on the same order as the mean.
Recall from the previous discussion that nmse values on the order of 1 are tantamount to trivial prediction of the mean value of a test data set. Lower values of nmse are associated with genuine learning of underlying patterns in the data and effective generalization. On this basis, the highest predictive accuracy (entry 1, nmse ϭ 0.198) observed in this comparative study was realized by Head and co-workers (26), who present a hybrid approach combining ligand-receptor threedimensional structural information and parameters derived from molecular mechanics. The test set comprised 14 ligandreceptor complexes.
The second best nmse in this group was achieved by Bö hm (8) using a regression-based empirical scoring function based on hydrogen bonds, electrostatics, complementary surface areas, and other characteristics of receptor-ligand pairs where the three-dimensional structure has been previously determined.
Next in our list of prediction results is the investigation reported in Wang et al. (4). Their approach uses another empirical scoring function for binding free energy that explicitly accounts for contributions due to Van der Waals interactions, metal-ligand bonding, hydrogen bonds, desolvation energies, and different kinematic effects. A regression equation is developed using these terms derived from known receptorligand complexes. All 11 data points in the test sample were based on endothiapepsin receptor complexes.
The current method, based on support vector regression, obtained the fourth best prediction error (nmse ϭ 0.419) averaged over 10 different 10-fold cross-validation tests. We suggest that this error rate represents a significant step for the following reasons.
1. The error rate and rank correlation value are surprisingly competitive with other investigations in light of the relatively large variance and extremely large sample size of the underlying data set. Note that the fifth lowest nmse value in Table I was also obtained by Head et al. (26) for a different data set than they used in entry 1.
Group 5 comprised 13 HIV-1 protease-HIV protease inhibitor complexes and showed a value of nmse ϭ 0.440. So the same methodology by the same research group, applied on a different data set, realized much different predictive results. This demonstrates the variability in results that are possible when using small sample sizes while providing confidence in the robustness of our current method and results, which were based on a sample size n ϭ 2,671. 2. The features used to represent the ligand-protein complexes in the support vector regression do not require any information about three-dimensional structure. All that is required as input data are the amino acid sequence of the receptor, a connection table representing the ligand structure in two dimensions, and the atom characteristics at the nodes of this connection table. 3. There is no limitation on the protein family membership of the putative receptor(s), on the type (organic or synthetic), or on the size of ligand used. 4. The results obtained in this study suggest that it may be possible to infer binding energies for complexes involving newly sequenced or difficult-to-crystallize proteins or for ligands that only exist in computer memory, awaiting synthesis upon successful in silico screening.
Rank Correlation-We draw the reader's attention to the trend in Kendall's rank correlation statistic in Table I. It is apparent that there is a general inverse correlation between the magnitude of binding energy prediction errors (nmse and nmae) and the value of . That is, low values of prediction error are associated with high values of the correlation coefficient.
measures the tendency of two ordinal random variables (here actual and predicted binding energy rank) to increase or decrease coordinately. If direct prediction of the physical binding energy is reasonably accurate, we would expect to see a positive and non-trivial correlation between the corresponding rank-ordered variables.
Computing biomolecular binding energies to higher accuracy remains a challenging problem (6). One author recently noted that current computational docking simulations, used to search for the best (lowest energy) "fit" of ligand into a target receptor cavity, still "suffer from insufficient precision of the scoring functions" (5). In Ref. 9, molecular dynamics simulations focused on biotin binding to avidin and streptavidin indicated that the energies of protein and ligand reorganization were found to be significant contributors to protein-ligand binding free energy in molecular dynamic simulations. These reorganization energies were estimated to be on the order of 10 -30 and 4.5-6 kcal/mol for protein and ligand, respectively. Because of the large variance in protein reorganization energy, the authors concluded that precise predictions of binding free energy were suspect.
Given these difficulties, the ability to reliably rank a set of ligand-receptor complexes during lead optimization (versus directly computing binding energy) remains important in the area of drug discovery. Such a procedure may add value, for example, as a decision aid when down-selecting a set of ligands for chemical synthesis. In connection with the current methodology, we recognize that training the SVR requires example data representing estimated or measured values of binding free energy. The output of a computational technique cannot exceed the accuracy of its input; this is especially true with systems that learn from examples. Therefore, at present the qualitative analysis or ranking of potential ligands may be the main utility of the SVR technique.
The prediction evaluation statistics appearing in Table I are presented in the form of a bar chart in Fig. 2. The investigations numbered along the horizontal axis appear in order of increasing nmse and correspond to the numbering in Table I. This visualization provides a different perspective on the opposing trends of nmse, nmae, and as discussed above.
Conclusions-In this work, we have introduced a new methodology, showing that it is possible to predict the binding free energy between ligand and receptor without direct information about their three-dimensional structures.
In cross-validation experiments, we have demonstrated that objective measurements of prediction error rate and rankordering statistics are competitive with several other investigations, most of which depend on three-dimensional structural data. The size of the sample used (n ϭ 2,671) indicates that this approach is robust and may have widespread applicability beyond restricted families of receptor types. Newly sequenced proteins, or those for which three-dimensional crystal structures are not easily obtained, can be rapidly analyzed for their binding potential against a library of ligands using this methodology.  Table I. Notice the general trend of inverse correlation between binding energy prediction errors (nmse and nmae) and rank correlation (). The present cross-validation results are represented as Investigation no. 4.