|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Molecular & Cellular Proteomics 4:752-761, 2005.
© 2005 by The American Society for Biochemistry and Molecular Biology, Inc.


,¶
From the
Department of Molecular Biology, The Scripps Research Institute and
Molsoft, LLC, La Jolla, California 92037
| ABSTRACT |
|---|
|
|
|---|
There are three independent sources of information that can be used to infer the location of possible ligand binding sites on the surface of a protein: (i) protein structure, (ii) evolutionary information (sequence alignments), and (iii) ligand/substrate information. A number of sophisticated algorithms using evolutionary information or algorithms predicting locations of binding sites for specific substrates have been published (13). Here we attempted to develop an algorithm that is based solely on the protein structure and without any prior knowledge about the nature of the substrate. We hypothesized that the structure itself is sufficiently informative, whereas the evolutionary conservation and the nature of the ligand can only be used as optional contributions.
Proteins are involved in several kinds of molecular interactions: with other proteins, DNA, RNA, peptides, and small molecules. In this study we present an algorithm to predict the binding envelopes near potential small ligand binding sites or areas that could be targeted with small "druglike" compounds. Once the ligand binding pocket is predicted, a high throughput ligand docking procedure or structure-based drug design (49) can be used to generate a list of the lead molecules. The properties of druglike molecules are well studied (10, 11) and cover a certain range of sizes, typically with molecular mass between 300 and 700 daltons. Therefore, we excluded from consideration very small ligands, such as metals and small solvent molecules, as well as very large substrates. However we wanted to develop an algorithm that within this size range does not depend on the nature of the ligand.
A number of structure-based pocket prediction algorithms have been published over the last 10 years. They can be divided into two general classes: (i) geometric algorithms and (ii) probe mapping/docking algorithms. Geometric approaches analyze protein surfaces to find clefts. SURFNET (12) detects the gap regions in proteins by fitting spheres into the spaces between protein atoms. The sphere fitting process results in a number of separate groups of interpenetrating spheres, which correspond to the cavities and clefts of the protein. LIGSITE (13), an improved version of POCKET (14), identifies clefts by putting the protein in a regular Cartesian grid and scanning along the x, y, and z axes and the cubic diagonals for areas that are enclosed on both sides by protein. APROPOS (15) and CAST (16) are based on the
-shape algorithm, which identifies pockets by comparing surfaces of the protein generated with different levels of detail. PASS (17) identifies the "active site points" by coating the protein surface with a layer of spherical probes and then filtering out those that clash with the protein or are not sufficiently buried. In addition to those pure geometrical methods, some approaches based on mapping/docking and scoring of molecular fragments have been proposed (1823). Two excellent recent reviews of computational tools for identification of small molecule binding sites in proteins give a good overview of the field (1, 24).
Pure geometric methods are relatively straightforward, but there is no direct physical meaning behind them. On the other hand, methods using molecular fragment mapping and ligand docking are better physically justified but computationally expensive and cannot always provide a good discrimination between correct and incorrect sites. Also the previously published methods were typically tested on relatively small data sets. Although APROPS used a relatively large test set of about 300 structures, others only used 1050 selected test cases. This is despite the fact that almost 30,000 x-ray structures have been deposited in the Protein Data Bank (25, 26). Finally, because the ultimate goal of the binding site prediction methods is to find active sites on uncharacterized structures, it is important to test and validate the algorithms on large sets of the "unbound" or apo structures. Only the PASS algorithm was tested on a data set of 21 apo structures; the other publications did not test the effect of induced conformational changes on the prediction accuracy. A benchmark test based on a large, systematic data set of apo structures is necessary for evaluating protein-ligand binding site identification methods.
In this study, we present and validate a novel algorithm for prediction of ligand binding pockets. The algorithm called PocketFinder is based on a transformation of the Lennard-Jones potential. Like pure geometric approaches, PocketFinder is fast and capable of identifying clefts and cavities regardless of the nature of the substrate while being more sensitive and specific. Furthermore, in contrast to other methods, the PocketFinder algorithm not only detects the location of the binding pocket but also predicts envelopes representing the shape and size of putative ligand binding volume. The method was tested on a systematically collected data set 2 orders of magnitude larger than previous benchmarks: 5,616 binding sites collected from ligand-protein complexes and 11,510 apo binding sites inferred from the complexes by homology. All small molecule binding envelopes from the human structural proteome were collected and clustered into a pocketome. The predicted binding envelopes were hierarchically clustered. The complete pocketome may be useful for understanding a complex network of interactions between small molecules and the cell proteins.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Step 1
The first step was to create the grid potential map of the van der Waals force field using a probe atom (parameters for an aliphatic carbon were used) in orthogonal parallelepiped surrounding receptor atoms (27). The grid had 1.0-Å spacing, and a margin of 1.0 Å beyond the dimensions of the protein was added. The potential was calculated according to the Lennard-Jones formula,
![]() |
where rpl is the distance between the probe p placed at a grid node and the protein atom Xl. Parameters AXC and BXC are taken from the Empirical Conformational Energy Program for Peptides (ECEPP)/3 molecular mechanics force field. Pp0 values were further truncated into the min(Pp0), 0.8 range to retain only the attractive regions.
Step 2
The second step was to smooth (space average) the potential map to emphasize the regions with the van der Waals potential consistently low across significant space span and to avoid excessive density fragmentation. Smoothing was performed by an iterative averaging of the potential Pijkn on each grid node (i,j,k) with the adjacent nodes.
![]() |
Iterative application of this transformation closely approximates a (more computationally expensive) convolution with an averaging function.
![]() |
Ten iterations were performed, corresponding to the smoothing length parameter (radius)
of
2.6 Å.
Step 3
The third step was to create putative ligand envelopes by contouring the resulting map at a level Pcont calculated as follows,
![]() |
where the threshold parameter
= 4.6 was established on the training data set, mean(Pijk) and r.m.s.d.(Pijk) are an average and a root mean square difference of all map potential values Pijk.
Step 4
The fourth step was to sort the created envelopes by their volumes and filter out those smaller than 100 Å3. All parameters for the map transformations were optimized using a large, diverse set of binding sites. The algorithm was coded with the ICM1 scripting language (28, 29).
| RESULTS |
|---|
|
|
|---|
Compiling an Exhaustive Benchmark of Ligand Binding Sites
We started from the October 30, 2003 release of the Protein Data Bank that contained 17,730 crystallographic structures of proteins. Then we collected all structures that are complexed with heteromolecules (tagged by HET keywords in the Protein Data Bank) to compile a data set of observed binding sites from protein-ligand complexes (liganded-pocket set (LP-Set)). This resulted in 13,431 Protein Data Bank entries and 3,561 unique heteromolecules. Several filters were used to produce the liganded protein-ligand binding site data set. First, we introduced a ligand size and frequency filter. Heteromolecules containing fewer than seven heavy atoms were excluded in this research. This excluded metals and popular crystallization buffer components. To avoid a benchmark bias toward the most frequent ligands, we excluded high frequency cofactors or substrates, such as hemes, N-acetyl-D-glucosamine,
-D-mannose, certain sugars, etc. This filter reduced the number of entries to 7,275 (about 54% of all complexes). Second, a receptor quality filter was used. Only entries with resolution better than 2.5Å were retained. Proteins that consist of fewer than 50 or more than 2,000 residues were also excluded. This filter reduced the number of entries to 5,736. We applied other filters that did not reduce the size of the data set significantly but cleaned up the data as follows: (i) heteromolecules that are far away from the receptors (no atoms of the heteromolecules were within 3.5 Å from the receptor) were removed; (ii) heteromolecules that contact the symmetric parts of the receptor (within 3.5 Å in distance) were removed because their binding sites are formed between the asymmetric units, and building a correct model requires biological information; (iii) ion clusters were removed; and (iv) duplicate combinations of Protein Data Bank entries and ligands were removed. The final data set consisted of 5,616 protein-ligand binding sites (LP-Set). That is the combination of 4,711 Protein Data Bank entries with 2,175 unique ligands.
Collecting Protein-Ligand Binding Sites from the Uncomplexed Structures
Because our main goal was to predict a potential binding envelope from an uncomplexed structure, it was critical to compile a benchmark of unliganded pocket sites (UP-Set). This additional data set helped us to validate the pocket prediction algorithm in a more realistic situation. Unliganded pockets may not be as obvious as the liganded pockets due to the ligand-induced conformational changes. Side chains often obstruct a part of the pocket in the absence of the ligand. The UP-Set was collected by superimposing the protein structure in the LP-Set onto its uncomplexed homologues and mapping the ligand to them. The proteins (apo structures) in the UP-Set had to meet the following criteria. (i) The sequence identity between apo structure and complexed protein is over 95%. (ii) The resolution of the structure is better than 2.5 Å. (iii) There are no mutations among the surface residues within 8 Å around the mapped ligand. (iv) There are no other ligands within 8 Å around the mapped ligand in the apo structure.
Only the single chain receptors in the LP-Set were used to find the apo structures. The only reason for using the single chain receptors was to simplify the process of identifying homologues and superimposing the structures. We preserved all alternative binding pockets in the UP-Set to avoid an arbitrary choice of one representative unliganded pocket. A total of 11,510 unliganded pockets projections were collected from the eligible Protein Data Bank entries. These mapped binding sites corresponded to 1,445 binding sites from LP-Set. As a result, the UP-Set contains on average about eight different mappings of the same binding site in LP-Set.
Representation of the Predicted Ligand Binding Pockets
We calculated transformed version of the three-dimensional Lennard-Jones potential on a 1-Å grid surrounding the entire protein surface (see "Materials and Methods") and then contoured at an optimized level to create pocket envelopes. The pocket envelope was represented by a triangulated surface (Fig. 1). We chose this potential because in contrast to purely geometrical methods it has a clear physical meaning, and at the same time it does not require any knowledge of the chemical nature of a ligand. Additionally the van der Waals component of the binding energy is present in complexes of various physical natures, including hydrophobic, charged, polar, or mixed complexes.
|
![]() |
where AL is the solvent-accessible area of the receptor atoms within 3.5 Å from a bound ligand, and AE is the solvent-accessible area of the receptor atoms within 3.5 Å from the predicted envelope. A completely failed prediction would have RO equal to zero, whereas a perfect prediction would have RO close to 1.0. However, this requirement would be too strict because obviously the same pocket may bind different ligands, and all these ligands may share a core site but extend in different directions. We define the threshold of successful prediction as RO = 0.5, i.e. at least 50% of the accessible area AL has to be overlapped by the predicted area (AE).
Validating the Pocket Prediction Algorithm
We first applied PocketFinder to the LP-Set to evaluate its performance on holo structures. The method successfully identified 96.8% of the 5,616 ligand binding sites as having a relative overlap RO > 0.5 between the observed and predicted sites (see the definition of the RO measure above). Furthermore 55.2% of binding sites were perfectly identified (RO = 1.0), and 85.7% of binding sites were identified with RO higher than 0.8, i.e. greater than 80.0% overlap (Fig. 2).
|
We studied the effect of different conformational rearrangements observed in the unliganded binding sites of the UP-Set upon the prediction success. Because the 11,510 binding sites of the UP-Set were inferred based on 1,445 binding sites in the LP-Set, we grouped them into 1,445 clusters and sorted the clusters by the RO of LP-Set. The distribution of the RO values of UP-Set for 1,445 clusters is shown in Fig. 3. From this distribution we observed the following. (i) The overwhelming majority of UP-Set had RO > 0.5. (ii) For the clusters with high RO of LP-Set (RO > 0.9), the relative overlap of the UP-Set decreased compared with the LP-Set, but the majority were still above 0.8. However, some cases had much larger deterioration of the relative overlap. (iii) For the clusters with lower RO of LP-Set (RO < 0.8), the picture was mixed. We observed better overlap for the UP-Set conformations than the LP-Set, implying some "opening" of unliganded pockets. However, for the majority of 1,445 clusters the unliganded pockets appear more difficult to predict than the liganded pockets. Overall we may conclude that although the unliganded conformations of receptors were not as good as the liganded conformations for the binding site identification, there was only a marginal deterioration of recognition performance.
|
|
|
1.4 times larger in volume than the ligand. Considering that most ligands should fit inside their pockets and the ligands bound to those receptors were just some representative examples of all possible ligands, this value seems reasonable. For the predicted binding patch, the average ratio to the whole surface of the receptor was 4.7%. Fig. 5C shows the distribution of the ratio of contact area to the whole surface area of the protein for bound ligands and predicted envelopes. This distribution shows that the size of the predicted binding patch was close to the real binding area. Considered together with the high RO values of the prediction, we can conclude that the prediction was accurate.
Sample of Predicted Envelopes
Unfortunately there are too many pockets in the benchmark to show each prediction graphically. However, to leave a visual impression of the results, we created an unbiased subset of the benchmark pockets using the following selection procedure. We first annotated the proteins in the LP-Set based on the Structural Classification of Proteins (SCOP) data base (32). They belonged to 10 of 11 classes in SCOP, showing high diversity of our data set. Structures from 261 folds, 347 superfamilies, and 589 families were represented. These 20 representatives in Fig. 6 were selected from different folds based on the following criteria: (i) the envelope size between 300 and 700 Å3, the range of sizes that was most populated, and (ii) the folds that had only one representative in LP-Set were chosen to avoid any subjective selection of representative proteins from folds. We then sorted these folds by their SCOP fold ID and chose the first 20 folds. For each protein, only the largest and the second largest (if it existed) envelopes were displayed. Fig. 6 demonstrates the accuracy and reliability of the envelope prediction. It is important to note that in most cases the predicted envelopes captured the ligands correctly with size and shape being close to the bound ligands. The complete data sets (LP-Set and UP-Set) and prediction results are available upon request.
|
|
| DISCUSSION AND CONCLUSIONS |
|---|
|
|
|---|
A small number of binding sites could not be identified by the algorithm. For 5,616 binding sites from the LP-Set, 66 cases were completely missed (no relative overlaps, RO = 0) by the identification. We visually examined all these cases and found that these binding sites were either very small or very shallow. All of them could still be identified by adjusting the parameters (contouring level and volume cutoff of the envelope) of the program. However, adjusting the parameters to detect small or shallow binding sites may result in higher number of false positives. Practically if the program always returns a long list of putative binding sites, the prediction is of little value. However, when applying the method to a specific protein, the "average" optimal parameters can be adjusted based on the additional biological and structural data such as the flexibility of B-factors and the feedback from the inspection of the prediction result. If no pocket can be found by using the default average parameters or if the pockets are likely to be small/shallow, for example in the case of peptide binding sites, one can try to adjust the parameters to find smaller or shallower pockets if there are compelling reasons to expect their existence.
In conclusion, of the 5,616 protein-ligand binding sites of complexes we tested, the PocketFinder method correctly identified 96.8% of the known binding sites. 85.7% of the binding sites showed coverage of the known contact area higher than 80.0%. For correctly identified binding sites, 80.9% were the largest envelopes, and 11.8% were the second largest. The average ratio of the predicted binding patch to the total surface area of the protein was 4.7%, implying that the prediction was quite specific and had a low false positive rate. The prediction rate for 11,510 binding sites from apo structures (UP-Set) was 95.0%, close to that of the LP-Set. The RO in general was somewhat lower than that of the LP-Set, but 67.0% of the binding sites still had 80% or better overlap.
The proposed pocket prediction algorithm can be used to identify possible binding site locations for orphan receptors or for uncharacterized secondary binding sites of known receptors. Furthermore ligand design can be directed by the predicted envelope. It can also be used to prioritize novel targets by the "druggability" of identified pockets. In addition, applying the algorithm to separated protein subunits and evaluating the strength of "pocket potential" at the interface patches may help evaluate the feasibility of protein-protein interaction inhibition. Fig. 8 shows some examples of successful protein-protein interaction inhibition by small molecules targeting the transcriptional regulator MDM2 and LFA-1. We found pockets in the protein-protein interfaces from the apo structures that overlapped the bound ligands.
|
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published, MCP Papers in Press, March 9, 2005, DOI 10.1074/mcp.M400159-MCP200
1 The abbreviations used are: ICM, internal coordinate mechanics; LP-Set, liganded pocket set collected from complexes; UP-Set, unliganded pocket set collected from apo structures; RO, relative overlap of predicted patch to the real binding patch; SCOP, Structural Classification of Proteins; MDM2, ubiquitin-protein ligase e3 mdm2; LFA-1, lymphocyte function associated antigen 1. ![]()
* This work was supported in part by Department of Energy Grant ER63042. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ![]()
¶ To whom correspondence should be addressed: Dept. of Molecular Biology, The Scripps Research Inst., TPC-28, La Jolla, CA 92037. Tel.: 858-784-8595; Fax: 858-784-8299; E-mail: abagyan{at}scripps.edu
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. Dong, P. C.-H. Lam, D. I. Pinon, P. M. Sexton, R. Abagyan, and L. J. Miller Spatial Approximation between Secretin Residue Five and the Third Extracellular Loop of Its Receptor Provides New Insight into the Molecular Basis of Natural Agonist Binding Mol. Pharmacol., August 1, 2008; 74(2): 413 - 422. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Segers, O. Sperandio, M. Sack, R. Fischer, M. A. Miteva, J. Rosing, G. A. F. Nicolaes, and B. O. Villoutreix Design of protein membrane interaction inhibitors by virtual ligand screening, proof of concept with the C2 domain of factor V PNAS, July 31, 2007; 104(31): 12697 - 12702. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. H. Bisson, A. V. Cheltsov, N. Bruey-Sedano, B. Lin, J. Chen, N. Goldberger, L. T. May, A. Christopoulos, J. T. Dalton, P. M. Sexton, et al. Discovery of antiandrogen activity of nonsteroidal scaffolds of marketed drugs PNAS, July 17, 2007; 104(29): 11927 - 11932. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| All ASBMB Journals | Journal of Biological Chemistry |
| Journal of Lipid Research | ASBMB Today |