An Effective Way to Apply AI to the Design of New Drug Lead Compounds

Cloud Pharmaceuticals share their experience using AI for novel drug discovery

Add bookmark


Artificial Intelligence can greatly benefit the drug discovery field by lowering the high failure rate, high cost, and finding novel intellectual property.  However, the hype surrounding the use of Machine Learning (ML) and Artificial Intelligence (AI) can be found in almost every field today. When AI/ML is applied to drug design, many problems hinder progress. An exciting option is the use of Augmented Intelligence, which is the application of AI methods (such as big data and ML to enhance available information), computational chemistry, and other non-AI algorithms. Augmented Intelligences overcomes several problems encountered in common applications of AI & ML in drug discovery, as we will exemplify in this manuscript.



Artificial Intelligence can greatly benefit the drug discovery field by lowering the high failure rate, high cost, and finding novel intellectual property.  However, the hype surrounding artificial intelligence (AI) can be found in almost every field today.  It is perhaps most prevalent in healthcare, and in drug discovery specifically.  Successful applications of AI in image and text analysis have been shown in the fields of oncology (such as low-dose computed tomography (CT) screening) and matching patients to the best clinical trials Recent examples have been the approval by the FDA of IDx-DR, an AI based tool that detects diabetic retinopathy with medical imaging. Applications in genome analysis, combined with literature text mining is changing how we discover pharmaceuticals targets.


The High Dimensionality Problem of Drug Design

Chemistry is a vast space, and dimensionally complex. However, in the rush to deploy AI in drug discovery, we should not overlook the obvious. Small molecules with molecular weight below 850 (i.e. molecules that can become drugs) have a dimensional space estimated to be  10**65. (1) All previously discovered, as well as novel molecular drugs, exist within this vast space. The question is how to locate the right molecule. Due to its size, enumeration of chemical space is not possible. Currently the largest chemical space to be enumerated is GDB-13, including ~977.5 million small organic molecules which contain up to 13 atoms: C, N, O, S and Cl. The smaller “known” chemical space of 230 million compounds (such as the ZINC database)  includes all molecules that have ever been synthesized. However, these molecules are not novel. To achieve patentable molecules, novel chemical space is a functional requirement. This novel chemical space lays outside “known” chemical space but within the larger chemical space previously described.

Unlike text and image analysis where data sets are abundant, available data sets in chemistry are sparse, small or niche. AI methods such as Machine or Deep Learning (ML or DL) are suited for data sets which are dense and predictions occur within “inliers space”. Most common ML algorithms are used to predict a function from input to output based on clean curated Big Data. Using ML on image analysis works well because hundreds of millions of pictures that have been tagged.  A recent publication in Nature Medicine highlights the size of databases that are needed: Deep Convolutional Neural Networks were used to identify specific genetic disorders using purely facial recognition; the database included 500,000 images from 10,000 subjects.(2) This is not the case with drug design, where chemical synthesis and measurements are comparatively expensive. For example, one of the largest published dataset is the binding of inhibitors to human β-secretase 1 (BACE-1) - a collection of 1522 compounds from multiple literature sources.(3) In chemistry 1522 molecules is considered a very large set, but it is a paltry set for training in ML. Another problem with this dataset (and others) is the quality of the data. It originates from multiple laboratories, using varied experimental setting and spanning several decades.


Molecules are 3D objects

Even after obtaining a sufficiently large molecular data set, there remains the issue of molecule representation for ML algorithms. Most successful exemplars of ML works on vectors or matrixes, however, molecules are 3D objects, and there are multiple ways to vectorize them. Previously, most vectorization schemes were in 2D (such as Fingerprints of Graph Convolutional Networks), where there is a loss of information. Recently several 3D vectorization scheme have gained significant traction. Such schemes, includes the Coulomb Matrix vectorization and Symmetry Functions, as well as the Grid Featurizer which can be used for vectorization of interactions between two 3D objects (such as a protein binding pocket and a bound ligand).


Augmented Intelligence for Drug Design

What the prior two paragraphs imply is that the decisions of how and where to apply AI techniques to drug design is strategic. Because of limited data sets, AI methods used in drug design must be chosen that do not discard known information, but rather maximally leverage it.

Consequently, we believe the best application of AI to drug discovery is an offshoot of the IBM coined term “augmented intelligence”. It is the application of AI methods, utilizing big data and machine learning, to enhance computational chemistry and other non-AI algorithms and information, Figure 1. Augmented Intelligence is the combination of best from both AI and computational chemistry tools and especially shines in cases of limited or inconsistent data. The field of AI/ML provides many useful tools, which can be accompanied by multiple other tools to build a drug-design platform. We model problems with the best tools there are, and fill in the gaps - those areas where modeling is insufficient, with one or more AI techniques.

Cloud Pharmaceuticals’ Quantum Molecular Design (QMD) workflow is a combination of such tools. It is a multi-algorithm augmented intelligence based drug discovery platform where drugs are designed, based on binding affinity and desirable drug properties for a specific protein structure from a large, targeted chemical space. QMD has 3 major steps: First, based on the protein target, Bayesian reasoning algorithm are used to build and illuminate "hot spots" in a huge (100 million molecules) chemical space.  Second, a multi objective optimization heuristic search over the pre-identified chemical space to identify target molecules and lastly, the identified molecular hit list is machine evaluated with a medicinal chemistry expert system.  The result is a short, highly focused hit list of completely novel molecules that are synthesizable and of strong drug-like properties, ready for synthesis and evaluation. We have previously described QMD in details,(4) and here we highlight the use of specific augmented and artificial intelligence tools during the workflow deployment, Figure 2. The decision on where to use each method is based on available data, the suitability of a specific algorithm and level of accuracy needed. The combination of these tools is contributing to the successful deployment of QMD, where in multiple projects, small molecular inhibitors were designed, synthesize and their activity successfully measured.  


Successful Applications of Augmented Intelligence in Drug Design

A good drug candidate will require the optimization of many drug-like properties: toxicity, solubility, ease of synthesis, and more. The available datasets for some of these properties are much more suitable for ML then a traditional goal of modeling binding affinity. For example, one of the first places where ML can be demonstrated in drug design is in predicting toxic effects of a targeted drug. The EPA has a large set of measurements of known molecules and their toxic interaction with human proteins which can be used for ML.(5) The Tox21 set has ~7800 molecules and the ToxCast has about 1K more. All measurements are completed using the same apparatus ensuring a high degree of consistency. This is a comparatively large and clean dataset, a canonical classification problem (active vs. not active). Using this set, Cloud Pharmaceuticals (as well as others) have trained a Deep Neural Network (DNN) and are using it to accurately predict toxicity of novel molecules.

Another approach to handle data scarcity is to create a ML suitable dataset internally. Cloud Pharmaceuticals has deployed ML to predict solvation energy of differentiated conformers, Figure 3. During Cloud Pharmaceuticals’ QMD workflow, the solvation energy of each tested molecule is calculated using the Boltzmann average of solvation energies from multiple conformers. In order to achieve the high accuracy of QMD, solvation energy of each conformer is calculated using a lengthy QM/MM calculation.(6) In this project, an internally curated databases of previously calculated solvation energies was used to train a DNN, where each conformer was vectorized, retaining it’s 3D information. The outcome of the project was a ML predictor of solvation energy, which resulted in a 45% reduction of the computational cost of deploying quantum chemistry to design novel drugs.



The last two examples from Cloud Pharmaceuticals’ workflow represent a trend: you cannot simply use AI algorithms to predict chemical properties from existing datasets without relevant domain expertise. AI/ML tools are an easily available “commodity”, to name a few prominent examples: TensorFlow, AWS ML, Torch, and Keras. In drug design and discovery, what is required is expertise in chemistry and biology, to curate and analyze the dataset.  Understanding the questions being asked and the relevance of these to the available data is of crucial importance.  It is critical to avoid discarding or ignoring information from prior analysis and modeling. The dimensionality of drug space is simply too big even when big data is available.

Computational drug design is finally achieving industry recognition and is moving towards acceptance as indicated by numerous high profile deals in 2018. However, exact methods and applications are still part of active research. Here we have shown the methods Cloud Pharmaceuticals has successfully employed on numerous targets. Specifically, the use of Augmented Intelligence, instead of black box AI, provide solutions to the challenges of traditional drug design. We have demonstrated it by leveraging a mixture of ML and other computational tools in drug design to successfully fully utilize of prior data and domain expertise.



Figure 1: Augmented Intelligence is the combination of data processing tools from multiple sources, including AI, human intuition and knowledge and traditional computational chemistry tools.


Figure 2: Parts of Cloud Pharmaceuticals’ Augmented Intelligence arsenal includes Heuristic search, Bayesian statistics and Deep Neural Networks on Big Data.


Figure 3. 3D vectorization of molecules using Coulomb Matrix & Deep Tensor Neural Networks for predicting solvation energy. Dataset produced and curated by Cloud Pharmaceuticals.



Shahar Keinan is the Chief Scientific Officer and cofounder of Cloud Pharmaceuticals. Her research interests include in-silico drug design and discovery, molecular materials design, and computational methods development. Dr. Keinan received a PhD in chemistry from the Hebrew University of Jerusalem. With 46 published research articles, she has been cited over 2,800 times and has an h-index of 22. Contact her at

William J. Shipman is the Chief Technology Officer at Cloud Pharmaceuticals. His research interests include AI, cloud computing and NP hard problems. Shipman received an MSc in computer science and information systems from the University of North Carolina at Wilmington. Contact him at

Ed Addison is the CEO of Cloud Pharmaceuticals, a therapeutics company focused on cloud-based drug design and development he co-founded in 2009. Mr. Addison is a serial entrepreneur who has founded three previous ventures, two of which successfully merged with public companies in deals worth $55 million. He has a unique and strong blend of in-depth business and technical experience in biotechnology and in information technology.



  1. Reymond J-L. The Chemical Space Project. Acc Chem Res. 2015;48(3):722-30.
  2. Gurovich Y, Hanani Y, Bar O, Nadav G, Fleischer N, Gelbman D, et al. Identifying facial phenotypes of genetic disorders using deep learning. Nat Med. 2019;25(1):60-4.
  3. Subramanian G, Ramsundar B, Pande V, Denny RA. Computational Modeling of β-Secretase 1 (BACE-1) Inhibitors Using Ligand Based Approaches. J Chem Inf Model. 2016;56(10):1936-49.
  4. Keinan S, Frush EH, Shipman WJ. Leveraging Cloud Computing for In-Silico Drug Design Using the Quantum Molecular Design (QMD) Framework. Comp Sci Eng. 2018;20(4):66-73.
  5. TOX21 dataset. Available from: .
  6. Frush EH, Sekharan S, Keinan S. In Silico Prediction of Ligand Binding Energies in Multiple Therapeutic Targets and Diverse Ligand Sets—A Case Study on BACE1, TYK2, HSP90, and PERK Proteins. J Phys Chem B. 2017;121(34):8142-8.