Chemical space exploration: How AI changes the game

Flavie Prévost
11 min readOct 14, 2021
https://www.istockphoto.com/photos/chemical-reaction-green

Albert Einstein once said: “Two things are infinite, the universe and human stupidity. And I’m not so sure about the universe.” Albert might have been onto something. While we still don’t know about the size of intersideral space, we do know of a space so broad it seems infinite: the chemical space.

The chemical space is a relatively new concept at the intersection of medicine and AI. In this article, I will go more in depth regarding all of these things to explain how AI can accelerate drug discovery. I will also explain how a InSilico is already helping with this mission thanks to their cutting edge technologies.

Perhaps this article will be able to infuse die-hard health nerds with a new love for all things that aren’t so biological, but regardless crucial to the future of our healthcare.

What is the chemical space?

The chemical space is a concept that refers to the (large) set of all the molecules that fit certain given characteristics and conditions. It often comprises millions of possibilities, depending on the specified conditions.

When we explore the chemical space, we aim to expand the list of molecules that are known to bear certain particularities. Successfully finding such a new molecule is like finding another Earth-like planet: potentially very useful.

Why is that? To understand the importance of the discovery of a new molecule, you should know that they are the basis of the drugs we have today. I’m not going to teach you anything by telling you we don’t have drugs to cure every disease. Chemical space exploration is changing that by expanding the list of candidate drug molecules for every studied disease, and by making this list readily available.

The chemical space can be most efficiently explored by AI, specifically by deep learning generative models. I will further explain the AI involved later in the article.

Introducing InSilico

InSilico is a tech company developping AI tools to accelerate drug discovery. Out of the many they offer, in this article, I’ll focus on PandaOmics and Chemsitry42. I chose to leave out InClinico, the last part of their Pharma.AI software suite, because I found it was not as relevant for drug discovery. It’s more of a clinical trial design tool.

Panda Omics for finding drug candidates

PandaOmics is a software which searches through datasets to identify potential molecules that could act as drugs for the particular disease a researcher is studying. Of course, these molecules will bear different characteristics from disease to disease, which is why we said the chemical space corresponded to a set of molecules that satisfied certain conditions.

So what does the process actually look like? Let’s break it down.

First, the researcher will either input into PandOmics his own dataset that he collected through experiments, or use an existing dataset. The existing dataset will likely be larger and thus more likely to give rise to a suitable compound. The next image represents examples of the chemoinformatics datasets that researchers can use.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1347430/

But now, how does PandaOmics find new potential drugs based on the chosen dataset? That is thanks to the use of a molecular representation that the AI can understand. There are many ways to encode a molecule in a usable format, and a few of them are displayed in the table below.

To help you better understand what this representation is all about, I will describe a bit the SMILES, short for Simplified Molecular-Input Line-Entry System. It works as follows. First, the system which converts molecules into SMILES will start from a given atom in said molecule, and detect which other atoms it is connected to and where they are located, and will convert this information to short strings of ASCII characters in order to represent the structure of the molecule. The resulting string will be the SMILES notation of a given molecule. An issue with the SMILES notation is that it isn’t inherently standardized, because multiple strings can be generated for the same molecules depending on which was the chosen starting atom, which can indher the learning of the AI. To solve this problem, we came up with canonicalization, which is just a reference to which SMILES notation will be used for which molecule. This however has yet to be broadly adopted.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1347430/

Assuming the dataset we are working with is presented to PandaOmics in an understandable format (which it will be, that’s one the perks of using this tool), it will be searched in order to find patterns. Which patterns?

First, PandoOmics will search through publications, grants, patents, clinical trials and of course the dataset to find the relevant genes in the studied disease. Those connexions aren’t always obvious, and just imagine how much time it would take for a human to do the same work!

After the relevant genes have been identified, PandaOmics will take those genes into account and find potential drug targets that influence those genes, based on available research. In other words, it is not suggesting new molecules yet: it’s suggesting to try some known candidate molecules.

But how can PandaOmics generate drug outputs from data regarding a specific disease? That is thanks to machine learning.

Machine learning is refers to the fact of a computer system adapting its behavior on its own without having to follow specific instructions by analysing sets of data and finding patterns. The algorithms we use for machine learning are called neural networks.

Neural networks are a set of neuron layers (group of neurons) that takes in inputs and generate outputs through mathematical calculations that turn the output of a layer into the input for the next. After the data has gone through all the layers, the final output is generated, and in this case it consists of molecule targets.

There are many ways a neural network can learn through data. One method consists of supervized learning, where we train the neural network by feeding it data and labeling that data. One perk of training a model that way is that it’s easier to verify if they are learning the right things, because we, as humans, are also able to classify that data. On the other hand, supervized training doesn’t work well when we don’t have the labels, which happens a lot when the dataset is very large. Then, we can use unsupervized training. Unsupervized training is the opposite of supervized training: making a neural network learn through unlabeled data. Since PandaOmics analyses a very set of data which doesn’t contain a consistent set of labels (since they’re research papers on different subjects), it’s likely it learned through a mix of supervized and unsupervized training.

So, thanks to that machine learning, PandaOmics takes the data and, from it, extracts the genes related to the disease we’re studying. If there are no mentions of genes in our dataset, it would look through the body of research available on the subject and still find genes related to the health condition based on clues from the data.

After PandaOmics finds target genes, this output is probably fed back into the neural network in order for it to use this output, which becomes an input, to generate the final output which are drugs that interact with the previously identified genes. A neural network which can use it’s outputs to generate further outputs is called a recurrent neural network, as opposed to a feed forward neural network in which a layer’s output can only influence the next layer.

And there we have it, a candidate molecule to try for curing a disease. This is an example of hypothesis generation, as the AI makes the hypothesis that this or that drug could be useful for this or that disease. But what if we end up trying them and they don’t work? Then, we might want to generate a new molecule altogether. And that’s where Chemistry45 has us covered.

Chemistry42 for generating novel drug candidates

Chemistry42 is a piece of software that analyses existing molecules in order to generate plausible molecules that could act as drugs for our disease of interest, that we haven’t synthetized yet (and thus we haven’t ever tried them).

How could an AI possibly do that? That is thanks to generative neural networks. Generative neural network are algorithms that seek to create new versions of the type of data presented to them. For example, we have generative neural networks that can create new pieces of art from a dataset of paintings, and others that can generate human faces that don’t actually exist. Just like those, Chemistry42 is a generative neural network because it will create molecules that “don’t yet exist”, at least within the healthcare field.

Now, to generate new molecules, Chemistry42 will need some real molecules as data, or input. For that, it will take the identified drug targets that PandaOmics generated, if the researcher chose to use PandaOmics as well, and the molecules that are likely to act on our disease from available research, PandaOmics does.

The cool thing about Chemistry42 as an AI for molecule generation as that it is able to work with both structure-based molecular representations and ligand-based molecular representations.

https://www.researchgate.net/figure/Difference-Between-Two-General-Categories-Structure-Based-and-Ligand-Based_tbl1_319036203

Now, to make Chemistry42 work, we’ll need to train it.

As you can imagine, researchers don’t have to train Chemistry42 from scratch, but they do have to define rewarding and penalty rules for the it to better understand what kind of molecule they want from it. This is called reinforcement learning, and it’s a commonly-used method to bring AIs to successfully accomplish a task.

Reinforcement learning is one of the three main approached to make a machine learn. When the machine takes a specific set of actions, it will eitherr be rewarded or punished depending on whether the set of actions is close or not to the desired one. In this case, Chemistry42 is rewarded when the molecules it generates fit the needed characteristics and punished otherwise.

https://www.semanticscholar.org/paper/Deep-Reinforcement-Learning-for-Conversational-AI-Jadeja-Varia/100636ca50edf63d4336bb071f3e172cb0ebccaa

Now that the reward and penalty rules are defined, Chemistry42, will know what our new drug-like molecule should look like, according to the laws of chemistry and our own set of conditions. In as little as a week, it’ll generate hundred of new compounds that could cure our disease of interest.

What are the benefits over traditional research?

There is one huge benefit to using AIs such as the ones Insilico makes available, PandaOmics and Chemistry42, in drug development, and that is the sheer amount of time it saves. AIs can look through massive amounts of data, such as the totality of the evidence regarding a given disease, in very little time (the span of a few clicks, as Insilico puts it). As for Chemistry42, it can generate novel-like structures in just a week.

If we compare that to a human, the difference is enormous. It would takes us months, if not years, to accomplish the same thing.

Using AIs for drug discovery comes with another perk: generating insights that humans might not have been able to see. Admitedly, the technology isn’t quite there yet when it comes to generating surprising molecules:

The hypotheses suggested so far are mostly in the realm of the relatively unsurprising ones. — Giovanni Colavizza, research data scientist at the Alan Turing Institute

However, sometimes, it does surprise us, and this is when it’s the most useful. It’s easy to imagine that as technology progresses, so will the help AI can provide for drug discovery.

What are the challenges to be solved?

There are many opportunities for improvement in the world of chemical space exploration AIs.

As I said previously, the SMILES representation of different molecules isn’t yet standardized, and this is begining to be solved thanks to popular cheminformatic packages such as OpenEye57 and RDKit58. However, this may be the least of all the challenges to be overcome.

A hard problem to solve will be to allow the generation of macromolecules by our AIs. Macromolecules are for the moment too complex to be “imagined” by our programs, and solving this problem will probably have to wait until we can throw quantum computers into the mix.

Finally, and additional problem, which isn’t so much the AI’s fault, is that we can’t always synthesize the molecules it suggests, making some of it’s suggestions useless. Solving this would require providing the AI with information regarding the current state of our technology, a bit like as a set of new conditions, in order to make it take that into account when generating “plausible” molecules.

Conclusion

When it comes to the future of healthcare, it’s likely that many fields will have to collide in order for us to progress. While knowledge of molecular biology will be crucial to develop new ways to synthesize molecules and proteins, the development of AI will allow us to get more insightful suggestions to apply, in the same way quantum computing could allow for the generation of new macromolecules.

While all of these field may not be on-point yet as to be able to provide us with the best healthcare possible, and the freedom from all diseases, they do allow us to look forward to a bright future.

Sources

https://www.researchgate.net/figure/Difference-Between-Two-General-Categories-Structure-Based-and-Ligand-Based_tbl1_319036203

--

--