MIT’s AI Learns Molecular Language for Rapid Material Development and Drug Discovery

The new AI system developed at the MIT-Watson AI Lab reliably predicts chemical features with minimum data, vastly simplifying the medicine and material discovery processes. To quickly and effectively produce new molecules, the system employs a “molecular grammar” it has learnt through reinforcement learning. Using datasets with fewer than 100 samples, this strategy has proven to be remarkably effective.

This AI system only needs a small amount of data to predict molecular properties, which could speed up drug discovery and material development.

Manual, trial-and-error processes for discovering new materials and pharmaceuticals can take decades and cost tens of millions of dollars. Machine learning is frequently used by scientists to predict chemical properties and reduce the number of compounds that must be synthesized and tested in the lab.

Molecular property prediction and molecule generation have both been greatly improved by a new, unified framework established by MIT and MIT-Watson AI Lab researchers.

Researchers need to train a machine-learning algorithm on millions of labeled chemical structures before it can accurately predict a molecule’s biological or mechanical properties. The efficiency of machine-learning systems is hindered by the difficulty of obtaining large training datasets due to the high cost of discovering compounds and the difficulties of hand-labeling millions of structures.

In contrast, the method developed by the MIT team accurately predicts molecular attributes with minimal input. Their system is predicated on a knowledge of the laws governing the correct combination of building components to construct molecules. These criteria assist the system to efficiently build new compounds and forecast their attributes by capturing the similarities between chemical structures.

When given a dataset with fewer than 100 samples, our method was able to reliably predict chemical characteristics and build viable compounds, outperforming existing machine-learning approaches on both small and big datasets.

Using machine learning and a little quantity of training data, MIT and MIT-Watson AI Lab researchers have created a unified framework for predicting chemical features and creating novel compounds. Image courtesy of Jose-Luis Olivares/MIT.

Graduate student in computer science and electrical engineering (EECS) Minghao Guo explains, “Our goal with this project is to use some data-driven methods to speed up the discovery of new molecules, so you can train a model to do the prediction without all of these cost-heavy experiments.”

Researchers Veronika Thost, Payel Das, and Jie Chen from the MIT-IBM Watson AI Lab, along with recent MIT grads Samuel Song ’23 and Adithya Balachandran ’23, and senior author and EECS professor and MIT-IBM Watson AI Lab member Wojciech Matusik, head of the CSAIL’s Computational Design and Fabrication Group, contributed to the paper with Guo. The findings will be shared at the Machine Learning Conference.

Learning the language of molecules

Scientists need training datasets containing millions of molecules with similar properties to those they hope to discover in order to get the best results from machine-learning models. In practice, datasets from a single domain tend to be quite small. Models are then applied to a much smaller, more specific dataset, yet these models have already been pretrained on vast datasets of broad molecules. However, these models typically perform badly since they haven’t learned much domain-specific information.

The MIT group decided to take a different tack. Using just a tiny, domain-specific dataset, they developed a machine-learning system that can automatically learn a molecular grammar, or the “language” of molecules. With this “grammar,” it can build functional molecules and make educated guesses about their properties.

Words, phrases, and even entire paragraphs can be created using grammar rules in the study of language. A molecular grammar is similar in concept. It’s a set of guidelines for making molecules and polymers out of smaller building blocks.

One molecular grammar can represent an extremely large number of molecules, much like a language grammar can generate a large number of sentences using the same principles. The system is trained to recognize commonalities in the production rules followed by groups of molecules that share structural similarities.

The system exploits its innate understanding of molecular similarity to better forecast the features of novel molecules, as it has shown that molecules with similar structures tend to share similar properties.

To improve property prediction, “once we have this grammar as a representation for all the different molecules,” Guo says.

Using trial-and-error methods where the model is rewarded for behavior that brings it closer to accomplishing a goal, the system learns the production rules for a molecular grammar.

However, the procedure to learn grammar production rules would be prohibitively computationally expensive for anything but the smallest dataset because there may be billions of ways to mix atoms and substructures.

The researchers separated the two components of the molecular grammar. First, they hand the system a metagrammar, which is a manual, broadly applicable grammar they create. Then, the domain dataset can be used to teach it a much more condensed, molecule-specific language. The learning process is accelerated by this hierarchical structure.

Big results, small datasets

Despite using domain-specific datasets with only a few hundred samples, the novel method developed by the researchers was able to synthesize live molecules and polymers concurrently, and accurately predicted their properties, in testing. The new methodology eliminates the need for expensive pretraining, which was necessary for several alternative methods.

The method excelled at forecasting the glass transition temperature of polymers, the temperature at which a substance changes phase from solid to liquid. Due to the high temperatures and pressures required for the tests, obtaining this data manually is usually prohibitively expensive.

The researchers halved the size of one training set, leaving only 94 samples, to test the limits of their method. Results from their model were still competitive with those of approaches trained with the full dataset.

Top 5 Free Android Apps For You

Get More with Three UK: Unlimited Data, Roaming, and More

How to Safeguard Your Tech Life from Online Threats

MIT’s AI Learns Molecular Language for Rapid Material Development and Drug Discovery

Learning the language of molecules

Big results, small datasets

root9871

Other Articles

An Ingenious High-Power Thermoelectric Device Set to Disrupt the Electronics Cooling Industry

Energy from Falling Raindrops Is Captured Efficiently by New Triboelectric Nanogenerator Technology

Energy from Falling Raindrops Is Captured Efficiently by New Triboelectric Nanogenerator Technology

An Ingenious High-Power Thermoelectric Device Set to Disrupt the Electronics Cooling Industry

Categories

Recent Posts

Score Big with MLB Ticketing Plans: Unlock the Best Game Day Experience

Show Your Yankees Pride: A Complete Guide to Merchandise and Gear for True Fans

Useful Links

MIT’s AI Learns Molecular Language for Rapid Material Development and Drug Discovery

Learning the language of molecules

Big results, small datasets

Share Article

root9871

Other Articles