New AI tool can generate millions of new molecules

A new AI system generated millions of realistic molecules, giving chemists a faster way to explore unknown chemical space.

Joseph Shavit
Shy Cohen
Written By: Shy Cohen/
Edited By: Joseph Shavit
Add as a preferred source in Google
AI model CoCoGraph generated millions of new, chemically valid molecules that could speed drug and materials discovery.

AI model CoCoGraph generated millions of new, chemically valid molecules that could speed drug and materials discovery. (CREDIT: AI-generated image / The Brighter Side of News)

Chemists have long faced a maddening problem. The number of possible useful molecules is so vast that even the ones already known amount to only a tiny sliver of what might exist.

A research team at the Universitat Rovira i Virgili in Spain now says it has built an artificial intelligence system that can push into that unknown territory, generating millions of molecules that do not appear in current databases but still obey the rules of chemistry. The work, published in Nature Machine Intelligence, points to a faster way of exploring chemical space, the nearly unimaginable range of atom combinations that could one day lead to new drugs, materials, refrigerants, or other compounds.

The scale of that space is hard to overstate. The authors note that drug-like molecules may number around 10^60, far more than the number of water molecules in Earth’s oceans. That makes molecule discovery less like searching a library and more like sifting a universe.

At the center of the new work is a model called CoCoGraph, which works a bit like image-generating AI. Instead of producing pictures from noise, it generates molecular structures by learning how valid molecules can be broken apart and reassembled.

“Our algorithm does the same, but with molecules,” Roger Guimerà, an ICREA research professor in the Department of Chemical Engineering at URV, said in the source material.

Roger Guimerà, Manuel Ruiz-Botella and Marta Sales, from Department of Chemical Engineering, have led the research. (CREDIT: Universitat Rovira i Virgili)

Rules first, invention second

The project tackles a central weakness in many earlier molecule-generating systems. Past approaches, including models based on variational autoencoders, generative adversarial networks, and graph neural networks, improved the field but often struggled with scale, efficiency, or chemical validity. Some could generate structures that looked inventive but broke basic chemical rules.

CoCoGraph takes a different route. Rather than asking the model to learn those rules from scratch, the researchers built some of them directly into the generation process. Each atom keeps the correct number of bonds, preserving valence and ensuring that the molecular formula remains fixed throughout the process.

That design choice matters. It means every molecule the system generates is chemically valid under the benchmark used in the study. According to the authors, CoCoGraph reached 100% chemical validity while still producing highly novel outputs.

Marta Sales-Pardo, also in URV’s Department of Chemical Engineering, described the process simply: “We start with a real molecule, break the bonds and create new ones at random. The model learns to reverse this process and reconstruct coherent structures.”

Unlike images, molecules are not smooth visual fields. They are discrete structures made of atoms and bonds, which makes the mathematics harder. To handle that, CoCoGraph uses what the team calls a constrained discrete diffusion process based on double edge swapping. In effect, it repeatedly swaps bonds while preserving the overall bonding requirements of the molecule.

The system also includes a second model, called a time model, that estimates how close a partially reconstructed graph is to a realistic molecule. That extra signal helps the main diffusion model decide how to steer the denoising process.

Constrained collaborative graph diffusion model, CoCoGraph. (CREDIT: Nature Machine Intelligence)

Smaller model, stronger realism

The researchers compared CoCoGraph with six other leading molecule generators using the GuacaMol benchmark, a standard test suite in the field. They evaluated all models against a filtered PubChem reference database containing 94.7 million molecules with no overlap with training data.

Two versions of CoCoGraph were tested. The smaller BASE model used 534,000 parameters in total, while a fingerprint-enhanced version, called FPS, used 4.4 million. Even the larger one still had fewer parameters than most competing models.

Despite its lighter design, CoCoGraph performed strongly. Both versions achieved 100% chemical validity, uniqueness rates of 99.8% and 99.9%, and novelty of 95.7%. In the GuacaMol benchmark, the KL divergence scores for property matching reached 95.7% for the BASE model and 96.3% for the FPS version, beating the baselines the team compared against.

That matters because novelty alone is not enough. A useful model should generate molecules that are new but still plausible, with physicochemical properties that resemble those found in real chemistry.

The authors also widened the test beyond the benchmark’s usual ten properties. Across 36 chemical properties, CoCoGraph outperformed competing systems in at least 66.6% of them. The team reported particular strength in topological features, electronic properties, and structural descriptors, traits that could matter in medicinal chemistry and drug discovery.

Can chemists tell the difference?

One of the most striking parts of the work came when the researchers stepped away from automated benchmarks and asked human experts to judge the results.

They built a database of 8.2 million synthetic molecules, with 7.1% redundancy. Based on the reported novelty rate, the database contains about 7.3 million new, unique, chemically valid molecules not found in PubChem.

Then they ran what amounts to a molecular Turing test. A total of 121 participants with backgrounds in organic chemistry, biochemistry, and related fields were shown 20 pairs of molecules. In each pair, one molecule came from the original dataset and the other had been generated by CoCoGraph. Both shared the same molecular formula, forcing participants to judge structure rather than size or composition.

Across 2,420 assessments, the experts picked the real molecule correctly 62% of the time. Undergraduate participants scored 60%, while graduate participants scored 64%.

That is better than chance, but not by much.

For some categories, including acyclic molecules and predominantly aliphatic ones, performance was statistically compatible with random guessing. The authors are careful not to claim full indistinguishability, but they argue that the results show many generated molecules look convincing even to trained chemists.

A first step toward targeted design

Right now, CoCoGraph does not let a chemist type in a wish list and receive a perfect molecule in return. It cannot yet directly design compounds for a specific function.

Still, the study includes early demonstrations of how the model might become useful. The team searched its 8.2 million-molecule database for structures with physicochemical properties similar to paracetamol and identified top candidates using nine key properties. It also tested an inpainting-style approach that keeps part of an existing molecule fixed while adding small or medium fragments to create related variants.

50 random molecules generated by CoCoGraph FPS. (CREDIT: Nature Machine Intelligence)

That kind of controlled editing could matter in drug optimization, where researchers often want to preserve a molecular scaffold while adjusting other parts of the structure.

“For the moment, we are only generating molecules,” Manuel Ruiz-Botella, a doctoral student involved in the work, said in the source material. “The next step will be to apply specific objectives to this process.”

The study also outlines its limits. CoCoGraph fixes molecular formula during generation, which could restrict some applications. The model was also developed for molecules with up to 70 atoms, and the authors note that extending it to larger structures would require retraining and more computing resources. They also point to future uses in mass spectrometry and conditional molecule generation, but those remain directions for later work, not demonstrated outcomes.

Practical implications of the research

The immediate value of CoCoGraph is not that it has already delivered a new drug or material. It has not. What it offers is a more efficient way to search a chemical landscape that is far too large for humans to explore by hand.

By generating only chemically valid structures and doing so with fewer parameters than many rivals, the system could make large-scale molecule exploration less computationally expensive. Its 8.2 million-molecule database may also give researchers a starting point for screening realistic candidates in drug development or materials research.

More broadly, the work suggests that AI systems in chemistry may improve when they do not merely imitate known data but are built around the hard constraints of the field itself. In this case, the chemistry rules are not an afterthought. They are the reason the model works.

Research findings are available online in the journal Nature Machine Intelligence.

The original story "New AI tool can generate millions of new molecules" is published in The Brighter Side of News.



Like these kind of feel good stories? Get The Brighter Side of News' newsletter.


Shy Cohen
Shy CohenScience and Technology Writer

Shy Cohen
Writer

Shy Cohen is a Washington-based science and technology writer covering advances in artificial intelligence, machine learning, and computer science. Having published articles on MSN, AOL News, and Yahoo News, Shy reports news and writes clear, plain-language explainers that examine how emerging technologies shape society. Drawing on decades of experience, including long tenures at Microsoft and work as an independent consultant, he brings an engineering-informed perspective to his reporting. His work focuses on translating complex research and fast-moving developments into accurate, engaging stories, with a methodical, reader-first approach to research, interviews, and verification.