AI helps researchers discover several previously unknown molecules

Trained on millions of raw spectra, the DreaMS AI model reveals molecular structures and links hidden chemical patterns across life. (CREDIT: CC BY-SA 4.0)

Mass spectrometry, a powerful tool for studying tiny molecules, has long helped scientists unlock secrets hidden in plants, microbes, and even human tissues. But for all its strength, this method has a serious limitation—it’s hard to interpret. Each time mass spectrometry analyzes a sample, it produces a complex fingerprint made of peaks and numbers. These patterns are called mass spectra. Figuring out what each one means has remained a major challenge, even as the data grows.

That challenge is now being met with artificial intelligence.

A New Way to Read Molecular Fingerprints

A team of scientists led by Dr. Tomáš Pluskal from IOCB Prague, with Roman Bushuiev and collaborators at the Czech Technical University, has created a new AI model called DreaMS—short for Deep Representations Empowering the Annotation of Mass Spectra. This system can uncover the structure of molecules from raw spectral data faster and more accurately than previous methods. Their work, published in Nature Biotechnology, presents a major step forward in untangling the hidden language of nature’s chemistry.

From left: Dr. Tomáš Pluskal, head of the Biochemistry of Plant Specialized Metabolites research group at IOCB Prague; Roman Bushuiev, IOCB Prague; Anton Bushuiev, CIIRC CTU; Raman Samusevich, IOCB Prague; Dr. Josef Šivic, CIIRC CTU. (CREDIT: Tomáš Belloň/IOCB Prague)

DreaMS was trained using a method known as self-supervised learning. It studied more than 700 million raw mass spectra from the GNPS repository, which contains data collected from environmental and biological samples around the world. Without being told what any specific spectrum meant, the model learned to spot patterns, similarities, and hidden features within the data.

Dr. Josef Šivic, one of the researchers, compares this process to how language models like ChatGPT learn to understand text. “ChatGPT can infer the meaning of words and the connections between them from large volumes of text,” he says. “DreaMS learns to recognize what molecular structures are hidden within spectra. It draws on data from millions of examples.”

The Challenge of Unknown Chemistry

Despite decades of research, scientists estimate that fewer than 10% of naturally occurring small molecules have been discovered. That means most of the world’s chemical diversity remains unexplored. These unknown molecules could be the key to breakthroughs in medicine, environmental safety, and even our understanding of life beyond Earth.

Related Stories

The main problem isn’t the ability to gather data—it’s the struggle to analyze it. When a mass spectrometer runs, it produces two types of data: MS1, which gives a broad overview of the molecules present, and MS2, which zooms in on fragments of a specific molecule.

These MS2 spectra hold the real clues to a molecule’s identity, but only about 2% of them can be matched to known structures using reference libraries. Even advanced machine learning tools can’t annotate more than 10% of spectra with confidence.

Previous tools relied heavily on limited spectral libraries or manual interpretation by experts. The well-known software SIRIUS, for instance, uses complex steps involving combinatorics, optimization, and support vector machines to guess a molecular fingerprint. While it performs well, it still depends on hand-crafted rules and curated data, which slows things down and limits its reach.

By contrast, DreaMS skips most of these steps. It learns directly from raw data, without needing human-designed shortcuts or annotated training sets. It predicts masked peaks in the spectra and estimates when certain chemicals will show up during a chromatography run. Through this process, it builds a 1,024-dimensional mathematical representation of each spectrum that captures detailed information about molecular structure.

A Growing Map of the Chemical Universe

One of the most impressive outcomes of this project is the DreaMS Atlas. This massive, interconnected network links more than 200 million mass spectra. Each spectrum is like a webpage in a vast web. Similar to how websites are connected through hyperlinks, spectra in the DreaMS Atlas are connected based on chemical similarity.

Dr. Pluskal explains that this network helps scientists explore links they never noticed before. For example, DreaMS found surprising connections between pesticides, food, and human skin. It even led researchers to wonder whether certain pesticides could trigger autoimmune conditions like psoriasis. These kinds of insights were nearly impossible to find before.

The model isn’t just theoretical. It’s already helping with real-world tasks. It can guess what chemical elements are present in a molecule, how many fragments it has, and even whether it includes specific atoms like fluorine. This last task was particularly surprising.

The DreaMS neural network overcomes the limitation of mass spectral libraries. (CREDIT: Nature Biotechnology)

“Fluorine is present in about one-third of all drugs and agrochemicals, but we were previously unable to reliably detect it from the mass spectrum,” says Roman Bushuiev. After training DreaMS on millions of spectra and fine-tuning it with just a few thousand fluorine-containing samples, the model learned to identify fluorine correctly.

A Foundation for Future Discovery

DreaMS represents a turning point in the use of machine learning for chemistry. Instead of relying on small datasets or slow, rule-based tools, researchers now have a foundation model that can adapt to many different tasks. It works across different types of data and experimental conditions, which makes it flexible enough to be used in fields like drug development, environmental science, and even the search for life beyond Earth.

What makes DreaMS especially exciting is its potential to go further. The researchers are now working on the next step: teaching the model to predict full molecular structures. If successful, it could speed up the discovery of new chemicals and allow scientists to navigate the unknown parts of the chemical world with far more precision.

Sample-average DreaMS embeddings enable the sample-level analysis of metabolomics data, as exemplified on food LC–MS/MS datasets. (CREDIT: Nature Biotechnology)

This work also shows the power of self-supervised learning in science. By letting models learn patterns from raw data without human labels, researchers can uncover hidden relationships and insights that were previously out of reach.

As Dr. Pluskal notes, “The model was trained on tens of millions of spectra from diverse organisms and environments—plants, microbes, food, tissue, and soil samples. Thanks to this, it can uncover hidden similarities between spectra that, at first glance, seem unrelated.”

For scientists looking to better understand the building blocks of life, DreaMS offers a new path forward—one built not on guesswork, but on deep data and smarter machines.

Note: The article above provided above by The Brighter Side of News.

Like these kind of feel good stories? Get The Brighter Side of News' newsletter.