New AI model reads the language of genes to detect diseases faster

New AI model learns how genes work together, helping scientists predict disease pathways and drug targets faster.

Joseph Shavit
Shy Cohen
Written By: Shy Cohen/
Edited By: Joseph Shavit
Add as a preferred source in Google
Mount Sinai scientists developed an AI system that learns how genes interact, opening new paths for disease research and drug discovery.

Mount Sinai scientists developed an AI system that learns how genes interact, opening new paths for disease research and drug discovery. (CREDIT: Shutterstock)

Artificial intelligence is helping scientists read gene behavior more like language, revealing how genes cluster, shift roles, and shape disease. A new model from Mount Sinai learns from vast datasets to predict missing links, spotlight obscure genes, and hint at faster biomedical discoveries.

Artificial intelligence has transformed how computers understand human language. Now, scientists at the Icahn School of Medicine at Mount Sinai are using a similar idea to decode one of biology’s biggest mysteries: how genes work together inside human cells.

In a new study, researchers introduced a gene set foundation model, or GSFM, designed to learn relationships between genes across millions of biological datasets. The system draws inspiration from large language models such as ChatGPT, which learn how words gain meaning from context. Instead of studying sentences, however, this new AI studies groups of genes.

The result is a system that can predict how genes interact, identify poorly understood genes and even suggest possible disease targets. Researchers believe the model could eventually improve drug discovery, diagnostics and the understanding of human disease.

“Genes rarely act alone. Instead, they participate in multiple biological processes, forming different molecular groupings depending on where and when they are active in the cell. A single gene can play different roles in different settings, much like a word can have different meanings in different sentences,” said Avi Ma’ayan, PhD, Professor of Pharmacological Sciences and Director of the Mount Sinai Center for Bioinformatics.

The denoising autoencoder-like architecture variations (A–E) illustrate how each variant collapses the network to a single vector at some place prior to the final prediction. (CREDIT: Patterns)

“Just as modern language models learn the meaning of words from context, we asked whether AI could learn the ‘meaning’ of genes in the same way. Our GSFM was designed to do exactly that.”

Teaching AI To Understand Biology

Human cells contain thousands of genes, but those genes rarely work in isolation. They form networks, pathways and molecular teams that shift depending on the cell type, disease or environment.

Scientists have spent decades trying to understand those relationships. Traditional experiments can uncover gene functions, but they often require years of laboratory work. The new AI model aims to accelerate that process by learning patterns directly from massive amounts of biological data.

To build the system, researchers collected more than one million gene sets from published studies and transcriptomics datasets. These gene sets came from two major resources called Rummagene and RummaGEO.

Rummagene extracts gene lists from published scientific papers. RummaGEO generates gene sets from RNA sequencing studies stored in the Gene Expression Omnibus database. Together, these datasets covered thousands of diseases, tissues and experimental conditions.

The combined resource included more than 626,000 filtered gene sets spanning nearly 97,000 genes. Researchers then trained the AI using a puzzle-like strategy. The model received incomplete gene sets and had to predict the missing genes.

Over time, the system learned hidden biological patterns. Genes that appeared together repeatedly became linked within the model’s internal representation.

The hyperparameters tested (gray) and selected (red) for each architecture from Figure 1 and against each of the benchmark datasets (Table 3) trained on Rummagene. (CREDIT: Patterns)

Turning Genes Into a Biological Language

The concept behind GSFM closely mirrors how language AI works. Large language models learn that words appearing in similar contexts often share related meanings. For example, the words “doctor” and “hospital” frequently appear together.

Genes behave similarly. Certain genes repeatedly appear together during immune responses, cancer growth or tissue repair. By learning these patterns, the AI creates mathematical representations called embeddings, which capture relationships between genes.

“The organization of genes within cells remains one of the major unsolved questions in biology. The GSFM helps address this by learning from millions of gene groupings derived from published research and gene expression datasets,” said Dr. Ma’ayan.

Unlike earlier biological AI systems that relied mainly on single-cell gene expression data, GSFM learned directly from gene sets gathered across many research methods. That broader training allowed the system to integrate information from diseases, molecular studies and different biological conditions into one unified framework.

Researchers tested several AI architectures during development. Surprisingly, a relatively simple denoising autoencoder outperformed more complex systems such as variational autoencoders and transformer-based approaches.

The final model used a hidden layer size of 256 dimensions and reached peak performance after roughly 50 training cycles.

Outperforming Existing Biological Models

To test the system, scientists benchmarked GSFM against several well-known biological AI tools and gene databases. They evaluated how accurately the model could predict missing genes in known biological pathways and disease processes.

Benchmarking GSFM against gene-gene similarity matrices. (CREDIT: Patterns)

The model was tested using major biological libraries, including KEGG pathways, Gene Ontology Biological Processes, the GWAS Catalog and ChEA transcription factor datasets.

In these tests, researchers split known gene sets into two halves. One half was shown to the AI, while the other half was hidden. The model then attempted to predict the missing genes.

GSFM consistently outperformed competing methods across multiple benchmarks. It also successfully predicted gene-gene relationships and disease associations before they were experimentally confirmed in later studies.

Researchers measured performance using AUROC scores, a common statistical method used to evaluate predictive accuracy. The GSFM trained on Rummagene data achieved the strongest overall results.

The system also surpassed several prominent models, including Geneformer and scGPT, which were trained on tens of millions of single-cell datasets.

Scientists believe GSFM’s strength comes from its diversity of training data. Instead of focusing on one biological data type, the model learned from many different experimental contexts.

Predicting Disease Genes and Drug Targets

The AI model does more than predict missing genes. Researchers say it can also identify poorly understood genes, suggest disease-related pathways and uncover potential drug targets.

One major application involves gene set enrichment analysis, a widely used method that helps scientists interpret gene lists from experiments. GSFM improved performance in this task and helped identify biologically meaningful patterns more accurately than previous tools.

Applying GSFM to predict gene functions for additional contexts. (CREDIT: Patterns)

The team also applied the system to protein-protein interaction prediction and gene-disease association studies. In one demonstration, researchers tested the model on ferroptosis, a type of cell death linked to iron and lipid damage.

Using a known ferroptosis gene set, GSFM predicted additional genes that may play roles in the process. Several of the top-ranked predictions later matched findings reported in scientific literature.

One example involved the gene PLIN2, which recent studies linked to ferroptosis in oligodendrocytes. The AI identified it as a likely candidate before widespread confirmation.

These kinds of predictions could help researchers prioritize laboratory experiments and uncover biological mechanisms faster than before.

Faster Science With Smaller Computing Needs

Despite being trained on enormous biological datasets, GSFM remains computationally efficient. Researchers said the training data required only about 1 gigabyte of storage, and training the model took roughly 30 minutes on standard hardware.

That makes the system far more accessible than some massive AI models requiring expensive computing clusters.

The researchers have also made the model publicly available online. Scientists can explore predictions, analyze gene sets and access downloadable benchmarking data.

The source code and pretrained model weights are available through GitHub and HuggingFace platforms.

Benchmarking GSFM against other gene function prediction methods. (CREDIT: Patterns)

Looking Toward The Future

The Mount Sinai team plans to expand the model by combining it with other AI systems. One future goal involves linking GSFM with language-based models capable of generating plain-language explanations of gene function.

Another direction focuses on combining the model with drug-focused AI systems to predict how medicines interact with cells.

Researchers believe this type of biological AI could eventually help guide precision medicine, where treatments are tailored to an individual’s genetic profile.

The work also highlights a broader trend in science. AI systems once designed for human language are increasingly being adapted to biology, chemistry and medicine.

As biomedical datasets continue growing rapidly, tools like GSFM may help scientists uncover patterns too complex for humans to recognize alone.

Practical Implications of the Research

This research could significantly speed up biological discovery by helping scientists identify gene functions without years of laboratory testing. Researchers may use the model to uncover disease pathways faster, identify new biomarkers and prioritize promising drug targets for further study.

The system may also improve how scientists analyze large omics datasets, including genomics, proteomics and transcriptomics studies. Better interpretation of these datasets could support earlier disease detection and more personalized treatments.

Because GSFM integrates data from many biological conditions, it may help researchers discover hidden relationships between diseases that were previously difficult to detect. In the future, combining GSFM with drug-focused AI systems could improve predictions about how therapies interact with human cells and accelerate the development of new medicines.

Research findings are available online in the journal Patterns.

The original story "New AI model reads the language of genes to detect diseases faster" is published in The Brighter Side of News.



Like these kind of feel good stories? Get The Brighter Side of News' newsletter.


Shy Cohen
Shy CohenScience and Technology Writer

Shy Cohen
Writer

Shy Cohen is a Washington-based science and technology writer covering advances in artificial intelligence, machine learning, and computer science. Having published articles on MSN, AOL News, and Yahoo News, Shy reports news and writes clear, plain-language explainers that examine how emerging technologies shape society. Drawing on decades of experience, including long tenures at Microsoft and work as an independent consultant, he brings an engineering-informed perspective to his reporting. His work focuses on translating complex research and fast-moving developments into accurate, engaging stories, with a methodical, reader-first approach to research, interviews, and verification.