New AI model reconstructs human ancestry from DNA

AI tool reads mutation patterns in DNA to estimate ancestry faster than classical population genetics methods. (CREDIT: University of Oregon)

A stretch of DNA can look static on a screen, just long rows of A, T, C and G. But buried in those letters are small changes that mark where lineages split, rejoined and drifted over time.

Researchers at the University of Oregon say they have built an artificial intelligence system that can read those mutation patterns much the way a language model reads text, then use them to estimate when two genes last shared a common ancestor. The tool, called cxt, is described in the Proceedings of the National Academy of Sciences and is aimed at one of population genetics’ hardest jobs: reconstructing the hidden family history inside a genome.

Instead of predicting the next word in a sentence, the model predicts what the team calls the next coalescence, an estimate of shared ancestry along a chromosome.

Andrew Kern, a computational biologist in the University of Oregon College of Arts and Sciences, said the work draws on ideas behind generative AI but puts them to use in a field that has largely relied on math-heavy statistical methods. “Advances in generative AI and the architectures behind them are potentially useful to a number of fields outside a chatbot,” Kern said. “We’re borrowing strengths from the world of AI and applying them in this different context that’s largely been untapped.”

Traditional methods remain the standard for this kind of ancestry reconstruction. But they can be slow, especially with large genomic datasets, and they do not always handle incomplete data well.

Bones used for DNA extraction. (CREDIT: University of Oregon)

Kevin Korfmann, the study’s lead author and a former postdoctoral researcher at Oregon, said that made the problem a natural target for machine learning.

Teaching a model to read mutation patterns

The system is based on a modified GPT-2 architecture, an older language-model design better known for text generation. Here, though, it was not trained on books or websites. It learned from simulations of genetic evolution across a range of species, including primates, mosquitoes, rodents and bacteria.

That choice matters because evolution cannot be rerun in a lab on demand. “We can’t repeat evolution, so one of the key workflows we have is developing simulations,” Korfmann said. “The simulations mimic evolutionary processes, and then we use the outcomes as training data for our deep learning models.”

In simple terms, the program scans mutation density and related signals across windows of DNA. Regions with many differences usually point to a more distant shared ancestor. Regions with fewer tend to mark a more recent one. From those patterns, cxt estimates pairwise time to the most recent common ancestor, or TMRCA, a basic measure used to infer genealogical history.

The Oregon team says this makes the model a kind of language system for population genetics, not because it reads DNA letters the way some other biological AI models do, but because it learns a sequence of hidden evolutionary states from context. The research frames that as a translation problem, turning mutation patterns into coalescence times across a chromosome.

In tests, the model held up well against established approaches, particularly Singer+Polegon and SMC++. In a constant population-size scenario, cxt’s narrow model produced a mean squared error of 0.2531, close to Singer+Polegon’s 0.2470 and lower than SMC++ at 0.8685. A more complex “sawtooth” demographic scenario revealed a broader version of cxt improved sharply, posting an error of 0.1796.

Andrew Kern, an academic expert in population genetics and machine learning, develops new tools for studying evolutionary biology. (CREDIT: Charlie Litchfield)

That result surprised the group. “You never really know what’s going to work when you’re essentially borrowing techniques from a totally different world and applying them to a new problem,” Kern said. “But this was a case where things worked really well.”

Fast enough for whole chromosomes

Speed is one reason the team thinks the approach could prove useful.

The authors report that all pairwise coalescence curves for a sample of 50 haploid chromosomes could be inferred in under five minutes on a single NVIDIA A100 GPU. They also found that cxt scaled about linearly as the number of inferred pairs increased, and it gained near-linear speedups with added GPUs.

That is a major contrast with likelihood-based and Markov chain Monte Carlo approaches, which tend to be more computationally demanding and more sensitive to parameter choices.

Korfmann said the time savings come from where the work happens. “Compared to classical inferential approaches, the AI tool doesn’t have to reason about every mutation individually,” he said. “It just reads the patterns because all of the expensive statistical work was done up front, during training, which sidesteps the bottleneck.”

The model also handled incomplete DNA data better than some standard tools because missing-data patterns were incorporated during training and fine-tuning. That became especially important when the researchers tested cxt on mosquito genomes, where patchy data and uneven sample sizes are common.

The broad version of the model also generalized reasonably well to species added later to the stdpopsim catalog, including pig, rat, porpoise, mouse and gorilla. Still, that flexibility had limits. Across some out-of-sample cases, Singer+Polegon achieved lower error, sometimes by a wide margin. The authors say cxt also struggled most when mutation rates were low and recombination rates were high, a setting with poor signal-to-noise conditions.

True versus predicted coalescence times for three inference approaches across two demographic scenarios: a constant population size and a fluctuating “sawtooth” demography. (CREDIT: PNAS)

There were other caveats. The system does not reconstruct full genealogical topologies, only pairwise coalescence times. And in structured population models, the training setup may have biased some within-population estimates because simulated samples often mixed individuals from different populations.

From human history to malaria mosquitoes

To see how the system behaved on real data, the team applied it to human genomes from the 1000 Genomes Project and mosquito genomes from the Ag1000G consortium.

In humans, cxt recovered a familiar pattern at the LCT region on chromosome 2, where lactase persistence is known to have risen under recent selection. The model found a pronounced dip in coalescence times there, with some estimates slightly above 10,000 years, consistent with the age of the sweeping haplotype.

At the HLA region on chromosome 6, the picture flipped. There, the model inferred much deeper genealogical structure, including several genes with TMRCAs on the order of tens of millions of years. That matches long-standing evidence that balancing selection has preserved ancient variation in this immune-related region.

Mosquito work and malaria vectors

The mosquito work may be even more practical. Kern studies malaria vectors, and one of the biggest problems in controlling them is insecticide resistance. “Insecticide resistance is being observed in all of these mosquito populations today,” he said. “A major challenge in preventing the spread of malaria has been understanding the evolution of insecticide resistance. Now, we can go in with our AI model, ask how long ago these resistance genes arose in the population, and learn about the evolutionary history of this critical carrier of malaria.”

At the Rdl locus in Anopheles gambiae, cxt detected reduced coalescence times that varied by region. Ghana showed a sharp local depression, while Uganda showed none, a pattern consistent with known geographic differences in resistance alleles.

The model also suggested that the youngest estimated times at the site, in the hundreds to thousands of years, likely predate modern insecticide use, though the authors caution that those dates could be pushed back by older standing variation and by the fact that the method averages across sites rather than dating the exact mutation that causes resistance.

Out-of-sample evaluation of the broad model on stdpopsim v0.3. Each panel shows inferred marginal coalescence distributions (dashed) against true distributions (shaded). (CREDIT: PNAS)

The team also used cxt to examine the ancient In(2L)a inversion on mosquito chromosome 2L, finding deeper coalescent times inside the inversion than outside it, with especially old signals near the breakpoints.

Practical implications of the research

The work points to a different way of doing population genetics, one that trades hand-built likelihood formulas for simulation-trained machine learning. That could help researchers process much larger genomic datasets, work with messier sequence data and move faster when studying evolution in humans, disease vectors or other species.

It does not replace the best theory-driven methods in every case. Singer+Polegon still came out more accurate in some scenarios, and cxt has clear limits in unfamiliar parameter regimes. But the Oregon team argues that its speed, flexibility and ability to adapt through fine-tuning make it useful for questions where classical methods are too slow or too rigid.

Kern and Korfmann say the next step is to push beyond pairs of lineages and move toward reconstructing fuller genealogical trees. That would bring the model closer to the broader ancestral recombination graphs that population geneticists ultimately want to recover.

Research findings are available online in the journal PNAS.

The original story "New AI model reconstructs human ancestry from DNA" is published in The Brighter Side of News.

Like these kind of feel good stories? Get The Brighter Side of News' newsletter.

AI AI in biology artificial intelligence coalescence time DNA mutations evolutionary biology genetic ancestry genome analysis insecticide resistance malaria mosquitoes population genetics Research Science University of Oregon

Shy CohenScience and Technology Writer

Shy Cohen
Writer

Shy Cohen is a Washington-based science and technology writer covering advances in artificial intelligence, machine learning, and computer science. Having published articles on MSN, AOL News, and Yahoo News, Shy reports news and writes clear, plain-language explainers that examine how emerging technologies shape society. Drawing on decades of experience, including long tenures at Microsoft and work as an independent consultant, he brings an engineering-informed perspective to his reporting. His work focuses on translating complex research and fast-moving developments into accurate, engaging stories, with a methodical, reader-first approach to research, interviews, and verification.

New AI model reconstructs human ancestry from DNA

Oregon researchers built an AI model that reads DNA mutation patterns to estimate shared ancestry across genomes.

Written By: Shy Cohen/
Edited By: Joseph Shavit