Humanity’s last exam, the test that modern AI still struggles to pass

A global team built a test so difficult that today’s AI models still fail most of it. (CREDIT: Shutterstock)

Artificial intelligence systems now breeze through many academic tests that once challenged both machines and people. That success created an unexpected problem. The benchmarks used to measure AI progress stopped being useful because top models were scoring too high.

A massive international research effort set out to fix that.

Nearly 1,000 experts from more than 50 countries collaborated to build a new assessment called Humanity’s Last Exam, or HLE, a 2,500-question test covering more than 100 subjects. The project, described in the journal Nature, aims to measure how far modern AI still falls short of expert human knowledge.

“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human-level understanding,” said Tung Nguyen, an instructional associate professor in computer science and engineering at Texas A&M University who helped develop the exam. “But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context and specialized expertise.”

The name sounds dramatic. The purpose is practical.

Distribution of HLE questions across categories. HLE consists of 2,500 exam questions in over a hundred subjects, grouped into eight high-level categories. (CREDIT: Nature)

When benchmarks stop working

Large language models now exceed 90 percent accuracy on well-known tests such as Massive Multitask Language Understanding, or MMLU, which once represented the frontier of AI evaluation. As scores climbed, researchers lost a reliable way to track progress.

That saturation prompted the creation of a harder benchmark that would remain challenging even as technology improved.

HLE includes both text-based and image-based questions. About 14 percent require interpreting visual information alongside written prompts. Roughly one quarter are multiple choice, while the rest require precise answers that automated systems can verify.

Questions span an unusual range. Some involve translating ancient Palmyrene inscriptions. Others ask about bird microanatomy or details of Biblical Hebrew pronunciation. Many focus on advanced mathematics and technical reasoning.

Every question had to meet strict rules. It needed one correct answer, clear wording and resistance to simple internet lookup. Contributors also had to provide detailed solutions explaining how the answer was reached.

The goal was not to confuse people. It was to isolate weaknesses in AI.

A sample question from Humanity’s Last Exam. (CREDIT: lastexam.ai)

Built to stay ahead of machines

The development process itself acted as a filter.

Before entering the dataset, each question was tested against leading AI models. If a system answered correctly, the question was rejected. More than 70,000 attempts were logged during this screening phase, producing about 13,000 candidate questions that initially stumped models. These then underwent multiple rounds of expert human review before final selection.

The final result deliberately sits just beyond current AI capability.

Early scores confirmed that design. GPT-4o achieved 2.7 percent accuracy, Claude 3.5 Sonnet reached 4.1 percent, and OpenAI’s o1 model scored 8 percent. More advanced systems such as Gemini 3.1 Pro and Claude Opus 4.6 have reached roughly 40 to 50 percent accuracy.

Models also showed another weakness. They often answered incorrectly with high confidence, producing calibration errors above 70 percent. In simple terms, the systems did not recognize when they were wrong.

That matters for real-world use, where misplaced confidence can create risks.

HLE dataset creation pipeline. We accept questions that make frontier LLMs fail, then iteratively refine them with the help of expert peer reviewers. Each question is then manually approved by organizers or expert reviewers trained by organizers. (CREDIT: Nature)

A global academic effort

HLE draws questions from nearly 1,000 contributors affiliated with more than 500 institutions worldwide. Most participants hold advanced degrees and specialize in fields ranging from physics to linguistics.

Nguyen contributed 73 of the public questions, the second-highest number among authors, with many focused on mathematics and computer science.

“What made this project extraordinary was the scale,” he said. “Experts from nearly every discipline contributed. It wasn’t just computer scientists; it was historians, physicists, linguists, medical researchers. That diversity is exactly what exposes the gaps in today’s AI systems — perhaps ironically, it’s humans working together.”

To encourage high-quality submissions, organizers created a $500,000 prize pool. Top questions earned $5,000 each, while hundreds of additional contributors received smaller awards.

Some of the exam has been released publicly, while a private test set remains hidden to prevent models from memorizing answers.

Measuring progress, not predicting the future

Researchers emphasize that high performance on HLE would indicate expert-level ability on structured academic questions, not artificial general intelligence. The exam focuses on closed-ended problems rather than open-ended research or creative discovery.

Performance of frontier LLMs on popular benchmarks and HLE. (CREDIT: Nature)

There are also built-in limitations. Like earlier benchmarks, HLE could eventually become saturated as technology advances. The dataset also leans heavily toward math and science topics, reflecting contributor expertise.

Still, the benchmark offers something the field urgently needs: a common reference point.

“Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do,” Nguyen said. “Benchmarks provide the foundation for measuring progress and identifying risks.”

Practical implications of the research

A clearer picture of AI strengths and weaknesses helps guide both development and regulation.

Reliable benchmarks allow researchers to track improvements, identify safety concerns and avoid overestimating system capabilities.

For governments and industry, that information supports better decisions about deployment, oversight and investment as AI systems continue to evolve.

Research findings are available online in the journal Nature.

The original story "Humanity’s last exam, the test that modern AI still struggles to pass" is published in The Brighter Side of News.

Like these kind of feel good stories? Get The Brighter Side of News' newsletter.

AI benchmarks AI limitations AI safety AI testing artificial intelligence computer science expert knowledge large language models machine learning research Research Science technology research

Shy CohenScience and Technology Writer

Shy Cohen
Writer

Shy Cohen is a Washington-based science and technology writer covering advances in artificial intelligence, machine learning, and computer science. He reports news and writes clear, plain-language explainers that examine how emerging technologies shape society. Drawing on decades of experience, including long tenures at Microsoft and work as an independent consultant, he brings an engineering-informed perspective to his reporting. His work focuses on translating complex research and fast-moving developments into accurate, engaging stories, with a methodical, reader-first approach to research, interviews, and verification.

Humanity’s last exam, the test that modern AI still struggles to pass