AI models tested in a classic economics game reveal major differences from human thinking

Researchers test leading AI models in a classic economics game, revealing where machines match and miss human reasoning.

Joseph Shavit
Shy Cohen
Written By: Shy Cohen/
Edited By: Joseph Shavit
A new study places leading AI models into a classic economic experiment, revealing how their strategic choices differ from humans.

A new study places leading AI models into a classic economic experiment, revealing how their strategic choices differ from humans. (CREDIT: AI-generated image / The Brighter Side of News)

We are living at a time when large language models increasingly make choices once reserved for people. From writing emails to guiding business decisions, these systems shape daily life. That shift has raised a pressing question for researchers: when AI must reason strategically, does it think like a human?

A new study tackles that question by placing artificial intelligence inside a well-known economic experiment. The research was led by Dmitry Dagaev, head of the Laboratory of Sports Studies at the Faculty of Economic Sciences at HSE University. He worked with colleagues Sofia Paklina and Petr Parshakov from HSE University–Perm and Iuliia Alekseenko from University of Lausanne. Together, the team tested how leading AI models behave in the “Guess the Number” game, a modern version of the Keynesian beauty contest.

The game looks simple. Each participant chooses a number between 0 and 100. The winner is the player whose number comes closest to a fraction of the group’s average, often one half or two thirds. Yet decades of research show that human players rarely follow the mathematically optimal path. Instead, they reveal limits in reasoning, expectations about others, and emotional influences.

The researchers wanted to know whether AI would make the same kinds of choices.

Putting AI in Human Shoes

To find out, the team evaluated five widely used language models available between 2024 and 2025. These included GPT-4o, GPT-4o Mini, Gemini-2.5-flash, Claude-Sonnet-4, and Llama-4-Maverick. Each model acted as a single participant across 16 scenarios drawn from classic experiments with human subjects.

The scenarios varied widely. Some changed the fraction used to calculate the winning number. Others altered how the group’s numbers were combined, using averages, medians, or maximums. Several descriptions focused on the opponents themselves, such as first-year economics students, conference experts, or players described as angry or analytical.

Each model received the same instructions. It had to choose a number, explain its reasoning, and assume its opponents matched the description provided. Every scenario was repeated 50 times per model, with no opportunity to learn from earlier rounds. This approach mirrored one-shot experiments used with human participants.

The first results were basic but essential. All 4,000 responses followed the rules and stayed within the 0 to 100 range. Almost every explanation showed some form of strategic reasoning. Only 23 responses lacked it.

Summary of experiments replicated in this paper with an AI player. CRT = cognitive reflection test. (CREDIT: Journal of Economic Behavior & Organization)

Where AI Matches and Misses Human Behavior

When the researchers compared AI choices with well-known human results, clear differences appeared. In classic experiments conducted by economist Rosemarie Nagel, human players averaged about 27 when the target was half the group average and nearly 37 when it was two thirds. In every comparable case, AI models chose lower numbers.

Some models moved very close to zero, which is the Nash equilibrium in most versions of the game. Others, such as GPT-4o, chose higher values but still below human averages. All of these differences were statistically significant.

When the game used different rules, the pattern held. In versions based on the maximum number, both humans and AI selected higher values. Yet even then, the models differed from one another. Claude Sonnet averaged about 35, while Llama played much smaller numbers.

"These results show that AI responds to changes in game structure much like people do. When theory predicts higher or lower choices, the models move in the same direction," Dagaev told The Brighter Side of News.

"However, we found that a key gap emerged. In two-player versions of the game, choosing zero is always weakly dominant. It never performs worse than any other choice. None of the models identified or explained this logic. They relied instead on step-by-step reasoning about what others might do. This absence of dominant-strategy thinking marks a clear difference from formal economic training," he continued.

Replication results for Nagel (1995) with a LLM player. In Nagel (1995), 𝑛 varies from 15 to 18 in different sessions. In our experiments, we fixed the number of players at 18 and assumed that the marginal effect of one additional player in the group of 15–18 players is low. (CREDIT: Journal of Economic Behavior & Organization)

Differences Between Models and the Power of Scale

Not all AI behaved the same way. In most pairwise comparisons, models produced distinct averages. GPT-4o and Claude Sonnet often landed in the middle. Gemini Flash shifted between cautious and aggressive guesses. Llama showed the widest range.

The team explored this further by testing Llama models of different sizes, from 1 billion to 405 billion parameters. The pattern was striking. Smaller models chose numbers closer to typical human guesses, often near 50. Larger models moved steadily toward theoretical predictions, selecting much lower values.

As model size increased, so did the depth of reasoning. Bigger models appeared to anticipate more layers of thought, pushing their choices closer to equilibrium.

Emotion, Framing, and Social Context

The researchers also tested how sensitive AI was to context. They rewrote prompts using different words, framed the game as a television contest, and assigned emotional states to opponents.

Replication results for Grosskopf and Nagel (2008) with a LLM player. (CREDIT: Journal of Economic Behavior & Organization)

The results closely mirrored human behavior. When opponents were described as angry, both people and AI tended to choose higher numbers. Sadness led to smaller shifts. Players described as more analytical prompted lower guesses than intuitive ones.

Some models reacted more strongly to wording changes, especially GPT-4o Mini and Llama. Still, the overall structure of responses remained stable.

What the Findings Suggest

The study shows that modern AI can recognize strategic settings and adjust its behavior in predictable ways. In many cases, it behaves more “rationally” than human participants by choosing lower numbers. At the same time, it fails to identify simple dominant strategies and often assumes others are more sophisticated than they really are.

This matters beyond the lab. As Dagaev noted, “We are now at a stage where AI models are beginning to replace humans in many operations, enabling greater economic efficiency in business processes.” Yet he stressed that in many decisions, human-like behavior remains important.

Understanding where AI aligns with people, and where it does not, will shape how these systems are used in markets, policy, and everyday life.

Practical Implications of the Research

These findings help clarify how AI might behave in real economic settings. If models consistently expect others to act strategically, they may misjudge markets driven by emotion or limited reasoning. At the same time, their strong alignment with comparative trends suggests they can still be valuable tools for forecasting and analysis.

For researchers, the results highlight where AI needs refinement, especially in recognizing simple strategic dominance. For society, the work offers guidance on when to trust AI decisions and when human judgment remains essential.

Research findings are available online in the Journal of Economic Behavior & Organization.



Like these kind of feel good stories? Get The Brighter Side of News' newsletter.


Shy Cohen
Shy CohenScience and Technology Writer

Shy Cohen
Science & Technology Writer

Shy Cohen is a Washington-based science and technology writer covering advances in AI, biotech, and beyond. He reports news and writes plain-language explainers that analyze how technological breakthroughs affect readers and society. His work focuses on turning complex research and fast-moving developments into clear, engaging stories. Shy draws on decades of experience, including long tenures at Microsoft and his independent consulting practice to bridge engineering, product, and business perspectives. He has crafted technical narratives, multi-dimensional due-diligence reports, and executive-level briefs, experience that informs his source-driven journalism and rigorous fact-checking. He studied at the Technion – Israel Institute of Technology and brings a methodical, reader-first approach to research, interviews, and verification. Comfortable with data and documentation, he distills jargon into crisp prose without sacrificing nuance.