Thinking AI isn’t so smart after all, Apple study finds

Apple’s new study shows large reasoning models collapse on complex logic puzzles, challenging assumptions about AI’s thinking power. (CREDIT: CC BY-SA 4.0)

New research has cast doubt on whether today's advanced artificial intelligence models are truly capable of deep reasoning. These models—known as Large Reasoning Models, or LRMs—were built to go beyond basic language tasks by generating long, step-by-step thought chains before reaching conclusions.

That kind of output, which mimics how humans solve problems, has impressed many researchers. But a recent study from Apple suggests these systems may not actually “think” as well as they appear.

Researchers at Apple evaluated how well current AI reasoning models perform when facing problems of increasing complexity. These weren’t just any problems—they were classic logic puzzles like the Tower of Hanoi, river-crossing games, and block stacking tasks.

Each of these puzzles follows clear, rule-based logic. The difficulty increases simply by adding more steps or items, making them ideal for testing how models scale their thinking as tasks get harder.

Study setup enables verification of both final answers and intermediate reasoning traces, allowing detailed analysis of model thinking behavior. (CREDIT: Parshin Shojaee / Apple)

In the early stages, models like Claude 3.7 Sonnet Thinking and DeepSeek R1 handled simple and moderately difficult puzzles fairly well. But when researchers increased the difficulty, performance suddenly collapsed. Even when more computing power was added, the models began to give up—abandoning their chain-of-thought reasoning before reaching a conclusion.

Reasoning That Falls Apart

Apple's team used a new evaluation approach. Instead of relying on traditional math and coding benchmarks, which are often prone to data contamination, they turned to controlled puzzle environments. These puzzles were adjusted carefully so that each version became harder while keeping the underlying logic intact. This gave researchers the chance to track not just whether the models gave the right answers, but how they arrived at those answers.

Large reasoning models (LRMs) aim to mimic human thought by producing step-by-step written responses, often called “chain-of-thought reasoning.” This method is supposed to help AI solve problems through logic rather than guesswork. Researchers tested these models by putting them up against four classic puzzles that demand planning and foresight: Tower of Hanoi, checkers jumping, river crossing, and blocks world. Each puzzle was scaled in difficulty, starting with tasks like a one-disk Hanoi and building up to complex versions requiring over a million steps.

Related Stories

The team wanted to see how well these models actually reasoned, not just whether they reached the right answer. They pointed out a key flaw in how most models are evaluated today. “Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy,” the researchers wrote. That means tests often reward models for producing correct answers—even if those answers were guessed or pulled from similar examples in the training data—without checking whether any real reasoning took place.

By using puzzles that can't be easily solved through memorized examples, the researchers challenged the models to show genuine logical thinking. The results were not encouraging. Across nearly 200 attempts at novel mathematical proofs, no model produced a flawless answer. Most scored under 5 percent, and only one model hit 25 percent. These findings matched similar results in USAMO-related research, where models struggled to handle unfamiliar math challenges.

Both studies showed a steep drop in performance when the problems required long chains of logic or deeper understanding. The LRMs often broke down when faced with tasks that demanded more than surface-level processing. This suggests that, while these models might look impressive on paper, they still fall short when it comes to true reasoning ability—especially when the path to the solution isn't already baked into their training data.

The findings were surprising. As puzzle complexity grew, models began to spend fewer tokens—or "reasoning steps"—despite having room to generate longer answers.

Accuracy of thinking models (Claude 3.7 Sonnet with thinking, DeepSeek-R1) versus their non-thinking counterparts (Claude 3.7 Sonnet, DeepSeek-V3) across all puzzle environments and varying levels of problem complexity. (CREDIT: Parshin Shojaee / Apple)

In simpler tasks, they often found the right solution early but kept exploring wrong ones, a kind of digital overthinking. At medium complexity, they needed to sort through several incorrect paths before finding the answer. But once the problems became too complex, accuracy dropped to zero. The AI simply stopped trying.

This suggests a deeper issue. These models weren’t limited by how much text they could output. They were failing because they didn’t know how to handle the logic itself. Even when researchers gave them the correct algorithm in the prompt—essentially spelling out what to do—the models continued to get stuck. That raised a critical concern: are these systems really reasoning, or are they just very good at mimicking reasoning patterns?

Not Everyone Agrees

The paper stirred strong reactions. AI skeptics saw it as confirmation that the current wave of intelligent systems is being overhyped. It provided a dose of realism to those concerned about rapid progress toward artificial general intelligence (AGI). Meanwhile, the AI boosters had to reckon with hard limits, even in state-of-the-art models.

Illustration of the four puzzle environments. Columns show the progression from initial state (top) through intermediate state (middle) to target state (bottom) for puzzles: Tower of Hanoi (disk transfer across pegs), Checkers Jumping (position swapping of colored tokens), River Crossing (transporting entities across a river), and Blocks World (stack reconfiguration). (CREDIT: Parshin Shojaee / Apple)

But others questioned the study itself. One of the loudest critiques came from Alex Lawsen at Open Philanthropy, who wrote a detailed rebuttal titled “The Illusion of the Illusion of Thinking.” He argued that the Apple team had misunderstood their own results by focusing too much on output format and not enough on what the models were actually doing.

Lawsen pointed to several major flaws in Apple’s testing approach. First, he showed that some models were hitting token limits—essentially being cut off mid-thought. For example, Claude often stated in its output: “The pattern continues, but I’ll stop here to save tokens.” Apple counted these incomplete answers as failures, even when the model clearly understood the solution.

Second, some puzzles were unsolvable due to how they were built. In the river-crossing tests, Apple’s team created scenarios that couldn’t be solved under the rules provided. Models were penalized for recognizing the problem and refusing to solve it. That, Lawsen argued, was unfair.

Third, the evaluation scripts treated incomplete move lists as total failures, regardless of whether the model had started correctly or was stopped by design limits. In other words, the grading system didn’t separate actual reasoning problems from output formatting issues.

To make his point, Lawsen ran the tests again using a different method. Instead of asking models to list every move, he asked them to write a recursive computer program in Lua that would generate a solution to the puzzle. With this setup, models like Claude, Gemini, and OpenAI’s o3 easily solved puzzles with 15 discs—more than twice the complexity that Apple claimed caused total failure.

Do LRMs Actually Reason?

The core debate centers on whether LRMs are failing because they can't think—or because our tests don't fairly measure their thinking. Apple’s research suggests a performance cliff: a point where reasoning effort suddenly collapses. But Lawsen believes the models may still understand the logic and just can’t express it properly due to artificial constraints.

It’s also worth noting that many humans struggle with these same puzzles. As AI expert Gary Marcus pointed out, most people can’t solve an 8-disc Tower of Hanoi puzzle either. And while the study makes it seem like models are collapsing, it doesn’t compare their performance to human reasoning at the same level of complexity.

Pass@k performance of thinking vs. non-thinking models across equivalent compute budgets in puzzle environments of low , medium , and high complexity. Non-thinking models excel in simple problems, thinking models show advantages at medium complexity, while both approaches fail at high complexity regardless of compute allocation. (CREDIT: Parshin Shojaee / Apple)

Marcus, a longtime critic of AI overreach, agrees with the general message: LLMs are no replacement for well-defined algorithms. He wrote, “What the Apple paper shows, most fundamentally, regardless of how you define AGI, is that LLMs are no substitute for good well-specified conventional algorithms.”

In short, current AI models can simulate thought very well, especially when tasks are familiar. But they fall apart when asked to plan deeply or follow long-term logic. This reflects a broader weakness in how these systems were trained. Most large models learn from massive amounts of internet text. They become experts at prediction, not at understanding.

What Comes Next?

So, what should researchers and developers do with these findings?

Both sides agree that evaluations need to change. Future tests must separate reasoning skill from format constraints. Evaluators should verify that puzzles are solvable and consider how many tokens models are allowed to use. It’s also important to look at different ways of solving problems—like generating code instead of listing every move.

Left & Middle: Position and correctness of intermediate solutions within reasoning traces across four puzzles at varying complexity levels. ✓ indicates correct solutions, ✗ indicates incorrect solutions, with distribution density shown by shading; Right: Solution accuracy versus position in thinking for Tower of Hanoi at different complexity levels. (CREDIT: Parshin Shojaee / Apple)

At the same time, these findings highlight the need for models that combine language skills with true algorithmic reasoning. Some research groups are already working on hybrid models that mix LLM flexibility with traditional computation engines. Others are focused on making LRMs more efficient at planning and self-correction.

The Apple paper’s title, “The Illusion of Thinking,” may have been designed to grab attention. But the real story is more complex. These systems aren’t broken, and they aren’t magical either. They’re impressive tools with growing capabilities and well-defined limits. Knowing where those limits are helps everyone—from developers to the public—set better expectations.

As AI continues to shape the future, it’s important to understand not just what it can do, but how and why it works—or doesn’t. That’s the only way forward if the goal is truly intelligent machines.

Note: The article above provided above by The Brighter Side of News.

Like these kind of feel good stories? Get The Brighter Side of News' newsletter.

AI artificial general intelligence artificial intelligence computer science reasoning Research Science Thinking

Joshua ShavitScience and Good News Writer

Joshua Shavit
Science & Technology Writer | AI and Robotics Reporter

Joshua Shavit is a Los Angeles-based science and technology writer with a passion for exploring the breakthroughs shaping the future. As a contributor to The Brighter Side of News, he focuses on positive and transformative advancements in AI, technology, physics, engineering, robotics and space science. Joshua is currently working towards a Bachelor of Science in Business Administration at the University of California, Berkeley. He combines his academic background with a talent for storytelling, making complex scientific discoveries engaging and accessible. His work highlights the innovators behind the ideas, bringing readers closer to the people driving progress.

Thinking AI isn’t so smart after all, Apple study finds

New research reveals how AI reasoning models collapse under complex tasks, raising doubts about their thinking ability.

Joshua Shavit

Reasoning That Falls Apart

Not Everyone Agrees

Do LRMs Actually Reason?

What Comes Next?