Scientists are rethinking how much we can trust ChatGPT

A new study found ChatGPT improved slightly, but still gave inconsistent answers to the same research questions.

Joseph Shavit
Shy Cohen
Written By: Shy Cohen/
Edited By: Joseph Shavit
Add as a preferred source in Google
Study finds ChatGPT often answers research hypotheses correctly, but remains inconsistent and weak at spotting false claims.

Study finds ChatGPT often answers research hypotheses correctly, but remains inconsistent and weak at spotting false claims. (CREDIT: Shutterstock)

Some answers looked right. Ask the same question again, and the answer might flip.

That was the unsettling pattern Washington State University professor Mesut Cicek and his colleagues found when they tested ChatGPT against 719 hypotheses pulled from business research papers. The team repeatedly fed the AI statements from scientific articles and asked a simple question: did the research support the hypothesis, yes or no?

The system often sounded confident. It was not always dependable.

In mid-2024, the free version of ChatGPT-3.5 answered correctly 76.5% of the time. When the researchers repeated the experiment in mid-2025 using GPT-5 mini, accuracy rose to 80%. That improvement was statistically significant, but small. Once the team adjusted for the fact that a true-or-false guess has a 50% chance of being right, the model’s effective performance dropped sharply.

In mid-2024, the free version of ChatGPT-3.5 answered correctly 76.5% of the time. (CREDIT: Shutterstock)

Cicek said that gap matters because a polished answer can create more trust than it deserves.

“We're not just talking about accuracy, we're talking about inconsistency, because if you ask the same question again and again, you come up with different answers,” said Cicek, an associate professor in the Department of Marketing and International Business in WSU’s Carson College of Business and lead author of the paper.

The findings were published in the Rutgers Business Review.

When the same prompt gets different answers

The researchers extracted 719 hypothesis statements from 127 open-access articles published since 2021 in nine marketing and management journals. Each hypothesis described a formal, testable relationship, such as a main effect, mediation, or moderation.

Then they gave each statement to ChatGPT 10 times using the exact same prompt.

That repetition exposed one of the study’s central concerns.

“We used 10 prompts with the same exact question. Everything was identical. It would answer true. Next, it says it’s false. It’s true, it’s false, false, true. There were several cases where there were five true, five false,” Cicek said.

This chart illustrates variations in ChatGPT’s estimation accuracy across ten identical prompts for two consecutive years, showing modest improvement but no consistent trend across prompts. (CREDIT: Rutgers Business Review)

On average, prompt-level consistency improved from 80.2% in 2024 to 86.8% in 2025. But the stricter measure was less reassuring. Only 66.3% of hypotheses in 2024 were answered correctly across all 10 repeated prompts. In 2025, that figure rose to 72.9%.

That still left more than a quarter of cases with at least one wrong answer, even though the wording never changed.

The study also found a stubborn failure on unsupported hypotheses. ChatGPT correctly identified false statements only 13.6% of the time in 2024 and 16.4% in 2025. According to the authors, that suggests a tendency to favor confirming the statement it is given.

Fluent language, shallow reasoning

The paper argues that this is a deeper problem than occasional error. Large language models can produce persuasive, polished responses, yet still miss the logic behind a question.

“Current AI tools don't understand the world the way we do — they don't have a ‘brain,’” Cicek said. “They just memorize, and they can give you some insight, but they don't understand what they’re talking about.”

The pattern was especially clear when the researchers compared different types of hypotheses. ChatGPT performed best on mediation hypotheses, which follow a more linear chain of reasoning. It did less well on main effects and worst on moderation hypotheses, which require more contextual or conditional thinking.

The authors argue that this suggests the models are better at following linguistic structure than reasoning through shifting conditions or boundary effects. In their words, the systems can reproduce the language of logic without fully handling the logic itself.

Correct Responses per Accuracy Bracket. (CREDIT: Rutgers Business Review)

That may help explain why improvement from one year to the next was modest. Accuracy increased by 3.5 percentage points, but the researchers say the gain looked more like refinement in text processing than a leap in conceptual understanding.

Cicek’s co-authors were Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University.

A warning for managers, not a call to ditch AI

The paper does not argue that AI has no value. It argues that people should be careful about where they trust it.

The authors say generative AI can still be useful in structured tasks, especially when language is clear and the reasoning is straightforward. They point to settings such as A/B testing, experimental design, or campaign simulation as places where a first-pass tool may help spot patterns and speed up work.

But they also warn that managers, consultants, analysts, and researchers should not confuse speed with understanding. A fluent answer can hide weak logic, especially in high-stakes situations involving context, conditions, or indirect effects.

The study includes limitations. The researchers note that even peer-reviewed studies do not guarantee that every supported hypothesis is truly “true” and every unsupported one is “false.” They also limited their consistency test to 10 identical prompts per hypothesis, though Cicek said similar checks with more prompts and other AI platforms produced comparable patterns.

“Always be skeptical,” he said. “I'm not against AI. I’m using it. But you need to be very careful.”

Practical implications of the research

For businesses and researchers, the message is simple: AI can help organize ideas, summarize material, and speed up routine analytical work, but it still needs human checking.

Repeating prompts, verifying results, and training employees to question confident AI output may matter as much as using the tools in the first place.

The study suggests that today’s systems work better as assistants than as decision-makers.

Research findings are available online in the journal Rutgers Business Review.

The original story "Scientists are rethinking how much we can trust ChatGPT" is published in The Brighter Side of News.



Like these kind of feel good stories? Get The Brighter Side of News' newsletter.


Shy Cohen
Shy CohenScience and Technology Writer

Shy Cohen
Writer

Shy Cohen is a Washington-based science and technology writer covering advances in artificial intelligence, machine learning, and computer science. He reports news and writes clear, plain-language explainers that examine how emerging technologies shape society. Drawing on decades of experience, including long tenures at Microsoft and work as an independent consultant, he brings an engineering-informed perspective to his reporting. His work focuses on translating complex research and fast-moving developments into accurate, engaging stories, with a methodical, reader-first approach to research, interviews, and verification.