The growing risks from chatbots that act just like you

Researchers are teaching chatbots to roleplay as human users. The method makes AI more convincing than ever—and harder to distinguish from people. (CREDIT: Shutterstock)

Researchers are now showing that artificial intelligence can do something most people never thought about: convincingly act as the user in a conversation.

A team from the Technical University of Darmstadt has created a method called LLM Roleplay, which lets one chatbot impersonate a human while another plays the role of the responder. In many cases, people reading these conversations could not tell which dialogue came from a person and which came from a machine. The work was presented this summer at the Social Influence in Conversations workshop in Vienna.

The project set out to answer two questions. How close can AI-driven conversations come to mimicking real human–chatbot talk? And which models are the best at pretending? To find out, the researchers built a system that generates dialogues, then tested them against real ones to see if evaluators could spot the difference.

UBC computer scientist Dr. Vered Shwartz. (CREDIT: Alex Walls)

Why Simulate a User?

Gathering authentic user conversations for training is both expensive and narrow in scope. Real exchanges often cover only a handful of responses and leave little room for exploring new topics. They also usually assume an “average user,” ignoring how age, education, or language background shape the way people interact with chatbots.

That’s where roleplay comes in. The Darmstadt method begins by writing out a persona—maybe a college student, a retired teacher, or someone whose first language isn’t English—and pairs it with a specific goal such as solving a math problem or writing code.

The inquirer chatbot adopts this identity and chases the goal in a multi-turn conversation with another chatbot acting as the responder. By setting rules around goals, personas, and stopping points, the system produces conversations that feel more authentic.

How the Roleplay Works

The pipeline runs in three parts. First, the inquirer is primed with its persona and goal. Second, it produces a quoted prompt for the responder. Third, the responder’s answer is fed back, and the inquirer either continues or ends the exchange with the word “FINISH.”

Schematic illustration of our method: A textual description of a persona and a goal (top) is used to instruct the inquirer (SI) model to embody the given persona (left) and engage in a dialogue with the responder (SR) chatbot (right). (CREDIT: SICon 2025)

The setup also comes with detectors for common slipups. Sometimes the inquirer forgets to use quotes, repeats itself, answers its own prompt, or refuses to stop. Guardrails catch these problems. A safety filter from Llama-2 was also added, which the team said prevented unsafe content.

To compare humans with roleplayed users, the researchers first recruited 20 people to complete 10 tasks each with a Llama-2 chatbot, producing 200 human–AI dialogues. The volunteers shared demographic details, but their identities were kept private.

Next, the same personas and tasks were handed to four different models: Llama-2, Mixtral, Vicuna, and GPT-4. Together, they generated 800 simulated conversations. On average, these roleplayed dialogues ran about five turns, which is longer than many typical training datasets.

Who Fooled People Best?

Another group of 20 reviewers judged pairs of conversations, one real and one synthetic, to see if they could tell them apart. In about one-third of cases, they couldn’t. Mixtral was the most convincing, fooling people 44% of the time, followed by GPT-4 at 35%, Llama-2 at 33.5%, and Vicuna at 22.5%.

The distribution of detectability (left) and undetectability rates (right) per model for Llama-2, Mixtral, Vicuna, and GPT4. (CREDIT: SICon 2025)

Interestingly, GPT-4 often held out longer before people caught on. Reviewers usually made their decision after the third or fourth message in a GPT-4 dialogue, suggesting it produced more sustained and believable exchanges. Mixtral, however, balanced quality with ambiguity, leaving reviewers less confident in their guesses.

Strengths and Stumbles

Each model had its quirks. GPT-4 produced the longest conversations with few errors but often violated the rule of producing only one quoted prompt at a time. Mixtral kept things concise and balanced. Vicuna rarely broke the quoting rule but often ended up answering its own questions, which ruined the illusion. Llama-2 had the smoothest back-and-forth with its paired system.

The researchers also tested how much personas actually shaped responses. With Mixtral, adding sociodemographic features like age or education significantly increased the variety of words and phrases used. A fuller persona produced the richest and most human-like language.

The point of LLM Roleplay isn’t just to trick readers. It’s to make better synthetic training data for future chatbots. By creating conversations that reflect different kinds of users, developers can train systems to handle a wider variety of voices and goals. This could help reduce bias and improve fairness, ensuring chatbots don’t just cater to one “average” style of interaction.

Analysis of human-evaluation results for detected, undetected, and total dialogues for Llama-2, Mixtral, Vicuna, and GPT4, showing confidence statistics as occurrences (percentages in parentheses). (CREDIT: SICon 2025)

But the work also highlights risks. Personas touch on sensitive identity details, and even with safety filters, they could surface stereotypes or biases. The researchers stressed the need for further testing, broader participant pools, and stronger safeguards to make sure the tool is used responsibly.

When AI Persuades Better Than Humans

Meanwhile, researchers at the University of British Columbia (UBC) have been exploring a different concern: how persuasive AI can be. Dr. Vered Shwartz and her team tested GPT-4 against human persuaders in conversations about lifestyle choices such as becoming vegan, buying an electric car, or attending graduate school.

Participants were more swayed by the AI than by humans across all topics. The AI made longer arguments, used more elaborate vocabulary, and offered specific practical advice—like naming vegan brands or universities—that made it feel more authoritative. It also tended to agree more often and sprinkle in more pleasantries, which participants found pleasant.

Humans, however, were better at asking probing questions, showing that people still have an edge in curiosity and adaptability.

Safeguards for the Future

Shwartz warns that the persuasiveness of AI has real consequences. Large language models are already used in marketing, news, and entertainment, and their ability to influence could easily be misused. She argues that the priority now should not be whether AI is allowed in these areas—it already is—but how to protect people from manipulative uses.

Education is one safeguard. People need to understand how these systems are trained, what they can and cannot do, and how to check the information they produce. Critical thinking is another. If something feels too polished or too alarming, it’s worth investigating before accepting it.

The UBC team also suggests that companies could build warning systems into chat platforms if users appear to be in distress, and that regulators might consider different approaches to AI beyond today’s generative models.

Practical Implications of the Research

Together, these studies suggest both promise and peril. Tools like LLM Roleplay could create vast amounts of diverse training data, leading to chatbots that are more adaptable, fairer, and better at helping real people. At the same time, the finding that AI can be more persuasive than humans shows why society needs to tread carefully.

If models can play the role of the user and even out-argue human persuaders, then transparency, safeguards, and strong ethical guidelines become essential.

The benefit is clear: safer, more responsive AI that reflects the voices of different communities. But the risk is equally pressing: powerful tools that could be misused to manipulate or misinform.

The future impact will depend on how thoughtfully researchers, companies, and policymakers balance these two sides.

Research findings are available online in the journal SICon 2025.

Like these kind of feel good stories? Get The Brighter Side of News' newsletter.

AI artificial intelligence Chatbot ChatGPT Human Behavior Research Science

Shy CohenScience and Technology Writer

Shy Cohen
Science & Technology Writer

Shy Cohen is a Washington-based science and technology writer covering advances in AI, biotech, and beyond. He reports news and writes plain-language explainers that analyze how technological breakthroughs affect readers and society. His work focuses on turning complex research and fast-moving developments into clear, engaging stories. Shy draws on decades of experience, including long tenures at Microsoft and his independent consulting practice to bridge engineering, product, and business perspectives. He has crafted technical narratives, multi-dimensional due-diligence reports, and executive-level briefs, experience that informs his source-driven journalism and rigorous fact-checking. He studied at the Technion – Israel Institute of Technology and brings a methodical, reader-first approach to research, interviews, and verification. Comfortable with data and documentation, he distills jargon into crisp prose without sacrificing nuance.