Is AI voice technology becoming indistinguishable from human speech?

Study finds the brain detects subtle differences between AI and human speech even when people cannot consciously tell them apart. (CREDIT: Shutterstock)

Synthetic voice generation technology has progressed so quickly that many listeners may have difficulty determining whether they heard a sentence spoken by a human being or an artificially created voice. Voice-activated personal assistants, automated book narration systems, and automated telephone customer service systems are now capable of delivering messages with fluency that is indistinguishable from natural human speech. However, our brains may still be able to detect subtle indicators that reveal the source of the speech.

Recent findings indicate that traces of these differences exist, even though humans cannot accurately point them out through conscious judgment.

A joint research project between researchers at Tianjin University and the Chinese University of Hong Kong examined whether human listeners could accurately distinguish between human speech and speech generated by artificial intelligence. The researchers also examined whether a brief period of training could increase listeners' ability to distinguish between those types of speech. Published in the journal eNeuro, the study involved behavioral assessments and a brain-mapping technique to understand how listeners responded to synthetic speech.

Generation 1000 process of speech stimuli. Three native Mandarin speakers (1 male, 2 female speakers) each 1001 recorded 67 sentences and one story summary, constituting the human speech dataset (HS). (CREDIT: eNeuro)

When Artificial Speech Fools Our Ear

The capabilities of artificial speech have expanded rapidly over the last decade. Artificial speech is the backbone of everyday tools such as Apple Siri and Microsoft Cortana. Artificial speech can produce highly realistic voices using only a small sample of recorded speech.

As attractive as its realism may be, the development of synthetic speech also poses a risk for misuse. AI-generated voices can impersonate real-life individuals, increasing opportunities for committing fraudulent acts. Scientists began to question whether people are able to consistently distinguish machine-generated speech from human speech.

In general, individuals with normal hearing can distinguish voice patterns. Each individual's voice has unique acoustic characteristics. For example, an individual's speech may have specific sound qualities, such as variations in pitch, rate, accent, and responsiveness. In contrast, those same voice characteristics may sound similar to other voices.

In a study, researchers found that even babies have shown a preference for the sound of their mother's voice. However, AI-generated speech presents a new type of sound that has been created by combining multiple elements from natural spoken language.

Testing Human Listeners

Because of their sonic structure, synthetic voices created with advanced machine-learning algorithms can be very similar to human voices when performing simple phonological and syntactic tests. These tests were traditionally thought to be too difficult for non-humans to pass.

The research team recruited 35 individuals aged 20 to 32 who were in good physical health. All participants were native Mandarin speakers.

(B) Acoustic waveform of the utterance. The 1007 acoustic waveform corresponds to the Mandarin utterance "他很喜欢吃点心" (He enjoys eating 1008 desserts) produced by the male speaker. The y-axis represents the amplitude of the sound 1009 pressure level in arbitrary units (a.u.). (C) Speech modulation spectrum of various types of 1010 speech signals. The shaded box indicates the cluster that exhibited significant effects following 1011 a cluster-based permutation test. (CREDIT: eNeuro)

In the study, participants listened to short phrases and indicated whether they heard a human voice or an AI-produced voice.

The three types of speech analyzed were speech recordings created by real human speakers, AI-generated voices trained using recordings of real humans, and AI-generated voices created without training.

Training and Behavioral Results

To create synthetic speech, researchers used the GPT-SoVITS speech synthesis program. All participants completed an unstructured testing session prior to training and then completed a structured training session that lasted around 12 minutes. During training, participants were given labeled samples of AI and real human speech so they could become familiar with the differences.

After training, participants again engaged in unstructured testing sessions.

The findings showed that across both unstructured and structured testing sessions, there was little difference in the ability of participants to differentiate AI-generated speech from real human speech. Discrimination sensitivity throughout all testing sessions was low. This indicated that participants were less accurate at distinguishing the two types of voice.

According to the researchers' statistical analyses, a shorter amount of training resulted in only small performance improvements. A significant portion of participants were still unsure whether a voice came from an actual person or from synthetic means.

However, the way in which they listened changed because of the training. After hearing labeled examples, participants appeared to adapt by taking a more cautious approach. They were more inclined to label voices as being produced by artificial intelligence.

Differences Throughout the Brain

The recordings of brain waves provided a different and compelling insight.

The researchers used EEG to monitor how quickly participants responded to auditory stimuli as they listened to the sentences. Even though participants' behaviors remained largely unchanged, the signals recorded by their brains began to show signs of separation between the voices of humans and those of AI following training.

Evidence of separation began at multiple time intervals during auditory processing. These intervals included approximately 55 milliseconds, 210 milliseconds, and 455 milliseconds following the start of the stimulus.

"The auditory brain is starting to be able to detect subtle differences that may not yet be able to produce a decision at that point since listeners have not been accurately producing decisions," states Dr. Teng. "This is a positive sign that training can assist individuals in the future to develop better methods for discerning between the two forms of speech. People are adjusting to AI-produced speech, and the fact that people have yet to discern all information may simply mean the right signals have not yet been employed."

Synthetic Speech's Acoustic Fingerprint

The researchers also examined the waveforms of both human and AI voices to find differences between the two groups.

Human and AI voices both exhibit distinct characteristics in their modulation spectra, most notably in the frequency range of approximately 5.4 to 11.7 Hz. The development of these modulation spectra contributes to how natural a person's speech rhythm and phonetic transitions appear to the listener.

Although current AI systems can mathematically reproduce certain standard characteristic features of human voices, an AI system often cannot produce the same type of instantaneous variations in articulation patterns and speed as natural human-produced speech.

The auditory system appears to be affected by these subtle individual differences. This occurs even though listeners may not notice those differences at the level of conscious thought.

The Benefits of Training on the Brain

As noted above, the research shows an inconsistency between the perceived voice of a listener and the responses made by the listener.

In essence, signals may already be detected at a certain level in the auditory system that could help provide a clearer distinction between speech produced by AI and speech produced by humans. The small amount of training provided may create an opportunity for the brain to begin tuning to these signals. However, the response times of those signals likely have not yet increased to the level of consistent behavioral responses.

Future studies may determine whether extended training, repeated exposure to the same or similar voices, or increased familiarity with the context of speech will enhance a person's ability to discriminate between artificial and natural speech.

Practical Implications of the Research

As the use of AI-generated speech becomes more common, it will likely become important for humans to develop the capability to identify and distinguish AI-produced synthetic speech.

Based on the findings of the research, humans may already have the cognitive ability to detect artificial voices. However, they may require specific targeted training to increase the chances of identifying synthetic speech from natural speech.

The knowledge gained from this research may help create improved methods for identifying synthetic speech. These methods could include improved training systems, detection tools, or educational programs designed to reduce the risks associated with voice-based deepfakes.

Research findings are available online in the journal eNeuro.

The original story "Is AI voice technology becoming indistinguishable from human speech?" is published in The Brighter Side of News.

Like these kind of feel good stories? Get The Brighter Side of News' newsletter.

AI AI speech synthesis artificial intelligence artificial intelligence voices auditory perception deepfake voice detection EEG speech research neural processing of speech neuroscience of speech Research Science speech perception research synthetic speech technology voice recognition science

Shy CohenScience and Technology Writer

Shy Cohen
Writer

Shy Cohen is a Washington-based science and technology writer covering advances in artificial intelligence, machine learning, and computer science. He reports news and writes clear, plain-language explainers that examine how emerging technologies shape society. Drawing on decades of experience, including long tenures at Microsoft and work as an independent consultant, he brings an engineering-informed perspective to his reporting. His work focuses on translating complex research and fast-moving developments into accurate, engaging stories, with a methodical, reader-first approach to research, interviews, and verification.

Is AI voice technology becoming indistinguishable from human speech?

Even when people cannot tell AI voices from human ones, the brain may still detect subtle acoustic differences.

Written By: Shy Cohen/
Edited By: Joseph Shavit

When Artificial Speech Fools Our Ear

Testing Human Listeners

Training and Behavioral Results

Differences Throughout the Brain

Synthetic Speech's Acoustic Fingerprint

The Benefits of Training on the Brain

Practical Implications of the Research

Is AI voice technology becoming indistinguishable from human speech?

Even when people cannot tell AI voices from human ones, the brain may still detect subtle acoustic differences.

Written By: Shy Cohen/Edited By: Joseph Shavit

When Artificial Speech Fools Our Ear

Testing Human Listeners

Training and Behavioral Results

Differences Throughout the Brain

Synthetic Speech's Acoustic Fingerprint

The Benefits of Training on the Brain

Practical Implications of the Research

Related Stories

Written By: Shy Cohen/
Edited By: Joseph Shavit