The Decoder February 16, 2025 IDIOCRACY

Why ChatGPT Therapy Replies Are Getting Hard to Spot

A study published in PLOS Mental Health found that 830 participants had trouble telling ChatGPT therapy responses from those written by human therapists. ChatGPT responses were often rated higher for empathy, therapeutic alliance, and cultural competence, but researchers also point to major limits and the need for careful clinical oversight.

WTF Index IDIOCRACY

◄ Terminator 1 Idiocracy 3 ►

The story mainly suggests people may become dependent on AI for sensitive human judgment like therapy, even when oversight is still needed.

Why ChatGPT Therapy Replies Are Getting Hard to Spot

A new psychotherapy study adds a sharper edge to a familiar question about artificial intelligence: when an answer feels supportive, detailed, and empathetic, does the reader know whether it came from a human being or a machine?

In research published in PLOS Mental Health, participants asked to judge therapeutic responses struggled to separate ChatGPT from human therapists. The findings do not turn ChatGPT into a therapist, but they do show why AI in mental health care is becoming harder for professionals, patients, and researchers to ignore.

What The Study Tested

The study applied the idea of the classic Turing test to psychotherapy. Alan Turing’s original concept asks whether people can tell if they are interacting with a machine or a person. Here, researchers used that logic in a mental health setting by asking 830 participants to identify whether therapeutic responses came from ChatGPT or from human therapists.

The comparison was based on 18 couples therapy case studies. The researchers evaluated responses from 13 experienced therapists alongside responses generated by ChatGPT.

The results were close to guessing. Participants correctly identified human therapist responses 56.1 percent of the time. They correctly identified ChatGPT responses 51.2 percent of the time.

Those numbers matter because they suggest that, in short written therapy scenarios, the boundary between AI-generated support and professional human language was not obvious to readers. The study did not examine full therapy relationships or live sessions, but within its format, ChatGPT was difficult to detect.

Why ChatGPT Scored So Well

The study found that ChatGPT did more than merely imitate therapeutic language. Its responses outperformed human experts on measures of therapeutic quality, including therapeutic alliance, empathy, and cultural competence.

The source of that advantage appears to be partly linguistic. ChatGPT consistently produced longer responses. Its answers also had a more positive tone and used more nouns and adjectives. Those features likely made the replies feel more complete, attentive, and empathetic to the people rating them.

That does not prove the AI experienced empathy. It shows that the text it produced was often perceived as empathetic. In a written response, length, tone, detail, and word choice can strongly shape how supportive an answer feels.

For mental health professionals, that distinction is important. A response can read as caring without being grounded in human understanding. At the same time, if patients or readers experience the response as useful, the practical impact of AI-generated language becomes difficult to dismiss.

The Human Label Still Changes The Rating

The study also found a notable bias in how people judged the answers. When participants believed they were reading AI-generated responses, they rated them lower, regardless of whether the response was actually written by ChatGPT or by a human therapist.

The opposite pattern also appeared. AI-generated responses received their highest ratings when participants mistakenly believed they had been written by human therapists.

This creates a contradiction at the center of the findings. Readers had trouble reliably identifying the source of the responses, yet their opinion of the same kind of response shifted depending on whether they thought it came from AI or a person.

That matters for the future of AI in psychotherapy because trust is not based only on quality ratings. It is also shaped by the perceived source of care. A reader may value an answer less when it is labeled as AI, even if the answer is similar to one they would rate highly under a human label.

Where The Evidence Fits

The psychotherapy study is not the only research pointing to strong AI performance in advisory contexts. Research from the University of Melbourne and the University of Western Australia found that ChatGPT provided more balanced, comprehensive, and empathetic advice on social dilemmas than human advice columnists, with preference rates between 70 and 85 percent.

Even there, the same tension appeared. In the Australian study, 77 percent said they would rather receive advice from humans, even though they could not reliably distinguish between AI and human responses.

Other evidence cited in the source points in a similar direction. A study from April 2023 found that people rated AI responses to medical diagnoses as more empathetic and higher quality than responses from doctors. ChatGPT has also shown strong results on emotional awareness, scoring 98 out of 100 on the standardized test of emotional awareness (LEAS), compared with typical human scores of 56 to 59 points.

Taken together, these findings suggest that AI systems can generate advice-like and care-like text that many people evaluate very favorably. But they also show that preference, trust, and perceived legitimacy do not always move in the same direction.

Why Researchers Still Urge Caution

The study’s own limits are significant. It relied on brief, hypothetical therapy scenarios, not real therapy sessions. The researchers also questioned whether findings from couples therapy would apply in the same way to individual counseling.

That means the study should be read as evidence about written responses in a controlled comparison, not as a full assessment of AI-led psychotherapy. Therapy involves context, continuity, judgment, and responsibility, and the source does not claim that this study resolves those issues.

Researchers from Stanford University and the University of Texas also urge caution about ChatGPT’s use in psychotherapy. They argue that large language models lack a true "theory of mind" and cannot experience genuine empathy. They also call for an international research initiative to establish guidelines for safe integration of AI in psychology.

The practical takeaway is not that AI should replace clinicians. It is that mental health professionals need to understand these systems as their possible role in care grows. The researchers emphasize that responsible clinicians must carefully train and monitor AI models to maintain high standards of care.

ChatGPT’s performance in this study shows why the debate is no longer abstract. If people cannot easily tell AI therapy replies from human ones, and if those replies can score highly on qualities associated with care, then the central question becomes how to use that capability responsibly while recognizing what the system does not truly possess.