A new psychotherapy study adds a sharper edge to a familiar question about artificial intelligence: when an answer feels supportive, detailed, and empathetic, does the reader know whether it came from a human being or a machine?
In research published in PLOS Mental Health, participants asked to judge therapeutic responses struggled to separate ChatGPT from human therapists. The findings do not turn ChatGPT into a therapist, but they do show why AI in mental health care is becoming harder for professionals, patients, and researchers to ignore.
What The Study Tested
The study applied the idea of the classic Turing test to psychotherapy. Alan Turing’s original concept asks whether people can tell if they are interacting with a machine or a person. Here, researchers used that logic in a mental health setting by asking 830 participants to identify whether therapeutic responses came from ChatGPT or from human therapists.
The comparison was based on 18 couples therapy case studies. The researchers evaluated responses from 13 experienced therapists alongside responses generated by ChatGPT.
The results were close to guessing. Participants correctly identified human therapist responses 56.1 percent of the time. They correctly identified ChatGPT responses 51.2 percent of the time.
Those numbers matter because they suggest that, in short written therapy scenarios, the boundary between AI-generated support and professional human language was not obvious to readers. The study did not examine full therapy relationships or live sessions, but within its format, ChatGPT was difficult to detect.
Why ChatGPT Scored So Well
The study found that ChatGPT did more than merely imitate therapeutic language. Its responses outperformed human experts on measures of therapeutic quality, including therapeutic alliance, empathy, and cultural competence.
The source of that advantage appears to be partly linguistic. ChatGPT consistently produced longer responses. Its answers also had a more positive tone and used more nouns and adjectives. Those features likely made the replies feel more complete, attentive, and empathetic to the people rating them.
That does not prove the AI experienced empathy. It shows that the text it produced was often perceived as empathetic. In a written response, length, tone, detail, and word choice can strongly shape how supportive an answer feels.
For mental health professionals, that distinction is important. A response can read as caring without being grounded in human understanding. At the same time, if patients or readers experience the response as useful, the practical impact of AI-generated language becomes difficult to dismiss.
The Human Label Still Changes The Rating
The study also found a notable bias in how people judged the answers. When participants believed they were reading AI-generated responses, they rated them lower, regardless of whether the response was actually written by ChatGPT or by a human therapist.
The opposite pattern also appeared. AI-generated responses received their highest ratings when participants mistakenly believed they had been written by human therapists.
This creates a contradiction at the center of the findings. Readers had trouble reliably identifying the source of the responses, yet their opinion of the same kind of response shifted depending on whether they thought it came from AI or a person.
That matters for the future of AI in psychotherapy because trust is not based only on quality ratings. It is also shaped by the perceived source of care. A reader may value an answer less when it is labeled as AI, even if the answer is similar to one they would rate highly under a human label.
Where The Evidence Fits
The psychotherapy study is not the only research pointing to strong AI performance in advisory contexts. Research from the University of Melbourne and the University of Western Australia found that ChatGPT provided more balanced, comprehensive, and empathetic advice on social dilemmas than human advice columnists, with preference rates between 70 and 85 percent.
Even there, the same tension appeared. In the Australian study, 77 percent said they would rather receive advice from humans, even though they could not reliably distinguish between AI and human responses.
Other evidence cited in the source points in a similar direction. A study from April 2023 found that people rated AI responses to medical diagnoses as more empathetic and higher quality than responses from doctors. ChatGPT has also shown strong results on emotional awareness, scoring 98 out of 100 on the standardized test of emotional awareness (LEAS), compared with typical human scores of 56 to 59 points.
Taken together, these findings suggest that AI systems can generate advice-like and care-like text that many people evaluate very favorably. But they also show that preference, trust, and perceived legitimacy do not always move in the same direction.
Why Researchers Still Urge Caution
The study’s own limits are significant. It relied on brief, hypothetical therapy scenarios, not real therapy sessions. The researchers also questioned whether findings from couples therapy would apply in the same way to individual counseling.
That means the study should be read as evidence about written responses in a controlled comparison, not as a full assessment of AI-led psychotherapy. Therapy involves context, continuity, judgment, and responsibility, and the source does not claim that this study resolves those issues.
Researchers from Stanford University and the University of Texas also urge caution about ChatGPT’s use in psychotherapy. They argue that large language models lack a true "theory of mind" and cannot experience genuine empathy. They also call for an international research initiative to establish guidelines for safe integration of AI in psychology.
The practical takeaway is not that AI should replace clinicians. It is that mental health professionals need to understand these systems as their possible role in care grows. The researchers emphasize that responsible clinicians must carefully train and monitor AI models to maintain high standards of care.
ChatGPT’s performance in this study shows why the debate is no longer abstract. If people cannot easily tell AI therapy replies from human ones, and if those replies can score highly on qualities associated with care, then the central question becomes how to use that capability responsibly while recognizing what the system does not truly possess.