Large language models powered by artificial intelligence are now matching or even exceeding human-level empathic accuracy based solely on text, according to a new study that pits cutting-edge systems like GPT-4, Claude, and Gemini against human participants.
The study challenged models to infer emotional states from transcripts of deeply personal and emotionally complex narratives. Human participants were split: some read the same transcripts; others watched the original videos. Models had only the semantic content to work with. Remarkably, the AI systems performed on par with—or better than—the humans who had both visual and contextual cues.
Analysis across thousands of emotional prompts showed that AI hit or exceeded human empathic accuracy across both positive and negative emotions. That suggests semantic information is far more powerful than previously believed when it comes to gauging feelings. The authors caution, however, that humans may not always fully exploit available cues.
The research recruited 127 human subjects for transcript-only and video-viewing tasks, and used the same emotional transcripts for AI evaluation. Models such as GPT-4, Claude, and Gemini were able to infer emotional states from text with a precision level equal to or surpassing human performance.
This methodology builds on growing scholarship showing that AI is not just mimicking emotional sensitivity but may genuinely read emotional nuance from language. In an earlier 2024 experiment, four state-of-the-art models—including GPT-4, LLaMA-2-Chat, Gemini-Pro, and Mixtral-8x7B—were judged across 2,000 emotional dialogue prompts by 1,000 human raters. Models consistently outperformed humans in assigning “Good” empathy scores, with GPT-4 registering about a 31 per cent gain over human baselines.
Other recent work supports this shift. A study in 2024 found that LLM responses to real-life prompts were rated more empathic than human responses by independent evaluators. Linguistic analysis in that context detected stylistic patterns—like punctuation, word choice and structure—that distinguish AI empathy from human-crafted empathy.
Newer research is adding nuance to how we understand empathic capability in AI. A 2025 paper comparing model judgments with expert annotators and crowdworkers found LLMs nearly match experts in marking empathic communication and outrank crowdworkers in consistency. Another work introduced “SENSE-7,” a dataset capturing user perceptions of AI empathy in long dialogues; results show empathy judgments vary greatly by context and continuity.
These developments force rethinking of emotional interaction between humans and machines. If AI can accurately sense and respond to emotional states through text, its role in domains like mental health support, education, or companion systems becomes more serious.