AI models fall short in clinical conversations: Harvard study

Large language models like ChatGPT have performed well on medical exams, but they struggle with diagnostic accuracy in real-world clinical interactions. 

This is according to a new study led by researchers at Boston-based Harvard Medical School and Stanford (Calif.) University. To conduct the study, the team designed a testing framework, CRAFT-MD, to assess four AI models' conversation skills and diagnostic accuracy based on scenarios mimicking real-world clinician-patient interactions. 

While all four models fared well on medical exam-style questions, they struggled with basic conversations that mimic real-world encounters. Specifically, they showed limitations in asking questions to gather relevant medical history and synthesizing scattered information to make accurate diagnoses. 

"The dynamic nature of medical conversations — the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms — poses unique challenges that go far beyond answering multiple-choice questions," Pranav Rajpurkar, PhD, senior study author and assistant professor of biomedical informatics at Harvard Medical School, said in a news release. "When we switch from standardized tests to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy."

The researchers recommended a set of criteria for developers and regulators to enhance the use of AI tools in clinical settings. These include incorporating open-ended questions that reflect physician-patient interactions in model design and training, and assessing the tools' ability to ask relevant questions and extract critical information.

Copyright © 2025 Becker's Healthcare. All Rights Reserved. Privacy Policy. Cookie Policy. Linking and Reprinting Policy.

 

Articles We Think You'll Like

 

Featured Whitepapers

Featured Webinars