Large language models are poor medical coders: Study

Researchers at the Icahn School of Medicine at New York City-based Mount Sinai found that large language models were poor medical coders.

In a study published April 19 in NEJM AI, the researchers gathered more than 27,000 distinct diagnosis and procedure codes from a year of routine care at Mount Sinai Health System. They then utilized descriptions for each code to trigger models from OpenAI, Google and Meta to produce the most precise medical codes. 

The codes generated were then compared with the original ones, and any patterns in errors were examined, according to an April 22 news release from Mount Sinai.

All of the large language models analyzed — GPT-4, GPT-3.5, Gemini-pro and Llama-2-70b — demonstrated accuracy below 50% in replicating the original medical codes, the researchers found.

GPT-4 showed the best performance among the models. It achieved the highest exact match rates for ICD-9-CM (45.9%), ICD-10-CM (33.9%) and CPT codes (49.8%).

Copyright © 2024 Becker's Healthcare. All Rights Reserved. Privacy Policy. Cookie Policy. Linking and Reprinting Policy.

 

Articles We Think You'll Like

 

Featured Whitepapers

Featured Webinars