Large language models can complete 50 simultaneous tasks and drive a seventeenfold cost reduction, but any additional tasks will cause performance deterioration, according to a study published Nov. 18 in Nature.
Researchers at the Icahn School of Medicine at Mount Sinai in New York City conducted more than 300,000 experiments to stress test LLMs, according to lead author Eyal Klang, MD. Dr. Klang is director of Icahn Mount Sinai's generative AI research program.
The study used data from all 1,942,216 patient encounters at Mount Sinai Health System in 2023. The researchers evaluated 10 LLMs, including Meta's Llama-3–70b and OpenAI's GPT-4-turbo-128k, and how each performed with increasing demands from this EHR data.
Those two LLMs could handle a grouping of 50 clinical tasks, such as reviewing medication safety and structuring research cohorts, without a significant decline in accuracy. Other models would need fewer tasks, the study found.
"After adjusting for possible failures, the concatenation strategy achieved roughly seventeenfold savings in cost for 50 tasks with GPT-4-turbo-128k," the authors said in conclusion. "The price difference of $0.24 per 50 tasks may seem limited but would amount to significant savings at the health system scale."