As regulatory frameworks for AI in healthcare remain in their infancy, Mass General Brigham, based in Somerville, Mass., is taking a proactive role with its Healthcare AI Challenge.
This initiative aims to establish an unbiased, scalable platform for evaluating AI models, providing insights to guide institutional decision-making and support the development of broader regulatory standards.
Keith Dreyer, PhD, chief data science officer at Mass General Brigham, traced the roots of this initiative to the organization's early integration of AI into clinical practice in 2016-17, initially focusing on medical imaging.
"Back then, convolutional neural networks were driving breakthroughs in narrow AI solutions," Dr. Dreyer told Becker's. "These models tackled specific clinical problems, such as detecting strokes or precancerous pulmonary nodules, leading to the FDA approving nearly 1,000 algorithms for clinical use over time."
With the emergence of large language models, AI's potential has expanded dramatically — but so have the challenges. A key obstacle is the absence of regulatory frameworks for these advanced models.
"The FDA has yet to clear any large language model solutions, despite their transformative potential," Dr. Dreyer said. "Healthcare providers see the immense promise, but safety and efficacy must be thoroughly evaluated before deploying these tools. That's precisely what the Healthcare AI Challenge and its collaborative testing platform, the Arena, are designed to address."
Mass General Brigham partnered with Emory Healthcare, based in Atlanta; the University of Wisconsin-Madison Department of Radiology; the University of Washington Department of Radiology, based in Seattle; and the American College of Radiology. Together, they launched the Healthcare AI Challenge — a virtual, interactive series of events aimed at uniting healthcare professionals across institutions to explore and assess AI technologies in real-world clinical scenarios. Participants work with curated AI solutions, tackling tasks like medical image interpretation in a simulated environment known as "the Arena."
"We knew early on that for this initiative to succeed, it needed to extend beyond our walls," Dr. Dreyer said. "By adopting a cloud-based architecture, we created a platform accessible to multiple organizations, allowing them to contribute their data, expertise, and unique challenges."
The initiative employs a scalable, event-driven framework to evaluate AI models across various healthcare use cases. The first challenge centers on medical imaging, testing AI's ability to analyze X-rays and generate outputs such as draft radiology reports, differential diagnoses, and follow-up recommendations. Models are graded on a scale that mirrors clinical expertise, ranging from attending physicians to medical students.
"This process generates a leaderboard that shows which models perform at expert levels," Dr. Dreyer said. "The top-performing models then undergo head-to-head comparisons with human clinicians in subsequent challenges."
Future challenges will expand beyond imaging, targeting areas like summarizing electronic health records or predicting cancer risk.
"The design is endlessly adaptable," Dr. Dreyer said. "Each challenge is tailored to the specific data, experts, and outcomes we aim to evaluate."
Dr. Dreyer envisions the challenge as a learning opportunity for organizations, helping them understand the potential and limitations of AI.