An artificial intelligence (AI) model solved quiz questions testing doctors’ abilities to diagnose patients based on images and text summaries accurately. However, it made mistakes when trying to describe images and its decision-making rationale, according to a new study.
Details of the experiment were published in npj Digital Medicine on Tuesday. The study was led by the National Institutes of Health’s National Library of Medicine (NLM) and Weill Cornell Medicine in New York City.
Doctors along with the AI model responded to questions from the New England Journal of Medicine (NEJM)’s Image Challenge, an online quiz that shows actual clinical images and text descriptions of them. Users have to elect the correct diagnosis based on the image, offering multiple-choice answers.
“Integration of AI into health care holds great promise as a tool to help medical professionals diagnose patients faster, allowing them to start treatment sooner,” Stephen Sherry, PhD, acting director of the NLM, said in a statement. “However, as this study shows, AI is not advanced enough yet to replace human experience, which is crucial for accurate diagnosis.”
Nine doctors with varying medical specialties responded to questions in two sessions: In the first, the clinicians couldn’t use any additional materials like the internet; in the second, they could use external resources to respond to questions. The researchers then provided the physicians with the correct answer, along with the AI model’s answer and corresponding rationale. Finally, the doctors had to score the AI model’s ability to describe the image and give a summary of the medical knowledge, along with a step-by-step rationale of how the model arrived at the decision.
The AI model as well as clinicians scored highly in selecting the correct diagnosis. In fact, the AI model selected the correct diagnosis more often than doctors in the closed-book sessions, when the doctor couldn’t use outside resources. But the doctors performed better than the AI model when they could use outside resources, especially on questions that were more difficult.
The AI model often made mistakes when describing the image and explaining why it chose a specific diagnosis, even when the final choice was correct. For example, a photo of a patient’s arm showed two lesions at different angles so the AI couldn’t recognize that they were caused by the same diagnosis, but the doctor could.