A study in JAMA Pediatrics reveals that ChatGPT-4, an AI language model, performed poorly in evaluating children's health cases. With an 83% error rate, the study highlights the risks of relying on unvetted AI in healthcare. Researchers tested ChatGPT-4 against 100 pediatric case studies, finding it provided correct answers in only 17 instances. The inaccurate diagnoses raise concerns about the readiness of AI for medical applications. However, the study suggests ChatGPT-4 can be used as a supplementary tool for clinicians in complex cases.
A new study published in JAMA Pediatrics has thrown cold water on the hopes of some for
AI-powered
medical diagnoses, revealing that the popular language model
ChatGPT-4
performed poorly in evaluating children's health cases. According to a report by Ars Technica, with an error rate of a staggering 83%, the study underscores the dangers of relying on
unvetted AI
in high-stakes situations like
healthcare
.
Researchers from Cohen Children's Medical Center in New York tested ChatGPT-4 against 100 anonymised paediatric case studies, covering a range of common and complex conditions. The chatbot's dismal performance, missing vital clues and providing inaccurate diagnoses in the overwhelming majority of cases, raises serious concerns about the readiness of current AI technology for medical applications.
Out of 100 cases, ChatGPT provided correct answers in only 17 instances. In 72 cases, it gave inaccurate responses, and in the remaining 11 cases, it did not entirely grasp the correct diagnosis. Among the 83 incorrect diagnoses, 57 percent (47 cases) were related to the same organ system, as per the report.
How was ChatGPT evaluated?
During ChatGPT's evaluation, the researchers inserted the pertinent text of medical cases into the prompt. Subsequently, two qualified physician-researchers assessed the AI-generated responses, categorising them as either correct, incorrect, or "did not fully capture the diagnosis." In instances where ChatGPT fell into the latter category, it often provided a clinically related condition that was overly broad or insufficiently specific to be deemed the accurate diagnosis. For example, in diagnosing a child's case, ChatGPT identified a branchial cleft cyst—a lump in the neck or below the collarbone—when the correct diagnosis was Branchio-oto-renal syndrome. According to the report, this syndrome is a genetic condition leading to abnormal tissue development in the neck, along with malformations in the ears and kidneys. Notably, one of the indicators of this condition is the occurrence of branchial cleft cysts.
However, the study did mention that ChatGPT can be used as a supplementary tools. As part of the findings, the study noted that “LLM-based chatbots could be used as a supplementary tool for clinicians in diagnosing and developing a differential list for complex cases.”