ChatGPT significantly outperformed humans on a mock version of the American Board of Psychiatry and Neurology (ABPN) boards.
ChatGPT-4, the newer version of Open AI's large language model, bested the mean human score on a question bank approved by the ABPN, answering 85% of questions correctly versus a human score of 73.8%, according to Varun Venkataramani, MD, PhD, of University Hospital Heidelberg in Germany, and co-authors.
An older model, ChatGPT-3.5, answered only 66.8% of questions correctly. "Both models used confident or very confident language, even when incorrect," Venkataramani and colleagues reported in .
The study is a good demonstration of the power and capabilities of large language models like ChatGPT, but its findings could be misinterpreted, noted Lyell Jones Jr., MD, of the Mayo Clinic in Rochester, Minnesota, who wasn't involved with the study.
"This paper demonstrates that ChatGPT can answer multiple choice questions correctly," Jones told ֱ. "It does not demonstrate that ChatGPT can practice clinical medicine or serve as a substitute for clinical decision-making."
"Tests, including multiple choice tests, are tools designed to assess medical knowledge, which is only one domain or competency required to practice medicine," Jones continued. "Transformer technologies like those employed by ChatGPT can predict text but they do not conduct interviews, do a physical exam, generate an assessment and plan, interpret clinical data, and communicate results."
While it's a great technical feat for the software to answer many questions correctly, "the error rate was still high, and the tendency to express certainty while still incorrect presents an additional risk or caution in the use of large language model tools," he added.
Transformer-based natural language processing tools like GPT could enhance neurology clinical care but come with limitations and risks, including fabricated facts. Risks and benefits were addressed in a recent paper in that showed ChatGPT provided potentially dangerous advice for a young woman with epilepsy who wished to become pregnant. Research involving medical questions in other specialties has suggested that, despite improvements, neither ChatGPT 3.5 nor 4 should be relied on as a sole source of medical knowledge.
In their study, Venkataramani and co-authors used a question bank approved by the ABPN and categorized questions as either lower- or higher-order based on . Lower-order questions assessed remembering and basic understanding; higher-order questions measured applying, analyzing, or evaluating information.
The bank of 2,036 questions resembled the neurology board exam and was part of a self-assessment program that could be used for continuing medical education (CME) credit; a score of 70% was the threshold for CME. The researchers excluded 80 questions -- those with videos or images, and those based on preceding questions -- leaving 1,956 questions in the study.
Both large language models were server-contained and trained on more than 45 terabytes of text data from websites, books, and articles. Neither had the ability to search the internet.
GPT-3.5 matched human users on lower-order questions but lagged on higher-level questions. GPT-4 surpassed humans on both lower- and higher-order questions. GPT-4 had better performance on questions in behavioral, cognitive, and psychological categories (89.8%) compared with questions about epilepsy and seizures (70.9%) or neuromuscular topics (78.8%).
On a 5-point Likert scale, both models consistently rated their confidence in their answers as confident or highly confident, regardless of whether their answer was correct. When prompted with a right answer after a wrong one, both models apologized and agreed with the provided answer in all cases.
A study limitation is that official ABPN board exam questions could not be used due to their confidential and regulated nature, Venkataramani and co-authors said. In addition, the passing grade was an approximation based on the ABPN threshold for CME.
It's unclear what the clinical or educational utility of these findings are, Jones observed. "It's a great technical demonstration, but do we need software that can take tests designed for humans?" he asked.
"A more interesting study in this vein would be use of ChatGPT for generation of high-quality multiple-choice questions, educational cases, or other teaching materials," he suggested. "In any use, the error rates are high enough that any application of transformer technology in clinical or educational settings requires careful human validation and fact-checking."
Disclosures
Venkataramani had no disclosures. A co-author reported a patent for glioma treatment agents.
Jones has received publishing royalties from a publication relating to health care, has noncompensated relationships as a member of the board of directors of the Mayo Clinic Accountable Care Organization and the American Academy of Neurology Institute, and has received personal compensation for serving as an editor for the American Academy of Neurology.
Primary Source
JAMA Network Open
Schubert MC, et al "Performance of large language models on a neurology board–style examination" JAMA Netw Open 2023; DOI: 10.1001/jamanetworkopen.2023.46721.