As artificial intelligence (AI) continues to integrate into healthcare, its role in patient education is rapidly expanding. AI chatbots, such as ChatGPT, Bard, and Bing, are emerging as tools to provide medical information, potentially supplementing traditional resources like patient brochures.
However, the accuracy, comprehensiveness, and readability of the information these chatbots deliver remain critical factors in determining their reliability as sources of patient education. A study by the National Library of Medicine examined how well AI chatbots perform when answering glaucoma-related questions adapted from patient education brochures, comparing their responses to materials provided by the American Academy of Ophthalmology (AAO).
The findings offer insights into the strengths and limitations of AI chatbots as resources for improving patient understanding and supporting physician-patient interactions.
As artificial intelligence (AI) chatbots increasingly serve as a potential primary source of glaucoma-related information for patients, it becomes crucial to assess the quality of the information they provide. This evaluation helps healthcare providers tailor their discussions, anticipate patient concerns, and address any misinformation.
This study aimed to analyze glaucoma information focusing on accuracy, comprehensiveness, readability, word count, and character count. The chatbot responses were then compared to glaucoma-related patient materials provided by the American Academy of Ophthalmology (AAO).
Methods
Patient education brochure section headers from AAO glaucoma materials were converted into questions and posed five times to each chatbot (ChatGPT-4, Bard, and Bing). Two sets of responses from each chatbot were analyzed.
Glaucoma-trained ophthalmologists independently scored the accuracy of AI chatbot responses, the AAO materials, and the comprehensiveness of the chatbot responses on a scale of 1-5. Readability was evaluated using the Flesch-Kincaid Grade Level (FKGL), which corresponds to U.S. school grade levels. Additionally, word and character counts were measured for both chatbot responses and AAO brochure content.
Results
Accuracy scores were as follows:
AAO (4.84), ChatGPT (4.26), Bing (4.53), and Bard (3.53). AAO materials were significantly more accurate than ChatGPT responses (p=0.002), while Bard scored the lowest accuracy (Bard versus AAO, p<0.001; Bard versus ChatGPT, p<0.002; Bard versus Bing, p=0.001).
In terms of comprehensiveness:
ChatGPT scored highest (ChatGPT versus Bing, p<0.001; ChatGPT versus Bard, p=0.008), with scores for ChatGPT, Bing, and Bard at 3.32, 2.16, and 2.79, respectively.
The AAO materials and Bard responses were the most readable (AAO versus ChatGPT, AAO versus Bing, Bard versus ChatGPT, Bard versus Bing, all p<0.0001), with FKGL scores of 8.11 for AAO, 13.01 for ChatGPT, 11.73 for Bing, and 7.90 for Bard.
Bing provided the shortest responses in terms of word and character count.
AI chatbot-generated glaucoma information showed variability in accuracy, comprehensiveness, and readability. While ChatGPT delivered the most comprehensive responses, its readability levels were significantly higher than those of the AAO materials, which may limit accessibility. Bard had the least accurate responses but was among the most readable.
These findings highlight the need for improvements in AI chatbots to make them a more reliable supplemental resource for patient education. Physicians should remain aware of these limitations to better address patients’ existing knowledge and provide accurate, detailed, and accessible information.
Are you interested in how AI is changing healthcare? Subscribe to our newsletter, “PulsePoint,” for updates, insights, and trends on AI innovations in healthcare.