TY - JOUR
T1 - Reliability and Readability Assessment of Atrial Fibrillation Patient Information Delivered by Artificial Intelligence-Based Language Models (ChatGPT, YouChat, Gemini, and Perplexity AI) in English and Spanish
AU - Juan-Guardela, Emilio Jose Juan
AU - Beltrán-España, Jesús Andrés
AU - Ravagli-Baquero, María Paula
AU - Porras-Bueno, Cristian Orlando
AU - Cáceres-Méndez, Edward
AU - Ávila, Daniel Fernandez
AU - Muñoz-Velandia, Oscar
AU - García-Peña, Ángel Alberto
N1 - Publisher Copyright:
© The Author(s) 2025. This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage).
PY - 2025/10/31
Y1 - 2025/10/31
N2 - Background: Atrial fibrillation (AF) is the most prevalent arrhythmia and a significant cause of morbidity. Artificial intelligence (AI)-based language models represent a novel tool for searching for medical information; however, there is still uncertainty regarding their reliability and readability in different languages. Objective: To assess the reliability and readability of information provided by AI-based models for patients with AF. Methods: A cross-sectional study was conducted to assess the reliability and readability of the responses generated by ChatGPT, YouChat, Gemini and Perplexity on AF in English and Spanish. Thirty standardised questions were posed in both languages. The quality of the responses was then assessed by 2 independent reviewers via a standardised tool. Readability was assessed via the Flesch–Szigrist formula. The results were then compared by tool and language. Results: ChatGPT demonstrated the highest interrater agreement (PA = 0.73 in Spanish, 0.80 in English), followed by Gemini in English (PA = 0.66). In Spanish, ChatGPT generated the highest percentage of complete responses (80%), followed by Perplexity (73%) and Gemini (47%). In English, Perplexity demonstrated the strongest performance, with a score of 93%, followed by ChatGPT, with 73%, and Gemini, with 53%. A readability analysis revealed significant differences between the models (P < .01). The ChatGPT demonstrated the highest performance, although its content was moderately challenging in Spanish and highly challenging in English. Conclusion: ChatGPT and Perplexity emerged as the most reliable models, although readability remains a concern. There is a clear need for improvements to optimise the accuracy and accessibility of AI-generated medical information.
AB - Background: Atrial fibrillation (AF) is the most prevalent arrhythmia and a significant cause of morbidity. Artificial intelligence (AI)-based language models represent a novel tool for searching for medical information; however, there is still uncertainty regarding their reliability and readability in different languages. Objective: To assess the reliability and readability of information provided by AI-based models for patients with AF. Methods: A cross-sectional study was conducted to assess the reliability and readability of the responses generated by ChatGPT, YouChat, Gemini and Perplexity on AF in English and Spanish. Thirty standardised questions were posed in both languages. The quality of the responses was then assessed by 2 independent reviewers via a standardised tool. Readability was assessed via the Flesch–Szigrist formula. The results were then compared by tool and language. Results: ChatGPT demonstrated the highest interrater agreement (PA = 0.73 in Spanish, 0.80 in English), followed by Gemini in English (PA = 0.66). In Spanish, ChatGPT generated the highest percentage of complete responses (80%), followed by Perplexity (73%) and Gemini (47%). In English, Perplexity demonstrated the strongest performance, with a score of 93%, followed by ChatGPT, with 73%, and Gemini, with 53%. A readability analysis revealed significant differences between the models (P < .01). The ChatGPT demonstrated the highest performance, although its content was moderately challenging in Spanish and highly challenging in English. Conclusion: ChatGPT and Perplexity emerged as the most reliable models, although readability remains a concern. There is a clear need for improvements to optimise the accuracy and accessibility of AI-generated medical information.
KW - artificial intelligence
KW - atrial fibrillation
KW - large language models
KW - readability
UR - https://www.scopus.com/pages/publications/105020446665
UR - https://www.mendeley.com/catalogue/f042475e-fe01-3d8c-b6a4-3c2fbb7c666b/
U2 - 10.1177/11795468251383666
DO - 10.1177/11795468251383666
M3 - Article
SN - 1179-5468
VL - 19
SP - 1
EP - 10
JO - Clinical Medicine Insights: Cardiology
JF - Clinical Medicine Insights: Cardiology
M1 - 11795468251383666
ER -