Gemini 1.5 Flash provides the most reliable content while ChatGPT-4o offers the highest readability for patient education on meniscal tears

Çakmur, Başar; Koluman, Ali; Çiftçi, Mehmet; Aloğlu Çiftçi, Ebru; ZİROĞLU, NEZİH

doi:10.1002/ksa.70247

Gemini 1.5 Flash provides the most reliable content while ChatGPT-4o offers the highest readability for patient education on meniscal tears

Çakmur B. B., Koluman A. C., Çiftçi M. U., Aloğlu Çiftçi E., ZİROĞLU N.

Knee Surgery, Sports Traumatology, Arthroscopy, cilt.34, sa.3, ss.1141-1149, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 34 Sayı: 3
Basım Tarihi: 2026
Doi Numarası: 10.1002/ksa.70247
Dergi Adı: Knee Surgery, Sports Traumatology, Arthroscopy
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, MEDLINE
Sayfa Sayıları: ss.1141-1149
Anahtar Kelimeler: ChatGPT, DeepSeek, Gemini, large language models, meniscal tear, patient education
Acıbadem Mehmet Ali Aydınlar Üniversitesi Adresli: Evet

Özet

Purpose: The aim of this study was to comparatively evaluate the responses generated by three advanced artificial intelligence (AI) models, ChatGPT-4o (OpenAI), Gemini 1.5 Flash (Google) and DeepSeek-V3, to frequently asked patient questions about meniscal tears in terms of reliability, usefulness, quality, and readability. Methods: Responses from three AI chatbots, ChatGPT-4o (OpenAI), Gemini 1.5 Flash (Google) and DeepSeek-V3 (DeepSeek AI), were evaluated for 20 common patient questions regarding meniscal tears. Three orthopaedic specialists independently scored reliability and usefulness on 7-point Likert scales and overall response quality using the 5-point Global Quality Scale. Readability was analysed with six established indices. Inter-rater agreement was examined with intraclass correlation coefficients (ICCs) and Fleiss’ Kappa, while between-model differences were tested using Kruskal–Wallis and ANOVA with Bonferroni adjustment. Results: Gemini 1.5 Flash achieved the highest reliability, significantly outperforming both GPT-4o and DeepSeek-V3 (p = 0.001). While usefulness scores were broadly similar, Gemini was superior to DeepSeek-V3 (p = 0.045). Global Quality Scale scores did not differ significantly among models. In contrast, GPT-4o consistently provided the most readable content (p < 0.001). Inter-rater reliability was excellent across all evaluation domains (ICC > 0.9). Conclusion: All three AI models generated high-quality educational content regarding meniscal tears. Gemini 1.5 Flash demonstrated the highest reliability and usefulness, while GPT-4o provided significantly more readable responses. These findings highlight the trade-off between reliability and readability in AI-generated patient education materials and emphasise the importance of physician oversight to ensure safe, evidence-based integration of these tools into clinical practice. Level of Evidence: Level V, observation-based, expert opinion-based, or in vitro/artificial intelligence model evaluation.