Gemini 1.5 Flash provides the most reliable content while ChatGPT-4o offers the highest readability for patient education on meniscal tears


Çakmur B. B., Koluman A. C., Çiftçi M. U., Aloğlu Çiftçi E., ZİROĞLU N.

Knee Surgery, Sports Traumatology, Arthroscopy, cilt.34, sa.3, ss.1141-1149, 2026 (SCI-Expanded, Scopus) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 34 Sayı: 3
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1002/ksa.70247
  • Dergi Adı: Knee Surgery, Sports Traumatology, Arthroscopy
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, MEDLINE
  • Sayfa Sayıları: ss.1141-1149
  • Anahtar Kelimeler: ChatGPT, DeepSeek, Gemini, large language models, meniscal tear, patient education
  • Acıbadem Mehmet Ali Aydınlar Üniversitesi Adresli: Evet

Özet

Purpose: The aim of this study was to comparatively evaluate the responses generated by three advanced artificial intelligence (AI) models, ChatGPT-4o (OpenAI), Gemini 1.5 Flash (Google) and DeepSeek-V3, to frequently asked patient questions about meniscal tears in terms of reliability, usefulness, quality, and readability. Methods: Responses from three AI chatbots, ChatGPT-4o (OpenAI), Gemini 1.5 Flash (Google) and DeepSeek-V3 (DeepSeek AI), were evaluated for 20 common patient questions regarding meniscal tears. Three orthopaedic specialists independently scored reliability and usefulness on 7-point Likert scales and overall response quality using the 5-point Global Quality Scale. Readability was analysed with six established indices. Inter-rater agreement was examined with intraclass correlation coefficients (ICCs) and Fleiss’ Kappa, while between-model differences were tested using Kruskal–Wallis and ANOVA with Bonferroni adjustment. Results: Gemini 1.5 Flash achieved the highest reliability, significantly outperforming both GPT-4o and DeepSeek-V3 (p = 0.001). While usefulness scores were broadly similar, Gemini was superior to DeepSeek-V3 (p = 0.045). Global Quality Scale scores did not differ significantly among models. In contrast, GPT-4o consistently provided the most readable content (p < 0.001). Inter-rater reliability was excellent across all evaluation domains (ICC > 0.9). Conclusion: All three AI models generated high-quality educational content regarding meniscal tears. Gemini 1.5 Flash demonstrated the highest reliability and usefulness, while GPT-4o provided significantly more readable responses. These findings highlight the trade-off between reliability and readability in AI-generated patient education materials and emphasise the importance of physician oversight to ensure safe, evidence-based integration of these tools into clinical practice. Level of Evidence: Level V, observation-based, expert opinion-based, or in vitro/artificial intelligence model evaluation.