Gemini 1.5 Flash provides the most reliable content while ChatGPT-4o offers the highest readability for patient education on meniscal tears


Çakmur B. B., Koluman A. C., Çiftçi M. U., Aloğlu Çiftçi E., ZİROĞLU N.

Knee Surgery, Sports Traumatology, Arthroscopy, vol.34, no.3, pp.1141-1149, 2026 (SCI-Expanded, Scopus) identifier identifier identifier

  • Publication Type: Article / Article
  • Volume: 34 Issue: 3
  • Publication Date: 2026
  • Doi Number: 10.1002/ksa.70247
  • Journal Name: Knee Surgery, Sports Traumatology, Arthroscopy
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, MEDLINE
  • Page Numbers: pp.1141-1149
  • Keywords: ChatGPT, DeepSeek, Gemini, large language models, meniscal tear, patient education
  • Acibadem Mehmet Ali Aydinlar University Affiliated: Yes

Abstract

Purpose: The aim of this study was to comparatively evaluate the responses generated by three advanced artificial intelligence (AI) models, ChatGPT-4o (OpenAI), Gemini 1.5 Flash (Google) and DeepSeek-V3, to frequently asked patient questions about meniscal tears in terms of reliability, usefulness, quality, and readability. Methods: Responses from three AI chatbots, ChatGPT-4o (OpenAI), Gemini 1.5 Flash (Google) and DeepSeek-V3 (DeepSeek AI), were evaluated for 20 common patient questions regarding meniscal tears. Three orthopaedic specialists independently scored reliability and usefulness on 7-point Likert scales and overall response quality using the 5-point Global Quality Scale. Readability was analysed with six established indices. Inter-rater agreement was examined with intraclass correlation coefficients (ICCs) and Fleiss’ Kappa, while between-model differences were tested using Kruskal–Wallis and ANOVA with Bonferroni adjustment. Results: Gemini 1.5 Flash achieved the highest reliability, significantly outperforming both GPT-4o and DeepSeek-V3 (p = 0.001). While usefulness scores were broadly similar, Gemini was superior to DeepSeek-V3 (p = 0.045). Global Quality Scale scores did not differ significantly among models. In contrast, GPT-4o consistently provided the most readable content (p < 0.001). Inter-rater reliability was excellent across all evaluation domains (ICC > 0.9). Conclusion: All three AI models generated high-quality educational content regarding meniscal tears. Gemini 1.5 Flash demonstrated the highest reliability and usefulness, while GPT-4o provided significantly more readable responses. These findings highlight the trade-off between reliability and readability in AI-generated patient education materials and emphasise the importance of physician oversight to ensure safe, evidence-based integration of these tools into clinical practice. Level of Evidence: Level V, observation-based, expert opinion-based, or in vitro/artificial intelligence model evaluation.