Knee Surgery, Sports Traumatology, Arthroscopy, cilt.34, sa.3, ss.1141-1149, 2026 (SCI-Expanded, Scopus)
Purpose: The aim of this study was to comparatively evaluate the responses generated by three advanced artificial intelligence (AI) models, ChatGPT-4o (OpenAI), Gemini 1.5 Flash (Google) and DeepSeek-V3, to frequently asked patient questions about meniscal tears in terms of reliability, usefulness, quality, and readability. Methods: Responses from three AI chatbots, ChatGPT-4o (OpenAI), Gemini 1.5 Flash (Google) and DeepSeek-V3 (DeepSeek AI), were evaluated for 20 common patient questions regarding meniscal tears. Three orthopaedic specialists independently scored reliability and usefulness on 7-point Likert scales and overall response quality using the 5-point Global Quality Scale. Readability was analysed with six established indices. Inter-rater agreement was examined with intraclass correlation coefficients (ICCs) and Fleiss’ Kappa, while between-model differences were tested using Kruskal–Wallis and ANOVA with Bonferroni adjustment. Results: Gemini 1.5 Flash achieved the highest reliability, significantly outperforming both GPT-4o and DeepSeek-V3 (p = 0.001). While usefulness scores were broadly similar, Gemini was superior to DeepSeek-V3 (p = 0.045). Global Quality Scale scores did not differ significantly among models. In contrast, GPT-4o consistently provided the most readable content (p < 0.001). Inter-rater reliability was excellent across all evaluation domains (ICC > 0.9). Conclusion: All three AI models generated high-quality educational content regarding meniscal tears. Gemini 1.5 Flash demonstrated the highest reliability and usefulness, while GPT-4o provided significantly more readable responses. These findings highlight the trade-off between reliability and readability in AI-generated patient education materials and emphasise the importance of physician oversight to ensure safe, evidence-based integration of these tools into clinical practice. Level of Evidence: Level V, observation-based, expert opinion-based, or in vitro/artificial intelligence model evaluation.