Evaluating Large Language Model Adherence to AAOS Knee Osteoarthritis Guidelines: A Comparative Study of ChatGPT and NotebookLM

Yildiz, Cemil; Gokmen, Mehmet Yigit; Utlu, Cetin; Karabay, Elif Sude; Tanik, Ugur; Pazarci, Ozhan

Evaluating Large Language Model Adherence to AAOS Knee Osteoarthritis Guidelines: A Comparative Study of ChatGPT and NotebookLM

Tarih

2025

Yazarlar

Yayıncı

Springer Heidelberg

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

PurposeThis study evaluated how closely large language models (LLMs), specifically ChatGPT (OpenAI) and NotebookLM (Google), adhere to orthopedic guidelines. The objective was to determine whether AI-generated reasoning aligns with the 2021-2022 American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines for knee osteoarthritis (OA).MethodsA mixed-methods design combined quantitative concordance scoring with qualitative content analysis. Thirty-three decision points covering non-arthroplasty and surgical management were derived from AAOS guidelines. Structured Population-Intervention-Comparison-Outcome (PICO) prompts were presented to each model. Two orthopedic surgeons independently rated all outputs using a four-domain rubric assessing accuracy, evidence reasoning, additional information, and knowledge integration (0-4 scale). Concordance was classified as full (4), partial (3), or discordant (<= 2), with disagreements resolved through consensus. Inter-rater reliability was almost perfect (weighted kappa = 0.87).ResultsChatGPT achieved a mean composite score of 3.67 +/- 0.92, and NotebookLM 3.55 +/- 0.87, with no significant difference between models (p = 0.18). Full concordance with AAOS recommendations occurred in 84.8% of ChatGPT responses and 75.8% of NotebookLM responses. Both models performed consistently in high-evidence domains such as NSAID therapy, tranexamic acid use, and weight-loss counseling. Variability increased in limited-evidence or technology-driven areas. Partial concordance reflected the omission of evidence qualifiers, while discordant responses involved overstated or speculative interpretations.ConclusionBoth LLMs demonstrated strong alignment with evidence-based orthopedic reasoning. ChatGPT showed slightly higher fidelity to recommendation strength, whereas NotebookLM provided broader contextual interpretation. Structured, guideline-oriented prompting may enhance AI reasoning consistency and support the potential role of LLMs as complementary tools for evidence translation and orthopedic education.

Anahtar Kelimeler

Artificial intelligence, Generative artificial intelligence, Large language models, Knee osteoarthritis, Total knee arthroplasty, Evidence-based orthopedics

Kaynak

Indian Journal of Orthopaedics

WoS Q Değeri

Q3

Scopus Q Değeri

Q3

Bağlantı

https://doi.org/10.1007/s43465-025-01653-6
https://hdl.handle.net/20.500.12428/34885

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

Detaylı Öğe Kaydı

Evaluating Large Language Model Adherence to AAOS Knee Osteoarthritis Guidelines: A Comparative Study of ChatGPT and NotebookLM

Tarih

Yazarlar

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Erişim Hakkı

Özet

Açıklama

Anahtar Kelimeler

Kaynak

WoS Q Değeri

Scopus Q Değeri

Cilt

Sayı

Künye

Bağlantı

Koleksiyon