Evaluating Large Language Model Adherence to AAOS Knee Osteoarthritis Guidelines: A Comparative Study of ChatGPT and NotebookLM

Yildiz, Cemil; Gokmen, Mehmet Yigit; Utlu, Cetin; Karabay, Elif Sude; Tanik, Ugur; Pazarci, Ozhan

Evaluating Large Language Model Adherence to AAOS Knee Osteoarthritis Guidelines: A Comparative Study of ChatGPT and NotebookLM

dc.authorid	0000-0002-2345-0827
dc.authorid	0000-0001-7526-6071
dc.authorid	0000-0003-1243-2057
dc.contributor.author	Yildiz, Cemil
dc.contributor.author	Gokmen, Mehmet Yigit
dc.contributor.author	Utlu, Cetin
dc.contributor.author	Karabay, Elif Sude
dc.contributor.author	Tanik, Ugur
dc.contributor.author	Pazarci, Ozhan
dc.date.accessioned	2026-02-03T12:02:50Z
dc.date.available	2026-02-03T12:02:50Z
dc.date.issued	2025
dc.department	Çanakkale Onsekiz Mart Üniversitesi
dc.description.abstract	PurposeThis study evaluated how closely large language models (LLMs), specifically ChatGPT (OpenAI) and NotebookLM (Google), adhere to orthopedic guidelines. The objective was to determine whether AI-generated reasoning aligns with the 2021-2022 American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines for knee osteoarthritis (OA).MethodsA mixed-methods design combined quantitative concordance scoring with qualitative content analysis. Thirty-three decision points covering non-arthroplasty and surgical management were derived from AAOS guidelines. Structured Population-Intervention-Comparison-Outcome (PICO) prompts were presented to each model. Two orthopedic surgeons independently rated all outputs using a four-domain rubric assessing accuracy, evidence reasoning, additional information, and knowledge integration (0-4 scale). Concordance was classified as full (4), partial (3), or discordant (<= 2), with disagreements resolved through consensus. Inter-rater reliability was almost perfect (weighted kappa = 0.87).ResultsChatGPT achieved a mean composite score of 3.67 +/- 0.92, and NotebookLM 3.55 +/- 0.87, with no significant difference between models (p = 0.18). Full concordance with AAOS recommendations occurred in 84.8% of ChatGPT responses and 75.8% of NotebookLM responses. Both models performed consistently in high-evidence domains such as NSAID therapy, tranexamic acid use, and weight-loss counseling. Variability increased in limited-evidence or technology-driven areas. Partial concordance reflected the omission of evidence qualifiers, while discordant responses involved overstated or speculative interpretations.ConclusionBoth LLMs demonstrated strong alignment with evidence-based orthopedic reasoning. ChatGPT showed slightly higher fidelity to recommendation strength, whereas NotebookLM provided broader contextual interpretation. Structured, guideline-oriented prompting may enhance AI reasoning consistency and support the potential role of LLMs as complementary tools for evidence translation and orthopedic education.
dc.identifier.doi	10.1007/s43465-025-01653-6
dc.identifier.issn	0019-5413
dc.identifier.issn	1998-3727
dc.identifier.scopus	2-s2.0-105024089258
dc.identifier.scopusquality	Q3
dc.identifier.uri	https://doi.org/10.1007/s43465-025-01653-6
dc.identifier.uri	https://hdl.handle.net/20.500.12428/34885
dc.identifier.wos	WOS:001631448600001
dc.identifier.wosquality	Q3
dc.indekslendigikaynak	Web of Science
dc.indekslendigikaynak	Scopus
dc.language.iso	en
dc.publisher	Springer Heidelberg
dc.relation.ispartof	Indian Journal of Orthopaedics
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rights	info:eu-repo/semantics/closedAccess
dc.snmz	KA_WOS_20260130
dc.subject	Artificial intelligence
dc.subject	Generative artificial intelligence
dc.subject	Large language models
dc.subject	Knee osteoarthritis
dc.subject	Total knee arthroplasty
dc.subject	Evidence-based orthopedics
dc.title	Evaluating Large Language Model Adherence to AAOS Knee Osteoarthritis Guidelines: A Comparative Study of ChatGPT and NotebookLM
dc.type	Article

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

Evaluating Large Language Model Adherence to AAOS Knee Osteoarthritis Guidelines: A Comparative Study of ChatGPT and NotebookLM

Dosyalar

Koleksiyon