Evaluating Large Language Model Adherence to AAOS Knee Osteoarthritis Guidelines: A Comparative Study of ChatGPT and NotebookLM
| dc.authorid | 0000-0002-2345-0827 | |
| dc.authorid | 0000-0001-7526-6071 | |
| dc.authorid | 0000-0003-1243-2057 | |
| dc.contributor.author | Yildiz, Cemil | |
| dc.contributor.author | Gokmen, Mehmet Yigit | |
| dc.contributor.author | Utlu, Cetin | |
| dc.contributor.author | Karabay, Elif Sude | |
| dc.contributor.author | Tanik, Ugur | |
| dc.contributor.author | Pazarci, Ozhan | |
| dc.date.accessioned | 2026-02-03T12:02:50Z | |
| dc.date.available | 2026-02-03T12:02:50Z | |
| dc.date.issued | 2025 | |
| dc.department | Çanakkale Onsekiz Mart Üniversitesi | |
| dc.description.abstract | PurposeThis study evaluated how closely large language models (LLMs), specifically ChatGPT (OpenAI) and NotebookLM (Google), adhere to orthopedic guidelines. The objective was to determine whether AI-generated reasoning aligns with the 2021-2022 American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines for knee osteoarthritis (OA).MethodsA mixed-methods design combined quantitative concordance scoring with qualitative content analysis. Thirty-three decision points covering non-arthroplasty and surgical management were derived from AAOS guidelines. Structured Population-Intervention-Comparison-Outcome (PICO) prompts were presented to each model. Two orthopedic surgeons independently rated all outputs using a four-domain rubric assessing accuracy, evidence reasoning, additional information, and knowledge integration (0-4 scale). Concordance was classified as full (4), partial (3), or discordant (<= 2), with disagreements resolved through consensus. Inter-rater reliability was almost perfect (weighted kappa = 0.87).ResultsChatGPT achieved a mean composite score of 3.67 +/- 0.92, and NotebookLM 3.55 +/- 0.87, with no significant difference between models (p = 0.18). Full concordance with AAOS recommendations occurred in 84.8% of ChatGPT responses and 75.8% of NotebookLM responses. Both models performed consistently in high-evidence domains such as NSAID therapy, tranexamic acid use, and weight-loss counseling. Variability increased in limited-evidence or technology-driven areas. Partial concordance reflected the omission of evidence qualifiers, while discordant responses involved overstated or speculative interpretations.ConclusionBoth LLMs demonstrated strong alignment with evidence-based orthopedic reasoning. ChatGPT showed slightly higher fidelity to recommendation strength, whereas NotebookLM provided broader contextual interpretation. Structured, guideline-oriented prompting may enhance AI reasoning consistency and support the potential role of LLMs as complementary tools for evidence translation and orthopedic education. | |
| dc.identifier.doi | 10.1007/s43465-025-01653-6 | |
| dc.identifier.issn | 0019-5413 | |
| dc.identifier.issn | 1998-3727 | |
| dc.identifier.scopus | 2-s2.0-105024089258 | |
| dc.identifier.scopusquality | Q3 | |
| dc.identifier.uri | https://doi.org/10.1007/s43465-025-01653-6 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.12428/34885 | |
| dc.identifier.wos | WOS:001631448600001 | |
| dc.identifier.wosquality | Q3 | |
| dc.indekslendigikaynak | Web of Science | |
| dc.indekslendigikaynak | Scopus | |
| dc.language.iso | en | |
| dc.publisher | Springer Heidelberg | |
| dc.relation.ispartof | Indian Journal of Orthopaedics | |
| dc.relation.publicationcategory | Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı | |
| dc.rights | info:eu-repo/semantics/closedAccess | |
| dc.snmz | KA_WOS_20260130 | |
| dc.subject | Artificial intelligence | |
| dc.subject | Generative artificial intelligence | |
| dc.subject | Large language models | |
| dc.subject | Knee osteoarthritis | |
| dc.subject | Total knee arthroplasty | |
| dc.subject | Evidence-based orthopedics | |
| dc.title | Evaluating Large Language Model Adherence to AAOS Knee Osteoarthritis Guidelines: A Comparative Study of ChatGPT and NotebookLM | |
| dc.type | Article |











