Evaluating Large Language Model Adherence to AAOS Knee Osteoarthritis Guidelines: A Comparative Study of ChatGPT and NotebookLM

dc.authorid0000-0002-2345-0827
dc.authorid0000-0001-7526-6071
dc.authorid0000-0003-1243-2057
dc.contributor.authorYildiz, Cemil
dc.contributor.authorGokmen, Mehmet Yigit
dc.contributor.authorUtlu, Cetin
dc.contributor.authorKarabay, Elif Sude
dc.contributor.authorTanik, Ugur
dc.contributor.authorPazarci, Ozhan
dc.date.accessioned2026-02-03T12:02:50Z
dc.date.available2026-02-03T12:02:50Z
dc.date.issued2025
dc.departmentÇanakkale Onsekiz Mart Üniversitesi
dc.description.abstractPurposeThis study evaluated how closely large language models (LLMs), specifically ChatGPT (OpenAI) and NotebookLM (Google), adhere to orthopedic guidelines. The objective was to determine whether AI-generated reasoning aligns with the 2021-2022 American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines for knee osteoarthritis (OA).MethodsA mixed-methods design combined quantitative concordance scoring with qualitative content analysis. Thirty-three decision points covering non-arthroplasty and surgical management were derived from AAOS guidelines. Structured Population-Intervention-Comparison-Outcome (PICO) prompts were presented to each model. Two orthopedic surgeons independently rated all outputs using a four-domain rubric assessing accuracy, evidence reasoning, additional information, and knowledge integration (0-4 scale). Concordance was classified as full (4), partial (3), or discordant (<= 2), with disagreements resolved through consensus. Inter-rater reliability was almost perfect (weighted kappa = 0.87).ResultsChatGPT achieved a mean composite score of 3.67 +/- 0.92, and NotebookLM 3.55 +/- 0.87, with no significant difference between models (p = 0.18). Full concordance with AAOS recommendations occurred in 84.8% of ChatGPT responses and 75.8% of NotebookLM responses. Both models performed consistently in high-evidence domains such as NSAID therapy, tranexamic acid use, and weight-loss counseling. Variability increased in limited-evidence or technology-driven areas. Partial concordance reflected the omission of evidence qualifiers, while discordant responses involved overstated or speculative interpretations.ConclusionBoth LLMs demonstrated strong alignment with evidence-based orthopedic reasoning. ChatGPT showed slightly higher fidelity to recommendation strength, whereas NotebookLM provided broader contextual interpretation. Structured, guideline-oriented prompting may enhance AI reasoning consistency and support the potential role of LLMs as complementary tools for evidence translation and orthopedic education.
dc.identifier.doi10.1007/s43465-025-01653-6
dc.identifier.issn0019-5413
dc.identifier.issn1998-3727
dc.identifier.scopus2-s2.0-105024089258
dc.identifier.scopusqualityQ3
dc.identifier.urihttps://doi.org/10.1007/s43465-025-01653-6
dc.identifier.urihttps://hdl.handle.net/20.500.12428/34885
dc.identifier.wosWOS:001631448600001
dc.identifier.wosqualityQ3
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.language.isoen
dc.publisherSpringer Heidelberg
dc.relation.ispartofIndian Journal of Orthopaedics
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.snmzKA_WOS_20260130
dc.subjectArtificial intelligence
dc.subjectGenerative artificial intelligence
dc.subjectLarge language models
dc.subjectKnee osteoarthritis
dc.subjectTotal knee arthroplasty
dc.subjectEvidence-based orthopedics
dc.titleEvaluating Large Language Model Adherence to AAOS Knee Osteoarthritis Guidelines: A Comparative Study of ChatGPT and NotebookLM
dc.typeArticle

Dosyalar