Is AI the future of evaluation in medical education?? AI vs. human evaluation in objective structured clinical examination

dc.authoridKORKMAZ, GUNES/0000-0002-9060-5972
dc.authoridUYSAL, IBRAHIM/0000-0002-7507-3322
dc.contributor.authorTekin, Murat
dc.contributor.authorYurdal, Mustafa Onur
dc.contributor.authorToraman, Cetin
dc.contributor.authorKorkmaz, Gunes
dc.contributor.authorUysal, Ibrahim
dc.date.accessioned2025-05-29T02:57:38Z
dc.date.available2025-05-29T02:57:38Z
dc.date.issued2025
dc.departmentÇanakkale Onsekiz Mart Üniversitesi
dc.description.abstractBackgroundObjective Structured Clinical Examinations (OSCEs) are widely used in medical education to assess students' clinical and professional skills. Recent advancements in artificial intelligence (AI) offer opportunities to complement human evaluations. This study aims to explore the consistency between human and AI evaluators in assessing medical students' clinical skills during OSCE.MethodsThis cross-sectional study was conducted at a state university in Turkey, focusing on pre-clinical medical students (Years 1, 2, and 3). Four clinical skills-intramuscular injection, square knot tying, basic life support, and urinary catheterization-were evaluated during OSCE at the end of the 2023-2024 academic year. Video recordings of the students' performances were assessed by five evaluators: a real-time human assessor, two video-based expert human assessors, and two AI-based systems (ChatGPT-4o and Gemini Flash 1.5). The evaluations were based on standardized checklists validated by the university. Data were collected from 196 students, with sample sizes ranging from 43 to 58 for each skill. Consistency among evaluators was analyzed using statistical methods.ResultsAI models consistently assigned higher scores than human evaluators across all skills. For intramuscular injection, the mean total score given by AI was 28.23, while human evaluators averaged 25.25. For knot tying, AI scores averaged 16.07 versus 10.44 for humans. In basic life support, AI scores were 17.05 versus 16.48 for humans. For urinary catheterization, mean scores were similar (AI: 26.68; humans: 27.02), but showed considerable variance in individual criteria. Inter-rater consistency was higher for visually observable steps, while auditory tasks led to greater discrepancies between AI and human evaluators.ConclusionsAI shows promise as a supplemental tool for OSCE evaluation, especially for visually based clinical skills. However, its reliability varies depending on the perceptual demands of the skill being assessed. The higher and more uniform scores given by AI suggest potential for standardization, yet refinement is needed for accurate assessment of skills requiring verbal communication or auditory cues.
dc.identifier.doi10.1186/s12909-025-07241-4
dc.identifier.issn1472-6920
dc.identifier.issue1
dc.identifier.pmid40312328
dc.identifier.scopus2-s2.0-105003978208
dc.identifier.scopusqualityQ1
dc.identifier.urihttps://doi.org/10.1186/s12909-025-07241-4
dc.identifier.urihttps://hdl.handle.net/20.500.12428/30114
dc.identifier.volume25
dc.identifier.wosWOS:001479968600001
dc.identifier.wosqualityQ1
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.indekslendigikaynakPubMed
dc.language.isoen
dc.publisherBmc
dc.relation.ispartofBmc Medical Education
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.snmzKA_WOS_20250529
dc.subjectOSCE
dc.subjectClinical skills assessment
dc.subjectArtificial intelligence
dc.subjectMedical education
dc.subjectEvaluator consistency
dc.subjectInterrater reliability
dc.titleIs AI the future of evaluation in medical education?? AI vs. human evaluation in objective structured clinical examination
dc.typeArticle

Dosyalar