Camera-Based Monocular Depth Estimation in Orthodontics: Vision Transformer vs. CNN Model Performance

dc.authorid0009-0004-8920-8605
dc.authorid0000-0001-6152-6178
dc.contributor.authorArisan, Arda
dc.contributor.authorDuran, Gokhan Serhat
dc.date.accessioned2026-02-03T11:59:50Z
dc.date.available2026-02-03T11:59:50Z
dc.date.issued2025
dc.departmentÇanakkale Onsekiz Mart Üniversitesi
dc.description.abstractBackground: Monocular Depth Estimation (MDE) is a computer vision approach that predicts spatial depth information from a single two-dimensional image. In orthodontics, where facial soft-tissue evaluation is integral to diagnosis and treatment planning, such methods offer new possibilities for obtaining sagittal profile information from standard frontal photographs. This study aimed to determine whether MDE can extract clinically meaningful information for facial profile assessment. Methods: Standardized frontal photographs and lateral cephalometric radiographs from 82 adult patients (48 Class I, 28 Class II, 6 Class III) were retrospectively analyzed. Three clinically relevant soft-tissue landmarks-Upper Lip Anterior (ULA), Lower Lip Anterior (LLA), and Soft Tissue Pogonion (Pog ')-were annotated on frontal photographs, while true vertical line (TVL) analysis from cephalograms served as the reference standard. For each case, anteroposterior (AP) relationships among the three landmarks were represented as ordinal rankings based on predicted depth values, and accuracy was defined as complete agreement between model-derived and reference rankings. Depth maps were generated using one vision transformer model (DPT-Large) and two CNN-based models (DepthAnything-v2 and ZoeDepth). Model performance was evaluated using accuracy, 95% confidence intervals, and effect size measures. Results: The transformer-based DPT-Large achieved clinically acceptable accuracy in 92.7% of cases (76/82; 95% CI: 84.8-97.3), significantly outperforming the CNN-based models DepthAnything-v2 (9.8%) and ZoeDepth (4.9%), both of which performed below the theoretical chance level (16.7%). Conclusions: Vision transformer-based Monocular Depth Estimation demonstrates the potential for clinically meaningful soft-tissue profiling from frontal photographs, suggesting that depth information derived from two-dimensional images may serve as a supportive tool for facial profile evaluation. These findings provide a foundation for future studies exploring the integration of depth-based analysis into digital orthodontic diagnostics.
dc.identifier.doi10.3390/s25216512
dc.identifier.issn1424-8220
dc.identifier.issue21
dc.identifier.pmid41228735
dc.identifier.scopus2-s2.0-105021499198
dc.identifier.scopusqualityQ1
dc.identifier.urihttps://doi.org/10.3390/s25216512
dc.identifier.urihttps://hdl.handle.net/20.500.12428/34436
dc.identifier.volume25
dc.identifier.wosWOS:001613049200001
dc.identifier.wosqualityQ2
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.indekslendigikaynakPubMed
dc.language.isoen
dc.publisherMdpi
dc.relation.ispartofSensors
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/openAccess
dc.snmzKA_WOS_20260130
dc.subjectMonocular Depth Estimation
dc.subjectvision transformer
dc.subjectConvolutional Neural Networks
dc.subjectcomputer vision
dc.subjectmedical imaging
dc.subjectorthodontic diagnostics
dc.titleCamera-Based Monocular Depth Estimation in Orthodontics: Vision Transformer vs. CNN Model Performance
dc.typeArticle

Dosyalar