Camera-Based Monocular Depth Estimation in Orthodontics: Vision Transformer vs. CNN Model Performance

Arisan, Arda; Duran, Gokhan Serhat

Camera-Based Monocular Depth Estimation in Orthodontics: Vision Transformer vs. CNN Model Performance

dc.authorid	0009-0004-8920-8605
dc.authorid	0000-0001-6152-6178
dc.contributor.author	Arisan, Arda
dc.contributor.author	Duran, Gokhan Serhat
dc.date.accessioned	2026-02-03T11:59:50Z
dc.date.available	2026-02-03T11:59:50Z
dc.date.issued	2025
dc.department	Çanakkale Onsekiz Mart Üniversitesi
dc.description.abstract	Background: Monocular Depth Estimation (MDE) is a computer vision approach that predicts spatial depth information from a single two-dimensional image. In orthodontics, where facial soft-tissue evaluation is integral to diagnosis and treatment planning, such methods offer new possibilities for obtaining sagittal profile information from standard frontal photographs. This study aimed to determine whether MDE can extract clinically meaningful information for facial profile assessment. Methods: Standardized frontal photographs and lateral cephalometric radiographs from 82 adult patients (48 Class I, 28 Class II, 6 Class III) were retrospectively analyzed. Three clinically relevant soft-tissue landmarks-Upper Lip Anterior (ULA), Lower Lip Anterior (LLA), and Soft Tissue Pogonion (Pog ')-were annotated on frontal photographs, while true vertical line (TVL) analysis from cephalograms served as the reference standard. For each case, anteroposterior (AP) relationships among the three landmarks were represented as ordinal rankings based on predicted depth values, and accuracy was defined as complete agreement between model-derived and reference rankings. Depth maps were generated using one vision transformer model (DPT-Large) and two CNN-based models (DepthAnything-v2 and ZoeDepth). Model performance was evaluated using accuracy, 95% confidence intervals, and effect size measures. Results: The transformer-based DPT-Large achieved clinically acceptable accuracy in 92.7% of cases (76/82; 95% CI: 84.8-97.3), significantly outperforming the CNN-based models DepthAnything-v2 (9.8%) and ZoeDepth (4.9%), both of which performed below the theoretical chance level (16.7%). Conclusions: Vision transformer-based Monocular Depth Estimation demonstrates the potential for clinically meaningful soft-tissue profiling from frontal photographs, suggesting that depth information derived from two-dimensional images may serve as a supportive tool for facial profile evaluation. These findings provide a foundation for future studies exploring the integration of depth-based analysis into digital orthodontic diagnostics.
dc.identifier.doi	10.3390/s25216512
dc.identifier.issn	1424-8220
dc.identifier.issue	21
dc.identifier.pmid	41228735
dc.identifier.scopus	2-s2.0-105021499198
dc.identifier.scopusquality	Q1
dc.identifier.uri	https://doi.org/10.3390/s25216512
dc.identifier.uri	https://hdl.handle.net/20.500.12428/34436
dc.identifier.volume	25
dc.identifier.wos	WOS:001613049200001
dc.identifier.wosquality	Q2
dc.indekslendigikaynak	Web of Science
dc.indekslendigikaynak	Scopus
dc.indekslendigikaynak	PubMed
dc.language.iso	en
dc.publisher	Mdpi
dc.relation.ispartof	Sensors
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rights	info:eu-repo/semantics/openAccess
dc.snmz	KA_WOS_20260130
dc.subject	Monocular Depth Estimation
dc.subject	vision transformer
dc.subject	Convolutional Neural Networks
dc.subject	computer vision
dc.subject	medical imaging
dc.subject	orthodontic diagnostics
dc.title	Camera-Based Monocular Depth Estimation in Orthodontics: Vision Transformer vs. CNN Model Performance
dc.type	Article

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
PubMed İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

Camera-Based Monocular Depth Estimation in Orthodontics: Vision Transformer vs. CNN Model Performance

Dosyalar

Koleksiyon