Camera-Based Monocular Depth Estimation in Orthodontics: Vision Transformer vs. CNN Model Performance

Arisan, Arda; Duran, Gokhan Serhat

Camera-Based Monocular Depth Estimation in Orthodontics: Vision Transformer vs. CNN Model Performance

Tarih

2025

Yazarlar

Arisan, Arda

Duran, Gokhan Serhat

Yayıncı

Mdpi

Erişim Hakkı

info:eu-repo/semantics/openAccess

Özet

Background: Monocular Depth Estimation (MDE) is a computer vision approach that predicts spatial depth information from a single two-dimensional image. In orthodontics, where facial soft-tissue evaluation is integral to diagnosis and treatment planning, such methods offer new possibilities for obtaining sagittal profile information from standard frontal photographs. This study aimed to determine whether MDE can extract clinically meaningful information for facial profile assessment. Methods: Standardized frontal photographs and lateral cephalometric radiographs from 82 adult patients (48 Class I, 28 Class II, 6 Class III) were retrospectively analyzed. Three clinically relevant soft-tissue landmarks-Upper Lip Anterior (ULA), Lower Lip Anterior (LLA), and Soft Tissue Pogonion (Pog ')-were annotated on frontal photographs, while true vertical line (TVL) analysis from cephalograms served as the reference standard. For each case, anteroposterior (AP) relationships among the three landmarks were represented as ordinal rankings based on predicted depth values, and accuracy was defined as complete agreement between model-derived and reference rankings. Depth maps were generated using one vision transformer model (DPT-Large) and two CNN-based models (DepthAnything-v2 and ZoeDepth). Model performance was evaluated using accuracy, 95% confidence intervals, and effect size measures. Results: The transformer-based DPT-Large achieved clinically acceptable accuracy in 92.7% of cases (76/82; 95% CI: 84.8-97.3), significantly outperforming the CNN-based models DepthAnything-v2 (9.8%) and ZoeDepth (4.9%), both of which performed below the theoretical chance level (16.7%). Conclusions: Vision transformer-based Monocular Depth Estimation demonstrates the potential for clinically meaningful soft-tissue profiling from frontal photographs, suggesting that depth information derived from two-dimensional images may serve as a supportive tool for facial profile evaluation. These findings provide a foundation for future studies exploring the integration of depth-based analysis into digital orthodontic diagnostics.

Anahtar Kelimeler

Monocular Depth Estimation, vision transformer, Convolutional Neural Networks, computer vision, medical imaging, orthodontic diagnostics

Kaynak

Sensors

WoS Q Değeri

Q2

Scopus Q Değeri

Q1

Cilt

25

Sayı

21

Bağlantı

https://doi.org/10.3390/s25216512
https://hdl.handle.net/20.500.12428/34436

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
PubMed İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

Detaylı Öğe Kaydı

Camera-Based Monocular Depth Estimation in Orthodontics: Vision Transformer vs. CNN Model Performance

Tarih

Yazarlar

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Erişim Hakkı

Özet

Açıklama

Anahtar Kelimeler

Kaynak

WoS Q Değeri

Scopus Q Değeri

Cilt

Sayı

Künye

Bağlantı

Koleksiyon