Home > Publications database > Multi-Source Auxiliary Tasks supported Monocular Depth Estimation |
Conference Presentation (After Call) | FZJ-2025-00071 |
; ; ; ; ; ;
2024
Abstract: Monocular depth estimation (MDE) is a challenging task in computer vision, often hindered by the cost and scarcity of high-quality labeled datasets. We tackle this challenge using auxiliary datasets from related vision tasks for joint training of a shared decoder on top of a pre-trained vision foundation model, while giving a higher weight to MDE.In particular, we leverage a frozen DINOv2 ViT Giant model as a feature extractor, bypassing the need for fine-tuning, and jointly train a shared DPT decoder with auxiliary datasets from related tasks to improve MDE. We illustrate the qualitative and quantitative improvements of our method over the DINOv2 MDE baseline in Figures 1 and 2, respectively.Notably, compared to the recent Depth Anything, which reports no improvements using a jointly fine-tuned DINOv2 ViT Large and task-specific decoders, our method successfully leverages auxiliary tasks.Through extensive experiments we demonstrate the benefits of incorporating various auxiliary datasets and tasks to improve MDE quality on average by ~11% for related datasets. Our experimental analysis shows that auxiliary tasks have different impacts, confirming the importance of task selection, highlighting that quality gains are not achieved by merely adding data. Remarkably, our study reveals that using semantic segmentation datasets as multi-label dense classification often results in additional quality gains.
![]() |
The record appears in these collections: |