Contents

DepthLab: From Partial to Complete

1HKU2HKUST3Ant Group4Aalto University5Tongyi Lab

We propose a robust depth inpainting foundation model that can be applied to various
downstream tasks to enhance performance.

Abstract

Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting model powered by image diffusion priors. Our model features two notable strengths: (1) it demonstrates resilience to depth-deficient regions, providing reliable completion for both continuous areas and isolated points, and (2) it faithfully preserves scale consistency with the conditioned known depth when filling in missing values. Drawing on these advantages, our approach proves its worth in various downstream tasks, including 3D scene inpainting, text-to-3D scene generation, sparse-view reconstruction with DUST3R, and LiDAR depth completion, exceeding current solutions in both numerical performance and visual quality.

Method


DepthLab Pipeline

We apply random masking to the ground truth depth to create the masked depth, followed by interpolation. Both the interpolated masked depth and the original depth undergo random scale normalization before being fed into the encoder. The Reference U-Net extracts RGB features, while the Estimation U-Net takes the noisy depth, masked depth, and encoded mask as input. Layer-by-layer feature fusion allows for finer-grained visual guidance, achieving high-quality depth predictions even in large or complex masked regions.

Comparisons


DepthLab Pipeline

In the second column, black represents the known regions, while white indicates the predicted areas. Notably, to emphasize the contrast, we reattach the known ground truth depth to the corresponding positions in the right-side visualizations of the depth maps. Other methods exhibit geometric inconsistency.

Downstream Tasks

3D Gaussian Inpainting

In 3D scenes, we start by inpainting the depth of the image inpainted regions from the posed reference views, then unproject the points into the 3D space for optimal initialization, which significantly enhances the quality and speed of the 3D scene inpainting.

Text to Scene Generation

Our method substantially improves the process of generating a 3D scene from a single image by eliminating the need for alignment. This advancement effectively mitigates issues of edge disjunction that previously arose from geometric inconsistencies.

Slide2
Slide2
Slide2

Sparse-view Gaussian Reconstruction with DUST3R

Our approach begins by generating a mask for pixels without matches from any source images. These non-matching regions are then refined through DepthLab. Our approach effectively sharpens initial depth from DUST3R, substantially improving Gaussian splatting rendering quality.

Slide2
Slide2
Slide2

Sparse Depth Completion

Unlike existing methods that are trained and tested on a single dataset, such as NYUv2, our approach achieves comparable results in a zero-shot setting and can deliver even better outcomes with minimal fine-tuning.


completion
completion
completion
completion
completion
completion
completion
completion
completion
completion
completion
completion

Discussion on Future Work

First, we aim to discuss potential downstream tasks where our model could be applied, such as 4D scene generation or reconstruction, robotic navigation, editing in VR/AR, and a series of works related to DUST3R. In summary, any task requiring depth estimation that has inherent known information (either partial ground truth obtained through rendering or sensors, or warped depth from a changed camera pose) could be able to leverage our model for more accurate depth estimation, thereby enhancing the results.

Next, we think there are some possible further research directions:

  • How to accelerate the entire estimation process, such as using LCM or Flow Matching techniques.
  • Can such an idea be applied to normal estimation?
  • If camera pose information is also incorporated into the model, could it potentially enhance the model's performance in scenarios related to viewpoint transformations?
  • Our core idea is to use known information to achieve better depth estimation, which is even more critical in video depth estimation. This is because there is a significant amount of approximate information between adjacent frames. Therefore, the question arises: how can we design a video depth estimation model that takes advantage of the known information between adjacent frames to enhance temporal consistency?

    We think these are all interesting questions. If you have any questions or wish to further discuss, please feel free to contact us at zhihengl0528@connect.hku.hk

  • BibTeX

    @article{liu2024depthlab,
      author       = {Zhiheng Liu and Ka Leong Cheng and Qiuyu Wang and Shuzhe Wang and Hao Ouyang and Bin Tan and Kai Zhu and Yujun Shen and Qifeng Chen and Ping Luo},
      title        = {DepthLab: From Partial to Complete},
      journal      = {CoRR},
      volume       = {abs/xxxx.xxxxx},
      year         = {2024},
    }