[1] |
Yin W, Zhang J, Wang O, et al. Learning to recover 3D scene shape from a single image. In:2021 EEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA: IEEE, 2021: 204–213.
|
[2] |
Ranftl R, Bochkovskiy A, Koltun V. Vision transformers for dense prediction. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, 2021: 12159.
|
[3] |
Ranftl R, Lasinger K, Hafner D, et al. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44: 1623–1637. doi: 10.1109/tpami.2020.3019967
|
[4] |
Xian K, Zhang J, Wang O, et al. Structure-guided ranking loss for single image depth prediction. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020: 608–617.
|
[5] |
Li Z, Snavely N. MegaDepth: Learning single-view depth prediction from Internet photos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 2041–2050.
|
[6] |
Chen W, Qian S, Fan D, et al. OASIS: A large-scale dataset for single image 3d in the wild. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020: 676–685.
|
[7] |
Yin W, Liu Y, Shen C. Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44: 7282–7295. doi: 10.1109/tpami.2021.3097396
|
[8] |
Chen W, Fu Z, Yang D, et al. Single-image depth perception in the wild. In: NIPS'16:Proceedings of the 30th International Conference on Neural Information Processing Systems. New York: ACM , 2016: 730–738.
|
[9] |
Bian JW, Zhan H, Wang N, et al. Unsupervised scale-consistent depth learning from video. International Journal of Computer Vision, 2021, 129 (9): 2548–2564. doi: 10.1007/s11263-021-01484-6
|
[10] |
Wang Y, Chao W L, Garg D, et al. Pseudo-LiDAR from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 8437–8445.
|
[11] |
Peng W, Pan H, Liu H, et al. IDA-3D: Instance-depth-aware 3D object detection from stereo vision for autonomous driving. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020: 13012.
|
[12] |
Huang J, Chen Z, Ceylan D, et al. 6-DOF VR videos with a single 360-camera. In: 2017 IEEE Virtual Reality (VR). Los Angeles, CA, USA: IEEE, 2017: 37–44.
|
[13] |
Valentin J, Kowdle A, Barron J T, et al. Depth from motion for smartphone AR. ACM Transactions on Graphics, 2018, 37 (6): 1–19. doi: 10.1145/3272127.3275041
|
[14] |
Song S, Yu F, Zeng A, et al. Semantic scene completion from a single depth image. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017: 1746–1754.
|
[15] |
Silberman N, Hoiem D, Kohli P, et al. Indoor segmentation and support inference from rgbd images. In: Fitzgibbon A, Lazebnik S, Perona P, et al., editors. Computer Vision – ECCV 2012. Berlin: Springer, 2012: 746–760.
|
[16] |
Song S, Lichtenberg S P, Xiao J. SUN RGB-D: A RGB-D scene understanding benchmark suite. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015: 567–576.
|
[17] |
Yang X, Zhou L, Jiang H, et al. Mobile3DRecon: Real-time monocular 3D reconstruction on a mobile phone. IEEE Transactions on Visualization and Computer Graphics, 2020, 26 (12): 3446–3456. doi: 10.1109/tvcg.2020.3023634
|
[18] |
Azinović D, Martin-Brualla R, Dan B Goldman D, B, et al. Neural RGB-D surface reconstruction. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, 2022: 6280–6291.
|
[19] |
Eftekhar A, Sax A, Malik J, et al. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3D scans. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, 2021: 10766–10776.
|
[20] |
Bian J W, Zhan H, Wang N, et al. Auto-rectify network for unsupervised indoor depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (12): 9802–9813. doi: 10.1109/tpami.2021.3136220
|
[21] |
Godard C, Mac Aodha O, Firman M, et al. Digging into self-supervised monocular depth estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019: 3827–3837.
|
[22] |
Watson J, Mac Aodha O, Prisacariu V, et al. The temporal opportunist: Self-supervised multi-frame monocular depth. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, 2021: 1164–1174.
|
[23] |
Luo X, Huang J B, Szeliski R, et al. Consistent video depth estimation. ACM Transactions on Graphics, 2020, 39 (4): 71. doi: 10.1145/3386569.3392377
|
[24] |
Kopf J, Rong X, Huang J B. Robust consistent video depth estimation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, 2021: 1611–1621.
|
[25] |
Park J, Joo K, Hu Z, et al. Non-local spatial propagation network for depth completion. In: Vedaldi A, Bischof H, Brox T, et al., editors. Computer Vision – ECCV 2020. Cham: Springer, 2020: 120–136.
|
[26] |
Wong A, Soatto S. Unsupervised depth completion with calibrated backprojection layers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, 2021: 12727–12736.
|
[27] |
Wong A, Cicek S, Soatto S. Learning topology from synthetic data for unsupervised depth completion. IEEE Robotics and Automation Letters, 2021, 6 (2): 1495–1502. doi: 10.1109/lra.2021.3058072
|
[28] |
Yang Y, Wong A, Soatto S. Dense depth posterior (DDP) from single image and sparse range. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 3348–3357.
|
[29] |
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016: 770–778.
|
[30] |
Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada: IEEE, 2021: 9992–10002.
|
[31] |
Schönberger J L, Frahm J M. Structure-from-motion revisited. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016: 4104–4113.
|
[32] |
Teed Z, Deng J. DeepV2D: Video to depth with differentiable structure from motion. arXiv: 1812. 04605, 2018.
|
[33] |
Liu C, Gu J, Kim K, et al. Neural RGB@D sensing: Depth and uncertainty from a video camera. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA: IEEE, 2019: 10978–10987.
|
[34] |
Düzçeker A, Galliani S, Vogel C, et al. DeepVideoMVS: Multi-view stereo on video with recurrent spatio-temporal fusion. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, 2021: 15319–15328.
|
[35] |
Zeng A, Song S, Nießner M, et al. 3DMatch: Learning local geometric descriptors from RGB-D reconstructions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017: 199–208.
|
[36] |
Mur-Artal R, Montiel J M M, Tardós J D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 2015, 31: 1147–1163. doi: 10.1109/tro.2015.2463671
|
[37] |
Schönberger J L, Zheng E, Frahm J M, et al. Pixelwise view selection for unstructured multi-view stereo. In: European conference on computer vision. Cham: Springer, 2016: 501–518.
|
[38] |
Im S, Jeon H G, Lin S, et al. DPSNet: End to-end deep plane sweep stereo. arXiv: 1905. 00538, 2019.
|
[39] |
Wang C, Lucey S, Perazzi F, et al. Web stereo video supervision for depth prediction from dynamic scenes. In: 2019 International Conference on 3D Vision (3DV). Quebec City, QC, Canada: IEEE, 2019: 348–357.
|
[40] |
Chen W, Qian S, Deng J. Learning single-image depth from videos using quality assessment networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 5597–5606.
|
[41] |
Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017: 5987–5995.
|
[42] |
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv: 2010. 11929, 2021.
|
Figure 1. The pipeline for dense 3D scene reconstruction. The robust monocular depth estimation model trained on 6.3 million images, locally weighted linear regression strategy, and TSDF fusion[35] are the main components of our method.
Figure 2. The per-pixel error maps of ground-truth depth and predicted depth aligned through (a) global recovery and (b) local recovery, respectively. (c) The scale map and (d) the shift map of local recovery. Distribution of prediction-GT pairs obtained via (e) global recovery and (f) local recovery individually.
[1] |
Yin W, Zhang J, Wang O, et al. Learning to recover 3D scene shape from a single image. In:2021 EEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA: IEEE, 2021: 204–213.
|
[2] |
Ranftl R, Bochkovskiy A, Koltun V. Vision transformers for dense prediction. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, 2021: 12159.
|
[3] |
Ranftl R, Lasinger K, Hafner D, et al. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44: 1623–1637. doi: 10.1109/tpami.2020.3019967
|
[4] |
Xian K, Zhang J, Wang O, et al. Structure-guided ranking loss for single image depth prediction. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020: 608–617.
|
[5] |
Li Z, Snavely N. MegaDepth: Learning single-view depth prediction from Internet photos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 2041–2050.
|
[6] |
Chen W, Qian S, Fan D, et al. OASIS: A large-scale dataset for single image 3d in the wild. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020: 676–685.
|
[7] |
Yin W, Liu Y, Shen C. Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44: 7282–7295. doi: 10.1109/tpami.2021.3097396
|
[8] |
Chen W, Fu Z, Yang D, et al. Single-image depth perception in the wild. In: NIPS'16:Proceedings of the 30th International Conference on Neural Information Processing Systems. New York: ACM , 2016: 730–738.
|
[9] |
Bian JW, Zhan H, Wang N, et al. Unsupervised scale-consistent depth learning from video. International Journal of Computer Vision, 2021, 129 (9): 2548–2564. doi: 10.1007/s11263-021-01484-6
|
[10] |
Wang Y, Chao W L, Garg D, et al. Pseudo-LiDAR from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 8437–8445.
|
[11] |
Peng W, Pan H, Liu H, et al. IDA-3D: Instance-depth-aware 3D object detection from stereo vision for autonomous driving. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020: 13012.
|
[12] |
Huang J, Chen Z, Ceylan D, et al. 6-DOF VR videos with a single 360-camera. In: 2017 IEEE Virtual Reality (VR). Los Angeles, CA, USA: IEEE, 2017: 37–44.
|
[13] |
Valentin J, Kowdle A, Barron J T, et al. Depth from motion for smartphone AR. ACM Transactions on Graphics, 2018, 37 (6): 1–19. doi: 10.1145/3272127.3275041
|
[14] |
Song S, Yu F, Zeng A, et al. Semantic scene completion from a single depth image. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017: 1746–1754.
|
[15] |
Silberman N, Hoiem D, Kohli P, et al. Indoor segmentation and support inference from rgbd images. In: Fitzgibbon A, Lazebnik S, Perona P, et al., editors. Computer Vision – ECCV 2012. Berlin: Springer, 2012: 746–760.
|
[16] |
Song S, Lichtenberg S P, Xiao J. SUN RGB-D: A RGB-D scene understanding benchmark suite. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015: 567–576.
|
[17] |
Yang X, Zhou L, Jiang H, et al. Mobile3DRecon: Real-time monocular 3D reconstruction on a mobile phone. IEEE Transactions on Visualization and Computer Graphics, 2020, 26 (12): 3446–3456. doi: 10.1109/tvcg.2020.3023634
|
[18] |
Azinović D, Martin-Brualla R, Dan B Goldman D, B, et al. Neural RGB-D surface reconstruction. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, 2022: 6280–6291.
|
[19] |
Eftekhar A, Sax A, Malik J, et al. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3D scans. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, 2021: 10766–10776.
|
[20] |
Bian J W, Zhan H, Wang N, et al. Auto-rectify network for unsupervised indoor depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (12): 9802–9813. doi: 10.1109/tpami.2021.3136220
|
[21] |
Godard C, Mac Aodha O, Firman M, et al. Digging into self-supervised monocular depth estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019: 3827–3837.
|
[22] |
Watson J, Mac Aodha O, Prisacariu V, et al. The temporal opportunist: Self-supervised multi-frame monocular depth. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, 2021: 1164–1174.
|
[23] |
Luo X, Huang J B, Szeliski R, et al. Consistent video depth estimation. ACM Transactions on Graphics, 2020, 39 (4): 71. doi: 10.1145/3386569.3392377
|
[24] |
Kopf J, Rong X, Huang J B. Robust consistent video depth estimation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, 2021: 1611–1621.
|
[25] |
Park J, Joo K, Hu Z, et al. Non-local spatial propagation network for depth completion. In: Vedaldi A, Bischof H, Brox T, et al., editors. Computer Vision – ECCV 2020. Cham: Springer, 2020: 120–136.
|
[26] |
Wong A, Soatto S. Unsupervised depth completion with calibrated backprojection layers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, 2021: 12727–12736.
|
[27] |
Wong A, Cicek S, Soatto S. Learning topology from synthetic data for unsupervised depth completion. IEEE Robotics and Automation Letters, 2021, 6 (2): 1495–1502. doi: 10.1109/lra.2021.3058072
|
[28] |
Yang Y, Wong A, Soatto S. Dense depth posterior (DDP) from single image and sparse range. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 3348–3357.
|
[29] |
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016: 770–778.
|
[30] |
Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada: IEEE, 2021: 9992–10002.
|
[31] |
Schönberger J L, Frahm J M. Structure-from-motion revisited. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016: 4104–4113.
|
[32] |
Teed Z, Deng J. DeepV2D: Video to depth with differentiable structure from motion. arXiv: 1812. 04605, 2018.
|
[33] |
Liu C, Gu J, Kim K, et al. Neural RGB@D sensing: Depth and uncertainty from a video camera. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA: IEEE, 2019: 10978–10987.
|
[34] |
Düzçeker A, Galliani S, Vogel C, et al. DeepVideoMVS: Multi-view stereo on video with recurrent spatio-temporal fusion. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, 2021: 15319–15328.
|
[35] |
Zeng A, Song S, Nießner M, et al. 3DMatch: Learning local geometric descriptors from RGB-D reconstructions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017: 199–208.
|
[36] |
Mur-Artal R, Montiel J M M, Tardós J D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 2015, 31: 1147–1163. doi: 10.1109/tro.2015.2463671
|
[37] |
Schönberger J L, Zheng E, Frahm J M, et al. Pixelwise view selection for unstructured multi-view stereo. In: European conference on computer vision. Cham: Springer, 2016: 501–518.
|
[38] |
Im S, Jeon H G, Lin S, et al. DPSNet: End to-end deep plane sweep stereo. arXiv: 1905. 00538, 2019.
|
[39] |
Wang C, Lucey S, Perazzi F, et al. Web stereo video supervision for depth prediction from dynamic scenes. In: 2019 International Conference on 3D Vision (3DV). Quebec City, QC, Canada: IEEE, 2019: 348–357.
|
[40] |
Chen W, Qian S, Deng J. Learning single-image depth from videos using quality assessment networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 5597–5606.
|
[41] |
Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017: 5987–5995.
|
[42] |
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv: 2010. 11929, 2021.
|