ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Article 15 August 2023

Toward 3D scene reconstruction from locally scale-aligned monocular video depth

Cite this:
https://doi.org/10.52396/JUSTC-2023-0061
More Information
  • Author Bio:

    Guangkai Xu is currently a Ph. D. student at Zhejiang University. He received the B. E. degree at University of Electronic Science and Technology of China UESTC, and M. E. degree at University of Science and Technology of China. He is interested in computer vision and machine learning

    Feng Zhao is currently a Professor of Department of Automation at Unversity of Science and Technology of China (USTC). He received B. E. degree at USTC, M. Phil. and Ph. D. degrees at The Chinese University of Hong Kong CUHK, MMLab. He is interested in artificial intelligence, machine learning, and computer vision

  • Corresponding author: E-mail: fzhao956@ustc.edu.cn
  • Received Date: 11 April 2023
  • Accepted Date: 11 May 2023
  • Available Online: 15 August 2023
  • Monocular depth estimation methods have achieved excellent robustness on diverse scenes, usually by predicting affine-invariant depth, up to an unknown scale and shift, rather than metric depth in that it is much easier to collect large-scale affine-invariant depth training data. However, in some video-based scenarios such as video depth estimation and 3D scene reconstruction, the unknown scale and shift residing in per-frame prediction may cause the predicted depth to be inconsistent. To tackle this problem, we propose a locally weighted linear regression method to recover the scale and shift map with very sparse anchor points, which ensures the consistency along consecutive frames. Extensive experiments show that our method can drop the Rel error of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks. Besides, we merge 6.3 million RGBD images to train robust depth models. By locally recovering scale and shift, our produced ResNet50-backbone model even outperforms the state-of-the-art DPT ViT-Large model. Combined with geometry-based reconstruction methods, we formulate a new dense 3D scene reconstruction pipeline, which benefits from both the scale consistency of sparse points and the robustness of monocular methods. By performing simple per-frame prediction over a video, the accurate 3D scene geometry can be recovered.
    The pipeline for dense 3D scene reconstruction. It is composed of a robust monocular depth estimation module, a metric depth recovery module, and an RGB-D fusion module.
    Monocular depth estimation methods have achieved excellent robustness on diverse scenes, usually by predicting affine-invariant depth, up to an unknown scale and shift, rather than metric depth in that it is much easier to collect large-scale affine-invariant depth training data. However, in some video-based scenarios such as video depth estimation and 3D scene reconstruction, the unknown scale and shift residing in per-frame prediction may cause the predicted depth to be inconsistent. To tackle this problem, we propose a locally weighted linear regression method to recover the scale and shift map with very sparse anchor points, which ensures the consistency along consecutive frames. Extensive experiments show that our method can drop the Rel error of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks. Besides, we merge 6.3 million RGBD images to train robust depth models. By locally recovering scale and shift, our produced ResNet50-backbone model even outperforms the state-of-the-art DPT ViT-Large model. Combined with geometry-based reconstruction methods, we formulate a new dense 3D scene reconstruction pipeline, which benefits from both the scale consistency of sparse points and the robustness of monocular methods. By performing simple per-frame prediction over a video, the accurate 3D scene geometry can be recovered.
  • loading
  • [1]
    Yin W, Zhang J, Wang O, et al. Learning to recover 3D scene shape from a single image. In:2021 EEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA: IEEE, 2021: 204–213.
    [2]
    Ranftl R, Bochkovskiy A, Koltun V. Vision transformers for dense prediction. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, 2021: 12159.
    [3]
    Ranftl R, Lasinger K, Hafner D, et al. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44: 1623–1637. doi: 10.1109/tpami.2020.3019967
    [4]
    Xian K, Zhang J, Wang O, et al. Structure-guided ranking loss for single image depth prediction. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020: 608–617.
    [5]
    Li Z, Snavely N. MegaDepth: Learning single-view depth prediction from Internet photos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 2041–2050.
    [6]
    Chen W, Qian S, Fan D, et al. OASIS: A large-scale dataset for single image 3d in the wild. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020: 676–685.
    [7]
    Yin W, Liu Y, Shen C. Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44: 7282–7295. doi: 10.1109/tpami.2021.3097396
    [8]
    Chen W, Fu Z, Yang D, et al. Single-image depth perception in the wild. In: NIPS'16:Proceedings of the 30th International Conference on Neural Information Processing Systems. New York: ACM , 2016: 730–738.
    [9]
    Bian JW, Zhan H, Wang N, et al. Unsupervised scale-consistent depth learning from video. International Journal of Computer Vision, 2021, 129 (9): 2548–2564. doi: 10.1007/s11263-021-01484-6
    [10]
    Wang Y, Chao W L, Garg D, et al. Pseudo-LiDAR from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 8437–8445.
    [11]
    Peng W, Pan H, Liu H, et al. IDA-3D: Instance-depth-aware 3D object detection from stereo vision for autonomous driving. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020: 13012.
    [12]
    Huang J, Chen Z, Ceylan D, et al. 6-DOF VR videos with a single 360-camera. In: 2017 IEEE Virtual Reality (VR). Los Angeles, CA, USA: IEEE, 2017: 37–44.
    [13]
    Valentin J, Kowdle A, Barron J T, et al. Depth from motion for smartphone AR. ACM Transactions on Graphics, 2018, 37 (6): 1–19. doi: 10.1145/3272127.3275041
    [14]
    Song S, Yu F, Zeng A, et al. Semantic scene completion from a single depth image. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017: 1746–1754.
    [15]
    Silberman N, Hoiem D, Kohli P, et al. Indoor segmentation and support inference from rgbd images. In: Fitzgibbon A, Lazebnik S, Perona P, et al., editors. Computer Vision – ECCV 2012. Berlin: Springer, 2012: 746–760.
    [16]
    Song S, Lichtenberg S P, Xiao J. SUN RGB-D: A RGB-D scene understanding benchmark suite. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015: 567–576.
    [17]
    Yang X, Zhou L, Jiang H, et al. Mobile3DRecon: Real-time monocular 3D reconstruction on a mobile phone. IEEE Transactions on Visualization and Computer Graphics, 2020, 26 (12): 3446–3456. doi: 10.1109/tvcg.2020.3023634
    [18]
    Azinović D, Martin-Brualla R, Dan B Goldman D, B, et al. Neural RGB-D surface reconstruction. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, 2022: 6280–6291.
    [19]
    Eftekhar A, Sax A, Malik J, et al. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3D scans. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, 2021: 10766–10776.
    [20]
    Bian J W, Zhan H, Wang N, et al. Auto-rectify network for unsupervised indoor depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (12): 9802–9813. doi: 10.1109/tpami.2021.3136220
    [21]
    Godard C, Mac Aodha O, Firman M, et al. Digging into self-supervised monocular depth estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019: 3827–3837.
    [22]
    Watson J, Mac Aodha O, Prisacariu V, et al. The temporal opportunist: Self-supervised multi-frame monocular depth. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, 2021: 1164–1174.
    [23]
    Luo X, Huang J B, Szeliski R, et al. Consistent video depth estimation. ACM Transactions on Graphics, 2020, 39 (4): 71. doi: 10.1145/3386569.3392377
    [24]
    Kopf J, Rong X, Huang J B. Robust consistent video depth estimation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, 2021: 1611–1621.
    [25]
    Park J, Joo K, Hu Z, et al. Non-local spatial propagation network for depth completion. In: Vedaldi A, Bischof H, Brox T, et al., editors. Computer Vision – ECCV 2020. Cham: Springer, 2020: 120–136.
    [26]
    Wong A, Soatto S. Unsupervised depth completion with calibrated backprojection layers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, 2021: 12727–12736.
    [27]
    Wong A, Cicek S, Soatto S. Learning topology from synthetic data for unsupervised depth completion. IEEE Robotics and Automation Letters, 2021, 6 (2): 1495–1502. doi: 10.1109/lra.2021.3058072
    [28]
    Yang Y, Wong A, Soatto S. Dense depth posterior (DDP) from single image and sparse range. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 3348–3357.
    [29]
    He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016: 770–778.
    [30]
    Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada: IEEE, 2021: 9992–10002.
    [31]
    Schönberger J L, Frahm J M. Structure-from-motion revisited. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016: 4104–4113.
    [32]
    Teed Z, Deng J. DeepV2D: Video to depth with differentiable structure from motion. arXiv: 1812. 04605, 2018.
    [33]
    Liu C, Gu J, Kim K, et al. Neural RGB@D sensing: Depth and uncertainty from a video camera. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA: IEEE, 2019: 10978–10987.
    [34]
    Düzçeker A, Galliani S, Vogel C, et al. DeepVideoMVS: Multi-view stereo on video with recurrent spatio-temporal fusion. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, 2021: 15319–15328.
    [35]
    Zeng A, Song S, Nießner M, et al. 3DMatch: Learning local geometric descriptors from RGB-D reconstructions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017: 199–208.
    [36]
    Mur-Artal R, Montiel J M M, Tardós J D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 2015, 31: 1147–1163. doi: 10.1109/tro.2015.2463671
    [37]
    Schönberger J L, Zheng E, Frahm J M, et al. Pixelwise view selection for unstructured multi-view stereo. In: European conference on computer vision. Cham: Springer, 2016: 501–518.
    [38]
    Im S, Jeon H G, Lin S, et al. DPSNet: End to-end deep plane sweep stereo. arXiv: 1905. 00538, 2019.
    [39]
    Wang C, Lucey S, Perazzi F, et al. Web stereo video supervision for depth prediction from dynamic scenes. In: 2019 International Conference on 3D Vision (3DV). Quebec City, QC, Canada: IEEE, 2019: 348–357.
    [40]
    Chen W, Qian S, Deng J. Learning single-image depth from videos using quality assessment networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 5597–5606.
    [41]
    Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017: 5987–5995.
    [42]
    Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv: 2010. 11929, 2021.
  • 加载中

Catalog

    Figure  1.  The pipeline for dense 3D scene reconstruction. The robust monocular depth estimation model trained on 6.3 million images, locally weighted linear regression strategy, and TSDF fusion[35] are the main components of our method.

    Figure  2.  The per-pixel error maps of ground-truth depth and predicted depth aligned through (a) global recovery and (b) local recovery, respectively. (c) The scale map and (d) the shift map of local recovery. Distribution of prediction-GT pairs obtained via (e) global recovery and (f) local recovery individually.

    [1]
    Yin W, Zhang J, Wang O, et al. Learning to recover 3D scene shape from a single image. In:2021 EEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA: IEEE, 2021: 204–213.
    [2]
    Ranftl R, Bochkovskiy A, Koltun V. Vision transformers for dense prediction. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, 2021: 12159.
    [3]
    Ranftl R, Lasinger K, Hafner D, et al. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44: 1623–1637. doi: 10.1109/tpami.2020.3019967
    [4]
    Xian K, Zhang J, Wang O, et al. Structure-guided ranking loss for single image depth prediction. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020: 608–617.
    [5]
    Li Z, Snavely N. MegaDepth: Learning single-view depth prediction from Internet photos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 2041–2050.
    [6]
    Chen W, Qian S, Fan D, et al. OASIS: A large-scale dataset for single image 3d in the wild. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020: 676–685.
    [7]
    Yin W, Liu Y, Shen C. Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44: 7282–7295. doi: 10.1109/tpami.2021.3097396
    [8]
    Chen W, Fu Z, Yang D, et al. Single-image depth perception in the wild. In: NIPS'16:Proceedings of the 30th International Conference on Neural Information Processing Systems. New York: ACM , 2016: 730–738.
    [9]
    Bian JW, Zhan H, Wang N, et al. Unsupervised scale-consistent depth learning from video. International Journal of Computer Vision, 2021, 129 (9): 2548–2564. doi: 10.1007/s11263-021-01484-6
    [10]
    Wang Y, Chao W L, Garg D, et al. Pseudo-LiDAR from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 8437–8445.
    [11]
    Peng W, Pan H, Liu H, et al. IDA-3D: Instance-depth-aware 3D object detection from stereo vision for autonomous driving. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020: 13012.
    [12]
    Huang J, Chen Z, Ceylan D, et al. 6-DOF VR videos with a single 360-camera. In: 2017 IEEE Virtual Reality (VR). Los Angeles, CA, USA: IEEE, 2017: 37–44.
    [13]
    Valentin J, Kowdle A, Barron J T, et al. Depth from motion for smartphone AR. ACM Transactions on Graphics, 2018, 37 (6): 1–19. doi: 10.1145/3272127.3275041
    [14]
    Song S, Yu F, Zeng A, et al. Semantic scene completion from a single depth image. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017: 1746–1754.
    [15]
    Silberman N, Hoiem D, Kohli P, et al. Indoor segmentation and support inference from rgbd images. In: Fitzgibbon A, Lazebnik S, Perona P, et al., editors. Computer Vision – ECCV 2012. Berlin: Springer, 2012: 746–760.
    [16]
    Song S, Lichtenberg S P, Xiao J. SUN RGB-D: A RGB-D scene understanding benchmark suite. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, 2015: 567–576.
    [17]
    Yang X, Zhou L, Jiang H, et al. Mobile3DRecon: Real-time monocular 3D reconstruction on a mobile phone. IEEE Transactions on Visualization and Computer Graphics, 2020, 26 (12): 3446–3456. doi: 10.1109/tvcg.2020.3023634
    [18]
    Azinović D, Martin-Brualla R, Dan B Goldman D, B, et al. Neural RGB-D surface reconstruction. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, 2022: 6280–6291.
    [19]
    Eftekhar A, Sax A, Malik J, et al. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3D scans. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, 2021: 10766–10776.
    [20]
    Bian J W, Zhan H, Wang N, et al. Auto-rectify network for unsupervised indoor depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (12): 9802–9813. doi: 10.1109/tpami.2021.3136220
    [21]
    Godard C, Mac Aodha O, Firman M, et al. Digging into self-supervised monocular depth estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea: IEEE, 2019: 3827–3837.
    [22]
    Watson J, Mac Aodha O, Prisacariu V, et al. The temporal opportunist: Self-supervised multi-frame monocular depth. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, 2021: 1164–1174.
    [23]
    Luo X, Huang J B, Szeliski R, et al. Consistent video depth estimation. ACM Transactions on Graphics, 2020, 39 (4): 71. doi: 10.1145/3386569.3392377
    [24]
    Kopf J, Rong X, Huang J B. Robust consistent video depth estimation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, 2021: 1611–1621.
    [25]
    Park J, Joo K, Hu Z, et al. Non-local spatial propagation network for depth completion. In: Vedaldi A, Bischof H, Brox T, et al., editors. Computer Vision – ECCV 2020. Cham: Springer, 2020: 120–136.
    [26]
    Wong A, Soatto S. Unsupervised depth completion with calibrated backprojection layers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, 2021: 12727–12736.
    [27]
    Wong A, Cicek S, Soatto S. Learning topology from synthetic data for unsupervised depth completion. IEEE Robotics and Automation Letters, 2021, 6 (2): 1495–1502. doi: 10.1109/lra.2021.3058072
    [28]
    Yang Y, Wong A, Soatto S. Dense depth posterior (DDP) from single image and sparse range. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 3348–3357.
    [29]
    He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016: 770–778.
    [30]
    Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada: IEEE, 2021: 9992–10002.
    [31]
    Schönberger J L, Frahm J M. Structure-from-motion revisited. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, 2016: 4104–4113.
    [32]
    Teed Z, Deng J. DeepV2D: Video to depth with differentiable structure from motion. arXiv: 1812. 04605, 2018.
    [33]
    Liu C, Gu J, Kim K, et al. Neural RGB@D sensing: Depth and uncertainty from a video camera. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA: IEEE, 2019: 10978–10987.
    [34]
    Düzçeker A, Galliani S, Vogel C, et al. DeepVideoMVS: Multi-view stereo on video with recurrent spatio-temporal fusion. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, 2021: 15319–15328.
    [35]
    Zeng A, Song S, Nießner M, et al. 3DMatch: Learning local geometric descriptors from RGB-D reconstructions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017: 199–208.
    [36]
    Mur-Artal R, Montiel J M M, Tardós J D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 2015, 31: 1147–1163. doi: 10.1109/tro.2015.2463671
    [37]
    Schönberger J L, Zheng E, Frahm J M, et al. Pixelwise view selection for unstructured multi-view stereo. In: European conference on computer vision. Cham: Springer, 2016: 501–518.
    [38]
    Im S, Jeon H G, Lin S, et al. DPSNet: End to-end deep plane sweep stereo. arXiv: 1905. 00538, 2019.
    [39]
    Wang C, Lucey S, Perazzi F, et al. Web stereo video supervision for depth prediction from dynamic scenes. In: 2019 International Conference on 3D Vision (3DV). Quebec City, QC, Canada: IEEE, 2019: 348–357.
    [40]
    Chen W, Qian S, Deng J. Learning single-image depth from videos using quality assessment networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, 2019: 5597–5606.
    [41]
    Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE, 2017: 5987–5995.
    [42]
    Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv: 2010. 11929, 2021.

    Article Metrics

    Article views (240) PDF downloads(1833)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return