ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Article 15 August 2023

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

Cite this:
https://doi.org/10.52396/JUSTC-2023-0061
More Information
  • Corresponding author: E-mail: fzhao956@ustc.edu.cn
  • Accepted Date: 11 May 2023
  • Available Online: 15 August 2023
  • Monocular depth estimation methods have achieved excellent robustness on diverse scenes, usually by predicting affine-invariant depth, up to an unknown scale and shift, rather than metric depth in that it is much easier to collect large-scale affine-invariant depth training data. However, in some video-based scenarios such as video depth estimation and 3D scene reconstruction, the unknown scale and shift residing in per-frame prediction may cause the predicted depth to be inconsistent. To tackle this problem, we propose a locally weighted linear regression method to recover the scale and shift map with very sparse anchor points, which ensures the consistency along consecutive frames. Extensive experiments show that our method can drop the Rel error of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks. Besides, we merge 6.3 million RGBD images to train robust depth models. By locally recovering scale and shift, our produced ResNet50-backbone model even outperforms the state-of-the-art DPT ViT-Large model. Combined with geometry-based reconstruction methods, we formulate a new dense 3D scene reconstruction pipeline, which benefits from both the scale consistency of sparse points and the robustness of monocular methods. By performing simple per-frame prediction over a video, the accurate 3D scene geometry can be recovered.
    Our pipeline for dense 3D scene reconstruction is composed of a robust monoculardepth estimation module, a metric depth recovery module, and an RGB-D fusionmodule. With our robust depth model trained on enormous data and the proposedlocally weighted linear regression method, we can achieve robust and accurate 3Dscene shapes.
    Monocular depth estimation methods have achieved excellent robustness on diverse scenes, usually by predicting affine-invariant depth, up to an unknown scale and shift, rather than metric depth in that it is much easier to collect large-scale affine-invariant depth training data. However, in some video-based scenarios such as video depth estimation and 3D scene reconstruction, the unknown scale and shift residing in per-frame prediction may cause the predicted depth to be inconsistent. To tackle this problem, we propose a locally weighted linear regression method to recover the scale and shift map with very sparse anchor points, which ensures the consistency along consecutive frames. Extensive experiments show that our method can drop the Rel error of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks. Besides, we merge 6.3 million RGBD images to train robust depth models. By locally recovering scale and shift, our produced ResNet50-backbone model even outperforms the state-of-the-art DPT ViT-Large model. Combined with geometry-based reconstruction methods, we formulate a new dense 3D scene reconstruction pipeline, which benefits from both the scale consistency of sparse points and the robustness of monocular methods. By performing simple per-frame prediction over a video, the accurate 3D scene geometry can be recovered.
  • loading
  • [1]
    Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. In: IEEE Conf. Comput. Vis. Pattern Recogn., 204–213, 2021.
    [2]
    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In: IEEE Conf. Comput. Vis. Pattern Recogn., 12179–12188, 2021.
    [3]
    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Trans. Pattern Anal. Mach. Intell., 2022, 44 (3): 1623–1637. doi: 10.1109/TPAMI.2020.3019967
    [4]
    Ke Xian, Jianming Zhang, Oliver Wang, Long Mai, Zhe Lin, and Zhiguo Cao. Structure-Guided Ranking Loss for Single Image Depth Prediction. In: IEEE Conf. Comput. Vis. Pattern Recogn., 611–620, 2020.
    [5]
    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In: IEEE Conf. Comput. Vis. Pattern Recogn., 2041–2050, 2018.
    [6]
    Weifeng Chen, Shengyi Qian, David Fan, Noriyuki Kojima, Max Hamilton, and Jia Deng. OASIS: A large-scale dataset for single image 3d in the wild. In: IEEE Conf. Comput. Vis. Pattern Recogn., 679–688, 2020.
    [7]
    Wei Yin, Yifan Liu, and Chunhua Shen. Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. TPAMI, 2021, 44 (10): 7282–7295.
    [8]
    Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-image depth perception in the wild. In: Adv. Neural Inform. Process. Syst., 730–738, 2016.
    [9]
    Jia-Wang Bian, Huangying Zhan, Naiyan Wang, Zhichao Li, Le Zhang, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. Unsupervised scale-consistent depth learning from video. Int. J. Comput. Vis., 2021, 129 (9): 2548–2564. doi: 10.1007/s11263-021-01484-6
    [10]
    Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In: IEEE Conf. Comput. Vis. Pattern Recogn., 8445–8453, 2019.
    [11]
    Wanli Peng, Hao Pan, He Liu, and Yi Sun. Ida-3d: Instance-depth-aware 3d object detection from stereo vision for autonomous driving. In: IEEE Conf. Comput. Vis. Pattern Recogn., 13015–13024, 2020.
    [12]
    Jingwei Huang, Zhili Chen, Duygu Ceylan, and Hailin Jin. 6-DOF VR videos with a single 360-camera. In: IEEE Vir. Real., 37–44, 2017.
    [13]
    Julien Valentin, Adarsh Kowdle, Jonathan T Barron, Neal Wadhwa, Max Dzitsiuk, Michael Schoenberg, Vivek Verma, Ambrus Csaszar, Eric Turner, Ivan Dryanovski, et al. Depth from motion for smartphone AR. ACM Transh. Graph., 2018, 37 (6): 1–19.
    [14]
    Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In: IEEE Conf. Comput. Vis. Pattern Recogn., 1746–1754, 2017.
    [15]
    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In: Eur. Conf. Comput. Vis., 746–760. Springer, 2012.
    [16]
    Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In: IEEE Conf. Comput. Vis. Pattern Recogn., 567–576, 2015.
    [17]
    Xingbin Yang, Liyang Zhou, Hanqing Jiang, Zhongliang Tang, Yuanbo Wang, Hujun Bao, and Guofeng Zhang. Mobile3DRecon: real-time monocular 3D reconstruction on a mobile phone. IEEE Transh. Vish. and Comph. Graphh., 2020, 26 (12): 3446–3456. doi: 10.1109/TVCG.2020.3023634
    [18]
    Dejan Azinović, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural RGB-D surface reconstruction. In: IEEE Conf. Comput. Vis. Pattern Recogn., 6290–6301, 2022.
    [19]
    Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In: Int. Conf. Comput. Vis., 10786–10796, 2021.
    [20]
    Jia-Wang Bian, Huangying Zhan, Naiyan Wang, Tat-Jun Chin, Chunhua Shen, and Ian Reid. Auto-Rectify Network for Unsupervised Indoor Depth Estimation. TPAMI, 2021: 1–1.
    [21]
    Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. In: Int. Conf. Comput. Vis., 3828–3838, 2019.
    [22]
    Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel Brostow, and Michael Firman. The temporal opportunist: Self-supervised multi-frame monocular depth. In: IEEE Conf. Comput. Vis. Pattern Recogn., 1164–1174, 2021.
    [23]
    Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation. ACM Trans. Graph., 2020, 39 (4): 71–1.
    [24]
    Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Robust consistent video depth estimation. In: IEEE Conf. Comput. Vis. Pattern Recogn., 1611–1621, 2021.
    [25]
    Jinsun Park, Kyungdon Joo, Zhe Hu, Chi-Kuei Liu, and In So Kweon. Non-local spatial propagation network for depth completion. In: Eur. Conf. Comput. Vis., 120–136. Springer, 2020.
    [26]
    Alex Wong and Stefano Soatto. Unsupervised depth completion with calibrated backprojection layers. In: Int. Conf. Comput. Vis., 12747–12756, 2021.
    [27]
    Alex Wong, Safa Cicek, and Stefano Soatto. Learning topology from synthetic data for unsupervised depth completion. IEEE Rob. and Auto. Lett., 2021, 6 (2): 1495–1502. doi: 10.1109/LRA.2021.3058072
    [28]
    Yanchao Yang, Alex Wong, and Stefano Soatto. Dense depth posterior (ddp) from single image and sparse range. In: IEEE Conf. Comput. Vis. Pattern Recogn., 3353–3362, 2019.
    [29]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In: IEEE Conf. Comput. Vis. Pattern Recogn., 770–778, 2016.
    [30]
    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In: Int. Conf. Comput. Vis., 10012–10022, 2021.
    [31]
    Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In: IEEE Conf. Comput. Vis. Pattern Recogn., 4104–4113, 2016.
    [32]
    Zachary Teed and Jia Deng. DeepV2D: Video to Depth with Differentiable Structure from Motion. In: Int. Conf. Learn. Represent., 2020.
    [33]
    Chao Liu, Jinwei Gu, Kihwan Kim, Srinivasa G. Narasimhan, and Jan Kautz. Neural RGB → D sensing: Depth and uncertainty from a video camera. In: IEEE Conf. Comput. Vis. Pattern Recogn., 10986–10995, 2019.
    [34]
    Arda Duzceker, Silvano Galliani, Christoph Vogel, Pablo Speciale, Mihai Dusmanu, and Marc Pollefeys. DeepVideoMVS: Multi-view stereo on video with recurrent spatio-temporal fusion. In: IEEE Conf. Comput. Vis. Pattern Recogn., 15324–15333, 2021.
    [35]
    Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher, Jianxiong Xiao, and Thomas Funkhouser. 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. In: IEEE Conf. Comput. Vis. Pattern Recogn., 1802–1811, 2017.
    [36]
    Raúl Mur-Artal, J. M. M. Montiel, and Juan D. Tardós. ORB-SLAM: a Versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics, 2015, 31 (5): 1147–1163. doi: 10.1109/TRO.2015.2463671
    [37]
    Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In: Eur. Conf. Comput. Vis., 501–518. Springer, 2016.
    [38]
    Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So Kweon. DPSNet: Endto-end Deep Plane Sweep Stereo. In: Int. Conf. Learn. Represent., 2019.
    [39]
    Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes. In: Int. Conf. 3D Vis., 348–357, 2019.
    [40]
    Weifeng Chen, Shengyi Qian, and Jia Deng. Learning single-image depth from videos using quality assessment networks. In: IEEE Conf. Comput. Vis. Pattern Recogn., 5604–5613, 2019.
    [41]
    Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Ag-gregated residual transformations for deep neural networks. In: IEEE Conf. Comput. Vis. Pattern Recogn., 1492–1500, 2017.
    [42]
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Int. Conf. Learn. Represent., 2021.
  • 加载中

Catalog

    Figure  2.  The per-pixel error maps of ground-truth depth and predicted depth aligned through (a) global recovery and (b) local recovery, respectively. (c) The scale map and (d) the shift map of local recovery. Distribution of prediction-GT pairs obtained via (e) global recovery and (f) local recovery individually.

    Figure  1.  The pipeline for dense 3D scene reconstruction. The robust monocular depth estimation model trained on 6.3 million images, locally weighted linear regression strategy, and TSDF fusion[35] are the main components of our method.

    [1]
    Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. In: IEEE Conf. Comput. Vis. Pattern Recogn., 204–213, 2021.
    [2]
    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In: IEEE Conf. Comput. Vis. Pattern Recogn., 12179–12188, 2021.
    [3]
    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Trans. Pattern Anal. Mach. Intell., 2022, 44 (3): 1623–1637. doi: 10.1109/TPAMI.2020.3019967
    [4]
    Ke Xian, Jianming Zhang, Oliver Wang, Long Mai, Zhe Lin, and Zhiguo Cao. Structure-Guided Ranking Loss for Single Image Depth Prediction. In: IEEE Conf. Comput. Vis. Pattern Recogn., 611–620, 2020.
    [5]
    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In: IEEE Conf. Comput. Vis. Pattern Recogn., 2041–2050, 2018.
    [6]
    Weifeng Chen, Shengyi Qian, David Fan, Noriyuki Kojima, Max Hamilton, and Jia Deng. OASIS: A large-scale dataset for single image 3d in the wild. In: IEEE Conf. Comput. Vis. Pattern Recogn., 679–688, 2020.
    [7]
    Wei Yin, Yifan Liu, and Chunhua Shen. Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. TPAMI, 2021, 44 (10): 7282–7295.
    [8]
    Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-image depth perception in the wild. In: Adv. Neural Inform. Process. Syst., 730–738, 2016.
    [9]
    Jia-Wang Bian, Huangying Zhan, Naiyan Wang, Zhichao Li, Le Zhang, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. Unsupervised scale-consistent depth learning from video. Int. J. Comput. Vis., 2021, 129 (9): 2548–2564. doi: 10.1007/s11263-021-01484-6
    [10]
    Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In: IEEE Conf. Comput. Vis. Pattern Recogn., 8445–8453, 2019.
    [11]
    Wanli Peng, Hao Pan, He Liu, and Yi Sun. Ida-3d: Instance-depth-aware 3d object detection from stereo vision for autonomous driving. In: IEEE Conf. Comput. Vis. Pattern Recogn., 13015–13024, 2020.
    [12]
    Jingwei Huang, Zhili Chen, Duygu Ceylan, and Hailin Jin. 6-DOF VR videos with a single 360-camera. In: IEEE Vir. Real., 37–44, 2017.
    [13]
    Julien Valentin, Adarsh Kowdle, Jonathan T Barron, Neal Wadhwa, Max Dzitsiuk, Michael Schoenberg, Vivek Verma, Ambrus Csaszar, Eric Turner, Ivan Dryanovski, et al. Depth from motion for smartphone AR. ACM Transh. Graph., 2018, 37 (6): 1–19.
    [14]
    Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In: IEEE Conf. Comput. Vis. Pattern Recogn., 1746–1754, 2017.
    [15]
    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In: Eur. Conf. Comput. Vis., 746–760. Springer, 2012.
    [16]
    Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In: IEEE Conf. Comput. Vis. Pattern Recogn., 567–576, 2015.
    [17]
    Xingbin Yang, Liyang Zhou, Hanqing Jiang, Zhongliang Tang, Yuanbo Wang, Hujun Bao, and Guofeng Zhang. Mobile3DRecon: real-time monocular 3D reconstruction on a mobile phone. IEEE Transh. Vish. and Comph. Graphh., 2020, 26 (12): 3446–3456. doi: 10.1109/TVCG.2020.3023634
    [18]
    Dejan Azinović, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural RGB-D surface reconstruction. In: IEEE Conf. Comput. Vis. Pattern Recogn., 6290–6301, 2022.
    [19]
    Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In: Int. Conf. Comput. Vis., 10786–10796, 2021.
    [20]
    Jia-Wang Bian, Huangying Zhan, Naiyan Wang, Tat-Jun Chin, Chunhua Shen, and Ian Reid. Auto-Rectify Network for Unsupervised Indoor Depth Estimation. TPAMI, 2021: 1–1.
    [21]
    Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. In: Int. Conf. Comput. Vis., 3828–3838, 2019.
    [22]
    Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel Brostow, and Michael Firman. The temporal opportunist: Self-supervised multi-frame monocular depth. In: IEEE Conf. Comput. Vis. Pattern Recogn., 1164–1174, 2021.
    [23]
    Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation. ACM Trans. Graph., 2020, 39 (4): 71–1.
    [24]
    Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Robust consistent video depth estimation. In: IEEE Conf. Comput. Vis. Pattern Recogn., 1611–1621, 2021.
    [25]
    Jinsun Park, Kyungdon Joo, Zhe Hu, Chi-Kuei Liu, and In So Kweon. Non-local spatial propagation network for depth completion. In: Eur. Conf. Comput. Vis., 120–136. Springer, 2020.
    [26]
    Alex Wong and Stefano Soatto. Unsupervised depth completion with calibrated backprojection layers. In: Int. Conf. Comput. Vis., 12747–12756, 2021.
    [27]
    Alex Wong, Safa Cicek, and Stefano Soatto. Learning topology from synthetic data for unsupervised depth completion. IEEE Rob. and Auto. Lett., 2021, 6 (2): 1495–1502. doi: 10.1109/LRA.2021.3058072
    [28]
    Yanchao Yang, Alex Wong, and Stefano Soatto. Dense depth posterior (ddp) from single image and sparse range. In: IEEE Conf. Comput. Vis. Pattern Recogn., 3353–3362, 2019.
    [29]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In: IEEE Conf. Comput. Vis. Pattern Recogn., 770–778, 2016.
    [30]
    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In: Int. Conf. Comput. Vis., 10012–10022, 2021.
    [31]
    Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In: IEEE Conf. Comput. Vis. Pattern Recogn., 4104–4113, 2016.
    [32]
    Zachary Teed and Jia Deng. DeepV2D: Video to Depth with Differentiable Structure from Motion. In: Int. Conf. Learn. Represent., 2020.
    [33]
    Chao Liu, Jinwei Gu, Kihwan Kim, Srinivasa G. Narasimhan, and Jan Kautz. Neural RGB → D sensing: Depth and uncertainty from a video camera. In: IEEE Conf. Comput. Vis. Pattern Recogn., 10986–10995, 2019.
    [34]
    Arda Duzceker, Silvano Galliani, Christoph Vogel, Pablo Speciale, Mihai Dusmanu, and Marc Pollefeys. DeepVideoMVS: Multi-view stereo on video with recurrent spatio-temporal fusion. In: IEEE Conf. Comput. Vis. Pattern Recogn., 15324–15333, 2021.
    [35]
    Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher, Jianxiong Xiao, and Thomas Funkhouser. 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. In: IEEE Conf. Comput. Vis. Pattern Recogn., 1802–1811, 2017.
    [36]
    Raúl Mur-Artal, J. M. M. Montiel, and Juan D. Tardós. ORB-SLAM: a Versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics, 2015, 31 (5): 1147–1163. doi: 10.1109/TRO.2015.2463671
    [37]
    Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In: Eur. Conf. Comput. Vis., 501–518. Springer, 2016.
    [38]
    Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So Kweon. DPSNet: Endto-end Deep Plane Sweep Stereo. In: Int. Conf. Learn. Represent., 2019.
    [39]
    Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes. In: Int. Conf. 3D Vis., 348–357, 2019.
    [40]
    Weifeng Chen, Shengyi Qian, and Jia Deng. Learning single-image depth from videos using quality assessment networks. In: IEEE Conf. Comput. Vis. Pattern Recogn., 5604–5613, 2019.
    [41]
    Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Ag-gregated residual transformations for deep neural networks. In: IEEE Conf. Comput. Vis. Pattern Recogn., 1492–1500, 2017.
    [42]
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Int. Conf. Learn. Represent., 2021.

    Article Metrics

    Article views (171) PDF downloads(1386)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return