基于轻量化网络与投影损失函数的眼部图像语义分割网络

王钦杰; 王腾飞; 杨立壮; 李海

doi:10.52396/JUSTC-2023-0117

基于轻量化网络与投影损失函数的眼部图像语义分割网络

Lightweight deep network and projection loss for eye semantic segmentation

摘要

摘要: 眼部图像的语义分割是一项在人机交互、认知科学和神经科学等领域有着重要应用的复杂的任务。实时、准确和鲁棒的分割算法在计算资源受限的场景下如增强现实和虚拟现实至关重要。随着深度学习的快速发展，研究人员提出了各种用于眼部图像分割的网络模型。一些模型通过将分割过程分解为多个阶段减少神经网络模型参数量，并通过后处理增强预分割结果以提高分割准确性。然而，多阶段的分割显著增加推理时间。另一些模型采用更复杂的编码和解码模块，以实现端到端的输出，但这些模块依赖大量的计算资源。因此，平衡模型的大小、准确性和计算复杂性对于眼部分割任务至关重要。为了应对这些挑战，本文提出了一种轻量级的非对称 UNet 架构和投影损失函数。在编码阶段，利用ResNet-3 提高特征提取效率。在解码阶段，采用常规卷积和跳跃连接，将特征映射从潜空间放大到原始图像尺寸，以此平衡模型的大小和分割准确性。同时，本文还利用眼部的几何特征，设计了投影损失函数，在不增加推理计算成本的情况下，进一步提高分割精度。最后，在OpenEDS2019 虚拟现实眼部分割数据集上验证了本文提出的方法，取得了95.33% 的平均交并比（mIoU），参数量只有0.63M，分割速度达到350FPS。与最先进模型RITNet 相比，本文的方法在参数量上降低了32%、速度提升了200%。

Abstract: Semantic segmentation of eye images is a complex task with important applications in human–computer interaction, cognitive science, and neuroscience. Achieving real-time, accurate, and robust segmentation algorithms is crucial for computationally limited portable devices such as augmented reality and virtual reality. With the rapid advancements in deep learning, many network models have been developed specifically for eye image segmentation. Some methods divide the segmentation process into multiple stages to achieve model parameter miniaturization while enhancing output through post processing techniques to improve segmentation accuracy. These approaches significantly increase the inference time. Other networks adopt more complex encoding and decoding modules to achieve end-to-end output, which requires substantial computation. Therefore, balancing the model’s size, accuracy, and computational complexity is essential. To address these challenges, we propose a lightweight asymmetric UNet architecture and a projection loss function. We utilize ResNet-3 layer blocks to enhance feature extraction efficiency in the encoding stage. In the decoding stage, we employ regular convolutions and skip connections to upscale the feature maps from the latent space to the original image size, balancing the model size and segmentation accuracy. In addition, we leverage the geometric features of the eye region and design a projection loss function to further improve the segmentation accuracy without adding any additional inference computational cost. We validate our approach on the OpenEDS2019 dataset for virtual reality and achieve state-of-the-art performance with 95.33% mean intersection over union (mIoU). Our model has only 0.63M parameters and 350 FPS, which are 68% and 200% of the state-of-the-art model RITNet, respectively.

HTML全文

参考文献(44)

施引文献

资源附件(1)