高级检索

基于体素-关键点特征聚合网络的点云3维目标检测

3-D object detection for point clouds based on voxel-keypoint feature aggregation network

  • 摘要: 为了解决现有点云检测方法中忽略3维结构上下文特性,导致特征表示能力弱和候选框准确性低的问题,基于体素-关键点编码和多层级特征融合理论,研究了一种3维目标检测模型。利用球查询和体素查询的灵活感受野,从3维稀疏卷积特征和鸟瞰图像特征中提取3维候选框的多层次特征,并利用点之间的远程上下文依赖关系去学习可判别性表征;开发了体素-关键点编码模块,通过多层自注意力关键点采样方法及编码策略进行3维候选框的细化;通过卷积注意力方式对多层级特征进行聚合,得到了3维候选框表征,并在大规模点云目标检测数据集KITTI和Waymo数据集上进行了实验分析。结果表明,该方法能高效准确识别3维目标,在骑行者类别的3维检测精度最高达93.88%,在多个检测场景中均优于主流方法。该研究在远点稀疏场景下的3维目标检测中具有良好的应用价值。

     

    Abstract:
    To address the weak feature representation and low candidate box accuracy caused by ignoring 3-D structural context in existing methods, this study proposes a 3-D object detection method for point clouds based on the voxel-keypoint feature aggregation network (VKMFANet).
    Existing point-cloud 3-D object detection methods have obvious limitations. Point-based methods achieve high accuracy but incur high computational cost and poor real-time performance. Voxel-based methods are efficient but lose 3-D structural context due to feature conversion, which degrades accuracy. Although point-voxel aggregation methods have better accuracy, they still suffer from feature conversion issues and low sampling efficiency. Therefore, VKMFANet adopted a two-stage architecture combining voxel-keypoint encoding and multi-level feature aggregation to enhance detection performance.
    In the first stage, a 3-D sparse convolution module extracted point-cloud features and projected them onto a bird’s-eye view, after which a region proposal network generated candidate boxes. The second stage focused on extracting multi-level features from candidate boxes: (a) a 3-D sparse convolution feature extraction module aggregated contextual features of neighboring voxels via voxel queries to preserve 3-D structure. (b) A bird’s-eye-view feature pooling module aligned coordinates with an affine transformation to reduce information loss. (c) An internal point-cloud spatial-structure feature extraction module introduced a multi-layer self-attention keypoint sampling method, employed the FPS algorithm to select keypoints, and encoded spatial relationships via multi-scale spherical queries. (d) A convolutional attention aggregation module fused the above features through channel- and point-level attention to produce the final features used for classification and bounding-box regression.
    Experiments were conducted on the KITTI and Waymo datasets, using an Intel Xeon CPU and an NVIDIA RTX 3090 GPU as hardware, and Python 3.9 and PyTorch 1.10.1 as the software environment. The results showed that VKMFANet achieved 93.88% 3-D detection average precision (AP) and 96.27% bird’s-eye view (BEV) detection AP for cyclists at the Easy level on the KITTI dataset, outperforming mainstream methods such as PV-RCNN. On the Waymo dataset at level 1 difficulty, the mean average precisions (MAPs) for cars, pedestrians, and cyclists were 58.1%, 68.67%, and 63.38%, respectively, with advantages maintained at level 2 difficulty. Ablation experiments verified the effectiveness of each feature module, and this method improved processing speed by 16Hz compared to PV-RCNN, balancing accuracy and efficiency.
    The study performs excellently in long-range sparse scenes, providing an efficient solution for 3-D object detection in autonomous driving and other fields, with significant practical value.

     

/

返回文章
返回