Abstract:To solve the problem of efficient grasping of robotic arms in unstructured environments, a lightweight residual convolutional neural network recognition and localization method is proposed. The network adopts depthwise separable convolution to reduce parameter count and computational overhead, while enhancing feature transfer and nonlinear expression capabilities through inverted residual connections, constructing a network structure suitable for single-stage capture detection. The whole system consists of three parts: downsampling, feature extraction, and upsampling. In the upsampling process, a global local feature fusion mechanism is introduced to enhance the perception of the contour and shape information of the captured object. Based on the generated grasping quality heatmap, an elliptical fitting based grasping pose optimization method is adopted to improve the accuracy of pose estimation. Experimental verification was conducted on the Cornell and Jacquard datasets, and the capture detection accuracy reached 98.4% and 95.2%, respectively. In the NVIDIA RTX 3090 graphics card environment, the inference time for a single image is 16ms. Experimental results show that this method achieves a good balance between accuracy and real-time performance.