Abstract:To address occlusion completion and communication constraints on urban roads, this paper proposes a vehicle-road-cloud collaborative multimodal semantic completion and compression method. This approach integrates cameras, radar, and inertial navigation systems at the vehicle level to perform initial completion, transmitting only compact latent representations. Roadside units aggregate latent codes from multiple vehicles for spatio-temporal alignment, generating region guidance vectors and occupancy/occlusion cues distributed to relevant vehicles. Cloud-based processing iteratively updates codebooks and parameters based on regional heatmaps. It reduces upload overhead through importance-driven regionalised coding rate allocation and integrated keyframe/differential message formats. Shape and semantic quality in occluded regions are enhanced via dual-head decoding and small-step refinement. Experimental results demonstrate that mIoU_occ achieves an average improvement of approximately 3%–4% relative to the baseline. Complex scene quantiles exhibit lower average uplink overhead, with end-to-end P95 latency around 100 ms under typical bandwidth conditions. This satisfies the real-time and bandwidth constraints for vehicle-road collaboration deployment.