Facing the needs of autonomous driving and intelligent transportation, real-time and accurate reconstruction of 3D semantic scenes around vehicles is of great significance for ensuring traffic safety and optimizing travel efficiency. Aiming at the complex traffic environment where there are multiple sources of heterogeneous data, as well as serious occlusion and variable illumination, this paper proposes a multi-source fusion real-time semantic scene complementation algorithm. The cross-modal self-attention strategy fuses camera, LiDAR and millimeter-wave radar information to achieve accurate perception and semantic inference of the occluded region. Spatio-temporal contextual modeling is used to capture the dynamic changes of the target in the sequence data, which significantly improves the scene consistency and completeness of the complementation. Experimental results show that the proposed algorithm achieves significant advantages in both occlusion processing and inference speed compared with mainstream bas