Abstract:Addressing the memory access-intensive issues of layer normalization in high-performance computing, this paper proposes parallel layer normalization computing schemes on the Sunway multi-core processor platform. The core idea of the proposed solution is to efficiently utilize the computational resources and bandwidth of the SW26010P multi-core processor. By employing two different core distribution strategies to partition the data, and leveraging optimization techniques such as double buffering, DMA technology, and SIMD vectorization, the parallel processing of computational tasks is effectively realized. Experimental results demonstrate that, compared to the serial algorithm on the main core, this parallel approach achieves a maximum speedup of 55.48. Compared to the parallel layer normalization without SIMD instruction optimization, the maximum effective computing power after SIMD optimization reaches 28.25 GFLOPS.