Abstract:Aiming at the problem that the propagation path of the localization information is too long and the semantic information of different scale features cannot be fully mined, which makes it difficult to predict the text bounding box using the underlying localization information, a scene text detection method based on the attention mechanism and multi-scale features is proposed; the feature extraction module is improved, and a feature extractor with an embedded cross-crossing attention mechanism is constructed to extract the multi-scale features and obtain the context-aware information; Introducing the feature fusion module Path Aggregation Network (PANet) to fuse feature mappings of different scales and provide multi-scale contextual semantic information, so that the segmentation network generates finer boundary segmentation results and rebuilds the loss function in the prediction stage; in order to validate the effectiveness of the method, experiments are conducted on the three public datasets, namely, ICDAR2015, CTW1500 and Total-Text experiments on three public datasets, ICDAR2015,CTW1500 and Total-Text, and its comprehensive index F-value reaches 87.4%, 82.3% and 83.4%, respectively. The performance of CM-STD for scene text detection based on the attention mechanism and multi-scale features outperforms that of the classical EAST method, and it can be comparable to the current LOMO method.