基于OCR技术的航天器材料及器件试验数据识别系统

2023,31(1):282-288
陆俊杰, 魏亚东, 李晓峰, 王成, 李洪普, 李锋
中船重工奥蓝托无锡软件技术有限公司
摘要:航天器材料及器件数据库需要海量国内外试验报告数据的支撑,其中表格作为最普遍的数据存储形式含有的数据量最为庞大,然而面对人工识别提取表格数据工作繁琐且易出错的难点,以PDF文档的表格为研究对象,提出基于OCR技术的航天器材料及器件试验数据识别系统;采用了B/S架构,基于EXT、JAVA、Python等技术语言进行开发,系统具备PDF文档转换、表格识别、数据提取、数据编辑等功能;依据系统设计采用版面分析和PDFPlumber表格检测的关键技术和方法以达导准确有效识别PDF文档表格的目的,采用EXT表格控件形式展现提取的数据经试验测试实现了对PDF文档内规整表格的批量识别和数据提取;验证了设计方案的可行性,满足了试验数据试别系统的高识别准确率、快速识别等特点;
关键词:航天器材料与器件;数据识别系统;OCR;PDF文档;表格识别

Spacecraft Material and DeviceTest Data Identification System Based on OCR Technology

Abstract:The database of spacecraft materials and devices needs the support of massive test reports at home and abroad. As the most common form of data storage, table contains the largest amount of data. However, faced with the tedious and error-prone work of manual identification and extraction of table data, the table of PDF document is taken as the research object. The data identification system of spacecraft material and device test based on OCR technology is proposed. Using B/S architecture, based on EXT, JAVA, Python and other technical languages for development, the system has PDF document conversion, form recognition, data extraction, data editing and other functions; According to the system design, the key technologies and methods of layout analysis and PDFPlumber form inspection are used to identify PDF document forms accurately and effectively. The extracted data are displayed in the form of EXT form control. The batch identification and data extraction of regular forms in PDF documents are realized through the test. The feasibility of the design scheme is verified to meet the characteristics of high recognition accuracy and fast recognition of the test data test system.
Key words:spacecraft materials and devices; data identification system; ocr; pdf;form recognition
收稿日期:2022-06-17
基金项目:
     下载PDF全文