基于国密SM4及保形加密算法的文件脱敏系统研究

2024,32(11):315-321
黄俊1, 刘家甫, 曹志威2
1.公安部第三研究所 数据安全技术研发中心;2.海军军医大学 影像医学系
摘要:针对政务及金融等领域对于内部文件保密要求高,移动介质上存储的文件数据通过传统脱敏方法面临着数据内容量大、数据类型多样导致的脱敏效率低、脱敏内容不彻底等问题,提出了一种基于SM4与FF1结合的混合数据类型文件脱敏系统,该系统通过内容分割脱敏处理任意类型的数据,提升了文件脱敏的范围、准确性和效率;为了进一步减少脱敏系统代码运行的内存消耗,提出了汉字字典库索引转换算法,该算法通过构建待检测明文与汉字编码库的相对索引关系,优化传统脱敏系统中依赖于构建哈希表的键值映射;通过随机生成1000份测试文件进行脱敏测试,基于混合类型的文本不可识别率达到99.8%,脱敏以及内容复原的准确率达到99.9%;通过随机生成10份总大小约为10MB的测试文件,纯文本类型的脱敏速率平均可达2500字符/秒。
关键词:国密SM4,保形加密,数据脱敏,国密算法,文件脱敏系统

Research of File Desensitization System Based on The Domestic Algorithm SM4 and Format-Preserving Encryption

Abstract:The hybrid data type file desensitization system based on SM4 and FF1 algorithms desensitize arbitrary types of data by content segmentation, the system solves the problems of the low efficiency and incomplete desensitization due to the large amount of data and diverse data types stored on mobile media from government and financial institutions and improves the scope, accuracy and efficiency of file desensitization. The index conversion algorithm based on Chinese character dictionary library optimizes the key-value mapping that relies on hash table construction in traditional desensitization system by constructing the relative index relationship between the plaintext to be detected and the Chinese character encoding library, and further reduces the memory consumption for the runtime of the desensitization system. By randomly generating 1000 test files for desensitization test, the text unrecognition rate based on mixed types reaches 99.8%, and the accuracy rate of desensitization and content recovery reaches 99.9%. By randomly generating 10 test files with a total size of about 10MB, the desensitization rate of the plain text type can reach 2500 characters/SEC on average.
Key words:National Secret SM4, Conformal Encryption, Data Desensitization, National Secret Algorithm, File Desensitization System
收稿日期:2024-09-29
基金项目::科技部重点研发计划资助(No.2021YFB3102002)
     下载PDF全文