基于两阶分区的MapReduce实验室系统负载均衡研究

2023,31(4):252-257
郑文丽1, 熊贝贝, 程立勋, 蔡伊娜, 包先雨2
1.深圳市检验检疫科学研究院;2.深圳市检验检疫科学研究院深圳
摘要:在实验室系统处理海量原始数据时,实际应用场景中存在采样率高、偏度(skewness)高的特殊情况,导致在使用两阶分区算法在平衡同构环境下的Reducer节点负载时,无法有效地处理这些问题。为此,引入MapReduce的并行化处理,可以提高实验室系统中采样数据利用率;同时,为了解决数据偏度和采样度高的问题,则采用了ICSC(Improved Cluster Split Combination)分区调度的算法。经过实验证明,基于两阶分区的MapReduce负载均衡算法能够有效减少Mapper和Reducer节点空转的时间。随着数据偏度的增加,算法的执行时长基本不产生变化,即数据偏度对该算法执行时间的影响较小。此外,数据采样度的增加,ICSC分区调度算法也保持着对比模型中最少的时间开销。因此,基于两阶分区的MapReduce负载均衡算法弱化了Reducer节点间的依赖性,并提升MapReduce任务的执行效率和容错率,从而高效地实现MapReduce框架下的实验室系统中数据处理的负载均衡。
关键词:两阶分区;MapReduce;分区调度;负载均衡;实验室系统

Research on load balancing of MapReduce laboratory system based on two-tier partition

熊贝贝

程立勋

蔡伊娜

包先雨
,518045,601459563@qq.com
Abstract:When processing raw data in a laboratory system, there are special cases of high sampling rate and high skewness in real-world application scenarios, which cannot be effectively dealt with when balancing the load on the Reducer nodes in a homogeneous environment using a two-order partitioning algorithm. Therefore, the parallel processing of MapReduce is introduced to improve the utilization of sampling data in the laboratory system; At the same time, in order to solve the problem of data skewness and high sampling, ICSC (Improved Cluster Split Combination) partition scheduling algorithm is adopted. Experiments show that MapReduce load balancing algorithm based on two-tier partition can effectively reduce the idle time of Mapper and Reducer nodes. With the increase of data skewness, the execution time of the algorithm is basically unchanged, that is, data skewness has little impact on the execution time of the algorithm. In addition, with the increase of data sampling, ICSC partition scheduling algorithm also maintains the minimum time cost in the comparison model. Therefore, the MapReduce load balancing algorithm based on two-tier partitions weakens the dependency between the reducer nodes, and improves the execution efficiency and fault tolerance of MapReduce tasks, thus effectively realizing the load balancing of data processing in the laboratory system under the MapReduce framework.
Key words:two-tier partition; MapReduce; partition scheduling; load balancing; laboratory system
收稿日期:2022-11-11
基金项目:国家重点研发计划课题(2019YFC1605401);海关总署课题(2020HK109)。
     下载PDF全文