基于社区森林模型的分布式重叠社区发现算法
来源:用户上传
作者:张妍 刘滨 梅卫 许云峰 谷利东 于彭帅 石钰 魏西峰
摘 要:重B社区发现是复杂网络挖掘中的重要基础工作,可以应用于社交网络、通讯网络、蛋白质相互作用网络、代谢路径网络、交通网络等多种网络的数据分析,从而服务智慧交通、传染病防治、舆情分析、新药研制和人力资源管理等领域。传统的单机运算架构已经难以满足各类大规模复杂网络的分析和计算要求。人工智能领域的研究人员提出将社区发现应用到网络表示学习领域,以丰富网络中节点和边的特征,但传统的重叠社区发现算法在设计时未能考虑来自网络表示学习任务的相关要求,只重点关注节点的社区划分,缺乏对社区内部结构和外部边界的考虑,例如没有涉及节点在社区内部的权重和属于多个社区的归属度排序等,因而不能提供网络中节点和社区更丰富的特征信息,导致对网络表示学习任务支持不足。
针对传统单机重叠社区发现算法已经不适用于大规模复杂网络挖掘,以及不能满足网络表示学习任务的相关要求等问题,提出一种基于社区森林模型的分布式重叠社区发现算法(distributed community forest model,简称DCFM算法)。首先,将网络数据集存储到分布式文件系统,将数据分块,使用分布式计算框架在每个数据分块上执行CFM算法;然后,执行社区合并;最后,汇总社区划分结果,使用真实的DBLP数据集将算法运行于Spark集群上,采用F均值和运行时间对算法进行评估。结果表明,DCFM算法的F均值稍逊于CFM算法,但其运算时间随着节点的增加接近线性下降,在牺牲小部分F均值的同时,DCFM算法具备处理大规模网络数据的能力;分割份数对计算时间的影响很大,在com-dblp.ungraph.txt数据集上,CFM算法处理数据需要192 min,而DCFM算法在将数据分成6份时,需要约91 min,分成100份后仅需要约13 min。因此,在大数据平台上采用分布式计算骨干度,从而进行社区划分、合并的DCFM算法是一种可行的大规模复杂网络挖掘方法,通过分割网络,可以大幅加快社区划分速度,提高社区发现效率。
关键词:分布式处理系统;社交网络;重叠社区;社区森林模型;社区发现
中图分类号:TP311.13 文献标识码:A
Abstract:Overlapping community discovery is an important basic work in complex network mining.It can be applied to the data analysis of social networks,communication networks,protein interaction networks,metabolic path networks,transportation networks and other networks,so as to serve the fields of intelligent transportation,infectious disease prevention and control,public opinion analysis,new drug development and human resource management.The traditional stand-alone computing architecture has been difficult to meet the analysis and computing requirements of various large-scale complex networks.Researchers in the field of artificial intelligence propose to apply community discovery to the field of network representation learning to enrich the characteristics of nodes and edges in the network.However,the traditional overlapping community discovery algorithm fails to consider the relevant requirements from the network representation learning task in its design,only focuses on the community division of nodes,and lacks consideration of the internal structure and external boundary of the community.For example,it does not involve the weight of nodes within the community and the attribution ranking belonging to multiple communities,so it cannot provide richer characteristic information of nodes and communities in the network,resulting in insufficient support for network representation learning tasks.Aiming at the problem that the traditional single machine overlapping community discovery algorithm is not suitable for large-scale complex network mining and cannot support the relevant requirements of network representation learning tasks,a distributed overlapping community discovery algorithm based on community forest model (DCFM algorithm) was proposed.Firstly,the network dataset was stored in the distributed file system,the data were divided into blocks,and the distributed computing framework was used to execute the CFM algorithm on each data block;then,the community consolidation was performed;Finally,the community division results were summarized,and the algorithm was run on the spark cluster by using the real DBLP dataset and was evaluated by F Value and running time.The results show that the f-means of DCFM algorithm is slightly inferior to that of CFM algorithm,but its operation time decreases linearly with the increase of nodes.While sacrificing a small part of f-means,DCFM algorithm has the ability to process large-scale network data;the number of split copies has a great impact on the calculation time,which can be found in com DBLP ungraph.Txt data set,CFM algorithm needs 192 min to process data,while DCFM algorithm needs about 91 min to divide the data into 6 parts,and only about 13 min after dividing into 100 parts.Therefore,on the big data platform,DCFM algorithm uses distributed computing backbone to divide and merge communities,which is a feasible large-scale complex network mining method.By dividing the network,it can greatly improve the speed of community division and the efficiency of community discovery.
nlc202206151111
转载注明来源:https://www.xzbu.com/1/view-15433604.htm