基于小样本数据增强的科技文档不平衡分类研究
来源:用户上传
作者:黄金凤 高岩 徐童 陈恩红
摘 要:科学技术的飞速发展衍生出海量的科技文档,其有效管理与查询依赖于准确的文档自动化分类。然而,由于学科门类众多且发展各异,导致相关文档数量存在严重的不平衡现象,削弱了分类技术的有效性。虽然相关研究证实预训练语言模型在文本分类任务上能够取得很好的效果,但由于科技文档较强的领域性导致通用预训练模型难以取得良好效果。更重要的是,不同领域积累的文档数量存在显著差异,其不平衡分类问题仍未完善解决。针对上述问题,本文通过引入和改进多种数据增强策略,提升了小样本类别的数据多样性与分类鲁棒性,进而通过多组实验讨论了不同预训练模型下数据增强策略的最佳组合方式。结果显示,本文所提出的技术框架能够有效提升科技文档不平衡分类任务的精度,从而为实现科技文档自动化分类及智能应用奠定了基础。
关键词:文本分类;预训练模型;类别不平衡;数据增强
中图分类号:TP391.1文献标识码:A文章编号:2097-0145(2022)03-0023-08doi:10.11847/fj.41.3.23
Research of Imbalanced Classification for Technical Documents
Based on Few-shot Data Augmentation
HUANG Jin-feng, GAO Yan, XU Tong, CHEN En-hong
(School of Computer Science, University of Science and Technology of China, Hefei 230027, China)
Abstract:Recent years have witnessed the rapid development of science and technologies, which results in the abundant technical documents. Along this line, automatic classification tools are urgently required to support the management and retrieval of technical documents. Though prior arts have mentioned that the pre-trained models could achieve competitive performance on textual classification tasks, considering the domain-specific characters of technical documents, effectiveness of these pre-trained models might be still limited. Even worse, due to the imbalanced accumulation of documents for different research fields, there exists the severe imbalanced classification issue, which impair the effectiveness of classification tool. To deal with these issues, in this paper, we propose a comprehensive framework, which adapts the multiple data augmentation strategies, for improving the diversity and robustness of document samples in few-shot categories. Moreover, extensive validations have been executed to reveal the most effective combination of data augmentation strategies under different pre-trained models. The results indicate that our proposed framework could effectively improve the performance of imbalanced classification issue, and further support the intelligent services on technical documents.
Key words:text classification; pre-trained language model; class imbalance; data augmentation
1 引言
近年恚随着科研投入力度的不断加大,各学科研究的长足发展衍生出了海量的科技文档。以作为测度科技发展水平重要指标的科技论文产出情况为例,自2012年至今的10年来,SCI数据库收录的我国作者论文数量不断攀升,并于2019年突破50万篇。这一趋势既体现了科研领域蓬勃发展的新局面,也带来了科技文档有效管理与高效检索的巨大挑战。由于作者所提供的少量关键词信息难以适应层次复杂的标签体系和动态变化的分类标准,在实践中往往无法获得所需的精度。因此,借助机器学习技术,基于科技文档中的丰富文本进行自动化分类已成为应时之需。
事实上,由于自然语言表达本身有着复杂的语义结构、丰富的多样性和多义性,并且会随着不同的外部语境而发生变化,导致科技文档等长文本的理解与分类任务本身具有较高的困难性。近年来,随着BERT[1]等预训练语言模型[1~3]的提出,越来越多的研究者聚焦于预训练加微调的迁移学习方式进行文本分类。其中预训练语言模型按照设计的代理任务在海量的无标签语料中学习文本表征,获取语言中蕴含的结构信息。由此,通过使用预训练好的语言模型并在特定下游任务中微调训练,可以有效地将海量无标签语料中的信息泛化到下游任务,在各种文本分类中取得了不错的效果。
nlc202207011748
转载注明来源:https://www.xzbu.com/2/view-15435244.htm