您好, 访客   登录/注册

基于栈式预训练模型的中文序列标注

来源:用户上传      作者:刘宇鹏,李国栋

  摘要:序列标注(sequence labelling)是自然语言处理(natural language processing)中的一类重要任务。在文中,根据任务的相关性,使用栈式预训练模型进行特征提取,分词,命名实体识别/语块标注。并且通过对BERT内部框架的深入研究,在保证原有模型的准确率下进行优化,降低了BERT模型的复杂度,减少了模型在训练和预测过程中的时间成本。上层结构上,相比于传统的长短期记忆络(LSTM),采用的是双层双向LSTM结构,底层使用双向长短期记忆网络(BiLSTM)用来分词,顶层用来实现序列标注任务。在新式半马尔可夫条件随机场(new semiconditional random field,NSCRF)上,将传统的半马尔可夫条件随机场(SemiCRF)和条件随机场(CRF)相结合,同时考虑分词和单词的标签,在训练和解码上提高了准确率。将模型在CCKS2019、MSRANER和BosonNLP数据集上进行训练并取得了很大的提升,F1测度分别达到了92.37%、95.69%和93.75%。
  关键词:基于BERT的栈式模型;预训练模型;命名实体识别;语块分析
  DOI:10.15938/j.jhust.2022.01.002 中图分类号: TP391 文献标志码: A 文章编号: 1007-2683(2022)01-0008-06
  Chinese Sequence Labeling Based on Stack Pretraining Model
  LIU Yupeng,LI Guodong
  (School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150001, China)
  Abstract:Sequence labeling is an important task in natural language processing. In this paper, according to the relevance of tasks, we use stacking pretraining model to extract features, segment words, and name entity recognition/chunk tagging. Through indepth research on the internal structure of BERT, while ensuring the accuracy of the original model, the Bidirectional Encoder Representation from Transformers (BERT) is optimized, which reduces the complexity and the time cost of the model in the training and prediction process. In the upper layer structure, compared with the traditional long shortterm memory network (LSTM), this paper uses a twolayer bidirectional LSTM structure, the bottom layer uses a bidirectional longshortterm memory network (BiLSTM) for word segmentation, and the top layer is used for sequence labeling tasks. On the New SemiConditional Random Field (NSCRF), the traditional semiMarkov Conditional Random Field (SemiCRF) and Conditional Random Field (CRF) are combined while considering the segmentation. The labeling of words improves accuracy in training and decoding. We trained the model on the CCKS2019, MSRANER, and BosonNLP datasets and achieved great improvements. The F1 measures reached 92.37%, 95.69%, and 93.75%, respectively.
  Keywords:stacking model based on BERT; pretrained model; named entity recognition; chunk analysis
  0引言
  S着大数据时代的到来,互联网成为了信息传播的主要方式,但是,网络上的文本信息每天都会呈指数型的迅速增长,如何高效地挖掘海量文本中的有效信息,成为了当今自然语言处理(natural language processing ,NLP)等领域研究的重要任务。中文的序列标注问题是计算机理解人类语言,实现人机交互非常关键的一步,它可以将中文的句子转化成机器可以理解的语言。命名实体识别(named entity recognition, NER)和语块分析(chunking)是NLP领域底层识别句子中专有名词的一项技术,命名实体识别任务可以通过训练好的模型识别出文本中的人名、地名、机构名等专有名词,而语块分析任务可以识别句子中的短语块结构。它们识别的准确率直接影响到上层任务的性能,比如,情报分析、舆情分析、文献分析等等。

nlc202205121109



转载注明来源:https://www.xzbu.com/8/view-15431090.htm

相关文章