摘要:自然语言处理作为计算机科学领域与人工智能领域中的一个重要分支学科,其应用已经越来越广泛,几乎任何基于汉语的系统都必须经过分词这一步。中文信息处理发展史才二三十年,它在古汉语中的应用价值还还值得进一步挖掘。本文针对先秦文献《孟子》的分词,即对陌生的《孟子》文本,能实现将文本划分成独立的词,并实现词性一级标注。从而更方便快捷的处理先秦文献中《孟子》部分古文献。
首先,本文选取十二篇《孟子》原语料作为实验材料。对14篇原语料进行单字,双字,三字,四字的词频统计和词频交集统计,建立起自己的小型Access数据库。然后利用监督学习的分词方法:对原语料进行编码和格式的转换,结合CRF分词器,利用CRF算法,分词得到初步分词结果,统计正确率。然后通过查找Access数据库中现有的词,进一步划分词,统计正确率。分词的正确率可达90%,结果表明两种分词方法相结合能提升分词的效率。
关键词:词频交集统计, 数据库, CRF
Abstract:Natural language processing as an important branch of artificial intelligence and computer science of application has become increasingly widespread. Almost any Chinese system must be based on Chinese word segmentation through this step. History of Chinese information processing is only two or three decades. Its application in the ancient Chinese also is also worth further exploration. This article mainly segment the pre-Qin literature "MengZi", that is to process unknown “MengZi” articles, divided the article into separate words, and achieve a part of speech tagging, in order to achieve a more convenient and efficient handling of pre-Qin literature "MengZi" section of ancient literature . First of all, this paper selects fourteen "MengZi" original corpus as experimental material. The original corpus of single word, double words, three words, four words of word frequency and word frequency intersection of statistics, to establish their own small Access database. Then, the segmentation supervised learning methods: the original corpus encoding and format conversion, combined with CRF word breaker, the use of CRF algorithm, word segmentation results of the preliminary statistical accuracy. Select access database and then find an existing word, and further divided into words, statistical accuracy. The correct segmentation rate of 90%.This results show that the combination of the two segmentation methods can improve the efficiency of segmentation.
Keywords: word frequency intersection of statistics, databases, CRF
中文信息处理发展史才二三十年,它在古汉语中的应用价值还没有被充分挖掘出来。正如古汉语计算语言学家尉迟治平的呼吁: “ 我们期望能有可以用于汉语史电子文献自动分词、自动断句、自动标注的软件早日问世, 专家只需对结果刊谬补缺, 这将大大减轻属性式标注的劳动强度, 加快工作进度。” 本课题以对先秦文献《孟子》为具体切入口,致力于古汉语的分词研究,其意义在于:1古文献的自动处理属于自然语言处理分支,是人工智能的一个重要组成部分。2古汉语文献中包含了中国很多文化,医学,科技方面的遗产(如《论语》、《本草纲目》等),对古文献方面快捷的翻译能便于其他科学领域的研究。3人们也可通过古文献的计算机智能识别进一步了解古汉语的语言能力和智能的机制,进一步加深对古汉语的研究。