摘要:文本分类是自然语言处理的一个重要应用领域,在信息检索、数字图书馆、文本过滤等方面有着重要地位。文本分类,能够推动文档管理工作走向科学化、规范化,使之适应现代管理制度的要求。
本文首先介绍了文本分类的研究背景和意义及其在国内外的研究现状;其次对实现文本分类系统的过程中使用的相关技术和算法,分别进行了详细阐述;接着在介绍了中文信息处理、文本分类技术和算法的基础上,实现了一个基于向量空间模型的汉语文本分类系统,就是通过特征选择,对训练样本集合构建类模型,并以该模型作为文本自动分类时的依据设计分类器,先后采用ROCCHIO、KNN文本分类方法对文本进行分类;最后对实验结果进行了分析与评价。
文本自动分类主要包括文本模型、训练、分类、性能评估四个过程。首先对文本进行预处理,将文本用模型表示,进行特征提取;接着构造并训练分类器;然后用分类器对新文本进行分类;最后对分类性能进行评估。
本实验所选用的中文语料分为训练语料和测试预料两部分,其中包括计算机、环境、军事、交通、教育、经济、体育、医药、艺术、政治,共10类,训练语料1430篇,测试预料195篇,共计1525篇。实验数据表明,特征抽取方法MI的分类性能随着特征维数的增加分类性能变化明显,KNN中K值的选取也对分类器的性能有较大的影响;当特征维数和K值都选取最佳时,KNN分类器的宏平均查准率达到91.9%,宏平均查全率达到90.8%,具有较理想的精准率和查全率, ROCCHIO分类器的宏平均查准率达到54.9%,宏平均查全率达到45.1%,相对于KNN算法而言,分类性能不理想。
关键字:文本分类,向量空间模型,特征提取,训练样本
Abstract:Text classification is an important natural language processing applications, in information retrieval, digital library, text filtering, and so has an important position. Text classification can make document management work to promote the scientific, standardized and adapt to a modern management system requirements.
This article introduces the research background of the text classification and significance of their research status at home and abroad; Secondly, the process of realization of the text classification system used in related technologies and algorithms are described in detail; Then based on the introduction of the Chinese information processing and Text classification techniques and algorithms, showing a Chinese text categorization system based on a vector space model. That is through the selection, the training sample set of model building classes, and to the model as a basis for automatic text classification design classifier. Using ROCCHIO、KNN text classification method to classify the text; Finally, experimental results are analyzed and evaluated.
Text categorization includes text model, training, classification, performance evaluation of. First, pretreatment of the text and said the text with the model to construct and train the classifier; then constructed and trained classifier; then use the classifier to classify new text; finally, evaluate the classification performance.
The Chinese used in this experiment were divided into training data and test corpus is expected in two parts, including computers, environmental, military, transportation, education, economy, sports, medicine, art, politics, a total of 10 categories, the training is expected to 1430, the test is expected to 195, a total of 1525. Experimental data show that feature extraction method with the characteristics of MI classification performance of the dimension changed significantly increased classification performance, KNN in the selection of K value on the classification performance but also have a greater impact; when the feature dimension and K both select the best value,KNN classifier achieved 91.9% precision rate, recall rate of 90.8%, with better precision and recall rate, ROCCHIO classifier precision rate 54.9%, recall rate of 45.1%, compared with KNN algorithm, the classification performance is not satisfactory.
Key words: Text classification, vector space model, feature extraction, training samples