摘要:多标签学习中,每个训练样本都与一组标签集相关,多标签分类的任务就是预测未知样本的标签集。k近邻法作为常用的单标签分类方法,通过计算样本之间的距离,选择前若干个离新样本最近的已知样本,用它们的类别投票数来决定新样本的类别。但是,这种策略不能直接应用于多标签问题,本文将k近邻法进行扩展用于解决多标签分类问题,主要关注k近邻法的后处理问题。我们收集与编程实现了五种方法:k/2法,离散Bayes法,Logistic回归法,线性阈值函数法以及多输出线性回归法,并且在Yeast、Image、Scene三组数据集上进行测试。实验结果表明5种后处理方法在多标签分类中都拥有较好的性能,其中离散Bayes、多输出线性回归和Logistic回归性能相对比较优越;同时,不同的距离对算法的性能也有一定的影响。
关键词:多标签分类, k近邻, k/2法, 离散Bayes法, 线性阈值函数, 多输出线性回归, Logistic回归
Abstract:In multi-label learning, each training instance is associated with a label set, and the task is to predict the label set for each unknown instance. k nearest neighbor method is a classic single-label classification method. To determine the category of the unknown instance, it calculates the distance between the unknown instance and the training ones, and selects the top k instances as its k nearest neighbors, then votes for each label according to k nearest neighbors' label information. k nearest neighbor method can be extended to solve multi-label classification problems but post-processing is a critical problem. In this paper, five post-processing method including k/2 method, discrete Bayesian method, linear threshold function method, multi-output linear regression and Logistic regression will be realized by programming and tested in three data-sets (Yeast, Image and Scene). Experiments show the five methods all have excellent performance. Discrete Bayesian method, multi-output linear regression and logistic regression work better. Further, different distances have a certain impact on the algorithm performance.
Key words: k nearest neighbor method; multi-label classification problem; k/2 method; discrete Bayesian method; linear threshold function method; multi-output linear regression ; logistic regression
本人通过这次毕业课题程序的实现以及论文的书写,对模式识别中多标签分类领域的基本实现方法以及值得改进的部分有了比较全面的认识。当然,毕业设计让我学到的不仅是专业方面的知识,更多的是思维的训练。面对未知的领域,如何快速了解它、熟悉它、深入它对我以后的研究生学习以及社会生活都会有很大的促进作用。同时它也让我明白理论与实践之间存在巨大差距,只有我们不断的去研究探索才能找到它们之间的桥梁,将理论转化为实践,再使实践更好得指导理论。