JCSE

JCSE, vol. 5, no. 2, pp.150-160, 2011

DOI: 10.5626/JCSE.2011.5.2.150/

Issues and Empirical Results for Improving Text Classification

Youngjoong Ko, Jungyun Seo
Deptartment of Computer Engineering, Dong-A University, Busan, Korea Deptartment of Computer Engineering, Sogang University, Seoul, Korea

Abstract: Automatic text classification has a long history and many studies have been conducted in this field. In particular, many machine learning algorithms and information retrieval techniques have been applied to text classification tasks. Even though much technical progress has been made in text classification, there is still room for improvement in text classification. In this paper, we will discuss remaining issues in improving text classification. In this paper, three improvement issues are presented including automatic training data generation, noisy data treatment and term weighting and indexing, and four actual studies and their empirical results for those issues are introduced. First, the semi-supervised learning technique is applied to text classification to efficiently create training data. For effective noisy data treatment, a noisy data reduction method and a robust text classifier from noisy data are developed as a solution. Finally, the term weighting and indexing technique is revised by reflecting the importance of sentences into term weight calculation using summarization techniques.

Keyword: Text classification; Semi-supervised learning; Noisy data reduction; Importance of sentence; Term weighting and indexing

Full Paper: 153 Downloads, 2811 View