JCSE, vol. 5, no. 2, pp.150-160, June, 2011
DOI: 10.5626/JCSE.2011.5.2.150/
Issues and Empirical Results for Improving Text Classification
Youngjoong Ko, Jungyun Seo Deptartment of Computer Engineering, Dong-A University, Busan, Korea
Deptartment of Computer Engineering, Sogang University, Seoul, Korea
Abstract: Automatic text classification has a long history and many studies have been conducted in this field. In particular, many machine learning
algorithms and information retrieval techniques have been applied to text classification tasks. Even though much technical progress has been made in text classification, there is still room for improvement in text classification. In this paper, we will discuss remaining issues in improving text classification. In this paper, three improvement issues are presented including automatic training data generation,
noisy data treatment and term weighting and indexing, and four actual studies and their empirical results for those issues are introduced. First, the semi-supervised learning technique is applied to text classification to efficiently create training data. For effective noisy data treatment, a noisy data reduction method and a robust text classifier from noisy data are developed as a solution. Finally, the term weighting and indexing technique is revised by reflecting the importance of sentences into term weight calculation using summarization
techniques.
Keyword:
Text classification; Semi-supervised learning; Noisy data reduction; Importance of sentence; Term weighting and indexing
Full Paper: 153 Downloads, 2560 View
|