JCSE, vol. 3, no. 3, pp.165-180, 2009
DOI:
HKIB-20000 & HKIB-40075: Hangul Benchmark Collections for Text Categorization Research
Jinsuk Kim, Ho-Seop Choe, Beom-Jong You Jeong-Hyun Seo Suk-Hoon Lee Dong-Yul Ra
Department of Information Technology Research, KISTI, Daejeon, Korea|Department of Cyber Environment Development, KISTI, Daejeon, Korea|Department of Information & Statistics, Chungnam National University, Daejeon, Korea|Computer & Telecommunication Engine
Abstract: The HKIB, or Hankookilbo, test collections are two archives of Korean newswire storiesmanually categorized with semi-hierarchical or hierarchical category taxonomies. The basenewswire stories were made available by the Hankook Ilbo (The Korea Daily) for researchpurposes. At first, Chungnam National University and KISTI collaborated to manually tag40,075 news stories with categories by semi-hierarchical and balanced three-level classificationscheme, where each news story has only one level-3 category (single-labeling). We refer to thisoriginal data set as HKIB-40075 test collection. And then Yonsei University and KISTIcollaborated to select 20,000 newswire stories from the HKIB-40075 test collection, to rearrangethe classification scheme to be fully hierarchical but unbalanced, and to assign one or morecategories to each news story (multi-labeling). We refer to this modified data set as HKIB-20000test collection. We benchmark a k-NN categorization algorithm both on HKIB-20000 and onHKIB-40075, illustrating properties of the collections, providing baseline results for futurestudies, and suggesting new directions for further research on Korean text categorization
Keyword:
No keyword
Full Paper: 231 Downloads, 4296 View
|