Call for Papers
About the Journal
Editorial Board
Publication Ethics
Instructions for Authors
Announcements
Current Issue
Back Issues
Search for Articles
Categories
Back Issues
 

JCSE, vol. 11, no. 2, pp.39-48, June, 2017

DOI: http://dx.doi.org/10.5626/JCSE.2017.11.2.39

Main Content Extraction from Web Pages Based on Node Characteristics

Qingtang Liu, Mingbo Shao, Linjing Wu, Gang Zhao, Guilin Fan, and Jun Li
School of Educational Information Technology, Central China Normal University, Wuhan, China School of Information Engineering, Hubei University for Nationalities, Enshi, China

Abstract: Main content extraction of web pages is widely used in search engines, web content aggregation and mobile Internet browsing. However, a mass of irrelevant information such as advertisement, irrelevant navigation and trash information is included in web pages. Such irrelevant information reduces the efficiency of web content processing in content-based applications. The purpose of this paper is to propose an automatic main content extraction method of web pages. In this method, we use two indicators to describe characteristics of web pages: text density and hyperlink density. According to continuous distribution of similar content on a page, we use an estimation algorithm to judge if a node is a content node or a noisy node based on characteristics of the node and neighboring nodes. This algorithm enables us to filter advertisement nodes and irrelevant navigation. Experimental results on 10 news websites revealed that our algorithm could achieve a 96.34% average acceptable rate.

Keyword: Content extraction; Web page; Text density; Hyperlink density

Full Paper:   730 Downloads, 1556 View

 
 
ⓒ Copyright 2010 KIISE – All Rights Reserved.    
Korean Institute of Information Scientists and Engineers (KIISE)   #401 Meorijae Bldg., 984-1 Bangbae 3-dong, Seo-cho-gu, Seoul 137-849, Korea
Phone: +82-2-588-9240    Fax: +82-2-521-1352    Homepage: http://jcse.kiise.org    Email: office@kiise.org