FiVaTech: Page-Level Web Data Extraction from Template Pages

Mohammed Kayed; Chia-hui Chang; Khaled Shaalan; Moheb R. Girgis

FiVaTech: Page-Level Web Data Extraction from Template Pages

Citation:: Kayed, M., C. -hui Chang, K. Shaalan, and M. R. Girgis, "FiVaTech: Page-Level Web Data Extraction from Template Pages", International Workshop on Data Mining in Web 2.0 Environments , Omaha, USA, IEEE Computer Society, pp. 15–20, 28 October, 2007. copy at www.tinyurl.com/j2q4r8d

Date Presented:

28 October

Abstract:

In this paper, we proposed a new approach, called FiVaTech for the problem of Web data extraction. FiVaTech is a page-level data extraction system which deduces the data schema and templates for the input pages generated from a CGI program. FiVaTech uses tree templates to model the generation of dynamic Web pages. FiVaTech can deduce the schema and templates for each individual Deep Web site, which contains either singleton or multiple data records in one Web page. FiVaTech applies tree matching, tree alignment, and mining techniques to achieve the challenging task. The experiments show an encouraging result for the test pages used in many state-of-the-art Web data extraction works.

Notes:

n/a

Khaled Shaalan

Professor of Computer Science