Information Extraction
Chang, C. -hui, M. Kayed, M. Girgis, and K. Shaalan,
"A Survey of Web Information Extraction Systems",
IEEE Trans. on Knowl. and Data Eng., vol. 18, no. 10, Piscataway, NJ, USA, IEEE Educational Activities Department, pp. 1411–1428, oct, 2006.
AbstractThe Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Therefore, the availability of robust, flexible Information Extraction (IE) systems that transform the Web pages into program-friendly structures such as a relational database will become a great necessity. Although many approaches for data extraction from Web pages have been developed, there has been limited effort to compare such tools. Unfortunately, in only a few cases can the results generated by distinct tools be directly compared since the addressed extraction tasks are different. This paper surveys the major Web data extraction approaches and compares them in three dimensions: the task domain, the automation degree, and the techniques used. The criteria of the first dimension explain why an IE system fails to handle some Web sites of particular structures. The criteria of the second dimension classify IE systems based on the techniques used. The criteria of the third dimension measure the degree of automation for IE systems. We believe these criteria provide qualitatively measures to evaluate various IE approaches.
Chang, C. -hui, M. Kayed, M. Girgis, and K. Shaalan,
"Criteria for Evaluating Information Extraction Systems",
The 3rd International Conference on Informatics and Systems (INFOS2008), Cairo, Egypt, Faculty of Comptuers and Information, mar, 2005.
AbstractThe Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Therefore, the availability of robust, flexible Information Extraction (IE) systems that transform the Web pages into program-friendly structures will become a great necessity. Although many approaches for data extraction from Web pages have been developed, there has been limited effort to compare such tools. In addition to briefly surveying the major data extraction approaches described in the literature,the paper also mainly presenting three classes of criteria for qualitatively analyzing these approaches. The criteria of the first class are concerned with the difficulties of an IE task, so these criteria are capable of determining why an IE system fails to handle some Web sites of particular structures. The criteria of the second class are concerned with the effort made by the user in the training process, so these criteria are capable of measuring the degree of automation for IE systems. The criteria of the third class are concerned with the techniques used in IE tasks, so these criteria are capable of measuring the performance of IE systems.
El-Beltagy, S., M. Said, and K. Shaalan,
"A Framework for Information Extraction, Storage and Retrieval",
International Computer Engineering Conference: New Technologies for the Information Society (ICENCO'2004), Cairo, Egypt, Faculty of Engineering, dec, 2004.
AbstractThis paper presents a set of tools that were developed in order to facilitate and speed up the process of building information extraction and retrieval systems for documents that exhibit a setof predefined characteristics. Specifically, the work presents a simple framework for extracting information found in publications or documents that are issued in large volumes and which cover similar concepts or issues within a given domain. The paper presents a simple model for defining background knowledge and for using that to automatically augment segments of input documents with metadata in order to assist users in easily locating information within these documents through a structured front end. The model presented makes use of both document structure as well as dynamically acquired background knowledge to achieve its goals.