Information Extraction

Chang, C. -hui, M. Kayed, M. Girgis, and K. Shaalan, "A Survey of Web Information Extraction Systems", IEEE Trans. on Knowl. and Data Eng., vol. 18, no. 10, Piscataway, NJ, USA, IEEE Educational Activities Department, pp. 1411–1428, oct, 2006. Abstractiesurvey2006.pdfWebsite

The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Therefore, the availability of robust, flexible Information Extraction (IE) systems that transform the Web pages into program-friendly structures such as a relational database will become a great necessity. Although many approaches for data extraction from Web pages have been developed, there has been limited effort to compare such tools. Unfortunately, in only a few cases can the results generated by distinct tools be directly compared since the addressed extraction tasks are different. This paper surveys the major Web data extraction approaches and compares them in three dimensions: the task domain, the automation degree, and the techniques used. The criteria of the first dimension explain why an IE system fails to handle some Web sites of particular structures. The criteria of the second dimension classify IE systems based on the techniques used. The criteria of the third dimension measure the degree of automation for IE systems. We believe these criteria provide qualitatively measures to evaluate various IE approaches.

Shaalan, K., and H. Raza, "Person name entity recognition for Arabic", ACL 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, Prague, Czech Republic, Association for Computational Linguistics, pp. 17–24, 28 June, 2007. Abstractpera_cameraready.pdf

Named entity recognition (NER) is nowadays an important task, which is responsible for the identification of proper names in text and their classification as different types of named entity such as people, locations, and organizations. In this paper, we present our attempt at the recognition and extraction of the most important proper name entity, that is, the person name, for the Arabic language. We developed the system, Person Name Entity Recognition for Arabic (PERA), using a rule-based approach. The system consists of a lexicon, in the form of gazetteer name lists, and a grammar, in the form of regular expressions, which are responsible for recognizing person name entities. The PERA system is evaluated using a corpus that is tagged in a semi-automated way. The system performance results achieved were satisfactory and confirm to the targets set forth for the precision, recall, and f-measure.

Shaalan, K., and H. Raza, "NERA: Named Entity Recognition for Arabic", J. Am. Soc. Inf. Sci. Technol., vol. 60, no. 8, New York, NY, USA, John Wiley & Sons, Inc., pp. 1652–1663, 2009. Abstractnera_paper.pdfWebsite

Name identification has been worked on quite intensively for the past few years, and has been incorporated into several products revolving around natural language processing tasks. Many researchers have attacked the name identification problem in a variety of languages, but only a few limited research efforts have focused on named entity recognition for Arabic script. This is due to the lack of resources for Arabic named entities and the limited amount of progress made in Arabic natural language processing in general. In this article, we present the results of our attempt at the recognition and extraction of the 10 most important categories of named entities in Arabic script: the person name, location, company, date, time, price, measurement, phone number, ISBN, and file name. We developed the system Named Entity Recognition for Arabic (NERA) using a rule-based approach. The resources created are: a Whitelist representing a dictionary of names, and a grammar, in the form of regular expressions, which are responsible for recognizing the named entities. A filtration mechanism is used that serves two different purposes: (a) revision of the results from a named entity extractor by using metadata, in terms of a Blacklist or rejecter, about ill-formed named entities and (b) disambiguation of identical or overlapping textual matches returned by different name entity extractors to get the correct choice. In NERA, we addressed major challenges posed by NER in the Arabic language arising due to the complexity of the language, peculiarities in the Arabic orthographic system, non-standardization of the written text, ambiguity, and lack of resources. NERA has been effectively evaluated using our own tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure.}

Khaled Shaalan

Professor of Computer Science

Information Extraction