Publications

Export 116 results:

]

2009

Shaalan, K., and A. Farghaly, "Introduction to the Special Issue on Arabic Natural Language Processing", ACM Transactions on Asian Language Information Processing (TALIP), vol. 8, no. 4, New York, NY, USA, ACM, pp. 1–3, 2009. Abstracta13-editorial.pdfWebsite

n/a

Shaalan, K., and H. Raza, "NERA: Named Entity Recognition for Arabic", J. Am. Soc. Inf. Sci. Technol., vol. 60, no. 8, New York, NY, USA, John Wiley & Sons, Inc., pp. 1652–1663, 2009. Abstractnera_paper.pdfWebsite

Name identification has been worked on quite intensively for the past few years, and has been incorporated into several products revolving around natural language processing tasks. Many researchers have attacked the name identification problem in a variety of languages, but only a few limited research efforts have focused on named entity recognition for Arabic script. This is due to the lack of resources for Arabic named entities and the limited amount of progress made in Arabic natural language processing in general. In this article, we present the results of our attempt at the recognition and extraction of the 10 most important categories of named entities in Arabic script: the person name, location, company, date, time, price, measurement, phone number, ISBN, and file name. We developed the system Named Entity Recognition for Arabic (NERA) using a rule-based approach. The resources created are: a Whitelist representing a dictionary of names, and a grammar, in the form of regular expressions, which are responsible for recognizing the named entities. A filtration mechanism is used that serves two different purposes: (a) revision of the results from a named entity extractor by using metadata, in terms of a Blacklist or rejecter, about ill-formed named entities and (b) disambiguation of identical or overlapping textual matches returned by different name entity extractors to get the correct choice. In NERA, we addressed major challenges posed by NER in the Arabic language arising due to the complexity of the language, peculiarities in the Arabic orthographic system, non-standardization of the written text, ambiguity, and lack of resources. NERA has been effectively evaluated using our own tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure.}

Shaalan, K., A. Abdel-Monem, and A. Rafea, "Syntactic Generation of Arabic in Interlingua-based Machine Translation Framework", Third workshop on Computational Approaches to Arabic Script-based Languages (CAASL3), Machine Translation Summit XII: ACL, 2009. Abstractsyntactic_gen_arabic_caasl3.pdf

Arabic is a highly inflectional language, with a rich morphology, relatively free word order, and two types of sentences: nominal and verbal. Arabic natural language processing in general is still underdeveloped and Arabic natural language generation (NLG) is even less developed. In particular, Arabic natural language generation from Interlingua was only investigated using template-based approaches. Moreover, tools used for other languages are not easily adaptable to Arabic due to the Arabic language complexity at both the morphological and syntactic levels. In this paper, we report our attempt at developing a rule-based Arabic generator for task-oriented interlingua-based spoken dialogues. Examples of syntactic generation results from the Arabic generator will be given and will illustrate how the system works. Our proposed syntactic generator has been effectively evaluated using real test data and achieved satisfactory results.

2008

Hossny, A., K. Shaalan, and A. Fahmy, "Automatic Morphological Rule Induction for Arabic", The sixth international conference on Language Resources and Evaluation (LREC'08) workshop on HLT & NLP within the Arabic world: Arabic Language and local languages processing: Status Updates and Prospects, Marrakech, Morocco, LREC, pp. 97–101, may, 2008. Abstractautomaticruleinduction.pdf

In this paper, we introduce an algorithm for morphological rule induction using meta-rules for Arabic morphology based on inductive logic programming. The processing resources are a set of example pairs (stem and inflected form) with their feature vectors, either positive or negative, and the linguistic background knowledge from the Arabic morphological analysis domain. Each example pair has two words to be analyzed vocally into consonants and vowels. The algorithm applies two levels of mapping: between the vocal representation of the two words (stem, morphed) and between their feature vector. It differentiates between both mappings in order to accurately deduce which changes in the word structure led to which changes in its features. The paper also addresses the irregularity, productivity and model consistency issues. We have developed an Arabic morphological rule induction system (AMRIS). Successful evaluation has been performed and showed that the system performance results achieved were satisfactory.

Abo-Bakr, H., K. Shaalan, and I. Ziedan, "A Hybrid Approach for Converting Written Egyptian Colloquial Dialect into Diacritized Arabic", The 6th International Conference on Informatics and Systems (INFOS2008), Cairo, Egypt, Faculty of Comptuers and Information, mar, 2008. Abstractahybridapproachforconvertingwrittenegyptian.pdf

Recently the rate of written colloquial text has increased dramatically. It is being used as a medium of expressing ideas especially across the WWW, usually in the form of blogs and partially colloquial articles. Most of these written colloquial has been in the Egyptian colloquial dialect, which is considered the most widely dialect understood and used throughout the Arab world. Modern Standard Arabic is the official Arabic language taught and understood all over the Arab world. Diacritics play a key role in disambiguating Arabic text. The reader is expected to infer or predict vowels from the context of the sentence. Inferring the full form of the Arabic word is also useful when developing Arabic natural language processing tools and applications. In this paper, we introduce a generic method for converting a written Egyptian colloquial sentence into its corresponding diacritized Modern Standard Arabic sentence which could easily be extended to be applied to other dialects of Arabic. In spite of the non-availability of linguistic Arabic resources for this task, we have developed techniques for lexical acquisition of colloquial words which are used for transferring written Egyptian Arabic into Modern Standard Arabic. We successfully used Support Vector Machine approach for the diacritization (aka vocalization or vowelling) of Arabic text.

Abo-Bakr, H., K. Shaalan, and I. Ziedan, "A Statistical Method for Adding Case Diacritics for Arabic Text", Language Engineering Conference: Ain Shams University, pp. 225–234, dec, 2008. Abstractastatisticalmethodforaddingcaseendingdiacriticsforarabictext_final.pdf

In this paper, the issue of adding Case Ending diacritics to undiacritized Arabic text using statistical methods is addressed. The approach requires a large corpus of fully diacritized text for extracting the case ending. We made the training for detecting the case ending diacritics for each token based on its Part Of Speech (POS) and BP-chunk position as well as the position of token in the statement. The case ending diacritics is then efficiently obtained using the SVM technique. We presented an evaluation of the proposed diacritization algorithm and discussed various modifications for improving the performance of this approach.

Shaalan, K., and H. Raza, "Arabic Named Entity Recognition from Diverse Text Types", Advances in Natural Language Processing, vol. 5221: Springer Berlin Heidelberg, pp. 440-451, 2008. Abstractgotal_nera_.pdf

Name identification has been worked on quite intensively for the past few years, and has been incorporated into several products. Many researchers have attacked this problem in a variety of languages but only a few limited researches have focused on Named Entity Recognition (NER) for Arabic text due to the lack of resources for Arabic named entities and the limited amount of progress made in Arabic natural language processing in general. In this paper, we present the results of our attempt at the recognition and extraction of 10 most important named entities in Arabic script; the person name, location, company, date, time, price, measurement, phone number, ISBN and file name. We developed the system, Name Entity Recognition for Arabic (NERA), using a rule-based approach. The system consists of a whitelist representing a dictionary of names, and a grammar, in the form of regular expressions, which are responsible for recognizing the named entities. NERA is evaluated using our own corpora that are tagged in a semi-automated way, and the performance results achieved were satisfactory in terms of precision, recall, and f-measure.

Abdel-Monem, A., K. Shaalan, A. Rafea, and H. Baraka, "Generating Arabic text in multilingual speech-to-speech machine translation framework", Machine Translation, vol. 22, no. 4, Hingham, MA, USA, Kluwer Academic Publishers, pp. 205–258, 2008. Abstractgenerating_arabic_mt_journal.pdfWebsite

The interlingual approach to machine translation (MT) is used successfully in multilingual translation. It aims to achieve the translation task in two independent steps. First, meanings of the source-language sentences are represented in an intermediate language-independent (Interlingua) representation. Then, sentences of the target language are generated from those meaning representations. Arabic natural language processing in general is still underdeveloped and Arabic natural language generation (NLG) is even less developed. In particular, Arabic NLG from Interlinguas was only investigated using template-based approaches. Moreover, tools used for other languages are not easily adaptable to Arabic due to the language complexity at both the morphological and syntactic levels. In this paper, we describe a rule-based generation approach for task-oriented Interlingua-based spoken dialogue that transforms a relatively shallow semantic interlingual representation, called interchange format (IF), into Arabic text that corresponds to the intentions underlying the speaker's utterances. This approach addresses the handling of the problems of Arabic syntactic structure determination, and Arabic morphological and syntactic generation within the Interlingual MT approach. The generation approach is developed primarily within the framework of the NESPOLE! (NEgotiating through SPOken Language in E-commerce) multilingual speech-to-speech MT project. The IF-to-Arabic generator is implemented in SICStus Prolog. We conducted evaluation experiments using the input and output from the English analyzer that was developed by the NESPOLE! team at Carnegie Mellon University. The results of these experiments were promising and confirmed the ability of the rule-based approach in generating Arabic translation from the Interlingua taken from the travel and tourism domain.

2007

Shaalan, K., H. Bakr, and I. Ziedan, "Transferring Egyptian Colloquial Dialect into Modern Standard Arabic", International Conference on Recent Advances in Natural Language Processing (RANLP 2007), Borovets, Bulgaria, John Benjamins, pp. 525–529, sep, 2007. Abstracttransferring_egyptian_colloquial2arabic_.pdf

Arabic is rooted in the Classical or Qur'anical Arabic, but over the centuries, the language has developed to what is now accepted as Modern Standard Arabic (MSA). Arab colloquial dialects are generally only spoken languages, but recently the rate of colloquial written text increases dramatically as a medium of expressing ideas especially across the WWW, usually in the form of blogs and partially colloquial articles. Most of these written colloquial has been in the Egyptian colloquial dialect, which is considered the most widely dialect understood and used throughout the Arab world. We are able to reuse MSA processing tools with colloquial Arabic by transferring colloquial Arabic words into their corresponding MSA words. The advantages of this lexical transfer are to facilitate the communication with colloquial Arabic speakers and restoring it to the standard language in use nowadays. This paper addresses the transfer techniques between colloquial Arabic and MSA, which have not yet been closely studied before. In particular, we present a rule-based lexical transfer approach for converting Egyptian colloquial words into their corresponding MSA words. This process involves morphological analysis and lexical acquisition of colloquial words.

Shaalan, K., A. A. Monem, A. Rafea, and H. Baraka, "Generating Arabic text from Interlingua", the 2nd Workshop on Computational Approaches to Arabic Script-based Languages (CAASL), Stanford, California, USA, Linguistic Society of America Summer Institute, Stanford University, pp. 137–144, jul, 2007. Abstractcaasl2_mt.pdf

In this paper, we describe a grammar-based generation approach for task-oriented interlingua-based spoken dialogue that transforms a shallow semantic interlingua representation called Interchange Format (IF) into Arabic Text that corresponds to the intentions underlying the speakers' utterances. The generation approach is developed primarily within the framework of the NESPOLE! (NEgotiating through SPOken Language in E-commerce) multilingual speech-to-speech MT project. The IF-to-Arabic generator is implemented in SICStus Prolog. We conducted an evaluation experiment using the output from the English analyzer provided by Carnegie Mellon University (CMU). The results of this experiment were promising and assured the ability of the generation approach in generating Arabic text form the interlingua taken from the travel and tourism domain.

Magdy, M., K. Shaalan, and A. Fahmy, "Lexical Error Diagnosis for Second Language Learners of Arabic", The Seventh Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE), Cairo, Egypt, Dec., 2007. Abstractlexical_error_diagnosis_nle.pdf

This paper addresses the development of an automated lexical error diagnosis system, which helps Arabic second language learners to learn well-formed weak verbs. The learners are encouraged to produce input freely in various situations and contexts and guided to recognize by themselves the erroneous or inappropriate functions of their misused expressions. In this system, we successfully used constraint relaxation and edit-distance techniques to provide error-specific diagnosis and feedback to second language learners of Arabic. We demonstrated the capabilities of these techniques to diagnose errors related to the Arabic weak verb which is formed using complex morphological rules. Furthermore, the developed system allows for individualization of the learning process by providing feedback that conforms to the learner’s expertise. Inexperienced learners might require detailed instruction while experienced learners benefit from higher level reminders and explanations.

Farouk, A., A. Rafea, and K. Shaalan, "Analysis of Spoken Arabic into Interlingua Representation using Automatic Classification Approach", 3rd International Computer Engineering Conference: Smart Applications for the Information Society, Cairo, Egypt, Cairo University, dec, 2007. Abstractanalysis_spoken.pdf

Semantic analysis is the system that takes as input a sentence and outputs a list of prominent concepts that characterize the contents of the input sentence, and for each concept, gives the set of attributes that discuss the concept along with their relevancies. This paper presents a system that employs a machine learning approach that automates the semantic analysis process of spoken Arabic into interlingua representation. An experiment has been conducted to measure the performance of our approach. The results were promising and assured the ability of this approach in capturing the semantics of Arabic utterances taken from the travel and tourism domain.

Shaalan, K., and E. Othman, "Issues in the Morphological Analysis of the Arabic Passive Verb", The Seventh Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE), Cairo, Egypt, Ain Shams University, dec, 2007. Abstractweakpasvvrb.pdf

Arabic is a strongly structured and highly derivational language. Arabic morphology and syntax provide the ability to add a large number of affixes to each word which makes combinatorial increment of possible words. In Arabic, passive voice is used as a writing style when: 1) the subject is unknown, 2) the subject is unimportant enough to be mentioned, or 3) the author wants to highlight the object. In this paper, the issues related to the recognition of the Arabic passive verbs which impact the automated understanding of Arabic sentences were addressed. An experiment using the Buckwalter Arabic morphological analyzers, one of the mature Arabic morphological analyzer, were conducted in order to highlight the limitations in the analysis of Arabic passive verbs. Results indicated that there exists a need for handling the problems related to the morphological analysis of passive verbs in order to improve the recognition accuracy of Arabic words.

Farouk, A., A. Rafea, and K. Shaalan, "Recognizing Semantic Concepts of Spoken Arabic Utterances using Genetic Technology", The Seventh Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE), Cairo, Egypt, Ain Shams University, dec, 2007. Abstractconcept_spotting.pdf

Genetic algorithms (GA) are a family of computational models inspired by evolution. GA mainly designed to solve optimization problems which can be thought of as searching through a large number of candidates for the best one that can be found. In this paper we present a genetic model to solve the problem of recognizing deep semantic concepts from spoken Arabic utterances. The aim of this algorithm is to automatically generate the grammar that recognizes each concept in the domain of discourse. This grammar is used to extract the observed concepts from the utterance. An experiment has been conducted to measure the performance of our approach. The results were promising and assured the ability of this approach in identifying the concepts of Arabic utterances taken from the travel and tourism domain.

Kayed, M., C. -hui Chang, K. Shaalan, and M. R. Girgis, "FiVaTech: Page-Level Web Data Extraction from Template Pages", International Workshop on Data Mining in Web 2.0 Environments , Omaha, USA, IEEE Computer Society, pp. 15–20, 28 October, 2007. Abstractchang_fivatech.pdf

In this paper, we proposed a new approach, called FiVaTech for the problem of Web data extraction. FiVaTech is a page-level data extraction system which deduces the data schema and templates for the input pages generated from a CGI program. FiVaTech uses tree templates to model the generation of dynamic Web pages. FiVaTech can deduce the schema and templates for each individual Deep Web site, which contains either singleton or multiple data records in one Web page. FiVaTech applies tree matching, tree alignment, and mining techniques to achieve the challenging task. The experiments show an encouraging result for the test pages used in many state-of-the-art Web data extraction works.

Shaalan, K., and H. Raza, "Person name entity recognition for Arabic", ACL 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, Prague, Czech Republic, Association for Computational Linguistics, pp. 17–24, 28 June, 2007. Abstractpera_cameraready.pdf

Named entity recognition (NER) is nowadays an important task, which is responsible for the identification of proper names in text and their classification as different types of named entity such as people, locations, and organizations. In this paper, we present our attempt at the recognition and extraction of the most important proper name entity, that is, the person name, for the Arabic language. We developed the system, Person Name Entity Recognition for Arabic (PERA), using a rule-based approach. The system consists of a lexicon, in the form of gazetteer name lists, and a grammar, in the form of regular expressions, which are responsible for recognizing person name entities. The PERA system is evaluated using a corpus that is tagged in a semi-automated way. The system performance results achieved were satisfactory and confirm to the targets set forth for the precision, recall, and f-measure.

Shaalan, K., A. Abdel-Monem, and A. Rafea, "Arabic Morphological Generation from Interlingua: A Rule-based Approach", Intelligent Information Processing III, vol. 228: Springer US, pp. 441-451, 2007. Abstractmorph_gen_mt.pdf

Arabic is a Semitic language that is rich in its morphology. Arabic has very numerous and complex morphological rules. Arabic morphological analysis has gained the focus of Arabic natural language processing research for a long time in order to achieve the automated understanding of Arabic. With the recent technological advances, Arabic natural language generation has received attentions in order to allow for a room for wider applications such as machine translation. For machine translation systems that support a large number of languages, interlingua-based machine translation approaches are particularly attractive. In this paper, we report our attempt at developing a rule-based Arabic morphological generator for task-oriented interlingua-based spoken dialogues. Examples of morphological generation results from the Arabic morphological generator will be given and will illustrate how the system works. Nevertheless, we will discuss the issues related to the morphological generation of Arabic words from an interlingua representation, and present how we have handled them.

Shaalan, K., H. Talhami, and I. Kamel, "Automatic Morphological Generation for the Indexing of Arabic Speech Recordings", The International Journal of Computer Processing of Oriental Languages (IJCPOL), vol. 20, no. 1, pp. 1–14, 2007. Abstractijcpol2.pdfWebsite

This paper presents a novel Arabic morphological generator (AMG) for Modern Standard Arabic (MSA) which is designed and implemented using Prolog. The AMG is used to generate inflected forms of words used for the indexing of Arabic audio. These words are also the relevant terms in the Arab authority system (library information retrieval system) used in this study. The AMG generates inflected Arabic words from the root according to pre-specified morphological features that can be extended as needed. The Arabic word is represented as a feature structure which is handled through unification during the morphological generation process. The inflected forms can then be inserted automatically into a speech recognition grammar which is used to identify these words in an audio sequence or utterance.

2006

Shaalan, K., A. Abdel-Monem, A. Rafea, and H. Baraka, "Mapping Interlingua Representations to Feature Structures of Arabic Sentences", The Challenge of Arabic for NLP/MT International Conference, the British Computer Society, London, UK, British Computer Society (BCS), pp. 149–159, oct, 2006. Abstractmapping_interlingua2arabic.pdf

The interlingua approach to Machine Translation (MT) aims to achieve the translation task in two independent steps. First, the meanings of source language sentences are represented in an intermediate (interlingua) representation. Then, sentences of the target language are generated from those meaning representations. In the generation of the target sentence, determining sentence structures becomes more difficult, especially when the interlingua does not contain any syntactic information. Hence, the sentence structures cannot be transferred exactly from the interlingua representations. In this paper, we present a mapping approach for taskoriented interlingua-based spoken dialogue that transforms an interlingua representation, so-called Interchange Format (IF), into a feature structure (FS) that reflects the syntactic structure of the target Arabic sentence. This approach addresses the handling of the problem of Arabic syntactic structure determination in the interlingua approach. A mapper is developed primarily within the framework of the NESPOLE! (NEgotiating through SPOken Language in E-commerce) multilingual speech-to-speech MT project. The IF-to-Arabic FS mapper is implemented in SICStus Prolog. Examples of Arabic syntactic mapping, using the output from the English analyzer provided by Carnegie Mellon University (CMU), will illustrate how the system works.

Chang, C. -hui, M. Kayed, M. Girgis, and K. Shaalan, "A Survey of Web Information Extraction Systems", IEEE Trans. on Knowl. and Data Eng., vol. 18, no. 10, Piscataway, NJ, USA, IEEE Educational Activities Department, pp. 1411–1428, oct, 2006. Abstractiesurvey2006.pdfWebsite

The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Therefore, the availability of robust, flexible Information Extraction (IE) systems that transform the Web pages into program-friendly structures such as a relational database will become a great necessity. Although many approaches for data extraction from Web pages have been developed, there has been limited effort to compare such tools. Unfortunately, in only a few cases can the results generated by distinct tools be directly compared since the addressed extraction tasks are different. This paper surveys the major Web data extraction approaches and compares them in three dimensions: the task domain, the automation degree, and the techniques used. The criteria of the first dimension explain why an IE system fails to handle some Web sites of particular structures. The criteria of the second dimension classify IE systems based on the techniques used. The criteria of the third dimension measure the degree of automation for IE systems. We believe these criteria provide qualitatively measures to evaluate various IE approaches.

Khaled Shaalan

Professor of Computer Science

Publications

Tags

Recent Publications