Arabic

Shaalan, K., and H. Raza, "Person name entity recognition for Arabic", ACL 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, Prague, Czech Republic, Association for Computational Linguistics, pp. 17–24, 28 June, 2007. Abstractpera_cameraready.pdf

Named entity recognition (NER) is nowadays an important task, which is responsible for the identification of proper names in text and their classification as different types of named entity such as people, locations, and organizations. In this paper, we present our attempt at the recognition and extraction of the most important proper name entity, that is, the person name, for the Arabic language. We developed the system, Person Name Entity Recognition for Arabic (PERA), using a rule-based approach. The system consists of a lexicon, in the form of gazetteer name lists, and a grammar, in the form of regular expressions, which are responsible for recognizing person name entities. The PERA system is evaluated using a corpus that is tagged in a semi-automated way. The system performance results achieved were satisfactory and confirm to the targets set forth for the precision, recall, and f-measure.

Farouk, A., A. Rafea, and K. Shaalan, "Recognizing Semantic Concepts of Spoken Arabic Utterances using Genetic Technology", The Seventh Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE), Cairo, Egypt, Ain Shams University, dec, 2007. Abstractconcept_spotting.pdf

Genetic algorithms (GA) are a family of computational models inspired by evolution. GA mainly designed to solve optimization problems which can be thought of as searching through a large number of candidates for the best one that can be found. In this paper we present a genetic model to solve the problem of recognizing deep semantic concepts from spoken Arabic utterances. The aim of this algorithm is to automatically generate the grammar that recognizes each concept in the domain of discourse. This grammar is used to extract the observed concepts from the utterance. An experiment has been conducted to measure the performance of our approach. The results were promising and assured the ability of this approach in identifying the concepts of Arabic utterances taken from the travel and tourism domain.

Abo-Bakr, H., K. Shaalan, and I. Ziedan, "A Statistical Method for Adding Case Diacritics for Arabic Text", Language Engineering Conference: Ain Shams University, pp. 225–234, dec, 2008. Abstractastatisticalmethodforaddingcaseendingdiacriticsforarabictext_final.pdf

In this paper, the issue of adding Case Ending diacritics to undiacritized Arabic text using statistical methods is addressed. The approach requires a large corpus of fully diacritized text for extracting the case ending. We made the training for detecting the case ending diacritics for each token based on its Part Of Speech (POS) and BP-chunk position as well as the position of token in the statement. The case ending diacritics is then efficiently obtained using the SVM technique. We presented an evaluation of the proposed diacritization algorithm and discussed various modifications for improving the performance of this approach.

Hossny, A., K. Shaalan, and A. Fahmy, "Automatic Morphological Rule Induction for Arabic", The sixth international conference on Language Resources and Evaluation (LREC'08) workshop on HLT & NLP within the Arabic world: Arabic Language and local languages processing: Status Updates and Prospects, Marrakech, Morocco, LREC, pp. 97–101, may, 2008. Abstractautomaticruleinduction.pdf

In this paper, we introduce an algorithm for morphological rule induction using meta-rules for Arabic morphology based on inductive logic programming. The processing resources are a set of example pairs (stem and inflected form) with their feature vectors, either positive or negative, and the linguistic background knowledge from the Arabic morphological analysis domain. Each example pair has two words to be analyzed vocally into consonants and vowels. The algorithm applies two levels of mapping: between the vocal representation of the two words (stem, morphed) and between their feature vector. It differentiates between both mappings in order to accurately deduce which changes in the word structure led to which changes in its features. The paper also addresses the irregularity, productivity and model consistency issues. We have developed an Arabic morphological rule induction system (AMRIS). Successful evaluation has been performed and showed that the system performance results achieved were satisfactory.

Abo-Bakr, H., K. Shaalan, and I. Ziedan, "A Hybrid Approach for Converting Written Egyptian Colloquial Dialect into Diacritized Arabic", The 6th International Conference on Informatics and Systems (INFOS2008), Cairo, Egypt, Faculty of Comptuers and Information, mar, 2008. Abstractahybridapproachforconvertingwrittenegyptian.pdf

Recently the rate of written colloquial text has increased dramatically. It is being used as a medium of expressing ideas especially across the WWW, usually in the form of blogs and partially colloquial articles. Most of these written colloquial has been in the Egyptian colloquial dialect, which is considered the most widely dialect understood and used throughout the Arab world. Modern Standard Arabic is the official Arabic language taught and understood all over the Arab world. Diacritics play a key role in disambiguating Arabic text. The reader is expected to infer or predict vowels from the context of the sentence. Inferring the full form of the Arabic word is also useful when developing Arabic natural language processing tools and applications. In this paper, we introduce a generic method for converting a written Egyptian colloquial sentence into its corresponding diacritized Modern Standard Arabic sentence which could easily be extended to be applied to other dialects of Arabic. In spite of the non-availability of linguistic Arabic resources for this task, we have developed techniques for lexical acquisition of colloquial words which are used for transferring written Egyptian Arabic into Modern Standard Arabic. We successfully used Support Vector Machine approach for the diacritization (aka vocalization or vowelling) of Arabic text.

Abdel-Monem, A., K. Shaalan, A. Rafea, and H. Baraka, "Generating Arabic text in multilingual speech-to-speech machine translation framework", Machine Translation, vol. 22, no. 4, Hingham, MA, USA, Kluwer Academic Publishers, pp. 205–258, 2008. Abstractgenerating_arabic_mt_journal.pdfWebsite

The interlingual approach to machine translation (MT) is used successfully in multilingual translation. It aims to achieve the translation task in two independent steps. First, meanings of the source-language sentences are represented in an intermediate language-independent (Interlingua) representation. Then, sentences of the target language are generated from those meaning representations. Arabic natural language processing in general is still underdeveloped and Arabic natural language generation (NLG) is even less developed. In particular, Arabic NLG from Interlinguas was only investigated using template-based approaches. Moreover, tools used for other languages are not easily adaptable to Arabic due to the language complexity at both the morphological and syntactic levels. In this paper, we describe a rule-based generation approach for task-oriented Interlingua-based spoken dialogue that transforms a relatively shallow semantic interlingual representation, called interchange format (IF), into Arabic text that corresponds to the intentions underlying the speaker's utterances. This approach addresses the handling of the problems of Arabic syntactic structure determination, and Arabic morphological and syntactic generation within the Interlingual MT approach. The generation approach is developed primarily within the framework of the NESPOLE! (NEgotiating through SPOken Language in E-commerce) multilingual speech-to-speech MT project. The IF-to-Arabic generator is implemented in SICStus Prolog. We conducted evaluation experiments using the input and output from the English analyzer that was developed by the NESPOLE! team at Carnegie Mellon University. The results of these experiments were promising and confirmed the ability of the rule-based approach in generating Arabic translation from the Interlingua taken from the travel and tourism domain.

Abo-Bakr, H., K. Shaalan, and I. Ziedan, "A Statistical Method for Detecting the Arabic Empty Category", The Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, The MEDAR Consortium, 22 April, 2009. Abstract32_finalsubmission_pdf.pdf

In this paper we introduce a statistical approach for detecting the position of Empty-Category presented in Arabic Treebank. This can help in detecting the position of the elliptic personnel pronoun and overcoming, for some cases, the identification of dropped words within a sentence given the free word order nature of Arabic. The proposed approach requires a large corpus. The training for detecting the Empty-Category for each token is based on its Part Of Speech (POS), Base Phrase (BP)-chunk position, and the position of the token in the sentence. The Empty-Category detection is efficiently obtained using the Support Vector Machines (SVM) technique. We conducted an evaluation of the proposed diacritization algorithm, discussed the obtained results, and proposed various modifications for improving the performance of this approach.

Hossny, A., K. Shaalan, and A. Fahmy, "Machine translation model using inductive logic programming", the 2009 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE’09), Dalian, China, pp. 1–8, sep, 2009. Abstract101.pdf

Rule based machine translation systems face different challenges in building the translation model in a form of transfer rules. Some of these problems require enormous human effort to state rules and their consistency. This is where different human linguists make different rules for the same sentence. A human linguist states rules to be understood by human rather than machines. The proposed translation model (from Arabic to English) tackles the mentioned problem of building translation model. This model employs Inductive Logic Programming (ILP) to learn the language model from a set of example pairs acquired from parallel corpora and represent the language model in a rule-based format that maps Arabic sentence pattern to English sentence pattern. By testing the model on a small set of data, it generated translation rules with logarithmic growing rate and with word error rate 11%.

Shaalan, K., and A. Farghaly, "Introduction to the Special Issue on Arabic Natural Language Processing", ACM Transactions on Asian Language Information Processing (TALIP), vol. 8, no. 4, New York, NY, USA, ACM, pp. 1–3, 2009. Abstracta13-editorial.pdfWebsite

n/a

Farghaly, A., and K. Shaalan, "Arabic Natural Language Processing: Challenges and Solutions", ACM Transactions on Asian Language Information Processing (TALIP), vol. 8, no. 4, New York, NY, USA, ACM, pp. 1-22, 2009. Abstractfarghaly_shaalan_talip_anlp_pdf.pdfWebsite

The Arabic language presents researchers and developers of natural language processing (NLP) applications for Arabic text and speech with serious challenges. The purpose of this article is to describe some of these challenges and to present some solutions that would guide current and future practitioners in the field of Arabic natural language processing (ANLP). We begin with general features of the Arabic language in Sections 1, 2, and 3 and then we move to more specific properties of the language in the rest of the article. In Section 1 of this article we highlight the significance of the Arabic language today and describe its general properties. Section 2 presents the feature of Arabic Diglossia showing how the sociolinguistic aspects of the Arabic language differ from other languages. The stability of Arabic Diglossia and its implications for ANLP applications are discussed and ways to deal with this problematic property are proposed. Section 3 deals with the properties of the Arabic script and the explosion of ambiguity that results from the absence of short vowel representations and overt case markers in contemporary Arabic texts. We present in Section 4 specific features of the Arabic language such as the nonconcatenative property of Arabic morphology, Arabic as an agglutinative language, Arabic as a pro-drop language, and the challenge these properties pose to ANLP. We also present solutions that have already been adopted by some pioneering researchers in the field. In Section 5 we point out to the lack of formal and explicit grammars of Modern Standard Arabic which impedes the progress of more advanced ANLP systems. In Section 6 we draw our conclusion.

Khaled Shaalan

Professor of Computer Science

Arabic