Natural Language Processing
Shaalan, K., A. Abdel-Monem, and A. Rafea,
"Syntactic Generation of Arabic in Interlingua-based Machine Translation Framework",
Third workshop on Computational Approaches to Arabic Script-based Languages (CAASL3), Machine Translation Summit XII: ACL, 2009.
AbstractArabic is a highly inflectional language, with a rich morphology, relatively free word order, and two types of sentences: nominal and verbal. Arabic natural language processing in general is still underdeveloped and Arabic natural language generation (NLG) is even less developed. In particular, Arabic natural language generation from Interlingua was only investigated using template-based approaches. Moreover, tools used for other languages are not easily adaptable to Arabic due to the Arabic language complexity at both the morphological and syntactic levels. In this paper, we report our attempt at developing a rule-based Arabic generator for task-oriented interlingua-based spoken dialogues. Examples of syntactic generation results from the Arabic generator will be given and will illustrate how the system works. Our proposed syntactic generator has been effectively evaluated using real test data and achieved satisfactory results.
Farghaly, A., and K. Shaalan,
"Arabic Natural Language Processing: Challenges and Solutions",
ACM Transactions on Asian Language Information Processing (TALIP), vol. 8, no. 4, New York, NY, USA, ACM, pp. 1-22, 2009.
AbstractThe Arabic language presents researchers and developers of natural language processing (NLP) applications for Arabic text and speech with serious challenges. The purpose of this article is to describe some of these challenges and to present some solutions that would guide current and future practitioners in the field of Arabic natural language processing (ANLP). We begin with general features of the Arabic language in Sections 1, 2, and 3 and then we move to more specific properties of the language in the rest of the article. In Section 1 of this article we highlight the significance of the Arabic language today and describe its general properties. Section 2 presents the feature of Arabic Diglossia showing how the sociolinguistic aspects of the Arabic language differ from other languages. The stability of Arabic Diglossia and its implications for ANLP applications are discussed and ways to deal with this problematic property are proposed. Section 3 deals with the properties of the Arabic script and the explosion of ambiguity that results from the absence of short vowel representations and overt case markers in contemporary Arabic texts. We present in Section 4 specific features of the Arabic language such as the nonconcatenative property of Arabic morphology, Arabic as an agglutinative language, Arabic as a pro-drop language, and the challenge these properties pose to ANLP. We also present solutions that have already been adopted by some pioneering researchers in the field. In Section 5 we point out to the lack of formal and explicit grammars of Modern Standard Arabic which impedes the progress of more advanced ANLP systems. In Section 6 we draw our conclusion.
Shaalan, K., and A. Farghaly,
"Introduction to the Special Issue on Arabic Natural Language Processing",
ACM Transactions on Asian Language Information Processing (TALIP), vol. 8, no. 4, New York, NY, USA, ACM, pp. 1–3, 2009.
Abstract
Hossny, A., K. Shaalan, and A. Fahmy,
"Machine translation model using inductive logic programming",
the 2009 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE’09), Dalian, China, pp. 1–8, sep, 2009.
AbstractRule based machine translation systems face different challenges in building the translation model in a form of transfer rules. Some of these problems require enormous human effort to state rules and their consistency. This is where different human linguists make different rules for the same sentence. A human linguist states rules to be understood by human rather than machines. The proposed translation model (from Arabic to English) tackles the mentioned problem of building translation model. This model employs Inductive Logic Programming (ILP) to learn the language model from a set of example pairs acquired from parallel corpora and represent the language model in a rule-based format that maps Arabic sentence pattern to English sentence pattern. By testing the model on a small set of data, it generated translation rules with logarithmic growing rate and with word error rate 11%.
Abo-Bakr, H., K. Shaalan, and I. Ziedan,
"A Statistical Method for Detecting the Arabic Empty Category",
The Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, The MEDAR Consortium, 22 April, 2009.
AbstractIn this paper we introduce a statistical approach for detecting the position of Empty-Category presented in Arabic Treebank. This can help in detecting the position of the elliptic personnel pronoun and overcoming, for some cases, the identification of dropped words within a sentence given the free word order nature of Arabic. The proposed approach requires a large corpus. The training for detecting the Empty-Category for each token is based on its Part Of Speech (POS), Base Phrase (BP)-chunk position, and the position of the token in the sentence. The Empty-Category detection is efficiently obtained using the Support Vector Machines (SVM) technique. We conducted an evaluation of the proposed diacritization algorithm, discussed the obtained results, and proposed various modifications for improving the performance of this approach.
Abo-Bakr, H., K. Shaalan, and I. Ziedan,
"A Hybrid Approach for Converting Written Egyptian Colloquial Dialect into Diacritized Arabic",
The 6th International Conference on Informatics and Systems (INFOS2008), Cairo, Egypt, Faculty of Comptuers and Information, mar, 2008.
AbstractRecently the rate of written colloquial text has increased dramatically. It is being used as a medium of expressing ideas especially across the WWW, usually in the form of blogs and partially colloquial articles. Most of these written colloquial has been in the Egyptian colloquial dialect, which is considered the most widely dialect understood and used throughout the Arab world. Modern Standard Arabic is the official Arabic language taught and understood all over the Arab world. Diacritics play a key role in disambiguating Arabic text. The reader is expected to infer or predict vowels from the context of the sentence. Inferring the full form of the Arabic word is also useful when developing Arabic natural language processing tools and applications. In this paper, we introduce a generic method for converting a written Egyptian colloquial sentence into its corresponding diacritized Modern Standard Arabic sentence which could easily be extended to be applied to other dialects of Arabic. In spite of the non-availability of linguistic Arabic resources for this task, we have developed techniques for lexical acquisition of colloquial words which are used for transferring written Egyptian Arabic into Modern Standard Arabic. We successfully used Support Vector Machine approach for the diacritization (aka vocalization or vowelling) of Arabic text.
Hossny, A., K. Shaalan, and A. Fahmy,
"Automatic Morphological Rule Induction for Arabic",
The sixth international conference on Language Resources and Evaluation (LREC'08) workshop on HLT & NLP within the Arabic world: Arabic Language and local languages processing: Status Updates and Prospects, Marrakech, Morocco, LREC, pp. 97–101, may, 2008.
AbstractIn this paper, we introduce an algorithm for morphological rule induction using meta-rules for Arabic morphology based on inductive logic programming. The processing resources are a set of example pairs (stem and inflected form) with their feature vectors, either positive or negative, and the linguistic background knowledge from the Arabic morphological analysis domain. Each example pair has two words to be analyzed vocally into consonants and vowels. The algorithm applies two levels of mapping: between the vocal representation of the two words (stem, morphed) and between their feature vector. It differentiates between both mappings in order to accurately deduce which changes in the word structure led to which changes in its features. The paper also addresses the irregularity, productivity and model consistency issues. We have developed an Arabic morphological rule induction system (AMRIS). Successful evaluation has been performed and showed that the system performance results achieved were satisfactory.
Abo-Bakr, H., K. Shaalan, and I. Ziedan,
"A Statistical Method for Adding Case Diacritics for Arabic Text",
Language Engineering Conference: Ain Shams University, pp. 225–234, dec, 2008.
AbstractIn this paper, the issue of adding Case Ending diacritics to undiacritized Arabic text using statistical methods is addressed. The approach requires a large corpus of fully diacritized text for extracting the case ending. We made the training for detecting the case ending diacritics for each token based on its Part Of Speech (POS) and BP-chunk position as well as the position of token in the statement. The case ending diacritics is then efficiently obtained using the SVM technique. We presented an evaluation of the proposed diacritization algorithm and discussed various modifications for improving the performance of this approach.
Shaalan, K., and H. Raza,
"Arabic Named Entity Recognition from Diverse Text Types",
Advances in Natural Language Processing, vol. 5221: Springer Berlin Heidelberg, pp. 440-451, 2008.
Abstract Name identification has been worked on quite intensively for the past few years, and has been incorporated into several products. Many researchers have attacked this problem in a variety of languages but only a few limited researches have focused on Named Entity Recognition (NER) for Arabic text due to the lack of resources for Arabic named entities and the limited amount of progress made in Arabic natural language processing in general. In this paper, we present the results of our attempt at the recognition and extraction of 10 most important named entities in Arabic script; the person name, location, company, date, time, price, measurement, phone number, ISBN and file name. We developed the system, Name Entity Recognition for Arabic (NERA), using a rule-based approach. The system consists of a whitelist representing a dictionary of names, and a grammar, in the form of regular expressions, which are responsible for recognizing the named entities. NERA is evaluated using our own corpora that are tagged in a semi-automated way, and the performance results achieved were satisfactory in terms of precision, recall, and f-measure.
Abdel-Monem, A., K. Shaalan, A. Rafea, and H. Baraka,
"Generating Arabic text in multilingual speech-to-speech machine translation framework",
Machine Translation, vol. 22, no. 4, Hingham, MA, USA, Kluwer Academic Publishers, pp. 205–258, 2008.
AbstractThe interlingual approach to machine translation (MT) is used successfully in multilingual translation. It aims to achieve the translation task in two independent steps. First, meanings of the source-language sentences are represented in an intermediate language-independent (Interlingua) representation. Then, sentences of the target language are generated from those meaning representations. Arabic natural language processing in general is still underdeveloped and Arabic natural language generation (NLG) is even less developed. In particular, Arabic NLG from Interlinguas was only investigated using template-based approaches. Moreover, tools used for other languages are not easily adaptable to Arabic due to the language complexity at both the morphological and syntactic levels. In this paper, we describe a rule-based generation approach for task-oriented Interlingua-based spoken dialogue that transforms a relatively shallow semantic interlingual representation, called interchange format (IF), into Arabic text that corresponds to the intentions underlying the speaker's utterances. This approach addresses the handling of the problems of Arabic syntactic structure determination, and Arabic morphological and syntactic generation within the Interlingual MT approach. The generation approach is developed primarily within the framework of the NESPOLE! (NEgotiating through SPOken Language in E-commerce) multilingual speech-to-speech MT project. The IF-to-Arabic generator is implemented in SICStus Prolog. We conducted evaluation experiments using the input and output from the English analyzer that was developed by the NESPOLE! team at Carnegie Mellon University. The results of these experiments were promising and confirmed the ability of the rule-based approach in generating Arabic translation from the Interlingua taken from the travel and tourism domain.