Arabic

Shaalan, K., A. Abdel-Monem, and A. Rafea, "Syntactic Generation of Arabic in Interlingua-based Machine Translation Framework", Third workshop on Computational Approaches to Arabic Script-based Languages (CAASL3), Machine Translation Summit XII: ACL, 2009. Abstractsyntactic_gen_arabic_caasl3.pdf

Arabic is a highly inflectional language, with a rich morphology, relatively free word order, and two types of sentences: nominal and verbal. Arabic natural language processing in general is still underdeveloped and Arabic natural language generation (NLG) is even less developed. In particular, Arabic natural language generation from Interlingua was only investigated using template-based approaches. Moreover, tools used for other languages are not easily adaptable to Arabic due to the Arabic language complexity at both the morphological and syntactic levels. In this paper, we report our attempt at developing a rule-based Arabic generator for task-oriented interlingua-based spoken dialogues. Examples of syntactic generation results from the Arabic generator will be given and will illustrate how the system works. Our proposed syntactic generator has been effectively evaluated using real test data and achieved satisfactory results.

Shaalan, K., and H. Raza, "NERA: Named Entity Recognition for Arabic", J. Am. Soc. Inf. Sci. Technol., vol. 60, no. 8, New York, NY, USA, John Wiley & Sons, Inc., pp. 1652–1663, 2009. Abstractnera_paper.pdfWebsite

Name identification has been worked on quite intensively for the past few years, and has been incorporated into several products revolving around natural language processing tasks. Many researchers have attacked the name identification problem in a variety of languages, but only a few limited research efforts have focused on named entity recognition for Arabic script. This is due to the lack of resources for Arabic named entities and the limited amount of progress made in Arabic natural language processing in general. In this article, we present the results of our attempt at the recognition and extraction of the 10 most important categories of named entities in Arabic script: the person name, location, company, date, time, price, measurement, phone number, ISBN, and file name. We developed the system Named Entity Recognition for Arabic (NERA) using a rule-based approach. The resources created are: a Whitelist representing a dictionary of names, and a grammar, in the form of regular expressions, which are responsible for recognizing the named entities. A filtration mechanism is used that serves two different purposes: (a) revision of the results from a named entity extractor by using metadata, in terms of a Blacklist or rejecter, about ill-formed named entities and (b) disambiguation of identical or overlapping textual matches returned by different name entity extractors to get the correct choice. In NERA, we addressed major challenges posed by NER in the Arabic language arising due to the complexity of the language, peculiarities in the Arabic orthographic system, non-standardization of the written text, ambiguity, and lack of resources. NERA has been effectively evaluated using our own tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure.}

Shaalan, K., H. Abo-Bakr, and I. Ziedan, "A hybrid approach for building Arabic diacritizer", the 12th European Chapter of the Association for Computational Linguistics (EACL 2009) Workshop on Computational Approaches to Semitic Languages, Association for Computational Linguistics, Athens, Greece, Association for Computational Linguistics, pp. 27–35, mar, 2009. Abstracthybridapproachforbuildingarabicdiacritizer_eacl2009.pdf

Modern standard Arabic is usually written without diacritics. This makes it difficult for performing Arabic text processing. Diacritization helps clarify the meaning of words and disambiguate any vague spellings or pronunciations, as some Arabic words are spelled the same but differ in meaning. In this paper, we address the issue of adding diacritics to undiacritized Arabic text using a hybrid approach. The approach requires an Arabic lexicon and large corpus of fully diacritized text for training purposes in order to detect diacritics. Case-Ending is treated as a separate post processing task using syntactic information. The hybrid approach relies on lexicon retrieval, bigram, and SVM-statistical prioritized techniques. We present results of an evaluation of the proposed diacritization approach and discuss various modifications for improving the performance of this approach.

Shaalan, K. F., M. Magdy, and A. Fahmy, "Morphological Analysis of Ill-Formed Arabic Verbs in Intelligent Language Tutoring Framework", The 23rd International Florida Artificial Intelligence Research Society Conference (FLAIRS-23), Florida, USA, FLAIRS, pp. 277–282, may, 2010. Abstractflairs-23-1755.pdf

Arabic is a language of rich and complex morphology. The nature and peculiarity of Arabic make its morphological and phonological rules confusing for second language learners (SLLs). The conjugation of Arabic verbs is central to the formulation of an Arabic sentence because of its richness of form and meaning. In this paper, we address issues related to the morphological analysis of ill-formed Arabic verbs in order to identify the source of errors and provide an in-formative feedback to SLLs of Arabic. The edit distance and constraint relaxation techniques are used to demonstrate the capability of the proposed approach in generating all possible analyses of erroneous Arabic verbs written by SLLs. Filtering mechanisms are applied to exclude the irrelevant constructions and determine the target stem. A morphological analyzer has been developed and effectively evaluated using real test data. It achieved satisfactory results in terms of the recall rate.

Shaalan, K., "Rule-based Approach in Arabic Natural Language Processing", the International Journal on Information and Communication Technologies (IJICT), vol. 3, no. 3: Serial Publications, pp. 11–19, 2010. Abstractrules_based_nlp.pdfWebsite

The rule-based approach has successfully been used in developing many natural language processing systems. Systems that use rule-based transformations are based on a core of solid linguistic knowledge. The linguistic knowledge acquired for one natural language processing system may be reused to build knowledge required for a similar task in another system. The advantage of the rule-based approach over the corpus-based approach is clear for: 1) less-resourced languages, for which large corpora, possibly parallel or bilingual, with representative structures and entities are neither available nor easily affordable, and 2) for morphologically rich languages, which even with the availability of corpora suffer from data sparseness. These have motivated many researchers to fully or partially follow the rule-based approach in developing their Arabic natural processing tools and systems. In this paper we address our successful efforts that involved rule-based approach for different Arabic natural language processing tasks.

Shaalan, K., M. Magdy, and D. Samy, "Towards Resolving Morphological Ambiguity in Arabic Intelligent Language Tutoring Framework", The seventh international conference on Language Resources and Evaluation (LREC'10) Workshop on Supporting eLearning with Language Resources and Semantic Data, Valletta, Malta, LREC, 2010. Abstractlrec2010elearing_workshop.pdf

Ambiguity is a major issue in any NLP application that occurs when multiple interpretations of the same language phenomenon are produced. Given the complexity of the Arabic morphological system, it is difficult to determine what the intended meaning of the writer is. Moreover, Intelligent Language Tutoring Systems which need to analyze erroneous learner answers, generally, introduce techniques, such as constraints relaxation, that would produce more interpretations than systems designed for processing well-formed input. This paper addresses issues related to the morphological disambiguation of corrected interpretations of erroneous Arabic verbs that were written by beginner to intermediate Second Language Learners. The morphological disambiguation has been developed and effectively evaluated using real test data. It achieved satisfactory results in terms of the recall rate.

Shaalan, K., A. Hendam, and A. Rafea, "An English-Arabic Bi-directional Machine Translation Tool in the Agriculture Domain", Intelligent Information Processing V, vol. 340, Berlin, Heidelberg, Springer Boston, pp. 281–290, 2010. Abstractbi_direct_a_e_mt.pdf

The present work reports our attempt in developing an English-Arabic bi-directional Machine Translation (MT) tool in the agriculture domain. It aims to achieve automated translation of expert systems. In particular, we describe the translation of knowledge base, including, prompts, responses, explanation text, and advices. In the central laboratory for agricultural expert systems, this tool is found to be essential in developing bi-directional (English-Arabic) expert systems because both English and Arabic versions are needed for development, deployment, and usage purpose. The tool follows the rule-based transfer MT approach. A major design goal of this tool is that it can be used as a stand-alone tool and can be very well integrated with a general (English-Arabic) MT system for Arabic scientific text. The paper also discusses our experience with the developed MT system and reports on results of its application on real agricultural expert systems.

Shaalan, K., R. Aref, and A. Fahmy, "An Approach for Analyzing and Correcting Spelling Errors for Non-native Arabic learners", The 7th International Conference on Informatics and Systems (INFOS2010), Cairo, Egypt, Faculty of Comptuers and Information, 2010. Abstractnlp_09_p053-059.pdf

Spell checkers are widely used in many software products for identifying errors in users' writings. However, they are not designed to address spelling errors made by non-native learners of a language. As a matter of fact, spelling errors made by non-native learners are more than just misspellings. Non-native learners' errors require special handling in terms of detection and correction, especially when it comes to morphologically rich languages such as Arabic, which have few related resources. In this paper, we address common error patterns made by non-native Arabic learners and suggest a two-layer spell-checking approach, including spelling error detection and correction. The proposed error detection mechanism is applied on top of Buckwalter's Arabic morphological analyzer in order to demonstrate the capability of our approach in detecting possible spelling errors. The correction mechanism adopts a rule-based edit distance algorithm. Rules are designed in accordance with common spelling error patterns made by Arabic learners. Error correction uses a multiple filtering mechanism to propose final corrections. The approach utilizes semantic information given in exercising questions in order to achieve highly accurate detection and correction of spelling errors made by non-native Arabic learners. Finally, the proposed approach was evaluated using real test data and promising results were achieved.

Khaled Shaalan

Professor of Computer Science

Arabic