Arabic

Shaalan, K., and M. Attia, "Handling Unknown Words in Arabic FST Morphology", The 10th edition of the International Workshop on Finite State Methods and Natural Language Processing (FSMNLP 2012), San Sebastian, Spain, 23 July, 2012. Abstractunk_fsmnlp_2012-acl-anthology__short_04.pdf

A morphological analyser only recognizes words that it already knows in the lexical database. It needs, however, a way of sensing significant changes in the language in the form of newly borrowed or coined words with high frequency. We develop a finite-state morphological guesser in a pipelined methodology for extracting unknown words, lemmatizing them, and giving them a priority weight for inclusion in a lexicon. The processing is performed on a large contemporary corpus of 1,089,111,204 words and passed through a machine-learning-based annotation tool. Our method is tested on a manually-annotated gold standard of 1,310 forms and yields good results despite the complexity of the task. Our work shows the usability of a highly non-deterministic finite state guesser in a practical and complex application.

Shaalan, K., Y. Samih, M. Attia, P. Pecina, and J. van Genabith, "Arabic Word Generation and Modelling for Spell Checking", The eighth international conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey, 24 May , 2012. Abstract603_paper.pdf

Arabic is a language known for its rich and complex morphology. Although many research projects have focused on the problem of Arabic morphological analysis using different techniques and approaches, very few have addressed the issue of generation of fully inflected words for the purpose of text authoring.
Available open-source spell checking resources for Arabic are too small and inadequate. Ayaspell, for example, the official resource used with OpenOffice applications, contains only 300,000 fully inflected words. We try to bridge this critical gap by creating an adequate, open-source and large-coverage word list for Arabic containing 9,000,000 fully inflected surface words. Furthermore, from a large list of valid forms and invalid forms we create a character-based tri-gram language model to approximate knowledge about permissible character clusters in Arabic, creating a novel method for detecting spelling errors. Testing of his language model gives a precision of 98.2% at a recall of 100%. We take our research a step further by creating a context-independent spelling correction tool using a finite-state automaton that measures the edit distance between input words and candidate corrections, the Noisy Channel Model, and knowledge-based rules. Our system performs significantly better than Hunspell in choosing the best solution, but it is still below the MS Spell Checker.

Attia, M., K. Shaalan, L. Tounsi, and J. van Genabith, "Automatic Extraction and Evaluation of Arabic LFG Resources", The eighth international conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey, 22 May , 2012. Abstract609_paper.pdf

This paper presents the results of an approach to automatically acquire large-scale, probabilistic Lexical-Functional Grammar (LFG) resources for Arabic from the Penn Arabic Treebank (ATB). Our starting point is the earlier, work of (Tounsi et al., 2009) on automatic LFG f(eature)-structure annotation for Arabic using the ATB. They exploit tree configuration, POS categories, functional tags, local heads and trace information to annotate nodes with LFG feature-structure equations. We utilize this annotation to automatically acquire grammatical function (dependency) based subcategorization frames and paths linking long-distance dependencies (LDDs). Many state-of-the-art treebank-based probabilistic parsing approaches are scalable and robust but often also shallow: they do not capture LDDs and represent only local information. Subcategorization frames and LDD paths can be used to recover LDDs from such parser output to capture deep linguistic information. Automatic acquisition of language resources from existing treebanks saves time and effort involved in creating such resources by hand. Moreover, data-driven automatic acquisition naturally associates probabilistic information with subcategorization frames and LDD paths. Finally, based on the statistical distribution of LDD path types, we propose empirical bounds on traditional regular expression based functional uncertainty equations used to handle LDDs in LFG.

Shaalan, K., and M. Magdy, "Adaptive Feedback Message Generation for Second Language Learners of Arabic", Recent Advances in Natural Language Processing (RANLP - 2011),, Hissar, Bulgaria, 12 September , 2011. Abstractr11-1110.pdf

This paper addresses issues related to generating feedback messages to errors related to Arabic verbs made by second language learners (SLLs). The proposed approach allows for individualization. When a SLL of Arabic writes a wrong verb, it performs analysis of the input and distinguishes between different lexical error types. The proposed system issues the intelligent feedback that conforms to the learner’s proficiency level for each class of error. The proposed system has been effectively evaluated using real test data and achieved satisfactory results.

Shaalan, K., and H. Raza, "Arabic Named Entity Recognition from Diverse Text Types", Advances in Natural Language Processing, vol. 5221: Springer Berlin Heidelberg, pp. 440-451, 2008. Abstractgotal_nera_.pdf

Name identification has been worked on quite intensively for the past few years, and has been incorporated into several products. Many researchers have attacked this problem in a variety of languages but only a few limited researches have focused on Named Entity Recognition (NER) for Arabic text due to the lack of resources for Arabic named entities and the limited amount of progress made in Arabic natural language processing in general. In this paper, we present the results of our attempt at the recognition and extraction of 10 most important named entities in Arabic script; the person name, location, company, date, time, price, measurement, phone number, ISBN and file name. We developed the system, Name Entity Recognition for Arabic (NERA), using a rule-based approach. The system consists of a whitelist representing a dictionary of names, and a grammar, in the form of regular expressions, which are responsible for recognizing the named entities. NERA is evaluated using our own corpora that are tagged in a semi-automated way, and the performance results achieved were satisfactory in terms of precision, recall, and f-measure.

Shaalan, K., and H. Talhami, "Arabic Error Feedback in an Online Arabic Learning System", Advances in Natural Language Processing, Research in Computing Science (RCS) Journal, vol. 18, pp. 203-212, 2006. Abstracterror_feedback_2006.pdf

Arabic is a Semitic language that is rich in its morphology and syntax. The very numerous and complex grammar rules of the language could be confusing even for Arabic native speakers. Many Arabic intelligent computer assisted language-learning (ICALL) systems have neither deep error analysis nor sophisticated error handling. In this paper, we report an attempt at developing an error analyzer and error handler for Arabic as an important part of the Arabic ICALL system. In this system, the learners are encouraged to construct sentences freely in various contexts and are guided to recognize by themselves the errors or inappropriate usage of their language constructs. We used natural language processing (NLP) tools such as a morphological analyzer and a syntax analyzer for error analysis and to give feedback to the learner.
Furthermore, we propose a mechanism of correction by the learner, which allows the learner to correct the typed sentence independently. This will result in the learner being able to figure out what the error is. Examples of error analysis and error handling will be given and will illustrate how the system works.

Magdy, M., K. Shaalan, and A. Fahmy, "Lexical Error Diagnosis for Second Language Learners of Arabic", The Seventh Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE), Cairo, Egypt, Dec., 2007. Abstractlexical_error_diagnosis_nle.pdf

This paper addresses the development of an automated lexical error diagnosis system, which helps Arabic second language learners to learn well-formed weak verbs. The learners are encouraged to produce input freely in various situations and contexts and guided to recognize by themselves the erroneous or inappropriate functions of their misused expressions. In this system, we successfully used constraint relaxation and edit-distance techniques to provide error-specific diagnosis and feedback to second language learners of Arabic. We demonstrated the capabilities of these techniques to diagnose errors related to the Arabic weak verb which is formed using complex morphological rules. Furthermore, the developed system allows for individualization of the learning process by providing feedback that conforms to the learner’s expertise. Inexperienced learners might require detailed instruction while experienced learners benefit from higher level reminders and explanations.

Shaalan, K., A. Abdel-Monem, and A. Rafea, "Arabic Morphological Generation from Interlingua: A Rule-based Approach", Intelligent Information Processing III, vol. 228: Springer US, pp. 441-451, 2007. Abstractmorph_gen_mt.pdf

Arabic is a Semitic language that is rich in its morphology. Arabic has very numerous and complex morphological rules. Arabic morphological analysis has gained the focus of Arabic natural language processing research for a long time in order to achieve the automated understanding of Arabic. With the recent technological advances, Arabic natural language generation has received attentions in order to allow for a room for wider applications such as machine translation. For machine translation systems that support a large number of languages, interlingua-based machine translation approaches are particularly attractive. In this paper, we report our attempt at developing a rule-based Arabic morphological generator for task-oriented interlingua-based spoken dialogues. Examples of morphological generation results from the Arabic morphological generator will be given and will illustrate how the system works. Nevertheless, we will discuss the issues related to the morphological generation of Arabic words from an interlingua representation, and present how we have handled them.

Ezzat, M., K. Shaalan, and A. Fahmy, "Component Composition Analysis for Arabic Natural Language Processing", the 6th Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE), Cairo, Egypt, Dec., 2006. Abstractcomponentcompositionanlp2.pdf

Building NLP applications from scratch is a difficult task that takes a lot of time and requires acquiring a lot of NLP knowledge. For a rich language like Arabic the difficulties is increased significantly. In this paper, we investigated how to build a tool that helps NLP application developers to build rapid and robust applications. It involves two steps. Firstly, using COM objects technology in building common NLP tools. Secondly, building NLP applications that uses these tools which can access these tools either locally or remotely. We have demonstrated the capabilities of the COM objects in developing NLP tools such as morphological analyzer and used it for building two Arabic NLP applications.

Rafea, A., and K. Shaalan, "Lexical Analysis of Inflected Arabic words using Exhaustive Search of an Augmented Transition Network", Software Practice and Experience, vol. 23, issue 6, no. 6, New York, NY, USA, John Wiley & Sons, Inc., pp. 567–588, 1993. Abstractspe820.pdfWebsite

n/a

Khaled Shaalan

Professor of Computer Science

Arabic