Morphology

Shaalan, K., M. Magdy, and A. Fahmy, "Analysis and Feedback of Erroneous Arabic Verbs", Journal of Natural Language Engineering , vol. 21, issue 2, pp. 271-323, 2015. Abstractanalysis_and_feedback_of_erroneous_arabic_verbs.pdfWebsite

Arabic language is strongly structured and considered as one of the most highly inflected and
derivational languages. Learning Arabic morphology is a basic step for language learners to
develop language skills such as listening, speaking, reading, and writing. Arabic morphology
is non-concatenative and provides the ability to attach a large number of affixes to each
root or stem that makes combinatorial increment of possible inflected words. As such, Arabic
lexical (morphological and phonological) rules may be confusing for second language learners.
Our study indicates that research and development endeavors on spelling, and checking of
grammatical errors does not provide adequate interpretations to second language learners’
errors. In this paper we address issues related to error diagnosis and feedback for second
language learners of Arabic verbs and how they impact the development of a web-based
intelligent language tutoring system. The major aim is to develop an Arabic intelligent
language tutoring system that solves these issues and helps second language learners to
improve their linguistic knowledge. Learners are encouraged to produce input freely in
various situations and contexts, and are guided to recognize by themselves the erroneous
functions of their misused expressions. Moreover, we proposed a framework that allows
for the individualization of the learning process and provides the intelligent feedback that
conforms to the learner’s expertise for each class of error. Error diagnosis is not possible with
current Arabic morphological analyzers. So constraint relaxation and edit distance techniques
are successfully employed to provide error-specific diagnosis and adaptive feedback to learners.
We demonstrated the capabilities of these techniques in diagnosing errors related to Arabic
weak verbs formed using complex morphological rules. As a proof of concept, we have
implemented the components that diagnose learner’s errors and generate feedback which
have been effectively evaluated against test data acquired from real teaching environment.
The experimental results were satisfactory, and the performance achieved was 74.34 percent
in terms of recall rate.

Shaalan, K., M. Magdy, and A. Fahmy, "Morphological Analysis of Ill-formed Arabic Verbs for Second Language Learners", Applied Natural Language Processing: Identification, Investigation and Resolution, issue Hershey, PA, USA, PA, USA, IGI Global, pp. 1 - 659, 2012. Abstract978-1-60960-741-8.ch022.pdf

Arabic is a language of rich and complex morphology. The nature and peculiarity of Arabic make its morphological and phonological rules confusing for second language learners (SLLs). The conjugation of Arabic verbs is central to the formulation of an Arabic sentence because of its richness of form and meaning. In this research, we address issues related to the morphological analysis of ill-formed Arabic verbs in order to identify the source of errors and provide an informative feedback to SLLs of Arabic. The edit distance and constraint relaxation techniques are used to demonstrate the capability of the proposed system in generating all possible analyses of erroneous Arabic verbs written by SLLs. Filtering mechanisms are applied to exclude the irrelevant constructions and determine the target stem which is used as the base for constructing the feedback to the learner. The proposed system has been developed and effectively evaluated using real test data. It achieved satisfactory results in terms of the recall rate.

Attia, M., Y. Samih, K. Shaalan, and J. van Genabith, "The Floating Arabic Dictionary: An Automatic Method for Updating a Lexical Database through the detection and lemmatization of the Unknown Words", The International Conference on Computational Linguistics (COLING), Mumbai, India, 15 December, 2012. Abstractfloating_dictionary.pdf

Unknown words, or out of vocabulary words (OOV), cause a significant problem to morphological analysers, syntactic parses, MT systems and other NLP applications. Unknown words make up 29 % of the word types in in a large Arabic corpus used in this study. With today's corpus sizes exceeding 109 words, it becomes impossible to manually check corpora for new words to be included in a lexicon. We develop a finite-state morphological guesser and integrate it with a machine-learning-based pre-annotation tool in a pipeline architecture for extracting unknown words, lemmatizing them, and giving them a priority weight for inclusion in a lexical database. The processing is performed on a corpus of contemporary Arabic of
1,089,111,204 words. Our method is tested on a manually-annotated gold standard and yields encouraging results despite the complexity of the task. Our work shows the usability of a highly non-deterministic morphological guesser in a practical and complex application.

Shaalan, K., and M. Attia, "Handling Unknown Words in Arabic FST Morphology", The 10th edition of the International Workshop on Finite State Methods and Natural Language Processing (FSMNLP 2012), San Sebastian, Spain, 23 July, 2012. Abstractunk_fsmnlp_2012-acl-anthology__short_04.pdf

A morphological analyser only recognizes words that it already knows in the lexical database. It needs, however, a way of sensing significant changes in the language in the form of newly borrowed or coined words with high frequency. We develop a finite-state morphological guesser in a pipelined methodology for extracting unknown words, lemmatizing them, and giving them a priority weight for inclusion in a lexicon. The processing is performed on a large contemporary corpus of 1,089,111,204 words and passed through a machine-learning-based annotation tool. Our method is tested on a manually-annotated gold standard of 1,310 forms and yields good results despite the complexity of the task. Our work shows the usability of a highly non-deterministic finite state guesser in a practical and complex application.

Shaalan, K., Y. Samih, M. Attia, P. Pecina, and J. van Genabith, "Arabic Word Generation and Modelling for Spell Checking", The eighth international conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey, 24 May , 2012. Abstract603_paper.pdf

Arabic is a language known for its rich and complex morphology. Although many research projects have focused on the problem of Arabic morphological analysis using different techniques and approaches, very few have addressed the issue of generation of fully inflected words for the purpose of text authoring.
Available open-source spell checking resources for Arabic are too small and inadequate. Ayaspell, for example, the official resource used with OpenOffice applications, contains only 300,000 fully inflected words. We try to bridge this critical gap by creating an adequate, open-source and large-coverage word list for Arabic containing 9,000,000 fully inflected surface words. Furthermore, from a large list of valid forms and invalid forms we create a character-based tri-gram language model to approximate knowledge about permissible character clusters in Arabic, creating a novel method for detecting spelling errors. Testing of his language model gives a precision of 98.2% at a recall of 100%. We take our research a step further by creating a context-independent spelling correction tool using a finite-state automaton that measures the edit distance between input words and candidate corrections, the Noisy Channel Model, and knowledge-based rules. Our system performs significantly better than Hunspell in choosing the best solution, but it is still below the MS Spell Checker.

Nabhan, A., A. Rafea, and K. Shaalan, "Enhancing Phrase Extraction from Word Alignments Using Morphology", The 5th Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE), Cairo, Egypt, Ain Shams University, pp. 57–65, sep, 2005. Abstractnabhan_nle.pdf

We propose a technique for effective extraction of bilingual phrases from word alignments using morphological processing. Morphological processing leads to an increase of the frequency of words in the corpus, consequently reduces Alignment Error Rate (AER). Intuitively, better word alignments enhance the quality of bilingual phrases extracted. Using alignments of a stemmed corpus for phrase extraction, instead of alignments of a raw one, shows significant improvements in translation quality, especially with small corpora.

Shaalan, K., H. Talhami, and I. Kamel, "A Morphological Generator for the Indexing of Arabic Audio", the Proceedings of The IASTED International Conference on Artificial Intelligence and Soft Computing (ASC), Benidorm, Spain, ACTA Press, pp. 307–312, September, 2005. Abstractmorph_audio.pdf

This paper presents a novel Arabic morphological generator (AMG) for Modern Standard Arabic (MSA) which is designed and implemented using Prolog. The AMG is used to generate inflected forms of words used for the indexing of Arabic audio. These words are also the relevant terms in the Arab authority system (library information retrieval system) used in this study. The AMG generates inflected Arabic words from the root according to pre-specified morphological features that can be extended as needed. The Arabic word is represented as a feature structure which is handled through unification during the morphological generation process. The inflected forms can then be inserted automatically into a speech recognition grammar which is used to identify these words in an audio sequence or utterance.

Hossny, A., K. Shaalan, and A. Fahmy, "Automatic Morphological Rule Induction for Arabic", The sixth international conference on Language Resources and Evaluation (LREC'08) workshop on HLT & NLP within the Arabic world: Arabic Language and local languages processing: Status Updates and Prospects, Marrakech, Morocco, LREC, pp. 97–101, may, 2008. Abstractautomaticruleinduction.pdf

In this paper, we introduce an algorithm for morphological rule induction using meta-rules for Arabic morphology based on inductive logic programming. The processing resources are a set of example pairs (stem and inflected form) with their feature vectors, either positive or negative, and the linguistic background knowledge from the Arabic morphological analysis domain. Each example pair has two words to be analyzed vocally into consonants and vowels. The algorithm applies two levels of mapping: between the vocal representation of the two words (stem, morphed) and between their feature vector. It differentiates between both mappings in order to accurately deduce which changes in the word structure led to which changes in its features. The paper also addresses the irregularity, productivity and model consistency issues. We have developed an Arabic morphological rule induction system (AMRIS). Successful evaluation has been performed and showed that the system performance results achieved were satisfactory.

Shaalan, K., M. Magdy, and D. Samy, "Towards Resolving Morphological Ambiguity in Arabic Intelligent Language Tutoring Framework", The seventh international conference on Language Resources and Evaluation (LREC'10) Workshop on Supporting eLearning with Language Resources and Semantic Data, Valletta, Malta, LREC, 2010. Abstractlrec2010elearing_workshop.pdf

Ambiguity is a major issue in any NLP application that occurs when multiple interpretations of the same language phenomenon are produced. Given the complexity of the Arabic morphological system, it is difficult to determine what the intended meaning of the writer is. Moreover, Intelligent Language Tutoring Systems which need to analyze erroneous learner answers, generally, introduce techniques, such as constraints relaxation, that would produce more interpretations than systems designed for processing well-formed input. This paper addresses issues related to the morphological disambiguation of corrected interpretations of erroneous Arabic verbs that were written by beginner to intermediate Second Language Learners. The morphological disambiguation has been developed and effectively evaluated using real test data. It achieved satisfactory results in terms of the recall rate.

Khaled Shaalan

Professor of Computer Science

Morphology