Natural Language Processing

Shaalan, K., and M. Oudah, "A hybrid approach to Arabic named entity recognition", Journal of Information Science, vol. 40, no. 1, pp. 67-87, 2014. Abstractjis_arabicner.pdfWebsite

In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advantages of rule-based and machine learning-based approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. The complexity of Arabic poses special challenges to researchers of Arabic NER, which is essential for both monolingual and multilingual applications. We used the hybrid approach to develop an Arabic NER system that is capable of recognizing 11 types of Arabic named entities: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments were conducted using decision trees, Support Vector Machines and logistic regression classifiers to evaluate the system performance. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches when they are processed independently. More importantly, our system outperforms the state-of-the-art of Arabic NER in terms of accuracy when applied to ANERcorp standard dataset, with F-measures 0.94 for Person, 0.90 for Location and 0.88 for Organization.

Oudah, M., and K. Shaalan, "Person Name Recognition Using the Hybrid Approach", Natural Language Processing and Information Systems, vol. 7934, Berlin Heidelberg, Springer , pp. 237-248, 2013. Abstractperson_ner_using_hyprid_approach.pdf

Arabic Person Name Recognition has been tackled mostly using either of two approaches: a rule-based or Machine Learning (ML) based approach, with their strengths and weaknesses. In this paper, the problem of Arabic Person Name Recognition is tackled through integrating the two approaches together in a pipelined process to create a hybrid system with the aim of enhancing the overall performance of Person Name Recognition tasks. Extensive experiments are conducted using three different ML classifiers to evaluate the overall performance of the hybrid system. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches. Moreover, our system outperforms the state-of-the-art of Arabic Person Name Recognition in terms of accuracy when applied to ANERcorp dataset, with precision 0.949, recall 0.942 and f-measure 0.945.

Shaalan, K., S. Al-Sheikh, and F. Oroumchian, "Query Expansion Based-on Similarity of Terms for Improving Arabic Information Retrieval", Intelligent Information Processing VI, vol. 385, Berlin Heidelberg, Springer, pp. 167-176, 2012. Abstractquery_arabic_iip_ifip.pdf

This research suggests a method for query expansion on Arabic Information Retrieval using Expectation Maximization (EM). We employ the EM algorithm in the process of selecting relevant terms for expanding the query and weeding out the non-related terms. We tested our algorithm on INFILE test collection of CLLEF2009, and the experiments show that query expansion that considers similarity of terms both improves precision and retrieves more relevant documents. The main finding of this research is that we can increase the recall while keeping the precision at the same level by this method.

Shaalan, K., and A. H. Hossny, "Automatic rule induction in Arabic to English machine translation framework", Challenges for Arabic Machine Translation, Amsterdam, The Netherlands, John Benjamins Publishing Company, 2012. Abstractkhaled_shaalan_ch10.pdf

This paper addresses exploiting a supervised machine learning technique to automatically induce Arabic-to-English transfer rules from chunks of parallel aligned linguistic resources. The induced structural transfer rules encode the linguistic translation knowledge for converting an Arabic syntactic structure into a target English syntactic structure. These rules are going to be an integral part of an Arabic-English transfer-based machine translation. Nevertheless, a novel morphological rule induction method is employed for learning Arabic morphological rules that are applied in our Arabic morphological analyzer. To demonstrate the capability of the automated rule induction technique we conducted rule-based translation experiments that use induced rules from a relatively small data set. The translation quality of the hybrid translation experiments achieved good results in terms of WER.

Shaalan, K., M. Magdy, and A. Fahmy, "Morphological Analysis of Ill-formed Arabic Verbs for Second Language Learners", Applied Natural Language Processing: Identification, Investigation and Resolution, issue Hershey, PA, USA, PA, USA, IGI Global, pp. 1 - 659, 2012. Abstract978-1-60960-741-8.ch022.pdf

Arabic is a language of rich and complex morphology. The nature and peculiarity of Arabic make its morphological and phonological rules confusing for second language learners (SLLs). The conjugation of Arabic verbs is central to the formulation of an Arabic sentence because of its richness of form and meaning. In this research, we address issues related to the morphological analysis of ill-formed Arabic verbs in order to identify the source of errors and provide an informative feedback to SLLs of Arabic. The edit distance and constraint relaxation techniques are used to demonstrate the capability of the proposed system in generating all possible analyses of erroneous Arabic verbs written by SLLs. Filtering mechanisms are applied to exclude the irrelevant constructions and determine the target stem which is used as the base for constructing the feedback to the learner. The proposed system has been developed and effectively evaluated using real test data. It achieved satisfactory results in terms of the recall rate.

Abdallah, S., K. Shaalan, and M. Shoaib, "Integrating Rule-Based System with Classification for Arabic Named Entity Recognition", Computational Linguistics and Intelligent Text Processing, vol. 7181, Berlin, Heidelberg, Springer , pp. 311-322, 2012. Abstracthybrid_nera_2012.pdf

Named Entity Recognition (NER) is a subtask of information extraction that seeks to recognize and classify named entities in unstructured text into predefined categories such as the names of persons, organizations, locations, etc. The majority of researchers used machine learning, while few researchers used handcrafted rules to solve the NER problem. We focus here on NER for the Arabic language (NERA), an important language with its own distinct challenges. This paper proposes a simple method for integrating machine learning with rule-based systems and implement this proposal using the state-of-the-art rule-based system for NERA. Experimental evaluation shows that our integrated approach increases the F-measure by 8 to 14% when compared to the original (pure) rule based system and the (pure) machine learning approach, and the improvement is statistically significant for different datasets. More importantly, our system outperforms the state-of-the-art machine-learning system in NERA over a benchmark dataset.

Oudah, M., and K. Shaalan, "A Pipeline Arabic Named Entity Recognition Using a Hybrid Approach", The International Conference on Computational Linguistics (COLING), Mumbai, India, 14 December, 2012. Abstractpipeline_ner.pdf

Most Arabic Named Entity Recognition (NER) systems have been developed using either of two approaches: a rule-based or Machine Learning (ML) based approach, with their strengths and weaknesses. In this paper, the problem of Arabic NER is tackled through integrating the two approaches together in a pipelined process to create a hybrid system with the aim of enhancing the overall performance of NER tasks. The proposed system is capable of recognizing 11 different types of named entities (NEs): Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments are conducted using three different ML classifiers to evaluate the overall performance of the hybrid system. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches. Moreover, our system outperforms the state-of-the-art of Arabic NER in terms of accuracy when applied to ANERcorp dataset, with f-measures 94.4% for Person, 90.1% for Location, and 88.2% for Organization.

Attia, M., Y. Samih, K. Shaalan, and J. van Genabith, "The Floating Arabic Dictionary: An Automatic Method for Updating a Lexical Database through the detection and lemmatization of the Unknown Words", The International Conference on Computational Linguistics (COLING), Mumbai, India, 15 December, 2012. Abstractfloating_dictionary.pdf

Unknown words, or out of vocabulary words (OOV), cause a significant problem to morphological analysers, syntactic parses, MT systems and other NLP applications. Unknown words make up 29 % of the word types in in a large Arabic corpus used in this study. With today's corpus sizes exceeding 109 words, it becomes impossible to manually check corpora for new words to be included in a lexicon. We develop a finite-state morphological guesser and integrate it with a machine-learning-based pre-annotation tool in a pipeline architecture for extracting unknown words, lemmatizing them, and giving them a priority weight for inclusion in a lexical database. The processing is performed on a corpus of contemporary Arabic of
1,089,111,204 words. Our method is tested on a manually-annotated gold standard and yields encouraging results despite the complexity of the task. Our work shows the usability of a highly non-deterministic morphological guesser in a practical and complex application.

Attia, M., P. Pecina, Y. Samih, K. Shaalan, and J. van Genabith, "Improved Spelling Error Detection and Correction for Arabic", The International Conference on Computational Linguistics (COLING), Mumbai, India, 14 December, 2012. Abstractimproved_spelling.pdf

A spelling error detection and correction application is based on three main components: a dictionary (or reference word list), an error model and a language model. While most of the attention in the literature has been directed to the language model, we show how improvements in any of the three components can lead to significant cumulative improvements in the overall performance of the system. We semi-automatically develop a dictionary of 9.3 million fully inflected Arabic words using a morphological transducer and a large corpus. We improve the error model by analysing error types and creating an edit distance based re-ranker. We also improve the language model by analysing the level of noise in different sources of data and selecting the optimal subset to train the system on. Testing and evaluation experiments show that our system significantly outperforms Microsoft Word 2010, OpenOffice Ayaspell and Google Docs.

Shaalan, K., and M. Attia, "Handling Unknown Words in Arabic FST Morphology", The 10th edition of the International Workshop on Finite State Methods and Natural Language Processing (FSMNLP 2012), San Sebastian, Spain, 23 July, 2012. Abstractunk_fsmnlp_2012-acl-anthology__short_04.pdf

A morphological analyser only recognizes words that it already knows in the lexical database. It needs, however, a way of sensing significant changes in the language in the form of newly borrowed or coined words with high frequency. We develop a finite-state morphological guesser in a pipelined methodology for extracting unknown words, lemmatizing them, and giving them a priority weight for inclusion in a lexicon. The processing is performed on a large contemporary corpus of 1,089,111,204 words and passed through a machine-learning-based annotation tool. Our method is tested on a manually-annotated gold standard of 1,310 forms and yields good results despite the complexity of the task. Our work shows the usability of a highly non-deterministic finite state guesser in a practical and complex application.