Natural Language Processing

Atia, S., and K. Shaalan, "Increasing the Accuracy of Opinion Mining in Arabic", The International Conference on Arabic Computational Linguistics (ACLing), Cairo, Egypt, 18 April, 2015. Abstractopinionminacling2015.pdf

Opinion Mining is a raising research field of interest, with its different applications derived by market needs
to analyze product reviews or to assess the public opinion, for political reasons, during presidential campaigns. In this paper, we address an approach for improving accuracy of Opinion Mining in Arabic. In order to conduct our study we need Arabic linguistic resources for opinion mining. Investigating the available resources we found that the OCA corpus is available and sufficient to prove our approach. Experimental results showed that applying different parameters of the machine learning classifiers on the OCA corpus leads to increasing the accuracy of the Arabic Opinion Mining.

Al-Chalabi, H., S. Ray, and K. Shaalan, "Semantic Based Query Expansion for Arabic Question Answering Systems", The International Conference on Arabic Computational Linguistics (ACLing), Cairo, Egypt, 17 April, 2015. Abstractqeacling2015.pdf

Question Answering Systems have emerged as a good alternative to search engines where they produce the desired information in a very precise way in the real time. However, one serious concern with the Question Answering system is that despite having answers of the questions in the knowledge base, they are not able to retrieve the answer due to mismatch between the words used by users and content creators. There has been a
lot of research in the field of English and some European language Question Answering Systems to handle this issue. However, Arabic Question Answering Systems could not match the pace due to some inherent difficulties with the language itself as well as due to lack of tools available to assist the researchers. In this paper, we are
presenting a method to add semantically equivalent keywords in the questions by using semantic resources. The experiments suggest that the proposed research can deliver highly accurate
answers for Arabic questions.

Meselhi, M., H. Abo Bakr, I. Ziedan, and K. Shaalan, "Hybrid Named Entity Recognition - Application to Arabic Language", The International Conference on Computer Engineering & Systems (ICCES), Egypt, 23 December, 2015. Abstractarabic_hybrid_ner.pdf

Most Named Entity Recognition (NER) systems follow either a rule-based approach or machine learning approach. In this paper, we introduce out attempt at developing a hybrid NER system, which combines the rule-based approach with a machine learning approach in order to obtain the advantages of both approaches and
overcomes their problems. The system is able to recognize eight types of named entities including Location,
Person, Organization, Date, Time, Price, Measurement and Percent. Experimental results on ANERcorp dataset indicated that our hybrid approach outperforms the rule-based approach and the machine learning approach when
they are processed separately. Moreover, our hybrid approach outperforms the state-of-the-art of Arabic NER.

Chalabi, H. A., S. Ray, and K. Shaalan, "Question Classification for Arabic Question Answering Systems", The International Conference on Information and Communication Technology Research (ICTRC), UAE, 17 May, 2015. Abstractqueastion_classification-_in_final_proceedings.pdf

Due to very fast growth of information in the last few decades, getting precise information in real time is becoming increasingly difficult. Search engines such as Google and Yahoo are helping in finding the information but the information provided by them are in the form of documents which consumes a lot of time of the user. Question Answering Systems have emerged as a good alternative to search engines where they produce the desired information in a very precise way in the real time. This saves a lot of time for the user. There has been a lot of research in the field of English and some European language Question Answering Systems. However, Arabic Question Answering Systems could not match the pace due to some inherent difficulties with the language itself as well as due to lack of tools available to assist the researchers. Question classification is a very important module of Question Answering Systems. In this paper, we are presenting a method to accurately classify the Arabic questions in order to retrieve precise answers. The proposed method gives promising results.

Al-Emran, M., S. Zaza, and K. Shaalan, "Parsing Modern Standard Arabic using Treebank Resources", The International Conference on Information and Communication Technology Research (ICTRC), UAE, 18 May, 2015. Abstractparsing_atb.pdf

A Treebank is a linguistic resource that is composed of a large collection of manually annotated and verified syntactically analyzed sentences. Statistical Natural Language Processing (NLP) approaches have been successful in using these annotations for developing basic NLP tasks such as tokenization, diacritization, part-of-speech tagging, parsing, among others. In this paper, we address the problem of exploiting treebank resources for statistical parsing of Modern Standard Arabic (MSA) sentences. Statistical parsing is significant for NLP tasks that use parsed text as an input such as Information Retrieval, and Machine Translation. We conducted an experiment on 2000 sentences from the Pen Arabic Treebank (PATB) and the parsing performance obtained in terms of Precision, Recall, and F-measure was 82.4%, 86.6%, 84.4%, respectively.

Al-Zoghby, A., and K. Shaalan, "Semantic Search for Arabic", International Florida Artificial Intelligence Research Society Conference (FLAIRS), USA, 19 May, 2015. Abstractsemantic_search_arabic.pdf

There is a growing interest in Arabic web content worldwide due to its importance for culture, religion, and economics. In the literature, researches that address searching Arabic web content using semantic web technology are still insufficient compared to Arabic’s actual importance as a language. In this research, we propose an Arabic semantic search approach that is applied on Arabic web content. This approach is based on the Vector Space Model (VSM). It uses the Universal WordNet ontology to build a rich concept-space index instead of the traditional term-space index. The proposed index is used for enhancing the capability of the semantic-based VSM. Moreover, the approach introduces a new incidence measurement to calculate the semantic significance degree of the document's concepts which is more suitable than the traditional term frequency measure. Furthermore, a novel method for calculating the semantic weight of the concept is introduced in order to determine the semantic similarity of two vectors. As a proof of concept, a system is applied on a full dump of the Arabic Wikipedia. The experimental results in terms of Precision, Recall and F-measure have showed improvement in performance from 77%, 56%, and 63% to 71%, 96%, and 81%, respectively.

Shaalan, K., M. Magdy, and A. Fahmy, "Analysis and Feedback of Erroneous Arabic Verbs", Journal of Natural Language Engineering , vol. 21, issue 2, pp. 271-323, 2015. Abstractanalysis_and_feedback_of_erroneous_arabic_verbs.pdfWebsite

Arabic language is strongly structured and considered as one of the most highly inflected and
derivational languages. Learning Arabic morphology is a basic step for language learners to
develop language skills such as listening, speaking, reading, and writing. Arabic morphology
is non-concatenative and provides the ability to attach a large number of affixes to each
root or stem that makes combinatorial increment of possible inflected words. As such, Arabic
lexical (morphological and phonological) rules may be confusing for second language learners.
Our study indicates that research and development endeavors on spelling, and checking of
grammatical errors does not provide adequate interpretations to second language learners’
errors. In this paper we address issues related to error diagnosis and feedback for second
language learners of Arabic verbs and how they impact the development of a web-based
intelligent language tutoring system. The major aim is to develop an Arabic intelligent
language tutoring system that solves these issues and helps second language learners to
improve their linguistic knowledge. Learners are encouraged to produce input freely in
various situations and contexts, and are guided to recognize by themselves the erroneous
functions of their misused expressions. Moreover, we proposed a framework that allows
for the individualization of the learning process and provides the intelligent feedback that
conforms to the learner’s expertise for each class of error. Error diagnosis is not possible with
current Arabic morphological analyzers. So constraint relaxation and edit distance techniques
are successfully employed to provide error-specific diagnosis and adaptive feedback to learners.
We demonstrated the capabilities of these techniques in diagnosing errors related to Arabic
weak verbs formed using complex morphological rules. As a proof of concept, we have
implemented the components that diagnose learner’s errors and generate feedback which
have been effectively evaluated against test data acquired from real teaching environment.
The experimental results were satisfactory, and the performance achieved was 74.34 percent
in terms of recall rate.

Eltaher, A., H. A. Bak, I. Zidan, and K. Shaalan, "An Arabic CCG Approach for Determining Constituent Types from Arabic Treebank", Journal of King Saud University - Computer and Information Sciences, vol. 26, issue 4, pp. 441-449, 2014. Abstractarabic_ccgbank.pdfWebsite

Converting a treebank into a CCGbank opens the respective language to the sophisticated tools developed for Combinatory Categorial Grammar (CCG) and enriches cross-linguistic development. The conversion is primarily a three-step process: determining constituents’ types, binarization, and category conversion. Usually, this process involves a preprocessing step to the Treebank of choice for correcting brackets and normalizing tags for any changes that were introduced during the manual annotation, as well as extracting morpho-syntactic information that is necessary for determining constituents’ types. In this article, we describe the required preprocessing step on the Arabic Treebank, as well as how to determine Arabic constituents’ types. We conducted an experiment on parts 1 and 2 of the Penn Arabic Treebank (PATB) aimed at converting the PATB into an Arabic CCGbank. The performance of our algorithm when applied to ATB1v2.0 & ATB2v2.0 was 99% identification of head nodes and 100% coverage over the Treebank data.

Meselhi, M., H. Abo Bakr, I. Ziedan, and K. Shaalan, "A Novel Hybrid Approach to Arabic Named Entity Recognition", Machine Translation: Communications in Computer and Information Science: Springer, 2014. Abstracthybrid_arabic_ner_2014.pdf

Named Entity Recognition (NER) task is an essential preprocessing task for many Natural Language Processing (NLP) applications such as text summarization, document categorization, Information Retrieval, among others. NER systems follow either rule-based approach or machine learning approach. In this paper, we introduce a novel NER system for Arabic using a hybrid approach, which combines a rule-based approach and a machine learning approach in order to improve the performance of Arabic NER. The system is able to recognize three types of named entities, including Person, Location and Organization. Experimental results on ANERcorp dataset showed that our hybrid approach has achieved better performance than using the rule-based approach and the machine learning approach when they are processed separately. It also outperforms the state-of-the-art hybrid Arabic NER systems.

Shaalan, K., "A Survey of Arabic Named Entity Recognition and Classification", Computational LinguisticsComputational Linguistics, vol. 40, issue 2: MIT Press, pp. 469 - 510, 2013, 2014. Abstractcoli_a_00178.pdfWebsite

As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.

Khaled Shaalan

Professor of Computer Science

Natural Language Processing