Machine Learning

Nabhan, A. R., and K. Shaalan, "Keyword identification using text graphlet patterns", Natural Language to Information Systems: 21st International Conference on Applications of Natural Language to Information Systems (NLDB 2016), Berlin , Springer, 2016. Abstractpaper_published.pdf

Keyword identification is an important task that provides useful information for NLP applications including: document retrieval, clustering, and categorization, among others. State-of-the-art methods rely on local features of words (e.g. lexical, syntactic, and presentation features) to assess their candidacy as keywords. In this paper, we propose a novel keyword identification method that relies on representation of text abstracts as word graphs. The significance of the proposed method stems from a flexible data representation that expands the context of words to span multiple sentences and thus can enable capturing
of important non-local graph topological features. Specifically, graphlets (small subgraph patterns) were efficiently extracted and scored to reflect the statistical dependency between these graphlet patterns and words labeled as keywords. Experimental results demonstrate the capability of the graphlet patterns in a keyword identification task when applied to MEDLINE, a standard research abstract dataset.

Atia, S., and K. Shaalan, "Increasing the Accuracy of Opinion Mining in Arabic", The International Conference on Arabic Computational Linguistics (ACLing), Cairo, Egypt, 18 April, 2015. Abstractopinionminacling2015.pdf

Opinion Mining is a raising research field of interest, with its different applications derived by market needs
to analyze product reviews or to assess the public opinion, for political reasons, during presidential campaigns. In this paper, we address an approach for improving accuracy of Opinion Mining in Arabic. In order to conduct our study we need Arabic linguistic resources for opinion mining. Investigating the available resources we found that the OCA corpus is available and sufficient to prove our approach. Experimental results showed that applying different parameters of the machine learning classifiers on the OCA corpus leads to increasing the accuracy of the Arabic Opinion Mining.

Meselhi, M., H. Abo Bakr, I. Ziedan, and K. Shaalan, "A Novel Hybrid Approach to Arabic Named Entity Recognition", Machine Translation: Communications in Computer and Information Science: Springer, 2014. Abstracthybrid_arabic_ner_2014.pdf

Named Entity Recognition (NER) task is an essential preprocessing task for many Natural Language Processing (NLP) applications such as text summarization, document categorization, Information Retrieval, among others. NER systems follow either rule-based approach or machine learning approach. In this paper, we introduce a novel NER system for Arabic using a hybrid approach, which combines a rule-based approach and a machine learning approach in order to improve the performance of Arabic NER. The system is able to recognize three types of named entities, including Person, Location and Organization. Experimental results on ANERcorp dataset showed that our hybrid approach has achieved better performance than using the rule-based approach and the machine learning approach when they are processed separately. It also outperforms the state-of-the-art hybrid Arabic NER systems.

Shaalan, K., "A Survey of Arabic Named Entity Recognition and Classification", Computational LinguisticsComputational Linguistics, vol. 40, issue 2: MIT Press, pp. 469 - 510, 2013, 2014. Abstractcoli_a_00178.pdfWebsite

As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.As more and more Arabic textual information becomes available through the Web in homes and businesses, via Internet and Intranet services, there is an urgent need for technologies and tools to process the relevant information. Named Entity Recognition (NER) is an Information Extraction task that has become an integral part of many other Natural Language Processing (NLP) tasks, such as Machine Translation and Information Retrieval. Arabic NER has begun to receive attention in recent years. The characteristics and peculiarities of Arabic, a member of the Semitic languages family, make dealing with NER a challenge. The performance of an Arabic NER component affects the overall performance of the NLP system in a positive manner. This article attempts to describe and detail the recent increase in interest and progress made in Arabic NER research. The importance of the NER task is demonstrated, the main characteristics of the Arabic language are highlighted, and the aspects of standardization in annotating named entities are illustrated. Moreover, the different Arabic linguistic resources are presented and the approaches used in Arabic NER field are explained. The features of common tools used in Arabic NER are described, and standard evaluation metrics are illustrated. In addition, a review of the state of the art of Arabic NER research is discussed. Finally, we present our conclusions. Throughout the presentation, illustrative examples are used for clarification.

Shaalan, K., and M. Oudah, "A hybrid approach to Arabic named entity recognition", Journal of Information Science, vol. 40, no. 1, pp. 67-87, 2014. Abstractjis_arabicner.pdfWebsite

In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advantages of rule-based and machine learning-based approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. The complexity of Arabic poses special challenges to researchers of Arabic NER, which is essential for both monolingual and multilingual applications. We used the hybrid approach to develop an Arabic NER system that is capable of recognizing 11 types of Arabic named entities: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments were conducted using decision trees, Support Vector Machines and logistic regression classifiers to evaluate the system performance. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches when they are processed independently. More importantly, our system outperforms the state-of-the-art of Arabic NER in terms of accuracy when applied to ANERcorp standard dataset, with F-measures 0.94 for Person, 0.90 for Location and 0.88 for Organization.

Oudah, M., and K. Shaalan, "Person Name Recognition Using the Hybrid Approach", Natural Language Processing and Information Systems, vol. 7934, Berlin Heidelberg, Springer , pp. 237-248, 2013. Abstractperson_ner_using_hyprid_approach.pdf

Arabic Person Name Recognition has been tackled mostly using either of two approaches: a rule-based or Machine Learning (ML) based approach, with their strengths and weaknesses. In this paper, the problem of Arabic Person Name Recognition is tackled through integrating the two approaches together in a pipelined process to create a hybrid system with the aim of enhancing the overall performance of Person Name Recognition tasks. Extensive experiments are conducted using three different ML classifiers to evaluate the overall performance of the hybrid system. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches. Moreover, our system outperforms the state-of-the-art of Arabic Person Name Recognition in terms of accuracy when applied to ANERcorp dataset, with precision 0.949, recall 0.942 and f-measure 0.945.

Shaalan, K., and A. H. Hossny, "Automatic rule induction in Arabic to English machine translation framework", Challenges for Arabic Machine Translation, Amsterdam, The Netherlands, John Benjamins Publishing Company, 2012. Abstractkhaled_shaalan_ch10.pdf

This paper addresses exploiting a supervised machine learning technique to automatically induce Arabic-to-English transfer rules from chunks of parallel aligned linguistic resources. The induced structural transfer rules encode the linguistic translation knowledge for converting an Arabic syntactic structure into a target English syntactic structure. These rules are going to be an integral part of an Arabic-English transfer-based machine translation. Nevertheless, a novel morphological rule induction method is employed for learning Arabic morphological rules that are applied in our Arabic morphological analyzer. To demonstrate the capability of the automated rule induction technique we conducted rule-based translation experiments that use induced rules from a relatively small data set. The translation quality of the hybrid translation experiments achieved good results in terms of WER.

Abdallah, S., K. Shaalan, and M. Shoaib, "Integrating Rule-Based System with Classification for Arabic Named Entity Recognition", Computational Linguistics and Intelligent Text Processing, vol. 7181, Berlin, Heidelberg, Springer , pp. 311-322, 2012. Abstracthybrid_nera_2012.pdf

Named Entity Recognition (NER) is a subtask of information extraction that seeks to recognize and classify named entities in unstructured text into predefined categories such as the names of persons, organizations, locations, etc. The majority of researchers used machine learning, while few researchers used handcrafted rules to solve the NER problem. We focus here on NER for the Arabic language (NERA), an important language with its own distinct challenges. This paper proposes a simple method for integrating machine learning with rule-based systems and implement this proposal using the state-of-the-art rule-based system for NERA. Experimental evaluation shows that our integrated approach increases the F-measure by 8 to 14% when compared to the original (pure) rule based system and the (pure) machine learning approach, and the improvement is statistically significant for different datasets. More importantly, our system outperforms the state-of-the-art machine-learning system in NERA over a benchmark dataset.

Oudah, M., and K. Shaalan, "A Pipeline Arabic Named Entity Recognition Using a Hybrid Approach", The International Conference on Computational Linguistics (COLING), Mumbai, India, 14 December, 2012. Abstractpipeline_ner.pdf

Most Arabic Named Entity Recognition (NER) systems have been developed using either of two approaches: a rule-based or Machine Learning (ML) based approach, with their strengths and weaknesses. In this paper, the problem of Arabic NER is tackled through integrating the two approaches together in a pipelined process to create a hybrid system with the aim of enhancing the overall performance of NER tasks. The proposed system is capable of recognizing 11 different types of named entities (NEs): Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments are conducted using three different ML classifiers to evaluate the overall performance of the hybrid system. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches. Moreover, our system outperforms the state-of-the-art of Arabic NER in terms of accuracy when applied to ANERcorp dataset, with f-measures 94.4% for Person, 90.1% for Location, and 88.2% for Organization.

Shafic, S., K. Shaalan, and A. Rafea, "Macro Association Rule Discovery: Impact of Environmental indicators Changes on Life Assurance Business", Egyptian Informatics Journal, vol. 3, no. 2: Faculty of Comptuers and Information, pp. 96–114, dec, 2002. Abstractmacro_assoc_rule_disc_fci_journal.pdf

Knowledge discovery in financial organization have been built and operated mainly to support decision making using knowledge as strategic factor.In this paper, we investigate the use of association rule mining as an underlying technology for knowledge discovery in insurance business. Existing association rule algorithms and its extensions are inefficient in mining association rules in such data characteristics. We introduce algorithms for discovering knowledge in the form of association rules, suitable for data characteristics. Proposed data mining techniques is a hybrid of clustering partitioning and multi level rule induction. The proposed tool is managed by a repository meta model instantiated by meta-data libraries specific to insurance domain. It is implemented on a PC running on Ms Windows 2000. Samples of life data are extracted from different geographical locations of an Egyptian insurance company covering ten years. By using the induced rules, the decision- maker can define the horizontal expansion of marketing activities on new geographical area, or vertically empower the marketing forces in existing geographical area. Keywords: insurance data characteristics, macro association rules, clustering partitioning, preprocessing &transformation, OLAP aggregation, ontology, data warehouse

Khaled Shaalan

Professor of Computer Science

Machine Learning