Hossny, A., K. Shaalan, and A. Fahmy,
"Machine translation model using inductive logic programming",
the 2009 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE’09), Dalian, China, pp. 1–8, sep, 2009.
AbstractRule based machine translation systems face different challenges in building the translation model in a form of transfer rules. Some of these problems require enormous human effort to state rules and their consistency. This is where different human linguists make different rules for the same sentence. A human linguist states rules to be understood by human rather than machines. The proposed translation model (from Arabic to English) tackles the mentioned problem of building translation model. This model employs Inductive Logic Programming (ILP) to learn the language model from a set of example pairs acquired from parallel corpora and represent the language model in a rule-based format that maps Arabic sentence pattern to English sentence pattern. By testing the model on a small set of data, it generated translation rules with logarithmic growing rate and with word error rate 11%.
Shaalan, K., H. Abo-Bakr, and I. Ziedan,
"A hybrid approach for building Arabic diacritizer",
the 12th European Chapter of the Association for Computational Linguistics (EACL 2009) Workshop on Computational Approaches to Semitic Languages, Association for Computational Linguistics, Athens, Greece, Association for Computational Linguistics, pp. 27–35, mar, 2009.
AbstractModern standard Arabic is usually written without diacritics. This makes it difficult for performing Arabic text processing. Diacritization helps clarify the meaning of words and disambiguate any vague spellings or pronunciations, as some Arabic words are spelled the same but differ in meaning. In this paper, we address the issue of adding diacritics to undiacritized Arabic text using a hybrid approach. The approach requires an Arabic lexicon and large corpus of fully diacritized text for training purposes in order to detect diacritics. Case-Ending is treated as a separate post processing task using syntactic information. The hybrid approach relies on lexicon retrieval, bigram, and SVM-statistical prioritized techniques. We present results of an evaluation of the proposed diacritization approach and discuss various modifications for improving the performance of this approach.
Damankesh, A., J. Singh, F. Jahedpari, K. Shaalan, and F. Oroumchian,
"Using Human Plausible Reasoning as a Framework for Multilingual Information Filtering",
CLEF 2009 Workshop, in conjunction with ECDL2009, 13th European Conference on Digital Libraries, Corfu, Greece, 30 September , 2009.
AbstractIn this paper the application of the theory of Human Plausible Reasoning (HPR) has been investigated in the domain of filtering and cross language information retrieval. The theory of Human Plausible Reasoning first has been introduced by Collins and Michalski on early 1990s; it has been applied to IR since 1995. This work is an extension to those experiments which focuses on building a framework for cross language information retrieval. The system built in these experiments utilizes plausible inferences to infer new, unknown knowledge from existing knowledge to retrieve not only documents which are indexed by the query terms but also those which are plausibly relevant.
Abo-Bakr, H., K. Shaalan, and I. Ziedan,
"A Statistical Method for Detecting the Arabic Empty Category",
The Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, The MEDAR Consortium, 22 April, 2009.
AbstractIn this paper we introduce a statistical approach for detecting the position of Empty-Category presented in Arabic Treebank. This can help in detecting the position of the elliptic personnel pronoun and overcoming, for some cases, the identification of dropped words within a sentence given the free word order nature of Arabic. The proposed approach requires a large corpus. The training for detecting the Empty-Category for each token is based on its Part Of Speech (POS), Base Phrase (BP)-chunk position, and the position of the token in the sentence. The Empty-Category detection is efficiently obtained using the Support Vector Machines (SVM) technique. We conducted an evaluation of the proposed diacritization algorithm, discussed the obtained results, and proposed various modifications for improving the performance of this approach.
Farghaly, A., and K. Shaalan,
"Arabic Natural Language Processing: Challenges and Solutions",
ACM Transactions on Asian Language Information Processing (TALIP), vol. 8, no. 4, New York, NY, USA, ACM, pp. 1-22, 2009.
AbstractThe Arabic language presents researchers and developers of natural language processing (NLP) applications for Arabic text and speech with serious challenges. The purpose of this article is to describe some of these challenges and to present some solutions that would guide current and future practitioners in the field of Arabic natural language processing (ANLP). We begin with general features of the Arabic language in Sections 1, 2, and 3 and then we move to more specific properties of the language in the rest of the article. In Section 1 of this article we highlight the significance of the Arabic language today and describe its general properties. Section 2 presents the feature of Arabic Diglossia showing how the sociolinguistic aspects of the Arabic language differ from other languages. The stability of Arabic Diglossia and its implications for ANLP applications are discussed and ways to deal with this problematic property are proposed. Section 3 deals with the properties of the Arabic script and the explosion of ambiguity that results from the absence of short vowel representations and overt case markers in contemporary Arabic texts. We present in Section 4 specific features of the Arabic language such as the nonconcatenative property of Arabic morphology, Arabic as an agglutinative language, Arabic as a pro-drop language, and the challenge these properties pose to ANLP. We also present solutions that have already been adopted by some pioneering researchers in the field. In Section 5 we point out to the lack of formal and explicit grammars of Modern Standard Arabic which impedes the progress of more advanced ANLP systems. In Section 6 we draw our conclusion.