Diacrtization

Abo-Bakr, H., K. Shaalan, and I. Ziedan, "A Statistical Method for Adding Case Diacritics for Arabic Text", Language Engineering Conference: Ain Shams University, pp. 225–234, dec, 2008. Abstractastatisticalmethodforaddingcaseendingdiacriticsforarabictext_final.pdf

In this paper, the issue of adding Case Ending diacritics to undiacritized Arabic text using statistical methods is addressed. The approach requires a large corpus of fully diacritized text for extracting the case ending. We made the training for detecting the case ending diacritics for each token based on its Part Of Speech (POS) and BP-chunk position as well as the position of token in the statement. The case ending diacritics is then efficiently obtained using the SVM technique. We presented an evaluation of the proposed diacritization algorithm and discussed various modifications for improving the performance of this approach.

Abo-Bakr, H., K. Shaalan, and I. Ziedan, "A Hybrid Approach for Converting Written Egyptian Colloquial Dialect into Diacritized Arabic", The 6th International Conference on Informatics and Systems (INFOS2008), Cairo, Egypt, Faculty of Comptuers and Information, mar, 2008. Abstractahybridapproachforconvertingwrittenegyptian.pdf

Recently the rate of written colloquial text has increased dramatically. It is being used as a medium of expressing ideas especially across the WWW, usually in the form of blogs and partially colloquial articles. Most of these written colloquial has been in the Egyptian colloquial dialect, which is considered the most widely dialect understood and used throughout the Arab world. Modern Standard Arabic is the official Arabic language taught and understood all over the Arab world. Diacritics play a key role in disambiguating Arabic text. The reader is expected to infer or predict vowels from the context of the sentence. Inferring the full form of the Arabic word is also useful when developing Arabic natural language processing tools and applications. In this paper, we introduce a generic method for converting a written Egyptian colloquial sentence into its corresponding diacritized Modern Standard Arabic sentence which could easily be extended to be applied to other dialects of Arabic. In spite of the non-availability of linguistic Arabic resources for this task, we have developed techniques for lexical acquisition of colloquial words which are used for transferring written Egyptian Arabic into Modern Standard Arabic. We successfully used Support Vector Machine approach for the diacritization (aka vocalization or vowelling) of Arabic text.

Shaalan, K., H. Abo-Bakr, and I. Ziedan, "A hybrid approach for building Arabic diacritizer", the 12th European Chapter of the Association for Computational Linguistics (EACL 2009) Workshop on Computational Approaches to Semitic Languages, Association for Computational Linguistics, Athens, Greece, Association for Computational Linguistics, pp. 27–35, mar, 2009. Abstracthybridapproachforbuildingarabicdiacritizer_eacl2009.pdf

Modern standard Arabic is usually written without diacritics. This makes it difficult for performing Arabic text processing. Diacritization helps clarify the meaning of words and disambiguate any vague spellings or pronunciations, as some Arabic words are spelled the same but differ in meaning. In this paper, we address the issue of adding diacritics to undiacritized Arabic text using a hybrid approach. The approach requires an Arabic lexicon and large corpus of fully diacritized text for training purposes in order to detect diacritics. Case-Ending is treated as a separate post processing task using syntactic information. The hybrid approach relies on lexicon retrieval, bigram, and SVM-statistical prioritized techniques. We present results of an evaluation of the proposed diacritization approach and discuss various modifications for improving the performance of this approach.

Khaled Shaalan

Professor of Computer Science

Diacrtization