Statistical Approach

Al-Emran, M., S. Zaza, and K. Shaalan, "Parsing Modern Standard Arabic using Treebank Resources", The International Conference on Information and Communication Technology Research (ICTRC), UAE, 18 May, 2015. Abstractparsing_atb.pdf

A Treebank is a linguistic resource that is composed of a large collection of manually annotated and verified syntactically analyzed sentences. Statistical Natural Language Processing (NLP) approaches have been successful in using these annotations for developing basic NLP tasks such as tokenization, diacritization, part-of-speech tagging, parsing, among others. In this paper, we address the problem of exploiting treebank resources for statistical parsing of Modern Standard Arabic (MSA) sentences. Statistical parsing is significant for NLP tasks that use parsed text as an input such as Information Retrieval, and Machine Translation. We conducted an experiment on 2000 sentences from the Pen Arabic Treebank (PATB) and the parsing performance obtained in terms of Precision, Recall, and F-measure was 82.4%, 86.6%, 84.4%, respectively.

Nabhan, A., A. Rafea, and K. Shaalan, "Enhancing Phrase Extraction from Word Alignments Using Morphology", The 5th Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE), Cairo, Egypt, Ain Shams University, pp. 57–65, sep, 2005. Abstractnabhan_nle.pdf

We propose a technique for effective extraction of bilingual phrases from word alignments using morphological processing. Morphological processing leads to an increase of the frequency of words in the corpus, consequently reduces Alignment Error Rate (AER). Intuitively, better word alignments enhance the quality of bilingual phrases extracted. Using alignments of a stemmed corpus for phrase extraction, instead of alignments of a raw one, shows significant improvements in translation quality, especially with small corpora.

Abo-Bakr, H., K. Shaalan, and I. Ziedan, "A Statistical Method for Detecting the Arabic Empty Category", The Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, The MEDAR Consortium, 22 April, 2009. Abstract32_finalsubmission_pdf.pdf

In this paper we introduce a statistical approach for detecting the position of Empty-Category presented in Arabic Treebank. This can help in detecting the position of the elliptic personnel pronoun and overcoming, for some cases, the identification of dropped words within a sentence given the free word order nature of Arabic. The proposed approach requires a large corpus. The training for detecting the Empty-Category for each token is based on its Part Of Speech (POS), Base Phrase (BP)-chunk position, and the position of the token in the sentence. The Empty-Category detection is efficiently obtained using the Support Vector Machines (SVM) technique. We conducted an evaluation of the proposed diacritization algorithm, discussed the obtained results, and proposed various modifications for improving the performance of this approach.

Khaled Shaalan

Professor of Computer Science

Statistical Approach