Parsing

Al-Emran, M., S. Zaza, and K. Shaalan, "Parsing Modern Standard Arabic using Treebank Resources", The International Conference on Information and Communication Technology Research (ICTRC), UAE, 18 May, 2015. Abstractparsing_atb.pdf

A Treebank is a linguistic resource that is composed of a large collection of manually annotated and verified syntactically analyzed sentences. Statistical Natural Language Processing (NLP) approaches have been successful in using these annotations for developing basic NLP tasks such as tokenization, diacritization, part-of-speech tagging, parsing, among others. In this paper, we address the problem of exploiting treebank resources for statistical parsing of Modern Standard Arabic (MSA) sentences. Statistical parsing is significant for NLP tasks that use parsed text as an input such as Information Retrieval, and Machine Translation. We conducted an experiment on 2000 sentences from the Pen Arabic Treebank (PATB) and the parsing performance obtained in terms of Precision, Recall, and F-measure was 82.4%, 86.6%, 84.4%, respectively.

Eltaher, A., H. A. Bak, I. Zidan, and K. Shaalan, "An Arabic CCG Approach for Determining Constituent Types from Arabic Treebank", Journal of King Saud University - Computer and Information Sciences, vol. 26, issue 4, pp. 441-449, 2014. Abstractarabic_ccgbank.pdfWebsite

Converting a treebank into a CCGbank opens the respective language to the sophisticated tools developed for Combinatory Categorial Grammar (CCG) and enriches cross-linguistic development. The conversion is primarily a three-step process: determining constituents’ types, binarization, and category conversion. Usually, this process involves a preprocessing step to the Treebank of choice for correcting brackets and normalizing tags for any changes that were introduced during the manual annotation, as well as extracting morpho-syntactic information that is necessary for determining constituents’ types. In this article, we describe the required preprocessing step on the Arabic Treebank, as well as how to determine Arabic constituents’ types. We conducted an experiment on parts 1 and 2 of the Penn Arabic Treebank (PATB) aimed at converting the PATB into an Arabic CCGbank. The performance of our algorithm when applied to ATB1v2.0 & ATB2v2.0 was 99% identification of head nodes and 100% coverage over the Treebank data.

Shaalan, K., "Extending Prolog for Better Natural Language Analysis", 1st Conference on Language Engineering, Cairo, Egypt, Ain Shams University, pp. 225–236, March, 1998. Abstractextend_prolog_conf.pdf

Prolog supports natural language parsing with a clean semantics and additional constructs such as definite clause grammars (DCGs). While it provides excellent computational support, we claim it does not provide good notation that increases the readability and maintainability of natural language analysis programming. In this paper we explore an alternative solution: a general notational extension to Prolog programs that provides for concise expression of definitions. This notational extension results in a powerful and convenient logic programming language that fits into natural language analysis programming. Programs translate to Prolog in a way similar to DCGs. That is to say, they have a specific syntax and can be loaded and expanded to Prolog code. This expansion is transparent to the user. To demonstrate the language capabilities, we present an example for an Arabic morphological analyzer.

Shaalan, K., A. Farouk, and A. Rafea, "Towards An Arabic Parser for Modern Scientific Text", the International Conference on Artificial Intelligence for Decision , Control and Automation in Engineering and Industrial Applications (ACIDCA'2000), Monastir, Tunisia, pp. 228–235, mar, 2000. Abstractparser_modern_scientific_text.pdf

The present work reports our attempt in developing an Arabic Parser for modern scientific text. The parser is written in Definite Clause Grammar (DCG) and is targeted to be part of a machine translation system. The developing of the parser was a two-step process. In the first step, we acquired the rules that constitute a grammar for Arabic that gives a precise account of what it is for a sentence to be grammatical. The grammar covers a text from the domain of the agricultural extension documents. The second step was to implement the parser that assigns grammatical structure onto input sentence. Experiment on real extension document was performed. The paper will also describe our experience with the developed parser and results of its application on a real agricultural extension document.

Othman, E., K. Shaalan, and A. Rafea, "A chart parser for analyzing modern standard Arabic sentence", MT Summit IX Workshop on Machine Translation for Semitic Languages: Issues and Approaches, New Orleans, Louisiana, USA, ACL, pp. 37–44, September, 2003. Abstractchart_parser_mt_summit.pdf

The parsing of Arabic sentence is a necessary prerequisite for many natural language processing applications such as machine translation and information retrieval. In this paper we report our attempt to develop an efficient chart parser for Analyzing Modern Standard Arabic (MSA) sentence. From a practical point of view, the parser is able to satisfy syntactic constraints reducing parsing ambiguity. Lexical semantic features are also used to disambiguate the sentence structure. We explain also an Arabic morphological analyzer based on ATN technique. Both the Arabic parser and the Arabic morphological analyzer are implemented in Prolog. The linguistic rules were acquired from a set of sentences from MSA sentence in the Agriculture domain.

Shaalan, K., "Arabic GramCheck: a grammar checker for Arabic", Software Practice and Experience, vol. 35, no. 7, New York, NY, USA, John Wiley & Sons, Inc., pp. 643–665, 2005. Abstractarabic_gramcheck.pdfWebsite

Arabic is a Semitic language that is rich in its morphology and syntax. The very numerous and complex grammar rules of the language may be confusing for the average user of a word processor. In this paper, we report our attempt at developing a grammar checker program for Modern Standard Arabic, called Arabic GramCheck. Arabic GramCheck can help the average user by checking his/her writing for certain common grammatical errors; it describes the problem for him/her and offers suggestions for improvement. The use of the Arabic grammatical checker can increase productivity and improve the quality of the text for anyone who writes Arabic. Arabic GramCheck has been successfully implemented using SICStus Prolog on an IBM PC. The current implementation covers a well-formed subset of Arabic and focuses on people trying to write in a formal style. Successful tests have been performed using a set of Arabic sentences. It is concluded that the approach is promising by observing the results as compared to the output of a commercially available Arabic grammar checker

Abo-Bakr, H., K. Shaalan, and I. Ziedan, "A Statistical Method for Detecting the Arabic Empty Category", The Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, The MEDAR Consortium, 22 April, 2009. Abstract32_finalsubmission_pdf.pdf

In this paper we introduce a statistical approach for detecting the position of Empty-Category presented in Arabic Treebank. This can help in detecting the position of the elliptic personnel pronoun and overcoming, for some cases, the identification of dropped words within a sentence given the free word order nature of Arabic. The proposed approach requires a large corpus. The training for detecting the Empty-Category for each token is based on its Part Of Speech (POS), Base Phrase (BP)-chunk position, and the position of the token in the sentence. The Empty-Category detection is efficiently obtained using the Support Vector Machines (SVM) technique. We conducted an evaluation of the proposed diacritization algorithm, discussed the obtained results, and proposed various modifications for improving the performance of this approach.