Showing results in 'Publications'. Show all posts
Attia, M., Y. Samih, K. Shaalan, and J. van Genabith, "The Floating Arabic Dictionary: An Automatic Method for Updating a Lexical Database through the detection and lemmatization of the Unknown Words", The International Conference on Computational Linguistics (COLING), Mumbai, India, 15 December, 2012. Abstractfloating_dictionary.pdf

Unknown words, or out of vocabulary words (OOV), cause a significant problem to morphological analysers, syntactic parses, MT systems and other NLP applications. Unknown words make up 29 % of the word types in in a large Arabic corpus used in this study. With today's corpus sizes exceeding 109 words, it becomes impossible to manually check corpora for new words to be included in a lexicon. We develop a finite-state morphological guesser and integrate it with a machine-learning-based pre-annotation tool in a pipeline architecture for extracting unknown words, lemmatizing them, and giving them a priority weight for inclusion in a lexical database. The processing is performed on a corpus of contemporary Arabic of
1,089,111,204 words. Our method is tested on a manually-annotated gold standard and yields encouraging results despite the complexity of the task. Our work shows the usability of a highly non-deterministic morphological guesser in a practical and complex application.