Handling Unknown Words in Arabic FST Morphology

Shaalan, K., and M. Attia, "Handling Unknown Words in Arabic FST Morphology", The 10th edition of the International Workshop on Finite State Methods and Natural Language Processing (FSMNLP 2012), San Sebastian, Spain, 23 July, 2012. copy at www.tinyurl.com/zptjvoh

Date Presented:

23 July


A morphological analyser only recognizes words that it already knows in the lexical database. It needs, however, a way of sensing significant changes in the language in the form of newly borrowed or coined words with high frequency. We develop a finite-state morphological guesser in a pipelined methodology for extracting unknown words, lemmatizing them, and giving them a priority weight for inclusion in a lexicon. The processing is performed on a large contemporary corpus of 1,089,111,204 words and passed through a machine-learning-based annotation tool. Our method is tested on a manually-annotated gold standard of 1,310 forms and yields good results despite the complexity of the task. Our work shows the usability of a highly non-deterministic finite state guesser in a practical and complex application.

Related External Link

unk_fsmnlp_2012-acl-anthology__short_04.pdf217.34 KB