Arabic Preprocessing Schemes for Statistical Machine Translation

Download	View accepted manuscript: Arabic Preprocessing Schemes for Statistical Machine Translation (PDF, 235 KiB)
Author	Search for: Habash, N.; Search for: Sadat, F.
Format	Text, Article
Conference	Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics (HLT/NAACL) 2006, June 5-7, 2006, New York City, New York, USA
Abstract	In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.
Publication date	2006
In	Proceedings of Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics (HLT/NAACL) 2006.
Language	English
NRC number	NRCC 48759
NPARC number	9167805
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	07fcd97a-570b-45e4-a32c-5edef880e6c1
Record created	2009-06-29
Record modified	2020-10-09