Searching for poor quality machine translated text : learning the difference between human writing and machine translations

From National Research Council Canada

Download	View accepted manuscript: Searching for poor quality machine translated text : learning the difference between human writing and machine translations (PDF, 564 KiB)
DOI	Resolve DOI: https://doi.org/10.1007/978-3-642-30353-1
Author	Search for: Carter, Dave¹; Search for: Inkpen, Diana
Affiliation	National Research Council of Canada. NRC Institute for Information Technology
Format	Text, Article
Conference	25th Canadian Conference on Artificial Intelligence, Canadian AI 2012, 28-30 May 2012, Toronto, Ontario, Canada
Abstract	As machine translation (MT) tools have become mainstream, machine translated text has increasingly appeared on multilingual websites. Trustworthy multilingual websites are used as training corpora for statistical machine translation tools; large amounts of MT text in training data may make such products less effective. We performed three experiments to determine whether a support vector machine (SVM) could distinguish machine translated text from human written text (both original text and human translations). Machine translated versions of the Canadian Hansard were detected with an F-measure of 0.999. Machine translated versions of six Government of Canada web sites were detected with an F-measure of 0.98.We validated these results with a decision tree classifier. An experiment to find MT text on Government of Ontario web sites using Government of Canada training data was unfruitful, with a high rate of false positives. Machine translated text appears to be learnable and detectable when using a similar training corpus.
Publication date	2012-05
In	Advances in Artificial Intelligence (May 2012): 49–60.
Series	Lecture Notes in Artificial Intelligence (LNAI) 7310.
Language	English
Peer reviewed	Yes
NPARC number	20496817
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	cf9f7d1a-96a1-4b36-8355-6c808a7f3f4d
Record created	2012-08-16
Record modified	2020-04-21

Date modified:: 2024-04-20