Truecasing For The Portage System

  1. (PDF, 341 KB)
AuthorSearch for: ; Search for: ; Search for:
ConferenceInternational Conference on Recent Advances in Natural Language Processing (RANLP-05), September 21-24, 2005., Borovets, Bulgaria
AbstractThis paper presents a truecasing technique - that is, a technique for restoring the normal case form to an all lowercased or partially cased text. The technique uses a combination of statistical components, including an N-gram language model, a case mapping model, and a specialized language model for unknown words. The system is also capable of distinguishing between “title” and “non-title” lines, and can apply different statistical models to each type of line. The system was trained on the data taken from the English portion of the Canadian parliamentary Hansard corpus and on some English-language texts taken from a corpus of China-related stories; it was tested on a separate set of texts from the China-related corpus. The system achieved 96% case accuracy when the China-related test corpus had been completely lowercased; this represents 80% relative error rate reduction over the unigram baseline technique. Subsequently, our technique was implemented as a module called Portage-Truecasing inside a machine translation system called Portage, and its effect on the overall performance of Portage was tested. In this paper, we explore the truecasing concept, and then we explain the models used.
Publication date
AffiliationNRC Institute for Information Technology; National Research Council Canada
Peer reviewedNo
NRC number48515
NPARC number5763859
Export citationExport as RIS
Report a correctionReport a correction
Record identifiereab2cda0-07bf-403a-af3c-835ae30583ab
Record created2009-03-29
Record modified2016-05-09
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)
Date modified: