Pattern Recognition, International Conference on
Download PDF

Abstract

The BBN Byblos OCR system implements a script-independent methodology for OCR using Hidden Markov Models (HMMs). We have successfully ported the system to Arabic, Pashto, English, and Chinese. In this paper, we discuss our recent effort in configuring the system to perform recognition of noisy machine printed Japanese documents. The data for our experimentation was taken from the University of Washington (UW-II) Japanese OCR corpus and the LDC Japanese Business News Supplement corpus. We evaluated the performance of a whole-character configuration in which each character was modeled using a separate HMM. As in the case of our Chinese OCR system [Multilingual Machine Printed OCR], we also used a sub-character modeling approach [Porting the BBN BYBLOS OCR System to New Languages] in which each Japanese character was spelled using a shared set of automatically generated sub-characters. We experimentally evaluated the performance of different sub-character clusters as well as different HMM topologies to identify the best overall system configuration. On a fair test using noisy/degraded images from the UW-II corpus, the best sub-character configuration resulted in a character error rate of 20.13%. On relatively cleaner data, consisting of scanned newspaper images, the system delivered an error rate of 7.85%. Using a whole-character configuration the corresponding error rates were 11.94% and 4.55% respectively.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles