A segment-based hidden markov model for real-setting pinyin-to-chinese conversion

Xiaohua Zhou; Xiaohua Hu; Xiaodan Zhang; Xiajiong Shen

doi:10.1145/1321440.1321602

Back

Conference proceeding

A segment-based hidden markov model for real-setting pinyin-to-chinese conversion

Xiaohua Zhou, Xiaohua Hu, Xiaodan Zhang and Xiajiong Shen

Proceedings of the sixteenth ACM conference on conference on information and knowledge management, pp 1027-1030

06 Nov 2007

DOI: https://doi.org/10.1145/1321440.1321602

Additional Links

Abstract

chinese input

pinyin

segment-based hidden markov model

Hidden markov model (HMM) is frequently used for Pinyin-to-Chinese conversion. But it only captures the dependency with the preceding character. Higher order markov models can bring higher accuracy, but are computationally unaffordable to average PC settings. We propose a segment-based hidden markov model (SHMM), which has the same magnitude of complexity as first-order HMM, but generates higher decoding accuracy. SHMM tells a word from a bigram connecting two words, and assigns a reasonable probability to words as a whole. It is more powerful than HMM to decode words containing over two characters. We conduct a comprehensive Pinyin-to-Chinese conversion evaluation on Lancaster corpus. The experiment shows the perfect sentence accuracy is improved from 34.7% (HMM) to 43.3% (SHMM). The one-error sentence accuracy is increased from 72.7% to 78.3%. Furthermore, SHMM can seamlessly integrate with pinyin typing correction, acronym pinyin input, user-defined words, and self-adaptive learning all of which are a must for a commercial Pinyin-to-Chinese conversion product in order to improve the efficiency of pinyin input.

Metrics

11 Record Views

5 citations in Scopus

See more details

Details

Title: A segment-based hidden markov model for real-setting pinyin-to-chinese conversion
Creators: Xiaohua Zhou - Drexel University
Xiaohua Hu - Drexel University
Xiaodan Zhang - Drexel University
Xiajiong Shen - Henan University
Publication Details: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, pp 1027-1030
Conference: 16th ACM conference on conference on information and knowledge management, 16th
Series: CIKM '07
Publisher: Association for Computing Machinery (ACM)
Number of pages: 1
Resource Type: Conference proceeding
Language: English
Academic Unit: Information Science
Scopus ID: 2-s2.0-63449083354
Other Identifier: 991019173561104721

A segment-based hidden markov model for real-setting pinyin-to-chinese conversion

Additional Links

Abstract

Metrics

Details

Drexel University Social media