Exploiting poly-lingual documents for improving text categorization effectiveness

Chih-Ping Wei; Chin-Sheng Yang; Ching-Hsien Lee; Huihua Shi; Christopher C. Yang

doi:10.1016/j.dss.2013.08.001

Back

Exploiting poly-lingual documents for improving text categorization effectiveness

Journal article

Peer reviewed

Exploiting poly-lingual documents for improving text categorization effectiveness

Chih-Ping Wei, Chin-Sheng Yang, Ching-Hsien Lee, Huihua Shi and Christopher C. Yang

Decision Support Systems, v 57(1), pp 64-76

Jan 2014

DOI: https://doi.org/10.1016/j.dss.2013.08.001

Additional Links

Abstract

Document management

Feature reinforcement

Poly-lingual text categorization

Text categorization

Text mining

With the globalization of business environments and rapid emergence and proliferation of the Internet, organizations or individuals often generate, acquire, and then archive documents written in different languages (i.e., poly-lingual documents). Prevalent document management practice is to use categories to organize this ever-increasing volume of poly-lingual documents for subsequent searches and accesses. Poly-lingual text categorization (PLTC) refers to the automatic learning of text categorization models from a set of preclassified training documents written in different languages and the subsequent assignment of unclassified poly-lingual documents to predefined categories on the basis of the induced text categorization models. Although PLTC can be approached as multiple, independent monolingual text categorization problems, this naïve PLTC approach employs only the training documents of the same language to construct a monolingual classifier and thus fails to exploit the opportunity offered by poly-lingual training documents. In this study, we propose a feature-reinforcement-based PLTC (FR-PLTC) technique that takes into account the training documents of all languages when constructing a monolingual classifier for a specific language. Using the independent monolingual text categorization (MnTC) approach as a performance benchmark, the empirical evaluation results show that our proposed FR-PLTC technique achieves higher classification accuracy than the benchmark technique. In addition, our empirical results suggest the superiority of the proposed FR-PLTC technique over its counterpart across a range of training sizes. •We identify the problem of poly-lingual text categorization (PLTC).•We propose a feature-reinforcement-based PLTC (FR-PLTC) technique.•Our FR-PLTC technique considers the training documents of all languages.•Our FR-PLTC technique outperforms the benchmark technique.

Metrics

14 Record Views

8 citations in Web of Science

7 citations in Scopus

Details

Title: Exploiting poly-lingual documents for improving text categorization effectiveness
Creators: Chih-Ping Wei - National Taiwan University
Chin-Sheng Yang - Department of Information Management, Yuan Ze University, Chung-Li, Taiwan, ROC
Ching-Hsien Lee - National Taiwan University
Huihua Shi - AU Optronics
Christopher C. Yang - Yuan Ze University
Publication Details: Decision Support Systems, v 57(1), pp 64-76
Publisher: Elsevier
Resource Type: Journal article
Language: English
Academic Unit: Information Science
Web of Science ID: WOS:000330909700007
Scopus ID: 2-s2.0-84892373973
Other Identifier: 991019169585904721

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Collaboration types: Industry collaboration; Domestic collaboration; International collaboration
Web of Science research areas: Computer Science, Artificial Intelligence; Computer Science, Information Systems; Operations Research & Management Science

Exploiting poly-lingual documents for improving text categorization effectiveness

Additional Links

Abstract

Metrics

Details

InCites Highlights

Drexel University Social media