Logo image
Cross-lingual text categorization: Conquering language boundaries in globalized environments
Journal article   Open access   Peer reviewed

Cross-lingual text categorization: Conquering language boundaries in globalized environments

Chih-Ping Wei, Yen-Ting Lin and Christopher C Yang
Information processing & management, v 47(5), pp 786-804
2011
url
http://ntur.lib.ntu.edu.tw/bitstream/246246/245894/-1/52.pdfView

Abstract

Text mining Cross-lingual text categorization Document management Text categorization
► Our CLTC technique uses a statistical-based bilingual thesaurus for translations. ► Our proposed CLTC technique demonstrates its cross-lingual capability. ► The cluster-based category assignment method outperforms the individual-base method. ► The individual-based method is less sensitive to the size of training examples. Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most existing text categorization techniques deal with monolingual documents (i.e., written in the same language) during the learning of the text categorization model and category assignment (or prediction) for unclassified documents. However, with the globalization of business environments and advances in Internet technology, an organization or individual may generate and organize into categories documents in one language and subsequently archive documents in different languages into existing categories, which necessitate cross-lingual text categorization (CLTC). Specifically, cross-lingual text categorization deals with learning a text categorization model from a set of training documents written in one language (e.g., L 1) and then classifying new documents in a different language (e.g., L 2). Motivated by the significance of this demand, this study aims to design a CLTC technique with two different category assignment methods, namely, individual- and cluster-based. Using monolingual text categorization as a performance reference, our empirical evaluation results demonstrate the cross-lingual capability of the proposed CLTC technique. Moreover, the classification accuracy achieved by the cluster-based category assignment method is statistically significantly higher than that attained by the individual-based method.

Metrics

14 Record Views
7 citations in Scopus

Details

UN Sustainable Development Goals (SDGs)

This publication has contributed to the advancement of the following goals:

#3 Good Health and Well-Being

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Collaboration types
Domestic collaboration
International collaboration
Web of Science research areas
Computer Science, Information Systems
Information Science & Library Science
Logo image