Cross-lingual text categorization: Conquering language boundaries in globalized environments

Chih-Ping Wei; Yen-Ting Lin; Christopher C Yang

doi:10.1016/j.ipm.2011.01.011

Back

Cross-lingual text categorization: Conquering language boundaries in globalized environments

Journal article

Open access

Peer reviewed

Cross-lingual text categorization: Conquering language boundaries in globalized environments

Chih-Ping Wei, Yen-Ting Lin and Christopher C Yang

Information processing & management, v 47(5), pp 786-804

2011

DOI: https://doi.org/10.1016/j.ipm.2011.01.011

Featured in Collection : UN Sustainable Development Goals @ Drexel

Files and links (1)

url

http://ntur.lib.ntu.edu.tw/bitstream/246246/245894/-1/52.pdfView

SubmittedOpen Access (License Unspecified), Open

Abstract

Text mining

Cross-lingual text categorization

Document management

Text categorization

► Our CLTC technique uses a statistical-based bilingual thesaurus for translations. ► Our proposed CLTC technique demonstrates its cross-lingual capability. ► The cluster-based category assignment method outperforms the individual-base method. ► The individual-based method is less sensitive to the size of training examples. Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most existing text categorization techniques deal with monolingual documents (i.e., written in the same language) during the learning of the text categorization model and category assignment (or prediction) for unclassified documents. However, with the globalization of business environments and advances in Internet technology, an organization or individual may generate and organize into categories documents in one language and subsequently archive documents in different languages into existing categories, which necessitate cross-lingual text categorization (CLTC). Specifically, cross-lingual text categorization deals with learning a text categorization model from a set of training documents written in one language (e.g., L 1) and then classifying new documents in a different language (e.g., L 2). Motivated by the significance of this demand, this study aims to design a CLTC technique with two different category assignment methods, namely, individual- and cluster-based. Using monolingual text categorization as a performance reference, our empirical evaluation results demonstrate the cross-lingual capability of the proposed CLTC technique. Moreover, the classification accuracy achieved by the cluster-based category assignment method is statistically significantly higher than that attained by the individual-based method.

Metrics

14 Record Views

7 citations in Web of Science

7 citations in Scopus

Details

Title: Cross-lingual text categorization: Conquering language boundaries in globalized environments
Creators: Chih-Ping Wei - Department of Information Management, College of Management, National Taiwan University, Taipei, Taiwan, ROC
Yen-Ting Lin - Science & Technology Policy Research and Information Center, National Applied Research Laboratories, Taipei, Taiwan, ROC
Christopher C Yang - College of Information Science and Technology, Drexel University, Philadelphia, PA, USA
Publication Details: Information processing & management, v 47(5), pp 786-804
Publisher: Elsevier
Resource Type: Journal article
Language: English
Academic Unit: Information Science
Web of Science ID: WOS:000294087400014
Scopus ID: 2-s2.0-79960841151
Other Identifier: 991014878452404721

UN Sustainable Development Goals (SDGs)

This publication has contributed to the advancement of the following goals:

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Collaboration types: Domestic collaboration; International collaboration
Web of Science research areas: Computer Science, Information Systems; Information Science & Library Science

Cross-lingual text categorization: Conquering language boundaries in globalized environments

Files and links (1)

Abstract

Metrics

Details

UN Sustainable Development Goals (SDGs)

InCites Highlights

Drexel University Social media