Building parallel corpora by automatic title alignment using length-based and text-based approaches

Christopher C. Yang; Kar Wing Li

doi:10.1016/j.ipm.2003.11.002

Back

Building parallel corpora by automatic title alignment using length-based and text-based approaches

Journal article

Peer reviewed

Building parallel corpora by automatic title alignment using length-based and text-based approaches

Christopher C. Yang and Kar Wing Li

Information processing & management, v 40(6), pp 939-955

01 Nov 2004

DOI: https://doi.org/10.1016/j.ipm.2003.11.002

Additional Links

Abstract

Covert translation

Cross-lingual information retrieval

Parallel corpus

Sentence alignment

Cross-lingual semantic interoperability has drawn significant attention in recent digital library and World Wide Web research as the information in languages other than English has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish, and French, has been widely explored; however, CLIR across European languages and Oriental languages is still in the initial stage. To cross language boundary, corpus-based approach is promising to overcome the limitation of the knowledge-based and controlled vocabulary approaches but collecting parallel corpora between European language and Oriental language is not an easy task. Length-based and text-based approaches are two major approaches to align parallel documents. In this paper, we investigate several techniques using these approaches and compare their performances in aligning English and Chinese titles of parallel documents available on the Web.

Metrics

8 Record Views

16 citations in Web of Science

17 citations in Scopus

Details

Title: Building parallel corpora by automatic title alignment using length-based and text-based approaches
Creators: Christopher C. Yang - Chinese University of Hong Kong
Kar Wing Li
Publication Details: Information processing & management, v 40(6), pp 939-955
Publisher: Elsevier
Resource Type: Journal article
Language: English
Academic Unit: Information Science
Web of Science ID: WOS:000223985200004
Scopus ID: 2-s2.0-4243049091
Other Identifier: 991021855283104721

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Web of Science research areas: Computer Science, Information Systems; Information Science & Library Science

Building parallel corpora by automatic title alignment using length-based and text-based approaches

Additional Links

Abstract

Metrics

Details

InCites Highlights

Drexel University Social media