Computer Science - Computation and Language Computer Science - Information Retrieval
The online health community (OHC) is the primary channel for laypeople to
share health information. To analyze the health consumer-generated content
(HCGC) from the OHCs, identifying the colloquial medical expressions used by
laypeople is a critical challenge. The open-access and collaborative consumer
health vocabulary (OAC CHV) is the controlled vocabulary for addressing such a
challenge. Nevertheless, OAC CHV is only available in English, limiting the
applicability to other languages. This research aims to propose a cross-lingual
automatic term recognition framework for extending the English OAC CHV into a
cross-lingual one. Our framework requires an English HCGC corpus and a
non-English (i.e., Chinese in this study) HCGC corpus as inputs. Two
monolingual word vector spaces are determined using skip-gram algorithm so that
each space encodes common word associations from laypeople within a language.
Based on isometry assumption, the framework align two monolingual spaces into a
bilingual word vector space, where we employ cosine similarity as a metric for
identifying semantically similar words across languages. In the experiments,
our framework demonstrates that it can effectively retrieve similar medical
terms, including colloquial expressions, across languages and further
facilitate compilation of cross-lingual CHV.
Metrics
17 Record Views
Details
Title
Constructing Cross-lingual Consumer Health Vocabulary with Word-Embedding from Comparable User Generated Content