Logo image
Constructing Cross-lingual Consumer Health Vocabulary with Word-Embedding from Comparable User Generated Content
Preprint   Open access

Constructing Cross-lingual Consumer Health Vocabulary with Word-Embedding from Comparable User Generated Content

Chia-Hsuan Chang, Lei Wang and Christopher C Yang
arXiv (Cornell University)
23 Jun 2022
url
https://doi.org/10.48550/arxiv.2206.11612View
Preprint (Author's original)arXiv.org - Non-exclusive license to distribute Open

Abstract

Computer Science - Computation and Language Computer Science - Information Retrieval
The online health community (OHC) is the primary channel for laypeople to share health information. To analyze the health consumer-generated content (HCGC) from the OHCs, identifying the colloquial medical expressions used by laypeople is a critical challenge. The open-access and collaborative consumer health vocabulary (OAC CHV) is the controlled vocabulary for addressing such a challenge. Nevertheless, OAC CHV is only available in English, limiting the applicability to other languages. This research aims to propose a cross-lingual automatic term recognition framework for extending the English OAC CHV into a cross-lingual one. Our framework requires an English HCGC corpus and a non-English (i.e., Chinese in this study) HCGC corpus as inputs. Two monolingual word vector spaces are determined using skip-gram algorithm so that each space encodes common word associations from laypeople within a language. Based on isometry assumption, the framework align two monolingual spaces into a bilingual word vector space, where we employ cosine similarity as a metric for identifying semantically similar words across languages. In the experiments, our framework demonstrates that it can effectively retrieve similar medical terms, including colloquial expressions, across languages and further facilitate compilation of cross-lingual CHV.

Metrics

17 Record Views

Details

Logo image