Journal article
A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining
IEEE/ACM transactions on computational biology and bioinformatics, v 8(2), pp 294-307
01 Mar 2011
PMID: 20876938
Featured in Collection : UN Sustainable Development Goals @ Drexel
Abstract
Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets.
Metrics
Details
- Title
- A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining
- Creators
- Yanpeng Li - Dalian University of TechnologyXiaohua Hu - Drexel UniversityHongfei Lin - Dalian University of TechnologyZhihao Yang - Dalian Univ Technol, Coll Comp Sci & Technol, Dalian 116024, Liaoning, Peoples R China
- Publication Details
- IEEE/ACM transactions on computational biology and bioinformatics, v 8(2), pp 294-307
- Publisher
- IEEE
- Number of pages
- 14
- Grant note
- 60673039; 60973068; 90920005 / Natural Science Foundation of China; National Natural Science Foundation of China (NSFC) 20090041110002 / Ministry of Education of China; Ministry of Education, China NSF CCF 0905291; NSF IIS 1049864; NSF IIP 0934197 / National Science Foundation (USA); National Science Foundation (NSF) 2006AA01Z151 / National High Tech Research and Development Plan of China; National High Technology Research and Development Program of China
- Resource Type
- Journal article
- Language
- English
- Academic Unit
- Information Science
- Web of Science ID
- WOS:000286146600003
- Scopus ID
- 2-s2.0-79551655977
- Other Identifier
- 991019167702304721
UN Sustainable Development Goals (SDGs)
This publication has contributed to the advancement of the following goals:
InCites Highlights
Data related to this publication, from InCites Benchmarking & Analytics tool:
- Collaboration types
- Domestic collaboration
- International collaboration
- Web of Science research areas
- Biochemical Research Methods
- Computer Science, Interdisciplinary Applications
- Mathematics, Interdisciplinary Applications
- Statistics & Probability