Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods

Xiaodan Zhang

doi:10.17918/etd-3076

Back

Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods

Dissertation

Open access

Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods

Xiaodan Zhang

Doctor of Philosophy (Ph.D.), Drexel University

Jun 2009

DOI:

https://doi.org/10.17918/etd-3076

Files and links (1)

pdf

Zhang_Xiaodan_2009879.10 kBDownload View

PDFOpen Access (License Unspecified), Open Access

Abstract

Information science

Markov random fields

Computer Science

Finding the best way to utilize external/domain knowledge to enhance traditional text mining has been a challenging task. The difficulty centers on the lack of means in representing a document with external/domain knowledge integrated. Graphs are and versatile tools, useful in various subfields of science and engineering for their simple illustration of complicated problems. However, the graph-based approach on knowledge representation and discovery remains relatively unexplored. In this thesis, I propose a graph-based text mining system to incorporate semantic knowledge, document section knowledge, document linkage knowledge, and document category knowledge into the tasks of text clustering and topic analysis. I design a novel term-level graph knowledge representation and a graph-based clustering algorithm to incorporate semantic and document section knowledge for biomedical literature clustering and topic analysis. I present a Markov Random Field (MRF) with a Relaxation Labeling (RL) algorithm to incorporate document linkage knowledge. I evaluate different types of linkage among documents, including explicit linkage such as hyperlink and citation link, implicit linkage such as coauthor link and co-citation link, and pseudo linkage such as similarity link. I develop a novel semantic-based method to integrate Wikipedia concepts and categories external knowledge into traditional document clustering. In order to support these new approaches, I develop two automated algorithms to extract multiword phrases and ontological concepts, respectively. The evaluations of news collection, web dataset, andbiomedical literature prove the effectiveness of the proposed methods. In the experiment of document clustering, the proposed term-level graph-based method not only outperforms the baseline k-means algorithm in all configurations but is superior in terms of efficiency. The MRF-based algorithm significantly improves spherical k-means and model-based k-means clustering on the datasets containing explicit or implicit linkage; the Wikipedia knowledge-based clustering also improves the document-content-only-based clustering. On the task of topic analysis, the proposed presentation, sub graph detection, and graph ranking algorithm can effectively identify corpus-level topic terms and cluster-level topic terms.

Metrics

32 File views/ downloads

21 Record Views

Details

Title: Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods
Creators: Xiaodan Zhang - DU
Contributors: Xiaohua Hu (Advisor) - Drexel University (1970-)
Awarding Institution: Drexel University
Degree Awarded: Doctor of Philosophy (Ph.D.)
Publisher: Drexel University; Philadelphia, Pennsylvania
Resource Type: Dissertation
Language: English
Academic Unit: College of Information Science and Technology (1995-2013); Drexel University
Other Identifier: 3076; 991014632330704721

Exploiting external/domain knowledge to enhance traditional text mining using graph-based methods

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media