Text clustering and active learning using a LSI subspace signature model and query expansion

Weizhong Zhu

doi:10.17918/etd-3077

In this dissertation research, we developed a novel Latent Semantic Indexing Subspace Signature Model (LSISSM) for semantic content representation of unstructured text based on the Singular Value Decomposition (SVD). The model represents the meanings of the terms according to the distribution of their statistical contribution across the top ranking LSI latent concept dimensions. Each LSI latent concept dimension is related to one or more themes with their contexts which are composed of semantically coherent topics, entities and social indicators and are supported by a set of related documents. The model provides feature reduction and finds a low-rank approximation for the scalable and sparse term-document matrix. Firstly, the top ranking conceptual terms or term clusters are selected to represent the corpora according to their global statistical contribution to the LSI term subspace. Secondly, terms and documents are defined as spectral signatures which are represented by the distribution of their local statistical contribution on the identical LSI latent concept dimensions. Then, two novel similarity measures are defined to evaluate the associations between the concept signatures and the document signatures by reducing noise. Finally the model bridges the LSI subspaces naturally and produces a low-dimension term-document matrix. Experiments suggest that this model significantly improves the performance of the clustering algorithms such as basic K-means and Self-organized Mapping (SOM) efficiently and effectively, compared with the Vector Space Model and the traditional LSI model. Our model is also suitable for active learning which significantly decreases the number of the training examples through bootstrapped sampling and iterative learning without degrading the performance of the classifiers. The LSI Subspace Signature Model selects the document samples iteratively according to their statistical contribution to the LSI document subspace. The sampling method is evaluated in the context of text categorization with three classic classifiers on three standard news corpora. The results indicate that our approach improves the selection of the sampling subsets from the perspectives of sampling distribution and learningperformance of the classifiers. This method picks the most important samples and keeps the sampling distribution on the text categories, even outlier categories. The tests demonstrated that the sample subsets with the optimized feature sets substantially improve the performance of the three classifiers, Naïve Bayes, KNearest Neighbor and Rocchio effectively and efficiently. Four types of ontology-based query expansion strategies are applied in MEDLINE abstracts and OCR news text. UMLS-based term re-weighting and user relevance feedback with IPTC hyponyms significantly improve the performance of IR models, VSM and BM25. In addition, a novel random-walk centrality measure is developed to overcome the rank sink problem of the PageRank algorithm.

Text clustering and active learning using a LSI subspace signature model and query expansion

Files and links (1)

Abstract

Metrics

Details

Text clustering and active learning using a LSI subspace signature model and query expansion

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media