Semantic Smoothing for Bayesian Text Classification with Small Training Data

Xiaohua Zhou; Xiaodan Zhang; Xiaohua Hu

Journal article

Semantic Smoothing for Bayesian Text Classification with Small Training Data

Xiaohua Zhou, Xiaodan Zhang and Xiaohua Hu

Society for Industrial and Applied Mathematics. Proceedings of the SIAM International Conference on Data Mining, pp.289-289

01 Jan 2008

Additional Links

Abstract

Bayesian analysis

Classifiers

Materials handling

Semantics

Signatures

Smoothing

Texts

Training

Bayesian text classifiers face a common issue which is referred to as data sparsity problem, especially when the size of training data is very small. The frequently used Laplacian smoothing and corpus-based background smoothing are not effective in handling it. Instead, we propose a novel semantic smoothing method to address the sparse problem. Our method extracts explicit topic signatures (e.g. words, multiword phrases, and ontologybased concepts) from a document and then statistically maps them into single-word features. We conduct comprehensive experiments on three testing collections (OHSUMED, LATimes, and 20NG) to compare semantic smoothing with other approaches. When the size of training documents is small, the bayesian classifier with semantic smoothing not only outperforms the classifiers with background smoothing and Laplacian smoothing, but also beats the state-of-the-art active learning classifiers and SVM classifiers. In this paper, we also compare three types of topic signatures with respect to their effectiveness and efficiency for semantic smoothing.

Metrics

8 Record Views

Details

Title: Semantic Smoothing for Bayesian Text Classification with Small Training Data
Creators: Xiaohua Zhou
Xiaodan Zhang
Xiaohua Hu
Publication Details: Society for Industrial and Applied Mathematics. Proceedings of the SIAM International Conference on Data Mining, pp.289-289
Conference: Society for Industrial and Applied Mathematics. SIAM International Conference on Data Mining
Number of pages: 1
Resource Type: Journal article
Language: English
Academic Unit: Information Science (Informatics)
Identifiers: 991019189043204721

Semantic Smoothing for Bayesian Text Classification with Small Training Data

Additional Links

Abstract

Metrics

Details

Drexel University Social media