Conference proceeding
Semantic smoothing for model-based document clustering
ICDM 2006: SIXTH INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, pp 1193-1198
01 Jan 2006
Abstract
A document is often full of class-independent "general" words and short of class-specific "core" words, which leads to the difficulty of document clustering. We argue that both problems will be relieved after suitable smoothing of document models in agglomerative approaches and of cluster models in partitional approaches, and hence improve clustering quality. To the best of our knowledge, most model-based clustering approaches use Laplacian smoothing to prevent zero probability while most similarity-based approaches employ the heuristic TF*IDF scheme to discount the effect of "general" words. Inspired by a series of statistical translation language model for text retrieval, we propose in this paper a novel smoothing method referred to as context-sensitive semantic smoothing for document clustering purpose. The comparative experiment on three datasets shows that model-based clustering approaches with semantic smoothing is effective in improving cluster quality.
Metrics
Details
- Title
- Semantic smoothing for model-based document clustering
- Creators
- Xiaodan Zhang - Drexel UniversityXiaohua Zhou - Drexel UniversityXiaohua Hu - Drexel University
- Contributors
- C W Clifton (Editor)N Zhong (Editor)J M Liu (Editor)B W Wah (Editor)X D Wu (Editor)
- Publication Details
- ICDM 2006: SIXTH INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, pp 1193-1198
- Series
- IEEE International Conference on Data Mining
- Publisher
- IEEE
- Number of pages
- 2
- Grant note
- 240205; 240196 / PA Dept of Health Tobacco Settlement Formula NSF IIS 0448023 / NSF; National Science Foundation (NSF) 239667 / PA Dept of Health 0514679 / NSF CCF
- Resource Type
- Conference proceeding
- Language
- English
- Academic Unit
- Information Science
- Web of Science ID
- WOS:000245601900152
- Scopus ID
- 2-s2.0-52649114770
- Other Identifier
- 991019170400304721
InCites Highlights
Data related to this publication, from InCites Benchmarking & Analytics tool:
- Web of Science research areas
- Computer Science, Information Systems