Conference proceeding
Semantic Smoothing of Document Models for Agglomerative Clustering
20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, pp.2922-2927
01 Jan 2007
Abstract
In this paper, we argue that the agglomerative clustering with vector cosine similarity measure performs poorly due to two reasons. First, the nearest neighbors of a document belong to different classes in many cases since any pair of documents shares lots of "general" words. Second, the sparsity of class-specific "core" words leads to grouping documents with the same class labels into different clusters. Both problems can be resolved by suitable smoothing of document model and using Kullback-Leibler divergence of two smoothed models as pairwise document distances. Inspired by the recent work in information retrieval, we propose a novel context-sensitive semantic smoothing method that can automatically identifies multiword phrases in a document and then statistically map phrases to individual document terms. We evaluate the new model-based similarity measure on three datasets using complete linkage criterion for agglomerative clustering and find out it significantly improves the clustering quality over the traditional vector cosine measure.
Metrics
9 Record Views
Details
- Title
- Semantic Smoothing of Document Models for Agglomerative Clustering
- Creators
- Xiaohua Zhou - Drexel Univ, Coll Informat Sci & Technol, Philadelphia, PA 19104 USAXiaodan Zhang - Drexel Univ, Coll Informat Sci & Technol, Philadelphia, PA 19104 USAXiaohua Hu - Drexel University
- Contributors
- M M Veloso (Editor)
- Publication Details
- 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, pp.2922-2927
- Conference
- 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 20th
- Publisher
- Ijcai-Int Joint Conf Artif Intell
- Number of pages
- 6
- Resource Type
- Conference proceeding
- Language
- English
- Academic Unit
- Information Science (Informatics)
- Identifiers
- 991019170403804721
InCites Highlights
These are selected metrics from InCites Benchmarking & Analytics tool, related to this output
- Web of Science research areas
- Computer Science, Artificial Intelligence
- Computer Science, Theory & Methods