Conference proceeding
Information-theoretic Term Weighting Schemes for Document Clustering
JCDL'13: PROCEEDINGS OF THE 13TH ACM/IEEE-CS JOINT CONFERENCE ON DIGITAL LIBRARIES, pp.143-152
01 Jan 2013
Featured in Collection : UN Sustainable Development Goals @ Drexel
Abstract
We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed Least Information theory (LIT) provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities in the document clustering context: 1) LI Binary (LIB) which quantifies information due to the observation of a term's (binary) occurrence in a document; and 2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. Both quantities are computed given term distributions in the document collection as prior knowledge and can be used separately or combined to represent documents for text clustering. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering.
Metrics
6 Record Views
Details
- Title
- Information-theoretic Term Weighting Schemes for Document Clustering
- Creators
- Weimao Ke - Drexel UniversityACM
- Publication Details
- JCDL'13: PROCEEDINGS OF THE 13TH ACM/IEEE-CS JOINT CONFERENCE ON DIGITAL LIBRARIES, pp.143-152
- Conference
- JCDL'13: 13TH ACM/IEEE-CS JOINT CONFERENCE ON DIGITAL LIBRARIES, 13th
- Series
- ACM-IEEE Joint Conference on Digital Libraries JCDL
- Publisher
- Assoc Computing Machinery
- Number of pages
- 10
- Grant note
- LG-06-11-0332-11 / IMLS
- Resource Type
- Conference proceeding
- Language
- English
- Academic Unit
- Information Science (Informatics)
- Identifiers
- 991019170448804721
UN Sustainable Development Goals (SDGs)
This output has contributed to the advancement of the following goals:
InCites Highlights
These are selected metrics from InCites Benchmarking & Analytics tool, related to this output
- Web of Science research areas
- Information Science & Library Science