Journal article
Information-theoretic term weighting schemes for document clustering and classification
International journal on digital libraries, v 16(2), pp 145-159
01 Jun 2015
Featured in Collection : UN Sustainable Development Goals @ Drexel
Abstract
We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering and classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed least information theory provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities for document representation: (1) LI Binary (LIB) which quantifies information due to the observation of a term's (binary) occurrence in a document; and (2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. The two quantities are computed based on terms' prior distributions in the entire collection and posterior distributions in a document. LIB and LIF can be used individually or combined to represent documents for text clustering and classification. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering and classification.
Metrics
Details
- Title
- Information-theoretic term weighting schemes for document clustering and classification
- Creators
- Weimao Ke - Drexel University
- Publication Details
- International journal on digital libraries, v 16(2), pp 145-159
- Publisher
- Springer Nature
- Number of pages
- 15
- Resource Type
- Journal article
- Language
- English
- Academic Unit
- Information Science
- Web of Science ID
- WOS:000409778200005
- Scopus ID
- 2-s2.0-84929959191
- Other Identifier
- 991019167697104721
UN Sustainable Development Goals (SDGs)
This publication has contributed to the advancement of the following goals:
InCites Highlights
Data related to this publication, from InCites Benchmarking & Analytics tool:
- Web of Science research areas
- Information Science & Library Science