We proposed a Least Information theory (LIT) to quantify meaning of
information in probability distribution changes, from which a new information
retrieval model was developed. We observed several important characteristics of
the proposed theory and derived two quantities in the IR context for document
representation. Given probability distributions in a collection as prior
knowledge, LI Binary (LIB) quantifies least information due to the binary
occurrence of a term in a document whereas LI Frequency (LIF) measures least
information based on the probability of drawing a term from a bag of words.
Three fusion methods were also developed to combine LIB and LIF quantities for
term weighting and document ranking. Experiments on four benchmark TREC
collections for ad hoc retrieval showed that LIT-based methods demonstrated
very strong performances compared to classic TF*IDF and BM25, especially for
verbose queries and hard search topics. The least information theory offers a
new approach to measuring semantic quantities of information and provides
valuable insight into the development of new IR models.
Metrics
13 Record Views
Details
Title
Least Information Modeling for Information Retrieval