Journal article
Eigen-Entropy: A metric for multivariate sampling decisions
INFORMATION SCIENCES, v 619
Jan 2023
Featured in Collection : UN Sustainable Development Goals @ Drexel
Abstract
Sampling is a technique to help identify a representative data subset that captures the characteristics of the whole dataset. Most existing sampling algorithms require distribu-tion assumptions of the multivariate data, which may not be available beforehand. This study proposes a new metric called Eigen-Entropy (EE), which is based on information entropy for the multivariate dataset. EE is a model-free metric because it is derived based on eigenvalues extracted from the correlation coefficient matrix without any assumptions on data distributions. We prove that EE measures the composition of the dataset, such as its heterogeneity or homogeneity. As a result, EE can be used to support sampling deci-sions, such as which samples and how many samples to consider with respect to the appli-cation of interest. To demonstrate the utility of the EE metric, two sets of use cases are considered. The first use case focuses on classification problems with an imbalanced data -set, and EE is used to guide the rendering of homogeneous samples from minority classes. Using 10 public datasets, it is demonstrated that two oversampling techniques using the proposed EE method outperform reported methods from the literature in terms of preci-sion, recall, F-measure, and G-mean. In the second experiment, building fault detection is investigated where EE is used to sample heterogeneous data to support fault detection. Historical normal datasets collected from real building systems are used to construct the baselines by EE for 14 test cases, and experimental results indicate that the EE method out-performs benchmark methods in terms of recall. We conclude that EE is a viable metric to support sampling decisions.(c) 2022 Published by Elsevier Inc.
Metrics
Details
- Title
- Eigen-Entropy: A metric for multivariate sampling decisions
- Publication Details
- INFORMATION SCIENCES, v 619
- Publisher
- ELSEVIER SCIENCE INC; NEW YORK
- Grant note
- This research was supported by funds from the National Science Foundation Award under grant number IIP #1827757. The U.S. Government is authorized to reproduce and distribute for governmental purposes notwithstanding any copyright annotation of the work by the author (s) . The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of NSF or the U.S. Government.
- Resource Type
- Journal article
- Language
- English
- Academic Unit
- Drexel University
- Web of Science ID
- WOS:000901771900006
- Scopus ID
- 2-s2.0-85141927735
- Other Identifier
- 991021861277604721
UN Sustainable Development Goals (SDGs)
This publication has contributed to the advancement of the following goals:
InCites Highlights
Data related to this publication, from InCites Benchmarking & Analytics tool:
- Collaboration types
- Domestic collaboration
- International collaboration
- Web of Science research areas
- Computer Science, Information Systems