Logo image
Information-theoretic approaches to SVM feature selection for metagenome read classification
Journal article   Peer reviewed

Information-theoretic approaches to SVM feature selection for metagenome read classification

Elaine Garbarine, Joseph DePasquale, Vinay Gadia, Robi Polikar and Gail Rosen
Computational biology and chemistry, v 35(3), pp 199-209
Jun 2011
PMID: 21704267

Abstract

Support vector machines Metagenomics Information theory
[Display omitted] ► We find that feature selection improves performance of metagenomic taxonomic classification with SVMs. ► The mRMR information theoretic method is the best feature selection method, especially for the phylum level. ► Using feature selection with 9mers does not improve over 6mer feature selection. Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback–Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N=6 for all taxonomic levels.

Metrics

8 Record Views
17 citations in Scopus

Details

UN Sustainable Development Goals (SDGs)

This publication has contributed to the advancement of the following goals:

#3 Good Health and Well-Being

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Collaboration types
Domestic collaboration
Web of Science research areas
Biology
Computer Science, Interdisciplinary Applications
Logo image