Logo image
A text-mining approach for classification of genomic fragments
Conference proceeding   Open access

A text-mining approach for classification of genomic fragments

V Gadia and G Rosen
2008 IEEE International Conference on Bioinformatics and Biomedicine Workshops
Nov 2008
url
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.909.6525View

Abstract

Performance evaluation Euclidean distance Frequency Spatial databases Phylogeny Testing Bioinformatics DNA Data Analysis Genomics
Genome identification is an emerging area of interest due to the study of environmental DNA samples. We show that performance approaches 50% for classifying 500 bp fragments when using 12 mer features, but more importantly, the performance linearly increases for large N. Secondly, we determine that an inverted TF-IDF measure performs 16% better when only using 80% of the words, as opposed to taking the fullset (100%). This increase implies that while too sparse of a feature subset does not produce good results, a carefully selected set has the potential to improve genome classification over a random feature set. Computing even 80% of all possible features can result in a significant savings in computation. The Euclidean classifier and TF-IDF measures will pave the way for more discriminative classification techniques.

Metrics

15 Record Views
6 citations in Scopus

Details

UN Sustainable Development Goals (SDGs)

This publication has contributed to the advancement of the following goals:

#3 Good Health and Well-Being

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Web of Science research areas
Engineering, Biomedical
Logo image