Conference proceeding
KPSpotter: a flexible information gain-based keyphrase extraction system
Proceedings of the 5th ACM international workshop on web information and data management, pp 50-53
07 Nov 2003
Abstract
To tackle the issue of information overload, we present an Information Gain-based KeyPhrase Extraction System, called KPSpotter. KPSpotter is a flexible web-enabled keyphrase extraction system, capable of processing various formats of input data, including web data, and generating the extraction model as well as the list of keyphrases in XML. In KPSpotter, the following two features were selected for training and extracting keyphrases: 1) TF*IDF and 2) Distance from First Occurrence. Input training and testing collections were processed in three stages: 1) Data Cleaning, 2) Data Tokenizing, and 3) Data Discretizing. To measure the system performance, the keyphrases extracted by KPSpotter are compared with the ones that the authors assigned. Our experiments show that the performance of KPSpotter was evaluated to be equivalent to KEA, a well-known keyphrase extraction system. KPSpotter, however, is differentiated from other extraction systems in the followings: First, KPSpotter employs a new keyphrase extraction technique that combines the Information Gain data mining measure and several Natural Language Processing techniques such as stemming and case-folding. Second, KPSpotter is able to process various types of input data such as XML, HTML, and unstructured text data and generate XML output. Third, the user can provide input data and execute KPSpotter through the Internet. Fourth, for efficiency and performance reason, KPSpotter stores candidate keyphrases and its related information such as frequency and stemmed form into an embedded database management system.
Metrics
13 Record Views
Details
- Title
- KPSpotter
- Creators
- Min SongIl-Yeol SongXiaohua Hu
- Publication Details
- Proceedings of the 5th ACM international workshop on web information and data management, pp 50-53
- Conference
- 5th ACM international workshop on web information and data management, 5th
- Series
- WIDM '03
- Publisher
- Association for Computing Machinery (ACM)
- Number of pages
- 1
- Resource Type
- Conference proceeding
- Language
- English
- Academic Unit
- Information Science
- Other Identifier
- 991014878187804721