Journal article
An expert-in-the-loop method for domain-specific document categorization based on small training data
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, v 74(6), p669
Jun 2023
Featured in Collection : UN Sustainable Development Goals @ Drexel
Abstract
Automated text categorization methods are of broad relevance for domain experts since they free researchers and practitioners from manual labeling, save their resources (e.g., time, labor), and enrich the data with information helpful to study substantive questions. Despite a variety of newly developed categorization methods that require substantial amounts of annotated data, little is known about how to build models when (a) labeling texts with categories requires substantial domain expertise and/or in-depth reading, (b) only a few annotated documents are available for model training, and (c) no relevant computational resources, such as pretrained models, are available. In a collaboration with environmental scientists who study the socio-ecological impact of funded biodiversity conservation projects, we develop a method that integrates deep domain expertise with computational models to automatically categorize project reports based on a small sample of 93 annotated documents. Our results suggest that domain expertise can improve automated categorization and that the magnitude of these improvements is influenced by the experts' understanding of categories and their confidence in their annotation, as well as data sparsity and additional category characteristics such as the portion of exclusive keywords that can identify a category.
Metrics
Details
- Title
- An expert-in-the-loop method for domain-specific document categorization based on small training data
- Publication Details
- JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, v 74(6), p669
- Publisher
- WILEY; HOBOKEN
- Grant note
- We gratefully acknowledge the support from the John D. and Catherine T. MacArthur Foundation.
- Resource Type
- Journal article
- Language
- English
- Academic Unit
- Drexel University
- Web of Science ID
- WOS:000865435400001
- Scopus ID
- 2-s2.0-85139480111
- Other Identifier
- 991021861297604721
UN Sustainable Development Goals (SDGs)
This publication has contributed to the advancement of the following goals:
InCites Highlights
Data related to this publication, from InCites Benchmarking & Analytics tool:
- Collaboration types
- Domestic collaboration
- Web of Science research areas
- Computer Science, Information Systems
- Information Science & Library Science