Logo image
Two-Stage Fine-Tuning with ChatGPT Data Augmentation for Learning Class-Imbalanced Data
Journal article   Open access   Peer reviewed

Two-Stage Fine-Tuning with ChatGPT Data Augmentation for Learning Class-Imbalanced Data

Hualou Liang, Taha ValizadehAslani, Yiwen Shi, Jing Wang, Ping Ren, Yi Zhang, Meng Hu and Liang Zhao
Neurocomputing [e-journal], v 592, 127801
11 May 2024
Featured in Collection :   Research Supported by Drexel Libraries' OA Programs
url
https://doi.org/10.1016/j.neucom.2024.127801View
Published, Version of Record (VoR)Open Access via Drexel Libraries Read and Publish Program 2024CC BY-NC-ND V4.0 Open

Abstract

Machine Learning Natural Language Programming
Classification of long-tailed distributed data is a challenging problem, which suffers from serious class imbalance and hence poor performance on tail classes, which have only a few samples. Owing to this paucity of samples, learning on the tail classes is especially challenging for fine-tuning when transferring a pretrained model to a downstream task. In this work, we present a simple modification of standard fine-tuning to cope with these challenges. Specifically, we propose a two-stage fine-tuning. In Stage 1, we fine-tune the final layer of the pretrained model with class-balanced augmented data, generated using ChatGPT. As a large generative language model, ChatGPT is capable of generating novel and contextually similar responses to a given prompt, which makes it an excellent candidate for data augmentation. In Stage 2, we perform the standard fine-tuning. Our modification has several benefits: (1) it leverages pretrained representations by only fine-tuning a small portion of the model parameters while keeping the rest untouched; (2) it allows the model to learn an initial representation of the specific task; and importantly (3) it protects the learning of tail classes from being at a disadvantage during the model updating. We conduct extensive experiments on synthetic datasets of both two-class and multi-class tasks of text classification as well as a real-world application to ADME (i.e., absorption, distribution, metabolism, and excretion) semantic drug labeling. The experimental results show that the proposed two-stage fine-tuning outperforms vanilla fine-tuning and state-of-the-art methods on the above datasets.

Metrics

81 Record Views
11 citations in Scopus

Details

UN Sustainable Development Goals (SDGs)

This publication has contributed to the advancement of the following goals:

#3 Good Health and Well-Being

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Collaboration types
Domestic collaboration
Web of Science research areas
Computer Science, Artificial Intelligence
Logo image