Efficient large language model reinforcement learning fine-tuning on single-node multi-GPU setups

Keith Truongcao

doi:10.17918/00011392

[From introduction] In recent years, the development of large language models (LLMs) has transformed the landscape of natural language processing (NLP). Models based on the Transformer architecture have demonstrated impressive generalization capabilities across diverse linguistic and reasoning tasks [3]. More recent systems, including large-scale open models such as DeepSeek-V3, further demonstrate that improvements in scale, data, architecture, and post-training can produce strong reasoning and instruction-following behavior [27]. However, as these models continue to grow in scale, often reaching tens or even hundreds of billions of parameters, the computational cost of adapting them also increases substantially. This trend has led to an increasing demand for efficient fine-tuning methods for domain-specific and task-oriented applications, where specialized improvements can yield significant practical benefits. Despite this growing interest, fine-tuning large-scale LLMs remains a major computational challenge. The enormous size of these models typically requires multi-GPU or distributed training setups, which are notoriously difficult to configure and optimize. These systems must balance computation, communication overhead, and memory consumption across multiple devices, and small misconfigurations can severely impact both performance and cost-efficiency. Furthermore, reinforcement learning (RL)-based fine-tuning techniques such as Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) introduce additional computational burdens due to the iterative policy optimization, repeated sampling, reward computation, and log-probability evaluation required during training [13, 14, 11, 17, 18]. Given these challenges, improving the efficiency of reinforcement learning for large models has become a key research focus. This thesis contributes to this area by exploring strategies to make RL-based fine-tuning more memory and compute efficient. In particular, it investigates the combination of Fully Sharded Data Parallelism (FSDP2) with Quantized Low-Rank Adaptation (QLoRA) and memory-optimized algorithms for the Group Relative Policy Optimization (GRPO) framework. The integration of these methods seeks to minimize memory usage, reduce communication overhead, and accelerate training without changing the underlying reinforcement learning objective. By optimizing both the distributed training and memory management aspects of RL fine tuning, this work aims to make large-scale model adaptation more accessible and practical. The proposed methods not only address scalability limitations but also lay the groundwork for more sustainable and cost-effective training of next-generation language models. Ultimately, this research aims to bridge the gap between the theoretical potential of large language models and their efficient real-world deployment for specialized tasks.

Efficient large language model reinforcement learning fine-tuning on single-node multi-GPU setups

Files and links (1)

Abstract

Metrics

Details

Efficient large language model reinforcement learning fine-tuning on single-node multi-GPU setups

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media