Towards efficient biomedical natural language processing with large language models

Yiwen Shi

doi:10.17918/00011217

Recent breakthroughs in Natural Language Processing (NLP), particularly in large language models (LLMs), have advanced biomedical text understanding and generation. However, deploying these models in real biomedical settings remains challenging. Biomedical datasets are often small, specialized, and highly imbalanced. Specialized documents can be long, unstructured, and dense with domain-specific details. Strict data privacy requirements and limited computational resources often Strict data privacy requirements, with limited computational resources, often require the use of small open-source models for downstream tasks. These factors raise a central question for this thesis: how can we design effective and efficient NLP methods that make LLMs truly useful in real biomedical environments? First, we address the extraction of ADME (Absorption, Distribution, Metabolism, and Excretion) information from reference listed drug (RLD) labeling, an important step in Product-Specific Guidance (PSG) assessment. Since ADME datasets are inherently imbalanced, standard finetuning performs poorly on minority classes. To improve robustness in this setting, we introduce pre-finetuning, an intermediate training stage between a pretrained model and finetuning, that better aligns model representations with the target text, and improves performance on minority ADME classes. Another challenge focuses on summarizing long biomedical documents, such as FDA food-effect studies reviews and doctor–patient dialogue transcripts. To achieve concise yet detail-preserving summaries, we first propose an iterative prompting strategy that guides LLMs through multi-turn refinement, enhancing summary accuracy through user interaction. In addition, we develop a Dynamic Retrieval-Augmented Generation (RAG) framework that reduces token cost by retrieving only the most relevant chunks of the source document. Dynamic RAG can generate high-quality summaries with substantially greater token efficiency than full-dialogue baselines. Finally, real-world biomedical applications often require strict data privacy, limiting the use of commercial LLMs on internal datasets, which needs the development of specialized, lightweight open-source models that remain both effective and computationally affordable. To address this need, we propose a method for automatically generating and optimizing prompts for prompt-based finetuning of BERT in few-shot learning scenarios, enabling small models to perform competitively in constrained biomedical environments. Together, these three components form a framework for efficient biomedical NLP. Pre-finetuning improves data-efficient learning on imbalanced datasets, iterative prompting and Dynamic RAG enable token-efficient summarization for long biomedical documents, and automatic prompt optimization supports resource-efficient deployment using small open-source models under privacy constraints. They provide practical and scalable solutions for real-world biomedical applications, including PSG development.

Towards efficient biomedical natural language processing with large language models

Files and links (1)

Abstract

Metrics

Details

Towards efficient biomedical natural language processing with large language models

Files and links (1)

Abstract

Metrics

Details

Drexel University Social media