Journal article
Reasoning with large language models for medical question answering
Journal of the American Medical Informatics Association : JAMIA, ocae131
03 Jul 2024
Abstract
To investigate approaches of reasoning with large language models (LLMs) and to propose a new prompting approach, ensemble reasoning, to improve medical question answering performance with refined reasoning and reduced inconsistency.OBJECTIVESTo investigate approaches of reasoning with large language models (LLMs) and to propose a new prompting approach, ensemble reasoning, to improve medical question answering performance with refined reasoning and reduced inconsistency.We used multiple choice questions from the USMLE Sample Exam question files on 2 closed-source commercial and 1 open-source clinical LLM to evaluate our proposed approach ensemble reasoning.MATERIALS AND METHODSWe used multiple choice questions from the USMLE Sample Exam question files on 2 closed-source commercial and 1 open-source clinical LLM to evaluate our proposed approach ensemble reasoning.On GPT-3.5 turbo and Med42-70B, our proposed ensemble reasoning approach outperformed zero-shot chain-of-thought with self-consistency on Steps 1, 2, and 3 questions (+3.44%, +4.00%, and +2.54%) and (2.3%, 5.00%, and 4.15%), respectively. With GPT-4 turbo, there were mixed results with ensemble reasoning again outperforming zero-shot chain-of-thought with self-consistency on Step 1 questions (+1.15%). In all cases, the results demonstrated improved consistency of responses with our approach. A qualitative analysis of the reasoning from the model demonstrated that the ensemble reasoning approach produces correct and helpful reasoning.RESULTSOn GPT-3.5 turbo and Med42-70B, our proposed ensemble reasoning approach outperformed zero-shot chain-of-thought with self-consistency on Steps 1, 2, and 3 questions (+3.44%, +4.00%, and +2.54%) and (2.3%, 5.00%, and 4.15%), respectively. With GPT-4 turbo, there were mixed results with ensemble reasoning again outperforming zero-shot chain-of-thought with self-consistency on Step 1 questions (+1.15%). In all cases, the results demonstrated improved consistency of responses with our approach. A qualitative analysis of the reasoning from the model demonstrated that the ensemble reasoning approach produces correct and helpful reasoning.The proposed iterative ensemble reasoning has the potential to improve the performance of LLMs in medical question answering tasks, particularly with the less powerful LLMs like GPT-3.5 turbo and Med42-70B, which may suggest that this is a promising approach for LLMs with lower capabilities. Additionally, the findings show that our approach helps to refine the reasoning generated by the LLM and thereby improve consistency even with the more powerful GPT-4 turbo. We also identify the potential and need for human-artificial intelligence teaming to improve the reasoning beyond the limits of the model.CONCLUSIONThe proposed iterative ensemble reasoning has the potential to improve the performance of LLMs in medical question answering tasks, particularly with the less powerful LLMs like GPT-3.5 turbo and Med42-70B, which may suggest that this is a promising approach for LLMs with lower capabilities. Additionally, the findings show that our approach helps to refine the reasoning generated by the LLM and thereby improve consistency even with the more powerful GPT-4 turbo. We also identify the potential and need for human-artificial intelligence teaming to improve the reasoning beyond the limits of the model.
Metrics
Details
- Title
- Reasoning with large language models for medical question answering
- Creators
- Mary M Lucas - Drexel UniversityJustin Yang - University of Maryland, College ParkJon K Pomeroy - Penn Center for AIDS ResearchChristopher C Yang - Drexel University
- Publication Details
- Journal of the American Medical Informatics Association : JAMIA, ocae131
- Publisher
- JAMIA
- Resource Type
- Journal article
- Language
- English
- Academic Unit
- Information Science
- Web of Science ID
- WOS:001261047600001
- Scopus ID
- 2-s2.0-85199780977
- Other Identifier
- 991021892011904721
InCites Highlights
Data related to this publication, from InCites Benchmarking & Analytics tool:
- Collaboration types
- Domestic collaboration
- Web of Science research areas
- Computer Science, Information Systems
- Computer Science, Interdisciplinary Applications
- Health Care Sciences & Services
- Information Science & Library Science
- Medical Informatics