Logo image
Reasoning with large language models for medical question answering
Journal article   Peer reviewed

Reasoning with large language models for medical question answering

Mary M Lucas, Justin Yang, Jon K Pomeroy and Christopher C Yang
Journal of the American Medical Informatics Association : JAMIA, ocae131
03 Jul 2024
url
https://doi.org/10.1093/jamia/ocae131View
Published, Version of Record (VoR) Open

Abstract

To investigate approaches of reasoning with large language models (LLMs) and to propose a new prompting approach, ensemble reasoning, to improve medical question answering performance with refined reasoning and reduced inconsistency.OBJECTIVESTo investigate approaches of reasoning with large language models (LLMs) and to propose a new prompting approach, ensemble reasoning, to improve medical question answering performance with refined reasoning and reduced inconsistency.We used multiple choice questions from the USMLE Sample Exam question files on 2 closed-source commercial and 1 open-source clinical LLM to evaluate our proposed approach ensemble reasoning.MATERIALS AND METHODSWe used multiple choice questions from the USMLE Sample Exam question files on 2 closed-source commercial and 1 open-source clinical LLM to evaluate our proposed approach ensemble reasoning.On GPT-3.5 turbo and Med42-70B, our proposed ensemble reasoning approach outperformed zero-shot chain-of-thought with self-consistency on Steps 1, 2, and 3 questions (+3.44%, +4.00%, and +2.54%) and (2.3%, 5.00%, and 4.15%), respectively. With GPT-4 turbo, there were mixed results with ensemble reasoning again outperforming zero-shot chain-of-thought with self-consistency on Step 1 questions (+1.15%). In all cases, the results demonstrated improved consistency of responses with our approach. A qualitative analysis of the reasoning from the model demonstrated that the ensemble reasoning approach produces correct and helpful reasoning.RESULTSOn GPT-3.5 turbo and Med42-70B, our proposed ensemble reasoning approach outperformed zero-shot chain-of-thought with self-consistency on Steps 1, 2, and 3 questions (+3.44%, +4.00%, and +2.54%) and (2.3%, 5.00%, and 4.15%), respectively. With GPT-4 turbo, there were mixed results with ensemble reasoning again outperforming zero-shot chain-of-thought with self-consistency on Step 1 questions (+1.15%). In all cases, the results demonstrated improved consistency of responses with our approach. A qualitative analysis of the reasoning from the model demonstrated that the ensemble reasoning approach produces correct and helpful reasoning.The proposed iterative ensemble reasoning has the potential to improve the performance of LLMs in medical question answering tasks, particularly with the less powerful LLMs like GPT-3.5 turbo and Med42-70B, which may suggest that this is a promising approach for LLMs with lower capabilities. Additionally, the findings show that our approach helps to refine the reasoning generated by the LLM and thereby improve consistency even with the more powerful GPT-4 turbo. We also identify the potential and need for human-artificial intelligence teaming to improve the reasoning beyond the limits of the model.CONCLUSIONThe proposed iterative ensemble reasoning has the potential to improve the performance of LLMs in medical question answering tasks, particularly with the less powerful LLMs like GPT-3.5 turbo and Med42-70B, which may suggest that this is a promising approach for LLMs with lower capabilities. Additionally, the findings show that our approach helps to refine the reasoning generated by the LLM and thereby improve consistency even with the more powerful GPT-4 turbo. We also identify the potential and need for human-artificial intelligence teaming to improve the reasoning beyond the limits of the model.

Metrics

58 Record Views
11 citations in Scopus

Details

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Collaboration types
Domestic collaboration
Web of Science research areas
Computer Science, Information Systems
Computer Science, Interdisciplinary Applications
Health Care Sciences & Services
Information Science & Library Science
Medical Informatics
Logo image