Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning
As Large Language Models (LLMs) are integrated into critical real-world
applications, their strategic and logical reasoning abilities are increasingly
crucial. This paper evaluates LLMs' reasoning abilities in competitive
environments through game-theoretic tasks, e.g., board and card games that
require pure logic and strategic reasoning to compete with opponents. We first
propose GTBench, a language-driven environment composing 10 widely-recognized
tasks, across a comprehensive game taxonomy: complete versus incomplete
information, dynamic versus static, and probabilistic versus deterministic
scenarios. Then, we investigate two key problems: (1) Characterizing
game-theoretic reasoning of LLMs; (2) LLM-vs-LLM competitions as reasoning
evaluation. We observe that (1) LLMs have distinct behaviors regarding various
gaming scenarios; for example, LLMs fail in complete and deterministic games
yet they are competitive in probabilistic gaming scenarios; (2) Open-source
LLMs, e.g., CodeLlama-34b-Instruct, are less competitive than commercial LLMs,
e.g., GPT-4, in complex games. In addition, code-pretraining greatly benefits
strategic reasoning, while advanced reasoning methods such as Chain-of-Thought
(CoT) and Tree-of-Thought (ToT) do not always help. Detailed error profiles are
also provided for a better understanding of LLMs' behavior.
Metrics
8 Record Views
Details
Title
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations
Creators
Jinhao Duan
Renming Zhang
James Diffenderfer
Bhavya Kailkhura
Lichao Sun
Elias Stengel-Eskin
Mohit Bansal
Tianlong Chen
Kaidi Xu
Publication Details
arXiv.org
Resource Type
Preprint
Language
English
Academic Unit
Computer Science (Computing)
Other Identifier
991021871355904721
Research Home Page
Browse by research and academic units
Learn about the ETD submission process at Drexel
Learn about the Libraries’ research data management services