Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

Isamu Isozaki; Manil Shrestha; Rick Console; Edward Kim

doi:10.1145/3708319.3733804

Back

Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

Conference proceeding

Open access

Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

Isamu Isozaki, Manil Shrestha, Rick Console and Edward Kim

Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, pp 404-419

16 Jun 2025

DOI: https://doi.org/10.1145/3708319.3733804

Featured in Collection : Research Supported by Drexel Libraries' OA Programs

Files and links (1)

url

https://doi.org/10.1145/3708319.3733804View

Published, Version of Record (VoR)Open Access via Drexel Libraries Read and Publish Program 2025CC BY V4.0, Open

Abstract

Security and privacy -- Software security engineering

Hacking poses a significant threat to cybersecurity, inflicting billions of dollars in damages annually. To mitigate these risks, ethical hacking, or penetration testing, is employed to identify vulnerabilities in systems and networks. Recent advancements in large language models (LLMs) have shown potential across various domains, including cybersecurity. However, there is currently no comprehensive, open, end-to-end penetration testing benchmark to drive progress and evaluate the capabilities of these models in security contexts. This paper introduces a novel open benchmark1 for LLM-based penetration testing, addressing this critical gap. We first evaluate the performance of LLMs, including GPT-4o and LLama 3.1-405B, using the state-of-the-art PentestGPT tool. Our findings reveal that while LLama 3.1 demonstrates an edge over GPT-4o, both models currently fall short of performing end-to-end penetration testing even with some minimal human assistance. Next, we advance the state-of-the-art and present ablation studies that provide insights into improving the PentestGPT tool2. Our research illuminates the challenges LLMs face in each aspect of Pentesting, e.g. enumeration, exploitation, and privilege escalation. This work contributes to the growing body of knowledge on AI-assisted cybersecurity and lays the foundation for future research in automated penetration testing using large language models.

Metrics

10 Record Views

1 citations in Web of Science

1 citations in Scopus

Details

Title: Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements
Creators: Isamu Isozaki - Drexel University
Manil Shrestha - Drexel University
Rick Console - Drexel University
Edward Kim (Corresponding Author) - Drexel University
Publication Details: Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, pp 404-419
Conference: UMAP '25: 33rd ACM Conference on User Modeling, Adaptation and Personalization
Series: ACM Conferences
Publisher: Association for Computing Machinery
Number of pages: 16
Resource Type: Conference proceeding
Language: English
Academic Unit: Computer Science; College of Computing and Informatics
Web of Science ID: WOS:001525524600073
Scopus ID: 2-s2.0-105011050891
Other Identifier: 991022057938704721

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Web of Science research areas: Computer Science, Artificial Intelligence; Computer Science, Information Systems; Computer Science, Interdisciplinary Applications; Computer Science, Theory & Methods

Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

Files and links (1)

Abstract

Metrics

Details

InCites Highlights

Drexel University Social media