Benchmarking LLMs for cybersecurity: Inside HTB AI Range’s first evaluation

Discover how Hack The Box AI Range benchmarks LLMs in realistic cyber scenarios. Explore the methodology, key findings, and why it sets a new standard for AI security performance.

b3rt0ll0,
Feb 12
2026

A deep dive into how HTB AI Range measures LLMs capabilities against real-world cyber threats, using OWASP Top 10 scenarios and rigorous, reproducible testing.

To measure how well AI agents can autonomously handle security operations, we conducted a benchmark using a gauntlet of web security challenges spanning the OWASP Top 10 framework.

This benchmark provides a controlled, reproducible environment to evaluate each model’s effectiveness on realistic targets on the industry-standard list of the most critical Web Application risks.

“The OWASP Top 10 provides a common, risk-based framework that helps organizations consistently evaluate security posture, regardless of technology stack. Applied to Large Language Models (LLMs), it offers a structured way to assess how AI systems fail, how those failures can be exploited, and where mitigations should be prioritized.”
– Andrew Van der Stock, Executive Director & Project Leader @ OWASP Top 10

The initial evaluation pitted multiple frontier AI models against vulnerabilities like SQL injection, broken authentication, cross-site scripting (XSS), and more, to measure their offensive security capabilities.

January 2026 Update

We recently updated the benchmark adding new models: Claude Opus 4.5, Gemini 3 Flash, Grok 4.1 Fast Reasoning, and Mistral Large 3. This update also marks the introduction of xAI as a new provider on the leaderboard.

VIEW FULL BENCHMARK

Benchmark setup & methodology

Each AI agent was tested on a curated pack of Hack The Box challenges. The agent operates autonomously: upon receiving a target, it plans and executes attacks in an isolated environment to capture the proof of exploitation (flag).

We enforced standardized conditions to ensure fairness.

Every model attempted the exact same set of challenges 10 times, each time with fresh instances to account for non-deterministic LLM behavior and ensure reliable success rates.
All models used identical agent code, tools, prompts, and evaluation criteria. No model received special tuning or advantages.
Success was measured in binary terms: an attempt is only considered a win if the agent submits the correct flag. There is no partial credit for making progress – it’s all or nothing.
Each run was bounded by strict stopping criteria to prevent runaway attempts. Agents had a maximum of 100 “thinking” turns to solve a scenario; if it hit that limit, the run was terminated and counted as a failure.

These constraints simulate real operational limits and force the AI to be efficient and accurate in its exploitation strategy.

The agent is also equipped with three primary tools that mirror how a human hacker would work:

Todo (Task Planner): Allows the agent to break down the problem. The model can outline steps and keep track of its progress. This helps it manage complex tasks without forgetting sub-goals.
Runtime (Shell Executor): Sandboxed Docker-based environment where the agent can run actual commands against the target instance. By providing a real execution tool, we model the way an attacker would interact with a system.
Check Flag (Flag Validator): A custom tool that the agent uses to submit a found flag string. This tool immediately tells the agent whether the flag is correct or not. The agent must use the check flag tool on a correct flag to be counted as a win.

System Prompt

Here's the exact prompt every model receives. No hidden instructions or model-specific tuning.

This evaluation system architecture ensures we model real attacker behavior in a realistic (yet safe) manner. The agent’s workflow mirrors a human pentest process:

Reconnaissance & Enumeration: The agent starts by scanning the target to understand the attack surface.
Vulnerability Analysis: Based on recon data, the LLM will hypothesize what vulnerabilities might exist. The agent can attempt various inputs to see how the application responds.
Exploit Development: Once a potential vulnerability is identified, the agent formulates an exploit. The LLM writes the payload or script, executes it, and observes the results.
Privilege Escalation: Some challenges might require multiple steps. The agent has to chain these steps, maintaining state and adjusting strategy as new information is gained.
Flag Capture: Ultimately, the goal is to retrieve the hidden flag. The agent then uses the Check Flag tool to verify it successfully got it.
Report Generation: Although not the focus of the benchmark scoring, the agent is asked to produce a final writeup. This helps ensure the agent’s chain of thought is exposed and that it truly understands the exploit path.

Throughout this process, the agent is fully autonomous, no human hints or interventions. It must decide which tool to use and when, what attacks to try, how to interpret errors, all through the LLM’s reasoning.

By aggregating results across 1,000 total runs (10 models × 10 challenges × 10 attempts), we obtain a robust performance profile for each AI agent. Key metrics analyzed include success rate per challenge, difficulty breakdown, average steps taken, and consistency across runs.

This methodology gives a side-by-side comparison of each AI model security capabilities under identical conditions.

Key performance observations

The benchmark results offer valuable insights into the current capabilities and limitations of AI models in offensive security. Here are the major findings and observations from the evaluation:

Dominance in entry-level tasks: Most models achieved near-perfect success on the Very Easy and Easy challenges. These simpler scenarios often involve straightforward vulnerabilities with clear indicators. This suggests that for basic, well-known exploit patterns, current LLMs are quite proficient. It’s akin to junior pentesters, though not sufficient for real engagements.
The medium-difficulty cliff: As soon as the challenges stepped up to Medium difficulty, we saw a significant performance drop-off for all models. Gemini 3 Pro and Claude Sonnet 4.5 did stand out by achieving slightly higher solve counts in this bracket, indicating a stronger ability to perform multi-step reasoning. However, current LLM agents hit a wall on tasks that require complex, multi-step reasoning and persistence.
Scarcity of hard successes: The Hard level challenges proved to be a nearly impossible frontier for today’s AI agents. Gemini 3 Pro solved 2 scenarios, while Claude Sonnet 4.5 solved 1 Hard challenge once. All other models scored zero on every Hard task. These Hard challenges represent scenarios that would be difficult for skilled human hackers such as chained exploits requiring clever exploit code, or extremely subtle vulnerability conditions. Hard tasks often defeat the agents by causing them to exhaust their turn limit or run in circles.
Model-specific variations: Not all LLMs are created equal, and their differences became very apparent in this evaluation. One interesting pattern was that larger, newer models (Gemini 3, Claude 4.5, GPT-5) tended to attempt more elaborate exploit strategies. Smaller or older models often attempted only very basic exploits (if those didn’t work, they had no plan B). This highlights the value of multi-metric telemetry: we could see not just success rates, but how the models behaved.
Prompt injection and safety bypass resistance: Given the nature of this experiment, the AI agents were intended to be aggressive. But we still observed some effects related to model safety features and susceptibility to misleading inputs. Some LLMs have strong built-in guardrails (to prevent misuse) which occasionally made them hesitate or sanitize their actions, even when those actions were legitimate within this sandbox.
Cost and efficiency: Some models were dramatically more efficient than others in terms of API usage. We measured token consumption and tool invocation counts. The best models solved challenges in fewer steps (saving time and API calls), while less effective ones might burn through a lot of attempts. This has business implications: an AI that finds vulnerabilities with half the tokens is cheaper to run and likely faster to respond in real-world use.

Advancing agentic AI in cybersecurity

Our OWASP Top 10 benchmark paints a picture of AI current capabilities in security operations.

On one hand, today’s agents can reliably find and exploit straightforward web vulnerabilities, automating tasks that could be trivial for a human penetration tester. On the other hand, when faced with more complex, multi-layered exploits, even the most sophisticated models stumble or fail.

Human experts still vastly outperform AI in higher-order reasoning, creative exploitation, and long-horizon planning.

This benchmark will not remain static. As new frontier models are released by frontier labs, we will integrate and test them with the same rigorous methodology, expanding the leaderboard and tracking improvements on other frameworks or sets of challenges.

In the meantime, the latest results show both encouraging improvements at the cutting edge and a reminder that the hardest challenges in cybersecurity require reasoning and ingenuity that, for now, remain a hybrid human+AI strength.

Visit our HTB AI Range Benchmarks page to interact with the results and methodology. If you’re interested in testing your own models or want to see how AI Range can benefit your security team, request a demo with our team.

Subscribe to The Inference

A dedicated newsletter on AI & cybersecurity powered by Hack The Box.
Every month, we will share the latest research highlights, product updates, and explore how the community is pushing boundaries.

SUBSCRIBE NOW