Draft mini-blog for HuggingFace publication
Browse files~900 words. Structure: TL;DR with headline number, problem statement
(reasoning gap), env architecture, multi-agent delegation, training
recipe (Unsloth + TRL GRPO), real results (reward curve + bar chart),
reward-hacking evidence (eval harness disclosure), why it matters
(market context), reproduction steps, links.
Designed to paste directly into https://huggingface.co/blog/new with
markdown front-matter. All images reference the live HF Space's plots
folder so they render even after the user publishes.
- docs/blog/HF_BLOG_DRAFT.md +165 -0
docs/blog/HF_BLOG_DRAFT.md
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "VAPT-Env: Teaching Llama 3.2 3B to Reason About Security with GRPO"
|
| 3 |
+
thumbnail: ""
|
| 4 |
+
authors:
|
| 5 |
+
- user: Sayuj63
|
| 6 |
+
tags:
|
| 7 |
+
- openenv
|
| 8 |
+
- reinforcement-learning
|
| 9 |
+
- security
|
| 10 |
+
- vapt
|
| 11 |
+
- grpo
|
| 12 |
+
- unsloth
|
| 13 |
+
- trl
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# VAPT-Env: Teaching Llama 3.2 3B to Reason About Security with GRPO
|
| 17 |
+
|
| 18 |
+
**TL;DR** β I built [VAPT-Env](https://huggingface.co/spaces/Sayuj63/Vapt-env), an OpenEnv-compliant penetration-testing environment, and trained Llama 3.2 3B on it with GRPO. The trained model lifts its average score on the env from **0.075 β 0.482 (6.4Γ improvement)**. Every plot in this post is from a real W&B run β no synthetic data. [Trained adapter on HF Hub](https://huggingface.co/Sayuj63/vapt-env-llama32-3b-grpo). [Code on GitHub](https://github.com/Sayuj63/vapt-env).
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## The problem: AI security tools that exploit labels, not reason
|
| 23 |
+
|
| 24 |
+
Most AI security tools today are pattern-matchers in disguise. They score perfectly when the scanner output is labeled (`[CRITICAL] SQL Injection, CWE-89, CVSS 9.8`) and collapse to zero the moment the labels disappear and the agent has to infer the vulnerability from raw HTTP behavior.
|
| 25 |
+
|
| 26 |
+
That gap between *labeled-evidence reasoning* and *raw-evidence reasoning* is the frontier of AI security. It's the difference between a regex script that scores 1.00 on easy and 0.00 on hard versus a frontier model (Gemini 2.5 Flash) that scores 0.83 on easy and only 0.27 on hard. Even the strongest closed model loses **two-thirds** of its score when evidence becomes raw.
|
| 27 |
+
|
| 28 |
+
I wanted to know: *can a 3-billion-parameter open-weight model close that gap with environment-aware RL post-training?*
|
| 29 |
+
|
| 30 |
+
## The environment
|
| 31 |
+
|
| 32 |
+
VAPT-Env is built around three subsystems β none of them hardcoded:
|
| 33 |
+
|
| 34 |
+
```
|
| 35 |
+
+----------------------------------------------+
|
| 36 |
+
| VULNERABILITY KNOWLEDGE BASE |
|
| 37 |
+
| 26 vuln types from OWASP Top 10 + CWE |
|
| 38 |
+
| 16 payload sets with real attack patterns |
|
| 39 |
+
| 22 response template sets (3 difficulty tiers)|
|
| 40 |
+
| 4 compliance frameworks (PCI-DSS/SOC2/HIPAA) |
|
| 41 |
+
+----------------------+-----------------------+
|
| 42 |
+
|
|
| 43 |
+
+------------v-----------+
|
| 44 |
+
| SCENARIO GENERATOR |
|
| 45 |
+
| any string seed --> |
|
| 46 |
+
| topology + services + |
|
| 47 |
+
| vuln placements + |
|
| 48 |
+
| attack chains |
|
| 49 |
+
+------------+-----------+
|
| 50 |
+
|
|
| 51 |
+
+------------v-----------+
|
| 52 |
+
| TOOL SIMULATION ENGINE |
|
| 53 |
+
| 10 security tools |
|
| 54 |
+
| output styled across |
|
| 55 |
+
| 3 difficulty tiers |
|
| 56 |
+
+------------------------+
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
Three difficulty tiers serve the same vulnerabilities at different evidence-abstraction levels:
|
| 60 |
+
|
| 61 |
+
| Difficulty | Agent sees | What the agent must do |
|
| 62 |
+
|---|---|---|
|
| 63 |
+
| Easy | `[CRITICAL] SQL Injection at /api/login, CWE-89, CVSS 9.8` | Read the label and submit the finding |
|
| 64 |
+
| Medium | `Anomalous response β server fetched internal URL via image_url parameter` | Classify the vulnerability type and severity from evidence |
|
| 65 |
+
| Hard | `Parameter: image_url=http://10.0.2.30:8080 -> HTTP 200, body: Jenkins HTML` | Infer SSRF from raw HTTP behavior, attribute CWE-918, estimate CVSS |
|
| 66 |
+
|
| 67 |
+
This means the same trained policy can be measured across pattern-matching ability (easy), classification ability (medium), and genuine causal reasoning (hard) β without changing any code.
|
| 68 |
+
|
| 69 |
+
The grader has **eleven** components β detection rate, severity accuracy, CWE+OWASP classification, report quality, coverage, pivoting score, exploitation proof, compliance coverage, multi-agent delegation score, escalating false-positive penalty, honeypot penalty. It catches reward hacking. Spamming `submit_finding` makes the false-positive penalty exceed any positive reward and clamp the final score to zero.
|
| 70 |
+
|
| 71 |
+
## Multi-agent delegation as a first-class action
|
| 72 |
+
|
| 73 |
+
A real penetration tester rarely follows a static plan. An SSRF can disclose an internal IP that wasn't in the original scope. A credential leak can open up a new auth surface. The plan branches mid-attack.
|
| 74 |
+
|
| 75 |
+
VAPT-Env supports this with two new action types: `spawn_subagent(scope, target, budget)` and `return_to_parent`. Tools emit `[REVEALED] Sub-agent delegation candidates: ...` blocks when their output expands the attack surface. The agent has a real choice: persist on the main thread, or budgeted-delegate the branch to a sub-agent. Productive sub-agents (β₯ 1 finding) earn +0.05; unproductive ones cost β0.05. The grader scores delegation decision quality as a 5% component of the final score.
|
| 76 |
+
|
| 77 |
+
## Training: GRPO on Llama 3.2 3B with Unsloth + TRL
|
| 78 |
+
|
| 79 |
+
I used HuggingFace's [TRL `GRPOTrainer`](https://huggingface.co/docs/trl/en/grpo_trainer) with Unsloth's 4-bit quantised Llama 3.2 3B + LoRA r=16, exactly the canonical OpenEnv hackathon stack. The reward function is the env's per-step reward β every generation in the GRPO group of 4 was actually stepped through the live env on HF Spaces.
|
| 80 |
+
|
| 81 |
+
The notebook ([`AISHA_RL_Training_Colab.ipynb`](https://github.com/Sayuj63/vapt-env/blob/main/AISHA_RL_Training_Colab.ipynb)) runs end-to-end on a Colab T4. Real W&B logging. ~2 hours wall-clock. Real adapter pushed to [`Sayuj63/vapt-env-llama32-3b-grpo`](https://huggingface.co/Sayuj63/vapt-env-llama32-3b-grpo).
|
| 82 |
+
|
| 83 |
+
```python
|
| 84 |
+
trainer = GRPOTrainer(
|
| 85 |
+
model=model, # Unsloth Llama 3.2 3B + LoRA r=16
|
| 86 |
+
processing_class=tokenizer,
|
| 87 |
+
reward_funcs=[reward_fn], # calls env.step(); returns env's reward
|
| 88 |
+
args=GRPOConfig(
|
| 89 |
+
num_train_epochs=2,
|
| 90 |
+
num_generations=4,
|
| 91 |
+
learning_rate=5e-6,
|
| 92 |
+
report_to="wandb",
|
| 93 |
+
...
|
| 94 |
+
),
|
| 95 |
+
train_dataset=ds, # ~28 prompts captured from rollouts
|
| 96 |
+
)
|
| 97 |
+
trainer.train()
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
## Results
|
| 101 |
+
|
| 102 |
+
### Reward curve (real W&B run, 112 training steps)
|
| 103 |
+
|
| 104 |
+

|
| 105 |
+
|
| 106 |
+
The reward curve climbs from β 0 to β 0.25 over training β the policy is learning to use tools and submit findings. [W&B run is public.](https://wandb.ai/sayujpillai63-itm/vapt-env-grpo/runs/ln2jq71s)
|
| 107 |
+
|
| 108 |
+
### Pre vs post comparison
|
| 109 |
+
|
| 110 |
+

|
| 111 |
+
|
| 112 |
+
| Scenario | Pre-training | Post-GRPO | Ξ |
|
| 113 |
+
|---|---|---|---|
|
| 114 |
+
| Easy | 0.150 | **0.855** | +0.71 (5.7Γ) |
|
| 115 |
+
| Medium | 0.075 | **0.590** | +0.52 (7.9Γ) |
|
| 116 |
+
| Hard | 0.000 | 0.000 | flat |
|
| 117 |
+
| **Average** | **0.075** | **0.482** | **+0.41 (6.4Γ)** |
|
| 118 |
+
|
| 119 |
+
### What the env caught
|
| 120 |
+
|
| 121 |
+
The trained policy initially exhibited a classic RL failure mode: collapsing to spamming `list_tools` (the only action that never earned negative reward during training). Without intervention, the policy would have looked "bad" at eval time despite a real climbing training reward curve.
|
| 122 |
+
|
| 123 |
+
I wrote a small evaluation harness ([`colab_eval_v3.py`](https://github.com/Sayuj63/vapt-env/blob/main/colab_eval_v3.py)) β a 3-step scripted recon prefix, an anti-collapse safety net rotating through endpoints when the policy emits `list_tools` β₯ 2Γ in a row, and evidence-driven finding submission when a `test_*` tool returns reward > 0.05. The trained Llama drives action-type selection; the harness only fires when the env explicitly indicates a vulnerability is present. Disclosed in the README; reproducible end-to-end.
|
| 124 |
+
|
| 125 |
+
The hard scenario (raw-HTTP regime) stays at zero. *Frontier models* score β 0.27 on hard. A 3B model trained on 28 prompts Γ 2 epochs of GRPO can't bridge that gap alone. **That's the reasoning gap the env was designed to expose.**
|
| 126 |
+
|
| 127 |
+
## Why this matters
|
| 128 |
+
|
| 129 |
+
OpenEnv promises that environments β not larger models, not more data β are the leverage point for building agents that reason, not pattern-match. VAPT-Env is one such environment for the security domain:
|
| 130 |
+
|
| 131 |
+
- **It teaches.** A 3B model goes from 7.5% to 48% average score on it via GRPO.
|
| 132 |
+
- **It catches reward hacking.** The 11-component grader penalises spam, redundancy, honeypot interaction, and incorrect delegation decisions.
|
| 133 |
+
- **It scales evidence abstraction.** Three tiers from labeled to raw HTTP measure pattern matching versus real reasoning on the same vulnerabilities.
|
| 134 |
+
- **It is multi-agent native.** `spawn_subagent` is a first-class action; the grader scores delegation quality.
|
| 135 |
+
|
| 136 |
+
Penetration testing is a **\$2.7B market** with **4.8M unfilled positions**. Pen testers spend 40% of their time writing reports rather than finding vulnerabilities. VAPT-Env is one small step toward letting AI take on the routine 80% of an audit so humans can focus on the 20% that requires human judgement.
|
| 137 |
+
|
| 138 |
+
## Reproduce
|
| 139 |
+
|
| 140 |
+
The full pipeline is reproducible by anyone with a Colab account:
|
| 141 |
+
|
| 142 |
+
```bash
|
| 143 |
+
# 1. Hit the live HF Space (no install needed)
|
| 144 |
+
export ENV_URL="https://Sayuj63-Vapt-env.hf.space"
|
| 145 |
+
|
| 146 |
+
# 2. Pre-training baseline (any LLM API works; OpenRouter Llama 3.2 3B is free-ish)
|
| 147 |
+
export API_BASE_URL="https://openrouter.ai/api/v1"
|
| 148 |
+
export MODEL_NAME="meta-llama/llama-3.2-3b-instruct"
|
| 149 |
+
export OPENROUTER_API_KEY="<your-key>"
|
| 150 |
+
python inference.py
|
| 151 |
+
|
| 152 |
+
# 3. GRPO post-training on Colab T4 (~2 hours)
|
| 153 |
+
# Open AISHA_RL_Training_Colab.ipynb in Colab β Runtime: T4 GPU β Run all
|
| 154 |
+
# Output: trained adapter pushed to your HF Hub + real W&B reward curve + post-training scores
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
## Links
|
| 158 |
+
|
| 159 |
+
- **Live HF Space (env)**: <https://huggingface.co/spaces/Sayuj63/Vapt-env>
|
| 160 |
+
- **Trained adapter (HF Hub)**: <https://huggingface.co/Sayuj63/vapt-env-llama32-3b-grpo>
|
| 161 |
+
- **W&B training run**: <https://wandb.ai/sayujpillai63-itm/vapt-env-grpo/runs/ln2jq71s>
|
| 162 |
+
- **GitHub**: <https://github.com/Sayuj63/vapt-env>
|
| 163 |
+
- **Training notebook**: [`AISHA_RL_Training_Colab.ipynb`](https://github.com/Sayuj63/vapt-env/blob/main/AISHA_RL_Training_Colab.ipynb)
|
| 164 |
+
|
| 165 |
+
Built for the **Meta PyTorch OpenEnv Hackathon Γ SST Bangalore 2026**.
|