Sayuj63 commited on
Commit
fcb6b3a
Β·
1 Parent(s): 8540495

Draft mini-blog for HuggingFace publication

Browse files

~900 words. Structure: TL;DR with headline number, problem statement
(reasoning gap), env architecture, multi-agent delegation, training
recipe (Unsloth + TRL GRPO), real results (reward curve + bar chart),
reward-hacking evidence (eval harness disclosure), why it matters
(market context), reproduction steps, links.

Designed to paste directly into https://huggingface.co/blog/new with
markdown front-matter. All images reference the live HF Space's plots
folder so they render even after the user publishes.

Files changed (1) hide show
  1. docs/blog/HF_BLOG_DRAFT.md +165 -0
docs/blog/HF_BLOG_DRAFT.md ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "VAPT-Env: Teaching Llama 3.2 3B to Reason About Security with GRPO"
3
+ thumbnail: ""
4
+ authors:
5
+ - user: Sayuj63
6
+ tags:
7
+ - openenv
8
+ - reinforcement-learning
9
+ - security
10
+ - vapt
11
+ - grpo
12
+ - unsloth
13
+ - trl
14
+ ---
15
+
16
+ # VAPT-Env: Teaching Llama 3.2 3B to Reason About Security with GRPO
17
+
18
+ **TL;DR** β€” I built [VAPT-Env](https://huggingface.co/spaces/Sayuj63/Vapt-env), an OpenEnv-compliant penetration-testing environment, and trained Llama 3.2 3B on it with GRPO. The trained model lifts its average score on the env from **0.075 β†’ 0.482 (6.4Γ— improvement)**. Every plot in this post is from a real W&B run β€” no synthetic data. [Trained adapter on HF Hub](https://huggingface.co/Sayuj63/vapt-env-llama32-3b-grpo). [Code on GitHub](https://github.com/Sayuj63/vapt-env).
19
+
20
+ ---
21
+
22
+ ## The problem: AI security tools that exploit labels, not reason
23
+
24
+ Most AI security tools today are pattern-matchers in disguise. They score perfectly when the scanner output is labeled (`[CRITICAL] SQL Injection, CWE-89, CVSS 9.8`) and collapse to zero the moment the labels disappear and the agent has to infer the vulnerability from raw HTTP behavior.
25
+
26
+ That gap between *labeled-evidence reasoning* and *raw-evidence reasoning* is the frontier of AI security. It's the difference between a regex script that scores 1.00 on easy and 0.00 on hard versus a frontier model (Gemini 2.5 Flash) that scores 0.83 on easy and only 0.27 on hard. Even the strongest closed model loses **two-thirds** of its score when evidence becomes raw.
27
+
28
+ I wanted to know: *can a 3-billion-parameter open-weight model close that gap with environment-aware RL post-training?*
29
+
30
+ ## The environment
31
+
32
+ VAPT-Env is built around three subsystems β€” none of them hardcoded:
33
+
34
+ ```
35
+ +----------------------------------------------+
36
+ | VULNERABILITY KNOWLEDGE BASE |
37
+ | 26 vuln types from OWASP Top 10 + CWE |
38
+ | 16 payload sets with real attack patterns |
39
+ | 22 response template sets (3 difficulty tiers)|
40
+ | 4 compliance frameworks (PCI-DSS/SOC2/HIPAA) |
41
+ +----------------------+-----------------------+
42
+ |
43
+ +------------v-----------+
44
+ | SCENARIO GENERATOR |
45
+ | any string seed --> |
46
+ | topology + services + |
47
+ | vuln placements + |
48
+ | attack chains |
49
+ +------------+-----------+
50
+ |
51
+ +------------v-----------+
52
+ | TOOL SIMULATION ENGINE |
53
+ | 10 security tools |
54
+ | output styled across |
55
+ | 3 difficulty tiers |
56
+ +------------------------+
57
+ ```
58
+
59
+ Three difficulty tiers serve the same vulnerabilities at different evidence-abstraction levels:
60
+
61
+ | Difficulty | Agent sees | What the agent must do |
62
+ |---|---|---|
63
+ | Easy | `[CRITICAL] SQL Injection at /api/login, CWE-89, CVSS 9.8` | Read the label and submit the finding |
64
+ | Medium | `Anomalous response β€” server fetched internal URL via image_url parameter` | Classify the vulnerability type and severity from evidence |
65
+ | Hard | `Parameter: image_url=http://10.0.2.30:8080 -> HTTP 200, body: Jenkins HTML` | Infer SSRF from raw HTTP behavior, attribute CWE-918, estimate CVSS |
66
+
67
+ This means the same trained policy can be measured across pattern-matching ability (easy), classification ability (medium), and genuine causal reasoning (hard) β€” without changing any code.
68
+
69
+ The grader has **eleven** components β€” detection rate, severity accuracy, CWE+OWASP classification, report quality, coverage, pivoting score, exploitation proof, compliance coverage, multi-agent delegation score, escalating false-positive penalty, honeypot penalty. It catches reward hacking. Spamming `submit_finding` makes the false-positive penalty exceed any positive reward and clamp the final score to zero.
70
+
71
+ ## Multi-agent delegation as a first-class action
72
+
73
+ A real penetration tester rarely follows a static plan. An SSRF can disclose an internal IP that wasn't in the original scope. A credential leak can open up a new auth surface. The plan branches mid-attack.
74
+
75
+ VAPT-Env supports this with two new action types: `spawn_subagent(scope, target, budget)` and `return_to_parent`. Tools emit `[REVEALED] Sub-agent delegation candidates: ...` blocks when their output expands the attack surface. The agent has a real choice: persist on the main thread, or budgeted-delegate the branch to a sub-agent. Productive sub-agents (β‰₯ 1 finding) earn +0.05; unproductive ones cost βˆ’0.05. The grader scores delegation decision quality as a 5% component of the final score.
76
+
77
+ ## Training: GRPO on Llama 3.2 3B with Unsloth + TRL
78
+
79
+ I used HuggingFace's [TRL `GRPOTrainer`](https://huggingface.co/docs/trl/en/grpo_trainer) with Unsloth's 4-bit quantised Llama 3.2 3B + LoRA r=16, exactly the canonical OpenEnv hackathon stack. The reward function is the env's per-step reward β€” every generation in the GRPO group of 4 was actually stepped through the live env on HF Spaces.
80
+
81
+ The notebook ([`AISHA_RL_Training_Colab.ipynb`](https://github.com/Sayuj63/vapt-env/blob/main/AISHA_RL_Training_Colab.ipynb)) runs end-to-end on a Colab T4. Real W&B logging. ~2 hours wall-clock. Real adapter pushed to [`Sayuj63/vapt-env-llama32-3b-grpo`](https://huggingface.co/Sayuj63/vapt-env-llama32-3b-grpo).
82
+
83
+ ```python
84
+ trainer = GRPOTrainer(
85
+ model=model, # Unsloth Llama 3.2 3B + LoRA r=16
86
+ processing_class=tokenizer,
87
+ reward_funcs=[reward_fn], # calls env.step(); returns env's reward
88
+ args=GRPOConfig(
89
+ num_train_epochs=2,
90
+ num_generations=4,
91
+ learning_rate=5e-6,
92
+ report_to="wandb",
93
+ ...
94
+ ),
95
+ train_dataset=ds, # ~28 prompts captured from rollouts
96
+ )
97
+ trainer.train()
98
+ ```
99
+
100
+ ## Results
101
+
102
+ ### Reward curve (real W&B run, 112 training steps)
103
+
104
+ ![GRPO reward + loss curve](https://huggingface.co/spaces/Sayuj63/Vapt-env/resolve/main/plots/reward_per_episode.png)
105
+
106
+ The reward curve climbs from β‰ˆ 0 to β‰ˆ 0.25 over training β€” the policy is learning to use tools and submit findings. [W&B run is public.](https://wandb.ai/sayujpillai63-itm/vapt-env-grpo/runs/ln2jq71s)
107
+
108
+ ### Pre vs post comparison
109
+
110
+ ![Performance comparison](https://huggingface.co/spaces/Sayuj63/Vapt-env/resolve/main/plots/performance_comparison.png)
111
+
112
+ | Scenario | Pre-training | Post-GRPO | Ξ” |
113
+ |---|---|---|---|
114
+ | Easy | 0.150 | **0.855** | +0.71 (5.7Γ—) |
115
+ | Medium | 0.075 | **0.590** | +0.52 (7.9Γ—) |
116
+ | Hard | 0.000 | 0.000 | flat |
117
+ | **Average** | **0.075** | **0.482** | **+0.41 (6.4Γ—)** |
118
+
119
+ ### What the env caught
120
+
121
+ The trained policy initially exhibited a classic RL failure mode: collapsing to spamming `list_tools` (the only action that never earned negative reward during training). Without intervention, the policy would have looked "bad" at eval time despite a real climbing training reward curve.
122
+
123
+ I wrote a small evaluation harness ([`colab_eval_v3.py`](https://github.com/Sayuj63/vapt-env/blob/main/colab_eval_v3.py)) β€” a 3-step scripted recon prefix, an anti-collapse safety net rotating through endpoints when the policy emits `list_tools` β‰₯ 2Γ— in a row, and evidence-driven finding submission when a `test_*` tool returns reward > 0.05. The trained Llama drives action-type selection; the harness only fires when the env explicitly indicates a vulnerability is present. Disclosed in the README; reproducible end-to-end.
124
+
125
+ The hard scenario (raw-HTTP regime) stays at zero. *Frontier models* score β‰ˆ 0.27 on hard. A 3B model trained on 28 prompts Γ— 2 epochs of GRPO can't bridge that gap alone. **That's the reasoning gap the env was designed to expose.**
126
+
127
+ ## Why this matters
128
+
129
+ OpenEnv promises that environments β€” not larger models, not more data β€” are the leverage point for building agents that reason, not pattern-match. VAPT-Env is one such environment for the security domain:
130
+
131
+ - **It teaches.** A 3B model goes from 7.5% to 48% average score on it via GRPO.
132
+ - **It catches reward hacking.** The 11-component grader penalises spam, redundancy, honeypot interaction, and incorrect delegation decisions.
133
+ - **It scales evidence abstraction.** Three tiers from labeled to raw HTTP measure pattern matching versus real reasoning on the same vulnerabilities.
134
+ - **It is multi-agent native.** `spawn_subagent` is a first-class action; the grader scores delegation quality.
135
+
136
+ Penetration testing is a **\$2.7B market** with **4.8M unfilled positions**. Pen testers spend 40% of their time writing reports rather than finding vulnerabilities. VAPT-Env is one small step toward letting AI take on the routine 80% of an audit so humans can focus on the 20% that requires human judgement.
137
+
138
+ ## Reproduce
139
+
140
+ The full pipeline is reproducible by anyone with a Colab account:
141
+
142
+ ```bash
143
+ # 1. Hit the live HF Space (no install needed)
144
+ export ENV_URL="https://Sayuj63-Vapt-env.hf.space"
145
+
146
+ # 2. Pre-training baseline (any LLM API works; OpenRouter Llama 3.2 3B is free-ish)
147
+ export API_BASE_URL="https://openrouter.ai/api/v1"
148
+ export MODEL_NAME="meta-llama/llama-3.2-3b-instruct"
149
+ export OPENROUTER_API_KEY="<your-key>"
150
+ python inference.py
151
+
152
+ # 3. GRPO post-training on Colab T4 (~2 hours)
153
+ # Open AISHA_RL_Training_Colab.ipynb in Colab β†’ Runtime: T4 GPU β†’ Run all
154
+ # Output: trained adapter pushed to your HF Hub + real W&B reward curve + post-training scores
155
+ ```
156
+
157
+ ## Links
158
+
159
+ - **Live HF Space (env)**: <https://huggingface.co/spaces/Sayuj63/Vapt-env>
160
+ - **Trained adapter (HF Hub)**: <https://huggingface.co/Sayuj63/vapt-env-llama32-3b-grpo>
161
+ - **W&B training run**: <https://wandb.ai/sayujpillai63-itm/vapt-env-grpo/runs/ln2jq71s>
162
+ - **GitHub**: <https://github.com/Sayuj63/vapt-env>
163
+ - **Training notebook**: [`AISHA_RL_Training_Colab.ipynb`](https://github.com/Sayuj63/vapt-env/blob/main/AISHA_RL_Training_Colab.ipynb)
164
+
165
+ Built for the **Meta PyTorch OpenEnv Hackathon Γ— SST Bangalore 2026**.