Text Generation
English
web-scraping
html-extraction
agent
structured-data
qwen2.5
unsloth
lora
sukritvemula commited on
Commit
cb4f42d
·
verified ·
1 Parent(s): b39aef8

Upload WebScrapeAgent_Training_v2.ipynb with huggingface_hub

Browse files
Files changed (1) hide show
  1. WebScrapeAgent_Training_v2.ipynb +320 -0
WebScrapeAgent_Training_v2.ipynb ADDED
@@ -0,0 +1,320 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "nbformat": 4,
3
+ "nbformat_minor": 0,
4
+ "metadata": {
5
+ "colab": {"provenance": [], "gpuType": "T4"},
6
+ "kernelspec": {"name": "python3", "display_name": "Python 3"},
7
+ "language_info": {"name": "python"},
8
+ "accelerator": "GPU"
9
+ },
10
+ "cells": [
11
+ {
12
+ "cell_type": "markdown",
13
+ "metadata": {},
14
+ "source": [
15
+ "# 🕷️ WebScrapeAgent v2 — Fine-tune Qwen2.5-7B for Autonomous Web Scraping\n",
16
+ "\n",
17
+ "Trains **Qwen2.5-7B-Instruct** with **Unsloth + QLoRA** to scrape **any website** — including React/Next.js SPAs, sites behind Cloudflare/Akamai/DataDome, pages with shadow DOM, infinite scroll, and JS-rendered content.\n",
18
+ "\n",
19
+ "### What the model learns:\n",
20
+ "| Skill | Examples |\n",
21
+ "|---|---|\n",
22
+ "| **HTML → JSON extraction** | Tables, nested structures, malformed HTML, data attributes |\n",
23
+ "| **React/Next.js/Vue/Angular** | `__NEXT_DATA__`, `__INITIAL_STATE__`, XHR interception, hydration wait |\n",
24
+ "| **Anti-bot bypass** | Cloudflare (TLS + JS challenge), Akamai (behavioral), DataDome (fingerprint), PerimeterX (cookie replay) |\n",
25
+ "| **Strategy escalation** | HTTP → curl_cffi → stealth browser → residential proxy → CAPTCHA service |\n",
26
+ "| **Shadow DOM / Web Components** | shadowRoot traversal, `>>>` piercing selector, JS extraction |\n",
27
+ "| **Authentication** | Cookie replay, form login, token injection, browser profile loading |\n",
28
+ "| **Error recovery** | 403→strategy switch, timeout→JS fallback, rate limit→backoff, CAPTCHA→graceful degradation |\n",
29
+ "\n",
30
+ "**Dataset**: [sukritvemula/webscrape-agent-training-data](https://huggingface.co/datasets/sukritvemula/webscrape-agent-training-data) (45K+ examples)\n",
31
+ "\n",
32
+ "**Hardware**: Free Colab T4 (16GB VRAM). Training takes ~2-4 hours.\n",
33
+ "\n",
34
+ "### Training recipe (paper-backed):\n",
35
+ "- ScrapeGraphAI-100k (arXiv:2602.15189): QLoRA + completion-only loss → Key F1=0.887\n",
36
+ "- BrowserAgent (arXiv:2510.10666): Multi-turn SFT on Qwen2.5 → +20% over baselines\n",
37
+ "- A3-Annotators (arXiv:2604.07776): Assistant-only loss + reasoning chains → 41.5% WebArena"
38
+ ]
39
+ },
40
+ {
41
+ "cell_type": "markdown",
42
+ "metadata": {},
43
+ "source": ["## 1. Install"]
44
+ },
45
+ {
46
+ "cell_type": "code",
47
+ "execution_count": null,
48
+ "metadata": {},
49
+ "outputs": [],
50
+ "source": [
51
+ "%%capture\n",
52
+ "!pip install unsloth\n",
53
+ "!pip install --no-deps trl peft accelerate bitsandbytes xformers"
54
+ ]
55
+ },
56
+ {
57
+ "cell_type": "markdown",
58
+ "metadata": {},
59
+ "source": ["## 2. Config — edit your username here"]
60
+ },
61
+ {
62
+ "cell_type": "code",
63
+ "execution_count": null,
64
+ "metadata": {},
65
+ "outputs": [],
66
+ "source": [
67
+ "# ============ EDIT THIS ============\n",
68
+ "HF_USERNAME = \"sukritvemula\" # your HF username\n",
69
+ "OUTPUT_MODEL = f\"{HF_USERNAME}/WebScrapeAgent-7B-v2\" # where model gets pushed\n",
70
+ "# ===================================\n",
71
+ "\n",
72
+ "MODEL_NAME = \"unsloth/Qwen2.5-7B-Instruct-bnb-4bit\" # pre-quantized, fast load\n",
73
+ "DATASET = \"sukritvemula/webscrape-agent-training-data\" # 45K+ examples\n",
74
+ "\n",
75
+ "# hyperparams (from ScrapeGraphAI + BrowserAgent papers)\n",
76
+ "MAX_SEQ = 4096 # covers 95%+ of examples\n",
77
+ "LORA_R = 32 # rank 32 for complex structured output\n",
78
+ "LORA_A = 32 # alpha = rank\n",
79
+ "LR = 1e-4 # QLoRA needs ~10x higher LR\n",
80
+ "EPOCHS = 2\n",
81
+ "BS = 1 # per-device (T4-safe)\n",
82
+ "GA = 16 # gradient accumulation → effective batch = 16"
83
+ ]
84
+ },
85
+ {
86
+ "cell_type": "code",
87
+ "execution_count": null,
88
+ "metadata": {},
89
+ "outputs": [],
90
+ "source": [
91
+ "from huggingface_hub import login\n",
92
+ "login()"
93
+ ]
94
+ },
95
+ {
96
+ "cell_type": "markdown",
97
+ "metadata": {},
98
+ "source": ["## 3. Load model + LoRA"]
99
+ },
100
+ {
101
+ "cell_type": "code",
102
+ "execution_count": null,
103
+ "metadata": {},
104
+ "outputs": [],
105
+ "source": [
106
+ "import unsloth\n",
107
+ "import torch\n",
108
+ "from unsloth import FastLanguageModel, is_bfloat16_supported\n",
109
+ "from unsloth.chat_templates import get_chat_template, train_on_responses_only\n",
110
+ "\n",
111
+ "print(f\"GPU: {torch.cuda.get_device_name()} | VRAM: {torch.cuda.get_device_properties(0).total_mem/1e9:.1f}GB\")\n",
112
+ "\n",
113
+ "model, tokenizer = FastLanguageModel.from_pretrained(\n",
114
+ " model_name=MODEL_NAME, max_seq_length=MAX_SEQ, dtype=None, load_in_4bit=True,\n",
115
+ ")\n",
116
+ "\n",
117
+ "model = FastLanguageModel.get_peft_model(\n",
118
+ " model, r=LORA_R,\n",
119
+ " target_modules=[\"q_proj\",\"k_proj\",\"v_proj\",\"o_proj\",\"gate_proj\",\"up_proj\",\"down_proj\"],\n",
120
+ " lora_alpha=LORA_A, lora_dropout=0.0, bias=\"none\",\n",
121
+ " use_gradient_checkpointing=\"unsloth\", random_state=42,\n",
122
+ ")\n",
123
+ "\n",
124
+ "tokenizer = get_chat_template(tokenizer, chat_template=\"qwen-2.5\")\n",
125
+ "\n",
126
+ "t = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
127
+ "a = sum(p.numel() for p in model.parameters())\n",
128
+ "print(f\"Trainable: {t:,} / {a:,} ({t/a*100:.2f}%)\")"
129
+ ]
130
+ },
131
+ {
132
+ "cell_type": "markdown",
133
+ "metadata": {},
134
+ "source": ["## 4. Load & format dataset"]
135
+ },
136
+ {
137
+ "cell_type": "code",
138
+ "execution_count": null,
139
+ "metadata": {},
140
+ "outputs": [],
141
+ "source": [
142
+ "from datasets import load_dataset\n",
143
+ "\n",
144
+ "ds = load_dataset(DATASET, split=\"train\")\n",
145
+ "print(f\"Examples: {len(ds)}\")\n",
146
+ "\n",
147
+ "def fmt(examples):\n",
148
+ " texts = []\n",
149
+ " for msgs in examples[\"messages\"]:\n",
150
+ " try:\n",
151
+ " texts.append(tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False))\n",
152
+ " except:\n",
153
+ " t = \"\"\n",
154
+ " for m in msgs:\n",
155
+ " t += f\"<|im_start|>{m['role']}\\n{m['content']}<|im_end|>\\n\"\n",
156
+ " texts.append(t)\n",
157
+ " return {\"text\": texts}\n",
158
+ "\n",
159
+ "ds = ds.map(fmt, batched=True, num_proc=2, remove_columns=ds.column_names)\n",
160
+ "\n",
161
+ "# filter to max seq length\n",
162
+ "def filt(ex):\n",
163
+ " return len(tokenizer(ex[\"text\"], truncation=False)[\"input_ids\"]) <= MAX_SEQ\n",
164
+ "\n",
165
+ "n = len(ds)\n",
166
+ "ds = ds.filter(filt, num_proc=2)\n",
167
+ "print(f\"After length filter: {len(ds)}/{n} ({len(ds)/n*100:.1f}%)\")\n",
168
+ "print(f\"Sample: {ds[0]['text'][:200]}...\")"
169
+ ]
170
+ },
171
+ {
172
+ "cell_type": "markdown",
173
+ "metadata": {},
174
+ "source": [
175
+ "## 5. Train\n",
176
+ "\n",
177
+ "Completion-only loss (assistant tokens only) — **+15% schema compliance** per ScrapeGraphAI paper."
178
+ ]
179
+ },
180
+ {
181
+ "cell_type": "code",
182
+ "execution_count": null,
183
+ "metadata": {},
184
+ "outputs": [],
185
+ "source": [
186
+ "from trl import SFTTrainer, SFTConfig\n",
187
+ "\n",
188
+ "trainer = SFTTrainer(\n",
189
+ " model=model, tokenizer=tokenizer, train_dataset=ds,\n",
190
+ " args=SFTConfig(\n",
191
+ " output_dir=\"./checkpoints\",\n",
192
+ " num_train_epochs=EPOCHS,\n",
193
+ " per_device_train_batch_size=BS,\n",
194
+ " gradient_accumulation_steps=GA,\n",
195
+ " optim=\"adamw_8bit\",\n",
196
+ " learning_rate=LR,\n",
197
+ " weight_decay=0.01,\n",
198
+ " lr_scheduler_type=\"cosine\",\n",
199
+ " warmup_ratio=0.03,\n",
200
+ " max_grad_norm=0.3,\n",
201
+ " fp16=not is_bfloat16_supported(),\n",
202
+ " bf16=is_bfloat16_supported(),\n",
203
+ " max_seq_length=MAX_SEQ,\n",
204
+ " dataset_text_field=\"text\",\n",
205
+ " packing=False,\n",
206
+ " logging_steps=10,\n",
207
+ " logging_first_step=True,\n",
208
+ " save_strategy=\"steps\",\n",
209
+ " save_steps=500,\n",
210
+ " save_total_limit=2,\n",
211
+ " push_to_hub=True,\n",
212
+ " hub_model_id=OUTPUT_MODEL,\n",
213
+ " hub_strategy=\"end\",\n",
214
+ " seed=42,\n",
215
+ " dataset_num_proc=2,\n",
216
+ " ),\n",
217
+ ")\n",
218
+ "\n",
219
+ "# CRITICAL: train only on assistant responses\n",
220
+ "trainer = train_on_responses_only(trainer)\n",
221
+ "print(f\"Ready — {EPOCHS} epochs, effective batch {BS*GA}, lr {LR}\")"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "code",
226
+ "execution_count": null,
227
+ "metadata": {},
228
+ "outputs": [],
229
+ "source": [
230
+ "stats = trainer.train()\n",
231
+ "print(f\"\\n✅ Done! Loss: {stats.training_loss:.4f}\")"
232
+ ]
233
+ },
234
+ {
235
+ "cell_type": "markdown",
236
+ "metadata": {},
237
+ "source": ["## 6. Save & push"]
238
+ },
239
+ {
240
+ "cell_type": "code",
241
+ "execution_count": null,
242
+ "metadata": {},
243
+ "outputs": [],
244
+ "source": [
245
+ "model.save_pretrained(\"lora-adapter\")\n",
246
+ "tokenizer.save_pretrained(\"lora-adapter\")\n",
247
+ "\n",
248
+ "print(\"Pushing merged 16-bit model...\")\n",
249
+ "model.push_to_hub_merged(OUTPUT_MODEL, tokenizer, save_method=\"merged_16bit\")\n",
250
+ "\n",
251
+ "print(\"Pushing LoRA adapter...\")\n",
252
+ "model.push_to_hub(OUTPUT_MODEL + \"-lora\", tokenizer)\n",
253
+ "\n",
254
+ "print(f\"\\n✅ Model: https://huggingface.co/{OUTPUT_MODEL}\")\n",
255
+ "print(f\"✅ LoRA: https://huggingface.co/{OUTPUT_MODEL}-lora\")"
256
+ ]
257
+ },
258
+ {
259
+ "cell_type": "markdown",
260
+ "metadata": {},
261
+ "source": ["## 7. Test — extraction + anti-bot reasoning"]
262
+ },
263
+ {
264
+ "cell_type": "code",
265
+ "execution_count": null,
266
+ "metadata": {},
267
+ "outputs": [],
268
+ "source": [
269
+ "FastLanguageModel.for_inference(model)\n",
270
+ "\n",
271
+ "# Test 1: React SPA extraction\n",
272
+ "msgs = [\n",
273
+ " {\"role\": \"system\", \"content\": \"You are WebScrapeAgent. You can scrape any website including React/JS SPAs and sites with anti-bot protection.\"},\n",
274
+ " {\"role\": \"user\", \"content\": \"Task: Extract product data from a React e-commerce site\\nURL: https://shop.example.com/products\\nThe site is a Next.js SPA.\"},\n",
275
+ "]\n",
276
+ "inputs = tokenizer.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True, return_tensors=\"pt\").to(\"cuda\")\n",
277
+ "out = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.3, do_sample=True)\n",
278
+ "print(\"=== React SPA Test ===\")\n",
279
+ "print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))"
280
+ ]
281
+ },
282
+ {
283
+ "cell_type": "code",
284
+ "execution_count": null,
285
+ "metadata": {},
286
+ "outputs": [],
287
+ "source": [
288
+ "# Test 2: Anti-bot recovery\n",
289
+ "msgs2 = [\n",
290
+ " {\"role\": \"system\", \"content\": \"You are WebScrapeAgent. Available actions: NAVIGATE, NAVIGATE_BROWSER, SWITCH_STRATEGY, EXECUTE_JS, INTERCEPT_REQUESTS, RETURN_RESULT. Think in <thought> blocks.\"},\n",
291
+ " {\"role\": \"user\", \"content\": \"Task: Extract prices\\nURL: https://store.example.com/deals\"},\n",
292
+ " {\"role\": \"assistant\", \"content\": \"<thought>Let me try direct HTTP first.</thought>\\n\\nACTION: NAVIGATE\\n```json\\n{\\\"url\\\": \\\"https://store.example.com/deals\\\"}\\n```\"},\n",
293
+ " {\"role\": \"user\", \"content\": \"Observation: HTTP 403 Forbidden\\nHeaders: cf-ray: abc123, server: cloudflare, set-cookie: _abck=xyz...\\n\\n<html><body>Access Denied</body></html>\"},\n",
294
+ "]\n",
295
+ "inputs = tokenizer.apply_chat_template(msgs2, tokenize=True, add_generation_prompt=True, return_tensors=\"pt\").to(\"cuda\")\n",
296
+ "out = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.3, do_sample=True)\n",
297
+ "print(\"=== Anti-Bot Recovery Test ===\")\n",
298
+ "print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))"
299
+ ]
300
+ },
301
+ {
302
+ "cell_type": "markdown",
303
+ "metadata": {},
304
+ "source": [
305
+ "## 8. (Optional) Export to GGUF for local deployment\n",
306
+ "Uncomment to create a quantized GGUF file for llama.cpp / Ollama."
307
+ ]
308
+ },
309
+ {
310
+ "cell_type": "code",
311
+ "execution_count": null,
312
+ "metadata": {},
313
+ "outputs": [],
314
+ "source": [
315
+ "# model.save_pretrained_gguf(\"gguf-model\", tokenizer, quantization_method=\"q4_k_m\")\n",
316
+ "# model.push_to_hub_gguf(OUTPUT_MODEL + \"-GGUF\", tokenizer, quantization_method=\"q4_k_m\")"
317
+ ]
318
+ }
319
+ ]
320
+ }