sukritvemula
/

WebScrapeAgent-7B-v1

+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {"provenance": [], "gpuType": "T4"},
+  "kernelspec": {"name": "python3", "display_name": "Python 3"},
+  "language_info": {"name": "python"},
+  "accelerator": "GPU"
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 🕷️ WebScrapeAgent v2 — Fine-tune Qwen2.5-7B for Autonomous Web Scraping\n",
+    "\n",
+    "Trains **Qwen2.5-7B-Instruct** with **Unsloth + QLoRA** to scrape **any website** — including React/Next.js SPAs, sites behind Cloudflare/Akamai/DataDome, pages with shadow DOM, infinite scroll, and JS-rendered content.\n",
+    "\n",
+    "### What the model learns:\n",
+    "| Skill | Examples |\n",
+    "|---|---|\n",
+    "| **HTML → JSON extraction** | Tables, nested structures, malformed HTML, data attributes |\n",
+    "| **React/Next.js/Vue/Angular** | `__NEXT_DATA__`, `__INITIAL_STATE__`, XHR interception, hydration wait |\n",
+    "| **Anti-bot bypass** | Cloudflare (TLS + JS challenge), Akamai (behavioral), DataDome (fingerprint), PerimeterX (cookie replay) |\n",
+    "| **Strategy escalation** | HTTP → curl_cffi → stealth browser → residential proxy → CAPTCHA service |\n",
+    "| **Shadow DOM / Web Components** | shadowRoot traversal, `>>>` piercing selector, JS extraction |\n",
+    "| **Authentication** | Cookie replay, form login, token injection, browser profile loading |\n",
+    "| **Error recovery** | 403→strategy switch, timeout→JS fallback, rate limit→backoff, CAPTCHA→graceful degradation |\n",
+    "\n",
+    "**Dataset**: [sukritvemula/webscrape-agent-training-data](https://huggingface.co/datasets/sukritvemula/webscrape-agent-training-data) (45K+ examples)\n",
+    "\n",
+    "**Hardware**: Free Colab T4 (16GB VRAM). Training takes ~2-4 hours.\n",
+    "\n",
+    "### Training recipe (paper-backed):\n",
+    "- ScrapeGraphAI-100k (arXiv:2602.15189): QLoRA + completion-only loss → Key F1=0.887\n",
+    "- BrowserAgent (arXiv:2510.10666): Multi-turn SFT on Qwen2.5 → +20% over baselines\n",
+    "- A3-Annotators (arXiv:2604.07776): Assistant-only loss + reasoning chains → 41.5% WebArena"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 1. Install"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "!pip install unsloth\n",
+    "!pip install --no-deps trl peft accelerate bitsandbytes xformers"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 2. Config — edit your username here"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ============ EDIT THIS ============\n",
+    "HF_USERNAME = \"sukritvemula\"                              # your HF username\n",
+    "OUTPUT_MODEL = f\"{HF_USERNAME}/WebScrapeAgent-7B-v2\"      # where model gets pushed\n",
+    "# ===================================\n",
+    "\n",
+    "MODEL_NAME = \"unsloth/Qwen2.5-7B-Instruct-bnb-4bit\"      # pre-quantized, fast load\n",
+    "DATASET    = \"sukritvemula/webscrape-agent-training-data\"  # 45K+ examples\n",
+    "\n",
+    "# hyperparams (from ScrapeGraphAI + BrowserAgent papers)\n",
+    "MAX_SEQ   = 4096   # covers 95%+ of examples\n",
+    "LORA_R    = 32     # rank 32 for complex structured output\n",
+    "LORA_A    = 32     # alpha = rank\n",
+    "LR        = 1e-4   # QLoRA needs ~10x higher LR\n",
+    "EPOCHS    = 2\n",
+    "BS        = 1      # per-device (T4-safe)\n",
+    "GA        = 16     # gradient accumulation → effective batch = 16"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import login\n",
+    "login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 3. Load model + LoRA"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import unsloth\n",
+    "import torch\n",
+    "from unsloth import FastLanguageModel, is_bfloat16_supported\n",
+    "from unsloth.chat_templates import get_chat_template, train_on_responses_only\n",
+    "\n",
+    "print(f\"GPU: {torch.cuda.get_device_name()} | VRAM: {torch.cuda.get_device_properties(0).total_mem/1e9:.1f}GB\")\n",
+    "\n",
+    "model, tokenizer = FastLanguageModel.from_pretrained(\n",
+    "    model_name=MODEL_NAME, max_seq_length=MAX_SEQ, dtype=None, load_in_4bit=True,\n",
+    ")\n",
+    "\n",
+    "model = FastLanguageModel.get_peft_model(\n",
+    "    model, r=LORA_R,\n",
+    "    target_modules=[\"q_proj\",\"k_proj\",\"v_proj\",\"o_proj\",\"gate_proj\",\"up_proj\",\"down_proj\"],\n",
+    "    lora_alpha=LORA_A, lora_dropout=0.0, bias=\"none\",\n",
+    "    use_gradient_checkpointing=\"unsloth\", random_state=42,\n",
+    ")\n",
+    "\n",
+    "tokenizer = get_chat_template(tokenizer, chat_template=\"qwen-2.5\")\n",
+    "\n",
+    "t = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
+    "a = sum(p.numel() for p in model.parameters())\n",
+    "print(f\"Trainable: {t:,} / {a:,} ({t/a*100:.2f}%)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 4. Load & format dataset"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "ds = load_dataset(DATASET, split=\"train\")\n",
+    "print(f\"Examples: {len(ds)}\")\n",
+    "\n",
+    "def fmt(examples):\n",
+    "    texts = []\n",
+    "    for msgs in examples[\"messages\"]:\n",
+    "        try:\n",
+    "            texts.append(tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False))\n",
+    "        except:\n",
+    "            t = \"\"\n",
+    "            for m in msgs:\n",
+    "                t += f\"<|im_start|>{m['role']}\\n{m['content']}<|im_end|>\\n\"\n",
+    "            texts.append(t)\n",
+    "    return {\"text\": texts}\n",
+    "\n",
+    "ds = ds.map(fmt, batched=True, num_proc=2, remove_columns=ds.column_names)\n",
+    "\n",
+    "# filter to max seq length\n",
+    "def filt(ex):\n",
+    "    return len(tokenizer(ex[\"text\"], truncation=False)[\"input_ids\"]) <= MAX_SEQ\n",
+    "\n",
+    "n = len(ds)\n",
+    "ds = ds.filter(filt, num_proc=2)\n",
+    "print(f\"After length filter: {len(ds)}/{n} ({len(ds)/n*100:.1f}%)\")\n",
+    "print(f\"Sample: {ds[0]['text'][:200]}...\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Train\n",
+    "\n",
+    "Completion-only loss (assistant tokens only) — **+15% schema compliance** per ScrapeGraphAI paper."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import SFTTrainer, SFTConfig\n",
+    "\n",
+    "trainer = SFTTrainer(\n",
+    "    model=model, tokenizer=tokenizer, train_dataset=ds,\n",
+    "    args=SFTConfig(\n",
+    "        output_dir=\"./checkpoints\",\n",
+    "        num_train_epochs=EPOCHS,\n",
+    "        per_device_train_batch_size=BS,\n",
+    "        gradient_accumulation_steps=GA,\n",
+    "        optim=\"adamw_8bit\",\n",
+    "        learning_rate=LR,\n",
+    "        weight_decay=0.01,\n",
+    "        lr_scheduler_type=\"cosine\",\n",
+    "        warmup_ratio=0.03,\n",
+    "        max_grad_norm=0.3,\n",
+    "        fp16=not is_bfloat16_supported(),\n",
+    "        bf16=is_bfloat16_supported(),\n",
+    "        max_seq_length=MAX_SEQ,\n",
+    "        dataset_text_field=\"text\",\n",
+    "        packing=False,\n",
+    "        logging_steps=10,\n",
+    "        logging_first_step=True,\n",
+    "        save_strategy=\"steps\",\n",
+    "        save_steps=500,\n",
+    "        save_total_limit=2,\n",
+    "        push_to_hub=True,\n",
+    "        hub_model_id=OUTPUT_MODEL,\n",
+    "        hub_strategy=\"end\",\n",
+    "        seed=42,\n",
+    "        dataset_num_proc=2,\n",
+    "    ),\n",
+    ")\n",
+    "\n",
+    "# CRITICAL: train only on assistant responses\n",
+    "trainer = train_on_responses_only(trainer)\n",
+    "print(f\"Ready — {EPOCHS} epochs, effective batch {BS*GA}, lr {LR}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "stats = trainer.train()\n",
+    "print(f\"\\n✅ Done! Loss: {stats.training_loss:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 6. Save & push"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.save_pretrained(\"lora-adapter\")\n",
+    "tokenizer.save_pretrained(\"lora-adapter\")\n",
+    "\n",
+    "print(\"Pushing merged 16-bit model...\")\n",
+    "model.push_to_hub_merged(OUTPUT_MODEL, tokenizer, save_method=\"merged_16bit\")\n",
+    "\n",
+    "print(\"Pushing LoRA adapter...\")\n",
+    "model.push_to_hub(OUTPUT_MODEL + \"-lora\", tokenizer)\n",
+    "\n",
+    "print(f\"\\n✅ Model: https://huggingface.co/{OUTPUT_MODEL}\")\n",
+    "print(f\"✅ LoRA:  https://huggingface.co/{OUTPUT_MODEL}-lora\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 7. Test — extraction + anti-bot reasoning"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "FastLanguageModel.for_inference(model)\n",
+    "\n",
+    "# Test 1: React SPA extraction\n",
+    "msgs = [\n",
+    "    {\"role\": \"system\", \"content\": \"You are WebScrapeAgent. You can scrape any website including React/JS SPAs and sites with anti-bot protection.\"},\n",
+    "    {\"role\": \"user\", \"content\": \"Task: Extract product data from a React e-commerce site\\nURL: https://shop.example.com/products\\nThe site is a Next.js SPA.\"},\n",
+    "]\n",
+    "inputs = tokenizer.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True, return_tensors=\"pt\").to(\"cuda\")\n",
+    "out = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.3, do_sample=True)\n",
+    "print(\"=== React SPA Test ===\")\n",
+    "print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Test 2: Anti-bot recovery\n",
+    "msgs2 = [\n",
+    "    {\"role\": \"system\", \"content\": \"You are WebScrapeAgent. Available actions: NAVIGATE, NAVIGATE_BROWSER, SWITCH_STRATEGY, EXECUTE_JS, INTERCEPT_REQUESTS, RETURN_RESULT. Think in <thought> blocks.\"},\n",
+    "    {\"role\": \"user\", \"content\": \"Task: Extract prices\\nURL: https://store.example.com/deals\"},\n",
+    "    {\"role\": \"assistant\", \"content\": \"<thought>Let me try direct HTTP first.</thought>\\n\\nACTION: NAVIGATE\\n```json\\n{\\\"url\\\": \\\"https://store.example.com/deals\\\"}\\n```\"},\n",
+    "    {\"role\": \"user\", \"content\": \"Observation: HTTP 403 Forbidden\\nHeaders: cf-ray: abc123, server: cloudflare, set-cookie: _abck=xyz...\\n\\n<html><body>Access Denied</body></html>\"},\n",
+    "]\n",
+    "inputs = tokenizer.apply_chat_template(msgs2, tokenize=True, add_generation_prompt=True, return_tensors=\"pt\").to(\"cuda\")\n",
+    "out = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.3, do_sample=True)\n",
+    "print(\"=== Anti-Bot Recovery Test ===\")\n",
+    "print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. (Optional) Export to GGUF for local deployment\n",
+    "Uncomment to create a quantized GGUF file for llama.cpp / Ollama."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# model.save_pretrained_gguf(\"gguf-model\", tokenizer, quantization_method=\"q4_k_m\")\n",
+    "# model.push_to_hub_gguf(OUTPUT_MODEL + \"-GGUF\", tokenizer, quantization_method=\"q4_k_m\")"
+   ]
+  }
+ ]
+}