🕷️ WebScrapeAgent-7B-v1
An autonomous web scraping agent built on Qwen2.5-7B-Instruct, fine-tuned to extract structured data from any web page automatically.
Give it a URL and a description of what you want → it comes back with clean, usable JSON data every time.
What It Does
| Capability | Description |
|---|---|
| HTML Reading | Understands page structure — tables, nested divs, lists, forms, data attributes, malformed HTML |
| Action Sequencing | Decides what tools to call and in what order to get the data |
| Authentication | Handles login pages via cookie replay, form submission, token injection, or browser profiles |
| Error Recovery | When something breaks (403, timeout, CAPTCHA, rate limit), switches approach instead of failing |
How It Works
The model operates in an action loop:
User: "Extract product listings from example.com/shop"
↓
Model: <thought>Let me navigate there first.</thought>
ACTION: NAVIGATE {"url": "example.com/shop"}
↓
System: HTTP 200 OK. <html>...</html>
↓
Model: <thought>I see product cards. Let me extract the data.</thought>
ACTION: RETURN_RESULT {"status": "success", "data": [...]}
Each response includes a status: success, partial, or failed — so the caller always knows where things stand.
Maximum 10 steps per job. If it can't finish, it returns what it has with a clear explanation.
Quick Start
Python API
from webscrape_agent import WebScrapeAgent
agent = WebScrapeAgent("sukritvemula/WebScrapeAgent-7B-v1")
result = agent.scrape(
url="https://example.com/products",
task="Extract all product names, prices, and ratings",
schema={
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"},
"rating": {"type": "string"}
}
}
}
)
print(result.status) # "success" | "partial" | "failed"
print(result.data) # Clean JSON data
print(result.message) # Human-readable explanation
With Authentication
# Cookie-based auth
result = agent.scrape(
url="https://dashboard.example.com/analytics",
task="Get my usage statistics",
auth={"method": "cookies", "cookies": {"session_id": "abc123"}}
)
# API token
result = agent.scrape(
url="https://api.example.com/v2/data",
task="Get all user records",
auth={"method": "token", "token": "sk-xxx"}
)
CLI
python webscrape_agent.py "https://example.com/pricing" "Extract all pricing tiers with features"
Direct Model Usage
import unsloth
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
model, tokenizer = FastLanguageModel.from_pretrained(
"sukritvemula/WebScrapeAgent-7B-v1",
max_seq_length=4096,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")
messages = [
{"role": "system", "content": "You are WebScrapeAgent..."},
{"role": "user", "content": "Task: Extract pricing data\nURL: https://example.com/pricing"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=1024, temperature=0.3)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Available Actions
The model can output these actions in its loop:
| Action | Purpose | Example Params |
|---|---|---|
NAVIGATE |
Load a URL | {"url": "...", "method": "GET", "headers": {...}} |
CLICK |
Click an element | {"selector": "#submit-btn"} |
FILL_FORM |
Submit a form | {"selector": "#login", "fields": {"email": "...", "password": "..."}} |
WAIT |
Wait for dynamic content | {"selector": ".results", "timeout_ms": 5000} |
SET_COOKIES |
Inject auth cookies | {"cookies": {"session": "abc"}} |
SET_HEADERS |
Set HTTP headers | {"headers": {"Authorization": "Bearer ..."}} |
LOAD_BROWSER_PROFILE |
Use saved browser session | {"profile_name": "work-chrome"} |
EXECUTE_JS |
Run JavaScript | {"script": "return document.querySelector('#app').innerHTML"} |
SCROLL |
Scroll the page | {"direction": "down", "amount": 500} |
SWITCH_STRATEGY |
Change approach on failure | {"new_strategy": "headless_browser", "reason": "403 blocked"} |
RETURN_RESULT |
Return final data | {"status": "success", "data": [...], "message": "..."} |
Training Details
Recipe
Based on three published approaches:
| Paper | Key Contribution | Result |
|---|---|---|
| ScrapeGraphAI-100k | QLoRA + completion-only loss for HTML→JSON | Key F1 = 0.887 at 1.7B params |
| BrowserAgent | Multi-turn browser SFT on Qwen2.5-7B | +20% over baselines |
| A3-Annotators | Assistant-token-only loss + thought chains | 41.5% WebArena |
Hyperparameters
| Parameter | Value | Source |
|---|---|---|
| Base model | Qwen/Qwen2.5-7B-Instruct | — |
| Method | QLoRA (4-bit NF4) | ScrapeGraphAI |
| LoRA rank | 32 | Increased from paper's 16 for structured output complexity |
| LoRA alpha | 32 | Standard (= rank) |
| LoRA targets | All linear (q,k,v,o,gate,up,down) | ScrapeGraphAI + A3 |
| Learning rate | 1e-4 | ScrapeGraphAI |
| LR schedule | Cosine with 3% warmup | A3-Annotators |
| Optimizer | AdamW 8-bit | Unsloth best practice |
| Epochs | 2 | ScrapeGraphAI + BrowserAgent |
| Effective batch | 16 | — |
| Max seq length | 4096 | — |
| Loss | Completion-only (assistant tokens) | All three papers |
| Gradient checkpointing | Unsloth custom | — |
Training Data
Dataset: sukritvemula/webscrape-agent-training-data (45,624 examples)
| Source | Count | % | What It Teaches |
|---|---|---|---|
| ScrapeGraphAI-100k | 25,244 | 55.3% | HTML→JSON extraction across real websites |
| BrowserAgent-Data | 20,361 | 44.6% | Multi-turn browser interaction and reasoning |
| Synthetic scenarios | 19 | 0.04% | Auth handling, error recovery, diverse HTML |
Training Infrastructure
Designed for free GPU platforms:
- ✅ Google Colab (T4, 16GB)
- ✅ Kaggle (T4/P100, 16GB)
- ✅ Any 16GB+ GPU
How to Train
Option 1: Colab Notebook (Easiest)
Open WebScrapeAgent_Training.ipynb in Google Colab with a T4 GPU runtime. Everything is set up — just run all cells.
Option 2: Command Line
pip install unsloth trl peft transformers accelerate datasets bitsandbytes
# Train with defaults (pushes to Hub)
python train.py
# Custom settings
python train.py \
--model unsloth/Qwen2.5-7B-Instruct-bnb-4bit \
--output your-username/WebScrapeAgent-7B-custom \
--epochs 3 \
--lr 5e-5 \
--lora-r 64 \
--batch-size 2 \
--grad-accum 8
# Save locally only (no Hub push)
python train.py --no-push --save-local ./my-model
Limitations
- CAPTCHA: Cannot solve visual CAPTCHAs. Returns partial results with explanation.
- JavaScript-heavy SPAs: The default HTTP executor doesn't render JS. Use with Playwright/Selenium for full browser support (see
ActionExecutorclass inwebscrape_agent.py). - Private networks: Cannot access internal/intranet URLs. Returns clear failure message.
- Very large pages: HTML truncated to ~8K chars to fit context window. May miss data on extremely long pages.
- Data hallucination: While trained to never invent data, always verify critical extractions.
Files in This Repo
| File | Purpose |
|---|---|
webscrape_agent.py |
Runtime inference loop — Python API and CLI |
train.py |
Standalone training script (CLI with args) |
WebScrapeAgent_Training.ipynb |
Colab/Kaggle training notebook |
evaluate.py |
Evaluation script testing all 4 core skills |
prepare_data.py |
Dataset preparation pipeline (builds the training data) |
License
Apache 2.0 (same as base model)