🕷️ WebScrapeAgent-7B-v1

An autonomous web scraping agent built on Qwen2.5-7B-Instruct, fine-tuned to extract structured data from any web page automatically.

Give it a URL and a description of what you want → it comes back with clean, usable JSON data every time.

What It Does

Capability	Description
HTML Reading	Understands page structure — tables, nested divs, lists, forms, data attributes, malformed HTML
Action Sequencing	Decides what tools to call and in what order to get the data
Authentication	Handles login pages via cookie replay, form submission, token injection, or browser profiles
Error Recovery	When something breaks (403, timeout, CAPTCHA, rate limit), switches approach instead of failing

How It Works

The model operates in an action loop:

User: "Extract product listings from example.com/shop"
  ↓
Model: <thought>Let me navigate there first.</thought>
       ACTION: NAVIGATE {"url": "example.com/shop"}
  ↓
System: HTTP 200 OK. <html>...</html>
  ↓
Model: <thought>I see product cards. Let me extract the data.</thought>
       ACTION: RETURN_RESULT {"status": "success", "data": [...]}

Each response includes a status: success, partial, or failed — so the caller always knows where things stand.

Maximum 10 steps per job. If it can't finish, it returns what it has with a clear explanation.

Quick Start

Python API

from webscrape_agent import WebScrapeAgent

agent = WebScrapeAgent("sukritvemula/WebScrapeAgent-7B-v1")

result = agent.scrape(
    url="https://example.com/products",
    task="Extract all product names, prices, and ratings",
    schema={
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {"type": "string"},
                "rating": {"type": "string"}
            }
        }
    }
)

print(result.status)   # "success" | "partial" | "failed"
print(result.data)     # Clean JSON data
print(result.message)  # Human-readable explanation

With Authentication

# Cookie-based auth
result = agent.scrape(
    url="https://dashboard.example.com/analytics",
    task="Get my usage statistics",
    auth={"method": "cookies", "cookies": {"session_id": "abc123"}}
)

# API token
result = agent.scrape(
    url="https://api.example.com/v2/data",
    task="Get all user records",
    auth={"method": "token", "token": "sk-xxx"}
)

CLI

python webscrape_agent.py "https://example.com/pricing" "Extract all pricing tiers with features"

Direct Model Usage

import unsloth
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template

model, tokenizer = FastLanguageModel.from_pretrained(
    "sukritvemula/WebScrapeAgent-7B-v1",
    max_seq_length=4096,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")

messages = [
    {"role": "system", "content": "You are WebScrapeAgent..."},
    {"role": "user", "content": "Task: Extract pricing data\nURL: https://example.com/pricing"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=1024, temperature=0.3)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Available Actions

The model can output these actions in its loop:

Action	Purpose	Example Params
`NAVIGATE`	Load a URL	`{"url": "...", "method": "GET", "headers": {...}}`
`CLICK`	Click an element	`{"selector": "#submit-btn"}`
`FILL_FORM`	Submit a form	`{"selector": "#login", "fields": {"email": "...", "password": "..."}}`
`WAIT`	Wait for dynamic content	`{"selector": ".results", "timeout_ms": 5000}`
`SET_COOKIES`	Inject auth cookies	`{"cookies": {"session": "abc"}}`
`SET_HEADERS`	Set HTTP headers	`{"headers": {"Authorization": "Bearer ..."}}`
`LOAD_BROWSER_PROFILE`	Use saved browser session	`{"profile_name": "work-chrome"}`
`EXECUTE_JS`	Run JavaScript	`{"script": "return document.querySelector('#app').innerHTML"}`
`SCROLL`	Scroll the page	`{"direction": "down", "amount": 500}`
`SWITCH_STRATEGY`	Change approach on failure	`{"new_strategy": "headless_browser", "reason": "403 blocked"}`
`RETURN_RESULT`	Return final data	`{"status": "success", "data": [...], "message": "..."}`

Training Details

Recipe

Based on three published approaches:

Paper	Key Contribution	Result
ScrapeGraphAI-100k	QLoRA + completion-only loss for HTML→JSON	Key F1 = 0.887 at 1.7B params
BrowserAgent	Multi-turn browser SFT on Qwen2.5-7B	+20% over baselines
A3-Annotators	Assistant-token-only loss + thought chains	41.5% WebArena

Hyperparameters

Parameter	Value	Source
Base model	Qwen/Qwen2.5-7B-Instruct	—
Method	QLoRA (4-bit NF4)	ScrapeGraphAI
LoRA rank	32	Increased from paper's 16 for structured output complexity
LoRA alpha	32	Standard (= rank)
LoRA targets	All linear (q,k,v,o,gate,up,down)	ScrapeGraphAI + A3
Learning rate	1e-4	ScrapeGraphAI
LR schedule	Cosine with 3% warmup	A3-Annotators
Optimizer	AdamW 8-bit	Unsloth best practice
Epochs	2	ScrapeGraphAI + BrowserAgent
Effective batch	16	—
Max seq length	4096	—
Loss	Completion-only (assistant tokens)	All three papers
Gradient checkpointing	Unsloth custom	—

Training Data

Dataset: sukritvemula/webscrape-agent-training-data (45,624 examples)

Source	Count	%	What It Teaches
ScrapeGraphAI-100k	25,244	55.3%	HTML→JSON extraction across real websites
BrowserAgent-Data	20,361	44.6%	Multi-turn browser interaction and reasoning
Synthetic scenarios	19	0.04%	Auth handling, error recovery, diverse HTML

Training Infrastructure

Designed for free GPU platforms:

✅ Google Colab (T4, 16GB)
✅ Kaggle (T4/P100, 16GB)
✅ Any 16GB+ GPU

How to Train

Option 1: Colab Notebook (Easiest)

Open WebScrapeAgent_Training.ipynb in Google Colab with a T4 GPU runtime. Everything is set up — just run all cells.

Option 2: Command Line

pip install unsloth trl peft transformers accelerate datasets bitsandbytes

# Train with defaults (pushes to Hub)
python train.py

# Custom settings
python train.py \
    --model unsloth/Qwen2.5-7B-Instruct-bnb-4bit \
    --output your-username/WebScrapeAgent-7B-custom \
    --epochs 3 \
    --lr 5e-5 \
    --lora-r 64 \
    --batch-size 2 \
    --grad-accum 8

# Save locally only (no Hub push)
python train.py --no-push --save-local ./my-model

Limitations

CAPTCHA: Cannot solve visual CAPTCHAs. Returns partial results with explanation.
JavaScript-heavy SPAs: The default HTTP executor doesn't render JS. Use with Playwright/Selenium for full browser support (see ActionExecutor class in webscrape_agent.py).
Private networks: Cannot access internal/intranet URLs. Returns clear failure message.
Very large pages: HTML truncated to ~8K chars to fit context window. May miss data on extremely long pages.
Data hallucination: While trained to never invent data, always verify critical extractions.

Files in This Repo

File	Purpose
`webscrape_agent.py`	Runtime inference loop — Python API and CLI
`train.py`	Standalone training script (CLI with args)
`WebScrapeAgent_Training.ipynb`	Colab/Kaggle training notebook
`evaluate.py`	Evaluation script testing all 4 core skills
`prepare_data.py`	Dataset preparation pipeline (builds the training data)