Text Generation
English
web-scraping
html-extraction
agent
structured-data
qwen2.5
unsloth
lora

🕷️ WebScrapeAgent-7B-v1

An autonomous web scraping agent built on Qwen2.5-7B-Instruct, fine-tuned to extract structured data from any web page automatically.

Give it a URL and a description of what you want → it comes back with clean, usable JSON data every time.

What It Does

Capability Description
HTML Reading Understands page structure — tables, nested divs, lists, forms, data attributes, malformed HTML
Action Sequencing Decides what tools to call and in what order to get the data
Authentication Handles login pages via cookie replay, form submission, token injection, or browser profiles
Error Recovery When something breaks (403, timeout, CAPTCHA, rate limit), switches approach instead of failing

How It Works

The model operates in an action loop:

User: "Extract product listings from example.com/shop"
  ↓
Model: <thought>Let me navigate there first.</thought>
       ACTION: NAVIGATE {"url": "example.com/shop"}
  ↓
System: HTTP 200 OK. <html>...</html>
  ↓
Model: <thought>I see product cards. Let me extract the data.</thought>
       ACTION: RETURN_RESULT {"status": "success", "data": [...]}

Each response includes a status: success, partial, or failed — so the caller always knows where things stand.

Maximum 10 steps per job. If it can't finish, it returns what it has with a clear explanation.

Quick Start

Python API

from webscrape_agent import WebScrapeAgent

agent = WebScrapeAgent("sukritvemula/WebScrapeAgent-7B-v1")

result = agent.scrape(
    url="https://example.com/products",
    task="Extract all product names, prices, and ratings",
    schema={
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {"type": "string"},
                "rating": {"type": "string"}
            }
        }
    }
)

print(result.status)   # "success" | "partial" | "failed"
print(result.data)     # Clean JSON data
print(result.message)  # Human-readable explanation

With Authentication

# Cookie-based auth
result = agent.scrape(
    url="https://dashboard.example.com/analytics",
    task="Get my usage statistics",
    auth={"method": "cookies", "cookies": {"session_id": "abc123"}}
)

# API token
result = agent.scrape(
    url="https://api.example.com/v2/data",
    task="Get all user records",
    auth={"method": "token", "token": "sk-xxx"}
)

CLI

python webscrape_agent.py "https://example.com/pricing" "Extract all pricing tiers with features"

Direct Model Usage

import unsloth
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template

model, tokenizer = FastLanguageModel.from_pretrained(
    "sukritvemula/WebScrapeAgent-7B-v1",
    max_seq_length=4096,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")

messages = [
    {"role": "system", "content": "You are WebScrapeAgent..."},
    {"role": "user", "content": "Task: Extract pricing data\nURL: https://example.com/pricing"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=1024, temperature=0.3)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Available Actions

The model can output these actions in its loop:

Action Purpose Example Params
NAVIGATE Load a URL {"url": "...", "method": "GET", "headers": {...}}
CLICK Click an element {"selector": "#submit-btn"}
FILL_FORM Submit a form {"selector": "#login", "fields": {"email": "...", "password": "..."}}
WAIT Wait for dynamic content {"selector": ".results", "timeout_ms": 5000}
SET_COOKIES Inject auth cookies {"cookies": {"session": "abc"}}
SET_HEADERS Set HTTP headers {"headers": {"Authorization": "Bearer ..."}}
LOAD_BROWSER_PROFILE Use saved browser session {"profile_name": "work-chrome"}
EXECUTE_JS Run JavaScript {"script": "return document.querySelector('#app').innerHTML"}
SCROLL Scroll the page {"direction": "down", "amount": 500}
SWITCH_STRATEGY Change approach on failure {"new_strategy": "headless_browser", "reason": "403 blocked"}
RETURN_RESULT Return final data {"status": "success", "data": [...], "message": "..."}

Training Details

Recipe

Based on three published approaches:

Paper Key Contribution Result
ScrapeGraphAI-100k QLoRA + completion-only loss for HTML→JSON Key F1 = 0.887 at 1.7B params
BrowserAgent Multi-turn browser SFT on Qwen2.5-7B +20% over baselines
A3-Annotators Assistant-token-only loss + thought chains 41.5% WebArena

Hyperparameters

Parameter Value Source
Base model Qwen/Qwen2.5-7B-Instruct
Method QLoRA (4-bit NF4) ScrapeGraphAI
LoRA rank 32 Increased from paper's 16 for structured output complexity
LoRA alpha 32 Standard (= rank)
LoRA targets All linear (q,k,v,o,gate,up,down) ScrapeGraphAI + A3
Learning rate 1e-4 ScrapeGraphAI
LR schedule Cosine with 3% warmup A3-Annotators
Optimizer AdamW 8-bit Unsloth best practice
Epochs 2 ScrapeGraphAI + BrowserAgent
Effective batch 16
Max seq length 4096
Loss Completion-only (assistant tokens) All three papers
Gradient checkpointing Unsloth custom

Training Data

Dataset: sukritvemula/webscrape-agent-training-data (45,624 examples)

Source Count % What It Teaches
ScrapeGraphAI-100k 25,244 55.3% HTML→JSON extraction across real websites
BrowserAgent-Data 20,361 44.6% Multi-turn browser interaction and reasoning
Synthetic scenarios 19 0.04% Auth handling, error recovery, diverse HTML

Training Infrastructure

Designed for free GPU platforms:

  • ✅ Google Colab (T4, 16GB)
  • ✅ Kaggle (T4/P100, 16GB)
  • ✅ Any 16GB+ GPU

How to Train

Option 1: Colab Notebook (Easiest)

Open WebScrapeAgent_Training.ipynb in Google Colab with a T4 GPU runtime. Everything is set up — just run all cells.

Option 2: Command Line

pip install unsloth trl peft transformers accelerate datasets bitsandbytes

# Train with defaults (pushes to Hub)
python train.py

# Custom settings
python train.py \
    --model unsloth/Qwen2.5-7B-Instruct-bnb-4bit \
    --output your-username/WebScrapeAgent-7B-custom \
    --epochs 3 \
    --lr 5e-5 \
    --lora-r 64 \
    --batch-size 2 \
    --grad-accum 8

# Save locally only (no Hub push)
python train.py --no-push --save-local ./my-model

Limitations

  • CAPTCHA: Cannot solve visual CAPTCHAs. Returns partial results with explanation.
  • JavaScript-heavy SPAs: The default HTTP executor doesn't render JS. Use with Playwright/Selenium for full browser support (see ActionExecutor class in webscrape_agent.py).
  • Private networks: Cannot access internal/intranet URLs. Returns clear failure message.
  • Very large pages: HTML truncated to ~8K chars to fit context window. May miss data on extremely long pages.
  • Data hallucination: While trained to never invent data, always verify critical extractions.

Files in This Repo

File Purpose
webscrape_agent.py Runtime inference loop — Python API and CLI
train.py Standalone training script (CLI with args)
WebScrapeAgent_Training.ipynb Colab/Kaggle training notebook
evaluate.py Evaluation script testing all 4 core skills
prepare_data.py Dataset preparation pipeline (builds the training data)

License

Apache 2.0 (same as base model)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sukritvemula/WebScrapeAgent-7B-v1

Base model

Qwen/Qwen2.5-7B
Adapter
(2016)
this model

Dataset used to train sukritvemula/WebScrapeAgent-7B-v1

Papers for sukritvemula/WebScrapeAgent-7B-v1