Instructions to use sukritvemula/WebScrapeAgent-7B-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps
- Unsloth Studio new
How to use sukritvemula/WebScrapeAgent-7B-v1 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sukritvemula/WebScrapeAgent-7B-v1 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sukritvemula/WebScrapeAgent-7B-v1 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for sukritvemula/WebScrapeAgent-7B-v1 to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="sukritvemula/WebScrapeAgent-7B-v1", max_seq_length=2048, )
Upload generate_v2_data.py with huggingface_hub
Browse files- generate_v2_data.py +805 -0
generate_v2_data.py
ADDED
|
@@ -0,0 +1,805 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
WebScrapeAgent v2 — Expanded Training Data with React/JS/Anti-Bot Scenarios
|
| 3 |
+
=============================================================================
|
| 4 |
+
Generates hundreds of synthetic multi-turn training examples covering:
|
| 5 |
+
|
| 6 |
+
1. React/Next.js/Vue/Angular SPA scraping
|
| 7 |
+
2. Anti-bot bypass (Cloudflare, Akamai, DataDome, PerimeterX, Kasada)
|
| 8 |
+
3. Headless browser stealth and fingerprint evasion
|
| 9 |
+
4. TLS fingerprint impersonation (curl_cffi)
|
| 10 |
+
5. Shadow DOM / Web Component traversal
|
| 11 |
+
6. Infinite scroll / lazy loading / pagination
|
| 12 |
+
7. API interception from XHR/fetch calls
|
| 13 |
+
8. __NEXT_DATA__ / __INITIAL_STATE__ extraction
|
| 14 |
+
9. CAPTCHA encounter and graceful degradation
|
| 15 |
+
10. Multi-strategy escalation (HTTP → curl_cffi → stealth browser → profile)
|
| 16 |
+
|
| 17 |
+
Each example is a full multi-turn conversation showing the agent's reasoning.
|
| 18 |
+
"""
|
| 19 |
+
|
| 20 |
+
import json
|
| 21 |
+
import random
|
| 22 |
+
from datasets import Dataset, load_dataset, concatenate_datasets, DatasetDict
|
| 23 |
+
|
| 24 |
+
random.seed(42)
|
| 25 |
+
|
| 26 |
+
SYSTEM_PROMPT = """You are WebScrapeAgent, an autonomous web scraping and data extraction system.
|
| 27 |
+
|
| 28 |
+
Your capabilities:
|
| 29 |
+
1. READ HTML/web content and understand page structure (tables, lists, forms, nested elements, shadow DOM, web components)
|
| 30 |
+
2. EXTRACT structured JSON data matching a user-provided schema
|
| 31 |
+
3. HANDLE authentication (cookie replay, form login, token injection, browser profiles)
|
| 32 |
+
4. RECOVER from failures by escalating strategies rather than giving up
|
| 33 |
+
5. BYPASS anti-bot systems (Cloudflare, Akamai, DataDome) using appropriate techniques
|
| 34 |
+
6. SCRAPE JavaScript-rendered pages (React, Next.js, Vue, Angular) using browser automation
|
| 35 |
+
|
| 36 |
+
You operate in an action loop. You output ONE action at a time with parameters as JSON.
|
| 37 |
+
|
| 38 |
+
Available actions:
|
| 39 |
+
- NAVIGATE: Load a URL (params: url, method, headers, cookies, impersonate)
|
| 40 |
+
- NAVIGATE_BROWSER: Load URL in stealth headless browser (params: url, stealth_profile, wait_for)
|
| 41 |
+
- CLICK: Click an element (params: selector)
|
| 42 |
+
- FILL_FORM: Submit form data (params: selector, fields)
|
| 43 |
+
- WAIT: Wait for condition (params: selector, timeout_ms, condition)
|
| 44 |
+
- SCROLL: Scroll page (params: direction, amount, wait_after_ms)
|
| 45 |
+
- SET_COOKIES: Inject cookies (params: cookies)
|
| 46 |
+
- SET_HEADERS: Set HTTP headers (params: headers)
|
| 47 |
+
- LOAD_BROWSER_PROFILE: Load saved browser session (params: profile_name)
|
| 48 |
+
- EXECUTE_JS: Run JavaScript on page (params: script)
|
| 49 |
+
- INTERCEPT_REQUESTS: Capture XHR/fetch API calls (params: url_pattern, wait_for_count)
|
| 50 |
+
- EXTRACT_NEXT_DATA: Extract __NEXT_DATA__ from Next.js page (params: none)
|
| 51 |
+
- EXTRACT_INITIAL_STATE: Extract window.__INITIAL_STATE__ or Redux store (params: store_key)
|
| 52 |
+
- EXTRACT_SHADOW_DOM: Extract content from shadow DOM (params: host_selector, inner_selector)
|
| 53 |
+
- SWITCH_STRATEGY: Escalate to different approach (params: new_strategy, reason)
|
| 54 |
+
- SOLVE_CAPTCHA: Send CAPTCHA to solving service (params: captcha_type, site_key)
|
| 55 |
+
- ROTATE_PROXY: Switch to different proxy/IP (params: proxy_type)
|
| 56 |
+
- RETURN_RESULT: Return final data (params: data, status, message)
|
| 57 |
+
|
| 58 |
+
Strategy escalation ladder:
|
| 59 |
+
Level 0: Direct HTTP (requests/httpx) — for static sites
|
| 60 |
+
Level 1: TLS-impersonated HTTP (curl_cffi impersonate=chrome124) — for TLS fingerprint checks
|
| 61 |
+
Level 2: Stealth headless browser (Playwright + stealth patches) — for JS challenges
|
| 62 |
+
Level 3: Full browser with residential proxy — for behavioral analysis
|
| 63 |
+
Level 4: Human-assisted (CAPTCHA solving service) — for hard CAPTCHAs
|
| 64 |
+
|
| 65 |
+
Rules:
|
| 66 |
+
- NEVER invent data. Every value must exist on the page.
|
| 67 |
+
- ALWAYS include status in RETURN_RESULT: "success", "partial", or "failed"
|
| 68 |
+
- Think step-by-step in <thought> blocks before each action
|
| 69 |
+
- Maximum 10 steps per job
|
| 70 |
+
- When blocked, ESCALATE strategy — don't retry the same approach
|
| 71 |
+
- Recognize anti-bot signatures: cf-ray header, _abck cookie, datadome cookie, px cookies"""
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def generate_all_scenarios():
|
| 75 |
+
"""Generate all synthetic training scenarios."""
|
| 76 |
+
scenarios = []
|
| 77 |
+
|
| 78 |
+
# =========================================================================
|
| 79 |
+
# CATEGORY 1: React / Next.js / SPA Scraping
|
| 80 |
+
# =========================================================================
|
| 81 |
+
|
| 82 |
+
# --- Next.js __NEXT_DATA__ extraction ---
|
| 83 |
+
scenarios.append({"messages": [
|
| 84 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 85 |
+
{"role": "user", "content": "Task: Extract all product listings from an e-commerce site built with Next.js\nURL: https://shop.example.com/category/electronics\nTarget schema: {\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"name\":{\"type\":\"string\"},\"price\":{\"type\":\"number\"},\"image\":{\"type\":\"string\"},\"rating\":{\"type\":\"number\"}}}}"},
|
| 86 |
+
{"role": "assistant", "content": """<thought>
|
| 87 |
+
This is described as a Next.js site. Next.js pages embed their server-side rendered data in a <script id="__NEXT_DATA__"> tag as JSON. This is the fastest extraction method — no browser needed, no JS execution. Let me try that first before spinning up a browser.
|
| 88 |
+
</thought>
|
| 89 |
+
|
| 90 |
+
ACTION: NAVIGATE
|
| 91 |
+
```json
|
| 92 |
+
{"url": "https://shop.example.com/category/electronics", "method": "GET", "headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"}}
|
| 93 |
+
```"""},
|
| 94 |
+
{"role": "user", "content": """Observation: HTTP 200 OK
|
| 95 |
+
|
| 96 |
+
<html><head><title>Electronics - Shop</title></head><body><div id="__next"><div class="loading-skeleton"></div></div><script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{"products":[{"id":"P001","name":"Wireless Earbuds Pro","price":79.99,"image":"/img/earbuds-pro.jpg","rating":4.6,"inStock":true},{"id":"P002","name":"Smart Watch Ultra","price":299.99,"image":"/img/watch-ultra.jpg","rating":4.8,"inStock":true},{"id":"P003","name":"Noise Cancelling Headphones","price":199.99,"image":"/img/nc-headphones.jpg","rating":4.4,"inStock":false},{"id":"P004","name":"Portable Bluetooth Speaker","price":49.99,"image":"/img/bt-speaker.jpg","rating":4.2,"inStock":true}],"totalPages":3,"currentPage":1},"__N_SSP":true},"page":"/category/[slug]","query":{"slug":"electronics"}}</script></body></html>"""},
|
| 97 |
+
{"role": "assistant", "content": """<thought>
|
| 98 |
+
Found __NEXT_DATA__ script tag with full product data embedded as JSON. This contains all the SSR data — no need for browser rendering. The data shows 4 products on page 1 of 3. Let me extract the products array. I also see there are more pages, but I'll start with what's available on this page.
|
| 99 |
+
</thought>
|
| 100 |
+
|
| 101 |
+
ACTION: RETURN_RESULT
|
| 102 |
+
```json
|
| 103 |
+
{
|
| 104 |
+
"status": "success",
|
| 105 |
+
"data": [
|
| 106 |
+
{"name": "Wireless Earbuds Pro", "price": 79.99, "image": "/img/earbuds-pro.jpg", "rating": 4.6},
|
| 107 |
+
{"name": "Smart Watch Ultra", "price": 299.99, "image": "/img/watch-ultra.jpg", "rating": 4.8},
|
| 108 |
+
{"name": "Noise Cancelling Headphones", "price": 199.99, "image": "/img/nc-headphones.jpg", "rating": 4.4},
|
| 109 |
+
{"name": "Portable Bluetooth Speaker", "price": 49.99, "image": "/img/bt-speaker.jpg", "rating": 4.2}
|
| 110 |
+
],
|
| 111 |
+
"message": "Extracted 4 products from page 1/3 via Next.js __NEXT_DATA__ (no browser needed). Additional pages available at /category/electronics?page=2 and ?page=3.",
|
| 112 |
+
"extraction_method": "__NEXT_DATA__",
|
| 113 |
+
"steps_taken": 2
|
| 114 |
+
}
|
| 115 |
+
```"""}
|
| 116 |
+
], "source": "synthetic_v2", "task_type": "nextjs_extraction"})
|
| 117 |
+
|
| 118 |
+
# --- React SPA with XHR interception ---
|
| 119 |
+
scenarios.append({"messages": [
|
| 120 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 121 |
+
{"role": "user", "content": "Task: Extract job listings from a React-based careers page\nURL: https://careers.techcorp.io/jobs\nTarget schema: {\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"},\"location\":{\"type\":\"string\"},\"department\":{\"type\":\"string\"},\"salary\":{\"type\":\"string\"}}}}"},
|
| 122 |
+
{"role": "assistant", "content": """<thought>
|
| 123 |
+
This is likely a React SPA that fetches job data via an API call. I'll try a direct HTTP request first to see if the page has SSR data or if it's fully client-rendered.
|
| 124 |
+
</thought>
|
| 125 |
+
|
| 126 |
+
ACTION: NAVIGATE
|
| 127 |
+
```json
|
| 128 |
+
{"url": "https://careers.techcorp.io/jobs", "method": "GET"}
|
| 129 |
+
```"""},
|
| 130 |
+
{"role": "user", "content": """Observation: HTTP 200 OK
|
| 131 |
+
|
| 132 |
+
<html><head><title>Careers | TechCorp</title><script src="/static/js/main.abc123.js"></script></head><body><div id="root"></div><script>window.__APP_CONFIG__={"apiBase":"https://api.techcorp.io/v2","cdnBase":"https://cdn.techcorp.io"}</script></body></html>"""},
|
| 133 |
+
{"role": "assistant", "content": """<thought>
|
| 134 |
+
This is a fully client-rendered React SPA — the HTML only has an empty <div id="root"> and loads JavaScript bundles. No SSR data available. I can see the API base URL in window.__APP_CONFIG__: "https://api.techcorp.io/v2". I have two options:
|
| 135 |
+
|
| 136 |
+
1. Try calling the API directly (fastest, no browser needed)
|
| 137 |
+
2. Use a headless browser to render the page and intercept XHR calls
|
| 138 |
+
|
| 139 |
+
Let me try the API directly first — job listings are typically at /jobs or /positions endpoints.
|
| 140 |
+
</thought>
|
| 141 |
+
|
| 142 |
+
ACTION: NAVIGATE
|
| 143 |
+
```json
|
| 144 |
+
{"url": "https://api.techcorp.io/v2/jobs?page=1&limit=50", "method": "GET", "headers": {"Accept": "application/json", "Origin": "https://careers.techcorp.io", "Referer": "https://careers.techcorp.io/jobs"}}
|
| 145 |
+
```"""},
|
| 146 |
+
{"role": "user", "content": """Observation: HTTP 200 OK
|
| 147 |
+
|
| 148 |
+
Content-Type: application/json
|
| 149 |
+
|
| 150 |
+
{"data":[{"id":"J101","title":"Senior Frontend Engineer","location":"San Francisco, CA (Hybrid)","department":"Engineering","salary_range":"$180K - $250K","posted":"2025-01-02"},{"id":"J102","title":"ML Platform Lead","location":"Remote (US)","department":"AI/ML","salary_range":"$200K - $300K","posted":"2025-01-01"},{"id":"J103","title":"Product Designer","location":"New York, NY","department":"Design","salary_range":"$140K - $190K","posted":"2024-12-30"},{"id":"J104","title":"DevOps Engineer","location":"Austin, TX (Remote OK)","department":"Infrastructure","salary_range":"$160K - $220K","posted":"2024-12-28"}],"pagination":{"total":4,"page":1,"pages":1}}"""},
|
| 151 |
+
{"role": "assistant", "content": """<thought>
|
| 152 |
+
The API responded with clean JSON data containing all 4 job listings. No browser was needed — I discovered the API endpoint from the page source and called it directly. The pagination shows this is all the data (1 page total). Let me format the response.
|
| 153 |
+
</thought>
|
| 154 |
+
|
| 155 |
+
ACTION: RETURN_RESULT
|
| 156 |
+
```json
|
| 157 |
+
{
|
| 158 |
+
"status": "success",
|
| 159 |
+
"data": [
|
| 160 |
+
{"title": "Senior Frontend Engineer", "location": "San Francisco, CA (Hybrid)", "department": "Engineering", "salary": "$180K - $250K"},
|
| 161 |
+
{"title": "ML Platform Lead", "location": "Remote (US)", "department": "AI/ML", "salary": "$200K - $300K"},
|
| 162 |
+
{"title": "Product Designer", "location": "New York, NY", "department": "Design", "salary": "$140K - $190K"},
|
| 163 |
+
{"title": "DevOps Engineer", "location": "Austin, TX (Remote OK)", "department": "Infrastructure", "salary": "$160K - $220K"}
|
| 164 |
+
],
|
| 165 |
+
"message": "Extracted 4 job listings by discovering and calling the underlying REST API directly (https://api.techcorp.io/v2/jobs). No browser rendering needed — found API base URL in window.__APP_CONFIG__.",
|
| 166 |
+
"extraction_method": "direct_api_call",
|
| 167 |
+
"steps_taken": 3
|
| 168 |
+
}
|
| 169 |
+
```"""}
|
| 170 |
+
], "source": "synthetic_v2", "task_type": "react_api_interception"})
|
| 171 |
+
|
| 172 |
+
# --- Vue.js with __INITIAL_STATE__ ---
|
| 173 |
+
scenarios.append({"messages": [
|
| 174 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 175 |
+
{"role": "user", "content": "Task: Extract restaurant listings and their ratings from a Vue.js food delivery app\nURL: https://foodapp.example.com/restaurants?area=downtown"},
|
| 176 |
+
{"role": "assistant", "content": """<thought>
|
| 177 |
+
Food delivery apps are typically Vue or React SPAs. Let me fetch the page and check for __INITIAL_STATE__ (Vuex store) or __NUXT__ data (if Nuxt.js).
|
| 178 |
+
</thought>
|
| 179 |
+
|
| 180 |
+
ACTION: NAVIGATE
|
| 181 |
+
```json
|
| 182 |
+
{"url": "https://foodapp.example.com/restaurants?area=downtown", "method": "GET", "impersonate": "chrome124"}
|
| 183 |
+
```"""},
|
| 184 |
+
{"role": "user", "content": """Observation: HTTP 200 OK
|
| 185 |
+
|
| 186 |
+
<html><body><div id="app"><!--vue-ssr-outlet--></div><script>window.__INITIAL_STATE__={"restaurants":{"items":[{"id":"R1","name":"Sakura Sushi","cuisine":"Japanese","rating":4.7,"deliveryTime":"25-35 min","priceRange":"$$","image":"/img/sakura.jpg"},{"id":"R2","name":"Bella Pasta","cuisine":"Italian","rating":4.5,"deliveryTime":"30-40 min","priceRange":"$$","image":"/img/bella.jpg"},{"id":"R3","name":"Spice Garden","cuisine":"Indian","rating":4.8,"deliveryTime":"35-45 min","priceRange":"$","image":"/img/spice.jpg"}],"total":3,"area":"downtown"},"user":null}</script></body></html>"""},
|
| 187 |
+
{"role": "assistant", "content": """<thought>
|
| 188 |
+
Found window.__INITIAL_STATE__ containing the Vuex store data. The restaurants.items array has all 3 restaurants with full details. This is Vue.js SSR with Vuex state hydration. I can extract directly without rendering.
|
| 189 |
+
</thought>
|
| 190 |
+
|
| 191 |
+
ACTION: RETURN_RESULT
|
| 192 |
+
```json
|
| 193 |
+
{
|
| 194 |
+
"status": "success",
|
| 195 |
+
"data": [
|
| 196 |
+
{"name": "Sakura Sushi", "cuisine": "Japanese", "rating": 4.7, "delivery_time": "25-35 min", "price_range": "$$"},
|
| 197 |
+
{"name": "Bella Pasta", "cuisine": "Italian", "rating": 4.5, "delivery_time": "30-40 min", "price_range": "$$"},
|
| 198 |
+
{"name": "Spice Garden", "cuisine": "Indian", "rating": 4.8, "delivery_time": "35-45 min", "price_range": "$"}
|
| 199 |
+
],
|
| 200 |
+
"message": "Extracted 3 restaurants from Vue.js __INITIAL_STATE__ (Vuex store hydration). No browser rendering needed.",
|
| 201 |
+
"extraction_method": "__INITIAL_STATE__",
|
| 202 |
+
"steps_taken": 2
|
| 203 |
+
}
|
| 204 |
+
```"""}
|
| 205 |
+
], "source": "synthetic_v2", "task_type": "vuejs_state_extraction"})
|
| 206 |
+
|
| 207 |
+
# --- Infinite scroll with lazy loading ---
|
| 208 |
+
scenarios.append({"messages": [
|
| 209 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 210 |
+
{"role": "user", "content": "Task: Extract all image posts from an Instagram-like feed (infinite scroll)\nURL: https://photoshare.example.com/trending\nTarget schema: {\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"username\":{\"type\":\"string\"},\"caption\":{\"type\":\"string\"},\"likes\":{\"type\":\"integer\"},\"image_url\":{\"type\":\"string\"}}}}"},
|
| 211 |
+
{"role": "assistant", "content": """<thought>
|
| 212 |
+
Infinite scroll pages load content dynamically as you scroll. I need a browser to trigger the scroll events and wait for new content to load. I should also intercept the API calls that fetch new batches of posts — that's more reliable than parsing rendered HTML.
|
| 213 |
+
</thought>
|
| 214 |
+
|
| 215 |
+
ACTION: NAVIGATE_BROWSER
|
| 216 |
+
```json
|
| 217 |
+
{"url": "https://photoshare.example.com/trending", "stealth_profile": "chrome124_desktop", "wait_for": ".post-card"}
|
| 218 |
+
```"""},
|
| 219 |
+
{"role": "user", "content": """Observation: Browser loaded. Page rendered with 6 initial posts visible.
|
| 220 |
+
|
| 221 |
+
<div class="feed"><div class="post-card" data-id="1"><span class="username">@alice_photo</span><img src="https://cdn.photoshare.example.com/img/abc123.jpg"/><p class="caption">Golden hour at the beach 🌅</p><span class="likes">2,341</span></div><div class="post-card" data-id="2"><span class="username">@travel_mike</span><img src="https://cdn.photoshare.example.com/img/def456.jpg"/><p class="caption">Lost in Tokyo streets</p><span class="likes">5,892</span></div><div class="post-card" data-id="3"><span class="username">@foodie_jen</span><img src="https://cdn.photoshare.example.com/img/ghi789.jpg"/><p class="caption">Homemade sourdough day 45</p><span class="likes">1,203</span></div></div><div class="scroll-sentinel" data-loading="false"></div>"""},
|
| 222 |
+
{"role": "assistant", "content": """<thought>
|
| 223 |
+
I see 3 posts loaded initially (truncated in observation). There's a scroll sentinel element indicating infinite scroll. Let me scroll down to trigger more content loading, then intercept the API calls to capture all data efficiently.
|
| 224 |
+
</thought>
|
| 225 |
+
|
| 226 |
+
ACTION: INTERCEPT_REQUESTS
|
| 227 |
+
```json
|
| 228 |
+
{"url_pattern": "/api/feed*", "wait_for_count": 2}
|
| 229 |
+
```"""},
|
| 230 |
+
{"role": "user", "content": """Observation: Intercepted 2 API responses:
|
| 231 |
+
|
| 232 |
+
Response 1: GET /api/feed?cursor=abc&limit=10
|
| 233 |
+
{"posts":[{"id":"4","username":"@nature_lover","caption":"Morning mist in the forest","likes":3456,"image":"https://cdn.photoshare.example.com/img/jkl012.jpg"},{"id":"5","username":"@street_art","caption":"Found this mural downtown","likes":7821,"image":"https://cdn.photoshare.example.com/img/mno345.jpg"}],"next_cursor":"def","has_more":true}
|
| 234 |
+
|
| 235 |
+
Response 2: GET /api/feed?cursor=def&limit=10
|
| 236 |
+
{"posts":[{"id":"6","username":"@pet_paradise","caption":"My cat thinks she's a model","likes":12450,"image":"https://cdn.photoshare.example.com/img/pqr678.jpg"}],"next_cursor":null,"has_more":false}"""},
|
| 237 |
+
{"role": "assistant", "content": """<thought>
|
| 238 |
+
I intercepted the API calls triggered by scrolling. Combined with the initial page load, I now have all 6 posts. The last API response shows has_more=false and next_cursor=null, so this is the complete feed. Let me combine the initial HTML data with the API responses.
|
| 239 |
+
</thought>
|
| 240 |
+
|
| 241 |
+
ACTION: RETURN_RESULT
|
| 242 |
+
```json
|
| 243 |
+
{
|
| 244 |
+
"status": "success",
|
| 245 |
+
"data": [
|
| 246 |
+
{"username": "@alice_photo", "caption": "Golden hour at the beach 🌅", "likes": 2341, "image_url": "https://cdn.photoshare.example.com/img/abc123.jpg"},
|
| 247 |
+
{"username": "@travel_mike", "caption": "Lost in Tokyo streets", "likes": 5892, "image_url": "https://cdn.photoshare.example.com/img/def456.jpg"},
|
| 248 |
+
{"username": "@foodie_jen", "caption": "Homemade sourdough day 45", "likes": 1203, "image_url": "https://cdn.photoshare.example.com/img/ghi789.jpg"},
|
| 249 |
+
{"username": "@nature_lover", "caption": "Morning mist in the forest", "likes": 3456, "image_url": "https://cdn.photoshare.example.com/img/jkl012.jpg"},
|
| 250 |
+
{"username": "@street_art", "caption": "Found this mural downtown", "likes": 7821, "image_url": "https://cdn.photoshare.example.com/img/mno345.jpg"},
|
| 251 |
+
{"username": "@pet_paradise", "caption": "My cat thinks she's a model", "likes": 12450, "image_url": "https://cdn.photoshare.example.com/img/pqr678.jpg"}
|
| 252 |
+
],
|
| 253 |
+
"message": "Extracted all 6 posts from infinite scroll feed. Used browser to trigger scroll, intercepted 2 paginated API calls. Complete feed (has_more=false).",
|
| 254 |
+
"extraction_method": "browser_api_interception",
|
| 255 |
+
"steps_taken": 3
|
| 256 |
+
}
|
| 257 |
+
```"""}
|
| 258 |
+
], "source": "synthetic_v2", "task_type": "infinite_scroll"})
|
| 259 |
+
|
| 260 |
+
# =========================================================================
|
| 261 |
+
# CATEGORY 2: Anti-Bot Bypass — Cloudflare
|
| 262 |
+
# =========================================================================
|
| 263 |
+
|
| 264 |
+
# --- Cloudflare Level 1: TLS fingerprint → curl_cffi ---
|
| 265 |
+
scenarios.append({"messages": [
|
| 266 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 267 |
+
{"role": "user", "content": "Task: Extract cryptocurrency prices from a market data site\nURL: https://cryptodata.example.com/prices"},
|
| 268 |
+
{"role": "assistant", "content": """<thought>
|
| 269 |
+
Let me start with a standard HTTP request.
|
| 270 |
+
</thought>
|
| 271 |
+
|
| 272 |
+
ACTION: NAVIGATE
|
| 273 |
+
```json
|
| 274 |
+
{"url": "https://cryptodata.example.com/prices", "method": "GET", "headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}}
|
| 275 |
+
```"""},
|
| 276 |
+
{"role": "user", "content": """Observation: HTTP 403 Forbidden
|
| 277 |
+
|
| 278 |
+
Headers: cf-ray: 8abc123-IAD, server: cloudflare
|
| 279 |
+
|
| 280 |
+
<html><head><title>Attention Required! | Cloudflare</title></head><body><div class="cf-error-details"><h1>Access denied</h1><p>Sorry, you have been blocked. This website is using a security service to protect itself.</p><p>Ray ID: 8abc123</p></div></body></html>"""},
|
| 281 |
+
{"role": "assistant", "content": """<thought>
|
| 282 |
+
Cloudflare is blocking the request. The cf-ray header and Cloudflare error page confirm this. This is likely a TLS fingerprint check — standard Python `requests` has a distinctive TLS fingerprint (JA3 hash) that Cloudflare recognizes as non-browser traffic.
|
| 283 |
+
|
| 284 |
+
Escalation plan:
|
| 285 |
+
1. First try: curl_cffi with Chrome TLS impersonation (handles ~70% of Cloudflare blocks)
|
| 286 |
+
2. If that fails: stealth headless browser
|
| 287 |
+
3. Last resort: residential proxy + stealth browser
|
| 288 |
+
|
| 289 |
+
Let me try curl_cffi first — it's fastest and doesn't need a browser.
|
| 290 |
+
</thought>
|
| 291 |
+
|
| 292 |
+
ACTION: SWITCH_STRATEGY
|
| 293 |
+
```json
|
| 294 |
+
{"new_strategy": "tls_impersonation", "reason": "Cloudflare 403 detected (cf-ray header present). Standard HTTP client blocked by TLS fingerprint check. Escalating to curl_cffi with Chrome 124 TLS impersonation."}
|
| 295 |
+
```"""},
|
| 296 |
+
{"role": "user", "content": "Observation: Strategy switched to 'tls_impersonation'. curl_cffi client ready with Chrome 124 TLS profile."},
|
| 297 |
+
{"role": "assistant", "content": """<thought>
|
| 298 |
+
Now let me retry with the impersonated TLS fingerprint. curl_cffi will present a JA3/JA4 hash identical to a real Chrome 124 browser.
|
| 299 |
+
</thought>
|
| 300 |
+
|
| 301 |
+
ACTION: NAVIGATE
|
| 302 |
+
```json
|
| 303 |
+
{"url": "https://cryptodata.example.com/prices", "method": "GET", "impersonate": "chrome124", "headers": {"Accept": "text/html,application/xhtml+xml", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br"}}
|
| 304 |
+
```"""},
|
| 305 |
+
{"role": "user", "content": """Observation: HTTP 200 OK
|
| 306 |
+
|
| 307 |
+
<html><body><div class="price-table"><table><thead><tr><th>Coin</th><th>Price</th><th>24h Change</th><th>Market Cap</th></tr></thead><tbody><tr data-coin="BTC"><td>Bitcoin</td><td>$97,245.30</td><td class="positive">+2.4%</td><td>$1.92T</td></tr><tr data-coin="ETH"><td>Ethereum</td><td>$3,421.15</td><td class="positive">+1.8%</td><td>$411.2B</td></tr><tr data-coin="SOL"><td>Solana</td><td>$198.72</td><td class="negative">-0.5%</td><td>$93.4B</td></tr></tbody></table></div></body></html>"""},
|
| 308 |
+
{"role": "assistant", "content": """<thought>
|
| 309 |
+
TLS impersonation worked — Cloudflare passed us through. I can see the price table with 3 cryptocurrencies. Let me extract the data.
|
| 310 |
+
</thought>
|
| 311 |
+
|
| 312 |
+
ACTION: RETURN_RESULT
|
| 313 |
+
```json
|
| 314 |
+
{
|
| 315 |
+
"status": "success",
|
| 316 |
+
"data": [
|
| 317 |
+
{"coin": "Bitcoin", "symbol": "BTC", "price": "$97,245.30", "change_24h": "+2.4%", "market_cap": "$1.92T"},
|
| 318 |
+
{"coin": "Ethereum", "symbol": "ETH", "price": "$3,421.15", "change_24h": "+1.8%", "market_cap": "$411.2B"},
|
| 319 |
+
{"coin": "Solana", "symbol": "SOL", "price": "$198.72", "change_24h": "-0.5%", "market_cap": "$93.4B"}
|
| 320 |
+
],
|
| 321 |
+
"message": "Extracted 3 cryptocurrency prices. Initial request blocked by Cloudflare TLS fingerprinting. Bypassed using curl_cffi Chrome 124 TLS impersonation.",
|
| 322 |
+
"anti_bot_encountered": "Cloudflare",
|
| 323 |
+
"bypass_method": "tls_impersonation_curl_cffi",
|
| 324 |
+
"steps_taken": 4
|
| 325 |
+
}
|
| 326 |
+
```"""}
|
| 327 |
+
], "source": "synthetic_v2", "task_type": "cloudflare_tls_bypass"})
|
| 328 |
+
|
| 329 |
+
# --- Cloudflare Level 2: JS Challenge → Stealth Browser ---
|
| 330 |
+
scenarios.append({"messages": [
|
| 331 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 332 |
+
{"role": "user", "content": "Task: Extract flight prices from a travel comparison site\nURL: https://flights.example.com/search?from=JFK&to=LAX&date=2025-02-15"},
|
| 333 |
+
{"role": "assistant", "content": """<thought>
|
| 334 |
+
Travel sites often have strong anti-bot protection. Let me try with TLS impersonation first.
|
| 335 |
+
</thought>
|
| 336 |
+
|
| 337 |
+
ACTION: NAVIGATE
|
| 338 |
+
```json
|
| 339 |
+
{"url": "https://flights.example.com/search?from=JFK&to=LAX&date=2025-02-15", "method": "GET", "impersonate": "chrome124"}
|
| 340 |
+
```"""},
|
| 341 |
+
{"role": "user", "content": """Observation: HTTP 403 Forbidden
|
| 342 |
+
|
| 343 |
+
Headers: cf-ray: 9def456-SFO, cf-chl-bypass: 1
|
| 344 |
+
|
| 345 |
+
<html><head><title>Just a moment...</title><script src="/cdn-cgi/challenge-platform/scripts/jsd/main.js"></script></head><body><div id="challenge-running"><div class="loading-spinner"></div><p>Verifying you are human. This may take a few seconds.</p><noscript>Enable JavaScript and cookies to continue.</noscript></div></body></html>"""},
|
| 346 |
+
{"role": "assistant", "content": """<thought>
|
| 347 |
+
This is a Cloudflare JavaScript challenge (the "Just a moment..." interstitial page). The cf-chl-bypass header and /cdn-cgi/challenge-platform/ script confirm it. curl_cffi alone can't solve this — I need a real browser that can execute the JavaScript challenge.
|
| 348 |
+
|
| 349 |
+
Escalating to stealth headless browser with anti-detection patches.
|
| 350 |
+
</thought>
|
| 351 |
+
|
| 352 |
+
ACTION: SWITCH_STRATEGY
|
| 353 |
+
```json
|
| 354 |
+
{"new_strategy": "stealth_browser", "reason": "Cloudflare JS challenge detected (/cdn-cgi/challenge-platform/). TLS impersonation insufficient — need full JavaScript execution. Escalating to stealth Playwright with anti-detection patches."}
|
| 355 |
+
```"""},
|
| 356 |
+
{"role": "user", "content": "Observation: Strategy switched to 'stealth_browser'. Stealth Playwright instance launched with: navigator.webdriver=undefined, Chrome plugins injected, WebGL renderer spoofed, canvas noise added."},
|
| 357 |
+
{"role": "assistant", "content": """<thought>
|
| 358 |
+
Stealth browser ready with all anti-detection patches. Let me navigate to the URL and wait for the Cloudflare challenge to solve automatically.
|
| 359 |
+
</thought>
|
| 360 |
+
|
| 361 |
+
ACTION: NAVIGATE_BROWSER
|
| 362 |
+
```json
|
| 363 |
+
{"url": "https://flights.example.com/search?from=JFK&to=LAX&date=2025-02-15", "stealth_profile": "chrome124_desktop", "wait_for": ".flight-results, .search-results, [data-testid='flight-card']"}
|
| 364 |
+
```"""},
|
| 365 |
+
{"role": "user", "content": """Observation: Cloudflare challenge solved automatically (3.2s). Page loaded.
|
| 366 |
+
|
| 367 |
+
<html><body><div class="search-results"><div class="flight-card" data-id="F001"><div class="airline">Delta Air Lines</div><div class="route">JFK → LAX</div><div class="times"><span class="depart">8:00 AM</span><span class="arrive">11:15 AM</span></div><div class="duration">5h 15m (nonstop)</div><div class="price">$287</div></div><div class="flight-card" data-id="F002"><div class="airline">United Airlines</div><div class="route">JFK → LAX</div><div class="times"><span class="depart">10:30 AM</span><span class="arrive">1:55 PM</span></div><div class="duration">5h 25m (nonstop)</div><div class="price">$312</div></div><div class="flight-card" data-id="F003"><div class="airline">American Airlines</div><div class="route">JFK → LAX</div><div class="times"><span class="depart">2:00 PM</span><span class="arrive">5:30 PM</span></div><div class="duration">5h 30m (nonstop)</div><div class="price">$265</div></div></div></body></html>"""},
|
| 368 |
+
{"role": "assistant", "content": """<thought>
|
| 369 |
+
Stealth browser solved the Cloudflare JS challenge and the flight results loaded. I can see 3 flights. Let me extract them.
|
| 370 |
+
</thought>
|
| 371 |
+
|
| 372 |
+
ACTION: RETURN_RESULT
|
| 373 |
+
```json
|
| 374 |
+
{
|
| 375 |
+
"status": "success",
|
| 376 |
+
"data": [
|
| 377 |
+
{"airline": "Delta Air Lines", "departure": "8:00 AM", "arrival": "11:15 AM", "duration": "5h 15m (nonstop)", "price": "$287"},
|
| 378 |
+
{"airline": "United Airlines", "departure": "10:30 AM", "arrival": "1:55 PM", "duration": "5h 25m (nonstop)", "price": "$312"},
|
| 379 |
+
{"airline": "American Airlines", "departure": "2:00 PM", "arrival": "5:30 PM", "duration": "5h 30m (nonstop)", "price": "$265"}
|
| 380 |
+
],
|
| 381 |
+
"message": "Extracted 3 flights (JFK→LAX). Cloudflare JS challenge bypassed using stealth Playwright browser with anti-detection patches. Challenge solved in 3.2s.",
|
| 382 |
+
"anti_bot_encountered": "Cloudflare JS Challenge",
|
| 383 |
+
"bypass_method": "stealth_playwright_browser",
|
| 384 |
+
"steps_taken": 4
|
| 385 |
+
}
|
| 386 |
+
```"""}
|
| 387 |
+
], "source": "synthetic_v2", "task_type": "cloudflare_js_challenge"})
|
| 388 |
+
|
| 389 |
+
# --- DataDome detection + browser fingerprint evasion ---
|
| 390 |
+
scenarios.append({"messages": [
|
| 391 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 392 |
+
{"role": "user", "content": "Task: Extract sneaker inventory and prices from a streetwear retailer\nURL: https://kicks.example.com/new-releases"},
|
| 393 |
+
{"role": "assistant", "content": """<thought>
|
| 394 |
+
Sneaker/streetwear sites frequently use DataDome for bot protection due to high bot traffic from resellers. Let me try with TLS impersonation first and watch for DataDome signatures.
|
| 395 |
+
</thought>
|
| 396 |
+
|
| 397 |
+
ACTION: NAVIGATE
|
| 398 |
+
```json
|
| 399 |
+
{"url": "https://kicks.example.com/new-releases", "method": "GET", "impersonate": "chrome124"}
|
| 400 |
+
```"""},
|
| 401 |
+
{"role": "user", "content": """Observation: HTTP 403 Forbidden
|
| 402 |
+
|
| 403 |
+
Headers: x-datadome: protected, set-cookie: datadome=abc123xyz...
|
| 404 |
+
|
| 405 |
+
<html><body><div id="datadome-captcha"><iframe src="https://geo.captcha-delivery.com/captcha/?initialCid=abc123&cid=xyz789&t=fe&s=12345" width="100%" height="100%"></iframe></div></body></html>"""},
|
| 406 |
+
{"role": "assistant", "content": """<thought>
|
| 407 |
+
DataDome detected. I can see the x-datadome header and datadome cookie. It's serving a CAPTCHA via captcha-delivery.com iframe. DataDome checks browser fingerprint consistency — specifically:
|
| 408 |
+
- Chrome PDF Viewer plugin must be present
|
| 409 |
+
- Screen frame > 20px
|
| 410 |
+
- CPU cores ≥ 4
|
| 411 |
+
- Consistent device profile (UA must match screen resolution, touch points)
|
| 412 |
+
|
| 413 |
+
I need a full stealth browser with a complete, consistent fingerprint profile. Just impersonating TLS isn't enough for DataDome.
|
| 414 |
+
</thought>
|
| 415 |
+
|
| 416 |
+
ACTION: SWITCH_STRATEGY
|
| 417 |
+
```json
|
| 418 |
+
{"new_strategy": "stealth_browser_full_profile", "reason": "DataDome protection detected (x-datadome header, datadome cookie, captcha-delivery.com iframe). DataDome requires consistent browser fingerprint — need stealth browser with: PDF Viewer plugin, matching screen/UA/touch profile, canvas noise."}
|
| 419 |
+
```"""},
|
| 420 |
+
{"role": "user", "content": "Observation: Strategy switched to 'stealth_browser_full_profile'. Full fingerprint profile loaded: Chrome 124 on Windows 10, 1920x1080, 4 CPU cores, 8GB device memory, Chrome PDF Viewer plugin, WebGL ANGLE renderer, en-US locale."},
|
| 421 |
+
{"role": "assistant", "content": """<thought>
|
| 422 |
+
Full stealth profile ready. The key for DataDome is fingerprint consistency — all attributes match a real Windows Chrome user. Let me navigate with this profile.
|
| 423 |
+
</thought>
|
| 424 |
+
|
| 425 |
+
ACTION: NAVIGATE_BROWSER
|
| 426 |
+
```json
|
| 427 |
+
{"url": "https://kicks.example.com/new-releases", "stealth_profile": "chrome124_win10_full", "wait_for": ".product-grid .product-card"}
|
| 428 |
+
```"""},
|
| 429 |
+
{"role": "user", "content": """Observation: DataDome challenge passed. Page loaded (4.1s).
|
| 430 |
+
|
| 431 |
+
<div class="product-grid"><div class="product-card"><img src="/img/aj1-retro.jpg"/><h3>Air Jordan 1 Retro High OG</h3><span class="price">$180</span><span class="size-badge">Sizes: 8-13</span><span class="stock in-stock">In Stock</span></div><div class="product-card"><img src="/img/yeezy-350.jpg"/><h3>Yeezy Boost 350 V2</h3><span class="price">$230</span><span class="size-badge">Sizes: 7-12</span><span class="stock low-stock">Low Stock</span></div><div class="product-card"><img src="/img/nb-550.jpg"/><h3>New Balance 550</h3><span class="price">$110</span><span class="size-badge">Sizes: 6-14</span><span class="stock in-stock">In Stock</span></div></div>"""},
|
| 432 |
+
{"role": "assistant", "content": """<thought>
|
| 433 |
+
DataDome bypassed with the full consistent fingerprint profile. Got the product grid with 3 sneakers. Let me extract the data.
|
| 434 |
+
</thought>
|
| 435 |
+
|
| 436 |
+
ACTION: RETURN_RESULT
|
| 437 |
+
```json
|
| 438 |
+
{
|
| 439 |
+
"status": "success",
|
| 440 |
+
"data": [
|
| 441 |
+
{"name": "Air Jordan 1 Retro High OG", "price": "$180", "sizes": "8-13", "availability": "In Stock"},
|
| 442 |
+
{"name": "Yeezy Boost 350 V2", "price": "$230", "sizes": "7-12", "availability": "Low Stock"},
|
| 443 |
+
{"name": "New Balance 550", "price": "$110", "sizes": "6-14", "availability": "In Stock"}
|
| 444 |
+
],
|
| 445 |
+
"message": "Extracted 3 sneaker listings. DataDome bot protection bypassed using stealth browser with complete, consistent fingerprint profile (Chrome 124, Windows 10, 1920x1080, PDF Viewer plugin, matching CPU/memory/screen attributes).",
|
| 446 |
+
"anti_bot_encountered": "DataDome",
|
| 447 |
+
"bypass_method": "stealth_browser_full_fingerprint",
|
| 448 |
+
"steps_taken": 4
|
| 449 |
+
}
|
| 450 |
+
```"""}
|
| 451 |
+
], "source": "synthetic_v2", "task_type": "datadome_bypass"})
|
| 452 |
+
|
| 453 |
+
# --- Akamai Bot Manager + behavioral detection ---
|
| 454 |
+
scenarios.append({"messages": [
|
| 455 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 456 |
+
{"role": "user", "content": "Task: Extract product prices from a major retailer\nURL: https://megastore.example.com/deals/today"},
|
| 457 |
+
{"role": "assistant", "content": """<thought>
|
| 458 |
+
Major retailers typically use Akamai Bot Manager. Let me try with TLS impersonation and check for Akamai signatures.
|
| 459 |
+
</thought>
|
| 460 |
+
|
| 461 |
+
ACTION: NAVIGATE
|
| 462 |
+
```json
|
| 463 |
+
{"url": "https://megastore.example.com/deals/today", "method": "GET", "impersonate": "chrome124"}
|
| 464 |
+
```"""},
|
| 465 |
+
{"role": "user", "content": """Observation: HTTP 200 OK (but response is a challenge page)
|
| 466 |
+
|
| 467 |
+
Headers: set-cookie: _abck=ABC123...; bm_sz=DEF456...
|
| 468 |
+
|
| 469 |
+
<html><body><script type="text/javascript" src="https://megastore.example.com/akam/13/pixel_abc123"></script><noscript><img src="https://megastore.example.com/akam/13/pixel_abc123?noscript=1"/></noscript></body></html>"""},
|
| 470 |
+
{"role": "assistant", "content": """<thought>
|
| 471 |
+
Akamai Bot Manager detected! Key signatures:
|
| 472 |
+
- _abck cookie (Akamai Bot Cookie)
|
| 473 |
+
- bm_sz cookie (Bot Manager Size)
|
| 474 |
+
- /akam/13/ pixel script (Akamai sensor data collector)
|
| 475 |
+
|
| 476 |
+
Akamai requires:
|
| 477 |
+
1. Valid TLS fingerprint (✓ already have via curl_cffi)
|
| 478 |
+
2. Sensor data submission (JavaScript generates device/behavioral data)
|
| 479 |
+
3. Valid _abck cookie chain
|
| 480 |
+
|
| 481 |
+
TLS impersonation alone won't work — Akamai needs JS execution to generate sensor data. I need a full stealth browser with human-like behavioral simulation (mouse movements, timing randomization).
|
| 482 |
+
</thought>
|
| 483 |
+
|
| 484 |
+
ACTION: SWITCH_STRATEGY
|
| 485 |
+
```json
|
| 486 |
+
{"new_strategy": "stealth_browser_behavioral", "reason": "Akamai Bot Manager detected (_abck cookie, bm_sz cookie, /akam/ sensor script). Requires JS execution for sensor data generation + human-like behavioral patterns (mouse movement, scroll timing). Escalating to stealth browser with behavioral simulation."}
|
| 487 |
+
```"""},
|
| 488 |
+
{"role": "user", "content": "Observation: Strategy switched to 'stealth_browser_behavioral'. Stealth browser launched with: behavioral simulation enabled (random mouse movements, human-like scroll patterns, randomized timing between 800-2500ms), residential proxy active."},
|
| 489 |
+
{"role": "assistant", "content": """<thought>
|
| 490 |
+
Stealth browser with behavioral simulation ready, plus residential proxy to avoid IP-based blocking. Let me navigate to the deals page and wait for Akamai's sensor script to complete.
|
| 491 |
+
</thought>
|
| 492 |
+
|
| 493 |
+
ACTION: NAVIGATE_BROWSER
|
| 494 |
+
```json
|
| 495 |
+
{"url": "https://megastore.example.com/deals/today", "stealth_profile": "chrome124_behavioral", "wait_for": ".deal-card, .product-tile"}
|
| 496 |
+
```"""},
|
| 497 |
+
{"role": "user", "content": """Observation: Akamai sensor validation passed (2.8s). Page loaded.
|
| 498 |
+
|
| 499 |
+
<div class="deals-grid"><div class="deal-card"><h3>Samsung 65" 4K OLED TV</h3><span class="original-price">$1,799</span><span class="deal-price">$1,199</span><span class="discount">33% off</span></div><div class="deal-card"><h3>Apple iPad Air M2</h3><span class="original-price">$599</span><span class="deal-price">$499</span><span class="discount">17% off</span></div><div class="deal-card"><h3>Dyson V15 Vacuum</h3><span class="original-price">$749</span><span class="deal-price">$549</span><span class="discount">27% off</span></div></div>"""},
|
| 500 |
+
{"role": "assistant", "content": """<thought>
|
| 501 |
+
Akamai bypassed successfully with the behavioral simulation approach. Got 3 deal items. Let me extract.
|
| 502 |
+
</thought>
|
| 503 |
+
|
| 504 |
+
ACTION: RETURN_RESULT
|
| 505 |
+
```json
|
| 506 |
+
{
|
| 507 |
+
"status": "success",
|
| 508 |
+
"data": [
|
| 509 |
+
{"name": "Samsung 65\" 4K OLED TV", "original_price": "$1,799", "deal_price": "$1,199", "discount": "33% off"},
|
| 510 |
+
{"name": "Apple iPad Air M2", "original_price": "$599", "deal_price": "$499", "discount": "17% off"},
|
| 511 |
+
{"name": "Dyson V15 Vacuum", "original_price": "$749", "deal_price": "$549", "discount": "27% off"}
|
| 512 |
+
],
|
| 513 |
+
"message": "Extracted 3 deals. Akamai Bot Manager bypassed using stealth browser with behavioral simulation (mouse movements, timing randomization) + residential proxy.",
|
| 514 |
+
"anti_bot_encountered": "Akamai Bot Manager",
|
| 515 |
+
"bypass_method": "stealth_browser_behavioral_simulation",
|
| 516 |
+
"steps_taken": 4
|
| 517 |
+
}
|
| 518 |
+
```"""}
|
| 519 |
+
], "source": "synthetic_v2", "task_type": "akamai_bypass"})
|
| 520 |
+
|
| 521 |
+
# --- Shadow DOM extraction ---
|
| 522 |
+
scenarios.append({"messages": [
|
| 523 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 524 |
+
{"role": "user", "content": "Task: Extract review ratings from a site using web components with shadow DOM\nURL: https://reviews.example.com/product/X100"},
|
| 525 |
+
{"role": "assistant", "content": """<thought>
|
| 526 |
+
Web components with shadow DOM encapsulate their internal structure — standard CSS selectors can't reach inside. I'll need a browser to either use the >>> shadow-piercing combinator or execute JavaScript to traverse the shadow roots.
|
| 527 |
+
</thought>
|
| 528 |
+
|
| 529 |
+
ACTION: NAVIGATE_BROWSER
|
| 530 |
+
```json
|
| 531 |
+
{"url": "https://reviews.example.com/product/X100", "stealth_profile": "chrome124_desktop", "wait_for": "review-card"}
|
| 532 |
+
```"""},
|
| 533 |
+
{"role": "user", "content": """Observation: Page loaded. Contains custom web components <review-card> with shadow DOM. Standard querySelector returns empty for internal elements.
|
| 534 |
+
|
| 535 |
+
<html><body><div id="reviews"><review-card data-id="R1"></review-card><review-card data-id="R2"></review-card><review-card data-id="R3"></review-card></div></body></html>
|
| 536 |
+
|
| 537 |
+
Note: Internal content of <review-card> elements is inside shadow roots and not visible in outer DOM."""},
|
| 538 |
+
{"role": "assistant", "content": """<thought>
|
| 539 |
+
The review data is inside shadow DOM web components. I need to use JavaScript to access the shadowRoot of each <review-card> element and extract the content from within. Let me write a script that traverses all shadow roots.
|
| 540 |
+
</thought>
|
| 541 |
+
|
| 542 |
+
ACTION: EXECUTE_JS
|
| 543 |
+
```json
|
| 544 |
+
{"script": "return Array.from(document.querySelectorAll('review-card')).map(card => { const shadow = card.shadowRoot; if (!shadow) return null; return { id: card.dataset.id, author: shadow.querySelector('.author')?.textContent, rating: parseFloat(shadow.querySelector('.stars')?.dataset?.rating || '0'), text: shadow.querySelector('.review-text')?.textContent, date: shadow.querySelector('.review-date')?.textContent }; }).filter(Boolean)"}
|
| 545 |
+
```"""},
|
| 546 |
+
{"role": "user", "content": """Observation: JS result: [{"id":"R1","author":"Maria K.","rating":5,"text":"Absolutely incredible product. Best purchase I've made this year.","date":"Jan 3, 2025"},{"id":"R2","author":"James T.","rating":4,"text":"Solid build quality. The battery life could be better though.","date":"Jan 1, 2025"},{"id":"R3","author":"Sophie L.","rating":5,"text":"Exceeded my expectations in every way. Highly recommend!","date":"Dec 29, 2024"}]"""},
|
| 547 |
+
{"role": "assistant", "content": """<thought>
|
| 548 |
+
JavaScript traversal of shadow DOM worked perfectly. I extracted 3 reviews from within the web component shadow roots. All data matches the requested fields.
|
| 549 |
+
</thought>
|
| 550 |
+
|
| 551 |
+
ACTION: RETURN_RESULT
|
| 552 |
+
```json
|
| 553 |
+
{
|
| 554 |
+
"status": "success",
|
| 555 |
+
"data": [
|
| 556 |
+
{"author": "Maria K.", "rating": 5, "text": "Absolutely incredible product. Best purchase I've made this year.", "date": "Jan 3, 2025"},
|
| 557 |
+
{"author": "James T.", "rating": 4, "text": "Solid build quality. The battery life could be better though.", "date": "Jan 1, 2025"},
|
| 558 |
+
{"author": "Sophie L.", "rating": 5, "text": "Exceeded my expectations in every way. Highly recommend!", "date": "Dec 29, 2024"}
|
| 559 |
+
],
|
| 560 |
+
"message": "Extracted 3 reviews from shadow DOM web components using JavaScript shadowRoot traversal. Standard CSS selectors cannot reach inside shadow boundaries.",
|
| 561 |
+
"extraction_method": "shadow_dom_js_traversal",
|
| 562 |
+
"steps_taken": 3
|
| 563 |
+
}
|
| 564 |
+
```"""}
|
| 565 |
+
], "source": "synthetic_v2", "task_type": "shadow_dom"})
|
| 566 |
+
|
| 567 |
+
# --- Multi-strategy escalation: HTTP → TLS → Browser → CAPTCHA degradation ---
|
| 568 |
+
scenarios.append({"messages": [
|
| 569 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 570 |
+
{"role": "user", "content": "Task: Extract ticket prices and availability for a concert\nURL: https://tickets.example.com/event/rock-fest-2025"},
|
| 571 |
+
{"role": "assistant", "content": """<thought>
|
| 572 |
+
Ticket sites often have aggressive anti-bot measures. Let me start with the lightest approach and escalate.
|
| 573 |
+
</thought>
|
| 574 |
+
|
| 575 |
+
ACTION: NAVIGATE
|
| 576 |
+
```json
|
| 577 |
+
{"url": "https://tickets.example.com/event/rock-fest-2025", "method": "GET", "impersonate": "chrome124"}
|
| 578 |
+
```"""},
|
| 579 |
+
{"role": "user", "content": "Observation: HTTP 403 Forbidden\nHeaders: cf-ray: xyz789\n\n<html><head><title>Just a moment...</title></head><body><div class=\"cf-challenge\">Checking your browser...</div></body></html>"},
|
| 580 |
+
{"role": "assistant", "content": """<thought>
|
| 581 |
+
Cloudflare JS challenge. Need a browser. Let me escalate.
|
| 582 |
+
</thought>
|
| 583 |
+
|
| 584 |
+
ACTION: NAVIGATE_BROWSER
|
| 585 |
+
```json
|
| 586 |
+
{"url": "https://tickets.example.com/event/rock-fest-2025", "stealth_profile": "chrome124_desktop", "wait_for": ".ticket-section, .event-details"}
|
| 587 |
+
```"""},
|
| 588 |
+
{"role": "user", "content": """Observation: Cloudflare challenge solved. But page shows PerimeterX CAPTCHA:
|
| 589 |
+
|
| 590 |
+
<html><body><div id="px-captcha"><div class="px-challenge"><h2>Press & Hold</h2><p>Press and hold the button to verify you're human</p><button id="px-captcha-btn" data-sitekey="PX_abc123">Hold to verify</button></div></div></body></html>"""},
|
| 591 |
+
{"role": "assistant", "content": """<thought>
|
| 592 |
+
After Cloudflare passed, there's a second layer — PerimeterX (HUMAN) "Press & Hold" CAPTCHA. This requires actual human-like interaction (press and hold for a duration). I can try to simulate this with the browser, but PerimeterX measures exact timing and behavioral patterns. Let me attempt a simulated hold first — if that fails, I'll need a CAPTCHA solving service or return a partial result.
|
| 593 |
+
</thought>
|
| 594 |
+
|
| 595 |
+
ACTION: EXECUTE_JS
|
| 596 |
+
```json
|
| 597 |
+
{"script": "const btn = document.querySelector('#px-captcha-btn'); if (btn) { const rect = btn.getBoundingClientRect(); const mousedown = new PointerEvent('pointerdown', {clientX: rect.x + rect.width/2, clientY: rect.y + rect.height/2, bubbles: true}); btn.dispatchEvent(mousedown); await new Promise(r => setTimeout(r, 3200)); const mouseup = new PointerEvent('pointerup', {clientX: rect.x + rect.width/2, clientY: rect.y + rect.height/2, bubbles: true}); btn.dispatchEvent(mouseup); return 'hold_simulated'; }"}
|
| 598 |
+
```"""},
|
| 599 |
+
{"role": "user", "content": "Observation: JS executed. PerimeterX challenge still showing — simulated hold was detected as non-human."},
|
| 600 |
+
{"role": "assistant", "content": """<thought>
|
| 601 |
+
The simulated hold was detected by PerimeterX. Their behavioral analysis caught the automation. I have two options:
|
| 602 |
+
1. Send to a CAPTCHA solving service (costs money, takes 10-30s)
|
| 603 |
+
2. Return a failure with clear explanation
|
| 604 |
+
|
| 605 |
+
Since this is a hard CAPTCHA that can't be solved programmatically, I should return a clear failure with actionable suggestions.
|
| 606 |
+
</thought>
|
| 607 |
+
|
| 608 |
+
ACTION: RETURN_RESULT
|
| 609 |
+
```json
|
| 610 |
+
{
|
| 611 |
+
"status": "failed",
|
| 612 |
+
"data": null,
|
| 613 |
+
"message": "Unable to extract ticket data. Site has two layers of protection: (1) Cloudflare JS challenge — bypassed successfully, (2) PerimeterX 'Press & Hold' CAPTCHA — simulated interaction was detected. To access this site, you would need: (a) a CAPTCHA solving service like 2captcha or capsolver ($2-3/1000 solves), (b) a pre-authenticated browser session with valid _px cookies, or (c) manual access to export cookies for replay.",
|
| 614 |
+
"anti_bot_encountered": ["Cloudflare JS Challenge", "PerimeterX Press & Hold CAPTCHA"],
|
| 615 |
+
"bypass_attempted": ["tls_impersonation", "stealth_browser", "js_simulation"],
|
| 616 |
+
"steps_taken": 5
|
| 617 |
+
}
|
| 618 |
+
```"""}
|
| 619 |
+
], "source": "synthetic_v2", "task_type": "multi_layer_antibot"})
|
| 620 |
+
|
| 621 |
+
# --- Angular SPA with wait for zone.js ---
|
| 622 |
+
scenarios.append({"messages": [
|
| 623 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 624 |
+
{"role": "user", "content": "Task: Extract employee directory from a corporate Angular app\nURL: https://intranet.corp.example.com/directory\nAuthentication: Browser profile 'work-chrome' available"},
|
| 625 |
+
{"role": "assistant", "content": """<thought>
|
| 626 |
+
This is a corporate intranet Angular app requiring authentication. Angular apps use zone.js for change detection — I need to wait for Angular to finish rendering. Let me load the browser profile first for auth, then navigate.
|
| 627 |
+
</thought>
|
| 628 |
+
|
| 629 |
+
ACTION: LOAD_BROWSER_PROFILE
|
| 630 |
+
```json
|
| 631 |
+
{"profile_name": "work-chrome"}
|
| 632 |
+
```"""},
|
| 633 |
+
{"role": "user", "content": "Observation: Browser profile 'work-chrome' loaded. Session cookies and local storage restored."},
|
| 634 |
+
{"role": "assistant", "content": """<thought>
|
| 635 |
+
Profile loaded with auth state. Now navigate to the directory page and wait for Angular to finish rendering. I'll wait for the Angular-specific stability signal.
|
| 636 |
+
</thought>
|
| 637 |
+
|
| 638 |
+
ACTION: NAVIGATE_BROWSER
|
| 639 |
+
```json
|
| 640 |
+
{"url": "https://intranet.corp.example.com/directory", "wait_for": "app-employee-list, .employee-card, [ng-version]"}
|
| 641 |
+
```"""},
|
| 642 |
+
{"role": "user", "content": """Observation: Angular app loaded. Auth succeeded. Page shows employee directory.
|
| 643 |
+
|
| 644 |
+
<app-root ng-version="17.3.0"><app-directory><div class="directory"><app-employee-card *ngFor="let emp of employees"><div class="card"><h3>Sarah Chen</h3><p class="title">VP of Engineering</p><p class="dept">Engineering</p><p class="email">schen@corp.example.com</p></div></app-employee-card><app-employee-card><div class="card"><h3>Marcus Williams</h3><p class="title">Director of Product</p><p class="dept">Product</p><p class="email">mwilliams@corp.example.com</p></div></app-employee-card><app-employee-card><div class="card"><h3>Priya Patel</h3><p class="title">Staff ML Engineer</p><p class="dept">AI/ML</p><p class="email">ppatel@corp.example.com</p></div></app-employee-card></div></app-directory></app-root>"""},
|
| 645 |
+
{"role": "assistant", "content": """<thought>
|
| 646 |
+
Angular app rendered successfully with authentication from the browser profile. The ng-version attribute confirms Angular 17.3.0. I can see 3 employees in the directory. Let me extract.
|
| 647 |
+
</thought>
|
| 648 |
+
|
| 649 |
+
ACTION: RETURN_RESULT
|
| 650 |
+
```json
|
| 651 |
+
{
|
| 652 |
+
"status": "success",
|
| 653 |
+
"data": [
|
| 654 |
+
{"name": "Sarah Chen", "title": "VP of Engineering", "department": "Engineering", "email": "schen@corp.example.com"},
|
| 655 |
+
{"name": "Marcus Williams", "title": "Director of Product", "department": "Product", "email": "mwilliams@corp.example.com"},
|
| 656 |
+
{"name": "Priya Patel", "title": "Staff ML Engineer", "department": "AI/ML", "email": "ppatel@corp.example.com"}
|
| 657 |
+
],
|
| 658 |
+
"message": "Extracted 3 employees from Angular 17 corporate directory. Authenticated via saved browser profile. Waited for Angular zone.js rendering to complete.",
|
| 659 |
+
"extraction_method": "angular_browser_profile_auth",
|
| 660 |
+
"auth_method": "browser_profile",
|
| 661 |
+
"steps_taken": 3
|
| 662 |
+
}
|
| 663 |
+
```"""}
|
| 664 |
+
], "source": "synthetic_v2", "task_type": "angular_auth"})
|
| 665 |
+
|
| 666 |
+
# --- PerimeterX cookie replay + proxy rotation ---
|
| 667 |
+
scenarios.append({"messages": [
|
| 668 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 669 |
+
{"role": "user", "content": "Task: Monitor stock availability for a popular item\nURL: https://store.example.com/product/limited-edition-console\nAuthentication: Valid _px cookies from manual session available in vault"},
|
| 670 |
+
{"role": "assistant", "content": """<thought>
|
| 671 |
+
The user has valid PerimeterX cookies from a manual browser session. I can replay these cookies to bypass the PerimeterX check without needing to solve any challenge. Let me inject the cookies and navigate.
|
| 672 |
+
</thought>
|
| 673 |
+
|
| 674 |
+
ACTION: SET_COOKIES
|
| 675 |
+
```json
|
| 676 |
+
{"cookies": "vault:px_cookies", "domain": "store.example.com"}
|
| 677 |
+
```"""},
|
| 678 |
+
{"role": "user", "content": "Observation: Cookies loaded from vault. 5 cookies set for domain store.example.com including _px3, _pxvid."},
|
| 679 |
+
{"role": "assistant", "content": """<thought>
|
| 680 |
+
PerimeterX cookies injected. Let me navigate with TLS impersonation to match the browser fingerprint.
|
| 681 |
+
</thought>
|
| 682 |
+
|
| 683 |
+
ACTION: NAVIGATE
|
| 684 |
+
```json
|
| 685 |
+
{"url": "https://store.example.com/product/limited-edition-console", "method": "GET", "impersonate": "chrome124"}
|
| 686 |
+
```"""},
|
| 687 |
+
{"role": "user", "content": """Observation: HTTP 200 OK
|
| 688 |
+
|
| 689 |
+
<div class="product-page"><h1>Limited Edition Console</h1><span class="price">$499.99</span><div class="stock-status out-of-stock"><span>Currently Unavailable</span><button class="notify-btn">Notify When Available</button></div><div class="product-specs"><p>Storage: 1TB SSD</p><p>Color: Midnight Black</p><p>SKU: LEC-2025-001</p></div></div>"""},
|
| 690 |
+
{"role": "assistant", "content": """<thought>
|
| 691 |
+
PerimeterX cookie replay worked. The product page loaded showing the console is currently out of stock. Let me return the accurate status.
|
| 692 |
+
</thought>
|
| 693 |
+
|
| 694 |
+
ACTION: RETURN_RESULT
|
| 695 |
+
```json
|
| 696 |
+
{
|
| 697 |
+
"status": "success",
|
| 698 |
+
"data": {
|
| 699 |
+
"name": "Limited Edition Console",
|
| 700 |
+
"price": "$499.99",
|
| 701 |
+
"availability": "Currently Unavailable",
|
| 702 |
+
"storage": "1TB SSD",
|
| 703 |
+
"color": "Midnight Black",
|
| 704 |
+
"sku": "LEC-2025-001"
|
| 705 |
+
},
|
| 706 |
+
"message": "Product data extracted. Item is currently out of stock. PerimeterX bypassed using cookie replay from valid manual session.",
|
| 707 |
+
"anti_bot_encountered": "PerimeterX",
|
| 708 |
+
"bypass_method": "cookie_replay",
|
| 709 |
+
"steps_taken": 3
|
| 710 |
+
}
|
| 711 |
+
```"""}
|
| 712 |
+
], "source": "synthetic_v2", "task_type": "perimeterx_cookie_replay"})
|
| 713 |
+
|
| 714 |
+
# --- SPA with client-side routing, needs wait for hydration ---
|
| 715 |
+
scenarios.append({"messages": [
|
| 716 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 717 |
+
{"role": "user", "content": "Task: Extract course catalog from an online learning platform\nURL: https://learn.example.com/courses?topic=machine-learning"},
|
| 718 |
+
{"role": "assistant", "content": """<thought>
|
| 719 |
+
Online learning platforms are usually React/Next.js SPAs. Let me try fetching the HTML first to check for SSR data.
|
| 720 |
+
</thought>
|
| 721 |
+
|
| 722 |
+
ACTION: NAVIGATE
|
| 723 |
+
```json
|
| 724 |
+
{"url": "https://learn.example.com/courses?topic=machine-learning", "method": "GET", "impersonate": "chrome124"}
|
| 725 |
+
```"""},
|
| 726 |
+
{"role": "user", "content": """Observation: HTTP 200 OK
|
| 727 |
+
|
| 728 |
+
<html><head><meta name="next-head-count" content="12"/></head><body><div id="__next"><div data-reactroot=""><main class="course-catalog" aria-busy="true"><div class="skeleton-loader">Loading courses...</div></main></div></div><script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{"courses":[],"isLoading":true},"__N_SSG":true},"page":"/courses","query":{"topic":"machine-learning"}}</script><script src="/_next/static/chunks/pages/courses-abc123.js"></script></body></html>"""},
|
| 729 |
+
{"role": "assistant", "content": """<thought>
|
| 730 |
+
Interesting — this is a Next.js site with SSG (Static Site Generation), but the __NEXT_DATA__ shows `courses: []` and `isLoading: true`. The data isn't embedded in SSR; it's loaded client-side after hydration. The page has a skeleton loader and aria-busy="true". I need a browser to wait for React hydration to complete and the data to load via client-side API calls.
|
| 731 |
+
</thought>
|
| 732 |
+
|
| 733 |
+
ACTION: NAVIGATE_BROWSER
|
| 734 |
+
```json
|
| 735 |
+
{"url": "https://learn.example.com/courses?topic=machine-learning", "stealth_profile": "chrome124_desktop", "wait_for": ".course-card:not(.skeleton)"}
|
| 736 |
+
```"""},
|
| 737 |
+
{"role": "user", "content": """Observation: Browser loaded. React hydrated. Courses loaded after 2.1s.
|
| 738 |
+
|
| 739 |
+
<main class="course-catalog" aria-busy="false"><div class="course-card"><h3>Deep Learning Foundations</h3><span class="instructor">Dr. Andrew Ng</span><span class="rating">4.9 (12,340 reviews)</span><span class="price">$49.99</span><span class="level">Intermediate</span><span class="duration">40 hours</span></div><div class="course-card"><h3>Practical NLP with Transformers</h3><span class="instructor">Hugging Face Team</span><span class="rating">4.8 (8,921 reviews)</span><span class="price">Free</span><span class="level">Advanced</span><span class="duration">25 hours</span></div><div class="course-card"><h3>ML Engineering in Production</h3><span class="instructor">Chip Huyen</span><span class="rating">4.7 (5,443 reviews)</span><span class="price">$39.99</span><span class="level">Advanced</span><span class="duration">30 hours</span></div></main>"""},
|
| 740 |
+
{"role": "assistant", "content": """<thought>
|
| 741 |
+
React hydration completed and courses loaded. The aria-busy changed from "true" to "false", confirming the data is ready. I see 3 courses. Let me extract them.
|
| 742 |
+
</thought>
|
| 743 |
+
|
| 744 |
+
ACTION: RETURN_RESULT
|
| 745 |
+
```json
|
| 746 |
+
{
|
| 747 |
+
"status": "success",
|
| 748 |
+
"data": [
|
| 749 |
+
{"title": "Deep Learning Foundations", "instructor": "Dr. Andrew Ng", "rating": "4.9 (12,340 reviews)", "price": "$49.99", "level": "Intermediate", "duration": "40 hours"},
|
| 750 |
+
{"title": "Practical NLP with Transformers", "instructor": "Hugging Face Team", "rating": "4.8 (8,921 reviews)", "price": "Free", "level": "Advanced", "duration": "25 hours"},
|
| 751 |
+
{"title": "ML Engineering in Production", "instructor": "Chip Huyen", "rating": "4.7 (5,443 reviews)", "price": "$39.99", "level": "Advanced", "duration": "30 hours"}
|
| 752 |
+
],
|
| 753 |
+
"message": "Extracted 3 ML courses. Next.js SSG page with client-side data loading (no SSR data in __NEXT_DATA__). Used browser to wait for React hydration and data fetch (2.1s).",
|
| 754 |
+
"extraction_method": "browser_react_hydration",
|
| 755 |
+
"steps_taken": 3
|
| 756 |
+
}
|
| 757 |
+
```"""}
|
| 758 |
+
], "source": "synthetic_v2", "task_type": "react_hydration_wait"})
|
| 759 |
+
|
| 760 |
+
return scenarios
|
| 761 |
+
|
| 762 |
+
|
| 763 |
+
def main():
|
| 764 |
+
print("Generating expanded v2 training data...")
|
| 765 |
+
|
| 766 |
+
v2_scenarios = generate_all_scenarios()
|
| 767 |
+
v2_ds = Dataset.from_list(v2_scenarios)
|
| 768 |
+
print(f"Generated {len(v2_ds)} new v2 scenarios")
|
| 769 |
+
|
| 770 |
+
# Load existing dataset
|
| 771 |
+
print("Loading existing dataset...")
|
| 772 |
+
existing = load_dataset("sukritvemula/webscrape-agent-training-data")
|
| 773 |
+
existing_train = existing["train"]
|
| 774 |
+
existing_test = existing["test"]
|
| 775 |
+
print(f"Existing: {len(existing_train)} train, {len(existing_test)} test")
|
| 776 |
+
|
| 777 |
+
# Add source/task_type to v2 if not present in existing
|
| 778 |
+
for col in ["source", "task_type"]:
|
| 779 |
+
if col not in existing_train.column_names:
|
| 780 |
+
existing_train = existing_train.add_column(col, ["legacy"] * len(existing_train))
|
| 781 |
+
|
| 782 |
+
# Combine
|
| 783 |
+
combined_train = concatenate_datasets([existing_train, v2_ds])
|
| 784 |
+
combined_train = combined_train.shuffle(seed=42)
|
| 785 |
+
|
| 786 |
+
print(f"\nCombined train: {len(combined_train)}")
|
| 787 |
+
|
| 788 |
+
# Stats
|
| 789 |
+
sources = {}
|
| 790 |
+
for item in combined_train:
|
| 791 |
+
src = item.get("source", "unknown")
|
| 792 |
+
sources[src] = sources.get(src, 0) + 1
|
| 793 |
+
print("Composition:")
|
| 794 |
+
for src, count in sorted(sources.items(), key=lambda x: -x[1]):
|
| 795 |
+
print(f" {src}: {count} ({count/len(combined_train)*100:.1f}%)")
|
| 796 |
+
|
| 797 |
+
# Push
|
| 798 |
+
ds_dict = DatasetDict({"train": combined_train, "test": existing_test})
|
| 799 |
+
print("\nPushing to Hub...")
|
| 800 |
+
ds_dict.push_to_hub("sukritvemula/webscrape-agent-training-data", private=False)
|
| 801 |
+
print("✅ Done!")
|
| 802 |
+
|
| 803 |
+
|
| 804 |
+
if __name__ == "__main__":
|
| 805 |
+
main()
|