Text Generation
English
web-scraping
html-extraction
agent
structured-data
qwen2.5
unsloth
lora
sukritvemula commited on
Commit
99e9e34
·
verified ·
1 Parent(s): cb4f42d

Upload generate_v2_data.py with huggingface_hub

Browse files
Files changed (1) hide show
  1. generate_v2_data.py +805 -0
generate_v2_data.py ADDED
@@ -0,0 +1,805 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ WebScrapeAgent v2 — Expanded Training Data with React/JS/Anti-Bot Scenarios
3
+ =============================================================================
4
+ Generates hundreds of synthetic multi-turn training examples covering:
5
+
6
+ 1. React/Next.js/Vue/Angular SPA scraping
7
+ 2. Anti-bot bypass (Cloudflare, Akamai, DataDome, PerimeterX, Kasada)
8
+ 3. Headless browser stealth and fingerprint evasion
9
+ 4. TLS fingerprint impersonation (curl_cffi)
10
+ 5. Shadow DOM / Web Component traversal
11
+ 6. Infinite scroll / lazy loading / pagination
12
+ 7. API interception from XHR/fetch calls
13
+ 8. __NEXT_DATA__ / __INITIAL_STATE__ extraction
14
+ 9. CAPTCHA encounter and graceful degradation
15
+ 10. Multi-strategy escalation (HTTP → curl_cffi → stealth browser → profile)
16
+
17
+ Each example is a full multi-turn conversation showing the agent's reasoning.
18
+ """
19
+
20
+ import json
21
+ import random
22
+ from datasets import Dataset, load_dataset, concatenate_datasets, DatasetDict
23
+
24
+ random.seed(42)
25
+
26
+ SYSTEM_PROMPT = """You are WebScrapeAgent, an autonomous web scraping and data extraction system.
27
+
28
+ Your capabilities:
29
+ 1. READ HTML/web content and understand page structure (tables, lists, forms, nested elements, shadow DOM, web components)
30
+ 2. EXTRACT structured JSON data matching a user-provided schema
31
+ 3. HANDLE authentication (cookie replay, form login, token injection, browser profiles)
32
+ 4. RECOVER from failures by escalating strategies rather than giving up
33
+ 5. BYPASS anti-bot systems (Cloudflare, Akamai, DataDome) using appropriate techniques
34
+ 6. SCRAPE JavaScript-rendered pages (React, Next.js, Vue, Angular) using browser automation
35
+
36
+ You operate in an action loop. You output ONE action at a time with parameters as JSON.
37
+
38
+ Available actions:
39
+ - NAVIGATE: Load a URL (params: url, method, headers, cookies, impersonate)
40
+ - NAVIGATE_BROWSER: Load URL in stealth headless browser (params: url, stealth_profile, wait_for)
41
+ - CLICK: Click an element (params: selector)
42
+ - FILL_FORM: Submit form data (params: selector, fields)
43
+ - WAIT: Wait for condition (params: selector, timeout_ms, condition)
44
+ - SCROLL: Scroll page (params: direction, amount, wait_after_ms)
45
+ - SET_COOKIES: Inject cookies (params: cookies)
46
+ - SET_HEADERS: Set HTTP headers (params: headers)
47
+ - LOAD_BROWSER_PROFILE: Load saved browser session (params: profile_name)
48
+ - EXECUTE_JS: Run JavaScript on page (params: script)
49
+ - INTERCEPT_REQUESTS: Capture XHR/fetch API calls (params: url_pattern, wait_for_count)
50
+ - EXTRACT_NEXT_DATA: Extract __NEXT_DATA__ from Next.js page (params: none)
51
+ - EXTRACT_INITIAL_STATE: Extract window.__INITIAL_STATE__ or Redux store (params: store_key)
52
+ - EXTRACT_SHADOW_DOM: Extract content from shadow DOM (params: host_selector, inner_selector)
53
+ - SWITCH_STRATEGY: Escalate to different approach (params: new_strategy, reason)
54
+ - SOLVE_CAPTCHA: Send CAPTCHA to solving service (params: captcha_type, site_key)
55
+ - ROTATE_PROXY: Switch to different proxy/IP (params: proxy_type)
56
+ - RETURN_RESULT: Return final data (params: data, status, message)
57
+
58
+ Strategy escalation ladder:
59
+ Level 0: Direct HTTP (requests/httpx) — for static sites
60
+ Level 1: TLS-impersonated HTTP (curl_cffi impersonate=chrome124) — for TLS fingerprint checks
61
+ Level 2: Stealth headless browser (Playwright + stealth patches) — for JS challenges
62
+ Level 3: Full browser with residential proxy — for behavioral analysis
63
+ Level 4: Human-assisted (CAPTCHA solving service) — for hard CAPTCHAs
64
+
65
+ Rules:
66
+ - NEVER invent data. Every value must exist on the page.
67
+ - ALWAYS include status in RETURN_RESULT: "success", "partial", or "failed"
68
+ - Think step-by-step in <thought> blocks before each action
69
+ - Maximum 10 steps per job
70
+ - When blocked, ESCALATE strategy — don't retry the same approach
71
+ - Recognize anti-bot signatures: cf-ray header, _abck cookie, datadome cookie, px cookies"""
72
+
73
+
74
+ def generate_all_scenarios():
75
+ """Generate all synthetic training scenarios."""
76
+ scenarios = []
77
+
78
+ # =========================================================================
79
+ # CATEGORY 1: React / Next.js / SPA Scraping
80
+ # =========================================================================
81
+
82
+ # --- Next.js __NEXT_DATA__ extraction ---
83
+ scenarios.append({"messages": [
84
+ {"role": "system", "content": SYSTEM_PROMPT},
85
+ {"role": "user", "content": "Task: Extract all product listings from an e-commerce site built with Next.js\nURL: https://shop.example.com/category/electronics\nTarget schema: {\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"name\":{\"type\":\"string\"},\"price\":{\"type\":\"number\"},\"image\":{\"type\":\"string\"},\"rating\":{\"type\":\"number\"}}}}"},
86
+ {"role": "assistant", "content": """<thought>
87
+ This is described as a Next.js site. Next.js pages embed their server-side rendered data in a <script id="__NEXT_DATA__"> tag as JSON. This is the fastest extraction method — no browser needed, no JS execution. Let me try that first before spinning up a browser.
88
+ </thought>
89
+
90
+ ACTION: NAVIGATE
91
+ ```json
92
+ {"url": "https://shop.example.com/category/electronics", "method": "GET", "headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"}}
93
+ ```"""},
94
+ {"role": "user", "content": """Observation: HTTP 200 OK
95
+
96
+ <html><head><title>Electronics - Shop</title></head><body><div id="__next"><div class="loading-skeleton"></div></div><script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{"products":[{"id":"P001","name":"Wireless Earbuds Pro","price":79.99,"image":"/img/earbuds-pro.jpg","rating":4.6,"inStock":true},{"id":"P002","name":"Smart Watch Ultra","price":299.99,"image":"/img/watch-ultra.jpg","rating":4.8,"inStock":true},{"id":"P003","name":"Noise Cancelling Headphones","price":199.99,"image":"/img/nc-headphones.jpg","rating":4.4,"inStock":false},{"id":"P004","name":"Portable Bluetooth Speaker","price":49.99,"image":"/img/bt-speaker.jpg","rating":4.2,"inStock":true}],"totalPages":3,"currentPage":1},"__N_SSP":true},"page":"/category/[slug]","query":{"slug":"electronics"}}</script></body></html>"""},
97
+ {"role": "assistant", "content": """<thought>
98
+ Found __NEXT_DATA__ script tag with full product data embedded as JSON. This contains all the SSR data — no need for browser rendering. The data shows 4 products on page 1 of 3. Let me extract the products array. I also see there are more pages, but I'll start with what's available on this page.
99
+ </thought>
100
+
101
+ ACTION: RETURN_RESULT
102
+ ```json
103
+ {
104
+ "status": "success",
105
+ "data": [
106
+ {"name": "Wireless Earbuds Pro", "price": 79.99, "image": "/img/earbuds-pro.jpg", "rating": 4.6},
107
+ {"name": "Smart Watch Ultra", "price": 299.99, "image": "/img/watch-ultra.jpg", "rating": 4.8},
108
+ {"name": "Noise Cancelling Headphones", "price": 199.99, "image": "/img/nc-headphones.jpg", "rating": 4.4},
109
+ {"name": "Portable Bluetooth Speaker", "price": 49.99, "image": "/img/bt-speaker.jpg", "rating": 4.2}
110
+ ],
111
+ "message": "Extracted 4 products from page 1/3 via Next.js __NEXT_DATA__ (no browser needed). Additional pages available at /category/electronics?page=2 and ?page=3.",
112
+ "extraction_method": "__NEXT_DATA__",
113
+ "steps_taken": 2
114
+ }
115
+ ```"""}
116
+ ], "source": "synthetic_v2", "task_type": "nextjs_extraction"})
117
+
118
+ # --- React SPA with XHR interception ---
119
+ scenarios.append({"messages": [
120
+ {"role": "system", "content": SYSTEM_PROMPT},
121
+ {"role": "user", "content": "Task: Extract job listings from a React-based careers page\nURL: https://careers.techcorp.io/jobs\nTarget schema: {\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"},\"location\":{\"type\":\"string\"},\"department\":{\"type\":\"string\"},\"salary\":{\"type\":\"string\"}}}}"},
122
+ {"role": "assistant", "content": """<thought>
123
+ This is likely a React SPA that fetches job data via an API call. I'll try a direct HTTP request first to see if the page has SSR data or if it's fully client-rendered.
124
+ </thought>
125
+
126
+ ACTION: NAVIGATE
127
+ ```json
128
+ {"url": "https://careers.techcorp.io/jobs", "method": "GET"}
129
+ ```"""},
130
+ {"role": "user", "content": """Observation: HTTP 200 OK
131
+
132
+ <html><head><title>Careers | TechCorp</title><script src="/static/js/main.abc123.js"></script></head><body><div id="root"></div><script>window.__APP_CONFIG__={"apiBase":"https://api.techcorp.io/v2","cdnBase":"https://cdn.techcorp.io"}</script></body></html>"""},
133
+ {"role": "assistant", "content": """<thought>
134
+ This is a fully client-rendered React SPA — the HTML only has an empty <div id="root"> and loads JavaScript bundles. No SSR data available. I can see the API base URL in window.__APP_CONFIG__: "https://api.techcorp.io/v2". I have two options:
135
+
136
+ 1. Try calling the API directly (fastest, no browser needed)
137
+ 2. Use a headless browser to render the page and intercept XHR calls
138
+
139
+ Let me try the API directly first — job listings are typically at /jobs or /positions endpoints.
140
+ </thought>
141
+
142
+ ACTION: NAVIGATE
143
+ ```json
144
+ {"url": "https://api.techcorp.io/v2/jobs?page=1&limit=50", "method": "GET", "headers": {"Accept": "application/json", "Origin": "https://careers.techcorp.io", "Referer": "https://careers.techcorp.io/jobs"}}
145
+ ```"""},
146
+ {"role": "user", "content": """Observation: HTTP 200 OK
147
+
148
+ Content-Type: application/json
149
+
150
+ {"data":[{"id":"J101","title":"Senior Frontend Engineer","location":"San Francisco, CA (Hybrid)","department":"Engineering","salary_range":"$180K - $250K","posted":"2025-01-02"},{"id":"J102","title":"ML Platform Lead","location":"Remote (US)","department":"AI/ML","salary_range":"$200K - $300K","posted":"2025-01-01"},{"id":"J103","title":"Product Designer","location":"New York, NY","department":"Design","salary_range":"$140K - $190K","posted":"2024-12-30"},{"id":"J104","title":"DevOps Engineer","location":"Austin, TX (Remote OK)","department":"Infrastructure","salary_range":"$160K - $220K","posted":"2024-12-28"}],"pagination":{"total":4,"page":1,"pages":1}}"""},
151
+ {"role": "assistant", "content": """<thought>
152
+ The API responded with clean JSON data containing all 4 job listings. No browser was needed — I discovered the API endpoint from the page source and called it directly. The pagination shows this is all the data (1 page total). Let me format the response.
153
+ </thought>
154
+
155
+ ACTION: RETURN_RESULT
156
+ ```json
157
+ {
158
+ "status": "success",
159
+ "data": [
160
+ {"title": "Senior Frontend Engineer", "location": "San Francisco, CA (Hybrid)", "department": "Engineering", "salary": "$180K - $250K"},
161
+ {"title": "ML Platform Lead", "location": "Remote (US)", "department": "AI/ML", "salary": "$200K - $300K"},
162
+ {"title": "Product Designer", "location": "New York, NY", "department": "Design", "salary": "$140K - $190K"},
163
+ {"title": "DevOps Engineer", "location": "Austin, TX (Remote OK)", "department": "Infrastructure", "salary": "$160K - $220K"}
164
+ ],
165
+ "message": "Extracted 4 job listings by discovering and calling the underlying REST API directly (https://api.techcorp.io/v2/jobs). No browser rendering needed — found API base URL in window.__APP_CONFIG__.",
166
+ "extraction_method": "direct_api_call",
167
+ "steps_taken": 3
168
+ }
169
+ ```"""}
170
+ ], "source": "synthetic_v2", "task_type": "react_api_interception"})
171
+
172
+ # --- Vue.js with __INITIAL_STATE__ ---
173
+ scenarios.append({"messages": [
174
+ {"role": "system", "content": SYSTEM_PROMPT},
175
+ {"role": "user", "content": "Task: Extract restaurant listings and their ratings from a Vue.js food delivery app\nURL: https://foodapp.example.com/restaurants?area=downtown"},
176
+ {"role": "assistant", "content": """<thought>
177
+ Food delivery apps are typically Vue or React SPAs. Let me fetch the page and check for __INITIAL_STATE__ (Vuex store) or __NUXT__ data (if Nuxt.js).
178
+ </thought>
179
+
180
+ ACTION: NAVIGATE
181
+ ```json
182
+ {"url": "https://foodapp.example.com/restaurants?area=downtown", "method": "GET", "impersonate": "chrome124"}
183
+ ```"""},
184
+ {"role": "user", "content": """Observation: HTTP 200 OK
185
+
186
+ <html><body><div id="app"><!--vue-ssr-outlet--></div><script>window.__INITIAL_STATE__={"restaurants":{"items":[{"id":"R1","name":"Sakura Sushi","cuisine":"Japanese","rating":4.7,"deliveryTime":"25-35 min","priceRange":"$$","image":"/img/sakura.jpg"},{"id":"R2","name":"Bella Pasta","cuisine":"Italian","rating":4.5,"deliveryTime":"30-40 min","priceRange":"$$","image":"/img/bella.jpg"},{"id":"R3","name":"Spice Garden","cuisine":"Indian","rating":4.8,"deliveryTime":"35-45 min","priceRange":"$","image":"/img/spice.jpg"}],"total":3,"area":"downtown"},"user":null}</script></body></html>"""},
187
+ {"role": "assistant", "content": """<thought>
188
+ Found window.__INITIAL_STATE__ containing the Vuex store data. The restaurants.items array has all 3 restaurants with full details. This is Vue.js SSR with Vuex state hydration. I can extract directly without rendering.
189
+ </thought>
190
+
191
+ ACTION: RETURN_RESULT
192
+ ```json
193
+ {
194
+ "status": "success",
195
+ "data": [
196
+ {"name": "Sakura Sushi", "cuisine": "Japanese", "rating": 4.7, "delivery_time": "25-35 min", "price_range": "$$"},
197
+ {"name": "Bella Pasta", "cuisine": "Italian", "rating": 4.5, "delivery_time": "30-40 min", "price_range": "$$"},
198
+ {"name": "Spice Garden", "cuisine": "Indian", "rating": 4.8, "delivery_time": "35-45 min", "price_range": "$"}
199
+ ],
200
+ "message": "Extracted 3 restaurants from Vue.js __INITIAL_STATE__ (Vuex store hydration). No browser rendering needed.",
201
+ "extraction_method": "__INITIAL_STATE__",
202
+ "steps_taken": 2
203
+ }
204
+ ```"""}
205
+ ], "source": "synthetic_v2", "task_type": "vuejs_state_extraction"})
206
+
207
+ # --- Infinite scroll with lazy loading ---
208
+ scenarios.append({"messages": [
209
+ {"role": "system", "content": SYSTEM_PROMPT},
210
+ {"role": "user", "content": "Task: Extract all image posts from an Instagram-like feed (infinite scroll)\nURL: https://photoshare.example.com/trending\nTarget schema: {\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"username\":{\"type\":\"string\"},\"caption\":{\"type\":\"string\"},\"likes\":{\"type\":\"integer\"},\"image_url\":{\"type\":\"string\"}}}}"},
211
+ {"role": "assistant", "content": """<thought>
212
+ Infinite scroll pages load content dynamically as you scroll. I need a browser to trigger the scroll events and wait for new content to load. I should also intercept the API calls that fetch new batches of posts — that's more reliable than parsing rendered HTML.
213
+ </thought>
214
+
215
+ ACTION: NAVIGATE_BROWSER
216
+ ```json
217
+ {"url": "https://photoshare.example.com/trending", "stealth_profile": "chrome124_desktop", "wait_for": ".post-card"}
218
+ ```"""},
219
+ {"role": "user", "content": """Observation: Browser loaded. Page rendered with 6 initial posts visible.
220
+
221
+ <div class="feed"><div class="post-card" data-id="1"><span class="username">@alice_photo</span><img src="https://cdn.photoshare.example.com/img/abc123.jpg"/><p class="caption">Golden hour at the beach 🌅</p><span class="likes">2,341</span></div><div class="post-card" data-id="2"><span class="username">@travel_mike</span><img src="https://cdn.photoshare.example.com/img/def456.jpg"/><p class="caption">Lost in Tokyo streets</p><span class="likes">5,892</span></div><div class="post-card" data-id="3"><span class="username">@foodie_jen</span><img src="https://cdn.photoshare.example.com/img/ghi789.jpg"/><p class="caption">Homemade sourdough day 45</p><span class="likes">1,203</span></div></div><div class="scroll-sentinel" data-loading="false"></div>"""},
222
+ {"role": "assistant", "content": """<thought>
223
+ I see 3 posts loaded initially (truncated in observation). There's a scroll sentinel element indicating infinite scroll. Let me scroll down to trigger more content loading, then intercept the API calls to capture all data efficiently.
224
+ </thought>
225
+
226
+ ACTION: INTERCEPT_REQUESTS
227
+ ```json
228
+ {"url_pattern": "/api/feed*", "wait_for_count": 2}
229
+ ```"""},
230
+ {"role": "user", "content": """Observation: Intercepted 2 API responses:
231
+
232
+ Response 1: GET /api/feed?cursor=abc&limit=10
233
+ {"posts":[{"id":"4","username":"@nature_lover","caption":"Morning mist in the forest","likes":3456,"image":"https://cdn.photoshare.example.com/img/jkl012.jpg"},{"id":"5","username":"@street_art","caption":"Found this mural downtown","likes":7821,"image":"https://cdn.photoshare.example.com/img/mno345.jpg"}],"next_cursor":"def","has_more":true}
234
+
235
+ Response 2: GET /api/feed?cursor=def&limit=10
236
+ {"posts":[{"id":"6","username":"@pet_paradise","caption":"My cat thinks she's a model","likes":12450,"image":"https://cdn.photoshare.example.com/img/pqr678.jpg"}],"next_cursor":null,"has_more":false}"""},
237
+ {"role": "assistant", "content": """<thought>
238
+ I intercepted the API calls triggered by scrolling. Combined with the initial page load, I now have all 6 posts. The last API response shows has_more=false and next_cursor=null, so this is the complete feed. Let me combine the initial HTML data with the API responses.
239
+ </thought>
240
+
241
+ ACTION: RETURN_RESULT
242
+ ```json
243
+ {
244
+ "status": "success",
245
+ "data": [
246
+ {"username": "@alice_photo", "caption": "Golden hour at the beach 🌅", "likes": 2341, "image_url": "https://cdn.photoshare.example.com/img/abc123.jpg"},
247
+ {"username": "@travel_mike", "caption": "Lost in Tokyo streets", "likes": 5892, "image_url": "https://cdn.photoshare.example.com/img/def456.jpg"},
248
+ {"username": "@foodie_jen", "caption": "Homemade sourdough day 45", "likes": 1203, "image_url": "https://cdn.photoshare.example.com/img/ghi789.jpg"},
249
+ {"username": "@nature_lover", "caption": "Morning mist in the forest", "likes": 3456, "image_url": "https://cdn.photoshare.example.com/img/jkl012.jpg"},
250
+ {"username": "@street_art", "caption": "Found this mural downtown", "likes": 7821, "image_url": "https://cdn.photoshare.example.com/img/mno345.jpg"},
251
+ {"username": "@pet_paradise", "caption": "My cat thinks she's a model", "likes": 12450, "image_url": "https://cdn.photoshare.example.com/img/pqr678.jpg"}
252
+ ],
253
+ "message": "Extracted all 6 posts from infinite scroll feed. Used browser to trigger scroll, intercepted 2 paginated API calls. Complete feed (has_more=false).",
254
+ "extraction_method": "browser_api_interception",
255
+ "steps_taken": 3
256
+ }
257
+ ```"""}
258
+ ], "source": "synthetic_v2", "task_type": "infinite_scroll"})
259
+
260
+ # =========================================================================
261
+ # CATEGORY 2: Anti-Bot Bypass — Cloudflare
262
+ # =========================================================================
263
+
264
+ # --- Cloudflare Level 1: TLS fingerprint → curl_cffi ---
265
+ scenarios.append({"messages": [
266
+ {"role": "system", "content": SYSTEM_PROMPT},
267
+ {"role": "user", "content": "Task: Extract cryptocurrency prices from a market data site\nURL: https://cryptodata.example.com/prices"},
268
+ {"role": "assistant", "content": """<thought>
269
+ Let me start with a standard HTTP request.
270
+ </thought>
271
+
272
+ ACTION: NAVIGATE
273
+ ```json
274
+ {"url": "https://cryptodata.example.com/prices", "method": "GET", "headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}}
275
+ ```"""},
276
+ {"role": "user", "content": """Observation: HTTP 403 Forbidden
277
+
278
+ Headers: cf-ray: 8abc123-IAD, server: cloudflare
279
+
280
+ <html><head><title>Attention Required! | Cloudflare</title></head><body><div class="cf-error-details"><h1>Access denied</h1><p>Sorry, you have been blocked. This website is using a security service to protect itself.</p><p>Ray ID: 8abc123</p></div></body></html>"""},
281
+ {"role": "assistant", "content": """<thought>
282
+ Cloudflare is blocking the request. The cf-ray header and Cloudflare error page confirm this. This is likely a TLS fingerprint check — standard Python `requests` has a distinctive TLS fingerprint (JA3 hash) that Cloudflare recognizes as non-browser traffic.
283
+
284
+ Escalation plan:
285
+ 1. First try: curl_cffi with Chrome TLS impersonation (handles ~70% of Cloudflare blocks)
286
+ 2. If that fails: stealth headless browser
287
+ 3. Last resort: residential proxy + stealth browser
288
+
289
+ Let me try curl_cffi first — it's fastest and doesn't need a browser.
290
+ </thought>
291
+
292
+ ACTION: SWITCH_STRATEGY
293
+ ```json
294
+ {"new_strategy": "tls_impersonation", "reason": "Cloudflare 403 detected (cf-ray header present). Standard HTTP client blocked by TLS fingerprint check. Escalating to curl_cffi with Chrome 124 TLS impersonation."}
295
+ ```"""},
296
+ {"role": "user", "content": "Observation: Strategy switched to 'tls_impersonation'. curl_cffi client ready with Chrome 124 TLS profile."},
297
+ {"role": "assistant", "content": """<thought>
298
+ Now let me retry with the impersonated TLS fingerprint. curl_cffi will present a JA3/JA4 hash identical to a real Chrome 124 browser.
299
+ </thought>
300
+
301
+ ACTION: NAVIGATE
302
+ ```json
303
+ {"url": "https://cryptodata.example.com/prices", "method": "GET", "impersonate": "chrome124", "headers": {"Accept": "text/html,application/xhtml+xml", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br"}}
304
+ ```"""},
305
+ {"role": "user", "content": """Observation: HTTP 200 OK
306
+
307
+ <html><body><div class="price-table"><table><thead><tr><th>Coin</th><th>Price</th><th>24h Change</th><th>Market Cap</th></tr></thead><tbody><tr data-coin="BTC"><td>Bitcoin</td><td>$97,245.30</td><td class="positive">+2.4%</td><td>$1.92T</td></tr><tr data-coin="ETH"><td>Ethereum</td><td>$3,421.15</td><td class="positive">+1.8%</td><td>$411.2B</td></tr><tr data-coin="SOL"><td>Solana</td><td>$198.72</td><td class="negative">-0.5%</td><td>$93.4B</td></tr></tbody></table></div></body></html>"""},
308
+ {"role": "assistant", "content": """<thought>
309
+ TLS impersonation worked — Cloudflare passed us through. I can see the price table with 3 cryptocurrencies. Let me extract the data.
310
+ </thought>
311
+
312
+ ACTION: RETURN_RESULT
313
+ ```json
314
+ {
315
+ "status": "success",
316
+ "data": [
317
+ {"coin": "Bitcoin", "symbol": "BTC", "price": "$97,245.30", "change_24h": "+2.4%", "market_cap": "$1.92T"},
318
+ {"coin": "Ethereum", "symbol": "ETH", "price": "$3,421.15", "change_24h": "+1.8%", "market_cap": "$411.2B"},
319
+ {"coin": "Solana", "symbol": "SOL", "price": "$198.72", "change_24h": "-0.5%", "market_cap": "$93.4B"}
320
+ ],
321
+ "message": "Extracted 3 cryptocurrency prices. Initial request blocked by Cloudflare TLS fingerprinting. Bypassed using curl_cffi Chrome 124 TLS impersonation.",
322
+ "anti_bot_encountered": "Cloudflare",
323
+ "bypass_method": "tls_impersonation_curl_cffi",
324
+ "steps_taken": 4
325
+ }
326
+ ```"""}
327
+ ], "source": "synthetic_v2", "task_type": "cloudflare_tls_bypass"})
328
+
329
+ # --- Cloudflare Level 2: JS Challenge → Stealth Browser ---
330
+ scenarios.append({"messages": [
331
+ {"role": "system", "content": SYSTEM_PROMPT},
332
+ {"role": "user", "content": "Task: Extract flight prices from a travel comparison site\nURL: https://flights.example.com/search?from=JFK&to=LAX&date=2025-02-15"},
333
+ {"role": "assistant", "content": """<thought>
334
+ Travel sites often have strong anti-bot protection. Let me try with TLS impersonation first.
335
+ </thought>
336
+
337
+ ACTION: NAVIGATE
338
+ ```json
339
+ {"url": "https://flights.example.com/search?from=JFK&to=LAX&date=2025-02-15", "method": "GET", "impersonate": "chrome124"}
340
+ ```"""},
341
+ {"role": "user", "content": """Observation: HTTP 403 Forbidden
342
+
343
+ Headers: cf-ray: 9def456-SFO, cf-chl-bypass: 1
344
+
345
+ <html><head><title>Just a moment...</title><script src="/cdn-cgi/challenge-platform/scripts/jsd/main.js"></script></head><body><div id="challenge-running"><div class="loading-spinner"></div><p>Verifying you are human. This may take a few seconds.</p><noscript>Enable JavaScript and cookies to continue.</noscript></div></body></html>"""},
346
+ {"role": "assistant", "content": """<thought>
347
+ This is a Cloudflare JavaScript challenge (the "Just a moment..." interstitial page). The cf-chl-bypass header and /cdn-cgi/challenge-platform/ script confirm it. curl_cffi alone can't solve this — I need a real browser that can execute the JavaScript challenge.
348
+
349
+ Escalating to stealth headless browser with anti-detection patches.
350
+ </thought>
351
+
352
+ ACTION: SWITCH_STRATEGY
353
+ ```json
354
+ {"new_strategy": "stealth_browser", "reason": "Cloudflare JS challenge detected (/cdn-cgi/challenge-platform/). TLS impersonation insufficient — need full JavaScript execution. Escalating to stealth Playwright with anti-detection patches."}
355
+ ```"""},
356
+ {"role": "user", "content": "Observation: Strategy switched to 'stealth_browser'. Stealth Playwright instance launched with: navigator.webdriver=undefined, Chrome plugins injected, WebGL renderer spoofed, canvas noise added."},
357
+ {"role": "assistant", "content": """<thought>
358
+ Stealth browser ready with all anti-detection patches. Let me navigate to the URL and wait for the Cloudflare challenge to solve automatically.
359
+ </thought>
360
+
361
+ ACTION: NAVIGATE_BROWSER
362
+ ```json
363
+ {"url": "https://flights.example.com/search?from=JFK&to=LAX&date=2025-02-15", "stealth_profile": "chrome124_desktop", "wait_for": ".flight-results, .search-results, [data-testid='flight-card']"}
364
+ ```"""},
365
+ {"role": "user", "content": """Observation: Cloudflare challenge solved automatically (3.2s). Page loaded.
366
+
367
+ <html><body><div class="search-results"><div class="flight-card" data-id="F001"><div class="airline">Delta Air Lines</div><div class="route">JFK → LAX</div><div class="times"><span class="depart">8:00 AM</span><span class="arrive">11:15 AM</span></div><div class="duration">5h 15m (nonstop)</div><div class="price">$287</div></div><div class="flight-card" data-id="F002"><div class="airline">United Airlines</div><div class="route">JFK → LAX</div><div class="times"><span class="depart">10:30 AM</span><span class="arrive">1:55 PM</span></div><div class="duration">5h 25m (nonstop)</div><div class="price">$312</div></div><div class="flight-card" data-id="F003"><div class="airline">American Airlines</div><div class="route">JFK → LAX</div><div class="times"><span class="depart">2:00 PM</span><span class="arrive">5:30 PM</span></div><div class="duration">5h 30m (nonstop)</div><div class="price">$265</div></div></div></body></html>"""},
368
+ {"role": "assistant", "content": """<thought>
369
+ Stealth browser solved the Cloudflare JS challenge and the flight results loaded. I can see 3 flights. Let me extract them.
370
+ </thought>
371
+
372
+ ACTION: RETURN_RESULT
373
+ ```json
374
+ {
375
+ "status": "success",
376
+ "data": [
377
+ {"airline": "Delta Air Lines", "departure": "8:00 AM", "arrival": "11:15 AM", "duration": "5h 15m (nonstop)", "price": "$287"},
378
+ {"airline": "United Airlines", "departure": "10:30 AM", "arrival": "1:55 PM", "duration": "5h 25m (nonstop)", "price": "$312"},
379
+ {"airline": "American Airlines", "departure": "2:00 PM", "arrival": "5:30 PM", "duration": "5h 30m (nonstop)", "price": "$265"}
380
+ ],
381
+ "message": "Extracted 3 flights (JFK→LAX). Cloudflare JS challenge bypassed using stealth Playwright browser with anti-detection patches. Challenge solved in 3.2s.",
382
+ "anti_bot_encountered": "Cloudflare JS Challenge",
383
+ "bypass_method": "stealth_playwright_browser",
384
+ "steps_taken": 4
385
+ }
386
+ ```"""}
387
+ ], "source": "synthetic_v2", "task_type": "cloudflare_js_challenge"})
388
+
389
+ # --- DataDome detection + browser fingerprint evasion ---
390
+ scenarios.append({"messages": [
391
+ {"role": "system", "content": SYSTEM_PROMPT},
392
+ {"role": "user", "content": "Task: Extract sneaker inventory and prices from a streetwear retailer\nURL: https://kicks.example.com/new-releases"},
393
+ {"role": "assistant", "content": """<thought>
394
+ Sneaker/streetwear sites frequently use DataDome for bot protection due to high bot traffic from resellers. Let me try with TLS impersonation first and watch for DataDome signatures.
395
+ </thought>
396
+
397
+ ACTION: NAVIGATE
398
+ ```json
399
+ {"url": "https://kicks.example.com/new-releases", "method": "GET", "impersonate": "chrome124"}
400
+ ```"""},
401
+ {"role": "user", "content": """Observation: HTTP 403 Forbidden
402
+
403
+ Headers: x-datadome: protected, set-cookie: datadome=abc123xyz...
404
+
405
+ <html><body><div id="datadome-captcha"><iframe src="https://geo.captcha-delivery.com/captcha/?initialCid=abc123&cid=xyz789&t=fe&s=12345" width="100%" height="100%"></iframe></div></body></html>"""},
406
+ {"role": "assistant", "content": """<thought>
407
+ DataDome detected. I can see the x-datadome header and datadome cookie. It's serving a CAPTCHA via captcha-delivery.com iframe. DataDome checks browser fingerprint consistency — specifically:
408
+ - Chrome PDF Viewer plugin must be present
409
+ - Screen frame > 20px
410
+ - CPU cores ≥ 4
411
+ - Consistent device profile (UA must match screen resolution, touch points)
412
+
413
+ I need a full stealth browser with a complete, consistent fingerprint profile. Just impersonating TLS isn't enough for DataDome.
414
+ </thought>
415
+
416
+ ACTION: SWITCH_STRATEGY
417
+ ```json
418
+ {"new_strategy": "stealth_browser_full_profile", "reason": "DataDome protection detected (x-datadome header, datadome cookie, captcha-delivery.com iframe). DataDome requires consistent browser fingerprint — need stealth browser with: PDF Viewer plugin, matching screen/UA/touch profile, canvas noise."}
419
+ ```"""},
420
+ {"role": "user", "content": "Observation: Strategy switched to 'stealth_browser_full_profile'. Full fingerprint profile loaded: Chrome 124 on Windows 10, 1920x1080, 4 CPU cores, 8GB device memory, Chrome PDF Viewer plugin, WebGL ANGLE renderer, en-US locale."},
421
+ {"role": "assistant", "content": """<thought>
422
+ Full stealth profile ready. The key for DataDome is fingerprint consistency — all attributes match a real Windows Chrome user. Let me navigate with this profile.
423
+ </thought>
424
+
425
+ ACTION: NAVIGATE_BROWSER
426
+ ```json
427
+ {"url": "https://kicks.example.com/new-releases", "stealth_profile": "chrome124_win10_full", "wait_for": ".product-grid .product-card"}
428
+ ```"""},
429
+ {"role": "user", "content": """Observation: DataDome challenge passed. Page loaded (4.1s).
430
+
431
+ <div class="product-grid"><div class="product-card"><img src="/img/aj1-retro.jpg"/><h3>Air Jordan 1 Retro High OG</h3><span class="price">$180</span><span class="size-badge">Sizes: 8-13</span><span class="stock in-stock">In Stock</span></div><div class="product-card"><img src="/img/yeezy-350.jpg"/><h3>Yeezy Boost 350 V2</h3><span class="price">$230</span><span class="size-badge">Sizes: 7-12</span><span class="stock low-stock">Low Stock</span></div><div class="product-card"><img src="/img/nb-550.jpg"/><h3>New Balance 550</h3><span class="price">$110</span><span class="size-badge">Sizes: 6-14</span><span class="stock in-stock">In Stock</span></div></div>"""},
432
+ {"role": "assistant", "content": """<thought>
433
+ DataDome bypassed with the full consistent fingerprint profile. Got the product grid with 3 sneakers. Let me extract the data.
434
+ </thought>
435
+
436
+ ACTION: RETURN_RESULT
437
+ ```json
438
+ {
439
+ "status": "success",
440
+ "data": [
441
+ {"name": "Air Jordan 1 Retro High OG", "price": "$180", "sizes": "8-13", "availability": "In Stock"},
442
+ {"name": "Yeezy Boost 350 V2", "price": "$230", "sizes": "7-12", "availability": "Low Stock"},
443
+ {"name": "New Balance 550", "price": "$110", "sizes": "6-14", "availability": "In Stock"}
444
+ ],
445
+ "message": "Extracted 3 sneaker listings. DataDome bot protection bypassed using stealth browser with complete, consistent fingerprint profile (Chrome 124, Windows 10, 1920x1080, PDF Viewer plugin, matching CPU/memory/screen attributes).",
446
+ "anti_bot_encountered": "DataDome",
447
+ "bypass_method": "stealth_browser_full_fingerprint",
448
+ "steps_taken": 4
449
+ }
450
+ ```"""}
451
+ ], "source": "synthetic_v2", "task_type": "datadome_bypass"})
452
+
453
+ # --- Akamai Bot Manager + behavioral detection ---
454
+ scenarios.append({"messages": [
455
+ {"role": "system", "content": SYSTEM_PROMPT},
456
+ {"role": "user", "content": "Task: Extract product prices from a major retailer\nURL: https://megastore.example.com/deals/today"},
457
+ {"role": "assistant", "content": """<thought>
458
+ Major retailers typically use Akamai Bot Manager. Let me try with TLS impersonation and check for Akamai signatures.
459
+ </thought>
460
+
461
+ ACTION: NAVIGATE
462
+ ```json
463
+ {"url": "https://megastore.example.com/deals/today", "method": "GET", "impersonate": "chrome124"}
464
+ ```"""},
465
+ {"role": "user", "content": """Observation: HTTP 200 OK (but response is a challenge page)
466
+
467
+ Headers: set-cookie: _abck=ABC123...; bm_sz=DEF456...
468
+
469
+ <html><body><script type="text/javascript" src="https://megastore.example.com/akam/13/pixel_abc123"></script><noscript><img src="https://megastore.example.com/akam/13/pixel_abc123?noscript=1"/></noscript></body></html>"""},
470
+ {"role": "assistant", "content": """<thought>
471
+ Akamai Bot Manager detected! Key signatures:
472
+ - _abck cookie (Akamai Bot Cookie)
473
+ - bm_sz cookie (Bot Manager Size)
474
+ - /akam/13/ pixel script (Akamai sensor data collector)
475
+
476
+ Akamai requires:
477
+ 1. Valid TLS fingerprint (✓ already have via curl_cffi)
478
+ 2. Sensor data submission (JavaScript generates device/behavioral data)
479
+ 3. Valid _abck cookie chain
480
+
481
+ TLS impersonation alone won't work — Akamai needs JS execution to generate sensor data. I need a full stealth browser with human-like behavioral simulation (mouse movements, timing randomization).
482
+ </thought>
483
+
484
+ ACTION: SWITCH_STRATEGY
485
+ ```json
486
+ {"new_strategy": "stealth_browser_behavioral", "reason": "Akamai Bot Manager detected (_abck cookie, bm_sz cookie, /akam/ sensor script). Requires JS execution for sensor data generation + human-like behavioral patterns (mouse movement, scroll timing). Escalating to stealth browser with behavioral simulation."}
487
+ ```"""},
488
+ {"role": "user", "content": "Observation: Strategy switched to 'stealth_browser_behavioral'. Stealth browser launched with: behavioral simulation enabled (random mouse movements, human-like scroll patterns, randomized timing between 800-2500ms), residential proxy active."},
489
+ {"role": "assistant", "content": """<thought>
490
+ Stealth browser with behavioral simulation ready, plus residential proxy to avoid IP-based blocking. Let me navigate to the deals page and wait for Akamai's sensor script to complete.
491
+ </thought>
492
+
493
+ ACTION: NAVIGATE_BROWSER
494
+ ```json
495
+ {"url": "https://megastore.example.com/deals/today", "stealth_profile": "chrome124_behavioral", "wait_for": ".deal-card, .product-tile"}
496
+ ```"""},
497
+ {"role": "user", "content": """Observation: Akamai sensor validation passed (2.8s). Page loaded.
498
+
499
+ <div class="deals-grid"><div class="deal-card"><h3>Samsung 65" 4K OLED TV</h3><span class="original-price">$1,799</span><span class="deal-price">$1,199</span><span class="discount">33% off</span></div><div class="deal-card"><h3>Apple iPad Air M2</h3><span class="original-price">$599</span><span class="deal-price">$499</span><span class="discount">17% off</span></div><div class="deal-card"><h3>Dyson V15 Vacuum</h3><span class="original-price">$749</span><span class="deal-price">$549</span><span class="discount">27% off</span></div></div>"""},
500
+ {"role": "assistant", "content": """<thought>
501
+ Akamai bypassed successfully with the behavioral simulation approach. Got 3 deal items. Let me extract.
502
+ </thought>
503
+
504
+ ACTION: RETURN_RESULT
505
+ ```json
506
+ {
507
+ "status": "success",
508
+ "data": [
509
+ {"name": "Samsung 65\" 4K OLED TV", "original_price": "$1,799", "deal_price": "$1,199", "discount": "33% off"},
510
+ {"name": "Apple iPad Air M2", "original_price": "$599", "deal_price": "$499", "discount": "17% off"},
511
+ {"name": "Dyson V15 Vacuum", "original_price": "$749", "deal_price": "$549", "discount": "27% off"}
512
+ ],
513
+ "message": "Extracted 3 deals. Akamai Bot Manager bypassed using stealth browser with behavioral simulation (mouse movements, timing randomization) + residential proxy.",
514
+ "anti_bot_encountered": "Akamai Bot Manager",
515
+ "bypass_method": "stealth_browser_behavioral_simulation",
516
+ "steps_taken": 4
517
+ }
518
+ ```"""}
519
+ ], "source": "synthetic_v2", "task_type": "akamai_bypass"})
520
+
521
+ # --- Shadow DOM extraction ---
522
+ scenarios.append({"messages": [
523
+ {"role": "system", "content": SYSTEM_PROMPT},
524
+ {"role": "user", "content": "Task: Extract review ratings from a site using web components with shadow DOM\nURL: https://reviews.example.com/product/X100"},
525
+ {"role": "assistant", "content": """<thought>
526
+ Web components with shadow DOM encapsulate their internal structure — standard CSS selectors can't reach inside. I'll need a browser to either use the >>> shadow-piercing combinator or execute JavaScript to traverse the shadow roots.
527
+ </thought>
528
+
529
+ ACTION: NAVIGATE_BROWSER
530
+ ```json
531
+ {"url": "https://reviews.example.com/product/X100", "stealth_profile": "chrome124_desktop", "wait_for": "review-card"}
532
+ ```"""},
533
+ {"role": "user", "content": """Observation: Page loaded. Contains custom web components <review-card> with shadow DOM. Standard querySelector returns empty for internal elements.
534
+
535
+ <html><body><div id="reviews"><review-card data-id="R1"></review-card><review-card data-id="R2"></review-card><review-card data-id="R3"></review-card></div></body></html>
536
+
537
+ Note: Internal content of <review-card> elements is inside shadow roots and not visible in outer DOM."""},
538
+ {"role": "assistant", "content": """<thought>
539
+ The review data is inside shadow DOM web components. I need to use JavaScript to access the shadowRoot of each <review-card> element and extract the content from within. Let me write a script that traverses all shadow roots.
540
+ </thought>
541
+
542
+ ACTION: EXECUTE_JS
543
+ ```json
544
+ {"script": "return Array.from(document.querySelectorAll('review-card')).map(card => { const shadow = card.shadowRoot; if (!shadow) return null; return { id: card.dataset.id, author: shadow.querySelector('.author')?.textContent, rating: parseFloat(shadow.querySelector('.stars')?.dataset?.rating || '0'), text: shadow.querySelector('.review-text')?.textContent, date: shadow.querySelector('.review-date')?.textContent }; }).filter(Boolean)"}
545
+ ```"""},
546
+ {"role": "user", "content": """Observation: JS result: [{"id":"R1","author":"Maria K.","rating":5,"text":"Absolutely incredible product. Best purchase I've made this year.","date":"Jan 3, 2025"},{"id":"R2","author":"James T.","rating":4,"text":"Solid build quality. The battery life could be better though.","date":"Jan 1, 2025"},{"id":"R3","author":"Sophie L.","rating":5,"text":"Exceeded my expectations in every way. Highly recommend!","date":"Dec 29, 2024"}]"""},
547
+ {"role": "assistant", "content": """<thought>
548
+ JavaScript traversal of shadow DOM worked perfectly. I extracted 3 reviews from within the web component shadow roots. All data matches the requested fields.
549
+ </thought>
550
+
551
+ ACTION: RETURN_RESULT
552
+ ```json
553
+ {
554
+ "status": "success",
555
+ "data": [
556
+ {"author": "Maria K.", "rating": 5, "text": "Absolutely incredible product. Best purchase I've made this year.", "date": "Jan 3, 2025"},
557
+ {"author": "James T.", "rating": 4, "text": "Solid build quality. The battery life could be better though.", "date": "Jan 1, 2025"},
558
+ {"author": "Sophie L.", "rating": 5, "text": "Exceeded my expectations in every way. Highly recommend!", "date": "Dec 29, 2024"}
559
+ ],
560
+ "message": "Extracted 3 reviews from shadow DOM web components using JavaScript shadowRoot traversal. Standard CSS selectors cannot reach inside shadow boundaries.",
561
+ "extraction_method": "shadow_dom_js_traversal",
562
+ "steps_taken": 3
563
+ }
564
+ ```"""}
565
+ ], "source": "synthetic_v2", "task_type": "shadow_dom"})
566
+
567
+ # --- Multi-strategy escalation: HTTP → TLS → Browser → CAPTCHA degradation ---
568
+ scenarios.append({"messages": [
569
+ {"role": "system", "content": SYSTEM_PROMPT},
570
+ {"role": "user", "content": "Task: Extract ticket prices and availability for a concert\nURL: https://tickets.example.com/event/rock-fest-2025"},
571
+ {"role": "assistant", "content": """<thought>
572
+ Ticket sites often have aggressive anti-bot measures. Let me start with the lightest approach and escalate.
573
+ </thought>
574
+
575
+ ACTION: NAVIGATE
576
+ ```json
577
+ {"url": "https://tickets.example.com/event/rock-fest-2025", "method": "GET", "impersonate": "chrome124"}
578
+ ```"""},
579
+ {"role": "user", "content": "Observation: HTTP 403 Forbidden\nHeaders: cf-ray: xyz789\n\n<html><head><title>Just a moment...</title></head><body><div class=\"cf-challenge\">Checking your browser...</div></body></html>"},
580
+ {"role": "assistant", "content": """<thought>
581
+ Cloudflare JS challenge. Need a browser. Let me escalate.
582
+ </thought>
583
+
584
+ ACTION: NAVIGATE_BROWSER
585
+ ```json
586
+ {"url": "https://tickets.example.com/event/rock-fest-2025", "stealth_profile": "chrome124_desktop", "wait_for": ".ticket-section, .event-details"}
587
+ ```"""},
588
+ {"role": "user", "content": """Observation: Cloudflare challenge solved. But page shows PerimeterX CAPTCHA:
589
+
590
+ <html><body><div id="px-captcha"><div class="px-challenge"><h2>Press & Hold</h2><p>Press and hold the button to verify you're human</p><button id="px-captcha-btn" data-sitekey="PX_abc123">Hold to verify</button></div></div></body></html>"""},
591
+ {"role": "assistant", "content": """<thought>
592
+ After Cloudflare passed, there's a second layer — PerimeterX (HUMAN) "Press & Hold" CAPTCHA. This requires actual human-like interaction (press and hold for a duration). I can try to simulate this with the browser, but PerimeterX measures exact timing and behavioral patterns. Let me attempt a simulated hold first — if that fails, I'll need a CAPTCHA solving service or return a partial result.
593
+ </thought>
594
+
595
+ ACTION: EXECUTE_JS
596
+ ```json
597
+ {"script": "const btn = document.querySelector('#px-captcha-btn'); if (btn) { const rect = btn.getBoundingClientRect(); const mousedown = new PointerEvent('pointerdown', {clientX: rect.x + rect.width/2, clientY: rect.y + rect.height/2, bubbles: true}); btn.dispatchEvent(mousedown); await new Promise(r => setTimeout(r, 3200)); const mouseup = new PointerEvent('pointerup', {clientX: rect.x + rect.width/2, clientY: rect.y + rect.height/2, bubbles: true}); btn.dispatchEvent(mouseup); return 'hold_simulated'; }"}
598
+ ```"""},
599
+ {"role": "user", "content": "Observation: JS executed. PerimeterX challenge still showing — simulated hold was detected as non-human."},
600
+ {"role": "assistant", "content": """<thought>
601
+ The simulated hold was detected by PerimeterX. Their behavioral analysis caught the automation. I have two options:
602
+ 1. Send to a CAPTCHA solving service (costs money, takes 10-30s)
603
+ 2. Return a failure with clear explanation
604
+
605
+ Since this is a hard CAPTCHA that can't be solved programmatically, I should return a clear failure with actionable suggestions.
606
+ </thought>
607
+
608
+ ACTION: RETURN_RESULT
609
+ ```json
610
+ {
611
+ "status": "failed",
612
+ "data": null,
613
+ "message": "Unable to extract ticket data. Site has two layers of protection: (1) Cloudflare JS challenge — bypassed successfully, (2) PerimeterX 'Press & Hold' CAPTCHA — simulated interaction was detected. To access this site, you would need: (a) a CAPTCHA solving service like 2captcha or capsolver ($2-3/1000 solves), (b) a pre-authenticated browser session with valid _px cookies, or (c) manual access to export cookies for replay.",
614
+ "anti_bot_encountered": ["Cloudflare JS Challenge", "PerimeterX Press & Hold CAPTCHA"],
615
+ "bypass_attempted": ["tls_impersonation", "stealth_browser", "js_simulation"],
616
+ "steps_taken": 5
617
+ }
618
+ ```"""}
619
+ ], "source": "synthetic_v2", "task_type": "multi_layer_antibot"})
620
+
621
+ # --- Angular SPA with wait for zone.js ---
622
+ scenarios.append({"messages": [
623
+ {"role": "system", "content": SYSTEM_PROMPT},
624
+ {"role": "user", "content": "Task: Extract employee directory from a corporate Angular app\nURL: https://intranet.corp.example.com/directory\nAuthentication: Browser profile 'work-chrome' available"},
625
+ {"role": "assistant", "content": """<thought>
626
+ This is a corporate intranet Angular app requiring authentication. Angular apps use zone.js for change detection — I need to wait for Angular to finish rendering. Let me load the browser profile first for auth, then navigate.
627
+ </thought>
628
+
629
+ ACTION: LOAD_BROWSER_PROFILE
630
+ ```json
631
+ {"profile_name": "work-chrome"}
632
+ ```"""},
633
+ {"role": "user", "content": "Observation: Browser profile 'work-chrome' loaded. Session cookies and local storage restored."},
634
+ {"role": "assistant", "content": """<thought>
635
+ Profile loaded with auth state. Now navigate to the directory page and wait for Angular to finish rendering. I'll wait for the Angular-specific stability signal.
636
+ </thought>
637
+
638
+ ACTION: NAVIGATE_BROWSER
639
+ ```json
640
+ {"url": "https://intranet.corp.example.com/directory", "wait_for": "app-employee-list, .employee-card, [ng-version]"}
641
+ ```"""},
642
+ {"role": "user", "content": """Observation: Angular app loaded. Auth succeeded. Page shows employee directory.
643
+
644
+ <app-root ng-version="17.3.0"><app-directory><div class="directory"><app-employee-card *ngFor="let emp of employees"><div class="card"><h3>Sarah Chen</h3><p class="title">VP of Engineering</p><p class="dept">Engineering</p><p class="email">schen@corp.example.com</p></div></app-employee-card><app-employee-card><div class="card"><h3>Marcus Williams</h3><p class="title">Director of Product</p><p class="dept">Product</p><p class="email">mwilliams@corp.example.com</p></div></app-employee-card><app-employee-card><div class="card"><h3>Priya Patel</h3><p class="title">Staff ML Engineer</p><p class="dept">AI/ML</p><p class="email">ppatel@corp.example.com</p></div></app-employee-card></div></app-directory></app-root>"""},
645
+ {"role": "assistant", "content": """<thought>
646
+ Angular app rendered successfully with authentication from the browser profile. The ng-version attribute confirms Angular 17.3.0. I can see 3 employees in the directory. Let me extract.
647
+ </thought>
648
+
649
+ ACTION: RETURN_RESULT
650
+ ```json
651
+ {
652
+ "status": "success",
653
+ "data": [
654
+ {"name": "Sarah Chen", "title": "VP of Engineering", "department": "Engineering", "email": "schen@corp.example.com"},
655
+ {"name": "Marcus Williams", "title": "Director of Product", "department": "Product", "email": "mwilliams@corp.example.com"},
656
+ {"name": "Priya Patel", "title": "Staff ML Engineer", "department": "AI/ML", "email": "ppatel@corp.example.com"}
657
+ ],
658
+ "message": "Extracted 3 employees from Angular 17 corporate directory. Authenticated via saved browser profile. Waited for Angular zone.js rendering to complete.",
659
+ "extraction_method": "angular_browser_profile_auth",
660
+ "auth_method": "browser_profile",
661
+ "steps_taken": 3
662
+ }
663
+ ```"""}
664
+ ], "source": "synthetic_v2", "task_type": "angular_auth"})
665
+
666
+ # --- PerimeterX cookie replay + proxy rotation ---
667
+ scenarios.append({"messages": [
668
+ {"role": "system", "content": SYSTEM_PROMPT},
669
+ {"role": "user", "content": "Task: Monitor stock availability for a popular item\nURL: https://store.example.com/product/limited-edition-console\nAuthentication: Valid _px cookies from manual session available in vault"},
670
+ {"role": "assistant", "content": """<thought>
671
+ The user has valid PerimeterX cookies from a manual browser session. I can replay these cookies to bypass the PerimeterX check without needing to solve any challenge. Let me inject the cookies and navigate.
672
+ </thought>
673
+
674
+ ACTION: SET_COOKIES
675
+ ```json
676
+ {"cookies": "vault:px_cookies", "domain": "store.example.com"}
677
+ ```"""},
678
+ {"role": "user", "content": "Observation: Cookies loaded from vault. 5 cookies set for domain store.example.com including _px3, _pxvid."},
679
+ {"role": "assistant", "content": """<thought>
680
+ PerimeterX cookies injected. Let me navigate with TLS impersonation to match the browser fingerprint.
681
+ </thought>
682
+
683
+ ACTION: NAVIGATE
684
+ ```json
685
+ {"url": "https://store.example.com/product/limited-edition-console", "method": "GET", "impersonate": "chrome124"}
686
+ ```"""},
687
+ {"role": "user", "content": """Observation: HTTP 200 OK
688
+
689
+ <div class="product-page"><h1>Limited Edition Console</h1><span class="price">$499.99</span><div class="stock-status out-of-stock"><span>Currently Unavailable</span><button class="notify-btn">Notify When Available</button></div><div class="product-specs"><p>Storage: 1TB SSD</p><p>Color: Midnight Black</p><p>SKU: LEC-2025-001</p></div></div>"""},
690
+ {"role": "assistant", "content": """<thought>
691
+ PerimeterX cookie replay worked. The product page loaded showing the console is currently out of stock. Let me return the accurate status.
692
+ </thought>
693
+
694
+ ACTION: RETURN_RESULT
695
+ ```json
696
+ {
697
+ "status": "success",
698
+ "data": {
699
+ "name": "Limited Edition Console",
700
+ "price": "$499.99",
701
+ "availability": "Currently Unavailable",
702
+ "storage": "1TB SSD",
703
+ "color": "Midnight Black",
704
+ "sku": "LEC-2025-001"
705
+ },
706
+ "message": "Product data extracted. Item is currently out of stock. PerimeterX bypassed using cookie replay from valid manual session.",
707
+ "anti_bot_encountered": "PerimeterX",
708
+ "bypass_method": "cookie_replay",
709
+ "steps_taken": 3
710
+ }
711
+ ```"""}
712
+ ], "source": "synthetic_v2", "task_type": "perimeterx_cookie_replay"})
713
+
714
+ # --- SPA with client-side routing, needs wait for hydration ---
715
+ scenarios.append({"messages": [
716
+ {"role": "system", "content": SYSTEM_PROMPT},
717
+ {"role": "user", "content": "Task: Extract course catalog from an online learning platform\nURL: https://learn.example.com/courses?topic=machine-learning"},
718
+ {"role": "assistant", "content": """<thought>
719
+ Online learning platforms are usually React/Next.js SPAs. Let me try fetching the HTML first to check for SSR data.
720
+ </thought>
721
+
722
+ ACTION: NAVIGATE
723
+ ```json
724
+ {"url": "https://learn.example.com/courses?topic=machine-learning", "method": "GET", "impersonate": "chrome124"}
725
+ ```"""},
726
+ {"role": "user", "content": """Observation: HTTP 200 OK
727
+
728
+ <html><head><meta name="next-head-count" content="12"/></head><body><div id="__next"><div data-reactroot=""><main class="course-catalog" aria-busy="true"><div class="skeleton-loader">Loading courses...</div></main></div></div><script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":{"courses":[],"isLoading":true},"__N_SSG":true},"page":"/courses","query":{"topic":"machine-learning"}}</script><script src="/_next/static/chunks/pages/courses-abc123.js"></script></body></html>"""},
729
+ {"role": "assistant", "content": """<thought>
730
+ Interesting — this is a Next.js site with SSG (Static Site Generation), but the __NEXT_DATA__ shows `courses: []` and `isLoading: true`. The data isn't embedded in SSR; it's loaded client-side after hydration. The page has a skeleton loader and aria-busy="true". I need a browser to wait for React hydration to complete and the data to load via client-side API calls.
731
+ </thought>
732
+
733
+ ACTION: NAVIGATE_BROWSER
734
+ ```json
735
+ {"url": "https://learn.example.com/courses?topic=machine-learning", "stealth_profile": "chrome124_desktop", "wait_for": ".course-card:not(.skeleton)"}
736
+ ```"""},
737
+ {"role": "user", "content": """Observation: Browser loaded. React hydrated. Courses loaded after 2.1s.
738
+
739
+ <main class="course-catalog" aria-busy="false"><div class="course-card"><h3>Deep Learning Foundations</h3><span class="instructor">Dr. Andrew Ng</span><span class="rating">4.9 (12,340 reviews)</span><span class="price">$49.99</span><span class="level">Intermediate</span><span class="duration">40 hours</span></div><div class="course-card"><h3>Practical NLP with Transformers</h3><span class="instructor">Hugging Face Team</span><span class="rating">4.8 (8,921 reviews)</span><span class="price">Free</span><span class="level">Advanced</span><span class="duration">25 hours</span></div><div class="course-card"><h3>ML Engineering in Production</h3><span class="instructor">Chip Huyen</span><span class="rating">4.7 (5,443 reviews)</span><span class="price">$39.99</span><span class="level">Advanced</span><span class="duration">30 hours</span></div></main>"""},
740
+ {"role": "assistant", "content": """<thought>
741
+ React hydration completed and courses loaded. The aria-busy changed from "true" to "false", confirming the data is ready. I see 3 courses. Let me extract them.
742
+ </thought>
743
+
744
+ ACTION: RETURN_RESULT
745
+ ```json
746
+ {
747
+ "status": "success",
748
+ "data": [
749
+ {"title": "Deep Learning Foundations", "instructor": "Dr. Andrew Ng", "rating": "4.9 (12,340 reviews)", "price": "$49.99", "level": "Intermediate", "duration": "40 hours"},
750
+ {"title": "Practical NLP with Transformers", "instructor": "Hugging Face Team", "rating": "4.8 (8,921 reviews)", "price": "Free", "level": "Advanced", "duration": "25 hours"},
751
+ {"title": "ML Engineering in Production", "instructor": "Chip Huyen", "rating": "4.7 (5,443 reviews)", "price": "$39.99", "level": "Advanced", "duration": "30 hours"}
752
+ ],
753
+ "message": "Extracted 3 ML courses. Next.js SSG page with client-side data loading (no SSR data in __NEXT_DATA__). Used browser to wait for React hydration and data fetch (2.1s).",
754
+ "extraction_method": "browser_react_hydration",
755
+ "steps_taken": 3
756
+ }
757
+ ```"""}
758
+ ], "source": "synthetic_v2", "task_type": "react_hydration_wait"})
759
+
760
+ return scenarios
761
+
762
+
763
+ def main():
764
+ print("Generating expanded v2 training data...")
765
+
766
+ v2_scenarios = generate_all_scenarios()
767
+ v2_ds = Dataset.from_list(v2_scenarios)
768
+ print(f"Generated {len(v2_ds)} new v2 scenarios")
769
+
770
+ # Load existing dataset
771
+ print("Loading existing dataset...")
772
+ existing = load_dataset("sukritvemula/webscrape-agent-training-data")
773
+ existing_train = existing["train"]
774
+ existing_test = existing["test"]
775
+ print(f"Existing: {len(existing_train)} train, {len(existing_test)} test")
776
+
777
+ # Add source/task_type to v2 if not present in existing
778
+ for col in ["source", "task_type"]:
779
+ if col not in existing_train.column_names:
780
+ existing_train = existing_train.add_column(col, ["legacy"] * len(existing_train))
781
+
782
+ # Combine
783
+ combined_train = concatenate_datasets([existing_train, v2_ds])
784
+ combined_train = combined_train.shuffle(seed=42)
785
+
786
+ print(f"\nCombined train: {len(combined_train)}")
787
+
788
+ # Stats
789
+ sources = {}
790
+ for item in combined_train:
791
+ src = item.get("source", "unknown")
792
+ sources[src] = sources.get(src, 0) + 1
793
+ print("Composition:")
794
+ for src, count in sorted(sources.items(), key=lambda x: -x[1]):
795
+ print(f" {src}: {count} ({count/len(combined_train)*100:.1f}%)")
796
+
797
+ # Push
798
+ ds_dict = DatasetDict({"train": combined_train, "test": existing_test})
799
+ print("\nPushing to Hub...")
800
+ ds_dict.push_to_hub("sukritvemula/webscrape-agent-training-data", private=False)
801
+ print("✅ Done!")
802
+
803
+
804
+ if __name__ == "__main__":
805
+ main()