agentbee

Running

App Files Files Community

agentbee / README.md

mangubee

feat: phase1 planning and video processing research

0d77f39 3 months ago

preview code

raw

history blame contribute delete

18.2 kB

	---
	title: Agentbee \| GAIA Project \| HuggingFace Course
	emoji: 🕵🏻‍♂️
	colorFrom: indigo
	colorTo: indigo
	sdk: gradio
	sdk_version: 6.3.0
	app_file: app.py
	pinned: false
	hf_oauth: true
	hf_oauth_expiration_minutes: 480
	---

	Check out the configuration reference at <https://huggingface.co/docs/hub/spaces-config-reference>

	## Project Overview

	Project Name: Final_Assignment_Template

	Purpose: Course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). This project serves as a learning-focused workspace to support iterative agent development and experimentation.

	Target Users: Students learning agent development through hands-on implementation

	Key Objectives:

	- Build production-ready code that passes GAIA test requirements
	- Learn agent development through discovery-based implementation
	- Develop systematic approach to complex AI task solving
	- Document learning process and key decisions

	## Project Architecture

	Technology Stack:

	- Platform: Hugging Face Spaces with OAuth integration
	- UI Framework: Gradio 5.x with OAuth support
	- Agent Framework: LangGraph (state graph orchestration)
	- LLM Providers (4-tier fallback):
	- Google Gemini 2.0 Flash (free tier)
	- HuggingFace Inference API (free tier)
	- Groq (Llama 3.1 70B / Qwen 3 32B, free tier)
	- Anthropic Claude Sonnet 4.5 (paid tier)
	- Tools:
	- Web Search: Tavily API / Exa API
	- File Parser: PyPDF2, openpyxl, python-docx, pillow
	- Calculator: Safe expression evaluator
	- Vision: Multimodal LLM (Gemini/Claude)
	- Language: Python 3.12+
	- Package Manager: uv

	Project Structure:

	```
	Final_Assignment_Template/
	├── archive/ # Reference materials, previous solutions, static resources
	├── input/ # Input files, configuration, raw data
	├── output/ # Generated files, results, processed data
	├── test/ # Testing files, test scripts (99 tests)
	├── dev/ # Development records (permanent knowledge packages)
	├── src/ # Source code
	│ ├── agent/ # Agent orchestration
	│ │ ├── graph.py # LangGraph state machine
	│ │ └── llm_client.py # Multi-provider LLM integration with retry logic
	│ └── tools/ # Agent tools
	│ ├── __init__.py # Tool registry
	│ ├── web_search.py # Tavily/Exa web search
	│ ├── file_parser.py # Multi-format file reader
	│ ├── calculator.py # Safe math evaluator
	│ └── vision.py # Multimodal image/video analysis
	├── app.py # Gradio UI with OAuth, LLM provider selection
	├── pyproject.toml # uv package management
	├── requirements.txt # Python dependencies (generated from pyproject.toml)
	├── .env # Local environment variables (API keys, config)
	├── README.md # Project overview, architecture, workflow, specification
	├── CLAUDE.md # Project-specific AI instructions
	├── PLAN.md # Active implementation plan (temporary workspace)
	├── TODO.md # Active task tracking (temporary workspace)
	└── CHANGELOG.md # Session changelog (temporary workspace)
	```

	Core Components:

	- GAIAAgent class (src/agent/graph.py): LangGraph-based agent with state machine orchestration
	- Planning node: Analyze question and generate execution plan
	- Tool selection node: LLM function calling for dynamic tool selection
	- Tool execution node: Execute selected tools with timeout and error handling
	- Answer synthesis node: Generate factoid answer from evidence
	- LLM Client (src/agent/llm_client.py): Multi-provider LLM integration
	- 4-tier fallback chain: Gemini → HuggingFace → Groq → Claude
	- Exponential backoff retry logic (3 attempts per provider)
	- Runtime config for UI-based provider selection
	- Few-shot prompting for improved tool selection
	- Tool System (src/tools/):
	- Web Search: Tavily/Exa API with query optimization
	- File Parser: Multi-format support (PDF, Excel, Word, CSV, images)
	- Calculator: Safe expression evaluator with graceful error handling
	- Vision: Multimodal analysis for images/videos
	- Gradio UI (app.py):
	- Test & Debug tab: Single question testing with LLM provider dropdown
	- Full Evaluation tab: Run all GAIA questions with provider selection
	- Results export: JSON file download for analysis
	- OAuth integration for submission
	- Evaluation Infrastructure: Pre-built orchestration (question fetching, submission, scoring)

	System Architecture Diagram:

	```mermaid
	---
	config:
	layout: elk
	---
	graph TB
	subgraph "UI Layer"
	GradioUI[Gradio UI<br/>LLM Provider Selection<br/>Test & Full Evaluation]
	OAuth[HF OAuth<br/>User authentication]
	end

	subgraph "Agent Orchestration (LangGraph)"
	GAIAAgent[GAIAAgent<br/>State Machine]
	PlanNode[Planning Node<br/>Analyze question]
	ToolSelectNode[Tool Selection Node<br/>LLM function calling]
	ToolExecNode[Tool Execution Node<br/>Run selected tools]
	SynthesizeNode[Answer Synthesis Node<br/>Generate factoid]
	end

	subgraph "LLM Layer (4-Tier Fallback)"
	LLMClient[LLM Client<br/>Retry + Fallback]
	Gemini[Gemini 2.0 Flash<br/>Free Tier 1]
	HF[HuggingFace API<br/>Free Tier 2]
	Groq[Groq Llama/Qwen<br/>Free Tier 3]
	Claude[Claude Sonnet 4.5<br/>Paid Tier 4]
	end

	subgraph "Tool Layer"
	WebSearch[Web Search<br/>Tavily/Exa]
	FileParser[File Parser<br/>PDF/Excel/Word]
	Calculator[Calculator<br/>Safe eval]
	Vision[Vision<br/>Multimodal LLM]
	end

	subgraph "External Services"
	API[GAIA Scoring API]
	QEndpoint["/questions endpoint"]
	SEndpoint["/submit endpoint"]
	end

	GradioUI --> OAuth
	OAuth -->\|Authenticated\| GAIAAgent
	GAIAAgent --> PlanNode
	PlanNode --> ToolSelectNode
	ToolSelectNode --> ToolExecNode
	ToolExecNode --> SynthesizeNode

	PlanNode --> LLMClient
	ToolSelectNode --> LLMClient
	SynthesizeNode --> LLMClient

	LLMClient -->\|Try 1\| Gemini
	LLMClient -->\|Fallback 2\| HF
	LLMClient -->\|Fallback 3\| Groq
	LLMClient -->\|Fallback 4\| Claude

	ToolExecNode --> WebSearch
	ToolExecNode --> FileParser
	ToolExecNode --> Calculator
	ToolExecNode --> Vision

	GAIAAgent -->\|Answers\| API
	API --> QEndpoint
	API --> SEndpoint
	SEndpoint -->\|Score\| GradioUI

	style GAIAAgent fill:#ffcccc
	style LLMClient fill:#fff4cc
	style GradioUI fill:#cce5ff
	style API fill:#d9f2d9
	```

	## Project Specification

	Project Context:

	This is a course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). The project was recently started as a learning-focused workspace to support iterative agent development and experimentation.

	Current State:

	- Status: Stage 5 Complete - Performance Optimization
	- Development Progress:
	- Stage 1-2: Basic infrastructure and LangGraph setup ✅
	- Stage 3: Multi-provider LLM integration ✅
	- Stage 4: Tool system and MVP (10% GAIA score: 2/20 questions) ✅
	- Stage 5: Performance optimization (retry logic, Groq integration, improved prompts) ✅
	- Current Performance: Testing in progress (target: 25% accuracy, 5/20 questions)
	- Next Stage: Stage 6 - Advanced optimizations based on Stage 5 results

	Data & Workflows:

	- Input Data: GAIA test questions fetched from external scoring API (`agents-course-unit4-scoring.hf.space`)
	- Processing: BasicAgent class processes questions and generates answers
	- Output: Agent responses submitted to scoring endpoint for evaluation
	- Development Workflow:
	1. Local development and testing
	2. Deploy to Hugging Face Space
	3. Submit via integrated evaluation UI

	User Workflow Diagram:

	```mermaid
	---
	config:
	layout: fixed
	---
	flowchart TB
	Start(["Student starts assignment"]) --> Clone["Clone HF Space template"]
	Clone --> LocalDev["Local development:<br>Implement BasicAgent logic"]
	LocalDev --> LocalTest{"Test locally?"}
	LocalTest -- Yes --> RunLocal["Run app locally"]
	RunLocal --> Debug{"Works?"}
	Debug -- No --> LocalDev
	Debug -- Yes --> Deploy["Deploy to HF Space"]
	LocalTest -- Skip --> Deploy
	Deploy --> Login["Login with HF OAuth"]
	Login --> RunEval@{ label: "Click 'Run Evaluation'<br>button in UI" }
	RunEval --> FetchQ["System fetches GAIA<br>questions from API"]
	FetchQ --> RunAgent["Agent processes<br>each question"]
	RunAgent --> Submit["Submit answers<br>to scoring API"]
	Submit --> Display["Display score<br>and results"]
	Display --> Iterate{"Satisfied with<br>score?"}
	Iterate -- "No - improve agent" --> LocalDev
	Iterate -- Yes --> Complete(["Assignment complete"])

	RunEval@{ shape: rect}
	style Start fill:#e1f5e1
	style LocalDev fill:#fff4e1
	style Deploy fill:#e1f0ff
	style RunAgent fill:#ffe1f0
	style Complete fill:#e1f5e1
	```

	Technical Architecture:

	- Platform: Hugging Face Spaces with OAuth integration
	- Framework: Gradio for UI, Requests for API communication
	- Core Component: BasicAgent class (student-customizable template)
	- Evaluation Infrastructure: Pre-built orchestration (question fetching, submission, scoring display)
	- Deployment: HF Space with environment variables (SPACE_ID, SPACE_HOST)

	Requirements & Constraints:

	- Constraint Type: Minimal at current stage
	- Infrastructure: Must run on Hugging Face Spaces platform
	- Integration: Fixed scoring API endpoints (cannot modify evaluation system)
	- Flexibility: Students have full freedom to design agent capabilities

	Integration Points:

	- External API: `https://agents-course-unit4-scoring.hf.space`
	- `/questions` endpoint: Fetch GAIA test questions
	- `/submit` endpoint: Submit answers and receive scores
	- Authentication: Hugging Face OAuth for student identification
	- Deployment: HF Space runtime environment variables

	Development Goals:

	- Primary: Achieve competitive GAIA benchmark performance through systematic optimization
	- Focus: Multi-tier LLM architecture with free-tier prioritization to minimize costs
	- Key Features:
	- 4-tier LLM fallback for quota resilience (Gemini → HF → Groq → Claude)
	- Exponential backoff retry logic for quota/rate limit errors
	- UI-based LLM provider selection for easy A/B testing in cloud
	- Comprehensive tool system (web search, file parsing, calculator, vision)
	- Graceful error handling and degradation
	- Extensive test coverage (99 tests)
	- Documentation: Full dev record workflow tracking all decisions and changes

	## Key Features

	### LLM Provider Selection (UI-Based)

	Local Testing (.env configuration):

	```bash
	LLM_PROVIDER=gemini # Options: gemini, huggingface, groq, claude
	ENABLE_LLM_FALLBACK=false # Disable fallback for debugging single provider
	```

	Cloud Testing (HuggingFace Spaces):

	- Use UI dropdowns in Test & Debug tab or Full Evaluation tab
	- Select from: Gemini, HuggingFace, Groq, Claude
	- Toggle fallback behavior with checkbox
	- No environment variable changes needed, instant provider switching

	Benefits:

	- Easy A/B testing between providers
	- Clear visibility which LLM is used
	- Isolated testing for debugging
	- Production safety with fallback enabled

	### Retry Logic

	- Exponential backoff: 3 attempts with 1s, 2s, 4s delays
	- Error detection: 429 status, quota errors, rate limits
	- Scope: All LLM calls (planning, tool selection, synthesis)

	### Tool System

	Web Search (Tavily/Exa):

	- Factual information, current events, statistics
	- Wikipedia, company info, people

	File Parser:

	- PDF, Excel, Word, CSV, Text, Images
	- Handles uploaded files and local paths

	Calculator:

	- Safe expression evaluation
	- Arithmetic, algebra, trigonometry, logarithms
	- Functions: sqrt, sin, cos, log, abs, etc.

	Vision:

	- Multimodal image/video analysis
	- Describe content, identify objects, read text
	- YouTube video understanding

	### Performance Optimizations (Stage 5)

	- Few-shot prompting for improved tool selection
	- Graceful vision question skip when quota exhausted
	- Relaxed calculator validation (returns error dicts instead of crashes)
	- Improved tool descriptions with "Use when..." guidance
	- Config-based provider debugging

	## GAIA Benchmark Results

	Baseline (Stage 4): 10% accuracy (2/20 questions correct)

	Stage 5 Target: 25% accuracy (5/20 questions correct)

	- Status: Testing in progress
	- Expected improvements from retry logic, Groq integration, improved prompts

	Test Coverage: 99 passing tests (~2min 40sec runtime)

	> Note: This project implements the Course Leaderboard (20 questions, 30% target). See [GAIA Submission Guide](../agentbee/docs/gaia_submission_guide.md) for distinction between Course and Official GAIA leaderboards.

	## Workflow

	### Dev Record Workflow

	Philosophy: Dev records are the single source of truth. CHANGELOG/PLAN/TODO are temporary workspace files.

	Dev Record Types:

	- 🐞 Issue: Problem-solving, bug fixes, error resolution
	- 🔨 Development: Feature development, enhancements, new functionality

	### Session Start Workflow

	#### Phase 1: Planning (Explicit)

	1. Create or identify dev record: `dev/dev_YYMMDD_##_concise_title.md`
	- Choose type: 🐞 Issue or 🔨 Development
	2. Create PLAN.md ONLY: Use `/plan` command or write directly
	- Document implementation approach, steps, files to modify
	- DO NOT create TODO.md or CHANGELOG.md yet

	#### Phase 2: Development (Automatic)

	3. Create TODO.md: Automatically populate as you start implementing
	- Track tasks in real-time using TodoWrite tool
	- Mark in_progress/completed as you work
	4. Create CHANGELOG.md: Automatically populate as you make changes
	- Record file modifications/creations/deletions as they happen
	5. Work on solution: Update all three files during development

	### Session End Workflow

	#### Phase 3: Completion (Manual)

	After AI completes all work and updates PLAN/TODO/CHANGELOG:

	- AI stops and waits for user review (Checkpoint 3)
	- User reviews PLAN.md, TODO.md, and CHANGELOG.md
	- User manually runs `/update-dev dev_YYMMDD_##` when satisfied

	When /update-dev runs:

	1. Distills PLAN decisions → dev record "Key Decisions" section
	2. Distills TODO deliverables → dev record "Outcome" section
	3. Distills CHANGELOG changes → dev record "Changelog" section
	4. Empties PLAN.md, TODO.md, CHANGELOG.md back to templates
	5. Marks dev record status as ✅ Resolved

	### AI Context Loading Protocol

	MANDATORY - Execute in exact order. NO delegating to sub-agents for initial context.

	Phase 1: Current State (What's happening NOW)

	1. Read workspace files:

	- `CHANGELOG.md` - Active session changes (reverse chronological, newest first)
	- `PLAN.md` - Current implementation plan (if exists)
	- `TODO.md` - Active task tracking (if exists)

	2. Read actual outputs (CRITICAL - verify claims, don't trust summaries):
	- Latest files in `output/` folder (sorted by timestamp, newest first)
	- For GAIA projects: Read latest `output/gaia_results_*.json` completely
	- Check `metadata.score_percent` and `metadata.correct_count`
	- Read ALL `results[].submitted_answer` to understand failure patterns
	- Identify error categories (vision failures, tool errors, wrong answers)
	- For test projects: Read latest test output logs
	- Purpose: Ground truth of what ACTUALLY happened, not what was claimed

	Phase 2: Recent History (What was done recently)

	3. Read last 3 dev records from `dev/` folder:
	- Sort by filename (newest `dev_YYMMDD_##_title.md` first)
	- Read: Problem Description, Key Decisions, Outcome, Changelog
	- Cross-verify: Compare dev record claims with actual output files
	- Red flag: If dev record says "25% accuracy" but latest JSON shows "0%", prioritize JSON truth

	Phase 3: Project Structure (How it works)

	4. Read README.md sections in order:

	- Section 1: Overview (purpose, objectives)
	- Section 2: Architecture (tech stack, components, diagrams)
	- Section 3: Specification (current state, workflows, requirements)
	- Section 4: Workflow (this protocol)

	5. Read CLAUDE.md:
	- Project-specific coding standards
	- Usually empty (inherits from global ~/.claude/CLAUDE.md)

	Phase 4: Code Structure (Critical files)

	6. Identify critical files from README.md Architecture section:
	- Note main entry points (e.g., `app.py`)
	- Note core logic files (e.g., `src/agent/graph.py`, `src/agent/llm_client.py`)
	- Note tool implementations (e.g., `src/tools/*.py`)
	- DO NOT read these yet - only note their locations for later reference

	Verification Checklist (Before claiming "I have context"):

	- [ ] I personally read CHANGELOG.md, PLAN.md, TODO.md (not delegated)
	- [ ] I personally read latest output files (JSON results, test logs, etc.)
	- [ ] I know the ACTUAL current accuracy/status from output files
	- [ ] I read last 3 dev records and cross-verified claims with output data
	- [ ] I read README.md sections 1-4 completely
	- [ ] I can answer: "What is the current status and why?"
	- [ ] I can answer: "What were the last 3 major changes and their outcomes?"
	- [ ] I can answer: "What specific problems exist based on latest outputs?"

	Anti-Patterns (NEVER do these):

	- ❌ Delegate initial context loading to Explore/Task agents
	- ❌ Trust dev record claims without verifying against output files
	- ❌ Skip reading actual output data (JSON results, logs, test outputs)
	- ❌ Claim "I have context" after only reading summaries
	- ❌ Read code files before understanding current state from outputs

	Context Priority: Latest Outputs (ground truth) > CHANGELOG (active work) > Dev Records (history) > README (structure)