| --- |
| title: Agentbee | GAIA Project | HuggingFace Course |
| emoji: π΅π»ββοΈ |
| colorFrom: indigo |
| colorTo: indigo |
| sdk: gradio |
| sdk_version: 6.3.0 |
| app_file: app.py |
| pinned: false |
| hf_oauth: true |
| hf_oauth_expiration_minutes: 480 |
| --- |
| |
| Check out the configuration reference at <https://huggingface.co/docs/hub/spaces-config-reference> |
|
|
| ## Project Overview |
|
|
| **Project Name:** Final_Assignment_Template |
|
|
| **Purpose:** Course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). This project serves as a learning-focused workspace to support iterative agent development and experimentation. |
|
|
| **Target Users:** Students learning agent development through hands-on implementation |
|
|
| **Key Objectives:** |
|
|
| - Build production-ready code that passes GAIA test requirements |
| - Learn agent development through discovery-based implementation |
| - Develop systematic approach to complex AI task solving |
| - Document learning process and key decisions |
|
|
| ## Project Architecture |
|
|
| **Technology Stack:** |
|
|
| - **Platform:** Hugging Face Spaces with OAuth integration |
| - **UI Framework:** Gradio 5.x with OAuth support |
| - **Agent Framework:** LangGraph (state graph orchestration) |
| - **LLM Providers (4-tier fallback):** |
| - Google Gemini 2.0 Flash (free tier) |
| - HuggingFace Inference API (free tier) |
| - Groq (Llama 3.1 70B / Qwen 3 32B, free tier) |
| - Anthropic Claude Sonnet 4.5 (paid tier) |
| - **Tools:** |
| - Web Search: Tavily API / Exa API |
| - File Parser: PyPDF2, openpyxl, python-docx, pillow |
| - Calculator: Safe expression evaluator |
| - Vision: Multimodal LLM (Gemini/Claude) |
| - **Language:** Python 3.12+ |
| - **Package Manager:** uv |
|
|
| **Project Structure:** |
|
|
| ``` |
| Final_Assignment_Template/ |
| βββ archive/ # Reference materials, previous solutions, static resources |
| βββ input/ # Input files, configuration, raw data |
| βββ output/ # Generated files, results, processed data |
| βββ test/ # Testing files, test scripts (99 tests) |
| βββ dev/ # Development records (permanent knowledge packages) |
| βββ src/ # Source code |
| β βββ agent/ # Agent orchestration |
| β β βββ graph.py # LangGraph state machine |
| β β βββ llm_client.py # Multi-provider LLM integration with retry logic |
| β βββ tools/ # Agent tools |
| β βββ __init__.py # Tool registry |
| β βββ web_search.py # Tavily/Exa web search |
| β βββ file_parser.py # Multi-format file reader |
| β βββ calculator.py # Safe math evaluator |
| β βββ vision.py # Multimodal image/video analysis |
| βββ app.py # Gradio UI with OAuth, LLM provider selection |
| βββ pyproject.toml # uv package management |
| βββ requirements.txt # Python dependencies (generated from pyproject.toml) |
| βββ .env # Local environment variables (API keys, config) |
| βββ README.md # Project overview, architecture, workflow, specification |
| βββ CLAUDE.md # Project-specific AI instructions |
| βββ PLAN.md # Active implementation plan (temporary workspace) |
| βββ TODO.md # Active task tracking (temporary workspace) |
| βββ CHANGELOG.md # Session changelog (temporary workspace) |
| ``` |
|
|
| **Core Components:** |
|
|
| - **GAIAAgent class** (src/agent/graph.py): LangGraph-based agent with state machine orchestration |
| - Planning node: Analyze question and generate execution plan |
| - Tool selection node: LLM function calling for dynamic tool selection |
| - Tool execution node: Execute selected tools with timeout and error handling |
| - Answer synthesis node: Generate factoid answer from evidence |
| - **LLM Client** (src/agent/llm_client.py): Multi-provider LLM integration |
| - 4-tier fallback chain: Gemini β HuggingFace β Groq β Claude |
| - Exponential backoff retry logic (3 attempts per provider) |
| - Runtime config for UI-based provider selection |
| - Few-shot prompting for improved tool selection |
| - **Tool System** (src/tools/): |
| - Web Search: Tavily/Exa API with query optimization |
| - File Parser: Multi-format support (PDF, Excel, Word, CSV, images) |
| - Calculator: Safe expression evaluator with graceful error handling |
| - Vision: Multimodal analysis for images/videos |
| - **Gradio UI** (app.py): |
| - Test & Debug tab: Single question testing with LLM provider dropdown |
| - Full Evaluation tab: Run all GAIA questions with provider selection |
| - Results export: JSON file download for analysis |
| - OAuth integration for submission |
| - **Evaluation Infrastructure**: Pre-built orchestration (question fetching, submission, scoring) |
| |
| **System Architecture Diagram:** |
| |
| ```mermaid |
| --- |
| config: |
| layout: elk |
| --- |
| graph TB |
| subgraph "UI Layer" |
| GradioUI[Gradio UI<br/>LLM Provider Selection<br/>Test & Full Evaluation] |
| OAuth[HF OAuth<br/>User authentication] |
| end |
| |
| subgraph "Agent Orchestration (LangGraph)" |
| GAIAAgent[GAIAAgent<br/>State Machine] |
| PlanNode[Planning Node<br/>Analyze question] |
| ToolSelectNode[Tool Selection Node<br/>LLM function calling] |
| ToolExecNode[Tool Execution Node<br/>Run selected tools] |
| SynthesizeNode[Answer Synthesis Node<br/>Generate factoid] |
| end |
| |
| subgraph "LLM Layer (4-Tier Fallback)" |
| LLMClient[LLM Client<br/>Retry + Fallback] |
| Gemini[Gemini 2.0 Flash<br/>Free Tier 1] |
| HF[HuggingFace API<br/>Free Tier 2] |
| Groq[Groq Llama/Qwen<br/>Free Tier 3] |
| Claude[Claude Sonnet 4.5<br/>Paid Tier 4] |
| end |
| |
| subgraph "Tool Layer" |
| WebSearch[Web Search<br/>Tavily/Exa] |
| FileParser[File Parser<br/>PDF/Excel/Word] |
| Calculator[Calculator<br/>Safe eval] |
| Vision[Vision<br/>Multimodal LLM] |
| end |
| |
| subgraph "External Services" |
| API[GAIA Scoring API] |
| QEndpoint["/questions endpoint"] |
| SEndpoint["/submit endpoint"] |
| end |
| |
| GradioUI --> OAuth |
| OAuth -->|Authenticated| GAIAAgent |
| GAIAAgent --> PlanNode |
| PlanNode --> ToolSelectNode |
| ToolSelectNode --> ToolExecNode |
| ToolExecNode --> SynthesizeNode |
| |
| PlanNode --> LLMClient |
| ToolSelectNode --> LLMClient |
| SynthesizeNode --> LLMClient |
| |
| LLMClient -->|Try 1| Gemini |
| LLMClient -->|Fallback 2| HF |
| LLMClient -->|Fallback 3| Groq |
| LLMClient -->|Fallback 4| Claude |
| |
| ToolExecNode --> WebSearch |
| ToolExecNode --> FileParser |
| ToolExecNode --> Calculator |
| ToolExecNode --> Vision |
| |
| GAIAAgent -->|Answers| API |
| API --> QEndpoint |
| API --> SEndpoint |
| SEndpoint -->|Score| GradioUI |
| |
| style GAIAAgent fill:#ffcccc |
| style LLMClient fill:#fff4cc |
| style GradioUI fill:#cce5ff |
| style API fill:#d9f2d9 |
| ``` |
| |
| ## Project Specification |
| |
| **Project Context:** |
| |
| This is a course assignment template for building an AI agent that passes the GAIA benchmark (General AI Assistants). The project was recently started as a learning-focused workspace to support iterative agent development and experimentation. |
| |
| **Current State:** |
| |
| - **Status:** Stage 5 Complete - Performance Optimization |
| - **Development Progress:** |
| - Stage 1-2: Basic infrastructure and LangGraph setup β
|
| - Stage 3: Multi-provider LLM integration β
|
| - Stage 4: Tool system and MVP (10% GAIA score: 2/20 questions) β
|
| - Stage 5: Performance optimization (retry logic, Groq integration, improved prompts) β
|
| - **Current Performance:** Testing in progress (target: 25% accuracy, 5/20 questions) |
| - **Next Stage:** Stage 6 - Advanced optimizations based on Stage 5 results |
| |
| **Data & Workflows:** |
| |
| - **Input Data:** GAIA test questions fetched from external scoring API (`agents-course-unit4-scoring.hf.space`) |
| - **Processing:** BasicAgent class processes questions and generates answers |
| - **Output:** Agent responses submitted to scoring endpoint for evaluation |
| - **Development Workflow:** |
| 1. Local development and testing |
| 2. Deploy to Hugging Face Space |
| 3. Submit via integrated evaluation UI |
| |
| **User Workflow Diagram:** |
| |
| ```mermaid |
| --- |
| config: |
| layout: fixed |
| --- |
| flowchart TB |
| Start(["Student starts assignment"]) --> Clone["Clone HF Space template"] |
| Clone --> LocalDev["Local development:<br>Implement BasicAgent logic"] |
| LocalDev --> LocalTest{"Test locally?"} |
| LocalTest -- Yes --> RunLocal["Run app locally"] |
| RunLocal --> Debug{"Works?"} |
| Debug -- No --> LocalDev |
| Debug -- Yes --> Deploy["Deploy to HF Space"] |
| LocalTest -- Skip --> Deploy |
| Deploy --> Login["Login with HF OAuth"] |
| Login --> RunEval@{ label: "Click 'Run Evaluation'<br>button in UI" } |
| RunEval --> FetchQ["System fetches GAIA<br>questions from API"] |
| FetchQ --> RunAgent["Agent processes<br>each question"] |
| RunAgent --> Submit["Submit answers<br>to scoring API"] |
| Submit --> Display["Display score<br>and results"] |
| Display --> Iterate{"Satisfied with<br>score?"} |
| Iterate -- "No - improve agent" --> LocalDev |
| Iterate -- Yes --> Complete(["Assignment complete"]) |
| |
| RunEval@{ shape: rect} |
| style Start fill:#e1f5e1 |
| style LocalDev fill:#fff4e1 |
| style Deploy fill:#e1f0ff |
| style RunAgent fill:#ffe1f0 |
| style Complete fill:#e1f5e1 |
| ``` |
| |
| **Technical Architecture:** |
| |
| - **Platform:** Hugging Face Spaces with OAuth integration |
| - **Framework:** Gradio for UI, Requests for API communication |
| - **Core Component:** BasicAgent class (student-customizable template) |
| - **Evaluation Infrastructure:** Pre-built orchestration (question fetching, submission, scoring display) |
| - **Deployment:** HF Space with environment variables (SPACE_ID, SPACE_HOST) |
| |
| **Requirements & Constraints:** |
| |
| - **Constraint Type:** Minimal at current stage |
| - **Infrastructure:** Must run on Hugging Face Spaces platform |
| - **Integration:** Fixed scoring API endpoints (cannot modify evaluation system) |
| - **Flexibility:** Students have full freedom to design agent capabilities |
| |
| **Integration Points:** |
| |
| - **External API:** `https://agents-course-unit4-scoring.hf.space` |
| - `/questions` endpoint: Fetch GAIA test questions |
| - `/submit` endpoint: Submit answers and receive scores |
| - **Authentication:** Hugging Face OAuth for student identification |
| - **Deployment:** HF Space runtime environment variables |
| |
| **Development Goals:** |
| |
| - **Primary:** Achieve competitive GAIA benchmark performance through systematic optimization |
| - **Focus:** Multi-tier LLM architecture with free-tier prioritization to minimize costs |
| - **Key Features:** |
| - 4-tier LLM fallback for quota resilience (Gemini β HF β Groq β Claude) |
| - Exponential backoff retry logic for quota/rate limit errors |
| - UI-based LLM provider selection for easy A/B testing in cloud |
| - Comprehensive tool system (web search, file parsing, calculator, vision) |
| - Graceful error handling and degradation |
| - Extensive test coverage (99 tests) |
| - **Documentation:** Full dev record workflow tracking all decisions and changes |
| |
| ## Key Features |
| |
| ### LLM Provider Selection (UI-Based) |
| |
| **Local Testing (.env configuration):** |
| |
| ```bash |
| LLM_PROVIDER=gemini # Options: gemini, huggingface, groq, claude |
| ENABLE_LLM_FALLBACK=false # Disable fallback for debugging single provider |
| ``` |
| |
| **Cloud Testing (HuggingFace Spaces):** |
| |
| - Use UI dropdowns in Test & Debug tab or Full Evaluation tab |
| - Select from: Gemini, HuggingFace, Groq, Claude |
| - Toggle fallback behavior with checkbox |
| - No environment variable changes needed, instant provider switching |
| |
| **Benefits:** |
| |
| - Easy A/B testing between providers |
| - Clear visibility which LLM is used |
| - Isolated testing for debugging |
| - Production safety with fallback enabled |
| |
| ### Retry Logic |
| |
| - **Exponential backoff:** 3 attempts with 1s, 2s, 4s delays |
| - **Error detection:** 429 status, quota errors, rate limits |
| - **Scope:** All LLM calls (planning, tool selection, synthesis) |
| |
| ### Tool System |
| |
| **Web Search (Tavily/Exa):** |
| |
| - Factual information, current events, statistics |
| - Wikipedia, company info, people |
| |
| **File Parser:** |
| |
| - PDF, Excel, Word, CSV, Text, Images |
| - Handles uploaded files and local paths |
| |
| **Calculator:** |
| |
| - Safe expression evaluation |
| - Arithmetic, algebra, trigonometry, logarithms |
| - Functions: sqrt, sin, cos, log, abs, etc. |
| |
| **Vision:** |
| |
| - Multimodal image/video analysis |
| - Describe content, identify objects, read text |
| - YouTube video understanding |
| |
| ### Performance Optimizations (Stage 5) |
| |
| - Few-shot prompting for improved tool selection |
| - Graceful vision question skip when quota exhausted |
| - Relaxed calculator validation (returns error dicts instead of crashes) |
| - Improved tool descriptions with "Use when..." guidance |
| - Config-based provider debugging |
| |
| ## GAIA Benchmark Results |
| |
| **Baseline (Stage 4):** 10% accuracy (2/20 questions correct) |
| |
| **Stage 5 Target:** 25% accuracy (5/20 questions correct) |
| |
| - Status: Testing in progress |
| - Expected improvements from retry logic, Groq integration, improved prompts |
| |
| **Test Coverage:** 99 passing tests (~2min 40sec runtime) |
| |
| > **Note:** This project implements the **Course Leaderboard** (20 questions, 30% target). See [GAIA Submission Guide](../agentbee/docs/gaia_submission_guide.md) for distinction between Course and Official GAIA leaderboards. |
| |
| ## Workflow |
| |
| ### Dev Record Workflow |
| |
| **Philosophy:** Dev records are the single source of truth. CHANGELOG/PLAN/TODO are temporary workspace files. |
| |
| **Dev Record Types:** |
| |
| - π **Issue:** Problem-solving, bug fixes, error resolution |
| - π¨ **Development:** Feature development, enhancements, new functionality |
| |
| ### Session Start Workflow |
| |
| #### Phase 1: Planning (Explicit) |
| |
| 1. **Create or identify dev record:** `dev/dev_YYMMDD_##_concise_title.md` |
| - Choose type: π Issue or π¨ Development |
| 2. **Create PLAN.md ONLY:** Use `/plan` command or write directly |
| - Document implementation approach, steps, files to modify |
| - DO NOT create TODO.md or CHANGELOG.md yet |
| |
| #### Phase 2: Development (Automatic) |
| |
| 3. **Create TODO.md:** Automatically populate as you start implementing |
| - Track tasks in real-time using TodoWrite tool |
| - Mark in_progress/completed as you work |
| 4. **Create CHANGELOG.md:** Automatically populate as you make changes |
| - Record file modifications/creations/deletions as they happen |
| 5. **Work on solution:** Update all three files during development |
| |
| ### Session End Workflow |
| |
| #### Phase 3: Completion (Manual) |
| |
| After AI completes all work and updates PLAN/TODO/CHANGELOG: |
| |
| - AI stops and waits for user review (Checkpoint 3) |
| - User reviews PLAN.md, TODO.md, and CHANGELOG.md |
| - User manually runs `/update-dev dev_YYMMDD_##` when satisfied |
| |
| When /update-dev runs: |
| |
| 1. Distills PLAN decisions β dev record "Key Decisions" section |
| 2. Distills TODO deliverables β dev record "Outcome" section |
| 3. Distills CHANGELOG changes β dev record "Changelog" section |
| 4. Empties PLAN.md, TODO.md, CHANGELOG.md back to templates |
| 5. Marks dev record status as β
Resolved |
| |
| ### AI Context Loading Protocol |
| |
| **MANDATORY - Execute in exact order. NO delegating to sub-agents for initial context.** |
| |
| **Phase 1: Current State (What's happening NOW)** |
| |
| 1. **Read workspace files:** |
| |
| - `CHANGELOG.md` - Active session changes (reverse chronological, newest first) |
| - `PLAN.md` - Current implementation plan (if exists) |
| - `TODO.md` - Active task tracking (if exists) |
| |
| 2. **Read actual outputs (CRITICAL - verify claims, don't trust summaries):** |
| - Latest files in `output/` folder (sorted by timestamp, newest first) |
| - For GAIA projects: Read latest `output/gaia_results_*.json` completely |
| - Check `metadata.score_percent` and `metadata.correct_count` |
| - Read ALL `results[].submitted_answer` to understand failure patterns |
| - Identify error categories (vision failures, tool errors, wrong answers) |
| - For test projects: Read latest test output logs |
| - **Purpose:** Ground truth of what ACTUALLY happened, not what was claimed |
| |
| **Phase 2: Recent History (What was done recently)** |
| |
| 3. **Read last 3 dev records from `dev/` folder:** |
| - Sort by filename (newest `dev_YYMMDD_##_title.md` first) |
| - Read: Problem Description, Key Decisions, Outcome, Changelog |
| - **Cross-verify:** Compare dev record claims with actual output files |
| - **Red flag:** If dev record says "25% accuracy" but latest JSON shows "0%", prioritize JSON truth |
| |
| **Phase 3: Project Structure (How it works)** |
| |
| 4. **Read README.md sections in order:** |
| |
| - Section 1: Overview (purpose, objectives) |
| - Section 2: Architecture (tech stack, components, diagrams) |
| - Section 3: Specification (current state, workflows, requirements) |
| - Section 4: Workflow (this protocol) |
| |
| 5. **Read CLAUDE.md:** |
| - Project-specific coding standards |
| - Usually empty (inherits from global ~/.claude/CLAUDE.md) |
| |
| **Phase 4: Code Structure (Critical files)** |
| |
| 6. **Identify critical files from README.md Architecture section:** |
| - Note main entry points (e.g., `app.py`) |
| - Note core logic files (e.g., `src/agent/graph.py`, `src/agent/llm_client.py`) |
| - Note tool implementations (e.g., `src/tools/*.py`) |
| - **DO NOT read these yet** - only note their locations for later reference |
| |
| **Verification Checklist (Before claiming "I have context"):** |
| |
| - [ ] I personally read CHANGELOG.md, PLAN.md, TODO.md (not delegated) |
| - [ ] I personally read latest output files (JSON results, test logs, etc.) |
| - [ ] I know the ACTUAL current accuracy/status from output files |
| - [ ] I read last 3 dev records and cross-verified claims with output data |
| - [ ] I read README.md sections 1-4 completely |
| - [ ] I can answer: "What is the current status and why?" |
| - [ ] I can answer: "What were the last 3 major changes and their outcomes?" |
| - [ ] I can answer: "What specific problems exist based on latest outputs?" |
| |
| **Anti-Patterns (NEVER do these):** |
| |
| - β Delegate initial context loading to Explore/Task agents |
| - β Trust dev record claims without verifying against output files |
| - β Skip reading actual output data (JSON results, logs, test outputs) |
| - β Claim "I have context" after only reading summaries |
| - β Read code files before understanding current state from outputs |
| |
| **Context Priority:** Latest Outputs (ground truth) > CHANGELOG (active work) > Dev Records (history) > README (structure) |