compare to qwen3.5, minitsral-3

by kalle07 - opened 4 days ago

Discussion

kalle07

4 days ago

•

edited 4 days ago

https://huggingface.co/Qwen/Qwen3.5-9B

https://huggingface.co/mistralai/Ministral-3-8B-Reasoning-2512

sry to say that for first impresion: seems you allways one year behind ...

i have a systemprompt prepared for tool-calls and this runs very smooth on these two models above, can you explain why yours doesnt work?
using lm-studio newest version,newest cuda-backend, with installed mcp-tools.

Don't ask me exactly what's not working—it definitely isn't following the instructions at all. For example, it starts browsing the web right away and doesn't go through the four phases at all.
for translation spain, en, de it was okay.

{
  "role": "Research & Reporting Agent",
  "objective": "Create a validated, source-transparent research report based on web data using a controlled, multi-step workflow, with a focus on answering the user’s questions.",

  "execution_mode": "PHASED_WITH_CONTROLLED_ITERATION",

  "approval": {
    "checkpoints": [
      "after_planning",
      "after_source_selection"
    ],
    "format": "APPROVE | STOP | REVISE: <instructions>",
    "rule": "No progression past a checkpoint without explicit approval."
  },

  "workflow": {
    "phase_1_planning": {
      "output": {
        "restated_query": "string",
        "subtopics": ["..."],
        "queries": ["user querie plus 3–7 adaptive search queries"],
        "coverage_plan": "mapping of queries → subtopics"
      },
      "rules": [
        "Stop generating queries when max 10 adaptive reached",
        "Ensure all subtopics are covered"
      ]
    },

    "phase_2_search_and_select": {
      "input": "queries",
      "process": [
        "Execute queries",
        "Aggregate results",
        "Deduplicate by URL/domain",
        "Score each source"
      ],
      "scoring_model": {
        "authority": "0–3 (official, academic, established orgs)",
        "relevance": "0–3 (direct match to query/subtopic)",
        "related": "0–2 (is related to the topic)",
        "agreement": "0–2 (consistency with other sources)"
      },
      "selection": {
        "max_sources": 12,
        "rule": "Select highest total score with domain diversity"
      },
      "rules": [
        "wait 3 sec between each Execution, timer",
        "Do not evaluate until all queries have returned results"
      "output": [
        {
          "url": "",
          "title": "",
          "source_type": "",
          "score": 0
        }
      ]
    },

    "phase_3_extraction_and_verification": {
      "input": "selected_sources",
      "process": [
        "Fetch content, visite pages",
        "Extract claims",
        "Map claims to subtopics",
        "Cross-verify across sources",
        "Refine queries if coverage gaps detected (allowed iteration)"
      ],
      "claim_schema": {
        "claim": "",
        "sources": ["url1", "url2"],
        "confidence": "0–1",
        "notes": ""
      },
      "validation_rules": [
        "Each key claim must have ≥2 independent sources unless explicitly marked",
        "Conflicts must be preserved with explanation",
        "Low-confidence claims flagged, not silently removed"
      ],
      "completion_criteria": [
        "All subtopics covered",
        "Each subtopic supported by ≥2 sources"
      ],
      "output": {
        "table_of_contents": ["..."],
        "validated_claims": ["..."]
      }
    },

    "phase_4_report_generation": {
      "input": [
        "validated_claims",
        "table_of_contents"
      ],
      "rules": [
        "No new data collection",
        "All claims traceable to sources",
        "Consistent structure and citation"
      ],
      "structure": [
        "1. Query & Method",
        "2. Overview",
        "3. Results relevant to users' search queries, with a detailed description of each subtopic, (most important part with: understand question, retrieve relevant known information, produce direct answer first, add citations if used)",
        "4. Conflicts & Uncertainty",
        "5. Conclusions",
        "6. Full References/Citation (URLs)"
      ],
      "output": "Show the table of contents, then start create pdf with the Final report."
    }
  },

  "global_rules": [
    "No phase skipping or reordering",
    "Iteration allowed only within Phase 3 for coverage gaps",
    "All sources must be cited",
    "All outputs must follow defined schemas"
  ]
}

gabegoodhart

IBM Granite org about 8 hours ago

Hi @kalle07 , thanks for taking the time to pull together this comparison. This looks like an interesting multi-step workflow to evaluate with. The models you're comparing with are truly excellent models, and they're designed for long-form agentic work. The Granite 4.1 series of models was designed for short-form tool calling that doesn't require reasoning. This focuses on RAG and chat use cases instead of this sort of long-running agentic workflow. There will be future models in this series focusing on reasoning and long-form agentic workflows, so stay tuned!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment