--- base_model: - Qwen/Qwen2.5-VL-7B-Instruct pipeline_tag: image-to-image library_name: transformers license: apache-2.0 --- # REPLAN: REASONING-GUIDED REGION PLANNING FOR COMPLEX INSTRUCTION-BASED IMAGE EDITING This model is part of the work presented in the paper [RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing](https://huggingface.co/papers/2512.16864).

Project Page      Paper (Hugging Face)      Paper (arXiv)      GitHub Code      Hugging Face Dataset      Hugging Face Model

Demo video of RePlan:
## Model Summary This model is the **Planner module** of the **RePlan** framework, designed for complex instruction-based image editing. It is a fine-tuned version of **Qwen2.5-VL-7B**, trained with GRPO and only ~1k samples without any paired images. Given an input image and a natural language editing instruction, this model performs Chain-of-Thought (CoT) reasoning to decompose the task. It outputs structured guidance containing: 1. **Reasoning:** Analysis of the image and instruction. 2. **Global Edits:** Instructions for the entire image (if necessary). 3. **Regional Edits:** Precise bounding boxes (`bbox_2d`) and specific prompts (`hint`) for local regions. ## Paper Introduction **Paper Title:** RePlan: Reasoning-Guided Region Planning for Complex Instruction-Based Image Editing Existing instruction-based image editing models often struggle with **Instruction-Visual Complexity (IV-Complexity)**: scenarios involving cluttered visuals, ambiguous instructions, or the need for multi-step reasoning. **RePlan** introduces a "Plan-then-Execute" strategy: 1. **Plan:** This VLM planner analyzes the scene, grounds the instruction to specific pixels, and generates a precise editing plan. 2. **Execute:** A diffusion model (equipped with a Training-Free Attention Region Injection mechanism) applies the edits based on the planner's guidance. Experiments show that RePlan significantly outperforms baselines in visual reasoning and background consistency. ## Usage ### System Prompt **Crucial:** To get the correct XML/JSON structured output, you **MUST** use the following System Prompt. ```text You are an expert AI image editing assistant. Your task is to carefully analyze a user's editing instruction and input image, reason step by step, and then decompose the necessary actions into global and local edits. ### Rules 1. **Global Edits:** Affect the entire image's style, lighting, color grading, or overall composition. Global edits should only be derived if they are essential for achieving the user's core instruction. 2. **Local Edits:** Target specific objects or areas. These instructions go into the `` tag. 3. **Hint Quality:** The `hint` text MUST be a concise, visually descriptive instruction for its specific region. It should clearly state the expected visual outcome. 4. **Strict Separation:** Instructions for local edits in `` MUST NOT be duplicated in the `` prompt. 5. **Edge Case - No Global Edits:** If no global edits are necessary to achieve the user's goal, `` MUST be placeholder ‘keep remaining part of image unchanged.’. 6. **Edge Case - No Local Edits:** If no local edits are needed, `` must be an empty list `[]`. ### Output Format Your entire output must follow this format, with no text outside the tags. Reasoning processGlobal edit instruction[{"bbox_2d": [10,150,150,210], "point_2d": [30,175], "hint": "change the color of this one apple to blue"}, {"bbox_2d": [150,50,200,150], "point_2d": [175,75], "hint": "keep this one apple unchanged"}] ``` ### Inference Code Example ```python import torch from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration from qwen_vl_utils import process_vision_info # 1. Load Model model_path = "path/to/your/model" # Replace with huggingface repo id model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.float16, device_map="auto" ) processor = AutoProcessor.from_pretrained(model_path) # 2. Define System Prompt (See section above) SYSTEM_PROMPT = "..." # Paste the full system prompt here # 3. Prepare Inputs image_path = "./example_image.jpg" user_instruction = "Replace the cup that has been used and left on the desk with a small potted plant." messages = [ {"role": "system", "content": SYSTEM_PROMPT}, { "role": "user", "content": [ {"type": "image", "image": image_path}, {"type": "text", "text": user_instruction}, ], }, ] # 4. Processing text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ).to(model.device) # 5. Generate generated_ids = model.generate(**inputs, max_new_tokens=2048) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text_list = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) # Output is the raw XML string containing , , and response_text = output_text_list[0] if isinstance(output_text_list, list) else output_text_list print(response_text) ``` ## Input / Output Examples ### Example 1: Complex Reasoning **Input Image:** **Instruction:** "Replace the cup that has been used and left on the desk with a small potted plant." **Output Format Example:** ```xml The user wants to replace a "used" cup. I see two vessels: a red coffee cup and a glass of water. The red cup appears empty and discarded, fitting the "used and left" description. The glass of water is still potentially in use. Therefore, I should target the red cup for replacement and explicitly keep the glass unchanged to avoid confusion. keep remaining part of image unchanged.[{"bbox_2d": [224, 372, 263, 431], "hint": "Replace this red cup with a small potted plant"}, {"bbox_2d": [175, 329, 220, 388], "hint": "Keep this glass unchanged"}] ```