---
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-to-image
library_name: transformers
license: apache-2.0
---
# REPLAN: REASONING-GUIDED REGION PLANNING FOR COMPLEX INSTRUCTION-BASED IMAGE EDITING
This model is part of the work presented in the paper [RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing](https://huggingface.co/papers/2512.16864).
Demo video of RePlan:
## Model Summary
This model is the **Planner module** of the **RePlan** framework, designed for complex instruction-based image editing. It is a fine-tuned version of **Qwen2.5-VL-7B**, trained with GRPO and only ~1k samples without any paired images.
Given an input image and a natural language editing instruction, this model performs Chain-of-Thought (CoT) reasoning to decompose the task. It outputs structured guidance containing:
1. **Reasoning:** Analysis of the image and instruction.
2. **Global Edits:** Instructions for the entire image (if necessary).
3. **Regional Edits:** Precise bounding boxes (`bbox_2d`) and specific prompts (`hint`) for local regions.
## Paper Introduction
**Paper Title:** RePlan: Reasoning-Guided Region Planning for Complex Instruction-Based Image Editing
Existing instruction-based image editing models often struggle with **Instruction-Visual Complexity (IV-Complexity)**: scenarios involving cluttered visuals, ambiguous instructions, or the need for multi-step reasoning.
**RePlan** introduces a "Plan-then-Execute" strategy:
1. **Plan:** This VLM planner analyzes the scene, grounds the instruction to specific pixels, and generates a precise editing plan.
2. **Execute:** A diffusion model (equipped with a Training-Free Attention Region Injection mechanism) applies the edits based on the planner's guidance.
Experiments show that RePlan significantly outperforms baselines in visual reasoning and background consistency.
## Usage
### System Prompt
**Crucial:** To get the correct XML/JSON structured output, you **MUST** use the following System Prompt.
```text
You are an expert AI image editing assistant. Your task is to carefully analyze a user's editing instruction and input image, reason step by step, and then decompose the necessary actions into global and local edits.
### Rules
1. **Global Edits:** Affect the entire image's style, lighting, color grading, or overall composition. Global edits should only be derived if they are essential for achieving the user's core instruction.
2. **Local Edits:** Target specific objects or areas. These instructions go into the `` tag.
3. **Hint Quality:** The `hint` text MUST be a concise, visually descriptive instruction for its specific region. It should clearly state the expected visual outcome.
4. **Strict Separation:** Instructions for local edits in `` MUST NOT be duplicated in the `` prompt.
5. **Edge Case - No Global Edits:** If no global edits are necessary to achieve the user's goal, `` MUST be placeholder ‘keep remaining part of image unchanged.’.
6. **Edge Case - No Local Edits:** If no local edits are needed, `` must be an empty list `[]`.
### Output Format
Your entire output must follow this format, with no text outside the tags.
Reasoning processGlobal edit instruction[{"bbox_2d": [10,150,150,210], "point_2d": [30,175], "hint": "change the color of this one apple to blue"}, {"bbox_2d": [150,50,200,150], "point_2d": [175,75], "hint": "keep this one apple unchanged"}]
```
### Inference Code Example
```python
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
# 1. Load Model
model_path = "path/to/your/model" # Replace with huggingface repo id
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path, torch_dtype=torch.float16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)
# 2. Define System Prompt (See section above)
SYSTEM_PROMPT = "..." # Paste the full system prompt here
# 3. Prepare Inputs
image_path = "./example_image.jpg"
user_instruction = "Replace the cup that has been used and left on the desk with a small potted plant."
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": user_instruction},
],
},
]
# 4. Processing
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
# 5. Generate
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text_list = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
# Output is the raw XML string containing , , and
response_text = output_text_list[0] if isinstance(output_text_list, list) else output_text_list
print(response_text)
```
## Input / Output Examples
### Example 1: Complex Reasoning
**Input Image:**
**Instruction:** "Replace the cup that has been used and left on the desk with a small potted plant."
**Output Format Example:**
```xml
The user wants to replace a "used" cup. I see two vessels: a red coffee cup and a glass of water. The red cup appears empty and discarded, fitting the "used and left" description. The glass of water is still potentially in use. Therefore, I should target the red cup for replacement and explicitly keep the glass unchanged to avoid confusion.
keep remaining part of image unchanged.[{"bbox_2d": [224, 372, 263, 431], "hint": "Replace this red cup with a small potted plant"}, {"bbox_2d": [175, 329, 220, 388], "hint": "Keep this glass unchanged"}]
```