UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
Abstract
UniGeo is a camera-controllable image editing framework that addresses geometric drift and structural degradation by injecting unified geometric guidance across representation, architecture, and loss function levels.
Camera-controllable image editing aims to synthesize novel views of a given scene under varying camera poses while strictly preserving cross-view geometric consistency. However, existing methods typically rely on fragmented geometric guidance, such as only injecting point clouds at the representation level despite models containing multiple levels, and are mainly based on image diffusion models that operate on discrete view mappings. These two limitations jointly lead to geometric drift and structural degradation under continuous camera motion. We observe that while leveraging video models provides continuous viewpoint priors for camera-controllable image editing, they still struggle to form stable geometric understanding if geometric guidance remains fragmented. To systematically address this, we inject unified geometric guidance across three levels that jointly determine the generative output: representation, architecture, and loss function. To this end, we propose UniGeo, a novel camera-controllable editing framework. Specifically, at the representation level, UniGeo incorporates a frame-decoupled geometric reference injection mechanism to provide robust cross-view geometry context. At the architecture level, it introduces geometric anchor attention to align multi-view features. At the loss function level, it proposes a trajectory-endpoint geometric supervision strategy to explicitly reinforce the structural fidelity of target views. Comprehensive experiments across multiple public benchmarks, encompassing both extensive and limited camera motion settings, demonstrate that UniGeo significantly outperforms existing methods in both visual quality and geometric consistency.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- NavCrafter: Exploring 3D Scenes from a Single Image (2026)
- AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model (2026)
- SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation (2026)
- ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation (2026)
- VS3R: Robust Full-frame Video Stabilization via Deep 3D Reconstruction (2026)
- View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity (2026)
- GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.17565 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper

