paper seminar_251001
updated
Reconstruction Alignment Improves Unified Multimodal Models
Paper
• 2509.07295
• Published
• 40
F1: A Vision-Language-Action Model Bridging Understanding and Generation
to Actions
Paper
• 2509.06951
• Published
• 32
UMO: Scaling Multi-Identity Consistency for Image Customization via
Matching Reward
Paper
• 2509.06818
• Published
• 29
Interleaving Reasoning for Better Text-to-Image Generation
Paper
• 2509.06945
• Published
• 15
RewardDance: Reward Scaling in Visual Generation
Paper
• 2509.08826
• Published
• 73
Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with
Quantization-Aware Scheduling
Paper
• 2509.01624
• Published
• 7
Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human
Preference
Paper
• 2509.06942
• Published
• 17
Understand Before You Generate: Self-Guided Training for Autoregressive
Image Generation
Paper
• 2509.15185
• Published
• 29
LLM-I: LLMs are Naturally Interleaved Multimodal Creators
Paper
• 2509.13642
• Published
• 9
Image Tokenizer Needs Post-Training
Paper
• 2509.12474
• Published
• 8
InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis
Paper
• 2509.10441
• Published
• 31
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal
Conditioning
Paper
• 2509.08519
• Published
• 128
MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware
Alignment and Disentanglement
Paper
• 2509.01977
• Published
• 13
GenCompositor: Generative Video Compositing with Diffusion Transformer
Paper
• 2509.02460
• Published
• 26
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable
Text-to-Image Reinforcement Learning
Paper
• 2508.20751
• Published
• 89
Mixture of Contexts for Long Video Generation
Paper
• 2508.21058
• Published
• 35
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
Vision Tokenizer
Paper
• 2509.16197
• Published
• 58
Lynx: Towards High-Fidelity Personalized Video Generation
Paper
• 2509.15496
• Published
• 13
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion
Transformer Models
Paper
• 2509.17627
• Published
• 66
Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal
Understanding and Generation
Paper
• 2509.19244
• Published
• 12
Hyper-Bagel: A Unified Acceleration Framework for Multimodal
Understanding and Generation
Paper
• 2509.18824
• Published
• 23
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
Paper
• 2510.05094
• Published
• 38
Free Lunch Alignment of Text-to-Image Diffusion Models without
Preference Image Pairs
Paper
• 2509.25771
• Published
• 11
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
Paper
• 2510.01284
• Published
• 37
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Paper
• 2510.02283
• Published
• 96
UltraGen: High-Resolution Video Generation with Hierarchical Attention
Paper
• 2510.18775
• Published
• 18