My thing - a EL102 Collection

EL102 's Collections

My thing

updated 12 days ago

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

Paper • 2511.18373 • Published Nov 23, 2025 • 7
Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

Paper • 2511.13288 • Published Nov 17, 2025 • 19
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

Paper • 2511.19418 • Published Nov 24, 2025 • 29
SAM 3: Segment Anything with Concepts

Paper • 2511.16719 • Published Nov 20, 2025 • 134
Temporal Prompting Matters: Rethinking Referring Video Object Segmentation

Paper • 2510.07319 • Published Oct 8, 2025 • 3
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Paper • 2511.16334 • Published Nov 20, 2025 • 94
O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents

Paper • 2511.13593 • Published Nov 17, 2025 • 28
RynnVLA-002: A Unified Vision-Language-Action and World Model

Paper • 2511.17502 • Published Nov 21, 2025 • 28
VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models

Paper • 2511.11007 • Published Nov 14, 2025 • 15
Depth Anything 3: Recovering the Visual Space from Any Views

Paper • 2511.10647 • Published Nov 13, 2025 • 101
LightRAG: Simple and Fast Retrieval-Augmented Generation

Paper • 2410.05779 • Published Oct 8, 2024 • 34
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Paper • 2510.14528 • Published Oct 16, 2025 • 122
TradingAgents: Multi-Agents LLM Financial Trading Framework

Paper • 2412.20138 • Published Dec 28, 2024 • 33
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Paper • 2410.17799 • Published Oct 23, 2024 • 12
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

Paper • 2511.13648 • Published Nov 17, 2025 • 53
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Paper • 2509.22186 • Published Sep 26, 2025 • 154
Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery

Paper • 2510.15869 • Published Oct 17, 2025 • 50
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

Paper • 2511.15705 • Published Nov 19, 2025 • 98
FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Paper • 2510.22543 • Published Oct 26, 2025 • 14
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Paper • 2511.19900 • Published Nov 25, 2025 • 49
From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Paper • 2512.10867 • Published Dec 11, 2025 • 16
Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities

Paper • 2503.04721 • Published Mar 6, 2025 • 4
MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild

Paper • 2603.17187 • Published 14 days ago • 134
MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Paper • 2603.17117 • Published 14 days ago • 87
Complementary Reinforcement Learning

Paper • 2603.17621 • Published 13 days ago • 36
AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Paper • 2603.16496 • Published 14 days ago • 13
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Paper • 2603.18004 • Published 13 days ago • 12
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Paper • 2603.13398 • Published 20 days ago • 152
Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

Paper • 2603.16669 • Published 14 days ago • 70
Efficient Reasoning on the Edge

Paper • 2603.16867 • Published 14 days ago • 18
SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Paper • 2603.16864 • Published 14 days ago • 16
Omnilingual MT: Machine Translation for 1,600 Languages

Paper • 2603.16309 • Published 14 days ago • 20
ViT-AdaLA: Adapting Vision Transformers with Linear Attention

Paper • 2603.16063 • Published 15 days ago • 2
OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Paper • 2603.15594 • Published 15 days ago • 148
Mixture-of-Depths Attention

Paper • 2603.15619 • Published 15 days ago • 79
Can Vision-Language Models Solve the Shell Game?

Paper • 2603.08436 • Published 22 days ago • 39
Multimodal OCR: Parse Anything from Documents

Paper • 2603.13032 • Published 18 days ago • 40
NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

Paper • 2603.12824 • Published 18 days ago • 5