Daily paper that is inspiring (abstract is enough)
updated
World Model on Million-Length Video And Language With RingAttention
Paper
• 2402.08268
• Published
• 40
Improving Text Embeddings with Large Language Models
Paper
• 2401.00368
• Published
• 82
Chain-of-Thought Reasoning Without Prompting
Paper
• 2402.10200
• Published
• 109
FiT: Flexible Vision Transformer for Diffusion Model
Paper
• 2402.12376
• Published
• 48
Paper
• 2402.13144
• Published
• 100
Aria Everyday Activities Dataset
Paper
• 2402.13349
• Published
• 31
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
• 2403.10517
• Published
• 37
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
• 2404.01197
• Published
• 31
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
• 2404.06512
• Published
• 30
Adapting LLaMA Decoder to Vision Transformer
Paper
• 2404.06773
• Published
• 18
Rho-1: Not All Tokens Are What You Need
Paper
• 2404.07965
• Published
• 94
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and
Training Strategies
Paper
• 2404.08197
• Published
• 29
LoRA Learns Less and Forgets Less
Paper
• 2405.09673
• Published
• 91
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
• 2405.09798
• Published
• 32
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper
• 2405.12130
• Published
• 50
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper
• 2405.11143
• Published
• 41
Octo: An Open-Source Generalist Robot Policy
Paper
• 2405.12213
• Published
• 29
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
• 2405.15738
• Published
• 46
LLMs achieve adult human performance on higher-order theory of mind
tasks
Paper
• 2405.18870
• Published
• 17
Step-aware Preference Optimization: Aligning Preference with Denoising
Performance at Each Step
Paper
• 2406.04314
• Published
• 30
Autoregressive Model Beats Diffusion: Llama for Scalable Image
Generation
Paper
• 2406.06525
• Published
• 71
Vript: A Video Is Worth Thousands of Words
Paper
• 2406.06040
• Published
• 28
Mixture-of-Agents Enhances Large Language Model Capabilities
Paper
• 2406.04692
• Published
• 59
GenAI Arena: An Open Evaluation Platform for Generative Models
Paper
• 2406.04485
• Published
• 22
What If We Recaption Billions of Web Images with LLaMA-3?
Paper
• 2406.08478
• Published
• 43
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio
Understanding in Video-LLMs
Paper
• 2406.07476
• Published
• 36
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
• 2406.08418
• Published
• 32
DataComp-LM: In search of the next generation of training sets for
language models
Paper
• 2406.11794
• Published
• 55
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
• 2406.11271
• Published
• 21
Instruction Pre-Training: Language Models are Supervised Multitask
Learners
Paper
• 2406.14491
• Published
• 96
nabla^2DFT: A Universal Quantum Chemistry Dataset of Drug-Like
Molecules and a Benchmark for Neural Network Potentials
Paper
• 2406.14347
• Published
• 102
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video
Understanding
Paper
• 2406.14515
• Published
• 33
Video-Infinity: Distributed Long Video Generation
Paper
• 2406.16260
• Published
• 30
The FineWeb Datasets: Decanting the Web for the Finest Text Data at
Scale
Paper
• 2406.17557
• Published
• 100
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs
Paper
• 2406.18629
• Published
• 42
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into
Multimodal LLMs at Scale
Paper
• 2406.19280
• Published
• 63
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Paper
• 2406.20095
• Published
• 18
LiteSearch: Efficacious Tree Search for LLM
Paper
• 2407.00320
• Published
• 40
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video
Generation
Paper
• 2407.02371
• Published
• 54
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for
Sparse Architectural Large Language Models
Paper
• 2407.01906
• Published
• 46
Video-STaR: Self-Training Enables Video Instruction Tuning with Any
Supervision
Paper
• 2407.06189
• Published
• 27
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
Paper
• 2407.13623
• Published
• 56