-
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
Paper • 2601.19895 • Published • 24 -
Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers
Paper • 2601.17367 • Published • 34 -
Small-scale proxies for large-scale Transformer training instabilities
Paper • 2309.14322 • Published • 22 -
Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
Paper • 2602.00747 • Published • 9
Collections
Discover the best community collections!
Collections including paper arxiv:2309.14322
-
Textbooks Are All You Need II: phi-1.5 technical report
Paper • 2309.05463 • Published • 89 -
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale
Paper • 2309.04564 • Published • 17 -
Large-Scale Automatic Audiobook Creation
Paper • 2309.03926 • Published • 56 -
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute
Paper • 2309.11197 • Published • 5
-
A Loss Curvature Perspective on Training Instability in Deep Learning
Paper • 2110.04369 • Published -
Why Do We Need Weight Decay in Modern Deep Learning?
Paper • 2310.04415 • Published -
Small-scale proxies for large-scale Transformer training instabilities
Paper • 2309.14322 • Published • 22 -
Transformers Can Navigate Mazes With Multi-Step Prediction
Paper • 2412.05117 • Published • 5
-
AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models
Paper • 2309.16414 • Published • 19 -
Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model
Paper • 2309.13018 • Published • 9 -
Robust Speech Recognition via Large-Scale Weak Supervision
Paper • 2212.04356 • Published • 51 -
Language models in molecular discovery
Paper • 2309.16235 • Published • 10
-
Language Modeling Is Compression
Paper • 2309.10668 • Published • 84 -
Small-scale proxies for large-scale Transformer training instabilities
Paper • 2309.14322 • Published • 22 -
Evaluating Cognitive Maps and Planning in Large Language Models with CogEval
Paper • 2309.15129 • Published • 7 -
Vision Transformers Need Registers
Paper • 2309.16588 • Published • 86
-
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
Paper • 2601.19895 • Published • 24 -
Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers
Paper • 2601.17367 • Published • 34 -
Small-scale proxies for large-scale Transformer training instabilities
Paper • 2309.14322 • Published • 22 -
Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
Paper • 2602.00747 • Published • 9
-
A Loss Curvature Perspective on Training Instability in Deep Learning
Paper • 2110.04369 • Published -
Why Do We Need Weight Decay in Modern Deep Learning?
Paper • 2310.04415 • Published -
Small-scale proxies for large-scale Transformer training instabilities
Paper • 2309.14322 • Published • 22 -
Transformers Can Navigate Mazes With Multi-Step Prediction
Paper • 2412.05117 • Published • 5
-
AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models
Paper • 2309.16414 • Published • 19 -
Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model
Paper • 2309.13018 • Published • 9 -
Robust Speech Recognition via Large-Scale Weak Supervision
Paper • 2212.04356 • Published • 51 -
Language models in molecular discovery
Paper • 2309.16235 • Published • 10
-
Language Modeling Is Compression
Paper • 2309.10668 • Published • 84 -
Small-scale proxies for large-scale Transformer training instabilities
Paper • 2309.14322 • Published • 22 -
Evaluating Cognitive Maps and Planning in Large Language Models with CogEval
Paper • 2309.15129 • Published • 7 -
Vision Transformers Need Registers
Paper • 2309.16588 • Published • 86
-
Textbooks Are All You Need II: phi-1.5 technical report
Paper • 2309.05463 • Published • 89 -
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale
Paper • 2309.04564 • Published • 17 -
Large-Scale Automatic Audiobook Creation
Paper • 2309.03926 • Published • 56 -
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute
Paper • 2309.11197 • Published • 5