-
Learning to Reason in 13 Parameters
Paper • 2602.04118 • Published • 6 -
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
Paper • 2405.17604 • Published • 3 -
mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations
Paper • 2601.05732 • Published • 1 -
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 307
Collections
Discover the best community collections!
Collections including paper arxiv:2004.05150
-
Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning
Paper • 2211.04325 • Published • 1 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 26 -
On the Opportunities and Risks of Foundation Models
Paper • 2108.07258 • Published • 2 -
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Paper • 2204.07705 • Published • 2
-
Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 6 -
Ring Attention with Blockwise Transformers for Near-Infinite Context
Paper • 2310.01889 • Published • 13 -
Striped Attention: Faster Ring Attention for Causal Transformers
Paper • 2311.09431 • Published • 4 -
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Paper • 2309.14509 • Published • 20
-
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 81 -
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Paper • 2305.13245 • Published • 6 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 20 -
Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 6
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 150 -
Elucidating the Design Space of Diffusion-Based Generative Models
Paper • 2206.00364 • Published • 18 -
GLU Variants Improve Transformer
Paper • 2002.05202 • Published • 4 -
StarCoder 2 and The Stack v2: The Next Generation
Paper • 2402.19173 • Published • 152
-
In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss
Paper • 2402.10790 • Published • 42 -
LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration
Paper • 2402.11550 • Published • 19 -
A Neural Conversational Model
Paper • 1506.05869 • Published • 2 -
Data Engineering for Scaling Language Models to 128K Context
Paper • 2402.10171 • Published • 25
-
Learning to Reason in 13 Parameters
Paper • 2602.04118 • Published • 6 -
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
Paper • 2405.17604 • Published • 3 -
mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations
Paper • 2601.05732 • Published • 1 -
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 307
-
Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning
Paper • 2211.04325 • Published • 1 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 26 -
On the Opportunities and Risks of Foundation Models
Paper • 2108.07258 • Published • 2 -
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Paper • 2204.07705 • Published • 2
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 150 -
Elucidating the Design Space of Diffusion-Based Generative Models
Paper • 2206.00364 • Published • 18 -
GLU Variants Improve Transformer
Paper • 2002.05202 • Published • 4 -
StarCoder 2 and The Stack v2: The Next Generation
Paper • 2402.19173 • Published • 152
-
Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 6 -
Ring Attention with Blockwise Transformers for Near-Infinite Context
Paper • 2310.01889 • Published • 13 -
Striped Attention: Faster Ring Attention for Causal Transformers
Paper • 2311.09431 • Published • 4 -
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Paper • 2309.14509 • Published • 20
-
In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss
Paper • 2402.10790 • Published • 42 -
LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration
Paper • 2402.11550 • Published • 19 -
A Neural Conversational Model
Paper • 1506.05869 • Published • 2 -
Data Engineering for Scaling Language Models to 128K Context
Paper • 2402.10171 • Published • 25
-
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 81 -
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Paper • 2305.13245 • Published • 6 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 20 -
Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 6