Text Generation
Transformers
Safetensors
JAX
English
llama
eagle3
speculative-decoding
sglang
draft-model
tpu
text-generation-inference
Instructions to use thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3") model = AutoModelForCausalLM.from_pretrained("thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3
- SGLang
How to use thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 with Docker Model Runner:
docker model run hf.co/thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3
docs: standardize model card for public release
Browse files
README.md
CHANGED
|
@@ -1,65 +1,94 @@
|
|
| 1 |
---
|
| 2 |
-
language:
|
| 3 |
-
- en
|
| 4 |
-
license: mit
|
| 5 |
library_name: transformers
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
-
|
| 9 |
-
- draft-model
|
| 10 |
-
- jax
|
| 11 |
-
- tpu
|
| 12 |
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
# EAGLE3 Draft Head — DeepSeek-R1-Distill-Qwen-14B
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
|
| 21 |
## Usage
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
| 24 |
|
| 25 |
```bash
|
| 26 |
python -m sglang.launch_server \
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
```
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
| 36 |
|
| 37 |
```bash
|
| 38 |
-
python -m
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
```
|
| 43 |
|
| 44 |
## Training Details
|
| 45 |
|
| 46 |
-
| | |
|
| 47 |
-
|---|---|
|
| 48 |
-
|
|
| 49 |
-
|
|
| 50 |
-
|
|
| 51 |
-
|
|
| 52 |
-
|
|
| 53 |
-
|
|
| 54 |
-
|
|
| 55 |
-
|
|
| 56 |
-
|
|
| 57 |
-
|
|
| 58 |
-
|
|
| 59 |
|
| 60 |
-
##
|
| 61 |
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
| Position | Acceptance Rate |
|
| 65 |
|----------|----------------|
|
|
@@ -71,54 +100,40 @@ Token acceptance rates measured on the training distribution (54K mixed):
|
|
| 71 |
| acc_5 | 55.7% |
|
| 72 |
| acc_6 | 54.1% |
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
-
|
| 77 |
-
|------------|-------|------|
|
| 78 |
-
| epoch_1 | 52.0% | ~10.6 |
|
| 79 |
-
| epoch_2 | 60.9% | 7.64 |
|
| 80 |
-
| epoch_3 | 65.8% | 6.76 |
|
| 81 |
-
|
| 82 |
-
Full training curves: [W&B run `li7xhsk7`](https://wandb.ai/gustavo-lujan-thoughtworks/ds-r1-qwen-14b-eagle3-experiments/runs/li7xhsk7)
|
| 83 |
|
| 84 |
## Model Architecture
|
| 85 |
|
| 86 |
-
The draft head is a single-layer transformer that
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
"num_attention_heads": 40,
|
| 97 |
-
"num_key_value_heads": 8,
|
| 98 |
-
"head_dim": 128,
|
| 99 |
-
"vocab_size": 152064,
|
| 100 |
-
"draft_vocab_size": 32000,
|
| 101 |
-
"rope_theta": 1000000.0,
|
| 102 |
-
"rms_norm_eps": 1e-05,
|
| 103 |
-
"torch_dtype": "bfloat16"
|
| 104 |
-
}
|
| 105 |
-
```
|
| 106 |
|
| 107 |
-
##
|
| 108 |
|
| 109 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
```bibtex
|
| 112 |
-
@article{
|
| 113 |
-
title={
|
| 114 |
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
|
| 115 |
-
journal={arXiv},
|
| 116 |
-
year={
|
| 117 |
}
|
| 118 |
```
|
| 119 |
-
|
| 120 |
-
## License
|
| 121 |
-
|
| 122 |
-
This draft head is released under the MIT license.
|
| 123 |
-
The base model ([DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B))
|
| 124 |
-
is subject to its own license terms.
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
+
license: mit
|
| 4 |
+
language:
|
| 5 |
+
- en
|
|
|
|
|
|
|
|
|
|
| 6 |
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
|
| 7 |
+
tags:
|
| 8 |
+
- eagle3
|
| 9 |
+
- speculative-decoding
|
| 10 |
+
- sglang
|
| 11 |
+
- draft-model
|
| 12 |
+
- jax
|
| 13 |
+
- tpu
|
| 14 |
+
pipeline_tag: text-generation
|
| 15 |
---
|
| 16 |
|
| 17 |
# EAGLE3 Draft Head — DeepSeek-R1-Distill-Qwen-14B
|
| 18 |
|
| 19 |
+
A speculative decoding draft head for [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B), trained using the [EAGLE3](https://arxiv.org/abs/2503.01840) method on Google Cloud TPU with the [SpecJAX](https://github.com/tails-mpt/SpecJAX) framework.
|
| 20 |
+
|
| 21 |
+
EAGLE3 draft heads accelerate autoregressive generation by proposing multiple tokens per step that a target model then verifies in parallel — typically achieving 2-3x throughput gains with no change in output quality.
|
| 22 |
|
| 23 |
## Usage
|
| 24 |
|
| 25 |
+
### SGLang (GPU)
|
| 26 |
+
|
| 27 |
+
> **Note**: DeepSeek-R1-Distill-Qwen uses the Qwen2 architecture. EAGLE3 support requires a small patch to SGLang (adding `set_eagle3_layers_to_capture()` to the Qwen2 model). See the [SpecJAX inference guide](https://github.com/tails-mpt/SpecJAX/tree/main/inference) for details.
|
| 28 |
|
| 29 |
```bash
|
| 30 |
python -m sglang.launch_server \
|
| 31 |
+
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
|
| 32 |
+
--speculative-algorithm EAGLE3 \
|
| 33 |
+
--speculative-draft-model-path thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 \
|
| 34 |
+
--speculative-num-steps 5 \
|
| 35 |
+
--speculative-eagle-topk 4 \
|
| 36 |
+
--dtype bfloat16
|
| 37 |
```
|
| 38 |
|
| 39 |
+
### sglang-jax (TPU)
|
| 40 |
+
|
| 41 |
+
> **Note**: Requires the same Qwen2 EAGLE3 patch applied to sglang-jax. The sglang-jax EAGLE3 pipeline is functional but not yet performance-optimized.
|
| 42 |
|
| 43 |
```bash
|
| 44 |
+
python -m sgl_jax.launch_server \
|
| 45 |
+
--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
|
| 46 |
+
--speculative-algorithm EAGLE3 \
|
| 47 |
+
--speculative-draft-model-path thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 \
|
| 48 |
+
--speculative-eagle-topk 1 \
|
| 49 |
+
--speculative-num-steps 3 \
|
| 50 |
+
--speculative-num-draft-tokens 4 \
|
| 51 |
+
--tp-size 4 --dtype bfloat16
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
### Python (SGLang client)
|
| 55 |
+
|
| 56 |
+
```python
|
| 57 |
+
import sglang as sgl
|
| 58 |
+
|
| 59 |
+
llm = sgl.LLM(
|
| 60 |
+
model="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B",
|
| 61 |
+
speculative_algorithm="EAGLE3",
|
| 62 |
+
speculative_draft_model_path="thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3",
|
| 63 |
+
speculative_num_steps=5,
|
| 64 |
+
speculative_eagle_topk=4,
|
| 65 |
+
dtype="bfloat16",
|
| 66 |
+
)
|
| 67 |
```
|
| 68 |
|
| 69 |
## Training Details
|
| 70 |
|
| 71 |
+
| Parameter | Value |
|
| 72 |
+
|-----------|-------|
|
| 73 |
+
| Framework | [SpecJAX](https://github.com/tails-mpt/SpecJAX) — pure JAX, no Flax/PyTorch |
|
| 74 |
+
| Hardware | Google Cloud TPU v4-32 (4 hosts x 4 chips, TP=4, DP=4) |
|
| 75 |
+
| Dataset | 54K mixed: ShareGPT (45%) + UltraChat-200K (35%) + Open-PerfectBlend (20%) |
|
| 76 |
+
| Epochs | 3 |
|
| 77 |
+
| Steps | 2,490 total |
|
| 78 |
+
| Optimizer | AdamW, cosine LR decay, 3% warmup |
|
| 79 |
+
| Learning rate | 8e-4 |
|
| 80 |
+
| Batch size | B=2, sequence length T=512, gradient accumulation 8 |
|
| 81 |
+
| TTT length | 7 (multi-step speculative rollout) |
|
| 82 |
+
| Training time | ~84 minutes |
|
| 83 |
+
| Precision | bfloat16 |
|
| 84 |
|
| 85 |
+
### Training Method
|
| 86 |
|
| 87 |
+
This model uses [EAGLE3](https://arxiv.org/abs/2503.01840)'s Test-Time Training (TTT) objective with a rollout length of 7. At each training step, the draft head autoregressively proposes 7 tokens; the target model provides ground-truth hidden states and logits for all positions; a geometric loss (0.8^k weighting) trains the draft to match the target at each position.
|
| 88 |
+
|
| 89 |
+
## Performance
|
| 90 |
+
|
| 91 |
+
Token acceptance rates on generic instruction-following data (ShareGPT-style prompts):
|
| 92 |
|
| 93 |
| Position | Acceptance Rate |
|
| 94 |
|----------|----------------|
|
|
|
|
| 100 |
| acc_5 | 55.7% |
|
| 101 |
| acc_6 | 54.1% |
|
| 102 |
|
| 103 |
+
This model achieves the highest acc_0 among all SpecJAX-trained EAGLE3 draft heads.
|
| 104 |
|
| 105 |
+
*Measured on held-out evaluation data. Actual throughput gains depend on hardware, prompt distribution, and runtime version.*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
## Model Architecture
|
| 108 |
|
| 109 |
+
The draft head is a single-layer transformer that operates on the target model's hidden states:
|
| 110 |
+
|
| 111 |
+
| Parameter | Value |
|
| 112 |
+
|-----------|-------|
|
| 113 |
+
| Architecture | `LlamaForCausalLM` (1 decoder layer) |
|
| 114 |
+
| Hidden size | 5120 |
|
| 115 |
+
| Attention heads | 40 (GQA: 8 KV heads) |
|
| 116 |
+
| Vocabulary size | 152,064 (full target vocab) |
|
| 117 |
+
| Draft vocab size | 32,000 (top tokens by training frequency) |
|
| 118 |
+
| Parameters | ~530M |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
+
## Limitations
|
| 121 |
|
| 122 |
+
- Trained on English-dominant instruction data; performance may degrade on non-English inputs or highly domain-specific content.
|
| 123 |
+
- Acceptance rates are measured on generic chat data and will vary by prompt distribution.
|
| 124 |
+
- This is a v1 checkpoint trained on generic data. A v2 with target-model-regenerated training data is planned.
|
| 125 |
+
|
| 126 |
+
## License
|
| 127 |
+
|
| 128 |
+
This model is released under the [MIT License](https://opensource.org/licenses/MIT). The base model ([DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)) is subject to its own license terms.
|
| 129 |
+
|
| 130 |
+
## References
|
| 131 |
|
| 132 |
```bibtex
|
| 133 |
+
@article{li2025eagle3,
|
| 134 |
+
title={EAGLE3: Scalable Speculative Decoding with Training-Free Multi-Draft Speculation},
|
| 135 |
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
|
| 136 |
+
journal={arXiv preprint arXiv:2503.01840},
|
| 137 |
+
year={2025}
|
| 138 |
}
|
| 139 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|