Instructions to use thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3")
model = AutoModelForCausalLM.from_pretrained("thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3

SGLang

How to use thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 with Docker Model Runner:
```
docker model run hf.co/thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3
```

lujangusface commited on Mar 28

Commit

9dc0d9e

verified ·

1 Parent(s): 73b48a5

docs: standardize model card for public release

Browse files

Files changed (1) hide show

README.md +94 -79

README.md CHANGED Viewed

@@ -1,65 +1,94 @@
 ---
-language:
-- en
-license: mit
 library_name: transformers
-tags:
-- speculative-decoding
-- eagle3
-- draft-model
-- jax
-- tpu
 base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
 ---
 # EAGLE3 Draft Head — DeepSeek-R1-Distill-Qwen-14B
-An [EAGLE3](https://github.com/SafeAILab/EAGLE) speculative decoding draft head for
-[`deepseek-ai/DeepSeek-R1-Distill-Qwen-14B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B),
-trained on TPU v4 with [SpecJAX](https://github.com/thoughtworks/specjax) — a pure JAX port of the EAGLE3 training pipeline.
 ## Usage
-Load with [SGLang](https://github.com/sgl-project/sglang) (recommended):
 ```bash
 python -m sglang.launch_server \
-  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
-  --speculative-algorithm EAGLE3 \
-  --speculative-draft-model-path thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 \
-  --speculative-num-steps 5 \
-  --speculative-eagle-topk 4 \
-  --speculative-num-draft-tokens 16
 ```
-Or with [vLLM](https://github.com/vllm-project/vllm):
 ```bash
-python -m vllm.entrypoints.openai.api_server \
-  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
-  --speculative-model thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 \
-  --num-speculative-tokens 5
 ```
 ## Training Details
-| | |
-|---|---|
-| **Target model** | `deepseek-ai/DeepSeek-R1-Distill-Qwen-14B` |
-| **Training framework** | [SpecJAX](https://github.com/thoughtworks/specjax) — JAX/TPU port of EAGLE3 |
-| **Hardware** | TPU v4-32 (4 hosts × 4 chips, tp=4, dp=4) |
-| **Dataset** | 54K mixed (45% ShareGPT / 35% UltraChat-200K / 20% Open-PerfectBlend) |
-| **Epochs** | 3 |
-| **Steps** | 2,490 |
-| **Wall time** | ~84 min |
-| **Learning rate** | 8e-4, cosine decay, 3% warmup |
-| **Batch size** | 2 (grad accum 8, effective batch 16 per DP rank) |
-| **Max length** | 512 tokens |
-| **TTT length** | 7 (test-time training rollout positions) |
-## Results
-Token acceptance rates measured on the training distribution (54K mixed):
 | Position | Acceptance Rate |
 |----------|----------------|
@@ -71,54 +100,40 @@ Token acceptance rates measured on the training distribution (54K mixed):
 | acc_5 | 55.7% |
 | acc_6 | 54.1% |
-**Epoch progression** (no overfitting detected):
-| Checkpoint | acc_0 | Loss |
-|------------|-------|------|
-| epoch_1 | 52.0% | ~10.6 |
-| epoch_2 | 60.9% | 7.64 |
-| epoch_3 | 65.8% | 6.76 |
-Full training curves: [W&B run `li7xhsk7`](https://wandb.ai/gustavo-lujan-thoughtworks/ds-r1-qwen-14b-eagle3-experiments/runs/li7xhsk7)
 ## Model Architecture
-The draft head is a single-layer transformer that takes the target model's hidden states
-as input and predicts the next token using EAGLE3's feature-fusion approach.
-```json
-{
-  "architectures": ["LlamaForCausalLMEagle3"],
-  "model_type": "llama",
-  "hidden_size": 5120,
-  "intermediate_size": 13824,
-  "num_hidden_layers": 1,
-  "num_attention_heads": 40,
-  "num_key_value_heads": 8,
-  "head_dim": 128,
-  "vocab_size": 152064,
-  "draft_vocab_size": 32000,
-  "rope_theta": 1000000.0,
-  "rms_norm_eps": 1e-05,
-  "torch_dtype": "bfloat16"
-}
-```
-## Citation
-If you use this model, please cite the EAGLE3 paper:
 ```bibtex
-@article{li2024eagle3,
-  title={EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
   author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
-  journal={arXiv},
-  year={2024}
 }
 ```
-## License
-This draft head is released under the MIT license.
-The base model ([DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B))
-is subject to its own license terms.

 ---
 library_name: transformers
+license: mit
+language:
+  - en
 base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
+tags:
+  - eagle3
+  - speculative-decoding
+  - sglang
+  - draft-model
+  - jax
+  - tpu
+pipeline_tag: text-generation
 ---
 # EAGLE3 Draft Head — DeepSeek-R1-Distill-Qwen-14B
+A speculative decoding draft head for [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B), trained using the [EAGLE3](https://arxiv.org/abs/2503.01840) method on Google Cloud TPU with the [SpecJAX](https://github.com/tails-mpt/SpecJAX) framework.
+EAGLE3 draft heads accelerate autoregressive generation by proposing multiple tokens per step that a target model then verifies in parallel — typically achieving 2-3x throughput gains with no change in output quality.
 ## Usage
+### SGLang (GPU)
+> **Note**: DeepSeek-R1-Distill-Qwen uses the Qwen2 architecture. EAGLE3 support requires a small patch to SGLang (adding `set_eagle3_layers_to_capture()` to the Qwen2 model). See the [SpecJAX inference guide](https://github.com/tails-mpt/SpecJAX/tree/main/inference) for details.
 ```bash
 python -m sglang.launch_server \
+    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
+    --speculative-algorithm EAGLE3 \
+    --speculative-draft-model-path thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 \
+    --speculative-num-steps 5 \
+    --speculative-eagle-topk 4 \
+    --dtype bfloat16
 ```
+### sglang-jax (TPU)
+> **Note**: Requires the same Qwen2 EAGLE3 patch applied to sglang-jax. The sglang-jax EAGLE3 pipeline is functional but not yet performance-optimized.
 ```bash
+python -m sgl_jax.launch_server \
+    --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
+    --speculative-algorithm EAGLE3 \
+    --speculative-draft-model-path thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3 \
+    --speculative-eagle-topk 1 \
+    --speculative-num-steps 3 \
+    --speculative-num-draft-tokens 4 \
+    --tp-size 4 --dtype bfloat16
+```
+### Python (SGLang client)
+```python
+import sglang as sgl
+llm = sgl.LLM(
+    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B",
+    speculative_algorithm="EAGLE3",
+    speculative_draft_model_path="thoughtworks/DeepSeek-R1-Distill-Qwen-14B-Eagle3",
+    speculative_num_steps=5,
+    speculative_eagle_topk=4,
+    dtype="bfloat16",
+)
 ```
 ## Training Details
+| Parameter | Value |
+|-----------|-------|
+| Framework | [SpecJAX](https://github.com/tails-mpt/SpecJAX) — pure JAX, no Flax/PyTorch |
+| Hardware | Google Cloud TPU v4-32 (4 hosts x 4 chips, TP=4, DP=4) |
+| Dataset | 54K mixed: ShareGPT (45%) + UltraChat-200K (35%) + Open-PerfectBlend (20%) |
+| Epochs | 3 |
+| Steps | 2,490 total |
+| Optimizer | AdamW, cosine LR decay, 3% warmup |
+| Learning rate | 8e-4 |
+| Batch size | B=2, sequence length T=512, gradient accumulation 8 |
+| TTT length | 7 (multi-step speculative rollout) |
+| Training time | ~84 minutes |
+| Precision | bfloat16 |
+### Training Method
+This model uses [EAGLE3](https://arxiv.org/abs/2503.01840)'s Test-Time Training (TTT) objective with a rollout length of 7. At each training step, the draft head autoregressively proposes 7 tokens; the target model provides ground-truth hidden states and logits for all positions; a geometric loss (0.8^k weighting) trains the draft to match the target at each position.
+## Performance
+Token acceptance rates on generic instruction-following data (ShareGPT-style prompts):
 | Position | Acceptance Rate |
 |----------|----------------|
 | acc_5 | 55.7% |
 | acc_6 | 54.1% |
+This model achieves the highest acc_0 among all SpecJAX-trained EAGLE3 draft heads.
+*Measured on held-out evaluation data. Actual throughput gains depend on hardware, prompt distribution, and runtime version.*
 ## Model Architecture
+The draft head is a single-layer transformer that operates on the target model's hidden states:
+| Parameter | Value |
+|-----------|-------|
+| Architecture | `LlamaForCausalLM` (1 decoder layer) |
+| Hidden size | 5120 |
+| Attention heads | 40 (GQA: 8 KV heads) |
+| Vocabulary size | 152,064 (full target vocab) |
+| Draft vocab size | 32,000 (top tokens by training frequency) |
+| Parameters | ~530M |
+## Limitations
+- Trained on English-dominant instruction data; performance may degrade on non-English inputs or highly domain-specific content.
+- Acceptance rates are measured on generic chat data and will vary by prompt distribution.
+- This is a v1 checkpoint trained on generic data. A v2 with target-model-regenerated training data is planned.
+## License
+This model is released under the [MIT License](https://opensource.org/licenses/MIT). The base model ([DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)) is subject to its own license terms.
+## References
 ```bibtex
+@article{li2025eagle3,
+  title={EAGLE3: Scalable Speculative Decoding with Training-Free Multi-Draft Speculation},
   author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
+  journal={arXiv preprint arXiv:2503.01840},
+  year={2025}
 }
 ```