Training Code
https://github.com/juemifuji/eagle3-aimo3
โ Up to 36% inference speedup
| throughput(gpt-oss-120b) | throughput(gpt-oss-120b-eagle3-aimo3) | speedup | concurrency |
|---|---|---|---|
| 776.514 | 1059.43 | 36.40% | 8 |
| 686.717 | 956.431 | 39.30% | 7 |
| 596.596 | 851.647 | 42.80% | 6 |
| 518.76 | 680.951 | 31.30% | 5 |
| 465.702 | 657.682 | 41.20% | 4 |
| 379.48 | 541.304 | 42.60% | 3 |
| 297.553 | 422.232 | 41.90% | 2 |
| 190.023 | 268.132 | 41.10% | 1 |
(8-way concurrency, streaming inference, frequent prefill workloads)
Serving
TP="${TP:-8}"
MAX_NUM_SEQS="${MAX_NUM_SEQS:-256}"
GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.9}"
HOST="${HOST:-0.0.0.0}"
PORT="${PORT:-8000}"
MAX_LEN="${MAX_LEN:-40960}"
STREAM_INTERVAL="${STREAM_INTERVAL:-1}"
# ====== speculative config (JSON) ======
SPECULATIVE_CONFIG='{"method":"eagle3","model":"gpt-oss-120b-eagle3-aimo3","num_speculative_tokens":3,"draft_tensor_parallel_size":1}'
exec python -m vllm.entrypoints.openai.api_server \
--model openai/gpt-oss-120b \
--served-model-name gpt-oss \
--tensor-parallel-size "$TP" \
--max-num-seqs "$MAX_NUM_SEQS" \
--gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \
--host "$HOST" \
--port "$PORT" \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len "$MAX_LEN" \
--async-scheduling \
--stream-interval "$STREAM_INTERVAL" \
--speculative-config "$SPECULATIVE_CONFIG"
- Downloads last month
- 63
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support