Training Code

https://github.com/juemifuji/eagle3-aimo3

โœ… Up to 36% inference speedup

throughput(gpt-oss-120b) throughput(gpt-oss-120b-eagle3-aimo3) speedup concurrency
776.514 1059.43 36.40% 8
686.717 956.431 39.30% 7
596.596 851.647 42.80% 6
518.76 680.951 31.30% 5
465.702 657.682 41.20% 4
379.48 541.304 42.60% 3
297.553 422.232 41.90% 2
190.023 268.132 41.10% 1

(8-way concurrency, streaming inference, frequent prefill workloads)

Serving

TP="${TP:-8}"
MAX_NUM_SEQS="${MAX_NUM_SEQS:-256}"
GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.9}"
HOST="${HOST:-0.0.0.0}"
PORT="${PORT:-8000}"
MAX_LEN="${MAX_LEN:-40960}"
STREAM_INTERVAL="${STREAM_INTERVAL:-1}"

# ====== speculative config (JSON) ======
SPECULATIVE_CONFIG='{"method":"eagle3","model":"gpt-oss-120b-eagle3-aimo3","num_speculative_tokens":3,"draft_tensor_parallel_size":1}'

exec python -m vllm.entrypoints.openai.api_server \
  --model openai/gpt-oss-120b \
  --served-model-name gpt-oss \
  --tensor-parallel-size "$TP" \
  --max-num-seqs "$MAX_NUM_SEQS" \
  --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \
  --host "$HOST" \
  --port "$PORT" \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --max-model-len "$MAX_LEN" \
  --async-scheduling \
  --stream-interval "$STREAM_INTERVAL" \
  --speculative-config "$SPECULATIVE_CONFIG"
Downloads last month
63
Safetensors
Model size
0.3B params
Tensor type
I64
ยท
BF16
ยท
BOOL
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support