File size: 1,442 Bytes
2366496
 
 
 
 
 
 
 
3f56f53
2366496
 
66a07f9
 
 
 
 
d25601f
16ec7a5
66a07f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
---
base_model:
- Qwen/Qwen3.5-397B-A17B
tags:
- qwen
- fp8
- vllm
- compressed-tensors
name: RedHatAI/Qwen3.5-397B-A17B-FP8-dynamic
---

# FP8 Quantized Qwen3.5-397B-A17B

This is a preliminary version (and subject to change) of FP8 quantized [Qwen/Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B) model. 
The model has both weights and activations quantized to FP8 format with [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor).

It is compatible and tested against vllm main. Deploy it with: `vllm serve RedHatAI/Qwen3.5-397B-A17B-FP8-dynamic`.

# Preliminary Evaluations

1) GSM8k via vLLM's `tests/evals/gsm8k/gsm8k_eval.py` shows almost no degradation of accuracy:

|          | Qwen/Qwen3.5-397B-A17B | RedHatAI/Qwen3.5-397B-A17B-FP8-dynamic<br> (this model) |
| -------- | :--------------------: | :------------------------------------: |
| Accuracy | 89.5                   | 89.4                                   |
| Recovery | \-                     | 99.9%                                   |

2) Under greedy sampling, the model generates almost identical text to the unquantized baseline. `Qwen/Qwen3.5-397B-A17B` is left, `RedHatAI/Qwen3.5-397B-A17B-FP8-Dynamic` is right:


![image](https://cdn-uploads.huggingface.co/production/uploads/628e0ce4e53bbd334577fcb0/3RwIhv9s9LGJdEbG2FFDv.png)



**Note**: More rigorous evaluations are currently in progress and will be available soon.