Model conversion info
hi @kvaishnavi ,
are there any instruction how this model was created?
how do you serve the model onnxruntime/gpt-oss-20b-onnx - what model server do you use? i assume there will not be a kv cache? any numbers on the accuracy and latencyt compared to the original version of the model if you use nvidia l40s or nvidia h100?
i assume some convertion from openai/gpt-oss-20b?
Thanks,
Gerald
The description in the original PR has more information. Here is the environment that was used.
The model is served inside Foundry Local. You can try it out and measure performance with the CLI or SDK.
hi @kvaishnavi
- how can i run this model with e.g. triton inference server - can you use nvcr.io/nvidia/tritonserver:<yy.mm>-py3-sdk image as shown here https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tutorials/Quick_Deploy/ONNX/README.html ?
- how does the request prompt look like for this onnx model using e.g. triton inference server?
- is there a way to use kv cache with it - i assume no?
how can i run this model with e.g. triton inference server - can you use nvcr.io/nvidia/tritonserver:<yy.mm>-py3-sdk image as shown here https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tutorials/Quick_Deploy/ONNX/README.html ?
Yes, that should work.
how does the request prompt look like for this onnx model using e.g. triton inference server?
You can follow OpenAI's Harmony format for correct prompting.
is there a way to use kv cache with it - i assume no?
The model already uses KV caches.
hi @kvaishnavi "You can follow OpenAI's Harmony format for correct prompting." - do you have a sample request to the model?
"The model already uses KV caches." - what does that mean?
also can you say somthing to the accuracy of the model compared to the original one? can you also set reasononing=low, medium or high?
Thanks
hi @kvaishnavi "You can follow OpenAI's Harmony format for correct prompting." - do you have a sample request to the model?
You should use a tokenizer to apply the format. For example, here's how you can use the tokenizer from Hugging Face's Transformers.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")
messages = [
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "What color is the sky?"},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, reasoning_effort="low")
# `prompt` is a string that now contains the sample request to tokenize
print(prompt)
"The model already uses KV caches." - what does that mean?
The ONNX model contains past KV cache inputs and present KV cache outputs. You can open the ONNX model in Netron to see.
In order to run inference, you will need to provide these inputs and use these outputs as inputs in each iteration of the generation loop. An inference solution such as ONNX Runtime GenAI already knows how to use them for you.
You can view the links on the model card for more information.
also can you say something to the accuracy of the model compared to the original one? can you also set reasoning=low, medium or high?
The accuracy is close to the original model. For example, we internally measured MMLU and the delta was 1% from the original model.
It depends on which framework you use to tokenize and detokenize. If you are using ONNX Runtime GenAI for tokenization, you will need to modify the output from Tokenizer.ApplyChatTemplate to set the reasoning level. It is currently a missing feature from that API. If you are using Hugging Face's Transformers for tokenization, you can provide reasoning_effort as an extra parameter to Tokenizer.ApplyChatTemplate.
hi @kvaishnavi "You can follow OpenAI's Harmony format for correct prompting." - do you have a sample request to the model?
You should use a tokenizer to apply the format. For example, here's how you can use the tokenizer from Hugging Face's Transformers.
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b") messages = [ {"role": "system", "content": "You are a helpful AI assistant"}, {"role": "user", "content": "What color is the sky?"}, ] prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, reasoning_effort="low") # `prompt` is a string that now contains the sample request to tokenize print(prompt)"The model already uses KV caches." - what does that mean?
The ONNX model contains past KV cache inputs and present KV cache outputs. You can open the ONNX model in Netron to see.
In order to run inference, you will need to provide these inputs and use these outputs as inputs in each iteration of the generation loop. An inference solution such as ONNX Runtime GenAI already knows how to use them for you.
You can view the links on the model card for more information.
also can you say something to the accuracy of the model compared to the original one? can you also set reasoning=low, medium or high?
The accuracy is close to the original model. For example, we internally measured MMLU and the delta was 1% from the original model.
It depends on which framework you use to tokenize and detokenize. If you are using ONNX Runtime GenAI for tokenization, you will need to modify the output from
Tokenizer.ApplyChatTemplateto set the reasoning level. It is currently a missing feature from that API. If you are using Hugging Face's Transformers for tokenization, you can providereasoning_effortas an extra parameter toTokenizer.ApplyChatTemplate.
"In order to run inference, you will need to provide these inputs and use these outputs as inputs in each iteration of the generation loop." - i want to use triton inference server with onnx backend. will i need to write python code for that generation loop? can you share example code?
"In order to run inference, you will need to provide these inputs and use these outputs as inputs in each iteration of the generation loop." - i want to use triton inference server with onnx backend. will i need to write python code for that generation loop? can you share example code?
It depends on how Triton Inference Server is configured. From the link you provided, it seems that only ONNX Runtime is being used and not ONNX Runtime GenAI as well. With both ONNX Runtime and ONNX Runtime GenAI, KV cache management is handled for you. With only ONNX Runtime, you would have to write the logic to update your inputs between iterations (here is a primitive example).
The Triton Inference Server maintainers can best answer your questions about how it is set up and any example code. You can reach out to them here.
There are fully working Python examples already linked in the model card. For serving, we provide the model through Foundry Local. We don't have a sample with Triton Inference Server, but you can follow their official tutorials and swap out the example model with any of the models from this repo.
also does this model still use reasoning - if yes low, medium, high? and whats the different between this model and https://huggingface.co/onnx-community/gpt-oss-20b-ONNX/tree/main ?
also does this model still use reasoning - if yes low, medium, high?
It uses the default reasoning level of medium set in the chat template. As mentioned here, setting the reasoning level via ApplyChatTemplate is currently a missing feature. Alternatives include:
- You can modify the
chat_template.jinjafile at the above linked location to set a new default reasoning level. - You can modify the formatted string returned from
ApplyChatTemplateto use a different reasoning level. The pattern to match and replace for the reasoning level is"Reasoning: " + reasoning_effort + "\n\n"wherereasoning_effortislow,medium, orhigh. - You can use Hugging Face's tokenizer to change the default reasoning level and get the formatted string.
and what's the different between this model and https://huggingface.co/onnx-community/gpt-oss-20b-ONNX/tree/main ?
This repo has three different optimized variants: CPU, CUDA, and WebGPU. The model uploaded in the ONNX community is the WebGPU variant because it can run on GPUs from multiple hardware vendors. The CUDA variant can only run on NVIDIA GPUs.