Model conversion info

by Gerald001 - opened 13 days ago

•

are there any instruction how this model was created?
how do you serve the model onnxruntime/gpt-oss-20b-onnx - what model server do you use? i assume there will not be a kv cache? any numbers on the accuracy and latencyt compared to the original version of the model if you use nvidia l40s or nvidia h100?
i assume some convertion from openai/gpt-oss-20b?

Thanks,
Gerald

Gerald001 changed discussion title from model info to model convertion info 13 days ago

kvaishnavi

ONNX Runtime org 13 days ago

The description in the original PR has more information. Here is the environment that was used.

The model is served inside Foundry Local. You can try it out and measure performance with the CLI or SDK.

Gerald001

12 days ago

•

edited 12 days ago

hi @kvaishnavi

how can i run this model with e.g. triton inference server - can you use nvcr.io/nvidia/tritonserver:<yy.mm>-py3-sdk image as shown here https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tutorials/Quick_Deploy/ONNX/README.html ?
how does the request prompt look like for this onnx model using e.g. triton inference server?
is there a way to use kv cache with it - i assume no?

kvaishnavi

ONNX Runtime org 12 days ago

how can i run this model with e.g. triton inference server - can you use nvcr.io/nvidia/tritonserver:<yy.mm>-py3-sdk image as shown here https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tutorials/Quick_Deploy/ONNX/README.html ?

Yes, that should work.

how does the request prompt look like for this onnx model using e.g. triton inference server?

You can follow OpenAI's Harmony format for correct prompting.

is there a way to use kv cache with it - i assume no?

The model already uses KV caches.

Gerald001

12 days ago

•

edited 12 days ago

hi @kvaishnavi "You can follow OpenAI's Harmony format for correct prompting." - do you have a sample request to the model?

"The model already uses KV caches." - what does that mean?

also can you say somthing to the accuracy of the model compared to the original one? can you also set reasononing=low, medium or high?

Thanks

kvaishnavi

ONNX Runtime org 11 days ago

hi @kvaishnavi "You can follow OpenAI's Harmony format for correct prompting." - do you have a sample request to the model?

You should use a tokenizer to apply the format. For example, here's how you can use the tokenizer from Hugging Face's Transformers.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant"},
    {"role": "user", "content": "What color is the sky?"},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, reasoning_effort="low")

# `prompt` is a string that now contains the sample request to tokenize
print(prompt)

"The model already uses KV caches." - what does that mean?

The ONNX model contains past KV cache inputs and present KV cache outputs. You can open the ONNX model in Netron to see.

In order to run inference, you will need to provide these inputs and use these outputs as inputs in each iteration of the generation loop. An inference solution such as ONNX Runtime GenAI already knows how to use them for you.

You can view the links on the model card for more information.

also can you say something to the accuracy of the model compared to the original one? can you also set reasoning=low, medium or high?

The accuracy is close to the original model. For example, we internally measured MMLU and the delta was 1% from the original model.

It depends on which framework you use to tokenize and detokenize. If you are using ONNX Runtime GenAI for tokenization, you will need to modify the output from Tokenizer.ApplyChatTemplate to set the reasoning level. It is currently a missing feature from that API. If you are using Hugging Face's Transformers for tokenization, you can provide reasoning_effort as an extra parameter to Tokenizer.ApplyChatTemplate.

Gerald001

11 days ago

hi @kvaishnavi "You can follow OpenAI's Harmony format for correct prompting." - do you have a sample request to the model?

You should use a tokenizer to apply the format. For example, here's how you can use the tokenizer from Hugging Face's Transformers.
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant"},
    {"role": "user", "content": "What color is the sky?"},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, reasoning_effort="low")

# `prompt` is a string that now contains the sample request to tokenize
print(prompt)
"The model already uses KV caches." - what does that mean?

The ONNX model contains past KV cache inputs and present KV cache outputs. You can open the ONNX model in Netron to see.

In order to run inference, you will need to provide these inputs and use these outputs as inputs in each iteration of the generation loop. An inference solution such as ONNX Runtime GenAI already knows how to use them for you.

You can view the links on the model card for more information.

also can you say something to the accuracy of the model compared to the original one? can you also set reasoning=low, medium or high?

The accuracy is close to the original model. For example, we internally measured MMLU and the delta was 1% from the original model.

It depends on which framework you use to tokenize and detokenize. If you are using ONNX Runtime GenAI for tokenization, you will need to modify the output from Tokenizer.ApplyChatTemplate to set the reasoning level. It is currently a missing feature from that API. If you are using Hugging Face's Transformers for tokenization, you can provide reasoning_effort as an extra parameter to Tokenizer.ApplyChatTemplate.

"In order to run inference, you will need to provide these inputs and use these outputs as inputs in each iteration of the generation loop." - i want to use triton inference server with onnx backend. will i need to write python code for that generation loop? can you share example code?

kvaishnavi

ONNX Runtime org 10 days ago

"In order to run inference, you will need to provide these inputs and use these outputs as inputs in each iteration of the generation loop." - i want to use triton inference server with onnx backend. will i need to write python code for that generation loop? can you share example code?

It depends on how Triton Inference Server is configured. From the link you provided, it seems that only ONNX Runtime is being used and not ONNX Runtime GenAI as well. With both ONNX Runtime and ONNX Runtime GenAI, KV cache management is handled for you. With only ONNX Runtime, you would have to write the logic to update your inputs between iterations (here is a primitive example).

The Triton Inference Server maintainers can best answer your questions about how it is set up and any example code. You can reach out to them here.

kvaishnavi changed discussion status to closed 10 days ago

Gerald001

10 days ago

•

edited 10 days ago

hi @kvaishnavi cant you share a fully working python example to try this model?

kvaishnavi

ONNX Runtime org 10 days ago

There are fully working Python examples already linked in the model card. For serving, we provide the model through Foundry Local. We don't have a sample with Triton Inference Server, but you can follow their official tutorials and swap out the example model with any of the models from this repo.

kvaishnavi changed discussion title from model convertion info to Model conversion info 10 days ago

Gerald001

10 days ago

•

edited 10 days ago

also does this model still use reasoning - if yes low, medium, high? and whats the different between this model and https://huggingface.co/onnx-community/gpt-oss-20b-ONNX/tree/main ?

kvaishnavi changed discussion status to open 10 days ago

kvaishnavi

ONNX Runtime org 10 days ago

also does this model still use reasoning - if yes low, medium, high?

It uses the default reasoning level of medium set in the chat template. As mentioned here, setting the reasoning level via ApplyChatTemplate is currently a missing feature. Alternatives include:

You can modify the chat_template.jinja file at the above linked location to set a new default reasoning level.
You can modify the formatted string returned from ApplyChatTemplate to use a different reasoning level. The pattern to match and replace for the reasoning level is "Reasoning: " + reasoning_effort + "\n\n" where reasoning_effort is low, medium, or high.
You can use Hugging Face's tokenizer to change the default reasoning level and get the formatted string.

and what's the different between this model and https://huggingface.co/onnx-community/gpt-oss-20b-ONNX/tree/main ?

This repo has three different optimized variants: CPU, CUDA, and WebGPU. The model uploaded in the ONNX community is the WebGPU variant because it can run on GPUs from multiple hardware vendors. The CUDA variant can only run on NVIDIA GPUs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment