Instructions to use microsoft/mistral-7b-instruct-v0.2-ONNX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/mistral-7b-instruct-v0.2-ONNX with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/mistral-7b-instruct-v0.2-ONNX", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("microsoft/mistral-7b-instruct-v0.2-ONNX", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/mistral-7b-instruct-v0.2-ONNX with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/mistral-7b-instruct-v0.2-ONNX" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/mistral-7b-instruct-v0.2-ONNX", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/microsoft/mistral-7b-instruct-v0.2-ONNX
- SGLang
How to use microsoft/mistral-7b-instruct-v0.2-ONNX with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/mistral-7b-instruct-v0.2-ONNX" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/mistral-7b-instruct-v0.2-ONNX", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/mistral-7b-instruct-v0.2-ONNX" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/mistral-7b-instruct-v0.2-ONNX", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use microsoft/mistral-7b-instruct-v0.2-ONNX with Docker Model Runner:
docker model run hf.co/microsoft/mistral-7b-instruct-v0.2-ONNX
Mistral-7B-Instruct-v0.2 ONNX models
This repository hosts the optimized versions of Mistral-7B-Instruct-v0.2 to accelerate inference with ONNX Runtime.
The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.2.
Optimized Mistral models are published here in ONNX format to run with ONNX Runtime on CPU and GPU across devices, including server platforms and Windows, Linux, and Mac desktops, with the precision best suited to each of these targets.
DirectML support lets developers bring hardware acceleration to Windows devices at scale across AMD, Intel, and NVIDIA GPUs. Along with DirectML, ONNX Runtime provides cross platform support for Mistral across a range of devices for CPU and GPU.
To easily get started with Mistral, you can use Olive, our easy-to-use, hardware-aware model optimization tool. See here for instructions on how to run it with Mistral.
ONNX Models
Here are some of the optimized configurations we have added:
- ONNX model for int4 DML: ONNX model for AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using AWQ.
- ONNX model for fp16 CUDA: ONNX model you can use to run for your NVIDIA GPUs.
- ONNX model for int4 CUDA: ONNX model for NVIDIA GPUs using int4 quantization via RTN.
- ONNX model for int4 CPU: ONNX model for your CPU, using int4 quantization via RTN.
Hardware Supported
The models are tested on:
- GPU SKU: RTX 4090 (DirectML)
- GPU SKU: 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4 (CUDA)
- CPU SKU: Standard F64s v2 (64 vcpus, 128 GiB memory)
Minimum Configuration Required:
- Windows: DirectX 12-capable GPU and a minimum of 4GB of combined RAM
- CUDA: Streaming Multiprocessors (SMs) >= 70 (i.e. V100 or newer)
Model Description
- Developed by: Microsoft
- Model type: ONNX
- Language(s) (NLP): Python, C, C++
- License: Apache License Version 2.0
- Model Description: This is a conversion of the Mistral-7B-Instruct-v0.2 model for ONNX Runtime inference.
Additional Details
Appendix
Activation Aware Quantization
AWQ works by identifying the top 1% most salient weights that are most important for maintaining accuracy and quantizing the remaining 99% of weights. This leads to less accuracy loss from quantization compared to many other quantization techniques. For more on AWQ, see here.
Model Card Contact
sschoenmeyer, sunghcho, kvaishnavi
- Downloads last month
- 24